GROUP BY에 서브쿼리를: 된다고 막 쓰지 말자!

오늘은 오픈소스DBMS에서 GROUP BY에 서브쿼리를 사용하는 패턴을 살펴보겠습니다. 성능 관련해서 주의가 필요한 내용입니다.

MySQL

먼저 MySQL입니다. 아래 SQL은 tr_ord에서 2025년 1월 데이터를 조회하고 있습니다. 해당 데이터는 14,579건이 존재합니다. 14,579건에 대해 shop_id로 GROUP BY 처리하고 있습니다. 또한, SELECT 절에서는 shop_nm을 처리하기 위해 서브쿼리를 사용하고 있습니다. 그런데, GROUP BY 절에도 동일하게 서브쿼리를 사용했습니다. 아시는 분은 아시겠지만, GROUP BY 부분에 서브쿼리는 생략해도 됩니다. 하지만 개발 일정에 쫓겨 개발하다 보면 이렇게 개발을 하게 되는 경우들이 종종 있습니다. (저는, 절대 아닙니다만…) 또는 GROUP BY의 사용법을 잘 모르는 개발자일수도 있죠.

[SQL-1:MySQL]

EXPLAIN ANALYZE
SELECT  t1.shop_id
        ,(SELECT x.shop_nm FROM startdbmy.ms_shop x WHERE x.shop_id = t1.shop_id) shop_nm
        ,COUNT(*) cnt
FROM    startdbmy.tr_ord t1
WHERE   t1.ord_dtm >= STR_TO_DATE('20250101','%Y%m%d')
AND     t1.ord_dtm <  STR_TO_DATE('20250201','%Y%m%d')
GROUP BY t1.shop_id
        ,(SELECT x.shop_nm FROM startdbmy.ms_shop x WHERE x.shop_id = t1.shop_id);
        
-> Table scan on <temporary>  (actual time=113..113 rows=289 loops=1)
    -> Aggregate using temporary table  (actual time=113..113 rows=289 loops=1)
        -> Index range scan on t1 using tr_ord_x01 over ('2025-01-01 00:00:00' <= ord_dtm < '2025-02-01 00:00:00'), with index condition: ((startdbmy.t1.ord_dtm >= <cache>(str_to_date('20250101','%Y%m%d'))) and (startdbmy.t1.ord_dtm < <cache>(str_to_date('20250201','%Y%m%d'))))  (cost=32073 rows=27062) (actual time=0.0499..54.1 rows=14290 loops=1)
    -> Select #3 (subquery in projection; dependent)
        -> Single-row index lookup on x using PRIMARY (shop_id=startdbmy.t1.shop_id)  (cost=0.35 rows=1) (actual time=0.00182..0.00184 rows=1 loops=14579)
    -> Select #2 (subquery in projection; dependent)
        -> Single-row index lookup on x using PRIMARY (shop_id=startdbmy.t1.shop_id)  (cost=0.35 rows=1) (actual time=0.00327..0.0033 rows=1 loops=289)

위 SQL의 실행계획을 살펴보면, 서브쿼리가 2번 사용되고 있습니다. 이미 성능에서 큰 손해를 보고 있는 것이죠. 또한, Select #3에 대한 서브쿼리의 actual loops 수치를 보면 14,579입니다. 서브쿼리가 GROUP BY 이전의 데이터 건수인 14,579건만큼 반복 실행된 것입니다. 다행히도, Select #2에 대한 서브쿼리는 GROUP BY 이후의 건수만큼인 289번(loops=289) 반복 실행되었네요. MySQL에서는 위 SQL을 다음과 같이 변경할 수 있습니다. GROUP BY 절에, SELECT 절의 별칭인 shop_nm을 그대로 사용하는 것이죠. ORACLE에서 MySQL로 전환하다 보면, 아래와 같은 문법이 지원된다는 이유로, 아래와 같이 SQL을 개발하는 경우가 종종 발견됩니다. 다행히도 실행계획에는 서브쿼리가 하나만 출력되었습니다. 하지만, 여전히 서브쿼리가 14,579번 반복 실행된 것을 알 수 있습니다.

[SQL-2:MySQL]

EXPLAIN ANALYZE
SELECT  t1.shop_id
        ,(SELECT x.shop_nm FROM startdbmy.ms_shop x WHERE x.shop_id = t1.shop_id) shop_nm
        ,COUNT(*) cnt
FROM    startdbmy.tr_ord t1
WHERE   t1.ord_dtm >= STR_TO_DATE('20250101','%Y%m%d')
AND     t1.ord_dtm <  STR_TO_DATE('20250201','%Y%m%d')
GROUP BY t1.shop_id
        ,shop_nm;

-> Table scan on <temporary>  (actual time=75.6..75.7 rows=289 loops=1)
    -> Aggregate using temporary table  (actual time=75.6..75.6 rows=289 loops=1)
        -> Index range scan on t1 using tr_ord_x01 over ('2025-01-01 00:00:00' <= ord_dtm < '2025-02-01 00:00:00'), with index condition: ((startdbmy.t1.ord_dtm >= <cache>(str_to_date('20250101','%Y%m%d'))) and (startdbmy.t1.ord_dtm < <cache>(str_to_date('20250201','%Y%m%d'))))  (cost=32073 rows=27062) (actual time=0.0422..38.6 rows=14290 loops=1)
    -> Select #2 (subquery in projection; dependent)
        -> Single-row index lookup on x using PRIMARY (shop_id=startdbmy.t1.shop_id)  (cost=0.35 rows=1) (actual time=0.00128..0.0013 rows=1 loops=14579)

[SQL-1:MySQL]과 [SQL-2:MySQL]의 올바른 해법은 아래와 같습니다. GROUP BY에 불필요한 shop_nm 부분을 제거하는 것입니다. shop_nm은 어차피 shop_id에 종속되니까요. 서브쿼리가 하나만 출력되면서, 실행횟수도 GROUP BY 이후 건수인 289번만 반복 실행됩니다.

[SQL-3:MySQL]

EXPLAIN ANALYZE
SELECT  t1.shop_id
        ,(SELECT x.shop_nm FROM startdbmy.ms_shop x WHERE x.shop_id = t1.shop_id) shop_nm
        ,COUNT(*) cnt
FROM    startdbmy.tr_ord t1
WHERE   t1.ord_dtm >= STR_TO_DATE('20250101','%Y%m%d')
AND     t1.ord_dtm <  STR_TO_DATE('20250201','%Y%m%d')
GROUP BY t1.shop_id;

-> Table scan on <temporary>  (actual time=70.3..70.4 rows=289 loops=1)
    -> Aggregate using temporary table  (actual time=70.3..70.3 rows=289 loops=1)
        -> Index range scan on t1 using tr_ord_x01 over ('2025-01-01 00:00:00' <= ord_dtm < '2025-02-01 00:00:00'), with index condition: ((startdbmy.t1.ord_dtm >= <cache>(str_to_date('20250101','%Y%m%d'))) and (startdbmy.t1.ord_dtm < <cache>(str_to_date('20250201','%Y%m%d'))))  (cost=32073 rows=27062) (actual time=0.0545..54.8 rows=14290 loops=1)
    -> Select #2 (subquery in projection; dependent)
        -> Single-row index lookup on x using PRIMARY (shop_id=startdbmy.t1.shop_id)  (cost=0.35 rows=1) (actual time=0.00526..0.0053 rows=1 loops=289)

PostgreSQL

PostgreSQL도 퀵하게 살펴봅니다. MySQL과 동일한 방식으로 GROUP BY 절에 서브쿼리를 사용하거나, SELECT 절의 서브쿼리 별칭을 GROUP BY 절에 그대로 사용할 수 있습니다. 아래와 같습니다.

[SQL-1: PostgreSQL]
EXPLAIN (ANALYZE,BUFFERS,COSTS OFF)
SELECT  t1.shop_id
        ,(SELECT x.shop_nm FROM startdbpg.ms_shop x WHERE x.shop_id = t1.shop_id) shop_nm
        ,COUNT(*) cnt
FROM    startdbpg.tr_ord t1
WHERE   t1.ord_dtm >= '20250101'::date
AND     t1.ord_dtm <  '20250201'::date
GROUP BY t1.shop_id
        ,(SELECT x.shop_nm FROM startdbpg.ms_shop x WHERE x.shop_id = t1.shop_id);
GroupAggregate (actual time=212.145..213.626 rows=289 loops=1)
  Group Key: t1.shop_id, ((SubPlan 1))
  Buffers: shared hit=57362
  ->  Sort (actual time=212.080..212.522 rows=14290 loops=1)
        Sort Key: t1.shop_id, ((SubPlan 1))
        Sort Method: quicksort  Memory: 1045kB
        Buffers: shared hit=57362
        ->  Index Scan using tr_ord_x01 on tr_ord t1 (actual time=19.697..202.655 rows=14290 loops=1)
              Index Cond: ((ord_dtm >= '2025-01-01'::date) AND (ord_dtm < '2025-02-01'::date))
              Buffers: shared hit=57362
              SubPlan 1
                ->  Seq Scan on ms_shop x (actual time=0.005..0.012 rows=1 loops=14290)
                      Filter: ((shop_id)::text = (t1.shop_id)::text)
                      Rows Removed by Filter: 299
                      Buffers: shared hit=57160
Planning:
  Buffers: shared hit=8
Planning Time: 0.294 ms
Execution Time: 221.128 ms

[SQL-2: PostgreSQL]
EXPLAIN (ANALYZE,BUFFERS,COSTS OFF)
SELECT  t1.shop_id
        ,(SELECT x.shop_nm FROM startdbpg.ms_shop x WHERE x.shop_id = t1.shop_id) shop_nm
        ,COUNT(*) cnt
FROM    startdbpg.tr_ord t1
WHERE   t1.ord_dtm >= '20250101'::date
AND     t1.ord_dtm <  '20250201'::date
GROUP BY t1.shop_id
        ,shop_nm;
GroupAggregate (actual time=199.159..200.591 rows=289 loops=1)
  Group Key: t1.shop_id, ((SubPlan 1))
  Buffers: shared hit=57362
  ->  Sort (actual time=199.103..199.571 rows=14290 loops=1)
        Sort Key: t1.shop_id, ((SubPlan 1))
        Sort Method: quicksort  Memory: 1045kB
        Buffers: shared hit=57362
        ->  Index Scan using tr_ord_x01 on tr_ord t1 (actual time=6.021..191.003 rows=14290 loops=1)
              Index Cond: ((ord_dtm >= '2025-01-01'::date) AND (ord_dtm < '2025-02-01'::date))
              Buffers: shared hit=57362
              SubPlan 1
                ->  Seq Scan on ms_shop x (actual time=0.005..0.012 rows=1 loops=14290)
                      Filter: ((shop_id)::text = (t1.shop_id)::text)
                      Rows Removed by Filter: 299
                      Buffers: shared hit=57160
Planning:
  Buffers: shared hit=8
Planning Time: 0.133 ms
Execution Time: 201.384 ms

[SQL-3: PostgreSQL]
EXPLAIN (ANALYZE,BUFFERS,COSTS OFF)
SELECT  t1.shop_id
        ,(SELECT x.shop_nm FROM startdbpg.ms_shop x WHERE x.shop_id = t1.shop_id) shop_nm
        ,COUNT(*) cnt
FROM    startdbpg.tr_ord t1
WHERE   t1.ord_dtm >= '20250101'::date
AND     t1.ord_dtm <  '20250201'::date
GROUP BY t1.shop_id;
HashAggregate (actual time=3.831..7.565 rows=289 loops=1)
  Group Key: t1.shop_id
  Batches: 1  Memory Usage: 61kB
  Buffers: shared hit=1358
  ->  Index Scan using tr_ord_x01 on tr_ord t1 (actual time=0.013..1.581 rows=14290 loops=1)
        Index Cond: ((ord_dtm >= '2025-01-01'::date) AND (ord_dtm < '2025-02-01'::date))
        Buffers: shared hit=202
  SubPlan 1
    ->  Seq Scan on ms_shop x (actual time=0.007..0.013 rows=1 loops=289)
          Filter: ((shop_id)::text = (t1.shop_id)::text)
          Rows Removed by Filter: 299
          Buffers: shared hit=1156
Planning:
  Buffers: shared hit=8
Planning Time: 0.314 ms
Execution Time: 7.619 ms

MySQL과 마찬가지로, GROUP BY 절에 스칼라 서브쿼리가 사용되면, GROUP BY 이전의 레코드 건수만큼 반복 실행됩니다.