In Hoc Signo Vinces (part 9 of n): TPC-H Q18, Ordered Aggregation, and Top K

We will here return to polishing the cutting edge, the high geekcraft of database. We will look at more of the wonders of TPC-H and cover two more tricks. The experts can skip the preliminaries and go to the query profiles; for the others, there is some explanation first.

From the TPC-H specification:

    SELECT  TOP 100
                     c_name,
                     c_custkey,
                     o_orderkey,
                     o_orderdate,
                     o_totalprice,
                     SUM ( l_quantity )
     FROM  customer,
           orders,
           lineitem
    WHERE  o_orderkey 
             IN 
               (
                  SELECT  l_orderkey
                    FROM  lineitem
                GROUP BY  l_orderkey 
                            HAVING
                              SUM ( l_quantity ) > 312
               )
      AND  c_custkey = o_custkey
      AND  o_orderkey = l_orderkey
 GROUP BY  c_name,
           c_custkey,
           o_orderkey,
           o_orderdate,
           o_totalprice
 ORDER BY  o_totalprice DESC, 
           o_orderdate

The intent of the query is to return order and customer information for cases where an order involves a large quantity of items, with highest-value orders first.

We note that the only restriction in the query is the one on the SUM of l_quantity in the IN subquery. Everything else is a full scan or a JOIN on a foreign key.

Now, the first query optimization rule of thumb could be summarized as start from the small. Small here means something that is restricted; it does not mean small table. Smallest is the one from which the highest percentage is dropped via a condition that does not depend on other tables.

The next rule of thumb is to try starting from the large, if the large has a restricting join; for example, scan all the lineitems and hash join to parts that are green and of a given brand. In this case, the idea is to make a hash table from the small side and sequentially scan the large side, dropping everything that does not match something in the hash table.

The only restriction here is on orders via a join on lineitem. So, the IN subquery can be flattened, so as to read like --

SELECT ... 
  FROM  (   SELECT  l_orderkey, 
                    SUM ( l_quantity ) 
              FROM  lineitem 
          GROUP BY  l_orderkey 
                      HAVING
                        SUM ( l_quantity ) > 312
        ) f, 
          orders, 
          customer, 
          lineitem 
 WHERE  f.l_orderkey = o_orderkey ....

The above (left to right) is the best JOIN order for this type of plan. We start from the restriction, and for all the rest the JOIN is foreign key to primary key, sometimes n:1 (orders to customer), sometimes 1:n (orders to lineitem). A 1:n is usually best by index; an n:1 can be better by hash if there are enough tuples on the n side to make it worthwhile to build the hash table.

We note that the first GROUP BY makes a very large number of groups, e.g., 150M at 100 Gtriple scale. We also note that if lineitem is ordered so that the lineitems of a single order are together, the GROUP BY is ordered. In other words, once you have seen a specific value of l_orderkey change to the next, you will not see the old value again. In this way, the groups do not have to be remembered for all time. The GROUP BY produces a stream of results as the scan of lineitem proceeds.

Considering vectored execution, the GROUP BY does remember a bunch of groups, up to a vector size worth, so that output from the GROUP BY is done in large enough batches, not a tuple at a time.

Considering parallelization, the scan of lineitem must be split in such a way that all lineitems with the same l_orderkey get processed by the same thread. If this is the case, all threads will produce an independent stream of results that is guaranteed to need no merge with the output of another thread.

So, we can try this:

{ 
time     6e-06% fanout         1 input         1 rows
time       4.5% fanout         1 input         1 rows
{ hash filler

-- Make a hash table from c_custkey to c_name

time      0.99% fanout   1.5e+07 input         1 rows
CUSTOMER   1.5e+07 rows(.C_CUSTKEY, .C_NAME)
 
time      0.81% fanout         0 input   1.5e+07 rows
Sort hf 35 (.C_CUSTKEY) -> (.C_NAME)
 
}
time   2.2e-05% fanout         1 input         1 rows
time   1.6e-05% fanout         1 input         1 rows
{ fork
time   5.2e-06% fanout         1 input         1 rows
{ fork

-- Scan lineitem

time        10% fanout 6.00038e+08 input         1 rows
LINEITEM     6e+08 rows(t5.L_ORDERKEY, t5.L_QUANTITY)

-- Ordered GROUP BY (streaming with duplicates)

time        73% fanout 1.17743e-05 input 6.00038e+08 rows
Sort streaming with duplicates (t5.L_ORDERKEY) -> (t5.L_QUANTITY)

-- The ordered aggregation above emits a batch of results every so often, having accumulated 20K or so groups (DISTINCT l_orderkey's)

-- The operator below reads the batch and sends it onward, the GROUP BY hash table for the next batch.

time        10% fanout   21231.4 input      7065 rows
group by read node 
(t5.L_ORDERKEY, aggregate)
END Node
After test:
      0: if (aggregate >  312 ) then 4 else 5 unkn 5
      4: BReturn 1
      5: BReturn 0
 
After code:
      0: L_ORDERKEY :=  := artm t5.L_ORDERKEY
      4: BReturn 0

-- This marks the end of the flattened IN subquery. 1063 out of 150M groups survive the test on the SUM of l_quantity.

-- The main difficulty of Q18 is guessing that this condition is this selective.

time    0.0013% fanout         1 input      1063 rows
Subquery Select(L_ORDERKEY)
time     0.058% fanout         1 input      1063 rows
ORDERS unq      0.97 rows (.O_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE)
 inlined  O_ORDERKEY = L_ORDERKEY
hash partition+bloom by 42 (tmp)hash join merged always card      0.99 -> (.C_NAME)
time    0.0029% fanout         1 input      1063 rows
Hash source 35 merged into ts       0.99 rows(.O_CUSTKEY) -> (.C_NAME)
 
After code:
      0: .C_CUSTKEY :=  := artm .O_CUSTKEY
      4: BReturn 0
time     0.018% fanout         7 input      1063 rows
LINEITEM       4.3 rows(.L_QUANTITY)
 inlined  L_ORDERKEY = .O_ORDERKEY
time     0.011% fanout         0 input      7441 rows
Sort (.C_CUSTKEY, .O_ORDERKEY) -> (.L_QUANTITY, .O_TOTALPRICE, .O_ORDERDATE, .C_NAME)
 
}
time   0.00026% fanout      1063 input         1 rows
group by read node  
(.C_CUSTKEY, .O_ORDERKEY, aggregate, .O_TOTALPRICE, .O_ORDERDATE, .C_NAME)
time   0.00061% fanout         0 input      1063 rows
Sort (.O_TOTALPRICE, .O_ORDERDATE) -> (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, aggregate)
 
}
time   1.7e-05% fanout       100 input         1 rows
top order by read (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE, aggregate)
time   1.2e-06% fanout         0 input       100 rows
Select (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE, aggregate)
}


 6351 msec 1470% cpu,      2151 rnd 6.14898e+08 seq  0.185874% same seg   1.57993% same pg

What is wrong with this? The result is not bad, in the ballpark with VectorWise published results (4.9s on a slightly faster box), but better is possible. We note that there is a hash join from orders to customer. Only 1K customers of 15M get hit. The whole hash table of 15M entries is built in vain. Let's cheat and declare the join to be by index. Cheats like this are not allowed in an official run but here we are just looking. So we change the mention of the customer table in the FROM clause from FROM ... customer, ... to FROM ... customer TABLE OPTION (loop), ...

{ 
time   1.4e-06% fanout         1 input         1 rows
time     9e-07% fanout         1 input         1 rows

-- Here was the hash build in the previous plan; now we start direct with the scan of lineitem.

time   2.2e-06% fanout         1 input         1 rows
{ fork
time   2.3e-06% fanout         1 input         1 rows
{ fork
time        11% fanout 6.00038e+08 input         1 rows
LINEITEM     6e+08 rows(t5.L_ORDERKEY, t5.L_QUANTITY)
 
time        78% fanout 1.17743e-05 input 6.00038e+08 rows
Sort streaming with duplicates (t5.L_ORDERKEY) -> (t5.L_QUANTITY)
 
time        11% fanout   21231.4 input      7065 rows
group by read node  
(t5.L_ORDERKEY, aggregate)
END Node
After test:
      0: if (aggregate >  312 ) then 4 else 5 unkn 5
      4: BReturn 1
      5: BReturn 0
 
After code:
      0: L_ORDERKEY :=  := artm t5.L_ORDERKEY
      4: BReturn 0
time    0.0014% fanout         1 input      1063 rows
Subquery Select(L_ORDERKEY)
time     0.051% fanout         1 input      1063 rows
ORDERS unq      0.97 rows (.O_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE)
 inlined  O_ORDERKEY = L_ORDERKEY

-- We note that getting the 1063 customers by index takes no time, and there is no hash table to build

time     0.023% fanout         1 input      1063 rows
CUSTOMER unq      0.99 rows (.C_CUSTKEY, .C_NAME)
 inlined  C_CUSTKEY = .O_CUSTKEY
time     0.021% fanout         7 input      1063 rows
LINEITEM       4.3 rows(.L_QUANTITY)
 inlined  L_ORDERKEY = k_.O_ORDERKEY

-- The rest is identical to the previous plan, cut for brevity

 3852 msec 2311% cpu,      3213 rnd 5.99907e+08 seq  0.124456% same seg   1.08899% same pg 
Compilation: 1 msec 0 reads         0% read 0 messages         0% clw

We save over 2s of real time. But the problem is how to know that very few customers will be hit. One could make a calculation that l_quantity is between 1 and 50, and that an order has an average of 4 lineitems with a maximum of 7. For the SUM to be over 312, only orders with 7 lineitems are eligible, and even so the l_quantities must all be high. Assuming flat distributions, which here happens to be the case, one could estimate that the condition selects very few orders. The problem is that real data with this kind of regularity is sight unseen, so such a trick, while allowed, would just work for benchmarks.

* * * * *

As it happens, there is a better way. We also note that the query selects the TOP 100 orders with the highest o_totalprice. This is a very common pattern; there is almost always a TOP k clause in analytics queries unless they GROUP BY something that is known to be of low cardinality, like nation or year.

If the ordering falls on a grouping column, as soon as there are enough groups generated to fill a TOP 100, one can take the lowest o_totalprice as a limit and add this into the query as an extra restriction. Every time the TOP 100 changes, the condition becomes more selective, as the 100th highest o_totalprice increases.

Sometimes the ordering falls on the aggregation result, which is not known until the aggregation is finished. However, in lookup-style queries, it is common to take the latest-so-many events or just the TOP k items by some metric. In these cases, pushing the TOP k restriction down into the selection always works.

So, we try this:

{ 
time     4e-06% fanout         1 input         1 rows
time   6.1e-06% fanout         1 input         1 rows
{ fork

-- The plan begins with orders, as we now expect a selection on o_totalprice

-- We see that out of 150M orders, a little over 10M survive the o_totalprice selection, which gets more restrictive as the query proceeds.

time        33% fanout 1.00628e+07 input         1 rows
ORDERS   4.3e+04 rows(.O_TOTALPRICE, .O_ORDERKEY, .O_CUSTKEY, .O_ORDERDATE)
 
top k on O_TOTALPRICE
time        32% fanout 3.50797e-05 input 1.00628e+07 rows
END Node
After test:
      0: if ({

-- The IN subquery is here kept as a subquery, not flattened.


  time      0.42% fanout         1 input 1.00628e+07 rows
time        11% fanout   4.00136 input 1.00628e+07 rows
LINEITEM         4 rows(.L_ORDERKEY, .L_QUANTITY)
 inlined  L_ORDERKEY = k_.O_ORDERKEY
time        21% fanout 2.55806e-05 input 4.02649e+07 rows
Sort streaming with duplicates (set_ctr, .L_ORDERKEY) -> (.L_QUANTITY)
 
time       2.4% fanout   9769.72 input      1030 rows
group by read node  
(gb_set_no, .L_ORDERKEY, aggregate)
END Node
After test:
      0: if (aggregate >  312 ) then 4 else 5 unkn 5
      4: BReturn 1
      5: BReturn 0
time   0.00047% fanout         0 input       353 rows
Subquery Select(  )
}
) then 4 else 5 unkn 5
      4: BReturn 1
      5: BReturn 0

-- Here we see that fewer customers are accessed than in the non-TOP k plans, since there is an extra cut on o_totalprice that takes effect earlier

time     0.013% fanout         1 input       353 rows
CUSTOMER unq         1 rows (.C_CUSTKEY, .C_NAME)
 inlined  C_CUSTKEY = k_.O_CUSTKEY
time    0.0079% fanout         7 input       353 rows
LINEITEM         4 rows(.L_QUANTITY)
 inlined  L_ORDERKEY = k_.O_ORDERKEY
time    0.0063% fanout 0.0477539 input      2471 rows
Sort streaming with duplicates (.C_CUSTKEY, .O_ORDERKEY) -> (.L_QUANTITY, .O_TOTALPRICE, .O_ORDERDATE, .C_NAME)
 
time    0.0088% fanout   2.99153 input       118 rows
group by read node  
(.C_CUSTKEY, .O_ORDERKEY, aggregate, .O_TOTALPRICE, .O_ORDERDATE, .C_NAME)
time    0.0063% fanout         0 input       353 rows
Sort (.O_TOTALPRICE, .O_ORDERDATE) -> (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, aggregate)
 
}
time   8.5e-05% fanout       100 input         1 rows
top order by read (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE, aggregate)
time   2.7e-06% fanout         0 input       100 rows
Select (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE, aggregate)
}


 949 msec 2179% cpu, 1.00486e+07 rnd 4.71013e+07 seq   99.9267% same seg 0.0318055% same pg

Here we see that the time is about 4x better than with the cheat version. We note that about 10M of 1.5e8 orders get considered. After going through the first 10% or so of orders, there is a TOP 100, and a condition on o_totalprice that will drop most orders can be introduced.

If we set the condition on the SUM of quantity so that no orders match, there is no TOP k at any point, and we get a time of 6.8s, which is a little worse than the initial time with the flattened IN. But since the TOP k trick does not allocate memory, it is relatively safe even in cases where it does not help.

We can argue that the TOP k pushdown trick is more robust than guessing the selectivity of a SUM of l_quantity. Further, it applies to a broad range of lookup queries, while the SUM trick applies to only TPC-H Q18, or close enough. Thus, the TOP k trick is safer and more generic.

We are approaching the end of the TPC-H blog series, with still two families of tricks to consider, namely, moving predicates between subqueries, and late projection. After this we will look at results and the overall picture.

To be continued...