We will here return to polishing the cutting edge, the high geekcraft of database. We will look at more of the wonders of TPC-H and cover two more tricks. The experts can skip the preliminaries and go to the query profiles; for the others, there is some explanation first.
From the TPC-H specification:
SELECT TOP 100
c_name,
c_custkey,
o_orderkey,
o_orderdate,
o_totalprice,
SUM ( l_quantity )
FROM customer,
orders,
lineitem
WHERE o_orderkey
IN
(
SELECT l_orderkey
FROM lineitem
GROUP BY l_orderkey
HAVING
SUM ( l_quantity ) > 312
)
AND c_custkey = o_custkey
AND o_orderkey = l_orderkey
GROUP BY c_name,
c_custkey,
o_orderkey,
o_orderdate,
o_totalprice
ORDER BY o_totalprice DESC,
o_orderdate
The intent of the query is to return order
and customer
information for cases where an order
involves a large quantity of items
, with highest-value orders
first.
We note that the only restriction in the query is the one on the SUM
of l_quantity
in the IN
subquery. Everything else is a full scan or a JOIN
on a foreign key.
Now, the first query optimization rule of thumb could be summarized as start from the small. Small here means something that is restricted; it does not mean small table. Smallest is the one from which the highest percentage is dropped via a condition that does not depend on other tables.
The next rule of thumb is to try starting from the large, if the large has a restricting join
; for example, scan all the lineitems
and hash join to parts
that are green and of a given brand. In this case, the idea is to make a hash table from the small side and sequentially scan the large side, dropping everything that does not match something in the hash table.
The only restriction here is on orders
via a join
on lineitem
. So, the IN
subquery can be flattened, so as to read like --
SELECT ...
FROM ( SELECT l_orderkey,
SUM ( l_quantity )
FROM lineitem
GROUP BY l_orderkey
HAVING
SUM ( l_quantity ) > 312
) f,
orders,
customer,
lineitem
WHERE f.l_orderkey = o_orderkey ....
The above (left to right) is the best JOIN
order for this type of plan. We start from the restriction, and for all the rest the JOIN
is foreign key to primary key, sometimes n:1
(orders
to customer
), sometimes 1:n
(orders
to lineitem
). A 1:n
is usually best by index; an n:1
can be better by hash if there are enough tuples on the n side to make it worthwhile to build the hash table.
We note that the first GROUP BY
makes a very large number of groups, e.g., 150M at 100 Gtriple scale. We also note that if lineitem
is ordered so that the lineitems
of a single order
are together, the GROUP BY
is ordered. In other words, once you have seen a specific value of l_orderkey
change to the next, you will not see the old value again. In this way, the groups do not have to be remembered for all time. The GROUP BY
produces a stream of results as the scan of lineitem
proceeds.
Considering vectored execution, the GROUP BY
does remember a bunch of groups, up to a vector size worth, so that output from the GROUP BY
is done in large enough batches, not a tuple at a time.
Considering parallelization, the scan of lineitem
must be split in such a way that all lineitems
with the same l_orderkey
get processed by the same thread. If this is the case, all threads will produce an independent stream of results that is guaranteed to need no merge with the output of another thread.
So, we can try this:
{
time 6e-06% fanout 1 input 1 rows
time 4.5% fanout 1 input 1 rows
{ hash filler
-- Make a hash table from c_custkey
to c_name
time 0.99% fanout 1.5e+07 input 1 rows
CUSTOMER 1.5e+07 rows(.C_CUSTKEY, .C_NAME)
time 0.81% fanout 0 input 1.5e+07 rows
Sort hf 35 (.C_CUSTKEY) -> (.C_NAME)
}
time 2.2e-05% fanout 1 input 1 rows
time 1.6e-05% fanout 1 input 1 rows
{ fork
time 5.2e-06% fanout 1 input 1 rows
{ fork
-- Scan lineitem
time 10% fanout 6.00038e+08 input 1 rows
LINEITEM 6e+08 rows(t5.L_ORDERKEY, t5.L_QUANTITY)
-- Ordered GROUP BY
(streaming with duplicates)
time 73% fanout 1.17743e-05 input 6.00038e+08 rows
Sort streaming with duplicates (t5.L_ORDERKEY) -> (t5.L_QUANTITY)
-- The ordered aggregation above emits a batch of results every so often, having accumulated 20K or so groups (DISTINCT l_orderkey
's)
-- The operator below reads the batch and sends it onward, the GROUP BY
hash table for the next batch.
time 10% fanout 21231.4 input 7065 rows
group by read node
(t5.L_ORDERKEY, aggregate)
END Node
After test:
0: if (aggregate > 312 ) then 4 else 5 unkn 5
4: BReturn 1
5: BReturn 0
After code:
0: L_ORDERKEY := := artm t5.L_ORDERKEY
4: BReturn 0
-- This marks the end of the flattened IN
subquery. 1063 out of 150M groups survive the test on the SUM
of l_quantity
.
-- The main difficulty of Q18 is guessing that this condition is this selective.
time 0.0013% fanout 1 input 1063 rows
Subquery Select(L_ORDERKEY)
time 0.058% fanout 1 input 1063 rows
ORDERS unq 0.97 rows (.O_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE)
inlined O_ORDERKEY = L_ORDERKEY
hash partition+bloom by 42 (tmp)hash join merged always card 0.99 -> (.C_NAME)
time 0.0029% fanout 1 input 1063 rows
Hash source 35 merged into ts 0.99 rows(.O_CUSTKEY) -> (.C_NAME)
After code:
0: .C_CUSTKEY := := artm .O_CUSTKEY
4: BReturn 0
time 0.018% fanout 7 input 1063 rows
LINEITEM 4.3 rows(.L_QUANTITY)
inlined L_ORDERKEY = .O_ORDERKEY
time 0.011% fanout 0 input 7441 rows
Sort (.C_CUSTKEY, .O_ORDERKEY) -> (.L_QUANTITY, .O_TOTALPRICE, .O_ORDERDATE, .C_NAME)
}
time 0.00026% fanout 1063 input 1 rows
group by read node
(.C_CUSTKEY, .O_ORDERKEY, aggregate, .O_TOTALPRICE, .O_ORDERDATE, .C_NAME)
time 0.00061% fanout 0 input 1063 rows
Sort (.O_TOTALPRICE, .O_ORDERDATE) -> (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, aggregate)
}
time 1.7e-05% fanout 100 input 1 rows
top order by read (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE, aggregate)
time 1.2e-06% fanout 0 input 100 rows
Select (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE, aggregate)
}
6351 msec 1470% cpu, 2151 rnd 6.14898e+08 seq 0.185874% same seg 1.57993% same pg
What is wrong with this? The result is not bad, in the ballpark with VectorWise published results (4.9s on a slightly faster box), but better is possible. We note that there is a hash join from orders
to customer
. Only 1K customers
of 15M get hit. The whole hash table of 15M entries is built in vain. Let's cheat and declare the join
to be by index
. Cheats like this are not allowed in an official run but here we are just looking. So we change the mention of the customer
table in the FROM
clause from FROM ... customer, ...
to FROM ... customer TABLE OPTION (loop), ...
{
time 1.4e-06% fanout 1 input 1 rows
time 9e-07% fanout 1 input 1 rows
-- Here was the hash build in the previous plan; now we start direct with the scan of lineitem
.
time 2.2e-06% fanout 1 input 1 rows
{ fork
time 2.3e-06% fanout 1 input 1 rows
{ fork
time 11% fanout 6.00038e+08 input 1 rows
LINEITEM 6e+08 rows(t5.L_ORDERKEY, t5.L_QUANTITY)
time 78% fanout 1.17743e-05 input 6.00038e+08 rows
Sort streaming with duplicates (t5.L_ORDERKEY) -> (t5.L_QUANTITY)
time 11% fanout 21231.4 input 7065 rows
group by read node
(t5.L_ORDERKEY, aggregate)
END Node
After test:
0: if (aggregate > 312 ) then 4 else 5 unkn 5
4: BReturn 1
5: BReturn 0
After code:
0: L_ORDERKEY := := artm t5.L_ORDERKEY
4: BReturn 0
time 0.0014% fanout 1 input 1063 rows
Subquery Select(L_ORDERKEY)
time 0.051% fanout 1 input 1063 rows
ORDERS unq 0.97 rows (.O_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE)
inlined O_ORDERKEY = L_ORDERKEY
-- We note that getting the 1063 customers
by index takes no time, and there is no hash table to build
time 0.023% fanout 1 input 1063 rows
CUSTOMER unq 0.99 rows (.C_CUSTKEY, .C_NAME)
inlined C_CUSTKEY = .O_CUSTKEY
time 0.021% fanout 7 input 1063 rows
LINEITEM 4.3 rows(.L_QUANTITY)
inlined L_ORDERKEY = k_.O_ORDERKEY
-- The rest is identical to the previous plan, cut for brevity
3852 msec 2311% cpu, 3213 rnd 5.99907e+08 seq 0.124456% same seg 1.08899% same pg
Compilation: 1 msec 0 reads 0% read 0 messages 0% clw
We save over 2s of real time. But the problem is how to know that very few customers
will be hit. One could make a calculation that l_quantity
is between 1 and 50, and that an order
has an average of 4 lineitems
with a maximum of 7. For the SUM
to be over 312, only orders
with 7 lineitems
are eligible, and even so the l_quantities
must all be high. Assuming flat distributions, which here happens to be the case, one could estimate that the condition selects very few orders
. The problem is that real data with this kind of regularity is sight unseen, so such a trick, while allowed, would just work for benchmarks.
* * * * *
As it happens, there is a better way. We also note that the query selects the TOP 100 orders
with the highest o_totalprice
. This is a very common pattern; there is almost always a TOP k
clause in analytics queries unless they GROUP BY
something that is known to be of low cardinality, like nation or year.
If the ordering falls on a grouping column, as soon as there are enough groups generated to fill a TOP 100
, one can take the lowest o_totalprice
as a limit and add this into the query as an extra restriction. Every time the TOP 100
changes, the condition becomes more selective, as the 100th highest o_totalprice
increases.
Sometimes the ordering falls on the aggregation result, which is not known until the aggregation is finished. However, in lookup-style queries, it is common to take the latest-so-many events or just the TOP k
items by some metric. In these cases, pushing the TOP k
restriction down into the selection always works.
So, we try this:
{
time 4e-06% fanout 1 input 1 rows
time 6.1e-06% fanout 1 input 1 rows
{ fork
-- The plan begins with orders
, as we now expect a selection on o_totalprice
-- We see that out of 150M orders
, a little over 10M survive the o_totalprice
selection, which gets more restrictive as the query proceeds.
time 33% fanout 1.00628e+07 input 1 rows
ORDERS 4.3e+04 rows(.O_TOTALPRICE, .O_ORDERKEY, .O_CUSTKEY, .O_ORDERDATE)
top k on O_TOTALPRICE
time 32% fanout 3.50797e-05 input 1.00628e+07 rows
END Node
After test:
0: if ({
-- The IN
subquery is here kept as a subquery, not flattened.
time 0.42% fanout 1 input 1.00628e+07 rows
time 11% fanout 4.00136 input 1.00628e+07 rows
LINEITEM 4 rows(.L_ORDERKEY, .L_QUANTITY)
inlined L_ORDERKEY = k_.O_ORDERKEY
time 21% fanout 2.55806e-05 input 4.02649e+07 rows
Sort streaming with duplicates (set_ctr, .L_ORDERKEY) -> (.L_QUANTITY)
time 2.4% fanout 9769.72 input 1030 rows
group by read node
(gb_set_no, .L_ORDERKEY, aggregate)
END Node
After test:
0: if (aggregate > 312 ) then 4 else 5 unkn 5
4: BReturn 1
5: BReturn 0
time 0.00047% fanout 0 input 353 rows
Subquery Select( )
}
) then 4 else 5 unkn 5
4: BReturn 1
5: BReturn 0
-- Here we see that fewer customers
are accessed than in the non-TOP k
plans, since there is an extra cut on o_totalprice
that takes effect earlier
time 0.013% fanout 1 input 353 rows
CUSTOMER unq 1 rows (.C_CUSTKEY, .C_NAME)
inlined C_CUSTKEY = k_.O_CUSTKEY
time 0.0079% fanout 7 input 353 rows
LINEITEM 4 rows(.L_QUANTITY)
inlined L_ORDERKEY = k_.O_ORDERKEY
time 0.0063% fanout 0.0477539 input 2471 rows
Sort streaming with duplicates (.C_CUSTKEY, .O_ORDERKEY) -> (.L_QUANTITY, .O_TOTALPRICE, .O_ORDERDATE, .C_NAME)
time 0.0088% fanout 2.99153 input 118 rows
group by read node
(.C_CUSTKEY, .O_ORDERKEY, aggregate, .O_TOTALPRICE, .O_ORDERDATE, .C_NAME)
time 0.0063% fanout 0 input 353 rows
Sort (.O_TOTALPRICE, .O_ORDERDATE) -> (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, aggregate)
}
time 8.5e-05% fanout 100 input 1 rows
top order by read (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE, aggregate)
time 2.7e-06% fanout 0 input 100 rows
Select (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE, aggregate)
}
949 msec 2179% cpu, 1.00486e+07 rnd 4.71013e+07 seq 99.9267% same seg 0.0318055% same pg
Here we see that the time is about 4x better than with the cheat version. We note that about 10M of 1.5e8 orders
get considered. After going through the first 10% or so of orders
, there is a TOP 100
, and a condition on o_totalprice
that will drop most orders
can be introduced.
If we set the condition on the SUM
of quantity
so that no orders
match, there is no TOP k
at any point, and we get a time of 6.8s, which is a little worse than the initial time with the flattened IN
. But since the TOP k
trick does not allocate memory, it is relatively safe even in cases where it does not help.
We can argue that the TOP k
pushdown trick is more robust than guessing the selectivity of a SUM
of l_quantity
. Further, it applies to a broad range of lookup queries, while the SUM
trick applies to only TPC-H Q18, or close enough. Thus, the TOP k
trick is safer and more generic.
We are approaching the end of the TPC-H blog series, with still two families of tricks to consider, namely, moving predicates between subqueries, and late projection. After this we will look at results and the overall picture.
To be continued...
In Hoc Signo Vinces (TPC-H) Series