Analytics is generally about making something small out of something large. This reduction is obtained by a TOP k
operator (i.e., show only the 10 best by some metric) and/or by grouping and aggregation (i.e., for a set of items, show some attributes of these items and a sum, count, or other aggregate of dependent items for each).
In this installment we will look at late projection, also sometimes known as late materialization. If many attributes are returned and there is a cutoff of some sort, then the query does not need to be concerned about attributes on which there are no conditions, except for fetching them at the last moment, only for the entities which in fact will be returned to the user.
We look at TPC-H Q2 and Q10.
Q2:
SELECT TOP 100
s_acctbal,
s_name,
n_name,
p_partkey,
p_mfgr,
s_address,
s_phone,
s_comment
FROM part,
supplier,
partsupp,
nation,
region
WHERE p_partkey = ps_partkey
AND s_suppkey = ps_suppkey
AND p_size = 15
AND p_type LIKE '%BRASS'
AND s_nationkey = n_nationkey
AND n_regionkey = r_regionkey
AND r_name = 'EUROPE'
AND ps_supplycost =
( SELECT MIN(ps_supplycost)
FROM partsupp,
supplier,
nation,
region
WHERE p_partkey = ps_partkey
AND s_suppkey = ps_suppkey
AND s_nationkey = n_nationkey
AND n_regionkey = r_regionkey
AND r_name = 'EUROPE'
)
ORDER BY s_acctbal DESC,
n_name,
s_name,
p_partkey
The intent is to return information about parts
and suppliers
, such that the part
is available from a supplier
in Europe, and the supplier
has the lowest price
for the part
among all European suppliers
.
Q10:
SELECT TOP 20
c_custkey,
c_name,
SUM(l_extendedprice * (1 - l_discount)) AS revenue,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment
FROM customer,
orders,
lineitem,
nation
WHERE c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND o_orderdate >= CAST ('1993-10-01' AS DATE)
AND o_orderdate < DATEADD ('month', 3, CAST ('1993-10-01' AS DATE))
AND l_returnflag = 'R'
AND c_nationkey = n_nationkey
GROUP BY c_custkey,
c_name,
c_acctbal,
c_phone,
n_name,
c_address,
c_comment
ORDER BY revenue DESC
The intent is to list the customers
who cause the greatest loss of revenue in a given quarter by returning items
ordered in said quarter.
We notice that both queries return many columns on which there are no conditions, and that both have a cap on returned rows. The difference is that in Q2 the major ORDER BY
is on a grouping column, and in Q10 it is on the aggregate of the GROUP BY
. Thus the TOP k
trick discussed in the previous article does apply to Q2 but not to Q10.
The profile for Q2 follows:
{
time 6.1e-05% fanout 1 input 1 rows
time 1.1% fanout 1 input 1 rows
{ hash filler
Subquery 27
{
time 0.0012% fanout 1 input 1 rows
REGION 1 rows(t10.R_REGIONKEY)
R_NAME = <c EUROPE>
time 0.00045% fanout 5 input 1 rows
NATION 5 rows(t9.N_NATIONKEY)
N_REGIONKEY = t10.R_REGIONKEY
time 1.6% fanout 40107 input 5 rows
SUPPLIER 4.2e+04 rows(t8.S_SUPPKEY)
S_NATIONKEY = t9.N_NATIONKEY
After code:
0: t8.S_SUPPKEY := := artm t8.S_SUPPKEY
4: BReturn 0
time 0.1% fanout 0 input 200535 rows
Sort hf 49 (t8.S_SUPPKEY)
}
}
time 0.0004% fanout 1 input 1 rows
{ fork
time 21% fanout 79591 input 1 rows
PART 8e+04 rows(.P_PARTKEY)
P_TYPE LIKE <c %BRASS> LIKE <c > , P_SIZE = 15
time 44% fanout 0.591889 input 79591 rows
Precode:
0: {
time 0.083% fanout 1 input 79591 rows
time 0.13% fanout 1 input 79591 rows
{ fork
time 24% fanout 0.801912 input 79591 rows
PARTSUPP 3.5 rows(.PS_SUPPKEY, .PS_SUPPLYCOST)
inlined PS_PARTKEY = k_.P_PARTKEY
hash partition+bloom by 62 (tmp)hash join merged always card 0.2 -> ()
time 1.3% fanout 0 input 63825 rows
Hash source 49 merged into ts not partitionable 0.2 rows(.PS_SUPPKEY) -> ()
After code:
0: min min.PS_SUPPLYCOSTset no set_ctr
5: BReturn 0
}
After code:
0: aggregate := := artm min
4: BReturn 0
time 0.19% fanout 0 input 79591 rows
Subquery Select(aggregate)
}
8: BReturn 0
PARTSUPP 5e-08 rows(.PS_SUPPKEY)
inlined PS_PARTKEY = k_.P_PARTKEY PS_SUPPLYCOST = k_scalar
time 5.9% fanout 0.247023 input 47109 rows
SUPPLIER unq 0.9 rows (.S_ACCTBAL, .S_NATIONKEY, .S_NAME, .S_SUPPKEY)
inlined S_SUPPKEY = .PS_SUPPKEY
top k on S_ACCTBAL
time 0.077% fanout 1 input 11637 rows
NATION unq 1 rows (.N_REGIONKEY, .N_NAME)
inlined N_NATIONKEY = .S_NATIONKEY
time 0.051% fanout 1 input 11637 rows
REGION unq 0.2 rows ()
inlined R_REGIONKEY = .N_REGIONKEY R_NAME = <c EUROPE>
time 0.42% fanout 0 input 11637 rows
Sort (.S_ACCTBAL, .N_NAME, .S_NAME, .P_PARTKEY) -> (.S_SUPPKEY)
}
time 0.0016% fanout 100 input 1 rows
top order by read (.S_SUPPKEY, .P_PARTKEY, .N_NAME, .S_NAME, .S_ACCTBAL)
time 0.02% fanout 1 input 100 rows
PART unq 0.95 rows (.P_MFGR)
inlined P_PARTKEY = .P_PARTKEY
time 0.054% fanout 1 input 100 rows
SUPPLIER unq 1 rows (.S_PHONE, .S_ADDRESS, .S_COMMENT)
inlined S_SUPPKEY = k_.S_SUPPKEY
time 6.7e-05% fanout 0 input 100 rows
Select (.S_ACCTBAL, .S_NAME, .N_NAME, .P_PARTKEY, .P_MFGR, .S_ADDRESS, .S_PHONE, .S_COMMENT)
}
128 msec 1007% cpu, 196992 rnd 2.53367e+07 seq 50.4135% same seg 45.3574% same pg
The query starts with a scan looking for the qualifying parts
. It then looks for the best price
for each part
from a European supplier
. All the European suppliers
have been previously put in a hash table by the hash filler subquery at the start of the plan. Thus, to find the minimum price
, the query takes the partsupp
for the part
by index, and then eliminates all non-European suppliers
by a selective hash join. After this, there is a second index lookup on partsupp
where we look for the part
and the price
equal to the minimum price
found earlier. These operations could in principle be merged, as the minimum price
partsupp
has already been seen. The gain would not be very large, though.
Here we note that the cost model guesses that very few rows will survive the check of ps_supplycost =
minimum cost
. It does not know that the minimum is not just any value, but one of the values that do occur in the ps_supplycost
column for the part. Because of this, the remainder of the plan is carried out by index, which is just as well. The point is that if very few rows of input are expected, it is not worthwhile to make a hash table for a hash join. The hash table made for the European suppliers
could be reused here, maybe with some small gain. It would however need more columns, which might make it not worthwhile. We note that the major order with the TOP k
is on the supplier
s_acctbal
, hence as soon as there are 100 suppliers
found, one can add a restriction on the s_acctbal
for subsequent ones.
At the end of the plan, after the TOP k ORDER BY
and the reading of the results, we have a separate index-based lookup for getting only the columns that are returned. We note that this is done on 100 rows whereas the previous operations are done on tens-of-thousands of rows. The TOP k
restriction produces some benefit, but it is relatively late in the plan, and not many operations follow it.
The plan is easily good enough, with only small space for improvement. Q2 is one of the fastest queries of the set.
Let us now consider the execution of Q10:
{
time 1.1e-06% fanout 1 input 1 rows
time 4.4e-05% fanout 1 input 1 rows
{ hash filler
time 1.6e-05% fanout 25 input 1 rows
NATION 25 rows(.N_NATIONKEY, .N_NAME)
time 6.7e-06% fanout 0 input 25 rows
Sort hf 35 (.N_NATIONKEY) -> (.N_NAME)
}
time 1.5e-06% fanout 1 input 1 rows
{ fork
time 2.4e-06% fanout 1 input 1 rows
{ fork
time 13% fanout 5.73038e+06 input 1 rows
ORDERS 5.1e+06 rows(.O_ORDERKEY, .O_CUSTKEY)
O_ORDERDATE >= <c 1993-10-01> < <c 1994-01-01>
time 4.8% fanout 2.00042 input 5.73038e+06 rows
LINEITEM 1.1 rows(.L_EXTENDEDPRICE, .L_DISCOUNT)
inlined L_ORDERKEY = .O_ORDERKEY L_RETURNFLAG = <c R>
time 25% fanout 1 input 1.14632e+07 rows
Precode:
0: temp := artm 1 - .L_DISCOUNT
4: temp := artm .L_EXTENDEDPRICE * temp
8: BReturn 0
CUSTOMER unq 1 rows (.C_NATIONKEY, .C_CUSTKEY)
inlined C_CUSTKEY = k_.O_CUSTKEY
hash partition+bloom by 39 (tmp)hash join merged always card 1 -> (.N_NAME)
time 0.0023% fanout 1 input 1.14632e+07 rows
Hash source 35 merged into ts 1 rows(.C_NATIONKEY) -> (.N_NAME)
time 2.3% fanout 1 input 1.14632e+07 rows
Stage 2
time 3.6% fanout 0 input 1.14632e+07 rows
Sort (q_.C_CUSTKEY, .N_NAME) -> (temp)
}
time 0.6% fanout 3.88422e+06 input 1 rows
group by read node
(.C_CUSTKEY, .N_NAME, revenue)in each partition slice
time 0.57% fanout 0 input 3.88422e+06 rows
Sort (revenue) -> (.N_NAME, .C_CUSTKEY)
}
time 6.9e-06% fanout 20 input 1 rows
top order by read (.N_NAME, revenue, .C_CUSTKEY)
time 0.00036% fanout 1 input 20 rows
CUSTOMER unq 1 rows (.C_PHONE, .C_NAME, .C_ACCTBAL, .C_ADDRESS, .C_COMMENT)
inlined C_CUSTKEY = .C_CUSTKEY
time 1.1e-06% fanout 0 input 20 rows
Select (.C_CUSTKEY, .C_NAME, revenue, .C_ACCTBAL, .N_NAME, .C_ADDRESS, .C_PHONE, .C_COMMENT)
}
2153 msec 2457% cpu, 1.71845e+07 rnd 1.67177e+08 seq 76.3221% same seg 21.1204% same pg
The plan is by index, except for the lookup of nation name
for the customer
. The most selective condition is on order date
, followed by the returnflag
on lineitem
. Getting the customer
by index turns out to be better than by hash, even though almost all customers
are hit. See the input cardinality above the first customer
entry in the plan -- over 10M. The key point here is that only the c_custkey
and c_nationkey
get fetched, which saves a lot of time. In fact the c_custkey
is needless since this is anyway equal to the o_custkey
, but this makes little difference.
One could argue that customer
should be between lineitem
and orders
in join
order. Doing this would lose the ORDER BY
on orders
and lineitem
, but would prevent some customer
rows from being hit twice for a single order
. The difference would not be large, though. For a scale-out setting, one definitely wants to have orders
and lineitem
without customer
in between if the former are partitioned on the same key.
The c_nationkey
is next translated into a n_name
by hash, and there is a partitioned GROUP BY
on c_custkey
. The GROUP BY
is partitioned because there are many different c_custkey
values (155M for 100G scale).
The most important trick is fetching all the many dependent columns of c_custkey
only after the TOP k ORDER BY
. The last access to customer
in the plan does this and is only executed on 20 rows.
Without the TOP k
trick, the plan is identical, except that the dependent columns are fetched for nearly all customers
. If this is done, the run time is 16s, which is bad enough to sink the whole score.
There is another approach to the challenge of this query: If foreign keys are declared and enforced, the system will know that every order
has an actually existing customer
and that every customer
has a country
. If so, the whole GROUP BY
and TOP k
can be done without any reference to customer
, which is a notch better still, at least for this query. In this implementation, we do not declare foreign keys, thus the database must check that the customer
and its country
in fact exist before doing the GROUP BY
. This makes the late projection trick mandatory, but does save the expense of checking foreign keys on updates. In both cases, the optimizer must recognize that the columns to be fetched at the end (late projected) are functionally dependent on a grouping key (c_custkey
).
The late projection trick is generally useful, since almost all applications aside from bulk data export have some sort of limit on result set size. A column store especially benefits from this, since some columns of a row can be read without even coming near to other ones. A row store can also benefit from this in the form of decreased intermediate result size. This is especially good when returning long columns, such as text fields or blobs, on which there are most often no search conditions. If there are conditions of such, then these will most often be implemented via a special text index and not a scan.
* * * * *
In the next installment we will have a look at the overall 100G single server performance. After this we will recap the tricks so far. Then it will be time to look at implications of scale out for performance and run at larger scales. After the relational ground has been covered, we can look at implications of schema-lastness, i.e., triples for this type of workload.
So, while the most salient tricks have been at least briefly mentioned, we are far from having exhausted this most foundational of database topics.
To be continued...
In Hoc Signo Vinces (TPC-H) Series