In Hoc Signo Vinces (part 11 of n): TPC-H Q2, Q10

Analytics is generally about making something small out of something large. This reduction is obtained by a TOP k operator (i.e., show only the 10 best by some metric) and/or by grouping and aggregation (i.e., for a set of items, show some attributes of these items and a sum, count, or other aggregate of dependent items for each).

In this installment we will look at late projection, also sometimes known as late materialization. If many attributes are returned and there is a cutoff of some sort, then the query does not need to be concerned about attributes on which there are no conditions, except for fetching them at the last moment, only for the entities which in fact will be returned to the user.

We look at TPC-H Q2 and Q10.

Q2:

  SELECT  TOP 100
                   s_acctbal,
                   s_name,
                   n_name,
                   p_partkey,
                   p_mfgr,
                   s_address,
                   s_phone,
                   s_comment
    FROM  part,
          supplier,
          partsupp,
          nation,
          region
   WHERE  p_partkey = ps_partkey
     AND  s_suppkey = ps_suppkey
     AND  p_size = 15
     AND  p_type LIKE '%BRASS'
     AND  s_nationkey = n_nationkey
     AND  n_regionkey = r_regionkey
     AND  r_name = 'EUROPE'
     AND  ps_supplycost = 
            ( SELECT  MIN(ps_supplycost)
                FROM  partsupp,
                      supplier,
                      nation,
                      region
               WHERE  p_partkey = ps_partkey
                 AND  s_suppkey = ps_suppkey
                 AND  s_nationkey = n_nationkey
                 AND  n_regionkey = r_regionkey
                 AND  r_name = 'EUROPE'
            )
ORDER BY  s_acctbal DESC,
          n_name,
          s_name,
          p_partkey

The intent is to return information about parts and suppliers, such that the part is available from a supplier in Europe, and the supplier has the lowest price for the part among all European suppliers.

Q10:

  SELECT  TOP 20
                                                     c_custkey,
                                                     c_name,
          SUM(l_extendedprice * (1 - l_discount)) AS revenue,
                                                     c_acctbal, 
                                                     n_name,
                                                     c_address,
                                                     c_phone,
                                                     c_comment
    FROM  customer,
          orders,
          lineitem,
          nation
   WHERE  c_custkey = o_custkey
     AND  l_orderkey = o_orderkey
     AND  o_orderdate >= CAST ('1993-10-01' AS DATE)
     AND  o_orderdate < DATEADD ('month', 3, CAST ('1993-10-01' AS DATE))
     AND  l_returnflag = 'R'
     AND  c_nationkey = n_nationkey
GROUP BY  c_custkey,
          c_name,
          c_acctbal,
          c_phone,
          n_name,
          c_address,
          c_comment
ORDER BY  revenue DESC

The intent is to list the customers who cause the greatest loss of revenue in a given quarter by returning items ordered in said quarter.

We notice that both queries return many columns on which there are no conditions, and that both have a cap on returned rows. The difference is that in Q2 the major ORDER BY is on a grouping column, and in Q10 it is on the aggregate of the GROUP BY. Thus the TOP k trick discussed in the previous article does apply to Q2 but not to Q10.

The profile for Q2 follows:

{ 
time   6.1e-05% fanout         1 input         1 rows
time       1.1% fanout         1 input         1 rows
{ hash filler
Subquery 27 
{ 
time    0.0012% fanout         1 input         1 rows
REGION         1 rows(t10.R_REGIONKEY)
 R_NAME = <c EUROPE>
time   0.00045% fanout         5 input         1 rows
NATION         5 rows(t9.N_NATIONKEY)
 N_REGIONKEY = t10.R_REGIONKEY
time       1.6% fanout     40107 input         5 rows
SUPPLIER   4.2e+04 rows(t8.S_SUPPKEY)
 S_NATIONKEY = t9.N_NATIONKEY
 
After code:
      0: t8.S_SUPPKEY :=  := artm t8.S_SUPPKEY
      4: BReturn 0
time       0.1% fanout         0 input    200535 rows
Sort hf 49 (t8.S_SUPPKEY)
}
}
time    0.0004% fanout         1 input         1 rows
{ fork
time        21% fanout     79591 input         1 rows
PART     8e+04 rows(.P_PARTKEY)
 P_TYPE LIKE <c %BRASS> LIKE <c > ,  P_SIZE =  15 
time        44% fanout  0.591889 input     79591 rows
 
Precode:
      0: { 
time     0.083% fanout         1 input     79591 rows
time      0.13% fanout         1 input     79591 rows
{ fork
time        24% fanout  0.801912 input     79591 rows
PARTSUPP       3.5 rows(.PS_SUPPKEY, .PS_SUPPLYCOST)
 inlined  PS_PARTKEY = k_.P_PARTKEY
hash partition+bloom by 62 (tmp)hash join merged always card       0.2 -> ()
time       1.3% fanout         0 input     63825 rows
Hash source 49 merged into ts not partitionable       0.2 rows(.PS_SUPPKEY) -> ()
 
After code:
      0:  min min.PS_SUPPLYCOSTset no set_ctr
      5: BReturn 0
}
 
After code:
      0: aggregate :=  := artm min
      4: BReturn 0
time      0.19% fanout         0 input     79591 rows
Subquery Select(aggregate)
}
 
      8: BReturn 0
PARTSUPP     5e-08 rows(.PS_SUPPKEY)
 inlined  PS_PARTKEY = k_.P_PARTKEY PS_SUPPLYCOST = k_scalar
time       5.9% fanout  0.247023 input     47109 rows
SUPPLIER unq       0.9 rows (.S_ACCTBAL, .S_NATIONKEY, .S_NAME, .S_SUPPKEY)
 inlined  S_SUPPKEY = .PS_SUPPKEY
top k on S_ACCTBAL
time     0.077% fanout         1 input     11637 rows
NATION unq         1 rows (.N_REGIONKEY, .N_NAME)
 inlined  N_NATIONKEY = .S_NATIONKEY
time     0.051% fanout         1 input     11637 rows
REGION unq       0.2 rows ()
 inlined  R_REGIONKEY = .N_REGIONKEY R_NAME = <c EUROPE>
time      0.42% fanout         0 input     11637 rows
Sort (.S_ACCTBAL, .N_NAME, .S_NAME, .P_PARTKEY) -> (.S_SUPPKEY)
 
}
time    0.0016% fanout       100 input         1 rows
top order by read (.S_SUPPKEY, .P_PARTKEY, .N_NAME, .S_NAME, .S_ACCTBAL)
time      0.02% fanout         1 input       100 rows
PART unq      0.95 rows (.P_MFGR)
 inlined  P_PARTKEY = .P_PARTKEY
time     0.054% fanout         1 input       100 rows
SUPPLIER unq         1 rows (.S_PHONE, .S_ADDRESS, .S_COMMENT)
 inlined  S_SUPPKEY = k_.S_SUPPKEY
time   6.7e-05% fanout         0 input       100 rows
Select (.S_ACCTBAL, .S_NAME, .N_NAME, .P_PARTKEY, .P_MFGR, .S_ADDRESS, .S_PHONE, .S_COMMENT)
}


 128 msec 1007% cpu,    196992 rnd 2.53367e+07 seq   50.4135% same seg   45.3574% same pg

The query starts with a scan looking for the qualifying parts. It then looks for the best price for each part from a European supplier. All the European suppliers have been previously put in a hash table by the hash filler subquery at the start of the plan. Thus, to find the minimum price, the query takes the partsupp for the part by index, and then eliminates all non-European suppliers by a selective hash join. After this, there is a second index lookup on partsupp where we look for the part and the price equal to the minimum price found earlier. These operations could in principle be merged, as the minimum price partsupp has already been seen. The gain would not be very large, though.

Here we note that the cost model guesses that very few rows will survive the check of ps_supplycost = minimum cost. It does not know that the minimum is not just any value, but one of the values that do occur in the ps_supplycost column for the part. Because of this, the remainder of the plan is carried out by index, which is just as well. The point is that if very few rows of input are expected, it is not worthwhile to make a hash table for a hash join. The hash table made for the European suppliers could be reused here, maybe with some small gain. It would however need more columns, which might make it not worthwhile. We note that the major order with the TOP k is on the supplier s_acctbal, hence as soon as there are 100 suppliers found, one can add a restriction on the s_acctbal for subsequent ones.

At the end of the plan, after the TOP k ORDER BY and the reading of the results, we have a separate index-based lookup for getting only the columns that are returned. We note that this is done on 100 rows whereas the previous operations are done on tens-of-thousands of rows. The TOP k restriction produces some benefit, but it is relatively late in the plan, and not many operations follow it.

The plan is easily good enough, with only small space for improvement. Q2 is one of the fastest queries of the set.

Let us now consider the execution of Q10:

{ 
time   1.1e-06% fanout         1 input         1 rows
time   4.4e-05% fanout         1 input         1 rows
{ hash filler
time   1.6e-05% fanout        25 input         1 rows
NATION        25 rows(.N_NATIONKEY, .N_NAME)
 
time   6.7e-06% fanout         0 input        25 rows
Sort hf 35 (.N_NATIONKEY) -> (.N_NAME)
 
}
time   1.5e-06% fanout         1 input         1 rows
{ fork
time   2.4e-06% fanout         1 input         1 rows
{ fork
time        13% fanout 5.73038e+06 input         1 rows
ORDERS   5.1e+06 rows(.O_ORDERKEY, .O_CUSTKEY)
 O_ORDERDATE >= <c 1993-10-01> < <c 1994-01-01>
time       4.8% fanout   2.00042 input 5.73038e+06 rows
LINEITEM       1.1 rows(.L_EXTENDEDPRICE, .L_DISCOUNT)
 inlined  L_ORDERKEY = .O_ORDERKEY L_RETURNFLAG = <c R>
time        25% fanout         1 input 1.14632e+07 rows
 
Precode:
      0: temp := artm  1  - .L_DISCOUNT
      4: temp := artm .L_EXTENDEDPRICE * temp
      8: BReturn 0
CUSTOMER unq         1 rows (.C_NATIONKEY, .C_CUSTKEY)
 inlined  C_CUSTKEY = k_.O_CUSTKEY
hash partition+bloom by 39 (tmp)hash join merged always card         1 -> (.N_NAME)
time    0.0023% fanout         1 input 1.14632e+07 rows
Hash source 35 merged into ts          1 rows(.C_NATIONKEY) -> (.N_NAME)
time       2.3% fanout         1 input 1.14632e+07 rows
Stage 2
time       3.6% fanout         0 input 1.14632e+07 rows
Sort (q_.C_CUSTKEY, .N_NAME) -> (temp)
 
}
time       0.6% fanout 3.88422e+06 input         1 rows
group by read node  
(.C_CUSTKEY, .N_NAME, revenue)in each partition slice
time      0.57% fanout         0 input 3.88422e+06 rows
Sort (revenue) -> (.N_NAME, .C_CUSTKEY)
 
}
time   6.9e-06% fanout        20 input         1 rows
top order by read (.N_NAME, revenue, .C_CUSTKEY)
time   0.00036% fanout         1 input        20 rows
CUSTOMER unq         1 rows (.C_PHONE, .C_NAME, .C_ACCTBAL, .C_ADDRESS, .C_COMMENT)
 inlined  C_CUSTKEY = .C_CUSTKEY
time   1.1e-06% fanout         0 input        20 rows
Select (.C_CUSTKEY, .C_NAME, revenue, .C_ACCTBAL, .N_NAME, .C_ADDRESS, .C_PHONE, .C_COMMENT)
}


 2153 msec 2457% cpu, 1.71845e+07 rnd 1.67177e+08 seq   76.3221% same seg   21.1204% same pg

The plan is by index, except for the lookup of nation name for the customer. The most selective condition is on order date, followed by the returnflag on lineitem. Getting the customer by index turns out to be better than by hash, even though almost all customers are hit. See the input cardinality above the first customer entry in the plan -- over 10M. The key point here is that only the c_custkey and c_nationkey get fetched, which saves a lot of time. In fact the c_custkey is needless since this is anyway equal to the o_custkey, but this makes little difference.

One could argue that customer should be between lineitem and orders in join order. Doing this would lose the ORDER BY on orders and lineitem, but would prevent some customer rows from being hit twice for a single order. The difference would not be large, though. For a scale-out setting, one definitely wants to have orders and lineitem without customer in between if the former are partitioned on the same key.

The c_nationkey is next translated into a n_name by hash, and there is a partitioned GROUP BY on c_custkey. The GROUP BY is partitioned because there are many different c_custkey values (155M for 100G scale).

The most important trick is fetching all the many dependent columns of c_custkey only after the TOP k ORDER BY. The last access to customer in the plan does this and is only executed on 20 rows.

Without the TOP k trick, the plan is identical, except that the dependent columns are fetched for nearly all customers. If this is done, the run time is 16s, which is bad enough to sink the whole score.

There is another approach to the challenge of this query: If foreign keys are declared and enforced, the system will know that every order has an actually existing customer and that every customer has a country. If so, the whole GROUP BY and TOP k can be done without any reference to customer, which is a notch better still, at least for this query. In this implementation, we do not declare foreign keys, thus the database must check that the customer and its country in fact exist before doing the GROUP BY. This makes the late projection trick mandatory, but does save the expense of checking foreign keys on updates. In both cases, the optimizer must recognize that the columns to be fetched at the end (late projected) are functionally dependent on a grouping key (c_custkey).

The late projection trick is generally useful, since almost all applications aside from bulk data export have some sort of limit on result set size. A column store especially benefits from this, since some columns of a row can be read without even coming near to other ones. A row store can also benefit from this in the form of decreased intermediate result size. This is especially good when returning long columns, such as text fields or blobs, on which there are most often no search conditions. If there are conditions of such, then these will most often be implemented via a special text index and not a scan.

* * * * *

In the next installment we will have a look at the overall 100G single server performance. After this we will recap the tricks so far. Then it will be time to look at implications of scale out for performance and run at larger scales. After the relational ground has been covered, we can look at implications of schema-lastness, i.e., triples for this type of workload.

So, while the most salient tricks have been at least briefly mentioned, we are far from having exhausted this most foundational of database topics.

To be continued...