In Hoc Signo Vinces (part 6 of n) -- TPC-H Q1 and Q3: An Introduction to Query Plans

Here we look at two TPC-H queries whose execution plans are relatively straightforward and look at Virtuoso performance metrics. On one hand; this is an introduction to query plans; on the other hand, a case study in tuning Virtuoso and understanding what goes on. The choke points outlined in TPC-H Analyzed are mentioned where applicable, with some extra commentary.

Q1 -- Scan, I/O, Aggregation

The query is below. The date is a parameter (a value near the end of the l_shipdate range is used), so most of the columns get read.

SELECT                                                               l_returnflag,
                                                                     l_linestatus,
          SUM(l_quantity)                                        AS  sum_qty,
          SUM(l_extendedprice)                                   AS  sum_base_price,
          SUM(l_extendedprice * (1 - l_discount))                AS  sum_disc_price,
          SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax))  AS  sum_charge,
          AVG(l_quantity)                                        AS  avg_qty,
          AVG(l_extendedprice)                                   AS  avg_price,
          AVG(l_discount)                                        AS  avg_disc,
          COUNT(*)                                               AS  count_order
    FROM  lineitem
   WHERE  l_shipdate <= dateadd('DAY', -90, CAST ('1998-12-01' AS DATE))
GROUP BY  l_returnflag,
          l_linestatus
ORDER BY  l_returnflag,
          l_linestatus
 ;

We note that a count of a non-nullable column is the same as COUNT (*) and that AVG of a non-null column is SUM (column) / COUNT (*). So COUNT (*) occurs 4 times, and SUM (l_extendedprice) and SUM (l_quantity) each occur twice. The grouping columns have few distinct values.

TPC-H Analyzed suggests to use an array-based GROUP BY, because there can only be 64K combinations of 2 single-character values. The grouping keys are declared CHAR (1) and non-nullable. The Virtuoso implementation does not do this, though.

This query has been treated in many papers because it cannot be implemented in very many ways, it is easy to understand, and it still illustrates some basic metrics.

One execution on warm cache is between 3.9s and 4.7s. One execution with the data coming from OS disk cache is 11.8s. One execution with the data coming from 2 SSDs is 22s. Five concurrent executions from warm cache are 17.3s for the fastest and 20.5s for the slowest. A single threaded execution from warm cache is 58.4s.

We see that scaling is linear; i.e., 5 times the work takes a little under 5x longer. The parallelism is reasonable, with 14.6 speedup from 24 threads on 12 cores. Splitting the work into 48 software threads, time-sliced on 24 hardware threads, does not affect execution time. The work thus appears to be evenly spread on the threads.

It may be interesting to see how much data is transferred. To see the space consumption per column --

  SELECT  TOP 20 * 
    FROM  sys_index_space_stats 
ORDER BY  iss_pages DESC ;

-- followed by --

  SELECT  coi_column, 
          SUM (coi_pages) / 128 
    FROM  sys_col_info 
GROUP BY  coi_column 
ORDER BY  2 DESC ;

-- gives us the following --

L_COMMENT                        17893
PS_COMMENT                       10371
O_COMMENT                         7824
L_EXTENDEDPRICE                   4771
O_CLERK                           2744
L_PARTKEY                         2432
L_SUPPKEY                         2432
L_COMMITDATE                      1784
L_SHIPDATE                        1551
L_RECEIPTDATE                     1537
O_TOTALPRICE                      1181
C_COMMENT                         1150
L_QUANTITY                         960
O_ORDERKEY                         736
P_NAME                             729
PS_SUPPLYCOST                      647
O_CUSTKEY                          624
C_ADDRESS                          427
L_DISCOUNT                         424
L_TAX                              419
L_SHIPINSTRUCT                     412
L_SHIPMODE                         410
L_LINENUMBER                       394
L_RETURNFLAG                       394
L_LINESTATUS                       394
O_ORDERDATE                        389
P_COMMENT                          341
PS_SUPPKEY                         323
P_TYPE                             293
C_PHONE                            274
L_ORDERKEY                         268
PS_AVAILQTY                        201
P_RETAILPRICE                      161
C_ACCTBAL                          123
O_ORDERPRIORITY                     95
O_ORDERSTATUS                       94
S_COMMENT                           66
...

The total in allocated pages is 65.6 GB, of which 34.1 GB are accessed by the workload. The comment strings could be stream-compressed, bringing some speedup in load time due to less I/O. Also l_extendedprice, a frequently accessed column, could be represented with 4 bytes instead of 8. The working set could thus be cut down to about 28 GB, which may offer some benefit at larger scales. At any rate, for system sizing, the space utilization report is very useful.

The query execution profile is as below, with comments inline. The profile here is obvious, but we show this as a guide to reading future profiles which will be more interesting.

{ 
time   2.6e-06% fanout         1 input         1 rows
time   1.7e-06% fanout         1 input         1 rows
{ fork
time   2.1e-06% fanout         1 input         1 rows
{ fork

The time xx% line above each operator is the actual percentage of execution time taken by it, followed by the count of rows of output per row of input, followed by the actual rows of input. The below produced 591M rows of output for one row of input --

time        34% fanout 5.91599e+08 input         1 rows
LINEITEM   5.9e+08 rows(.L_RETURNFLAG, .L_LINESTATUS, .L_DISCOUNT, .L_EXTENDEDPRICE, .L_QUANTITY, .L_TAX)
 L_SHIPDATE <= <c 1998-09-02>

Below is the arithmetic of the query, followed by a sort (GROUP BY) operator.

After code:
      0: temp := artm  1  - .L_DISCOUNT
      4: temp := artm .L_EXTENDEDPRICE * temp
      8: temp := artm  1  + .L_TAX
      12: temp := artm temp * temp
      16: BReturn 0

Most of the time is spent below, in the GROUP BY. We notice that each needed aggregation is done once, so the common subexpressions are correctly detected.

time        66% fanout         0 input 5.91599e+08 rows
Sort (.L_RETURNFLAG, .L_LINESTATUS) -> (inc, .L_DISCOUNT, .L_EXTENDEDPRICE, .L_QUANTITY, temp, temp)
 
}
time     4e-05% fanout         4 input         1 rows
group by read node  
(.L_RETURNFLAG, .L_LINESTATUS, count_order, aggregate, sum_base_price, sum_qty, sum_charge, sum_disc_price)
time   6.2e-05% fanout         0 input         4 rows

The SUMs are divided by the COUNTs, and the rows are sorted.

Precode:
      0: avg_qty := artm sum_qty / count_order
      4: avg_price := artm sum_base_price / count_order
      8: avg_disc := artm aggregate / count_order
      12: BReturn 0
Sort (.L_RETURNFLAG, .L_LINESTATUS) -> (sum_qty, sum_base_price, sum_disc_price, sum_charge, avg_qty, avg_price, avg_disc, count_order)
 
}
time   1.2e-05% fanout         4 input         1 rows
Key from temp (.L_RETURNFLAG, .L_LINESTATUS, sum_qty, sum_base_price, sum_disc_price, sum_charge, avg_qty, avg_price, avg_disc, count_order)

The data is returned to the client.

time   4.4e-06% fanout         0 input         4 rows
Select (.L_RETURNFLAG, .L_LINESTATUS, sum_qty, sum_base_price, sum_disc_price, sum_charge, avg_qty, avg_price, avg_disc, count_order)
}

Elapsed time and CPU%.

 3947 msec 2365% cpu,         6 rnd 5.99841e+08 seq         0% same seg         0% same pg 
Compilation: 0 msec 0 reads         0% read 0 messages         0% clw

This output is produced by the following sequence on the iSQL command line --

SQL> SET blobs ON;
SQL> PROFILE ('SELECT .... FROM .....');

We will next consider the CPU profile:

704173   33.0087  setp_chash_run
275039   12.8927  gb_aggregate
252751   11.8479  ce_dict_any_sets_decode
178304    8.3581  cha_cmp_2a
170819    8.0073  ce_dict_int64_sets_decode
127827    5.9920  chash_array_0
120994    5.6717  chash_array
60902     2.8548  ce_intd_any_range_lte
47865     2.2437  artm_mpy_double
38634     1.8110  ce_vec_int64_sets_decode
26411     1.2380  artm_sub_double
24794     1.1622  artm_add_double
13600     0.6375  cs_decode

For hardcore aficionados, the code may be found in the Virtuoso develop/7.x branch on github.com. The version is not exactly the same but close enough for the parts above. artm_* is arithmetic on typed vectors. As pointed out before, the arithmetic is with DOUBLEs, although users would prefer fixed point. There is, I believe, a MS SQL Server result with DOUBLEs, so using DOUBLEs would not disqualify a 100 GB TPC-H result.

The moral of the story is that an array-based aggregation without the chash_array* and cha_cmp* and only 1/3 of the setp_chash_run function would save upwards of a second of real time. The setp_ and cha_ are aggregation; the ce_* are column decompression and filtering. The arithmetic is not high in the sample but it could be sped up by 2-4x by SIMD, specially since AVX on Sandy Bridge and later does 4 DOUBLEs in a single instruction.

We note that the ce_filter_* function would drop off if the table were stored in date order, as then the top level index would show that all the values in the column matched, thus making it unnecessary to even read the l_shipdate column, except for the last part of the table. However this is a marginal slice of the time even now.

Q1 Conclusions

We have demonstrated good load balance and passed the required common sub-expressions exam. The array-based GROUP BY trick is unused but would save over 1s of real time, hence will be good value for only 100-200 lines of code.

Q3 -- Hash and Merge Joins

Next we look at JOINs by hash and index. Q3 is a relatively straightforward example, so we will go over the basics of JOIN type (i.e., whether by index or hash) and JOIN order. This will also show some scheduling effects.

The definition is:

SELECT TOP 10                                             l_orderkey,
               SUM(l_extendedprice * (1 - l_discount)) AS revenue,
                                                          o_orderdate,
                                                          o_shippriority
    FROM  customer,
          orders,
          lineitem
   WHERE  c_mktsegment = 'BUILDING'
     AND  c_custkey = o_custkey
     AND  l_orderkey = o_orderkey
     AND  o_orderdate < CAST ('1995-03-15' AS DATE)
     AND  l_shipdate > CAST ('1995-03-15' AS DATE)
GROUP BY  l_orderkey,
          o_orderdate,
          o_shippriority
ORDER BY  revenue desc,
          o_orderdate

The profile, comments inline, is:

{ 
time   6.7e-06% fanout         1 input         1 rows

Make a hash table with c_custkey for all customers with c_mktsegment building. For a hash join build side, the time above the hash filler line is the time for making the hash table from the buffered rows. The time above the sort ... hf ... line is the time for buffering the rows that go into the hash table. The other times in the hash filler block are for the operators for getting the data, and are not related to making the hash table.

time      0.72% fanout         1 input         1 rows
{ hash filler

We see that the actual cardinality of customer is close to what was predicted. The actual number is on the line with time; the predicted is on the line with the index name customer.


  time      0.63% fanout 3.00019e+06 input         1 rows
CUSTOMER     3e+06 rows(.C_CUSTKEY)
 C_MKTSEGMENT = 
time     0.089% fanout         0 input 3.00019e+06 rows
Sort hf 34 (.C_CUSTKEY)
}
time   9.2e-06% fanout         1 input         1 rows
{ fork
time   5.2e-06% fanout         1 input         1 rows
{ fork

The below is a merge of a scan of orders and a hash join to customer. The orders table is scanned, first reading o_orderdate and o_custkey, on which there are selections. The o_orderdate is a range check that is true of about 1/2 of the rows. The other condition is an invisible hash join against the customer hash table built above. This selects 1/5 of the rows on the average. So we see that for a total of 150M orders, the fanout is 14.5M, about 1/10, as predicted. The 6.1e7 rows on the line with orders represents the estimate based on the orderdate condition. The card 0.2 on the hash filter line is the prediction for the hash join selectivity.

We note that since no order has more than one customer, the JOIN is always cardinality-restricting, hence can be merged into a scan. Being merged into a scan, it becomes run-time re-orderable with the condition on o_orderdate. The conditions are evaluated and arranged at run time in the order of rows eliminated per unit of time.

The expression "hash partition + bloom" means that the hash join could be partitioned if the hash table did not fit in memory; i.e., there could be several passes over the data. This is not here the case, nor is this generally desirable. The bloom means that the hash is pre-filtered with a Bloom filter, which we will see in the CPU profile.

time        30% fanout 1.45679e+07 input         1 rows
ORDERS   6.1e+07 rows(.O_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_SHIPPRIORITY)
 O_ORDERDATE < <c 1995-03-15>
hash partition+bloom by 41 (tmp)hash join merged always card       0.2 -> ()

The below is the hash join operator that in fact was merged into the table scan above.

time    0.0016% fanout         1 input 1.45679e+07 rows
Hash source 34 merged into ts        0.2 rows(.O_CUSTKEY) -> ()

Below is the index-based access to lineitem. This is a de facto merge-join since the o_orderkeys are generated in order by the scan. One in 10 l_orderkeys is selected. Each of these has an average of 4 lineitems. Of these 4, the cost model predicts that 2.7 will be selected based on the additional condition on l_shipdate. The actual number of rows matched is in fact much lower since the date selection is heavily anti-correlated with the date selection on orders. In other words, an order tends to be shipped soon after the orderdate.


  time        16% fanout   0.20508 input 1.45679e+07 rows
LINEITEM       2.6 rows(.L_ORDERKEY, .L_EXTENDEDPRICE, .L_DISCOUNT)
 inlined  L_ORDERKEY = .O_ORDERKEY L_SHIPDATE > 
 
After code:
      0: temp := artm  1  - .L_DISCOUNT
      4: temp := artm .L_EXTENDEDPRICE * temp
      8: BReturn 0

The query has a GROUP BY that includes the high cardinality column, l_orderkey, with 150M distinct values. The GROUP BY is therefore partitioned.

This means that the previous part of the query is run on multiple threads, so that each thread gets an approximately equal number of lines of orders. For the GROUP BY, the threads pass each other chunks of data so that each grouping key can only end up in one partition. This means that at the end of the GROUP BY, there are multiple hash tables with grouping results that are guaranteed non-overlapping, hence there is no need to add up (re-aggregate) the per-thread results. The stage operator passes data between the threads. This is also known as an exchange operator.

time         1% fanout         1 input 2.98758e+06 rows
Stage 2
time       1.4% fanout         0 input 2.98758e+06 rows
Sort (q_.L_ORDERKEY, .O_ORDERDATE, .O_SHIPPRIORITY) -> (temp)
 
}
time       0.4% fanout 1.13104e+06 input         1 rows
group by read node  
(.L_ORDERKEY, .O_ORDERDATE, .O_SHIPPRIORITY, revenue)in each partition slice
time      0.36% fanout         0 input 1.13104e+06 rows
Sort (revenue, .O_ORDERDATE) -> (.L_ORDERKEY, .O_SHIPPRIORITY)
 
}
time   3.1e-05% fanout        10 input         1 rows
top order by read (.L_ORDERKEY, revenue, .O_ORDERDATE, .O_SHIPPRIORITY)
time   6.2e-06% fanout         0 input        10 rows
Select (.L_ORDERKEY, revenue, .O_ORDERDATE, .O_SHIPPRIORITY)
}


 1189 msec 2042% cpu, 1.45513e+07 rnd 2.08588e+08 seq   98.9053% same seg  0.952364% same pg 
Compilation: 1 msec 0 reads         0% read 0 messages         0% clw

This query also illustrates the meaning of the random and sequential access meters in the profile: For 1/10 of the orders, there is a random lookup from lineitem, hence 14.5M random lookups. The sequential scan is 150M rows of orders plus an average of 3 extra rows for each of the 14M random accesses of lineitem. The locality metric, 98.9% same segment, means that the JOIN has a merge-join pattern, since 99% of lookups fall in the same segment as the previous one. A segment is a column store structure that, in the case of lineitem, corresponds to about 4500 consecutive rows.

This is one of the queries where storing the data in date order would be advantageous. A zone map on date would eliminate half the second half of the orders without even looking at the columns. A zone map is a summary data structure that keeps, for example, a minimum and a maximum value of an attribute for a range of consecutive rows. Also, for all but the lineitems at the end of the range of orders, a zone map would also disqualify the items without looking at the column. VectorWise, for example, profits from this. However, the CPU profile below shows that the time spent in date compares is not very long even now.

On further analysis, we see that the query is run in the order of o_orderkey, so that each o_orderkey is seen once. Hence the partitioned GROUP BY can be changed into an ordered GROUP BY, as all the grouping columns are further functionally dependent on o_orderkey. An ordered GROUP BY is more efficient than a partitioned or re-aggregated one, since it does not have to remember grouping keys: Once a new key comes in, the previous key will not be seen again, and the aggregation for it can be sent onwards in the pipeline.

However, this last transformation has little effect here, as the count of rows passing to the aggregation is small. Use of ordered aggregation has much higher impact in other queries and will be visited there. There is also a chance for late projection, as the o_shippriority is in fact only needed for the top 10 rows returned. The impact is small in this case, though. This too will be visited later.

We now consider the CPU profile:

93087    20.9350  cha_inline_1i_n
63224    14.2189  cha_bloom_unroll
30162     6.7834  cha_insert_1i_n
29178     6.5621  ce_search_rld
28519     6.4139  ce_intd_range_ltgt
17319     3.8950  cs_decode
15848     3.5642  ce_intd_sets_ltgt
11731     2.6383  ce_skip_bits_2
8212      1.8469  ce_vec_int_sets_decode
7946      1.7870  itc_single_row_opt
7886      1.7735  ce_intd_any_sets_decode
7072      1.5905  itc_fetch_col_vec
6474      1.4560  setp_chash_run
5304      1.1929  itc_ce_value_offset
5263      1.1836  itc_col_seg

The top 3 items are for the orders x customer hash join -- the top 2 for the probe, and the 3rd for the build. The 4th item is the index lookup on lineitem. The one below that is the date condition on orders; below this is the condition on the date of lineitem.

The functions working on a compressed column are usually called ce_<compression type>_<sets or range>_<filter or decode>. ce means compression entry; the compression types are rl (run length), rld (run length with delta), bits (densely ascending values as bitmap), intd (16 bit deltas on a base), and dict (dictionary). The sets vs range determines whether the operation works on a set of contiguous values in the entry, or takes a vector of row numbers as context. The first predicate works on a range; the next one on the sets (row numbers) selected by the previous. Filter means selection, and decode means extracting a value for processing by a downstream operator.

We run 5 of these concurrently: the fastest returns in 2.8s; the slowest in 5.4s. The executions are staggered, so that each divides into up to 24 independent fragments which are then multiplexed on 48 worker threads, with each fragment guaranteed at least one thread. The slices of the first query are prioritized, so that when a worker thread has a choice of next unit of work, it will prefer one from an older queue. Each query in this setting has one queue of independently executable fragments. Thus the first to come in gets the most threads and finishes sooner. The rationale for this is that a query may have large transient memory consumption, e.g., GROUP BYs or hash join build sides. The sooner such a query finishes, the less likely it is that there will be many concurrent queries with the high peak-memory demand. This does not block short queries since in any case a runnable query will have one thread which will get scheduled by the OS from time to time.

Q3 Conclusions

The balance is that unused tricks (ordered aggregation, late projection) would gain little. Date order would gain about 0.4s from 1.3s, but would lose in other queries.

We have treated Q1 and Q3 at some length in order to introduce reading of query profiles and the meaning of some meters. For the handful of people who are deep into this sport, the information is rather obvious, but will still give an idea of the specific feature mix of Virtuoso. Column stores are similar to a point, but not all make the exact same choices.

If you are a developer, what, if anything, should you remember of this? Never mind the finesses of column store science -- if you understand join order and join type, then there is the possibility of understanding why some queries are fast and some slow. Most support questions are about this. If you know what the DBMS does or should do you are in control. This is why the metrics and concepts here are also of some interest outside the very small group that actually makes DBMS.

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Q1 -- Scan, I/O, Aggregation

Q1 Conclusions

Q3 -- Hash and Merge Joins

Q3 Conclusions

In Hoc Signo Vinces Series

Comments

Post Comment

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Q1 -- Scan, I/O, Aggregation

Q1 Conclusions

Q3 -- Hash and Merge Joins

Q3 Conclusions

In Hoc Signo Vinces Series

Related

Comments

Post Comment