Here we look at two TPC-H queries whose execution plans are relatively straightforward and look at Virtuoso performance metrics. On one hand; this is an introduction to query plans; on the other hand, a case study in tuning Virtuoso and understanding what goes on. The choke points outlined in TPC-H Analyzed are mentioned where applicable, with some extra commentary.
Q1 -- Scan, I/O, Aggregation
The query is below. The date is a parameter (a value near the end of the l_shipdate
range is used), so most of the columns get read.
SELECT l_returnflag,
l_linestatus,
SUM(l_quantity) AS sum_qty,
SUM(l_extendedprice) AS sum_base_price,
SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
AVG(l_quantity) AS avg_qty,
AVG(l_extendedprice) AS avg_price,
AVG(l_discount) AS avg_disc,
COUNT(*) AS count_order
FROM lineitem
WHERE l_shipdate <= dateadd('DAY', -90, CAST ('1998-12-01' AS DATE))
GROUP BY l_returnflag,
l_linestatus
ORDER BY l_returnflag,
l_linestatus
;
We note that a count of a non-nullable column is the same as COUNT (*)
and that AVG
of a non-null column is SUM (column) / COUNT (*)
. So COUNT (*)
occurs 4 times, and SUM (l_extendedprice)
and SUM (l_quantity)
each occur twice. The grouping columns have few distinct values.
TPC-H Analyzed suggests to use an array-based GROUP BY
, because there can only be 64K combinations of 2 single-character values. The grouping keys are declared CHAR (1)
and non-nullable. The Virtuoso implementation does not do this, though.
This query has been treated in many papers because it cannot be implemented in very many ways, it is easy to understand, and it still illustrates some basic metrics.
One execution on warm cache is between 3.9s and 4.7s. One execution with the data coming from OS disk cache is 11.8s. One execution with the data coming from 2 SSDs is 22s. Five concurrent executions from warm cache are 17.3s for the fastest and 20.5s for the slowest. A single threaded execution from warm cache is 58.4s.
We see that scaling is linear; i.e., 5 times the work takes a little under 5x longer. The parallelism is reasonable, with 14.6 speedup from 24 threads on 12 cores. Splitting the work into 48 software threads, time-sliced on 24 hardware threads, does not affect execution time. The work thus appears to be evenly spread on the threads.
It may be interesting to see how much data is transferred. To see the space consumption per column --
SELECT TOP 20 *
FROM sys_index_space_stats
ORDER BY iss_pages DESC ;
-- followed by --
SELECT coi_column,
SUM (coi_pages) / 128
FROM sys_col_info
GROUP BY coi_column
ORDER BY 2 DESC ;
-- gives us the following --
L_COMMENT 17893
PS_COMMENT 10371
O_COMMENT 7824
L_EXTENDEDPRICE 4771
O_CLERK 2744
L_PARTKEY 2432
L_SUPPKEY 2432
L_COMMITDATE 1784
L_SHIPDATE 1551
L_RECEIPTDATE 1537
O_TOTALPRICE 1181
C_COMMENT 1150
L_QUANTITY 960
O_ORDERKEY 736
P_NAME 729
PS_SUPPLYCOST 647
O_CUSTKEY 624
C_ADDRESS 427
L_DISCOUNT 424
L_TAX 419
L_SHIPINSTRUCT 412
L_SHIPMODE 410
L_LINENUMBER 394
L_RETURNFLAG 394
L_LINESTATUS 394
O_ORDERDATE 389
P_COMMENT 341
PS_SUPPKEY 323
P_TYPE 293
C_PHONE 274
L_ORDERKEY 268
PS_AVAILQTY 201
P_RETAILPRICE 161
C_ACCTBAL 123
O_ORDERPRIORITY 95
O_ORDERSTATUS 94
S_COMMENT 66
...
The total in allocated pages is 65.6 GB, of which 34.1 GB are accessed by the workload. The comment strings could be stream-compressed, bringing some speedup in load time due to less I/O. Also l_extendedprice
, a frequently accessed column, could be represented with 4 bytes instead of 8. The working set could thus be cut down to about 28 GB, which may offer some benefit at larger scales. At any rate, for system sizing, the space utilization report is very useful.
The query execution profile is as below, with comments inline. The profile here is obvious, but we show this as a guide to reading future profiles which will be more interesting.
{
time 2.6e-06% fanout 1 input 1 rows
time 1.7e-06% fanout 1 input 1 rows
{ fork
time 2.1e-06% fanout 1 input 1 rows
{ fork
The time xx%
line above each operator is the actual percentage of execution time taken by it, followed by the count of rows of output per row of input, followed by the actual rows of input. The below produced 591M rows of output for one row of input --
time 34% fanout 5.91599e+08 input 1 rows
LINEITEM 5.9e+08 rows(.L_RETURNFLAG, .L_LINESTATUS, .L_DISCOUNT, .L_EXTENDEDPRICE, .L_QUANTITY, .L_TAX)
L_SHIPDATE <= <c 1998-09-02>
Below is the arithmetic of the query, followed by a sort (GROUP BY
) operator.
After code:
0: temp := artm 1 - .L_DISCOUNT
4: temp := artm .L_EXTENDEDPRICE * temp
8: temp := artm 1 + .L_TAX
12: temp := artm temp * temp
16: BReturn 0
Most of the time is spent below, in the GROUP BY
. We notice that each needed aggregation is done once, so the common subexpressions are correctly detected.
time 66% fanout 0 input 5.91599e+08 rows
Sort (.L_RETURNFLAG, .L_LINESTATUS) -> (inc, .L_DISCOUNT, .L_EXTENDEDPRICE, .L_QUANTITY, temp, temp)
}
time 4e-05% fanout 4 input 1 rows
group by read node
(.L_RETURNFLAG, .L_LINESTATUS, count_order, aggregate, sum_base_price, sum_qty, sum_charge, sum_disc_price)
time 6.2e-05% fanout 0 input 4 rows
The SUMs
are divided by the COUNTs
, and the rows are sorted.
Precode:
0: avg_qty := artm sum_qty / count_order
4: avg_price := artm sum_base_price / count_order
8: avg_disc := artm aggregate / count_order
12: BReturn 0
Sort (.L_RETURNFLAG, .L_LINESTATUS) -> (sum_qty, sum_base_price, sum_disc_price, sum_charge, avg_qty, avg_price, avg_disc, count_order)
}
time 1.2e-05% fanout 4 input 1 rows
Key from temp (.L_RETURNFLAG, .L_LINESTATUS, sum_qty, sum_base_price, sum_disc_price, sum_charge, avg_qty, avg_price, avg_disc, count_order)
The data is returned to the client.
time 4.4e-06% fanout 0 input 4 rows
Select (.L_RETURNFLAG, .L_LINESTATUS, sum_qty, sum_base_price, sum_disc_price, sum_charge, avg_qty, avg_price, avg_disc, count_order)
}
Elapsed time and CPU%.
3947 msec 2365% cpu, 6 rnd 5.99841e+08 seq 0% same seg 0% same pg
Compilation: 0 msec 0 reads 0% read 0 messages 0% clw
This output is produced by the following sequence on the iSQL command line --
SQL> SET blobs ON;
SQL> PROFILE ('SELECT .... FROM .....');
We will next consider the CPU profile:
704173 33.0087 setp_chash_run
275039 12.8927 gb_aggregate
252751 11.8479 ce_dict_any_sets_decode
178304 8.3581 cha_cmp_2a
170819 8.0073 ce_dict_int64_sets_decode
127827 5.9920 chash_array_0
120994 5.6717 chash_array
60902 2.8548 ce_intd_any_range_lte
47865 2.2437 artm_mpy_double
38634 1.8110 ce_vec_int64_sets_decode
26411 1.2380 artm_sub_double
24794 1.1622 artm_add_double
13600 0.6375 cs_decode
For hardcore aficionados, the code may be found in the Virtuoso develop/7.x
branch on github.com. The version is not exactly the same but close enough for the parts above. artm_*
is arithmetic on typed vectors. As pointed out before, the arithmetic is with DOUBLEs
, although users would prefer fixed point. There is, I believe, a MS SQL Server result with DOUBLEs
, so using DOUBLEs
would not disqualify a 100 GB TPC-H result.
The moral of the story is that an array-based aggregation without the chash_array*
and cha_cmp*
and only 1/3 of the setp_chash_run
function would save upwards of a second of real time. The setp_
and cha_
are aggregation; the ce_*
are column decompression and filtering. The arithmetic is not high in the sample but it could be sped up by 2-4x by SIMD, specially since AVX on Sandy Bridge and later does 4 DOUBLEs
in a single instruction.
We note that the ce_filter_*
function would drop off if the table were stored in date order, as then the top level index would show that all the values in the column matched, thus making it unnecessary to even read the l_shipdate
column, except for the last part of the table. However this is a marginal slice of the time even now.
Q1 Conclusions
We have demonstrated good load balance and passed the required common sub-expressions exam. The array-based GROUP BY
trick is unused but would save over 1s of real time, hence will be good value for only 100-200 lines of code.
Q3 -- Hash and Merge Joins
Next we look at JOINs
by hash and index. Q3 is a relatively straightforward example, so we will go over the basics of JOIN
type (i.e., whether by index or hash) and JOIN
order. This will also show some scheduling effects.
The definition is:
SELECT TOP 10 l_orderkey,
SUM(l_extendedprice * (1 - l_discount)) AS revenue,
o_orderdate,
o_shippriority
FROM customer,
orders,
lineitem
WHERE c_mktsegment = 'BUILDING'
AND c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND o_orderdate < CAST ('1995-03-15' AS DATE)
AND l_shipdate > CAST ('1995-03-15' AS DATE)
GROUP BY l_orderkey,
o_orderdate,
o_shippriority
ORDER BY revenue desc,
o_orderdate
The profile, comments inline, is:
{
time 6.7e-06% fanout 1 input 1 rows
Make a hash table with c_custkey
for all customers with c_mktsegment
building. For a hash join build side, the time above the hash filler line is the time for making the hash table from the buffered rows. The time above the sort ... hf ...
line is the time for buffering the rows that go into the hash table. The other times in the hash filler block are for the operators for getting the data, and are not related to making the hash table.
time 0.72% fanout 1 input 1 rows
{ hash filler
We see that the actual cardinality of customer
is close to what was predicted. The actual number is on the line with time; the predicted is on the line with the index name customer
.
time 0.63% fanout 3.00019e+06 input 1 rows
CUSTOMER 3e+06 rows(.C_CUSTKEY)
C_MKTSEGMENT =
time 0.089% fanout 0 input 3.00019e+06 rows
Sort hf 34 (.C_CUSTKEY)
}
time 9.2e-06% fanout 1 input 1 rows
{ fork
time 5.2e-06% fanout 1 input 1 rows
{ fork
The below is a merge of a scan of orders
and a hash join to customer
. The orders
table is scanned, first reading o_orderdate
and o_custkey
, on which there are selections. The o_orderdate
is a range check that is true of about 1/2 of the rows. The other condition is an invisible hash join against the customer
hash table built above. This selects 1/5 of the rows on the average. So we see that for a total of 150M orders, the fanout is 14.5M, about 1/10, as predicted. The 6.1e7 rows on the line with orders
represents the estimate based on the orderdate
condition. The card 0.2 on the hash filter line is the prediction for the hash join selectivity.
We note that since no order
has more than one customer
, the JOIN
is always cardinality-restricting, hence can be merged into a scan. Being merged into a scan, it becomes run-time re-orderable with the condition on o_orderdate
. The conditions are evaluated and arranged at run time in the order of rows eliminated per unit of time.
The expression "hash partition + bloom" means that the hash join could be partitioned if the hash table did not fit in memory; i.e., there could be several passes over the data. This is not here the case, nor is this generally desirable. The bloom means that the hash is pre-filtered with a Bloom filter, which we will see in the CPU profile.
time 30% fanout 1.45679e+07 input 1 rows
ORDERS 6.1e+07 rows(.O_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_SHIPPRIORITY)
O_ORDERDATE < <c 1995-03-15>
hash partition+bloom by 41 (tmp)hash join merged always card 0.2 -> ()
The below is the hash join operator that in fact was merged into the table scan above.
time 0.0016% fanout 1 input 1.45679e+07 rows
Hash source 34 merged into ts 0.2 rows(.O_CUSTKEY) -> ()
Below is the index-based access to lineitem
. This is a de facto merge-join since the o_orderkeys
are generated in order by the scan. One in 10 l_orderkeys
is selected. Each of these has an average of 4 lineitems
. Of these 4, the cost model predicts that 2.7 will be selected based on the additional condition on l_shipdate
. The actual number of rows matched is in fact much lower since the date selection is heavily anti-correlated with the date selection on orders
. In other words, an order
tends to be shipped soon after the orderdate
.
time 16% fanout 0.20508 input 1.45679e+07 rows
LINEITEM 2.6 rows(.L_ORDERKEY, .L_EXTENDEDPRICE, .L_DISCOUNT)
inlined L_ORDERKEY = .O_ORDERKEY L_SHIPDATE >
After code:
0: temp := artm 1 - .L_DISCOUNT
4: temp := artm .L_EXTENDEDPRICE * temp
8: BReturn 0
The query has a GROUP BY
that includes the high cardinality column, l_orderkey
, with 150M distinct values. The GROUP BY
is therefore partitioned.
This means that the previous part of the query is run on multiple threads, so that each thread gets an approximately equal number of lines of orders. For the GROUP BY
, the threads pass each other chunks of data so that each grouping key can only end up in one partition. This means that at the end of the GROUP BY
, there are multiple hash tables with grouping results that are guaranteed non-overlapping, hence there is no need to add up (re-aggregate) the per-thread results. The stage operator passes data between the threads. This is also known as an exchange operator.
time 1% fanout 1 input 2.98758e+06 rows
Stage 2
time 1.4% fanout 0 input 2.98758e+06 rows
Sort (q_.L_ORDERKEY, .O_ORDERDATE, .O_SHIPPRIORITY) -> (temp)
}
time 0.4% fanout 1.13104e+06 input 1 rows
group by read node
(.L_ORDERKEY, .O_ORDERDATE, .O_SHIPPRIORITY, revenue)in each partition slice
time 0.36% fanout 0 input 1.13104e+06 rows
Sort (revenue, .O_ORDERDATE) -> (.L_ORDERKEY, .O_SHIPPRIORITY)
}
time 3.1e-05% fanout 10 input 1 rows
top order by read (.L_ORDERKEY, revenue, .O_ORDERDATE, .O_SHIPPRIORITY)
time 6.2e-06% fanout 0 input 10 rows
Select (.L_ORDERKEY, revenue, .O_ORDERDATE, .O_SHIPPRIORITY)
}
1189 msec 2042% cpu, 1.45513e+07 rnd 2.08588e+08 seq 98.9053% same seg 0.952364% same pg
Compilation: 1 msec 0 reads 0% read 0 messages 0% clw
This query also illustrates the meaning of the random and sequential access meters in the profile: For 1/10 of the orders
, there is a random lookup from lineitem
, hence 14.5M random lookups. The sequential scan is 150M rows of orders
plus an average of 3 extra rows for each of the 14M random accesses of lineitem
. The locality metric, 98.9% same segment, means that the JOIN
has a merge-join pattern, since 99% of lookups fall in the same segment as the previous one. A segment is a column store structure that, in the case of lineitem
, corresponds to about 4500 consecutive rows.
This is one of the queries where storing the data in date order would be advantageous. A zone map on date would eliminate half the second half of the orders
without even looking at the columns. A zone map is a summary data structure that keeps, for example, a minimum and a maximum value of an attribute for a range of consecutive rows. Also, for all but the lineitems
at the end of the range of orders
, a zone map would also disqualify the items without looking at the column. VectorWise, for example, profits from this. However, the CPU profile below shows that the time spent in date compares is not very long even now.
On further analysis, we see that the query is run in the order of o_orderkey
, so that each o_orderkey
is seen once. Hence the partitioned GROUP BY
can be changed into an ordered GROUP BY
, as all the grouping columns are further functionally dependent on o_orderkey
. An ordered GROUP BY
is more efficient than a partitioned or re-aggregated one, since it does not have to remember grouping keys: Once a new key comes in, the previous key will not be seen again, and the aggregation for it can be sent onwards in the pipeline.
However, this last transformation has little effect here, as the count of rows passing to the aggregation is small. Use of ordered aggregation has much higher impact in other queries and will be visited there. There is also a chance for late projection, as the o_shippriority
is in fact only needed for the top 10 rows returned. The impact is small in this case, though. This too will be visited later.
We now consider the CPU profile:
93087 20.9350 cha_inline_1i_n
63224 14.2189 cha_bloom_unroll
30162 6.7834 cha_insert_1i_n
29178 6.5621 ce_search_rld
28519 6.4139 ce_intd_range_ltgt
17319 3.8950 cs_decode
15848 3.5642 ce_intd_sets_ltgt
11731 2.6383 ce_skip_bits_2
8212 1.8469 ce_vec_int_sets_decode
7946 1.7870 itc_single_row_opt
7886 1.7735 ce_intd_any_sets_decode
7072 1.5905 itc_fetch_col_vec
6474 1.4560 setp_chash_run
5304 1.1929 itc_ce_value_offset
5263 1.1836 itc_col_seg
The top 3 items are for the orders
x customer
hash join -- the top 2 for the probe, and the 3rd for the build. The 4th item is the index lookup on lineitem
. The one below that is the date condition on orders
; below this is the condition on the date of lineitem
.
The functions working on a compressed column are usually called ce_<compression type>_<sets or range>_<filter or decode>
. ce
means compression entry; the compression types are rl
(run length), rld
(run length with delta), bits
(densely ascending values as bitmap), intd
(16 bit deltas on a base), and dict
(dictionary). The sets
vs range
determines whether the operation works on a set of contiguous values in the entry, or takes a vector of row numbers as context. The first predicate works on a range; the next one on the sets (row numbers) selected by the previous. Filter
means selection, and decode
means extracting a value for processing by a downstream operator.
We run 5 of these concurrently: the fastest returns in 2.8s; the slowest in 5.4s. The executions are staggered, so that each divides into up to 24 independent fragments which are then multiplexed on 48 worker threads, with each fragment guaranteed at least one thread. The slices of the first query are prioritized, so that when a worker thread has a choice of next unit of work, it will prefer one from an older queue. Each query in this setting has one queue of independently executable fragments. Thus the first to come in gets the most threads and finishes sooner. The rationale for this is that a query may have large transient memory consumption, e.g., GROUP BYs
or hash join build sides. The sooner such a query finishes, the less likely it is that there will be many concurrent queries with the high peak-memory demand. This does not block short queries since in any case a runnable query will have one thread which will get scheduled by the OS from time to time.
Q3 Conclusions
The balance is that unused tricks (ordered aggregation, late projection) would gain little. Date order would gain about 0.4s from 1.3s, but would lose in other queries.
We have treated Q1 and Q3 at some length in order to introduce reading of query profiles and the meaning of some meters. For the handful of people who are deep into this sport, the information is rather obvious, but will still give an idea of the specific feature mix of Virtuoso. Column stores are similar to a point, but not all make the exact same choices.
If you are a developer, what, if anything, should you remember of this? Never mind the finesses of column store science -- if you understand join order and join type, then there is the possibility of understanding why some queries are fast and some slow. Most support questions are about this. If you know what the DBMS does or should do you are in control. This is why the metrics and concepts here are also of some interest outside the very small group that actually makes DBMS.
In Hoc Signo Vinces Series