In Hoc Signo Vinces (part 7 of n) -- TPC-H Q13: The Good and the Bad Plans

Q13 is one of the longest running of the 22 queries. The TPC-H metric is a geometric mean of two scores, power and throughput, where the throughput score is the elapsed time of the multiuser part of the test divided by the number of queries executed. In this part of the score, Q13 can be up to 1/5 of the total. The power score on the other hand is a geometric mean of the run times of all the queries, scaled into queries per hour. There all queries have equal importance. A bad Q13 will sink a whole result.

Q13 counts the orders of each customer and then shows, for each distinct count of orders, how many customers have this number of orders. 1/3 of the customers have no orders; hence this is an outer join between customers and orders, as follows:

  SELECT              c_count,
          COUNT(*) AS custdist
    FROM  (    SELECT                      c_custkey,
                      COUNT(o_orderkey) AS c_count
                FROM  ( SELECT * 
                          FROM customer
                               LEFT OUTER JOIN orders 
                                 ON
                                    c_custkey = o_custkey 
                                    AND
                                    o_comment NOT LIKE '%special%requests%'
                      ) c_customer
             GROUP BY  c_custkey
          ) c_orders
GROUP BY  c_count
ORDER BY  custdist DESC,
          c_count DESC
;

The only parameter of the query is the pattern in the NOT LIKE condition. The NOT LIKE is very unselective, so almost all orders will be considered.

The Virtuoso run time for Q13 is 6.7s, which we can consider a good result. Running 5 of these at the same time has the fastest execution finishing in 23.7s and the slowest in 35.3s. Doing 5x the work takes 5.2x the time. This is not bad, considering that the query has a high transient memory consumption. A second execution of 5 concurrent Q13s has the fastest finishing in 22.s and the slowest in 29.8s. The difference comes from already having the needed memory blocks cached, so there are no calls to the OS for mapping more memory.

To measure the peak memory consumption, which is a factor with this query, there is the mp_max_large_in_use counter. To reset:

__dbf_set ('mp_max_large_in_use', 0);

To read:

SELECT sys_stat ('mp_max_large_in_use');

For the 5 concurrent executions of Q13, the counter goes to 10GB. This is easily accommodated at 100 GB; but at ten times the scale, this will be a significant quantity, even in a scale out setting. The memory allocation time is recorded in the counter mp_mmap_clocks, read with sys_stat. This is a count of cycles spent waiting for mmap or munmap and allows tracking if the process is being slowed down by transient memory allocation.

Let us consider how this works. The plan is as follows:

{ 
{ hash filler
CUSTOMER   1.5e+07 rows(t3.C_CUSTKEY)

-- Make a hash table of the 150M customers. The stage 2 operator below means that the customers are partitioned in a number of distinct partitions based on the c_custkey, which is the key in the hash table. This means that a number of disjoint hash tables are built, as many as there are concurrent threads. This corresponds to the ThreadsPerQuery ini file setting of the enable_qp setting with __dbf_set and sys_stat.

Stage 2
Sort hf 34 (q_t3.C_CUSTKEY)
}
{ fork
{ fork
{ fork
END Node
outer {

-- Here we start a RIGHT OUTER JOIN block. The below operator scans the orders table and picks out the orders which do not contain the mentioned LIKE pattern.


  ORDERS   1.5e+08 rows(t4.O_CUSTKEY, t4.O_ORDERKEY)
 O_COMMENT LIKE  LIKE 
hash partition+bloom by 80 ()

-- Below is a partitioning operator, also known as an exchange operator, which will divide the stream of o_custkeys from the previous scan into different partitions, each served by a different thread.

Stage 2

-- Below is a lookup in the customer hash table. The lookup takes place in the partition determined by the o_custkey being looked up.

Hash source 34  not partitionable         1 rows(q_t4.O_CUSTKEY) -> ()
 right oj, key out ssls: (t3.C_CUSTKEY)
 
After code:
      0: t3.C_CUSTKEY :=  := artm t4.O_CUSTKEY
      4: BReturn 0

-- The below is a RIGHT OUTER JOIN end operator; see below for further description

 end of outer}
set_ctr
 out: (t4.O_ORDERKEY, t4.O_CUSTKEY)
 shadow: (t4.O_ORDERKEY, t4.O_CUSTKEY)
 
Precode:
      0: isnotnull := Call isnotnull (t4.O_ORDERKEY)
      5: BReturn 0

-- The below sort is the innermost GROUP BY. The ISNOTNULL above makes a 0 or a 1, depending on whether there was a found o_custkey for the c_custkey of the customer.

Sort (t3.C_CUSTKEY) -> (isnotnull)
 
}

-- The below operators start after the above have executed to completion on every partition. We read the first aggregation, containing for each customer the COUNT of orders.

group by read node  
(t3.C_CUSTKEY, aggregate)in each partition slice
 
After code:
      0: c_custkey :=  := artm t3.C_CUSTKEY
      4: c_count :=  := artm aggregate
      8: BReturn 0
Subquery Select(c_custkey, c_count)

-- Below is the second GROUP BY; for each COUNT, we count how many customers have this many orders.

Sort (c_count) -> (inc)
 
}
group by read node  
(c_count, custdist)

-- Below is the final ORDER BY.

Sort (custdist, c_count)
}
Key from temp (c_count, custdist)
 
Select (c_count, custdist)
}

The CPU profile starts as follows:

971537   31.8329           setp_chash_run
494300   16.1960           hash_source_chash_input_1i_n
262218    8.5917           clrg_partition_dc
162773    5.3333           strstr_sse42
68049     2.2297           memcpy_16
65883     2.1587           cha_insert_1i_n
57515     1.8845           hs_send_output
56093     1.8379           cmp_like_const
53752     1.7612           gb_aggregate
51274     1.6800           cha_rehash_ents
...

The GROUP BY is on top, with 31%. This is the first GROUP BY, which has one group per customer, for a total of 150M groups. Below the GROUP BY is the hash lookup of the hash join from orders to customer. The third item is partitioning of a data column (dc, or vectored query variable). The partitioning refers to the operator labeled stage 2 above. From one column of values, it makes several. In the 4th place, we have the NOT LIKE predicate on o_comment. This is a substring search implemented using SSE 4.2 instructions. Finally, in the last place, there is a function for resizing a hash table; in the present case, the hash table for the innermost GROUP BY.

At this point, we have to explain the RIGHT OUTER JOIN: Generally when making a hash join, the larger table is on the probe side and the smaller on the build side. This means that the rows on the build side get put in a hash table and then for each row on the probe side there is a lookup to see if there is a match in the hash table.

However, here the bigger table is on the right side of LEFT OUTER JOIN. Normally, one would have to make the hash table from the orders table and then probe it with customer, so that one would find no match for the customers with no orders and several matches for customers with many orders. However, this would be much slower. So there is a trick for reversing the process: You still build the hash from the smaller set in the JOIN, but now for each key that does get probed, you set a bit in a bit mask, in addition to sending the match as output. After all outputs have been generated, you look in the hash table for the entries where the bit is not set. These correspond to the customers with no orders. For these, you send the c_custkey with a null o_orderkey to the next operator in the pipeline, which is the GROUP BY on c_custkey with the count of non-null o_orderkeys.

One might at first think that such a backwards way of doing an outer join is good for nothing but this benchmark and should be considered a benchmark special. This is not so, though, as there are accepted implementations that do this very thing.

Furthermore, getting a competitive score in any other way is impossible, as we shall see below.

We further note that the the grouping key in the innermost GROUP BY is the same as the hash key in the last hash join, i.e., o_custkey. This means that the GROUP BY and the hash join could be combined in a single operator called GROUPJOIN. If this were done, the hash would be built from customer with extra space left for the counters. This would in fact remove the hash join from the profile as well as the rehash of the group by hash table, for a gain of about 20%. The outer join behavior is not a problem here since untouched buckets, e.g., customers without orders, would be inited with a COUNT of 0. For an inner join behavior, one would simply leave out the zero counts when reading the GROUP BY. At the end of the series, we will see what the DBT3 score will be. We remember that there is a 1.5s savings to be had here for the throughput score if the score is not high enough otherwise. The effect on the power score will be less because that only cares about relative speedup, not absolute time.

Next, we disable the RIGHT OUTER JOIN optimization and force the JOIN to build a hash on orders and to probe it with customer. The execution time is 25s. Most of the time goes into building the hash table of orders. The memory consumption also goes up to around 8G. Then we try the JOIN by index with a scan of customer, and for each an index lookup of orders based on an index on o_custkey. Here we note that there is a condition on a dependent part of the primary key, namely o_comment, which requires joining to the main row from the o_ck index. There is a gain however because the GROUP BY becomes ordered; i.e., there is no need to keep groups around for customers that have already been seen since we know they will not come again, the outer scan being in order of c_custkey. For this reason, the memory consumption for the GROUP BY goes away. However, the index-based plan is extremely sensitive to vector size: The execution takes 29.4s if vector size is allowed to grow to 1MB, but 413s if it stays at the default of 10KB. The difference is in the 1MB vector hitting 1/150 (1 million lookups for a 150 million row table), whereas the 10KB vector hits 1/15000. Thus, benefits from vectoring lookups are largely lost, since there are hardly ever hits in the same segment; in this case, within 2000 rows. But this is not the main problem: The condition on the main row is a LIKE on a long column. Thus, the whole column for the segment in question must be accessed for read, meaning 2000 or so o_comments, of which one will be checked. If instead of a condition on o_comment, we have one on o_totalprice > 0, we get 93s with 10KB vector size and 15s with dynamic up to 1MB.

If we now remove the condition on dependent columns of orders, the index plan becomes faster, since the whole condition is resolved within the o_custkey index -- 2.5s with 10KB vector size, 2.6s with dynamic vector size up to 1MB. The point here is that the access from customer to orders on the o_custkey index is ordered, like a merge join.

Q13 Conclusions

Q13 is a combo of many choke points in the TPC-H Analyzed paper. The most important is special JOIN types, i.e., RIGHT OUTER JOIN and GROUPJOIN. Then there is string operation performance for the substring matching with LIKE. This needs to be implemented with the SSE 4.2 string instructions; otherwise there is a hit of about 0.5s on query speed.

The TPC-H Analyzed paper was written against the background of analytical DB tradition where the dominant JOIN type is hash, except when there is a merge between two sets that are ordered or at least clustered on the same key. Clustered here means physical order but without the need to be strictly in key order.

Here I have added some index based variants to show that hash join indeed wins and to point out the sensitivity of random access to vector size. As column stores go, Virtuoso is especially good at random access. This must be so since it was optimized to do RDF well, which entails a lot of lookup. Also note how a big string column goes with great ease in a sequential scan, but kills in a non-local random access pattern.