This article is about how scale-out differs from single-server. This shows large effects of parameters whose very existence most would not anticipate, and some low level metrics for assessing these. The moral of the story is that this is the stuff which makes the difference between merely surviving scale-out and winning with it. The developer and DBA would not normally know about this; thus these things fall into the category of adaptive self-configuration expected from the DBMS. But since this series is about what makes performance, I will discuss the dynamics such as they are and how to play these.
We take the prototypical cross partition join in Q13: Make a hash table of all customers
, partitioned by c_custkey
. This is independently done with full parallelism in each partition. Scan the orders
, get the customer
(in a different partition), and flag the customers
that had at least one order
. Then, to get the customers
with no orders
, return the customers
that were not flagged in the previous pass.
The single-server time in part 12 was 7.8 and 6.0 with a single user. We consider the better of the times. The difference is due to allocating memory on the first go; on the second go the memory is already in reserve.
With default settings, we get 4595 ms (microseconds), with per node resource utilization at:
Cluster 4 nodes, 4 s. 112405 m/s 742602 KB/s 2749% cpu 0% read 4% clw threads 1r 0w 0i buffers 8577766 287874 d 0 w 0 pfs
cl 1: 27867 m/s 185654 KB/s 733% cpu 0% read 4% clw threads 1r 0w 0i buffers 2144242 71757 d 0 w 0 pfs
cl 2: 28149 m/s 185372 KB/s 672% cpu 0% read 0% clw threads 0r 0w 0i buffers 2144640 71903 d 0 w 0 pfs
cl 3: 28220 m/s 185621 KB/s 675% cpu 0% read 0% clw threads 0r 0w 0i buffers 2144454 71962 d 0 w 0 pfs
cl 4: 28150 m/s 185837 KB/s 667% cpu 0% read 0% clw threads 0r 0w 0i buffers 2144430 72252 d 0 w 0 pfs
The top line is the summary; the lines below are per-process. The m/s
is messages-per-second; KB/s
is interconnect traffic per second; clw %
is idle time spent waiting for a reply from another process. The cluster is set up with 4 processes across 2 machines, each with 2 NUMA nodes. Each process has affinity to the NUMA node, so local memory only. The time is reasonable in light of the overall CPU of 2700%. The maximum would be 4800% with all threads of all cores busy all the time.
The catch here is that we do not have a steady half-platform utilization all the time, but full platform peaks followed by synchronization barriers with very low utilization. So, we set the batch size differently:
cl_exec ('__dbf_set (''cl_dfg_batch_bytes'', 50000000)');
This means that we set, on each process, the cl_dfg_batch_bytes
to 50M from a default of 10M. The effect is that each scan of orders
, one thread per slice, 48 slices total, will produce 50MB worth of o_custkeys
to be sent to the other partition for getting the customer
. After each 50M, the thread stops and will produce the next batch when all are done and a global continue message is sent by the coordinator.
The time is now 3173 ms with:
Cluster 4 nodes, 3 s. 158220 m/s 1054944 KB/s 3676% cpu 0% read 1% clw threads 1r 0w 0i buffers 8577766 287874 d 0 w 0 pfs
cl 1: 39594 m/s 263962 KB/s 947% cpu 0% read 1% clw threads 1r 0w 0i buffers 2144242 71757 d 0 w 0 pfs
cl 2: 39531 m/s 263476 KB/s 894% cpu 0% read 0% clw threads 0r 0w 0i buffers 2144640 71903 d 0 w 0 pfs
cl 3: 39523 m/s 263684 KB/s 933% cpu 0% read 0% clw threads 0r 0w 0i buffers 2144454 71962 d 0 w 0 pfs
cl 4: 39535 m/s 263586 KB/s 900% cpu 0% read 0% clw threads 0r 0w 0i buffers 2144430 72252 d 0 w 0 pfs
The platform utilization is better as we see. The throughput is nearly double that of the single-server, which is pretty good for a communication-heavy query.
This was done with a vector size of 10K. In other words, each partition gets 10K o_custkeys
and splits these 48 ways to go to every recipient. 1/4 are in the same process, 1/4 in a different process on the same machine, and 2/4 on a different machine. The recipient gets messages with an average of 208 o_custkey
values, puts them back together in batches of 10K, and passes these to the hash join with customer
.
We try different vector sizes, such as 100K:
cl_exec ('__dbf_set (''dc_batch_sz'', 100000)');
There are two metrics of interest here: The write block time, and the scheduling overhead. The write block time is microseconds, which increases whenever a thread must wait before it can write to a connection. The scheduling overhead is cumulative clocks spent by threads while waiting for a critical section that deals with dispatching messages to consumer threads. Long messages make blocking; short messages make frequent scheduling decisions.
SELECT cl_sys_stat ('local_cll_clk', clr=>1),
cl_sys_stat ('write_block_usec', clr=>1)
;
cl_sys_stat
gets the counters from all processes and returns the sum. clr=>1
means that the counter is cleared after read.
We do Q13 with vector sizes of 10, 100, and 1000K.
Vector size |
msec |
mtx |
wblock |
10K |
3297 |
10,829,910,329 |
0 |
100K |
3150 |
1,663,238,367 |
59,132 |
1000K |
3876 |
414,631,129 |
4,578,003 |
So, 100K seems to strike the best balance between scheduling and blocking on write.
The times are measured after several samples with each setting. The times stabilize after a few runs, as the appropriate size memory blocks are in reserve. Calling mmap
to allocate these on the first run with each size has a very high penalty, e.g., 60s for the first run with 1M vector size. We note that blocking on write is really bad even though 1/3 of the time there is no network and 2/3 of the time there is a fast network (QDR IB) with no other load. Further, the affinities are set so that the thread responsible for incoming messages is always on core. Result variability on consecutive runs is under 5%, which is similar to single-server behavior.
It would seem that a mutex, as bad as it is, is still better than a distributed cause for going off core (blocking on write). The latency for continuing a thread thus blocked is of course higher than the latency for continuing one that is waiting for a mutex.
We note that a cluster with more machines can take a longer vector size because a vector spreads out to more recipients. The key seems to be to set the message size so that blocking on write is not common. This is a possible adaptive execution feature. We have seen no particular benefit from SDP (Sockets Direct Protocol) and its zero copy. This is a TCP replacement that comes with the InfiniBand drivers.
We will next look at replication/partitioning tradeoffs for hash joins. Then we can look at full runs.
To be continued...
In Hoc Signo Vinces (TPC-H) Series