TPC-H is the data warehousing benchmark to date. I will here cover the Virtuoso implementation of this in detail. The primary audience is database experts. This will also be very educational for DBAs and advanced application developers: Life becomes much more predictable if one knows a good query plan from a bad one. Alongside a commentary on database science, you will also find here a guided tour of Virtuoso performance tuning and diagnostics. To follow this, it is useful to have the official TPC-H spec at hand (download links are on the far-right of this page).
By now, TPC-H is an old game and it is safe to say that pretty much any player in the analytics database domain has had a go at it, even though some have never published a result. So, the bar for new entrants is very high.
Especially, VectorWise and EXASolution have taken performance in this workload close to the limits of the achievable. A challenger has to do everything right in order to win. One wrong move will lose the whole race.
This presentation has many objectives:
-
To illustrate how Virtuoso is an excellent SQL analytics engine
-
To provide an in-depth discussion on the science of query optimization and execution
-
To outline avenues of future development, specifically as concerns analytics with schema-less data
In the TPC TC workshop at VLDB 2013 there was a paper TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark, by Peter Boncz, Thomas Neumann, and myself concerning what the database world has learned from this very tough exercise. Peter Boncz is the original architect of Actian VectorWise, the current champion in TPC-H performance per core. Thomas Neumann is the author of HyPer, most likely the best entry in DBMS research for simultaneously supporting analytics and OLTP. Peter and Thomas are among the most renowned in database science. I am the Program Manager of the Virtuoso column store, overseeing core engineering tasks such as SQL query optimization, execution, storage, and scale out.
In this series I will go over the Virtuoso implementation of TPC-H and will elaborate further on the points discussed in the paper. The subject is broader than any single paper can cover in detail, although there are plenty of papers only addressing one or two of the 22 queries.
Virtuoso is mostly known for RDF. Here we will cover the whole benchmark in SQL first, with both single-server and cluster implementations, and discussion of where these differ. A state-of-the-art SQL implementation is the necessary basis for discussing how the same can be accomplished in RDF. Comparing good RDF to bad SQL is not interesting.
The earlier articles on the Star Schema Benchmark (SSB) (PDF) -- Annuit Coeptis, or, Star Schema and The Cost of Freedom and E Pluribus Unum, or, Star Schema Meets Cluster -- demonstrated how the most basic analytical database operations perform in Virtuoso. All the techniques used there are also directly applicable to TPC-H, but the latter adds a good 20 more tricks one needs to see through.
Future installments will discuss TPC-H query by query. We conclude with a full run of OSDL-DBT-3. DBT-3™ is an unofficial TPC-H without auditing but with the same workload.
In Hoc Signo Vinces Series