Performance of RDF and other graph workloads may easily be limited by query compilation time. Query compilation time tends to increase as more complex optimizations are added, thus simply streamlining compilation is not the answer. All applications we have come across make queries by instantiating templates and plugging in different literals and different combinations of predefined search conditions. The DBMS will see a few hundred distinct queries, but they may come each time with different literals.
So reusing query plans between invocations is a natural optimization. This works especially well with lookup workloads, or when the data is small. Analytics tends to be dominated by run time, but lookups that touch at most a million or so triples will often be bound by compilation time, especially if these have tens of triple patterns.
Let's consider the following from Open PHACTS:
sparql
PREFIX chembl: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX cheminf: <http://semanticscience.org/resource/>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX qudt: <http://qudt.org/1.1/schema/qudt#>
SELECT DISTINCT ?item
WHERE
{ VALUES ?chembl_target_uri
{ <http://rdf.ebi.ac.uk/resource/chembl/target/CHEMBL5451> }
GRAPH <http://www.ebi.ac.uk/chembl>
{
?assay_uri chembl:hasTarget ?chembl_target_uri .
?assay_uri chembl:hasActivity ?item .
?item chembl:hasMolecule ?compound_chembl .
?chembl_target_uri a ?target_type .
OPTIONAL { ?chembl_target_uri dcterms:title ?target_name_chembl }
OPTIONAL { ?chembl_target_uri chembl:organismName ?target_organism }
OPTIONAL { ?chembl_target_uri chembl:hasTargetComponent ?protein .
GRAPH <http://www.conceptwiki.org>
{
?cw_target skos:exactMatch ?protein
; skos:prefLabel ?protein_name
}
}
OPTIONAL { ?assay_uri chembl:organismName ?assay_organism }
OPTIONAL { ?assay_uri dcterms:description ?assay_description }
OPTIONAL { ?assay_uri chembl:assayTestType ?assay_type }
OPTIONAL { ?item chembl:publishedType ?published_type }
OPTIONAL { ?item chembl:publishedRelation ?published_relation }
OPTIONAL { ?item chembl:publishedValue ?published_value }
OPTIONAL { ?item chembl:publishedUnits ?published_unit }
OPTIONAL { ?item chembl:standardType ?activity_type }
OPTIONAL { ?item chembl:standardRelation ?activity_relation }
OPTIONAL { ?item chembl:standardValue ?standard_value .
BIND ( xsd:decimal( ?standard_value ) AS ?activity_value )
}
OPTIONAL { ?item chembl:standardUnits ?activity_unit }
OPTIONAL { ?item chembl:hasQUDT ?qudt_uri }
OPTIONAL { ?item chembl:pChembl ?pChembl }
OPTIONAL { ?item chembl:activityComment ?act_comment }
OPTIONAL { ?item chembl:hasDocument ?doc_uri .
OPTIONAL { ?doc_uri owl:sameAs ?doi }
OPTIONAL { ?doc_uri bibo:pmid ?pmid }
}
}
GRAPH <http://ops.rsc.org>
{
?compound_ocrs skos:exactMatch ?compound_chembl .
?compound_ocrs cheminf:CHEMINF_000396 ?inchi
; cheminf:CHEMINF_000399 ?inchi_key
; cheminf:CHEMINF_000018 ?smiles .
OPTIONAL { [] obo:IAO_0000136 ?compound_ocrs
; a cheminf:CHEMINF_000484
; qudt:numericValue ?molweight .
}
OPTIONAL { [] obo:IAO_0000136 ?compound_ocrs
; a cheminf:CHEMINF_000367
; qudt:numericValue ?num_ro5_violations .
}
}
?compound_cw skos:exactMatch ?compound_ocrs
; skos:prefLabel ?compound_name
}
ORDER BY ?item
LIMIT 10
OFFSET 0
;
This is quite typical. More complex ones have been seen, with many unions on top.
We run this with profile on warm cache, no plan reuse in effect. The database is Open PHACTS OPS from this January.
profile ('sparql prefix ....');
...
15 msec 4% cpu, 15688 rnd 9547 seq 91.8669% same seg 4.62107% same pg
Compilation: 313 msec 0 reads 0% read 0 messages 0% clw
The compilation time is over 20x longer than the execution. We see from the top line that the execution did 15K random lookups and retrieved 9K rows sequentially.
We enable plan reuse and rerun:
14 msec 58% cpu, 15688 rnd 9547 seq 91.8669% same seg 4.62107% same pg
Compilation: 0 msec 0 reads 0% read 0 messages 0% clw
The compile time is now gone. This is an especially large win. With a set of 31 queries from Open PHACTS, each repeated over many different parameter bindings, the gains from query caching are a speedup of 1.5x. More details may be published after VU Amsterdam, which does the data management for Open PHACTS, following publication of the benchmark data and queries. The present figures are an order of magnitude better than the figures from last fall, which will also be in the publication.
With query plan caching, the same plan will be reused as long as the literals in the new query have approximately the same selectivity as the ones which were present when the plan was first made. In this way, if a different plan is in fact needed, one will be made. The same query text can have many alternative plans for different selectivities of search conditions.
In this way, plan reuse may work better than prepared statements. Anyway, prepared statements do not exist in the SPARQL query language. In SQL they do, but then the optimizer does not know the values the parameters will have.
The overhead of plan reuse, as opposed to parameterized prepared statements, is relatively low. The cache remembers the sampling that was done when the plan was first made. The same samples are taken with the new literals plugged in. If the cardinalities are within a settable percentage (e.g., 20% of the original), the plan is assumed to be applicable. On the other hand, with prepared parameterized statements, there is no sampling at all, but then the plan might be worse due to less information being available to the optimizer.
Publishing is another type of workload where compile times easily form a large percentage of the total. The queries are shorter than in the biology case, since the modeling tends to be simpler and there are less distinct sources being queried. The frequency of the queries is higher though, and each might touch only some tens of triples.
The query caching feature will be included in forthcoming Virtuoso updates, and will not require operator intervention nor changes to configurations or applications. The feature will be controlled by a few settings in the configuration file, but the defaults will work for almost all cases.