Query Plan Cache and Lookups

Performance of RDF and other graph workloads may easily be limited by query compilation time. Query compilation time tends to increase as more complex optimizations are added, thus simply streamlining compilation is not the answer. All applications we have come across make queries by instantiating templates and plugging in different literals and different combinations of predefined search conditions. The DBMS will see a few hundred distinct queries, but they may come each time with different literals.

So reusing query plans between invocations is a natural optimization. This works especially well with lookup workloads, or when the data is small. Analytics tends to be dominated by run time, but lookups that touch at most a million or so triples will often be bound by compilation time, especially if these have tens of triple patterns.

Let's consider the following from Open PHACTS:

sparql 
PREFIX   chembl:  <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX  dcterms:  <http://purl.org/dc/terms/>
PREFIX     skos:  <http://www.w3.org/2004/02/skos/core#>
PREFIX      xsd:  <http://www.w3.org/2001/XMLSchema#>
PREFIX      owl:  <http://www.w3.org/2002/07/owl#>
PREFIX     bibo:  <http://purl.org/ontology/bibo/>
PREFIX  cheminf:  <http://semanticscience.org/resource/>
PREFIX      obo:  <http://purl.obolibrary.org/obo/>
PREFIX     qudt:  <http://qudt.org/1.1/schema/qudt#>

SELECT DISTINCT ?item 
  WHERE
    { VALUES ?chembl_target_uri 
               { <http://rdf.ebi.ac.uk/resource/chembl/target/CHEMBL5451> } 
      GRAPH <http://www.ebi.ac.uk/chembl> 
        {
                     ?assay_uri          chembl:hasTarget           ?chembl_target_uri   .
                     ?assay_uri          chembl:hasActivity         ?item                .
                     ?item               chembl:hasMolecule         ?compound_chembl     .
                     ?chembl_target_uri  a                          ?target_type         .
          OPTIONAL { ?chembl_target_uri  dcterms:title              ?target_name_chembl  }
          OPTIONAL { ?chembl_target_uri  chembl:organismName        ?target_organism     }
          OPTIONAL { ?chembl_target_uri  chembl:hasTargetComponent  ?protein             .
                     GRAPH <http://www.conceptwiki.org> 
                             {
                               ?cw_target  skos:exactMatch  ?protein
                                        ;  skos:prefLabel   ?protein_name
                             }
                   }
          OPTIONAL { ?assay_uri  chembl:organismName       ?assay_organism               }
          OPTIONAL { ?assay_uri  dcterms:description       ?assay_description            }
          OPTIONAL { ?assay_uri  chembl:assayTestType      ?assay_type                   }
          OPTIONAL { ?item       chembl:publishedType      ?published_type               }
          OPTIONAL { ?item       chembl:publishedRelation  ?published_relation           }
          OPTIONAL { ?item       chembl:publishedValue     ?published_value              }
          OPTIONAL { ?item       chembl:publishedUnits     ?published_unit               }
          OPTIONAL { ?item       chembl:standardType       ?activity_type                }
          OPTIONAL { ?item       chembl:standardRelation   ?activity_relation            }
          OPTIONAL { ?item       chembl:standardValue      ?standard_value               .
                     BIND ( xsd:decimal( ?standard_value ) AS ?activity_value )
                   }
          OPTIONAL { ?item       chembl:standardUnits      ?activity_unit                }
          OPTIONAL { ?item       chembl:hasQUDT            ?qudt_uri                     }
          OPTIONAL { ?item       chembl:pChembl            ?pChembl                      }
          OPTIONAL { ?item       chembl:activityComment    ?act_comment                  }
          OPTIONAL { ?item       chembl:hasDocument        ?doc_uri                      .
                     OPTIONAL { ?doc_uri  owl:sameAs  ?doi  }
                     OPTIONAL { ?doc_uri  bibo:pmid   ?pmid }
                   }
        }
       GRAPH <http://ops.rsc.org> 
        {
                     ?compound_ocrs  skos:exactMatch           ?compound_chembl          .
                     ?compound_ocrs  cheminf:CHEMINF_000396    ?inchi
                                  ;  cheminf:CHEMINF_000399    ?inchi_key
                                  ;  cheminf:CHEMINF_000018    ?smiles                   .
          OPTIONAL { []              obo:IAO_0000136           ?compound_ocrs
                                  ;  a                         cheminf:CHEMINF_000484
                                  ;  qudt:numericValue         ?molweight                . 
                   }
          OPTIONAL { []              obo:IAO_0000136           ?compound_ocrs
                                  ;  a                         cheminf:CHEMINF_000367
                                  ;  qudt:numericValue         ?num_ro5_violations       . 
                   }
        }
                     ?compound_cw    skos:exactMatch           ?compound_ocrs
                                ;    skos:prefLabel            ?compound_name
    } 
  ORDER BY  ?item  
  LIMIT     10 
  OFFSET    0
;

This is quite typical. More complex ones have been seen, with many unions on top.

We run this with profile on warm cache, no plan reuse in effect. The database is Open PHACTS OPS from this January.

profile ('sparql prefix ....'); 

...
 15 msec 4% cpu,     15688 rnd      9547 seq   91.8669% same seg   4.62107% same pg 
Compilation: 313 msec 0 reads         0% read 0 messages         0% clw

The compilation time is over 20x longer than the execution. We see from the top line that the execution did 15K random lookups and retrieved 9K rows sequentially.

We enable plan reuse and rerun:

 14 msec 58% cpu,     15688 rnd      9547 seq   91.8669% same seg   4.62107% same pg 
Compilation: 0 msec 0 reads         0% read 0 messages         0% clw

The compile time is now gone. This is an especially large win. With a set of 31 queries from Open PHACTS, each repeated over many different parameter bindings, the gains from query caching are a speedup of 1.5x. More details may be published after VU Amsterdam, which does the data management for Open PHACTS, following publication of the benchmark data and queries. The present figures are an order of magnitude better than the figures from last fall, which will also be in the publication.

With query plan caching, the same plan will be reused as long as the literals in the new query have approximately the same selectivity as the ones which were present when the plan was first made. In this way, if a different plan is in fact needed, one will be made. The same query text can have many alternative plans for different selectivities of search conditions.

In this way, plan reuse may work better than prepared statements. Anyway, prepared statements do not exist in the SPARQL query language. In SQL they do, but then the optimizer does not know the values the parameters will have.

The overhead of plan reuse, as opposed to parameterized prepared statements, is relatively low. The cache remembers the sampling that was done when the plan was first made. The same samples are taken with the new literals plugged in. If the cardinalities are within a settable percentage (e.g., 20% of the original), the plan is assumed to be applicable. On the other hand, with prepared parameterized statements, there is no sampling at all, but then the plan might be worse due to less information being available to the optimizer.

Publishing is another type of workload where compile times easily form a large percentage of the total. The queries are shorter than in the biology case, since the modeling tends to be simpler and there are less distinct sources being queried. The frequency of the queries is higher though, and each might touch only some tens of triples.

The query caching feature will be included in forthcoming Virtuoso updates, and will not require operator intervention nor changes to configurations or applications. The feature will be controlled by a few settings in the configuration file, but the defaults will work for almost all cases.

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Comments

Post Comment