Virtuoso Anytime: No Query Is Too Complex (updated)

Details

A persistent argument against the linked data web has been the cost, scalability, and vulnerability of SPARQL end points, should the linked data web gain serious mass and traffic.

As we are on the brink of hosting the whole DBpedia Linked Open Data cloud in Virtuoso Cluster, we have had to think of what we'll do if, for example, somebody decides to count all the triples in the set.

How can we encourage clever use of data, yet not die if somebody, whether through malice, lack of understanding, or simple bad luck, submits impossible queries?

Restricting the language is not the way; any language beyond text search can express queries that will take forever to execute. Also, just returning a timeout after the first second (or whatever arbitrary time period) leaves people in the dark and does not produce an impression of responsiveness. So we decided to allow arbitrary queries, and if a quota of time or resources is exceeded, we return partial results and indicate how much processing was done.

Here we are looking for the top 10 people whom people claim to know without being known in return, like this:

SQL> sparql 
SELECT ?celeb, 
       COUNT (*)
WHERE { ?claimant foaf:knows ?celeb .
        FILTER (!bif:exists ( SELECT (1) 
                              WHERE { ?celeb foaf:knows ?claimant }
                            )
               )
      } 
GROUP BY ?celeb 
ORDER BY DESC 2 
LIMIT 10;

celeb                                      callret-1
VARCHAR                                    VARCHAR
________________________________________   _________

http://twitter.com/BarackObama             252
http://twitter.com/brianshaler             183
http://twitter.com/newmediajim             101
http://twitter.com/HenryRollins            95
http://twitter.com/wilw                    81
http://twitter.com/stevegarfield           78
http://twitter.com/cote                    66
mailto:adam.westerski@deri.org             66
mailto:michal.zaremba@deri.org             66
http://twitter.com/dsifry                  65

*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete 
results, query interrupted by result timeout.  
Activity:      1R rnd      0R seq      0P disk  1.346KB /      3 messages

SQL> sparql 
SELECT ?celeb, 
       COUNT (*)
WHERE { ?claimant foaf:knows ?celeb .
        FILTER (!bif:exists ( SELECT (1) 
                              WHERE { ?celeb foaf:knows ?claimant }
                            )
               )
      } 
GROUP BY ?celeb 
ORDER BY DESC 2 
LIMIT 10;

celeb                                      callret-1
VARCHAR                                    VARCHAR
________________________________________   _________

http://twitter.com/JasonCalacanis          496
http://twitter.com/Twitterrific            466
http://twitter.com/ev                      442
http://twitter.com/BarackObama             356
http://twitter.com/laughingsquid           317
http://twitter.com/gruber                  294
http://twitter.com/chrispirillo            259
http://twitter.com/ambermacarthur          224
http://twitter.com/t                       219
http://twitter.com/johnedwards             188

*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete 
results, query interrupted by result timeout.  
Activity:    329R rnd   44.6KR seq    342P disk  638.4KB /     46 messages

The first query read all data from disk; the second run had the working set from the first and could read some more before time ran out, hence the results were better. But the response time was the same.

If one has a query that just loops over consecutive joins, like in basic SPARQL, interrupting the processing after a set time period is simple. But such queries are not very interesting. To give meaningful partial answers with nested aggregation and sub-queries requires some more tricks. The basic idea is to terminate the innermost active sub-query/aggregation at the first timeout, and extend the timeout a bit so that accumulated results get fed to the next aggregation, like from the GROUP BY to the ORDER BY. If this again times out, we continue with the next outer layer. This guarantees that results are delivered if there were any results found for which the query pattern is true. False results are not produced, except in cases where there is comparison with a count and the count is smaller than it would be with the full evaluation.

One can also use this as a basis for paid services. The cutoff does not have to be time; it can also be in other units, making it insensitive to concurrent usage and variations of working set.

This system will be deployed on our Billion Triples Challenge demo instance in a few days, after some more testing. When Virtuoso 6 ships, all LOD Cloud AMIs and OpenLink-hosted LOD Cloud SPARQL endpoints will have this enabled by default. (AMI users will be able to disable the feature, if desired.) The feature works with Virtuoso 6 in both single server and cluster deployment.

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Comments

Post Comment

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Related

Comments

Post Comment