A persistent argument against the linked data web has been the cost, scalability, and vulnerability of SPARQL end points, should the linked data web gain serious mass and traffic.
As we are on the brink of hosting the whole DBpedia Linked Open Data cloud in Virtuoso Cluster, we have had to think of what we'll do if, for example, somebody decides to count all the triples in the set.
How can we encourage clever use of data, yet not die if somebody, whether through malice, lack of understanding, or simple bad luck, submits impossible queries?
Restricting the language is not the way; any language beyond text search can express queries that will take forever to execute. Also, just returning a timeout after the first second (or whatever arbitrary time period) leaves people in the dark and does not produce an impression of responsiveness. So we decided to allow arbitrary queries, and if a quota of time or resources is exceeded, we return partial results and indicate how much processing was done.
Here we are looking for the top 10 people whom people claim to know without being known in return, like this:
SQL> sparql
SELECT ?celeb,
COUNT (*)
WHERE { ?claimant foaf:knows ?celeb .
FILTER (!bif:exists ( SELECT (1)
WHERE { ?celeb foaf:knows ?claimant }
)
)
}
GROUP BY ?celeb
ORDER BY DESC 2
LIMIT 10;
celeb callret-1
VARCHAR VARCHAR
________________________________________ _________
http://twitter.com/BarackObama 252
http://twitter.com/brianshaler 183
http://twitter.com/newmediajim 101
http://twitter.com/HenryRollins 95
http://twitter.com/wilw 81
http://twitter.com/stevegarfield 78
http://twitter.com/cote 66
mailto:adam.westerski@deri.org 66
mailto:michal.zaremba@deri.org 66
http://twitter.com/dsifry 65
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete
results, query interrupted by result timeout.
Activity: 1R rnd 0R seq 0P disk 1.346KB / 3 messages
SQL> sparql
SELECT ?celeb,
COUNT (*)
WHERE { ?claimant foaf:knows ?celeb .
FILTER (!bif:exists ( SELECT (1)
WHERE { ?celeb foaf:knows ?claimant }
)
)
}
GROUP BY ?celeb
ORDER BY DESC 2
LIMIT 10;
celeb callret-1
VARCHAR VARCHAR
________________________________________ _________
http://twitter.com/JasonCalacanis 496
http://twitter.com/Twitterrific 466
http://twitter.com/ev 442
http://twitter.com/BarackObama 356
http://twitter.com/laughingsquid 317
http://twitter.com/gruber 294
http://twitter.com/chrispirillo 259
http://twitter.com/ambermacarthur 224
http://twitter.com/t 219
http://twitter.com/johnedwards 188
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete
results, query interrupted by result timeout.
Activity: 329R rnd 44.6KR seq 342P disk 638.4KB / 46 messages
The first query read all data from disk; the second run had the working set from the first and could read some more before time ran out, hence the results were better. But the response time was the same.
If one has a query that just loops over consecutive joins, like in basic SPARQL, interrupting the processing after a set time period is simple. But such queries are not very interesting. To give meaningful partial answers with nested aggregation and sub-queries requires some more tricks. The basic idea is to terminate the innermost active sub-query/aggregation at the first timeout, and extend the timeout a bit so that accumulated results get fed to the next aggregation, like from the GROUP BY
to the ORDER BY
. If this again times out, we continue with the next outer layer. This guarantees that results are delivered if there were any results found for which the query pattern is true. False results are not produced, except in cases where there is comparison with a count and the count is smaller than it would be with the full evaluation.
One can also use this as a basis for paid services. The cutoff does not have to be time; it can also be in other units, making it insensitive to concurrent usage and variations of working set.
This system will be deployed on our Billion Triples Challenge demo instance in a few days, after some more testing. When Virtuoso 6 ships, all LOD Cloud AMIs and OpenLink-hosted LOD Cloud SPARQL endpoints will have this enabled by default. (AMI users will be able to disable the feature, if desired.) The feature works with Virtuoso 6 in both single server and cluster deployment.