Virtuoso has an extensive collection of RDF-izers called Sponger Cartridges. These take a web resource in one of 30+ formats (so far) and extract RDF from it. The Virtuoso Sponger is a device which evaluates a query and along the way, finds dereferenceable links, dereferences them, and iteratively re-evaluates the query, until either nothing new is found or some limit is reached.

We could call this query-driven crawling. The idea is intuitive — what one looks for, determines what one finds.

This does however raise certain questions pertaining to the nature and ultimate possibility of knowledge, i.e., epistemology.

The process of querying could be said to go from the few to the many, just like the process of harvesting data from the web, the way any search engine does. One follows links or makes joins and thereby increases one's reach.

The difference is that a query has no a priori direction. If I ask for the phone numbers of my friends and there are no phone numbers in the database, then it is valid to give an empty result without looking at my friends at all. Closed world, as it is said. Never mind that the friends would have had a "see also" link to a retrievable document that did have a phone number.

The problem is that a query execution plan determines what possible dereferenceable material the query will encounter during its execution. What is worse, a query plan tends toward the minimal, i.e., toward minimizing the chances of encountering something dereferenceable along the way. Where query and crawl appeared to have a similarity, in fact they have two opposite goals.

The user generally has no idea of the execution plan. In the general case, the user cannot have an idea of this plan. There are valid, over 40 year old reasons for leaving the query planning to the database. In exceptional situations the user can read or direct these, but this is really quite tedious and requires understanding that is basically never present.

So, given a query, how do we find data that will match it, short of having a pre-loaded database of absolutely everything? This is certainly a desirable goal, and all in the open world, distributed spirit of the web.

Let us limit ourselves to queries that have some literals in the object or subject positions. A SPARQL query is basically a graph. Its vertices are variables and literals, and its edges are triple patterns. An edge is labeled by a predicate. For now, we will consider the predicate to always be a literal. From each literal, we can draw a tree, following each edge starting at this literal and descending until we find another literal. Each tree is not always a spanning tree of the graph, but all the trees collectively span the graph.

Consider the query

{ <john> knows ?x . <mary> knows ?x . ?x label ?l }.
The starting points are the literals john and mary. The john tree has one child, ?x, which has the children mary and ?l. One could notate it as
{ <john> knows ?x . {{ <mary> knows ?x} UNION {?x label ?l}}}
That is, the head first, and if it has more than one child, a union listing them, recursively.

If one composed such queries for each literal in the original pattern and evaluated each as a breadth first walk of the tree, no query optimization tricks, and for each binding of each variable, recorded whether there was something to dereference, one would in a finite time have reached all the directly reachable data. Then one could evaluate the original query, using whatever plan was preferred.

The check for dereferenceable data applied to each IRI-valued binding formed in the above evaluation, would consist of looking for "see also", "same as", and other such properties of the IRI. It could also consult text based search engines. Since the evaluation is breadth first, it generates a large number of parallel tasks and is fairly latency tolerant, i.e., it will not die if it must retrieve a few pages from remote sources. We will leave the exact rewrite rules for unions, optionals, aggregates, subqueries, and so on, as an exercise; the general idea should be clear enough.

We have here shown a way of transforming SPARQL queries in such a way as to guarantee dereferencing of findable links, without requiring the end user to either explicitly specify or understand query plans.

The present Sponger does not work exactly in this manner but it will be developed in this direction. Fortunately, the algorithms outlined above are nothing complicated.