We regularly encounter user scenarios where there is a need for selective access to RDF and other graph-model data. In a schema-less situation, the graph is the natural unit of access control. A common design pattern is to put all the triples that represent one logical entity, possibly including immediately dependent entities, into the same graph. The application will then usually replace the whole graph when some part of it changes.
In this usage pattern, the graph is akin to a relational row, plus maybe items of one to many tables; for example, an order and its order lines would make a single graph. In a publishing setting, what goes into a graph is whatever is approved for publication as a unit. So, a graph is conceptually somewhere between a row and a document.
The access control situations further fall into two principal types.
- In a hosted application setting, the user will see his own data. Most of the database is invisible.
- In a data publishing setting, there will typically be a large quantity of data visible to all, plus some premium content visible only for users who pay for this specific content.
OpenPHACTS, for example, would fall into the latter category. There the matter is not so much about charging for premium content but about keeping proprietary separate from public.
The graph is thus a combined provenance and security label. We sometimes hear about applications that would want quints (a quad with an extra field) for separating concerns, but so far no actionable need has materialized. One could do quints if necessary. We did also in past times talk to Systap, the vendor of the Bigdata® RDF store, about doing reification right, which would resolve many of these things in one go.
So anyway, it is time to do graph-level security right. The technical approaches depend on how many grantable objects one has, and how many distinct security principals ("grantees") there are. If the separately grantable objects are numerous (e.g., a billion distinct graphs), and if the typical principal has access to most, it makes sense to enumerate for each principal the items that are denied instead of enumerating the granted ones.
There is a related case of scoping queries to a variable set of graphs. This corresponds to cases like "give me information from sources with an over-average reliability rating." There is also the security-related special case of scoping to non-classified material. This has the special feature that one part of the query must know what the classified material is, so as to exclude that. However, since from the user's viewpoint the classified material does not exist, the query must run at two different access control levels and must prevent leakage between the two. This is routinely done with SQL views and SQL policy functions so there is nothing new here, but this is just not something the RDF people have thought much of.
In the hosted-application scenario there is a slowly-changing, relatively-small set of graphs that are in scope for a user session. Queries will only consider data where the graph is in this set. This set is typically different for each principal. The distinct end users are not extremely numerous; maybe in the thousands.
In a publishing setting, there is a small set of restricted content and a large number of principals; however, many principals will have exactly the same exclude list (e.g.: no special access; paid access to content class A or B or both; etc.). The number of access compartments is small relative to the number of principals, and many independent principals will share the same set of premium content. As a result, the number of distinct exclusion lists will not be extremely large, even if the lists may be mid-size.
How does one build this into the database? This is done by a selective hash join for granted lists, and a hash anti-join (NOT EXISTs
operator) for the exclusion lists. From the TPC-H series, we recall that there is an invisible hash join operation that can be merged into a table scan/index lookup. This is the ideal building block for this task, except that it needs to be extended to also have a negative form. The negative form also occurs in TPC-H -- for example, in Q16, where one considers only stock from suppliers with no customer complaints.
There are a few distinct cases of application behavior where the cost of enforcing graph-level access will be different. For example, in OpenPHACTS, a query is often a union/join between graph patterns where the graph is fixed. In this way, each triple pattern has exactly one named graph where it can be matched, but many named graphs are mentioned in the query. In this case, the access check has no cost, or the cost is a vanishingly small constant. In the case of the hosted application, where the named graph corresponds to a unit of update ("row" or "document," as discussed above), the query typically does not specify a named graph, so each triple pattern can match anywhere within the visible graphs. There the check is enforced at each triple pattern, but this is always against the same list (i.e., hash table), and the hash table will most often fit in CPU cache. Most of the accesses will find a match in the hash table.
In the publishing case, where relatively small premium content is to be excluded, the named graph will also generally not be specified; but it can happen that some parts of the query are required to match within the same named graph -- i.e., there is a pattern like
graph ?g
{ ?x xx:about <stuff>
. ?x xx:date ?dt
. filter ( ?dt > ... )
}
Here, the named graph needs to be checked only once for the two patterns. We also note that since most of the content is not restricted, the check against the restriction will nearly always fail -- i.e., the hash lookup will miss. Missing is cheaper than hitting, as mentioned in the TPC-H series. One can use a Bloom filter for low cost miss detection. So it follows that an exclusion list that seldom hits can be maybe 10x bigger than a near-always-hitting inclusion list with the same cost of checking.
So, if inclusion lists are in the 100,000-entry range and exclusion lists in the 1,000,000-item range, we are set. What if this is not so? In that case, we will rely on two-level tricks and encoding of application information into supposedly meaning-free identifiers. This is not sem-head material but is well within the domain of database science. For example, if the graph is a marker of provenance or update locality, then all graphs of the same owner or security classification may have an identifier in a certain range or ranges. Then we may only check for range, i.e., omit the low few bits from the hash lookup, knowing that any graph id in the range will have the same security behavior. This does introduce some more burden on the application, and will add some special cases when moving assets between security principals, but then life is like that. If there is a big-ticket application that depends on this, then such things can be provided as custom development.
I will here refer you to David Karger's immortal insight at the Semantic Big Data panel at ESWC 2013, where he said that "big data is very much about performance, and performance is usually obtained by sacrificing the general for the specific." So it is.
So far, I have made a summary where the many people with whom this matter has been touched upon may recognize their specific case. Next I will show some experiments. We use a copy of the URIBurner dataset, as this is a collection with large numbers of tiny graphs that can be assigned different security attributes.
To be continued.