I have been away from the world for a few weeks, concentrating on technology.

We have now implemented an entirely new storage layout. With RDF data, we have now successfully doubled the working set.

This means that the number of triples that will fit in memory is doubled for any configuration. For any database in the hundreds of millions of triples, this is very significant. For LUBM data, we go from 75b to 35b per triple with the default indices.

This is obtained without using gzip or some other stream compression. Thus no decompression is needed at read time. Random access speeds are within 5% of those of Virtuoso v5.0.1, but the space requirement is halved and you can still locate a random triple in cache in a few microseconds.

What is better still, when using 8-byte IDs for IRIs instead of 4-byte ones, the space consumption stays almost the same since unique values are stored only once per page.

When applying gzip to the new storage layout, we usually get 3x compression. This means that 99% of 8K pages fit in 3K after compression. This is no real surprise since an index is repetitive pretty much by definition, even if the repeated sections are now shorter than in v5.0.1.

Gzip applied to pages does nothing for the working set since a page must remain random accessible for fast search but will cut disk usage to between half and a third. We will make this an option later. There are other tricks to be done with compression, like using a separate dictionary for non key text columns in relational applications. This would improve the working set in TPC-C and TPC-D quite a bit so we may do this also while on the subject.

Right now we are writing the clustering support, revising all internal APIs to run with batches of rows instead of single rows. We will most likely release clustering and the new storage layout together, towards the end of summer, at least in internal deployments.

I will blog about results as and when they are obtained, over the next few weeks.