Storage News

Details

I have been away from the world for a few weeks, concentrating on technology.

We have now implemented an entirely new storage layout. With RDF data, we have now successfully doubled the working set.

This means that the number of triples that will fit in memory is doubled for any configuration. For any database in the hundreds of millions of triples, this is very significant. For LUBM data, we go from 75b to 35b per triple with the default indices.

This is obtained without using gzip or some other stream compression. Thus no decompression is needed at read time. Random access speeds are within 5% of those of Virtuoso v5.0.1, but the space requirement is halved and you can still locate a random triple in cache in a few microseconds.

What is better still, when using 8-byte IDs for IRIs instead of 4-byte ones, the space consumption stays almost the same since unique values are stored only once per page.

When applying gzip to the new storage layout, we usually get 3x compression. This means that 99% of 8K pages fit in 3K after compression. This is no real surprise since an index is repetitive pretty much by definition, even if the repeated sections are now shorter than in v5.0.1.

Gzip applied to pages does nothing for the working set since a page must remain random accessible for fast search but will cut disk usage to between half and a third. We will make this an option later. There are other tricks to be done with compression, like using a separate dictionary for non key text columns in relational applications. This would improve the working set in TPC-C and TPC-D quite a bit so we may do this also while on the subject.

Right now we are writing the clustering support, revising all internal APIs to run with batches of rows instead of single rows. We will most likely release clustering and the new storage layout together, towards the end of summer, at least in internal deployments.

I will blog about results as and when they are obtained, over the next few weeks.

Comments

Re:Storage News

Hi Orri,

nice to see that you are pushing the limits further.

I also quite liked Seth's Google video on how far you can actually push the limits if you have the cash for enough hardware: currently 3 Petabyte per database. Next goal: 300 Petabyte. Their datamodel somehow reminded me of triples.

See, if you are interested
http://video.google.com/videoplay?docid=-2727172597104463277

Cheers

Chris

Posted by ChrisB on 07/12/2007 16:00 GMT

Comments URL for this entry: http://www.openlinksw.com/mt-tb/Http/comments?id=1225

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Comments

Post Comment

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Related

Comments

Post Comment