The "file tables" feature first introduced in the TPC-H bulk load article is now available on the v7fasttrack clone of the Virtuoso repository on GitHub.
To check out —
$ git clone https://github.com/v7fasttrack/virtuoso-opensource.git v7fasttrack
The v7fasttrack tree compiles just like the main Virtuoso tree. Its content is substantially the same today, except for the file tables feature. There is a diff with the main tree which now consists mostly of white space, since the Fast Track tree is automatically indented with the Linux indent utility each time it is updated, and the Virtuoso tree is not.
Ongoing maintenance and previews of new features will be added to this tree as and when they become available.
Let's now look at the "file tables" feature.
You can use any CSV file like a table, as described in the documentation. The TPC-H data generator (complete source ZIP; ZIP of just the dbgen source) is a convenient place to start to try things out.
To generate the qualification database, run —
dbgen -s 1
This makes 8 CSV files called *.tbl
. You can use these scripts to load them into Virtuoso —
To verify the load, do —
SELECT COUNT (*)
FROM lineitem_f
;
SELECT COUNT (*)
FROM lineitem
;
To try different combinations of tables and CSV files, you can, for example, do —
SELECT COUNT (*)
FROM lineitem,
part_f
WHERE l_partkey = p_partkey
AND p_name LIKE '%green%'
;
This counts shipments of green parts, using the file as the part
table. You can then replace the part_f
with part
to join against the database. The database will be a little faster but the file is also pretty fast since the smaller table (part
) is on the build side of a hash join and the scan of lineitem
is the same in either case.
You can now replace lineitem
with lineitem_f
and you will see a larger difference. This is still reasonably fast since the lineitem
file is scanned in parallel.
You can try the different TPC-H queries against tables and files. To get the perfect plans you will need the analytics branch which will be made available shortly via this same GitHub channel.
You can also try RDFizing the files using the scripts in the Enterprise Linked Data article from earlier this year. The qualification database should go in about 15 minutes on a commodity server and make some 120M triples. In the article, the data came from another server, but it can just as well come from files. These two scripts from that article have been adapted for loading from files —
To try this, execute the following commands in iSQL —
LOAD sql_rdf.sql;
RDF_VIEW_SYNC_TO_PHYSICAL
( 'http://example.com/tpcd'
, 1
,
, "urn:example.com:tpcd"
, 2
, 0
)
;
To verify the result —
sparql
SELECT ?c
COUNT (*)
WHERE { ?s ?p ?o }
GROUP BY ?p
ORDER BY DESC 2
;