You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by "A. Soroka" <aj...@virginia.edu> on 2016/03/05 19:04:55 UTC

StreamRDF or similar for a TDB bulk load?

I’m writing an ETL-ish utility that extracts triples from some directories of application-specific XML to assemble into a from-scratch TDB database. Of course I want to take advantage of the bulk loader facilities for best results. The TDBLoader methods that I’m looking at all accept InputStreams or URIs from which to get serialized RDF. It happens that I am already using Jena to transform the XML into RDF, so I’ve got actual Jena Triples in hand when I come to the bulk loading apparatus. It seems silly to serialize the triples only for the bulk loader to deserialize them, so I’d like to get at a StreamRDF instance or something similar that I can use to give Triples in a flow directly to the bulk loader, but at a first glance it looks like that’s hidden as BulkLoader.DestinationGraphs.

As additional context, the extraction is easily parallelized, but I do not see any note that the bulk loading is threadsafe, so I had intended to run a couple of threads of extraction loading a queue with a thread feeding the bulk loading gear from that queue.

Am I misunderstanding the action of the bulk loader, and more to the point, what is the most efficient way I can build a from-scratch TDB database from Triples?

Thanks for any help or advice!

---
A. Soroka
The University of Virginia Library

Re: StreamRDF or similar for a TDB bulk load?

Posted by "A. Soroka" <aj...@virginia.edu>.

> On Mar 6, 2016, at 11:36 AM, Andy Seaborne <an...@apache.org> wrote:
> Hi,
> 
> StreamRDF came after BulkLoader so it might not be fully exposed tough note it uses "BulkStreamRDF" which adds to the StreamRDF contract.  As parsing many files each cause start/finish calls, there has to be some handling of the overall bulk process which is what startBulk/finishBulk adds.
> 
> Bulkloading is not thread safe.
> 
> Serializing isn't so bad.  It makes the parallel extraction simple.
> 
> A bonus here is that you are running two processes in parallel and also you can check the data. Checking before a large bulk load is a good idea for a reliable process.
> 
> Realistically, on one general purpose machine, running the extractor process at the same time as the bulk load is going to slow down bulkloading due to I/O interactions (even if separate disks). Write/parse is CPU-dominated  and faster than the bulkloader.
> 
>    Andy

Okay, sounds like for the moment, serializing is the thing to do. In that case, I can drive the bulk loader with a PipedInputStream that I feed with N-Triples. I think I still might use the queue because a large enough number of Triple instances will take up less space than their serialization, assuming that they share enough nodes, which is a safe assumption here. I will take a crack at some point at getting an exposure of BulkStreamRDF out of the bulk loader, after everything else I’m supposed to do for Jena is done. [grin] I know that bandwidth will get divided between the "sides" of the process-as-a-whole, but there’s not much I can do about that in the particular circumstances.

---
A. Soroka
The University of Virginia Library
>

Re: StreamRDF or similar for a TDB bulk load?

Posted by Andy Seaborne <an...@apache.org>.

On 05/03/16 18:04, A. Soroka wrote:
> I’m writing an ETL-ish utility that extracts triples from some directories of application-specific XML to assemble into a from-scratch TDB database. Of course I want to take advantage of the bulk loader facilities for best results. The TDBLoader methods that I’m looking at all accept InputStreams or URIs from which to get serialized RDF. It happens that I am already using Jena to transform the XML into RDF, so I’ve got actual Jena Triples in hand when I come to the bulk loading apparatus. It seems silly to serialize the triples only for the bulk loader to deserialize them, so I’d like to get at a StreamRDF instance or something similar that I can use to give Triples in a flow directly to the bulk loader, but at a first glance it looks like that’s hidden as BulkLoader.DestinationGraphs.
>
> As additional context, the extraction is easily parallelized, but I do not see any note that the bulk loading is threadsafe, so I had intended to run a couple of threads of extraction loading a queue with a thread feeding the bulk loading gear from that queue.
>
> Am I misunderstanding the action of the bulk loader, and more to the point, what is the most efficient way I can build a from-scratch TDB database from Triples?
>
> Thanks for any help or advice!
>
> ---
> A. Soroka
> The University of Virginia Library
>

Hi,

StreamRDF came after BulkLoader so it might not be fully exposed tough 
note it uses "BulkStreamRDF" which adds to the StreamRDF contract.  As 
parsing many files each cause start/finish calls, there has to be some 
handling of the overall bulk process which is what startBulk/finishBulk 
adds.

Bulkloading is not thread safe.

Serializing isn't so bad.  It makes the parallel extraction simple.

A bonus here is that you are running two processes in parallel and also 
you can check the data. Checking before a large bulk load is a good idea 
for a reliable process.

Realistically, on one general purpose machine, running the extractor 
process at the same time as the bulk load is going to slow down 
bulkloading due to I/O interactions (even if separate disks). 
Write/parse is CPU-dominated  and faster than the bulkloader.

     Andy