You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Zak Mc Kracken <za...@yahoo.it.INVALID> on 2017/11/26 01:40:36 UTC

Is StreamRDFWriter.write() thread-safe?

Hi all,

as per subject, is this operation (seen at 
https://jena.apache.org/documentation/io/streaming-io.html) thread-safe?

StreamRDFWriter.write(output, model.getGraph(), lang);

ie, can it be run by multiple threads in parallel, assuming each spawns 
its own Model instance and invokes the operation above at their end, 
when the model is complete?

Will the write() above manage the concurrent access to the output stream?

At which level? The triple, some sort of block, or the entire 
operation/graph? (The latter case implying there aren't benefits from 
this kind of parallelism).

 From the first tests I've done, it seems it's thread-safe, but I don't 
get the details.

Thank you in advance,

Marco.




Re: Is StreamRDFWriter.write() thread-safe?

Posted by Zak Mc Kracken <za...@yahoo.it.INVALID>.
Hi ajs6f,

good suggestion, thank you. For the moment, I don't think that writing 
in a single file sequentially is going to be the slowest part, but I'll 
see how it goes.

Marco

On 26/11/2017 15:56, ajs6f wrote:
> I had a similar task a while ago: I did wrap the StreamRDF in a wrapper that synchronized the relevant methods, and that worked fine. Then I tried using several independent output files, one for each thread, and performance improved enormously.
>
> Keep in mind that if you use NTriples or Trig, merging two files (for later processing) is just concatenating them.
>
> ajs6f
>
>


Re: Is StreamRDFWriter.write() thread-safe?

Posted by Rob Vesse <rv...@dotnetrdf.org>.
I’ve taken similar approaches in data generators in the past. Using one file per thread is by far the best way to do things and requires the least coordination.

If you are using a concatenable format you can always have a secondary thread which tracks which files you are done writing to and generates the Final concatenated output partly in parallel. Whether that approach will work or depend on the exact structure of your data generator i.e. Whether there are logical points to consider a given output file complete

Rob

On 26/11/2017, 15:56, "ajs6f" <aj...@apache.org> wrote:

    I had a similar task a while ago: I did wrap the StreamRDF in a wrapper that synchronized the relevant methods, and that worked fine. Then I tried using several independent output files, one for each thread, and performance improved enormously.
    
    Keep in mind that if you use NTriples or Trig, merging two files (for later processing) is just concatenating them.
    
    ajs6f
    
    > On Nov 26, 2017, at 9:15 AM, Zak Mc Kracken <za...@yahoo.it.INVALID> wrote:
    > 
    > Hi Andy,
    > 
    > thank you for your reply. Good to know. My use case is an RDF exporter that takes data from a relatively slow data source (like a DBMS). In order to speed things up, it has multiple threads reading data, converting it to RDF and then sending generated RDF to their own Jena Model (one per thread). At the end, they stream the model to a common sink/stream, such as a file.
    > 
    > Actually I'm designing this with some flexibility: one can chose to pass a java.util.function.Consumer<Model> to the exporter, that is, an handler that does something with a thread model, once it is ready. That's because, I want to reuse the upstream processing for either an RDF file exporter, or a Neo4J uploader (which should be able to manage concurrent writings at a finer grain level), or, in general, some other kind of converter.
    > 
    > That said, I'm OK with making the file writing part synchronized and hence non really parallel, my question was to understand it better how Jena works with this.
    > 
    > Best,
    > Marco.
    > 
    > On 26/11/2017 11:14, Andy Seaborne wrote:
    >> If the output stream is shared, then no.  It's buffered internally.
    >> 
    >> So at small scale, it'll look safe because the whole output is one buffer or the order was OK.  But beyond that, the buffered flushes will be interleaved and buffer boundaries are based on characters, not logical unit of the RDF output.
    >> 
    >> Parallel writing to a shared OutputStream is a bad idea.
    >> 
    >> What's the use case you have for a shared output stream?
    >> 
    >>     Andy
    >> 
    >> 
    > 
    
    





Re: Is StreamRDFWriter.write() thread-safe?

Posted by ajs6f <aj...@apache.org>.
I had a similar task a while ago: I did wrap the StreamRDF in a wrapper that synchronized the relevant methods, and that worked fine. Then I tried using several independent output files, one for each thread, and performance improved enormously.

Keep in mind that if you use NTriples or Trig, merging two files (for later processing) is just concatenating them.

ajs6f

> On Nov 26, 2017, at 9:15 AM, Zak Mc Kracken <za...@yahoo.it.INVALID> wrote:
> 
> Hi Andy,
> 
> thank you for your reply. Good to know. My use case is an RDF exporter that takes data from a relatively slow data source (like a DBMS). In order to speed things up, it has multiple threads reading data, converting it to RDF and then sending generated RDF to their own Jena Model (one per thread). At the end, they stream the model to a common sink/stream, such as a file.
> 
> Actually I'm designing this with some flexibility: one can chose to pass a java.util.function.Consumer<Model> to the exporter, that is, an handler that does something with a thread model, once it is ready. That's because, I want to reuse the upstream processing for either an RDF file exporter, or a Neo4J uploader (which should be able to manage concurrent writings at a finer grain level), or, in general, some other kind of converter.
> 
> That said, I'm OK with making the file writing part synchronized and hence non really parallel, my question was to understand it better how Jena works with this.
> 
> Best,
> Marco.
> 
> On 26/11/2017 11:14, Andy Seaborne wrote:
>> If the output stream is shared, then no.  It's buffered internally.
>> 
>> So at small scale, it'll look safe because the whole output is one buffer or the order was OK.  But beyond that, the buffered flushes will be interleaved and buffer boundaries are based on characters, not logical unit of the RDF output.
>> 
>> Parallel writing to a shared OutputStream is a bad idea.
>> 
>> What's the use case you have for a shared output stream?
>> 
>>     Andy
>> 
>> 
> 


Re: Is StreamRDFWriter.write() thread-safe?

Posted by Zak Mc Kracken <za...@yahoo.it.INVALID>.
Hi Andy,

thank you for your reply. Good to know. My use case is an RDF exporter 
that takes data from a relatively slow data source (like a DBMS). In 
order to speed things up, it has multiple threads reading data, 
converting it to RDF and then sending generated RDF to their own Jena 
Model (one per thread). At the end, they stream the model to a common 
sink/stream, such as a file.

Actually I'm designing this with some flexibility: one can chose to pass 
a java.util.function.Consumer<Model> to the exporter, that is, an 
handler that does something with a thread model, once it is ready. 
That's because, I want to reuse the upstream processing for either an 
RDF file exporter, or a Neo4J uploader (which should be able to manage 
concurrent writings at a finer grain level), or, in general, some other 
kind of converter.

That said, I'm OK with making the file writing part synchronized and 
hence non really parallel, my question was to understand it better how 
Jena works with this.

Best,
Marco.

On 26/11/2017 11:14, Andy Seaborne wrote:
> If the output stream is shared, then no.  It's buffered internally.
>
> So at small scale, it'll look safe because the whole output is one 
> buffer or the order was OK.  But beyond that, the buffered flushes 
> will be interleaved and buffer boundaries are based on characters, not 
> logical unit of the RDF output.
>
> Parallel writing to a shared OutputStream is a bad idea.
>
> What's the use case you have for a shared output stream?
>
>     Andy
>
>


Re: Is StreamRDFWriter.write() thread-safe?

Posted by Andy Seaborne <an...@apache.org>.
If the output stream is shared, then no.  It's buffered internally.

So at small scale, it'll look safe because the whole output is one 
buffer or the order was OK.  But beyond that, the buffered flushes will 
be interleaved and buffer boundaries are based on characters, not 
logical unit of the RDF output.

Parallel writing to a shared OutputStream is a bad idea.

What's the use case you have for a shared output stream?

     Andy

On 26/11/17 01:40, Zak Mc Kracken wrote:
> Hi all,
> 
> as per subject, is this operation (seen at 
> https://jena.apache.org/documentation/io/streaming-io.html) thread-safe?
> 
> StreamRDFWriter.write(output, model.getGraph(), lang);
> 
> ie, can it be run by multiple threads in parallel, assuming each spawns 
> its own Model instance and invokes the operation above at their end, 
> when the model is complete?
> 
> Will the write() above manage the concurrent access to the output stream?
> 
> At which level? The triple, some sort of block, or the entire 
> operation/graph? (The latter case implying there aren't benefits from 
> this kind of parallelism).
> 
>  From the first tests I've done, it seems it's thread-safe, but I don't 
> get the details.
> 
> Thank you in advance,
> 
> Marco.
> 
> 
>