You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by "Coughlan, Barry" <ba...@deri.org> on 2013/11/01 10:15:38 UTC

Single-threaded RIOT parsingof InputStream

Hi all,

According to the RIOT docs, iterating over triples/quads with piped streams requires separate threads for producer/consumer.

For some applications this isn't practical. In my case I am running an Hadoop job on NTriple datasets, so I am parsing one triple at a time. The overhead and extra code complexity of kicking off a thread to parse each triple is too high, and this may be true for other use cases involving small datasets.

I wrote some StreamRDF implementations which store the results in Java Collections, so that parsing can be run on a single thread. Attached is a patch with the implementations, tests and an example (I borrowed the term 'Collector' from Apache Lucene). But I now suspect that I've overlooked some simple existing API call to do this.

Any feedback appreciated.

Regards,
Barry

Re: Single-threaded RIOT parsingof InputStream

Posted by Andy Seaborne <an...@apache.org>.

On 01/11/13 09:15, Coughlan, Barry wrote:
> Hi all,
>
> According to the RIOT docs, iterating over triples/quads with piped
> streams requires separate threads for producer/consumer.
>
> For some applications this isn't practical. In my case I am running an
> Hadoop job on NTriple datasets, so I am parsing one triple at a time.
> The overhead and extra code complexity of kicking off a thread to parse
> each triple is too high, and this may be true for other use cases
> involving small datasets.
>
> I wrote some StreamRDF implementations which store the results in Java
> Collections, so that parsing can be run on a single thread. Attached is
> a patch with the implementations, tests and an example (I borrowed the
> term 'Collector' from Apache Lucene). But I now suspect that I've
> overlooked some simple existing API call to do this.
>
> Any feedback appreciated.
>
> Regards,
> Barry

Barry,

Thanks for the contribution - I've create JENA-581 [1] and attached your 
patch.  Looks like a useful thing to add to StreamRDFLib.

You could use a graph or model to collect your triples but at the 
granularity of one-by-one, even that may incur some overhead.

(You can pass the same StreamRDF to multiple calls of the parser 
machinery to aggregate triples. e.g. RDFDataMgr.parse)

The documentation needs to spell out the implicit assumption that it's 
about parallel processing of data (typically a very large file); your 
use case isn't that.	

	thanks
	Andy

[1] https://issues.apache.org/jira/browse/JENA-581

RE: Single-threaded RIOT parsingof InputStream

Posted by "Coughlan, Barry" <ba...@deri.org>.

Hi Rob,

The real target is Hadoop, where each split will contain one line from an NTriple file that is
parsed in the mapper, so it's parsing one triple at a time. It doesn't matter if it blocks since
there is nothing to do until it completed..

I also noted that it could be used for easier parsing of small datasets (which is why I used
a collection instead of a single instance) but after some thought I don't see a realistic use
case for that.

I wasn't aware of jena-grande. Creating input formats for triple parsing seems like a much
better solution than mine. If there are no other use cases for my patch, I think it should be
removed from the trunk.

Regards,
Barry

________________________________________
From: Rob Vesse [rvesse@dotnetrdf.org]
Sent: 04 November 2013 09:54
To: dev@jena.apache.org
Subject: Re: Single-threaded RIOT parsingof InputStream

Barry

As Andy has stated in his replies no we didn't have this functionality
already and he has now added it to trunk.

As far as your described use case goes I would point out that this mode of
operation will not be scalable unless you have appropriately partitioned the
data.  Parsing is inherently a blocking process hence why the iterator model
provided by RIOT already relies on having a producer and a consumer thread
with a bounded thread safe queue between them to stop the producer filling
the memory with as much data as it can read before the consumer ever gets to
start processing the data.

In your described model you will need to parse the entirety of the data into
memory before you can start consuming it which risks OOM errors with larger
datasets.  If your real target is Hadoop input formats then you may want to
instead take a look at Paolo Castagna's jena-grande repository on GitHub -
https://github.com/castagna/jena-grande  which is a little out of date with
respect to latest Hadoop versions but demonstrates how to create input
formats for RDF -
https://github.com/castagna/jena-grande/tree/master/src/main/java/org/apache
/jena/grande/mapreduce/io

Hope this helps,

Rob

From:  "Coughlan, Barry" <ba...@deri.org>
Reply-To:  <de...@jena.apache.org>
Date:  Friday, 1 November 2013 09:15
To:  "dev@jena.apache.org" <de...@jena.apache.org>
Subject:  Single-threaded RIOT parsingof InputStream

> Hi all,
>
> According to the RIOT docs, iterating over triples/quads with piped streams
> requires separate threads for producer/consumer.
>
> For some applications this isn't practical. In my case I am running an Hadoop
> job on NTriple datasets, so I am parsing one triple at a time. The overhead
> and extra code complexity of kicking off a thread to parse each triple is too
> high, and this may be true for other use cases involving small datasets.
>
> I wrote some StreamRDF implementations which store the results in Java
> Collections, so that parsing can be run on a single thread. Attached is a
> patch with the implementations, tests and an example (I borrowed the term
> 'Collector' from Apache Lucene). But I now suspect that I've overlooked some
> simple existing API call to do this.
>
> Any feedback appreciated.
>
> Regards,
> Barry

Re: Single-threaded RIOT parsingof InputStream

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Barry

As Andy has stated in his replies no we didn't have this functionality
already and he has now added it to trunk.

As far as your described use case goes I would point out that this mode of
operation will not be scalable unless you have appropriately partitioned the
data.  Parsing is inherently a blocking process hence why the iterator model
provided by RIOT already relies on having a producer and a consumer thread
with a bounded thread safe queue between them to stop the producer filling
the memory with as much data as it can read before the consumer ever gets to
start processing the data.

In your described model you will need to parse the entirety of the data into
memory before you can start consuming it which risks OOM errors with larger
datasets.  If your real target is Hadoop input formats then you may want to
instead take a look at Paolo Castagna's jena-grande repository on GitHub -
https://github.com/castagna/jena-grande  which is a little out of date with
respect to latest Hadoop versions but demonstrates how to create input
formats for RDF - 
https://github.com/castagna/jena-grande/tree/master/src/main/java/org/apache
/jena/grande/mapreduce/io

Hope this helps,

Rob

From:  "Coughlan, Barry" <ba...@deri.org>
Reply-To:  <de...@jena.apache.org>
Date:  Friday, 1 November 2013 09:15
To:  "dev@jena.apache.org" <de...@jena.apache.org>
Subject:  Single-threaded RIOT parsingof InputStream

> Hi all,
> 
> According to the RIOT docs, iterating over triples/quads with piped streams
> requires separate threads for producer/consumer.
> 
> For some applications this isn't practical. In my case I am running an Hadoop
> job on NTriple datasets, so I am parsing one triple at a time. The overhead
> and extra code complexity of kicking off a thread to parse each triple is too
> high, and this may be true for other use cases involving small datasets.
> 
> I wrote some StreamRDF implementations which store the results in Java
> Collections, so that parsing can be run on a single thread. Attached is a
> patch with the implementations, tests and an example (I borrowed the term
> 'Collector' from Apache Lucene). But I now suspect that I've overlooked some
> simple existing API call to do this.
> 
> Any feedback appreciated.
> 
> Regards,
> Barry