You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Adrian Gschwend <ml...@netlabs.org> on 2016/03/29 14:53:12 UTC

riot, streaming & memory

Hi group,

I try to convert a 800MB JSON-LD file to something more readable (NT or
Turtle) using riot. Unfortunately I run into memory issues:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
exceeded

I tried -Xmx up to 16GB but no luck so far. According to the
documentation riot should try streaming if possible, is that not
available for JSON-LD or am I missing something else?

regards

Adrian

-- 
Adrian Gschwend
Bern University of Applied Sciences



Re: riot, streaming & memory

Posted by Paul Houle <on...@gmail.com>.
If you're serious about streaming JSON,  look at

http://www.rfc-editor.org/rfc/rfc7464.txt

In fact,  the same approach works for Turtle.

On Sun, Apr 3, 2016 at 6:53 AM, Adrian Gschwend <ml...@netlabs.org> wrote:

> On 31.03.16 11:28, Andy Seaborne wrote:
>
> Hi Andy,
>
> > The whole chain JSON->JSON-LD->RDF is designed around small/medium sized
> > data.
> >
> > If you find a JSON-LD parser that can convert this file to N-triples,
> > could you let the list know please?
>
> will do so, might be that we have some streaming parser sooner or later
> in the Node.js/JavaScript world. There are some ongoing discussions.
>
> regards
>
> Adrian
>



-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254    paul.houle on Skype   ontology2@gmail.com

:BaseKB -- Query Freebase Data With SPARQL
http://basekb.com/gold/

Legal Entity Identifier Lookup
https://legalentityidentifier.info/lei/lookup/
<http://legalentityidentifier.info/lei/lookup/>

Join our Data Lakes group on LinkedIn
https://www.linkedin.com/grp/home?gid=8267275

Re: riot, streaming & memory

Posted by Adrian Gschwend <ml...@netlabs.org>.
On 31.03.16 11:28, Andy Seaborne wrote:

Hi Andy,

> The whole chain JSON->JSON-LD->RDF is designed around small/medium sized
> data.
> 
> If you find a JSON-LD parser that can convert this file to N-triples,
> could you let the list know please?

will do so, might be that we have some streaming parser sooner or later
in the Node.js/JavaScript world. There are some ongoing discussions.

regards

Adrian

Re: riot, streaming & memory

Posted by Adrian Gschwend <ml...@netlabs.org>.
On 31.03.16 11:28, Andy Seaborne wrote:

Hi Andy,

> If you find a JSON-LD parser that can convert this file to N-triples,
> could you let the list know please?

Talked to my colleague Reto last week and he had mercy:

https://github.com/zazuko/jsonldparser

See the README, currently it only supports a subset of JSON-LD. Pull
requests are welcome :) It's MIT license.

regards

Adrian

Re: riot, streaming & memory

Posted by Andy Seaborne <an...@apache.org>.
On 30/03/16 21:39, Adrian Gschwend wrote:
> On 30.03.16 18:49, Andy Seaborne wrote:
>
> Hi Andy,
>
>> The start of the file is somewhat long literal heavy for the geo data.
>> There are some serious
>
> yeah it is a Swiss geodata set which should be published later this year
> as RDF. It gets generated as JSON-LD from its original INTERLIS format
> (mainly used in Switzerland)
>
>> It does seem to get into some kind of GC hell as the oldest GC
>> generation grows which I think is the cause of lost of CPU cycles and
>> little real progress.
>
> ok that sounds like bad code, will I run into this issue as well when I
> try loading it to Fuseki?

Yes. Same code path.

The JSON-LD algorithm assume complete access to the JSON (they walk over 
the JSON tree looking for things) and jsonld-java is faithful 
implementation of the JSON-LD spec, even down to comments in the code as 
to which spec part the code is providing at that point.

The whole chain JSON->JSON-LD->RDF is designed around small/medium sized 
data.

If you find a JSON-LD parser that can convert this file to N-triples, 
could you let the list know please?

	Andy

>
> regards
>
> Adrian
>


Re: riot, streaming & memory

Posted by Adrian Gschwend <ml...@netlabs.org>.
On 30.03.16 18:49, Andy Seaborne wrote:

Hi Andy,

> The start of the file is somewhat long literal heavy for the geo data.
> There are some serious

yeah it is a Swiss geodata set which should be published later this year
as RDF. It gets generated as JSON-LD from its original INTERLIS format
(mainly used in Switzerland)

> It does seem to get into some kind of GC hell as the oldest GC
> generation grows which I think is the cause of lost of CPU cycles and
> little real progress.

ok that sounds like bad code, will I run into this issue as well when I
try loading it to Fuseki?

regards

Adrian

Re: riot, streaming & memory

Posted by Andy Seaborne <an...@apache.org>.
On 29/03/16 18:19, Adrian Gschwend wrote:
> On 29.03.16 19:13, Andy Seaborne wrote:
>
> Hi Andy,
>
>> So JSON-LD does not stream end-to-end.
>
> Ok I thought something like this. We have the same problem with the
> JavaScript JSON-LD library.
>
>> At 800Mb I would have expected a large enough heap to work for N-triples
>> output.  Is the file available online anywhere?
>
> the generated file is here:
>
> http://www.eisenhutinformatik.ch/tmp/swissNAMES3D_LV03.zip
>
>> (and is the JSON-LD one big object?  It is not really JSON sweet spot
>> for large objects)
>
> I'm not really into JSON-LD details so not sure

The start of the file is somewhat long literal heavy for the geo data. 
There are some serious

I trace the paring and it dives into 
com.fasterxml.jackson.core.JsonParser and starts to build a java object 
for the JSON input.

So even a large heap does not seem to get the JSON in and processing 
from parsed JSON through JSON-LD to RDF never happens.  I fed it through 
YourKit to profile and the java heap is getting hit hard.  My guess is 
that the java data structure is not particular compact.

It does seem to get into some kind of GC hell as the oldest GC 
generation grows which I think is the cause of lost of CPU cycles and 
little real progress.

	Andy

>
> regards
>
> Adrian
>


Re: riot, streaming & memory

Posted by Adrian Gschwend <ml...@netlabs.org>.
On 29.03.16 19:13, Andy Seaborne wrote:

Hi Andy,

> So JSON-LD does not stream end-to-end.

Ok I thought something like this. We have the same problem with the
JavaScript JSON-LD library.

> At 800Mb I would have expected a large enough heap to work for N-triples
> output.  Is the file available online anywhere?

the generated file is here:

http://www.eisenhutinformatik.ch/tmp/swissNAMES3D_LV03.zip

> (and is the JSON-LD one big object?  It is not really JSON sweet spot
> for large objects)

I'm not really into JSON-LD details so not sure

regards

Adrian

Re: riot, streaming & memory

Posted by Andy Seaborne <an...@apache.org>.
On 29/03/16 13:53, Adrian Gschwend wrote:
> Hi group,
>
> I try to convert a 800MB JSON-LD file to something more readable (NT or
> Turtle) using riot. Unfortunately I run into memory issues:
>
> Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
> exceeded
>
> I tried -Xmx up to 16GB but no luck so far. According to the
> documentation riot should try streaming if possible, is that not
> available for JSON-LD or am I missing something else?
>
> regards
>
> Adrian
>

Jena uses jsonld-java [1] for JSON-LD.  jsonld-java uses Jackosn which 
reads the entire file before letting the client operate on the file. 
The actual JSON to RDF step is streaming.

So JSON-LD does not stream end-to-end.

(if the JSON-LD is arranged carefully, @context before data, a streaming 
parsers is theoretically possible.  Jena does this for SPARQL results in 
JSON - if the headers are seem before the results, it streams else it 
has to buffer).

At 800Mb I would have expected a large enough heap to work for N-triples 
output.  Is the file available online anywhere?

	Andy

(and is the JSON-LD one big object?  It is not really JSON sweet spot 
for large objects)


[1] https://github.com/jsonld-java/jsonld-java