You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Adrian Gschwend <ml...@netlabs.org> on 2016/03/29 14:53:12 UTC
riot, streaming & memory
Hi group,
I try to convert a 800MB JSON-LD file to something more readable (NT or
Turtle) using riot. Unfortunately I run into memory issues:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
exceeded
I tried -Xmx up to 16GB but no luck so far. According to the
documentation riot should try streaming if possible, is that not
available for JSON-LD or am I missing something else?
regards
Adrian
--
Adrian Gschwend
Bern University of Applied Sciences
Re: riot, streaming & memory
Posted by Paul Houle <on...@gmail.com>.
If you're serious about streaming JSON, look at
http://www.rfc-editor.org/rfc/rfc7464.txt
In fact, the same approach works for Turtle.
On Sun, Apr 3, 2016 at 6:53 AM, Adrian Gschwend <ml...@netlabs.org> wrote:
> On 31.03.16 11:28, Andy Seaborne wrote:
>
> Hi Andy,
>
> > The whole chain JSON->JSON-LD->RDF is designed around small/medium sized
> > data.
> >
> > If you find a JSON-LD parser that can convert this file to N-triples,
> > could you let the list know please?
>
> will do so, might be that we have some streaming parser sooner or later
> in the Node.js/JavaScript world. There are some ongoing discussions.
>
> regards
>
> Adrian
>
--
Paul Houle
*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*
(607) 539 6254 paul.houle on Skype ontology2@gmail.com
:BaseKB -- Query Freebase Data With SPARQL
http://basekb.com/gold/
Legal Entity Identifier Lookup
https://legalentityidentifier.info/lei/lookup/
<http://legalentityidentifier.info/lei/lookup/>
Join our Data Lakes group on LinkedIn
https://www.linkedin.com/grp/home?gid=8267275
Re: riot, streaming & memory
Posted by Adrian Gschwend <ml...@netlabs.org>.
On 31.03.16 11:28, Andy Seaborne wrote:
Hi Andy,
> The whole chain JSON->JSON-LD->RDF is designed around small/medium sized
> data.
>
> If you find a JSON-LD parser that can convert this file to N-triples,
> could you let the list know please?
will do so, might be that we have some streaming parser sooner or later
in the Node.js/JavaScript world. There are some ongoing discussions.
regards
Adrian
Re: riot, streaming & memory
Posted by Adrian Gschwend <ml...@netlabs.org>.
On 31.03.16 11:28, Andy Seaborne wrote:
Hi Andy,
> If you find a JSON-LD parser that can convert this file to N-triples,
> could you let the list know please?
Talked to my colleague Reto last week and he had mercy:
https://github.com/zazuko/jsonldparser
See the README, currently it only supports a subset of JSON-LD. Pull
requests are welcome :) It's MIT license.
regards
Adrian
Re: riot, streaming & memory
Posted by Andy Seaborne <an...@apache.org>.
On 30/03/16 21:39, Adrian Gschwend wrote:
> On 30.03.16 18:49, Andy Seaborne wrote:
>
> Hi Andy,
>
>> The start of the file is somewhat long literal heavy for the geo data.
>> There are some serious
>
> yeah it is a Swiss geodata set which should be published later this year
> as RDF. It gets generated as JSON-LD from its original INTERLIS format
> (mainly used in Switzerland)
>
>> It does seem to get into some kind of GC hell as the oldest GC
>> generation grows which I think is the cause of lost of CPU cycles and
>> little real progress.
>
> ok that sounds like bad code, will I run into this issue as well when I
> try loading it to Fuseki?
Yes. Same code path.
The JSON-LD algorithm assume complete access to the JSON (they walk over
the JSON tree looking for things) and jsonld-java is faithful
implementation of the JSON-LD spec, even down to comments in the code as
to which spec part the code is providing at that point.
The whole chain JSON->JSON-LD->RDF is designed around small/medium sized
data.
If you find a JSON-LD parser that can convert this file to N-triples,
could you let the list know please?
Andy
>
> regards
>
> Adrian
>
Re: riot, streaming & memory
Posted by Adrian Gschwend <ml...@netlabs.org>.
On 30.03.16 18:49, Andy Seaborne wrote:
Hi Andy,
> The start of the file is somewhat long literal heavy for the geo data.
> There are some serious
yeah it is a Swiss geodata set which should be published later this year
as RDF. It gets generated as JSON-LD from its original INTERLIS format
(mainly used in Switzerland)
> It does seem to get into some kind of GC hell as the oldest GC
> generation grows which I think is the cause of lost of CPU cycles and
> little real progress.
ok that sounds like bad code, will I run into this issue as well when I
try loading it to Fuseki?
regards
Adrian
Re: riot, streaming & memory
Posted by Andy Seaborne <an...@apache.org>.
On 29/03/16 18:19, Adrian Gschwend wrote:
> On 29.03.16 19:13, Andy Seaborne wrote:
>
> Hi Andy,
>
>> So JSON-LD does not stream end-to-end.
>
> Ok I thought something like this. We have the same problem with the
> JavaScript JSON-LD library.
>
>> At 800Mb I would have expected a large enough heap to work for N-triples
>> output. Is the file available online anywhere?
>
> the generated file is here:
>
> http://www.eisenhutinformatik.ch/tmp/swissNAMES3D_LV03.zip
>
>> (and is the JSON-LD one big object? It is not really JSON sweet spot
>> for large objects)
>
> I'm not really into JSON-LD details so not sure
The start of the file is somewhat long literal heavy for the geo data.
There are some serious
I trace the paring and it dives into
com.fasterxml.jackson.core.JsonParser and starts to build a java object
for the JSON input.
So even a large heap does not seem to get the JSON in and processing
from parsed JSON through JSON-LD to RDF never happens. I fed it through
YourKit to profile and the java heap is getting hit hard. My guess is
that the java data structure is not particular compact.
It does seem to get into some kind of GC hell as the oldest GC
generation grows which I think is the cause of lost of CPU cycles and
little real progress.
Andy
>
> regards
>
> Adrian
>
Re: riot, streaming & memory
Posted by Adrian Gschwend <ml...@netlabs.org>.
On 29.03.16 19:13, Andy Seaborne wrote:
Hi Andy,
> So JSON-LD does not stream end-to-end.
Ok I thought something like this. We have the same problem with the
JavaScript JSON-LD library.
> At 800Mb I would have expected a large enough heap to work for N-triples
> output. Is the file available online anywhere?
the generated file is here:
http://www.eisenhutinformatik.ch/tmp/swissNAMES3D_LV03.zip
> (and is the JSON-LD one big object? It is not really JSON sweet spot
> for large objects)
I'm not really into JSON-LD details so not sure
regards
Adrian
Re: riot, streaming & memory
Posted by Andy Seaborne <an...@apache.org>.
On 29/03/16 13:53, Adrian Gschwend wrote:
> Hi group,
>
> I try to convert a 800MB JSON-LD file to something more readable (NT or
> Turtle) using riot. Unfortunately I run into memory issues:
>
> Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
> exceeded
>
> I tried -Xmx up to 16GB but no luck so far. According to the
> documentation riot should try streaming if possible, is that not
> available for JSON-LD or am I missing something else?
>
> regards
>
> Adrian
>
Jena uses jsonld-java [1] for JSON-LD. jsonld-java uses Jackosn which
reads the entire file before letting the client operate on the file.
The actual JSON to RDF step is streaming.
So JSON-LD does not stream end-to-end.
(if the JSON-LD is arranged carefully, @context before data, a streaming
parsers is theoretically possible. Jena does this for SPARQL results in
JSON - if the headers are seem before the results, it streams else it
has to buffer).
At 800Mb I would have expected a large enough heap to work for N-triples
output. Is the file available online anywhere?
Andy
(and is the JSON-LD one big object? It is not really JSON sweet spot
for large objects)
[1] https://github.com/jsonld-java/jsonld-java