You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Erman Korkut (BLOOMBERG/ 120 PARK)" <ek...@bloomberg.net> on 2017/06/15 21:58:33 UTC

Patch to support streaming in riot for rdf/xml and jsonld formats

Hi all,

We are using riot to convert large nt files into turtle, so far it works great thanks to streaming support for turtle. For a 500 million triple file, it does it in order of minutes, without running into any memory issue. 

We are interested in converting to rdf/xml and jsonld formats in a similar fashion with streaming. These formats do not seem to support streaming at the moment. I saw Andy's response to a question in stack overflow saying that "JSON-LD is not a streaming output language (the writer needs all the data available before calling the jsonld-java code)" (https://stackoverflow.com/questions/26287432/json-ld-in-jena-riot)

Is it conceptually impossible to support rdf/xml and json-ld in riot with streaming? This looks like it can be made to work, particularly when the input file is turtle where each subject is already grouped for its predicates/objects. We are willing to work on this patch and contribute it back to the Jena but wanted to check with you first to see what you think. Is it really impossible or would it be really take very significant effort in the current codebase?

Please let me know what you think on this patch idea.

Thanks,
Erman Korkut
Bloomberg L.P.
ekorkut1@bloomberg.net



Re: Patch to support streaming in riot for rdf/xml and jsonld formats

Posted by Andy Seaborne <an...@apache.org>.
Writing the basic RDF/XML (not RDF-XML/ABBREV) is streaming IIRC. Plain 
RDF/XML is quite verbose and unreadable.

There could be a new writer that did something like 
RDFFormat.TURTLE_BLOCKS. That writer needs writing - it does not exist 
currently. Btter to weriet a new one that try to modify the existing 
"pretty" one which analyses the whole data before starting to write (as 
does the prettiest Turtle writer).


On 16/06/17 10:06, Rob Vesse wrote:
> Both are technically possible
> 
>   Please bear in mind that the resulting outputs may be exceptionally verbose because generating either of those formats in a streaming fashion that will prevent you from using many of the available syntactic sugars. In the case of JSON-LD we don’t maintain the core functionality ourselves so you would need to provide contributions upstream to the third-party library we use. In the RDF/XML case you can contribute directly to Jena
> 
>   Contributions are always welcome
> 
>   As an aside what is the value of producing such large data sets in those formats? There is a reason that the community has standardised on other more compact formats for large scale data exchange

I agree - N-Triples is faster to ingest. RDF-Thrift is fastest but not 
standard. N-Triples, compressed, is a reasonable size. It does not slow 
down parsing very much.

Layering over an XML parser makes the parsing process slow - there a lot 
of work to do.

Similarly, JSON-LD (Jena uses "jsonld-java" that in turn uses Jackson 
... so for reading, it reads the whole file, then coverts JSON->RDF)

> 
> Rob
> 
> On 15/06/2017 22:58, "Erman Korkut (BLOOMBERG/ 120 PARK)" <ek...@bloomberg.net> wrote:
> 
>      Hi all,
>      
>      We are using riot to convert large nt files into turtle, so far it works great thanks to streaming support for turtle. For a 500 million triple file, it does it in order of minutes, without running into any memory issue.
>      
>      We are interested in converting to rdf/xml and jsonld formats in a similar fashion with streaming. These formats do not seem to support streaming at the moment. I saw Andy's response to a question in stack overflow saying that "JSON-LD is not a streaming output language (the writer needs all the data available before calling the jsonld-java code)" (https://stackoverflow.com/questions/26287432/json-ld-in-jena-riot)

Do you need it to be a single file? Maybe there can be parallel processing.

How often are you going to be doing the conversion?

Maybe Redland can be used for RDF/XML.

>      
>      Is it conceptually impossible to support rdf/xml and json-ld in riot with streaming? This looks like it can be made to work, particularly when the input file is turtle where each subject is already grouped for its predicates/objects. We are willing to work on this patch and contribute it back to the Jena but wanted to check with you first to see what you think. Is it really impossible or would it be really take very significant effort in the current codebase?

For RDF/XML it's possible to do better.

You have to make some other assumptions like the prefixes are known at 
the start.  In practice, that's normal, but it isn't required.

In RDF/XML you can reset the namespaces (with care) so you can do some 
stuff as you stream whereas for JSON-LD, the context is fixed per 
document, I think.

	Andy

>      
>      Please let me know what you think on this patch idea.

if it solves then problem, it would be good to have.  Rob and I have 
doubts about the approach to the problem based on common practice but 
your business requirements may mean you need RDF/XML or JSON-LD.  Can 
you say more?

>      
>      Thanks,
>      Erman Korkut
>      Bloomberg L.P.
>      ekorkut1@bloomberg.net
>      
>      
>      
> 
> 
> 
> 

Re: Patch to support streaming in riot for rdf/xml and jsonld formats

Posted by Rob Vesse <rv...@dotnetrdf.org>.
Both are technically possible

 Please bear in mind that the resulting outputs may be exceptionally verbose because generating either of those formats in a streaming fashion that will prevent you from using many of the available syntactic sugars. In the case of JSON-LD we don’t maintain the core functionality ourselves so you would need to provide contributions upstream to the third-party library we use. In the RDF/XML case you can contribute directly to Jena

 Contributions are always welcome

 As an aside what is the value of producing such large data sets in those formats? There is a reason that the community has standardised on other more compact formats for large scale data exchange

Rob

On 15/06/2017 22:58, "Erman Korkut (BLOOMBERG/ 120 PARK)" <ek...@bloomberg.net> wrote:

    Hi all,
    
    We are using riot to convert large nt files into turtle, so far it works great thanks to streaming support for turtle. For a 500 million triple file, it does it in order of minutes, without running into any memory issue. 
    
    We are interested in converting to rdf/xml and jsonld formats in a similar fashion with streaming. These formats do not seem to support streaming at the moment. I saw Andy's response to a question in stack overflow saying that "JSON-LD is not a streaming output language (the writer needs all the data available before calling the jsonld-java code)" (https://stackoverflow.com/questions/26287432/json-ld-in-jena-riot)
    
    Is it conceptually impossible to support rdf/xml and json-ld in riot with streaming? This looks like it can be made to work, particularly when the input file is turtle where each subject is already grouped for its predicates/objects. We are willing to work on this patch and contribute it back to the Jena but wanted to check with you first to see what you think. Is it really impossible or would it be really take very significant effort in the current codebase?
    
    Please let me know what you think on this patch idea.
    
    Thanks,
    Erman Korkut
    Bloomberg L.P.
    ekorkut1@bloomberg.net