You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Martynas Jusevičius <ma...@graphity.org> on 2015/05/10 22:48:26 UTC

Implementing RDF reader

Hey all,

I want to refactor my RDF/POST parser into a Jena-compatible reader.
An example of the format can be found here:
http://www.lsrn.org/semweb/rdfpost.html#sec-examples

The documentation suggests implementing ReaderRIOT interface:
https://github.com/apache/jena/blob/master/jena-arq/src-examples/arq/examples/riot/ExRIOT_5.java

However, if I look at (what I think is) existing readers such as
Turtle for example, they do not seem to implement ReaderRIOT:
https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/riot/lang/LangTurtleBase.java

What is the explanation for that?

Do I need to to tokenize the InputStream myself or is there some
machinery I can reuse?

Martynas
graphityhq.com

Re: Implementing RDF reader

Posted by Andy Seaborne <an...@apache.org>.
On 14/05/15 20:27, Martynas Jusevičius wrote:
> Andy,
>
> I took a crack at it:
> https://github.com/Graphity/graphity-core/blob/master/src/main/java/org/graphity/core/riot/lang/RDFPostReader.java
> https://github.com/Graphity/graphity-core/blob/master/src/main/java/org/graphity/core/riot/lang/TokenizerText.java

TokenizerRDFPost

I'd drop the "extends TokenizerText" or at least write 
AbstractTokenizerText with the machinery you want and
"abstract protected Token parseToken"

Throw out all unused code and so it won't accidentally get in the way in 
the future.

(If you do this, please contribute it - it would be useful and maybe 
should have been done originally if it makes no speed difference.)


>
> It was surely one of the more labor-intensive pieces of code in a while...

That means you are on the right track!  When a parser isn't tedious it 
is either not helpful or slow :-)

>
> Works with the example from RDF/POST spec, but I need to do more
> testing. Probably could be more DRY as well. If you have some advice,
> please let me know.

For grammars and tokenizers, comprehensive testing of each pays big 
rewards.  Theer is not much worse than chasing bugs when the core 
machinery is not doping the right thing.  Tests pin that down and make 
you think of every case that can come up.

For speed, the tokenizer is more likely to be the bottleneck. 
PeekReader should do reasonable (for Java) speed I/O for one character 
lookahead tokenizing.

	Andy




Re: Implementing RDF reader

Posted by Martynas Jusevičius <ma...@graphity.org>.
Andy,

I took a crack at it:
https://github.com/Graphity/graphity-core/blob/master/src/main/java/org/graphity/core/riot/lang/RDFPostReader.java
https://github.com/Graphity/graphity-core/blob/master/src/main/java/org/graphity/core/riot/lang/TokenizerText.java

It was surely one of the more labor-intensive pieces of code in a while...

Works with the example from RDF/POST spec, but I need to do more
testing. Probably could be more DRY as well. If you have some advice,
please let me know.

Martynas
graphityhq.com

On Mon, May 11, 2015 at 2:44 PM, Andy Seaborne <an...@apache.org> wrote:
> On 10/05/15 21:48, Martynas Jusevičius wrote:
>>
>> Hey all,
>>
>> I want to refactor my RDF/POST parser into a Jena-compatible reader.
>> An example of the format can be found here:
>> http://www.lsrn.org/semweb/rdfpost.html#sec-examples
>>
>> The documentation suggests implementing ReaderRIOT interface:
>>
>> https://github.com/apache/jena/blob/master/jena-arq/src-examples/arq/examples/riot/ExRIOT_5.java
>>
>> However, if I look at (what I think is) existing readers such as
>> Turtle for example, they do not seem to implement ReaderRIOT:
>>
>> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/riot/lang/LangTurtleBase.java
>>
>> What is the explanation for that?
>
>
> Hi Martynas,
>
> It is historical - the Turtle derived parsers emerged with the RiotReader
> interface and some code is/was around that used that interface.
>
> ReaderRIOTLang is the cross-over code from the proper interface ReaderRIOT
> to RiotReader. RiotReader is a fixed set of parsers.
>
> This can be sorted out in Jena3.
>
>>
>> Do I need to to tokenize the InputStream myself or is there some
>> machinery I can reuse?
>
>
> The Turtle-world tokenizer is TokenizerText.  It is turtle term specific.
>
> Any tokenizing for a new language is often, in my experience, very sensitive
> to the language details.
>
> If you are used to javacc, and performance isn't critical at scale, that's a
> good tool.
>
> RIOT uses custom I/O for speed; Jena used to have a javacc parser for Turtle
> but Turtle is sufficiently simple that a hand-written parser is doable.  A
> hand written tokenizer is for speed at scale (big file - about x2 than basic
> javacc tokenizing) but you need large input to make it worthwhile.  NTriples
> dumps of databases make it worthwhile.
>
> If you do rdfpost -> Turtle (string manipulation), then you can parse the
> Turtle as normal.  Downside: Error messages may be confusing as they refer
> to the Turtle, not the input string.
>
> Splitting up the query string, with all the HTTP escaping rules, can be done
> with library code (see FusekiLib.parseQueryString [no longer used, but it
> works without consuming the body, unlike the servlet operations which
> combine form and query string processing] and probably lots of better code
> examples on the web.
>
>         Andy
>>
>>
>> Martynas
>> graphityhq.com
>>
>

Re: Implementing RDF reader

Posted by Andy Seaborne <an...@apache.org>.
On 11/05/15 20:28, Martynas Jusevičius wrote:
> Thanks Andy.
>
> I have a parser that works on String, but this time I want to do it
> right and make it streaming and plug it into Jena at the low level.
>
> It seems that I should be able to reuse some code from TokenizerText.
>
> I understand StreamRDF is used to sink the triples, but what about
> ParserProfile? I see LangTurtleBase uses it:
>
>          org.apache.jena.iri.IRI iri = profile.makeIRI(iriStr,
> currLine, currCol) ;
>
> How do I construct an instance of ParserProfile? Or is there an
> alternative way to construct IRIs etc.?

RiotLib.profile

	Andy

>
> Martynas
>
> On Mon, May 11, 2015 at 2:44 PM, Andy Seaborne <an...@apache.org> wrote:
>> On 10/05/15 21:48, Martynas Jusevičius wrote:
>>>
>>> Hey all,
>>>
>>> I want to refactor my RDF/POST parser into a Jena-compatible reader.
>>> An example of the format can be found here:
>>> http://www.lsrn.org/semweb/rdfpost.html#sec-examples
>>>
>>> The documentation suggests implementing ReaderRIOT interface:
>>>
>>> https://github.com/apache/jena/blob/master/jena-arq/src-examples/arq/examples/riot/ExRIOT_5.java
>>>
>>> However, if I look at (what I think is) existing readers such as
>>> Turtle for example, they do not seem to implement ReaderRIOT:
>>>
>>> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/riot/lang/LangTurtleBase.java
>>>
>>> What is the explanation for that?
>>
>>
>> Hi Martynas,
>>
>> It is historical - the Turtle derived parsers emerged with the RiotReader
>> interface and some code is/was around that used that interface.
>>
>> ReaderRIOTLang is the cross-over code from the proper interface ReaderRIOT
>> to RiotReader. RiotReader is a fixed set of parsers.
>>
>> This can be sorted out in Jena3.
>>
>>>
>>> Do I need to to tokenize the InputStream myself or is there some
>>> machinery I can reuse?
>>
>>
>> The Turtle-world tokenizer is TokenizerText.  It is turtle term specific.
>>
>> Any tokenizing for a new language is often, in my experience, very sensitive
>> to the language details.
>>
>> If you are used to javacc, and performance isn't critical at scale, that's a
>> good tool.
>>
>> RIOT uses custom I/O for speed; Jena used to have a javacc parser for Turtle
>> but Turtle is sufficiently simple that a hand-written parser is doable.  A
>> hand written tokenizer is for speed at scale (big file - about x2 than basic
>> javacc tokenizing) but you need large input to make it worthwhile.  NTriples
>> dumps of databases make it worthwhile.
>>
>> If you do rdfpost -> Turtle (string manipulation), then you can parse the
>> Turtle as normal.  Downside: Error messages may be confusing as they refer
>> to the Turtle, not the input string.
>>
>> Splitting up the query string, with all the HTTP escaping rules, can be done
>> with library code (see FusekiLib.parseQueryString [no longer used, but it
>> works without consuming the body, unlike the servlet operations which
>> combine form and query string processing] and probably lots of better code
>> examples on the web.
>>
>>          Andy
>>>
>>>
>>> Martynas
>>> graphityhq.com
>>>
>>


Re: Implementing RDF reader

Posted by Martynas Jusevičius <ma...@graphity.org>.
Thanks Andy.

I have a parser that works on String, but this time I want to do it
right and make it streaming and plug it into Jena at the low level.

It seems that I should be able to reuse some code from TokenizerText.

I understand StreamRDF is used to sink the triples, but what about
ParserProfile? I see LangTurtleBase uses it:

        org.apache.jena.iri.IRI iri = profile.makeIRI(iriStr,
currLine, currCol) ;

How do I construct an instance of ParserProfile? Or is there an
alternative way to construct IRIs etc.?

Martynas

On Mon, May 11, 2015 at 2:44 PM, Andy Seaborne <an...@apache.org> wrote:
> On 10/05/15 21:48, Martynas Jusevičius wrote:
>>
>> Hey all,
>>
>> I want to refactor my RDF/POST parser into a Jena-compatible reader.
>> An example of the format can be found here:
>> http://www.lsrn.org/semweb/rdfpost.html#sec-examples
>>
>> The documentation suggests implementing ReaderRIOT interface:
>>
>> https://github.com/apache/jena/blob/master/jena-arq/src-examples/arq/examples/riot/ExRIOT_5.java
>>
>> However, if I look at (what I think is) existing readers such as
>> Turtle for example, they do not seem to implement ReaderRIOT:
>>
>> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/riot/lang/LangTurtleBase.java
>>
>> What is the explanation for that?
>
>
> Hi Martynas,
>
> It is historical - the Turtle derived parsers emerged with the RiotReader
> interface and some code is/was around that used that interface.
>
> ReaderRIOTLang is the cross-over code from the proper interface ReaderRIOT
> to RiotReader. RiotReader is a fixed set of parsers.
>
> This can be sorted out in Jena3.
>
>>
>> Do I need to to tokenize the InputStream myself or is there some
>> machinery I can reuse?
>
>
> The Turtle-world tokenizer is TokenizerText.  It is turtle term specific.
>
> Any tokenizing for a new language is often, in my experience, very sensitive
> to the language details.
>
> If you are used to javacc, and performance isn't critical at scale, that's a
> good tool.
>
> RIOT uses custom I/O for speed; Jena used to have a javacc parser for Turtle
> but Turtle is sufficiently simple that a hand-written parser is doable.  A
> hand written tokenizer is for speed at scale (big file - about x2 than basic
> javacc tokenizing) but you need large input to make it worthwhile.  NTriples
> dumps of databases make it worthwhile.
>
> If you do rdfpost -> Turtle (string manipulation), then you can parse the
> Turtle as normal.  Downside: Error messages may be confusing as they refer
> to the Turtle, not the input string.
>
> Splitting up the query string, with all the HTTP escaping rules, can be done
> with library code (see FusekiLib.parseQueryString [no longer used, but it
> works without consuming the body, unlike the servlet operations which
> combine form and query string processing] and probably lots of better code
> examples on the web.
>
>         Andy
>>
>>
>> Martynas
>> graphityhq.com
>>
>

Re: Implementing RDF reader

Posted by Andy Seaborne <an...@apache.org>.
On 10/05/15 21:48, Martynas Jusevičius wrote:
> Hey all,
>
> I want to refactor my RDF/POST parser into a Jena-compatible reader.
> An example of the format can be found here:
> http://www.lsrn.org/semweb/rdfpost.html#sec-examples
>
> The documentation suggests implementing ReaderRIOT interface:
> https://github.com/apache/jena/blob/master/jena-arq/src-examples/arq/examples/riot/ExRIOT_5.java
>
> However, if I look at (what I think is) existing readers such as
> Turtle for example, they do not seem to implement ReaderRIOT:
> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/riot/lang/LangTurtleBase.java
>
> What is the explanation for that?

Hi Martynas,

It is historical - the Turtle derived parsers emerged with the 
RiotReader interface and some code is/was around that used that interface.

ReaderRIOTLang is the cross-over code from the proper interface 
ReaderRIOT to RiotReader. RiotReader is a fixed set of parsers.

This can be sorted out in Jena3.

>
> Do I need to to tokenize the InputStream myself or is there some
> machinery I can reuse?

The Turtle-world tokenizer is TokenizerText.  It is turtle term specific.

Any tokenizing for a new language is often, in my experience, very 
sensitive to the language details.

If you are used to javacc, and performance isn't critical at scale, 
that's a good tool.

RIOT uses custom I/O for speed; Jena used to have a javacc parser for 
Turtle but Turtle is sufficiently simple that a hand-written parser is 
doable.  A hand written tokenizer is for speed at scale (big file - 
about x2 than basic javacc tokenizing) but you need large input to make 
it worthwhile.  NTriples dumps of databases make it worthwhile.

If you do rdfpost -> Turtle (string manipulation), then you can parse 
the Turtle as normal.  Downside: Error messages may be confusing as they 
refer to the Turtle, not the input string.

Splitting up the query string, with all the HTTP escaping rules, can be 
done with library code (see FusekiLib.parseQueryString [no longer used, 
but it works without consuming the body, unlike the servlet operations 
which combine form and query string processing] and probably lots of 
better code examples on the web.

	Andy
>
> Martynas
> graphityhq.com
>