You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Laurent Pellegrino <la...@gmail.com> on 2011/05/26 16:37:23 UTC

Reverse operation for FmtUtils.stringForNode(...)

Hi all,

I am using FmtUtils.stringForNode(...) from ARQ to encode a Node to a
String. Now, I have to perform the reverse operation: from the String
I want to create the Node. Is there a class and method to do that from
the ARQ library?

It seems that NodecLib.decode(...) do the trick but it is in the TDB
library and I am not sure that it works with any output from
FmtUtils.stringForNode(...)?

Kind Regards,

Laurent

Re: Reverse operation for FmtUtils.stringForNode(...)

Posted by Andy Seaborne <an...@epimorphics.com>.


On 27/05/11 10:38, Andy Seaborne wrote:
>
>
> On 26/05/11 15:37, Laurent Pellegrino wrote:
>> Hi all,
>>
>> I am using FmtUtils.stringForNode(...) from ARQ to encode a Node to a
>> String. Now, I have to perform the reverse operation: from the String
>> I want to create the Node. Is there a class and method to do that from
>> the ARQ library?
>>
>> It seems that NodecLib.decode(...) do the trick but it is in the TDB
>> library and I am not sure that it works with any output from
>> FmtUtils.stringForNode(...)?
>>
>> Kind Regards,
>>
>> Laurent
>
> There are ways to reverse the process - too many in fact.
>
> Simple: SSE.parseNode: String -> Node
>
> It uses a javacc parser so the overall efficiency isn't ideal.
>
> But RIOT is in the process of reworking I/O for efficiency; the input
> side is the area that is most finished. The tokenizer will do what you
> want.
>
> What's missing in RIOT is Node to stream writing without using FmtUtils
> -- this is OutputLangUtils which is unfinished. FmtUtils creates
> intermediate strings, when the output could be straight to a stream,
> avoiding a copy and the temporary object allocation.
>
> The Tokenizer is:
>
> interface Tokenizer extends Iterator<Token>
>
> and see org.openjena.riot.tokens.TokenizerFactory
>
> especially if you have a sequence of them to parse ... like a TSV file.
> But you will have to manage newlines as to the tokenizer they are
> whitespace like anything else.

This does not stop Tokenizer being used as-is because you can check the 
line number with Tokenizer.getLine()  When it changes, you're on a new 
line of the TSV file.

	Andy

>
>
> There is some stuff in my scratch area for streams of tuples of RDF
> terms and variables:
>
> https://svn.apache.org/repos/asf/incubator/jena/Scratch/AFS/trunk/src/riot/io/
>
>
> TokenInputStream and TokenOutputStream might be useful.
>
> Until TSV, a tuple of terms is a number of RDF terms, terminated by a
> DOT (not newline).
>
> This could be useful to JENA-44, JENA-45 and JENA-69
>
> I'm keen that we create a single solid I/O layer so it can teste and
> optimized then shared amongst all the code doing I/O related things.
>
> Nodec is an interface specializes attempt to ByteBuffers for file, not
> stream I/O. File I/O can be random access.
>
> Andy

Re: Reverse operation for FmtUtils.stringForNode(...)

Posted by Andy Seaborne <an...@epimorphics.com>.


On 14/06/11 12:11, Paolo Castagna wrote:
> Andy Seaborne wrote:
>>
>>
>> On 14/06/11 10:24, Paolo Castagna wrote:
>>> Thank you Andy.
>>>
>>> Andy Seaborne wrote:
>>>> Missed the important part ....
>>>>
>>>> Any blank node written as _:label will be subject to label scope
>>>> rules, that is, per file, and not bNode preserving (that's why TDB
>>>> does it's own thing).
>>>>
>>>> The tokenizer knows <_:xyz> "URIs" which create bNodes with the xyz as
>>>> the internal label.
>>>
>>> Is ":" a legal character in the xyz part of the bNode internal label?
>>
>> Yes. See Tokenizer.
>
> If ':' is a legal character in the bNode internal label, I don't understand
> why "_:foo:bar" is tokenized into [BNODE:foo][PREFIXED_NAME::bar] rather
> than [BNODE:foo:bar].

Internal label != external label.

":" is illegal in the external label.  It is the prefix name divider.

"a:b:c" is "a:b :c"

That's how Turtle is defined.

>
> Looking at TokenizerText.java:
>
> 599 // Blank node label: letters, numbers and '-', '_'
> 600 // Strictly, can't start with "-" or digits.
> ...
> 620 for(;;)
> 621 {
> 622 int ch = reader.peekChar() ;
> 623 if ( ch == EOF )
> 624 break ;
> 625 if ( ! isAlphaNumeric(ch) && ch != '-' && ch != '_' )
> 626 break ;
> 627 reader.readChar() ;
> 628 stringBuilder.append((char)ch) ;
> 629 }
>
> If ch is ':' the for(;;) loop ends at line 626, therefore the bNode
> internal label is read only up to the first ':'.
>
> Should we add at line 625 ch != ':'?

It would not be compliant with Turtle - see example above.

>
> 625 if ( ! isAlphaNumeric(ch) && ch != '-' && ch != '_' & ch != ':' )
>
> Even if I do that, I am left with the problem that when I create bNodes
> with
> Node_Blank.createAnon() the bNode internal label generated by Jena can
> be of
> the form "-4ceedaaf:1308da2cdd0:-7fff" (i.e. it starts with '-').
>
> TokenizerText explicitly forbid a bNode internal label to start with '-':

Implement the <_:xyz> form - that's what it is there for.  See 
ParserBase.createNode in SPARQL.

The RDF WG is going to publish a skolemization scheme but it's a bit 
verbose for this usage.

Any _:label is subject to the label rules of the parser and the synatx 
rules of the token language.  Both of which mean _:internal will not 
work. Labels are scoped to the file, they are NOT the internal label. 
Messing around with ParseProfile can only fix one issue, but the lexical 
rule for bNode labels does not allow the full set of characters that an 
internal label might use and it's not configurable.

>
> 614 if ( ! isAlphaNumeric(ch) && ch != '_' )
> 615 exception("Blank node label does not start with alphabetic or _
> :"+(char)ch) ;
>
> Alternatively, JenaParameters.disableBNodeUIDGeneration can be set to true
> so that AnonId does not use java.rmi.server.UID to generate bNode internal
> labels (which can start with a negative number and are rejected by the
> current TokenizerText).

It's a system global and there only for testing.  Any concurrent use of 
the parser must contiue to work as per Turtle etc.  And it does not 
address the fact the lexical rule for bNodeLabels does not allow all the 
characters you want.

	Andy

>
> Paolo
>
>>
>>> I am asking because the generated bNode internal labels seem to have ":"
>>> in it and if I use RIOT's Tokenizer there is a problem, I think.
>>>
>>> For example:
>>>
>>> 1 AnonId id = new AnonId("foo:bar");
>>> 2 Node node1 = Node_Blank.createAnon(id);
>>> 3 String str = NodeFmtLib.serialize(node1);
>>> 4 Tokenizer tokenizer = TokenizerFactory.makeTokenizerString(str);
>>> 5 assertTrue (tokenizer.hasNext());
>>> 6 assertEquals("[BNODE:foo]", tokenizer.next().toString());
>>> 7 assertTrue (tokenizer.hasNext());
>>> 8 assertEquals("[PREFIXED_NAME::bar]", tokenizer.next().toString());
>>> 9 assertFalse (tokenizer.hasNext());
>>>
>>> At line 6, I would expect [BNODE:foo:bar] instead.
>>>
>>> Now, I am looking at Token{Input|Output}Stream, TSV{Input|Output} and
>>> OutputLangUtils.
>>>
>>> Paolo
>>>
>>>>
>>>> Andy
>>>>
>>>> On 13/06/11 21:33, Andy Seaborne wrote:
>>>>>
>>>>>
>>>>> On 13/06/11 16:55, Paolo Castagna wrote:
>>>>>> Andy Seaborne wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 26/05/11 15:37, Laurent Pellegrino wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I am using FmtUtils.stringForNode(...) from ARQ to encode a Node
>>>>>>>> to a
>>>>>>>> String. Now, I have to perform the reverse operation: from the
>>>>>>>> String
>>>>>>>> I want to create the Node. Is there a class and method to do that
>>>>>>>> from
>>>>>>>> the ARQ library?
>>>>>>>>
>>>>>>>> It seems that NodecLib.decode(...) do the trick but it is in the
>>>>>>>> TDB
>>>>>>>> library and I am not sure that it works with any output from
>>>>>>>> FmtUtils.stringForNode(...)?
>>>>>>>>
>>>>>>>> Kind Regards,
>>>>>>>>
>>>>>>>> Laurent
>>>>>>>
>>>>>>> There are ways to reverse the process - too many in fact.
>>>>>>>
>>>>>>> Simple: SSE.parseNode: String -> Node
>>>>>>>
>>>>>>> It uses a javacc parser so the overall efficiency isn't ideal.
>>>>>>>
>>>>>>> But RIOT is in the process of reworking I/O for efficiency; the
>>>>>>> input
>>>>>>> side is the area that is most finished. The tokenizer will do
>>>>>>> what you
>>>>>>> want.
>>>>>>>
>>>>>>> What's missing in RIOT is Node to stream writing without using
>>>>>>> FmtUtils -- this is OutputLangUtils which is unfinished. FmtUtils
>>>>>>> creates intermediate strings, when the output could be straight to a
>>>>>>> stream, avoiding a copy and the temporary object allocation.
>>>>>>>
>>>>>>> The Tokenizer is:
>>>>>>>
>>>>>>> interface Tokenizer extends Iterator<Token>
>>>>>>>
>>>>>>> and see org.openjena.riot.tokens.TokenizerFactory
>>>>>>>
>>>>>>> especially if you have a sequence of them to parse ... like a TSV
>>>>>>> file. But you will have to manage newlines as to the tokenizer they
>>>>>>> are whitespace like anything else.
>>>>>>>
>>>>>>>
>>>>>>> There is some stuff in my scratch area for streams of tuples of RDF
>>>>>>> terms and variables:
>>>>>>>
>>>>>>> https://svn.apache.org/repos/asf/incubator/jena/Scratch/AFS/trunk/src/riot/io/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> TokenInputStream and TokenOutputStream might be useful.
>>>>>>>
>>>>>>> Until TSV, a tuple of terms is a number of RDF terms, terminated
>>>>>>> by a
>>>>>>> DOT (not newline).
>>>>>>>
>>>>>>> This could be useful to JENA-44, JENA-45 and JENA-69
>>>>>>
>>>>>> Hi,
>>>>>> I am looking at the code to serialize bindings (in relation to
>>>>>> JENA-44
>>>>>> and JENA-45) and I would like to use as much as I can what is already
>>>>>> available in RIOT (and/or help to add what's missing, once I
>>>>>> understand
>>>>>> what is the right thing to do).
>>>>>>
>>>>>> I am having a few problems with blank nodes.
>>>>>>
>>>>>> This is a snipped of code which explains my problem:
>>>>>>
>>>>>> 1 Node node1 = Node_Blank.createAnon();
>>>>>> 2 String str = NodeFmtLib.serialize(node1);
>>>>>> 3 Tokenizer tokenizer = TokenizerFactory.makeTokenizerString(str);
>>>>>> 4 Token token = tokenizer.next();
>>>>>> 5 Node node2 = token.asNode();
>>>>>> 6 assertEquals(node1, node2);
>>>>>>
>>>>>> I have two different problems.
>>>>>>
>>>>>> In the case the blank node id starts with a digit, the assertion at
>>>>>> line 6 fails with, for example:
>>>>>> "expected:<1c7b85b4:13089a0cb42:-7fff>
>>>>>> but was:<1c7b85b4>".
>>>>>>
>>>>>> If the blank node id is a negative number (i.e. it starts with a
>>>>>> '-'),
>>>>>> I have a RiotParserException: "org.openjena.riot.RiotParseException:
>>>>>> [line: 1, col: 3 ] Blank node label does not start with alphabetic or
>>>>>> _ :-" from TokenizerText.java line 1067.
>>>>>
>>>>> Setting onlySafeBNodeLabels to true might help.
>>>>>
>>>>> Because TDB does not use the tokenizer for decode, the raw path may be
>>>>> buggy.
>>>>>
>>>>> See OutputLangUtils because that has the prospect of streaming.
>>>>>
>>>>> We may need to switch OutputStream but there is OutStreamUTF8 if UTF-8
>>>>> encoding by std Java is costly.
>>>>>
>>>>>> What I am trying to do is to rewrite the BindingSerializer in the
>>>>>> patch
>>>>>> for JENA-44. These are the signatures of the two methods I am
>>>>>> implementing:
>>>>>>
>>>>>> public void serialize(Binding b, DataOutputStream out) throws
>>>>>> IOException
>>>>>> public Binding deserialize(DataInputStream in) throws IOException
>>>>>
>>>>> What's wrong with TokenOutputStream which even does some buffering.
>>>>>
>>>>> Binding -> Nodes (you're only writing the RDF term values), beware of
>>>>> missingbindings. See the TSV output format that Laurent has been
>>>>> looking
>>>>> at.
>>>>>
>>>>> DataOutputStream can only write 16bit lengths for strings - so you use
>>>>> write(byte[]) and much of the point of DataOutputStream is lost. Seems
>>>>> better to be to use our own internal interface and map to whatever
>>>>> mechanism is most appropriate. testing the round-tripping between
>>>>> TokenOutputStream and TokenInputStream being then done.
>>>>>
>>>>>> At the moment, I am assuming all the bindings written in the same
>>>>>> file
>>>>>> have
>>>>>> the same variables and I am writing them only once at the
>>>>>> beginning of
>>>>>> the
>>>>>> file and after that I am serializing binding values only:
>>>>>>
>>>>>> for (Var var : vars) {
>>>>>> Node node = b.get(var);
>>>>>> byte[] buf = NodeFmtLib.serialize(node).getBytes("UTF-8");
>>>>>
>>>>> whether this is faster that converting to UTF-8 duirectly into the
>>>>> stream will need testing but it's a point optimization. For now, it's
>>>>> the design that matters.
>>>>>
>>>>>> out.writeInt(buf.length);
>>>>>> out.write(buf);
>>>>>> }
>>>>>>
>>>>>> Should I try to use OutputLangUtils instead? And Writer(s) instead of
>>>>>> DataOutputStream(s)?
>>>>>>
>>>>>> Thanks,
>>>>>> Paolo
>>>>>>
>>>>>>>
>>>>>>> I'm keen that we create a single solid I/O layer so it can teste and
>>>>>>> optimized then shared amongst all the code doing I/O related things.
>>>>>>>
>>>>>>> Nodec is an interface specializes attempt to ByteBuffers for
>>>>>>> file, not
>>>>>>> stream I/O. File I/O can be random access.
>>>>>>>
>>>>>>> Andy
>>>>>>
>>>
>

Re: Reverse operation for FmtUtils.stringForNode(...)

Posted by Paolo Castagna <ca...@googlemail.com>.

Andy Seaborne wrote:
> 
> 
> On 14/06/11 10:24, Paolo Castagna wrote:
>> Thank you Andy.
>>
>> Andy Seaborne wrote:
>>> Missed the important part ....
>>>
>>> Any blank node written as _:label will be subject to label scope
>>> rules, that is, per file, and not bNode preserving (that's why TDB
>>> does it's own thing).
>>>
>>> The tokenizer knows <_:xyz> "URIs" which create bNodes with the xyz as
>>> the internal label.
>>
>> Is ":" a legal character in the xyz part of the bNode internal label?
> 
> Yes.  See Tokenizer.

If ':' is a legal character in the bNode internal label, I don't understand
why "_:foo:bar" is tokenized into [BNODE:foo][PREFIXED_NAME::bar] rather
than [BNODE:foo:bar].

Looking at TokenizerText.java:

599    // Blank node label: letters, numbers and '-', '_'
600    // Strictly, can't start with "-" or digits.
...
620   for(;;)
621   {
622       int ch = reader.peekChar() ;
623       if ( ch == EOF )
624           break ;
625       if ( ! isAlphaNumeric(ch) && ch != '-' && ch != '_' )
626           break ;
627       reader.readChar() ;
628       stringBuilder.append((char)ch) ;
629   }

If ch is ':' the for(;;) loop ends at line 626, therefore the bNode
internal label is read only up to the first ':'.

Should we add at line 625 ch != ':'?

625      if ( ! isAlphaNumeric(ch) && ch != '-' && ch != '_' & ch != ':' )

Even if I do that, I am left with the problem that when I create bNodes with
Node_Blank.createAnon() the bNode internal label generated by Jena can be of
the form "-4ceedaaf:1308da2cdd0:-7fff" (i.e. it starts with '-').

TokenizerText explicitly forbid a bNode internal label to start with '-':

614      if ( ! isAlphaNumeric(ch) && ch != '_' )
615          exception("Blank node label does not start with alphabetic or _ 
:"+(char)ch) ;

Alternatively, JenaParameters.disableBNodeUIDGeneration can be set to true
so that AnonId does not use java.rmi.server.UID to generate bNode internal
labels (which can start with a negative number and are rejected by the
current TokenizerText).

Paolo

> 
>> I am asking because the generated bNode internal labels seem to have ":"
>> in it and if I use RIOT's Tokenizer there is a problem, I think.
>>
>> For example:
>>
>> 1 AnonId id = new AnonId("foo:bar");
>> 2 Node node1 = Node_Blank.createAnon(id);
>> 3 String str = NodeFmtLib.serialize(node1);
>> 4 Tokenizer tokenizer = TokenizerFactory.makeTokenizerString(str);
>> 5 assertTrue (tokenizer.hasNext());
>> 6 assertEquals("[BNODE:foo]", tokenizer.next().toString());
>> 7 assertTrue (tokenizer.hasNext());
>> 8 assertEquals("[PREFIXED_NAME::bar]", tokenizer.next().toString());
>> 9 assertFalse (tokenizer.hasNext());
>>
>> At line 6, I would expect [BNODE:foo:bar] instead.
>>
>> Now, I am looking at Token{Input|Output}Stream, TSV{Input|Output} and
>> OutputLangUtils.
>>
>> Paolo
>>
>>>
>>> Andy
>>>
>>> On 13/06/11 21:33, Andy Seaborne wrote:
>>>>
>>>>
>>>> On 13/06/11 16:55, Paolo Castagna wrote:
>>>>> Andy Seaborne wrote:
>>>>>>
>>>>>>
>>>>>> On 26/05/11 15:37, Laurent Pellegrino wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I am using FmtUtils.stringForNode(...) from ARQ to encode a Node 
>>>>>>> to a
>>>>>>> String. Now, I have to perform the reverse operation: from the 
>>>>>>> String
>>>>>>> I want to create the Node. Is there a class and method to do that
>>>>>>> from
>>>>>>> the ARQ library?
>>>>>>>
>>>>>>> It seems that NodecLib.decode(...) do the trick but it is in the TDB
>>>>>>> library and I am not sure that it works with any output from
>>>>>>> FmtUtils.stringForNode(...)?
>>>>>>>
>>>>>>> Kind Regards,
>>>>>>>
>>>>>>> Laurent
>>>>>>
>>>>>> There are ways to reverse the process - too many in fact.
>>>>>>
>>>>>> Simple: SSE.parseNode: String -> Node
>>>>>>
>>>>>> It uses a javacc parser so the overall efficiency isn't ideal.
>>>>>>
>>>>>> But RIOT is in the process of reworking I/O for efficiency; the input
>>>>>> side is the area that is most finished. The tokenizer will do what 
>>>>>> you
>>>>>> want.
>>>>>>
>>>>>> What's missing in RIOT is Node to stream writing without using
>>>>>> FmtUtils -- this is OutputLangUtils which is unfinished. FmtUtils
>>>>>> creates intermediate strings, when the output could be straight to a
>>>>>> stream, avoiding a copy and the temporary object allocation.
>>>>>>
>>>>>> The Tokenizer is:
>>>>>>
>>>>>> interface Tokenizer extends Iterator<Token>
>>>>>>
>>>>>> and see org.openjena.riot.tokens.TokenizerFactory
>>>>>>
>>>>>> especially if you have a sequence of them to parse ... like a TSV
>>>>>> file. But you will have to manage newlines as to the tokenizer they
>>>>>> are whitespace like anything else.
>>>>>>
>>>>>>
>>>>>> There is some stuff in my scratch area for streams of tuples of RDF
>>>>>> terms and variables:
>>>>>>
>>>>>> https://svn.apache.org/repos/asf/incubator/jena/Scratch/AFS/trunk/src/riot/io/ 
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> TokenInputStream and TokenOutputStream might be useful.
>>>>>>
>>>>>> Until TSV, a tuple of terms is a number of RDF terms, terminated by a
>>>>>> DOT (not newline).
>>>>>>
>>>>>> This could be useful to JENA-44, JENA-45 and JENA-69
>>>>>
>>>>> Hi,
>>>>> I am looking at the code to serialize bindings (in relation to JENA-44
>>>>> and JENA-45) and I would like to use as much as I can what is already
>>>>> available in RIOT (and/or help to add what's missing, once I 
>>>>> understand
>>>>> what is the right thing to do).
>>>>>
>>>>> I am having a few problems with blank nodes.
>>>>>
>>>>> This is a snipped of code which explains my problem:
>>>>>
>>>>> 1 Node node1 = Node_Blank.createAnon();
>>>>> 2 String str = NodeFmtLib.serialize(node1);
>>>>> 3 Tokenizer tokenizer = TokenizerFactory.makeTokenizerString(str);
>>>>> 4 Token token = tokenizer.next();
>>>>> 5 Node node2 = token.asNode();
>>>>> 6 assertEquals(node1, node2);
>>>>>
>>>>> I have two different problems.
>>>>>
>>>>> In the case the blank node id starts with a digit, the assertion at
>>>>> line 6 fails with, for example: "expected:<1c7b85b4:13089a0cb42:-7fff>
>>>>> but was:<1c7b85b4>".
>>>>>
>>>>> If the blank node id is a negative number (i.e. it starts with a '-'),
>>>>> I have a RiotParserException: "org.openjena.riot.RiotParseException:
>>>>> [line: 1, col: 3 ] Blank node label does not start with alphabetic or
>>>>> _ :-" from TokenizerText.java line 1067.
>>>>
>>>> Setting onlySafeBNodeLabels to true might help.
>>>>
>>>> Because TDB does not use the tokenizer for decode, the raw path may be
>>>> buggy.
>>>>
>>>> See OutputLangUtils because that has the prospect of streaming.
>>>>
>>>> We may need to switch OutputStream but there is OutStreamUTF8 if UTF-8
>>>> encoding by std Java is costly.
>>>>
>>>>> What I am trying to do is to rewrite the BindingSerializer in the 
>>>>> patch
>>>>> for JENA-44. These are the signatures of the two methods I am
>>>>> implementing:
>>>>>
>>>>> public void serialize(Binding b, DataOutputStream out) throws
>>>>> IOException
>>>>> public Binding deserialize(DataInputStream in) throws IOException
>>>>
>>>> What's wrong with TokenOutputStream which even does some buffering.
>>>>
>>>> Binding -> Nodes (you're only writing the RDF term values), beware of
>>>> missingbindings. See the TSV output format that Laurent has been 
>>>> looking
>>>> at.
>>>>
>>>> DataOutputStream can only write 16bit lengths for strings - so you use
>>>> write(byte[]) and much of the point of DataOutputStream is lost. Seems
>>>> better to be to use our own internal interface and map to whatever
>>>> mechanism is most appropriate. testing the round-tripping between
>>>> TokenOutputStream and TokenInputStream being then done.
>>>>
>>>>> At the moment, I am assuming all the bindings written in the same file
>>>>> have
>>>>> the same variables and I am writing them only once at the beginning of
>>>>> the
>>>>> file and after that I am serializing binding values only:
>>>>>
>>>>> for (Var var : vars) {
>>>>> Node node = b.get(var);
>>>>> byte[] buf = NodeFmtLib.serialize(node).getBytes("UTF-8");
>>>>
>>>> whether this is faster that converting to UTF-8 duirectly into the
>>>> stream will need testing but it's a point optimization. For now, it's
>>>> the design that matters.
>>>>
>>>>> out.writeInt(buf.length);
>>>>> out.write(buf);
>>>>> }
>>>>>
>>>>> Should I try to use OutputLangUtils instead? And Writer(s) instead of
>>>>> DataOutputStream(s)?
>>>>>
>>>>> Thanks,
>>>>> Paolo
>>>>>
>>>>>>
>>>>>> I'm keen that we create a single solid I/O layer so it can teste and
>>>>>> optimized then shared amongst all the code doing I/O related things.
>>>>>>
>>>>>> Nodec is an interface specializes attempt to ByteBuffers for file, 
>>>>>> not
>>>>>> stream I/O. File I/O can be random access.
>>>>>>
>>>>>> Andy
>>>>>
>>

Re: Reverse operation for FmtUtils.stringForNode(...)

Posted by Andy Seaborne <an...@epimorphics.com>.


On 14/06/11 10:24, Paolo Castagna wrote:
> Thank you Andy.
>
> Andy Seaborne wrote:
>> Missed the important part ....
>>
>> Any blank node written as _:label will be subject to label scope
>> rules, that is, per file, and not bNode preserving (that's why TDB
>> does it's own thing).
>>
>> The tokenizer knows <_:xyz> "URIs" which create bNodes with the xyz as
>> the internal label.
>
> Is ":" a legal character in the xyz part of the bNode internal label?

Yes.  See Tokenizer.

> I am asking because the generated bNode internal labels seem to have ":"
> in it and if I use RIOT's Tokenizer there is a problem, I think.
>
> For example:
>
> 1 AnonId id = new AnonId("foo:bar");
> 2 Node node1 = Node_Blank.createAnon(id);
> 3 String str = NodeFmtLib.serialize(node1);
> 4 Tokenizer tokenizer = TokenizerFactory.makeTokenizerString(str);
> 5 assertTrue (tokenizer.hasNext());
> 6 assertEquals("[BNODE:foo]", tokenizer.next().toString());
> 7 assertTrue (tokenizer.hasNext());
> 8 assertEquals("[PREFIXED_NAME::bar]", tokenizer.next().toString());
> 9 assertFalse (tokenizer.hasNext());
>
> At line 6, I would expect [BNODE:foo:bar] instead.
>
> Now, I am looking at Token{Input|Output}Stream, TSV{Input|Output} and
> OutputLangUtils.
>
> Paolo
>
>>
>> Andy
>>
>> On 13/06/11 21:33, Andy Seaborne wrote:
>>>
>>>
>>> On 13/06/11 16:55, Paolo Castagna wrote:
>>>> Andy Seaborne wrote:
>>>>>
>>>>>
>>>>> On 26/05/11 15:37, Laurent Pellegrino wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I am using FmtUtils.stringForNode(...) from ARQ to encode a Node to a
>>>>>> String. Now, I have to perform the reverse operation: from the String
>>>>>> I want to create the Node. Is there a class and method to do that
>>>>>> from
>>>>>> the ARQ library?
>>>>>>
>>>>>> It seems that NodecLib.decode(...) do the trick but it is in the TDB
>>>>>> library and I am not sure that it works with any output from
>>>>>> FmtUtils.stringForNode(...)?
>>>>>>
>>>>>> Kind Regards,
>>>>>>
>>>>>> Laurent
>>>>>
>>>>> There are ways to reverse the process - too many in fact.
>>>>>
>>>>> Simple: SSE.parseNode: String -> Node
>>>>>
>>>>> It uses a javacc parser so the overall efficiency isn't ideal.
>>>>>
>>>>> But RIOT is in the process of reworking I/O for efficiency; the input
>>>>> side is the area that is most finished. The tokenizer will do what you
>>>>> want.
>>>>>
>>>>> What's missing in RIOT is Node to stream writing without using
>>>>> FmtUtils -- this is OutputLangUtils which is unfinished. FmtUtils
>>>>> creates intermediate strings, when the output could be straight to a
>>>>> stream, avoiding a copy and the temporary object allocation.
>>>>>
>>>>> The Tokenizer is:
>>>>>
>>>>> interface Tokenizer extends Iterator<Token>
>>>>>
>>>>> and see org.openjena.riot.tokens.TokenizerFactory
>>>>>
>>>>> especially if you have a sequence of them to parse ... like a TSV
>>>>> file. But you will have to manage newlines as to the tokenizer they
>>>>> are whitespace like anything else.
>>>>>
>>>>>
>>>>> There is some stuff in my scratch area for streams of tuples of RDF
>>>>> terms and variables:
>>>>>
>>>>> https://svn.apache.org/repos/asf/incubator/jena/Scratch/AFS/trunk/src/riot/io/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> TokenInputStream and TokenOutputStream might be useful.
>>>>>
>>>>> Until TSV, a tuple of terms is a number of RDF terms, terminated by a
>>>>> DOT (not newline).
>>>>>
>>>>> This could be useful to JENA-44, JENA-45 and JENA-69
>>>>
>>>> Hi,
>>>> I am looking at the code to serialize bindings (in relation to JENA-44
>>>> and JENA-45) and I would like to use as much as I can what is already
>>>> available in RIOT (and/or help to add what's missing, once I understand
>>>> what is the right thing to do).
>>>>
>>>> I am having a few problems with blank nodes.
>>>>
>>>> This is a snipped of code which explains my problem:
>>>>
>>>> 1 Node node1 = Node_Blank.createAnon();
>>>> 2 String str = NodeFmtLib.serialize(node1);
>>>> 3 Tokenizer tokenizer = TokenizerFactory.makeTokenizerString(str);
>>>> 4 Token token = tokenizer.next();
>>>> 5 Node node2 = token.asNode();
>>>> 6 assertEquals(node1, node2);
>>>>
>>>> I have two different problems.
>>>>
>>>> In the case the blank node id starts with a digit, the assertion at
>>>> line 6 fails with, for example: "expected:<1c7b85b4:13089a0cb42:-7fff>
>>>> but was:<1c7b85b4>".
>>>>
>>>> If the blank node id is a negative number (i.e. it starts with a '-'),
>>>> I have a RiotParserException: "org.openjena.riot.RiotParseException:
>>>> [line: 1, col: 3 ] Blank node label does not start with alphabetic or
>>>> _ :-" from TokenizerText.java line 1067.
>>>
>>> Setting onlySafeBNodeLabels to true might help.
>>>
>>> Because TDB does not use the tokenizer for decode, the raw path may be
>>> buggy.
>>>
>>> See OutputLangUtils because that has the prospect of streaming.
>>>
>>> We may need to switch OutputStream but there is OutStreamUTF8 if UTF-8
>>> encoding by std Java is costly.
>>>
>>>> What I am trying to do is to rewrite the BindingSerializer in the patch
>>>> for JENA-44. These are the signatures of the two methods I am
>>>> implementing:
>>>>
>>>> public void serialize(Binding b, DataOutputStream out) throws
>>>> IOException
>>>> public Binding deserialize(DataInputStream in) throws IOException
>>>
>>> What's wrong with TokenOutputStream which even does some buffering.
>>>
>>> Binding -> Nodes (you're only writing the RDF term values), beware of
>>> missingbindings. See the TSV output format that Laurent has been looking
>>> at.
>>>
>>> DataOutputStream can only write 16bit lengths for strings - so you use
>>> write(byte[]) and much of the point of DataOutputStream is lost. Seems
>>> better to be to use our own internal interface and map to whatever
>>> mechanism is most appropriate. testing the round-tripping between
>>> TokenOutputStream and TokenInputStream being then done.
>>>
>>>> At the moment, I am assuming all the bindings written in the same file
>>>> have
>>>> the same variables and I am writing them only once at the beginning of
>>>> the
>>>> file and after that I am serializing binding values only:
>>>>
>>>> for (Var var : vars) {
>>>> Node node = b.get(var);
>>>> byte[] buf = NodeFmtLib.serialize(node).getBytes("UTF-8");
>>>
>>> whether this is faster that converting to UTF-8 duirectly into the
>>> stream will need testing but it's a point optimization. For now, it's
>>> the design that matters.
>>>
>>>> out.writeInt(buf.length);
>>>> out.write(buf);
>>>> }
>>>>
>>>> Should I try to use OutputLangUtils instead? And Writer(s) instead of
>>>> DataOutputStream(s)?
>>>>
>>>> Thanks,
>>>> Paolo
>>>>
>>>>>
>>>>> I'm keen that we create a single solid I/O layer so it can teste and
>>>>> optimized then shared amongst all the code doing I/O related things.
>>>>>
>>>>> Nodec is an interface specializes attempt to ByteBuffers for file, not
>>>>> stream I/O. File I/O can be random access.
>>>>>
>>>>> Andy
>>>>
>

Re: Reverse operation for FmtUtils.stringForNode(...)

Posted by Paolo Castagna <ca...@googlemail.com>.

Thank you Andy.

Andy Seaborne wrote:
> Missed the important part ....
> 
> Any blank node written as _:label will be subject to label scope rules, 
> that is, per file, and not bNode preserving (that's why TDB does it's 
> own thing).
> 
> The tokenizer knows <_:xyz> "URIs" which create bNodes with the xyz as 
> the internal label.

Is ":" a legal character in the xyz part of the bNode internal label?

I am asking because the generated bNode internal labels seem to have ":"
in it and if I use RIOT's Tokenizer there is a problem, I think.

For example:

   1    AnonId id = new AnonId("foo:bar");
   2    Node node1 = Node_Blank.createAnon(id);
   3    String str = NodeFmtLib.serialize(node1);
   4    Tokenizer tokenizer = TokenizerFactory.makeTokenizerString(str);
   5    assertTrue (tokenizer.hasNext());
   6    assertEquals("[BNODE:foo]", tokenizer.next().toString());
   7    assertTrue (tokenizer.hasNext());
   8    assertEquals("[PREFIXED_NAME::bar]", tokenizer.next().toString());
   9    assertFalse (tokenizer.hasNext());

At line 6, I would expect [BNODE:foo:bar] instead.

Now, I am looking at Token{Input|Output}Stream, TSV{Input|Output} and
OutputLangUtils.

Paolo

> 
>     Andy
> 
> On 13/06/11 21:33, Andy Seaborne wrote:
>>
>>
>> On 13/06/11 16:55, Paolo Castagna wrote:
>>> Andy Seaborne wrote:
>>>>
>>>>
>>>> On 26/05/11 15:37, Laurent Pellegrino wrote:
>>>>> Hi all,
>>>>>
>>>>> I am using FmtUtils.stringForNode(...) from ARQ to encode a Node to a
>>>>> String. Now, I have to perform the reverse operation: from the String
>>>>> I want to create the Node. Is there a class and method to do that from
>>>>> the ARQ library?
>>>>>
>>>>> It seems that NodecLib.decode(...) do the trick but it is in the TDB
>>>>> library and I am not sure that it works with any output from
>>>>> FmtUtils.stringForNode(...)?
>>>>>
>>>>> Kind Regards,
>>>>>
>>>>> Laurent
>>>>
>>>> There are ways to reverse the process - too many in fact.
>>>>
>>>> Simple: SSE.parseNode: String -> Node
>>>>
>>>> It uses a javacc parser so the overall efficiency isn't ideal.
>>>>
>>>> But RIOT is in the process of reworking I/O for efficiency; the input
>>>> side is the area that is most finished. The tokenizer will do what you
>>>> want.
>>>>
>>>> What's missing in RIOT is Node to stream writing without using
>>>> FmtUtils -- this is OutputLangUtils which is unfinished. FmtUtils
>>>> creates intermediate strings, when the output could be straight to a
>>>> stream, avoiding a copy and the temporary object allocation.
>>>>
>>>> The Tokenizer is:
>>>>
>>>> interface Tokenizer extends Iterator<Token>
>>>>
>>>> and see org.openjena.riot.tokens.TokenizerFactory
>>>>
>>>> especially if you have a sequence of them to parse ... like a TSV
>>>> file. But you will have to manage newlines as to the tokenizer they
>>>> are whitespace like anything else.
>>>>
>>>>
>>>> There is some stuff in my scratch area for streams of tuples of RDF
>>>> terms and variables:
>>>>
>>>> https://svn.apache.org/repos/asf/incubator/jena/Scratch/AFS/trunk/src/riot/io/ 
>>>>
>>>>
>>>>
>>>>
>>>> TokenInputStream and TokenOutputStream might be useful.
>>>>
>>>> Until TSV, a tuple of terms is a number of RDF terms, terminated by a
>>>> DOT (not newline).
>>>>
>>>> This could be useful to JENA-44, JENA-45 and JENA-69
>>>
>>> Hi,
>>> I am looking at the code to serialize bindings (in relation to JENA-44
>>> and JENA-45) and I would like to use as much as I can what is already
>>> available in RIOT (and/or help to add what's missing, once I understand
>>> what is the right thing to do).
>>>
>>> I am having a few problems with blank nodes.
>>>
>>> This is a snipped of code which explains my problem:
>>>
>>> 1 Node node1 = Node_Blank.createAnon();
>>> 2 String str = NodeFmtLib.serialize(node1);
>>> 3 Tokenizer tokenizer = TokenizerFactory.makeTokenizerString(str);
>>> 4 Token token = tokenizer.next();
>>> 5 Node node2 = token.asNode();
>>> 6 assertEquals(node1, node2);
>>>
>>> I have two different problems.
>>>
>>> In the case the blank node id starts with a digit, the assertion at
>>> line 6 fails with, for example: "expected:<1c7b85b4:13089a0cb42:-7fff>
>>> but was:<1c7b85b4>".
>>>
>>> If the blank node id is a negative number (i.e. it starts with a '-'),
>>> I have a RiotParserException: "org.openjena.riot.RiotParseException:
>>> [line: 1, col: 3 ] Blank node label does not start with alphabetic or
>>> _ :-" from TokenizerText.java line 1067.
>>
>> Setting onlySafeBNodeLabels to true might help.
>>
>> Because TDB does not use the tokenizer for decode, the raw path may be
>> buggy.
>>
>> See OutputLangUtils because that has the prospect of streaming.
>>
>> We may need to switch OutputStream but there is OutStreamUTF8 if UTF-8
>> encoding by std Java is costly.
>>
>>> What I am trying to do is to rewrite the BindingSerializer in the patch
>>> for JENA-44. These are the signatures of the two methods I am
>>> implementing:
>>>
>>> public void serialize(Binding b, DataOutputStream out) throws 
>>> IOException
>>> public Binding deserialize(DataInputStream in) throws IOException
>>
>> What's wrong with TokenOutputStream which even does some buffering.
>>
>> Binding -> Nodes (you're only writing the RDF term values), beware of
>> missingbindings. See the TSV output format that Laurent has been looking
>> at.
>>
>> DataOutputStream can only write 16bit lengths for strings - so you use
>> write(byte[]) and much of the point of DataOutputStream is lost. Seems
>> better to be to use our own internal interface and map to whatever
>> mechanism is most appropriate. testing the round-tripping between
>> TokenOutputStream and TokenInputStream being then done.
>>
>>> At the moment, I am assuming all the bindings written in the same file
>>> have
>>> the same variables and I am writing them only once at the beginning of
>>> the
>>> file and after that I am serializing binding values only:
>>>
>>> for (Var var : vars) {
>>> Node node = b.get(var);
>>> byte[] buf = NodeFmtLib.serialize(node).getBytes("UTF-8");
>>
>> whether this is faster that converting to UTF-8 duirectly into the
>> stream will need testing but it's a point optimization. For now, it's
>> the design that matters.
>>
>>> out.writeInt(buf.length);
>>> out.write(buf);
>>> }
>>>
>>> Should I try to use OutputLangUtils instead? And Writer(s) instead of
>>> DataOutputStream(s)?
>>>
>>> Thanks,
>>> Paolo
>>>
>>>>
>>>> I'm keen that we create a single solid I/O layer so it can teste and
>>>> optimized then shared amongst all the code doing I/O related things.
>>>>
>>>> Nodec is an interface specializes attempt to ByteBuffers for file, not
>>>> stream I/O. File I/O can be random access.
>>>>
>>>> Andy
>>>

Re: Reverse operation for FmtUtils.stringForNode(...)

Posted by Andy Seaborne <an...@epimorphics.com>.

Missed the important part ....

Any blank node written as _:label will be subject to label scope rules, 
that is, per file, and not bNode preserving (that's why TDB does it's 
own thing).

The tokenizer knows <_:xyz> "URIs" which create bNodes with the xyz as 
the internal label.

	Andy

On 13/06/11 21:33, Andy Seaborne wrote:
>
>
> On 13/06/11 16:55, Paolo Castagna wrote:
>> Andy Seaborne wrote:
>>>
>>>
>>> On 26/05/11 15:37, Laurent Pellegrino wrote:
>>>> Hi all,
>>>>
>>>> I am using FmtUtils.stringForNode(...) from ARQ to encode a Node to a
>>>> String. Now, I have to perform the reverse operation: from the String
>>>> I want to create the Node. Is there a class and method to do that from
>>>> the ARQ library?
>>>>
>>>> It seems that NodecLib.decode(...) do the trick but it is in the TDB
>>>> library and I am not sure that it works with any output from
>>>> FmtUtils.stringForNode(...)?
>>>>
>>>> Kind Regards,
>>>>
>>>> Laurent
>>>
>>> There are ways to reverse the process - too many in fact.
>>>
>>> Simple: SSE.parseNode: String -> Node
>>>
>>> It uses a javacc parser so the overall efficiency isn't ideal.
>>>
>>> But RIOT is in the process of reworking I/O for efficiency; the input
>>> side is the area that is most finished. The tokenizer will do what you
>>> want.
>>>
>>> What's missing in RIOT is Node to stream writing without using
>>> FmtUtils -- this is OutputLangUtils which is unfinished. FmtUtils
>>> creates intermediate strings, when the output could be straight to a
>>> stream, avoiding a copy and the temporary object allocation.
>>>
>>> The Tokenizer is:
>>>
>>> interface Tokenizer extends Iterator<Token>
>>>
>>> and see org.openjena.riot.tokens.TokenizerFactory
>>>
>>> especially if you have a sequence of them to parse ... like a TSV
>>> file. But you will have to manage newlines as to the tokenizer they
>>> are whitespace like anything else.
>>>
>>>
>>> There is some stuff in my scratch area for streams of tuples of RDF
>>> terms and variables:
>>>
>>> https://svn.apache.org/repos/asf/incubator/jena/Scratch/AFS/trunk/src/riot/io/
>>>
>>>
>>>
>>> TokenInputStream and TokenOutputStream might be useful.
>>>
>>> Until TSV, a tuple of terms is a number of RDF terms, terminated by a
>>> DOT (not newline).
>>>
>>> This could be useful to JENA-44, JENA-45 and JENA-69
>>
>> Hi,
>> I am looking at the code to serialize bindings (in relation to JENA-44
>> and JENA-45) and I would like to use as much as I can what is already
>> available in RIOT (and/or help to add what's missing, once I understand
>> what is the right thing to do).
>>
>> I am having a few problems with blank nodes.
>>
>> This is a snipped of code which explains my problem:
>>
>> 1 Node node1 = Node_Blank.createAnon();
>> 2 String str = NodeFmtLib.serialize(node1);
>> 3 Tokenizer tokenizer = TokenizerFactory.makeTokenizerString(str);
>> 4 Token token = tokenizer.next();
>> 5 Node node2 = token.asNode();
>> 6 assertEquals(node1, node2);
>>
>> I have two different problems.
>>
>> In the case the blank node id starts with a digit, the assertion at
>> line 6 fails with, for example: "expected:<1c7b85b4:13089a0cb42:-7fff>
>> but was:<1c7b85b4>".
>>
>> If the blank node id is a negative number (i.e. it starts with a '-'),
>> I have a RiotParserException: "org.openjena.riot.RiotParseException:
>> [line: 1, col: 3 ] Blank node label does not start with alphabetic or
>> _ :-" from TokenizerText.java line 1067.
>
> Setting onlySafeBNodeLabels to true might help.
>
> Because TDB does not use the tokenizer for decode, the raw path may be
> buggy.
>
> See OutputLangUtils because that has the prospect of streaming.
>
> We may need to switch OutputStream but there is OutStreamUTF8 if UTF-8
> encoding by std Java is costly.
>
>> What I am trying to do is to rewrite the BindingSerializer in the patch
>> for JENA-44. These are the signatures of the two methods I am
>> implementing:
>>
>> public void serialize(Binding b, DataOutputStream out) throws IOException
>> public Binding deserialize(DataInputStream in) throws IOException
>
> What's wrong with TokenOutputStream which even does some buffering.
>
> Binding -> Nodes (you're only writing the RDF term values), beware of
> missingbindings. See the TSV output format that Laurent has been looking
> at.
>
> DataOutputStream can only write 16bit lengths for strings - so you use
> write(byte[]) and much of the point of DataOutputStream is lost. Seems
> better to be to use our own internal interface and map to whatever
> mechanism is most appropriate. testing the round-tripping between
> TokenOutputStream and TokenInputStream being then done.
>
>> At the moment, I am assuming all the bindings written in the same file
>> have
>> the same variables and I am writing them only once at the beginning of
>> the
>> file and after that I am serializing binding values only:
>>
>> for (Var var : vars) {
>> Node node = b.get(var);
>> byte[] buf = NodeFmtLib.serialize(node).getBytes("UTF-8");
>
> whether this is faster that converting to UTF-8 duirectly into the
> stream will need testing but it's a point optimization. For now, it's
> the design that matters.
>
>> out.writeInt(buf.length);
>> out.write(buf);
>> }
>>
>> Should I try to use OutputLangUtils instead? And Writer(s) instead of
>> DataOutputStream(s)?
>>
>> Thanks,
>> Paolo
>>
>>>
>>> I'm keen that we create a single solid I/O layer so it can teste and
>>> optimized then shared amongst all the code doing I/O related things.
>>>
>>> Nodec is an interface specializes attempt to ByteBuffers for file, not
>>> stream I/O. File I/O can be random access.
>>>
>>> Andy
>>

Re: Reverse operation for FmtUtils.stringForNode(...)

Posted by Andy Seaborne <an...@epimorphics.com>.


On 13/06/11 16:55, Paolo Castagna wrote:
> Andy Seaborne wrote:
>>
>>
>> On 26/05/11 15:37, Laurent Pellegrino wrote:
>>> Hi all,
>>>
>>> I am using FmtUtils.stringForNode(...) from ARQ to encode a Node to a
>>> String. Now, I have to perform the reverse operation: from the String
>>> I want to create the Node. Is there a class and method to do that from
>>> the ARQ library?
>>>
>>> It seems that NodecLib.decode(...) do the trick but it is in the TDB
>>> library and I am not sure that it works with any output from
>>> FmtUtils.stringForNode(...)?
>>>
>>> Kind Regards,
>>>
>>> Laurent
>>
>> There are ways to reverse the process - too many in fact.
>>
>> Simple: SSE.parseNode: String -> Node
>>
>> It uses a javacc parser so the overall efficiency isn't ideal.
>>
>> But RIOT is in the process of reworking I/O for efficiency; the input
>> side is the area that is most finished. The tokenizer will do what you
>> want.
>>
>> What's missing in RIOT is Node to stream writing without using
>> FmtUtils -- this is OutputLangUtils which is unfinished. FmtUtils
>> creates intermediate strings, when the output could be straight to a
>> stream, avoiding a copy and the temporary object allocation.
>>
>> The Tokenizer is:
>>
>> interface Tokenizer extends Iterator<Token>
>>
>> and see org.openjena.riot.tokens.TokenizerFactory
>>
>> especially if you have a sequence of them to parse ... like a TSV
>> file. But you will have to manage newlines as to the tokenizer they
>> are whitespace like anything else.
>>
>>
>> There is some stuff in my scratch area for streams of tuples of RDF
>> terms and variables:
>>
>> https://svn.apache.org/repos/asf/incubator/jena/Scratch/AFS/trunk/src/riot/io/
>>
>>
>> TokenInputStream and TokenOutputStream might be useful.
>>
>> Until TSV, a tuple of terms is a number of RDF terms, terminated by a
>> DOT (not newline).
>>
>> This could be useful to JENA-44, JENA-45 and JENA-69
>
> Hi,
> I am looking at the code to serialize bindings (in relation to JENA-44
> and JENA-45) and I would like to use as much as I can what is already
> available in RIOT (and/or help to add what's missing, once I understand
> what is the right thing to do).
>
> I am having a few problems with blank nodes.
>
> This is a snipped of code which explains my problem:
>
> 1 Node node1 = Node_Blank.createAnon();
> 2 String str = NodeFmtLib.serialize(node1);
> 3 Tokenizer tokenizer = TokenizerFactory.makeTokenizerString(str);
> 4 Token token = tokenizer.next();
> 5 Node node2 = token.asNode();
> 6 assertEquals(node1, node2);
>
> I have two different problems.
>
> In the case the blank node id starts with a digit, the assertion at
> line 6 fails with, for example: "expected:<1c7b85b4:13089a0cb42:-7fff>
> but was:<1c7b85b4>".
>
> If the blank node id is a negative number (i.e. it starts with a '-'),
> I have a RiotParserException: "org.openjena.riot.RiotParseException:
> [line: 1, col: 3 ] Blank node label does not start with alphabetic or
> _ :-" from TokenizerText.java line 1067.

Setting onlySafeBNodeLabels to true might help.

Because TDB does not use the tokenizer for decode, the raw path may be 
buggy.

See OutputLangUtils because that has the prospect of streaming.

We may need to switch OutputStream but there is OutStreamUTF8 if UTF-8 
encoding by std Java is costly.

> What I am trying to do is to rewrite the BindingSerializer in the patch
> for JENA-44. These are the signatures of the two methods I am implementing:
>
> public void serialize(Binding b, DataOutputStream out) throws IOException
> public Binding deserialize(DataInputStream in) throws IOException

What's wrong with TokenOutputStream which even does some buffering.

Binding -> Nodes (you're only writing the RDF term values), beware of 
missingbindings.  See the TSV output format that Laurent has been 
looking at.

DataOutputStream can only write 16bit lengths for strings - so you use 
write(byte[]) and much of the point of DataOutputStream is lost.  Seems 
better to be to use our own internal interface and map to whatever 
mechanism is most appropriate.  testing the round-tripping between 
TokenOutputStream and TokenInputStream being then done.

> At the moment, I am assuming all the bindings written in the same file have
> the same variables and I am writing them only once at the beginning of the
> file and after that I am serializing binding values only:
>
> for (Var var : vars) {
> Node node = b.get(var);
> byte[] buf = NodeFmtLib.serialize(node).getBytes("UTF-8");

whether this is faster that converting to UTF-8 duirectly into the 
stream will need testing but it's a point optimization.   For now, it's 
the design that matters.

> out.writeInt(buf.length);
> out.write(buf);
> }
>
> Should I try to use OutputLangUtils instead? And Writer(s) instead of
> DataOutputStream(s)?
>
> Thanks,
> Paolo
>
>>
>> I'm keen that we create a single solid I/O layer so it can teste and
>> optimized then shared amongst all the code doing I/O related things.
>>
>> Nodec is an interface specializes attempt to ByteBuffers for file, not
>> stream I/O. File I/O can be random access.
>>
>> Andy
>

Re: Reverse operation for FmtUtils.stringForNode(...)

Posted by Paolo Castagna <ca...@googlemail.com>.

Andy Seaborne wrote:
> 
> 
> On 26/05/11 15:37, Laurent Pellegrino wrote:
>> Hi all,
>>
>> I am using FmtUtils.stringForNode(...) from ARQ to encode a Node to a
>> String. Now, I have to perform the reverse operation: from the String
>> I want to create the Node. Is there a class and method to do that from
>> the ARQ library?
>>
>> It seems that NodecLib.decode(...) do the trick but it is in the TDB
>> library and I am not sure that it works with any output from
>> FmtUtils.stringForNode(...)?
>>
>> Kind Regards,
>>
>> Laurent
> 
> There are ways to reverse the process - too many in fact.
> 
> Simple: SSE.parseNode: String -> Node
> 
> It uses a javacc parser so the overall efficiency isn't ideal.
> 
> But RIOT is in the process of reworking I/O for efficiency; the input 
> side is the area that is most finished.  The tokenizer will do what you 
> want.
> 
> What's missing in RIOT is Node to stream writing without using FmtUtils 
> -- this is OutputLangUtils which is unfinished.  FmtUtils creates 
> intermediate strings, when the output could be straight to a stream, 
> avoiding a copy and the temporary object allocation.
> 
> The Tokenizer is:
> 
>     interface Tokenizer extends Iterator<Token>
> 
> and see org.openjena.riot.tokens.TokenizerFactory
> 
> especially if you have a sequence of them to parse ... like a TSV file. 
>  But you will have to manage newlines as to the tokenizer they are 
> whitespace like anything else.
> 
> 
> There is some stuff in my scratch area for streams of tuples of RDF 
> terms and variables:
> 
> https://svn.apache.org/repos/asf/incubator/jena/Scratch/AFS/trunk/src/riot/io/ 
> 
> 
> TokenInputStream and TokenOutputStream might be useful.
> 
> Until TSV, a tuple of terms is a number of RDF terms, terminated by a 
> DOT (not newline).
> 
> This could be useful to JENA-44, JENA-45 and JENA-69

Hi,
I am looking at the code to serialize bindings (in relation to JENA-44
and JENA-45) and I would like to use as much as I can what is already
available in RIOT (and/or help to add what's missing, once I understand
what is the right thing to do).

I am having a few problems with blank nodes.

This is a snipped of code which explains my problem:

1		Node node1 = Node_Blank.createAnon();
2		String str = NodeFmtLib.serialize(node1);
3		Tokenizer tokenizer = TokenizerFactory.makeTokenizerString(str);
4		Token token = tokenizer.next();
5		Node node2 = token.asNode();
6		assertEquals(node1, node2);

I have two different problems.

In the case the blank node id starts with a digit, the assertion at
line 6 fails with, for example: "expected:<1c7b85b4:13089a0cb42:-7fff>
but was:<1c7b85b4>".

If the blank node id is a negative number (i.e. it starts with a '-'),
I have a RiotParserException: "org.openjena.riot.RiotParseException:
[line: 1, col: 3 ] Blank node label does not start with alphabetic or
_ :-" from TokenizerText.java line 1067.

What I am trying to do is to rewrite the BindingSerializer in the patch
for JENA-44. These are the signatures of the two methods I am implementing:

   public void serialize(Binding b, DataOutputStream out) throws IOException
   public Binding deserialize(DataInputStream in) throws IOException

At the moment, I am assuming all the bindings written in the same file have
the same variables and I am writing them only once at the beginning of the
file and after that I am serializing binding values only:

   for (Var var : vars) {
       Node node = b.get(var);
       byte[] buf = NodeFmtLib.serialize(node).getBytes("UTF-8");
       out.writeInt(buf.length);
       out.write(buf);
   }

Should I try to use OutputLangUtils instead? And Writer(s) instead of
DataOutputStream(s)?

Thanks,
Paolo

> 
> I'm keen that we create a single solid I/O layer so it can teste and 
> optimized then shared amongst all the code doing I/O related things.
> 
> Nodec is an interface specializes attempt to ByteBuffers for file, not 
> stream I/O.  File I/O can be random access.
> 
>     Andy

Re: Reverse operation for FmtUtils.stringForNode(...)

Posted by Andy Seaborne <an...@epimorphics.com>.

On 26/05/11 15:37, Laurent Pellegrino wrote:
> Hi all,
>
> I am using FmtUtils.stringForNode(...) from ARQ to encode a Node to a
> String. Now, I have to perform the reverse operation: from the String
> I want to create the Node. Is there a class and method to do that from
> the ARQ library?
>
> It seems that NodecLib.decode(...) do the trick but it is in the TDB
> library and I am not sure that it works with any output from
> FmtUtils.stringForNode(...)?
>
> Kind Regards,
>
> Laurent

There are ways to reverse the process - too many in fact.

Simple: SSE.parseNode: String -> Node

It uses a javacc parser so the overall efficiency isn't ideal.

But RIOT is in the process of reworking I/O for efficiency; the input 
side is the area that is most finished.  The tokenizer will do what you 
want.

What's missing in RIOT is Node to stream writing without using FmtUtils 
-- this is OutputLangUtils which is unfinished.  FmtUtils creates 
intermediate strings, when the output could be straight to a stream, 
avoiding a copy and the temporary object allocation.

The Tokenizer is:

     interface Tokenizer extends Iterator<Token>

and see org.openjena.riot.tokens.TokenizerFactory

especially if you have a sequence of them to parse ... like a TSV file. 
  But you will have to manage newlines as to the tokenizer they are 
whitespace like anything else.

There is some stuff in my scratch area for streams of tuples of RDF 
terms and variables:

https://svn.apache.org/repos/asf/incubator/jena/Scratch/AFS/trunk/src/riot/io/

TokenInputStream and TokenOutputStream might be useful.

Until TSV, a tuple of terms is a number of RDF terms, terminated by a 
DOT (not newline).

This could be useful to JENA-44, JENA-45 and JENA-69

I'm keen that we create a single solid I/O layer so it can teste and 
optimized then shared amongst all the code doing I/O related things.

Nodec is an interface specializes attempt to ByteBuffers for file, not 
stream I/O.  File I/O can be random access.

	Andy