You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by "Paolo Castagna (JIRA)" <ji...@apache.org> on 2011/08/03 09:47:27 UTC

[jira] [Created] (JENA-85) Common bindings I/O

Common bindings I/O
-------------------

                 Key: JENA-85
                 URL: https://issues.apache.org/jira/browse/JENA-85
             Project: Jena
          Issue Type: New Feature
          Components: ARQ
            Reporter: Paolo Castagna


(Text taken from: http://markmail.org/thread/ljjrsiun3oxtrchw)

There are a number of activities that require being about to serialize, and read back, bindings.  They use different serializations.  A shared "bindings I/O" would mean all activities could use one, tuned, set of serialization and I/O classes.

JENA-44 (External sort) encodes a binding as a length-denoted byte array.  The byte arry uses lengh-denoted byte arrays within the bindings.  I/O is done using Data(In|Out)putStream, specifically. putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[]) for the per-row serialization as (var,Turtle string form) pairs.  It uses a null for no such value.

JENA-45 (Spill to disk SPARQL Update) uses a more textual representation based on a binding endcoded as (var, Turtle term). End of row is denoted by a DOT.  It uses modified RIOT for input reading.

There is also use of TSV I/O for writing and reading result sets.  In this form, the variables are written once at the start, and not in each line.

== Proposed mini-language

This proposal takes those separate designs, and adds high-level compression.

A sequence of bindings is written assuming there is a list of variables in force.  Position in the row determines which variable is bound to which variable (=> compression of variable names).  Turtle-style prefixes can be used (=> compression for IRIs) and the value of a slot in a row can "same as the row before" (=> compression for repeated terms) or undefined.

Rows end in a DOT - this is not stricly necessary but adds a robustness against truncated data and bugs.
Every row is the length, in number of terms, as the list variables in force.

Directives are lines starting with a keyword.  End on DOT.

The directives are:

  PREFIX : <http://example> .

  Like Turtles, except keyword based to fit with being a keyword-driven mini-language.


  VARS ?x ?y .

  Set the variables in force for subsequent rows,
  until the next VARS directive.
  We need VARS because it's not always possible to determine all
  the possible variables before starting to write out bindings.

A binding row is a sequence of terms, encoded like Turtle, including prefixed names and short forms for numbers (more compression).  In addition STAR ("*") means "same term as the row before" and DASH ("-") means undef.  Don't use * for - from previous row.

Rows end in DOT. Preferred style is one space after each term.  This makes writing safe.

Terms can be written without intermediate copies (except local name processing) or buffers.  The OutputLangUtils does not do this currently but it should.

For presentation reasons only, blank lines are allowed (this would all get lost in the lexing/tokenization anyway).

Example:

-------------
VARS ?x ?y .
PREFIX : <http://example/> .
:local1 <http://example.other/text> .
* - .
* 123 .
-------------

== Discussion

The format is text - but we're writing strings anyway so a binary form, rather than a delimited text form, is unlikely to give much advantage but can't reuse the standard bytes<->chars stuff without intermediate copies

This would all be hidden behind interface anyway.  A binary tokenizer and binary OutputLangUtils would enable binary output.

Dynamic choosing of prefixes can be done. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-85) Common bindings I/O

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081899#comment-13081899 ] 

Andy Seaborne commented on JENA-85:
-----------------------------------

Yes - that's one way of doing it and looks good to me.

Another way is to write bNodes as <_:label> IRIs and use a parser profile that reconstructs bNodes from skolemized forms.have 

A third is to make the tokenizer less fussy about bNode labels (e.g. whitespace delimited, or simply include digits "-" and ":" anywhere uniformly).  Conformance to e,.g. Turtle can be done at checking time.

I'd like to get the complete cycle of read/write or write/read working.  Currently, I'm finding code exists but it's not configurable into a input profile, and the lack of policy-driven output means changes are global.

The tests only cover the legal options for Turtle/N-triples which doe snot need bNode preserving roundtrip.  TDB does it's own thing for storing nodes anyway.



> Common bindings I/O
> -------------------
>
>                 Key: JENA-85
>                 URL: https://issues.apache.org/jira/browse/JENA-85
>             Project: Jena
>          Issue Type: New Feature
>          Components: ARQ
>            Reporter: Paolo Castagna
>         Attachments: JENA-85-BindingOutputStream-Changes.patch, JENA-85-Blank-Node-Test.patch, JENA-85-DecodeBlankNodeLabels.patch
>
>
> ( from: http://markmail.org/thread/ljjrsiun3oxtrchw )
> There are a number of activities that require being about to serialize, and read back, bindings.  They use different serializations.  A shared "bindings I/O" would mean all activities could use one, tuned, set of serialization and I/O classes.
> JENA-44 (External sort) encodes a binding as a length-denoted byte array.  The byte arry uses lengh-denoted byte arrays within the bindings.  I/O is done using Data(In|Out)putStream, specifically. putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[]) for the per-row serialization as (var,Turtle string form) pairs.  It uses a null for no such value.
> JENA-45 (Spill to disk SPARQL Update) uses a more textual representation based on a binding endcoded as (var, Turtle term). End of row is denoted by a DOT.  It uses modified RIOT for input reading.
> There is also use of TSV I/O for writing and reading result sets.  In this form, the variables are written once at the start, and not in each line.
> == Proposed mini-language
> This proposal takes those separate designs, and adds high-level compression.
> A sequence of bindings is written assuming there is a list of variables in force.  Position in the row determines which variable is bound to which variable (=> compression of variable names).  Turtle-style prefixes can be used (=> compression for IRIs) and the value of a slot in a row can "same as the row before" (=> compression for repeated terms) or undefined.
> Rows end in a DOT - this is not stricly necessary but adds a robustness against truncated data and bugs.
> Every row is the length, in number of terms, as the list variables in force.
> Directives are lines starting with a keyword.  End on DOT.
> The directives are:
>   PREFIX : <http://example> .
>   Like Turtles, except keyword based to fit with being a keyword-driven mini-language.
>   VARS ?x ?y .
>   Set the variables in force for subsequent rows,
>   until the next VARS directive.
>   We need VARS because it's not always possible to determine all
>   the possible variables before starting to write out bindings.
> A binding row is a sequence of terms, encoded like Turtle, including prefixed names and short forms for numbers (more compression).  In addition STAR ("*") means "same term as the row before" and DASH ("-") means undef.  Don't use * for - from previous row.
> Rows end in DOT. Preferred style is one space after each term.  This makes writing safe.
> Terms can be written without intermediate copies (except local name processing) or buffers.  The OutputLangUtils does not do this currently but it should.
> For presentation reasons only, blank lines are allowed (this would all get lost in the lexing/tokenization anyway).
> Example:
> -------------
> VARS ?x ?y .
> PREFIX : <http://example/> .
> :local1 <http://example.other/text> .
> * - .
> * 123 .
> -------------
> == Discussion
> The format is text - but we're writing strings anyway so a binary form, rather than a delimited text form, is unlikely to give much advantage but can't reuse the standard bytes<->chars stuff without intermediate copies
> This would all be hidden behind interface anyway.  A binary tokenizer and binary OutputLangUtils would enable binary output.
> Dynamic choosing of prefixes can be done. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JENA-85) Common bindings I/O

Posted by "Stephen Allen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stephen Allen updated JENA-85:
------------------------------

    Attachment: JENA-85-Blank-Node-Test.patch

I'm having some issues with blank nodes not coming back with the same internal label.  I've attached a testcase that fails when writing and then reading back in a binding with a blank node [1].

The issue seems to be the serializer is writing a mapped blank node label (like _:b0) instead of the internal label.

[1] JENA-85-Blank-Node-Test.patch

> Common bindings I/O
> -------------------
>
>                 Key: JENA-85
>                 URL: https://issues.apache.org/jira/browse/JENA-85
>             Project: Jena
>          Issue Type: New Feature
>          Components: ARQ
>            Reporter: Paolo Castagna
>         Attachments: JENA-85-BindingOutputStream-Changes.patch, JENA-85-Blank-Node-Test.patch
>
>
> ( from: http://markmail.org/thread/ljjrsiun3oxtrchw )
> There are a number of activities that require being about to serialize, and read back, bindings.  They use different serializations.  A shared "bindings I/O" would mean all activities could use one, tuned, set of serialization and I/O classes.
> JENA-44 (External sort) encodes a binding as a length-denoted byte array.  The byte arry uses lengh-denoted byte arrays within the bindings.  I/O is done using Data(In|Out)putStream, specifically. putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[]) for the per-row serialization as (var,Turtle string form) pairs.  It uses a null for no such value.
> JENA-45 (Spill to disk SPARQL Update) uses a more textual representation based on a binding endcoded as (var, Turtle term). End of row is denoted by a DOT.  It uses modified RIOT for input reading.
> There is also use of TSV I/O for writing and reading result sets.  In this form, the variables are written once at the start, and not in each line.
> == Proposed mini-language
> This proposal takes those separate designs, and adds high-level compression.
> A sequence of bindings is written assuming there is a list of variables in force.  Position in the row determines which variable is bound to which variable (=> compression of variable names).  Turtle-style prefixes can be used (=> compression for IRIs) and the value of a slot in a row can "same as the row before" (=> compression for repeated terms) or undefined.
> Rows end in a DOT - this is not stricly necessary but adds a robustness against truncated data and bugs.
> Every row is the length, in number of terms, as the list variables in force.
> Directives are lines starting with a keyword.  End on DOT.
> The directives are:
>   PREFIX : <http://example> .
>   Like Turtles, except keyword based to fit with being a keyword-driven mini-language.
>   VARS ?x ?y .
>   Set the variables in force for subsequent rows,
>   until the next VARS directive.
>   We need VARS because it's not always possible to determine all
>   the possible variables before starting to write out bindings.
> A binding row is a sequence of terms, encoded like Turtle, including prefixed names and short forms for numbers (more compression).  In addition STAR ("*") means "same term as the row before" and DASH ("-") means undef.  Don't use * for - from previous row.
> Rows end in DOT. Preferred style is one space after each term.  This makes writing safe.
> Terms can be written without intermediate copies (except local name processing) or buffers.  The OutputLangUtils does not do this currently but it should.
> For presentation reasons only, blank lines are allowed (this would all get lost in the lexing/tokenization anyway).
> Example:
> -------------
> VARS ?x ?y .
> PREFIX : <http://example/> .
> :local1 <http://example.other/text> .
> * - .
> * 123 .
> -------------
> == Discussion
> The format is text - but we're writing strings anyway so a binary form, rather than a delimited text form, is unlikely to give much advantage but can't reuse the standard bytes<->chars stuff without intermediate copies
> This would all be hidden behind interface anyway.  A binary tokenizer and binary OutputLangUtils would enable binary output.
> Dynamic choosing of prefixes can be done. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JENA-85) Common bindings I/O

Posted by "Paolo Castagna (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paolo Castagna updated JENA-85:
-------------------------------

    Description: 
( from: http://markmail.org/thread/ljjrsiun3oxtrchw )

There are a number of activities that require being about to serialize, and read back, bindings.  They use different serializations.  A shared "bindings I/O" would mean all activities could use one, tuned, set of serialization and I/O classes.

JENA-44 (External sort) encodes a binding as a length-denoted byte array.  The byte arry uses lengh-denoted byte arrays within the bindings.  I/O is done using Data(In|Out)putStream, specifically. putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[]) for the per-row serialization as (var,Turtle string form) pairs.  It uses a null for no such value.

JENA-45 (Spill to disk SPARQL Update) uses a more textual representation based on a binding endcoded as (var, Turtle term). End of row is denoted by a DOT.  It uses modified RIOT for input reading.

There is also use of TSV I/O for writing and reading result sets.  In this form, the variables are written once at the start, and not in each line.

== Proposed mini-language

This proposal takes those separate designs, and adds high-level compression.

A sequence of bindings is written assuming there is a list of variables in force.  Position in the row determines which variable is bound to which variable (=> compression of variable names).  Turtle-style prefixes can be used (=> compression for IRIs) and the value of a slot in a row can "same as the row before" (=> compression for repeated terms) or undefined.

Rows end in a DOT - this is not stricly necessary but adds a robustness against truncated data and bugs.
Every row is the length, in number of terms, as the list variables in force.

Directives are lines starting with a keyword.  End on DOT.

The directives are:

  PREFIX : <http://example> .

  Like Turtles, except keyword based to fit with being a keyword-driven mini-language.


  VARS ?x ?y .

  Set the variables in force for subsequent rows,
  until the next VARS directive.
  We need VARS because it's not always possible to determine all
  the possible variables before starting to write out bindings.

A binding row is a sequence of terms, encoded like Turtle, including prefixed names and short forms for numbers (more compression).  In addition STAR ("*") means "same term as the row before" and DASH ("-") means undef.  Don't use * for - from previous row.

Rows end in DOT. Preferred style is one space after each term.  This makes writing safe.

Terms can be written without intermediate copies (except local name processing) or buffers.  The OutputLangUtils does not do this currently but it should.

For presentation reasons only, blank lines are allowed (this would all get lost in the lexing/tokenization anyway).

Example:

-------------
VARS ?x ?y .
PREFIX : <http://example/> .
:local1 <http://example.other/text> .
* - .
* 123 .
-------------

== Discussion

The format is text - but we're writing strings anyway so a binary form, rather than a delimited text form, is unlikely to give much advantage but can't reuse the standard bytes<->chars stuff without intermediate copies

This would all be hidden behind interface anyway.  A binary tokenizer and binary OutputLangUtils would enable binary output.

Dynamic choosing of prefixes can be done. 

  was:
(Text taken from: http://markmail.org/thread/ljjrsiun3oxtrchw)

There are a number of activities that require being about to serialize, and read back, bindings.  They use different serializations.  A shared "bindings I/O" would mean all activities could use one, tuned, set of serialization and I/O classes.

JENA-44 (External sort) encodes a binding as a length-denoted byte array.  The byte arry uses lengh-denoted byte arrays within the bindings.  I/O is done using Data(In|Out)putStream, specifically. putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[]) for the per-row serialization as (var,Turtle string form) pairs.  It uses a null for no such value.

JENA-45 (Spill to disk SPARQL Update) uses a more textual representation based on a binding endcoded as (var, Turtle term). End of row is denoted by a DOT.  It uses modified RIOT for input reading.

There is also use of TSV I/O for writing and reading result sets.  In this form, the variables are written once at the start, and not in each line.

== Proposed mini-language

This proposal takes those separate designs, and adds high-level compression.

A sequence of bindings is written assuming there is a list of variables in force.  Position in the row determines which variable is bound to which variable (=> compression of variable names).  Turtle-style prefixes can be used (=> compression for IRIs) and the value of a slot in a row can "same as the row before" (=> compression for repeated terms) or undefined.

Rows end in a DOT - this is not stricly necessary but adds a robustness against truncated data and bugs.
Every row is the length, in number of terms, as the list variables in force.

Directives are lines starting with a keyword.  End on DOT.

The directives are:

  PREFIX : <http://example> .

  Like Turtles, except keyword based to fit with being a keyword-driven mini-language.


  VARS ?x ?y .

  Set the variables in force for subsequent rows,
  until the next VARS directive.
  We need VARS because it's not always possible to determine all
  the possible variables before starting to write out bindings.

A binding row is a sequence of terms, encoded like Turtle, including prefixed names and short forms for numbers (more compression).  In addition STAR ("*") means "same term as the row before" and DASH ("-") means undef.  Don't use * for - from previous row.

Rows end in DOT. Preferred style is one space after each term.  This makes writing safe.

Terms can be written without intermediate copies (except local name processing) or buffers.  The OutputLangUtils does not do this currently but it should.

For presentation reasons only, blank lines are allowed (this would all get lost in the lexing/tokenization anyway).

Example:

-------------
VARS ?x ?y .
PREFIX : <http://example/> .
:local1 <http://example.other/text> .
* - .
* 123 .
-------------

== Discussion

The format is text - but we're writing strings anyway so a binary form, rather than a delimited text form, is unlikely to give much advantage but can't reuse the standard bytes<->chars stuff without intermediate copies

This would all be hidden behind interface anyway.  A binary tokenizer and binary OutputLangUtils would enable binary output.

Dynamic choosing of prefixes can be done. 


> Common bindings I/O
> -------------------
>
>                 Key: JENA-85
>                 URL: https://issues.apache.org/jira/browse/JENA-85
>             Project: Jena
>          Issue Type: New Feature
>          Components: ARQ
>            Reporter: Paolo Castagna
>
> ( from: http://markmail.org/thread/ljjrsiun3oxtrchw )
> There are a number of activities that require being about to serialize, and read back, bindings.  They use different serializations.  A shared "bindings I/O" would mean all activities could use one, tuned, set of serialization and I/O classes.
> JENA-44 (External sort) encodes a binding as a length-denoted byte array.  The byte arry uses lengh-denoted byte arrays within the bindings.  I/O is done using Data(In|Out)putStream, specifically. putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[]) for the per-row serialization as (var,Turtle string form) pairs.  It uses a null for no such value.
> JENA-45 (Spill to disk SPARQL Update) uses a more textual representation based on a binding endcoded as (var, Turtle term). End of row is denoted by a DOT.  It uses modified RIOT for input reading.
> There is also use of TSV I/O for writing and reading result sets.  In this form, the variables are written once at the start, and not in each line.
> == Proposed mini-language
> This proposal takes those separate designs, and adds high-level compression.
> A sequence of bindings is written assuming there is a list of variables in force.  Position in the row determines which variable is bound to which variable (=> compression of variable names).  Turtle-style prefixes can be used (=> compression for IRIs) and the value of a slot in a row can "same as the row before" (=> compression for repeated terms) or undefined.
> Rows end in a DOT - this is not stricly necessary but adds a robustness against truncated data and bugs.
> Every row is the length, in number of terms, as the list variables in force.
> Directives are lines starting with a keyword.  End on DOT.
> The directives are:
>   PREFIX : <http://example> .
>   Like Turtles, except keyword based to fit with being a keyword-driven mini-language.
>   VARS ?x ?y .
>   Set the variables in force for subsequent rows,
>   until the next VARS directive.
>   We need VARS because it's not always possible to determine all
>   the possible variables before starting to write out bindings.
> A binding row is a sequence of terms, encoded like Turtle, including prefixed names and short forms for numbers (more compression).  In addition STAR ("*") means "same term as the row before" and DASH ("-") means undef.  Don't use * for - from previous row.
> Rows end in DOT. Preferred style is one space after each term.  This makes writing safe.
> Terms can be written without intermediate copies (except local name processing) or buffers.  The OutputLangUtils does not do this currently but it should.
> For presentation reasons only, blank lines are allowed (this would all get lost in the lexing/tokenization anyway).
> Example:
> -------------
> VARS ?x ?y .
> PREFIX : <http://example/> .
> :local1 <http://example.other/text> .
> * - .
> * 123 .
> -------------
> == Discussion
> The format is text - but we're writing strings anyway so a binary form, rather than a delimited text form, is unlikely to give much advantage but can't reuse the standard bytes<->chars stuff without intermediate copies
> This would all be hidden behind interface anyway.  A binary tokenizer and binary OutputLangUtils would enable binary output.
> Dynamic choosing of prefixes can be done. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-85) Common bindings I/O

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082309#comment-13082309 ] 

Andy Seaborne commented on JENA-85:
-----------------------------------

BindingIO doc updated.  Code updated.

For now, I've put in an encoding for labela (which is taken from N-Triples)

The first latter of the label is "B" (this ensures a letter is first)
Any character outside A-Za-z0-9 is encoded as Xnn where nn is the byte value
X is encoded as XX.

The Unicode implications of this need properly sorting out but it will work for all Jena-allocated blank nodes for now.


> Common bindings I/O
> -------------------
>
>                 Key: JENA-85
>                 URL: https://issues.apache.org/jira/browse/JENA-85
>             Project: Jena
>          Issue Type: New Feature
>          Components: ARQ
>            Reporter: Paolo Castagna
>         Attachments: JENA-85-BindingOutputStream-Changes.patch, JENA-85-Blank-Node-Test.patch, JENA-85-DecodeBlankNodeLabels.patch
>
>
> ( from: http://markmail.org/thread/ljjrsiun3oxtrchw )
> There are a number of activities that require being about to serialize, and read back, bindings.  They use different serializations.  A shared "bindings I/O" would mean all activities could use one, tuned, set of serialization and I/O classes.
> JENA-44 (External sort) encodes a binding as a length-denoted byte array.  The byte arry uses lengh-denoted byte arrays within the bindings.  I/O is done using Data(In|Out)putStream, specifically. putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[]) for the per-row serialization as (var,Turtle string form) pairs.  It uses a null for no such value.
> JENA-45 (Spill to disk SPARQL Update) uses a more textual representation based on a binding endcoded as (var, Turtle term). End of row is denoted by a DOT.  It uses modified RIOT for input reading.
> There is also use of TSV I/O for writing and reading result sets.  In this form, the variables are written once at the start, and not in each line.
> == Proposed mini-language
> This proposal takes those separate designs, and adds high-level compression.
> A sequence of bindings is written assuming there is a list of variables in force.  Position in the row determines which variable is bound to which variable (=> compression of variable names).  Turtle-style prefixes can be used (=> compression for IRIs) and the value of a slot in a row can "same as the row before" (=> compression for repeated terms) or undefined.
> Rows end in a DOT - this is not stricly necessary but adds a robustness against truncated data and bugs.
> Every row is the length, in number of terms, as the list variables in force.
> Directives are lines starting with a keyword.  End on DOT.
> The directives are:
>   PREFIX : <http://example> .
>   Like Turtles, except keyword based to fit with being a keyword-driven mini-language.
>   VARS ?x ?y .
>   Set the variables in force for subsequent rows,
>   until the next VARS directive.
>   We need VARS because it's not always possible to determine all
>   the possible variables before starting to write out bindings.
> A binding row is a sequence of terms, encoded like Turtle, including prefixed names and short forms for numbers (more compression).  In addition STAR ("*") means "same term as the row before" and DASH ("-") means undef.  Don't use * for - from previous row.
> Rows end in DOT. Preferred style is one space after each term.  This makes writing safe.
> Terms can be written without intermediate copies (except local name processing) or buffers.  The OutputLangUtils does not do this currently but it should.
> For presentation reasons only, blank lines are allowed (this would all get lost in the lexing/tokenization anyway).
> Example:
> -------------
> VARS ?x ?y .
> PREFIX : <http://example/> .
> :local1 <http://example.other/text> .
> * - .
> * 123 .
> -------------
> == Discussion
> The format is text - but we're writing strings anyway so a binary form, rather than a delimited text form, is unlikely to give much advantage but can't reuse the standard bytes<->chars stuff without intermediate copies
> This would all be hidden behind interface anyway.  A binary tokenizer and binary OutputLangUtils would enable binary output.
> Dynamic choosing of prefixes can be done. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (JENA-85) Common bindings I/O

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Seaborne reassigned JENA-85:
---------------------------------

    Assignee: Andy Seaborne

> Common bindings I/O
> -------------------
>
>                 Key: JENA-85
>                 URL: https://issues.apache.org/jira/browse/JENA-85
>             Project: Jena
>          Issue Type: New Feature
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Andy Seaborne
>         Attachments: JENA-85-BindingOutputStream-Changes.patch, JENA-85-Blank-Node-Test.patch, JENA-85-DecodeBlankNodeLabels.patch
>
>
> ( from: http://markmail.org/thread/ljjrsiun3oxtrchw )
> There are a number of activities that require being about to serialize, and read back, bindings.  They use different serializations.  A shared "bindings I/O" would mean all activities could use one, tuned, set of serialization and I/O classes.
> JENA-44 (External sort) encodes a binding as a length-denoted byte array.  The byte arry uses lengh-denoted byte arrays within the bindings.  I/O is done using Data(In|Out)putStream, specifically. putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[]) for the per-row serialization as (var,Turtle string form) pairs.  It uses a null for no such value.
> JENA-45 (Spill to disk SPARQL Update) uses a more textual representation based on a binding endcoded as (var, Turtle term). End of row is denoted by a DOT.  It uses modified RIOT for input reading.
> There is also use of TSV I/O for writing and reading result sets.  In this form, the variables are written once at the start, and not in each line.
> == Proposed mini-language
> This proposal takes those separate designs, and adds high-level compression.
> A sequence of bindings is written assuming there is a list of variables in force.  Position in the row determines which variable is bound to which variable (=> compression of variable names).  Turtle-style prefixes can be used (=> compression for IRIs) and the value of a slot in a row can "same as the row before" (=> compression for repeated terms) or undefined.
> Rows end in a DOT - this is not stricly necessary but adds a robustness against truncated data and bugs.
> Every row is the length, in number of terms, as the list variables in force.
> Directives are lines starting with a keyword.  End on DOT.
> The directives are:
>   PREFIX : <http://example> .
>   Like Turtles, except keyword based to fit with being a keyword-driven mini-language.
>   VARS ?x ?y .
>   Set the variables in force for subsequent rows,
>   until the next VARS directive.
>   We need VARS because it's not always possible to determine all
>   the possible variables before starting to write out bindings.
> A binding row is a sequence of terms, encoded like Turtle, including prefixed names and short forms for numbers (more compression).  In addition STAR ("*") means "same term as the row before" and DASH ("-") means undef.  Don't use * for - from previous row.
> Rows end in DOT. Preferred style is one space after each term.  This makes writing safe.
> Terms can be written without intermediate copies (except local name processing) or buffers.  The OutputLangUtils does not do this currently but it should.
> For presentation reasons only, blank lines are allowed (this would all get lost in the lexing/tokenization anyway).
> Example:
> -------------
> VARS ?x ?y .
> PREFIX : <http://example/> .
> :local1 <http://example.other/text> .
> * - .
> * 123 .
> -------------
> == Discussion
> The format is text - but we're writing strings anyway so a binary form, rather than a delimited text form, is unlikely to give much advantage but can't reuse the standard bytes<->chars stuff without intermediate copies
> This would all be hidden behind interface anyway.  A binary tokenizer and binary OutputLangUtils would enable binary output.
> Dynamic choosing of prefixes can be done. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JENA-85) Common bindings I/O

Posted by "Stephen Allen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stephen Allen updated JENA-85:
------------------------------

    Attachment: JENA-85-DecodeBlankNodeLabels.patch

Could we use the NodeFmtLib.safeBNodeLabel(String) method to create a legal label?  I wrote a corresponding method to decode that format [1].

[1] JENA-85-DecodeBlankNodeLabels.patch


> Common bindings I/O
> -------------------
>
>                 Key: JENA-85
>                 URL: https://issues.apache.org/jira/browse/JENA-85
>             Project: Jena
>          Issue Type: New Feature
>          Components: ARQ
>            Reporter: Paolo Castagna
>         Attachments: JENA-85-BindingOutputStream-Changes.patch, JENA-85-Blank-Node-Test.patch, JENA-85-DecodeBlankNodeLabels.patch
>
>
> ( from: http://markmail.org/thread/ljjrsiun3oxtrchw )
> There are a number of activities that require being about to serialize, and read back, bindings.  They use different serializations.  A shared "bindings I/O" would mean all activities could use one, tuned, set of serialization and I/O classes.
> JENA-44 (External sort) encodes a binding as a length-denoted byte array.  The byte arry uses lengh-denoted byte arrays within the bindings.  I/O is done using Data(In|Out)putStream, specifically. putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[]) for the per-row serialization as (var,Turtle string form) pairs.  It uses a null for no such value.
> JENA-45 (Spill to disk SPARQL Update) uses a more textual representation based on a binding endcoded as (var, Turtle term). End of row is denoted by a DOT.  It uses modified RIOT for input reading.
> There is also use of TSV I/O for writing and reading result sets.  In this form, the variables are written once at the start, and not in each line.
> == Proposed mini-language
> This proposal takes those separate designs, and adds high-level compression.
> A sequence of bindings is written assuming there is a list of variables in force.  Position in the row determines which variable is bound to which variable (=> compression of variable names).  Turtle-style prefixes can be used (=> compression for IRIs) and the value of a slot in a row can "same as the row before" (=> compression for repeated terms) or undefined.
> Rows end in a DOT - this is not stricly necessary but adds a robustness against truncated data and bugs.
> Every row is the length, in number of terms, as the list variables in force.
> Directives are lines starting with a keyword.  End on DOT.
> The directives are:
>   PREFIX : <http://example> .
>   Like Turtles, except keyword based to fit with being a keyword-driven mini-language.
>   VARS ?x ?y .
>   Set the variables in force for subsequent rows,
>   until the next VARS directive.
>   We need VARS because it's not always possible to determine all
>   the possible variables before starting to write out bindings.
> A binding row is a sequence of terms, encoded like Turtle, including prefixed names and short forms for numbers (more compression).  In addition STAR ("*") means "same term as the row before" and DASH ("-") means undef.  Don't use * for - from previous row.
> Rows end in DOT. Preferred style is one space after each term.  This makes writing safe.
> Terms can be written without intermediate copies (except local name processing) or buffers.  The OutputLangUtils does not do this currently but it should.
> For presentation reasons only, blank lines are allowed (this would all get lost in the lexing/tokenization anyway).
> Example:
> -------------
> VARS ?x ?y .
> PREFIX : <http://example/> .
> :local1 <http://example.other/text> .
> * - .
> * 123 .
> -------------
> == Discussion
> The format is text - but we're writing strings anyway so a binary form, rather than a delimited text form, is unlikely to give much advantage but can't reuse the standard bytes<->chars stuff without intermediate copies
> This would all be hidden behind interface anyway.  A binary tokenizer and binary OutputLangUtils would enable binary output.
> Dynamic choosing of prefixes can be done. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JENA-85) Common bindings I/O

Posted by "Stephen Allen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stephen Allen updated JENA-85:
------------------------------

    Attachment: JENA-85-BindingOutputStream-Changes.patch

I've attached a patch with some proposed changes to BindingOutputStream.  Basically the change is to implement Sink<Binding>.


> Common bindings I/O
> -------------------
>
>                 Key: JENA-85
>                 URL: https://issues.apache.org/jira/browse/JENA-85
>             Project: Jena
>          Issue Type: New Feature
>          Components: ARQ
>            Reporter: Paolo Castagna
>         Attachments: JENA-85-BindingOutputStream-Changes.patch
>
>
> ( from: http://markmail.org/thread/ljjrsiun3oxtrchw )
> There are a number of activities that require being about to serialize, and read back, bindings.  They use different serializations.  A shared "bindings I/O" would mean all activities could use one, tuned, set of serialization and I/O classes.
> JENA-44 (External sort) encodes a binding as a length-denoted byte array.  The byte arry uses lengh-denoted byte arrays within the bindings.  I/O is done using Data(In|Out)putStream, specifically. putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[]) for the per-row serialization as (var,Turtle string form) pairs.  It uses a null for no such value.
> JENA-45 (Spill to disk SPARQL Update) uses a more textual representation based on a binding endcoded as (var, Turtle term). End of row is denoted by a DOT.  It uses modified RIOT for input reading.
> There is also use of TSV I/O for writing and reading result sets.  In this form, the variables are written once at the start, and not in each line.
> == Proposed mini-language
> This proposal takes those separate designs, and adds high-level compression.
> A sequence of bindings is written assuming there is a list of variables in force.  Position in the row determines which variable is bound to which variable (=> compression of variable names).  Turtle-style prefixes can be used (=> compression for IRIs) and the value of a slot in a row can "same as the row before" (=> compression for repeated terms) or undefined.
> Rows end in a DOT - this is not stricly necessary but adds a robustness against truncated data and bugs.
> Every row is the length, in number of terms, as the list variables in force.
> Directives are lines starting with a keyword.  End on DOT.
> The directives are:
>   PREFIX : <http://example> .
>   Like Turtles, except keyword based to fit with being a keyword-driven mini-language.
>   VARS ?x ?y .
>   Set the variables in force for subsequent rows,
>   until the next VARS directive.
>   We need VARS because it's not always possible to determine all
>   the possible variables before starting to write out bindings.
> A binding row is a sequence of terms, encoded like Turtle, including prefixed names and short forms for numbers (more compression).  In addition STAR ("*") means "same term as the row before" and DASH ("-") means undef.  Don't use * for - from previous row.
> Rows end in DOT. Preferred style is one space after each term.  This makes writing safe.
> Terms can be written without intermediate copies (except local name processing) or buffers.  The OutputLangUtils does not do this currently but it should.
> For presentation reasons only, blank lines are allowed (this would all get lost in the lexing/tokenization anyway).
> Example:
> -------------
> VARS ?x ?y .
> PREFIX : <http://example/> .
> :local1 <http://example.other/text> .
> * - .
> * 123 .
> -------------
> == Discussion
> The format is text - but we're writing strings anyway so a binary form, rather than a delimited text form, is unlikely to give much advantage but can't reuse the standard bytes<->chars stuff without intermediate copies
> This would all be hidden behind interface anyway.  A binary tokenizer and binary OutputLangUtils would enable binary output.
> Dynamic choosing of prefixes can be done. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (JENA-85) Common bindings I/O

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Seaborne resolved JENA-85.
-------------------------------

    Resolution: Fixed

Implementation done.  Please raise specific problems as new JIRA.

> Common bindings I/O
> -------------------
>
>                 Key: JENA-85
>                 URL: https://issues.apache.org/jira/browse/JENA-85
>             Project: Jena
>          Issue Type: New Feature
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Andy Seaborne
>         Attachments: JENA-85-BindingOutputStream-Changes.patch, JENA-85-Blank-Node-Test.patch, JENA-85-DecodeBlankNodeLabels.patch
>
>
> ( from: http://markmail.org/thread/ljjrsiun3oxtrchw )
> There are a number of activities that require being about to serialize, and read back, bindings.  They use different serializations.  A shared "bindings I/O" would mean all activities could use one, tuned, set of serialization and I/O classes.
> JENA-44 (External sort) encodes a binding as a length-denoted byte array.  The byte arry uses lengh-denoted byte arrays within the bindings.  I/O is done using Data(In|Out)putStream, specifically. putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[]) for the per-row serialization as (var,Turtle string form) pairs.  It uses a null for no such value.
> JENA-45 (Spill to disk SPARQL Update) uses a more textual representation based on a binding endcoded as (var, Turtle term). End of row is denoted by a DOT.  It uses modified RIOT for input reading.
> There is also use of TSV I/O for writing and reading result sets.  In this form, the variables are written once at the start, and not in each line.
> == Proposed mini-language
> This proposal takes those separate designs, and adds high-level compression.
> A sequence of bindings is written assuming there is a list of variables in force.  Position in the row determines which variable is bound to which variable (=> compression of variable names).  Turtle-style prefixes can be used (=> compression for IRIs) and the value of a slot in a row can "same as the row before" (=> compression for repeated terms) or undefined.
> Rows end in a DOT - this is not stricly necessary but adds a robustness against truncated data and bugs.
> Every row is the length, in number of terms, as the list variables in force.
> Directives are lines starting with a keyword.  End on DOT.
> The directives are:
>   PREFIX : <http://example> .
>   Like Turtles, except keyword based to fit with being a keyword-driven mini-language.
>   VARS ?x ?y .
>   Set the variables in force for subsequent rows,
>   until the next VARS directive.
>   We need VARS because it's not always possible to determine all
>   the possible variables before starting to write out bindings.
> A binding row is a sequence of terms, encoded like Turtle, including prefixed names and short forms for numbers (more compression).  In addition STAR ("*") means "same term as the row before" and DASH ("-") means undef.  Don't use * for - from previous row.
> Rows end in DOT. Preferred style is one space after each term.  This makes writing safe.
> Terms can be written without intermediate copies (except local name processing) or buffers.  The OutputLangUtils does not do this currently but it should.
> For presentation reasons only, blank lines are allowed (this would all get lost in the lexing/tokenization anyway).
> Example:
> -------------
> VARS ?x ?y .
> PREFIX : <http://example/> .
> :local1 <http://example.other/text> .
> * - .
> * 123 .
> -------------
> == Discussion
> The format is text - but we're writing strings anyway so a binary form, rather than a delimited text form, is unlikely to give much advantage but can't reuse the standard bytes<->chars stuff without intermediate copies
> This would all be hidden behind interface anyway.  A binary tokenizer and binary OutputLangUtils would enable binary output.
> Dynamic choosing of prefixes can be done. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (JENA-85) Common bindings I/O

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Seaborne closed JENA-85.
-----------------------------


> Common bindings I/O
> -------------------
>
>                 Key: JENA-85
>                 URL: https://issues.apache.org/jira/browse/JENA-85
>             Project: Jena
>          Issue Type: New Feature
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Andy Seaborne
>         Attachments: JENA-85-BindingOutputStream-Changes.patch, JENA-85-Blank-Node-Test.patch, JENA-85-DecodeBlankNodeLabels.patch
>
>
> ( from: http://markmail.org/thread/ljjrsiun3oxtrchw )
> There are a number of activities that require being about to serialize, and read back, bindings.  They use different serializations.  A shared "bindings I/O" would mean all activities could use one, tuned, set of serialization and I/O classes.
> JENA-44 (External sort) encodes a binding as a length-denoted byte array.  The byte arry uses lengh-denoted byte arrays within the bindings.  I/O is done using Data(In|Out)putStream, specifically. putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[]) for the per-row serialization as (var,Turtle string form) pairs.  It uses a null for no such value.
> JENA-45 (Spill to disk SPARQL Update) uses a more textual representation based on a binding endcoded as (var, Turtle term). End of row is denoted by a DOT.  It uses modified RIOT for input reading.
> There is also use of TSV I/O for writing and reading result sets.  In this form, the variables are written once at the start, and not in each line.
> == Proposed mini-language
> This proposal takes those separate designs, and adds high-level compression.
> A sequence of bindings is written assuming there is a list of variables in force.  Position in the row determines which variable is bound to which variable (=> compression of variable names).  Turtle-style prefixes can be used (=> compression for IRIs) and the value of a slot in a row can "same as the row before" (=> compression for repeated terms) or undefined.
> Rows end in a DOT - this is not stricly necessary but adds a robustness against truncated data and bugs.
> Every row is the length, in number of terms, as the list variables in force.
> Directives are lines starting with a keyword.  End on DOT.
> The directives are:
>   PREFIX : <http://example> .
>   Like Turtles, except keyword based to fit with being a keyword-driven mini-language.
>   VARS ?x ?y .
>   Set the variables in force for subsequent rows,
>   until the next VARS directive.
>   We need VARS because it's not always possible to determine all
>   the possible variables before starting to write out bindings.
> A binding row is a sequence of terms, encoded like Turtle, including prefixed names and short forms for numbers (more compression).  In addition STAR ("*") means "same term as the row before" and DASH ("-") means undef.  Don't use * for - from previous row.
> Rows end in DOT. Preferred style is one space after each term.  This makes writing safe.
> Terms can be written without intermediate copies (except local name processing) or buffers.  The OutputLangUtils does not do this currently but it should.
> For presentation reasons only, blank lines are allowed (this would all get lost in the lexing/tokenization anyway).
> Example:
> -------------
> VARS ?x ?y .
> PREFIX : <http://example/> .
> :local1 <http://example.other/text> .
> * - .
> * 123 .
> -------------
> == Discussion
> The format is text - but we're writing strings anyway so a binary form, rather than a delimited text form, is unlikely to give much advantage but can't reuse the standard bytes<->chars stuff without intermediate copies
> This would all be hidden behind interface anyway.  A binary tokenizer and binary OutputLangUtils would enable binary output.
> Dynamic choosing of prefixes can be done. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-85) Common bindings I/O

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081709#comment-13081709 ] 

Andy Seaborne commented on JENA-85:
-----------------------------------

Sink<Binding> patch applied - also have write(Binding) == send(Binding) for familiarity of naming.

For the bNodes, a complication is that just writing "_:bnodelabel" isn't legal.  The tokenizer needs reversible bNode mapping.

An approach is to have the <_:label> synatx should be enabled for input and output.


> Common bindings I/O
> -------------------
>
>                 Key: JENA-85
>                 URL: https://issues.apache.org/jira/browse/JENA-85
>             Project: Jena
>          Issue Type: New Feature
>          Components: ARQ
>            Reporter: Paolo Castagna
>         Attachments: JENA-85-BindingOutputStream-Changes.patch, JENA-85-Blank-Node-Test.patch
>
>
> ( from: http://markmail.org/thread/ljjrsiun3oxtrchw )
> There are a number of activities that require being about to serialize, and read back, bindings.  They use different serializations.  A shared "bindings I/O" would mean all activities could use one, tuned, set of serialization and I/O classes.
> JENA-44 (External sort) encodes a binding as a length-denoted byte array.  The byte arry uses lengh-denoted byte arrays within the bindings.  I/O is done using Data(In|Out)putStream, specifically. putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[]) for the per-row serialization as (var,Turtle string form) pairs.  It uses a null for no such value.
> JENA-45 (Spill to disk SPARQL Update) uses a more textual representation based on a binding endcoded as (var, Turtle term). End of row is denoted by a DOT.  It uses modified RIOT for input reading.
> There is also use of TSV I/O for writing and reading result sets.  In this form, the variables are written once at the start, and not in each line.
> == Proposed mini-language
> This proposal takes those separate designs, and adds high-level compression.
> A sequence of bindings is written assuming there is a list of variables in force.  Position in the row determines which variable is bound to which variable (=> compression of variable names).  Turtle-style prefixes can be used (=> compression for IRIs) and the value of a slot in a row can "same as the row before" (=> compression for repeated terms) or undefined.
> Rows end in a DOT - this is not stricly necessary but adds a robustness against truncated data and bugs.
> Every row is the length, in number of terms, as the list variables in force.
> Directives are lines starting with a keyword.  End on DOT.
> The directives are:
>   PREFIX : <http://example> .
>   Like Turtles, except keyword based to fit with being a keyword-driven mini-language.
>   VARS ?x ?y .
>   Set the variables in force for subsequent rows,
>   until the next VARS directive.
>   We need VARS because it's not always possible to determine all
>   the possible variables before starting to write out bindings.
> A binding row is a sequence of terms, encoded like Turtle, including prefixed names and short forms for numbers (more compression).  In addition STAR ("*") means "same term as the row before" and DASH ("-") means undef.  Don't use * for - from previous row.
> Rows end in DOT. Preferred style is one space after each term.  This makes writing safe.
> Terms can be written without intermediate copies (except local name processing) or buffers.  The OutputLangUtils does not do this currently but it should.
> For presentation reasons only, blank lines are allowed (this would all get lost in the lexing/tokenization anyway).
> Example:
> -------------
> VARS ?x ?y .
> PREFIX : <http://example/> .
> :local1 <http://example.other/text> .
> * - .
> * 123 .
> -------------
> == Discussion
> The format is text - but we're writing strings anyway so a binary form, rather than a delimited text form, is unlikely to give much advantage but can't reuse the standard bytes<->chars stuff without intermediate copies
> This would all be hidden behind interface anyway.  A binary tokenizer and binary OutputLangUtils would enable binary output.
> Dynamic choosing of prefixes can be done. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-85) Common bindings I/O

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079932#comment-13079932 ] 

Andy Seaborne commented on JENA-85:
-----------------------------------

Documentation copied to cwiki:
https://cwiki.apache.org/confluence/display/JENA/BindingIO

Some possible code in
ARQ: com.hp.hpl.jena.sparql.engine.binding
  BindingInputStream
  BindingOutputStream

> Common bindings I/O
> -------------------
>
>                 Key: JENA-85
>                 URL: https://issues.apache.org/jira/browse/JENA-85
>             Project: Jena
>          Issue Type: New Feature
>          Components: ARQ
>            Reporter: Paolo Castagna
>
> ( from: http://markmail.org/thread/ljjrsiun3oxtrchw )
> There are a number of activities that require being about to serialize, and read back, bindings.  They use different serializations.  A shared "bindings I/O" would mean all activities could use one, tuned, set of serialization and I/O classes.
> JENA-44 (External sort) encodes a binding as a length-denoted byte array.  The byte arry uses lengh-denoted byte arrays within the bindings.  I/O is done using Data(In|Out)putStream, specifically. putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[]) for the per-row serialization as (var,Turtle string form) pairs.  It uses a null for no such value.
> JENA-45 (Spill to disk SPARQL Update) uses a more textual representation based on a binding endcoded as (var, Turtle term). End of row is denoted by a DOT.  It uses modified RIOT for input reading.
> There is also use of TSV I/O for writing and reading result sets.  In this form, the variables are written once at the start, and not in each line.
> == Proposed mini-language
> This proposal takes those separate designs, and adds high-level compression.
> A sequence of bindings is written assuming there is a list of variables in force.  Position in the row determines which variable is bound to which variable (=> compression of variable names).  Turtle-style prefixes can be used (=> compression for IRIs) and the value of a slot in a row can "same as the row before" (=> compression for repeated terms) or undefined.
> Rows end in a DOT - this is not stricly necessary but adds a robustness against truncated data and bugs.
> Every row is the length, in number of terms, as the list variables in force.
> Directives are lines starting with a keyword.  End on DOT.
> The directives are:
>   PREFIX : <http://example> .
>   Like Turtles, except keyword based to fit with being a keyword-driven mini-language.
>   VARS ?x ?y .
>   Set the variables in force for subsequent rows,
>   until the next VARS directive.
>   We need VARS because it's not always possible to determine all
>   the possible variables before starting to write out bindings.
> A binding row is a sequence of terms, encoded like Turtle, including prefixed names and short forms for numbers (more compression).  In addition STAR ("*") means "same term as the row before" and DASH ("-") means undef.  Don't use * for - from previous row.
> Rows end in DOT. Preferred style is one space after each term.  This makes writing safe.
> Terms can be written without intermediate copies (except local name processing) or buffers.  The OutputLangUtils does not do this currently but it should.
> For presentation reasons only, blank lines are allowed (this would all get lost in the lexing/tokenization anyway).
> Example:
> -------------
> VARS ?x ?y .
> PREFIX : <http://example/> .
> :local1 <http://example.other/text> .
> * - .
> * 123 .
> -------------
> == Discussion
> The format is text - but we're writing strings anyway so a binary form, rather than a delimited text form, is unlikely to give much advantage but can't reuse the standard bytes<->chars stuff without intermediate copies
> This would all be hidden behind interface anyway.  A binary tokenizer and binary OutputLangUtils would enable binary output.
> Dynamic choosing of prefixes can be done. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira