You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Andy Seaborne (JIRA)" <ji...@apache.org> on 2011/08/15 13:51:27 UTC

[jira] [Closed] (JENA-85) Common bindings I/O

     [ https://issues.apache.org/jira/browse/JENA-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Seaborne closed JENA-85.
-----------------------------


> Common bindings I/O
> -------------------
>
>                 Key: JENA-85
>                 URL: https://issues.apache.org/jira/browse/JENA-85
>             Project: Jena
>          Issue Type: New Feature
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Andy Seaborne
>         Attachments: JENA-85-BindingOutputStream-Changes.patch, JENA-85-Blank-Node-Test.patch, JENA-85-DecodeBlankNodeLabels.patch
>
>
> ( from: http://markmail.org/thread/ljjrsiun3oxtrchw )
> There are a number of activities that require being about to serialize, and read back, bindings.  They use different serializations.  A shared "bindings I/O" would mean all activities could use one, tuned, set of serialization and I/O classes.
> JENA-44 (External sort) encodes a binding as a length-denoted byte array.  The byte arry uses lengh-denoted byte arrays within the bindings.  I/O is done using Data(In|Out)putStream, specifically. putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[]) for the per-row serialization as (var,Turtle string form) pairs.  It uses a null for no such value.
> JENA-45 (Spill to disk SPARQL Update) uses a more textual representation based on a binding endcoded as (var, Turtle term). End of row is denoted by a DOT.  It uses modified RIOT for input reading.
> There is also use of TSV I/O for writing and reading result sets.  In this form, the variables are written once at the start, and not in each line.
> == Proposed mini-language
> This proposal takes those separate designs, and adds high-level compression.
> A sequence of bindings is written assuming there is a list of variables in force.  Position in the row determines which variable is bound to which variable (=> compression of variable names).  Turtle-style prefixes can be used (=> compression for IRIs) and the value of a slot in a row can "same as the row before" (=> compression for repeated terms) or undefined.
> Rows end in a DOT - this is not stricly necessary but adds a robustness against truncated data and bugs.
> Every row is the length, in number of terms, as the list variables in force.
> Directives are lines starting with a keyword.  End on DOT.
> The directives are:
>   PREFIX : <http://example> .
>   Like Turtles, except keyword based to fit with being a keyword-driven mini-language.
>   VARS ?x ?y .
>   Set the variables in force for subsequent rows,
>   until the next VARS directive.
>   We need VARS because it's not always possible to determine all
>   the possible variables before starting to write out bindings.
> A binding row is a sequence of terms, encoded like Turtle, including prefixed names and short forms for numbers (more compression).  In addition STAR ("*") means "same term as the row before" and DASH ("-") means undef.  Don't use * for - from previous row.
> Rows end in DOT. Preferred style is one space after each term.  This makes writing safe.
> Terms can be written without intermediate copies (except local name processing) or buffers.  The OutputLangUtils does not do this currently but it should.
> For presentation reasons only, blank lines are allowed (this would all get lost in the lexing/tokenization anyway).
> Example:
> -------------
> VARS ?x ?y .
> PREFIX : <http://example/> .
> :local1 <http://example.other/text> .
> * - .
> * 123 .
> -------------
> == Discussion
> The format is text - but we're writing strings anyway so a binary form, rather than a delimited text form, is unlikely to give much advantage but can't reuse the standard bytes<->chars stuff without intermediate copies
> This would all be hidden behind interface anyway.  A binary tokenizer and binary OutputLangUtils would enable binary output.
> Dynamic choosing of prefixes can be done. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira