You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Andy Seaborne (JIRA)" <ji...@apache.org> on 2011/08/15 13:51:27 UTC
[jira] [Closed] (JENA-85) Common bindings I/O
[ https://issues.apache.org/jira/browse/JENA-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andy Seaborne closed JENA-85.
-----------------------------
> Common bindings I/O
> -------------------
>
> Key: JENA-85
> URL: https://issues.apache.org/jira/browse/JENA-85
> Project: Jena
> Issue Type: New Feature
> Components: ARQ
> Reporter: Paolo Castagna
> Assignee: Andy Seaborne
> Attachments: JENA-85-BindingOutputStream-Changes.patch, JENA-85-Blank-Node-Test.patch, JENA-85-DecodeBlankNodeLabels.patch
>
>
> ( from: http://markmail.org/thread/ljjrsiun3oxtrchw )
> There are a number of activities that require being about to serialize, and read back, bindings. They use different serializations. A shared "bindings I/O" would mean all activities could use one, tuned, set of serialization and I/O classes.
> JENA-44 (External sort) encodes a binding as a length-denoted byte array. The byte arry uses lengh-denoted byte arrays within the bindings. I/O is done using Data(In|Out)putStream, specifically. putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[]) for the per-row serialization as (var,Turtle string form) pairs. It uses a null for no such value.
> JENA-45 (Spill to disk SPARQL Update) uses a more textual representation based on a binding endcoded as (var, Turtle term). End of row is denoted by a DOT. It uses modified RIOT for input reading.
> There is also use of TSV I/O for writing and reading result sets. In this form, the variables are written once at the start, and not in each line.
> == Proposed mini-language
> This proposal takes those separate designs, and adds high-level compression.
> A sequence of bindings is written assuming there is a list of variables in force. Position in the row determines which variable is bound to which variable (=> compression of variable names). Turtle-style prefixes can be used (=> compression for IRIs) and the value of a slot in a row can "same as the row before" (=> compression for repeated terms) or undefined.
> Rows end in a DOT - this is not stricly necessary but adds a robustness against truncated data and bugs.
> Every row is the length, in number of terms, as the list variables in force.
> Directives are lines starting with a keyword. End on DOT.
> The directives are:
> PREFIX : <http://example> .
> Like Turtles, except keyword based to fit with being a keyword-driven mini-language.
> VARS ?x ?y .
> Set the variables in force for subsequent rows,
> until the next VARS directive.
> We need VARS because it's not always possible to determine all
> the possible variables before starting to write out bindings.
> A binding row is a sequence of terms, encoded like Turtle, including prefixed names and short forms for numbers (more compression). In addition STAR ("*") means "same term as the row before" and DASH ("-") means undef. Don't use * for - from previous row.
> Rows end in DOT. Preferred style is one space after each term. This makes writing safe.
> Terms can be written without intermediate copies (except local name processing) or buffers. The OutputLangUtils does not do this currently but it should.
> For presentation reasons only, blank lines are allowed (this would all get lost in the lexing/tokenization anyway).
> Example:
> -------------
> VARS ?x ?y .
> PREFIX : <http://example/> .
> :local1 <http://example.other/text> .
> * - .
> * 123 .
> -------------
> == Discussion
> The format is text - but we're writing strings anyway so a binary form, rather than a delimited text form, is unlikely to give much advantage but can't reuse the standard bytes<->chars stuff without intermediate copies
> This would all be hidden behind interface anyway. A binary tokenizer and binary OutputLangUtils would enable binary output.
> Dynamic choosing of prefixes can be done.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira