You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2014/06/19 18:06:32 UTC

Binary RDF

Lizard needs to do network transfer of RDF data.  Rather than just doing 
something specific to Lizard, I've started on a general binary RDF 
module using Apache Thrift.

== RDF-Thrift
Work in Progress :: https://github.com/afs/rdf-thrift/

Discussion welcome.


The current is to have three supported abstractions:

1. StreamRDF
2. SPARQL Result Sets
3. RDF patch (which is very like StreamRDF but with A and D markers).

A first pass for StreamRDF is done including some attempts to reduce 
objetc churn when crossing the abstract boundaries. Abstract is all very 
well but repeated conversion of datastructures can slow things down.

Using StreamRDF means that prefix compression can be done.

See
   https://github.com/afs/rdf-thrift/blob/master/RDF.thrift
for the encoding at the moment for just RDF.

== In Jena

There are a number of places this might be useful:

1/ Fuseki and "application/sparql-results+thrift", "application/x-thrift"

(oh dear, "application/x-thrift", "x-" is not encouraged any more due to 
the transition problem c.f. "application/x-www-form-urlencoded")

2/ Hadoop-RDF

This is currently using N-Triple/N-Quads.  Rob - presumably this would 
be useful eventually.  AbstractNodeTupleWritable / 
AbstractNLineFileInputFormat look about right to be but that's from 
code-reading not code-doing.

(I know you/Cray have some internal binary RDF)

3/ Data bags and spill to disk

4/ RDF patch

5/ TDB (v2 - it would be a disk change) could useful use the RDF term 
encoding for the node table.

5/ Files.  Add to RIOT as a new syntax (a fairly direct access to 
StreamRDF+Thrift) which then helps TDB loading.

6/ Caching results set in queries in Fuseki.

In an ideal world, the Thrift format could be shared across toolkits. 
There is nothing Jena specific about the wire encoding.

== Thrift vs Protocol Buffer(+netty)

The Lizard prototype currently uses Protocol Buffer + netty.  Doing RDF 
Thrift has a way to learn about Thrift.

All the reviews and comparisons on the interweb seem to be born out.
There isn't a huge difference between the two.

Thrift's initial entry costs are higher (document is still weak, the 
maven artifact does not have a maven compatible source artifact (!!!) so 
you have to mangle one yourself which isn't hard; there is the source 
but in a non-standard form.

Thrift has it's own networking; I'm unlikely to use the service (RPC) 
layer from Thrift in Lizard itself as it is not fully streaming but 
driving the next layer down directly is quite easy (as it is in PB+N).

Protocol Buffers does not have a network layer, it's just the byte 
encoding, but Netty comes with built in protocol buffer handling (PB+N). 
  That works fine as well and I have done back and found the equivalent 
functionality I have used in RDF Thrift.

For binary RDF and it's general use, thrift's wider language cover is a 
plus point.

	Andy

Re: Binary RDF

Posted by Paul Houle <on...@gmail.com>.

Cool!
ᐧ

On Thu, Jun 19, 2014 at 12:06 PM, Andy Seaborne <an...@apache.org> wrote:
> Lizard needs to do network transfer of RDF data.  Rather than just doing
> something specific to Lizard, I've started on a general binary RDF module
> using Apache Thrift.
>
> == RDF-Thrift
> Work in Progress :: https://github.com/afs/rdf-thrift/
>
> Discussion welcome.
>
>
> The current is to have three supported abstractions:
>
> 1. StreamRDF
> 2. SPARQL Result Sets
> 3. RDF patch (which is very like StreamRDF but with A and D markers).
>
> A first pass for StreamRDF is done including some attempts to reduce objetc
> churn when crossing the abstract boundaries. Abstract is all very well but
> repeated conversion of datastructures can slow things down.
>
> Using StreamRDF means that prefix compression can be done.
>
> See
>   https://github.com/afs/rdf-thrift/blob/master/RDF.thrift
> for the encoding at the moment for just RDF.
>
> == In Jena
>
> There are a number of places this might be useful:
>
> 1/ Fuseki and "application/sparql-results+thrift", "application/x-thrift"
>
> (oh dear, "application/x-thrift", "x-" is not encouraged any more due to the
> transition problem c.f. "application/x-www-form-urlencoded")
>
> 2/ Hadoop-RDF
>
> This is currently using N-Triple/N-Quads.  Rob - presumably this would be
> useful eventually.  AbstractNodeTupleWritable / AbstractNLineFileInputFormat
> look about right to be but that's from code-reading not code-doing.
>
> (I know you/Cray have some internal binary RDF)
>
> 3/ Data bags and spill to disk
>
> 4/ RDF patch
>
> 5/ TDB (v2 - it would be a disk change) could useful use the RDF term
> encoding for the node table.
>
> 5/ Files.  Add to RIOT as a new syntax (a fairly direct access to
> StreamRDF+Thrift) which then helps TDB loading.
>
> 6/ Caching results set in queries in Fuseki.
>
> In an ideal world, the Thrift format could be shared across toolkits. There
> is nothing Jena specific about the wire encoding.
>
> == Thrift vs Protocol Buffer(+netty)
>
> The Lizard prototype currently uses Protocol Buffer + netty.  Doing RDF
> Thrift has a way to learn about Thrift.
>
> All the reviews and comparisons on the interweb seem to be born out.
> There isn't a huge difference between the two.
>
> Thrift's initial entry costs are higher (document is still weak, the maven
> artifact does not have a maven compatible source artifact (!!!) so you have
> to mangle one yourself which isn't hard; there is the source but in a
> non-standard form.
>
> Thrift has it's own networking; I'm unlikely to use the service (RPC) layer
> from Thrift in Lizard itself as it is not fully streaming but driving the
> next layer down directly is quite easy (as it is in PB+N).
>
> Protocol Buffers does not have a network layer, it's just the byte encoding,
> but Netty comes with built in protocol buffer handling (PB+N).  That works
> fine as well and I have done back and found the equivalent functionality I
> have used in RDF Thrift.
>
> For binary RDF and it's general use, thrift's wider language cover is a plus
> point.
>
>         Andy



-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254    paul.houle on Skype   ontology2@gmail.com

Re: Binary RDF

Posted by Andy Seaborne <an...@apache.org>.

On 30/06/14 14:07, Rob Vesse wrote:
> Setup and code?

https://github.com/afs/rdf-thrift

(caution - I have swapped the encoding scheme to see if a different one 
is better/worse and haven't rerun the timing tests).

There are a couple of scripts rdf2thrift (writes thrift) and thrift2rdf.

In theory, now if you call LangThrift.init() it wires itself into RIOT 
but I ran out of time properly testing that.

I don't know what the writing speed is yet. It should be much better 
than the string-based N-Triples etc.

	Andy

>
> I'd be interested in seeing how the internal binary rdf stuff we have
> compares
>
> Rob
>
> On 21/06/2014 22:19, "Andy Seaborne" <an...@apache.org> wrote:
>
>> First pass results for parsing from a file to a null sink, no tuning or
>> profiling. Jena java level Triple objects and all nodes are created.
>>
>> RIOT (128K IO buffer)
>> bsbm-25m.nt.gz : 127,082 Triples per second (TPS)
>> bsbm-25m.nt:     133,104 TPS
>>
>> RDF Thrift (32K IO buffer)
>> bsbm-25m.rt:     357,101 TPS  x2.8
>> bsbm-25m.rt.gz:  390,578 TPS  x2.9
>>
>> RDF Thrift (128K IO buffer)
>> bsbm-25m.rt:     409,788 TPS  x3.2
>> bsbm-25m.rt.gz:  389,969 TPS  x2.9
>>
>> and best
>> gzip -d bsbm-25m.rt.gz | thrift2rdf (128K IO buffer)
>>    490,138 TPS
>>
>> File sizes:
>> bsbm-25m.nt:     6,505,289,318 bytes (6.1G)
>> bsbm-25m.nt.gz:    691,429,780 bytes (660M)
>>
>> bsbm-25m.rt:     6,684,543,995 bytes (6.3G)
>> bsbm-25m.rt.gz:    700,639,242 bytes (669M)
>>
>> 	Andy
>
>
>
>

Re: Binary RDF

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Setup and code?

I'd be interested in seeing how the internal binary rdf stuff we have
compares

Rob

On 21/06/2014 22:19, "Andy Seaborne" <an...@apache.org> wrote:

>First pass results for parsing from a file to a null sink, no tuning or
>profiling. Jena java level Triple objects and all nodes are created.
>
>RIOT (128K IO buffer)
>bsbm-25m.nt.gz : 127,082 Triples per second (TPS)
>bsbm-25m.nt:     133,104 TPS
>
>RDF Thrift (32K IO buffer)
>bsbm-25m.rt:     357,101 TPS  x2.8
>bsbm-25m.rt.gz:  390,578 TPS  x2.9
>
>RDF Thrift (128K IO buffer)
>bsbm-25m.rt:     409,788 TPS  x3.2
>bsbm-25m.rt.gz:  389,969 TPS  x2.9
>
>and best
>gzip -d bsbm-25m.rt.gz | thrift2rdf (128K IO buffer)
>   490,138 TPS
>
>File sizes:
>bsbm-25m.nt:     6,505,289,318 bytes (6.1G)
>bsbm-25m.nt.gz:    691,429,780 bytes (660M)
>
>bsbm-25m.rt:     6,684,543,995 bytes (6.3G)
>bsbm-25m.rt.gz:    700,639,242 bytes (669M)
>
>	Andy

Re: Binary RDF

Posted by Andy Seaborne <an...@apache.org>.

First pass results for parsing from a file to a null sink, no tuning or 
profiling. Jena java level Triple objects and all nodes are created.

RIOT (128K IO buffer)
bsbm-25m.nt.gz : 127,082 Triples per second (TPS)
bsbm-25m.nt:     133,104 TPS

RDF Thrift (32K IO buffer)
bsbm-25m.rt:     357,101 TPS  x2.8
bsbm-25m.rt.gz:  390,578 TPS  x2.9

RDF Thrift (128K IO buffer)
bsbm-25m.rt:     409,788 TPS  x3.2
bsbm-25m.rt.gz:  389,969 TPS  x2.9

and best
gzip -d bsbm-25m.rt.gz | thrift2rdf (128K IO buffer)
   490,138 TPS

File sizes:
bsbm-25m.nt:     6,505,289,318 bytes (6.1G)
bsbm-25m.nt.gz:    691,429,780 bytes (660M)

bsbm-25m.rt:     6,684,543,995 bytes (6.3G)
bsbm-25m.rt.gz:    700,639,242 bytes (669M)

	Andy

Re: Binary RDF

Posted by Paul Houle <on...@gmail.com>.

For what I do in Hadoop,  I don't care about the sort order so long
as,  in some controlled domain,  nodes always sort in the same order.
This is sufficient to group triples that have the same ?s,  ?p,  or ?o
together which is good for grouping on relationships,  joining,  etc.
Something stupid but fast would be good for that.

The next step up is the SPARQL ordering,  which is a bit iffy

http://www.w3.org/TR/sparql11-query/#modOrderBy

People who are picky about the answers they get will need to define
their own sort order,  either by putting in a custom sort order (which
usually won't be fast) or using a static data type (i.e.
WritableInteger) for the key.

---

I'd like to have some way of processing triples in Hadoop which avoids
UTF8 -> String conversion if at all possible.  Often a map job filters
out triples or tuples with a selectivity of 1% or so,  so in a case
like that you don't want to do any work you don't need to,  say,  test
the predicate.

---

As for representing URIs a very efficient way is to compute the
cumulative probability distribution of the URIs,  which,
surprisingly,  can be computed in parallel for real-world cases

https://github.com/paulhoule/infovore/wiki/Design-of-a-data-processing-path

You can then code these with a variable-length code which can then be
treated as an opaque identifier.  This is insanely fast if you want to
do PageRank-style graph calculations,  but it does mean joining if you
want to ask questions about the string represenation of the URI.

ᐧ

On Fri, Jun 20, 2014 at 6:18 AM, Andy Seaborne <an...@apache.org> wrote:
> On 20/06/14 09:48, Rob Vesse wrote:
>>
>> Andy
>>
>> Comments inline:
>
>
> Ditto.
>
>
>>
>> On 19/06/2014 17:06, "Andy Seaborne" <an...@apache.org> wrote:
>>
>>> Lizard needs to do network transfer of RDF data.  Rather than just doing
>>> something specific to Lizard, I've started on a general binary RDF
>>> module using Apache Thrift.
>>>
>>> == RDF-Thrift
>>> Work in Progress :: https://github.com/afs/rdf-thrift/
>>>
>>> Discussion welcome.
>>>
>>>
>>> The current is to have three supported abstractions:
>>>
>>> 1. StreamRDF
>>> 2. SPARQL Result Sets
>>> 3. RDF patch (which is very like StreamRDF but with A and D markers).
>>>
>>> A first pass for StreamRDF is done including some attempts to reduce
>>> objetc churn when crossing the abstract boundaries. Abstract is all very
>>> well but repeated conversion of datastructures can slow things down.
>>>
>>> Using StreamRDF means that prefix compression can be done.
>>>
>>> See
>>>    https://github.com/afs/rdf-thrift/blob/master/RDF.thrift
>>> for the encoding at the moment for just RDF.
>>
>>
>> Looks like a sane encoding from what I understand of Thrift
>
>
> Thanks - it's my first real use of Thrift.  There are choices and I hope to
> do a similar-but-different design.  This one flattens everything into a
> tagged RDF_Term - that skips a layer of objects that a union of RDF_IRI,
> RDF_BNODE, RDF_Literal,... has.  Little on-the-wire difference, less Java
> object churn, maybe over engineering :-)
>
>
>>> == In Jena
>>>
>>> There are a number of places this might be useful:
>>>
>>> 1/ Fuseki and "application/sparql-results+thrift", "application/x-thrift"
>>>
>>> (oh dear, "application/x-thrift", "x-" is not encouraged any more due to
>>> the transition problem c.f. "application/x-www-form-urlencoded")
>>>
>>> 2/ Hadoop-RDF
>>>
>>> This is currently using N-Triple/N-Quads.  Rob - presumably this would
>>> be useful eventually.  AbstractNodeTupleWritable /
>>> AbstractNLineFileInputFormat look about right to be but that's from
>>> code-reading not code-doing.
>>
>>
>> Yes and No
>>
>> The concerns on Hadoop are somewhat different.  It is
>> advantageous/required that the Hadoop code has direct control over the
>> binary serialisation because of the contract for Writable.  This is needed
>> both to support serialisation and deserialisation of values and in order
>> to optionally provide direct comparisons on the binary representation
>> terms which has substantial performance benefits because it avoids having
>> to unnecessarily deserialise terms.
>>
>> It is unclear to me whether using RDF Thrift would allow this or not?  Or
>> if the overhead of Thrift would be more overall?
>
>
> The RDF thrift format is binary comparable if the same TProtocol is used.
> TProtocol is Thriftism for the choice of wire layout - Binary, Comopact JSON
> or Tuples (more compact with less resilience) - and ends have to agree the
> TProtocol for interworking.  Normally, one would just use "compact".
>
> As Thrift is used in a Hadoop setting, there should be places to go and
> learn from other people's practical experience.
>
>
>> Certainly it would be possible to support a RDF Thrift based binary RDF as
>> an input & output format regardless of how the writables are defined
>>
>>>
>>> (I know you/Cray have some internal binary RDF)
>>
>>
>> Yes though the intent of that format is somewhat different.  It was
>> designed to be a parallel friendly RDF specific compression format so
>> besides a global header at the start of the stream it is then block
>> oriented such that each block is entirely independent of each other and
>> requires only the data in the global header and itself in order to permit
>> decompression.
>>
>> For small data there will be little/no benefit, for large data the
>> compression achieved is roughly equivalent to GZipped NTriples with the
>> primary advantage that it is substantially faster to produce (about 5x)
>> and potentially even faster given a good parallel implementation.  Of
>> course what we have is mostly just a prototype and it hasn't been heavily
>> optimised so there may be more performance to be had.
>
>
> Thanks for the description.  RDF binary uses include several ones that are
> write-once-read-once.  Compression other than applying prefixes is not the
> target here (it's orthogonal?).  "snappy" would be the obvious choice to
> look at for a single stream because of the compression time costs of gzip.
>
>
>>
>>>
>>> 3/ Data bags and spill to disk
>>>
>>> 4/ RDF patch
>>>
>>> 5/ TDB (v2 - it would be a disk change) could useful use the RDF term
>>> encoding for the node table.
>>
>>
>> Would this actually save much space?
>>
>> It looks like you'd only save a few bytes because you still have to store
>> the bulk of the term encoding you just lose some of the surface syntax
>> that something like a NTriples encoding would give you
>
>
> For TDB the big win is speed, not space. At the moment, the on-disk node
> format is a string that needs parsing and producing by string bashing.
>
> Both are relative expensive and the thing that limits load performance for
> medium sized
> datasets is the node table. The node cache largely hides the cost during
> SPARQL execution.
>
> In Lizard, storing Thrift means that remote retrieval is simply
> disk-bytes to network-bytes - no decode-encode in the node table storage
> server.
>
>         Andy
>
>
>>
>> Rob
>>
>>>
>>> 5/ Files.  Add to RIOT as a new syntax (a fairly direct access to
>>> StreamRDF+Thrift) which then helps TDB loading.
>>>
>>> 6/ Caching results set in queries in Fuseki.
>>>
>>> In an ideal world, the Thrift format could be shared across toolkits.
>>> There is nothing Jena specific about the wire encoding.
>
> ...
>>>
>>>
>>>         Andy
>>
>>
>>
>>
>>
>



-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254    paul.houle on Skype   ontology2@gmail.com

Re: Binary RDF

Posted by Andy Seaborne <an...@apache.org>.

On 20/06/14 09:48, Rob Vesse wrote:
> Andy
>
> Comments inline:

Ditto.

>
> On 19/06/2014 17:06, "Andy Seaborne" <an...@apache.org> wrote:
>
>> Lizard needs to do network transfer of RDF data.  Rather than just doing
>> something specific to Lizard, I've started on a general binary RDF
>> module using Apache Thrift.
>>
>> == RDF-Thrift
>> Work in Progress :: https://github.com/afs/rdf-thrift/
>>
>> Discussion welcome.
>>
>>
>> The current is to have three supported abstractions:
>>
>> 1. StreamRDF
>> 2. SPARQL Result Sets
>> 3. RDF patch (which is very like StreamRDF but with A and D markers).
>>
>> A first pass for StreamRDF is done including some attempts to reduce
>> objetc churn when crossing the abstract boundaries. Abstract is all very
>> well but repeated conversion of datastructures can slow things down.
>>
>> Using StreamRDF means that prefix compression can be done.
>>
>> See
>>    https://github.com/afs/rdf-thrift/blob/master/RDF.thrift
>> for the encoding at the moment for just RDF.
>
> Looks like a sane encoding from what I understand of Thrift

Thanks - it's my first real use of Thrift.  There are choices and I hope 
to do a similar-but-different design.  This one flattens everything into 
a tagged RDF_Term - that skips a layer of objects that a union of 
RDF_IRI, RDF_BNODE, RDF_Literal,... has.  Little on-the-wire difference, 
less Java object churn, maybe over engineering :-)

>> == In Jena
>>
>> There are a number of places this might be useful:
>>
>> 1/ Fuseki and "application/sparql-results+thrift", "application/x-thrift"
>>
>> (oh dear, "application/x-thrift", "x-" is not encouraged any more due to
>> the transition problem c.f. "application/x-www-form-urlencoded")
>>
>> 2/ Hadoop-RDF
>>
>> This is currently using N-Triple/N-Quads.  Rob - presumably this would
>> be useful eventually.  AbstractNodeTupleWritable /
>> AbstractNLineFileInputFormat look about right to be but that's from
>> code-reading not code-doing.
>
> Yes and No
>
> The concerns on Hadoop are somewhat different.  It is
> advantageous/required that the Hadoop code has direct control over the
> binary serialisation because of the contract for Writable.  This is needed
> both to support serialisation and deserialisation of values and in order
> to optionally provide direct comparisons on the binary representation
> terms which has substantial performance benefits because it avoids having
> to unnecessarily deserialise terms.
>
> It is unclear to me whether using RDF Thrift would allow this or not?  Or
> if the overhead of Thrift would be more overall?

The RDF thrift format is binary comparable if the same TProtocol is 
used.  TProtocol is Thriftism for the choice of wire layout - Binary, 
Comopact JSON or Tuples (more compact with less resilience) - and ends 
have to agree the TProtocol for interworking.  Normally, one would just 
use "compact".

As Thrift is used in a Hadoop setting, there should be places to go and 
learn from other people's practical experience.

> Certainly it would be possible to support a RDF Thrift based binary RDF as
> an input & output format regardless of how the writables are defined
>
>>
>> (I know you/Cray have some internal binary RDF)
>
> Yes though the intent of that format is somewhat different.  It was
> designed to be a parallel friendly RDF specific compression format so
> besides a global header at the start of the stream it is then block
> oriented such that each block is entirely independent of each other and
> requires only the data in the global header and itself in order to permit
> decompression.
>
> For small data there will be little/no benefit, for large data the
> compression achieved is roughly equivalent to GZipped NTriples with the
> primary advantage that it is substantially faster to produce (about 5x)
> and potentially even faster given a good parallel implementation.  Of
> course what we have is mostly just a prototype and it hasn't been heavily
> optimised so there may be more performance to be had.

Thanks for the description.  RDF binary uses include several ones that 
are write-once-read-once.  Compression other than applying prefixes is 
not the target here (it's orthogonal?).  "snappy" would be the obvious 
choice to look at for a single stream because of the compression time 
costs of gzip.

>
>>
>> 3/ Data bags and spill to disk
>>
>> 4/ RDF patch
>>
>> 5/ TDB (v2 - it would be a disk change) could useful use the RDF term
>> encoding for the node table.
>
> Would this actually save much space?
>
> It looks like you'd only save a few bytes because you still have to store
> the bulk of the term encoding you just lose some of the surface syntax
> that something like a NTriples encoding would give you

For TDB the big win is speed, not space. At the moment, the on-disk node
format is a string that needs parsing and producing by string bashing.

Both are relative expensive and the thing that limits load performance 
for medium sized
datasets is the node table. The node cache largely hides the cost during 
SPARQL execution.

In Lizard, storing Thrift means that remote retrieval is simply
disk-bytes to network-bytes - no decode-encode in the node table storage
server.

	Andy

>
> Rob
>
>>
>> 5/ Files.  Add to RIOT as a new syntax (a fairly direct access to
>> StreamRDF+Thrift) which then helps TDB loading.
>>
>> 6/ Caching results set in queries in Fuseki.
>>
>> In an ideal world, the Thrift format could be shared across toolkits.
>> There is nothing Jena specific about the wire encoding.
...
>>
>> 	Andy
>
>
>
>

Re: Binary RDF

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Andy

Comments inline:

On 19/06/2014 17:06, "Andy Seaborne" <an...@apache.org> wrote:

>Lizard needs to do network transfer of RDF data.  Rather than just doing
>something specific to Lizard, I've started on a general binary RDF
>module using Apache Thrift.
>
>== RDF-Thrift
>Work in Progress :: https://github.com/afs/rdf-thrift/
>
>Discussion welcome.
>
>
>The current is to have three supported abstractions:
>
>1. StreamRDF
>2. SPARQL Result Sets
>3. RDF patch (which is very like StreamRDF but with A and D markers).
>
>A first pass for StreamRDF is done including some attempts to reduce
>objetc churn when crossing the abstract boundaries. Abstract is all very
>well but repeated conversion of datastructures can slow things down.
>
>Using StreamRDF means that prefix compression can be done.
>
>See
>   https://github.com/afs/rdf-thrift/blob/master/RDF.thrift
>for the encoding at the moment for just RDF.

Looks like a sane encoding from what I understand of Thrift

>
>== In Jena
>
>There are a number of places this might be useful:
>
>1/ Fuseki and "application/sparql-results+thrift", "application/x-thrift"
>
>(oh dear, "application/x-thrift", "x-" is not encouraged any more due to
>the transition problem c.f. "application/x-www-form-urlencoded")
>
>2/ Hadoop-RDF
>
>This is currently using N-Triple/N-Quads.  Rob - presumably this would
>be useful eventually.  AbstractNodeTupleWritable /
>AbstractNLineFileInputFormat look about right to be but that's from
>code-reading not code-doing.

Yes and No

The concerns on Hadoop are somewhat different.  It is
advantageous/required that the Hadoop code has direct control over the
binary serialisation because of the contract for Writable.  This is needed
both to support serialisation and deserialisation of values and in order
to optionally provide direct comparisons on the binary representation
terms which has substantial performance benefits because it avoids having
to unnecessarily deserialise terms.

It is unclear to me whether using RDF Thrift would allow this or not?  Or
if the overhead of Thrift would be more overall?

Certainly it would be possible to support a RDF Thrift based binary RDF as
an input & output format regardless of how the writables are defined

>
>(I know you/Cray have some internal binary RDF)

Yes though the intent of that format is somewhat different.  It was
designed to be a parallel friendly RDF specific compression format so
besides a global header at the start of the stream it is then block
oriented such that each block is entirely independent of each other and
requires only the data in the global header and itself in order to permit
decompression.

For small data there will be little/no benefit, for large data the
compression achieved is roughly equivalent to GZipped NTriples with the
primary advantage that it is substantially faster to produce (about 5x)
and potentially even faster given a good parallel implementation.  Of
course what we have is mostly just a prototype and it hasn't been heavily
optimised so there may be more performance to be had.

>
>3/ Data bags and spill to disk
>
>4/ RDF patch
>
>5/ TDB (v2 - it would be a disk change) could useful use the RDF term
>encoding for the node table.

Would this actually save much space?

It looks like you'd only save a few bytes because you still have to store
the bulk of the term encoding you just lose some of the surface syntax
that something like a NTriples encoding would give you

Rob

>
>5/ Files.  Add to RIOT as a new syntax (a fairly direct access to
>StreamRDF+Thrift) which then helps TDB loading.
>
>6/ Caching results set in queries in Fuseki.
>
>In an ideal world, the Thrift format could be shared across toolkits.
>There is nothing Jena specific about the wire encoding.
>
>== Thrift vs Protocol Buffer(+netty)
>
>The Lizard prototype currently uses Protocol Buffer + netty.  Doing RDF
>Thrift has a way to learn about Thrift.
>
>All the reviews and comparisons on the interweb seem to be born out.
>There isn't a huge difference between the two.
>
>Thrift's initial entry costs are higher (document is still weak, the
>maven artifact does not have a maven compatible source artifact (!!!) so
>you have to mangle one yourself which isn't hard; there is the source
>but in a non-standard form.
>
>Thrift has it's own networking; I'm unlikely to use the service (RPC)
>layer from Thrift in Lizard itself as it is not fully streaming but
>driving the next layer down directly is quite easy (as it is in PB+N).
>
>Protocol Buffers does not have a network layer, it's just the byte
>encoding, but Netty comes with built in protocol buffer handling (PB+N).
>  That works fine as well and I have done back and found the equivalent
>functionality I have used in RDF Thrift.
>
>For binary RDF and it's general use, thrift's wider language cover is a
>plus point.
>
>	Andy