You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Stephen Allen <sa...@bbn.com> on 2011/02/01 17:36:30 UTC

RE: SPARQL 1.1 Update in ARQ 2.8.7

Andy,

I have started implementing the serializer (SinkBindingOutput) by using
org.openjena.riot.SinkQuadOutput as a guide and using OutputLangUtils to
print out the variable/values.  I created the deserializer (LangBindings) by
extending org.openjena.riot.lang.LangNTuple.  I'm using the paired var/value
format you described below.  For now I'll start with a straightforward
implementation with no compression, but like your ideas in this area.  I'll
try to do some measurements to see if any other compression is beneficial.

I did not define an org.openjena.riot.Lang enum for the deserializer
(because it isn't an RDF language) but I was planning on putting the
LangBindings class in the org.openjena.riot.lang package.

For determining when to spill bindings to disk, there are a few options (in
order of least difficulty):
1) Store binding objects in an list, and then spill them to disk once the
list size passes a threshold
2) Start serializing bindings immediately into something like
DeferredFileOutputStream [1] that will retain the data in memory until it
passes a memory threshold
3) Do 1), but try to calculate the size of the bindings in memory and use a
memory threshold instead of a number of bindings threshold

I think 1) should be sufficient if we come up with a reasonable guess for
the threshold.  Option 2) lets you get much better control over the memory
management, but I think the cost of unnecessarily serializing/deserializing
small queries may be too high.

-Stephen

[1]
http://commons.apache.org/io/api-release/org/apache/commons/io/output/Deferr
edFileOutputStream.html

-----Original Message-----
From: Andy Seaborne [mailto:andy.seaborne@epimorphics.com] 
Sent: Monday, January 31, 2011 7:22 AM
To: jena-dev@incubator.apache.org
Subject: Re: SPARQL 1.1 Update in ARQ 2.8.7

Hi Stephen,

We have resisted making Node serializable although that's usually been 
requested for RPC purposes in conjunction with serializable graphs and 
it is better to use RDF syntaxes for that.  Automatic Java serialization 
should not in the current design end up pulling on a lot of stuff but 
(long term, no plans) Node may be an interface so things like Parliament 
can have Nodes carrying internal data around (not Nodes wil not be ties 
to storage layers - that would break use inference and probably lots of 
other things).  To repeat: There are no plans to make Nodes interfaces - 
an experiment has been done.

ObjectOutputStreams might be too complicated. In writing to 
ObjectOutputStream, there may be one than one Java object written per 
Node (e.g. literals, lexical form and datatype) and the whole shared 
reference mechanism for ObjectOutputStream may or may not be a win.  It 
needs state during writing to do that so this might be a loss when the 
objective is to have beyond-memory datastructures.

And ObjectOutputStream.writeUTF is limited to 2 bytes of length.  While 
64K strings seems a lot, for a general mechanism it's a bit of a pain 
(Uniprot has 69Kb literals, Clerezza have been experimenting with 3Mb 
literals).

XML is not fast to parse - a lot of bytes, and the layering of the 
processing in order to reuse a standard parser can incur costs.  StAX is 
better, and the streaming model is going to suite this use but there 
seems to be to be little value in XML internally - it's not about 
transfer between applications.  Instead, JSON or RIOT (which parse 
faster) can be used.

 > However,
 > having serializable binding objects has the potential benefit of being
 > useful for other parts of query execution that could be memory bound 
 > (sort, join, distinct, group by, etc.).

I agree - a general mechanism for spill-to-disk iterators of bindings 
could be so useful that an implementation that is specific to this case 
is worth while.  The functionality already in Java looks to be not that 
helpful although using DataOutputStream as one implementation of a 
"BindingOutputStream" would be interesting to compare to doing it with 
the parser functionality.

There are some building blocks already.

The RIOT parser suite is built on top of a general tokenizer that 
understands Turtle-style tokens and a few extras.  The extras were put 
in for this sort of situation for extensibility.  The tokenizer is tuned 
for speed.  So it natively recognized IRIs and literals with lang tag 
for example.

Writing a binding could be done as one row, of alternating var and value 
(because bindings may have different variables in them in different rows).

Alternatively, a table, with declared columns could be done.  The 
possible columns can be calculated from the syntax of a query although 
this isn't easily available currently.

As well as the usual RDF tokens (IRIs, literals, bnodes) it does 
variables and "symbols", where symbols are things that can be use dto 
extend the language.

A simple run-length-encoding compression scheme would be low-space and
If we use "*" to mean "same as the row before in this position, and 
prefixes can be used to compress URIs

?s <http://example/> ?p <http://example/p> ?o 123 .
* ?p <http://example/p1> ?o "hello" .

Further token replacement with common strings (e.g. "<http://") woudl 
also get the size down quite easily.  That also compresses numerical 
data (datatype is syntax, not explicit declaration).

The fact output is sort of "human readable" helps debugging :-)

Aside: in working on RIOT, I have found that reading from gzip streams 
is slightly slower than working on the uncompressed data despite there 
being more I/O bytes involved.  If it's across a network, I'm sure the 
reverse would be true.  But compression is a lot more expensive than 
decompression for gzip.  My guess is that gz compression will not be a win.

The output side of write-node-to-stream is something I've been meaning 
to do better for a while.  There is FmtUtils, which can turn RDF terms 
to Strings, which really should have been RDF term to output, one of 
which is a wrapped StringBuilder or ByteArrayOutputStream. 
Unfortunately, FmtUtils does the job quite well, even if it is a copy, 
and has the advantage it can provide the length of output before the 
actual output which is sometimes needed, or at least convenient. (c.f. 
TDB NodecSSE).

There is some code in an experimental system to write streams of RIOT 
tokens.  It may even do the binding/row stuff. I can't look it up very 
easily at the moment, because some SF services like "view SVN" are offline.

So something in the "class 1" would be very valuable.

As for integration with TDB and where it works in NodeIds, not Nodes, 
things like sort need the full node

 >   - Does it matter that Bindings coming out of the deserializer
 > will be flat and lose any notion of their original types?

No, it shouldn't  It's sued because in query processing, adding bindings 
is done by sharing the previous results, avoiding a copy and saving some 
space.

 >   - Should the Binding that comes out of the deserializer be
 > immutable or should it properly implement .add() and .addAll()
 > (for the SPARQL Update case it can definitely be immutable,
 >  but I'm not sure if it needs to be elsewhere in the query
 > execution process)?

Immutable is probably fine.  It's not possible to "set" a binding 
currently, only add one.  Once assigned, it can't be changed.

The general style is that some stage creates the binding or extends its 
input, finishes it's work then the binding is not changed after that stage.

	Andy

On 27/01/11 23:35, Stephen Allen wrote:
> 1) Serialize the Binding objects as they are generated and before they are
> applied to the triple template(s).  Two methods of doing so are:
>
> 1a) Create a Serializable Binding object and use Java's
ObjectOutputStream.
> Here I could check to see if the Binding object implemented Serializable
> already and just write it out, or copy it into a new Serializable Binding
> object if it didn't.  This would allow stores to serialize the binding
> object themselves, which could be of benefit to systems like TDB which
would
> store its internal NodeIds instead of Nodes (some mechanism of passing the
> serialized Binding object type and other important objects, such as the
> NodeTable reference, around the serialization gap would probably be
needed).
> I would also have to make new Serializable Node objects to parallel the
Node
> subclasses (or modify the existing ones to use Serializable instead of
> Object in the "label" field).
 >
> 1b) Implement a custom serializer for Binding and Node_* objects.  Could
be
> binary or XML based.  Maybe leveraging the
> com.hp.hpl.jena.sparql.resultset.XMLOutputResultSet class if we wanted to
> use XML.
>
> 2) Serialize the generated Triples after applying the Bindings to the
> insert/delete templates.  This has the benefit of using a slightly
modified
> N-Quads serializer/deserializer (changed to restore blank nodes back to
> their internal Jena IDs).  A further optimization would be to wrap this in
a
> compressed input/output stream.
>
> I'm not sure which approach would be better for space efficiency, I guess
it
> would really depend on the specific query as to whether the list of
bindings
> or list of triples would be larger or smaller.  As of now it seems like 2)
> would be slightly easier to implement since I wouldn't have to create a
> serializer/deserializer.  However, it has the drawback of being less
general
> and also forcing the generated triples to be materialized to Nodes and
would
> mean that store implementations would not be able to leverage it if they
> wanted to generate triples of NodeIds when applying the templates.  Also
it
> could be fragile in relying on internal blank node ids passing through the
> RDF writer and reader.  1a) does not look too difficult if I can make Node
> serializable, but then this change affects both Jena and ARQ.  However,
> having serializable binding objects has the potential benefit of being
> useful for other parts of query execution that could be memory bound
(sort,
> join, distinct, group by, etc.).
>
>
> I would like to tackle option 1a), but I have a few questions:
>
>   - I want to make sure that there would be no major adverse effects from
> making the Node classes Serializable and the Node label field
Serializable.
>   - The Binding.getParent() method.  What is this used for?  I think I can
> ignore this and store just the results of .vars(), and results of
.get(var)
> for each variable since these will retrieve any required info from the
> parents as necessary.
>   - Does it matter that Bindings coming out of the deserializer will be
flat
> and lose any notion of their original types?
>   - Should the Binding that comes out of the deserializer be immutable or
> should it properly implement .add() and .addAll() (for the SPARQL Update
> case it can definitely be immutable, but I'm not sure if it needs to be
> elsewhere in the query execution process)?

Re: SPARQL 1.1 Update in ARQ 2.8.7

Posted by Andy Seaborne <an...@epimorphics.com>.


On 03/02/11 14:37, Andy Seaborne wrote:
>
>
> On 01/02/11 16:36, Stephen Allen wrote:
>> Andy,
>>
>> I have started implementing the serializer (SinkBindingOutput) by using
>> org.openjena.riot.SinkQuadOutput as a guide and using OutputLangUtils to
>> print out the variable/values. I created the deserializer
>> (LangBindings) by
>> extending org.openjena.riot.lang.LangNTuple. I'm using the paired
>> var/value
>> format you described below. For now I'll start with a straightforward
>> implementation with no compression, but like your ideas in this area.
>> I'll
>> try to do some measurements to see if any other compression is
>> beneficial.
>
> Sounds good.
>
>>
>> I did not define an org.openjena.riot.Lang enum for the deserializer
>> (because it isn't an RDF language) but I was planning on putting the
>> LangBindings class in the org.openjena.riot.lang package.
>
> As good a place as any at the moment.
>
> I've just digging out some code that does tuple I/O from an
> experiemental system a while ago (a clustered query engine ..).

I've dug the code out :

https://jena.svn.sf.net/svnroot/jena/Scratch/AFS/trunk/src/riot/io/

Has TokenInputStream, TokeOutputStream as interfaces.

	Andy

Re: SPARQL 1.1 Update in ARQ 2.8.7

Posted by Andy Seaborne <an...@epimorphics.com>.


On 01/02/11 16:36, Stephen Allen wrote:
> Andy,
>
> I have started implementing the serializer (SinkBindingOutput) by using
> org.openjena.riot.SinkQuadOutput as a guide and using OutputLangUtils to
> print out the variable/values.  I created the deserializer (LangBindings) by
> extending org.openjena.riot.lang.LangNTuple.  I'm using the paired var/value
> format you described below.  For now I'll start with a straightforward
> implementation with no compression, but like your ideas in this area.  I'll
> try to do some measurements to see if any other compression is beneficial.

Sounds good.

>
> I did not define an org.openjena.riot.Lang enum for the deserializer
> (because it isn't an RDF language) but I was planning on putting the
> LangBindings class in the org.openjena.riot.lang package.

As good a place as any at the moment.

I've just digging out some code that does tuple I/O from an 
experiemental system a while ago (a clustered query engine ..).

>
> For determining when to spill bindings to disk, there are a few options (in
> order of least difficulty):
> 1) Store binding objects in an list, and then spill them to disk once the
> list size passes a threshold
> 2) Start serializing bindings immediately into something like
> DeferredFileOutputStream [1] that will retain the data in memory until it
> passes a memory threshold
> 3) Do 1), but try to calculate the size of the bindings in memory and use a
> memory threshold instead of a number of bindings threshold
>
> I think 1) should be sufficient if we come up with a reasonable guess for
> the threshold.  Option 2) lets you get much better control over the memory
> management, but I think the cost of unnecessarily serializing/deserializing
> small queries may be too high.

Persoanlly, I'd encapsulate this in a policy object and have different 
implementations.  Well, may just one implementation - case 1 with a 
settable threshold for testing.  (3) then becomes a smarter policy 
object to be done later, if needed.

I share your concern on (2) about the serialization to memory costs.

	Andy