You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Chris Dollin <ch...@epimorphics.com> on 2016/01/26 16:08:29 UTC

promises (or not) about result order in tdb find calls

Dear All

If a dataset graph G backed by TDB runs .find(ANY, ANY, ANY, ANY), are
there any promises made about the order in which quads come out of
the iterator? Failing a promise, how about a strong likelihood of
some specific order? [1]

I ask because we have a (large) dataset for which we wish to apply
an operation (as it happens, text indexing) to each subject+graph
in the graph exactly once. Currently we write code that runs the above
find() call and processes the graph+subject if it has not already
seen it, using a Set<Node> to remember subject Nodes.

If all the quads with the same graph+subject turned up together we could
dispense with this machinery and its overhead.

If not, well, we have other approaches in mind (to avoid big sets).

Chris

[1] I'm not expecting such a promise but it would be remiss of me
     not to check and dismiss it a priori ...

-- 
"It's just the beginning we've seen"            - Colosseum, /Tomorrow's Blues/

Epimorphics Ltd, http://www.epimorphics.com
Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Epimorphics Ltd. is a limited company registered in England (number 7016688)

Re: promises (or not) about result order in tdb find calls

Posted by Chris Dollin <ch...@epimorphics.com>.
Hi Andy

> On 26/01/16 15:08, Chris Dollin wrote:
>> Dear All
>>
>> If a dataset graph G backed by TDB runs .find(ANY, ANY, ANY, ANY),
>
>
> Is that Graph.find(?,?,?) on graph G or DatasetGraph.find(?,?,?,?). There isn't
> a find/4 on Graph. Or is it a default union graph G?

It's a DatasetGraph and it's configured as a default union graph.

>> Failing a promise, how about a strong likelihood of
>> some specific order? [1]
>
> Yes. Currently.
>
> DatasetGraph.find(?,?,?,?) uses
>
> a default graph, it uses SPO
> then
> named graphs, it uses SPOG (caution - this is a special case)
>
> But DatasetGraph.find(G,?,?,?) uses GSPO for fixed G (not SPOG)

OK.

> The special case is because of default union graph.  Normally, GSPO is the
> "primary" index.

...
> This confuses me - where has the graph name gone?

Careless typing.

> Or are you assuming subjects only in one graph?

No, we're tracking by subject x graph.

>> If all the quads with the same graph+subject turned up together we could
>> dispense with this machinery and its overhead.
>
> As you are prepared to make version and TDB specific assumptions you could
> access the specific TDB index you are interested in.  You will need to
> reconstruct Nodes.  Make a QuadTable or TripleTable of one index.
>
> That way, you will see GSPO which is the index you want.
> GSPO is sorted by G then S then P then O.
>
> Caveat emptor.

Yes. I think that would be more than would be wise for us to do.

> Caveat emptor^2 if a live database when write transactions are around (still
> possible but harder).

In this case the only activity on the database is reading it for this
indexing.

>> If not, well, we have other approaches in mind (to avoid big sets).
>
> Do a backup, sort the n-quads so (S,G) are adjacent, and read that as input.
> This avoids in-memory workspace for (S,G) or (S) depending on which case we are in.

Exactly.

> Even more "caveat emptor" - it all depends.

Yes.

Thanks Andy.

Chris

-- 
"It's just the beginning we've seen"            - Colosseum, /Tomorrow's Blues/

Epimorphics Ltd, http://www.epimorphics.com
Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Epimorphics Ltd. is a limited company registered in England (number 7016688)

Re: promises (or not) about result order in tdb find calls

Posted by Andy Seaborne <an...@apache.org>.
On 26/01/16 15:08, Chris Dollin wrote:
> Dear All
>
> If a dataset graph G backed by TDB runs .find(ANY, ANY, ANY, ANY),


Is that Graph.find(?,?,?) on graph G or DatasetGraph.find(?,?,?,?). 
There isn't a find/4 on Graph. Or is it a default union graph G?

These all make a difference to what non-promise you get.

> are
> there any promises made about the order in which quads come out of
> the iterator?

Promised - no.

(It has even changed between versions in one case.)

> Failing a promise, how about a strong likelihood of
> some specific order? [1]

Yes. Currently.

DatasetGraph.find(?,?,?,?) uses

a default graph, it uses SPO
then
named graphs, it uses SPOG (caution - this is a special case)

But DatasetGraph.find(G,?,?,?) uses GSPO for fixed G (not SPOG)

The special case is because of default union graph.  Normally, GSPO is 
the "primary" index.

> I ask because we have a (large) dataset for which we wish to apply
> an operation (as it happens, text indexing) to each subject+graph
> in the graph exactly once. Currently we write code that runs the above
> find() call and processes the graph+subject if it has not already
> seen it, using a Set<Node> to remember subject Nodes.

This confuses me - where has the graph name gone?
Or are you assuming subjects only in one graph?

> If all the quads with the same graph+subject turned up together we could
> dispense with this machinery and its overhead.

As you are prepared to make version and TDB specific assumptions you 
could access the specific TDB index you are interested in.  You will 
need to reconstruct Nodes.  Make a QuadTable or TripleTable of one index.

That way, you will see GSPO which is the index you want.
GSPO is sorted by G then S then P then O.

Caveat emptor.

Caveat emptor^2 if a live database when write transactions are around 
(still possible but harder).

> If not, well, we have other approaches in mind (to avoid big sets).

Do a backup, sort the n-quads so (S,G) are adjacent, and read that as 
input.  This avoids in-memory workspace for (S,G) or (S) depending on 
which case we are in.

Even more "caveat emptor" - it all depends.

	Andy

>
> Chris
>
> [1] I'm not expecting such a promise but it would be remiss of me
>      not to check and dismiss it a priori ...
>