You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Rob Vesse <rv...@dotnetrdf.org> on 2013/11/27 12:07:18 UTC

Possible bug in LeftJoin implementation?

Hey Andy

Prompted by a bug originally reported for dotNetRDF (CORE-386 [1]) which I
initially rejected as Invalid based on my understanding of how LeftJoin
behaves I then reopened because the user reporting it gets different
behaviour in ARQ (which I have reproduced) so I am unclear which of
dotNetRDF or ARQ is doing things wrong based on my understanding of the
specification.

The test data is the trivial Turtle document as follows:

<http://r1> <http://r1> <http://r1> .
<http://r2> <http://r2> <http://r2> .

And the query is as follows:

SELECT *
WHERE
{
  GRAPH <http://a>
  {
    ?s ?p ?o .
  }
  OPTIONAL
  {
    GRAPH <http://b> { ?s0 ?p0 ?o0 . }
    FILTER (SAMETERM(?s, ?s0) && SAMETERM(?p, ?p0) && SAMETERM(?o, ?o0))
  }
  FILTER(!BOUND(?s0))
}


And for reference the unoptimised algebra is as follows:

(base <http://example/base/>
 (filter (! (bound ?s0))
  (leftjoin
   (graph <http://a>
    (bgp (triple ?s ?p ?o)))
   (graph <http://b>
    (bgp (triple ?s0 ?p0 ?o0)))
   (&& (&& (sameTerm ?s ?s0) (sameTerm ?p ?p0)) (sameTerm ?o ?o0)))))


The intent of the query is to calculate the delta of the graphs I.e. the
triples that are present in <http://a> that are not present in <http://b>.
 So given two identical graphs it was intended to return 0 results,
however the behaviour in dotNetRDF is that it returns 2 results whereas
ARQ returns 0 results.

My belief was that dotNetRDF is correct and I'll explain why, I think I
may be wrong and if so I'd love to understand why.  My understanding of
the flow of execution is as follows:

Step 1 - Execute the LHS of the left join which finds all triples in graph
<http://a> and thus returns the following:

s = r1, p = r1, o = r1
s = r2, p = r2, o = r2

Step 2 - Execute the RHS of the left join which finds all triples in graph
<http://b> and thus returns the following:

s0 = r1, p0 = r1, o0 = r1
s0 = r2, p0 = r2, o0 = r2


Step 3 - Calculate the possible join

s = r1, p = r1, o = r1, s0 = r1, p0 = r1, o0 = r1
s = r1, p = r1, o = r1, s0 = r2, p0 = r2, o0 = r2
s = r2, p = r2, o = r2, s0 = r1, p0 = r1, o0 = r1
s = r2, p = r2, o = r2, s0 = r2, p0 = r2, o0 = r2


Step 4 - Apply the filter on the left join


s = r1, p = r1, o = r1
s = r1, p = r1, o = r1, s0 = r2, p0 = r2, o0 = r2
s = r2, p = r2, o = r2
s = r2, p = r2, o = r2, s0 = r2, p0 = r2, o0 = r2

This I think is where ARQ and dotNetRDF differ in behaviour and where I
suspect my implementation is wrong.  For the rows where FILTER fails for
some (but not all rows) I retain the LHS whereas ARQ does not.  I'm
guessing that I'm missing some bit of the SPARQL specification for
LeftJoin that says that where there is at least one valid joinable
solution for a LHS solution then the LHS does not need to be preserved on
its own?


If you could point me to this I would much appreciate this.

Step 5 - Apply the outer filter


s0 = r1, p0 = r1, o0 = r1
s0 = r2, p0 = r2, o0 = r2


So dotNetRDF returns 2 results but ARQ returns 0 results for this query.
Am I correct in thinking I've got a bug in my LeftJoin implementation over
in dotNetRDF?  Or is this actually a subtle bug in ARQ?

Thanks,

Rob

p.s. code for this test case and variations on it is committed as
TestGraphDeltas

[1] http://dotnetrdf.org/tracker/Issues/IssueDetail.aspx?id=386

Re: Possible bug in LeftJoin implementation?

Posted by Mike Grove <mi...@clarkparsia.com>.

On Wed, Nov 27, 2013 at 6:07 AM, Rob Vesse <rv...@dotnetrdf.org> wrote:

> Hey Andy
>
> Prompted by a bug originally reported for dotNetRDF (CORE-386 [1]) which I
> initially rejected as Invalid based on my understanding of how LeftJoin
> behaves I then reopened because the user reporting it gets different
> behaviour in ARQ (which I have reproduced) so I am unclear which of
> dotNetRDF or ARQ is doing things wrong based on my understanding of the
> specification.
>
> The test data is the trivial Turtle document as follows:
>
> <http://r1> <http://r1> <http://r1> .
> <http://r2> <http://r2> <http://r2> .
>
> And the query is as follows:
>
> SELECT *
> WHERE
> {
>   GRAPH <http://a>
>   {
>     ?s ?p ?o .
>   }
>   OPTIONAL
>   {
>     GRAPH <http://b> { ?s0 ?p0 ?o0 . }
>     FILTER (SAMETERM(?s, ?s0) && SAMETERM(?p, ?p0) && SAMETERM(?o, ?o0))
>   }
>   FILTER(!BOUND(?s0))
> }
>
>
> And for reference the unoptimised algebra is as follows:
>
> (base <http://example/base/>
>  (filter (! (bound ?s0))
>   (leftjoin
>    (graph <http://a>
>     (bgp (triple ?s ?p ?o)))
>    (graph <http://b>
>     (bgp (triple ?s0 ?p0 ?o0)))
>    (&& (&& (sameTerm ?s ?s0) (sameTerm ?p ?p0)) (sameTerm ?o ?o0)))))
>
>
> The intent of the query is to calculate the delta of the graphs I.e. the
> triples that are present in <http://a> that are not present in <http://b>.
>  So given two identical graphs it was intended to return 0 results,
> however the behaviour in dotNetRDF is that it returns 2 results whereas
> ARQ returns 0 results.
>
> My belief was that dotNetRDF is correct and I'll explain why, I think I
> may be wrong and if so I'd love to understand why.  My understanding of
> the flow of execution is as follows:
>
> Step 1 - Execute the LHS of the left join which finds all triples in graph
> <http://a> and thus returns the following:
>
> s = r1, p = r1, o = r1
> s = r2, p = r2, o = r2
>
> Step 2 - Execute the RHS of the left join which finds all triples in graph
> <http://b> and thus returns the following:
>
> s0 = r1, p0 = r1, o0 = r1
> s0 = r2, p0 = r2, o0 = r2
>
>
> Step 3 - Calculate the possible join
>
> s = r1, p = r1, o = r1, s0 = r1, p0 = r1, o0 = r1
> s = r1, p = r1, o = r1, s0 = r2, p0 = r2, o0 = r2
> s = r2, p = r2, o = r2, s0 = r1, p0 = r1, o0 = r1
> s = r2, p = r2, o = r2, s0 = r2, p0 = r2, o0 = r2
>
>
> Step 4 - Apply the filter on the left join
>
>
> s = r1, p = r1, o = r1
> s = r1, p = r1, o = r1, s0 = r2, p0 = r2, o0 = r2
> s = r2, p = r2, o = r2
> s = r2, p = r2, o = r2, s0 = r2, p0 = r2, o0 = r2
>
> This I think is where ARQ and dotNetRDF differ in behaviour and where I
> suspect my implementation is wrong.  For the rows where FILTER fails for
> some (but not all rows) I retain the LHS whereas ARQ does not.  I'm
> guessing that I'm missing some bit of the SPARQL specification for
> LeftJoin that says that where there is at least one valid joinable
> solution for a LHS solution then the LHS does not need to be preserved on
> its own?
>
>
Yes, it's this.  If any at point a solution on the LHS joins with something
on the RHS, you don't need to preserve the unjoined solution from the LHS.
 That's only for when it completely fails to join.

So the output from that should should be the 2nd & 4th row from Step 4,
neither of which pass the subsequent filter, yielding zero results.

Cheers,

Mike


>
> If you could point me to this I would much appreciate this.
>
> Step 5 - Apply the outer filter
>
>
> s0 = r1, p0 = r1, o0 = r1
> s0 = r2, p0 = r2, o0 = r2
>
>
> So dotNetRDF returns 2 results but ARQ returns 0 results for this query.
> Am I correct in thinking I've got a bug in my LeftJoin implementation over
> in dotNetRDF?  Or is this actually a subtle bug in ARQ?
>
> Thanks,
>
> Rob
>
> p.s. code for this test case and variations on it is committed as
> TestGraphDeltas
>
> [1] http://dotnetrdf.org/tracker/Issues/IssueDetail.aspx?id=386
>
>
>
>
>

Re: Possible bug in LeftJoin implementation?

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Andy -

Yes we are talking about the same thing, I understand the scoping of
FILTER in OPTIONAL and when it applies over the join rather than over the
inner operator

Mike -

Thanks for confirming my suspicion, turns out to be a trivial bug in the
handling of left joins in dotNetRDF only when there is a cross product.
The normal join case was already correctly handling this part of the spec
and I just somehow missed it in the cross product case.

Cheers,

Rob

On 27/11/2013 15:39, "Andy Seaborne" <an...@apache.org> wrote:

>Hi Rob,
>
>Partial answer - I'm about to go into a RDF-WG telecon but I'll work
>through the details later.  I just wanted to check we are talking about
>the same because "OPTIONAL{ ... FILTER ... }" is special.
>
>You'll see in the algebra there is no (filter) in this part.
>
> >   (leftjoin
> >     (graph <http://a>
> >      (bgp (triple ?s ?p ?o)))
> >     (graph <http://b>
> >      (bgp (triple ?s0 ?p0 ?o0)))
> >     (&& (&& (sameTerm ?s ?s0) (sameTerm ?p ?p0)) (sameTerm ?o ?o0)))))
>
>The (&&...) is the 3rd argument to the leftJoin operation and forms part
>of the join condition, not a filter over the GRAPH <http://b> { ?s0 ?p0
>?o0 . } nor applied after the LeftJoin - in SQL terms, is't the ON
>condition for a leftjoin.  Scope-wide the the (&&) can see the ?s which
>it could not otherwise.
>
>For example this is a different query:
>
>SELECT *
>WHERE
>{
>   GRAPH <http://a>
>   {
>     ?s ?p ?o .
>   }
>   OPTIONAL
>   {
>     {
>       GRAPH <http://b> { ?s0 ?p0 ?o0 . }
>       FILTER (SAMETERM(?s, ?s0) && SAMETERM(?p, ?p0) && SAMETERM(?o,
>?o0))
>     }
>   }
>   FILTER(!BOUND(?s0))
>}
>
>there is an additional {} inside the OPTIONAL {}.
>
>(filter (! (bound ?s0))
>   (leftjoin
>     (graph <http://a>
>       (bgp (triple ?s ?p ?o)))
>     (filter (&& (&& (sameTerm ?s ?s0) (sameTerm ?p ?p0))
>                 (sameTerm ?o ?o0))
>       (graph <http://b>
>         (bgp (triple ?s0 ?p0 ?o0))))))
>
>Now has 2* (filter)
>
>ARQ then gives 2 rows
>
>----------------------------------------------------------
>| s           | p           | o           | s0 | p0 | o0 |
>==========================================================
>| <http://r2> | <http://r2> | <http://r2> |    |    |    |
>| <http://r1> | <http://r1> | <http://r1> |    |    |    |
>----------------------------------------------------------
>
>The &&-filter is always false (?s not defined)
>
>Is that what dotNetRDF returns?
>
>I get this with the normal and ref query engines in ARQ.
>
>	Andy
>
><http://r1> <http://r1> <http://r1> <http://a> .
><http://r2> <http://r2> <http://r2> <http://a> .
><http://r1> <http://r1> <http://r1> <http://b> .
><http://r2> <http://r2> <http://r2> <http://b> .
>
>
>On 27/11/13 11:07, Rob Vesse wrote:
>> Hey Andy
>>
>> Prompted by a bug originally reported for dotNetRDF (CORE-386 [1])
>>which I
>> initially rejected as Invalid based on my understanding of how LeftJoin
>> behaves I then reopened because the user reporting it gets different
>> behaviour in ARQ (which I have reproduced) so I am unclear which of
>> dotNetRDF or ARQ is doing things wrong based on my understanding of the
>> specification.
>>
>> The test data is the trivial Turtle document as follows:
>>
>> <http://r1> <http://r1> <http://r1> .
>> <http://r2> <http://r2> <http://r2> .
>>
>> And the query is as follows:
>>
>> SELECT *
>> WHERE
>> {
>>    GRAPH <http://a>
>>    {
>>      ?s ?p ?o .
>>    }
>>    OPTIONAL
>>    {
>>      GRAPH <http://b> { ?s0 ?p0 ?o0 . }
>>      FILTER (SAMETERM(?s, ?s0) && SAMETERM(?p, ?p0) && SAMETERM(?o,
>>?o0))
>>    }
>>    FILTER(!BOUND(?s0))
>> }
>>
>>
>> And for reference the unoptimised algebra is as follows:
>>
>> (base <http://example/base/>
>>   (filter (! (bound ?s0))
>>    (leftjoin
>>     (graph <http://a>
>>      (bgp (triple ?s ?p ?o)))
>>     (graph <http://b>
>>      (bgp (triple ?s0 ?p0 ?o0)))
>>     (&& (&& (sameTerm ?s ?s0) (sameTerm ?p ?p0)) (sameTerm ?o ?o0)))))
>>
>>
>> The intent of the query is to calculate the delta of the graphs I.e. the
>> triples that are present in <http://a> that are not present in
>><http://b>.
>>   So given two identical graphs it was intended to return 0 results,
>> however the behaviour in dotNetRDF is that it returns 2 results whereas
>> ARQ returns 0 results.
>>
>> My belief was that dotNetRDF is correct and I'll explain why, I think I
>> may be wrong and if so I'd love to understand why.  My understanding of
>> the flow of execution is as follows:
>>
>> Step 1 - Execute the LHS of the left join which finds all triples in
>>graph
>> <http://a> and thus returns the following:
>>
>> s = r1, p = r1, o = r1
>> s = r2, p = r2, o = r2
>>
>> Step 2 - Execute the RHS of the left join which finds all triples in
>>graph
>> <http://b> and thus returns the following:
>>
>> s0 = r1, p0 = r1, o0 = r1
>> s0 = r2, p0 = r2, o0 = r2
>>
>>
>> Step 3 - Calculate the possible join
>>
>> s = r1, p = r1, o = r1, s0 = r1, p0 = r1, o0 = r1
>> s = r1, p = r1, o = r1, s0 = r2, p0 = r2, o0 = r2
>> s = r2, p = r2, o = r2, s0 = r1, p0 = r1, o0 = r1
>> s = r2, p = r2, o = r2, s0 = r2, p0 = r2, o0 = r2
>>
>>
>> Step 4 - Apply the filter on the left join
>>
>>
>> s = r1, p = r1, o = r1
>> s = r1, p = r1, o = r1, s0 = r2, p0 = r2, o0 = r2
>> s = r2, p = r2, o = r2
>> s = r2, p = r2, o = r2, s0 = r2, p0 = r2, o0 = r2
>>
>> This I think is where ARQ and dotNetRDF differ in behaviour and where I
>> suspect my implementation is wrong.  For the rows where FILTER fails for
>> some (but not all rows) I retain the LHS whereas ARQ does not.  I'm
>> guessing that I'm missing some bit of the SPARQL specification for
>> LeftJoin that says that where there is at least one valid joinable
>> solution for a LHS solution then the LHS does not need to be preserved
>>on
>> its own?
>>
>>
>> If you could point me to this I would much appreciate this.
>>
>> Step 5 - Apply the outer filter
>>
>>
>> s0 = r1, p0 = r1, o0 = r1
>> s0 = r2, p0 = r2, o0 = r2
>>
>>
>> So dotNetRDF returns 2 results but ARQ returns 0 results for this query.
>> Am I correct in thinking I've got a bug in my LeftJoin implementation
>>over
>> in dotNetRDF?  Or is this actually a subtle bug in ARQ?
>>
>> Thanks,
>>
>> Rob
>>
>> p.s. code for this test case and variations on it is committed as
>> TestGraphDeltas
>>
>> [1] http://dotnetrdf.org/tracker/Issues/IssueDetail.aspx?id=386
>>
>>
>>
>>
>

Re: Possible bug in LeftJoin implementation?

Posted by Andy Seaborne <an...@apache.org>.

Hi Rob,

Partial answer - I'm about to go into a RDF-WG telecon but I'll work 
through the details later.  I just wanted to check we are talking about 
the same because "OPTIONAL{ ... FILTER ... }" is special.

You'll see in the algebra there is no (filter) in this part.

 >   (leftjoin
 >     (graph <http://a>
 >      (bgp (triple ?s ?p ?o)))
 >     (graph <http://b>
 >      (bgp (triple ?s0 ?p0 ?o0)))
 >     (&& (&& (sameTerm ?s ?s0) (sameTerm ?p ?p0)) (sameTerm ?o ?o0)))))

The (&&...) is the 3rd argument to the leftJoin operation and forms part 
of the join condition, not a filter over the GRAPH <http://b> { ?s0 ?p0 
?o0 . } nor applied after the LeftJoin - in SQL terms, is't the ON 
condition for a leftjoin.  Scope-wide the the (&&) can see the ?s which 
it could not otherwise.

For example this is a different query:

SELECT *
WHERE
{
   GRAPH <http://a>
   {
     ?s ?p ?o .
   }
   OPTIONAL
   {
     {
       GRAPH <http://b> { ?s0 ?p0 ?o0 . }
       FILTER (SAMETERM(?s, ?s0) && SAMETERM(?p, ?p0) && SAMETERM(?o, ?o0))
     }
   }
   FILTER(!BOUND(?s0))
}

there is an additional {} inside the OPTIONAL {}.

(filter (! (bound ?s0))
   (leftjoin
     (graph <http://a>
       (bgp (triple ?s ?p ?o)))
     (filter (&& (&& (sameTerm ?s ?s0) (sameTerm ?p ?p0))
                 (sameTerm ?o ?o0))
       (graph <http://b>
         (bgp (triple ?s0 ?p0 ?o0))))))

Now has 2* (filter)

ARQ then gives 2 rows

----------------------------------------------------------
| s           | p           | o           | s0 | p0 | o0 |
==========================================================
| <http://r2> | <http://r2> | <http://r2> |    |    |    |
| <http://r1> | <http://r1> | <http://r1> |    |    |    |
----------------------------------------------------------

The &&-filter is always false (?s not defined)

Is that what dotNetRDF returns?

I get this with the normal and ref query engines in ARQ.

	Andy

<http://r1> <http://r1> <http://r1> <http://a> .
<http://r2> <http://r2> <http://r2> <http://a> .
<http://r1> <http://r1> <http://r1> <http://b> .
<http://r2> <http://r2> <http://r2> <http://b> .


On 27/11/13 11:07, Rob Vesse wrote:
> Hey Andy
>
> Prompted by a bug originally reported for dotNetRDF (CORE-386 [1]) which I
> initially rejected as Invalid based on my understanding of how LeftJoin
> behaves I then reopened because the user reporting it gets different
> behaviour in ARQ (which I have reproduced) so I am unclear which of
> dotNetRDF or ARQ is doing things wrong based on my understanding of the
> specification.
>
> The test data is the trivial Turtle document as follows:
>
> <http://r1> <http://r1> <http://r1> .
> <http://r2> <http://r2> <http://r2> .
>
> And the query is as follows:
>
> SELECT *
> WHERE
> {
>    GRAPH <http://a>
>    {
>      ?s ?p ?o .
>    }
>    OPTIONAL
>    {
>      GRAPH <http://b> { ?s0 ?p0 ?o0 . }
>      FILTER (SAMETERM(?s, ?s0) && SAMETERM(?p, ?p0) && SAMETERM(?o, ?o0))
>    }
>    FILTER(!BOUND(?s0))
> }
>
>
> And for reference the unoptimised algebra is as follows:
>
> (base <http://example/base/>
>   (filter (! (bound ?s0))
>    (leftjoin
>     (graph <http://a>
>      (bgp (triple ?s ?p ?o)))
>     (graph <http://b>
>      (bgp (triple ?s0 ?p0 ?o0)))
>     (&& (&& (sameTerm ?s ?s0) (sameTerm ?p ?p0)) (sameTerm ?o ?o0)))))
>
>
> The intent of the query is to calculate the delta of the graphs I.e. the
> triples that are present in <http://a> that are not present in <http://b>.
>   So given two identical graphs it was intended to return 0 results,
> however the behaviour in dotNetRDF is that it returns 2 results whereas
> ARQ returns 0 results.
>
> My belief was that dotNetRDF is correct and I'll explain why, I think I
> may be wrong and if so I'd love to understand why.  My understanding of
> the flow of execution is as follows:
>
> Step 1 - Execute the LHS of the left join which finds all triples in graph
> <http://a> and thus returns the following:
>
> s = r1, p = r1, o = r1
> s = r2, p = r2, o = r2
>
> Step 2 - Execute the RHS of the left join which finds all triples in graph
> <http://b> and thus returns the following:
>
> s0 = r1, p0 = r1, o0 = r1
> s0 = r2, p0 = r2, o0 = r2
>
>
> Step 3 - Calculate the possible join
>
> s = r1, p = r1, o = r1, s0 = r1, p0 = r1, o0 = r1
> s = r1, p = r1, o = r1, s0 = r2, p0 = r2, o0 = r2
> s = r2, p = r2, o = r2, s0 = r1, p0 = r1, o0 = r1
> s = r2, p = r2, o = r2, s0 = r2, p0 = r2, o0 = r2
>
>
> Step 4 - Apply the filter on the left join
>
>
> s = r1, p = r1, o = r1
> s = r1, p = r1, o = r1, s0 = r2, p0 = r2, o0 = r2
> s = r2, p = r2, o = r2
> s = r2, p = r2, o = r2, s0 = r2, p0 = r2, o0 = r2
>
> This I think is where ARQ and dotNetRDF differ in behaviour and where I
> suspect my implementation is wrong.  For the rows where FILTER fails for
> some (but not all rows) I retain the LHS whereas ARQ does not.  I'm
> guessing that I'm missing some bit of the SPARQL specification for
> LeftJoin that says that where there is at least one valid joinable
> solution for a LHS solution then the LHS does not need to be preserved on
> its own?
>
>
> If you could point me to this I would much appreciate this.
>
> Step 5 - Apply the outer filter
>
>
> s0 = r1, p0 = r1, o0 = r1
> s0 = r2, p0 = r2, o0 = r2
>
>
> So dotNetRDF returns 2 results but ARQ returns 0 results for this query.
> Am I correct in thinking I've got a bug in my LeftJoin implementation over
> in dotNetRDF?  Or is this actually a subtle bug in ARQ?
>
> Thanks,
>
> Rob
>
> p.s. code for this test case and variations on it is committed as
> TestGraphDeltas
>
> [1] http://dotnetrdf.org/tracker/Issues/IssueDetail.aspx?id=386
>
>
>
>