You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Andy Seaborne (JIRA)" <ji...@apache.org> on 2015/05/28 14:05:19 UTC

[jira] [Commented] (JENA-949) DISTINCT spilling to a data bag leads to wrong answers.

    [ https://issues.apache.org/jira/browse/JENA-949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562768#comment-14562768 ] 

Andy Seaborne commented on JENA-949:
------------------------------------

Analysis:

The problem is that the return from the "distinct data net" is misused:

{code:title=QueryIterDistinct}
   @Override
    protected boolean isFreshSighting(Binding binding)
    {
        return db.netAdd(binding) ;
    }
{code}

A return of true means definitely new, false covers two cases. While filling the first part of the bag, the distinct data net returns false if the item is a duplicate. Once it starts spilling, it returns false as an  "unknown" always. {{QueryIterDistinct}} does not go back to check the data bag when the input iterator closes.  What is more, some results have already been yielded so the data bag iterator is the wrong answer.

The effect on {{QueryIterDistinct}} is that it will always skip over items added to the spilled data.


> DISTINCT spilling to a data bag leads to wrong answers.
> -------------------------------------------------------
>
>                 Key: JENA-949
>                 URL: https://issues.apache.org/jira/browse/JENA-949
>             Project: Apache Jena
>          Issue Type: Bug
>          Components: ARQ
>    Affects Versions: Jena 3.0.0
>            Reporter: Andy Seaborne
>         Attachments: Jena949_1.java
>
>
> In the attached example, the same query is made twice. The second time {{ARQ.spillToDiskThreshold}} is set to 2L.  The first results are correct.
> [email 2015-05-20|http://mail-archives.apache.org/mod_mbox/jena-users/201505.mbox/%3C34B3B313-EAE4-4498-875F-A9674A8B3B2D%40interition.net%3E]
> reports a possibly similar situation at scale.
> The presence of {{DISTINCT}} is the key factor.
> Output:
> {noformat}
> -----------------------
> | g                   |
> =======================
> | <http://example/g1> |
> | <http://example/g2> |
> | <http://example/g3> |
> | <http://example/g4> |
> | <http://example/g5> |
> | <http://example/g6> |
> | <http://example/g7> |
> | <http://example/g8> |
> | <http://example/g9> |
> | <http://example/g0> |
> -----------------------
> -----------------------
> | g                   |
> =======================
> | <http://example/g1> |
> | <http://example/g2> |
> -----------------------
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)