You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by "Paolo Castagna (JIRA)" <ji...@apache.org> on 2011/08/08 11:26:27 UTC

[jira] [Created] (JENA-90) Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries

Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries
------------------------------------------------------------------

                 Key: JENA-90
                 URL: https://issues.apache.org/jira/browse/JENA-90
             Project: Jena
          Issue Type: Improvement
          Components: ARQ
            Reporter: Paolo Castagna
            Priority: Trivial


ARQ's optimizer could use an OpReduce instead of OpDistinct if a query is DISTINCT + ORDER BY.
OpReduce removes adjacent duplicates and it does not require a set of already seen bindings as the current OpDistinct implementation does.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-90) Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084105#comment-13084105 ] 

Andy Seaborne commented on JENA-90:
-----------------------------------

There are a number of ways to approach this:

1/ Add a parameter to the "top" operator  e.g.(top distinct 10 (sort) (sub))
2/ Add a new operator e.g. top_distinct
3/ Decide when choosing QueryIterator based on (sub)

In (3), if the OpExecutor(OpTop) sees the algebra pattern:

(top N (?x)
  (distinct
    ....))

by 

opTop.getSubOp() instanceof OpDistinct 

then it chooses QueryIterTopNDistinct over the sub-expression of OpDistinct
else choose QueryIterTopN.  

This is treating it as a lower-level optimization.

I think 3 is neater.


> Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries
> ------------------------------------------------------------------
>
>                 Key: JENA-90
>                 URL: https://issues.apache.org/jira/browse/JENA-90
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Priority: Trivial
>              Labels: arq, optimizer, sparql
>
> ARQ's optimizer could use an OpReduce instead of OpDistinct if a query is DISTINCT + ORDER BY.
> OpReduce removes adjacent duplicates and it does not require a set of already seen bindings as the current OpDistinct implementation does.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-90) Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries

Posted by "Stephen Allen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092086#comment-13092086 ] 

Stephen Allen commented on JENA-90:
-----------------------------------

Hi Paolo,

I think the approach you want is to use QueryIterReduced instead of the new QueryIterDistinctSort class you propose (also, an important note: [1]).  Perhaps QueryIterReduced could possibly be optimized a little bit by eliminating the general purpose window array and using a single variable in this particular case of a sorted input.

Although, in my mind, a better approach would be to modify the algebra as part of a query optimization step (replace the OpDistinct with an OpReduced) when it is known that the QueryIterator to which it is applied to is sorted (either because of an underlying OpOrder or a sorted triple/quad index).  This makes it easier to determine what is going on during a query execution by examining the transformed algebra instead of having branches in the physical operators themselves.


[1]  DistinctDataBag is not guaranteed to be sorted.  The in-memory bindings are stored in a HashSet, thus if the bag does not spill to disk then no attempt is made to sort the bindings in the iterator (so as not to perform extra effort).  It would not be hard to create a DistinctSortedDataBag, but I'm not sure that it is necessary (and IMO limiting the number of primitive operations helps simplify the system).

> Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries
> ------------------------------------------------------------------
>
>                 Key: JENA-90
>                 URL: https://issues.apache.org/jira/browse/JENA-90
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Trivial
>              Labels: arq, optimizer, sparql
>         Attachments: ARQ_JENA-90_r1159636.patch
>
>
> ARQ's optimizer could use an OpReduce instead of OpDistinct if a query is DISTINCT + ORDER BY.
> OpReduce removes adjacent duplicates and it does not require a set of already seen bindings as the current OpDistinct implementation does.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-90) Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries

Posted by "Paolo Castagna (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083102#comment-13083102 ] 

Paolo Castagna commented on JENA-90:
------------------------------------

Let's take, for example, this SPARQL query:

SELECT DISTINCT  *
WHERE
  { ?s ?p ?o }
ORDER BY ?p
LIMIT   10

The correspondent algebra expression is:

(slice _ 10
  (distinct
    (order (?p)
      (bgp (triple ?s ?p ?o)))))

Which is equivalent to:

(slice _ 10
  (reduced
    (order (?p)
      (bgp (triple ?s ?p ?o)))))

However, the distinct or reduced operators forbid the optimization described in JENA-89. Maybe we can modify the 'top' operator to yields only distinct bindings or add a new 'top_distinct' operator for that:

(top_distinct (10 ?p ?s)
  (bgp (triple ?s ?p ?o))) 

SPARQL queries of the type SELECT DISTINCT ... WHERE {...} ORDER BY ...  LIMIT 10 are common when people want to display the 10 most 'something' things in their dataset.

The implementation of a QueryIterTopNDistinct is almost the same as QueryIterTopN (see: JENA-89) but we add bindings to the PriorityQueue if and only if they are not already there (using .contains() to check).

Is it worth adding a top_distinct operator or it just pollutes the algebra?

> Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries
> ------------------------------------------------------------------
>
>                 Key: JENA-90
>                 URL: https://issues.apache.org/jira/browse/JENA-90
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Priority: Trivial
>              Labels: arq, optimizer, sparql
>
> ARQ's optimizer could use an OpReduce instead of OpDistinct if a query is DISTINCT + ORDER BY.
> OpReduce removes adjacent duplicates and it does not require a set of already seen bindings as the current OpDistinct implementation does.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (JENA-90) Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries

Posted by "Paolo Castagna (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paolo Castagna resolved JENA-90.
--------------------------------

    Resolution: Fixed

Thanks Stephen and Andy for your comments and guidance.

I now want to think (again!) at how the optimizations in JENA-89 and JENA-90 interact each others when we have a DISTINCT + ORDER BY + LIMIT query (i.e. people want to find the 10 most something things in their data).

> Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries
> ------------------------------------------------------------------
>
>                 Key: JENA-90
>                 URL: https://issues.apache.org/jira/browse/JENA-90
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Trivial
>              Labels: arq, optimizer, sparql
>         Attachments: ARQ_JENA-90_r1159636.patch
>
>
> ARQ's optimizer could use an OpReduce instead of OpDistinct if a query is DISTINCT + ORDER BY.
> OpReduce removes adjacent duplicates and it does not require a set of already seen bindings as the current OpDistinct implementation does.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (JENA-90) Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries

Posted by "Paolo Castagna (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paolo Castagna closed JENA-90.
------------------------------


> Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries
> ------------------------------------------------------------------
>
>                 Key: JENA-90
>                 URL: https://issues.apache.org/jira/browse/JENA-90
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Trivial
>              Labels: arq, optimizer, sparql
>         Attachments: ARQ_JENA-90_r1159636.patch
>
>
> ARQ's optimizer could use an OpReduce instead of OpDistinct if a query is DISTINCT + ORDER BY.
> OpReduce removes adjacent duplicates and it does not require a set of already seen bindings as the current OpDistinct implementation does.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-90) Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries

Posted by "Andreas Schultz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084081#comment-13084081 ] 

Andreas Schultz commented on JENA-90:
-------------------------------------

About not needing a set of already seen bindings: This would only be true if all variables in the Select clause were also part of the Order By. In general a set would still be needed. The optimization in the general case would be to clear this set every time the variable bindings used in the Order By change.

> Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries
> ------------------------------------------------------------------
>
>                 Key: JENA-90
>                 URL: https://issues.apache.org/jira/browse/JENA-90
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Priority: Trivial
>              Labels: arq, optimizer, sparql
>
> ARQ's optimizer could use an OpReduce instead of OpDistinct if a query is DISTINCT + ORDER BY.
> OpReduce removes adjacent duplicates and it does not require a set of already seen bindings as the current OpDistinct implementation does.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-90) Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries

Posted by "Paolo Castagna (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092095#comment-13092095 ] 

Paolo Castagna commented on JENA-90:
------------------------------------

Stephen, thanks for your comments and suggestions. They make sense, I'll do that.

> Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries
> ------------------------------------------------------------------
>
>                 Key: JENA-90
>                 URL: https://issues.apache.org/jira/browse/JENA-90
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Trivial
>              Labels: arq, optimizer, sparql
>         Attachments: ARQ_JENA-90_r1159636.patch
>
>
> ARQ's optimizer could use an OpReduce instead of OpDistinct if a query is DISTINCT + ORDER BY.
> OpReduce removes adjacent duplicates and it does not require a set of already seen bindings as the current OpDistinct implementation does.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JENA-90) Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries

Posted by "Paolo Castagna (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paolo Castagna updated JENA-90:
-------------------------------

    Attachment: ARQ_JENA-90_r1159636.patch

This is a first attempt which I am sharing for feedback (now, more testing is necessary... after the lesson learned from JENA-89 :-)).

Another option would have been to replace distinct with reduce via an algebra transformation and used the QueryIterDistinctSort in OpExecutor.execute(OpReduced, ...) method instead. But I did not see advantages in doing so, unless we want to control the optimization via a symbol.

Both are reasonable and very simple... and the tests will be the same.

> Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries
> ------------------------------------------------------------------
>
>                 Key: JENA-90
>                 URL: https://issues.apache.org/jira/browse/JENA-90
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Trivial
>              Labels: arq, optimizer, sparql
>         Attachments: ARQ_JENA-90_r1159636.patch
>
>
> ARQ's optimizer could use an OpReduce instead of OpDistinct if a query is DISTINCT + ORDER BY.
> OpReduce removes adjacent duplicates and it does not require a set of already seen bindings as the current OpDistinct implementation does.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (JENA-90) Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries

Posted by "Paolo Castagna (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paolo Castagna reassigned JENA-90:
----------------------------------

    Assignee: Paolo Castagna

> Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries
> ------------------------------------------------------------------
>
>                 Key: JENA-90
>                 URL: https://issues.apache.org/jira/browse/JENA-90
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Trivial
>              Labels: arq, optimizer, sparql
>
> ARQ's optimizer could use an OpReduce instead of OpDistinct if a query is DISTINCT + ORDER BY.
> OpReduce removes adjacent duplicates and it does not require a set of already seen bindings as the current OpDistinct implementation does.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-90) Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries

Posted by "Paolo Castagna (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084187#comment-13084187 ] 

Paolo Castagna commented on JENA-90:
------------------------------------

> I think 3 is neater. 

Ok.

However, I still have a doubt.

With this algebra expression: 

(slice _ 10
  (distinct
    (order (?p) 
      ...)))

The optimization in JENA-89 will not trigger. Therefore there will be no (top ...)

Could we have the distinct before the order?

(slice _ 10
  (order (?p) 
    (distinct
      ...)))

Or, should we do it in OpExecutor(OpSlice) checking if it sees the algebra patter (slice ... (distinct ... (order (...)))?

> Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries
> ------------------------------------------------------------------
>
>                 Key: JENA-90
>                 URL: https://issues.apache.org/jira/browse/JENA-90
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Priority: Trivial
>              Labels: arq, optimizer, sparql
>
> ARQ's optimizer could use an OpReduce instead of OpDistinct if a query is DISTINCT + ORDER BY.
> OpReduce removes adjacent duplicates and it does not require a set of already seen bindings as the current OpDistinct implementation does.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira