You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by "Paolo Castagna (JIRA)" <ji...@apache.org> on 2011/08/08 10:38:27 UTC

[jira] [Created] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Avoid a total sort for ORDER BY + LIMIT queries
-----------------------------------------------

Key: JENA-89
URL: https://issues.apache.org/jira/browse/JENA-89
Project: Jena
Issue Type: Improvement
Components: ARQ
Reporter: Paolo Castagna
Assignee: Paolo Castagna
Priority: Minor

In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
As an alternative, we can use a [PriorityQueue|http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html] (which is based on a priority heap) to avoid a sort operation.

ARQ's algebra package contains already a [OpTopN|https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java] operator. The [OpExecutor|https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.

ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.

Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084139#comment-13084139 ] 

Andy Seaborne commented on JENA-89:
-----------------------------------

new PriorityQueue<Binding>(TOPN_LIMIT_THRESHOLD, comparator) ;

Should this be:

new PriorityQueue<Binding>(numItems, comparator) ;

?
The PQ only needs to be the size of N.


> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch
>
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Paolo Castagna (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paolo Castagna updated JENA-89:
-------------------------------

    Attachment: ARQ_JENA-89_r1156212.patch

I am attaching a patch for this to get some feedback before committing it. There are a couple of changes I am not 100% sure about.

> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch
>
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Seaborne updated JENA-89:
------------------------------

    Attachment: JENA-89-TopTests.patch

> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch, JENA-89-TopTests.patch
>
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Seaborne updated JENA-89:
------------------------------

    Comment: was deleted

(was: OpBuilder was broken in another was as well: "Op sub = build(list, 3) ;" now fixed.
)

> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch
>
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084640#comment-13084640 ] 

Andy Seaborne commented on JENA-89:
-----------------------------------

The existing tests are for other things and don't gaurantee coverage.  Having a specific set of tests for the feature means it's easier to see what is and is not covered.

Patch JENA-89-TopTests attached for tests in testing/ARQ/Optimizer.



> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch, JENA-89-TopTests.patch, JENA-89-TopTests.patch
>
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Paolo Castagna (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paolo Castagna resolved JENA-89.
--------------------------------

    Resolution: Fixed

Thanks Andy (new tests have been added and bug fixed).

> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch, JENA-89-Top-FixHeapCollation.patch, JENA-89-TopTests.patch
>
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Paolo Castagna (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paolo Castagna closed JENA-89.
------------------------------


> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch, JENA-89-Top-FixHeapCollation.patch, JENA-89-TopTests.patch
>
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Paolo Castagna (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paolo Castagna updated JENA-89:
-------------------------------

    Description: 
In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.

ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.

ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.

Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 

  [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
  [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
  [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
  [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

  was:
In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
As an alternative, we can use a PriorityQueue [1] (which is based on a priority heap) to avoid a sort operation.

ARQ's algebra package contains already a OpTopN [2] operator. The OpExecutor [3] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.

ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.

Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 

  [1] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
  [2] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
  [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java


> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084641#comment-13084641 ] 

Andy Seaborne commented on JENA-89:
-----------------------------------

In creating the tests, a bug was discovered.

The heap needs to keep the least N items. So when a new item to keep comes in, the greatest item so far needs to be evicted, not the least.  In other words, the heap is kept in maximum order and the test to to look at the max of the N items (heap top) to see if the new binding is less than the max least N so far.

Patch attached.  heap uses a reversed comparator.

Also (minor) avoid a copy by Arrays.asList.


> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch, JENA-89-Top-FixHeapCollation.patch, JENA-89-TopTests.patch, JENA-89-TopTests.patch
>
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Seaborne updated JENA-89:
------------------------------

    Attachment: JENA-89-TopTests.patch

> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch, JENA-89-TopTests.patch, JENA-89-TopTests.patch
>
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Seaborne updated JENA-89:
------------------------------

    Attachment: JENA-89-Top-FixHeapCollation.patch

> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch, JENA-89-Top-FixHeapCollation.patch, JENA-89-TopTests.patch, JENA-89-TopTests.patch
>
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084120#comment-13084120 ] 

Andy Seaborne commented on JENA-89:
-----------------------------------

OpBuilder was broken in another was as well: "Op sub = build(list, 3) ;" now fixed.


> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch
>
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084123#comment-13084123 ] 

Andy Seaborne commented on JENA-89:
-----------------------------------

A fixed threshold seems sensible.  It can also always be made adjustable later.  I think the key use case is small values (e.g. 10) so a fixed limit more than that is fine.

BuilderOp fixes look right.

qparse will test this out - it prefers various checks, including round-tripping through printed algebra (as you've probably already discovered :-))

> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch
>
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Seaborne updated JENA-89:
------------------------------

    Attachment:     (was: JENA-89-TopTests.patch)

> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch, JENA-89-Top-FixHeapCollation.patch, JENA-89-TopTests.patch
>
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Paolo Castagna (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083012#comment-13083012 ] 

Paolo Castagna edited comment on JENA-89 at 8/11/11 8:44 AM:
-------------------------------------------------------------

The optimizer applies the transformation if and only if LIMIT is less than the threshold and no OFFSET is used.
Are we happy with a fixed TOPN_LIMIT_THRESHOLD = 100? 

The optimization is active by default, but it can be turned off via --set http://jena.hpl.hp.com/ARQ#optTopNSorting=false

The changes in BuilderOp are the ones I am less sure about:

-            BuilderLib.checkLength(4, list,  Tags.tagTopN) ;
-            int N = BuilderNode.buildInt(list, 1, -1) ;
-            ItemList conditions = list.get(2).getList() ;
+            BuilderLib.checkLength(3, list,  Tags.tagTopN) ;
+            long N = BuilderNode.buildInt(list.get(1).getList(), 0, -1) ;
+            ItemList conditions = list.get(1).getList().cdr() ;

An example of the optimization in action follows:

SPARQL query:

SELECT  *
WHERE
  { ?s ?p ?o }
ORDER BY ?p ?s
LIMIT   10

SPARQL algebra expression unchanged:

(slice _ 10
  (order (?p ?s)
    (bgp (triple ?s ?p ?o))))

SPARQL algebra expression with the optimization applied:

(top (10 ?p ?s)
  (bgp (triple ?s ?p ?o)))


      was (Author: castagna):
    The optimizer applies the transformation if and only if LIMIT is less than the threshold and no OFFSET is used.
Are we happy with a fixed TOPN_LIMIT_THRESHOLD = 100? 

The optimization is active by default, but it can be turned off via --set http://jena.hpl.hp.com/ARQ#optTopNSorting=false

The changes in BuilderOp are the ones I am less sure about:

-            BuilderLib.checkLength(4, list,  Tags.tagTopN) ;

-            int N = BuilderNode.buildInt(list, 1, -1) ;

-            ItemList conditions = list.get(2).getList() ;

+            BuilderLib.checkLength(3, list,  Tags.tagTopN) ;

+            long N = BuilderNode.buildInt(list.get(1).getList(), 0, -1) ;

+            ItemList conditions = list.get(1).getList().cdr() ;

An example of the optimization in action follows:

SPARQL query:

SELECT  *
WHERE
  { ?s ?p ?o }
ORDER BY ?p ?s
LIMIT   10

SPARQL algebra expression unchanged:

(slice _ 10
  (order (?p ?s)
    (bgp (triple ?s ?p ?o))))

SPARQL algebra expression with the optimization applied:

(top (10 ?p ?s)
  (bgp (triple ?s ?p ?o)))

  
> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch
>
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Paolo Castagna (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paolo Castagna updated JENA-89:
-------------------------------

    Description: 
In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
As an alternative, we can use a PriorityQueue [1] (which is based on a priority heap) to avoid a sort operation.

ARQ's algebra package contains already a OpTopN [2] operator. The OpExecutor [3] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.

ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.

Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 

  [1] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
  [2] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
  [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

  was:
In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
As an alternative, we can use a [PriorityQueue|http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html] (which is based on a priority heap) to avoid a sort operation.

ARQ's algebra package contains already a [OpTopN|https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java] operator. The [OpExecutor|https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.

ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.

Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 


> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, we can use a PriorityQueue [1] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [2] operator. The OpExecutor [3] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [2] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Paolo Castagna (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paolo Castagna resolved JENA-89.
--------------------------------

    Resolution: Fixed

Thanks Andy.

By the way, I had a look at the scripted tests: ORDER BY + LIMIT queries are present in various places. So, I did not duplicate those tests in testing/ARQ/Optimization. 

However, it would be good to have something similar to the scripted tests but to check the SPARQL algebra before optimization and after optimization. Is this something already there? I did not find it. If not, is it worth doing? I can give it a try.

> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch
>
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Paolo Castagna (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083012#comment-13083012 ] 

Paolo Castagna commented on JENA-89:
------------------------------------

The optimizer applies the transformation if and only if LIMIT is less than the threshold and no OFFSET is used.
Are we happy with a fixed TOPN_LIMIT_THRESHOLD = 100? 

The optimization is active by default, but it can be turned off via --set http://jena.hpl.hp.com/ARQ#optTopNSorting=false

The changes in BuilderOp are the ones I am less sure about:

-            BuilderLib.checkLength(4, list,  Tags.tagTopN) ;

-            int N = BuilderNode.buildInt(list, 1, -1) ;

-            ItemList conditions = list.get(2).getList() ;

+            BuilderLib.checkLength(3, list,  Tags.tagTopN) ;

+            long N = BuilderNode.buildInt(list.get(1).getList(), 0, -1) ;

+            ItemList conditions = list.get(1).getList().cdr() ;

An example of the optimization in action follows:

SPARQL query:

SELECT  *
WHERE
  { ?s ?p ?o }
ORDER BY ?p ?s
LIMIT   10

SPARQL algebra expression unchanged:

(slice _ 10
  (order (?p ?s)
    (bgp (triple ?s ?p ?o))))

SPARQL algebra expression with the optimization applied:

(top (10 ?p ?s)
  (bgp (triple ?s ?p ?o)))


> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch
>
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Paolo Castagna (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paolo Castagna reopened JENA-89:
--------------------------------


> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch, JENA-89-Top-FixHeapCollation.patch, JENA-89-TopTests.patch
>
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084212#comment-13084212 ] 

Andy Seaborne commented on JENA-89:
-----------------------------------

TestOptimizer (which you've already found).

Could add op->op tests.

c.f. TestVarRename

> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch
>
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084143#comment-13084143 ] 

Andy Seaborne commented on JENA-89:
-----------------------------------

To test this, we could have scripted queries in testing/ARQ/Optimization although it might incidently be partially covered elsewhere.


> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch
>
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira