You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by "Robert Seidel (JIRA)" <ji...@apache.org> on 2011/05/04 16:06:03 UTC

[jira] [Created] (JCR-2959) SQL2 "order by" can't be used - QueryEngine should use Lucene for sort instead of Collections.sort (line 327)

SQL2 "order by" can't be used - QueryEngine should use Lucene for sort instead of Collections.sort (line 327)
-------------------------------------------------------------------------------------------------------------

                 Key: JCR-2959
                 URL: https://issues.apache.org/jira/browse/JCR-2959
             Project: Jackrabbit Content Repository
          Issue Type: Improvement
          Components: jackrabbit-core
    Affects Versions: 2.2.1
            Reporter: Robert Seidel


Currently all SQL2 queries with "order by" expression are sorted with Collections.sort - which is slow and very bad if there are lots of hits. Lucene should be used for sorting hits instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JCR-2959) SQL2 QueryEngine should use Lucene for sort

Posted by "Alex Parvulescu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090903#comment-13090903 ] 

Alex Parvulescu commented on JCR-2959:
--------------------------------------

I pushed the first iteration based on the patch in revision #1161475.

It does not currently use native lucene sorts (we're missing the apis to figure out if a given field is indexed or not).
Also it is untested against joins of any kind.

The code is disabled by default, to keep things working. You can enable it by setting the system property "useNativeSort" to true. 
This property will be removed once the implementation is stable enough.

> SQL2 QueryEngine should use Lucene for sort
> -------------------------------------------
>
>                 Key: JCR-2959
>                 URL: https://issues.apache.org/jira/browse/JCR-2959
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>    Affects Versions: 2.2.1
>            Reporter: Robert Seidel
>            Priority: Minor
>         Attachments: JCR-2959.patch
>
>
> Currently all SQL2 queries with "order by" expression are sorted with Collections.sort - which is slow and very bad if there are lots of hits. Lucene should be used for sorting hits instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JCR-2959) SQL2 QueryEngine should use Lucene for sort

Posted by "Alex Parvulescu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JCR-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Parvulescu updated JCR-2959:
---------------------------------

    Priority: Minor  (was: Major)
     Summary: SQL2 QueryEngine should use Lucene for sort  (was: SQL2 "order by" can't be used - QueryEngine should use Lucene for sort instead of Collections.sort (line 327))

I've changed the name, it was a bit too inflammatory :)

I agree with the idea that we should investigate native lucene search, with custom hooks into JR for fetching Node data.

> SQL2 QueryEngine should use Lucene for sort
> -------------------------------------------
>
>                 Key: JCR-2959
>                 URL: https://issues.apache.org/jira/browse/JCR-2959
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>    Affects Versions: 2.2.1
>            Reporter: Robert Seidel
>            Priority: Minor
>
> Currently all SQL2 queries with "order by" expression are sorted with Collections.sort - which is slow and very bad if there are lots of hits. Lucene should be used for sorting hits instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JCR-2959) SQL2 QueryEngine should use Lucene for sort

Posted by "Alex Parvulescu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084294#comment-13084294 ] 

Alex Parvulescu commented on JCR-2959:
--------------------------------------

I've updated JCR-3047, so in order for the patch to still work, please replace 'String p = evaluator.getAffectedPropertyName' with 'String p = o.toString()'

Yes, currently the sort is still done in JR, but it is extremely easy to fallback on native lucene fields (I just have to figure out how to know when that is available or not)

I believe that having a native sort would benefit more in the long term, as it will allow for other performance improvements (see also JCR-2830)
I'd aim at having equivalent run times in worst case scenarios, and be good on some of the worst performing scenarios, so that sql2 stops being the underdog in this query language wars :)
And I find it hard to believe that the improvement is just a fluke.

I agree with the fact that we need to test some more, this is why the patch is a working prototype.
I'll go for the test scenario you proposed, so we can see what turns up.

> SQL2 QueryEngine should use Lucene for sort
> -------------------------------------------
>
>                 Key: JCR-2959
>                 URL: https://issues.apache.org/jira/browse/JCR-2959
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>    Affects Versions: 2.2.1
>            Reporter: Robert Seidel
>            Priority: Minor
>         Attachments: JCR-2959.patch
>
>
> Currently all SQL2 queries with "order by" expression are sorted with Collections.sort - which is slow and very bad if there are lots of hits. Lucene should be used for sorting hits instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JCR-2959) SQL2 QueryEngine should use Lucene for sort

Posted by "Alex Parvulescu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090912#comment-13090912 ] 

Alex Parvulescu commented on JCR-2959:
--------------------------------------

I've added a new test suite that includes the changes to the 2.3 performance tests in revision #1161481.

So we can compare the 2 sort implementations. (it will also help see the evolution of the new impl).

> SQL2 QueryEngine should use Lucene for sort
> -------------------------------------------
>
>                 Key: JCR-2959
>                 URL: https://issues.apache.org/jira/browse/JCR-2959
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>    Affects Versions: 2.2.1
>            Reporter: Robert Seidel
>            Priority: Minor
>         Attachments: JCR-2959.patch
>
>
> Currently all SQL2 queries with "order by" expression are sorted with Collections.sort - which is slow and very bad if there are lots of hits. Lucene should be used for sorting hits instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JCR-2959) SQL2 QueryEngine should use Lucene for sort

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084564#comment-13084564 ] 

Jukka Zitting commented on JCR-2959:
------------------------------------

BTW, +1 on applying the patch as a work in progress. It's then much easier for us to propose improvements or alternatives.

> SQL2 QueryEngine should use Lucene for sort
> -------------------------------------------
>
>                 Key: JCR-2959
>                 URL: https://issues.apache.org/jira/browse/JCR-2959
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>    Affects Versions: 2.2.1
>            Reporter: Robert Seidel
>            Priority: Minor
>         Attachments: JCR-2959.patch
>
>
> Currently all SQL2 queries with "order by" expression are sorted with Collections.sort - which is slow and very bad if there are lots of hits. Lucene should be used for sorting hits instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JCR-2959) SQL2 QueryEngine should use Lucene for sort

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084206#comment-13084206 ] 

Jukka Zitting commented on JCR-2959:
------------------------------------

Just measuring the Query.execute() time doesn't give you very meaningful results. You'll need to include the time it takes to iterate through the returned nodes, as that's the time that an end user would be seeing.

I'd expect the Collections.sort() overhead to be a problem only for big query results of which the client only traverses a small subset of first results (i.e. the typical Google search result use case). For such cases it would indeed be great if the SQL2 engine could leverage the result ordering features of the underlying Lucene index.

As for the implementation, I'd rethink the rather complicated DynamicOperandFieldComparator mechanism that can make the Lucene index call back through layers of abstraction with Session.getNodeByIdentifier(). Instead we could simply divide the given Orderings to those that the Lucene index can take care of and those that need to be applied with the current Collections.sort() implementation. Any Lucene-compatible Orderings at the beginning of the list would be converted and passed to Lucene, and the remaining Orderings (if any) would be handled with the current mechanism.

> SQL2 QueryEngine should use Lucene for sort
> -------------------------------------------
>
>                 Key: JCR-2959
>                 URL: https://issues.apache.org/jira/browse/JCR-2959
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>    Affects Versions: 2.2.1
>            Reporter: Robert Seidel
>            Priority: Minor
>         Attachments: JCR-2959.patch
>
>
> Currently all SQL2 queries with "order by" expression are sorted with Collections.sort - which is slow and very bad if there are lots of hits. Lucene should be used for sorting hits instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JCR-2959) SQL2 QueryEngine should use Lucene for sort

Posted by "Alex Parvulescu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JCR-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Parvulescu updated JCR-2959:
---------------------------------

    Attachment: JCR-2959.patch

watch out, big patch coming through!

A first stab at a more integrated approach to sort. It doesn't cover all the bases yet, but it should be enough to get us started.

Just to increase your interest, I've attached a perf test (SQL2OrderByTest#testBig), there are the first results I see:

Old style (10 iterations)
ran select, took 941 ms
ran select, took 310 ms
ran select, took 199 ms
ran select, took 102 ms
ran select, took 112 ms
ran select, took 734 ms
ran select, took 189 ms
ran select, took 104 ms
ran select, took 105 ms
ran select, took 125 ms

New style (10 iterations)
ran select, took 629 ms
ran select, took 134 ms
ran select, took 117 ms
ran select, took 117 ms
ran select, took 133 ms
ran select, took 618 ms
ran select, took 108 ms
ran select, took 94 ms
ran select, took 84 ms
ran select, took 80 ms

It looks a bit interesting ~30% faster run times.

You can switch between implementations via a global flag (I know it's not nice): 
QueryEngine.NATIVE_SORT true for lucene native, false for Collections.sort.

It doesn't know how to sort joins yet, just simple queries.
Also, it could do better (like falling back on lucene native properties when they are available in the index instead of loading the entire node).

Feedback very welcome!




> SQL2 QueryEngine should use Lucene for sort
> -------------------------------------------
>
>                 Key: JCR-2959
>                 URL: https://issues.apache.org/jira/browse/JCR-2959
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>    Affects Versions: 2.2.1
>            Reporter: Robert Seidel
>            Priority: Minor
>         Attachments: JCR-2959.patch
>
>
> Currently all SQL2 queries with "order by" expression are sorted with Collections.sort - which is slow and very bad if there are lots of hits. Lucene should be used for sorting hits instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JCR-2959) SQL2 QueryEngine should use Lucene for sort

Posted by "Alex Parvulescu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084232#comment-13084232 ] 

Alex Parvulescu commented on JCR-2959:
--------------------------------------

Ok, I'm all for testing and profiling

This is with iteration included (node.getpath called on each entry)

lucene sort
select took 507 ms
fetch took 45 ms
select took 106 ms
fetch took 7 ms
select took 102 ms
fetch took 6 ms
select took 84 ms
fetch took 7 ms
select took 100 ms
fetch took 6 ms
select took 84 ms
fetch took 5 ms
select took 61 ms
fetch took 5 ms
select took 55 ms
fetch took 6 ms
select took 46 ms
fetch took 5 ms
select took 39 ms
fetch took 5 ms


  collections.sort
select took 600 ms
fetch took 43 ms
select took 104 ms
fetch took 6 ms
select took 88 ms
fetch took 7 ms
select took 92 ms
fetch took 7 ms
select took 109 ms
fetch took 5 ms
select took 90 ms
fetch took 5 ms
select took 89 ms
fetch took 6 ms
select took 89 ms
fetch took 5 ms
select took 90 ms
fetch took 5 ms
select took 98 ms
fetch took 6 ms

It still looks better.

While I agree that this solution has a different access pattern, which could be in the end not that good, I'd like to see a test scenario that can surface the problems.
Also, this is a rough draft, so I'd like to validate (or not) the approach using more math rather than history :)

Like you pointed out, having a native approach to sorting can allow us to improve certain use-cases that are really poor performing in sql2 right now. Sorting at the end on the entire result set does limit what we can do in the way of optimization and performance.


> SQL2 QueryEngine should use Lucene for sort
> -------------------------------------------
>
>                 Key: JCR-2959
>                 URL: https://issues.apache.org/jira/browse/JCR-2959
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>    Affects Versions: 2.2.1
>            Reporter: Robert Seidel
>            Priority: Minor
>         Attachments: JCR-2959.patch
>
>
> Currently all SQL2 queries with "order by" expression are sorted with Collections.sort - which is slow and very bad if there are lots of hits. Lucene should be used for sorting hits instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JCR-2959) SQL2 QueryEngine should use Lucene for sort

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084283#comment-13084283 ] 

Jukka Zitting commented on JCR-2959:
------------------------------------

If I understand correctly, the implementation in the patch still doesn't use the actual fields in Lucene for searching, it just loads the node using Session.getNodeByIdentifier() and uses the OperandEvaluator to get the values by which documents are to be compared. It seems to me that this is pretty much equivalent in terms of disk accesses to what the existing sorting mechanism does and the speedup you're seeing is probably simply caused by the sorting mechanism memorizing the values returned by the OperandEvaluator. I would expect that difference to fade away if the test set becomes larger than what fits into memory.

Some math to give us more background: The execution of a single-selector query currently consists of the following main step:

    1. Constructing the Lucene query
    2. Executing the Lucene query
    3. Fetching all matching nodes to memory
    4. Sorting the nodes

Step 1 is just an in-memory mapping, so not too time-consuming. Steps 2 and 3 are probably the most time-consuming bits here, but both outside the scope of this issue. Finally, step 4 consists of O(n log n) in-memory operations, where n is the number of returned nodes. That's some time, but as these are in-memory operations the overhead should be pretty small compared to the previous steps, whose worst case behavior is O(n) disk accesses.

The proposed change here is to switch steps 3 and 4 around and use Lucene for the sorting. The time required to sort all the results is still O(n log n) and may well require some extra disk accesses to fetch the sort fields from the Lucene index. However, the time taken in this step should still be significantly lower than in fetching all the matching nodes to memory. Thus the big benefit that we can gain from this optimization comes in cases where we *don't* want to fetch all the matching nodes to memory.

A good test case for such a scenario could be a data set of 1m nodes, of which a query targets a subset of 100k nodes, and only the first 100 nodes of the sorted results are actually being read. With a proper implementation of this mechanism, the Lucene search would find all the matching 100k documents and read the sort fields (in the Lucene index) of those documents for quickly ordering the results. That should still be pretty fast compared to fetching all the 100k nodes into memory, which is what the current implementation (and the current patch) does. And then fetching only 100 first results should take no time at all. I'd expect us to reach up to an order of magnitude of performance improvement for such a scenario.

> SQL2 QueryEngine should use Lucene for sort
> -------------------------------------------
>
>                 Key: JCR-2959
>                 URL: https://issues.apache.org/jira/browse/JCR-2959
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>    Affects Versions: 2.2.1
>            Reporter: Robert Seidel
>            Priority: Minor
>         Attachments: JCR-2959.patch
>
>
> Currently all SQL2 queries with "order by" expression are sorted with Collections.sort - which is slow and very bad if there are lots of hits. Lucene should be used for sorting hits instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira