You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by Lukas Kahwe Smith <ml...@pooteeweet.org> on 2012/04/10 11:21:18 UTC

dealing with large result sets

Hi,

Currently I see some big issues with queries that return large result sets. A lot of work is not done inside Lucene, which will probably not be fixed soon (or maybe never inside 2.x). However I think its important to do some intermediate improvements.

Here are some suggestions I have. I hope we can brainstorm together on some ideas that are feasible to get implemented in a shorter time period than waiting for Oak:

1) there should be a way to get a count

This way if I need to do a query that needs to be ordered, I can first check if the count is too high to determine if I should even bother running the search. Aka in most cases a search leading to 100+ results means that who ever did the search needs to further narrow it down.

I guess the most sensible thing would be to simply offer a way to do SELECT COUNT(*) FROM ..

2) a way to automatically stop long running queries

It would be great if one could define a timeout for queries. If a query takes longer than X, it should just fail. This should be a global setting, but ideally it should be possible to override this on a per query basis.

3) .. ?

regards,
Lukas Kahwe Smith
mls@pooteeweet.org

Re: dealing with large result sets

Posted by Ard Schrijvers <a....@onehippo.com>.

On Tue, Apr 10, 2012 at 11:55 AM, Christian Stocker
<ch...@liip.ch> wrote:
>
>
> On 10.04.12 11:51, Ard Schrijvers wrote:
>> On Tue, Apr 10, 2012 at 11:42 AM, Christian Stocker
>> <ch...@liip.ch> wrote:
>>>
>>>
>>> On 10.04.12 11:32, Ard Schrijvers wrote:
>>>> On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith <ml...@pooteeweet.org> wrote:
>>>>> Hi,
>>>>>
>>>>> Currently I see some big issues with queries that return large result sets. A lot of work is not done inside Lucene, which will probably not be fixed soon (or maybe never inside 2.x). However I think its important to do some intermediate improvements.
>>>>>
>>>>> Here are some suggestions I have. I hope we can brainstorm together on some ideas that are feasible to get implemented in a shorter time period than waiting for Oak:
>>>>>
>>>>> 1) there should be a way to get a count
>>>>>
>>>>> This way if I need to do a query that needs to be ordered, I can first check if the count is too high to determine if I should even bother running the search. Aka in most cases a search leading to 100+ results means that who ever did the search needs to further narrow it down.
>>>>
>>>> The cpu is not spend in ordering the results: That is done quite fast
>>>> in Lucene, unless you have millions of hits
>>>
>>> I read the code and also read this
>>> https://issues.apache.org/jira/browse/JCR-2959 and it looks to me that
>>> jackrabbit always sorts the result set by itself and not in lucene (or
>>> maybe additionally). This makes it slow even if you have a limit set,
>>> because it first sorts all nodes (fetching it from the PM if necessary),
>>> then does the limit. Maybe I have missed something but real life tests
>>> showed exactly this behaviour.
>>
>> Ah, I don't know about that part: We always sticked to xpath queries :
>> Sorting is done in Lucene (more precisely, in some Lucene exensions in
>> jr, but are equally fast) for at least xpath, I am quite sure
>
> Is the search part done differently in SQL2 and XPath? Can't remember ;)

I think in some areas, but, it should pretty much result in the same
Lucene queries. I've never looked into SQL2 : But, because you pointed
at [1] and it there explicitly mentioned SQL2, and since we don't have
this problem, I thought it would be SQL2 only

[1] https://issues.apache.org/jira/browse/JCR-2959

>
>>>> The problem with getting a correct count is authorization : This total
>>>> search index count should is fast (if you try to avoid some known slow
>>>> searches). However, authorizing for example 100k+ nodes if they are
>>>> not in the jackrabbit caches is very expensive.
>>>>
>>>> Either way: You get a correct count if you make sure that you include
>>>> in your (xpath) search at least an order by clause. Then, to avoid
>>>> 100k + hits, make sure you also set a limit. For example a limit of
>>>> 501 : You can then show 50 pages of 10 hits, and if the count is 501
>>>> you state that there are at least 500+ hits
>>>
>>> That's what we do now, but it doesn't help (as said above) if we have
>>> thousends of results which have to be ordered first.
>>
>> And the second sort is also slow? The first sort is also slow with
>> Lucene, as Lucene needs to load all terms to sort on from FS in
>> memory. However, consecutive searches are fast. We don't have problems
>> for resultsets sorting for a million hits
>
> It definitively loaded all nodes from the PM before sorting it. The
> lucene part itself was fast enough, that wasn't the issue.
>
>>
>>>
>>>>
>>>> We also wanted to get around this, thus in our api hooked in a
>>>> 'getTotalSize()' which returns the Lucene unauthorized count
>>>
>>> That would help us a lot, since we currently don't use the ACLs of
>>> Jackrabbit, so the lucene count would be pretty correct for our use case.
>>
>> Yes, however, you would have to hook into jr itself to get this done
>
> Yep, saw that, that's somewhere deep in the code. That's why I didn't
> try to adress that yet
>
> chregu
>
>>
>> Regards Ard
>>
>>>
>>> chregu
>>>
>>>>
>>>>>
>>>>> I guess the most sensible thing would be to simply offer a way to do SELECT COUNT(*) FROM ..
>>>>>
>>>>> 2) a way to automatically stop long running queries
>>>>
>>>> It is not just about 'long' . Some queries easily blow up, and bring
>>>> you app to an OOM before they can be stopped. For example jcr:like is
>>>> such a thing. Or range queries on many unique values
>>>
>>>
>>>>
>>>> Regards Ard
>>>>
>>>>>
>>>>> It would be great if one could define a timeout for queries. If a query takes longer than X, it should just fail. This should be a global setting, but ideally it should be possible to override this on a per query basis.
>>>>>
>>>>> 3) .. ?
>>>>>
>>>>> regards,
>>>>> Lukas Kahwe Smith
>>>>> mls@pooteeweet.org
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Liip AG  //  Feldstrasse 133 //  CH-8004 Zurich
>>> Tel +41 43 500 39 81 // Mobile +41 76 561 88 60
>>> www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE
>>>
>>
>>
>>
>
> --
> Liip AG  //  Feldstrasse 133 //  CH-8004 Zurich
> Tel +41 43 500 39 81 // Mobile +41 76 561 88 60
> www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE
>



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Fwd: Re: dealing with large result sets

Posted by Christian Stocker <ch...@liip.ch>.

Hi

Posting this to the dev list for further discussion.

The problem mentioned below hit us several times hard lately. If there
are large result sets (with SQL2) jackrabbit gets very slow.

I know it's on your radar or some work was done already, eg. in
https://issues.apache.org/jira/browse/JCR-2959
But there's still an issue with the usecases below.

I looked at the code of jackrabbit and there's obviously no easy fix
(otherwise there would already be a solution), but as it was fast with
old (xpath/sql1) queries (at least as long as there are no JOINS and
all), I guess it would be technically possible somehow.

I also tried on our side to convert some trivial SQL2 queries into XPath
(via our php based QOM), but that got pretty fast pretty messy ;) (by
hand it's of course easy, but I'd need some automated thing for our app)

All I basically need to know is: Are there any plan to fix the issue if
slow (but "easy") queries or do we have to try to solve the problem on
the client side somehow with converting our SQL2 queries to XPath for
some special cases?

And the last question: Is the old SQL query "engine" using the same as
XPath? Meaning it has the same performance characteristics? Because if
we have to solve the problem on our client side for the time beeing, it
may be easier to copy and adjust our QOM2SQL class to output "old" SQL
than XPath (of course with less features than SQL2)

Greetings

chregu


Thanks already

chregu






-------- Original Message --------
Subject: Re: dealing with large result sets
Date: Wed, 11 Apr 2012 09:46:36 +0200
From: Christian Stocker <ch...@liip.ch>
Reply-To: users@jackrabbit.apache.org
Organization: Liip AG
To: users@jackrabbit.apache.org
CC: Ard Schrijvers <a....@onehippo.com>

Ok, that made me wondering and I did some short tests on my macbook,
there are approx. 600'000 nodes, which match those queries

With xpath, without ordering


<d:searchrequest xmlns:d="DAV:"
xmlns:dcr="http://www.day.com/jcr/webdav/1.0">
<dcr:xpath>

/jcr:root/article//*[@phpcr:class = 'Own\ApiBundle\Document\Article']

</dcr:xpath>
<d:limit>
     <d:nresults>10</d:nresults>
</d:limit>
</d:searchrequest>


1st run: 455 ms
2nd run:  42 ms

With xpath, with order by

<d:searchrequest xmlns:d="DAV:"
xmlns:dcr="http://www.day.com/jcr/webdav/1.0">
<dcr:xpath>

/jcr:root/article//*[@phpcr:class = 'Own\ApiBundle\Document\Article']
order by @firstImportDate

</dcr:xpath>
<d:limit>
     <d:nresults>10</d:nresults>
</d:limit>
</d:searchrequest>


1st run: 2555 ms
2nd run:   16 ms


Those numbers seem to be reasonable.

With SQL2 without ordering:

<D:searchrequest xmlns:D="DAV:">
	<JCR-SQL2>
		<![CDATA[
		SELECT data.* FROM [nt:base] AS data WHERE data.[phpcr:class] =
'Own\ApiBundle\Document\Article'  AND  ISDESCENDANTNODE(data, '/article')
		]]>
	</JCR-SQL2>
	<D:limit>
		<D:nresults>10</D:nresults>
	</D:limit>
</D:searchrequest>

1st run: 2'006'634 ms (33 minutes.)

>From the log

 SQL2 SELECT took 2004498 ms. selector: [nt:base] AS data, columns:
[data.jcr:primaryType], constraint: (data.[phpcr:class] =
'Own\ApiBundle\Document\Article') AND (ISDESCENDANTNODE(data,
[/article])), offset 0, limit 10
 SQL2 SORT took 1479 ms.
 SQL2 QUERY execute took 2006634 ms. native sort is false.


With those results, I didn't even try a 2nd time (caches are full
anyway) or with ordering.

Something seems to be quite wrong here. If you want more measurements,
just tell me

Greetings

chregu





On 10.04.12 11:55, Christian Stocker wrote:
> 
> 
> On 10.04.12 11:51, Ard Schrijvers wrote:
>> On Tue, Apr 10, 2012 at 11:42 AM, Christian Stocker
>> <ch...@liip.ch> wrote:
>>>
>>>
>>> On 10.04.12 11:32, Ard Schrijvers wrote:
>>>> On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith <ml...@pooteeweet.org> wrote:
>>>>> Hi,
>>>>>
>>>>> Currently I see some big issues with queries that return large result sets. A lot of work is not done inside Lucene, which will probably not be fixed soon (or maybe never inside 2.x). However I think its important to do some intermediate improvements.
>>>>>
>>>>> Here are some suggestions I have. I hope we can brainstorm together on some ideas that are feasible to get implemented in a shorter time period than waiting for Oak:
>>>>>
>>>>> 1) there should be a way to get a count
>>>>>
>>>>> This way if I need to do a query that needs to be ordered, I can first check if the count is too high to determine if I should even bother running the search. Aka in most cases a search leading to 100+ results means that who ever did the search needs to further narrow it down.
>>>>
>>>> The cpu is not spend in ordering the results: That is done quite fast
>>>> in Lucene, unless you have millions of hits
>>>
>>> I read the code and also read this
>>> https://issues.apache.org/jira/browse/JCR-2959 and it looks to me that
>>> jackrabbit always sorts the result set by itself and not in lucene (or
>>> maybe additionally). This makes it slow even if you have a limit set,
>>> because it first sorts all nodes (fetching it from the PM if necessary),
>>> then does the limit. Maybe I have missed something but real life tests
>>> showed exactly this behaviour.
>>
>> Ah, I don't know about that part: We always sticked to xpath queries :
>> Sorting is done in Lucene (more precisely, in some Lucene exensions in
>> jr, but are equally fast) for at least xpath, I am quite sure
> 
> Is the search part done differently in SQL2 and XPath? Can't remember ;)
> 
>>>> The problem with getting a correct count is authorization : This total
>>>> search index count should is fast (if you try to avoid some known slow
>>>> searches). However, authorizing for example 100k+ nodes if they are
>>>> not in the jackrabbit caches is very expensive.
>>>>
>>>> Either way: You get a correct count if you make sure that you include
>>>> in your (xpath) search at least an order by clause. Then, to avoid
>>>> 100k + hits, make sure you also set a limit. For example a limit of
>>>> 501 : You can then show 50 pages of 10 hits, and if the count is 501
>>>> you state that there are at least 500+ hits
>>>
>>> That's what we do now, but it doesn't help (as said above) if we have
>>> thousends of results which have to be ordered first.
>>
>> And the second sort is also slow? The first sort is also slow with
>> Lucene, as Lucene needs to load all terms to sort on from FS in
>> memory. However, consecutive searches are fast. We don't have problems
>> for resultsets sorting for a million hits
> 
> It definitively loaded all nodes from the PM before sorting it. The
> lucene part itself was fast enough, that wasn't the issue.
> 
>>
>>>
>>>>
>>>> We also wanted to get around this, thus in our api hooked in a
>>>> 'getTotalSize()' which returns the Lucene unauthorized count
>>>
>>> That would help us a lot, since we currently don't use the ACLs of
>>> Jackrabbit, so the lucene count would be pretty correct for our use case.
>>
>> Yes, however, you would have to hook into jr itself to get this done
> 
> Yep, saw that, that's somewhere deep in the code. That's why I didn't
> try to adress that yet
> 
> chregu
> 
>>
>> Regards Ard
>>
>>>
>>> chregu
>>>
>>>>
>>>>>
>>>>> I guess the most sensible thing would be to simply offer a way to do SELECT COUNT(*) FROM ..
>>>>>
>>>>> 2) a way to automatically stop long running queries
>>>>
>>>> It is not just about 'long' . Some queries easily blow up, and bring
>>>> you app to an OOM before they can be stopped. For example jcr:like is
>>>> such a thing. Or range queries on many unique values
>>>
>>>
>>>>
>>>> Regards Ard
>>>>
>>>>>
>>>>> It would be great if one could define a timeout for queries. If a query takes longer than X, it should just fail. This should be a global setting, but ideally it should be possible to override this on a per query basis.
>>>>>
>>>>> 3) .. ?
>>>>>
>>>>> regards,
>>>>> Lukas Kahwe Smith
>>>>> mls@pooteeweet.org
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Liip AG  //  Feldstrasse 133 //  CH-8004 Zurich
>>> Tel +41 43 500 39 81 // Mobile +41 76 561 88 60
>>> www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE
>>>
>>
>>
>>
>

Re: dealing with large result sets

Posted by Ard Schrijvers <a....@onehippo.com>.

On Wed, Apr 11, 2012 at 9:46 AM, Christian Stocker
<ch...@liip.ch> wrote:
> Ok, that made me wondering and I did some short tests on my macbook,
> there are approx. 600'000 nodes, which match those queries
>
> With xpath, without ordering
>
>
> <d:searchrequest xmlns:d="DAV:"
> xmlns:dcr="http://www.day.com/jcr/webdav/1.0">
> <dcr:xpath>
>
> /jcr:root/article//*[@phpcr:class = 'Own\ApiBundle\Document\Article']
>
> </dcr:xpath>
> <d:limit>
>     <d:nresults>10</d:nresults>
> </d:limit>
> </d:searchrequest>
>
>
> 1st run: 455 ms
> 2nd run:  42 ms
>
> With xpath, with order by
>
> <d:searchrequest xmlns:d="DAV:"
> xmlns:dcr="http://www.day.com/jcr/webdav/1.0">
> <dcr:xpath>
>
> /jcr:root/article//*[@phpcr:class = 'Own\ApiBundle\Document\Article']
> order by @firstImportDate
>
> </dcr:xpath>
> <d:limit>
>     <d:nresults>10</d:nresults>
> </d:limit>
> </d:searchrequest>
>
>
> 1st run: 2555 ms
> 2nd run:   16 ms

This makes sense: Only the first time, Lucene needs to load the unique
terms of the 600.000 you want to sort on, so many FS lookups. After
that, they are cached

About SQL2, I have no experience with it, nor ever looked at it. I
can't help you out there.

We stick to xpath as jr 2.x

Regards Ard

>
>
> Those numbers seem to be reasonable.
>
> With SQL2 without ordering:
>
> <D:searchrequest xmlns:D="DAV:">
>        <JCR-SQL2>
>                <![CDATA[
>                SELECT data.* FROM [nt:base] AS data WHERE data.[phpcr:class] =
> 'Own\ApiBundle\Document\Article'  AND  ISDESCENDANTNODE(data, '/article')
>                ]]>
>        </JCR-SQL2>
>        <D:limit>
>                <D:nresults>10</D:nresults>
>        </D:limit>
> </D:searchrequest>
>
> 1st run: 2'006'634 ms (33 minutes.)
>
> From the log
>
>  SQL2 SELECT took 2004498 ms. selector: [nt:base] AS data, columns:
> [data.jcr:primaryType], constraint: (data.[phpcr:class] =
> 'Own\ApiBundle\Document\Article') AND (ISDESCENDANTNODE(data,
> [/article])), offset 0, limit 10
>  SQL2 SORT took 1479 ms.
>  SQL2 QUERY execute took 2006634 ms. native sort is false.
>
>
> With those results, I didn't even try a 2nd time (caches are full
> anyway) or with ordering.
>
> Something seems to be quite wrong here. If you want more measurements,
> just tell me
>
> Greetings
>
> chregu
>
>
>
>
>
> On 10.04.12 11:55, Christian Stocker wrote:
>>
>>
>> On 10.04.12 11:51, Ard Schrijvers wrote:
>>> On Tue, Apr 10, 2012 at 11:42 AM, Christian Stocker
>>> <ch...@liip.ch> wrote:
>>>>
>>>>
>>>> On 10.04.12 11:32, Ard Schrijvers wrote:
>>>>> On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith <ml...@pooteeweet.org> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Currently I see some big issues with queries that return large result sets. A lot of work is not done inside Lucene, which will probably not be fixed soon (or maybe never inside 2.x). However I think its important to do some intermediate improvements.
>>>>>>
>>>>>> Here are some suggestions I have. I hope we can brainstorm together on some ideas that are feasible to get implemented in a shorter time period than waiting for Oak:
>>>>>>
>>>>>> 1) there should be a way to get a count
>>>>>>
>>>>>> This way if I need to do a query that needs to be ordered, I can first check if the count is too high to determine if I should even bother running the search. Aka in most cases a search leading to 100+ results means that who ever did the search needs to further narrow it down.
>>>>>
>>>>> The cpu is not spend in ordering the results: That is done quite fast
>>>>> in Lucene, unless you have millions of hits
>>>>
>>>> I read the code and also read this
>>>> https://issues.apache.org/jira/browse/JCR-2959 and it looks to me that
>>>> jackrabbit always sorts the result set by itself and not in lucene (or
>>>> maybe additionally). This makes it slow even if you have a limit set,
>>>> because it first sorts all nodes (fetching it from the PM if necessary),
>>>> then does the limit. Maybe I have missed something but real life tests
>>>> showed exactly this behaviour.
>>>
>>> Ah, I don't know about that part: We always sticked to xpath queries :
>>> Sorting is done in Lucene (more precisely, in some Lucene exensions in
>>> jr, but are equally fast) for at least xpath, I am quite sure
>>
>> Is the search part done differently in SQL2 and XPath? Can't remember ;)
>>
>>>>> The problem with getting a correct count is authorization : This total
>>>>> search index count should is fast (if you try to avoid some known slow
>>>>> searches). However, authorizing for example 100k+ nodes if they are
>>>>> not in the jackrabbit caches is very expensive.
>>>>>
>>>>> Either way: You get a correct count if you make sure that you include
>>>>> in your (xpath) search at least an order by clause. Then, to avoid
>>>>> 100k + hits, make sure you also set a limit. For example a limit of
>>>>> 501 : You can then show 50 pages of 10 hits, and if the count is 501
>>>>> you state that there are at least 500+ hits
>>>>
>>>> That's what we do now, but it doesn't help (as said above) if we have
>>>> thousends of results which have to be ordered first.
>>>
>>> And the second sort is also slow? The first sort is also slow with
>>> Lucene, as Lucene needs to load all terms to sort on from FS in
>>> memory. However, consecutive searches are fast. We don't have problems
>>> for resultsets sorting for a million hits
>>
>> It definitively loaded all nodes from the PM before sorting it. The
>> lucene part itself was fast enough, that wasn't the issue.
>>
>>>
>>>>
>>>>>
>>>>> We also wanted to get around this, thus in our api hooked in a
>>>>> 'getTotalSize()' which returns the Lucene unauthorized count
>>>>
>>>> That would help us a lot, since we currently don't use the ACLs of
>>>> Jackrabbit, so the lucene count would be pretty correct for our use case.
>>>
>>> Yes, however, you would have to hook into jr itself to get this done
>>
>> Yep, saw that, that's somewhere deep in the code. That's why I didn't
>> try to adress that yet
>>
>> chregu
>>
>>>
>>> Regards Ard
>>>
>>>>
>>>> chregu
>>>>
>>>>>
>>>>>>
>>>>>> I guess the most sensible thing would be to simply offer a way to do SELECT COUNT(*) FROM ..
>>>>>>
>>>>>> 2) a way to automatically stop long running queries
>>>>>
>>>>> It is not just about 'long' . Some queries easily blow up, and bring
>>>>> you app to an OOM before they can be stopped. For example jcr:like is
>>>>> such a thing. Or range queries on many unique values
>>>>
>>>>
>>>>>
>>>>> Regards Ard
>>>>>
>>>>>>
>>>>>> It would be great if one could define a timeout for queries. If a query takes longer than X, it should just fail. This should be a global setting, but ideally it should be possible to override this on a per query basis.
>>>>>>
>>>>>> 3) .. ?
>>>>>>
>>>>>> regards,
>>>>>> Lukas Kahwe Smith
>>>>>> mls@pooteeweet.org
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Liip AG  //  Feldstrasse 133 //  CH-8004 Zurich
>>>> Tel +41 43 500 39 81 // Mobile +41 76 561 88 60
>>>> www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE
>>>>
>>>
>>>
>>>
>>



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: dealing with large result sets

Posted by Christian Stocker <ch...@liip.ch>.

Ok, that made me wondering and I did some short tests on my macbook,
there are approx. 600'000 nodes, which match those queries

With xpath, without ordering


<d:searchrequest xmlns:d="DAV:"
xmlns:dcr="http://www.day.com/jcr/webdav/1.0">
<dcr:xpath>

/jcr:root/article//*[@phpcr:class = 'Own\ApiBundle\Document\Article']

</dcr:xpath>
<d:limit>
     <d:nresults>10</d:nresults>
</d:limit>
</d:searchrequest>


1st run: 455 ms
2nd run:  42 ms

With xpath, with order by

<d:searchrequest xmlns:d="DAV:"
xmlns:dcr="http://www.day.com/jcr/webdav/1.0">
<dcr:xpath>

/jcr:root/article//*[@phpcr:class = 'Own\ApiBundle\Document\Article']
order by @firstImportDate

</dcr:xpath>
<d:limit>
     <d:nresults>10</d:nresults>
</d:limit>
</d:searchrequest>


1st run: 2555 ms
2nd run:   16 ms


Those numbers seem to be reasonable.

With SQL2 without ordering:

<D:searchrequest xmlns:D="DAV:">
	<JCR-SQL2>
		<![CDATA[
		SELECT data.* FROM [nt:base] AS data WHERE data.[phpcr:class] =
'Own\ApiBundle\Document\Article'  AND  ISDESCENDANTNODE(data, '/article')
		]]>
	</JCR-SQL2>
	<D:limit>
		<D:nresults>10</D:nresults>
	</D:limit>
</D:searchrequest>

1st run: 2'006'634 ms (33 minutes.)

>From the log

 SQL2 SELECT took 2004498 ms. selector: [nt:base] AS data, columns:
[data.jcr:primaryType], constraint: (data.[phpcr:class] =
'Own\ApiBundle\Document\Article') AND (ISDESCENDANTNODE(data,
[/article])), offset 0, limit 10
 SQL2 SORT took 1479 ms.
 SQL2 QUERY execute took 2006634 ms. native sort is false.


With those results, I didn't even try a 2nd time (caches are full
anyway) or with ordering.

Something seems to be quite wrong here. If you want more measurements,
just tell me

Greetings

chregu





On 10.04.12 11:55, Christian Stocker wrote:
> 
> 
> On 10.04.12 11:51, Ard Schrijvers wrote:
>> On Tue, Apr 10, 2012 at 11:42 AM, Christian Stocker
>> <ch...@liip.ch> wrote:
>>>
>>>
>>> On 10.04.12 11:32, Ard Schrijvers wrote:
>>>> On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith <ml...@pooteeweet.org> wrote:
>>>>> Hi,
>>>>>
>>>>> Currently I see some big issues with queries that return large result sets. A lot of work is not done inside Lucene, which will probably not be fixed soon (or maybe never inside 2.x). However I think its important to do some intermediate improvements.
>>>>>
>>>>> Here are some suggestions I have. I hope we can brainstorm together on some ideas that are feasible to get implemented in a shorter time period than waiting for Oak:
>>>>>
>>>>> 1) there should be a way to get a count
>>>>>
>>>>> This way if I need to do a query that needs to be ordered, I can first check if the count is too high to determine if I should even bother running the search. Aka in most cases a search leading to 100+ results means that who ever did the search needs to further narrow it down.
>>>>
>>>> The cpu is not spend in ordering the results: That is done quite fast
>>>> in Lucene, unless you have millions of hits
>>>
>>> I read the code and also read this
>>> https://issues.apache.org/jira/browse/JCR-2959 and it looks to me that
>>> jackrabbit always sorts the result set by itself and not in lucene (or
>>> maybe additionally). This makes it slow even if you have a limit set,
>>> because it first sorts all nodes (fetching it from the PM if necessary),
>>> then does the limit. Maybe I have missed something but real life tests
>>> showed exactly this behaviour.
>>
>> Ah, I don't know about that part: We always sticked to xpath queries :
>> Sorting is done in Lucene (more precisely, in some Lucene exensions in
>> jr, but are equally fast) for at least xpath, I am quite sure
> 
> Is the search part done differently in SQL2 and XPath? Can't remember ;)
> 
>>>> The problem with getting a correct count is authorization : This total
>>>> search index count should is fast (if you try to avoid some known slow
>>>> searches). However, authorizing for example 100k+ nodes if they are
>>>> not in the jackrabbit caches is very expensive.
>>>>
>>>> Either way: You get a correct count if you make sure that you include
>>>> in your (xpath) search at least an order by clause. Then, to avoid
>>>> 100k + hits, make sure you also set a limit. For example a limit of
>>>> 501 : You can then show 50 pages of 10 hits, and if the count is 501
>>>> you state that there are at least 500+ hits
>>>
>>> That's what we do now, but it doesn't help (as said above) if we have
>>> thousends of results which have to be ordered first.
>>
>> And the second sort is also slow? The first sort is also slow with
>> Lucene, as Lucene needs to load all terms to sort on from FS in
>> memory. However, consecutive searches are fast. We don't have problems
>> for resultsets sorting for a million hits
> 
> It definitively loaded all nodes from the PM before sorting it. The
> lucene part itself was fast enough, that wasn't the issue.
> 
>>
>>>
>>>>
>>>> We also wanted to get around this, thus in our api hooked in a
>>>> 'getTotalSize()' which returns the Lucene unauthorized count
>>>
>>> That would help us a lot, since we currently don't use the ACLs of
>>> Jackrabbit, so the lucene count would be pretty correct for our use case.
>>
>> Yes, however, you would have to hook into jr itself to get this done
> 
> Yep, saw that, that's somewhere deep in the code. That's why I didn't
> try to adress that yet
> 
> chregu
> 
>>
>> Regards Ard
>>
>>>
>>> chregu
>>>
>>>>
>>>>>
>>>>> I guess the most sensible thing would be to simply offer a way to do SELECT COUNT(*) FROM ..
>>>>>
>>>>> 2) a way to automatically stop long running queries
>>>>
>>>> It is not just about 'long' . Some queries easily blow up, and bring
>>>> you app to an OOM before they can be stopped. For example jcr:like is
>>>> such a thing. Or range queries on many unique values
>>>
>>>
>>>>
>>>> Regards Ard
>>>>
>>>>>
>>>>> It would be great if one could define a timeout for queries. If a query takes longer than X, it should just fail. This should be a global setting, but ideally it should be possible to override this on a per query basis.
>>>>>
>>>>> 3) .. ?
>>>>>
>>>>> regards,
>>>>> Lukas Kahwe Smith
>>>>> mls@pooteeweet.org
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Liip AG  //  Feldstrasse 133 //  CH-8004 Zurich
>>> Tel +41 43 500 39 81 // Mobile +41 76 561 88 60
>>> www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE
>>>
>>
>>
>>
>

Re: dealing with large result sets

Posted by Christian Stocker <ch...@liip.ch>.


On 10.04.12 11:51, Ard Schrijvers wrote:
> On Tue, Apr 10, 2012 at 11:42 AM, Christian Stocker
> <ch...@liip.ch> wrote:
>>
>>
>> On 10.04.12 11:32, Ard Schrijvers wrote:
>>> On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith <ml...@pooteeweet.org> wrote:
>>>> Hi,
>>>>
>>>> Currently I see some big issues with queries that return large result sets. A lot of work is not done inside Lucene, which will probably not be fixed soon (or maybe never inside 2.x). However I think its important to do some intermediate improvements.
>>>>
>>>> Here are some suggestions I have. I hope we can brainstorm together on some ideas that are feasible to get implemented in a shorter time period than waiting for Oak:
>>>>
>>>> 1) there should be a way to get a count
>>>>
>>>> This way if I need to do a query that needs to be ordered, I can first check if the count is too high to determine if I should even bother running the search. Aka in most cases a search leading to 100+ results means that who ever did the search needs to further narrow it down.
>>>
>>> The cpu is not spend in ordering the results: That is done quite fast
>>> in Lucene, unless you have millions of hits
>>
>> I read the code and also read this
>> https://issues.apache.org/jira/browse/JCR-2959 and it looks to me that
>> jackrabbit always sorts the result set by itself and not in lucene (or
>> maybe additionally). This makes it slow even if you have a limit set,
>> because it first sorts all nodes (fetching it from the PM if necessary),
>> then does the limit. Maybe I have missed something but real life tests
>> showed exactly this behaviour.
> 
> Ah, I don't know about that part: We always sticked to xpath queries :
> Sorting is done in Lucene (more precisely, in some Lucene exensions in
> jr, but are equally fast) for at least xpath, I am quite sure

Is the search part done differently in SQL2 and XPath? Can't remember ;)

>>> The problem with getting a correct count is authorization : This total
>>> search index count should is fast (if you try to avoid some known slow
>>> searches). However, authorizing for example 100k+ nodes if they are
>>> not in the jackrabbit caches is very expensive.
>>>
>>> Either way: You get a correct count if you make sure that you include
>>> in your (xpath) search at least an order by clause. Then, to avoid
>>> 100k + hits, make sure you also set a limit. For example a limit of
>>> 501 : You can then show 50 pages of 10 hits, and if the count is 501
>>> you state that there are at least 500+ hits
>>
>> That's what we do now, but it doesn't help (as said above) if we have
>> thousends of results which have to be ordered first.
> 
> And the second sort is also slow? The first sort is also slow with
> Lucene, as Lucene needs to load all terms to sort on from FS in
> memory. However, consecutive searches are fast. We don't have problems
> for resultsets sorting for a million hits

It definitively loaded all nodes from the PM before sorting it. The
lucene part itself was fast enough, that wasn't the issue.

> 
>>
>>>
>>> We also wanted to get around this, thus in our api hooked in a
>>> 'getTotalSize()' which returns the Lucene unauthorized count
>>
>> That would help us a lot, since we currently don't use the ACLs of
>> Jackrabbit, so the lucene count would be pretty correct for our use case.
> 
> Yes, however, you would have to hook into jr itself to get this done

Yep, saw that, that's somewhere deep in the code. That's why I didn't
try to adress that yet

chregu

> 
> Regards Ard
> 
>>
>> chregu
>>
>>>
>>>>
>>>> I guess the most sensible thing would be to simply offer a way to do SELECT COUNT(*) FROM ..
>>>>
>>>> 2) a way to automatically stop long running queries
>>>
>>> It is not just about 'long' . Some queries easily blow up, and bring
>>> you app to an OOM before they can be stopped. For example jcr:like is
>>> such a thing. Or range queries on many unique values
>>
>>
>>>
>>> Regards Ard
>>>
>>>>
>>>> It would be great if one could define a timeout for queries. If a query takes longer than X, it should just fail. This should be a global setting, but ideally it should be possible to override this on a per query basis.
>>>>
>>>> 3) .. ?
>>>>
>>>> regards,
>>>> Lukas Kahwe Smith
>>>> mls@pooteeweet.org
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>> --
>> Liip AG  //  Feldstrasse 133 //  CH-8004 Zurich
>> Tel +41 43 500 39 81 // Mobile +41 76 561 88 60
>> www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE
>>
> 
> 
> 

-- 
Liip AG  //  Feldstrasse 133 //  CH-8004 Zurich
Tel +41 43 500 39 81 // Mobile +41 76 561 88 60
www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE

Re: dealing with large result sets

Posted by Ard Schrijvers <a....@onehippo.com>.

On Tue, Apr 10, 2012 at 11:42 AM, Christian Stocker
<ch...@liip.ch> wrote:
>
>
> On 10.04.12 11:32, Ard Schrijvers wrote:
>> On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith <ml...@pooteeweet.org> wrote:
>>> Hi,
>>>
>>> Currently I see some big issues with queries that return large result sets. A lot of work is not done inside Lucene, which will probably not be fixed soon (or maybe never inside 2.x). However I think its important to do some intermediate improvements.
>>>
>>> Here are some suggestions I have. I hope we can brainstorm together on some ideas that are feasible to get implemented in a shorter time period than waiting for Oak:
>>>
>>> 1) there should be a way to get a count
>>>
>>> This way if I need to do a query that needs to be ordered, I can first check if the count is too high to determine if I should even bother running the search. Aka in most cases a search leading to 100+ results means that who ever did the search needs to further narrow it down.
>>
>> The cpu is not spend in ordering the results: That is done quite fast
>> in Lucene, unless you have millions of hits
>
> I read the code and also read this
> https://issues.apache.org/jira/browse/JCR-2959 and it looks to me that
> jackrabbit always sorts the result set by itself and not in lucene (or
> maybe additionally). This makes it slow even if you have a limit set,
> because it first sorts all nodes (fetching it from the PM if necessary),
> then does the limit. Maybe I have missed something but real life tests
> showed exactly this behaviour.

Ah, I don't know about that part: We always sticked to xpath queries :
Sorting is done in Lucene (more precisely, in some Lucene exensions in
jr, but are equally fast) for at least xpath, I am quite sure

>
>>
>> The problem with getting a correct count is authorization : This total
>> search index count should is fast (if you try to avoid some known slow
>> searches). However, authorizing for example 100k+ nodes if they are
>> not in the jackrabbit caches is very expensive.
>>
>> Either way: You get a correct count if you make sure that you include
>> in your (xpath) search at least an order by clause. Then, to avoid
>> 100k + hits, make sure you also set a limit. For example a limit of
>> 501 : You can then show 50 pages of 10 hits, and if the count is 501
>> you state that there are at least 500+ hits
>
> That's what we do now, but it doesn't help (as said above) if we have
> thousends of results which have to be ordered first.

And the second sort is also slow? The first sort is also slow with
Lucene, as Lucene needs to load all terms to sort on from FS in
memory. However, consecutive searches are fast. We don't have problems
for resultsets sorting for a million hits

>
>>
>> We also wanted to get around this, thus in our api hooked in a
>> 'getTotalSize()' which returns the Lucene unauthorized count
>
> That would help us a lot, since we currently don't use the ACLs of
> Jackrabbit, so the lucene count would be pretty correct for our use case.

Yes, however, you would have to hook into jr itself to get this done

Regards Ard

>
> chregu
>
>>
>>>
>>> I guess the most sensible thing would be to simply offer a way to do SELECT COUNT(*) FROM ..
>>>
>>> 2) a way to automatically stop long running queries
>>
>> It is not just about 'long' . Some queries easily blow up, and bring
>> you app to an OOM before they can be stopped. For example jcr:like is
>> such a thing. Or range queries on many unique values
>
>
>>
>> Regards Ard
>>
>>>
>>> It would be great if one could define a timeout for queries. If a query takes longer than X, it should just fail. This should be a global setting, but ideally it should be possible to override this on a per query basis.
>>>
>>> 3) .. ?
>>>
>>> regards,
>>> Lukas Kahwe Smith
>>> mls@pooteeweet.org
>>>
>>>
>>>
>>
>>
>>
>
> --
> Liip AG  //  Feldstrasse 133 //  CH-8004 Zurich
> Tel +41 43 500 39 81 // Mobile +41 76 561 88 60
> www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE
>



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: dealing with large result sets

Posted by Christian Stocker <ch...@liip.ch>.


On 10.04.12 11:32, Ard Schrijvers wrote:
> On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith <ml...@pooteeweet.org> wrote:
>> Hi,
>>
>> Currently I see some big issues with queries that return large result sets. A lot of work is not done inside Lucene, which will probably not be fixed soon (or maybe never inside 2.x). However I think its important to do some intermediate improvements.
>>
>> Here are some suggestions I have. I hope we can brainstorm together on some ideas that are feasible to get implemented in a shorter time period than waiting for Oak:
>>
>> 1) there should be a way to get a count
>>
>> This way if I need to do a query that needs to be ordered, I can first check if the count is too high to determine if I should even bother running the search. Aka in most cases a search leading to 100+ results means that who ever did the search needs to further narrow it down.
> 
> The cpu is not spend in ordering the results: That is done quite fast
> in Lucene, unless you have millions of hits

I read the code and also read this
https://issues.apache.org/jira/browse/JCR-2959 and it looks to me that
jackrabbit always sorts the result set by itself and not in lucene (or
maybe additionally). This makes it slow even if you have a limit set,
because it first sorts all nodes (fetching it from the PM if necessary),
then does the limit. Maybe I have missed something but real life tests
showed exactly this behaviour.

> 
> The problem with getting a correct count is authorization : This total
> search index count should is fast (if you try to avoid some known slow
> searches). However, authorizing for example 100k+ nodes if they are
> not in the jackrabbit caches is very expensive.
> 
> Either way: You get a correct count if you make sure that you include
> in your (xpath) search at least an order by clause. Then, to avoid
> 100k + hits, make sure you also set a limit. For example a limit of
> 501 : You can then show 50 pages of 10 hits, and if the count is 501
> you state that there are at least 500+ hits

That's what we do now, but it doesn't help (as said above) if we have
thousends of results which have to be ordered first.

> 
> We also wanted to get around this, thus in our api hooked in a
> 'getTotalSize()' which returns the Lucene unauthorized count

That would help us a lot, since we currently don't use the ACLs of
Jackrabbit, so the lucene count would be pretty correct for our use case.

chregu

> 
>>
>> I guess the most sensible thing would be to simply offer a way to do SELECT COUNT(*) FROM ..
>>
>> 2) a way to automatically stop long running queries
> 
> It is not just about 'long' . Some queries easily blow up, and bring
> you app to an OOM before they can be stopped. For example jcr:like is
> such a thing. Or range queries on many unique values


> 
> Regards Ard
> 
>>
>> It would be great if one could define a timeout for queries. If a query takes longer than X, it should just fail. This should be a global setting, but ideally it should be possible to override this on a per query basis.
>>
>> 3) .. ?
>>
>> regards,
>> Lukas Kahwe Smith
>> mls@pooteeweet.org
>>
>>
>>
> 
> 
> 

-- 
Liip AG  //  Feldstrasse 133 //  CH-8004 Zurich
Tel +41 43 500 39 81 // Mobile +41 76 561 88 60
www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE

Re: dealing with large result sets

Posted by Ard Schrijvers <a....@onehippo.com>.

On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith <ml...@pooteeweet.org> wrote:
> Hi,
>
> Currently I see some big issues with queries that return large result sets. A lot of work is not done inside Lucene, which will probably not be fixed soon (or maybe never inside 2.x). However I think its important to do some intermediate improvements.
>
> Here are some suggestions I have. I hope we can brainstorm together on some ideas that are feasible to get implemented in a shorter time period than waiting for Oak:
>
> 1) there should be a way to get a count
>
> This way if I need to do a query that needs to be ordered, I can first check if the count is too high to determine if I should even bother running the search. Aka in most cases a search leading to 100+ results means that who ever did the search needs to further narrow it down.

The cpu is not spend in ordering the results: That is done quite fast
in Lucene, unless you have millions of hits

The problem with getting a correct count is authorization : This total
search index count should is fast (if you try to avoid some known slow
searches). However, authorizing for example 100k+ nodes if they are
not in the jackrabbit caches is very expensive.

Either way: You get a correct count if you make sure that you include
in your (xpath) search at least an order by clause. Then, to avoid
100k + hits, make sure you also set a limit. For example a limit of
501 : You can then show 50 pages of 10 hits, and if the count is 501
you state that there are at least 500+ hits

We also wanted to get around this, thus in our api hooked in a
'getTotalSize()' which returns the Lucene unauthorized count

>
> I guess the most sensible thing would be to simply offer a way to do SELECT COUNT(*) FROM ..
>
> 2) a way to automatically stop long running queries

It is not just about 'long' . Some queries easily blow up, and bring
you app to an OOM before they can be stopped. For example jcr:like is
such a thing. Or range queries on many unique values

Regards Ard

>
> It would be great if one could define a timeout for queries. If a query takes longer than X, it should just fail. This should be a global setting, but ideally it should be possible to override this on a per query basis.
>
> 3) .. ?
>
> regards,
> Lukas Kahwe Smith
> mls@pooteeweet.org
>
>
>

-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com