You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Davide Giannella <gi...@gmail.com> on 2014/02/25 15:33:42 UTC

OAK-1263 query benchmarking review

Good afternoon everyone,

(long message warning)

I managed to create benchmarks for the query side of the ordered index
I'm (trying to) implementing. The benchmarks compare the performances of
queries when executed in the following use cases: No index, Standard
Property, Ordered Property.

As in order to have a proper cost computing we should change the
QueryEngine API for passing down the ORDER BY clause to the getCost()
currently I found a workaround that is executing an open query with and
without the ORDER BY clause[0][1]. In this way when there's no ORDER BY
and only the ordered index is present we know the resultset is ordered
but the QueryEngine won't sort the data as it should happen in the final
stage.

(0) http://goo.gl/XhRnJ3
(1) http://goo.gl/0FS5fm

The benchmarks is currently not iterating through the returned resultset
as it's measuring the time it takes OAK to return to the client
application the resultset foor doing whatever it wants. It's obvious IMO
that the bigger the result the longer it will take to manage it.

Now I'm getting very distant numbers I wasn't expecting[2]. For example
if I query 500 nodes I have the ordered index 30 times faster than the
examples over the other two cases.

(2) https://drive.google.com/file/d/0B4fycUuvk2qqQnEyX1hUcTR4Z2s/

I could be doing something wrong therefore I ask if anyone could give a
look at the benchmarks themselves and tell me if I'm mistaking anything
or it's really such a big difference. Following the classes involved in
the benchmarking.

OrderedIndexQueryNoIndexTest: http://goo.gl/WB2gxA
OrderedIndexQueryStandardIndexTest: http://goo.gl/Gf4U1t
OrderedIndexInsertOrderedPropertyTest: http://goo.gl/SDZGJ8

Thank you
Regards
Davide

Re: OAK-1263 query benchmarking review

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Wed, Feb 26, 2014 at 9:11 AM, Davide Giannella
<gi...@gmail.com> wrote:
>> Furthermore I'd be interested in "big numbers". That is how does the
>> query with an order index perform with 10000, 100000, 1000000 nodes?
> I don't think my laptop can hold that numbers but we'll try to figure
> out something ;)

1m nodes shouldn't be much of a problem:

    $ java -jar oak-run/target/oak-run-0.18-SNAPSHOT.jar \
          benchmark CreateNodesBenchmark Oak-Tar
    Apache Jackrabbit Oak 0.18-SNAPSHOT
    Oak-Tar: Create nodes benchmark
    [...]
    Created 1000000 nodes in 77 seconds (0.08ms/node)...

:-)

BR,

Jukka Zitting

Re: OAK-1263 query benchmarking review

Posted by Davide Giannella <gi...@gmail.com>.
On 26/02/2014 14:47, Michael Dürig wrote:
> This looks very promising indeed! I couldn't see anything wrong with
> the benchmarks. Still it might be interesting why this gap is so wide.
> Probably a good starting point is to look into the standard and non
> index cases with a profiler to see where the CPU a cycles are actually
> burnt.
My gut feeling tells me that it's because of the ORDER BY clause. The
QueryEngine has to fetch the whole resultset for sorting it in memory
and therefore we have expensive round-trip down to the persistence. In
case of the ordered index the order is persisted giving the effort of
the "sorting" during the commit hook therefore the QE doesn't have to
pre-parse the result-set before returning it to the client.
> Furthermore I'd be interested in "big numbers". That is how does the
> query with an order index perform with 10000, 100000, 1000000 nodes?
I don't think my laptop can hold that numbers but we'll try to figure
out something ;)

> I think we should quickly get this into the Oak code base and follow
> up from there with further improvements. This would also allow us to
> run those benchmarks on a CI system to have more continuous feedback.
> Could you prepare a patch and attach that to the issue? Thomas might
> be a good person to then have another look into this since he knows
> much more about query and indexing than I do.
Still have to complete some use-cases and check the full end-to-end
behaviour one last time. Then it could be committed to the codebase for
future improvements (the insert has to be improved and already know
how). Currently it can't be used as it will requires changes to QE API
for managing the cost (still to be implemented) computing with ORDER BY
clause and a way to tell the QE: "don't sort this set as it's already
done for you".

D.



Re: OAK-1263 query benchmarking review

Posted by Michael Dürig <md...@apache.org>.

On 25.2.14 10:37 , Davide Giannella wrote:
> On 25/02/2014 16:24, Davide Giannella wrote:
>> On 25/02/2014 15:44, Jukka Zitting wrote:
>>> As a general rule it's good to always fetch at least some results from
>>> the query in a benchmark, as many parts of the query execution are
>>> done lazily. Depending on the application use case we're interested in
>>> benchmarking, we should fetch either all of the results or just first
>>> N results, where N is some predefined constant like 10 or 100.
>>>
>> Trivial. Will do as soon as off a meeting  and post updated numbers :)
>>
> Running the whole benchmarks[0] takes a while. Added an explicit
> iteration over the first 100 nodes[1] and the numbers didn't change[3].
>
> (0) http://goo.gl/oI1AfN
> (1) http://goo.gl/Ky0qH6
> (2) https://drive.google.com/file/d/0B4fycUuvk2qqZnNaeEV4OGxGV2s/

This looks very promising indeed! I couldn't see anything wrong with the 
benchmarks. Still it might be interesting why this gap is so wide. 
Probably a good starting point is to look into the standard and non 
index cases with a profiler to see where the CPU a cycles are actually 
burnt.

Furthermore I'd be interested in "big numbers". That is how does the 
query with an order index perform with 10000, 100000, 1000000 nodes?

I think we should quickly get this into the Oak code base and follow up 
from there with further improvements. This would also allow us to run 
those benchmarks on a CI system to have more continuous feedback. Could 
you prepare a patch and attach that to the issue? Thomas might be a good 
person to then have another look into this since he knows much more 
about query and indexing than I do.

Michael

>
> Cheers
> Davide
>
>

Re: OAK-1263 query benchmarking review

Posted by Davide Giannella <gi...@gmail.com>.
On 25/02/2014 16:24, Davide Giannella wrote:
> On 25/02/2014 15:44, Jukka Zitting wrote:
>> As a general rule it's good to always fetch at least some results from
>> the query in a benchmark, as many parts of the query execution are
>> done lazily. Depending on the application use case we're interested in
>> benchmarking, we should fetch either all of the results or just first
>> N results, where N is some predefined constant like 10 or 100.
>>
> Trivial. Will do as soon as off a meeting  and post updated numbers :)
>
Running the whole benchmarks[0] takes a while. Added an explicit
iteration over the first 100 nodes[1] and the numbers didn't change[3].

(0) http://goo.gl/oI1AfN
(1) http://goo.gl/Ky0qH6
(2) https://drive.google.com/file/d/0B4fycUuvk2qqZnNaeEV4OGxGV2s/

Cheers
Davide



Re: OAK-1263 query benchmarking review

Posted by Davide Giannella <gi...@gmail.com>.
On 25/02/2014 15:44, Jukka Zitting wrote:
> As a general rule it's good to always fetch at least some results from
> the query in a benchmark, as many parts of the query execution are
> done lazily. Depending on the application use case we're interested in
> benchmarking, we should fetch either all of the results or just first
> N results, where N is some predefined constant like 10 or 100.
>
Trivial. Will do as soon as off a meeting  and post updated numbers :)

D.







Re: OAK-1263 query benchmarking review

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

I didn't look at the details yet; just a quick comment first.

On Tue, Feb 25, 2014 at 9:33 AM, Davide Giannella
<gi...@gmail.com> wrote:
> The benchmarks is currently not iterating through the returned resultset
> as it's measuring the time it takes OAK to return to the client
> application the resultset foor doing whatever it wants. It's obvious IMO
> that the bigger the result the longer it will take to manage it.

As a general rule it's good to always fetch at least some results from
the query in a benchmark, as many parts of the query execution are
done lazily. Depending on the application use case we're interested in
benchmarking, we should fetch either all of the results or just first
N results, where N is some predefined constant like 10 or 100.

BR,

Jukka Zitting