You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@sling.apache.org by Roy Teeuwen <ro...@teeuwen.be> on 2016/06/16 13:28:51 UTC

Querying vs iterating

Hello all,

Lets say I got a resource with around 10-20 child/grand-child resources, not going deeper than 3 levels max. What is the most performant when searching for the child resources containing a specific property (the property is configurable with OSGi, so hard to put an index on it). Iterating the child / grand-child resources until you find it or making an xpath/jcr-sql2 query? When would one option start to be more performant than the other.

Thanks!
Roy

Re: Querying vs iterating

Posted by Bertrand Delacretaz <bd...@apache.org>.

On Mon, Jun 20, 2016 at 4:47 PM, Jason E Bailey <ja...@24601.org> wrote:
> ...Maybe I should file a bug report. "Hey why is it that I can iterate and
> get results faster than doing an indexed query?"...

That's not a bug, that's a feature ;-)

JCR naturally enables very efficient navigation between nodes that are
close to each other (in terms of tree hops) so if you're working in a
fairly compact subtree JCR navigation is often much more efficient
than querying.

The trick is to design your tree structures with this in mind - things
that usually go together should stay close to each other in the JCR
content tree, whenever possible.

As David's model [1] says, "drive the content hierarchy, don't let it happen".

-Bertrand

[1] https://wiki.apache.org/jackrabbit/DavidsModel

Re: Querying vs iterating

Posted by Jason E Bailey <ja...@24601.org>.

I have seen significant gains in obtaining a list of results, and the
speed of my services, by doing an iteration versus a query. I have had a
query looking for an indexed node type, going from 10 minutes to 1 and a
half minute.

I should point out that  that makes no sense.

When it was first suggested to me that I iterate rather than use a
query. I looked at the person in question as if they had never studied
computers. It has historically been beaten into my head for close to 2
decades that if you want performance from a data store, you use a query
and you create indexes.  Doing an iteration struck me as something that
only someone who didn't know what they were doing would suggest or that
they had done something wrong in their setup i.e. failed to set up an
index.

I was wrong.

Maybe I should file a bug report. "Hey why is it that I can iterate and
get results faster than doing an indexed query?" I'm also sure that
indexing and running a query is the right way to address some needs.
However, right now, every time I've done a comparison between executing
a query and just going to the source and checking myself.  The iteration
style has been significantly faster.

--
Jason

On Mon, Jun 20, 2016, at 10:01 AM, Julian Sedding wrote:
> Hi Roy
> 
> Yes, I would expect that you cannot measure any meaningful difference.
> Using a query may be marginally faster, because it can traverse using
> internal Oak APIs. On the other hand it may be slightly slower,
> because of possible QueryEngine overhead.
> 
> Personally I would test whether it works sufficiently well with a
> query, because it is less code.
> 
> Note also that Sling Query
> (https://sling.apache.org/documentation/bundles/sling-query.html)
> allows you to express a query and choose traversal vs query as a
> strategy. This may or may not help.
> 
> Regards
> Julian
> 
> 
> On Mon, Jun 20, 2016 at 3:52 PM, Roy Teeuwen <ro...@teeuwen.be> wrote:
> > Hey Julian,
> >
> > Ok cool, for me the context is querying on a page in AEM, so I am creating a query for one cq:Page node, so that will be most of the times max like 10-20 nodes.
> > So what you are saying then is that it shouldn’t really matter in performance to choose either for manually traverse myself or doing a query when looking to see if a specific property name exists on the page,
> > because behind the scene it will most likely traverse itself then anyway, right?
> >
> > Thanks!
> > Roy
> >> On 20 Jun 2016, at 15:43, Julian Sedding <js...@gmail.com> wrote:
> >>
> >> Hi Roy
> >>
> >> From you question ("hard to put an index to it") I assume that you are
> >> running on an Oak repository. If that is incorrect, my answer does not
> >> apply.
> >>
> >> Oak will always consider traversal as an alternative to existing
> >> indexes. For most queries the cost of traversal is so high that an
> >> index is chosen. However, if no suitable index exists (and
> >> theoretically also if the traversal is cheaper than a lookup in a
> >> matching index), it will do a traversal behind the scenes. Note that
> >> traversal logs a warning every 10000 traversed nodes. So if you plan
> >> to traverse more than that you should really consider creating an
> >> index.
> >>
> >> In short: with Oak using a query on a small subtree should give you
> >> what you want, even without an index.
> >>
> >> Regards
> >> Julian
> >>
> >>
> >> On Thu, Jun 16, 2016 at 4:44 PM, Steven Walters <ke...@gmail.com> wrote:
> >>> Hopefully other people chime in here, I've only had bad experiences
> >>> with utilizing queries and have often resulted in personally never
> >>> using them - so I always end up iterating/navigating myself.
> >>>
> >>> Theoretically if you have a REALLY GOOD index then you may get some
> >>> similar performances, but if your index(es) are inefficient, then it's
> >>> just wasted CPU cycles (you'd wish those CPU cycles were going to a
> >>> good cause, but they're not).
> >>>
> >>> the transition of Sling (and AEM) to Oak from Jackrabbit 2.x made this
> >>> experience worse with the awkward indexing policies/process in Oak,
> >>> and the fact that Oak never seemed to ever use multiple indexes.
> >>> Oak always seemed to calculates the costs of the entire query against
> >>> all the available indexes and only chooses the ONE best index.
> >>> This sounds like a good idea in theory, but then most DBMS I've used
> >>> in the past utilize ALL the indexes they can - not just one.
> >>>
> >>> So basically i guess this comes to be "If you have a good index (in
> >>> that it can apply to ALL the conditions/attributes/properties of your
> >>> query) then using a query should be fine, otherwise iterate yourself"
> >>> having any condition missing from the index can be fatal in
> >>> performance, such as lacking the evaluatePathRestrictions = true,
> >>> which without it is basically death of the system if you have a lot of
> >>> content.
> >>>
> >>> But really, I hope some other people with more positive experiences
> >>> can provide some better advice.
> >>>
> >>> On Thu, Jun 16, 2016 at 11:08 PM, Roy Teeuwen <ro...@teeuwen.be> wrote:
> >>>> Ok, it would be handy to have an estimate on the approximate amount / levels of resources when to go for iterating vs querying :).
> >>>>
> >>>> Greets
> >>>> Roy
> >>>>> On 16 Jun 2016, at 16:06, Steven Walters <ke...@gmail.com> wrote:
> >>>>>
> >>>>> if you know there are that few resources, then I say iterating would be
> >>>>> better performing than XPath / JCR-SQL2 queries.
> >>>>> This is primarily from past experience speaking in that queries have
> >>>>> generally turned out (often MUCH) slower than directly iterating if you
> >>>>> know what you're actually looking for.
> >>>>>
> >>>>>
> >>>>> On Thu, Jun 16, 2016 at 10:28 PM, Roy Teeuwen <ro...@teeuwen.be> wrote:
> >>>>>
> >>>>>> Hello all,
> >>>>>>
> >>>>>> Lets say I got a resource with around 10-20 child/grand-child resources,
> >>>>>> not going deeper than 3 levels max. What is the most performant when
> >>>>>> searching for the child resources containing a specific property (the
> >>>>>> property is configurable with OSGi, so hard to put an index on it).
> >>>>>> Iterating the child / grand-child resources until you find it or making an
> >>>>>> xpath/jcr-sql2 query? When would one option start to be more performant
> >>>>>> than the other.
> >>>>>>
> >>>>>> Thanks!
> >>>>>> Roy
> >>>>
> >

Re: Querying vs iterating

Posted by Julian Sedding <js...@gmail.com>.

Hi Roy

Yes, I would expect that you cannot measure any meaningful difference.
Using a query may be marginally faster, because it can traverse using
internal Oak APIs. On the other hand it may be slightly slower,
because of possible QueryEngine overhead.

Personally I would test whether it works sufficiently well with a
query, because it is less code.

Note also that Sling Query
(https://sling.apache.org/documentation/bundles/sling-query.html)
allows you to express a query and choose traversal vs query as a
strategy. This may or may not help.

Regards
Julian


On Mon, Jun 20, 2016 at 3:52 PM, Roy Teeuwen <ro...@teeuwen.be> wrote:
> Hey Julian,
>
> Ok cool, for me the context is querying on a page in AEM, so I am creating a query for one cq:Page node, so that will be most of the times max like 10-20 nodes.
> So what you are saying then is that it shouldn’t really matter in performance to choose either for manually traverse myself or doing a query when looking to see if a specific property name exists on the page,
> because behind the scene it will most likely traverse itself then anyway, right?
>
> Thanks!
> Roy
>> On 20 Jun 2016, at 15:43, Julian Sedding <js...@gmail.com> wrote:
>>
>> Hi Roy
>>
>> From you question ("hard to put an index to it") I assume that you are
>> running on an Oak repository. If that is incorrect, my answer does not
>> apply.
>>
>> Oak will always consider traversal as an alternative to existing
>> indexes. For most queries the cost of traversal is so high that an
>> index is chosen. However, if no suitable index exists (and
>> theoretically also if the traversal is cheaper than a lookup in a
>> matching index), it will do a traversal behind the scenes. Note that
>> traversal logs a warning every 10000 traversed nodes. So if you plan
>> to traverse more than that you should really consider creating an
>> index.
>>
>> In short: with Oak using a query on a small subtree should give you
>> what you want, even without an index.
>>
>> Regards
>> Julian
>>
>>
>> On Thu, Jun 16, 2016 at 4:44 PM, Steven Walters <ke...@gmail.com> wrote:
>>> Hopefully other people chime in here, I've only had bad experiences
>>> with utilizing queries and have often resulted in personally never
>>> using them - so I always end up iterating/navigating myself.
>>>
>>> Theoretically if you have a REALLY GOOD index then you may get some
>>> similar performances, but if your index(es) are inefficient, then it's
>>> just wasted CPU cycles (you'd wish those CPU cycles were going to a
>>> good cause, but they're not).
>>>
>>> the transition of Sling (and AEM) to Oak from Jackrabbit 2.x made this
>>> experience worse with the awkward indexing policies/process in Oak,
>>> and the fact that Oak never seemed to ever use multiple indexes.
>>> Oak always seemed to calculates the costs of the entire query against
>>> all the available indexes and only chooses the ONE best index.
>>> This sounds like a good idea in theory, but then most DBMS I've used
>>> in the past utilize ALL the indexes they can - not just one.
>>>
>>> So basically i guess this comes to be "If you have a good index (in
>>> that it can apply to ALL the conditions/attributes/properties of your
>>> query) then using a query should be fine, otherwise iterate yourself"
>>> having any condition missing from the index can be fatal in
>>> performance, such as lacking the evaluatePathRestrictions = true,
>>> which without it is basically death of the system if you have a lot of
>>> content.
>>>
>>> But really, I hope some other people with more positive experiences
>>> can provide some better advice.
>>>
>>> On Thu, Jun 16, 2016 at 11:08 PM, Roy Teeuwen <ro...@teeuwen.be> wrote:
>>>> Ok, it would be handy to have an estimate on the approximate amount / levels of resources when to go for iterating vs querying :).
>>>>
>>>> Greets
>>>> Roy
>>>>> On 16 Jun 2016, at 16:06, Steven Walters <ke...@gmail.com> wrote:
>>>>>
>>>>> if you know there are that few resources, then I say iterating would be
>>>>> better performing than XPath / JCR-SQL2 queries.
>>>>> This is primarily from past experience speaking in that queries have
>>>>> generally turned out (often MUCH) slower than directly iterating if you
>>>>> know what you're actually looking for.
>>>>>
>>>>>
>>>>> On Thu, Jun 16, 2016 at 10:28 PM, Roy Teeuwen <ro...@teeuwen.be> wrote:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> Lets say I got a resource with around 10-20 child/grand-child resources,
>>>>>> not going deeper than 3 levels max. What is the most performant when
>>>>>> searching for the child resources containing a specific property (the
>>>>>> property is configurable with OSGi, so hard to put an index on it).
>>>>>> Iterating the child / grand-child resources until you find it or making an
>>>>>> xpath/jcr-sql2 query? When would one option start to be more performant
>>>>>> than the other.
>>>>>>
>>>>>> Thanks!
>>>>>> Roy
>>>>
>

Re: Querying vs iterating

Posted by Roy Teeuwen <ro...@teeuwen.be>.

Hey Julian,

Ok cool, for me the context is querying on a page in AEM, so I am creating a query for one cq:Page node, so that will be most of the times max like 10-20 nodes. 
So what you are saying then is that it shouldn’t really matter in performance to choose either for manually traverse myself or doing a query when looking to see if a specific property name exists on the page, 
because behind the scene it will most likely traverse itself then anyway, right?

Thanks!
Roy
> On 20 Jun 2016, at 15:43, Julian Sedding <js...@gmail.com> wrote:
> 
> Hi Roy
> 
> From you question ("hard to put an index to it") I assume that you are
> running on an Oak repository. If that is incorrect, my answer does not
> apply.
> 
> Oak will always consider traversal as an alternative to existing
> indexes. For most queries the cost of traversal is so high that an
> index is chosen. However, if no suitable index exists (and
> theoretically also if the traversal is cheaper than a lookup in a
> matching index), it will do a traversal behind the scenes. Note that
> traversal logs a warning every 10000 traversed nodes. So if you plan
> to traverse more than that you should really consider creating an
> index.
> 
> In short: with Oak using a query on a small subtree should give you
> what you want, even without an index.
> 
> Regards
> Julian
> 
> 
> On Thu, Jun 16, 2016 at 4:44 PM, Steven Walters <ke...@gmail.com> wrote:
>> Hopefully other people chime in here, I've only had bad experiences
>> with utilizing queries and have often resulted in personally never
>> using them - so I always end up iterating/navigating myself.
>> 
>> Theoretically if you have a REALLY GOOD index then you may get some
>> similar performances, but if your index(es) are inefficient, then it's
>> just wasted CPU cycles (you'd wish those CPU cycles were going to a
>> good cause, but they're not).
>> 
>> the transition of Sling (and AEM) to Oak from Jackrabbit 2.x made this
>> experience worse with the awkward indexing policies/process in Oak,
>> and the fact that Oak never seemed to ever use multiple indexes.
>> Oak always seemed to calculates the costs of the entire query against
>> all the available indexes and only chooses the ONE best index.
>> This sounds like a good idea in theory, but then most DBMS I've used
>> in the past utilize ALL the indexes they can - not just one.
>> 
>> So basically i guess this comes to be "If you have a good index (in
>> that it can apply to ALL the conditions/attributes/properties of your
>> query) then using a query should be fine, otherwise iterate yourself"
>> having any condition missing from the index can be fatal in
>> performance, such as lacking the evaluatePathRestrictions = true,
>> which without it is basically death of the system if you have a lot of
>> content.
>> 
>> But really, I hope some other people with more positive experiences
>> can provide some better advice.
>> 
>> On Thu, Jun 16, 2016 at 11:08 PM, Roy Teeuwen <ro...@teeuwen.be> wrote:
>>> Ok, it would be handy to have an estimate on the approximate amount / levels of resources when to go for iterating vs querying :).
>>> 
>>> Greets
>>> Roy
>>>> On 16 Jun 2016, at 16:06, Steven Walters <ke...@gmail.com> wrote:
>>>> 
>>>> if you know there are that few resources, then I say iterating would be
>>>> better performing than XPath / JCR-SQL2 queries.
>>>> This is primarily from past experience speaking in that queries have
>>>> generally turned out (often MUCH) slower than directly iterating if you
>>>> know what you're actually looking for.
>>>> 
>>>> 
>>>> On Thu, Jun 16, 2016 at 10:28 PM, Roy Teeuwen <ro...@teeuwen.be> wrote:
>>>> 
>>>>> Hello all,
>>>>> 
>>>>> Lets say I got a resource with around 10-20 child/grand-child resources,
>>>>> not going deeper than 3 levels max. What is the most performant when
>>>>> searching for the child resources containing a specific property (the
>>>>> property is configurable with OSGi, so hard to put an index on it).
>>>>> Iterating the child / grand-child resources until you find it or making an
>>>>> xpath/jcr-sql2 query? When would one option start to be more performant
>>>>> than the other.
>>>>> 
>>>>> Thanks!
>>>>> Roy
>>>

Re: Querying vs iterating

Posted by Julian Sedding <js...@gmail.com>.

Hi Roy

From you question ("hard to put an index to it") I assume that you are
running on an Oak repository. If that is incorrect, my answer does not
apply.

Oak will always consider traversal as an alternative to existing
indexes. For most queries the cost of traversal is so high that an
index is chosen. However, if no suitable index exists (and
theoretically also if the traversal is cheaper than a lookup in a
matching index), it will do a traversal behind the scenes. Note that
traversal logs a warning every 10000 traversed nodes. So if you plan
to traverse more than that you should really consider creating an
index.

In short: with Oak using a query on a small subtree should give you
what you want, even without an index.

Regards
Julian


On Thu, Jun 16, 2016 at 4:44 PM, Steven Walters <ke...@gmail.com> wrote:
> Hopefully other people chime in here, I've only had bad experiences
> with utilizing queries and have often resulted in personally never
> using them - so I always end up iterating/navigating myself.
>
> Theoretically if you have a REALLY GOOD index then you may get some
> similar performances, but if your index(es) are inefficient, then it's
> just wasted CPU cycles (you'd wish those CPU cycles were going to a
> good cause, but they're not).
>
> the transition of Sling (and AEM) to Oak from Jackrabbit 2.x made this
> experience worse with the awkward indexing policies/process in Oak,
> and the fact that Oak never seemed to ever use multiple indexes.
> Oak always seemed to calculates the costs of the entire query against
> all the available indexes and only chooses the ONE best index.
> This sounds like a good idea in theory, but then most DBMS I've used
> in the past utilize ALL the indexes they can - not just one.
>
> So basically i guess this comes to be "If you have a good index (in
> that it can apply to ALL the conditions/attributes/properties of your
> query) then using a query should be fine, otherwise iterate yourself"
> having any condition missing from the index can be fatal in
> performance, such as lacking the evaluatePathRestrictions = true,
> which without it is basically death of the system if you have a lot of
> content.
>
> But really, I hope some other people with more positive experiences
> can provide some better advice.
>
> On Thu, Jun 16, 2016 at 11:08 PM, Roy Teeuwen <ro...@teeuwen.be> wrote:
>> Ok, it would be handy to have an estimate on the approximate amount / levels of resources when to go for iterating vs querying :).
>>
>> Greets
>> Roy
>>> On 16 Jun 2016, at 16:06, Steven Walters <ke...@gmail.com> wrote:
>>>
>>> if you know there are that few resources, then I say iterating would be
>>> better performing than XPath / JCR-SQL2 queries.
>>> This is primarily from past experience speaking in that queries have
>>> generally turned out (often MUCH) slower than directly iterating if you
>>> know what you're actually looking for.
>>>
>>>
>>> On Thu, Jun 16, 2016 at 10:28 PM, Roy Teeuwen <ro...@teeuwen.be> wrote:
>>>
>>>> Hello all,
>>>>
>>>> Lets say I got a resource with around 10-20 child/grand-child resources,
>>>> not going deeper than 3 levels max. What is the most performant when
>>>> searching for the child resources containing a specific property (the
>>>> property is configurable with OSGi, so hard to put an index on it).
>>>> Iterating the child / grand-child resources until you find it or making an
>>>> xpath/jcr-sql2 query? When would one option start to be more performant
>>>> than the other.
>>>>
>>>> Thanks!
>>>> Roy
>>

Re: Querying vs iterating

Posted by Steven Walters <ke...@gmail.com>.

Hopefully other people chime in here, I've only had bad experiences
with utilizing queries and have often resulted in personally never
using them - so I always end up iterating/navigating myself.

Theoretically if you have a REALLY GOOD index then you may get some
similar performances, but if your index(es) are inefficient, then it's
just wasted CPU cycles (you'd wish those CPU cycles were going to a
good cause, but they're not).

the transition of Sling (and AEM) to Oak from Jackrabbit 2.x made this
experience worse with the awkward indexing policies/process in Oak,
and the fact that Oak never seemed to ever use multiple indexes.
Oak always seemed to calculates the costs of the entire query against
all the available indexes and only chooses the ONE best index.
This sounds like a good idea in theory, but then most DBMS I've used
in the past utilize ALL the indexes they can - not just one.

So basically i guess this comes to be "If you have a good index (in
that it can apply to ALL the conditions/attributes/properties of your
query) then using a query should be fine, otherwise iterate yourself"
having any condition missing from the index can be fatal in
performance, such as lacking the evaluatePathRestrictions = true,
which without it is basically death of the system if you have a lot of
content.

But really, I hope some other people with more positive experiences
can provide some better advice.

On Thu, Jun 16, 2016 at 11:08 PM, Roy Teeuwen <ro...@teeuwen.be> wrote:
> Ok, it would be handy to have an estimate on the approximate amount / levels of resources when to go for iterating vs querying :).
>
> Greets
> Roy
>> On 16 Jun 2016, at 16:06, Steven Walters <ke...@gmail.com> wrote:
>>
>> if you know there are that few resources, then I say iterating would be
>> better performing than XPath / JCR-SQL2 queries.
>> This is primarily from past experience speaking in that queries have
>> generally turned out (often MUCH) slower than directly iterating if you
>> know what you're actually looking for.
>>
>>
>> On Thu, Jun 16, 2016 at 10:28 PM, Roy Teeuwen <ro...@teeuwen.be> wrote:
>>
>>> Hello all,
>>>
>>> Lets say I got a resource with around 10-20 child/grand-child resources,
>>> not going deeper than 3 levels max. What is the most performant when
>>> searching for the child resources containing a specific property (the
>>> property is configurable with OSGi, so hard to put an index on it).
>>> Iterating the child / grand-child resources until you find it or making an
>>> xpath/jcr-sql2 query? When would one option start to be more performant
>>> than the other.
>>>
>>> Thanks!
>>> Roy
>

Re: Querying vs iterating

Posted by Roy Teeuwen <ro...@teeuwen.be>.

Ok, it would be handy to have an estimate on the approximate amount / levels of resources when to go for iterating vs querying :).

Greets
Roy
> On 16 Jun 2016, at 16:06, Steven Walters <ke...@gmail.com> wrote:
> 
> if you know there are that few resources, then I say iterating would be
> better performing than XPath / JCR-SQL2 queries.
> This is primarily from past experience speaking in that queries have
> generally turned out (often MUCH) slower than directly iterating if you
> know what you're actually looking for.
> 
> 
> On Thu, Jun 16, 2016 at 10:28 PM, Roy Teeuwen <ro...@teeuwen.be> wrote:
> 
>> Hello all,
>> 
>> Lets say I got a resource with around 10-20 child/grand-child resources,
>> not going deeper than 3 levels max. What is the most performant when
>> searching for the child resources containing a specific property (the
>> property is configurable with OSGi, so hard to put an index on it).
>> Iterating the child / grand-child resources until you find it or making an
>> xpath/jcr-sql2 query? When would one option start to be more performant
>> than the other.
>> 
>> Thanks!
>> Roy

Re: Querying vs iterating

Posted by Steven Walters <ke...@gmail.com>.

if you know there are that few resources, then I say iterating would be
better performing than XPath / JCR-SQL2 queries.
This is primarily from past experience speaking in that queries have
generally turned out (often MUCH) slower than directly iterating if you
know what you're actually looking for.

On Thu, Jun 16, 2016 at 10:28 PM, Roy Teeuwen <ro...@teeuwen.be> wrote:

> Hello all,
>
> Lets say I got a resource with around 10-20 child/grand-child resources,
> not going deeper than 3 levels max. What is the most performant when
> searching for the child resources containing a specific property (the
> property is configurable with OSGi, so hard to put an index on it).
> Iterating the child / grand-child resources until you find it or making an
> xpath/jcr-sql2 query? When would one option start to be more performant
> than the other.
>
> Thanks!
> Roy