You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Tommaso Teofili <to...@gmail.com> on 2014/05/26 10:25:27 UTC

[DISCUSS] - QueryIndex selection

Hi all,

I'd like to start discussing how we may improve / simplify current way of
selecting a query engine to use for a certain query.

In the QueryIndex interface we have the plain old getCost method which
selects the index returning the lower cost for the given query but,
recently, also an AdvancedQueryIndex interface has been introduced which,
if I understood things correctly, uses the IndexPlan(s) returned by each
query index for the given query to select which one has to be used.
So I would like to discuss if it's possible to clean up things a bit in
order to have a unified query selection mechanism.

At the moment, in my opinion, one problem with the getCost() method is that
it inherently merges the following topics:
- index capability to handle a certain query (can the QueryIndex handle
that query?)
- index efficiency in handling a certain query (how fast will the
QueryIndex will be in handling that query?)

Also the efficiency is not evaluated on a "cost model", each QueryIndex
implementation can return an arbitrary different number; on one hand this
is ok as it allows to take very index specific constraint into account: on
the other hand if one has to write a new QueryIndex implementation he/she
will have to look into each other query index implementation to understand
(and design) if / when its index is picked up; and even with already
existing indexes it's not easy to say upfront which one will be selected
(e.g. for debugging purposes).

With the AdvancedQueryIndex, if I understood it correctly (I just had a
look at it on Friday), a QueryIndex is selected upon its IndexPlan, which
is supposed to address better both the cost (as it explicitly exposes the
cost per execution, cost per entry and estimated entry count metrics) and
the query index capability to handle a certain query (e.g. this is used for
ordered property index).
However, at the moment, only the OrderedPropertyIndex is using it so I
think it'd be good to decide if we want to go further with the
AdvancedQueryIndex also for the other QueryIndex implementations (and get
rid for example of the FullTextQueryIndex interface as it seems useless to
me) or not.

One final question on query index selection, should we always select the
fastest index ?
Especially for full text ones this should be in some way configurable.

What do others think?
Regards,
Tommaso

p.s.:
As discussed also offline last week with some other folks maybe one further
metric to be taken into consideration for the index selection is if the
index is synchronous or not

Re: [DISCUSS] - QueryIndex selection

Posted by Thomas Mueller <mu...@adobe.com>.
>>
>>We could let the
>> user decide if using an asynchronous index is OK or not.
>
>Another option is if there is no synch index available but an asynch
>index is available then QueryEngine should use that instead of
>resorting to traversal.

Well, this is the current behavior. The query engine doesn't know which
index is async. If an async index reports a reasonable cost, the it is
already used. But this might not be what the user wants: Maybe having
up-to-date results is important for him. So, maybe instead of "async_ok",
we need a flag "async_not_ok".

Regards,
Thomas


Re: [DISCUSS] - QueryIndex selection

Posted by Chetan Mehrotra <ch...@gmail.com>.
On Wed, Jun 4, 2014 at 1:06 PM, Thomas Mueller <mu...@adobe.com> wrote:
> We could let the
> user decide if using an asynchronous index is OK or not.

Another option is if there is no synch index available but an asynch
index is available then QueryEngine should use that instead of
resorting to traversal. This should enable us to support existing
codebase where people have not created index for all possible cases.

Chetan Mehrotra

Re: [DISCUSS] - QueryIndex selection

Posted by Tommaso Teofili <to...@gmail.com>.
ok, thanks Davide for the pointers.

Regards,
Tommaso


2014-06-18 13:36 GMT+02:00 Davide Giannella <gi...@gmail.com>:

> On 18/06/2014 10:26, Tommaso Teofili wrote:
> > it would be ok for me to either deprecate it or improve the semantics
> > of the cost calculation (e.g. explicitly introduce other metrics to be
> > taken into account in the cost calculation: local / remote index,
> With the IndexPlan.isDelayed() we instruct the query engine if the
> current index is asynchronous or not
>
>
> http://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/spi/query/QueryIndex.IndexPlan.html#isDelayed%28%29
>
> > since the FullTextQueryIndex seems to be a marker interface, could we
> > "merge" it some way into the AdvancedQueryIndex? Just to make things
> > simpler.
> I think we already have this aspect in the IndexPlan.isFullText()
>
>
> http://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/spi/query/QueryIndex.IndexPlan.html#isFulltextIndex%28%29
>
> Cheers
> Davide
>
>
>

Re: [DISCUSS] - QueryIndex selection

Posted by Davide Giannella <gi...@gmail.com>.
On 18/06/2014 10:26, Tommaso Teofili wrote:
> it would be ok for me to either deprecate it or improve the semantics
> of the cost calculation (e.g. explicitly introduce other metrics to be
> taken into account in the cost calculation: local / remote index, 
With the IndexPlan.isDelayed() we instruct the query engine if the
current index is asynchronous or not

http://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/spi/query/QueryIndex.IndexPlan.html#isDelayed%28%29

> since the FullTextQueryIndex seems to be a marker interface, could we
> "merge" it some way into the AdvancedQueryIndex? Just to make things
> simpler.
I think we already have this aspect in the IndexPlan.isFullText()

http://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/spi/query/QueryIndex.IndexPlan.html#isFulltextIndex%28%29

Cheers
Davide



Re: [DISCUSS] - QueryIndex selection

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Wed, Jun 18, 2014 at 7:44 AM, Thomas Mueller <mu...@adobe.com> wrote:
>>My other concern on this point is that it's not granted, in my opinion,
>>that the index returning less entries would be the faster.
>
> Yes, it's not that much about less entries or more entries, it's about
> lower or higher cost. If there are more entries, but they are all in
> memory, then that's better than less entries, but you need a disk read for
> each entry. That's what the javadocs talk about.

There's no way for a query index to know whether a particular node is
cached in memory or not.

BR,

Jukka Zitting

Re: [DISCUSS] - QueryIndex selection

Posted by Tommaso Teofili <to...@gmail.com>.
Hi,

2014-06-18 13:44 GMT+02:00 Thomas Mueller <mu...@adobe.com>:

> Hi,
>
> >>QueryIndex.getCost
>
> >my doubt is what
> >this heuristic function to estimate the "traversed entries" should look
> >like in general
>
> Relational databases typically know the number of entries in the index
> (total indexed entries), plus the selectivity of a column. See also
> http://www.akadia.com/services/ora_index_selectivity.html . That way it's
> quite easy to estimate the number of returned values, for a given filter.
> I think Lucene and Solr have a way to get those numbers, but I didn't look
> into that yet. I guess the problem might be, the cost method needs to be
> very fast, otherwise all the queries would be affected. Relational
> databases typically have those index statistics fully in memory.
>

ok thanks for clarifying, yes we could get such numbers from Lucene and
Solr.


>
> Some databases additionally have a histogram that describes the
> distribution of the values. See also
> http://www.dba-oracle.com/t_histograms.htm
>
> >, but that may be implementation specific
>
> The cost returned by an index should not be implementation specific,
> otherwise the cost between different index implementations can't be
> compared.
>

exactly, that's one of my main concerns when starting this discussion.
I wonder then if we should abstract the getCost method and just ask the
indexes to return the estimated entries and let the QueryEngine assign a
cost to each of them.


>
> >, and how that
> >estimate should be used, should we just return the number of estimated
> >entries for the cost? So that cost is 1 when we estimate to return 1
> >entry,
> >2 when we estimate to return 2 entries, 1000 when we estimate to return
> >1000 entries?
>
> Generally yes, but if you keep everything in memory or know things are
> cached, then you could return less (as documented).
>
> >My other concern on this point is that it's not granted, in my opinion,
> >that the index returning less entries would be the faster.
>
> Yes, it's not that much about less entries or more entries, it's about
> lower or higher cost. If there are more entries, but they are all in
> memory, then that's better than less entries, but you need a disk read for
> each entry. That's what the javadocs talk about.
>

yes, but I think the problem is that we don't have a defined "function"
that implements that, so how much should we smoothen the cost if the
entries are all in memory or viceversa increase the cost if there're
network roundtrips? That's what it seems to me to be implementation
specific and could lead to not comparable costs.


>
> >it would be ok for me to either deprecate it or improve the semantics of
> >the cost calculation (e.g. explicitly introduce other metrics to be taken
> >into account in the cost calculation: local / remote index,
>
> The QueryIndex will be deprated, sure. Until the AdvancedQueryIndex is
> used, the documented behavior should be used: "...and if there could in
> theory be one network roundtrip or disk read operation per node (this
> method may return a lower number if the data is known to be fully in
> memory)."
>
> >since the FullTextQueryIndex seems to be a marker interface, could we
> >"merge" it some way into the AdvancedQueryIndex? Just to make things
> >simpler.
>
> Yes. The IndexPlan of the AdvancedQueryIndex has a method
> "isFulltextIndex". Once we can switch to the AdvancedQueryIndex, the
> FullTextQueryIndex is no longer needed.
>

ok


>
> >In the end my very generic take on this discussion is that we should try
> >to
> >have a clearer API/documentation for the part that involves picking the
> >index via cost / index plans (but maybe that's only me), implementations
> >can be then adjusted accordingly if needed.
>
> Yes, sure.
>

thanks for your comments Thomas.

Regards,
Tommaso


>
> Regards,
> Thomas
>
>

Re: [DISCUSS] - QueryIndex selection

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

>>QueryIndex.getCost

>my doubt is what
>this heuristic function to estimate the "traversed entries" should look
>like in general

Relational databases typically know the number of entries in the index
(total indexed entries), plus the selectivity of a column. See also
http://www.akadia.com/services/ora_index_selectivity.html . That way it's
quite easy to estimate the number of returned values, for a given filter.
I think Lucene and Solr have a way to get those numbers, but I didn't look
into that yet. I guess the problem might be, the cost method needs to be
very fast, otherwise all the queries would be affected. Relational
databases typically have those index statistics fully in memory.

Some databases additionally have a histogram that describes the
distribution of the values. See also
http://www.dba-oracle.com/t_histograms.htm

>, but that may be implementation specific

The cost returned by an index should not be implementation specific,
otherwise the cost between different index implementations can't be
compared.

>, and how that
>estimate should be used, should we just return the number of estimated
>entries for the cost? So that cost is 1 when we estimate to return 1
>entry,
>2 when we estimate to return 2 entries, 1000 when we estimate to return
>1000 entries?

Generally yes, but if you keep everything in memory or know things are
cached, then you could return less (as documented).

>My other concern on this point is that it's not granted, in my opinion,
>that the index returning less entries would be the faster.

Yes, it's not that much about less entries or more entries, it's about
lower or higher cost. If there are more entries, but they are all in
memory, then that's better than less entries, but you need a disk read for
each entry. That's what the javadocs talk about.

>it would be ok for me to either deprecate it or improve the semantics of
>the cost calculation (e.g. explicitly introduce other metrics to be taken
>into account in the cost calculation: local / remote index,

The QueryIndex will be deprated, sure. Until the AdvancedQueryIndex is
used, the documented behavior should be used: "...and if there could in
theory be one network roundtrip or disk read operation per node (this
method may return a lower number if the data is known to be fully in
memory)."

>since the FullTextQueryIndex seems to be a marker interface, could we
>"merge" it some way into the AdvancedQueryIndex? Just to make things
>simpler.

Yes. The IndexPlan of the AdvancedQueryIndex has a method
"isFulltextIndex". Once we can switch to the AdvancedQueryIndex, the
FullTextQueryIndex is no longer needed.

>In the end my very generic take on this discussion is that we should try
>to
>have a clearer API/documentation for the part that involves picking the
>index via cost / index plans (but maybe that's only me), implementations
>can be then adjusted accordingly if needed.

Yes, sure.

Regards,
Thomas


Re: [DISCUSS] - QueryIndex selection

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Jun 26, 2014 at 4:10 AM, Angela Schreiber <an...@adobe.com> wrote:
> however, please be aware that one key feature of oak (compared to
> jackrabbit which only allowed permission evaluation by path) is that
> it always needs to be clear if the target for the permission evaluation
> is a node or a property. similarly, the restrictions may require the
> item being retrieved in order to perform the required evaluation (e.g.
> restricting access by node type).

Agreed. Here with the covered index idea we however have a valid use
case for considering changes to the internal access control API. Such
a change would indeed make it harder to implement some access control
features (like type-dependent restrictions), so we'd need to carefully
weigh the benefits and drawbacks of such a change before implementing
it. For now I tend to agree with you that we shouldn't change anything
here, but I've been wrong before so I think it would be a good idea to
keep our options open here.

BR,

Jukka Zitting

Re: [DISCUSS] - QueryIndex selection

Posted by Angela Schreiber <an...@adobe.com>.
hi jukka

this is not quite true. as i will explain below.

first i would strongly recommend not to rely on the current implementation.
if we have the requirement to evaluated permissions based on the path
we may extend the permissionprovider which IMO is the key API for these
cases; not the treepermissions.

however, please be aware that one key feature of oak (compared to
jackrabbit
which only allowed permission evaluation by path) is that it always
needs to be clear if the target for the permission evaluation is a
node or a property. similarly, the restrictions may require the item
being retrieved in order to perform the required evaluation (e.g.
restricting
access by node type). also the evaluation is obligated to respect the
different
types of items when it comes to read access.

in other words: extending the  the permissionprovider with something like
isGranted(@Nonnull String treePath, @Nullable String propertyName, long
permissions).
will just be an optimization for those cases where full access is granted
everywhere
which can equally be determined by calling TreePermission#canReadAll on
the root node.
so, i would recommend to first explore if the method would not just serve
the purpose.

kind regards
angela

On 25/06/14 16:48, "Jukka Zitting" <ju...@gmail.com> wrote:

>Hi,
>
>On Wed, Jun 25, 2014 at 10:16 AM, Thomas Mueller <mu...@adobe.com>
>wrote:
>> Yes, we would need to use a different access control API. The ability to
>> check whether a session has access to a path/node/property, without
>> actually loading the node from the storage backend. Maybe that API is
>> already there?
>
>The TreePermission interface is the key API here, and the way we've
>designed it requires loading the nodes being accessed (see the
>getChildPermission method). The current implementation of the API
>actually *doesn't* strictly require the loading of the underlying
>nodes, so we might be able to refactor the API to better accommodate
>such indexes.
>
>BR,
>
>Jukka Zitting


Re: [DISCUSS] - QueryIndex selection

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

>Can't we do the ACL check lazily?

That's what we do right now.

Regards,
Thomas


Re: [DISCUSS] - QueryIndex selection

Posted by Michael Marth <mm...@adobe.com>.
Hi,

I looked a bit into how MongoDB selects indexes (query plans) and think we could take some inspiration.

So, the way MongoDB does it afaiu:
* query gets parsed into Abstract Syntax Tree (so that parameters can get stripped out)
* the first time this query is performed then the query is executed against *all* available indexes
* the fastest index is put into a cache, so that when the same query (abstracted, regardless of parameters) comes in, then only that fastest index will be used (will be looked up from cache)
* after a number of modifications that index-selection-cache is flushed. Process starts at beginning.

What I dislike about this process is that the first query puts a lot more into the system (due to the fact that all indexes perform the query). Moreover, the first execution of that query could be disturbed by noise, so the selection could be wrong.

What I like, though, (if we ignore the noise issue from above) is that the selected index is the one that has actually proven to be the fastest.

So, for Oak: maybe we could enhance the deterministic selection process we have right now. We could run queries in the background to determine if the cost factors that the indexers claim to have are actually correct (and if not, correct them in the query engine). Those background queries could be the ones “most often executed” by users on that repo that have multiple indexes capable of answering the query.

Consider such a scenario: you have the same nodes indexed in the local property index (on the same machine that also serves requests) and a remote SolrCloud cluster. If we only reason about index size etc then we can never account for the fact that the local machine’s index might be much slower than those external machines that are used exclusively for answering queries. We could though, if we actually run those queries a number of times on both indexes.

Cheers
Michael



Re: [DISCUSS] - QueryIndex selection

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Jun 26, 2014 at 2:55 AM, Davide Giannella <da...@apache.org> wrote:
> Can't we do the ACL check lazily? Instead of the query engine looping
> through the nodes and check, if there's no need of doing so already (IE
> sorting), why not returning the set and then filter out the ACLs while
> the user load the node using the NodeIterator?

As Thomas said, we already do that.

Since we want to estimate full cost of executing a query, including
the iteration over the query results, we need to take also those costs
into account when making a decision on which index to use.

BR,

Jukka Zitting

Re: [DISCUSS] - QueryIndex selection

Posted by Davide Giannella <da...@apache.org>.
On 25/06/2014 16:48, Jukka Zitting wrote:
> The TreePermission interface is the key API here, and the way we've
> designed it requires loading the nodes being accessed (see the
> getChildPermission method). The current implementation of the API
> actually *doesn't* strictly require the loading of the underlying
> nodes, so we might be able to refactor the API to better accommodate
> such indexes.
>
Never looked into the details so I'm most probably wrong.

Can't we do the ACL check lazily? Instead of the query engine looping
through the nodes and check, if there's no need of doing so already (IE
sorting), why not returning the set and then filter out the ACLs while
the user load the node using the NodeIterator?

D.



Re: [DISCUSS] - QueryIndex selection

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Wed, Jun 25, 2014 at 10:16 AM, Thomas Mueller <mu...@adobe.com> wrote:
> Yes, we would need to use a different access control API. The ability to
> check whether a session has access to a path/node/property, without
> actually loading the node from the storage backend. Maybe that API is
> already there?

The TreePermission interface is the key API here, and the way we've
designed it requires loading the nodes being accessed (see the
getChildPermission method). The current implementation of the API
actually *doesn't* strictly require the loading of the underlying
nodes, so we might be able to refactor the API to better accommodate
such indexes.

BR,

Jukka Zitting

Re: [DISCUSS] - QueryIndex selection

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

>But getting
>to that point may be a bit tricky, especially because of access
>control.

Yes, we would need to use a different access control API. The ability to
check whether a session has access to a path/node/property, without
actually loading the node from the storage backend. Maybe that API is
already there?

Regards,
Thomas


Re: [DISCUSS] - QueryIndex selection

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Jun 23, 2014 at 4:23 PM, Thomas Mueller <mu...@adobe.com> wrote:
> Sorry, sure, the condition is verified again. But this might be an
> in-memory operation. The index may return the property value for each
> entry as part of running the query (QueryIndex - Cursor - IndexRow). I
> think the index implementations don't do that currently, so in reality the
> node is always loaded with the current version of Oak, that's true.
> Javadoc of Cursor.next(): "The index should return a row with those
> properties that are stored in the index itself, so that the query engine
> doesn't have to load the whole row / node unnecessarily."

Ah, I see where you're going!

It would indeed be nice if the index could also provide the data to
back more than the virtual properties (jcr:score, etc.) it now does.
Then a query like "SELECT a, b, c FROM ..." with an index that covers
a, b and c could avoid the (1 + ...) term from OAK-1910. But getting
to that point may be a bit tricky, especially because of access
control.

BR,

Jukka Zitting

Re: [DISCUSS] - QueryIndex selection

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

>>>It's more than access control. The query engine needs to double-check
>>>the constraints of the query for each matching path before passing
>>>that node to the client (see the constraint.evaluate() call in [1]). I
>>>don't see any easy way to avoid that step without major refactoring.
>>
>> If there is no other constraint, then no additional checks are needed.
>
>That's not correct. For example if I query for a value with more than
>100 characters, the PropertyIndex may return paths that actually won't
>match the query.

Sorry, sure, the condition is verified again. But this might be an
in-memory operation. The index may return the property value for each
entry as part of running the query (QueryIndex - Cursor - IndexRow). I
think the index implementations don't do that currently, so in reality the
node is always loaded with the current version of Oak, that's true.
Javadoc of Cursor.next(): "The index should return a row with those
properties that are stored in the index itself, so that the query engine
doesn't have to load the whole row / node unnecessarily."

Regards,
Thomas




Re: [DISCUSS] - QueryIndex selection

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Jun 23, 2014 at 1:58 PM, Thomas Mueller <mu...@adobe.com> wrote:
>>It's more than access control. The query engine needs to double-check
>>the constraints of the query for each matching path before passing
>>that node to the client (see the constraint.evaluate() call in [1]). I
>>don't see any easy way to avoid that step without major refactoring.
>
> If there is no other constraint, then no additional checks are needed.

That's not correct. For example if I query for a value with more than
100 characters, the PropertyIndex may return paths that actually won't
match the query.

This requirement to double-check the results is also described in the
QueryIndex.guery() contract: "An implementation should only filter the
result if it can do so easily and efficiently; the query engine will
verify the data again (in memory) and check for access rights."

>>Are there any potential indexes where the
>>AdvancedQueryIndex.getCostPerEntry() method (at least the way it's now
>>used in [2]) should return a value that's different from 1?
>
> Yes, for example an index that keep all (relevant) entries in memory, the
> cost should be close to zero.

Makes sense if we adapt the calculation as suggested in OAK-1910.

> Yes. The AdvancedQueryIndex has separate methods so that the query engine
> can better calculate the cost (getCostPerExecution, getCostPerEntry,
> getEstimatedEntryCount). The query could contain a limit (let's say 100)
> which should also be taken into account, and possibly an "order by"
> restriction. Plus the query engine could take into account that typically,
> only the first 50 entries are read (optimize for "fast first 50 entries" -
> see also
> http://stackoverflow.com/questions/1308946/should-i-use-query-hint-fast-num
> ber-rows-fastfirstrow ).

Agreed.

>>The index-level entry cost estimates only become relevant when the
>>cost of returning a path is more than a fraction of the cost of
>>loading a node. I don't believe that's the case for any reasonable
>>index implementations.
>
> It depends on whether (and when) the node needs to be loaded.

The node always needs to be loaded, see above.

>>I'm just worried about potential confusion about what the
>>getCostPerEntry() method (as used in [2]) should return. The value is
>>currently only set in [3], but there the estimate seems to be based on
>>the relative performance of the *index lookup*, not the overall
>>performance of a query. I believe either [2] or [3] should be adjusted
>>to fix the cost calculation.
>
> Yes, you are right. Currently the formula assumes that the query engine
> doesn't load the node. That's not correct. I created OAK-1910 to track
> this.

Thanks!

BR,

Jukka Zitting

Re: [DISCUSS] - QueryIndex selection

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

>It's more than access control. The query engine needs to double-check
>the constraints of the query for each matching path before passing
>that node to the client (see the constraint.evaluate() call in [1]). I
>don't see any easy way to avoid that step without major refactoring.

If there is no other constraint, then no additional checks are needed.

>Are there any potential indexes where the
>AdvancedQueryIndex.getCostPerEntry() method (at least the way it's now
>used in [2]) should return a value that's different from 1?

Yes, for example an index that keep all (relevant) entries in memory, the
cost should be close to zero.

>Say I have 10k nodes distributed uniformly across two sizes and ten
>colors. Querying for "size=L and color=red" would return 5000 paths
>when using the size index, or 1000 paths when using the color index (a
>multi-property index would return only the 500 exact matches). If the
>size index is really fast and returns those 5000 paths in just 10ms
>whereas the color index takes 100ms to return the 1000 matching paths,
>it'll still be much slower to use the size index for the query if
>accessing a single node takes say 1ms on average.

Yes. The AdvancedQueryIndex has separate methods so that the query engine
can better calculate the cost (getCostPerExecution, getCostPerEntry,
getEstimatedEntryCount). The query could contain a limit (let's say 100)
which should also be taken into account, and possibly an "order by"
restriction. Plus the query engine could take into account that typically,
only the first 50 entries are read (optimize for "fast first 50 entries" -
see also 
http://stackoverflow.com/questions/1308946/should-i-use-query-hint-fast-num
ber-rows-fastfirstrow ).

>The index-level entry cost estimates only become relevant when the
>cost of returning a path is more than a fraction of the cost of
>loading a node. I don't believe that's the case for any reasonable
>index implementations.

It depends on whether (and when) the node needs to be loaded.

>I'm just worried about potential confusion about what the
>getCostPerEntry() method (as used in [2]) should return. The value is
>currently only set in [3], but there the estimate seems to be based on
>the relative performance of the *index lookup*, not the overall
>performance of a query. I believe either [2] or [3] should be adjusted
>to fix the cost calculation.

Yes, you are right. Currently the formula assumes that the query engine
doesn't load the node. That's not correct. I created OAK-1910 to track
this.

Regards,
Thomas


Re: [DISCUSS] - QueryIndex selection

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Jun 23, 2014 at 11:18 AM, Thomas Mueller <mu...@adobe.com> wrote:
>>Sure, but we don't use a covered index.
>
> Yes, we are not there yet. The node is currently loaded to check access
> rights, but that's an implementation detail of access control part. And
> it's not needed for the admin. If (when) this is changed, it depends on
> the query. If the query only returns the path (for example), and the
> indexed property, then the node does not need to be loaded.

It's more than access control. The query engine needs to double-check
the constraints of the query for each matching path before passing
that node to the client (see the constraint.evaluate() call in [1]). I
don't see any easy way to avoid that step without major refactoring.

>> At least in the current design
>>the query engine will always load all the matching nodes, regardless
>>of any extra information stored in the index. Thus we can't use the
>>performance of the index lookup as an accurate estimate of the overall
>>query performance.
>
> Yes, the query engine knows the cost overhead per entry, that's why the
> AdvancedQueryIndex interface is a bit different (does not just contain the
> cost).

Are there any potential indexes where the
AdvancedQueryIndex.getCostPerEntry() method (at least the way it's now
used in [2]) should return a value that's different from 1?

>>The overhead of the index lookup is probably significant when only few
>>matching paths are returned (the UUID index would be the ultimate
>>example), but in those cases query performance is probably best
>>optimized in other ways (caching, configuration, profiling, etc.) than
>>making a more accurate estimate of the index lookup performance,
>>especially since in most such cases there is only a single index with
>>exact matches.
>
> We have queries that have multiple property constraints (for example "size
> = 'L' and color = 'red'"). If there are two indexes, one for size and one
> for color, it's important to know which one is faster. (If there is a
> multi-property index, that one might be faster.)

Say I have 10k nodes distributed uniformly across two sizes and ten
colors. Querying for "size=L and color=red" would return 5000 paths
when using the size index, or 1000 paths when using the color index (a
multi-property index would return only the 500 exact matches). If the
size index is really fast and returns those 5000 paths in just 10ms
whereas the color index takes 100ms to return the 1000 matching paths,
it'll still be much slower to use the size index for the query if
accessing a single node takes say 1ms on average.

The index-level entry cost estimates only become relevant when the
cost of returning a path is more than a fraction of the cost of
loading a node. I don't believe that's the case for any reasonable
index implementations.

>>Agreed, but we should be making guesses about the overall query
>>performance, not just the index lookup time.
>
> Yes, but the index implementation doesn't know (can't know) the cost of
> the query engine. The query engine needs to calculate its own cost.

Agreed. I'm just worried about potential confusion about what the
getCostPerEntry() method (as used in [2]) should return. The value is
currently only set in [3], but there the estimate seems to be based on
the relative performance of the *index lookup*, not the overall
performance of a query. I believe either [2] or [3] should be adjusted
to fix the cost calculation.

[1] https://github.com/apache/jackrabbit-oak/blob/jackrabbit-oak-1.0.0/oak-core/src/main/java/org/apache/jackrabbit/oak/query/QueryImpl.java#L628
[2] https://github.com/apache/jackrabbit-oak/blob/jackrabbit-oak-1.0.0/oak-core/src/main/java/org/apache/jackrabbit/oak/query/QueryImpl.java#L810
[3] https://github.com/apache/jackrabbit-oak/blob/jackrabbit-oak-1.0.0/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/property/OrderedPropertyIndex.java#L73

BR,

Jukka Zitting

Re: [DISCUSS] - QueryIndex selection

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

>The problem with that assumption is that typically a single disk read
>to the index would return n paths, whereas loading those n nodes might
>well take n more disk reads.

Ideally, the cost returned of the index would reflect that. For
single-property indexes (all property indexes are single property right
now), the cost should be lower than for a multi-property index, because
one disk read operation would fetch more paths.

>Sure, but we don't use a covered index.

Yes, we are not there yet. The node is currently loaded to check access
rights, but that's an implementation detail of access control part. And
it's not needed for the admin. If (when) this is changed, it depends on
the query. If the query only returns the path (for example), and the
indexed property, then the node does not need to be loaded.


> At least in the current design
>the query engine will always load all the matching nodes, regardless
>of any extra information stored in the index. Thus we can't use the
>performance of the index lookup as an accurate estimate of the overall
>query performance.

Yes, the query engine knows the cost overhead per entry, that's why the
AdvancedQueryIndex interface is a bit different (does not just contain the
cost).

>The overhead of the index lookup is probably significant when only few
>matching paths are returned (the UUID index would be the ultimate
>example), but in those cases query performance is probably best
>optimized in other ways (caching, configuration, profiling, etc.) than
>making a more accurate estimate of the index lookup performance,
>especially since in most such cases there is only a single index with
>exact matches.

We have queries that have multiple property constraints (for example "size
= 'L' and color = 'red'"). If there are two indexes, one for size and one
for color, it's important to know which one is faster. (If there is a
multi-property index, that one might be faster.)

>Agreed, but we should be making guesses about the overall query
>performance, not just the index lookup time.

Yes, but the index implementation doesn't know (can't know) the cost of
the query engine. The query engine needs to calculate its own cost.

Regards,
Thomas


Re: [DISCUSS] - QueryIndex selection

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Jun 23, 2014 at 3:30 AM, Thomas Mueller <mu...@adobe.com> wrote:
>>Right. I don't believe the cost of the index lookup is significant (at
>>least in the asymptotic sense) compared to the overall cost of
>>executing a query.
>
> Sorry, I don't understand. The cost of the index lookup *is* significant
> of course, specially if the nodes are not in the cache (which is very
> common). In which case, index lookup can contribute to about 50% of the
> overall cost of a query (if, let's assume, one disk read is needed for the
> index lookup, and one disk read is needed to read the actual node).

The problem with that assumption is that typically a single disk read
to the index would return n paths, whereas loading those n nodes might
well take n more disk reads.

Similarly for the Solr case you mention: There is some overhead in the
network roundtrip, but the search server will typically reply with the
list of all matching paths in a single roundtrip, whereas then loading
those matching nodes will (at least with our current code) require
O(n) network roundtrips or disk reads.

> For a covering index -
> http://stackoverflow.com/questions/62137/what-is-a-covered-index - the
> index lookup is nearly 100% of the query cost.

Sure, but we don't use a covered index. At least in the current design
the query engine will always load all the matching nodes, regardless
of any extra information stored in the index. Thus we can't use the
performance of the index lookup as an accurate estimate of the overall
query performance.

>> in the asymptotic sense
>
> Sorry could you translate that for me please? :-)

In the Big-O sense.

The overhead of the index lookup is probably significant when only few
matching paths are returned (the UUID index would be the ultimate
example), but in those cases query performance is probably best
optimized in other ways (caching, configuration, profiling, etc.) than
making a more accurate estimate of the index lookup performance,
especially since in most such cases there is only a single index with
exact matches.

When the number of matching paths becomes large, say in the 1k-1M
range, the relative performance overhead of different query indexes
becomes less of an issue. For example, say index A returns 10k
matching paths in 100ms and index B returns 20k less accurately
matching paths in just 10ms. If accessing each node takes 1ms on
average, the significantly faster performance of index B is totally
irrelevant to the overall time it takes to execute the full query.

> Trying to make an informed guess about the cost is the whole point of a
> cost based optimizer - http://en.wikipedia.org/wiki/Query_optimization

Agreed, but we should be making guesses about the overall query
performance, not just the index lookup time.

BR,

Jukka Zitting

Re: [DISCUSS] - QueryIndex selection

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

>should we just return the number of estimated entries for the cost?

For Lucene, the property index, the ordered index, and the node type
index: yes. 

For Solr, the cost per index lookup (not per entry) is probably a bit
higher, because there is a network round trip. Specially if Solr is
remote. That's a fixed offset, so the cost could be: 1 +
estimatedEntryCount.

For in-memory indexes (that keep all entries in memory), no. One example
is an in-memory index of transient UUIDs. In this case it might depend on
whether the UUID is in memory or not. Calculating the cost in this case is
quick, as it's just an in-memory lookup. Therefore, the cost could be very
close to 0.

>Right. I don't believe the cost of the index lookup is significant (at
>least in the asymptotic sense) compared to the overall cost of
>executing a query.

Sorry, I don't understand. The cost of the index lookup *is* significant
of course, specially if the nodes are not in the cache (which is very
common). In which case, index lookup can contribute to about 50% of the
overall cost of a query (if, let's assume, one disk read is needed for the
index lookup, and one disk read is needed to read the actual node). For a
covering index - 
http://stackoverflow.com/questions/62137/what-is-a-covered-index - the
index lookup is nearly 100% of the query cost.

> in the asymptotic sense

Sorry could you translate that for me please? :-)

>>ok, under such perspective the index is not returning a cost, but how
>>many
>> nodes it will provide to the engine, the cost of the query is then a
>> function of the number of entries.
>
>Exactly.

As I wrote above, it depends on the nature of the index (in-memory, disk
based, remote or local).

>instead of trying to make informed guesses about expected index
>performance.

Trying to make an informed guess about the cost is the whole point of a
cost based optimizer - http://en.wikipedia.org/wiki/Query_optimization

>it shouldn't really matter much which one of equally
>costly indexes is being selected.

Sure, if the cost of two indexes is the same, then it doesn't matter which
index is used. That's the reason to make sure the returned cost is
somewhat accurate, and the reason to use the same scale (units of
measuring) in all index implementations.

Regards,
Thomas


Re: [DISCUSS] - QueryIndex selection

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Wed, Jun 18, 2014 at 11:31 AM, Tommaso Teofili
<to...@gmail.com> wrote:
> 2014-06-18 16:02 GMT+02:00 Jukka Zitting <ju...@gmail.com>:
>> On Wed, Jun 18, 2014 at 4:26 AM, Tommaso Teofili
>> <to...@gmail.com> wrote:
>> > should we just return the number of estimated entries for the cost?
>>
>> Yes, that's what I think the contract should be.
>
> ok, that's different from what Thomas suggests, right? Just entry
> estimates, no network roundtrips / asynchronous index penalties, etc.

Right. I don't believe the cost of the index lookup is significant (at
least in the asymptotic sense) compared to the overall cost of
executing a query.

> ok, under such perspective the index is not returning a cost, but how many
> nodes it will provide to the engine, the cost of the query is then a
> function of the number of entries.

Exactly.

> At the moment node number estimates and performance of the index aspects
> seem kind of merged into the "getCost".
> Then we should probably decouple (at least) the concepts of:
> 1. how many nodes the index will return for this query (as an estimate)
> 2. how fast in retrieving the estimated nodes the index is

I would further argue that point 2 is mostly irrelevant for any decent
index. The only case where I would expect index performance to show up
as a significant factor is when n is small, but the best way to
optimize such queries is probably to just cache the results per query
instead of trying to make informed guesses about expected index
performance.

> Even with this distinction we would have to make some choices as given two
> indices returning the same number of estimated nodes for the same query, (I
> assume) the fastest should be chosen, but if two indices return two
> different node number estimates (e.g. that's likely if you have two
> different full text indices being able to handle the same query), which one
> should be chosen and why?

Unless there are other contributing factors (like preferring a
synchronous index over an asynchronous one, or an explicit preference
by a client), it shouldn't really matter much which one of equally
costly indexes is being selected.

BR,

Jukka Zitting

Re: [DISCUSS] - QueryIndex selection

Posted by Tommaso Teofili <to...@gmail.com>.
Hi,

2014-06-18 16:02 GMT+02:00 Jukka Zitting <ju...@gmail.com>:

> Hi,
>
> On Wed, Jun 18, 2014 at 4:26 AM, Tommaso Teofili
> <to...@gmail.com> wrote:
> > should we just return the number of estimated entries for the cost?
>
> Yes, that's what I think the contract should be.
>

ok, that's different from what Thomas suggests, right? Just entry
estimates, no network roundtrips / asynchronous index penalties, etc.


>
> > My other concern on this point is that it's not granted, in my opinion,
> > that the index returning less entries would be the faster.
>
> The key concern here isn't the performance of the index, but the
> overall performance of the query. If an index returns n paths for a
> given query, then the amount of work done by the query engine will be
> O(n) as it needs to look up each path and double-check the constraints
> on those nodes (or O(n log n) if it also needs to sort the results).
> Since I don't expect any reasonable index to perform worse than O(n),
> returning n as the cost estimate seems like the best thing to do.
>

ok, under such perspective the index is not returning a cost, but how many
nodes it will provide to the engine, the cost of the query is then a
function of the number of entries.
At the moment node number estimates and performance of the index aspects
seem kind of merged into the "getCost".
Then we should probably decouple (at least) the concepts of:
1. how many nodes the index will return for this query (as an estimate)
2. how fast in retrieving the estimated nodes the index is

Even with this distinction we would have to make some choices as given two
indices returning the same number of estimated nodes for the same query, (I
assume) the fastest should be chosen, but if two indices return two
different node number estimates (e.g. that's likely if you have two
different full text indices being able to handle the same query), which one
should be chosen and why?


>
> On a related note, I tend to disagree with the additional cost factors
> introduced in AdvancedQueryIndex/IndexPlan. In practically all cases
> the index lookups will be asymptotically faster than looking up all
> the paths returned by the index.


true, but then again the same question comes to my mind: how we select the
index to be used? Are we just interested in how many nodes it will return
or how much time it takes to do that plays some role in selection?


> Thus I don't think there is value in
> trying to make detailed estimates about the cost of the index lookup.
>

Thanks to all, I will probably need some more time to digest all the
thoughts and maybe I'll come back with more questions / ideas.
Regards,
Tommaso


>
> BR,
>
> Jukka Zitting
>

Re: [DISCUSS] - QueryIndex selection

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Wed, Jun 18, 2014 at 4:26 AM, Tommaso Teofili
<to...@gmail.com> wrote:
> should we just return the number of estimated entries for the cost?

Yes, that's what I think the contract should be.

> My other concern on this point is that it's not granted, in my opinion,
> that the index returning less entries would be the faster.

The key concern here isn't the performance of the index, but the
overall performance of the query. If an index returns n paths for a
given query, then the amount of work done by the query engine will be
O(n) as it needs to look up each path and double-check the constraints
on those nodes (or O(n log n) if it also needs to sort the results).
Since I don't expect any reasonable index to perform worse than O(n),
returning n as the cost estimate seems like the best thing to do.

On a related note, I tend to disagree with the additional cost factors
introduced in AdvancedQueryIndex/IndexPlan. In practically all cases
the index lookups will be asymptotically faster than looking up all
the paths returned by the index. Thus I don't think there is value in
trying to make detailed estimates about the cost of the index lookup.

BR,

Jukka Zitting

Re: [DISCUSS] - QueryIndex selection

Posted by Tommaso Teofili <to...@gmail.com>.
2014-06-04 9:36 GMT+02:00 Thomas Mueller <mu...@adobe.com>:

> Hi,
>
> QueryIndex.getCost: this is actually quite well documented (see the
> Javadocs). But the implementations might not fully follow the contract :-)
>

this is probably just my opinion but the contract is not much clear; to me
finding "the worst-case cost to query with the given filter" defines what
should be calculated and "the returned cost is a value between 1 (very
fast; lookup of a unique node) and the estimated number of entries to
traverse" defines the output range as a function of the number of the,
estimated, number of results that the query would eventually return using
that index but, regardless of existing implementations, my doubt is what
this heuristic function to estimate the "traversed entries" should look
like in general, but that may be implementation specific, and how that
estimate should be used, should we just return the number of estimated
entries for the cost? So that cost is 1 when we estimate to return 1 entry,
2 when we estimate to return 2 entries, 1000 when we estimate to return
1000 entries?

My other concern on this point is that it's not granted, in my opinion,
that the index returning less entries would be the faster.


> But anyway, I think it's anyway the better to deprecate it and use
> AdvancedQueryIndex, as it has more features (specially important for
> ordered indexes).


it would be ok for me to either deprecate it or improve the semantics of
the cost calculation (e.g. explicitly introduce other metrics to be taken
into account in the cost calculation: local / remote index,


> Currently, both QueryIndex and AdvancedQueryIndex are
> supported, in the future I hope we can switch all index implementations to
> AdvancedQueryIndex. I didn't want to do that just before the 1.0 release
> however.
>

right, good point.


>
> > get rid for example of the FullTextQueryIndex interface
>
> The FullTextQueryIndex allows to chose a full-text index for full-text
> queries, if one is available, even if the cost is higher. The problem is
> that only full-text indexes can return the correctly data, if full-text
> constraints are used. If we want to get rid of the FullTextQueryIndex
> interface, we need to address this problem in some other way.
>

since the FullTextQueryIndex seems to be a marker interface, could we
"merge" it some way into the AdvancedQueryIndex? Just to make things
simpler.


>
> > should we always select the fastest index ? Especially for full text
> >ones this should be in some way configurable.
>
> There are multiple aspects to this: (a) "let the user decide which index
> to use", and (b) "synchronous versus async indexes":
>
> (a) The user should be able to decide which index to use for a certain
> query. There are some problems with that: The index the user has in mind
> might not be available (in a certain environment, or with a later version
> of Oak, because for example the index implementation was replaced, or when
> not using Oak).


right, in this case we can fallback to the current mechanism where the
query engine decides which index to use


> Hardcoding the index to use (in the query) is problematic.
> Relational databases: Oracle supports "hints" (search for "oracle database
> hint"). SQLite supports something similar, but there it's actually an
> assertion that an index is available, not a hint:
> http://www.sqlite.org/lang_indexedby.html . My position is that we should
> avoid such a mechanism, and instead improve the query engine, the indexes
> implementations, and the documentation instead.
>

+1, I would not want to specify the index to be used in the (JCR) query,
the approach I was thinking too was to let the user to be able to give an
order of preference of the index to be used (at a repository level) which
would be respected in case those indexes exist and again that would
fallback to current selection mechanism in case of missing index(es).


>
> (b) Synchronous versus async indexes: some indexes are updated
> asynchronously, and therefore will not include the very latest additions
> to the content. (Recently removed nodes are not a problem, as the query
> engine will anyway have to check if a node is available). We could let the
> user decide if using an asynchronous index is OK or not.
>
> For both (a) and (b), one problem is that the JCR spec doesn't allow for
> extensions in the query syntax, so if the user would use "select ...
> option async_ok", the query would not work with Jackrabbit 2.x and other
> JCR implementations. Maybe we should create a JCR commons utility method,
> so that one could use:
>
>     QueryManager qm = ...
>
>     Query q = JcrUtils.createQuery(qm, query, language, ASYNC_OK);
>
>
> The method JcrUtils.createQuery could then use "instanceof" to decide
> whether it's OK to modify the query (for Oak) or not (non-Oak).
>

yes, that's an option but if we could avoid having it in the query that
would be better.


>
> > for full text queries for example, one may be interested in having a
> >higher recall (more documents matching the query) which may eventually
> >lead to a slightly slower query execution / higher cost evaluation
>
> That would also need to be specified by the user in some way, right? For
> example in the query itself? We could use a similar mechanism than
> ASYNC_OK above, so the application would still work for Jackrabbit 2.x.
>

I am not sure, I would like to avoid having to specify such things in the
query (as per the index to be used).
With the current implementation one reliable way to define the index to be
used is (or should be) to use native query language [1].

In the end my very generic take on this discussion is that we should try to
have a clearer API/documentation for the part that involves picking the
index via cost / index plans (but maybe that's only me), implementations
can be then adjusted accordingly if needed.

Regards,
Tommaso

[1] : http://jackrabbit.apache.org/oak/docs/query.html#Native_Queries


> Regards,
> Thomas
>
>
>
>
>
>
>
>
>
> On 26/05/14 10:25, "Tommaso Teofili" <to...@gmail.com> wrote:
>
> >Hi all,
> >
> >I'd like to start discussing how we may improve / simplify current way of
> >selecting a query engine to use for a certain query.
> >
> >In the QueryIndex interface we have the plain old getCost method which
> >selects the index returning the lower cost for the given query but,
> >recently, also an AdvancedQueryIndex interface has been introduced which,
> >if I understood things correctly, uses the IndexPlan(s) returned by each
> >query index for the given query to select which one has to be used.
> >So I would like to discuss if it's possible to clean up things a bit in
> >order to have a unified query selection mechanism.
> >
> >At the moment, in my opinion, one problem with the getCost() method is
> >that
> >it inherently merges the following topics:
> >- index capability to handle a certain query (can the QueryIndex handle
> >that query?)
> >- index efficiency in handling a certain query (how fast will the
> >QueryIndex will be in handling that query?)
> >
> >Also the efficiency is not evaluated on a "cost model", each QueryIndex
> >implementation can return an arbitrary different number; on one hand this
> >is ok as it allows to take very index specific constraint into account: on
> >the other hand if one has to write a new QueryIndex implementation he/she
> >will have to look into each other query index implementation to understand
> >(and design) if / when its index is picked up; and even with already
> >existing indexes it's not easy to say upfront which one will be selected
> >(e.g. for debugging purposes).
> >
> >With the AdvancedQueryIndex, if I understood it correctly (I just had a
> >look at it on Friday), a QueryIndex is selected upon its IndexPlan, which
> >is supposed to address better both the cost (as it explicitly exposes the
> >cost per execution, cost per entry and estimated entry count metrics) and
> >the query index capability to handle a certain query (e.g. this is used
> >for
> >ordered property index).
> >However, at the moment, only the OrderedPropertyIndex is using it so I
> >think it'd be good to decide if we want to go further with the
> >AdvancedQueryIndex also for the other QueryIndex implementations (and get
> >rid for example of the FullTextQueryIndex interface as it seems useless to
> >me) or not.
> >
> >One final question on query index selection, should we always select the
> >fastest index ?
> >Especially for full text ones this should be in some way configurable.
> >
> >What do others think?
> >Regards,
> >Tommaso
> >
> >p.s.:
> >As discussed also offline last week with some other folks maybe one
> >further
> >metric to be taken into consideration for the index selection is if the
> >index is synchronous or not
>
>

Re: [DISCUSS] - QueryIndex selection

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

QueryIndex.getCost: this is actually quite well documented (see the
Javadocs). But the implementations might not fully follow the contract :-)
But anyway, I think it's anyway the better to deprecate it and use
AdvancedQueryIndex, as it has more features (specially important for
ordered indexes). Currently, both QueryIndex and AdvancedQueryIndex are
supported, in the future I hope we can switch all index implementations to
AdvancedQueryIndex. I didn't want to do that just before the 1.0 release
however.

> get rid for example of the FullTextQueryIndex interface

The FullTextQueryIndex allows to chose a full-text index for full-text
queries, if one is available, even if the cost is higher. The problem is
that only full-text indexes can return the correctly data, if full-text
constraints are used. If we want to get rid of the FullTextQueryIndex
interface, we need to address this problem in some other way.

> should we always select the fastest index ? Especially for full text
>ones this should be in some way configurable.

There are multiple aspects to this: (a) "let the user decide which index
to use", and (b) "synchronous versus async indexes":

(a) The user should be able to decide which index to use for a certain
query. There are some problems with that: The index the user has in mind
might not be available (in a certain environment, or with a later version
of Oak, because for example the index implementation was replaced, or when
not using Oak). Hardcoding the index to use (in the query) is problematic.
Relational databases: Oracle supports "hints" (search for "oracle database
hint"). SQLite supports something similar, but there it's actually an
assertion that an index is available, not a hint:
http://www.sqlite.org/lang_indexedby.html . My position is that we should
avoid such a mechanism, and instead improve the query engine, the indexes
implementations, and the documentation instead.

(b) Synchronous versus async indexes: some indexes are updated
asynchronously, and therefore will not include the very latest additions
to the content. (Recently removed nodes are not a problem, as the query
engine will anyway have to check if a node is available). We could let the
user decide if using an asynchronous index is OK or not.

For both (a) and (b), one problem is that the JCR spec doesn't allow for
extensions in the query syntax, so if the user would use "select ...
option async_ok", the query would not work with Jackrabbit 2.x and other
JCR implementations. Maybe we should create a JCR commons utility method,
so that one could use:

    QueryManager qm = ...

    Query q = JcrUtils.createQuery(qm, query, language, ASYNC_OK);


The method JcrUtils.createQuery could then use "instanceof" to decide
whether it's OK to modify the query (for Oak) or not (non-Oak).

> for full text queries for example, one may be interested in having a
>higher recall (more documents matching the query) which may eventually
>lead to a slightly slower query execution / higher cost evaluation

That would also need to be specified by the user in some way, right? For
example in the query itself? We could use a similar mechanism than
ASYNC_OK above, so the application would still work for Jackrabbit 2.x.

Regards,
Thomas









On 26/05/14 10:25, "Tommaso Teofili" <to...@gmail.com> wrote:

>Hi all,
>
>I'd like to start discussing how we may improve / simplify current way of
>selecting a query engine to use for a certain query.
>
>In the QueryIndex interface we have the plain old getCost method which
>selects the index returning the lower cost for the given query but,
>recently, also an AdvancedQueryIndex interface has been introduced which,
>if I understood things correctly, uses the IndexPlan(s) returned by each
>query index for the given query to select which one has to be used.
>So I would like to discuss if it's possible to clean up things a bit in
>order to have a unified query selection mechanism.
>
>At the moment, in my opinion, one problem with the getCost() method is
>that
>it inherently merges the following topics:
>- index capability to handle a certain query (can the QueryIndex handle
>that query?)
>- index efficiency in handling a certain query (how fast will the
>QueryIndex will be in handling that query?)
>
>Also the efficiency is not evaluated on a "cost model", each QueryIndex
>implementation can return an arbitrary different number; on one hand this
>is ok as it allows to take very index specific constraint into account: on
>the other hand if one has to write a new QueryIndex implementation he/she
>will have to look into each other query index implementation to understand
>(and design) if / when its index is picked up; and even with already
>existing indexes it's not easy to say upfront which one will be selected
>(e.g. for debugging purposes).
>
>With the AdvancedQueryIndex, if I understood it correctly (I just had a
>look at it on Friday), a QueryIndex is selected upon its IndexPlan, which
>is supposed to address better both the cost (as it explicitly exposes the
>cost per execution, cost per entry and estimated entry count metrics) and
>the query index capability to handle a certain query (e.g. this is used
>for
>ordered property index).
>However, at the moment, only the OrderedPropertyIndex is using it so I
>think it'd be good to decide if we want to go further with the
>AdvancedQueryIndex also for the other QueryIndex implementations (and get
>rid for example of the FullTextQueryIndex interface as it seems useless to
>me) or not.
>
>One final question on query index selection, should we always select the
>fastest index ?
>Especially for full text ones this should be in some way configurable.
>
>What do others think?
>Regards,
>Tommaso
>
>p.s.:
>As discussed also offline last week with some other folks maybe one
>further
>metric to be taken into consideration for the index selection is if the
>index is synchronous or not


Re: [DISCUSS] - QueryIndex selection

Posted by Tommaso Teofili <to...@gmail.com>.
2014-05-27 11:21 GMT+02:00 Davide Giannella <gi...@gmail.com>:

> On 26/05/2014 09:25, Tommaso Teofili wrote:
> > ...
> > Also the efficiency is not evaluated on a "cost model", each QueryIndex
> > implementation can return an arbitrary different number; on one hand this
> > is ok as it allows to take very index specific constraint into account:
> on
> > the other hand if one has to write a new QueryIndex implementation he/she
> > will have to look into each other query index implementation to
> understand
> > (and design) if / when its index is picked up; and even with already
> > existing indexes it's not easy to say upfront which one will be selected
> > (e.g. for debugging purposes).
>
> I don't know if by using the IndexPlans it will be possible to says
> beforehand which index will be pick up. It really depends by the query
> engine logics and if there are other indexes that could perform faster
> why not choose them?
>

for full text queries for example, one may be interested in having a higher
recall (more documents matching the query) which may eventually lead to a
slightly slower query execution / higher cost evaluation, then if we only
select the fastest that use case cannot be addressed.


>
> > With the AdvancedQueryIndex, if I understood it correctly (I just had a
> > look at it on Friday), a QueryIndex is selected upon its IndexPlan, which
> > is supposed to address better both the cost (as it explicitly exposes the
> > cost per execution, cost per entry and estimated entry count metrics) and
> > the query index capability to handle a certain query (e.g. this is used
> for
> > ordered property index).
> > However, at the moment, only the OrderedPropertyIndex is using it so I
> > think it'd be good to decide if we want to go further with the
> > AdvancedQueryIndex also for the other QueryIndex implementations (and get
> > rid for example of the FullTextQueryIndex interface as it seems useless
> to
> > me) or not.
> >
> > One final question on query index selection, should we always select the
> > fastest index ?
> > Especially for full text ones this should be in some way configurable.
>
> Yes we discussed off-line last time. It seems that it would be good for
> the query engine to expose an API in which the client application could
> state: run with this index. Something like
> queryengine.forceIndex("path/to/oak:QueryIndexDefinition").  If
> specified the query engine will know about it and skip the index
> selection by forcing it.
>
> Useful definitely for debugging as well as in edge cases where the
> client application will know that the query has to be run with a
> specific index for a reason or another.
> > ...
> > As discussed also offline last week with some other folks maybe one
> further
> > metric to be taken into consideration for the index selection is if the
> > index is synchronous or not
> >
> The current index plan of the AdvancedQueryIndex expose the sync/async
> aspect by IndexPlan.isDelayed(). If that is taken into account yet I
> don't know :)
>

ok


>
> Another aspect of improvement we were discussing is on the query engine
> side. As the index selections and plan are expensive if the query engine
> is asked to execute a query with bindings[0] it could cache the selected
> index for either a fixed amount of time or other logics.
>
> (0) http://goo.gl/LDQw1Y
>
> Last one as we where discussing about the possibility of serving queries
> for "all" the properties from Lucene/Solr a metrics of evaluation could
> be that in case of the same property served by a sync property index and
> a Lucene, the first one will have should be chosen as it would be local.
>

right.

Thanks,
Tommaso


>
> D.
>
>
>

Re: [DISCUSS] - QueryIndex selection

Posted by Davide Giannella <gi...@gmail.com>.
On 26/05/2014 09:25, Tommaso Teofili wrote:
> ...
> Also the efficiency is not evaluated on a "cost model", each QueryIndex
> implementation can return an arbitrary different number; on one hand this
> is ok as it allows to take very index specific constraint into account: on
> the other hand if one has to write a new QueryIndex implementation he/she
> will have to look into each other query index implementation to understand
> (and design) if / when its index is picked up; and even with already
> existing indexes it's not easy to say upfront which one will be selected
> (e.g. for debugging purposes).

I don't know if by using the IndexPlans it will be possible to says
beforehand which index will be pick up. It really depends by the query
engine logics and if there are other indexes that could perform faster
why not choose them?

> With the AdvancedQueryIndex, if I understood it correctly (I just had a
> look at it on Friday), a QueryIndex is selected upon its IndexPlan, which
> is supposed to address better both the cost (as it explicitly exposes the
> cost per execution, cost per entry and estimated entry count metrics) and
> the query index capability to handle a certain query (e.g. this is used for
> ordered property index).
> However, at the moment, only the OrderedPropertyIndex is using it so I
> think it'd be good to decide if we want to go further with the
> AdvancedQueryIndex also for the other QueryIndex implementations (and get
> rid for example of the FullTextQueryIndex interface as it seems useless to
> me) or not.
>
> One final question on query index selection, should we always select the
> fastest index ?
> Especially for full text ones this should be in some way configurable.

Yes we discussed off-line last time. It seems that it would be good for
the query engine to expose an API in which the client application could
state: run with this index. Something like
queryengine.forceIndex("path/to/oak:QueryIndexDefinition").  If
specified the query engine will know about it and skip the index
selection by forcing it.

Useful definitely for debugging as well as in edge cases where the
client application will know that the query has to be run with a
specific index for a reason or another.
> ...
> As discussed also offline last week with some other folks maybe one further
> metric to be taken into consideration for the index selection is if the
> index is synchronous or not
>
The current index plan of the AdvancedQueryIndex expose the sync/async
aspect by IndexPlan.isDelayed(). If that is taken into account yet I
don't know :)

Another aspect of improvement we were discussing is on the query engine
side. As the index selections and plan are expensive if the query engine
is asked to execute a query with bindings[0] it could cache the selected
index for either a fixed amount of time or other logics.

(0) http://goo.gl/LDQw1Y

Last one as we where discussing about the possibility of serving queries
for "all" the properties from Lucene/Solr a metrics of evaluation could
be that in case of the same property served by a sync property index and
a Lucene, the first one will have should be chosen as it would be local.

D.