You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Thomas Mueller (JIRA)" <ji...@apache.org> on 2013/11/28 10:37:36 UTC

[jira] [Commented] (OAK-894) Query: better cost estimates for indexes

    [ https://issues.apache.org/jira/browse/OAK-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13834651#comment-13834651 ] 

Thomas Mueller commented on OAK-894:
------------------------------------

It would be nice to further improve cost estimates, but I think the current behavior is OK for the known cases.

> Query: better cost estimates for indexes
> ----------------------------------------
>
>                 Key: OAK-894
>                 URL: https://issues.apache.org/jira/browse/OAK-894
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: query
>            Reporter: Thomas Mueller
>            Assignee: Thomas Mueller
>             Fix For: 0.14
>
>
> Currently, the cost estimation for the PropertyIndex is somewhat limited, so that the estimated cost for the index on "jcr:primaryType", with filter "jcr:primaryType is not null" is relatively low. However, there are many nodes with this property (well, _most_ nodes have a primary type). Because of that, this index is used for the following query, which tests the existence of a child node:
> {code}
> -- xpath:
> /jcr:root/home/users//*[a/b/c/@jcr:primaryType]
> -- sql2:
> select [jcr:path], [jcr:score], * from [nt:base] as a 
> where [a/b/c/jcr:primaryType] is not null 
> and isdescendantnode(a, '/home/users')
> {code}
> With SQL-2, the query could be written as an inner join, but I guess the query would end up rather complex (4 selectors as far as I see).
> It would be nice if the index would return a better cost estimate. There are multiple ways to do that:
> * Automatically update the index statistics when modifying the index
> * Update the statistics in a background thread
> * Hardcode the statistics in the index (when declaring the index)
> With statistics I mean: number of entries in total within the index, and selectivity (or number of distinct index keys), and potentially number of entries for each index key.
> To automatically update the statistics at runtime, the following algorithms could be used: whenever adding an entry to the index, increment a counter; when removing an entry, decrement. This would result in many write operations and have a high risk of conflicts. Slightly improved: when adding/removing an entry, and if Math.random()<0.001, increment/decrement the counter by 1000 (which is a good enough accuracy for our case). This would improve performance and reduce the risk of conflicts; but there would still be a risk.
> Updating the statistics in a background thread would require a background thread, but it could be done rather efficiently. We wouldn't need to read and count _all_ child node entries; instead, a random walk could be used to estimate the number of entries. I guess it would be somewhat less accurate than the automatic update.
> Hardcoding the statistics should be relatively easy to do. I guess this is what I will try for now.



--
This message was sent by Atlassian JIRA
(v6.1#6144)