You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Thomas Mueller (JIRA)" <ji...@apache.org> on 2016/11/29 14:58:59 UTC

[jira] [Commented] (OAK-3219) Lucene IndexPlanner should also account for number of property constraints evaluated while giving cost estimation

    [ https://issues.apache.org/jira/browse/OAK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15705511#comment-15705511 ] 

Thomas Mueller commented on OAK-3219:
-------------------------------------

It's probably a bit hard to fully resolve this issue for Oak 1.6.

[~chetanm], which part of the issue is the most important for your:

* (a) If there are two indexes, a Property index and a Lucene index, for the same properties, then I think it's reasonable to disable one of those, as it doesn't make sense to index the same data twice.

* (b) The Lucene index doesn't know {{estimateOfNodesUnderGivenPath}} right now. For a good estimate, that would be needed I think. That could be added, but it would require API changes (making the node counter info available to the Lucene index).

* (c) In case similar, but not the same properties are indexed, and the query contains multiple conditions:

** Lucene index on property a and b
** Lucene index on property c and d
** Query with conditions for a, b, c, d. Which index to pick? It should depend on the number of documents for a, b, c, d.

For (c), we would need to retrieve the number of documents for a field. I don't know exactly which Lucene API makes most sense, as this needs to be fast. I suggest to add a feature to the LuceneIndexMBean to retrieve the data, using the following algorithm. Then we can verify how fast this is for a large repository, and whether that info is useful or not:

{noformat}
    private static String[] getFieldInfo(IndexSearcher searcher) throws IOException {
        ArrayList<String> list = new ArrayList<String>();
        IndexReader reader = searcher.getIndexReader();
        Fields fields = MultiFields.getFields(reader);
        if (fields != null) {
            for(String f : fields) {
                list.add(f + " " + reader.getDocCount(f));
            }
        }
        return list.toArray(new String[0]);
    }
{noformat}

> Lucene IndexPlanner should also account for number of property constraints evaluated while giving cost estimation
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: OAK-3219
>                 URL: https://issues.apache.org/jira/browse/OAK-3219
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Thomas Mueller
>            Priority: Minor
>              Labels: performance
>             Fix For: 1.6
>
>
> Currently the cost returned by Lucene index is a function of number of indexed documents present in the index. If the number of indexed entries are high then it might reduce chances of this index getting selected if some property index also support of the property constraint.
> {noformat}
> /jcr:root/content/freestyle-cms/customers//element(*, cq:Page)
> [(jcr:content/@title = 'm' or jcr:like(jcr:content/@title, 'm%')) 
> and jcr:content/@sling:resourceType = '/components/page/customer’]
> {noformat}
> Consider above query with following index definition
> * A property index on resourceType
> * A Lucene index for cq:Page with properties {{jcr:content/title}}, {{jcr:content/sling:resourceType}} indexed and also path restriction evaluation enabled
> Now what the two indexes can help in
> # Property index
> ## Path restriction
> ## Property restriction on  {{sling:resourceType}}
> # Lucene index
> ## NodeType restriction
> ## Property restriction on  {{sling:resourceType}}
> ## Property restriction on  {{title}}
> ## Path restriction
> Now cost estimate currently works like this
> * Property index - {{f(indexedValueEstimate, estimateOfNodesUnderGivenPath)}}
> ** indexedValueEstimate - For 'sling:resourceType=foo' its the approximate count for nodes having that as 'foo'
> ** estimateOfNodesUnderGivenPath - Its derived from an approximate estimation of nodes present under given path
> * Lucene Index - {{f(totalIndexedEntries)}}
> As cost of Lucene is too simple it does not reflect the reality. Following 2 changes can be done to make it better
> * Given that Lucene index can handle multiple constraints compared (4) to property index (2), the cost estimate returned by it should also reflect this state. This can be done by setting costPerEntry to 1/(no of property restriction evaluated)
> * Get the count for queried property value - This is similar to what PropertyIndex does and assumes that Lucene can provide that information in O(1) cost. In case of multiple supported property restriction this can be minima of all



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)