You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-issues@jackrabbit.apache.org by "Vikas Saurabh (JIRA)" <ji...@apache.org> on 2017/10/11 18:05:00 UTC

[jira] [Updated] (OAK-6735) Lucene Index: improved cost estimation by using document count per field

     [ https://issues.apache.org/jira/browse/OAK-6735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vikas Saurabh updated OAK-6735:
-------------------------------
    Attachment: IndexReadPattern.txt
                LuceneIndexReadPattern.java

So, I was trying to see how much read would lucene incur while calculating various stats. I used [^LuceneIndexReadPattern.java] (it has a few hard-coded paths for indexed data on my setup).

Following is the size of indices I extracted the stats from:
{noformat}
$ du -sh */data
364K	damAssetLucene-1505227087108/data
36M	lucene-1505227210399/data
4.2G	PetabyteDamAssetLucene/data
19G	PetabyteLucene/data
46M	someLuceneIdx/data
{noformat}

The complete output is at [^IndexReadPattern.txt].

Few interesting things to note:
* opening reader reads quite a bit - but, we open reader only on index refresh (and that we've been incurring this cost even today anyway)
* reading numDocs, and reading numTermsPerField didn't incur any read even on /oak:index/lucene that AEM provisions (index size at 19G)
* reading numDocsAgainstATerm does require read (although in large indices)

So, I think, we'd need to limit ourselves with termsPerField if we bind with index refresh.

If we want some deeper stats collection, then it'd have to happen infrequently in some background thread.

> Lucene Index: improved cost estimation by using document count per field
> ------------------------------------------------------------------------
>
>                 Key: OAK-6735
>                 URL: https://issues.apache.org/jira/browse/OAK-6735
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene, query
>    Affects Versions: 1.7.4
>            Reporter: Thomas Mueller
>             Fix For: 1.8
>
>         Attachments: IndexReadPattern.txt, LuceneIndexReadPattern.java
>
>
> The cost estimation of the Lucene index is somewhat inaccurate because (by default) it just used the number of documents in the index (as of Oak 1.7.4 by default, due to OAK-6333).
> Instead, it should use the number of documents for the given fields (the minimum, if there are multiple fields with restrictions). 
> Plus divided by the number of restrictions (as we do now already).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)