You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Randy Tidd <rc...@tidd.cc> on 2016/07/05 15:20:16 UTC

Document retrieval, performance, and DocValues

My Lucene index has about 3 million documents and result sets can be large, often 1000’s and sometimes as many as 100,000.  I am expecting the index size to grow 5-10x as the system matures.

I index 5 fields, and per recommendations I’ve read, am storing the minimal data in Lucene, currently just a 12 byte numeric identifier (a Mongo ObjectId) per document.  I store the rest of the data separately and use the id I get from Lucene to look it up there.

In my load testing, a search like this:

    TopDocs docs = indexSearcher.search(query, maxResults, sort)

takes about 50-75 msec which is good.  Retrieving documents with a loop like this:

    for(int i=0; i<docs.scoreDocs.length; i++) {
        ScoreDoc sdoc = docs.scoreDocs[i];
        String id = indexReader.document(sdoc.doc, Collections.singleton("pos_id”)).getField("pos_id").stringValue();
        // … retrieve data with id
    }

takes around 350-400 msec, sometimes as long as 800 msec.  I’m looking for ways to try to decrease this time if possible.

I’ve read up on DocValues and am not sure if that is intended to help with this.  I understand that it is a separate store/mapping of Lucene’s internal document id’s to my “pos_id” which sounds like it may help but I am not sure.  I tried getting the id’s from my reader like this:

            String id = MultiDocValues.getBinaryValues(indexReader, "pos_id").get(sdoc.doc).utf8ToString()

But performance was no better.  However I saw in the docs for MultiDocValues that I may get better performance using the "atomic leaves and then operate per-LeafReader”. I searched around and could not find documentation on how to do that. I see some examples using leaf readers in the solr projects but they were just examples and I don’t think were written specifically to optimize performance.  It would be great to find an explanation of why there are multiple leaf readers per reader and how to use them.

So my questions are 1) are DocValues a possibility for improving my document retrieval performance, and 2) if so, where can I find an example of this that is written for best performance?

Thanks in advance!

Randy


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document retrieval, performance, and DocValues

Posted by Michael McCandless <lu...@mikemccandless.com>.

You should do the MultiDocValues.getBinaryDocValues(indexReader, "pos_id")
once up front, not per hit.

You could operate per-segment instead by making a custom Collector.

Are you sorting by your pos_id field?  If so, the value is already
available in each FieldDoc and you don't need to separately look it up.

How many hits are you collecting for each search?

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jul 5, 2016 at 11:20 AM, Randy Tidd <rc...@tidd.cc> wrote:

> My Lucene index has about 3 million documents and result sets can be
> large, often 1000’s and sometimes as many as 100,000.  I am expecting the
> index size to grow 5-10x as the system matures.
>
> I index 5 fields, and per recommendations I’ve read, am storing the
> minimal data in Lucene, currently just a 12 byte numeric identifier (a
> Mongo ObjectId) per document.  I store the rest of the data separately and
> use the id I get from Lucene to look it up there.
>
> In my load testing, a search like this:
>
>     TopDocs docs = indexSearcher.search(query, maxResults, sort)
>
> takes about 50-75 msec which is good.  Retrieving documents with a loop
> like this:
>
>     for(int i=0; i<docs.scoreDocs.length; i++) {
>         ScoreDoc sdoc = docs.scoreDocs[i];
>         String id = indexReader.document(sdoc.doc,
> Collections.singleton("pos_id”)).getField("pos_id").stringValue();
>         // … retrieve data with id
>     }
>
> takes around 350-400 msec, sometimes as long as 800 msec.  I’m looking for
> ways to try to decrease this time if possible.
>
> I’ve read up on DocValues and am not sure if that is intended to help with
> this.  I understand that it is a separate store/mapping of Lucene’s
> internal document id’s to my “pos_id” which sounds like it may help but I
> am not sure.  I tried getting the id’s from my reader like this:
>
>             String id = MultiDocValues.getBinaryValues(indexReader,
> "pos_id").get(sdoc.doc).utf8ToString()
>
> But performance was no better.  However I saw in the docs for
> MultiDocValues that I may get better performance using the "atomic leaves
> and then operate per-LeafReader”. I searched around and could not find
> documentation on how to do that. I see some examples using leaf readers in
> the solr projects but they were just examples and I don’t think were
> written specifically to optimize performance.  It would be great to find an
> explanation of why there are multiple leaf readers per reader and how to
> use them.
>
> So my questions are 1) are DocValues a possibility for improving my
> document retrieval performance, and 2) if so, where can I find an example
> of this that is written for best performance?
>
> Thanks in advance!
>
> Randy
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Document retrieval, performance, and DocValues

Posted by Sanne Grinovero <sa...@gmail.com>.

Hi Randy,

a first quick and easy win would be to rewrite it as:

DocumentStoredFieldVisitor visitor = new
DocumentStoredFieldVisitor(Collections.singleton("pos_id”));
for(int i=0; i<docs.scoreDocs.length; i++) {
   ScoreDoc sdoc = docs.scoreDocs[i];
   String id = indexReader.document(sdoc.doc,
visitor).getField("pos_id").stringValue();
    // … retrieve data with id
}

as creating all those FieldVisitor instances isn't free.

Next, I'd suggest to try optimise this loop further without measuring
the "retrieve data with id" aspect?
It's maybe possible to just fill in an array of those ids, and load
them all as a second step, in batches?

I'm not sure about your other questions, will leave that to others on the list.

-- Sanne




On 5 July 2016 at 16:20, Randy Tidd <rc...@tidd.cc> wrote:
> My Lucene index has about 3 million documents and result sets can be large, often 1000’s and sometimes as many as 100,000.  I am expecting the index size to grow 5-10x as the system matures.
>
> I index 5 fields, and per recommendations I’ve read, am storing the minimal data in Lucene, currently just a 12 byte numeric identifier (a Mongo ObjectId) per document.  I store the rest of the data separately and use the id I get from Lucene to look it up there.
>
> In my load testing, a search like this:
>
>     TopDocs docs = indexSearcher.search(query, maxResults, sort)
>
> takes about 50-75 msec which is good.  Retrieving documents with a loop like this:
>
>     for(int i=0; i<docs.scoreDocs.length; i++) {
>         ScoreDoc sdoc = docs.scoreDocs[i];
>         String id = indexReader.document(sdoc.doc, Collections.singleton("pos_id”)).getField("pos_id").stringValue();
>         // … retrieve data with id
>     }
>
> takes around 350-400 msec, sometimes as long as 800 msec.  I’m looking for ways to try to decrease this time if possible.
>
> I’ve read up on DocValues and am not sure if that is intended to help with this.  I understand that it is a separate store/mapping of Lucene’s internal document id’s to my “pos_id” which sounds like it may help but I am not sure.  I tried getting the id’s from my reader like this:
>
>             String id = MultiDocValues.getBinaryValues(indexReader, "pos_id").get(sdoc.doc).utf8ToString()
>
> But performance was no better.  However I saw in the docs for MultiDocValues that I may get better performance using the "atomic leaves and then operate per-LeafReader”. I searched around and could not find documentation on how to do that. I see some examples using leaf readers in the solr projects but they were just examples and I don’t think were written specifically to optimize performance.  It would be great to find an explanation of why there are multiple leaf readers per reader and how to use them.
>
> So my questions are 1) are DocValues a possibility for improving my document retrieval performance, and 2) if so, where can I find an example of this that is written for best performance?
>
> Thanks in advance!
>
> Randy
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org