You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Rose, Stuart J" <st...@pnnl.gov> on 2012/02/24 22:18:52 UTC

retrieved doc field values being cached?

Lucene (using 3.5) seems to be caching field values for documents (after they have been retrieved) and I am hoping someone can provide more information on how and where exactly the field values are stored.

The table below lists the times (in milliseconds) associated with retrieving for a set of documents matching a particular query a single stored value from each document in the set. Results are shown for three queries (A, B, and C) submitted multiple times. The first time each query is submitted, the time to retrieve it's matching document values is considerably longer than any time after that.

1) search A          nDocs =                489         time =   1342
2) search A          nDocs =                489         time =   811
3) search B          nDocs =                47038    time =   76658
4) search B          nDocs =                47038    time =   1062
5) search C          nDocs =                5256       time =   22741
6) search C          nDocs =                5256       time =   578
7) search A          nDocs =                489         time =   515
8) search A          nDocs =                489         time =   514
9) search B          nDocs =                47038    time =   1000
10) search B        nDocs =                47038    time =   967
11) search C        nDocs =                5256       time =   563
12) search C        nDocs =                5256       time =   562


Whatever information that is being cached is available across separate processes so presumably it is residing somewhere in the file system (and/or virtual memory). I have also seen the same behavior when retrieving TermFreqVector information as well.

Any additional insight is appreciated!

Thanks,
Stuart


__________________________________________________
Stuart Rose
Senior Research Engineer
Pacific Northwest National Laboratory

RE: retrieved doc field values being cached?

Posted by "Rose, Stuart J" <st...@pnnl.gov>.

Thanks Simon, 

Warming up all the docs in the index took less time and space than I expected (28 million doc titles, ~60 seconds, ~4GB in RAM). Do you know if the speedup is solely due to the doc fields being loaded into RAM?

Regards,
Stuart


-----Original Message-----
From: Simon Willnauer [mailto:simon.willnauer@googlemail.com] 
Sent: Friday, February 24, 2012 1:29 PM
To: java-user@lucene.apache.org
Subject: Re: retrieved doc field values being cached?

Hey Stuart,

Lucene solely relies on the FS cache with some exceptions for the term-dictionary and FieldCache which is pulled entirely into memory.
FieldCache is not used to retrieve stored fields though, its rather an univerted view (docID -> value) of an indexed (inverted) field. So basically what you see is likely filesystem memory / cache ie. your documents are "hot". In general you should fire up some warmup queries before you swap your search in to serve user queries to get best performance.

hope that helps

simon

On Fri, Feb 24, 2012 at 10:18 PM, Rose, Stuart J <st...@pnnl.gov> wrote:
>
> Lucene (using 3.5) seems to be caching field values for documents (after they have been retrieved) and I am hoping someone can provide more information on how and where exactly the field values are stored.
>
> The table below lists the times (in milliseconds) associated with retrieving for a set of documents matching a particular query a single stored value from each document in the set. Results are shown for three queries (A, B, and C) submitted multiple times. The first time each query is submitted, the time to retrieve it's matching document values is considerably longer than any time after that.
>
> 1) search A          nDocs =                489         time =   1342
> 2) search A          nDocs =                489         time =   811
> 3) search B          nDocs =                47038    time =   76658
> 4) search B          nDocs =                47038    time =   1062
> 5) search C          nDocs =                5256       time =   22741
> 6) search C          nDocs =                5256       time =   578
> 7) search A          nDocs =                489         time =   515
> 8) search A          nDocs =                489         time =   514
> 9) search B          nDocs =                47038    time =   1000
> 10) search B        nDocs =                47038    time =   967
> 11) search C        nDocs =                5256       time =   563
> 12) search C        nDocs =                5256       time =   562
>
>
> Whatever information that is being cached is available across separate processes so presumably it is residing somewhere in the file system (and/or virtual memory). I have also seen the same behavior when retrieving TermFreqVector information as well.
>
> Any additional insight is appreciated!
>
> Thanks,
> Stuart
>
>
> __________________________________________________
> Stuart Rose
> Senior Research Engineer
> Pacific Northwest National Laboratory
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: retrieved doc field values being cached?

Posted by Simon Willnauer <si...@googlemail.com>.

Hey Stuart,

Lucene solely relies on the FS cache with some exceptions for the
term-dictionary and FieldCache which is pulled entirely into memory.
FieldCache is not used to retrieve stored fields though, its rather an
univerted view (docID -> value) of an indexed (inverted) field. So
basically what you see is likely filesystem memory / cache ie. your
documents are "hot". In general you should fire up some warmup queries
before you swap your search in to serve user queries to get best
performance.

hope that helps

simon

On Fri, Feb 24, 2012 at 10:18 PM, Rose, Stuart J <st...@pnnl.gov> wrote:
>
> Lucene (using 3.5) seems to be caching field values for documents (after they have been retrieved) and I am hoping someone can provide more information on how and where exactly the field values are stored.
>
> The table below lists the times (in milliseconds) associated with retrieving for a set of documents matching a particular query a single stored value from each document in the set. Results are shown for three queries (A, B, and C) submitted multiple times. The first time each query is submitted, the time to retrieve it's matching document values is considerably longer than any time after that.
>
> 1) search A          nDocs =                489         time =   1342
> 2) search A          nDocs =                489         time =   811
> 3) search B          nDocs =                47038    time =   76658
> 4) search B          nDocs =                47038    time =   1062
> 5) search C          nDocs =                5256       time =   22741
> 6) search C          nDocs =                5256       time =   578
> 7) search A          nDocs =                489         time =   515
> 8) search A          nDocs =                489         time =   514
> 9) search B          nDocs =                47038    time =   1000
> 10) search B        nDocs =                47038    time =   967
> 11) search C        nDocs =                5256       time =   563
> 12) search C        nDocs =                5256       time =   562
>
>
> Whatever information that is being cached is available across separate processes so presumably it is residing somewhere in the file system (and/or virtual memory). I have also seen the same behavior when retrieving TermFreqVector information as well.
>
> Any additional insight is appreciated!
>
> Thanks,
> Stuart
>
>
> __________________________________________________
> Stuart Rose
> Senior Research Engineer
> Pacific Northwest National Laboratory
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org