You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Walter Ferrara <wa...@gmail.com> on 2007/09/20 17:30:28 UTC

Solr and FieldCache

I have an index with several fields, but just one stored: ID (string,
unique).
I need to access that ID field for each of the tops "nodes" docs in my
results (this is done inside a handler I wrote), code looks like:

     Hits hits = searcher.search(query);
     for(int i=0; i<nodes; i++) {
            id[i]=hits.doc(i).get("ID");
            score[i]=hits.score(i);
     }

I noticed that retrieving the code is slow.

if I use the FieldCache, like:
id[i]=FieldCache.DEFAULT.getStrings(searcher.getReader(),
"ID")[hits.id(i)];
after the first execution (the initialization of the cache take some
times), it seems to run much faster.

But what happens when SOLR reload  the index (after a commit, or an
optimize for example)?
Will it refresh the cache with new reader (in the warmup process?), or
it will be the first query execution of that code (with the new reader)
that will force the refresh? (this could mean that every first query
after a reload will be slower)
Is there any way to tell SOLR to cache and warmup when needed this "ID"
field?
 
Thanks,
Walter


Re: Solr and FieldCache

Posted by Yonik Seeley <yo...@apache.org>.
On 9/20/07, Walter Ferrara <wa...@gmail.com> wrote:
> I have an index with several fields, but just one stored: ID (string,
> unique).
> I need to access that ID field for each of the tops "nodes" docs in my
> results (this is done inside a handler I wrote), code looks like:
>
>      Hits hits = searcher.search(query);
>      for(int i=0; i<nodes; i++) {
>             id[i]=hits.doc(i).get("ID");
>             score[i]=hits.score(i);
>      }

What is the higher level use-case you are trying to address that makes
it necessary to write a plugin?

-Yonik

Re: Solr and FieldCache

Posted by Yonik Seeley <yo...@apache.org>.
On 9/20/07, Walter Ferrara <wa...@gmail.com> wrote:
> I'm just wondering, as this cached object could be (theoretically)
> pretty big, do I need to be aware of some OOM? I know that FieldCache
> use weakmaps, so I presume the cached array for the older reader(s) will
> be gc-ed when the reader is no longer referenced (i.e. when solr load
> the new one, after its warmup and so on), is that right?

Right.  You will need room for two entries (one for the current
searcher and one for the warming searcher).

-Yonik

Re: Solr and FieldCache

Posted by Walter Ferrara <wa...@gmail.com>.
About stored/index difference: ID is a string, (= solr.StrField) so
FieldCache give me what I need.

I'm just wondering, as this cached object could be (theoretically)
pretty big, do I need to be aware of some OOM? I know that FieldCache
use weakmaps, so I presume the cached array for the older reader(s) will
be gc-ed when the reader is no longer referenced (i.e. when solr load
the new one, after its warmup and so on), is that right?

Thanks
--

J.J. Larrea wrote:
> At 5:30 PM +0200 9/20/07, Walter Ferrara wrote:
>   
>> I have an index with several fields, but just one stored: ID (string,
>> unique).
>> I need to access that ID field for each of the tops "nodes" docs in my
>> results (this is done inside a handler I wrote), code looks like:
>>
>>     Hits hits = searcher.search(query);
>>     for(int i=0; i<nodes; i++) {
>>            id[i]=hits.doc(i).get("ID");
>>            score[i]=hits.score(i);
>>     }
>>
>> I noticed that retrieving the code is slow.
>>
>> if I use the FieldCache, like:
>> id[i]=FieldCache.DEFAULT.getStrings(searcher.getReader(),
>> "ID")[hits.id(i)];
>>     
>
> I assume you're putting FieldCache.DEFAULT.getStrings(searcher.getReader(),
> "ID") in an array outside the loop, saving 2 redundant method calls per iteration.
>
>   
>> after the first execution (the initialization of the cache take some
>> times), it seems to run much faster.
>>     
>
> Do note that FieldCache.DEFAULT is caching the indexed values, not the stored values.  Since your field is an ID you are probably indexing it in such a way that both are identical, e.g. with KeywordTokenizer, so you're not seeing a difference.
>
>   
>> But what happens when SOLR reload  the index (after a commit, or an
>> optimize for example)?
>> Will it refresh the cache with new reader (in the warmup process?), or
>> it will be the first query execution of that code (with the new reader)
>> that will force the refresh? (this could mean that every first query
>> after a reload will be slower)
>>     
>
> It is refreshed by Lucene the first time the FieldCache array is requested from the new IndexReader.
>
>   
>> Is there any way to tell SOLR to cache and warmup when needed this "ID"
>> field?
>>     
>
> Absolutely, just put a warmup query in solrconfig.xml which makes request that invokes FieldCache.DEFAULT.getStrings on that field.
>
> Simplest would probably be to invoke your custom handler, perhaps passing arguments that limit it to only processing one document to limit the data which gets cached; since getStrings returns the entire array, one pass through your loop is fine.
>
> If that's not easy with your handler, you could achieve the same effect by setting up a handler which facets on the ID field, sorting by ID (facet.sort=false), and only asks for a single value (facet.limit=1) (the entire id[docid] array will get scanned to count references to that ID, but that ensures it gets paged in).
>
> - J.J.
>
>   

Re: Solr and FieldCache

Posted by "J.J. Larrea" <jj...@panix.com>.
At 5:30 PM +0200 9/20/07, Walter Ferrara wrote:
>I have an index with several fields, but just one stored: ID (string,
>unique).
>I need to access that ID field for each of the tops "nodes" docs in my
>results (this is done inside a handler I wrote), code looks like:
>
>     Hits hits = searcher.search(query);
>     for(int i=0; i<nodes; i++) {
>            id[i]=hits.doc(i).get("ID");
>            score[i]=hits.score(i);
>     }
>
>I noticed that retrieving the code is slow.
>
>if I use the FieldCache, like:
>id[i]=FieldCache.DEFAULT.getStrings(searcher.getReader(),
>"ID")[hits.id(i)];

I assume you're putting FieldCache.DEFAULT.getStrings(searcher.getReader(),
"ID") in an array outside the loop, saving 2 redundant method calls per iteration.

>after the first execution (the initialization of the cache take some
>times), it seems to run much faster.

Do note that FieldCache.DEFAULT is caching the indexed values, not the stored values.  Since your field is an ID you are probably indexing it in such a way that both are identical, e.g. with KeywordTokenizer, so you're not seeing a difference.

>But what happens when SOLR reload  the index (after a commit, or an
>optimize for example)?
>Will it refresh the cache with new reader (in the warmup process?), or
>it will be the first query execution of that code (with the new reader)
>that will force the refresh? (this could mean that every first query
>after a reload will be slower)

It is refreshed by Lucene the first time the FieldCache array is requested from the new IndexReader.

>Is there any way to tell SOLR to cache and warmup when needed this "ID"
>field?

Absolutely, just put a warmup query in solrconfig.xml which makes request that invokes FieldCache.DEFAULT.getStrings on that field.

Simplest would probably be to invoke your custom handler, perhaps passing arguments that limit it to only processing one document to limit the data which gets cached; since getStrings returns the entire array, one pass through your loop is fine.

If that's not easy with your handler, you could achieve the same effect by setting up a handler which facets on the ID field, sorting by ID (facet.sort=false), and only asks for a single value (facet.limit=1) (the entire id[docid] array will get scanned to count references to that ID, but that ensures it gets paged in).

- J.J.

Re: Solr and FieldCache

Posted by Chris Hostetter <ho...@fucit.org>.
:      Hits hits = searcher.search(query);
:      for(int i=0; i<nodes; i++) {
:             id[i]=hits.doc(i).get("ID");
:             score[i]=hits.score(i);

Please note that because of your use of the Hits class, you are not 
getting any of the benefits of Solr: no DocSet, DocList, or Document 
caching.  depending on the value of the "nodes" variable, you may even be 
reexecuting the query multiple times (and non of those attempts will be 
cached)

you would probably be *much* happier if you used the a method returning a 
DocSet to get the docIds and score ... you might even find that the 
FieldSelector methods are faster for getting the "ID" field then using hte 
FieldCache (but maybe not)



-Hoss