You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jay Luker <lb...@reallywow.com> on 2010/10/27 16:39:44 UTC

documentCache clarification

Hi all,

The solr wiki says this about the documentCache: "The more fields you
store in your documents, the higher the memory usage of this cache
will be."

OK, but if i have enableLazyFieldLoading set to true and in my request
parameters specify "fl=id", then the number of fields per document
shouldn't affect the memory usage of the document cache, right?

Thanks,
--jay

RE: documentCache clarification

Posted by Jonathan Rochkind <ro...@jhu.edu>.

This is a great explanation, thanks.  I'm going to add it to the wiki somewhere that seems relevant, if no-one minds and the wiki lets me. 
________________________________________
From: Chris Hostetter [hossman_lucene@fucit.org]
Sent: Thursday, October 28, 2010 7:27 PM
To: solr-user@lucene.apache.org
Subject: Re: documentCache clarification

: the documentCache: "(Note: This cache cannot be used as a source for
: autowarming because document IDs will change when anything in the
: index changes so they can't be used by a new searcher.)"
:
: Can anyone elaborate a bit on that. I think I've read it at least 10
: times and I'm still unable to draw a mental picture. I'm wondering if
: the document IDs referred to are the ones I'm defining in my schema,
: or are they the underlying lucene ids, i.e. the ones that, according
: to the Lucene in Action book, are "relative within each segment"?

they are the underlying lucene docIds that change as segments are merged.

: queryResultCache. However, if I issue a request with rows=10, I will
: get an insert, and then a later request for rows=500 would re-use and
: update that original cached docList object. Right? And would it be
: updated with the full list of 500 ordered doc ids or only 200?

note quite.

The queryResultCache is keyed on <Query,Sort,Start,Rows,Filters> and the
value is a "DocList" object ...

http://lucene.apache.org/solr/api/org/apache/solr/search/DocList.html

Unlike the Document objects in the documentCache, the DocLists in the
queryResultCache never get modified (techincally Solr doesn't actually
modify the Documents either, the Document just keeps track of it's fields
and updates itself as Lazy Load fields are needed)

if a DocList containing results 0-10 is put in the cache, it's not
going to be of any use for a query with start=50.  but if it contains 0-50
it *can* be used if start < 50 and rows < 50 -- that's where the
queryResultWindowSize comes in.  if you use start=0&rows=10, but your
window size is 50, SolrIndexSearcher will (under the covers) use
start=0&rows=50 and put that in the cache, returning a "slice" from 0-10
for your query.  the next query asking for 10-20 will be a cache hit.

-Hoss

Re: documentCache clarification

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Fri, Oct 29, 2010 at 4:21 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : > Why don't we just include the start & rows (modulo the window size) in
> : > the cache key?
> :
> : The implementation of equals() would be rather difficult... actually
> : impossible w/o abusing the semantics.
> : It would also be impossible w/o the Map implementation guaranteeing
> : what object was on the LHS vs the RHS when equals was called.
> :
> : Unless I'm missing something obvious?
>
> You've totally confused me.
>
> What i'm saying is that SolrIndexSearcher should consult the window size
> before consulting the cache -- the start param should be rounded down to
> the nearest multiple of hte window size, and start+rows (ie: end) should
> be rounded up to one less then the nearest multiple of the windows size,
> and then that should be looked up in the cache.

That's already done.
In "example", do
q=*:*&rows=12
q=*:*&rows=16
and you should see a queryResultCache hit since queryResultWindowSize
is 20 and both requests round up to that.

*but* if you do this (with an index with more than 20  docs in it)
q=*:*&rows=25

Currently that query will round up to 40, but since nResults
(start+row) isn't in the key, it will still get a cache hit but then
not be usable.

Now, if your proposal is to put nResults into the key, we then have a
worse problem.
Assume we're starting over with a clean cache.
q=*:*&rows=25   // cached under a key including nResults=40
q=*:*&rows=15  // looked up under a key including nResults=20... not found!

> but that's why people are suppose to pick a window size greater
> then the largest number of rows typically requested)

Hmmm, I don't think so.  If that were the case, there would be no need
for two parameters (no need for queryResultWindowSize) since we would
always just pick queryResultMaxDocsCached.

-Yonik
http://www.lucidimagination.com

Re: documentCache clarification

Posted by Chris Hostetter <ho...@fucit.org>.

: > Why don't we just include the start & rows (modulo the window size) in
: > the cache key?
: 
: The implementation of equals() would be rather difficult... actually
: impossible w/o abusing the semantics.
: It would also be impossible w/o the Map implementation guaranteeing
: what object was on the LHS vs the RHS when equals was called.
: 
: Unless I'm missing something obvious?

You've totally confused me.

What i'm saying is that SolrIndexSearcher should consult the window size 
before consulting the cache -- the start param should be rounded down to 
the nearest multiple of hte window size, and start+rows (ie: end) should 
be rounded up to one less then the nearest multiple of the windows size, 
and then that should be looked up in the cache.

equality on the cache key is straight forward...
   this.q==that.q && this.start==that.start && this.end==that.end && 
   this.sort == that.sort && this.filters == that.filters

so if the window size is "50" and SOlrIndexSearcher gets a request like 
q=x&start=33&rows=10&sort=y&fq=... it should  
generate a cache key where start=0 and end=49.  (if start=33&rows=42, then 
the key would contain start=0 and end=99 ... which could result in some 
overlap, but that's why people are suppose to pick a window size greater 
then the largest number of rows typically requested)



-Hoss

Re: documentCache clarification

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Fri, Oct 29, 2010 at 3:49 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : This is a limitation in the SolrCache API.
> : The key into the cache does not contain rows, so the cache returns the
> : first 10 docs and increments it's hit count.  Then the cache user
> : (SolrIndexSearcher) looks at the entry and determines it can't use it.
>
> Wow, I never realized that.
>
> Why don't we just include the start & rows (modulo the window size) in
> the cache key?

The implementation of equals() would be rather difficult... actually
impossible w/o abusing the semantics.
It would also be impossible w/o the Map implementation guaranteeing
what object was on the LHS vs the RHS when equals was called.

Unless I'm missing something obvious?

-Yonik
http://www.lucidimagination.com

Re: documentCache clarification

Posted by Chris Hostetter <ho...@fucit.org>.

: This is a limitation in the SolrCache API.
: The key into the cache does not contain rows, so the cache returns the
: first 10 docs and increments it's hit count.  Then the cache user
: (SolrIndexSearcher) looks at the entry and determines it can't use it.

Wow, I never realized that.

Why don't we just include the start & rows (modulo the window size) in 
the cache key?

-Hoss

Re: documentCache clarification

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Fri, Oct 29, 2010 at 2:31 PM, Jay Luker <lb...@reallywow.com> wrote:
> This makes sense but still doesn't explain what I'm seeing in my cache
> stats. When I issue a request with rows=10 the stats show an insert
> into the queryResultCache. If I send the same query, this time with
> rows=1000, I would not expect to see a cache hit but I do.

This is a limitation in the SolrCache API.
The key into the cache does not contain rows, so the cache returns the
first 10 docs and increments it's hit count.  Then the cache user
(SolrIndexSearcher) looks at the entry and determines it can't use it.
 One way to fix this would be to add a method that says "that was
actually a miss" to the cache API.

-Yonik
http://www.lucidimagination.com

Re: documentCache clarification

Posted by Jay Luker <lb...@reallywow.com>.

On Thu, Oct 28, 2010 at 7:27 PM, Chris Hostetter
<ho...@fucit.org> wrote:

> The queryResultCache is keyed on <Query,Sort,Start,Rows,Filters> and the
> value is a "DocList" object ...
>
> http://lucene.apache.org/solr/api/org/apache/solr/search/DocList.html
>
> Unlike the Document objects in the documentCache, the DocLists in the
> queryResultCache never get modified (techincally Solr doesn't actually
> modify the Documents either, the Document just keeps track of it's fields
> and updates itself as Lazy Load fields are needed)
>
> if a DocList containing results 0-10 is put in the cache, it's not
> going to be of any use for a query with start=50.  but if it contains 0-50
> it *can* be used if start < 50 and rows < 50 -- that's where the
> queryResultWindowSize comes in.  if you use start=0&rows=10, but your
> window size is 50, SolrIndexSearcher will (under the covers) use
> start=0&rows=50 and put that in the cache, returning a "slice" from 0-10
> for your query.  the next query asking for 10-20 will be a cache hit.

This makes sense but still doesn't explain what I'm seeing in my cache
stats. When I issue a request with rows=10 the stats show an insert
into the queryResultCache. If I send the same query, this time with
rows=1000, I would not expect to see a cache hit but I do. So it seems
like there must be something useful in whatever gets cached on the
first request for rows=10 for it to be re-used by the request for
rows=1000.

--jay

Re: documentCache clarification

Posted by Chris Hostetter <ho...@fucit.org>.

: the documentCache: "(Note: This cache cannot be used as a source for
: autowarming because document IDs will change when anything in the
: index changes so they can't be used by a new searcher.)"
: 
: Can anyone elaborate a bit on that. I think I've read it at least 10
: times and I'm still unable to draw a mental picture. I'm wondering if
: the document IDs referred to are the ones I'm defining in my schema,
: or are they the underlying lucene ids, i.e. the ones that, according
: to the Lucene in Action book, are "relative within each segment"?

they are the underlying lucene docIds that change as segments are merged.

: queryResultCache. However, if I issue a request with rows=10, I will
: get an insert, and then a later request for rows=500 would re-use and
: update that original cached docList object. Right? And would it be
: updated with the full list of 500 ordered doc ids or only 200?

note quite.

The queryResultCache is keyed on <Query,Sort,Start,Rows,Filters> and the 
value is a "DocList" object ...

http://lucene.apache.org/solr/api/org/apache/solr/search/DocList.html

Unlike the Document objects in the documentCache, the DocLists in the 
queryResultCache never get modified (techincally Solr doesn't actually 
modify the Documents either, the Document just keeps track of it's fields 
and updates itself as Lazy Load fields are needed)

if a DocList containing results 0-10 is put in the cache, it's not 
going to be of any use for a query with start=50.  but if it contains 0-50 
it *can* be used if start < 50 and rows < 50 -- that's where the 
queryResultWindowSize comes in.  if you use start=0&rows=10, but your 
window size is 50, SolrIndexSearcher will (under the covers) use 
start=0&rows=50 and put that in the cache, returning a "slice" from 0-10 
for your query.  the next query asking for 10-20 will be a cache hit.


-Hoss

Re: documentCache clarification

Posted by Jay Luker <lb...@reallywow.com>.

On Wed, Oct 27, 2010 at 9:13 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : schema.) My evidence for this is the documentCache stats reported by
> : solr/admin. If I request "rows=10&fl=id" followed by
> : "rows=10&fl=id,title" I would expect to see the 2nd request result in
> : a 2nd insert to the cache, but instead I see that the 2nd request hits
> : the cache from the 1st request. "rows=10&fl=*" does the same thing.
>
> your evidence is correct, but your interpretation is incorrect.
>
> the objects in the documentCache are lucene Documents, which contain a
> List of Field refrences.  when enableLazyFieldLoading=true is set, and
> there is a documentCache Document fetched from the IndexReader only
> contains the Fields specified in the fl, and all other Fields are marked
> as "LOAD_LAZY".
>
> When there is a cache hit on that uniqueKey at a later date, the Fields
> allready loaded are used directly if requested, but the Fields marked
> LOAD_LAZY are (you guessed it) lazy loaded from the IndexReader and then
> the Document updates the refrence to the newly actualized fields (which
> are no longer marked LOAD_LAZY)
>
> So with different "fl" params, the same Document Object is continually
> used, but the Fields in that Document grow as the fields requested (using
> the "fl" param) change.

Great stuff. Makes sense. Thanks for the clarification, and if no one
objects I'll update the wiki with some of this info.

I'm still not clear on this statement from the wiki's description of
the documentCache: "(Note: This cache cannot be used as a source for
autowarming because document IDs will change when anything in the
index changes so they can't be used by a new searcher.)"

Can anyone elaborate a bit on that. I think I've read it at least 10
times and I'm still unable to draw a mental picture. I'm wondering if
the document IDs referred to are the ones I'm defining in my schema,
or are they the underlying lucene ids, i.e. the ones that, according
to the Lucene in Action book, are "relative within each segment"?

> : will *not* result in an insert to queryResultCache. I have tried
> : various increments--10, 100, 200, 500--and it seems the magic number
> : is somewhere between 200 (cache insert) and 500 (no insert). Can
> : someone explain this?
>
> In addition to the <queryResultMaxDocsCached> config option already
> mentioned (which controls wether a DocList is cached based on it's size)
> there is also the <queryResultWindowSize> config option which may confuse
> your cache observations.  if the window size is "50" and you ask for
> start=0&rows=10 what actually gets cached is "0-50" (assuming there are
> more then 50 results) so a subsequent request for start=10&rows=10 will be
> a cache hit.

Just so I'm clear, does the queryResultCache operate in a similar
manner as the documentCache as to what is actually cached? In other
words, is it the caching of the docList object that is reported in the
cache statistics hits/inserts numbers? And that object would get
updated with a new set of ordered doc ids on subsequent, larger
requests. (I'm flailing a bit to articulate the question, I know). For
example, if my queryResultMaxDocsCached is set to 200 and I issue a
request with rows=500, then I won't get a docList object entry in the
queryResultCache. However, if I issue a request with rows=10, I will
get an insert, and then a later request for rows=500 would re-use and
update that original cached docList object. Right? And would it be
updated with the full list of 500 ordered doc ids or only 200?

Thanks,
--jay

Re: documentCache clarification

Posted by Chris Hostetter <ho...@fucit.org>.

: schema.) My evidence for this is the documentCache stats reported by
: solr/admin. If I request "rows=10&fl=id" followed by
: "rows=10&fl=id,title" I would expect to see the 2nd request result in
: a 2nd insert to the cache, but instead I see that the 2nd request hits
: the cache from the 1st request. "rows=10&fl=*" does the same thing.

your evidence is correct, but your interpretation is incorrect.

the objects in the documentCache are lucene Documents, which contain a 
List of Field refrences.  when enableLazyFieldLoading=true is set, and 
there is a documentCache Document fetched from the IndexReader only 
contains the Fields specified in the fl, and all other Fields are marked 
as "LOAD_LAZY".

When there is a cache hit on that uniqueKey at a later date, the Fields 
allready loaded are used directly if requested, but the Fields marked 
LOAD_LAZY are (you guessed it) lazy loaded from the IndexReader and then 
the Document updates the refrence to the newly actualized fields (which 
are no longer marked LOAD_LAZY)

So with different "fl" params, the same Document Object is continually 
used, but the Fields in that Document grow as the fields requested (using 
the "fl" param) change.

: will *not* result in an insert to queryResultCache. I have tried
: various increments--10, 100, 200, 500--and it seems the magic number
: is somewhere between 200 (cache insert) and 500 (no insert). Can
: someone explain this?

In addition to the <queryResultMaxDocsCached> config option already 
mentioned (which controls wether a DocList is cached based on it's size) 
there is also the <queryResultWindowSize> config option which may confuse 
your cache observations.  if the window size is "50" and you ask for 
start=0&rows=10 what actually gets cached is "0-50" (assuming there are 
more then 50 results) so a subsequent request for start=10&rows=10 will be 
a cache hit.

-Hoss

Re: documentCache clarification

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

(10/10/28 6:32), Jonathan Rochkind wrote:
> Woah, I hadn't known about that. queryResultMaxDocsCached is actually a part of Solr 1.4? Is it
> documented anywhere at all? I guess it is included in the example solrconfig.xml, but is not in my
> own personal solrconfig.xml.

The feature was added since 1.3. Please see:
https://issues.apache.org/jira/browse/SOLR-291

> Anyone know if it has a default size if left unspecified?

I believe the default is no limit.

Koji

-- 
http://www.rondhuit.com/en/

Re: documentCache clarification

Posted by Jonathan Rochkind <ro...@jhu.edu>.

Woah, I hadn't known about that. queryResultMaxDocsCached is actually a 
part of Solr 1.4?   Is it documented anywhere at all?  I guess it is 
included in the example solrconfig.xml, but is not in my own personal 
solrconfig.xml.

Anyone know if it has a default size if left unspecified?

Shawn Heisey wrote:
> On 10/27/2010 12:17 PM, Jay Luker wrote:
>   
>> A 2nd question: while watching these stats I noticed something else
>> weird with the queryResultCache. It seems that inserts to the
>> queryResultCache depend on the number of rows requested. For example,
>> an initial request (solr restarted, clean cache, etc) with rows=10
>> will result in a insert. A 2nd request of the same query with
>> rows=1000 will result in a cache hit. However if you reverse that
>> order, starting with a clean cache, an initial request for rows=1000
>> will *not* result in an insert to queryResultCache. I have tried
>> various increments--10, 100, 200, 500--and it seems the magic number
>> is somewhere between 200 (cache insert) and 500 (no insert). Can
>> someone explain this?
>>     
>
> Perhaps it's this setting in the <query> section of solrconfig.xml?
>
> <queryResultMaxDocsCached>200</queryResultMaxDocsCached>
>
> See SOLR-291.
>
> Shawn
>
>

Re: documentCache clarification

Posted by Shawn Heisey <so...@elyograg.org>.

On 10/27/2010 12:17 PM, Jay Luker wrote:
> A 2nd question: while watching these stats I noticed something else
> weird with the queryResultCache. It seems that inserts to the
> queryResultCache depend on the number of rows requested. For example,
> an initial request (solr restarted, clean cache, etc) with rows=10
> will result in a insert. A 2nd request of the same query with
> rows=1000 will result in a cache hit. However if you reverse that
> order, starting with a clean cache, an initial request for rows=1000
> will *not* result in an insert to queryResultCache. I have tried
> various increments--10, 100, 200, 500--and it seems the magic number
> is somewhere between 200 (cache insert) and 500 (no insert). Can
> someone explain this?

Perhaps it's this setting in the <query> section of solrconfig.xml?

<queryResultMaxDocsCached>200</queryResultMaxDocsCached>

See SOLR-291.

Shawn

Re: documentCache clarification

Posted by Jay Luker <lb...@reallywow.com>.

(btw, I'm running 1.4.1)

It looks like my assumption was wrong. Regardless of the fields
selected using the "fl" parameter and the enableLazyFieldLoading
setting, solr apparently fetches from disk and caches all the fields
in the document (or maybe just those that are stored="true" in my
schema.) My evidence for this is the documentCache stats reported by
solr/admin. If I request "rows=10&fl=id" followed by
"rows=10&fl=id,title" I would expect to see the 2nd request result in
a 2nd insert to the cache, but instead I see that the 2nd request hits
the cache from the 1st request. "rows=10&fl=*" does the same thing.
i.e., the first request, even though I have
enableLazyFieldLoading=true and I'm only asking for the ids, fetches
the entire document from disk and inserts into the documentCache.
Subsequent requests, regardless of which fields I actually select,
don't hit the disk but are loaded from the documentCache. Is this
really the expected behavior and/or am I misunderstanding something?

A 2nd question: while watching these stats I noticed something else
weird with the queryResultCache. It seems that inserts to the
queryResultCache depend on the number of rows requested. For example,
an initial request (solr restarted, clean cache, etc) with rows=10
will result in a insert. A 2nd request of the same query with
rows=1000 will result in a cache hit. However if you reverse that
order, starting with a clean cache, an initial request for rows=1000
will *not* result in an insert to queryResultCache. I have tried
various increments--10, 100, 200, 500--and it seems the magic number
is somewhere between 200 (cache insert) and 500 (no insert). Can
someone explain this?

Thanks,
--jay

On Wed, Oct 27, 2010 at 10:54 AM, Markus Jelsma
<ma...@openindex.io> wrote:
> I've been wondering about this too some time ago. I've found more informationenableLazyFieldLoading
> in SOLR-52 and some correspondence on this one but it didn't give me a
> definitive answer..
>
> [1]: https://issues.apache.org/jira/browse/SOLR-52
> [2]: http://www.mail-archive.com/solr-dev@lucene.apache.org/msg01185.html
>
> On Wednesday 27 October 2010 16:39:44 Jay Luker wrote:
>> Hi all,
>>
>> The solr wiki says this about the documentCache: "The more fields you
>> store in your documents, the higher the memory usage of this cache
>> will be."
>>
>> OK, but if i have enableLazyFieldLoading set to true and in my request
>> parameters specify "fl=id", then the number of fields per document
>> shouldn't affect the memory usage of the document cache, right?
>>
>> Thanks,
>> --jay
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536600 / 06-50258350
>

Re: documentCache clarification

Posted by Markus Jelsma <ma...@openindex.io>.

I've been wondering about this too some time ago. I've found more information 
in SOLR-52 and some correspondence on this one but it didn't give me a 
definitive answer..

[1]: https://issues.apache.org/jira/browse/SOLR-52
[2]: http://www.mail-archive.com/solr-dev@lucene.apache.org/msg01185.html

On Wednesday 27 October 2010 16:39:44 Jay Luker wrote:
> Hi all,
> 
> The solr wiki says this about the documentCache: "The more fields you
> store in your documents, the higher the memory usage of this cache
> will be."
> 
> OK, but if i have enableLazyFieldLoading set to true and in my request
> parameters specify "fl=id", then the number of fields per document
> shouldn't affect the memory usage of the document cache, right?
> 
> Thanks,
> --jay

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350