You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Eoghan Ó Carragáin <eo...@gmail.com> on 2013/08/15 21:58:32 UTC

Large cache settings values - sanity check

Hi,

I’m involved in the an open source project called Vufind which uses Solr to
search across library catalogue records [1].

The project uses what seems to be very high defaults cache settings in
solrconfig.xml [2]:

   -

   filterCache (size="300000" initialSize="300000" autowarmCount="50000"),
   -

   queryResultCache (size="100000" initialSize="100000"
   autowarmCount="50000"),
   -

   documentCache (size="50000" initialSize="50000").


These settings haven’t been reviewed since early in the project history (c.
2007) but came up in a recent discussion around out-of-memory issues and
garbage collection.

Of course decisions on cache configuration (along with jvm settings,
sharding etc) vary depending on the instance (index size, query/sec etc),
but I wanted to run these values past this list as a sanity check for what
you’d consider good default settings giving that most adopters of the
software will not touch the defaults.

Some characteristics of library data & Vufind’s schema [3] which may have a
bearing on the issue:

   -

   quite a few facet fields & filtering (~ 12 facets configured by default)
   -

   high number of unique facet values (e.g. several hundred-thousands in a
   facet field for authors or subjects)
   -

   most libraries would do only one or two incremental commits a day (which
   may justify high auto-warming settings since the next commit isn’t for 24
   hours)
   -

   sorting: relevance by default but other options configured by default
   (title, author, callnumber, year, etc)
   -

   mostly, small sparse documents (MARC records containing title, author,
   desciption etc but no full-text content)
   -

   quite a few stored fields, including a field which stores the full MARC
   record for additional parsing by the application
   -

   average number of documents for most adopters probably somewhere between
   500K and 2 million MARC records (Vufind has several adopters with up to 50m
   full-text docs but these make considerable customisations their Solr setup)
   - query/sec will vary from library to library, but shouldn't be anything
   too taxing for most adopters


Do the current cache settings make sense in this context, or should we
consider dropping back to the much lower values given in the Solr example
and wiki?

Many thanks

Eoghan


[1] vufind.org

[2]
https://github.com/vufind-org/vufind/blob/master/solr/biblio/conf/solrconfig.xml
[3]
https://github.com/vufind-org/vufind/blob/master/solr/biblio/conf/schema.xml

Re: Large cache settings values - sanity check

Posted by Erick Erickson <er...@gmail.com>.

Waaaay too high :). Hmmmm, not much detail there...

bq:   filterCache (size="300000" initialSize="300000"
autowarmCount="50000"),

This is an OOM waiting to happen. Each filterCache entry is a key/value
pair. The key is the fq clause, but the value is a bitmap of all the docs in
the index, i.e. maxDoc/8 bytes.

So for a 32M doc corpus, each one is 4M. Your filter cache is
potentially1.2T
if I've done my math right. The entries aren't allocated until an fq is
used, so at startup it's not very big.

Then it gets worse. You say "most libraries would do only
one or two incremental commits a day". So the filter cache will gradually
accumulate until the next hard commit (openSearcher=true in 4.0, any
hard commit in 3.x), leading to unpredictable OOMs.

I claim you can create an algorithmic query generator that keeps adding
unique fq clauses and crash your app at will.

And when you _do_ do a commit, the most recent 50,000 fq clauses are
re-executed leading to what I suspect are very long startup times.

As I wrote in another context, evictions and hit ratios are the key
statistics. Drop these WAAAAAY back and monitor these to see what
the sizes _should_ be. If you have no evictions, it's probably too large. If
you have lots of evictions _and_ the hit ratio is small (< 75% or so) then
think of making it larger.

If you're doing date-based fq clauses, beware of NOW clauses, see:
http://searchhub.org/2012/02/23/date-math-now-and-filter-queries/

bq queryResultCache (size="100000" initialSize="100000"
 autowarmCount="50000")

Not as big an offender as the filterCache, but still. The usual case for
this cache is when a user pages. But the autowarmCount is huge. Again,
whenever
a new searcher is opened the last 50,000 queries will be re-executed before
the
searcher handles new queries.

The size here isn't as bad as the filterCache, but it's still far too large
IMO. The
key is the query and the value is a couple of windows
sized of ints (where window size is, say, 20). But the autowarmcount is a
killer.

Again, look on the admin page for this cache and the hit ratio. It's
usually actually
quite small so this cache often (but you need to measure) not all that
valuable.

I think the documentcache is also kind of big, the usual recommendation is
something
like (max simultaneous queries) * (&rows parameter) as a start.

Best
Erick

On Thu, Aug 15, 2013 at 3:58 PM, Eoghan Ó Carragáin <
eoghan.ocarragain@gmail.com> wrote:

> Hi,
>
> I’m involved in the an open source project called Vufind which uses Solr to
> search across library catalogue records [1].
>
> The project uses what seems to be very high defaults cache settings in
> solrconfig.xml [2]:
>
>    -
>
>    filterCache (size="300000" initialSize="300000" autowarmCount="50000"),
>    -
>
>    queryResultCache (size="100000" initialSize="100000"
>    autowarmCount="50000"),
>    -
>
>    documentCache (size="50000" initialSize="50000").
>
>
> These settings haven’t been reviewed since early in the project history (c.
> 2007) but came up in a recent discussion around out-of-memory issues and
> garbage collection.
>
> Of course decisions on cache configuration (along with jvm settings,
> sharding etc) vary depending on the instance (index size, query/sec etc),
> but I wanted to run these values past this list as a sanity check for what
> you’d consider good default settings giving that most adopters of the
> software will not touch the defaults.
>
> Some characteristics of library data & Vufind’s schema [3] which may have a
> bearing on the issue:
>
>    -
>
>    quite a few facet fields & filtering (~ 12 facets configured by default)
>    -
>
>    high number of unique facet values (e.g. several hundred-thousands in a
>    facet field for authors or subjects)
>    -
>
>    most libraries would do only one or two incremental commits a day (which
>    may justify high auto-warming settings since the next commit isn’t for
> 24
>    hours)
>    -
>
>    sorting: relevance by default but other options configured by default
>    (title, author, callnumber, year, etc)
>    -
>
>    mostly, small sparse documents (MARC records containing title, author,
>    desciption etc but no full-text content)
>    -
>
>    quite a few stored fields, including a field which stores the full MARC
>    record for additional parsing by the application
>    -
>
>    average number of documents for most adopters probably somewhere between
>    500K and 2 million MARC records (Vufind has several adopters with up to
> 50m
>    full-text docs but these make considerable customisations their Solr
> setup)
>    - query/sec will vary from library to library, but shouldn't be anything
>    too taxing for most adopters
>
>
> Do the current cache settings make sense in this context, or should we
> consider dropping back to the much lower values given in the Solr example
> and wiki?
>
> Many thanks
>
> Eoghan
>
>
> [1] vufind.org
>
> [2]
>
> https://github.com/vufind-org/vufind/blob/master/solr/biblio/conf/solrconfig.xml
> [3]
>
> https://github.com/vufind-org/vufind/blob/master/solr/biblio/conf/schema.xml
>