You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by aaronireland <aa...@gmail.com> on 2014/11/02 13:49:46 UTC

Solr filterCache and autoWarming memory requirements

I have Solr server set up on CentOS that's being queried from a Flask app in
a very specific/controlled way. Basically, I just have a large (200 million)
amount of largely static name/address data (along with an internal record ID
field and a few integer fields). I'm running 50 threads that need to do a
search on name/address/birth-date and return an ID value and an integer
modeling score as quickly as possible.

Here is the schema.xml information for the fields I'm using:

   <field name="external_id" type="string" indexed="true" stored="false"
required="false" multiValued="false" />
   <field name="internal_id" type="string" indexed="false" stored="true"
multiValued="false" />
   <field name="score" type="int" indexed="false" stored="true" />

   <field name="first_name" type="text_general" indexed="true"
stored="true"/>
   <field name="last_name" type="text_general" indexed="true"
stored="true"/>
   <field name="city" type="text_general" indexed="true" stored="true"/>
   <field name="state" type="string" indexed="true" stored="true"/>

   <field name="birth_year" type="string" indexed="true" stored="false" />
   <field name="birth_month" type="string" indexed="true" stored="false" />
   <field name="birth_day" type="string" indexed="true" stored="false" />

I had a similar set-up working well when I was using 1-4 threads, but since
upping the number of threads querying the Solr server I'm running into Out
Of Memory errors. I removed the autoWarming filter queries from
solrconfig.xml and upped the RAM on the box to 24 gigs and JVM to 8 gigs and
changed the directory Factory from MMap to NIOFS and that solved the memory
problems but performance is pretty bad with most queries taking over 1
second to return a response.

Here's a screenshot showing the breakdown of a heap dump I did before I
upped the RAM/JVM the first time: 
<http://lucene.472066.n3.nabble.com/file/n4167111/Screen_Shot_2014-10-23_at_11.png> 

Since I'm only querying Solr in a very specific way, I'd like to set up the
filterCache so that I have filters on U.S. State Abbreviation and Birth
Month cached but how much memory would I need?

Here's an example of what I had previously (now commented out) in the
QuerySenderListener to auto-warm the filterCaches:

        <lst><str name="q">*:*</str><str name="fq">state:CA</str><str
name="fq">birth_month:1</str></lst>
        <lst><str name="q">*:*</str><str name="fq">state:CA</str><str
name="fq">birth_month:2</str></lst>
        <lst><str name="q">*:*</str><str name="fq">state:CA</str><str
name="fq">birth_month:3</str></lst>
        <lst><str name="q">*:*</str><str name="fq">state:CA</str><str
name="fq">birth_month:4</str></lst>

The number of documents matching each query this way range in size from a
few thousand to one million.





--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-filterCache-and-autoWarming-memory-requirements-tp4167111.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr filterCache and autoWarming memory requirements

Posted by Erick Erickson <er...@gmail.com>.

The upper limit for a filterCache entry is roughly
(size of fq clause) + (maxDoc/8)

You can get maxDoc from your admin/overview page.

the filterCache is just a map. The key is the fq clause
so the size there is just the string length + around 40.

The value is, at most, a bitset representing all the docs
in the index. These are the internal Lucene doc IDs,
so they start at 0 and to up to maxDoc. You can force
this to be equal to numDocs by force merge, which I only
suggest if this index doesn't change much, like maybe
daily. Take a look at your admin page to see if the
number of deleted documents is a significant percentage
of your maxDoc to see whether doing a force merge
makes any sense.

I'm cheating a bit here, the actual value _may_ be just a
list of integers if the fq clause matches very few documents,
but with your setup I figure that's unlikely.

So, each entry will take on the order of 25M just for the
bitset, plus the query size plus some overhead. But at
that scale, you can just use 25M per entry I'd think.

So you're talking on the order of 1.5G for the filter cache alone,
assuming one filter for every state and one filter for every month,
which seems tractable.

BTW, it makes no difference to the fq size, but you an squeeze
some more memory out of this if you changed the year/month/day
to int (or tint) types, they store much more efficiently than strings.

Best,
Erick

On Sun, Nov 2, 2014 at 4:49 AM, aaronireland <aa...@gmail.com> wrote:
> I have Solr server set up on CentOS that's being queried from a Flask app in
> a very specific/controlled way. Basically, I just have a large (200 million)
> amount of largely static name/address data (along with an internal record ID
> field and a few integer fields). I'm running 50 threads that need to do a
> search on name/address/birth-date and return an ID value and an integer
> modeling score as quickly as possible.
>
> Here is the schema.xml information for the fields I'm using:
>
>    <field name="external_id" type="string" indexed="true" stored="false"
> required="false" multiValued="false" />
>    <field name="internal_id" type="string" indexed="false" stored="true"
> multiValued="false" />
>    <field name="score" type="int" indexed="false" stored="true" />
>
>    <field name="first_name" type="text_general" indexed="true"
> stored="true"/>
>    <field name="last_name" type="text_general" indexed="true"
> stored="true"/>
>    <field name="city" type="text_general" indexed="true" stored="true"/>
>    <field name="state" type="string" indexed="true" stored="true"/>
>
>    <field name="birth_year" type="string" indexed="true" stored="false" />
>    <field name="birth_month" type="string" indexed="true" stored="false" />
>    <field name="birth_day" type="string" indexed="true" stored="false" />
>
> I had a similar set-up working well when I was using 1-4 threads, but since
> upping the number of threads querying the Solr server I'm running into Out
> Of Memory errors. I removed the autoWarming filter queries from
> solrconfig.xml and upped the RAM on the box to 24 gigs and JVM to 8 gigs and
> changed the directory Factory from MMap to NIOFS and that solved the memory
> problems but performance is pretty bad with most queries taking over 1
> second to return a response.
>
> Here's a screenshot showing the breakdown of a heap dump I did before I
> upped the RAM/JVM the first time:
> <http://lucene.472066.n3.nabble.com/file/n4167111/Screen_Shot_2014-10-23_at_11.png>
>
> Since I'm only querying Solr in a very specific way, I'd like to set up the
> filterCache so that I have filters on U.S. State Abbreviation and Birth
> Month cached but how much memory would I need?
>
> Here's an example of what I had previously (now commented out) in the
> QuerySenderListener to auto-warm the filterCaches:
>
>         <lst><str name="q">*:*</str><str name="fq">state:CA</str><str
> name="fq">birth_month:1</str></lst>
>         <lst><str name="q">*:*</str><str name="fq">state:CA</str><str
> name="fq">birth_month:2</str></lst>
>         <lst><str name="q">*:*</str><str name="fq">state:CA</str><str
> name="fq">birth_month:3</str></lst>
>         <lst><str name="q">*:*</str><str name="fq">state:CA</str><str
> name="fq">birth_month:4</str></lst>
>
> The number of documents matching each query this way range in size from a
> few thousand to one million.
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Solr-filterCache-and-autoWarming-memory-requirements-tp4167111.html
> Sent from the Solr - User mailing list archive at Nabble.com.