You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Arun Rangarajan <ar...@gmail.com> on 2012/11/28 03:25:20 UTC

Help with sort on dynamic field and out of memory error

We have a Solr 3.6 core that has about 250 TrieIntFields (declared using
dynamicField). There are about 14M docs in our Solr index and many
documents have some value in many of these fields. We have a need to sort
on all of these 250 fields over a period of time.

The issue we are facing is that the underlying lucene fieldCache gets
filled up very quickly. We have a 4 GB box and the index size is 18 GB.
After a sort on 40 or 45 of these dynamic fields, the memory consumption is
about 90% (tomcat set up to get max heap size of 3.6 GB) and we start
getting OutOfMemory errors.

For now, we have a cron job running every minute restarting tomcat if the
total memory consumed is more than 80%.

We thought instead of sorting, if we did boosting it won't go to
fieldCache. So instead of issuing a query like

select?q=name:alba&sort=relevance_11 desc

we tried

select?q={!boost relevance_11}name:alba

but unfortunately boosting also populates the field cache.

>From what I have read, I understand that restricting the number of distinct
values on sortable Solr fields will bring down the fieldCache space. The
values in these sortable fields can be any integer from 0 to 33000 and
quite widely distributed. We have a few scaling solutions in mind, but what
is the best way to handle this whole issue?

thanks.

Re: Help with sort on dynamic field and out of memory error

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Wed, 2012-11-28 at 03:25 +0100, Arun Rangarajan wrote:

[Sorting on 14M docs, 250 fields]

> From what I have read, I understand that restricting the number of distinct
> values on sortable Solr fields will bring down the fieldCache space. The
> values in these sortable fields can be any integer from 0 to 33000 and
> quite widely distributed. We have a few scaling solutions in mind, but what
> is the best way to handle this whole issue?

Since the number of documents exceeds to maximum value in your fields,
the lowest memory consumption for a fast implementation I can come up
with is #docs * #fields * maxbits/value: 14M * 250 * 16bit ~= 7GB.

So it's time to get creative and hack Solr.

First off, is the number of unique values/field significantly lower than
your maximum of 33000 for a non-trivial amount of your fields? If so,
they can be mapped to a contiguous range when storing the data (this
could be done dynamically when creating the field cache). If an average
field holds < 1024 unique values, the total memory consumption would be
about 14M * 250 * 10bits ~= 4.4GB.

Secondly, if you normally use only a few fields for sorting, which I
suspect is the case, you could compress the values as a single block and
uncompress it when requested from field cache. Having a fixed size cache
of uncompressed values in the field cache should ensure that there is no
slowdown for most requests.

It is very hard to estimate the memory savings on this, but I would not
be surprised if you could reduce memory consumption to 1/10 of the worst
case 7GB, if the values are fairly uniform. Of course, if the values are
all over the place, this gains you nothing at all.

Regards,
Toke Eskildsen


Re: Help with sort on dynamic field and out of memory error

Posted by Arun Rangarajan <ar...@gmail.com>.
Erick,

Thanks for your reply. So there is no easy way to get around this problem.

We have a way to rework the schema by keeping a single sort field. The
dynamic fields we have are like relevance_CLASSID. The current schema has a
unique key NODEID and a multi-valued field CLASSID - the relevance scores
are for these class Ids. If we instead keep one document per classId per
nodeId i.e. the new schema will have DOCID:CLASSID as unique key and store
some redundant information across documents with the same NODEID, then we
can sort on a single field relevance and do a filter query on classId.


On Tue, Nov 27, 2012 at 7:07 PM, Erick Erickson <er...@gmail.com>wrote:

> I sure don't see how this can work given the constraints. Just to hold the
> values, assuming that each doc holds a value in 150 fields, you have 150 *
> 4 * 14,000,000 or 8.4G of memory required, and you just don't have that
> much memory to play around with.
>
> Sharding seems silly for 14M docs, but that might be what's necessary. Or
> get hardware with lots of memory.
>
> Or redefine the problem so you don't have to sort so many fields. Not quite
> sure how do do that off the top of my head, but.....
>
> Best
> Erick
>
>
> On Tue, Nov 27, 2012 at 9:25 PM, Arun Rangarajan
> <ar...@gmail.com>wrote:
>
> > We have a Solr 3.6 core that has about 250 TrieIntFields (declared using
> > dynamicField). There are about 14M docs in our Solr index and many
> > documents have some value in many of these fields. We have a need to sort
> > on all of these 250 fields over a period of time.
> >
> > The issue we are facing is that the underlying lucene fieldCache gets
> > filled up very quickly. We have a 4 GB box and the index size is 18 GB.
> > After a sort on 40 or 45 of these dynamic fields, the memory consumption
> is
> > about 90% (tomcat set up to get max heap size of 3.6 GB) and we start
> > getting OutOfMemory errors.
> >
> > For now, we have a cron job running every minute restarting tomcat if the
> > total memory consumed is more than 80%.
> >
> > We thought instead of sorting, if we did boosting it won't go to
> > fieldCache. So instead of issuing a query like
> >
> > select?q=name:alba&sort=relevance_11 desc
> >
> > we tried
> >
> > select?q={!boost relevance_11}name:alba
> >
> > but unfortunately boosting also populates the field cache.
> >
> > From what I have read, I understand that restricting the number of
> distinct
> > values on sortable Solr fields will bring down the fieldCache space. The
> > values in these sortable fields can be any integer from 0 to 33000 and
> > quite widely distributed. We have a few scaling solutions in mind, but
> what
> > is the best way to handle this whole issue?
> >
> > thanks.
> >
>

Re: Help with sort on dynamic field and out of memory error

Posted by Erick Erickson <er...@gmail.com>.
I sure don't see how this can work given the constraints. Just to hold the
values, assuming that each doc holds a value in 150 fields, you have 150 *
4 * 14,000,000 or 8.4G of memory required, and you just don't have that
much memory to play around with.

Sharding seems silly for 14M docs, but that might be what's necessary. Or
get hardware with lots of memory.

Or redefine the problem so you don't have to sort so many fields. Not quite
sure how do do that off the top of my head, but.....

Best
Erick


On Tue, Nov 27, 2012 at 9:25 PM, Arun Rangarajan
<ar...@gmail.com>wrote:

> We have a Solr 3.6 core that has about 250 TrieIntFields (declared using
> dynamicField). There are about 14M docs in our Solr index and many
> documents have some value in many of these fields. We have a need to sort
> on all of these 250 fields over a period of time.
>
> The issue we are facing is that the underlying lucene fieldCache gets
> filled up very quickly. We have a 4 GB box and the index size is 18 GB.
> After a sort on 40 or 45 of these dynamic fields, the memory consumption is
> about 90% (tomcat set up to get max heap size of 3.6 GB) and we start
> getting OutOfMemory errors.
>
> For now, we have a cron job running every minute restarting tomcat if the
> total memory consumed is more than 80%.
>
> We thought instead of sorting, if we did boosting it won't go to
> fieldCache. So instead of issuing a query like
>
> select?q=name:alba&sort=relevance_11 desc
>
> we tried
>
> select?q={!boost relevance_11}name:alba
>
> but unfortunately boosting also populates the field cache.
>
> From what I have read, I understand that restricting the number of distinct
> values on sortable Solr fields will bring down the fieldCache space. The
> values in these sortable fields can be any integer from 0 to 33000 and
> quite widely distributed. We have a few scaling solutions in mind, but what
> is the best way to handle this whole issue?
>
> thanks.
>