You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Damien Kamerman <da...@gmail.com> on 2014/04/10 04:23:22 UTC

Facet search and growing memory usage

Hi All,

What I have found with Solr 4.6.0 to 4.7.1 is that memory usage continues
to grow with facet queries.

Originally I saw the issue with 40 facets over 60 collections (distributed
search). Memory usage would spike and solr would become unresponsive like
https://issues.apache.org/jira/browse/SOLR-2855

Then I tried to determine a safe limit at which the search would work
without breaking solr. But what I found is that I can break solr in the
same way with one facet (with many distinct values) and one collection. By
holding F5 (reload) in the browser for 10 seconds memory usage continues to
grow.

e.g.
http://localhost:8000/solr/collection/select?facet=true&facet.mincount=1&q=*:*&facet.threads=5&facet.field=id

I realize that faceting on 'id' is extreme but it seems to highlight the
issue that memory usage continues to grow (leak?) with each new query until
solr eventually breaks.

This does not happen with the 'old' method 'facet.method=enum' - memory
usage is stable and solr is unbreakable with my hold-reload test.

This post
http://shal.in/post/285908948/inside-solr-improvements-in-faceted-search-performance
describes the new/current facet method and states
"The structure is thrown away and re-created lazily on a commit. There
might be a few concerns around the garbage accumulated by the (re)-creation
of the many arrays needed for this structure. However, the performance gain
is significant enough to warrant the trade-off."

The wiki http://wiki.apache.org/solr/SimpleFacetParameters#facet.method
says the new/default method 'tends to use less memory'.

I use autoCommit (1min) on my collections - does mean there's a one minute
(or longer with no new docs) window where facet queries will effectively
'leak'?

Test setup. JDK 1.7.0u40 64-bit; Solr 4.7.1; 3 instances; 64GB each; 17m
docs; 2 replicas.

Cheers,
Damien.

Re: Facet search and growing memory usage

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
fwiw,
Facets are much less heap greedy when counted for docValues enabled fields,
they should not hit UnInvertedField in this case. Try them.


On Thu, Apr 10, 2014 at 8:20 PM, Toke Eskildsen <te...@statsbiblioteket.dk>wrote:

> Shawn Heisey [solr@elyograg.org] wrote:
> >On 4/9/2014 11:53 PM, Toke Eskildsen wrote:
> >> The memory allocation for enum is both low and independent of the amount
> >> of unique values in the facets. The trade-off is that is is very slow
> >> for medium- to high-cardinality fields.
>
> > This is where it is extremely beneficial to have enough RAM to cache
> > your entire index.  The term list must be enumerated for every facet
> > request, but if the data is already in the OS disk cache, this is very
> > fast.
>
> Very fast compared to not cached, yes, but still slow compared to fc, for
> high-cardinality. The processing overhead per term is a great deal larger
> for enum. I recently ran some tests with Solr's different faceting methods
> for 50M+ values, but stopped measuring for enum as it took so much longer
> than the other methods. For a fully cached index.
>
> > If facets are happening on lots of fields and are heavily utilized,
> > facet.method=enum should be used, and there must be plenty of RAM to
> > cache all or most of the index data on the machine.
>
> I do not understand how the number of facets has any influence on the
> choice between enum and fc. As Solr (sadly) does not support combined
> structures for multiple facets, each facet is independent from the others.
> Shouldn't the choice be done for each individual facet?
>
> - Toke Eskildsen
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

RE: Facet search and growing memory usage

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Shawn Heisey [solr@elyograg.org] wrote:
>On 4/9/2014 11:53 PM, Toke Eskildsen wrote:
>> The memory allocation for enum is both low and independent of the amount
>> of unique values in the facets. The trade-off is that is is very slow
>> for medium- to high-cardinality fields.

> This is where it is extremely beneficial to have enough RAM to cache
> your entire index.  The term list must be enumerated for every facet
> request, but if the data is already in the OS disk cache, this is very
> fast.

Very fast compared to not cached, yes, but still slow compared to fc, for high-cardinality. The processing overhead per term is a great deal larger for enum. I recently ran some tests with Solr's different faceting methods for 50M+ values, but stopped measuring for enum as it took so much longer than the other methods. For a fully cached index.

> If facets are happening on lots of fields and are heavily utilized,
> facet.method=enum should be used, and there must be plenty of RAM to
> cache all or most of the index data on the machine.

I do not understand how the number of facets has any influence on the choice between enum and fc. As Solr (sadly) does not support combined structures for multiple facets, each facet is independent from the others. Shouldn't the choice be done for each individual facet?

- Toke Eskildsen

Re: Facet search and growing memory usage

Posted by Shawn Heisey <so...@elyograg.org>.
On 4/9/2014 11:53 PM, Toke Eskildsen wrote:
>> This does not happen with the 'old' method 'facet.method=enum' - memory
>> usage is stable and solr is unbreakable with my hold-reload test.
> 
> The memory allocation for enum is both low and independent of the amount
> of unique values in the facets. The trade-off is that is is very slow
> for medium- to high-cardinality fields.

This is where it is extremely beneficial to have enough RAM to cache
your entire index.  The term list must be enumerated for every facet
request, but if the data is already in the OS disk cache, this is very
fast.  If the operating system has to read the data off the disk, it
will be *very* slow.

If facets are happening on lots of fields and are heavily utilized,
facet.method=enum should be used, and there must be plenty of RAM to
cache all or most of the index data on the machine.  The default method
(fc) will create the memory structure that Toke has mentioned for
*every* field that gets used for facets.  If there are only a few fields
used for faceting and they have low cardinality, this is not a problem,
and the speedup is usually worth the extra heap memory usage.  With 40
facets, that is not supportable.

Thanks,
Shawn


Re: Facet search and growing memory usage

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Thu, 2014-04-10 at 04:23 +0200, Damien Kamerman wrote:
> What I have found with Solr 4.6.0 to 4.7.1 is that memory usage continues
> to grow with facet queries.

It allocates (potentially significant) temporary structures, yes.

> Then I tried to determine a safe limit at which the search would work
> without breaking solr. But what I found is that I can break solr in the
> same way with one facet (with many distinct values) and one collection. By
> holding F5 (reload) in the browser for 10 seconds memory usage continues to
> grow.
> 
> e.g.
> http://localhost:8000/solr/collection/select?facet=true&facet.mincount=1&q=*:*&facet.threads=5&facet.field=id
> 
> I realize that faceting on 'id' is extreme but it seems to highlight the
> issue that memory usage continues to grow (leak?) with each new query until
> solr eventually breaks.

Qualified guess: Keyboard-repeat kicks in and your browser will break
the existing connections and establish new ones very quickly.

Each faceted call allocates temporary memory. For standard searches, the
amount is small, but faceting on a high-cardinality field like id is
more expensive: 4 bytes/unique ID for String field cache. The overhead
lives until the faceting call has been fully processed - breaking the
connection to Solr does not stop that.


You state that you have 17M+ documents in your indes. That is 58MB+
temporary overhead for each call. Let's say your keyboard repeat is
about 50/second. That means 50*58MB+ ~= 3,4GB+ for temporary structures
in Solr when you F5.

I have recently learned that the Jetty provided with Solr is tweaked to
accept 1000 concurrent incoming requests (which in your case would
require 40GB of heap), so it will happily dispatch those 50 requests to
Solr.

To avoid this, lower your maxThreads-setting for Jetty to an amount that
can be handled with your heap size. The F5-test seems like a very quick
and easy way to determine if it works: You should start getting errors
in the browser end instead of the Solr end.

> This does not happen with the 'old' method 'facet.method=enum' - memory
> usage is stable and solr is unbreakable with my hold-reload test.

The memory allocation for enum is both low and independent of the amount
of unique values in the facets. The trade-off is that is is very slow
for medium- to high-cardinality fields.

> This post
> http://shal.in/post/285908948/inside-solr-improvements-in-faceted-search-performance
> describes the new/current facet method and states
> "The structure is thrown away and re-created lazily on a commit. There
> might be a few concerns around the garbage accumulated by the (re)-creation
> of the many arrays needed for this structure. However, the performance gain
> is significant enough to warrant the trade-off."

I investigated the garbage issue as part of SOLR-5894 and find it to be
significant. See
https://sbdevel.wordpress.com/2014/04/04/sparse-facet-counting-without-the-downsides/ for some numbers. Solving that does not help with the temporary allocation though.

> The wiki http://wiki.apache.org/solr/SimpleFacetParameters#facet.method
> says the new/default method 'tends to use less memory'.

I do not agree on that part, but it is of course possible that I have
misunderstood something. fc allocates the array I described in
UnInvertedField.getCounts (look for counts = new int[...]).

> I use autoCommit (1min) on my collections - does mean there's a one minute
> (or longer with no new docs) window where facet queries will effectively
> 'leak'?

It does worsen the problem due to the resources used for warmup of the
facet.

- Toke Eskildsen, State and University Library, Denmark