You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Per Steffensen <st...@designware.dk> on 2013/09/11 09:11:09 UTC

No or limited use of FieldCache

Hi

We have a SolrCloud setup handling huge amounts of data. When we do 
group, facet or sort searches Solr will use its FieldCache, and add data 
in it for every single document we have. For us it is not realistic that 
this will ever fit in memory and we get OOM exceptions. Are there some 
way of disabling the FieldCache (taking the performance penalty of 
course) or make it behave in a nicer way where it only uses up to e.g. 
80% of the memory available to the JVM? Or other suggestions?

Regards, Per Steffensen

Re: No or limited use of FieldCache

Posted by Per Steffensen <st...@designware.dk>.

On 9/12/13 3:28 PM, Toke Eskildsen wrote:
> On Thu, 2013-09-12 at 14:48 +0200, Per Steffensen wrote:
>> Actually some months back I made PoC of a FieldCache that could expand
>> beyond the heap. Basically imagine a FieldCache with room for
>> "unlimited" data-arrays, that just behind the scenes goes to
>> memory-mapped files when there is no more room on heap.
> That sounds a lot like disk-based DocValues.
>
He he
>> But that solution will also have the "running out of swap space"-problems.
> Not really. Memory mapping works like the disk cache: There is no
> requirement that a certain amount of physical memory needs to be
> available, it just takes what it can get. If there are not a lot of
> physical memory, it will require a lot of storage access, but it will
> not over-allocate swap space.
That was also my impression, but during the work, I experienced some 
problems around swap space, but I do not remember exactly what I saw, 
and therefore how I concluded that everything in mm-files actually have 
to fit in physical mem + swap. I might very well have been wrong in that 
conclusion
> It seems that different setups vary quite a lot in this area and some
> systems are prone to aggressive use of the swap file, which can severely
> harm responsiveness of applications with out-swapped data.
>
> However, this should still not result in any OOM's, as the system can
> always discard some of the memory mapped data if it needs more physical
> memory.
I saw no OOMs
> - Toke Eskildsen, State and University Library, Denmark
>

Re: No or limited use of FieldCache

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Thu, 2013-09-12 at 14:48 +0200, Per Steffensen wrote:
> Actually some months back I made PoC of a FieldCache that could expand 
> beyond the heap. Basically imagine a FieldCache with room for 
> "unlimited" data-arrays, that just behind the scenes goes to 
> memory-mapped files when there is no more room on heap.

That sounds a lot like disk-based DocValues.

[...]

> But that solution will also have the "running out of swap space"-problems.

Not really. Memory mapping works like the disk cache: There is no
requirement that a certain amount of physical memory needs to be
available, it just takes what it can get. If there are not a lot of
physical memory, it will require a lot of storage access, but it will
not over-allocate swap space.

It seems that different setups vary quite a lot in this area and some
systems are prone to aggressive use of the swap file, which can severely
harm responsiveness of applications with out-swapped data.

However, this should still not result in any OOM's, as the system can
always discard some of the memory mapped data if it needs more physical
memory.

- Toke Eskildsen, State and University Library, Denmark

Re: No or limited use of FieldCache

Posted by Per Steffensen <st...@designware.dk>.

Yes, thanks.

Actually some months back I made PoC of a FieldCache that could expand 
beyond the heap. Basically imagine a FieldCache with room for 
"unlimited" data-arrays, that just behind the scenes goes to 
memory-mapped files when there is no more room on heap. Never finished 
it, and it might be kinda stupid because you actually just go read the 
data from lucene indices and write them to memory-mapped files in order 
to use them. It is better to just use the data in the Lucene indices 
instead. But it had some nice features. But that solution will also have 
the "running out of swap space"-problems.

Regards, Per Steffensen

On 9/12/13 12:48 PM, Erick Erickson wrote:
> Per:
>
> One thing I'll be curious about. From my reading of DocValues, it uses
> little or no heap. But it _will_ use memory from the OS if I followed
> Simon's slides correctly. So I wonder if you'll hit swapping issues...
> Which are better than OOMs, certainly...
>
> Thanks,
> Erick

Re: No or limited use of FieldCache

Posted by Erick Erickson <er...@gmail.com>.

Per:

One thing I'll be curious about. From my reading of DocValues, it uses
little or no heap. But it _will_ use memory from the OS if I followed
Simon's slides correctly. So I wonder if you'll hit swapping issues...
Which are better than OOMs, certainly...

Thanks,
Erick


On Thu, Sep 12, 2013 at 2:07 AM, Per Steffensen <st...@designware.dk> wrote:

> Thanks, guys. Now I know a little more about DocValues and realize that
> they will do the job wrt FieldCache.
>
> Regards, Per Steffensen
>
>
> On 9/12/13 3:11 AM, Otis Gospodnetic wrote:
>
>> Per,  check zee Wiki, there is a page describing docvalues. We used them
>> successfully in a solr for analytics scenario.
>>
>> Otis
>> Solr & ElasticSearch Support
>> http://sematext.com/
>> On Sep 11, 2013 9:15 AM, "Michael Sokolov" <msokolov@safaribooksonline.**
>> com <ms...@safaribooksonline.com>>
>> wrote:
>>
>>  On 09/11/2013 08:40 AM, Per Steffensen wrote:
>>>
>>>  The reason I mention sort is that we in my project, half a year ago,
>>>> have
>>>> dealt with the FieldCache->OOM-problem when doing sort-requests. We
>>>> basically just reject sort-requests unless they hit below X documents -
>>>> in
>>>> case they do we just find them without sorting and sort them ourselves
>>>> afterwards.
>>>>
>>>> Currently our problem is, that we have to do a group/distinct (in
>>>> SQL-language) query and we have found that we can do what we want to do
>>>> using group (http://wiki.apache.org/solr/****FieldCollapsing<http://wiki.apache.org/solr/**FieldCollapsing>
>>>> <http://wiki.**apache.org/solr/**FieldCollapsing<http://wiki.apache.org/solr/FieldCollapsing>
>>>> >)
>>>> or facet - either will work for us. Problem is that they both use
>>>> FieldCache and we "know" that using FieldCache will lead to
>>>> OOM-execptions
>>>> with the amount of data each of our Solr-nodes administrate. This time
>>>> we
>>>> have really no option of just "limit" usage as we did with sort.
>>>> Therefore
>>>> we need a group/distinct-functionality that works even on huge
>>>> data-amounts
>>>> (and a algorithm using FieldCache will not)
>>>>
>>>> I believe setting facet.method=enum will actually make facet not use the
>>>> FieldCache. Is that true? Is it a bad idea?
>>>>
>>>> I do not know much about DocValues, but I do not believe that you will
>>>> avoid FieldCache by using DocValues? Please elaborate, or point to
>>>> documentation where I will be able to read that I am wrong. Thanks!
>>>>
>>>>  There is Simon Willnauer's presentation http://www.slideshare.net/**
>>> lucenerevolution/willnauer-****simon-doc-values-column-**
>>> stride-fields-in-lucene<http:/**/www.slideshare.net/**
>>> lucenerevolution/willnauer-**simon-doc-values-column-**
>>> stride-fields-in-lucene<http://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene>
>>> >
>>>
>>> and this blog post http://blog.trifork.com/2011/****<http://blog.trifork.com/2011/**>
>>> 10/27/introducing-lucene-****index-doc-values/<http://blog.**
>>> trifork.com/2011/10/27/**introducing-lucene-index-doc-**values/<http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/>
>>> >
>>>
>>> and this one that shows some performance comparisons:
>>> http://searchhub.org/2013/04/****02/fun-with-docvalues-in-**solr-**4-2/<http://searchhub.org/2013/04/**02/fun-with-docvalues-in-solr-**4-2/>
>>> <http://searchhub.**org/2013/04/02/fun-with-**docvalues-in-solr-4-2/<http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/>
>>> >
>>>
>>>
>>>
>>>
>>>
>

Re: No or limited use of FieldCache

Posted by Per Steffensen <st...@designware.dk>.

Thanks, guys. Now I know a little more about DocValues and realize that 
they will do the job wrt FieldCache.

Regards, Per Steffensen

On 9/12/13 3:11 AM, Otis Gospodnetic wrote:
> Per,  check zee Wiki, there is a page describing docvalues. We used them
> successfully in a solr for analytics scenario.
>
> Otis
> Solr & ElasticSearch Support
> http://sematext.com/
> On Sep 11, 2013 9:15 AM, "Michael Sokolov" <ms...@safaribooksonline.com>
> wrote:
>
>> On 09/11/2013 08:40 AM, Per Steffensen wrote:
>>
>>> The reason I mention sort is that we in my project, half a year ago, have
>>> dealt with the FieldCache->OOM-problem when doing sort-requests. We
>>> basically just reject sort-requests unless they hit below X documents - in
>>> case they do we just find them without sorting and sort them ourselves
>>> afterwards.
>>>
>>> Currently our problem is, that we have to do a group/distinct (in
>>> SQL-language) query and we have found that we can do what we want to do
>>> using group (http://wiki.apache.org/solr/**FieldCollapsing<http://wiki.apache.org/solr/FieldCollapsing>)
>>> or facet - either will work for us. Problem is that they both use
>>> FieldCache and we "know" that using FieldCache will lead to OOM-execptions
>>> with the amount of data each of our Solr-nodes administrate. This time we
>>> have really no option of just "limit" usage as we did with sort. Therefore
>>> we need a group/distinct-functionality that works even on huge data-amounts
>>> (and a algorithm using FieldCache will not)
>>>
>>> I believe setting facet.method=enum will actually make facet not use the
>>> FieldCache. Is that true? Is it a bad idea?
>>>
>>> I do not know much about DocValues, but I do not believe that you will
>>> avoid FieldCache by using DocValues? Please elaborate, or point to
>>> documentation where I will be able to read that I am wrong. Thanks!
>>>
>> There is Simon Willnauer's presentation http://www.slideshare.net/**
>> lucenerevolution/willnauer-**simon-doc-values-column-**
>> stride-fields-in-lucene<http://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene>
>>
>> and this blog post http://blog.trifork.com/2011/**
>> 10/27/introducing-lucene-**index-doc-values/<http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/>
>>
>> and this one that shows some performance comparisons:
>> http://searchhub.org/2013/04/**02/fun-with-docvalues-in-solr-**4-2/<http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/>
>>
>>
>>
>>

Re: No or limited use of FieldCache

Posted by Otis Gospodnetic <ot...@gmail.com>.

Per,  check zee Wiki, there is a page describing docvalues. We used them
successfully in a solr for analytics scenario.

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Sep 11, 2013 9:15 AM, "Michael Sokolov" <ms...@safaribooksonline.com>
wrote:

> On 09/11/2013 08:40 AM, Per Steffensen wrote:
>
>> The reason I mention sort is that we in my project, half a year ago, have
>> dealt with the FieldCache->OOM-problem when doing sort-requests. We
>> basically just reject sort-requests unless they hit below X documents - in
>> case they do we just find them without sorting and sort them ourselves
>> afterwards.
>>
>> Currently our problem is, that we have to do a group/distinct (in
>> SQL-language) query and we have found that we can do what we want to do
>> using group (http://wiki.apache.org/solr/**FieldCollapsing<http://wiki.apache.org/solr/FieldCollapsing>)
>> or facet - either will work for us. Problem is that they both use
>> FieldCache and we "know" that using FieldCache will lead to OOM-execptions
>> with the amount of data each of our Solr-nodes administrate. This time we
>> have really no option of just "limit" usage as we did with sort. Therefore
>> we need a group/distinct-functionality that works even on huge data-amounts
>> (and a algorithm using FieldCache will not)
>>
>> I believe setting facet.method=enum will actually make facet not use the
>> FieldCache. Is that true? Is it a bad idea?
>>
>> I do not know much about DocValues, but I do not believe that you will
>> avoid FieldCache by using DocValues? Please elaborate, or point to
>> documentation where I will be able to read that I am wrong. Thanks!
>>
> There is Simon Willnauer's presentation http://www.slideshare.net/**
> lucenerevolution/willnauer-**simon-doc-values-column-**
> stride-fields-in-lucene<http://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene>
>
> and this blog post http://blog.trifork.com/2011/**
> 10/27/introducing-lucene-**index-doc-values/<http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/>
>
> and this one that shows some performance comparisons:
> http://searchhub.org/2013/04/**02/fun-with-docvalues-in-solr-**4-2/<http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/>
>
>
>
>

Re: No or limited use of FieldCache

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

On 09/11/2013 08:40 AM, Per Steffensen wrote:
> The reason I mention sort is that we in my project, half a year ago, 
> have dealt with the FieldCache->OOM-problem when doing sort-requests. 
> We basically just reject sort-requests unless they hit below X 
> documents - in case they do we just find them without sorting and sort 
> them ourselves afterwards.
>
> Currently our problem is, that we have to do a group/distinct (in 
> SQL-language) query and we have found that we can do what we want to 
> do using group (http://wiki.apache.org/solr/FieldCollapsing) or facet 
> - either will work for us. Problem is that they both use FieldCache 
> and we "know" that using FieldCache will lead to OOM-execptions with 
> the amount of data each of our Solr-nodes administrate. This time we 
> have really no option of just "limit" usage as we did with sort. 
> Therefore we need a group/distinct-functionality that works even on 
> huge data-amounts (and a algorithm using FieldCache will not)
>
> I believe setting facet.method=enum will actually make facet not use 
> the FieldCache. Is that true? Is it a bad idea?
>
> I do not know much about DocValues, but I do not believe that you will 
> avoid FieldCache by using DocValues? Please elaborate, or point to 
> documentation where I will be able to read that I am wrong. Thanks!
There is Simon Willnauer's presentation 
http://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene

and this blog post 
http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/

and this one that shows some performance comparisons: 
http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/

Re: No or limited use of FieldCache

Posted by Per Steffensen <st...@designware.dk>.

The reason I mention sort is that we in my project, half a year ago, 
have dealt with the FieldCache->OOM-problem when doing sort-requests. We 
basically just reject sort-requests unless they hit below X documents - 
in case they do we just find them without sorting and sort them 
ourselves afterwards.

Currently our problem is, that we have to do a group/distinct (in 
SQL-language) query and we have found that we can do what we want to do 
using group (http://wiki.apache.org/solr/FieldCollapsing) or facet - 
either will work for us. Problem is that they both use FieldCache and we 
"know" that using FieldCache will lead to OOM-execptions with the amount 
of data each of our Solr-nodes administrate. This time we have really no 
option of just "limit" usage as we did with sort. Therefore we need a 
group/distinct-functionality that works even on huge data-amounts (and a 
algorithm using FieldCache will not)

I believe setting facet.method=enum will actually make facet not use the 
FieldCache. Is that true? Is it a bad idea?

I do not know much about DocValues, but I do not believe that you will 
avoid FieldCache by using DocValues? Please elaborate, or point to 
documentation where I will be able to read that I am wrong. Thanks!

Regards, Per Steffensen

On 9/11/13 1:38 PM, Erick Erickson wrote:
> I don't know any more than Michael, but I'd _love_ some reports from the
> field.
>
> There are some restriction on DocValues though, I believe one of them
> is that they don't really work on analyzed data....
>
> FWIW,
> Erick

Re: No or limited use of FieldCache

Posted by Erick Erickson <er...@gmail.com>.

I don't know any more than Michael, but I'd _love_ some reports from the
field.

There are some restriction on DocValues though, I believe one of them
is that they don't really work on analyzed data....

FWIW,
Erick


On Wed, Sep 11, 2013 at 7:00 AM, Michael Sokolov <
msokolov@safaribooksonline.com> wrote:

> On 9/11/13 3:11 AM, Per Steffensen wrote:
>
>> Hi
>>
>> We have a SolrCloud setup handling huge amounts of data. When we do
>> group, facet or sort searches Solr will use its FieldCache, and add data in
>> it for every single document we have. For us it is not realistic that this
>> will ever fit in memory and we get OOM exceptions. Are there some way of
>> disabling the FieldCache (taking the performance penalty of course) or make
>> it behave in a nicer way where it only uses up to e.g. 80% of the memory
>> available to the JVM? Or other suggestions?
>>
>> Regards, Per Steffensen
>>
> I think you might want to look into using DocValues fields, which are
> column-stride fields stored as compressed arrays - one value per document
> -- for the fields on which you are sorting and faceting. My understanding
> (which is limited) is that these avoid the use of the field cache, and I
> believe you have the option to control whether they are held in memory or
> on disk.  I hope someone who knows more will elaborate...
>
> -Mike
>

Re: No or limited use of FieldCache

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

On 9/11/13 3:11 AM, Per Steffensen wrote:
> Hi
>
> We have a SolrCloud setup handling huge amounts of data. When we do 
> group, facet or sort searches Solr will use its FieldCache, and add 
> data in it for every single document we have. For us it is not 
> realistic that this will ever fit in memory and we get OOM exceptions. 
> Are there some way of disabling the FieldCache (taking the performance 
> penalty of course) or make it behave in a nicer way where it only uses 
> up to e.g. 80% of the memory available to the JVM? Or other suggestions?
>
> Regards, Per Steffensen
I think you might want to look into using DocValues fields, which are 
column-stride fields stored as compressed arrays - one value per 
document -- for the fields on which you are sorting and faceting. My 
understanding (which is limited) is that these avoid the use of the 
field cache, and I believe you have the option to control whether they 
are held in memory or on disk.  I hope someone who knows more will 
elaborate...

-Mike