You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jeongseok Son <in...@gmail.com> on 2014/05/15 12:39:09 UTC

Sorting problem in Solr due to Lucene Field Cache

Hello, I'm struggling with large data indexed and searched by Solr.

The schema of the documents consist of date(YYYY-MM-DD), text(tokenized and
indexed with Natural Language Toolkit), and several numerical fields.

Each document is small-sized but but the number of the docs is very large,
which is around 10 million per each date. The server has 32GB of memory and
I allocated around 30GB for Solr JVM.

My Solr server has to return documents sorted by one of the numerical
fields when is requested with specific date and text.(ex.
q=date:YYYY-MM-DD+text:KEYWORD) The problem is that sorting in Lucene
requires lots of Field Cache and Solr can't handle Field Cache well. The
Field Cache is getting larger as more queries are executed and is not
evicted. When the whole memory is filled with Field Cache, Solr server
stops or generates Out of Memory exception.

Solr cannot control Lucene field cache at all so I have a difficult time to
solve this problem. I'm considering these three ways to solve this.

1) Add more memory.
This can relieve the problem but I don't think it can completely solve it.
Anyway the memory would fill up with field cache as the server handles
search requests.
2) Separate numerical data from text data
I find Solr/Lucene isn't suitable for sorting large numerical data.
Therefore I'm thinking of storing numerical data in another DB(HBase,
MongoDB ...), then Solr server will just do some text search.
3) Switching to Elasticsearch
According to this page(
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-fielddata.html)
Elasticsearch can control field cache. I think ES could solve my
problem.

I'm likely to try 2nd, or 3rd way. Are these appropriate solutions? If you
have any better ideas please let me know. I've went through too many
troubles so it's time to make a decision. I want my choices reviewed by
many other excellent Solr users and developers and also want to find better
solutions.
I really appreciate any help you can provide.

Re: Sorting problem in Solr due to Lucene Field Cache

Posted by Joel Bernstein <jo...@gmail.com>.
Take a look at Solr's use of DocValues:
https://cwiki.apache.org/confluence/display/solr/DocValues.

There are docValues options that use less memory then the FieldCache.

Joel Bernstein
Search Engineer at Heliosearch


On Thu, May 15, 2014 at 6:39 AM, Jeongseok Son <in...@gmail.com> wrote:

> Hello, I'm struggling with large data indexed and searched by Solr.
>
> The schema of the documents consist of date(YYYY-MM-DD), text(tokenized and
> indexed with Natural Language Toolkit), and several numerical fields.
>
> Each document is small-sized but but the number of the docs is very large,
> which is around 10 million per each date. The server has 32GB of memory and
> I allocated around 30GB for Solr JVM.
>
> My Solr server has to return documents sorted by one of the numerical
> fields when is requested with specific date and text.(ex.
> q=date:YYYY-MM-DD+text:KEYWORD) The problem is that sorting in Lucene
> requires lots of Field Cache and Solr can't handle Field Cache well. The
> Field Cache is getting larger as more queries are executed and is not
> evicted. When the whole memory is filled with Field Cache, Solr server
> stops or generates Out of Memory exception.
>
> Solr cannot control Lucene field cache at all so I have a difficult time to
> solve this problem. I'm considering these three ways to solve this.
>
> 1) Add more memory.
> This can relieve the problem but I don't think it can completely solve it.
> Anyway the memory would fill up with field cache as the server handles
> search requests.
> 2) Separate numerical data from text data
> I find Solr/Lucene isn't suitable for sorting large numerical data.
> Therefore I'm thinking of storing numerical data in another DB(HBase,
> MongoDB ...), then Solr server will just do some text search.
> 3) Switching to Elasticsearch
> According to this page(
>
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-fielddata.html
> )
> Elasticsearch can control field cache. I think ES could solve my
> problem.
>
> I'm likely to try 2nd, or 3rd way. Are these appropriate solutions? If you
> have any better ideas please let me know. I've went through too many
> troubles so it's time to make a decision. I want my choices reviewed by
> many other excellent Solr users and developers and also want to find better
> solutions.
> I really appreciate any help you can provide.
>