You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jeongseok Son <in...@gmail.com> on 2014/05/20 10:01:58 UTC

Re: solr-user Digest of: get.100322

Thank you for your reply! I also found docValues after sending an
email and your suggestion seems the best solution for me.

Now I'm configuring schema.xml to use docValues and have a question
about docValuesFormat.

According to this thread(
http://lucene.472066.n3.nabble.com/Trade-offs-in-choosing-DocValuesFormat-td4114758.html
),

Solr 4.6 only holds some hash structures in memory space with the
default docValuesFormat configuration.

Though it uses only small amount of memory I'm worried about memory
usage because I have to store so many documents. (32GB RAM / total 5B
docs, sum of docs. of all cores)

Which docValuesFormat is more appropriate in my case? (Default or
Disk?) Can I change it later without re-indexing?

On Sat, May 17, 2014 at 9:45 PM,  <so...@lucene.apache.org> wrote:
>
> solr-user Digest of: get.100322
>
> Topics (messages 100322 through 100322)
>
> Re: Sorting problem in Solr due to Lucene Field Cache
>         100322 by: Joel Bernstein
>
> Administrivia:
>
>
> --- Administrative commands for the solr-user list ---
>
> I can handle administrative requests automatically. Please
> do not send them to the list address! Instead, send
> your message to the correct command address:
>
> To subscribe to the list, send a message to:
>    <so...@lucene.apache.org>
>
> To remove your address from the list, send a message to:
>    <so...@lucene.apache.org>
>
> Send mail to the following for info and FAQ for this list:
>    <so...@lucene.apache.org>
>    <so...@lucene.apache.org>
>
> Similar addresses exist for the digest list:
>    <so...@lucene.apache.org>
>    <so...@lucene.apache.org>
>
> To get messages 123 through 145 (a maximum of 100 per request), mail:
>    <so...@lucene.apache.org>
>
> To get an index with subject and author for messages 123-456 , mail:
>    <so...@lucene.apache.org>
>
> They are always returned as sets of 100, max 2000 per request,
> so you'll actually get 100-499.
>
> To receive all messages with the same subject as message 12345,
> send a short message to:
>    <so...@lucene.apache.org>
>
> The messages should contain one line or word of text to avoid being
> treated as sp@m, but I will ignore their content.
> Only the ADDRESS you send to is important.
>
> You can start a subscription for an alternate address,
> for example "john@host.domain", just add a hyphen and your
> address (with '=' instead of '@') after the command word:
> <so...@lucene.apache.org>
>
> To stop subscription for this address, mail:
> <so...@lucene.apache.org>
>
> In both cases, I'll send a confirmation message to that address. When
> you receive it, simply reply to it to complete your subscription.
>
> If despite following these instructions, you do not get the
> desired results, please contact my owner at
> solr-user-owner@lucene.apache.org. Please be patient, my owner is a
> lot slower than I am ;-)
>
> --- Enclosed is a copy of the request I received.
>
> Return-Path: <in...@gmail.com>
> Received: (qmail 64267 invoked by uid 99); 17 May 2014 12:22:20 -0000
> Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136)
>     by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 17 May 2014 12:22:20 +0000
> X-ASF-Spam-Status: No, hits=-0.7 required=5.0
>         tests=RCVD_IN_DNSWL_LOW,SPF_PASS
> X-Spam-Check-By: apache.org
> Received-SPF: pass (athena.apache.org: domain of invictusjs@gmail.com designates 209.85.128.193 as permitted sender)
> Received: from [209.85.128.193] (HELO mail-ve0-f193.google.com) (209.85.128.193)
>     by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 17 May 2014 12:22:14 +0000
> Received: by mail-ve0-f193.google.com with SMTP id sa20so1075564veb.8
>         for <so...@lucene.apache.org>; Sat, 17 May 2014 05:21:54 -0700 (PDT)
> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
>         d=gmail.com; s=20120113;
>         h=mime-version:date:message-id:subject:from:to:content-type;
>         bh=QzTOKgbCPT36kZdZcCT/uV4aRZ2PlQ3OgQFPLH0SCoc=;
>         b=yygC07cHEwmRg6rS0bHxGg5AaqtPRdsozFD6eO8ssVVC+YsfT32ZWUDDk9s7/2Z91Q
>          aCwFsbb7Thla9nkKbtMctqonOacly29Tsple/lzQX5qOQyAFdzOsQHpim+9jB+W0B1Ac
>          ZEDLqPzdMG8ZszKDa8lJ8yRadUtlb83HgB56PulZLh1XQG+WOMAuC8pBQ2zS8c/0lsib
>          JVehSX/OdqU+6HAhPYcIm6pLNWP4lYPwjTAp66Bms9j2/Y5ROwZ6azwCgGIe2hsk06q6
>          5BSKtoTXAfGweIvTQHEfvp6KgLEhIpgjlgo/s5r0NzNaaRM9zdkhp+qYOWM8nWuT8RAu
>          ytng==
> MIME-Version: 1.0
> X-Received: by 10.220.95.204 with SMTP id e12mr2401964vcn.37.1400329314139;
>  Sat, 17 May 2014 05:21:54 -0700 (PDT)
> Received: by 10.52.10.137 with HTTP; Sat, 17 May 2014 05:21:54 -0700 (PDT)
> Date: Sat, 17 May 2014 21:21:54 +0900
> Message-ID: <CA...@mail.gmail.com>
> Subject: Give me this mail
> From: Jeongseok Son <in...@gmail.com>
> To: solr-user-get.100322@lucene.apache.org
> Content-Type: text/plain; charset=UTF-8
> X-Virus-Checked: Checked by ClamAV on apache.org
>
>
> ----------------------------------------------------------------------
>
>
>
> ---------- Forwarded message ----------
> From: Joel Bernstein <jo...@gmail.com>
> To: solr-user@lucene.apache.org
> Cc:
> Date: Fri, 16 May 2014 17:49:51 -0400
> Subject: Re: Sorting problem in Solr due to Lucene Field Cache
> Take a look at Solr's use of DocValues:
> https://cwiki.apache.org/confluence/display/solr/DocValues.
>
> There are docValues options that use less memory then the FieldCache.
>
> Joel Bernstein
> Search Engineer at Heliosearch
>
>
> On Thu, May 15, 2014 at 6:39 AM, Jeongseok Son <in...@gmail.com> wrote:
>
>> Hello, I'm struggling with large data indexed and searched by Solr.
>>
>> The schema of the documents consist of date(YYYY-MM-DD), text(tokenized and
>> indexed with Natural Language Toolkit), and several numerical fields.
>>
>> Each document is small-sized but but the number of the docs is very large,
>> which is around 10 million per each date. The server has 32GB of memory and
>> I allocated around 30GB for Solr JVM.
>>
>> My Solr server has to return documents sorted by one of the numerical
>> fields when is requested with specific date and text.(ex.
>> q=date:YYYY-MM-DD+text:KEYWORD) The problem is that sorting in Lucene
>> requires lots of Field Cache and Solr can't handle Field Cache well. The
>> Field Cache is getting larger as more queries are executed and is not
>> evicted. When the whole memory is filled with Field Cache, Solr server
>> stops or generates Out of Memory exception.
>>
>> Solr cannot control Lucene field cache at all so I have a difficult time to
>> solve this problem. I'm considering these three ways to solve this.
>>
>> 1) Add more memory.
>> This can relieve the problem but I don't think it can completely solve it.
>> Anyway the memory would fill up with field cache as the server handles
>> search requests.
>> 2) Separate numerical data from text data
>> I find Solr/Lucene isn't suitable for sorting large numerical data.
>> Therefore I'm thinking of storing numerical data in another DB(HBase,
>> MongoDB ...), then Solr server will just do some text search.
>> 3) Switching to Elasticsearch
>> According to this page(
>>
>> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-fielddata.html
>> )
>> Elasticsearch can control field cache. I think ES could solve my
>> problem.
>>
>> I'm likely to try 2nd, or 3rd way. Are these appropriate solutions? If you
>> have any better ideas please let me know. I've went through too many
>> troubles so it's time to make a decision. I want my choices reviewed by
>> many other excellent Solr users and developers and also want to find better
>> solutions.
>> I really appreciate any help you can provide.
>>
>

Re: solr-user Digest of: get.100322

Posted by Shawn Heisey <so...@elyograg.org>.

On 5/21/2014 7:28 AM, Jack Krupansky wrote:
> Just to re-emphasize the point - when provisioning Solr, you need to
> ASSURE that the system has enough system memory so that the Solr index
> on that system fits entirely in the OS file system cache. No ifs,
> ands, or buts. If you fail to follow that RULE, all bets are off for
> performance and don't even bother complaining about poor performance
> on this mailing list!! Either get more memory or shard your index more
> heavily - again, no ifs, ands, or buts!!
>
> Any questions on that rule?
>
> Maybe somebody else can phrase this "guidance" more clearly, so that
> fewer people will fail to follow it.
>
> Or, maybe we should enhance Solr to check available memory and log a
> stern warning if the index size exceeds system memory when Solr is
> started.

If the amount of free and cached RAM can be detected by Java in a
cross-platform method, it would be awesome to log a performance warning
if the total of that memory is less than 50% of the total index size. 
This is the point where I generally feel comfortable saying that lack of
memory is a likely problem.  Depending on the exact index composition
and the types of queries being run, a Solr server may run very well when
only half the index can be cached.

I've seen some discussion of a documentation section (and supporting
scripts/data in the download) that describes how to set up a
production-ready and fault tolerant install.  That would be a good place
to put this information.  An install script on *NIX systems would be
able to easily gather memory information and display various index sizes
that the hardware is likely to handle efficiently.

If nothing else, we can beef up the SYSTEM_REQUIREMENTS.txt file.  Later
today I'll file an issue and cook up a patch for that.

Thanks,
Shawn

Re: solr-user Digest of: get.100322

Posted by Jack Krupansky <ja...@basetechnology.com>.

Just to re-emphasize the point - when provisioning Solr, you need to ASSURE 
that the system has enough system memory so that the Solr index on that 
system fits entirely in the OS file system cache. No ifs, ands, or buts. If 
you fail to follow that RULE, all bets are off for performance and don't 
even bother complaining about poor performance on this mailing list!! Either 
get more memory or shard your index more heavily - again, no ifs, ands, or 
buts!!

Any questions on that rule?

Maybe somebody else can phrase this "guidance" more clearly, so that fewer 
people will fail to follow it.

Or, maybe we should enhance Solr to check available memory and log a stern 
warning if the index size exceeds system memory when Solr is started.

-- Jack Krupansky

-----Original Message----- 
From: Shawn Heisey
Sent: Tuesday, May 20, 2014 1:49 PM
To: solr-user@lucene.apache.org
Subject: Re: solr-user Digest of: get.100322

On 5/20/2014 2:01 AM, Jeongseok Son wrote:
> Though it uses only small amount of memory I'm worried about memory
> usage because I have to store so many documents. (32GB RAM / total 5B
> docs, sum of docs. of all cores)

If you've only got 32GB of RAM and there are five billion docs on the
system, Solr performance will be dismal no matter what you do with
docValues.  Your index will be FAR larger than the amount of available
RAM for caching.

http://wiki.apache.org/solr/SolrPerformanceProblems#RAM

With that many documents, even if you don't use RAM-hungry features like
sorting and facets, you'll need a significant heap size, which will
further reduce the amount of RAM on the system that the OS can use to
cache the index.

For good performance, Solr *relies* on the operating system caching a
significant portion of the index.

Thanks,
Shawn

Re: solr-user Digest of: get.100322

Posted by Shawn Heisey <so...@elyograg.org>.

On 5/20/2014 2:01 AM, Jeongseok Son wrote:
> Though it uses only small amount of memory I'm worried about memory
> usage because I have to store so many documents. (32GB RAM / total 5B
> docs, sum of docs. of all cores)

If you've only got 32GB of RAM and there are five billion docs on the
system, Solr performance will be dismal no matter what you do with
docValues.  Your index will be FAR larger than the amount of available
RAM for caching.

http://wiki.apache.org/solr/SolrPerformanceProblems#RAM

With that many documents, even if you don't use RAM-hungry features like
sorting and facets, you'll need a significant heap size, which will
further reduce the amount of RAM on the system that the OS can use to
cache the index.

For good performance, Solr *relies* on the operating system caching a
significant portion of the index.

Thanks,
Shawn