You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Susheel Kumar <su...@gmail.com> on 2016/11/08 19:49:52 UTC

Re: OOM Error

Hello,

Ran into OOM Error again right after two weeks. Below is the GC log viewer
graph.  The first time we run into this was after 3 months and then second
time in two weeks. After first incident reduced the cache size and increase
heap from 8 to 10G.  Interestingly query and ingestion load is like normal
other days and heap utilisation remains stable and suddenly jumps to x2.

We are looking to reproduce this in test environment by producing similar
queries/ingestion but wondering if running into some memory leak or bug
like  "SOLR-8922 - DocSetCollector can allocate massive garbage on large
indexes" which can cause this issue.  Also we have frequent updates and
wondering if not optimizing the index can result into this situation

Any thoughts ?

GC Viewer
====
https://www.dropbox.com/s/bb29ub5q2naljdl/gc_log_snapshot.png?dl=0




On Wed, Oct 26, 2016 at 10:47 AM, Susheel Kumar <su...@gmail.com>
wrote:

> Hi Toke,
>
> I think your guess is right.  We have ingestion running in batches.  We
> have 6 shards & 6 replicas on 12 VM's each around 40+ million docs on each
> shard.
>
> Thanks everyone for the suggestions/pointers.
>
> Thanks,
> Susheel
>
> On Wed, Oct 26, 2016 at 1:52 AM, Toke Eskildsen <te...@statsbiblioteket.dk>
> wrote:
>
>> On Tue, 2016-10-25 at 15:04 -0400, Susheel Kumar wrote:
>> > Thanks, Toke.  Analyzing GC logs helped to determine that it was a
>> > sudden
>> > death.
>>
>> > The peaks in last 20 mins... See   http://tinypic.com/r/n2zonb/9
>>
>> Peaks yes, but there is a pattern of
>>
>> 1) Stable memory use
>> 2) Temporary doubling of the memory used and a lot of GC
>> 3) Increased (relative to last stable period) but stable memory use
>> 4) Goto 2
>>
>> Should I guess, I would say that you are running ingests in batches,
>> which temporarily causes 2 searchers to be open at the same time. That
>> is 2 in the list above. After the batch ingest, the baseline moves up,
>> assumedly because your have added quite a lot of documents, relative to
>> the overall number of documents.
>>
>>
>> The temporary doubling of the baseline is hard to avoid, but I am
>> surprised of the amount of heap that you need in the stable periods.
>> Just to be clear: This is from a Solr with 8GB of heap handling only 1
>> shard of 20GB and you are using DocValues? How many documents do you
>> have in such a shard?
>>
>> - Toke Eskildsen, State and University Library, Denmark
>>
>
>

Re: OOM Error

Posted by Susheel Kumar <su...@gmail.com>.
Thanks, Shawn for looking into. Your summption is right, the end of graph
is the OOM. I am trying to collect all the queries & ingestion numbers
around 9:12 but one more observation and a question from today

Observed that on 2-3 VM's out of 12, shows high usage of heap even though
heavy ingestion stopped more than an hour back while on other machines
shows normal usage.  Does that tells anything?

Snapshot 1 showing high usage of heap
===
https://www.dropbox.com/s/c1qy1s5nc9uo6cp/2016-11-09_15-55-24.png?dl=0

Snapshot  2 showing normal usage of heap
===
https://www.dropbox.com/s/9v016ilmhcahs28/2016-11-09_15-58-28.png?dl=0

The other question is we found that our ingestion batch size varies (goes
from 200 to 4000+ docs depending on  queue size). I am asking the ingestion
folks to fix the batch size but wondering does it matter in terms of load
on solr/heap usage if we submit small batches (like 500 docs max) more
frequently, than submitting bigger batches less frequently.  So far bigger
batch size has not caused any issues except these two incidents.

Thanks,
Susheel





On Wed, Nov 9, 2016 at 10:19 AM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 11/8/2016 12:49 PM, Susheel Kumar wrote:
> > Ran into OOM Error again right after two weeks. Below is the GC log
> > viewer graph. The first time we run into this was after 3 months and
> > then second time in two weeks. After first incident reduced the cache
> > size and increase heap from 8 to 10G. Interestingly query and
> > ingestion load is like normal other days and heap utilisation remains
> > stable and suddenly jumps to x2.
>
> It looks like something happened at about 9:12:30 on that graph.  Do you
> know what that was?  Starting at about that time, GC times went through
> the roof and the allocated heap began a steady rise.  At about 9:15, a
> lot of garbage was freed up and GC times dropped way down again.  At
> about 9:18, the GC once again started taking a long time, and the used
> heap was still going up steadily. At about 9:21, the full GCs started --
> the wide black bars.  I assume that the end of the graph is the OOM.
>
> > We are looking to reproduce this in test environment by producing
> > similar queries/ingestion but wondering if running into some memory
> > leak or bug like "SOLR-8922 - DocSetCollector can allocate massive
> > garbage on large indexes" which can cause this issue. Also we have
> > frequent updates and wondering if not optimizing the index can result
> > into this situation
>
> It looks more like a problem with allocated memory that's NOT garbage
> than a problem with garbage, but I can't really rule anything out, and
> even what I've said below could be wrong.
>
> Most of the allocated heap is in the old generation.  If there's a bug
> in Solr causing this problem, it would probably be a memory leak, but
> SOLR-8922 doesn't talk about a leak.  A memory leak is always possible,
> but those have been rare in Solr.  The most likely problem is that
> something changed in your indexing or query patterns which required a
> lot more memory than what happened before that point.
>
> Thanks,
> Shawn
>
>

Re: OOM Error

Posted by Shawn Heisey <ap...@elyograg.org>.
On 11/8/2016 12:49 PM, Susheel Kumar wrote:
> Ran into OOM Error again right after two weeks. Below is the GC log
> viewer graph. The first time we run into this was after 3 months and
> then second time in two weeks. After first incident reduced the cache
> size and increase heap from 8 to 10G. Interestingly query and
> ingestion load is like normal other days and heap utilisation remains
> stable and suddenly jumps to x2. 

It looks like something happened at about 9:12:30 on that graph.  Do you
know what that was?  Starting at about that time, GC times went through
the roof and the allocated heap began a steady rise.  At about 9:15, a
lot of garbage was freed up and GC times dropped way down again.  At
about 9:18, the GC once again started taking a long time, and the used
heap was still going up steadily. At about 9:21, the full GCs started --
the wide black bars.  I assume that the end of the graph is the OOM.

> We are looking to reproduce this in test environment by producing
> similar queries/ingestion but wondering if running into some memory
> leak or bug like "SOLR-8922 - DocSetCollector can allocate massive
> garbage on large indexes" which can cause this issue. Also we have
> frequent updates and wondering if not optimizing the index can result
> into this situation

It looks more like a problem with allocated memory that's NOT garbage
than a problem with garbage, but I can't really rule anything out, and
even what I've said below could be wrong.

Most of the allocated heap is in the old generation.  If there's a bug
in Solr causing this problem, it would probably be a memory leak, but
SOLR-8922 doesn't talk about a leak.  A memory leak is always possible,
but those have been rare in Solr.  The most likely problem is that
something changed in your indexing or query patterns which required a
lot more memory than what happened before that point.

Thanks,
Shawn