You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Susheel Kumar <su...@gmail.com> on 2016/10/24 22:27:09 UTC

OOM Error

Hello,

I am seeing OOM script killed solr (solr 6.0.0) on couple of our VM's
today. So far our solr cluster has been running fine but suddenly today
many of the VM's Solr instance got killed. I had 8G of heap allocated on 64
GB machines with 20+ GB of index size on each shards.

What could be looked to find the exact root cause. I am suspecting of any
query (wildcard prefix query etc.) might have caused this issue.  The
ingestion and query load looks normal as other days.  I have the solr GC
logs as well.

Thanks,
Susheel

Re: OOM Error

Posted by Susheel Kumar <su...@gmail.com>.

Thanks, Shawn for looking into. Your summption is right, the end of graph
is the OOM. I am trying to collect all the queries & ingestion numbers
around 9:12 but one more observation and a question from today

Observed that on 2-3 VM's out of 12, shows high usage of heap even though
heavy ingestion stopped more than an hour back while on other machines
shows normal usage.  Does that tells anything?

Snapshot 1 showing high usage of heap
===
https://www.dropbox.com/s/c1qy1s5nc9uo6cp/2016-11-09_15-55-24.png?dl=0

Snapshot  2 showing normal usage of heap
===
https://www.dropbox.com/s/9v016ilmhcahs28/2016-11-09_15-58-28.png?dl=0

The other question is we found that our ingestion batch size varies (goes
from 200 to 4000+ docs depending on  queue size). I am asking the ingestion
folks to fix the batch size but wondering does it matter in terms of load
on solr/heap usage if we submit small batches (like 500 docs max) more
frequently, than submitting bigger batches less frequently.  So far bigger
batch size has not caused any issues except these two incidents.

Thanks,
Susheel





On Wed, Nov 9, 2016 at 10:19 AM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 11/8/2016 12:49 PM, Susheel Kumar wrote:
> > Ran into OOM Error again right after two weeks. Below is the GC log
> > viewer graph. The first time we run into this was after 3 months and
> > then second time in two weeks. After first incident reduced the cache
> > size and increase heap from 8 to 10G. Interestingly query and
> > ingestion load is like normal other days and heap utilisation remains
> > stable and suddenly jumps to x2.
>
> It looks like something happened at about 9:12:30 on that graph.  Do you
> know what that was?  Starting at about that time, GC times went through
> the roof and the allocated heap began a steady rise.  At about 9:15, a
> lot of garbage was freed up and GC times dropped way down again.  At
> about 9:18, the GC once again started taking a long time, and the used
> heap was still going up steadily. At about 9:21, the full GCs started --
> the wide black bars.  I assume that the end of the graph is the OOM.
>
> > We are looking to reproduce this in test environment by producing
> > similar queries/ingestion but wondering if running into some memory
> > leak or bug like "SOLR-8922 - DocSetCollector can allocate massive
> > garbage on large indexes" which can cause this issue. Also we have
> > frequent updates and wondering if not optimizing the index can result
> > into this situation
>
> It looks more like a problem with allocated memory that's NOT garbage
> than a problem with garbage, but I can't really rule anything out, and
> even what I've said below could be wrong.
>
> Most of the allocated heap is in the old generation.  If there's a bug
> in Solr causing this problem, it would probably be a memory leak, but
> SOLR-8922 doesn't talk about a leak.  A memory leak is always possible,
> but those have been rare in Solr.  The most likely problem is that
> something changed in your indexing or query patterns which required a
> lot more memory than what happened before that point.
>
> Thanks,
> Shawn
>
>

Re: OOM Error

Posted by Shawn Heisey <ap...@elyograg.org>.

On 11/8/2016 12:49 PM, Susheel Kumar wrote:
> Ran into OOM Error again right after two weeks. Below is the GC log
> viewer graph. The first time we run into this was after 3 months and
> then second time in two weeks. After first incident reduced the cache
> size and increase heap from 8 to 10G. Interestingly query and
> ingestion load is like normal other days and heap utilisation remains
> stable and suddenly jumps to x2. 

It looks like something happened at about 9:12:30 on that graph.  Do you
know what that was?  Starting at about that time, GC times went through
the roof and the allocated heap began a steady rise.  At about 9:15, a
lot of garbage was freed up and GC times dropped way down again.  At
about 9:18, the GC once again started taking a long time, and the used
heap was still going up steadily. At about 9:21, the full GCs started --
the wide black bars.  I assume that the end of the graph is the OOM.

> We are looking to reproduce this in test environment by producing
> similar queries/ingestion but wondering if running into some memory
> leak or bug like "SOLR-8922 - DocSetCollector can allocate massive
> garbage on large indexes" which can cause this issue. Also we have
> frequent updates and wondering if not optimizing the index can result
> into this situation

It looks more like a problem with allocated memory that's NOT garbage
than a problem with garbage, but I can't really rule anything out, and
even what I've said below could be wrong.

Most of the allocated heap is in the old generation.  If there's a bug
in Solr causing this problem, it would probably be a memory leak, but
SOLR-8922 doesn't talk about a leak.  A memory leak is always possible,
but those have been rare in Solr.  The most likely problem is that
something changed in your indexing or query patterns which required a
lot more memory than what happened before that point.

Thanks,
Shawn

Re: OOM Error

Posted by Susheel Kumar <su...@gmail.com>.

Hello,

Ran into OOM Error again right after two weeks. Below is the GC log viewer
graph.  The first time we run into this was after 3 months and then second
time in two weeks. After first incident reduced the cache size and increase
heap from 8 to 10G.  Interestingly query and ingestion load is like normal
other days and heap utilisation remains stable and suddenly jumps to x2.

We are looking to reproduce this in test environment by producing similar
queries/ingestion but wondering if running into some memory leak or bug
like  "SOLR-8922 - DocSetCollector can allocate massive garbage on large
indexes" which can cause this issue.  Also we have frequent updates and
wondering if not optimizing the index can result into this situation

Any thoughts ?

GC Viewer
====
https://www.dropbox.com/s/bb29ub5q2naljdl/gc_log_snapshot.png?dl=0

On Wed, Oct 26, 2016 at 10:47 AM, Susheel Kumar <su...@gmail.com>
wrote:

> Hi Toke,
>
> I think your guess is right.  We have ingestion running in batches.  We
> have 6 shards & 6 replicas on 12 VM's each around 40+ million docs on each
> shard.
>
> Thanks everyone for the suggestions/pointers.
>
> Thanks,
> Susheel
>
> On Wed, Oct 26, 2016 at 1:52 AM, Toke Eskildsen <te...@statsbiblioteket.dk>
> wrote:
>
>> On Tue, 2016-10-25 at 15:04 -0400, Susheel Kumar wrote:
>> > Thanks, Toke.  Analyzing GC logs helped to determine that it was a
>> > sudden
>> > death.
>>
>> > The peaks in last 20 mins... See   http://tinypic.com/r/n2zonb/9
>>
>> Peaks yes, but there is a pattern of
>>
>> 1) Stable memory use
>> 2) Temporary doubling of the memory used and a lot of GC
>> 3) Increased (relative to last stable period) but stable memory use
>> 4) Goto 2
>>
>> Should I guess, I would say that you are running ingests in batches,
>> which temporarily causes 2 searchers to be open at the same time. That
>> is 2 in the list above. After the batch ingest, the baseline moves up,
>> assumedly because your have added quite a lot of documents, relative to
>> the overall number of documents.
>>
>>
>> The temporary doubling of the baseline is hard to avoid, but I am
>> surprised of the amount of heap that you need in the stable periods.
>> Just to be clear: This is from a Solr with 8GB of heap handling only 1
>> shard of 20GB and you are using DocValues? How many documents do you
>> have in such a shard?
>>
>> - Toke Eskildsen, State and University Library, Denmark
>>
>
>

Re: OOM Error

Posted by Susheel Kumar <su...@gmail.com>.

Hi Toke,

I think your guess is right.  We have ingestion running in batches.  We
have 6 shards & 6 replicas on 12 VM's each around 40+ million docs on each
shard.

Thanks everyone for the suggestions/pointers.

Thanks,
Susheel

On Wed, Oct 26, 2016 at 1:52 AM, Toke Eskildsen <te...@statsbiblioteket.dk>
wrote:

> On Tue, 2016-10-25 at 15:04 -0400, Susheel Kumar wrote:
> > Thanks, Toke.  Analyzing GC logs helped to determine that it was a
> > sudden
> > death.
>
> > The peaks in last 20 mins... See   http://tinypic.com/r/n2zonb/9
>
> Peaks yes, but there is a pattern of
>
> 1) Stable memory use
> 2) Temporary doubling of the memory used and a lot of GC
> 3) Increased (relative to last stable period) but stable memory use
> 4) Goto 2
>
> Should I guess, I would say that you are running ingests in batches,
> which temporarily causes 2 searchers to be open at the same time. That
> is 2 in the list above. After the batch ingest, the baseline moves up,
> assumedly because your have added quite a lot of documents, relative to
> the overall number of documents.
>
>
> The temporary doubling of the baseline is hard to avoid, but I am
> surprised of the amount of heap that you need in the stable periods.
> Just to be clear: This is from a Solr with 8GB of heap handling only 1
> shard of 20GB and you are using DocValues? How many documents do you
> have in such a shard?
>
> - Toke Eskildsen, State and University Library, Denmark
>

Re: OOM Error

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Tue, 2016-10-25 at 15:04 -0400, Susheel Kumar wrote:
> Thanks, Toke.  Analyzing GC logs helped to determine that it was a
> sudden
> death.  

> The peaks in last 20 mins... See   http://tinypic.com/r/n2zonb/9

Peaks yes, but there is a pattern of 

1) Stable memory use
2) Temporary doubling of the memory used and a lot of GC
3) Increased (relative to last stable period) but stable memory use
4) Goto 2

Should I guess, I would say that you are running ingests in batches,
which temporarily causes 2 searchers to be open at the same time. That
is 2 in the list above. After the batch ingest, the baseline moves up,
assumedly because your have added quite a lot of documents, relative to
the overall number of documents.

The temporary doubling of the baseline is hard to avoid, but I am
surprised of the amount of heap that you need in the stable periods.
Just to be clear: This is from a Solr with 8GB of heap handling only 1
shard of 20GB and you are using DocValues? How many documents do you
have in such a shard?

- Toke Eskildsen, State and University Library, Denmark

Re: OOM Error

Posted by Erick Erickson <er...@gmail.com>.

Off the top of my head:

a) Should the below JVM parameter be included for Prod to get heap dump

Makes sense. It may produce quite a large dump file, but then this is
an extraordinary situation so that's probably OK.

b) Currently OOM script just kills the Solr instance. Shouldn't it be
enhanced to wait and restart Solr instance

Personally I don't think so. IMO there's no real point in restarting
Solr, you have to address this issue as this situation is likely to
recur. So restarting Solr may hide this very serious problem, how
would you even know to look? Restarting Solr could potentially lead to
a long involved process of wondering why selected queries seem to fail
and not noticing that the OOM script killed Solr. Having the default
_not_ restart Solr forces you to notice.

If you have to change the script to restart Solr, you also know that
you made the change and you should _really_ notify ops that they
should monitor this situation.

I admit this can be argued either way; Personally, I'd rather "fail
fast and often".

Best,
Erick

On Tue, Oct 25, 2016 at 7:03 PM, Susheel Kumar <su...@gmail.com> wrote:
> Agree, Pushkar.  I had docValues for sorting / faceting fields from
> begining (since I setup Solr 6.0).  So good on that side. I am going to
> analyze the queries to find any potential issue. Two questions which I am
> puzzling with
>
> a) Should the below JVM parameter be included for Prod to get heap dump
>
> "-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/path/to/the/dump"
>
> b) Currently OOM script just kills the Solr instance. Shouldn't it be
> enhanced to wait and restart Solr instance
>
> Thanks,
> Susheel
>
>
>
>
> On Tue, Oct 25, 2016 at 7:35 PM, Pushkar Raste <pu...@gmail.com>
> wrote:
>
>> You should look into using docValues.  docValues are stored off heap and
>> hence you would be better off than just bumping up the heap.
>>
>> Don't enable docValues on existing fields unless you plan to reindex data
>> from scratch.
>>
>> On Oct 25, 2016 3:04 PM, "Susheel Kumar" <su...@gmail.com> wrote:
>>
>> > Thanks, Toke.  Analyzing GC logs helped to determine that it was a sudden
>> > death.  The peaks in last 20 mins... See   http://tinypic.com/r/n2zonb/9
>> >
>> > Will look into the queries more closer and also adjusting the cache
>> sizing.
>> >
>> >
>> > Thanks,
>> > Susheel
>> >
>> > On Tue, Oct 25, 2016 at 3:37 AM, Toke Eskildsen <te...@statsbiblioteket.dk>
>> > wrote:
>> >
>> > > On Mon, 2016-10-24 at 18:27 -0400, Susheel Kumar wrote:
>> > > > I am seeing OOM script killed solr (solr 6.0.0) on couple of our VM's
>> > > > today. So far our solr cluster has been running fine but suddenly
>> > > > today many of the VM's Solr instance got killed.
>> > >
>> > > As you have the GC-logs, you should be able to determine if it was a
>> > > slow death (e.g. caches gradually being filled) or a sudden one (e.g.
>> > > grouping or faceting on a large new non-DocValued field).
>> > >
>> > > Try plotting the GC logs with time on the x-axis and free memory after
>> > > GC on the y-axis. It it happens to be a sudden death, the last lines in
>> > > solr.log might hold a clue after all.
>> > >
>> > > - Toke Eskildsen, State and University Library, Denmark
>> > >
>> >
>>

Re: OOM Error

Posted by Tom Evans <te...@googlemail.com>.

On Wed, Oct 26, 2016 at 4:53 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 10/25/2016 8:03 PM, Susheel Kumar wrote:
>> Agree, Pushkar.  I had docValues for sorting / faceting fields from
>> begining (since I setup Solr 6.0).  So good on that side. I am going to
>> analyze the queries to find any potential issue. Two questions which I am
>> puzzling with
>>
>> a) Should the below JVM parameter be included for Prod to get heap dump
>>
>> "-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/path/to/the/dump"
>
> A heap dump can take a very long time to complete, and there may not be
> enough memory in the machine to start another instance of Solr until the
> first one has finished the heap dump.  Also, I do not know whether Java
> would release the listening port before the heap dump finishes.  If not,
> then a new instance would not be able to start immediately.
>
> If a different heap dump file is created each time, that might lead to
> problems with disk space after repeated dumps.  I don't know how the
> option works.
>
>> b) Currently OOM script just kills the Solr instance. Shouldn't it be
>> enhanced to wait and restart Solr instance
>
> As long as there is a problem causing OOMs, it seems rather pointless to
> start Solr right back up, as another OOM is likely.  The safest thing to
> do is kill Solr (since its operation would be unpredictable after OOM)
> and let the admin sort the problem out.
>

Occasionally our cloud nodes can OOM, when particularly complex
faceting is performed. The current OOM management can be exceedingly
annoying; a user will make a too complex analysis request, bringing
down one server, taking it out of the balancer. The user gets fed up
at no response, so reloads the page, re-submitting the analysis and
bringing down the next server in the cluster.

Lather, rinse, repeat - and then you get to have a meeting to discuss
why we invest so much in HA infrastructure that can be made non-HA by
one user with a complex query. In those meetings it is much harder to
justify not restarting.

Cheers

Tom

Re: OOM Error

Posted by Shawn Heisey <ap...@elyograg.org>.

On 10/25/2016 8:03 PM, Susheel Kumar wrote:
> Agree, Pushkar.  I had docValues for sorting / faceting fields from
> begining (since I setup Solr 6.0).  So good on that side. I am going to
> analyze the queries to find any potential issue. Two questions which I am
> puzzling with
>
> a) Should the below JVM parameter be included for Prod to get heap dump
>
> "-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/path/to/the/dump"

A heap dump can take a very long time to complete, and there may not be
enough memory in the machine to start another instance of Solr until the
first one has finished the heap dump.  Also, I do not know whether Java
would release the listening port before the heap dump finishes.  If not,
then a new instance would not be able to start immediately.

If a different heap dump file is created each time, that might lead to
problems with disk space after repeated dumps.  I don't know how the
option works.

> b) Currently OOM script just kills the Solr instance. Shouldn't it be
> enhanced to wait and restart Solr instance

As long as there is a problem causing OOMs, it seems rather pointless to
start Solr right back up, as another OOM is likely.  The safest thing to
do is kill Solr (since its operation would be unpredictable after OOM)
and let the admin sort the problem out.

Thanks,
Shawn

Re: OOM Error

Posted by Susheel Kumar <su...@gmail.com>.

Agree, Pushkar.  I had docValues for sorting / faceting fields from
begining (since I setup Solr 6.0).  So good on that side. I am going to
analyze the queries to find any potential issue. Two questions which I am
puzzling with

a) Should the below JVM parameter be included for Prod to get heap dump

"-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/path/to/the/dump"

b) Currently OOM script just kills the Solr instance. Shouldn't it be
enhanced to wait and restart Solr instance

Thanks,
Susheel




On Tue, Oct 25, 2016 at 7:35 PM, Pushkar Raste <pu...@gmail.com>
wrote:

> You should look into using docValues.  docValues are stored off heap and
> hence you would be better off than just bumping up the heap.
>
> Don't enable docValues on existing fields unless you plan to reindex data
> from scratch.
>
> On Oct 25, 2016 3:04 PM, "Susheel Kumar" <su...@gmail.com> wrote:
>
> > Thanks, Toke.  Analyzing GC logs helped to determine that it was a sudden
> > death.  The peaks in last 20 mins... See   http://tinypic.com/r/n2zonb/9
> >
> > Will look into the queries more closer and also adjusting the cache
> sizing.
> >
> >
> > Thanks,
> > Susheel
> >
> > On Tue, Oct 25, 2016 at 3:37 AM, Toke Eskildsen <te...@statsbiblioteket.dk>
> > wrote:
> >
> > > On Mon, 2016-10-24 at 18:27 -0400, Susheel Kumar wrote:
> > > > I am seeing OOM script killed solr (solr 6.0.0) on couple of our VM's
> > > > today. So far our solr cluster has been running fine but suddenly
> > > > today many of the VM's Solr instance got killed.
> > >
> > > As you have the GC-logs, you should be able to determine if it was a
> > > slow death (e.g. caches gradually being filled) or a sudden one (e.g.
> > > grouping or faceting on a large new non-DocValued field).
> > >
> > > Try plotting the GC logs with time on the x-axis and free memory after
> > > GC on the y-axis. It it happens to be a sudden death, the last lines in
> > > solr.log might hold a clue after all.
> > >
> > > - Toke Eskildsen, State and University Library, Denmark
> > >
> >
>

Re: OOM Error

Posted by Pushkar Raste <pu...@gmail.com>.

You should look into using docValues.  docValues are stored off heap and
hence you would be better off than just bumping up the heap.

Don't enable docValues on existing fields unless you plan to reindex data
from scratch.

On Oct 25, 2016 3:04 PM, "Susheel Kumar" <su...@gmail.com> wrote:

> Thanks, Toke.  Analyzing GC logs helped to determine that it was a sudden
> death.  The peaks in last 20 mins... See   http://tinypic.com/r/n2zonb/9
>
> Will look into the queries more closer and also adjusting the cache sizing.
>
>
> Thanks,
> Susheel
>
> On Tue, Oct 25, 2016 at 3:37 AM, Toke Eskildsen <te...@statsbiblioteket.dk>
> wrote:
>
> > On Mon, 2016-10-24 at 18:27 -0400, Susheel Kumar wrote:
> > > I am seeing OOM script killed solr (solr 6.0.0) on couple of our VM's
> > > today. So far our solr cluster has been running fine but suddenly
> > > today many of the VM's Solr instance got killed.
> >
> > As you have the GC-logs, you should be able to determine if it was a
> > slow death (e.g. caches gradually being filled) or a sudden one (e.g.
> > grouping or faceting on a large new non-DocValued field).
> >
> > Try plotting the GC logs with time on the x-axis and free memory after
> > GC on the y-axis. It it happens to be a sudden death, the last lines in
> > solr.log might hold a clue after all.
> >
> > - Toke Eskildsen, State and University Library, Denmark
> >
>

Re: OOM Error

Posted by Susheel Kumar <su...@gmail.com>.

Thanks, Toke.  Analyzing GC logs helped to determine that it was a sudden
death.  The peaks in last 20 mins... See   http://tinypic.com/r/n2zonb/9

Will look into the queries more closer and also adjusting the cache sizing.


Thanks,
Susheel

On Tue, Oct 25, 2016 at 3:37 AM, Toke Eskildsen <te...@statsbiblioteket.dk>
wrote:

> On Mon, 2016-10-24 at 18:27 -0400, Susheel Kumar wrote:
> > I am seeing OOM script killed solr (solr 6.0.0) on couple of our VM's
> > today. So far our solr cluster has been running fine but suddenly
> > today many of the VM's Solr instance got killed.
>
> As you have the GC-logs, you should be able to determine if it was a
> slow death (e.g. caches gradually being filled) or a sudden one (e.g.
> grouping or faceting on a large new non-DocValued field).
>
> Try plotting the GC logs with time on the x-axis and free memory after
> GC on the y-axis. It it happens to be a sudden death, the last lines in
> solr.log might hold a clue after all.
>
> - Toke Eskildsen, State and University Library, Denmark
>

Re: OOM Error

Posted by William Bell <bi...@gmail.com>.

I would also recommend that 8GB is cutting it close for Java 8 JVM with
SOLR. We use 12GB and have had issues with 8GB. But your mileage may vary.

On Tue, Oct 25, 2016 at 1:37 AM, Toke Eskildsen <te...@statsbiblioteket.dk>
wrote:

> On Mon, 2016-10-24 at 18:27 -0400, Susheel Kumar wrote:
> > I am seeing OOM script killed solr (solr 6.0.0) on couple of our VM's
> > today. So far our solr cluster has been running fine but suddenly
> > today many of the VM's Solr instance got killed.
>
> As you have the GC-logs, you should be able to determine if it was a
> slow death (e.g. caches gradually being filled) or a sudden one (e.g.
> grouping or faceting on a large new non-DocValued field).
>
> Try plotting the GC logs with time on the x-axis and free memory after
> GC on the y-axis. It it happens to be a sudden death, the last lines in
> solr.log might hold a clue after all.
>
> - Toke Eskildsen, State and University Library, Denmark
>



-- 
Bill Bell
billnbell@gmail.com
cell 720-256-8076

Re: OOM Error

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Mon, 2016-10-24 at 18:27 -0400, Susheel Kumar wrote:
> I am seeing OOM script killed solr (solr 6.0.0) on couple of our VM's
> today. So far our solr cluster has been running fine but suddenly
> today many of the VM's Solr instance got killed.

As you have the GC-logs, you should be able to determine if it was a
slow death (e.g. caches gradually being filled) or a sudden one (e.g.
grouping or faceting on a large new non-DocValued field).

Try plotting the GC logs with time on the x-axis and free memory after
GC on the y-axis. It it happens to be a sudden death, the last lines in
solr.log might hold a clue after all.

- Toke Eskildsen, State and University Library, Denmark

Re: OOM Error

Posted by Pushkar Raste <pu...@gmail.com>.

Did you look into the heap dump ?

On Mon, Oct 24, 2016 at 6:27 PM, Susheel Kumar <su...@gmail.com>
wrote:

> Hello,
>
> I am seeing OOM script killed solr (solr 6.0.0) on couple of our VM's
> today. So far our solr cluster has been running fine but suddenly today
> many of the VM's Solr instance got killed. I had 8G of heap allocated on 64
> GB machines with 20+ GB of index size on each shards.
>
> What could be looked to find the exact root cause. I am suspecting of any
> query (wildcard prefix query etc.) might have caused this issue.  The
> ingestion and query load looks normal as other days.  I have the solr GC
> logs as well.
>
> Thanks,
> Susheel
>

Re: OOM Error

Posted by Susheel Kumar <su...@gmail.com>.

Thanks, Pushkar. The Solr was already killed by OOM script so i believe we
can't get heap dump.

Hi Shawn, I used Solr service scripts to launch Solr and it looks like
bin/solr doesn't include by default the below JVM parameter.

"-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/path/to/the/dump"

Is that something we should add to the Solr launch scripts to have it
included or may be at least in disabled (comment) mode?

Thanks,
Susheel

On Mon, Oct 24, 2016 at 8:20 PM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 10/24/2016 4:27 PM, Susheel Kumar wrote:
> > I am seeing OOM script killed solr (solr 6.0.0) on couple of our VM's
> > today. So far our solr cluster has been running fine but suddenly today
> > many of the VM's Solr instance got killed. I had 8G of heap allocated on
> 64
> > GB machines with 20+ GB of index size on each shards.
> >
> > What could be looked to find the exact root cause. I am suspecting of any
> > query (wildcard prefix query etc.) might have caused this issue.  The
> > ingestion and query load looks normal as other days.  I have the solr GC
> > logs as well.
>
> It is unlikely that you will be able to figure out exactly what is using
> too much memory from Solr logs.  The place where the OOM happens may be
> completely unrelated to the parts of the system that are using large
> amounts of memory.  That point is just the place where Java ran out of
> memory to allocate, which could happen when allocating a tiny amount of
> memory just as easily as it could happen when allocating a large amount
> of memory.
>
> What I can tell you has been placed on this wiki page:
>
> https://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap
>
> Thanks,
> Shawn
>
>

Re: OOM Error

Posted by Shawn Heisey <ap...@elyograg.org>.

On 10/24/2016 4:27 PM, Susheel Kumar wrote:
> I am seeing OOM script killed solr (solr 6.0.0) on couple of our VM's
> today. So far our solr cluster has been running fine but suddenly today
> many of the VM's Solr instance got killed. I had 8G of heap allocated on 64
> GB machines with 20+ GB of index size on each shards.
>
> What could be looked to find the exact root cause. I am suspecting of any
> query (wildcard prefix query etc.) might have caused this issue.  The
> ingestion and query load looks normal as other days.  I have the solr GC
> logs as well.

It is unlikely that you will be able to figure out exactly what is using
too much memory from Solr logs.  The place where the OOM happens may be
completely unrelated to the parts of the system that are using large
amounts of memory.  That point is just the place where Java ran out of
memory to allocate, which could happen when allocating a tiny amount of
memory just as easily as it could happen when allocating a large amount
of memory.

What I can tell you has been placed on this wiki page:

https://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap

Thanks,
Shawn