You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Dominique Bejean <do...@eolya.fr> on 2020/05/17 08:05:16 UTC

Solrcloud 6.6 becomes nuts

Hi,

I have a six node Solrcoud that suddenly has its six nodes failed with OOM
at the same time.
This can happen even when the Solrcloud is not under heavy load and there
is no indexing.

I do not see any raison for this to happen. Here are the description of the
issue. Thank you for your suggestions and advices.


One or two hours before the nodes stop with OOM, we see this scenario on
all six nodes during the same five minutes time frame :
* a little bit more young gc : from one each second (duration<0.05secs) to
one each two or three seconds (duration <0.15 sec)
* full gc start occurs each 5sec with 0 bytes reclaimed
* young gc start reclaim less bytes
* long full gc start reclaim bytes but with less and less reclaimed bytes
* then no more young GC
Here are GC graphs : https://www.eolya.fr/solr_issue_gc.png


Just before the problem occurs :
* there is no more requests per seconds
* no update/commit/merge
* CPU usage and load are low
* disk I/O are low
After the problem starts, requests become longer and longer but still no
increase of CPU usage or disk I/O


During last issue, we dumped the threads on one node just before OOM but
unfortunately, more than one hour after the problem starts.
85% of threads (more than 3000) are BLOCKED and related to log4j
Solr either try to log slow query or try to log problems in requesthandler
at org.apache.solr.common.SolrException.log(SolrException.java:148)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:204)

This high count of BLOCKED threads is more a consequence than a cause. We
will dump threads each minute until the next issue.


About Solr environment :
* Solr 6.6
* Java Oracle 1.8.0_112 25.112-b15

* 1 collection with 10 millions small documents
* 3 shards x 2 replicas
* 3.5 millions docs per core
* 90 Gb index size per core

* Server with 6 processors and 90 Gb of RAM
* Swappiness set to 1, nearly no swap used
* 4Gb Heap used nearly between 25 to 60% before young GC and one full GC (3
seconds) each 15 to 30 minutes when all is fine.

* Default JVM settings with CMS GC
* JMX enabled
* Average Request per seconds in pic on one core : 170, but during the last
issue the Average Request per seconds was 30 !!!
* Average Time per seconds : < 30 ms

About updates :
* Very few add/updates in general
* Some deleteByQuery (nearly 2000 per day) but not before the problem occurs
* autocommit maxTime:15000ms

About queries :
* Queries are standard queries or suggesters
* Queries generate facets but there is no fields with very high number of
unique values
* No grouping
* High usage of function query for relevance computing


Thank you.

Dominique

Re: Solrcloud 6.6 becomes nuts

Posted by Dominique Bejean <do...@eolya.fr>.

Mickhail,


Thank you for your response.


--- For the logs

On not leader replica, there are no error in log, only WARN due to slow
queries.

On leader replica, there are these errors:

* Twice per minute during all the day before the problem starts and also
after the problem start
RequestHandlerBase org.apache.solr.common.SolrException: Collection: xxxxxx
not found
where xxxxxx is the alias name pointing on the collection

* Just after the problem start
2020-05-13 15:24:41.450 ERROR (qtp1682092198-315202) [c:xxxxxx_2 s:shard3
r:core_node1 x:xxxxxx_2_shard3_replica0] o.a.s.h.RequestHandlerBase
org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: No live SolrServers
available to handle this request:[
http://XXXXXX127:8983/solr/xxxxxx_2_shard1_replica1,
http://XXXXXX132:8983/solr/xxxxxx_2_shard2_replica0]
2020-05-13 15:24:41.451 ERROR (qtp1682092198-315202) [c:xxxxxx_2 s:shard3
r:core_node1 x:xxxxxx_2_shard3_replica0] o.a.s.s.HttpSolrCall
null:org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: No live SolrServers
available to handle this request:[
http://XXXXXX127:8983/solr/xxxxxx_2_shard1_replica1,
http://XXXXXX132:8983/solr/xxxxxx_2_shard2_replica0]

2020-05-13 15:25:49.642 ERROR (qtp1682092198-315193) [c:xxxxxx_2 s:shard3
r:core_node1 x:xxxxxx_2_shard3_replica0] o.a.s.s.HttpSolrCall
null:java.io.IOException: java.util.concurrent.TimeoutException: Idle
timeout expired: 51815/50000 ms

and later until the JVM hangs
2020-05-13 15:58:54.397 ERROR (qtp1682092198-316314) [c:xxxxxx_2 s:shard3
r:core_node1 x:xxxxxx_2_shard3_replica0] o.a.s.h.RequestHandlerBase
org.apache.solr.common.SolrException: no servers hosting shard:
xxxxxx_2_shard2

No OOM errors in Solr logs, just OOM killer scripts log
Running OOM killer script for process 4488 for Solr on port 8983
Killed process 4488


--- For heap dump

I have dump for one shard leader just before the OOM script kill the JVM
but more than one hour the problem starts. I will take a look.

Regards.

Dominique










Le dim. 17 mai 2020 à 20:22, Mikhail Khludnev <mk...@apache.org> a écrit :

> Hello, Dominique.
> What did it log? Which exception?
> Do you have a chance to review heap dump? What did consume whole heap?
>
> On Sun, May 17, 2020 at 11:05 AM Dominique Bejean <
> dominique.bejean@eolya.fr> wrote:
>
>> Hi,
>>
>> I have a six node Solrcoud that suddenly has its six nodes failed with OOM
>> at the same time.
>> This can happen even when the Solrcloud is not under heavy load and there
>> is no indexing.
>>
>> I do not see any raison for this to happen. Here are the description of
>> the
>> issue. Thank you for your suggestions and advices.
>>
>>
>> One or two hours before the nodes stop with OOM, we see this scenario on
>> all six nodes during the same five minutes time frame :
>> * a little bit more young gc : from one each second (duration<0.05secs) to
>> one each two or three seconds (duration <0.15 sec)
>> * full gc start occurs each 5sec with 0 bytes reclaimed
>> * young gc start reclaim less bytes
>> * long full gc start reclaim bytes but with less and less reclaimed bytes
>> * then no more young GC
>> Here are GC graphs : https://www.eolya.fr/solr_issue_gc.png
>>
>>
>> Just before the problem occurs :
>> * there is no more requests per seconds
>> * no update/commit/merge
>> * CPU usage and load are low
>> * disk I/O are low
>> After the problem starts, requests become longer and longer but still no
>> increase of CPU usage or disk I/O
>>
>>
>> During last issue, we dumped the threads on one node just before OOM but
>> unfortunately, more than one hour after the problem starts.
>> 85% of threads (more than 3000) are BLOCKED and related to log4j
>> Solr either try to log slow query or try to log problems in requesthandler
>> at org.apache.solr.common.SolrException.log(SolrException.java:148)
>> at
>>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:204)
>>
>> This high count of BLOCKED threads is more a consequence than a cause. We
>> will dump threads each minute until the next issue.
>>
>>
>> About Solr environment :
>> * Solr 6.6
>> * Java Oracle 1.8.0_112 25.112-b15
>>
>> * 1 collection with 10 millions small documents
>> * 3 shards x 2 replicas
>> * 3.5 millions docs per core
>> * 90 Gb index size per core
>>
>> * Server with 6 processors and 90 Gb of RAM
>> * Swappiness set to 1, nearly no swap used
>> * 4Gb Heap used nearly between 25 to 60% before young GC and one full GC
>> (3
>> seconds) each 15 to 30 minutes when all is fine.
>>
>> * Default JVM settings with CMS GC
>> * JMX enabled
>> * Average Request per seconds in pic on one core : 170, but during the
>> last
>> issue the Average Request per seconds was 30 !!!
>> * Average Time per seconds : < 30 ms
>>
>> About updates :
>> * Very few add/updates in general
>> * Some deleteByQuery (nearly 2000 per day) but not before the problem
>> occurs
>> * autocommit maxTime:15000ms
>>
>> About queries :
>> * Queries are standard queries or suggesters
>> * Queries generate facets but there is no fields with very high number of
>> unique values
>> * No grouping
>> * High usage of function query for relevance computing
>>
>>
>> Thank you.
>>
>> Dominique
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: Solrcloud 6.6 becomes nuts

Posted by Mikhail Khludnev <mk...@apache.org>.

Hello, Dominique.
What did it log? Which exception?
Do you have a chance to review heap dump? What did consume whole heap?

On Sun, May 17, 2020 at 11:05 AM Dominique Bejean <do...@eolya.fr>
wrote:

> Hi,
>
> I have a six node Solrcoud that suddenly has its six nodes failed with OOM
> at the same time.
> This can happen even when the Solrcloud is not under heavy load and there
> is no indexing.
>
> I do not see any raison for this to happen. Here are the description of the
> issue. Thank you for your suggestions and advices.
>
>
> One or two hours before the nodes stop with OOM, we see this scenario on
> all six nodes during the same five minutes time frame :
> * a little bit more young gc : from one each second (duration<0.05secs) to
> one each two or three seconds (duration <0.15 sec)
> * full gc start occurs each 5sec with 0 bytes reclaimed
> * young gc start reclaim less bytes
> * long full gc start reclaim bytes but with less and less reclaimed bytes
> * then no more young GC
> Here are GC graphs : https://www.eolya.fr/solr_issue_gc.png
>
>
> Just before the problem occurs :
> * there is no more requests per seconds
> * no update/commit/merge
> * CPU usage and load are low
> * disk I/O are low
> After the problem starts, requests become longer and longer but still no
> increase of CPU usage or disk I/O
>
>
> During last issue, we dumped the threads on one node just before OOM but
> unfortunately, more than one hour after the problem starts.
> 85% of threads (more than 3000) are BLOCKED and related to log4j
> Solr either try to log slow query or try to log problems in requesthandler
> at org.apache.solr.common.SolrException.log(SolrException.java:148)
> at
>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:204)
>
> This high count of BLOCKED threads is more a consequence than a cause. We
> will dump threads each minute until the next issue.
>
>
> About Solr environment :
> * Solr 6.6
> * Java Oracle 1.8.0_112 25.112-b15
>
> * 1 collection with 10 millions small documents
> * 3 shards x 2 replicas
> * 3.5 millions docs per core
> * 90 Gb index size per core
>
> * Server with 6 processors and 90 Gb of RAM
> * Swappiness set to 1, nearly no swap used
> * 4Gb Heap used nearly between 25 to 60% before young GC and one full GC (3
> seconds) each 15 to 30 minutes when all is fine.
>
> * Default JVM settings with CMS GC
> * JMX enabled
> * Average Request per seconds in pic on one core : 170, but during the last
> issue the Average Request per seconds was 30 !!!
> * Average Time per seconds : < 30 ms
>
> About updates :
> * Very few add/updates in general
> * Some deleteByQuery (nearly 2000 per day) but not before the problem
> occurs
> * autocommit maxTime:15000ms
>
> About queries :
> * Queries are standard queries or suggesters
> * Queries generate facets but there is no fields with very high number of
> unique values
> * No grouping
> * High usage of function query for relevance computing
>
>
> Thank you.
>
> Dominique
>


-- 
Sincerely yours
Mikhail Khludnev

Re: Solrcloud 6.6 becomes nuts

Posted by Dominique Bejean <do...@eolya.fr>.

Hi,

We had the problem again a few days ago.

I have noticed that each time the problem occurs the old generation of the
heap suddenly grows. Its size is generally between 0,5 et 1,5Gb on 3Gg
limit. In 4 minutes the old generation grows to 3Gb and never goes down as
consecutive GC reclaims 0 bytes.

I have a heap dump for a previous problem. Near 1.7 Gb is of memory is an
array of 6 BoostQuery objects including themself a huge array of
PhraseQuery$PhraseWeigth

I also noticed that a few queries (nearly 10 per minute) use the parameter "
debug=track&debug=timing". I have asked the customer not enable debug in
production. Can query debug cause so high memory usage ?

Regards.

Dominique




Le lun. 18 mai 2020 à 09:42, Dominique Bejean <do...@eolya.fr> a
écrit :

> Hi Shawn,
>
> In fact, I was using logs from a core at WARN log level so with only slow
> queries (>500ms).
>
> I just checked in a core at INFO log level with all queries (we set the
> log level top INFO for one core after the previous crash) and there is no
> more queries with these two facets when the problem starts. There are
> nearly 150 queries per minute faceting with the 750K unique terms fields
> during the 3 hours before the problem occurs and no increase during the few
> minutes before and when the problem starts.
>
> I can't see anything specific in logs at the time the problem start.
>
> Regards
>
> Dominique
>
>
>
>
> Le lun. 18 mai 2020 à 03:28, Shawn Heisey <ap...@elyograg.org> a écrit :
>
>> On 5/17/2020 4:18 PM, Dominique Bejean wrote:
>> > I was not thinking that queries using facet with fields with high number
>> > of unique value but with low hits count can be the origin of this
>> problem.
>>
>> Performance for most things does not depend on numFound (hit count) or
>> the rows parameter.  The number of terms in the field and the total
>> number of documents in the index matters a lot more.
>>
>> If you do facets or grouping on a field with 750K unique terms, it's
>> going to be very slow and require a LOT of memory.  I would not be
>> surprised to see it require more than 4GB.  These features are designed
>> to work best with fields that have a relatively small number of possible
>> values.
>>
>> Thanks,
>> Shawn
>>
>

Re: Solrcloud 6.6 becomes nuts

Posted by Dominique Bejean <do...@eolya.fr>.

Hi Shawn,

In fact, I was using logs from a core at WARN log level so with only slow
queries (>500ms).

I just checked in a core at INFO log level with all queries (we set the log
level top INFO for one core after the previous crash) and there is no more
queries with these two facets when the problem starts. There are nearly 150
queries per minute faceting with the 750K unique terms fields during the 3
hours before the problem occurs and no increase during the few minutes
before and when the problem starts.

I can't see anything specific in logs at the time the problem start.

Regards

Dominique

Le lun. 18 mai 2020 à 03:28, Shawn Heisey <ap...@elyograg.org> a écrit :

> On 5/17/2020 4:18 PM, Dominique Bejean wrote:
> > I was not thinking that queries using facet with fields with high number
> > of unique value but with low hits count can be the origin of this
> problem.
>
> Performance for most things does not depend on numFound (hit count) or
> the rows parameter.  The number of terms in the field and the total
> number of documents in the index matters a lot more.
>
> If you do facets or grouping on a field with 750K unique terms, it's
> going to be very slow and require a LOT of memory.  I would not be
> surprised to see it require more than 4GB.  These features are designed
> to work best with fields that have a relatively small number of possible
> values.
>
> Thanks,
> Shawn
>

Re: Solrcloud 6.6 becomes nuts

Posted by Shawn Heisey <ap...@elyograg.org>.

On 5/17/2020 4:18 PM, Dominique Bejean wrote:
> I was not thinking that queries using facet with fields with high number
> of unique value but with low hits count can be the origin of this problem.

Performance for most things does not depend on numFound (hit count) or 
the rows parameter.  The number of terms in the field and the total 
number of documents in the index matters a lot more.

If you do facets or grouping on a field with 750K unique terms, it's 
going to be very slow and require a LOT of memory.  I would not be 
surprised to see it require more than 4GB.  These features are designed 
to work best with fields that have a relatively small number of possible 
values.

Thanks,
Shawn

Re: Solrcloud 6.6 becomes nuts

Posted by Dominique Bejean <do...@eolya.fr>.

Hi Shawn,

There is no OOM error in logs. I gave more details in response to  Mickhail.

The problem starts with full GC near 15h20 but Young GC changed a little
starting 15h10.
Here are the heap usage before and after during this period.
https://www.eolya.fr/solr_issue_heap_before_after.png

There is no grouping but there are faceting.
The collection contains 10.000.000 documents

2 fields contains each 60.000 and 750.000 uniq values

These two fields were used in query for faceting 1 to 10 times per hour
before the problem starts
They are used a lot during the 20 minutes the problem starts
* 50 times for the field with  750.000 uniq values
* 250 times for the field with 60.000 uniq values

Hits count for these queries are mainly under 10, a couple of time between
100 and 1000.
Once hits count is  2000 for the field with  60.000 uniq values

In the other hand these queries are very long.

We will investigate this !

I was not thinking that queries using facet with fields with high number
of unique value but with low hits count can be the origin of this problem.


Regards

Dominique







Le dim. 17 mai 2020 à 21:45, Shawn Heisey <ap...@elyograg.org> a écrit :

> On 5/17/2020 2:05 AM, Dominique Bejean wrote:
> > One or two hours before the nodes stop with OOM, we see this scenario on
> > all six nodes during the same five minutes time frame :
> > * a little bit more young gc : from one each second (duration<0.05secs)
> to
> > one each two or three seconds (duration <0.15 sec)
> > * full gc start occurs each 5sec with 0 bytes reclaimed
> > * young gc start reclaim less bytes
> > * long full gc start reclaim bytes but with less and less reclaimed bytes
> > * then no more young GC
> > Here are GC graphs : https://www.eolya.fr/solr_issue_gc.png
>
> Do you have the OutOfMemoryException in the solr log?  From the graph
> you provided, it does look likely that it was heap memory on the OOME,
> I'd just like to be sure, by seeing the logged exception.
>
> Between 15:00 and 15:30, something happened which suddenly required
> additional heap memory.  Do you have any idea what that was?  If you can
> zoom in on the graph, you could get a more accurate time for this.  I am
> looking specifically at the "heap usage before GC" graph.  The "heap
> usage after GC" graph that gceasy makes, which has not been included
> here, is potentially more useful.
>
> I found that I most frequently ran into memory problems when I executed
> a data mining query -- doing facets or grouping on a high cardinality
> field, for example.  Those kinds of queries required a LOT of extra memory.
>
> If the servers have any memory left, you might need to increase the max
> heap beyond where it currently sits.  To handle your indexes and
> queries, Solr may simply require more memory than you have allowed.
>
> Thanks,
> Shawn
>

Re: Solrcloud 6.6 becomes nuts

Posted by Shawn Heisey <ap...@elyograg.org>.

On 5/17/2020 2:05 AM, Dominique Bejean wrote:
> One or two hours before the nodes stop with OOM, we see this scenario on
> all six nodes during the same five minutes time frame :
> * a little bit more young gc : from one each second (duration<0.05secs) to
> one each two or three seconds (duration <0.15 sec)
> * full gc start occurs each 5sec with 0 bytes reclaimed
> * young gc start reclaim less bytes
> * long full gc start reclaim bytes but with less and less reclaimed bytes
> * then no more young GC
> Here are GC graphs : https://www.eolya.fr/solr_issue_gc.png

Do you have the OutOfMemoryException in the solr log?  From the graph 
you provided, it does look likely that it was heap memory on the OOME, 
I'd just like to be sure, by seeing the logged exception.

Between 15:00 and 15:30, something happened which suddenly required 
additional heap memory.  Do you have any idea what that was?  If you can 
zoom in on the graph, you could get a more accurate time for this.  I am 
looking specifically at the "heap usage before GC" graph.  The "heap 
usage after GC" graph that gceasy makes, which has not been included 
here, is potentially more useful.

I found that I most frequently ran into memory problems when I executed 
a data mining query -- doing facets or grouping on a high cardinality 
field, for example.  Those kinds of queries required a LOT of extra memory.

If the servers have any memory left, you might need to increase the max 
heap beyond where it currently sits.  To handle your indexes and 
queries, Solr may simply require more memory than you have allowed.

Thanks,
Shawn