You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Maulin Rathod <mr...@asite.com> on 2019/05/13 14:26:31 UTC

Solr node goes into recovery mode

Hi,

We are using solr 6.1 version with 2 shards. Each shard have 1 replica set-up. i.e. We have total 4 server nodes (each node is assigned 60 gb of RAM).

Recently we are observing issue where solr node (any random node) automatically goes into recovery mode and stops responding.

We have enough memory allocated to Solr (60 gb) and system also have enough memory (300 gb)...

We have analyzed GC logs and found that there was GC pause time of 29.6583943 second when problem happened. Can this GC Pause lead to make the node unavailable/recovery mode? or there could be some another reason ?

Please note we have set zkClientTimeout to 10 minutes (zkClientTimeout=600000) so that zookeeper will not consider this node unavailable during high GC pause time.

Solr GC Logs
==========

{Heap before GC invocations=10940 (full 14):
par new generation   total 17476288K, used 14724911K [0x0000000080000000, 0x0000000580000000, 0x0000000580000000)
  eden space 13981056K, 100% used [0x0000000080000000, 0x00000003d5560000, 0x00000003d5560000)
  from space 3495232K,  21% used [0x00000003d5560000, 0x0000000402bcbdb0, 0x00000004aaab0000)
  to   space 3495232K,   0% used [0x00000004aaab0000, 0x00000004aaab0000, 0x0000000580000000)
concurrent mark-sweep generation total 62914560K, used 27668932K [0x0000000580000000, 0x0000001480000000, 0x0000001480000000)
Metaspace       used 47602K, capacity 48370K, committed 49860K, reserved 51200K
2019-05-13T12:23:19.103+0100: 174643.550: [GC (Allocation Failure) 174643.550: [ParNew
Desired survivor size 3221205808 bytes, new threshold 8 (max 8)
- age   1:   52251504 bytes,   52251504 total
- age   2:  208183784 bytes,  260435288 total
- age   3:  274752960 bytes,  535188248 total
- age   4:   12176528 bytes,  547364776 total
- age   5:    6135968 bytes,  553500744 total
- age   6:    3903152 bytes,  557403896 total
- age   7:   15341896 bytes,  572745792 total
- age   8:    5518880 bytes,  578264672 total
: 14724911K->762845K(17476288K), 24.7822734 secs] 42393844K->28434889K(80390848K), 24.7825687 secs] [Times: user=157.97 sys=25.63, real=24.78 secs]
Heap after GC invocations=10941 (full 14):
par new generation   total 17476288K, used 762845K [0x0000000080000000, 0x0000000580000000, 0x0000000580000000)
  eden space 13981056K,   0% used [0x0000000080000000, 0x0000000080000000, 0x00000003d5560000)
  from space 3495232K,  21% used [0x00000004aaab0000, 0x00000004d93a76a8, 0x0000000580000000)
  to   space 3495232K,   0% used [0x00000003d5560000, 0x00000003d5560000, 0x00000004aaab0000)
concurrent mark-sweep generation total 62914560K, used 27672043K [0x0000000580000000, 0x0000001480000000, 0x0000001480000000)
Metaspace       used 47602K, capacity 48370K, committed 49860K, reserved 51200K
}
2019-05-13T12:23:44.456+0100: 174668.901: Total time for which application threads were stopped: 29.6583943 seconds, Stopping threads took: 4.3050775 seconds


==============================



Regards,

Maulin

[CC Award Winners!]


Re: Solr node goes into recovery mode

Posted by Erick Erickson <er...@gmail.com>.
There are a number of timeouts that  can trip this, the ZK timeout is only one.

For instance, when a leader sends an update to a follower, if that times out the leader may put the follower into “Leader Iniated Recovery” (LIR).

60G heaps are, by and large, not recommended for this very reason. Consider creating more JVMs with less memory and each hosting fewer Solr replicas.

Best,
Erick

> On May 13, 2019, at 9:26 AM, Maulin Rathod <mr...@asite.com> wrote:
> 
> Hi,
> 
> We are using solr 6.1 version with 2 shards. Each shard have 1 replica set-up. i.e. We have total 4 server nodes (each node is assigned 60 gb of RAM).
> 
> Recently we are observing issue where solr node (any random node) automatically goes into recovery mode and stops responding.
> 
> We have enough memory allocated to Solr (60 gb) and system also have enough memory (300 gb)...
> 
> We have analyzed GC logs and found that there was GC pause time of 29.6583943 second when problem happened. Can this GC Pause lead to make the node unavailable/recovery mode? or there could be some another reason ?
> 
> Please note we have set zkClientTimeout to 10 minutes (zkClientTimeout=600000) so that zookeeper will not consider this node unavailable during high GC pause time.
> 
> Solr GC Logs
> ==========
> 
> {Heap before GC invocations=10940 (full 14):
> par new generation   total 17476288K, used 14724911K [0x0000000080000000, 0x0000000580000000, 0x0000000580000000)
>  eden space 13981056K, 100% used [0x0000000080000000, 0x00000003d5560000, 0x00000003d5560000)
>  from space 3495232K,  21% used [0x00000003d5560000, 0x0000000402bcbdb0, 0x00000004aaab0000)
>  to   space 3495232K,   0% used [0x00000004aaab0000, 0x00000004aaab0000, 0x0000000580000000)
> concurrent mark-sweep generation total 62914560K, used 27668932K [0x0000000580000000, 0x0000001480000000, 0x0000001480000000)
> Metaspace       used 47602K, capacity 48370K, committed 49860K, reserved 51200K
> 2019-05-13T12:23:19.103+0100: 174643.550: [GC (Allocation Failure) 174643.550: [ParNew
> Desired survivor size 3221205808 bytes, new threshold 8 (max 8)
> - age   1:   52251504 bytes,   52251504 total
> - age   2:  208183784 bytes,  260435288 total
> - age   3:  274752960 bytes,  535188248 total
> - age   4:   12176528 bytes,  547364776 total
> - age   5:    6135968 bytes,  553500744 total
> - age   6:    3903152 bytes,  557403896 total
> - age   7:   15341896 bytes,  572745792 total
> - age   8:    5518880 bytes,  578264672 total
> : 14724911K->762845K(17476288K), 24.7822734 secs] 42393844K->28434889K(80390848K), 24.7825687 secs] [Times: user=157.97 sys=25.63, real=24.78 secs]
> Heap after GC invocations=10941 (full 14):
> par new generation   total 17476288K, used 762845K [0x0000000080000000, 0x0000000580000000, 0x0000000580000000)
>  eden space 13981056K,   0% used [0x0000000080000000, 0x0000000080000000, 0x00000003d5560000)
>  from space 3495232K,  21% used [0x00000004aaab0000, 0x00000004d93a76a8, 0x0000000580000000)
>  to   space 3495232K,   0% used [0x00000003d5560000, 0x00000003d5560000, 0x00000004aaab0000)
> concurrent mark-sweep generation total 62914560K, used 27672043K [0x0000000580000000, 0x0000001480000000, 0x0000001480000000)
> Metaspace       used 47602K, capacity 48370K, committed 49860K, reserved 51200K
> }
> 2019-05-13T12:23:44.456+0100: 174668.901: Total time for which application threads were stopped: 29.6583943 seconds, Stopping threads took: 4.3050775 seconds
> 
> 
> ==============================
> 
> 
> 
> Regards,
> 
> Maulin
> 
> [CC Award Winners!]
> 


Re: Solr node goes into recovery mode

Posted by Erick Erickson <er...@gmail.com>.
Use docValues on all fields you group, facet or sort on.

NOTE: you _must_ re-index from scratch, I’d index to a new collection and start over. Paradoxically your index size _on disk_ will increase, but your JVM will need drastically less heap. See: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Best,
Erick

> On May 14, 2019, at 1:11 AM, Maulin Rathod <mr...@asite.com> wrote:
> 
> Thanks for reply.
> 
> Our solr node normally uses 30-45 gb and hence we allocated 60 heap size.  We analyzed heap dump and found that around 85% heap was used by org.apache.solr.uninverting.FieldCacheImpl.
> --------------------
> One instance of
> "org.apache.solr.uninverting.FieldCacheImpl" loaded by "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x48fe5e9b0" occupies 19,72,04,15,160 (86.28%) bytes. The memory is accumulated in one instance of "java.util.HashMap$Node[]" loaded by "<system class loader>".
> --------------------
> 
> Please note we are not using any solr cache as in our scenario new documents added to index quite fast (at least 10 documents are added to index every second) and we need to open searcher again to make this new documents available.
> 
> We are not using docValues. As per our understanding using docValues to should improve query performance and should also reduce memory requirement as we are using lots of sorting/faceting in our queries. Please let me know your thoughts on it. Please also suggest if there are any other way to reduce to memory requirement/optimize the performance.
> 
> 
> Regards,
> 
> Maulin
> 
> -----Original Message-----
> From: Shawn Heisey <ap...@elyograg.org>
> Sent: 14 May 2019 01:04
> To: solr-user@lucene.apache.org
> Subject: Re: Solr node goes into recovery mode
> 
> On 5/13/2019 8:26 AM, Maulin Rathod wrote:
>> Recently we are observing issue where solr node (any random node) automatically goes into recovery mode and stops responding.
> 
> Do you KNOW that these Solr instances actually need a 60GB heap?  That's a HUGE heap.  When a full GC happens on a heap that large, it's going to be a long pause, and there's nothing that can be done about it.
> 
>> We have enough memory allocated to Solr (60 gb) and system also have enough memory (300 gb)...
> 
> As just mentioned, unless you are CERTAIN that you need a 60GB heap, which most users do not, don't set it that high.  Any advice you read that says "set the heap to XX percent of the installed system memory"
> will frequently result in a setting that's incorrect for your specific setup.
> 
> And if you really DO need a 60GB heap, it would be recommended to either add more servers and put less of your index on each one, or to split your replicas between two Solr instances each running 31GB or less -- as Erick mentioned in his reply.
> 
>> We have analyzed GC logs and found that there was GC pause time of 29.6583943 second when problem happened. Can this GC Pause lead to make the node unavailable/recovery mode? or there could be some another reason ?
> 
>> Please note we have set zkClientTimeout to 10 minutes (zkClientTimeout=600000) so that zookeeper will not consider this node unavailable during high GC pause time.
> 
> You can't actually set that timeout that high.  I believe that ZooKeeper limits the session timeout to 20 times the tickTime, which is typically set to 2 seconds.  So 40 seconds is typically the maximum you can have for that timeout.  Solr's zkClientTimeout value is used to set ZooKeeper's session timeout.
> 
> And, as Erick also mentioned, there are other ways that a long GC pause can cause problems other than that specific timeout.  SolrCloud is not going to work well with a huge heap ... eventually a full GC is going to happen, and if it takes more than a few seconds, it's going to cause issues.
> 
> Thanks,
> Shawn
> 
> [CC Award Winners!]
> 


RE: Solr node goes into recovery mode

Posted by Maulin Rathod <mr...@asite.com>.
Thanks for reply.

Our solr node normally uses 30-45 gb and hence we allocated 60 heap size.  We analyzed heap dump and found that around 85% heap was used by org.apache.solr.uninverting.FieldCacheImpl.
--------------------
One instance of
"org.apache.solr.uninverting.FieldCacheImpl" loaded by "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x48fe5e9b0" occupies 19,72,04,15,160 (86.28%) bytes. The memory is accumulated in one instance of "java.util.HashMap$Node[]" loaded by "<system class loader>".
--------------------

Please note we are not using any solr cache as in our scenario new documents added to index quite fast (at least 10 documents are added to index every second) and we need to open searcher again to make this new documents available.

We are not using docValues. As per our understanding using docValues to should improve query performance and should also reduce memory requirement as we are using lots of sorting/faceting in our queries. Please let me know your thoughts on it. Please also suggest if there are any other way to reduce to memory requirement/optimize the performance.


Regards,

Maulin

-----Original Message-----
From: Shawn Heisey <ap...@elyograg.org>
Sent: 14 May 2019 01:04
To: solr-user@lucene.apache.org
Subject: Re: Solr node goes into recovery mode

On 5/13/2019 8:26 AM, Maulin Rathod wrote:
> Recently we are observing issue where solr node (any random node) automatically goes into recovery mode and stops responding.

Do you KNOW that these Solr instances actually need a 60GB heap?  That's a HUGE heap.  When a full GC happens on a heap that large, it's going to be a long pause, and there's nothing that can be done about it.

> We have enough memory allocated to Solr (60 gb) and system also have enough memory (300 gb)...

As just mentioned, unless you are CERTAIN that you need a 60GB heap, which most users do not, don't set it that high.  Any advice you read that says "set the heap to XX percent of the installed system memory"
will frequently result in a setting that's incorrect for your specific setup.

And if you really DO need a 60GB heap, it would be recommended to either add more servers and put less of your index on each one, or to split your replicas between two Solr instances each running 31GB or less -- as Erick mentioned in his reply.

> We have analyzed GC logs and found that there was GC pause time of 29.6583943 second when problem happened. Can this GC Pause lead to make the node unavailable/recovery mode? or there could be some another reason ?

> Please note we have set zkClientTimeout to 10 minutes (zkClientTimeout=600000) so that zookeeper will not consider this node unavailable during high GC pause time.

You can't actually set that timeout that high.  I believe that ZooKeeper limits the session timeout to 20 times the tickTime, which is typically set to 2 seconds.  So 40 seconds is typically the maximum you can have for that timeout.  Solr's zkClientTimeout value is used to set ZooKeeper's session timeout.

And, as Erick also mentioned, there are other ways that a long GC pause can cause problems other than that specific timeout.  SolrCloud is not going to work well with a huge heap ... eventually a full GC is going to happen, and if it takes more than a few seconds, it's going to cause issues.

Thanks,
Shawn

[CC Award Winners!]


Re: Solr node goes into recovery mode

Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/13/2019 8:26 AM, Maulin Rathod wrote:
> Recently we are observing issue where solr node (any random node) automatically goes into recovery mode and stops responding.

Do you KNOW that these Solr instances actually need a 60GB heap?  That's 
a HUGE heap.  When a full GC happens on a heap that large, it's going to 
be a long pause, and there's nothing that can be done about it.

> We have enough memory allocated to Solr (60 gb) and system also have enough memory (300 gb)...

As just mentioned, unless you are CERTAIN that you need a 60GB heap, 
which most users do not, don't set it that high.  Any advice you read 
that says "set the heap to XX percent of the installed system memory" 
will frequently result in a setting that's incorrect for your specific 
setup.

And if you really DO need a 60GB heap, it would be recommended to either 
add more servers and put less of your index on each one, or to split 
your replicas between two Solr instances each running 31GB or less -- as 
Erick mentioned in his reply.

> We have analyzed GC logs and found that there was GC pause time of 29.6583943 second when problem happened. Can this GC Pause lead to make the node unavailable/recovery mode? or there could be some another reason ?

> Please note we have set zkClientTimeout to 10 minutes (zkClientTimeout=600000) so that zookeeper will not consider this node unavailable during high GC pause time.

You can't actually set that timeout that high.  I believe that ZooKeeper 
limits the session timeout to 20 times the tickTime, which is typically 
set to 2 seconds.  So 40 seconds is typically the maximum you can have 
for that timeout.  Solr's zkClientTimeout value is used to set 
ZooKeeper's session timeout.

And, as Erick also mentioned, there are other ways that a long GC pause 
can cause problems other than that specific timeout.  SolrCloud is not 
going to work well with a huge heap ... eventually a full GC is going to 
happen, and if it takes more than a few seconds, it's going to cause issues.

Thanks,
Shawn