You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Kimber, Mike" <Mi...@verint.com> on 2018/09/25 15:48:24 UTC

Auto recovery of a failed Solr Cloud Node?

Hi,

Is there a recommend design pattern or best practice for auto recovery of a failed Solr Node?

I'm I correct to assume there is nothing out of the box for this and we have to code our own solution?

Thanks

Michael Kimber


This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries. The information is intended to be for the use of the individual(s) or entity(ies) named above. If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message. If you have received this electronic message in error, please notify us by replying to this e-mail.

Re: Auto recovery of a failed Solr Cloud Node?

Posted by James Keeney <ne...@gmail.com>.
There is another thing to consider as well ...

When a node goes off line and then back on, unless Zookeeper has been
configured properly the ensemble may have trouble responding to the
cluster.


Jim Keeney
President, FitterWeb
E: NextVestor@gmail.com
M: 703-568-5887

*FitterWeb Consulting*
*Are you lean and agile enough for the web? *


On Thu, Sep 27, 2018 at 4:12 AM Shawn Heisey <ap...@elyograg.org> wrote:

> On 9/27/2018 8:00 AM, Shawn Heisey wrote:
> > On 9/27/2018 7:24 AM, Kimber, Mike wrote:
> >> I'm trying to determine if there is any health check available to
> >> determine the above and then if the issue happens then an automated
> >> mechanism in SolrCloud to restart the instance. Or is this something
> >> we have to code ourselves?
> >
> > As shipped by the project, Solr will never restart itself
> > automatically.  If it dies, it's dead until you start it again, unless
> > you implement something to restart it automatically.This is
> > intentional -- Solr almost never dies unless there's some kind of
> > problem -- not enough memory, corrupt software, etc.If Solr *does*
> > die, you need to figure out why and fix it, not rely on an automatic
> > restart.
>
> Replying to myself.  Probably a sign of insanity!
>
> The other side of that coin is a completely unresponsive server.  Here's
> the thing about that situation:  If it's really unresponsive, it
> probably wouldn't be possible to send Solr a message to tell it to
> restart itself.  When a server in SolrCloud becomes unresponsive,
> SolrCloud will attempt to have it do an index recovery, but this does
> NOT involve a restart.  Solr cannot restart itself automatically.  It
> might be possible to write that functionality into Solr, but I think
> that using such functionality for automatic restarts on problem
> detection is a very bad idea. The root of the problem must be found and
> fixed, a restart probably isn't going to get rid of it.
>
> If a SolrCloud server remains unresponsive, then any recovery operation
> that is initiated is going to fail.  Typically, problems that lead to an
> unresponsive server are not the kind of problems that will go away
> without action by the administrator -- adding memory, reducing the index
> size, etc.  If the admin restarts the server to clear that kind of
> problem, it's very likely that the problem will happen again.
>
> Thanks,
> Shawn
>
>

Re: Auto recovery of a failed Solr Cloud Node?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 9/27/2018 8:00 AM, Shawn Heisey wrote:
> On 9/27/2018 7:24 AM, Kimber, Mike wrote:
>> I'm trying to determine if there is any health check available to 
>> determine the above and then if the issue happens then an automated 
>> mechanism in SolrCloud to restart the instance. Or is this something 
>> we have to code ourselves?
>
> As shipped by the project, Solr will never restart itself 
> automatically.  If it dies, it's dead until you start it again, unless 
> you implement something to restart it automatically.This is 
> intentional -- Solr almost never dies unless there's some kind of 
> problem -- not enough memory, corrupt software, etc.If Solr *does* 
> die, you need to figure out why and fix it, not rely on an automatic 
> restart. 

Replying to myself.  Probably a sign of insanity!

The other side of that coin is a completely unresponsive server.  Here's 
the thing about that situation:  If it's really unresponsive, it 
probably wouldn't be possible to send Solr a message to tell it to 
restart itself.  When a server in SolrCloud becomes unresponsive, 
SolrCloud will attempt to have it do an index recovery, but this does 
NOT involve a restart.  Solr cannot restart itself automatically.  It 
might be possible to write that functionality into Solr, but I think 
that using such functionality for automatic restarts on problem 
detection is a very bad idea. The root of the problem must be found and 
fixed, a restart probably isn't going to get rid of it.

If a SolrCloud server remains unresponsive, then any recovery operation 
that is initiated is going to fail.  Typically, problems that lead to an 
unresponsive server are not the kind of problems that will go away 
without action by the administrator -- adding memory, reducing the index 
size, etc.  If the admin restarts the server to clear that kind of 
problem, it's very likely that the problem will happen again.

Thanks,
Shawn


Re: Auto recovery of a failed Solr Cloud Node?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 9/28/2018 4:18 PM, Christopher Schultz wrote:
> I thought someone recently mentioned (but I cannot find a reference,
> sorry) that Solr would automatically restart if an OutOfMemoryError
> was encountered.
>
> Is that only for single-note Solr (i.e. non-cloud/ZK)?

On non-windows systems, Solr includes an "oom killer" script.  This will 
find the Solr PID and execute "kill -9" on it if OutOfMemoryError is 
thrown.  We have an issue to add this capability for Windows as well, 
but the work hasn't been done yet.

If somebody has a Solr install that automatically restarts itself, that 
install has been altered and isn't what the project shipped.

Thanks,
Shawn


Re: Auto recovery of a failed Solr Cloud Node?

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Shawn,

On 9/27/18 10:00, Shawn Heisey wrote:
> On 9/27/2018 7:24 AM, Kimber, Mike wrote:
>> I'm trying to determine if there is any health check available
>> to determine the above and then if the issue happens then an
>> automated mechanism in SolrCloud to restart the instance. Or is
>> this something we have to code ourselves?
> 
> As shipped by the project, Solr will never restart itself 
> automatically.  If it dies, it's dead until you start it again,
> unless you implement something to restart it automatically.This is
> intentional -- Solr almost never dies unless there's some kind of
> problem -- not enough memory, corrupt software, etc.If Solr *does*
> die, you need to figure out why and fix it, not rely on an
> automatic restart.

I thought someone recently mentioned (but I cannot find a reference,
sorry) that Solr would automatically restart if an OutOfMemoryError
was encountered.

Is that only for single-note Solr (i.e. non-cloud/ZK)?

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAluuqJ8ACgkQHPApP6U8
pFgxNQ/7BFG6RbF1I/jQ0Pevs4Yum4BElkAEknEv7JLar9sWwuGCNBe4Zj1wgpaF
Gwmkt9TsQEs7/1amR4nUu1SUAaFkhw3R920/5ad/mz+qzvtyV0VyEEYhiJrAxCoH
EA+fxKYjy/9DgZ5ZFLaBbOl0JUk+6uqoaEX7RoNAZxyGjqVzeVR7JXBzeNl1Wagg
9wiq2MQrP1o8xwsBvQzQPO/sB6YZOlGLiAiAcJ7NAlt7RF4V5XvvG1fz7NM84w1e
iKImZiBorxEl6eangxr8TU2HqkDdfMHxXmAGlmqGuGEkut/agPjM1HeR63vzjy1p
Jslr3Ef2+NIslyMg0jk4e6VBppg1wHJOrrqOyxg0xlNvvJIa7XoinQH3zmu48pFN
fLd4cXXHcZ2Xn4X7g74ey1o4HZyxgY+hu2aSNRUtQrSpcTO3WeF4lYe8cHk871K5
7YF9jJ7SVZblHPqzLNxj1BItmh0FyRflfW7XMPGYHzCs2dKS0IlNtSJYsSZsYKpn
Z85nct0/gQ6uF2LMJdL7MKVbdyn/jtPndIHVSq6fP867r7kCtKY20njnnmjbQFd0
U5Ox+LJ+NU5nKBsckHsfS4TEr5PrUqlAhesgLhNmAhd1GL8iXYvBCLeE/fCNpNjj
ixGNDKX9//z00TOOULyQVzwRjHvFLQyJ+LBmLf/11CxPIt3vxVg=
=2fgU
-----END PGP SIGNATURE-----

Re: Auto recovery of a failed Solr Cloud Node?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 9/27/2018 7:24 AM, Kimber, Mike wrote:
> I'm trying to determine if there is any health check available to determine the above and then if the issue happens then an automated mechanism in SolrCloud to restart the instance. Or is this something we have to code ourselves?

As shipped by the project, Solr will never restart itself 
automatically.  If it dies, it's dead until you start it again, unless 
you implement something to restart it automatically.This is intentional 
-- Solr almost never dies unless there's some kind of problem -- not 
enough memory, corrupt software, etc.If Solr *does* die, you need to 
figure out why and fix it, not rely on an automatic restart.

Thanks,
Shawn


RE: Auto recovery of a failed Solr Cloud Node?

Posted by "Kimber, Mike" <Mi...@verint.com>.
Erick,

Apologies I should have been more specific. "Failed solr node" mean's:

1. SolrCloud instance has crashed
2. SolrCloud Instance is up but not responding
3. SolrCloud Cluster is not responding

I'm trying to determine if there is any health check available to determine the above and then if the issue happens then an automated mechanism in SolrCloud to restart the instance. Or is this something we have to code ourselves?

Thanks

Mike

-----Original Message-----
From: Erick Erickson <er...@gmail.com>
Sent: 25 September 2018 18:25
To: solr-user <so...@lucene.apache.org>
Subject: Re: Auto recovery of a failed Solr Cloud Node?

What does "Failed solr node" mean? How do you mean if fails? There's lots of recovery built in for a replica that gets out-of-sync somehow (is shut down while indexing is going on etc). All that relies on having more than one replica per shard of course.

If the node completely dies due to hardware for instance, then yes the best solution now is to spin up another Solr node. I'm not sure what REPLACENODE does in this scenario.

If you're using HDFS there's an option to do this since the index is replicated by HDFS.

Best,
Erick
On Tue, Sep 25, 2018 at 8:48 AM Kimber, Mike <Mi...@verint.com> wrote:
>
> Hi,
>
> Is there a recommend design pattern or best practice for auto recovery of a failed Solr Node?
>
> I'm I correct to assume there is nothing out of the box for this and we have to code our own solution?
>
> Thanks
>
> Michael Kimber
>
>
> This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries. The information is intended to be for the use of the individual(s) or entity(ies) named above. If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message. If you have received this electronic message in error, please notify us by replying to this e-mail.


This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries. The information is intended to be for the use of the individual(s) or entity(ies) named above. If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message. If you have received this electronic message in error, please notify us by replying to this e-mail.

Re: Auto recovery of a failed Solr Cloud Node?

Posted by Erick Erickson <er...@gmail.com>.
What does "Failed solr node" mean? How do you mean if fails? There's
lots of recovery built in for a replica that gets out-of-sync somehow
(is shut down while indexing is going on etc). All that relies on
having more than one replica per shard of course.

If the node completely dies due to hardware for instance, then yes the
best solution now is to spin up another Solr node. I'm not sure what
REPLACENODE does in this scenario.

If you're using HDFS there's an option to do this since the index is
replicated by HDFS.

Best,
Erick
On Tue, Sep 25, 2018 at 8:48 AM Kimber, Mike <Mi...@verint.com> wrote:
>
> Hi,
>
> Is there a recommend design pattern or best practice for auto recovery of a failed Solr Node?
>
> I'm I correct to assume there is nothing out of the box for this and we have to code our own solution?
>
> Thanks
>
> Michael Kimber
>
>
> This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries. The information is intended to be for the use of the individual(s) or entity(ies) named above. If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message. If you have received this electronic message in error, please notify us by replying to this e-mail.