You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Gili Nachum <gi...@gmail.com> on 2016/01/14 11:08:44 UTC

Solr cluster doesn't recover from a ZK disconnect if collection.reload() was issued

Hi,

Our Solr cluster is running VMs that could freeze for more than the ZK tick
time (it's a non critical CI/CD pipeline running on an overloaded ESX).
When this happens the node's shards will be registered as down. Then when
the node is back recovery takes place, and all shards replicas end up
active state. Everyone is happy.

However, we noticed that recover doesn't take place if the collection was
reloaded and the server didn't restart since. Shards end up in done state.
Before providing log messages, I wonder if this is a known issue?

Reproducing recipe (assume two nodes):
1. Before starting: restart both solr1 and solr2: all shards are active.
2. Reload the collection
3. Cause disconnect by freezing the Java process:
On Solr2: kill -SIGSTOP <solr server pid> and then in 2 min kill -SIGCONT
<solr server pid>
4. solr2 shard replicas are *Down *forever. No recovery.

If we omit step #2, the cluster recovers as expected.

Re: Solr cluster doesn't recover from a ZK disconnect if collection.reload() was issued

Posted by Gili Nachum <gi...@gmail.com>.

Opps. Got omitted.
v4.72. plus it kept reproducing after upgrading to v4.9 (was trying to see
if it was fixed later on).


On Thu, Jan 14, 2016 at 5:26 PM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

> Which version of Solr is this on?
>
> On Thu, Jan 14, 2016 at 4:10 PM, Gili Nachum <gi...@gmail.com> wrote:
> > Clarificaiton: If we restart nodes after reloading collection and before
> > pausing, then recovery works fine.
> >
> > On Thu, Jan 14, 2016 at 12:08 PM, Gili Nachum <gi...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> Our Solr cluster is running VMs that could freeze for more than the ZK
> >> tick time (it's a non critical CI/CD pipeline running on an overloaded
> >> ESX). When this happens the node's shards will be registered as down.
> Then
> >> when the node is back recovery takes place, and all shards replicas end
> up
> >> active state. Everyone is happy.
> >>
> >> However, we noticed that recover doesn't take place if the collection
> was
> >> reloaded and the server didn't restart since. Shards end up in done
> state.
> >> Before providing log messages, I wonder if this is a known issue?
> >>
> >> Reproducing recipe (assume two nodes):
> >> 1. Before starting: restart both solr1 and solr2: all shards are active.
> >> 2. Reload the collection
> >> 3. Cause disconnect by freezing the Java process:
> >> On Solr2: kill -SIGSTOP <solr server pid> and then in 2 min kill
> -SIGCONT
> >> <solr server pid>
> >> 4. solr2 shard replicas are *Down *forever. No recovery.
> >>
> >> If we omit step #2, the cluster recovers as expected.
> >>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: Solr cluster doesn't recover from a ZK disconnect if collection.reload() was issued

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

Which version of Solr is this on?

On Thu, Jan 14, 2016 at 4:10 PM, Gili Nachum <gi...@gmail.com> wrote:
> Clarificaiton: If we restart nodes after reloading collection and before
> pausing, then recovery works fine.
>
> On Thu, Jan 14, 2016 at 12:08 PM, Gili Nachum <gi...@gmail.com> wrote:
>
>> Hi,
>>
>> Our Solr cluster is running VMs that could freeze for more than the ZK
>> tick time (it's a non critical CI/CD pipeline running on an overloaded
>> ESX). When this happens the node's shards will be registered as down. Then
>> when the node is back recovery takes place, and all shards replicas end up
>> active state. Everyone is happy.
>>
>> However, we noticed that recover doesn't take place if the collection was
>> reloaded and the server didn't restart since. Shards end up in done state.
>> Before providing log messages, I wonder if this is a known issue?
>>
>> Reproducing recipe (assume two nodes):
>> 1. Before starting: restart both solr1 and solr2: all shards are active.
>> 2. Reload the collection
>> 3. Cause disconnect by freezing the Java process:
>> On Solr2: kill -SIGSTOP <solr server pid> and then in 2 min kill -SIGCONT
>> <solr server pid>
>> 4. solr2 shard replicas are *Down *forever. No recovery.
>>
>> If we omit step #2, the cluster recovers as expected.
>>



-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr cluster doesn't recover from a ZK disconnect if collection.reload() was issued

Posted by Gili Nachum <gi...@gmail.com>.

Clarificaiton: If we restart nodes after reloading collection and before
pausing, then recovery works fine.

On Thu, Jan 14, 2016 at 12:08 PM, Gili Nachum <gi...@gmail.com> wrote:

> Hi,
>
> Our Solr cluster is running VMs that could freeze for more than the ZK
> tick time (it's a non critical CI/CD pipeline running on an overloaded
> ESX). When this happens the node's shards will be registered as down. Then
> when the node is back recovery takes place, and all shards replicas end up
> active state. Everyone is happy.
>
> However, we noticed that recover doesn't take place if the collection was
> reloaded and the server didn't restart since. Shards end up in done state.
> Before providing log messages, I wonder if this is a known issue?
>
> Reproducing recipe (assume two nodes):
> 1. Before starting: restart both solr1 and solr2: all shards are active.
> 2. Reload the collection
> 3. Cause disconnect by freezing the Java process:
> On Solr2: kill -SIGSTOP <solr server pid> and then in 2 min kill -SIGCONT
> <solr server pid>
> 4. solr2 shard replicas are *Down *forever. No recovery.
>
> If we omit step #2, the cluster recovers as expected.
>