You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Joshi, Shital" <Sh...@gs.com> on 2015/01/27 22:51:55 UTC

replica never takes leader role

Hello,

We have SolrCloud cluster (5 shards and 2 replicas) on 10 boxes and three zookeeper instances. We have noticed that when a leader node goes down the replica never takes over as a leader, cloud becomes unusable and we have to bounce entire cloud for replica to assume leader role. Is this default behavior? How can we change this?

Thanks. 



Re: replica never takes leader role

Posted by Mark Miller <ma...@gmail.com>.
Yes, after 45 seconds a replica should take over as leader. It should
likely explain in the logs of the replica that should be taking over why
this is not happening.

- Mar

On Wed Jan 28 2015 at 2:52:32 PM Joshi, Shital <Sh...@gs.com> wrote:

> When leader reaches 99% physical memory on the box and starts swapping
> (stops replicating), we forcefully bring down leader (first kill -15 and
> then kill -9 if kill -15 doesn't work). This is when we are looking up to
> replica to assume leader's role and it never happens.
>
> Zookeeper timeout is 45 seconds. We can increase it up to 2 minutes and
> test.
>
> <cores adminPath="/admin/cores" defaultCoreName="collection1"
> host="${host:}" hostPort="${jetty.port:8983}" hostContext="${hostContext:solr}"
> zkClientTimeout="${zkClientTimeout:45000}">
>
> As per definition of zkClientTimeout, After the leader is brought down and
> it doesn't talk to zookeeper for 45 seconds, shouldn't ZK promote replica
> to leader? I am not sure how increasing zk timeout will help.
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Wednesday, January 28, 2015 11:42 AM
> To: solr-user@lucene.apache.org
> Subject: Re: replica never takes leader role
>
> This is not the desired behavior at all. I know there have been
> improvements in this area since 4.8, but can't seem to locate the JIRAs.
>
> I'm curious _why_ the nodes are going down though, is it happening at
> random or are you taking it down? One problem has been that the Zookeeper
> timeout used to default to 15 seconds, and occasionally a node would be
> unresponsive (sometimes due to GC pauses) and exceed the timeout. So upping
> the ZK timeout has helped some people avoid this...
>
> FWIW,
> Erick
>
> On Wed, Jan 28, 2015 at 7:11 AM, Joshi, Shital <Sh...@gs.com>
> wrote:
>
> > We're using Solr 4.8.0
> >
> >
> > -----Original Message-----
> > From: Erick Erickson [mailto:erickerickson@gmail.com]
> > Sent: Tuesday, January 27, 2015 7:47 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: replica never takes leader role
> >
> > What version of Solr? This is an ongoing area of improvements and several
> > are very recent.
> >
> > Try searching the JIRA for Solr for details.
> >
> > Best,
> > Erick
> >
> > On Tue, Jan 27, 2015 at 1:51 PM, Joshi, Shital <Sh...@gs.com>
> > wrote:
> >
> > > Hello,
> > >
> > > We have SolrCloud cluster (5 shards and 2 replicas) on 10 boxes and
> three
> > > zookeeper instances. We have noticed that when a leader node goes down
> > the
> > > replica never takes over as a leader, cloud becomes unusable and we
> have
> > to
> > > bounce entire cloud for replica to assume leader role. Is this default
> > > behavior? How can we change this?
> > >
> > > Thanks.
> > >
> > >
> > >
> >
>

RE: replica never takes leader role

Posted by "Joshi, Shital" <Sh...@gs.com>.
When leader reaches 99% physical memory on the box and starts swapping (stops replicating), we forcefully bring down leader (first kill -15 and then kill -9 if kill -15 doesn't work). This is when we are looking up to replica to assume leader's role and it never happens. 

Zookeeper timeout is 45 seconds. We can increase it up to 2 minutes and test. 

<cores adminPath="/admin/cores" defaultCoreName="collection1" host="${host:}" hostPort="${jetty.port:8983}" hostContext="${hostContext:solr}" zkClientTimeout="${zkClientTimeout:45000}">

As per definition of zkClientTimeout, After the leader is brought down and it doesn't talk to zookeeper for 45 seconds, shouldn't ZK promote replica to leader? I am not sure how increasing zk timeout will help. 

 
-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Wednesday, January 28, 2015 11:42 AM
To: solr-user@lucene.apache.org
Subject: Re: replica never takes leader role

This is not the desired behavior at all. I know there have been
improvements in this area since 4.8, but can't seem to locate the JIRAs.

I'm curious _why_ the nodes are going down though, is it happening at
random or are you taking it down? One problem has been that the Zookeeper
timeout used to default to 15 seconds, and occasionally a node would be
unresponsive (sometimes due to GC pauses) and exceed the timeout. So upping
the ZK timeout has helped some people avoid this...

FWIW,
Erick

On Wed, Jan 28, 2015 at 7:11 AM, Joshi, Shital <Sh...@gs.com> wrote:

> We're using Solr 4.8.0
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Tuesday, January 27, 2015 7:47 PM
> To: solr-user@lucene.apache.org
> Subject: Re: replica never takes leader role
>
> What version of Solr? This is an ongoing area of improvements and several
> are very recent.
>
> Try searching the JIRA for Solr for details.
>
> Best,
> Erick
>
> On Tue, Jan 27, 2015 at 1:51 PM, Joshi, Shital <Sh...@gs.com>
> wrote:
>
> > Hello,
> >
> > We have SolrCloud cluster (5 shards and 2 replicas) on 10 boxes and three
> > zookeeper instances. We have noticed that when a leader node goes down
> the
> > replica never takes over as a leader, cloud becomes unusable and we have
> to
> > bounce entire cloud for replica to assume leader role. Is this default
> > behavior? How can we change this?
> >
> > Thanks.
> >
> >
> >
>

Re: replica never takes leader role

Posted by Erick Erickson <er...@gmail.com>.
This is not the desired behavior at all. I know there have been
improvements in this area since 4.8, but can't seem to locate the JIRAs.

I'm curious _why_ the nodes are going down though, is it happening at
random or are you taking it down? One problem has been that the Zookeeper
timeout used to default to 15 seconds, and occasionally a node would be
unresponsive (sometimes due to GC pauses) and exceed the timeout. So upping
the ZK timeout has helped some people avoid this...

FWIW,
Erick

On Wed, Jan 28, 2015 at 7:11 AM, Joshi, Shital <Sh...@gs.com> wrote:

> We're using Solr 4.8.0
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Tuesday, January 27, 2015 7:47 PM
> To: solr-user@lucene.apache.org
> Subject: Re: replica never takes leader role
>
> What version of Solr? This is an ongoing area of improvements and several
> are very recent.
>
> Try searching the JIRA for Solr for details.
>
> Best,
> Erick
>
> On Tue, Jan 27, 2015 at 1:51 PM, Joshi, Shital <Sh...@gs.com>
> wrote:
>
> > Hello,
> >
> > We have SolrCloud cluster (5 shards and 2 replicas) on 10 boxes and three
> > zookeeper instances. We have noticed that when a leader node goes down
> the
> > replica never takes over as a leader, cloud becomes unusable and we have
> to
> > bounce entire cloud for replica to assume leader role. Is this default
> > behavior? How can we change this?
> >
> > Thanks.
> >
> >
> >
>

RE: replica never takes leader role

Posted by "Joshi, Shital" <Sh...@gs.com>.
We're using Solr 4.8.0


-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Tuesday, January 27, 2015 7:47 PM
To: solr-user@lucene.apache.org
Subject: Re: replica never takes leader role

What version of Solr? This is an ongoing area of improvements and several
are very recent.

Try searching the JIRA for Solr for details.

Best,
Erick

On Tue, Jan 27, 2015 at 1:51 PM, Joshi, Shital <Sh...@gs.com> wrote:

> Hello,
>
> We have SolrCloud cluster (5 shards and 2 replicas) on 10 boxes and three
> zookeeper instances. We have noticed that when a leader node goes down the
> replica never takes over as a leader, cloud becomes unusable and we have to
> bounce entire cloud for replica to assume leader role. Is this default
> behavior? How can we change this?
>
> Thanks.
>
>
>

Re: replica never takes leader role

Posted by Erick Erickson <er...@gmail.com>.
What version of Solr? This is an ongoing area of improvements and several
are very recent.

Try searching the JIRA for Solr for details.

Best,
Erick

On Tue, Jan 27, 2015 at 1:51 PM, Joshi, Shital <Sh...@gs.com> wrote:

> Hello,
>
> We have SolrCloud cluster (5 shards and 2 replicas) on 10 boxes and three
> zookeeper instances. We have noticed that when a leader node goes down the
> replica never takes over as a leader, cloud becomes unusable and we have to
> bounce entire cloud for replica to assume leader role. Is this default
> behavior? How can we change this?
>
> Thanks.
>
>
>