You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Walter Underwood <wu...@wunderwood.org> on 2019/05/21 23:10:25 UTC

Cluster with no overseer?

We have a 6.6.2 cluster in prod that appears to have no overseer. In /overseer_elect on ZK, there is an election folder, but no leader document. An OVERSEERSTATUS request fails with a timeout.

I’m going to try ADDROLE, but I’d be delighted to hear any other ideas. We’ve diverted all the traffic to the backing cluster, so we can blow this one away and rebuild.

Looking at the Zookeeper logs, I see a few instances of network failures across all three nodes.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

Re: Cluster with no overseer?

Posted by Erick Erickson <er...@gmail.com>.

110 isn’t all that many, well within the normal range _assuming_ that they are being processed…. When you restart Solr, every state change operation writes an operation to the work queue which can mount up.

Perhaps you’re hitting: https://issues.apache.org/jira/browse/SOLR-13416?

In which case restarting ZK should fix it since all the work items are ephemeral and will go away if ZK restarts. Of course shut them all down rather than doing a rolling restart.

You shouldn’t need to clean anything in ZK associated with this since those are are all (I’m pretty sure) “ephemeral nodes” and should just disappear when ZK shuts down.

If you’re feeling really brave, you could try to use "bin/solr zk rm” to nuke the ephemeral nodes and try the rolling restart of Solr nodes, but only as a last resort IMO.

Best,
Erick

> On May 22, 2019, at 9:11 AM, Walter Underwood <wu...@wunderwood.org> wrote:
> 
> The ZK ensemble appears to be OK. It is the Solr-related stuff that is borked. There are 110 items in /overseer/collection-queue-work/, which doesn’t seem healthy.
> 
> If it is really hosed, I’ll shut down all the nodes, clean out the files in Zookeeper and start over.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On May 22, 2019, at 8:53 AM, Erick Erickson <er...@gmail.com> wrote:
>> 
>> Good luck, this kind of assumes that your ZK ensemble is healthy of course...
>> 
>>> On May 22, 2019, at 8:23 AM, Walter Underwood <wu...@wunderwood.org> wrote:
>>> 
>>> Thanks, we’ll try that. Bouncing one Solr node doesn’t fix it, because we did a rolling restart yesterday.
>>> 
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>>> On May 22, 2019, at 8:21 AM, Erick Erickson <er...@gmail.com> wrote:
>>>> 
>>>> Walter:
>>>> 
>>>> I have no idea what the root cause is here, this really shouldn’t happen. But the Overseer role (and I’m assuming you’re talking Solr’s Overseer) is assigned similarly to a shard leader, the same election process happens. All the election nodes are ephemeral ZK nodes.
>>>> 
>>>> Solr’s Overseer is _not_ fixed to a particular Solr node, although you can assign a preferred role of Overseer in those (rare) cases where there are so many state changes for ZooKeeper that it’s advisable for them to run on a dedicated machine.
>>>> 
>>>> Overseer assignment is automatic. This should work;
>>>> 1> shut everything down, Solr and Zookeeper
>>>> 2> start your ZooKeepers and let them all get in sync with each other
>>>> 3> start your Solr nodes. It might take 3 minutes or more to bring up the first Solr node, there’s up to a 180 second delay if leaders are not findable easily.
>>>> 
>>>> That should cause Solr to elect an overseer, probably the first Solr node to come up.
>>>> 
>>>> It _might_ work to bounce just one Solr node, seeing the Overseer election queue empty it may elect itself. That said, the overseer election queue won’t contain the rest of the Solr nodes like it should, so if that works you should probably bounce the rest of the Solr servers one by one to restore the proper election queue process.
>>>> 
>>>> Not a fix for the root cause of course, but should get things operating again. I’ll add that I haven’t seen this happen in the field to my recollection, if at all.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>>> On May 21, 2019, at 9:04 PM, Will Martin <wm...@urgent.ly> wrote:
>>>>> 
>>>>> Worked with Fusion and Zookeeper at GSA for 18 months: admin role.
>>>>> 
>>>>> Before blowing it away, you could try:
>>>>> 
>>>>> - id a candidate node, with a snapshot you just might think is old enough
>>>>> to be robust.
>>>>> - clean data for zk nodes otherwise.
>>>>> - bring up the chosen node and wait for it to settle[wish i could remember
>>>>> why i called what i saw that]
>>>>> - bring up other nodes 1 at a time.  let each one fully sync to follower of
>>>>> the new leader.
>>>>> - they should each in turn request the snapshot from the lead. then you
>>>>> have
>>>>> 
>>>>> : align your collections with the ensemble. and for the life of me i can't
>>>>> remember there being anything particularly tricky about that with fusion ,
>>>>> which means I can't remember what I did... or have it doc'd at home. ;-)
>>>>> 
>>>>> 
>>>>> Will Martin
>>>>> DEVOPS ENGINEER
>>>>> 540.454.9565
>>>>> 
>>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>>> VIENNA, VA 22182
>>>>> geturgently.com
>>>>> 
>>>>> 
>>>>> On Tue, May 21, 2019 at 11:40 PM Walter Underwood <wu...@wunderwood.org>
>>>>> wrote:
>>>>> 
>>>>>> Yes, please. I have the logs from each of the Zookeepers.
>>>>>> 
>>>>>> We are running 3.4.12.
>>>>>> 
>>>>>> wunder
>>>>>> Walter Underwood
>>>>>> wunder@wunderwood.org
>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>> 
>>>>>>> On May 21, 2019, at 6:49 PM, Will Martin <wm...@urgent.ly> wrote:
>>>>>>> 
>>>>>>> Walter. Can I cross-post to zk-dev?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Will Martin
>>>>>>> DEVOPS ENGINEER
>>>>>>> 540.454.9565
>>>>>>> 
>>>>>>> <urgently-email-logo>
>>>>>>> 
>>>>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>>>>> VIENNA, VA 22182
>>>>>>> geturgently.com <http://geturgently.com/>
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On May 21, 2019, at 9:26 PM, Will Martin <wmartin@urgent.ly <mailto:
>>>>>> wmartin@urgent.ly>> wrote:
>>>>>>>> 
>>>>>>>> +1
>>>>>>>> 
>>>>>>>> Will Martin
>>>>>>>> DEVOPS ENGINEER
>>>>>>>> 540.454.9565
>>>>>>>> 
>>>>>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>>>>>> VIENNA, VA 22182
>>>>>>>> geturgently.com <http://geturgently.com/>
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, May 21, 2019 at 7:39 PM Walter Underwood <wunder@wunderwood.org
>>>>>> <ma...@wunderwood.org>> wrote:
>>>>>>>> ADDROLE times out after 180 seconds. This seems to be an unrecoverable
>>>>>> state for the cluster, so that is a pretty serious bug.
>>>>>>>> 
>>>>>>>> wunder
>>>>>>>> Walter Underwood
>>>>>>>> wunder@wunderwood.org <ma...@wunderwood.org>
>>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my
>>>>>> blog)
>>>>>>>> 
>>>>>>>>> On May 21, 2019, at 4:10 PM, Walter Underwood <wunder@wunderwood.org
>>>>>> <ma...@wunderwood.org>> wrote:
>>>>>>>>> 
>>>>>>>>> We have a 6.6.2 cluster in prod that appears to have no overseer. In
>>>>>> /overseer_elect on ZK, there is an election folder, but no leader document.
>>>>>> An OVERSEERSTATUS request fails with a timeout.
>>>>>>>>> 
>>>>>>>>> I’m going to try ADDROLE, but I’d be delighted to hear any other
>>>>>> ideas. We’ve diverted all the traffic to the backing cluster, so we can
>>>>>> blow this one away and rebuild.
>>>>>>>>> 
>>>>>>>>> Looking at the Zookeeper logs, I see a few instances of network
>>>>>> failures across all three nodes.
>>>>>>>>> 
>>>>>>>>> wunder
>>>>>>>>> Walter Underwood
>>>>>>>>> wunder@wunderwood.org <ma...@wunderwood.org>
>>>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>>>>>> (my blog)
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>> 
>

Re: Cluster with no overseer?

Posted by Walter Underwood <wu...@wunderwood.org>.

The ZK ensemble appears to be OK. It is the Solr-related stuff that is borked. There are 110 items in /overseer/collection-queue-work/, which doesn’t seem healthy.

If it is really hosed, I’ll shut down all the nodes, clean out the files in Zookeeper and start over.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 22, 2019, at 8:53 AM, Erick Erickson <er...@gmail.com> wrote:
> 
> Good luck, this kind of assumes that your ZK ensemble is healthy of course...
> 
>> On May 22, 2019, at 8:23 AM, Walter Underwood <wu...@wunderwood.org> wrote:
>> 
>> Thanks, we’ll try that. Bouncing one Solr node doesn’t fix it, because we did a rolling restart yesterday.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On May 22, 2019, at 8:21 AM, Erick Erickson <er...@gmail.com> wrote:
>>> 
>>> Walter:
>>> 
>>> I have no idea what the root cause is here, this really shouldn’t happen. But the Overseer role (and I’m assuming you’re talking Solr’s Overseer) is assigned similarly to a shard leader, the same election process happens. All the election nodes are ephemeral ZK nodes.
>>> 
>>> Solr’s Overseer is _not_ fixed to a particular Solr node, although you can assign a preferred role of Overseer in those (rare) cases where there are so many state changes for ZooKeeper that it’s advisable for them to run on a dedicated machine.
>>> 
>>> Overseer assignment is automatic. This should work;
>>> 1> shut everything down, Solr and Zookeeper
>>> 2> start your ZooKeepers and let them all get in sync with each other
>>> 3> start your Solr nodes. It might take 3 minutes or more to bring up the first Solr node, there’s up to a 180 second delay if leaders are not findable easily.
>>> 
>>> That should cause Solr to elect an overseer, probably the first Solr node to come up.
>>> 
>>> It _might_ work to bounce just one Solr node, seeing the Overseer election queue empty it may elect itself. That said, the overseer election queue won’t contain the rest of the Solr nodes like it should, so if that works you should probably bounce the rest of the Solr servers one by one to restore the proper election queue process.
>>> 
>>> Not a fix for the root cause of course, but should get things operating again. I’ll add that I haven’t seen this happen in the field to my recollection, if at all.
>>> 
>>> Best,
>>> Erick
>>> 
>>>> On May 21, 2019, at 9:04 PM, Will Martin <wm...@urgent.ly> wrote:
>>>> 
>>>> Worked with Fusion and Zookeeper at GSA for 18 months: admin role.
>>>> 
>>>> Before blowing it away, you could try:
>>>> 
>>>> - id a candidate node, with a snapshot you just might think is old enough
>>>> to be robust.
>>>> - clean data for zk nodes otherwise.
>>>> - bring up the chosen node and wait for it to settle[wish i could remember
>>>> why i called what i saw that]
>>>> - bring up other nodes 1 at a time.  let each one fully sync to follower of
>>>> the new leader.
>>>> - they should each in turn request the snapshot from the lead. then you
>>>> have
>>>> 
>>>> : align your collections with the ensemble. and for the life of me i can't
>>>> remember there being anything particularly tricky about that with fusion ,
>>>> which means I can't remember what I did... or have it doc'd at home. ;-)
>>>> 
>>>> 
>>>> Will Martin
>>>> DEVOPS ENGINEER
>>>> 540.454.9565
>>>> 
>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>> VIENNA, VA 22182
>>>> geturgently.com
>>>> 
>>>> 
>>>> On Tue, May 21, 2019 at 11:40 PM Walter Underwood <wu...@wunderwood.org>
>>>> wrote:
>>>> 
>>>>> Yes, please. I have the logs from each of the Zookeepers.
>>>>> 
>>>>> We are running 3.4.12.
>>>>> 
>>>>> wunder
>>>>> Walter Underwood
>>>>> wunder@wunderwood.org
>>>>> http://observer.wunderwood.org/  (my blog)
>>>>> 
>>>>>> On May 21, 2019, at 6:49 PM, Will Martin <wm...@urgent.ly> wrote:
>>>>>> 
>>>>>> Walter. Can I cross-post to zk-dev?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Will Martin
>>>>>> DEVOPS ENGINEER
>>>>>> 540.454.9565
>>>>>> 
>>>>>> <urgently-email-logo>
>>>>>> 
>>>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>>>> VIENNA, VA 22182
>>>>>> geturgently.com <http://geturgently.com/>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On May 21, 2019, at 9:26 PM, Will Martin <wmartin@urgent.ly <mailto:
>>>>> wmartin@urgent.ly>> wrote:
>>>>>>> 
>>>>>>> +1
>>>>>>> 
>>>>>>> Will Martin
>>>>>>> DEVOPS ENGINEER
>>>>>>> 540.454.9565
>>>>>>> 
>>>>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>>>>> VIENNA, VA 22182
>>>>>>> geturgently.com <http://geturgently.com/>
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, May 21, 2019 at 7:39 PM Walter Underwood <wunder@wunderwood.org
>>>>> <ma...@wunderwood.org>> wrote:
>>>>>>> ADDROLE times out after 180 seconds. This seems to be an unrecoverable
>>>>> state for the cluster, so that is a pretty serious bug.
>>>>>>> 
>>>>>>> wunder
>>>>>>> Walter Underwood
>>>>>>> wunder@wunderwood.org <ma...@wunderwood.org>
>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my
>>>>> blog)
>>>>>>> 
>>>>>>>> On May 21, 2019, at 4:10 PM, Walter Underwood <wunder@wunderwood.org
>>>>> <ma...@wunderwood.org>> wrote:
>>>>>>>> 
>>>>>>>> We have a 6.6.2 cluster in prod that appears to have no overseer. In
>>>>> /overseer_elect on ZK, there is an election folder, but no leader document.
>>>>> An OVERSEERSTATUS request fails with a timeout.
>>>>>>>> 
>>>>>>>> I’m going to try ADDROLE, but I’d be delighted to hear any other
>>>>> ideas. We’ve diverted all the traffic to the backing cluster, so we can
>>>>> blow this one away and rebuild.
>>>>>>>> 
>>>>>>>> Looking at the Zookeeper logs, I see a few instances of network
>>>>> failures across all three nodes.
>>>>>>>> 
>>>>>>>> wunder
>>>>>>>> Walter Underwood
>>>>>>>> wunder@wunderwood.org <ma...@wunderwood.org>
>>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>>>>> (my blog)
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
>

Re: Cluster with no overseer?

Posted by Erick Erickson <er...@gmail.com>.

Good luck, this kind of assumes that your ZK ensemble is healthy of course...

> On May 22, 2019, at 8:23 AM, Walter Underwood <wu...@wunderwood.org> wrote:
> 
> Thanks, we’ll try that. Bouncing one Solr node doesn’t fix it, because we did a rolling restart yesterday.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On May 22, 2019, at 8:21 AM, Erick Erickson <er...@gmail.com> wrote:
>> 
>> Walter:
>> 
>> I have no idea what the root cause is here, this really shouldn’t happen. But the Overseer role (and I’m assuming you’re talking Solr’s Overseer) is assigned similarly to a shard leader, the same election process happens. All the election nodes are ephemeral ZK nodes.
>> 
>> Solr’s Overseer is _not_ fixed to a particular Solr node, although you can assign a preferred role of Overseer in those (rare) cases where there are so many state changes for ZooKeeper that it’s advisable for them to run on a dedicated machine.
>> 
>> Overseer assignment is automatic. This should work;
>> 1> shut everything down, Solr and Zookeeper
>> 2> start your ZooKeepers and let them all get in sync with each other
>> 3> start your Solr nodes. It might take 3 minutes or more to bring up the first Solr node, there’s up to a 180 second delay if leaders are not findable easily.
>> 
>> That should cause Solr to elect an overseer, probably the first Solr node to come up.
>> 
>> It _might_ work to bounce just one Solr node, seeing the Overseer election queue empty it may elect itself. That said, the overseer election queue won’t contain the rest of the Solr nodes like it should, so if that works you should probably bounce the rest of the Solr servers one by one to restore the proper election queue process.
>> 
>> Not a fix for the root cause of course, but should get things operating again. I’ll add that I haven’t seen this happen in the field to my recollection, if at all.
>> 
>> Best,
>> Erick
>> 
>>> On May 21, 2019, at 9:04 PM, Will Martin <wm...@urgent.ly> wrote:
>>> 
>>> Worked with Fusion and Zookeeper at GSA for 18 months: admin role.
>>> 
>>> Before blowing it away, you could try:
>>> 
>>> - id a candidate node, with a snapshot you just might think is old enough
>>> to be robust.
>>> - clean data for zk nodes otherwise.
>>> - bring up the chosen node and wait for it to settle[wish i could remember
>>> why i called what i saw that]
>>> - bring up other nodes 1 at a time.  let each one fully sync to follower of
>>> the new leader.
>>> - they should each in turn request the snapshot from the lead. then you
>>> have
>>> 
>>> : align your collections with the ensemble. and for the life of me i can't
>>> remember there being anything particularly tricky about that with fusion ,
>>> which means I can't remember what I did... or have it doc'd at home. ;-)
>>> 
>>> 
>>> Will Martin
>>> DEVOPS ENGINEER
>>> 540.454.9565
>>> 
>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>> VIENNA, VA 22182
>>> geturgently.com
>>> 
>>> 
>>> On Tue, May 21, 2019 at 11:40 PM Walter Underwood <wu...@wunderwood.org>
>>> wrote:
>>> 
>>>> Yes, please. I have the logs from each of the Zookeepers.
>>>> 
>>>> We are running 3.4.12.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wunder@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>>> On May 21, 2019, at 6:49 PM, Will Martin <wm...@urgent.ly> wrote:
>>>>> 
>>>>> Walter. Can I cross-post to zk-dev?
>>>>> 
>>>>> 
>>>>> 
>>>>> Will Martin
>>>>> DEVOPS ENGINEER
>>>>> 540.454.9565
>>>>> 
>>>>> <urgently-email-logo>
>>>>> 
>>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>>> VIENNA, VA 22182
>>>>> geturgently.com <http://geturgently.com/>
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On May 21, 2019, at 9:26 PM, Will Martin <wmartin@urgent.ly <mailto:
>>>> wmartin@urgent.ly>> wrote:
>>>>>> 
>>>>>> +1
>>>>>> 
>>>>>> Will Martin
>>>>>> DEVOPS ENGINEER
>>>>>> 540.454.9565
>>>>>> 
>>>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>>>> VIENNA, VA 22182
>>>>>> geturgently.com <http://geturgently.com/>
>>>>>> 
>>>>>> 
>>>>>> On Tue, May 21, 2019 at 7:39 PM Walter Underwood <wunder@wunderwood.org
>>>> <ma...@wunderwood.org>> wrote:
>>>>>> ADDROLE times out after 180 seconds. This seems to be an unrecoverable
>>>> state for the cluster, so that is a pretty serious bug.
>>>>>> 
>>>>>> wunder
>>>>>> Walter Underwood
>>>>>> wunder@wunderwood.org <ma...@wunderwood.org>
>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my
>>>> blog)
>>>>>> 
>>>>>>> On May 21, 2019, at 4:10 PM, Walter Underwood <wunder@wunderwood.org
>>>> <ma...@wunderwood.org>> wrote:
>>>>>>> 
>>>>>>> We have a 6.6.2 cluster in prod that appears to have no overseer. In
>>>> /overseer_elect on ZK, there is an election folder, but no leader document.
>>>> An OVERSEERSTATUS request fails with a timeout.
>>>>>>> 
>>>>>>> I’m going to try ADDROLE, but I’d be delighted to hear any other
>>>> ideas. We’ve diverted all the traffic to the backing cluster, so we can
>>>> blow this one away and rebuild.
>>>>>>> 
>>>>>>> Looking at the Zookeeper logs, I see a few instances of network
>>>> failures across all three nodes.
>>>>>>> 
>>>>>>> wunder
>>>>>>> Walter Underwood
>>>>>>> wunder@wunderwood.org <ma...@wunderwood.org>
>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>>>> (my blog)
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>

Re: Cluster with no overseer?

Posted by Walter Underwood <wu...@wunderwood.org>.

Thanks, we’ll try that. Bouncing one Solr node doesn’t fix it, because we did a rolling restart yesterday.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 22, 2019, at 8:21 AM, Erick Erickson <er...@gmail.com> wrote:
> 
> Walter:
> 
> I have no idea what the root cause is here, this really shouldn’t happen. But the Overseer role (and I’m assuming you’re talking Solr’s Overseer) is assigned similarly to a shard leader, the same election process happens. All the election nodes are ephemeral ZK nodes.
> 
> Solr’s Overseer is _not_ fixed to a particular Solr node, although you can assign a preferred role of Overseer in those (rare) cases where there are so many state changes for ZooKeeper that it’s advisable for them to run on a dedicated machine.
> 
> Overseer assignment is automatic. This should work;
> 1> shut everything down, Solr and Zookeeper
> 2> start your ZooKeepers and let them all get in sync with each other
> 3> start your Solr nodes. It might take 3 minutes or more to bring up the first Solr node, there’s up to a 180 second delay if leaders are not findable easily.
> 
> That should cause Solr to elect an overseer, probably the first Solr node to come up.
> 
> It _might_ work to bounce just one Solr node, seeing the Overseer election queue empty it may elect itself. That said, the overseer election queue won’t contain the rest of the Solr nodes like it should, so if that works you should probably bounce the rest of the Solr servers one by one to restore the proper election queue process.
> 
> Not a fix for the root cause of course, but should get things operating again. I’ll add that I haven’t seen this happen in the field to my recollection, if at all.
> 
> Best,
> Erick
> 
>> On May 21, 2019, at 9:04 PM, Will Martin <wm...@urgent.ly> wrote:
>> 
>> Worked with Fusion and Zookeeper at GSA for 18 months: admin role.
>> 
>> Before blowing it away, you could try:
>> 
>> - id a candidate node, with a snapshot you just might think is old enough
>> to be robust.
>> - clean data for zk nodes otherwise.
>> - bring up the chosen node and wait for it to settle[wish i could remember
>> why i called what i saw that]
>> - bring up other nodes 1 at a time.  let each one fully sync to follower of
>> the new leader.
>> - they should each in turn request the snapshot from the lead. then you
>> have
>> 
>> : align your collections with the ensemble. and for the life of me i can't
>> remember there being anything particularly tricky about that with fusion ,
>> which means I can't remember what I did... or have it doc'd at home. ;-)
>> 
>> 
>> Will Martin
>> DEVOPS ENGINEER
>> 540.454.9565
>> 
>> 8609 WESTWOOD CENTER DR, SUITE 475
>> VIENNA, VA 22182
>> geturgently.com
>> 
>> 
>> On Tue, May 21, 2019 at 11:40 PM Walter Underwood <wu...@wunderwood.org>
>> wrote:
>> 
>>> Yes, please. I have the logs from each of the Zookeepers.
>>> 
>>> We are running 3.4.12.
>>> 
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>>> On May 21, 2019, at 6:49 PM, Will Martin <wm...@urgent.ly> wrote:
>>>> 
>>>> Walter. Can I cross-post to zk-dev?
>>>> 
>>>> 
>>>> 
>>>> Will Martin
>>>> DEVOPS ENGINEER
>>>> 540.454.9565
>>>> 
>>>> <urgently-email-logo>
>>>> 
>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>> VIENNA, VA 22182
>>>> geturgently.com <http://geturgently.com/>
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On May 21, 2019, at 9:26 PM, Will Martin <wmartin@urgent.ly <mailto:
>>> wmartin@urgent.ly>> wrote:
>>>>> 
>>>>> +1
>>>>> 
>>>>> Will Martin
>>>>> DEVOPS ENGINEER
>>>>> 540.454.9565
>>>>> 
>>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>>> VIENNA, VA 22182
>>>>> geturgently.com <http://geturgently.com/>
>>>>> 
>>>>> 
>>>>> On Tue, May 21, 2019 at 7:39 PM Walter Underwood <wunder@wunderwood.org
>>> <ma...@wunderwood.org>> wrote:
>>>>> ADDROLE times out after 180 seconds. This seems to be an unrecoverable
>>> state for the cluster, so that is a pretty serious bug.
>>>>> 
>>>>> wunder
>>>>> Walter Underwood
>>>>> wunder@wunderwood.org <ma...@wunderwood.org>
>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my
>>> blog)
>>>>> 
>>>>>> On May 21, 2019, at 4:10 PM, Walter Underwood <wunder@wunderwood.org
>>> <ma...@wunderwood.org>> wrote:
>>>>>> 
>>>>>> We have a 6.6.2 cluster in prod that appears to have no overseer. In
>>> /overseer_elect on ZK, there is an election folder, but no leader document.
>>> An OVERSEERSTATUS request fails with a timeout.
>>>>>> 
>>>>>> I’m going to try ADDROLE, but I’d be delighted to hear any other
>>> ideas. We’ve diverted all the traffic to the backing cluster, so we can
>>> blow this one away and rebuild.
>>>>>> 
>>>>>> Looking at the Zookeeper logs, I see a few instances of network
>>> failures across all three nodes.
>>>>>> 
>>>>>> wunder
>>>>>> Walter Underwood
>>>>>> wunder@wunderwood.org <ma...@wunderwood.org>
>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>>> (my blog)
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>

Re: Cluster with no overseer?

Posted by Erick Erickson <er...@gmail.com>.

Walter:

I have no idea what the root cause is here, this really shouldn’t happen. But the Overseer role (and I’m assuming you’re talking Solr’s Overseer) is assigned similarly to a shard leader, the same election process happens. All the election nodes are ephemeral ZK nodes.

Solr’s Overseer is _not_ fixed to a particular Solr node, although you can assign a preferred role of Overseer in those (rare) cases where there are so many state changes for ZooKeeper that it’s advisable for them to run on a dedicated machine.

Overseer assignment is automatic. This should work;
1> shut everything down, Solr and Zookeeper
2> start your ZooKeepers and let them all get in sync with each other
3> start your Solr nodes. It might take 3 minutes or more to bring up the first Solr node, there’s up to a 180 second delay if leaders are not findable easily.

That should cause Solr to elect an overseer, probably the first Solr node to come up.

It _might_ work to bounce just one Solr node, seeing the Overseer election queue empty it may elect itself. That said, the overseer election queue won’t contain the rest of the Solr nodes like it should, so if that works you should probably bounce the rest of the Solr servers one by one to restore the proper election queue process.

Not a fix for the root cause of course, but should get things operating again. I’ll add that I haven’t seen this happen in the field to my recollection, if at all.

Best,
Erick

> On May 21, 2019, at 9:04 PM, Will Martin <wm...@urgent.ly> wrote:
> 
> Worked with Fusion and Zookeeper at GSA for 18 months: admin role.
> 
> Before blowing it away, you could try:
> 
> - id a candidate node, with a snapshot you just might think is old enough
> to be robust.
> - clean data for zk nodes otherwise.
> - bring up the chosen node and wait for it to settle[wish i could remember
> why i called what i saw that]
> - bring up other nodes 1 at a time.  let each one fully sync to follower of
> the new leader.
> - they should each in turn request the snapshot from the lead. then you
> have
> 
> : align your collections with the ensemble. and for the life of me i can't
> remember there being anything particularly tricky about that with fusion ,
> which means I can't remember what I did... or have it doc'd at home. ;-)
> 
> 
> Will Martin
> DEVOPS ENGINEER
> 540.454.9565
> 
> 8609 WESTWOOD CENTER DR, SUITE 475
> VIENNA, VA 22182
> geturgently.com
> 
> 
> On Tue, May 21, 2019 at 11:40 PM Walter Underwood <wu...@wunderwood.org>
> wrote:
> 
>> Yes, please. I have the logs from each of the Zookeepers.
>> 
>> We are running 3.4.12.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On May 21, 2019, at 6:49 PM, Will Martin <wm...@urgent.ly> wrote:
>>> 
>>> Walter. Can I cross-post to zk-dev?
>>> 
>>> 
>>> 
>>> Will Martin
>>> DEVOPS ENGINEER
>>> 540.454.9565
>>> 
>>> <urgently-email-logo>
>>> 
>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>> VIENNA, VA 22182
>>> geturgently.com <http://geturgently.com/>
>>> 
>>> 
>>> 
>>> 
>>>> On May 21, 2019, at 9:26 PM, Will Martin <wmartin@urgent.ly <mailto:
>> wmartin@urgent.ly>> wrote:
>>>> 
>>>> +1
>>>> 
>>>> Will Martin
>>>> DEVOPS ENGINEER
>>>> 540.454.9565
>>>> 
>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>> VIENNA, VA 22182
>>>> geturgently.com <http://geturgently.com/>
>>>> 
>>>> 
>>>> On Tue, May 21, 2019 at 7:39 PM Walter Underwood <wunder@wunderwood.org
>> <ma...@wunderwood.org>> wrote:
>>>> ADDROLE times out after 180 seconds. This seems to be an unrecoverable
>> state for the cluster, so that is a pretty serious bug.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wunder@wunderwood.org <ma...@wunderwood.org>
>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my
>> blog)
>>>> 
>>>>> On May 21, 2019, at 4:10 PM, Walter Underwood <wunder@wunderwood.org
>> <ma...@wunderwood.org>> wrote:
>>>>> 
>>>>> We have a 6.6.2 cluster in prod that appears to have no overseer. In
>> /overseer_elect on ZK, there is an election folder, but no leader document.
>> An OVERSEERSTATUS request fails with a timeout.
>>>>> 
>>>>> I’m going to try ADDROLE, but I’d be delighted to hear any other
>> ideas. We’ve diverted all the traffic to the backing cluster, so we can
>> blow this one away and rebuild.
>>>>> 
>>>>> Looking at the Zookeeper logs, I see a few instances of network
>> failures across all three nodes.
>>>>> 
>>>>> wunder
>>>>> Walter Underwood
>>>>> wunder@wunderwood.org <ma...@wunderwood.org>
>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>> (my blog)
>>>>> 
>>>> 
>>> 
>> 
>>

Re: Cluster with no overseer?

Posted by Will Martin <wm...@urgent.ly>.

Worked with Fusion and Zookeeper at GSA for 18 months: admin role.

Before blowing it away, you could try:

- id a candidate node, with a snapshot you just might think is old enough
to be robust.
- clean data for zk nodes otherwise.
- bring up the chosen node and wait for it to settle[wish i could remember
why i called what i saw that]
- bring up other nodes 1 at a time.  let each one fully sync to follower of
the new leader.
- they should each in turn request the snapshot from the lead. then you
have

: align your collections with the ensemble. and for the life of me i can't
remember there being anything particularly tricky about that with fusion ,
which means I can't remember what I did... or have it doc'd at home. ;-)


Will Martin
DEVOPS ENGINEER
540.454.9565

8609 WESTWOOD CENTER DR, SUITE 475
VIENNA, VA 22182
geturgently.com


On Tue, May 21, 2019 at 11:40 PM Walter Underwood <wu...@wunderwood.org>
wrote:

> Yes, please. I have the logs from each of the Zookeepers.
>
> We are running 3.4.12.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On May 21, 2019, at 6:49 PM, Will Martin <wm...@urgent.ly> wrote:
> >
> > Walter. Can I cross-post to zk-dev?
> >
> >
> >
> > Will Martin
> > DEVOPS ENGINEER
> > 540.454.9565
> >
> > <urgently-email-logo>
> >
> > 8609 WESTWOOD CENTER DR, SUITE 475
> > VIENNA, VA 22182
> > geturgently.com <http://geturgently.com/>
> >
> >
> >
> >
> >> On May 21, 2019, at 9:26 PM, Will Martin <wmartin@urgent.ly <mailto:
> wmartin@urgent.ly>> wrote:
> >>
> >> +1
> >>
> >> Will Martin
> >> DEVOPS ENGINEER
> >> 540.454.9565
> >>
> >> 8609 WESTWOOD CENTER DR, SUITE 475
> >> VIENNA, VA 22182
> >> geturgently.com <http://geturgently.com/>
> >>
> >>
> >> On Tue, May 21, 2019 at 7:39 PM Walter Underwood <wunder@wunderwood.org
> <ma...@wunderwood.org>> wrote:
> >> ADDROLE times out after 180 seconds. This seems to be an unrecoverable
> state for the cluster, so that is a pretty serious bug.
> >>
> >> wunder
> >> Walter Underwood
> >> wunder@wunderwood.org <ma...@wunderwood.org>
> >> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my
> blog)
> >>
> >> > On May 21, 2019, at 4:10 PM, Walter Underwood <wunder@wunderwood.org
> <ma...@wunderwood.org>> wrote:
> >> >
> >> > We have a 6.6.2 cluster in prod that appears to have no overseer. In
> /overseer_elect on ZK, there is an election folder, but no leader document.
> An OVERSEERSTATUS request fails with a timeout.
> >> >
> >> > I’m going to try ADDROLE, but I’d be delighted to hear any other
> ideas. We’ve diverted all the traffic to the backing cluster, so we can
> blow this one away and rebuild.
> >> >
> >> > Looking at the Zookeeper logs, I see a few instances of network
> failures across all three nodes.
> >> >
> >> > wunder
> >> > Walter Underwood
> >> > wunder@wunderwood.org <ma...@wunderwood.org>
> >> > http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
> (my blog)
> >> >
> >>
> >
>
>

Re: Cluster with no overseer?

Posted by Walter Underwood <wu...@wunderwood.org>.

Yes, please. I have the logs from each of the Zookeepers. 

We are running 3.4.12.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 21, 2019, at 6:49 PM, Will Martin <wm...@urgent.ly> wrote:
> 
> Walter. Can I cross-post to zk-dev?
> 
> 
> 
> Will Martin
> DEVOPS ENGINEER
> 540.454.9565
> 
> <urgently-email-logo>
> 
> 8609 WESTWOOD CENTER DR, SUITE 475
> VIENNA, VA 22182
> geturgently.com <http://geturgently.com/>
> 
> 
> 
> 
>> On May 21, 2019, at 9:26 PM, Will Martin <wmartin@urgent.ly <ma...@urgent.ly>> wrote:
>> 
>> +1
>> 
>> Will Martin
>> DEVOPS ENGINEER
>> 540.454.9565
>> 
>> 8609 WESTWOOD CENTER DR, SUITE 475
>> VIENNA, VA 22182
>> geturgently.com <http://geturgently.com/>
>> 
>> 
>> On Tue, May 21, 2019 at 7:39 PM Walter Underwood <wunder@wunderwood.org <ma...@wunderwood.org>> wrote:
>> ADDROLE times out after 180 seconds. This seems to be an unrecoverable state for the cluster, so that is a pretty serious bug.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org <ma...@wunderwood.org>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>> 
>> > On May 21, 2019, at 4:10 PM, Walter Underwood <wunder@wunderwood.org <ma...@wunderwood.org>> wrote:
>> > 
>> > We have a 6.6.2 cluster in prod that appears to have no overseer. In /overseer_elect on ZK, there is an election folder, but no leader document. An OVERSEERSTATUS request fails with a timeout.
>> > 
>> > I’m going to try ADDROLE, but I’d be delighted to hear any other ideas. We’ve diverted all the traffic to the backing cluster, so we can blow this one away and rebuild.
>> > 
>> > Looking at the Zookeeper logs, I see a few instances of network failures across all three nodes.
>> > 
>> > wunder
>> > Walter Underwood
>> > wunder@wunderwood.org <ma...@wunderwood.org>
>> > http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>> > 
>> 
>

Re: Cluster with no overseer?

Posted by Will Martin <wm...@urgent.ly>.

Walter. Can I cross-post to zk-dev?



Will Martin
DEVOPS ENGINEER
540.454.9565



8609 WESTWOOD CENTER DR, SUITE 475
VIENNA, VA 22182
geturgently.com <http://geturgently.com/>




> On May 21, 2019, at 9:26 PM, Will Martin <wm...@urgent.ly> wrote:
> 
> +1
> 
> Will Martin
> DEVOPS ENGINEER
> 540.454.9565
> 
> 8609 WESTWOOD CENTER DR, SUITE 475
> VIENNA, VA 22182
> geturgently.com <http://geturgently.com/>
> 
> 
> On Tue, May 21, 2019 at 7:39 PM Walter Underwood <wunder@wunderwood.org <ma...@wunderwood.org>> wrote:
> ADDROLE times out after 180 seconds. This seems to be an unrecoverable state for the cluster, so that is a pretty serious bug.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org <ma...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
> > On May 21, 2019, at 4:10 PM, Walter Underwood <wunder@wunderwood.org <ma...@wunderwood.org>> wrote:
> > 
> > We have a 6.6.2 cluster in prod that appears to have no overseer. In /overseer_elect on ZK, there is an election folder, but no leader document. An OVERSEERSTATUS request fails with a timeout.
> > 
> > I’m going to try ADDROLE, but I’d be delighted to hear any other ideas. We’ve diverted all the traffic to the backing cluster, so we can blow this one away and rebuild.
> > 
> > Looking at the Zookeeper logs, I see a few instances of network failures across all three nodes.
> > 
> > wunder
> > Walter Underwood
> > wunder@wunderwood.org <ma...@wunderwood.org>
> > http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> > 
>

Re: Cluster with no overseer?

Posted by Will Martin <wm...@urgent.ly>.

+1

Will Martin
DEVOPS ENGINEER
540.454.9565

8609 WESTWOOD CENTER DR, SUITE 475
VIENNA, VA 22182
geturgently.com


On Tue, May 21, 2019 at 7:39 PM Walter Underwood <wu...@wunderwood.org>
wrote:

> ADDROLE times out after 180 seconds. This seems to be an unrecoverable
> state for the cluster, so that is a pretty serious bug.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On May 21, 2019, at 4:10 PM, Walter Underwood <wu...@wunderwood.org>
> wrote:
> >
> > We have a 6.6.2 cluster in prod that appears to have no overseer. In
> /overseer_elect on ZK, there is an election folder, but no leader document.
> An OVERSEERSTATUS request fails with a timeout.
> >
> > I’m going to try ADDROLE, but I’d be delighted to hear any other ideas.
> We’ve diverted all the traffic to the backing cluster, so we can blow this
> one away and rebuild.
> >
> > Looking at the Zookeeper logs, I see a few instances of network failures
> across all three nodes.
> >
> > wunder
> > Walter Underwood
> > wunder@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
>
>

Re: Cluster with no overseer?

Posted by Walter Underwood <wu...@wunderwood.org>.

ADDROLE times out after 180 seconds. This seems to be an unrecoverable state for the cluster, so that is a pretty serious bug.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 21, 2019, at 4:10 PM, Walter Underwood <wu...@wunderwood.org> wrote:
> 
> We have a 6.6.2 cluster in prod that appears to have no overseer. In /overseer_elect on ZK, there is an election folder, but no leader document. An OVERSEERSTATUS request fails with a timeout.
> 
> I’m going to try ADDROLE, but I’d be delighted to hear any other ideas. We’ve diverted all the traffic to the backing cluster, so we can blow this one away and rebuild.
> 
> Looking at the Zookeeper logs, I see a few instances of network failures across all three nodes.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>