You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ravi Solr <ra...@gmail.com> on 2017/02/02 07:27:19 UTC

6.4.0 collection leader election and recovery issues

Hello,
         Yesterday I upgraded from 6.0.1 to 6.4.0, its been straight 12
hours of debugging spree!! Can somebody kindly help me  out of this misery.

I have a set has 8 single shard collections with 3 replicas. As soon as I
updated the configs and started the servers one of my collection got stuck
with no leader. I have restarted solr to no avail, I also tried to force a
leader via collections API that dint work either. I also see that, from
time to time multiple solr nodes go down all at the same time, only a
restart resolves the issue.

The error snippets are shown below

2017-02-02 01:43:42.785 ERROR
(recoveryExecutor-3-thread-6-processing-n:10.128.159.245:9001_solr
x:clicktrack_shard1_replica1 s:shard1 c:clicktrack r:core_node1)
[c:clicktrack s:shard1 r:core_node1 x:clicktrack_shard1_replica1]
o.a.s.c.RecoveryStrategy Error while trying to recover.
core=clicktrack_shard1_replica1:org.apache.solr.common.SolrException: No
registered leader was found after waiting for 4000ms , collection:
clicktrack slice: shard1

solr.log.9:2017-02-02 01:43:41.336 INFO
(zkCallback-4-thread-29-processing-n:10.128.159.245:9001_solr) [   ]
o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent
state:SyncConnected type:NodeDataChanged
path:/collections/clicktrack/state.json] for collection [clicktrack] has
occurred - updating... (live nodes size: [1])
solr.log.9:2017-02-02 01:43:42.224 INFO
(zkCallback-4-thread-29-processing-n:10.128.159.245:9001_solr) [   ]
o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent
state:SyncConnected type:NodeDataChanged
path:/collections/clicktrack/state.json] for collection [clicktrack] has
occurred - updating... (live nodes size: [1])
solr.log.9:2017-02-02 01:43:43.767 INFO
(zkCallback-4-thread-23-processing-n:10.128.159.245:9001_solr) [   ]
o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent
state:SyncConnected type:NodeDataChanged
path:/collections/clicktrack/state.json] for collection [clicktrack] has
occurred - updating... (live nodes size: [1])


Suspecting the worst I backed up the index and renamed the collection's
data folder and restarted the servers, this time the collection got a
proper leader. So is my index really corrupted ? Solr UI showed live nodes
just like the logs but without any leader. Even with the leader issue
somewhat alleviated after renaming the data folder and letting silr create
a new data folder my servers did go down a couple of times.

I am not all that well versed with zookeeper...any trick to make zookeeper
pick a leader and be happy ? Did anybody have solr/zookeeper issues with
6.4.0 ?

Thanks

Ravi Kiran Bhaskar

Re: 6.4.0 collection leader election and recovery issues

Posted by Ravi Solr <ra...@gmail.com>.

Thanks Hendrik. Iam baffled as to why I did not hit this issue prior to
moving to 6.4.0.

On Thu, Feb 2, 2017 at 7:58 AM, Hendrik Haddorp <he...@gmx.net>
wrote:

> Might be that your overseer queue overloaded. Similar to what is described
> here:
> https://support.lucidworks.com/hc/en-us/articles/203959903-
> Bringing-up-downed-Solr-servers-that-don-t-want-to-come-up
>
> If the overseer queue gets too long you get hit by this:
> https://github.com/Netflix/curator/wiki/Tech-Note-4
>
> Try to request the overseer status (/solr/admin/collections?action=OVERSEERSTATUS).
> If that fails you likely hit that problem. If so you can also not use the
> ZooKeeper command line client anymore. You can now restart all your ZK
> nodes with an increases jute.maxbuffer value. Once ZK is restarted you can
> use the ZK command line client with the same jute.maxbuffer value and check
> how many entries /overseer/queue has in ZK. Normally there should be a few
> entries but if you see thousands then you should delete them. I used a few
> lines of Java code for that, again setting jute.maxbuffer to the same
> value. Once cleaned up restart the Solr nodes one by one and keep an eye on
> the overseer status.
>
>
> On 02.02.2017 10:52, Ravi Solr wrote:
>
>> Following up on my previous email, the intermittent server unavailability
>> seems to be linked to the interaction between Solr and Zookeeper. Can
>> somebody help me understand what this error means and how to recover from
>> it.
>>
>> 2017-02-02 09:44:24.648 ERROR
>> (recoveryExecutor-3-thread-16-processing-n:xx.xxx.xxx.xxx:1234_solr
>> x:clicktrack_shard1_replica4 s:shard1 c:clicktrack r:core_node3)
>> [c:clicktrack s:shard1 r:core_node3 x:clicktrack_shard1_replica4]
>> o.a.s.c.RecoveryStrategy Error while trying to recover.
>> core=clicktrack_shard1_replica4:org.apache.zookeeper.KeeperE
>> xception$SessionExpiredException:
>> KeeperErrorCode = Session expired for /overseer/queue/qn-
>>      at org.apache.zookeeper.KeeperException.create(KeeperException.
>> java:127)
>>      at org.apache.zookeeper.KeeperException.create(KeeperException.
>> java:51)
>>      at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
>>      at
>> org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkCl
>> ient.java:391)
>>      at
>> org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkCl
>> ient.java:388)
>>      at
>> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(Zk
>> CmdExecutor.java:60)
>>      at
>> org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:388)
>>      at
>> org.apache.solr.cloud.DistributedQueue.offer(DistributedQueue.java:244)
>>      at org.apache.solr.cloud.ZkController.publish(ZkController.
>> java:1215)
>>      at org.apache.solr.cloud.ZkController.publish(ZkController.
>> java:1128)
>>      at org.apache.solr.cloud.ZkController.publish(ZkController.
>> java:1124)
>>      at
>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoverySt
>> rategy.java:334)
>>      at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.
>> java:222)
>>      at
>> com.codahale.metrics.InstrumentedExecutorService$Instrumente
>> dRunnable.run(InstrumentedExecutorService.java:176)
>>      at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>      at
>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolE
>> xecutor.lambda$execute$0(ExecutorUtil.java:229)
>>      at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>>      at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>>      at java.lang.Thread.run(Thread.java:745)
>>
>> Thanks
>>
>> Ravi Kiran Bhaskar
>>
>> On Thu, Feb 2, 2017 at 2:27 AM, Ravi Solr <ra...@gmail.com> wrote:
>>
>> Hello,
>>>           Yesterday I upgraded from 6.0.1 to 6.4.0, its been straight 12
>>> hours of debugging spree!! Can somebody kindly help me  out of this
>>> misery.
>>>
>>> I have a set has 8 single shard collections with 3 replicas. As soon as I
>>> updated the configs and started the servers one of my collection got
>>> stuck
>>> with no leader. I have restarted solr to no avail, I also tried to force
>>> a
>>> leader via collections API that dint work either. I also see that, from
>>> time to time multiple solr nodes go down all at the same time, only a
>>> restart resolves the issue.
>>>
>>> The error snippets are shown below
>>>
>>> 2017-02-02 01:43:42.785 ERROR (recoveryExecutor-3-thread-6-processing-n:
>>> 10.128.159.245:9001_solr x:clicktrack_shard1_replica1 s:shard1
>>> c:clicktrack r:core_node1) [c:clicktrack s:shard1 r:core_node1
>>> x:clicktrack_shard1_replica1] o.a.s.c.RecoveryStrategy Error while trying
>>> to recover. core=clicktrack_shard1_replica1:org.apache.solr.common.
>>> SolrException:
>>> No registered leader was found after waiting for 4000ms , collection:
>>> clicktrack slice: shard1
>>>
>>> solr.log.9:2017-02-02 01:43:41.336 INFO  (zkCallback-4-thread-29-
>>> processing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A
>>> cluster state change: [WatchedEvent state:SyncConnected
>>> type:NodeDataChanged path:/collections/clicktrack/state.json] for
>>> collection [clicktrack] has occurred - updating... (live nodes size: [1])
>>> solr.log.9:2017-02-02 01:43:42.224 INFO  (zkCallback-4-thread-29-
>>> processing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A
>>> cluster state change: [WatchedEvent state:SyncConnected
>>> type:NodeDataChanged path:/collections/clicktrack/state.json] for
>>> collection [clicktrack] has occurred - updating... (live nodes size: [1])
>>> solr.log.9:2017-02-02 01:43:43.767 INFO  (zkCallback-4-thread-23-
>>> processing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A
>>> cluster state change: [WatchedEvent state:SyncConnected
>>> type:NodeDataChanged path:/collections/clicktrack/state.json] for
>>> collection [clicktrack] has occurred - updating... (live nodes size: [1])
>>>
>>>
>>> Suspecting the worst I backed up the index and renamed the collection's
>>> data folder and restarted the servers, this time the collection got a
>>> proper leader. So is my index really corrupted ? Solr UI showed live
>>> nodes
>>> just like the logs but without any leader. Even with the leader issue
>>> somewhat alleviated after renaming the data folder and letting silr
>>> create
>>> a new data folder my servers did go down a couple of times.
>>>
>>> I am not all that well versed with zookeeper...any trick to make
>>> zookeeper
>>> pick a leader and be happy ? Did anybody have solr/zookeeper issues with
>>> 6.4.0 ?
>>>
>>> Thanks
>>>
>>> Ravi Kiran Bhaskar
>>>
>>>
>

Re: 6.4.0 collection leader election and recovery issues

Posted by Hendrik Haddorp <he...@gmx.net>.

Might be that your overseer queue overloaded. Similar to what is 
described here:
https://support.lucidworks.com/hc/en-us/articles/203959903-Bringing-up-downed-Solr-servers-that-don-t-want-to-come-up

If the overseer queue gets too long you get hit by this:
https://github.com/Netflix/curator/wiki/Tech-Note-4

Try to request the overseer status 
(/solr/admin/collections?action=OVERSEERSTATUS). If that fails you 
likely hit that problem. If so you can also not use the ZooKeeper 
command line client anymore. You can now restart all your ZK nodes with 
an increases jute.maxbuffer value. Once ZK is restarted you can use the 
ZK command line client with the same jute.maxbuffer value and check how 
many entries /overseer/queue has in ZK. Normally there should be a few 
entries but if you see thousands then you should delete them. I used a 
few lines of Java code for that, again setting jute.maxbuffer to the 
same value. Once cleaned up restart the Solr nodes one by one and keep 
an eye on the overseer status.

On 02.02.2017 10:52, Ravi Solr wrote:
> Following up on my previous email, the intermittent server unavailability
> seems to be linked to the interaction between Solr and Zookeeper. Can
> somebody help me understand what this error means and how to recover from
> it.
>
> 2017-02-02 09:44:24.648 ERROR
> (recoveryExecutor-3-thread-16-processing-n:xx.xxx.xxx.xxx:1234_solr
> x:clicktrack_shard1_replica4 s:shard1 c:clicktrack r:core_node3)
> [c:clicktrack s:shard1 r:core_node3 x:clicktrack_shard1_replica4]
> o.a.s.c.RecoveryStrategy Error while trying to recover.
> core=clicktrack_shard1_replica4:org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /overseer/queue/qn-
>      at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>      at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>      at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
>      at
> org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:391)
>      at
> org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:388)
>      at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
>      at
> org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:388)
>      at
> org.apache.solr.cloud.DistributedQueue.offer(DistributedQueue.java:244)
>      at org.apache.solr.cloud.ZkController.publish(ZkController.java:1215)
>      at org.apache.solr.cloud.ZkController.publish(ZkController.java:1128)
>      at org.apache.solr.cloud.ZkController.publish(ZkController.java:1124)
>      at
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:334)
>      at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
>      at
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
>      at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>      at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
>      at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>      at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>      at java.lang.Thread.run(Thread.java:745)
>
> Thanks
>
> Ravi Kiran Bhaskar
>
> On Thu, Feb 2, 2017 at 2:27 AM, Ravi Solr <ra...@gmail.com> wrote:
>
>> Hello,
>>           Yesterday I upgraded from 6.0.1 to 6.4.0, its been straight 12
>> hours of debugging spree!! Can somebody kindly help me  out of this misery.
>>
>> I have a set has 8 single shard collections with 3 replicas. As soon as I
>> updated the configs and started the servers one of my collection got stuck
>> with no leader. I have restarted solr to no avail, I also tried to force a
>> leader via collections API that dint work either. I also see that, from
>> time to time multiple solr nodes go down all at the same time, only a
>> restart resolves the issue.
>>
>> The error snippets are shown below
>>
>> 2017-02-02 01:43:42.785 ERROR (recoveryExecutor-3-thread-6-processing-n:
>> 10.128.159.245:9001_solr x:clicktrack_shard1_replica1 s:shard1
>> c:clicktrack r:core_node1) [c:clicktrack s:shard1 r:core_node1
>> x:clicktrack_shard1_replica1] o.a.s.c.RecoveryStrategy Error while trying
>> to recover. core=clicktrack_shard1_replica1:org.apache.solr.common.SolrException:
>> No registered leader was found after waiting for 4000ms , collection:
>> clicktrack slice: shard1
>>
>> solr.log.9:2017-02-02 01:43:41.336 INFO  (zkCallback-4-thread-29-
>> processing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A
>> cluster state change: [WatchedEvent state:SyncConnected
>> type:NodeDataChanged path:/collections/clicktrack/state.json] for
>> collection [clicktrack] has occurred - updating... (live nodes size: [1])
>> solr.log.9:2017-02-02 01:43:42.224 INFO  (zkCallback-4-thread-29-
>> processing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A
>> cluster state change: [WatchedEvent state:SyncConnected
>> type:NodeDataChanged path:/collections/clicktrack/state.json] for
>> collection [clicktrack] has occurred - updating... (live nodes size: [1])
>> solr.log.9:2017-02-02 01:43:43.767 INFO  (zkCallback-4-thread-23-
>> processing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A
>> cluster state change: [WatchedEvent state:SyncConnected
>> type:NodeDataChanged path:/collections/clicktrack/state.json] for
>> collection [clicktrack] has occurred - updating... (live nodes size: [1])
>>
>>
>> Suspecting the worst I backed up the index and renamed the collection's
>> data folder and restarted the servers, this time the collection got a
>> proper leader. So is my index really corrupted ? Solr UI showed live nodes
>> just like the logs but without any leader. Even with the leader issue
>> somewhat alleviated after renaming the data folder and letting silr create
>> a new data folder my servers did go down a couple of times.
>>
>> I am not all that well versed with zookeeper...any trick to make zookeeper
>> pick a leader and be happy ? Did anybody have solr/zookeeper issues with
>> 6.4.0 ?
>>
>> Thanks
>>
>> Ravi Kiran Bhaskar
>>

Re: 6.4.0 collection leader election and recovery issues

Posted by Ravi Solr <ra...@gmail.com>.

Thanks Shawn. Yes I did index some docs after moving to 6.4.0. The release
notes did not mention anything about format being changed so I thought it
would be backward compatible. Yeah my only recourse is to re-index data.
Apart from that it was weird problems overall with 6.4.0. I was excited
about using the unified highlighter but the zookeeper flakiness and
constant disconnections of solr and sometimes not electing a leader for
some collections made me rollback.

Anyway thanks for promptly responding, will be more careful form next time.

Thanks

Ravi Kiran Bhaskar



On Thu, Feb 2, 2017 at 9:41 AM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 2/2/2017 7:23 AM, Ravi Solr wrote:
> > When i try to rollback from 6.4.0 to my original version of 6.0.1 it now
> > throws another issue. Now I cant go to 6.4.0 nor can I roll back to 6.0.1
> >
> > Could not load codec 'Lucene62'.  Did you forget to add
> > lucene-backward-codecs.jar?
> >     at org.apache.lucene.index.SegmentInfos.readCodec(
> SegmentInfos.java:429)
> >     at
> > org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:349)
> >     at
> > org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:284)
> >
> > Hope this doesnt cost me dearly. Any ideas at least on how to rollback
> > safely.
>
> This sounds like you did some indexing after the upgrade, or possibly
> some index optimizing, so the parts of the index that were written (or
> merged) by the newer version are now in a format that the older version
> cannot use.  Perhaps the merge policy was changed, causing Solr to do
> some automatic merges once it started up.  I am not aware of anything in
> Solr that would write new segments without indexing input or a merge
> policy change.
>
> As far as I know, there is no straightforward way to go backwards with
> the index format.  If you want to downgrade and don't have a backup of
> your indexes from before the upgrade, you'll probably need to wipe the
> index directory and completely reindex.
>
> Solr will always use the newest default index format for new segments
> when you upgrade.  Contrary to many user expectations, setting
> luceneMatchVersion will *NOT* affect the index format, only the behavior
> of components that do field analysis.
>
> Downgrading the index format would involve writing a custom Lucene
> program that changes the active index format to the older version, then
> runs a forceMerge on the index.  It would be completely separate from
> Solr, and definitely not straightforward.
>
> Thanks,
> Shawn
>
>

Re: 6.4.0 collection leader election and recovery issues

Posted by Shawn Heisey <ap...@elyograg.org>.

On 2/2/2017 7:23 AM, Ravi Solr wrote:
> When i try to rollback from 6.4.0 to my original version of 6.0.1 it now
> throws another issue. Now I cant go to 6.4.0 nor can I roll back to 6.0.1
>
> Could not load codec 'Lucene62'.  Did you forget to add
> lucene-backward-codecs.jar?
>     at org.apache.lucene.index.SegmentInfos.readCodec(SegmentInfos.java:429)
>     at
> org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:349)
>     at
> org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:284)
>
> Hope this doesnt cost me dearly. Any ideas at least on how to rollback
> safely.

This sounds like you did some indexing after the upgrade, or possibly
some index optimizing, so the parts of the index that were written (or
merged) by the newer version are now in a format that the older version
cannot use.  Perhaps the merge policy was changed, causing Solr to do
some automatic merges once it started up.  I am not aware of anything in
Solr that would write new segments without indexing input or a merge
policy change.

As far as I know, there is no straightforward way to go backwards with
the index format.  If you want to downgrade and don't have a backup of
your indexes from before the upgrade, you'll probably need to wipe the
index directory and completely reindex.

Solr will always use the newest default index format for new segments
when you upgrade.  Contrary to many user expectations, setting
luceneMatchVersion will *NOT* affect the index format, only the behavior
of components that do field analysis.

Downgrading the index format would involve writing a custom Lucene
program that changes the active index format to the older version, then
runs a forceMerge on the index.  It would be completely separate from
Solr, and definitely not straightforward.

Thanks,
Shawn

Re: 6.4.0 collection leader election and recovery issues

Posted by Ravi Solr <ra...@gmail.com>.

When i try to rollback from 6.4.0 to my original version of 6.0.1 it now
throws another issue. Now I cant go to 6.4.0 nor can I roll back to 6.0.1

Could not load codec 'Lucene62'.  Did you forget to add
lucene-backward-codecs.jar?
    at org.apache.lucene.index.SegmentInfos.readCodec(SegmentInfos.java:429)
    at
org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:349)
    at
org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:284)

Hope this doesnt cost me dearly. Any ideas at least on how to rollback
safely.

Thanks

Ravi Kiran Bhaskar

On Thu, Feb 2, 2017 at 4:52 AM, Ravi Solr <ra...@gmail.com> wrote:

> Following up on my previous email, the intermittent server unavailability
> seems to be linked to the interaction between Solr and Zookeeper. Can
> somebody help me understand what this error means and how to recover from
> it.
>
> 2017-02-02 09:44:24.648 ERROR (recoveryExecutor-3-thread-16-
> processing-n:xx.xxx.xxx.xxx:1234_solr x:clicktrack_shard1_replica4
> s:shard1 c:clicktrack r:core_node3) [c:clicktrack s:shard1 r:core_node3
> x:clicktrack_shard1_replica4] o.a.s.c.RecoveryStrategy Error while trying
> to recover. core=clicktrack_shard1_replica4:org.apache.zookeeper.
> KeeperException$SessionExpiredException: KeeperErrorCode = Session
> expired for /overseer/queue/qn-
>     at org.apache.zookeeper.KeeperException.create(
> KeeperException.java:127)
>     at org.apache.zookeeper.KeeperException.create(
> KeeperException.java:51)
>     at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
>     at org.apache.solr.common.cloud.SolrZkClient$9.execute(
> SolrZkClient.java:391)
>     at org.apache.solr.common.cloud.SolrZkClient$9.execute(
> SolrZkClient.java:388)
>     at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(
> ZkCmdExecutor.java:60)
>     at org.apache.solr.common.cloud.SolrZkClient.create(
> SolrZkClient.java:388)
>     at org.apache.solr.cloud.DistributedQueue.offer(
> DistributedQueue.java:244)
>     at org.apache.solr.cloud.ZkController.publish(ZkController.java:1215)
>     at org.apache.solr.cloud.ZkController.publish(ZkController.java:1128)
>     at org.apache.solr.cloud.ZkController.publish(ZkController.java:1124)
>     at org.apache.solr.cloud.RecoveryStrategy.doRecovery(
> RecoveryStrategy.java:334)
>     at org.apache.solr.cloud.RecoveryStrategy.run(
> RecoveryStrategy.java:222)
>     at com.codahale.metrics.InstrumentedExecutorService$
> InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
>     at java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at org.apache.solr.common.util.ExecutorUtil$
> MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:745)
>
> Thanks
>
> Ravi Kiran Bhaskar
>
> On Thu, Feb 2, 2017 at 2:27 AM, Ravi Solr <ra...@gmail.com> wrote:
>
>> Hello,
>>          Yesterday I upgraded from 6.0.1 to 6.4.0, its been straight 12
>> hours of debugging spree!! Can somebody kindly help me  out of this misery.
>>
>> I have a set has 8 single shard collections with 3 replicas. As soon as I
>> updated the configs and started the servers one of my collection got stuck
>> with no leader. I have restarted solr to no avail, I also tried to force a
>> leader via collections API that dint work either. I also see that, from
>> time to time multiple solr nodes go down all at the same time, only a
>> restart resolves the issue.
>>
>> The error snippets are shown below
>>
>> 2017-02-02 01:43:42.785 ERROR (recoveryExecutor-3-thread-6-processing-n:
>> 10.128.159.245:9001_solr x:clicktrack_shard1_replica1 s:shard1
>> c:clicktrack r:core_node1) [c:clicktrack s:shard1 r:core_node1
>> x:clicktrack_shard1_replica1] o.a.s.c.RecoveryStrategy Error while trying
>> to recover. core=clicktrack_shard1_replica1:org.apache.solr.common.SolrException:
>> No registered leader was found after waiting for 4000ms , collection:
>> clicktrack slice: shard1
>>
>> solr.log.9:2017-02-02 01:43:41.336 INFO  (zkCallback-4-thread-29-proces
>> sing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A cluster
>> state change: [WatchedEvent state:SyncConnected type:NodeDataChanged
>> path:/collections/clicktrack/state.json] for collection [clicktrack] has
>> occurred - updating... (live nodes size: [1])
>> solr.log.9:2017-02-02 01:43:42.224 INFO  (zkCallback-4-thread-29-proces
>> sing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A cluster
>> state change: [WatchedEvent state:SyncConnected type:NodeDataChanged
>> path:/collections/clicktrack/state.json] for collection [clicktrack] has
>> occurred - updating... (live nodes size: [1])
>> solr.log.9:2017-02-02 01:43:43.767 INFO  (zkCallback-4-thread-23-proces
>> sing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A cluster
>> state change: [WatchedEvent state:SyncConnected type:NodeDataChanged
>> path:/collections/clicktrack/state.json] for collection [clicktrack] has
>> occurred - updating... (live nodes size: [1])
>>
>>
>> Suspecting the worst I backed up the index and renamed the collection's
>> data folder and restarted the servers, this time the collection got a
>> proper leader. So is my index really corrupted ? Solr UI showed live nodes
>> just like the logs but without any leader. Even with the leader issue
>> somewhat alleviated after renaming the data folder and letting silr create
>> a new data folder my servers did go down a couple of times.
>>
>> I am not all that well versed with zookeeper...any trick to make
>> zookeeper pick a leader and be happy ? Did anybody have solr/zookeeper
>> issues with 6.4.0 ?
>>
>> Thanks
>>
>> Ravi Kiran Bhaskar
>>
>
>

Re: 6.4.0 collection leader election and recovery issues

Posted by Ravi Solr <ra...@gmail.com>.

Following up on my previous email, the intermittent server unavailability
seems to be linked to the interaction between Solr and Zookeeper. Can
somebody help me understand what this error means and how to recover from
it.

2017-02-02 09:44:24.648 ERROR
(recoveryExecutor-3-thread-16-processing-n:xx.xxx.xxx.xxx:1234_solr
x:clicktrack_shard1_replica4 s:shard1 c:clicktrack r:core_node3)
[c:clicktrack s:shard1 r:core_node3 x:clicktrack_shard1_replica4]
o.a.s.c.RecoveryStrategy Error while trying to recover.
core=clicktrack_shard1_replica4:org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer/queue/qn-
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
    at
org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:391)
    at
org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:388)
    at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
    at
org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:388)
    at
org.apache.solr.cloud.DistributedQueue.offer(DistributedQueue.java:244)
    at org.apache.solr.cloud.ZkController.publish(ZkController.java:1215)
    at org.apache.solr.cloud.ZkController.publish(ZkController.java:1128)
    at org.apache.solr.cloud.ZkController.publish(ZkController.java:1124)
    at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:334)
    at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
    at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
    at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Thanks

Ravi Kiran Bhaskar

On Thu, Feb 2, 2017 at 2:27 AM, Ravi Solr <ra...@gmail.com> wrote:

> Hello,
>          Yesterday I upgraded from 6.0.1 to 6.4.0, its been straight 12
> hours of debugging spree!! Can somebody kindly help me  out of this misery.
>
> I have a set has 8 single shard collections with 3 replicas. As soon as I
> updated the configs and started the servers one of my collection got stuck
> with no leader. I have restarted solr to no avail, I also tried to force a
> leader via collections API that dint work either. I also see that, from
> time to time multiple solr nodes go down all at the same time, only a
> restart resolves the issue.
>
> The error snippets are shown below
>
> 2017-02-02 01:43:42.785 ERROR (recoveryExecutor-3-thread-6-processing-n:
> 10.128.159.245:9001_solr x:clicktrack_shard1_replica1 s:shard1
> c:clicktrack r:core_node1) [c:clicktrack s:shard1 r:core_node1
> x:clicktrack_shard1_replica1] o.a.s.c.RecoveryStrategy Error while trying
> to recover. core=clicktrack_shard1_replica1:org.apache.solr.common.SolrException:
> No registered leader was found after waiting for 4000ms , collection:
> clicktrack slice: shard1
>
> solr.log.9:2017-02-02 01:43:41.336 INFO  (zkCallback-4-thread-29-
> processing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A
> cluster state change: [WatchedEvent state:SyncConnected
> type:NodeDataChanged path:/collections/clicktrack/state.json] for
> collection [clicktrack] has occurred - updating... (live nodes size: [1])
> solr.log.9:2017-02-02 01:43:42.224 INFO  (zkCallback-4-thread-29-
> processing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A
> cluster state change: [WatchedEvent state:SyncConnected
> type:NodeDataChanged path:/collections/clicktrack/state.json] for
> collection [clicktrack] has occurred - updating... (live nodes size: [1])
> solr.log.9:2017-02-02 01:43:43.767 INFO  (zkCallback-4-thread-23-
> processing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A
> cluster state change: [WatchedEvent state:SyncConnected
> type:NodeDataChanged path:/collections/clicktrack/state.json] for
> collection [clicktrack] has occurred - updating... (live nodes size: [1])
>
>
> Suspecting the worst I backed up the index and renamed the collection's
> data folder and restarted the servers, this time the collection got a
> proper leader. So is my index really corrupted ? Solr UI showed live nodes
> just like the logs but without any leader. Even with the leader issue
> somewhat alleviated after renaming the data folder and letting silr create
> a new data folder my servers did go down a couple of times.
>
> I am not all that well versed with zookeeper...any trick to make zookeeper
> pick a leader and be happy ? Did anybody have solr/zookeeper issues with
> 6.4.0 ?
>
> Thanks
>
> Ravi Kiran Bhaskar
>