You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ben DeMott (JIRA)" <ji...@apache.org> on 2017/08/23 22:43:00 UTC
[jira] [Commented] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged

    [ https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139275#comment-16139275 ] 

Ben DeMott commented on SOLR-6707:
----------------------------------

We have experienced this multiple times.  We host inside AWS and Zookeeper is spread across different availability zones...
This means that the connection between ZK's has high latency once in awhile which ZK doesn't seem to like.  I wonder if anyone else is in this situation.
We've never had so many Zookeeper issues as we do now that we've moved our infrastructure inside AWS.

What triggered a backed up overseer queue for us was a hung ephemeral node in Zookeeper which I discuss here:
https://stackoverflow.com/questions/23743424/solr-issue-clusterstate-says-we-are-the-leader-but-locally-we-dont-think-so/42210844#42210844

As OP said, once this goes on for long enough Solr runs out of file-descriptors, and eventually brings down the whole cluster.

This bug in Zookeeper (appears) to be the cause of the hung ephemeral node:
https://issues.apache.org/jira/browse/ZOOKEEPER-2355

> Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-6707
>                 URL: https://issues.apache.org/jira/browse/SOLR-6707
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.10
>            Reporter: James Hardwick
>             Fix For: 5.2, 6.0
>
>
> We experienced an issue the other day that brought a production solr server down, and this is what we found after investigating:
> - Running solr instance with two separate cores, one of which is perpetually down because it's configs are not yet completely updated for Solr-cloud. This was thought to be harmless since it's not currently in use. 
> - Solr experienced an "internal server error" supposedly because of "No space left on device" even though we appeared to have ~10GB free. 
> - Solr immediately went into recovery, and subsequent leader election for each shard of each core. 
> - Our primary core recovered immediately. Our additional core which was never active in the first place, attempted to recover but of course couldn't due to the improper configs. 
> - Solr then began rapid-fire reattempting recovery of said node, trying maybe 20-30 times per second.
> - This in turn bombarded zookeepers /overseer/queue into oblivion
> - At some point /overseer/queue becomes so backed up that normal cluster coordination can no longer play out, and Solr topples over. 
> I know this is a bit of an unusual circumstance due to us keeping the dead core around, and our quick solution has been to remove said core. However I can see other potential scenarios that might cause the same issue to arise. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org