You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "David Smiley (Jira)" <ji...@apache.org> on 2023/03/23 19:59:00 UTC
[jira] [Resolved] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged
[ https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Smiley resolved SOLR-6707.
--------------------------------
Resolution: Abandoned
Closing as "Abandoned" (not sure if better status is appropriate) on the grounds that so much time has passed, that it's doubtful the original observed behavior would happen today.
> Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged
> -----------------------------------------------------------------------------------------------------
>
> Key: SOLR-6707
> URL: https://issues.apache.org/jira/browse/SOLR-6707
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 4.10
> Reporter: James Hardwick
> Priority: Major
> Fix For: 6.0, 5.2
>
>
> We experienced an issue the other day that brought a production solr server down, and this is what we found after investigating:
> - Running solr instance with two separate cores, one of which is perpetually down because it's configs are not yet completely updated for Solr-cloud. This was thought to be harmless since it's not currently in use.
> - Solr experienced an "internal server error" supposedly because of "No space left on device" even though we appeared to have ~10GB free.
> - Solr immediately went into recovery, and subsequent leader election for each shard of each core.
> - Our primary core recovered immediately. Our additional core which was never active in the first place, attempted to recover but of course couldn't due to the improper configs.
> - Solr then began rapid-fire reattempting recovery of said node, trying maybe 20-30 times per second.
> - This in turn bombarded zookeepers /overseer/queue into oblivion
> - At some point /overseer/queue becomes so backed up that normal cluster coordination can no longer play out, and Solr topples over.
> I know this is a bit of an unusual circumstance due to us keeping the dead core around, and our quick solution has been to remove said core. However I can see other potential scenarios that might cause the same issue to arise.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org