You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Mayya Sharipova (Jira)" <ji...@apache.org> on 2021/06/23 13:50:08 UTC
[jira] [Updated] (SOLR-15386) Internal DOWNNODE request will mark
replicas down even if their host node is now live
[ https://issues.apache.org/jira/browse/SOLR-15386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mayya Sharipova updated SOLR-15386:
-----------------------------------
Security: (was: Public)
> Internal DOWNNODE request will mark replicas down even if their host node is now live
> -------------------------------------------------------------------------------------
>
> Key: SOLR-15386
> URL: https://issues.apache.org/jira/browse/SOLR-15386
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 8.6
> Reporter: Megan Carey
> Priority: Major
>
> When a node is shutting down, it calls into:
> # [CoreContainer.shutdown()|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/core/CoreContainer.java#L1026]
> # [ZkController.preClose()|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L612]
> # [ZkController.publishNodeAsDown|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L2753]
> This sends a request to Overseer to mark all of the replicas DOWN for the soon-to-be down node.
> # [Overseer.processMessage()|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/cloud/Overseer.java#L459]
> # [NodeMutator.downNode()|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/cloud/overseer/NodeMutator.java#L48]
> The issue we encountered was as follows:
> # Solr node shuts down
> # DOWNNODE message is enqueued for Overseer
> # Solr node comes back up (running on K8s, so a new node is auto-started as soon as the old node was detected as down)
> # DOWNNODE was dequeued for processing, and marked all replicas DOWN for the node that is now live.
> The only place where these replicas would later be marked ACTIVE again is after ShardLeaderElection, but we did not reach that case. An easy fix is to add a check for node liveness prior to marking replicas down, but a lot of tests fail with this change. Was this the intended functionality?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org