You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Mayya Sharipova (Jira)" <ji...@apache.org> on 2021/06/23 13:50:08 UTC

[jira] [Updated] (SOLR-15386) Internal DOWNNODE request will mark replicas down even if their host node is now live

     [ https://issues.apache.org/jira/browse/SOLR-15386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mayya Sharipova updated SOLR-15386:
-----------------------------------
    Security:     (was: Public)

> Internal DOWNNODE request will mark replicas down even if their host node is now live
> -------------------------------------------------------------------------------------
>
>                 Key: SOLR-15386
>                 URL: https://issues.apache.org/jira/browse/SOLR-15386
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 8.6
>            Reporter: Megan Carey
>            Priority: Major
>
> When a node is shutting down, it calls into:
>  # [CoreContainer.shutdown()|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/core/CoreContainer.java#L1026]
>  # [ZkController.preClose()|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L612]
>  # [ZkController.publishNodeAsDown|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L2753]
> This sends a request to Overseer to mark all of the replicas DOWN for the soon-to-be down node.
> # [Overseer.processMessage()|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/cloud/Overseer.java#L459]
> # [NodeMutator.downNode()|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/cloud/overseer/NodeMutator.java#L48]
> The issue we encountered was as follows:
> # Solr node shuts down
> # DOWNNODE message is enqueued for Overseer
> # Solr node comes back up (running on K8s, so a new node is auto-started as soon as the old node was detected as down)
> # DOWNNODE was dequeued for processing, and marked all replicas DOWN for the node that is now live.
> The only place where these replicas would later be marked ACTIVE again is after ShardLeaderElection, but we did not reach that case. An easy fix is to add a check for node liveness prior to marking replicas down, but a lot of tests fail with this change. Was this the intended functionality? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org