You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Mikhail Khludnev (JIRA)" <ji...@apache.org> on 2019/04/24 06:53:00 UTC

[jira] [Commented] (SOLR-12291) Async prematurely reports completed status that causes severe shard loss

    [ https://issues.apache.org/jira/browse/SOLR-12291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824853#comment-16824853 ] 

Mikhail Khludnev commented on SOLR-12291:
-----------------------------------------

Let's start from scratch. Todays patch just adds JUnit assume for getting statuses from all nodes. Now, this assume mostly fails causes test to be skipped in report. I believe if the core problem (keys overlap in async IDs map) is fixed, it should pass since every node responds its' status. I'm going to commit just this test amendment soon, shout out to veto.   

> Async prematurely reports completed status that causes severe shard loss
> ------------------------------------------------------------------------
>
>                 Key: SOLR-12291
>                 URL: https://issues.apache.org/jira/browse/SOLR-12291
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Backup/Restore, SolrCloud
>            Reporter: Varun Thacker
>            Assignee: Mikhail Khludnev
>            Priority: Major
>         Attachments: SOLR-12291.patch, SOLR-12291.patch, SOLR-12291.patch, SOLR-122911.patch
>
>
> The OverseerCollectionMessageHandler sliceCmd assumes only one replica exists on one node
> When multiple replicas of a slice are on the same node we only track one replica's async request. This happens because the async requestMap's key is "node_name"
> I discovered this when [~alabax] shared some logs of a restore issue, where the second replica got added before the first replica had completed it's restorecore action.
> While looking at the logs I noticed that the overseer never called REQUESTSTATUS for the restorecore action , almost as if it had missed tracking that particular async request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org