You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Chris M. Hostetter (Jira)" <ji...@apache.org> on 2019/10/02 16:46:00 UTC

[jira] [Created] (SOLR-13811) possible autoAddReplicas bug and/or (Hdfs)AutoAddReplicasIntegrationTest refactoring / fixes

Chris M. Hostetter created SOLR-13811:
-----------------------------------------

             Summary: possible autoAddReplicas bug and/or (Hdfs)AutoAddReplicasIntegrationTest refactoring / fixes
                 Key: SOLR-13811
                 URL: https://issues.apache.org/jira/browse/SOLR-13811
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Chris M. Hostetter


I've noticed a pattern of failure behavior in jenkins runs of {{AutoAddReplicasIntegrationTest}} (which mostly manifests in the subclass {{HdfsAutoAddReplicasIntegrationTest}}, probably due to timing) which indicates either:
 # the test is too contrived, and expects {{autoAddReplicas}} to kick in in a situation where the current impl of {{NodeLostTrigger}} isn't smart enough to handle
 # {{NodeLostTrigger}} _should_ be smart enough to handle this, but isn't.

The test failure is currently somewhat finicky to reproduce, and depends on a node being stoped, restarted, and stopped again – while an affected collection is changed from {{autoAddReplicas=false}} to {{autoAddReplicas=true}} before the second "stop"

Regardless of which of the 2 above is true: the test itself is somewhat convoluted. It creates a sequence of events (some randomized, some static) and asserting specific outcomes after each – but the timing of scheduled triggers like {{NodeLostTrigger}} , and the interplay of things like "pick a random node to shutdown" with a subsequent "explicitly shut down node2" (even if it was the node randomly shut down earlier) is confusing.

I'm creating this issue to track two tightly dependent objectives:
 # refactoring this test to:
 ** better isolate the specific things it's trying to test in individual test methods.
 ** have a singular test method that triggers the specific sequence of events that is currently problematic (ideally in such a way that it reliably fails).
 # AwaitsFix this new test method until someone with a better understand of the {{autoAddReplicas}} / {{NodeLostTrigger}} code can assess if the test is faulty or the code being tested is faulty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org