You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Chris M. Hostetter (Jira)" <ji...@apache.org> on 2019/10/02 17:32:00 UTC
[jira] [Updated] (SOLR-13811) possible autoAddReplicas bug and/or (Hdfs)AutoAddReplicasIntegrationTest refactoring / fixes

     [ https://issues.apache.org/jira/browse/SOLR-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris M. Hostetter updated SOLR-13811:
--------------------------------------
    Attachment: hoss_local_failure_after_refactoring.log.txt
                apache_Lucene-Solr-NightlyTests-8.x_221.log.txt
        Status: Open  (was: Open)

As noted by gitbot, I've committed some refactoring to help clean this up and isolate the problematic test logic.
----
I'm attaching two files:
 * {{apache_Lucene-Solr-NightlyTests-8.x_221.log.txt}} - showing and example of how the problem has manifested in jenkins builds _prior_ to the refactoring I've just committed.
 * {{hoss_local_failure_after_refactoring.log.txt}} - showing how the newly refactored {{testRapidStopStartStopWithPropChange()}} can fail demonstrating the same problem in isolation.

Note that {{testRapidStopStartStopWithPropChange()}} does not fail deterministically – the behavior is dependent on the timing of when exactly {{NodeLostTrigger}} fires _after_ the node is restarted, but before it is stopped again. Perhaps there is a way to "pause" the triggers to increase the odds of this happening? ... not sure.  (It also seems to fail much more often in the Hdfs version of the test ... i'm not sure if that's because the MOVEREPLICA logic works faster/slower then in the non hdfs situation? ... i actaully haven't been able to trigger the failure w/the refactoring in place)

[~ab] : can you please take a look at this and chime in with wether you think the current code in {{testRapidStopStartStopWithPropChange()}} is something that should pass reliably given the way the code is designed to work? ... if so please update the jira summary/description to make it clear what the underlying bug is, if not we should go ahead and: delete this test method, reclassify this issue as a "Test" task, and resolve as "DONE".

> possible autoAddReplicas bug and/or (Hdfs)AutoAddReplicasIntegrationTest refactoring / fixes
> --------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13811
>                 URL: https://issues.apache.org/jira/browse/SOLR-13811
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: apache_Lucene-Solr-NightlyTests-8.x_221.log.txt, hoss_local_failure_after_refactoring.log.txt
>
>
> I've noticed a pattern of failure behavior in jenkins runs of {{AutoAddReplicasIntegrationTest}} (which mostly manifests in the subclass {{HdfsAutoAddReplicasIntegrationTest}}, probably due to timing) which indicates either:
>  # the test is too contrived, and expects {{autoAddReplicas}} to kick in in a situation where the current impl of {{NodeLostTrigger}} isn't smart enough to handle
>  # {{NodeLostTrigger}} _should_ be smart enough to handle this, but isn't.
> The test failure is currently somewhat finicky to reproduce, and depends on a node being stoped, restarted, and stopped again – while an affected collection is changed from {{autoAddReplicas=false}} to {{autoAddReplicas=true}} before the second "stop"
> Regardless of which of the 2 above is true: the test itself is somewhat convoluted. It creates a sequence of events (some randomized, some static) and asserting specific outcomes after each – but the timing of scheduled triggers like {{NodeLostTrigger}} , and the interplay of things like "pick a random node to shutdown" with a subsequent "explicitly shut down node2" (even if it was the node randomly shut down earlier) is confusing.
> I'm creating this issue to track two tightly dependent objectives:
>  # refactoring this test to:
>  ** better isolate the specific things it's trying to test in individual test methods.
>  ** have a singular test method that triggers the specific sequence of events that is currently problematic (ideally in such a way that it reliably fails).
>  # AwaitsFix this new test method until someone with a better understand of the {{autoAddReplicas}} / {{NodeLostTrigger}} code can assess if the test is faulty or the code being tested is faulty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org