You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2018/06/12 21:01:00 UTC

[jira] [Created] (SOLR-12479) TriggerAction failures may cause inconsistent trigger behavior

Andrzej Bialecki  created SOLR-12479:
----------------------------------------

             Summary: TriggerAction failures may cause inconsistent trigger behavior
                 Key: SOLR-12479
                 URL: https://issues.apache.org/jira/browse/SOLR-12479
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: AutoScaling
    Affects Versions: 7.4, master (8.0)
            Reporter: Andrzej Bialecki 


The following issue occasionally appears when running {{TestLargeCluster.testNodeLost}}.

The test kills a large number of nodes, waiting for a certain time between the kills. Depending on the sequence and the length of {{waitFor}} it may happen that when {{ExecutePlanAction}} processes MOVEREPLICA the target node may just have been killed. This results in an exception and a FAILED status of the action.

However, this failure is not reported back to the trigger as unprocessed event because it happens asynchronously in the action executor (in {{ScheduledTriggers}}) - so the trigger happily resets its internal state to no longer track the lost node. As a result, replicas remain lost and even if there’s a Policy violation the event will not be generated again, and the number of replicas won’t go back to the original number.

Also, {{ScheduledTriggers:311}} and 323 only logs the exception but doesn’t fire listeners with FAILED status, which is a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org