You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Tassapol Athiapinya (JIRA)" <ji...@apache.org> on 2014/04/16 12:40:18 UTC

[jira] [Created] (TEZ-1060) Add randomness to fault tolerance tests

Tassapol Athiapinya created TEZ-1060:
----------------------------------------

             Summary: Add randomness to fault tolerance tests
                 Key: TEZ-1060
                 URL: https://issues.apache.org/jira/browse/TEZ-1060
             Project: Apache Tez
          Issue Type: Improvement
    Affects Versions: 0.5.0
            Reporter: Tassapol Athiapinya



We do have TestFaultTolerance for unit tests that see whether AM can correctly handles a case when there are processor failures and input failures. TestFaultTolerance uses TestProcessor and TestInput to simulate controlled failure scenario for a DAG. In each test, on processor front, we do select which tasks fail (do-fail), which physical task indexes fail (failing-task-index) and upto which attempt these physical tasks fail (failing-upto-task-attempt). On input front, we do select which tasks have failed inputs (do-fail), which physical task indexes fail (failing-task-index), upto which attempt these physical tasks have failed input (failing-task-attempt), which physical inputs to fail (failing-input-index) and upto which version of physical inputs tasks do reject (failing-upto-input-attempt). In addition to task failure and input failures, we also check values of specific physical tasks to see if inputs of downstream vertices match outputs of upstream vertices (verify-value, verify-task-index). These tests were added during 0.3.0 and 0.4.0. We could find several issues in Tez AM, fixed them and enhanced stability of Tez AM. Though current unit tests are useful, they are limited by scenarios carefully chosen by individual contributors. When Tez is used in heavy load scenario, more issues are likely to arise. To bring fault tolerance tests to new level, we should add tests that generate randomized failure scenarios. When each contributor runs unit tests, new scenario will be generated. From there, it gives more opportunity for community to report and fix new issues.

There are few criteria for new tests:
- We want to keep time used to run unit tests minimal. Each contributor runs different hardware. It is inconvenient if people with slow machine needs to spend too much time to run tests for any patch.
- Random scenario needs to be controlled enough to know expected behavior. This means parameters have to be validated by test itself first.




--
This message was sent by Atlassian JIRA
(v6.2#6252)