You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by "Mariia Mykhailova (JIRA)" <ji...@apache.org> on 2016/10/25 23:59:59 UTC

[jira] [Commented] (REEF-1625) Fix TestFailMapperEvaluatorsOnDispose failures in AppVeyor

    [ https://issues.apache.org/jira/browse/REEF-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606856#comment-15606856 ] 

Mariia Mykhailova commented on REEF-1625:
-----------------------------------------

Sometimes we get {{Actual: 6}}. I suspect what happens here is the following.

The test is supposed to fail evaluator after all tasks are completed, so that IMRU FT doesn't start a retry. We use {{Dispose}} to simulate failure at this time, since we don't want to modify IMRU code and thus need some task-initiated failure. 

However, we don't wait for all tasks to complete before we start disposing of them. Tasks are disposed of immediately after they report completion, following normal REEF task lifecycle. So there is a race condition: if all tasks complete before the ones with failure injected get disposed of, test succeeds, but if one of the tasks with failure injected completes early and proceeds to dispose, the system gets evaluator failure before task completions and goes on to retry.

This is a bit tricky to fix. I see options:
* analyze the number of retries done and amend our test verification to account for the retries. But this is imprecise, because we don't know how many tasks had a chance to complete before failed evaluator event. So we can only check that number of failed evaluators = 2 * numberOfRetriesDone (i.e. at the last retry there were also 2 failed evaluators) and the job succeeded. Also, there is non-zero probability of failing task being fast every time (can be reduced to use only 1 failure each time instead of 2).
* delay the failure. Can we do a short {{Sleep}} before failure in failing evaluators? This will make the tests faster than they are now because there wouldn't be a retry involved. Synchronizing via driver with all other evaluators completion will bring in a lot of complexity which I'd rather avoid.

> Fix TestFailMapperEvaluatorsOnDispose failures in AppVeyor
> ----------------------------------------------------------
>
>                 Key: REEF-1625
>                 URL: https://issues.apache.org/jira/browse/REEF-1625
>             Project: REEF
>          Issue Type: Sub-task
>          Components: IMRU, REEF.NET
>            Reporter: Mariia Mykhailova
>
> {noformat}
> Assert.Equal() Failure
> Expected: 2
> Actual:   4
>    at Org.Apache.REEF.Tests.Functional.IMRU.TestFailMapperEvaluatorsOnDispose.TestFailedMapperOnLocalRuntime() in C:\projects\reef\lang\cs\Org.Apache.REEF.Tests\Functional\IMRU\TestFailMapperEvaluatorsOnDispose.cs:line 66
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)