You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by "Shravan Matthur Narayanamurthy (JIRA)" <ji...@apache.org> on 2016/11/19 01:27:58 UTC

[jira] [Comment Edited] (REEF-1674) Random Failures in Broadcast and Reduce Fault Tolerance tests

    [ https://issues.apache.org/jira/browse/REEF-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15678249#comment-15678249 ] 

Shravan Matthur Narayanamurthy edited comment on REEF-1674 at 11/19/16 1:27 AM:
--------------------------------------------------------------------------------

I will be adding a map function that launches a thread in the background in the constructor that will call Environment.Exit() after a random timeout. This test will accept three additional parameters apart from the generic ones: 
# Failure Probability, 
# A minimum timeout in seconds & 
# The expected throughput in MBps. 

The background thread is launched only in a fraction of the map tasks, controlled by failure probability. With every retry attempt a different task can be chosen to fail.

The minimum timeout ensures that failure does not happen before the specified timeout has elapsed.

The expected throughput is a parameter that controls the maximum timeout. This is our expected throughput we observe per iteration of IMRU. A rough estimate is fine and default is set to 1 MBps which is quite low and leads to generous max timeouts. Values of 5 to 10 are also good. The random timeout is picked uniformly between min timeout & max timeout.

This seems to me like a good model to simulate real failure.


was (Author: shravanmn):
I will be adding a map function that launches a thread in the background in the constructor that will call Environment.Exit() after a random timeout. This test will accept three additional parameters apart from the generic ones: 
# Failure Probability, 
# A minimum timeout in seconds & 
# The expected throughput in MBps. 

The background thread is launched only in a fraction of the map tasks, controlled by failure probability.

The minimum timeout ensures that failure does not happen before the specified timeout has elapsed.

The expected throughput is a parameter that controls the maximum timeout. This is our expected throughput we observe per iteration of IMRU. A rough estimate is fine and default is set to 1 MBps which is quite low and leads to generous max timeouts. Values of 5 to 10 are also good. The random timeout is picked uniformly between min timeout & max timeout.

This seems to me like a good model to simulate real failure.

> Random Failures in Broadcast and Reduce Fault Tolerance tests
> -------------------------------------------------------------
>
>                 Key: REEF-1674
>                 URL: https://issues.apache.org/jira/browse/REEF-1674
>             Project: REEF
>          Issue Type: Improvement
>          Components: REEF.NET IO
>    Affects Versions: 0.16
>            Reporter: Shravan Matthur Narayanamurthy
>            Assignee: Shravan Matthur Narayanamurthy
>            Priority: Minor
>             Fix For: 0.16
>
>
> The current fault tolerance tests inject simulated failure in a controlled manner and hence are not the right failure model to test our fault tolerance work. It would be good to have failures injected randomly than only at specific points as is done in the current code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)