You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hama.apache.org by "Thomas Jungblut (JIRA)" <ji...@apache.org> on 2012/05/29 08:24:23 UTC

[jira] [Commented] (HAMA-505) Fault Tolerant Job Processing

    [ https://issues.apache.org/jira/browse/HAMA-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284628#comment-13284628 ] 

Thomas Jungblut commented on HAMA-505:
--------------------------------------

Suraj and I had a small meeting on FT in 0.6.0, here are our first iteration result:

First focus on 0.6.0

# Checkpointing on receive side HAMA-557
## ZK stores successful superstep checkpointing files / paths
# When fault happens:
## Single task recovery (when fault happens inside of computation)
### Groom detects failure, flag the task as fail and redirects a new task schedule to the scheduler(HAMA-534), BSPTask#run takes care of correct filling of message queue in BSPPeerImpl and MessageManager.
## Global recovery (when fault happens during sync or checkpointing)
### All tasks must fail and rescheduled with the last successful superstep
# Restart the task(s) with Superstep API HAMA-533
## Improve Superstep API with HAMA-546
## Improve Superstep API or rather BSP API with following features:
### deregister/close (empty the BSP slot)
### relieve from sync .. the task runs but would not sync anymore
                
> Fault Tolerant Job Processing
> -----------------------------
>
>                 Key: HAMA-505
>                 URL: https://issues.apache.org/jira/browse/HAMA-505
>             Project: Hama
>          Issue Type: Umbrella
>            Reporter: Thomas Jungblut
>
> This umbrella summarizes all issues related with checkpointing and task restarting to archieve fault tolerance on the job level.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira