You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hama.apache.org by Praveen Sripati <pr...@gmail.com> on 2012/04/05 09:10:25 UTC

Hama Fault Tolerance

1) If a BSPJob has 10 super steps and a task fails at step 5, does the job
need to be run again? Is Hama-503 the solution? Is the state of the job
stored in HDFS between super steps?

2) What other fault tolerance features are implemented in Hama?

3) What is check pointing in Hama?

Praveen

Re: Hama Fault Tolerance

Posted by Praveen Sripati <pr...@gmail.com>.

Thanks for the clarification. So, the messages are stored in HDFS whenever
there is a checkpoint and in case of any failure the tasks will execute
from the last checkpoint state.

Praveen

On Thu, Apr 5, 2012 at 5:09 PM, Suraj Menon <su...@apache.org> wrote:

> Hey Praveen,
>
> https://issues.apache.org/jira/browse/HAMA-505 is an umbrella issue to all
> the fault tolerance design and implementation issues.
> Please read the discussion thread "Recovering issues" here -
>
> http://mail-archives.apache.org/mod_mbox/incubator-hama-dev/201203.mbox/browser
> that
> has a gist of where we are headed for this issue.
>
> Fault tolerance in task execution is scheduled for 0.6. I would be updating
> the Wiki with the design sometime.
>
> -Suraj
>
> On Thu, Apr 5, 2012 at 7:14 AM, Thomas Jungblut <
> thomas.jungblut@googlemail.com> wrote:
>
> > Currently if failure occurs, the whole job is killed.
> > After 503, it will restart a single tasks when it fails at superstep 5.
> > Yes the state (messages) are stored in the sync() method.
> >
> > 2) What other fault tolerance features are implemented in Hama?
> > >
> >
> > None yet.
> >
> > 3) What is check pointing in Hama?
> > >
> >
> > Writing sent messages to HDFS after a computation phase.
> >
> > Am 5. April 2012 09:10 schrieb Praveen Sripati <praveensripati@gmail.com
> >:
> >
> > > 1) If a BSPJob has 10 super steps and a task fails at step 5, does the
> > job
> > > need to be run again? Is Hama-503 the solution? Is the state of the job
> > > stored in HDFS between super steps?
> > >
> > > 2) What other fault tolerance features are implemented in Hama?
> > >
> > > 3) What is check pointing in Hama?
> > >
> > > Praveen
> > >
> >
> >
> >
> > --
> > Thomas Jungblut
> > Berlin <th...@gmail.com>
> >
>

Re: Hama Fault Tolerance

Posted by Suraj Menon <su...@apache.org>.

Hey Praveen,

https://issues.apache.org/jira/browse/HAMA-505 is an umbrella issue to all
the fault tolerance design and implementation issues.
Please read the discussion thread "Recovering issues" here -
http://mail-archives.apache.org/mod_mbox/incubator-hama-dev/201203.mbox/browser
that
has a gist of where we are headed for this issue.

Fault tolerance in task execution is scheduled for 0.6. I would be updating
the Wiki with the design sometime.

-Suraj

On Thu, Apr 5, 2012 at 7:14 AM, Thomas Jungblut <
thomas.jungblut@googlemail.com> wrote:

> Currently if failure occurs, the whole job is killed.
> After 503, it will restart a single tasks when it fails at superstep 5.
> Yes the state (messages) are stored in the sync() method.
>
> 2) What other fault tolerance features are implemented in Hama?
> >
>
> None yet.
>
> 3) What is check pointing in Hama?
> >
>
> Writing sent messages to HDFS after a computation phase.
>
> Am 5. April 2012 09:10 schrieb Praveen Sripati <pr...@gmail.com>:
>
> > 1) If a BSPJob has 10 super steps and a task fails at step 5, does the
> job
> > need to be run again? Is Hama-503 the solution? Is the state of the job
> > stored in HDFS between super steps?
> >
> > 2) What other fault tolerance features are implemented in Hama?
> >
> > 3) What is check pointing in Hama?
> >
> > Praveen
> >
>
>
>
> --
> Thomas Jungblut
> Berlin <th...@gmail.com>
>

Re: Hama Fault Tolerance

Posted by Thomas Jungblut <th...@googlemail.com>.

Currently if failure occurs, the whole job is killed.
After 503, it will restart a single tasks when it fails at superstep 5.
Yes the state (messages) are stored in the sync() method.

2) What other fault tolerance features are implemented in Hama?
>

None yet.

3) What is check pointing in Hama?
>

Writing sent messages to HDFS after a computation phase.

Am 5. April 2012 09:10 schrieb Praveen Sripati <pr...@gmail.com>:

> 1) If a BSPJob has 10 super steps and a task fails at step 5, does the job
> need to be run again? Is Hama-503 the solution? Is the state of the job
> stored in HDFS between super steps?
>
> 2) What other fault tolerance features are implemented in Hama?
>
> 3) What is check pointing in Hama?
>
> Praveen
>



-- 
Thomas Jungblut
Berlin <th...@gmail.com>