You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hama.apache.org by "Thomas Jungblut (JIRA)" <ji...@apache.org> on 2011/07/06 07:32:16 UTC

[jira] [Created] (HAMA-411) Support checkpoint based on HDFS

Support checkpoint based on HDFS
--------------------------------

                 Key: HAMA-411
                 URL: https://issues.apache.org/jira/browse/HAMA-411
             Project: Hama
          Issue Type: New Feature
          Components: bsp
            Reporter: Thomas Jungblut


We need to add checkpointing to Hama to deal with fault in future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-411) Support checkpoint based on HDFS

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062147#comment-13062147 ] 

Thomas Jungblut commented on HAMA-411:
--------------------------------------

I overthought the overriding of the logic.
I don't think this gonna work, let's assume the user is going to set the checkpoint to true in every thirds superstep. 
And now a task failed (two supersteps after checkpointing) and we don't acutally have the state safed to revert onto the calculation it was 3 steps ago.

And turning on and off the checkpointing should be configurable in the Configuration not via a method.
So scratch all that fancyness I thought of, it isn't going to work.

> Support checkpoint based on HDFS
> --------------------------------
>
>                 Key: HAMA-411
>                 URL: https://issues.apache.org/jira/browse/HAMA-411
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp
>            Reporter: Thomas Jungblut
>
> We need to add checkpointing to Hama to deal with fault in future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-411) Support checkpoint based on HDFS

Posted by "ChiaHung Lin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061092#comment-13061092 ] 

ChiaHung Lin commented on HAMA-411:
-----------------------------------

Thanks for the reminder because I lose track of that issue.

> Support checkpoint based on HDFS
> --------------------------------
>
>                 Key: HAMA-411
>                 URL: https://issues.apache.org/jira/browse/HAMA-411
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp
>            Reporter: Thomas Jungblut
>
> We need to add checkpointing to Hama to deal with fault in future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-411) Support checkpoint based on HDFS

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060334#comment-13060334 ] 

Thomas Jungblut commented on HAMA-411:
--------------------------------------

Correct, I think we should split this into two different tasks. 
This here is adding the function that allows the user to checkpoint and the checkpointing itself.

The way that the BSPMaster is handle fault belongs to the other task since we are not sure if MR2 or your phi accrual detector should be used.  (see mailing list).
But both ideas share the same checkpointing that is needed, so we can implement this now and use it later if we strik an aggreement.

> Support checkpoint based on HDFS
> --------------------------------
>
>                 Key: HAMA-411
>                 URL: https://issues.apache.org/jira/browse/HAMA-411
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp
>            Reporter: Thomas Jungblut
>
> We need to add checkpointing to Hama to deal with fault in future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-411) Support checkpoint based on HDFS

Posted by "ChiaHung Lin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13104383#comment-13104383 ] 

ChiaHung Lin commented on HAMA-411:
-----------------------------------

I think so. The issue HAMA-398 already adds checkpointing messages to hdfs.  

> Support checkpoint based on HDFS
> --------------------------------
>
>                 Key: HAMA-411
>                 URL: https://issues.apache.org/jira/browse/HAMA-411
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp
>            Reporter: Thomas Jungblut
>
> We need to add checkpointing to Hama to deal with fault in future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (HAMA-411) Support checkpoint based on HDFS

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062147#comment-13062147 ] 

Thomas Jungblut edited comment on HAMA-411 at 7/8/11 8:03 PM:
--------------------------------------------------------------

I overthought the logic with overriding the default implementation.
I don't think this gonna work, let's assume the user is going to set the checkpoint to true in every thirds superstep. 
And now a task failed (two supersteps after checkpointing) and we don't acutally have the state safed to revert onto the calculation it was 3 steps ago.

And turning on and off the checkpointing should be configurable in the Configuration not via a method.
So scratch all that fancyness I thought of, it isn't going to work.

      was (Author: thomas.jungblut):
    I overthought the overriding of the logic.
I don't think this gonna work, let's assume the user is going to set the checkpoint to true in every thirds superstep. 
And now a task failed (two supersteps after checkpointing) and we don't acutally have the state safed to revert onto the calculation it was 3 steps ago.

And turning on and off the checkpointing should be configurable in the Configuration not via a method.
So scratch all that fancyness I thought of, it isn't going to work.
  
> Support checkpoint based on HDFS
> --------------------------------
>
>                 Key: HAMA-411
>                 URL: https://issues.apache.org/jira/browse/HAMA-411
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp
>            Reporter: Thomas Jungblut
>
> We need to add checkpointing to Hama to deal with fault in future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (HAMA-411) Support checkpoint based on HDFS

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Jungblut resolved HAMA-411.
----------------------------------

    Resolution: Won't Fix

Great. 

-> Won't fix.

> Support checkpoint based on HDFS
> --------------------------------
>
>                 Key: HAMA-411
>                 URL: https://issues.apache.org/jira/browse/HAMA-411
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp
>            Reporter: Thomas Jungblut
>
> We need to add checkpointing to Hama to deal with fault in future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-411) Support checkpoint based on HDFS

Posted by "ChiaHung Lin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062843#comment-13062843 ] 

ChiaHung Lin commented on HAMA-411:
-----------------------------------

With BSP model, we can have checkpoints when computation reaches the barrier synchronization, which forms a consistent global state. So in the case where a user configures to have checkpoint with every 3 superstep, once a task failure the computation can roll back to a global state a few supersteps ago. 

The drawback of having such global checkpoint would be if involved processes in computation increase, rolling back to a consistent global state is an overhead. 

> Support checkpoint based on HDFS
> --------------------------------
>
>                 Key: HAMA-411
>                 URL: https://issues.apache.org/jira/browse/HAMA-411
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp
>            Reporter: Thomas Jungblut
>
> We need to add checkpointing to Hama to deal with fault in future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-411) Support checkpoint based on HDFS

Posted by "ChiaHung Lin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060318#comment-13060318 ] 

ChiaHung Lin commented on HAMA-411:
-----------------------------------

Is this going to be applied in the scenario of fault tolerance for checkpoint & recovery? Just a bit confused as this seems to be addressed in HAMA-199[1], where BSPPeer would save the state so that it can be restored if groom server crashes. 

[1]. https://issues.apache.org/jira/browse/HAMA-199

> Support checkpoint based on HDFS
> --------------------------------
>
>                 Key: HAMA-411
>                 URL: https://issues.apache.org/jira/browse/HAMA-411
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp
>            Reporter: Thomas Jungblut
>
> We need to add checkpointing to Hama to deal with fault in future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (HAMA-411) Support checkpoint based on HDFS

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060302#comment-13060302 ] 

Thomas Jungblut edited comment on HAMA-411 at 7/6/11 5:45 AM:
--------------------------------------------------------------

My first idea would be to use an abstract method in BSP (Or we can think of adding this to our new BSPPeer.):

{noformat}
abstract boolean checkpoint(long superStep);
{noformat}

Or we are providing a default implementation which always returns true, but the user can override this with his own logic.

So the user is up to handle in which superstep he want the checkpointing / checkpoint(X) = true. 
This method gets called before a superstep starts.

If true we are going to save all the messages in the queues to disk. 

Additionally we should think of a method in BSP class which is helping the user to save his own computation- for example the tentative pagerank map in PageRank Example. Or the user has to take care of it himself when returning true in the method.


      was (Author: thomas.jungblut):
    My first idea would be to use an abstract method in BSP (Or we can think of adding this to our new BSPPeer.):

{noformat}
abstract boolean checkpoint(long superStep);
{noformat}

So the user is up to handle in which superstep he want the checkpointing / checkpoint(X) = true. 
This method gets called before a superstep starts.

If true we are going to save all the messages in the queues to disk. 

Additionally we should think of a method in BSP class which is helping the user to save his own computation- for example the tentative pagerank map in PageRank Example. Or the user has to take care of it himself when returning true in the method.

  
> Support checkpoint based on HDFS
> --------------------------------
>
>                 Key: HAMA-411
>                 URL: https://issues.apache.org/jira/browse/HAMA-411
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp
>            Reporter: Thomas Jungblut
>
> We need to add checkpointing to Hama to deal with fault in future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-411) Support checkpoint based on HDFS

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061042#comment-13061042 ] 

Thomas Jungblut commented on HAMA-411:
--------------------------------------

I am not a fan of forcing people to override methods, so I'll think we'll provide a default implementation which always return true.
Since checkpointing can cause overhead, some people may not need to checkpoint in every step of the calculation. 

We should consider to implement https://issues.apache.org/jira/browse/HAMA-398 first. This issue is going to put messages to disk.

But I first let Edward refactor the task management in https://issues.apache.org/jira/browse/HAMA-410 .
Afterwards we'll see where to implement these things.

> Support checkpoint based on HDFS
> --------------------------------
>
>                 Key: HAMA-411
>                 URL: https://issues.apache.org/jira/browse/HAMA-411
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp
>            Reporter: Thomas Jungblut
>
> We need to add checkpointing to Hama to deal with fault in future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-411) Support checkpoint based on HDFS

Posted by "ChiaHung Lin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061028#comment-13061028 ] 

ChiaHung Lin commented on HAMA-411:
-----------------------------------

I incline to go with application/ user level checkpoint. But several issues may be worthy of checking beforehand, such as safety, programmers may be reluctant to write such code as it is nontrivial, etc. 

> Support checkpoint based on HDFS
> --------------------------------
>
>                 Key: HAMA-411
>                 URL: https://issues.apache.org/jira/browse/HAMA-411
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp
>            Reporter: Thomas Jungblut
>
> We need to add checkpointing to Hama to deal with fault in future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-411) Support checkpoint based on HDFS

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060302#comment-13060302 ] 

Thomas Jungblut commented on HAMA-411:
--------------------------------------

My first idea would be to use an abstract method in BSP (Or we can think of adding this to our new BSPPeer.):

{noformat}
abstract boolean checkpoint(long superStep);
{noformat}

So the user is up to handle in which superstep he want the checkpointing / checkpoint(X) = true. 
This method gets called before a superstep starts.

If true we are going to save all the messages in the queues to disk. 

Additionally we should think of a method in BSP class which is helping the user to save his own computation- for example the tentative pagerank map in PageRank Example. Or the user has to take care of it himself when returning true in the method.


> Support checkpoint based on HDFS
> --------------------------------
>
>                 Key: HAMA-411
>                 URL: https://issues.apache.org/jira/browse/HAMA-411
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp
>            Reporter: Thomas Jungblut
>
> We need to add checkpointing to Hama to deal with fault in future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-411) Support checkpoint based on HDFS

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13103553#comment-13103553 ] 

Thomas Jungblut commented on HAMA-411:
--------------------------------------

@ChiaHung we have some kind of checkpoint system right? (HAMA-398)

This is running as a seperate process on every groom and listens on a TCP socket and the process is currently writing to HDFS, correct?
Is anything left here? If not we should close this.

> Support checkpoint based on HDFS
> --------------------------------
>
>                 Key: HAMA-411
>                 URL: https://issues.apache.org/jira/browse/HAMA-411
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp
>            Reporter: Thomas Jungblut
>
> We need to add checkpointing to Hama to deal with fault in future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (HAMA-411) Support checkpoint based on HDFS

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062147#comment-13062147 ] 

Thomas Jungblut edited comment on HAMA-411 at 7/8/11 8:04 PM:
--------------------------------------------------------------

I overthought the logic with overriding the default implementation.
I don't think this gonna work, let's assume the user is going to set the checkpoint to true in every third superstep. 
And now a task fails (two supersteps after checkpointing) and we don't acutally have the state safed to revert onto the calculation it was 2 steps ago.

And turning on and off the checkpointing should be configurable in the Configuration not via a method.
So scratch all that fancyness I thought of, it isn't going to work.

      was (Author: thomas.jungblut):
    I overthought the logic with overriding the default implementation.
I don't think this gonna work, let's assume the user is going to set the checkpoint to true in every thirds superstep. 
And now a task failed (two supersteps after checkpointing) and we don't acutally have the state safed to revert onto the calculation it was 3 steps ago.

And turning on and off the checkpointing should be configurable in the Configuration not via a method.
So scratch all that fancyness I thought of, it isn't going to work.
  
> Support checkpoint based on HDFS
> --------------------------------
>
>                 Key: HAMA-411
>                 URL: https://issues.apache.org/jira/browse/HAMA-411
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp
>            Reporter: Thomas Jungblut
>
> We need to add checkpointing to Hama to deal with fault in future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira