You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@giraph.apache.org by "Maja Kabiljo (JIRA)" <ji...@apache.org> on 2012/08/13 11:50:38 UTC

[jira] [Created] (GIRAPH-298) TestAutoCheckpoint doesn't restart from checkpoint

Maja Kabiljo created GIRAPH-298:
-----------------------------------

             Summary: TestAutoCheckpoint doesn't restart from checkpoint
                 Key: GIRAPH-298
                 URL: https://issues.apache.org/jira/browse/GIRAPH-298
             Project: Giraph
          Issue Type: Bug
            Reporter: Maja Kabiljo


When we run TestAutoCheckpoint, after one worker failure master and all other workers also fail. All of them get restarted, but they restart from the beginning, not from the last checkpointed superstep.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-298) TestAutoCheckpoint doesn't restart from checkpoint

Posted by "Maja Kabiljo (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433048#comment-13433048 ] 

Maja Kabiljo commented on GIRAPH-298:
-------------------------------------

I've been (unsuccessfully) trying to figure out how automatic restarting from checkpoint works. Please correct me where I am wrong, this is how I see it after investigating with the example and looking in the code:
Worker registers its health in the beginning of superstep. Master enters BspServiceMaster.barrierOnWorkerList, from which it exits with false only if some worker didn't register its health - i.e. crashed before starting superstep computation. This is the only case in which we'll come to SuperstepState.WORKER_FAILURE. If a worker crashes during superstep computations, master will stay in the loop in barrierOnWorkerList, and eventually crash because of Zookeeper. All the others crash then also. Hadoop restarts them, but I don't see a place where we set which superstep should we restart from after that.
                
> TestAutoCheckpoint doesn't restart from checkpoint
> --------------------------------------------------
>
>                 Key: GIRAPH-298
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-298
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Maja Kabiljo
>
> When we run TestAutoCheckpoint, after one worker failure master and all other workers also fail. All of them get restarted, but they restart from the beginning, not from the last checkpointed superstep.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-298) TestAutoCheckpoint doesn't restart from checkpoint

Posted by "Maja Kabiljo (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GIRAPH-298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Maja Kabiljo updated GIRAPH-298:
--------------------------------

    Priority: Minor  (was: Major)
    
> TestAutoCheckpoint doesn't restart from checkpoint
> --------------------------------------------------
>
>                 Key: GIRAPH-298
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-298
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Maja Kabiljo
>            Priority: Minor
>         Attachments: GIRAPH-298.patch
>
>
> When we run TestAutoCheckpoint, after one worker failure master and all other workers also fail. All of them get restarted, but they restart from the beginning, not from the last checkpointed superstep.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-298) TestAutoCheckpoint doesn't restart from checkpoint

Posted by "Maja Kabiljo (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GIRAPH-298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Maja Kabiljo updated GIRAPH-298:
--------------------------------

    Attachment: GIRAPH-298.patch

This doesn't happen anymore, I'm not sure which patch fixed it. 

Anyway, the test now runs for over 5 minutes, I've attached the patch to make timeouts shorter, at least it takes about half a minute now.
                
> TestAutoCheckpoint doesn't restart from checkpoint
> --------------------------------------------------
>
>                 Key: GIRAPH-298
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-298
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Maja Kabiljo
>         Attachments: GIRAPH-298.patch
>
>
> When we run TestAutoCheckpoint, after one worker failure master and all other workers also fail. All of them get restarted, but they restart from the beginning, not from the last checkpointed superstep.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-298) Reduce timeout for TestAutoCheckpoint

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GIRAPH-298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching updated GIRAPH-298:
-------------------------------

    Summary: Reduce timeout for TestAutoCheckpoint  (was: TestAutoCheckpoint doesn't restart from checkpoint)
    
> Reduce timeout for TestAutoCheckpoint
> -------------------------------------
>
>                 Key: GIRAPH-298
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-298
>             Project: Giraph
>          Issue Type: Wish
>            Reporter: Maja Kabiljo
>            Priority: Minor
>         Attachments: GIRAPH-298.patch
>
>
> When we run TestAutoCheckpoint, after one worker failure master and all other workers also fail. All of them get restarted, but they restart from the beginning, not from the last checkpointed superstep.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-298) Reduce timeout for TestAutoCheckpoint

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454348#comment-13454348 ] 

Hudson commented on GIRAPH-298:
-------------------------------

Integrated in Giraph-trunk-Commit #194 (See [https://builds.apache.org/job/Giraph-trunk-Commit/194/])
    GIRAPH-298: Reduce timeout for TestAutoCheckpoint. (majakabiljo via
aching) (Revision 1384115)

     Result = SUCCESS
aching : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1384115
Files : 
* /giraph/trunk/CHANGELOG
* /giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java
* /giraph/trunk/src/main/java/org/apache/giraph/zk/ZooKeeperManager.java
* /giraph/trunk/src/test/java/org/apache/giraph/TestAutoCheckpoint.java

                
> Reduce timeout for TestAutoCheckpoint
> -------------------------------------
>
>                 Key: GIRAPH-298
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-298
>             Project: Giraph
>          Issue Type: Wish
>            Reporter: Maja Kabiljo
>            Assignee: Maja Kabiljo
>            Priority: Minor
>         Attachments: GIRAPH-298.patch
>
>
> When we run TestAutoCheckpoint, after one worker failure master and all other workers also fail. All of them get restarted, but they restart from the beginning, not from the last checkpointed superstep.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (GIRAPH-298) Reduce timeout for TestAutoCheckpoint

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GIRAPH-298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching resolved GIRAPH-298.
--------------------------------

    Resolution: Fixed
      Assignee: Maja Kabiljo

Thanks Maja!  +1 and committed.
                
> Reduce timeout for TestAutoCheckpoint
> -------------------------------------
>
>                 Key: GIRAPH-298
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-298
>             Project: Giraph
>          Issue Type: Wish
>            Reporter: Maja Kabiljo
>            Assignee: Maja Kabiljo
>            Priority: Minor
>         Attachments: GIRAPH-298.patch
>
>
> When we run TestAutoCheckpoint, after one worker failure master and all other workers also fail. All of them get restarted, but they restart from the beginning, not from the last checkpointed superstep.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-298) TestAutoCheckpoint doesn't restart from checkpoint

Posted by "Maja Kabiljo (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GIRAPH-298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Maja Kabiljo updated GIRAPH-298:
--------------------------------

    Issue Type: Wish  (was: Bug)
    
> TestAutoCheckpoint doesn't restart from checkpoint
> --------------------------------------------------
>
>                 Key: GIRAPH-298
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-298
>             Project: Giraph
>          Issue Type: Wish
>            Reporter: Maja Kabiljo
>            Priority: Minor
>         Attachments: GIRAPH-298.patch
>
>
> When we run TestAutoCheckpoint, after one worker failure master and all other workers also fail. All of them get restarted, but they restart from the beginning, not from the last checkpointed superstep.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira