You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by "Nic Eggert (JIRA)" <ji...@apache.org> on 2017/03/27 20:20:41 UTC
[jira] [Created] (GIRAPH-1139) Resuming from checkpoint doesn't
work
Nic Eggert created GIRAPH-1139:
----------------------------------
Summary: Resuming from checkpoint doesn't work
Key: GIRAPH-1139
URL: https://issues.apache.org/jira/browse/GIRAPH-1139
Project: Giraph
Issue Type: Bug
Components: bsp
Affects Versions: 1.2.0
Reporter: Nic Eggert
I ran into a couple of issues when trying to get Giraph to resume from checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
* If we just wrote a checkpoint, the master expects the workers to checkpoint again, while the workers (correctly) clear the checkpointing flag.
* When workers restart, they take their task id from the partition number, which stays the same across multiple attempts. This gets transferred to the Netty clientId, and the server starts ignoring messages from restarted workers because it thinks it processed them already.
I believe I've fixed these issues. I'll send a GitHub PR shortly.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)