You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by "Maja Kabiljo (JIRA)" <ji...@apache.org> on 2012/08/15 16:28:38 UTC

[jira] [Updated] (GIRAPH-293) Should aggregators be checkpointed?

     [ https://issues.apache.org/jira/browse/GIRAPH-293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Maja Kabiljo updated GIRAPH-293:
--------------------------------

    Attachment: GIRAPH-293.patch

Making aggregators work correctly with checkpointing - saving the aggregator name, class, value and whether it's persistent. Apart from that, I removed the code for aggregators handling from BspServiceWorker and BspServiceMaster to separate classes, since I think it's cleaner this way, and those two classes do too much different stuff as it is. But that's the reason why the patch looks big. Later with GIRAPH-273 AggregatorHandler classes should become more independent of BspServices.

I added test for aggregator serialization and manual restarting from checkpoint (that one also relies on recent GIRAPH-296 and GIRAPH-298 working). The patch passes mvn verify and tests in pseudo-distributed mode.
                
> Should aggregators be checkpointed?
> -----------------------------------
>
>                 Key: GIRAPH-293
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-293
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Alessandro Presta
>            Assignee: Maja Kabiljo
>         Attachments: GIRAPH-293.patch
>
>
> As I understand, we don't include aggregators in checkpoints because they are kept in the Zookeeper.
> One of our bootcampers is working on fixing TestManualCheckpoint, which currently involves starting a new job from a checkpoint from a previous job*.
> If this is a functionality we want going forward, then persistent aggregators should be checkpointed.
> [*] That test relies on the fact that either aggregators are checkpointed or they are always reset at each superstep. None of these is happening, but the error cancels out with the fact that we are not actually resuming from a checkpoint, but re-running the job from scratch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira