You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "Matthias Boehm (JIRA)" <ji...@apache.org> on 2018/05/13 01:00:00 UTC

[jira] [Commented] (SYSTEMML-2087) Initial version of distributed spark backend

    [ https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473322#comment-16473322 ] 

Matthias Boehm commented on SYSTEMML-2087:
------------------------------------------

Once we come closer to this task, it would be good to flash out the details in terms of sub tasks. For example, we need to decide (1) how to distribute the data (for the different distribution schemes) to the individual workers, (2) how to implement the parameter exchange, and (3) how to handle task failures and preemption. Regarding the latter, I would recommend to start simple with something like once a worker is brought up it pulls the current state of the model and checkpointing is done in a centralized manner.

> Initial version of distributed spark backend
> --------------------------------------------
>
>                 Key: SYSTEMML-2087
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2087
>             Project: SystemML
>          Issue Type: Sub-task
>            Reporter: Matthias Boehm
>            Assignee: LI Guobao
>            Priority: Major
>
> This part aims to implement the BSP for spark distributed backend. Hence the idea is to be able to launch a remote parameter server and the workers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)