You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "Matthias Boehm (JIRA)" <ji...@apache.org> on 2018/02/13 06:24:02 UTC

[jira] [Commented] (SYSTEMML-2083) Language and runtime for parameter servers

    [ https://issues.apache.org/jira/browse/SYSTEMML-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361878#comment-16361878 ] 

Matthias Boehm commented on SYSTEMML-2083:
------------------------------------------

Great - thanks for your interest [~mpgovinda]. Below I try to give you a couple of pointers and a more concrete idea of the project. Additional details regarding the individual sub tasks will follow later and likely evolve over time. Note that we're happy to help out and guide you where needed. 

SystemML is an existing system for a broad range of machine learning algorithms. Users write their algorithms in an R- or Python-like syntax with abstract data types for scalars, matrix, and frames as well as operations such as linear algebra, element-wise operations, aggregations, indexing, and statistical functions. SystemML then automatically compiles these scripts into hybrid runtime plans of single-node and distributed operations on MapReduce or Spark according to data and cluster characteristics. For more details, please refer to our website (https://systemml.apache.org/) as well as our "SystemML on Spark" paper (http://www.vldb.org/pvldb/vol9/p1425-boehm.pdf).

In the past, we primarily focused on data- and task-parallel execution strategies (as described above) but in the last years we also added support for deep learning including an nn script library, various builtin functions for specific layers, as well as a native and GPU operations. 

This epic aims to extend these capabilities by execution strategies for parameter servers. We want to build alternative runtime backends as a foundation which would already enable users to easily select their preferred strategy for local or distributed execution. Later (not part of this project) we would like to futher extend this to the automatic selection of these strategies.

Specifically, this project aims to introduce a new builtin function, called {{paramserv}} that can be called at script level.
{code}
[model’] = paramserv(model, X, y, X_val, y_val, fun1,
   mode=ASYNC, freq=EPOCH, agg=..., epochs=100, batchsize=64, k=7, checkpointing=...)
{code}
where we pass an existing (e.g., for transfer learning) or otherwise initialized {{model}}, the training feature and label matrices {{X}}, {{y}}, the validation features and labels {{X_val}}, {{y_val}}, a batch update function specified in SystemML's R- or Python-like language, an update strategy {{mode}} along with its frequency {{freq}} (e.g., per batch or epoch), an aggregation function {{agg}}, the number of {{epochs}}, {{batchsize}}, degree of parallelism {{k}}, and a checkpointing strategy. 

The core of the project then deals with implementing the runtime for this builtin function in Java for both local, multi-threaded execution and distributed execution on top Spark. The advantage of building the distributed parameter servers on top of the data-parallel Spark framework is a seamless integration with the rest of the SystemML (e.g., where the input feature matrix {{X}} cab be a large RDD). Since the update and aggregation functions are expressed in SystemML's language, we can simply reuse the existing runtime (control flow, instructions, and matrix operations) and concentrate on building alternative parameter update mechanisms.


> Language and runtime for parameter servers
> ------------------------------------------
>
>                 Key: SYSTEMML-2083
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2083
>             Project: SystemML
>          Issue Type: Epic
>            Reporter: Matthias Boehm
>            Priority: Major
>              Labels: gsoc2018
>
> SystemML already provides a rich set of execution strategies ranging from local operations to large-scale computation on MapReduce or Spark. In this context, we support both data-parallel (multi-threaded or distributed operations) as well as task-parallel computation (multi-threaded or distributed parfor loops). This epic aims to complement the existing execution strategies by language and runtime primitives for parameter servers, i.e., model-parallel execution. We use the terminology of model-parallel execution with distributed data and distributed model to differentiate them from the existing data-parallel operations. Target applications are distributed deep learning and mini-batch algorithms in general. These new abstractions will help making SystemML a unified framework for small- and large-scale machine learning that supports all three major execution strategies in a single framework.
>  
> A major challenge is the integration of stateful parameter servers and their common push/pull primitives into an otherwise functional (and thus, stateless) language. We will approach this challenge via a new builtin function \{{paramserv}} which internally maintains state but at the same time fits into the runtime framework of stateless operations.
> Furthermore, we are interested in providing (1) different runtime backends (local and distributed), (2) different parameter server modes (synchronous, asynchronous, hogwild!, stale-synchronous), (3) different update frequencies (batch, multi-batch, epoch), as well as (4) different architectures for distributed data (1 parameter server, k workers) and distributed model (k1 parameter servers, k2 workers). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)