You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@singa.apache.org by wa...@apache.org on 2015/05/21 09:34:44 UTC

svn commit: r1680751 - in /incubator/singa/site/trunk/content: markdown/docs.md markdown/docs/code-structure.md markdown/introduction.md resources/images/arch.png resources/images/software_stack.jpg site.xml

Author: wangsh
Date: Thu May 21 07:34:44 2015
New Revision: 1680751

URL: http://svn.apache.org/r1680751
Log:
update introduction page

Added:
    incubator/singa/site/trunk/content/resources/images/arch.png   (with props)
Removed:
    incubator/singa/site/trunk/content/resources/images/software_stack.jpg
Modified:
    incubator/singa/site/trunk/content/markdown/docs.md
    incubator/singa/site/trunk/content/markdown/docs/code-structure.md
    incubator/singa/site/trunk/content/markdown/introduction.md
    incubator/singa/site/trunk/content/site.xml

Modified: incubator/singa/site/trunk/content/markdown/docs.md
URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/docs.md?rev=1680751&r1=1680750&r2=1680751&view=diff
==============================================================================
--- incubator/singa/site/trunk/content/markdown/docs.md (original)
+++ incubator/singa/site/trunk/content/markdown/docs.md Thu May 21 07:34:44 2015
@@ -5,6 +5,5 @@ ___
 * [Installation](docs/installation.html)
 * [System Architecture](docs/architecture.html)
 * [Communication](docs/communication.html)
-* [Code Structure](docs/code-structure.html)
 * [Neural Network Partition](docs/neuralnet-partition.html)
 * [Programming Model](docs/programming-model.html)

Modified: incubator/singa/site/trunk/content/markdown/docs/code-structure.md
URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/docs/code-structure.md?rev=1680751&r1=1680750&r2=1680751&view=diff
==============================================================================
--- incubator/singa/site/trunk/content/markdown/docs/code-structure.md (original)
+++ incubator/singa/site/trunk/content/markdown/docs/code-structure.md Thu May 21 07:34:44 2015
@@ -2,6 +2,8 @@
 
 ___
 
+<!--
+
 ### Worker Side
 
 #### Main Classes
@@ -70,3 +72,5 @@ table server. The control flow for other
 the server side, there are at least 3 threads running at any time: two by
 NetworkService for sending and receiving message, and at least one by the
 RequestDispatcher for dispatching requests.
+
+-->

Modified: incubator/singa/site/trunk/content/markdown/introduction.md
URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/introduction.md?rev=1680751&r1=1680750&r2=1680751&view=diff
==============================================================================
--- incubator/singa/site/trunk/content/markdown/introduction.md (original)
+++ incubator/singa/site/trunk/content/markdown/introduction.md Thu May 21 07:34:44 2015
@@ -2,94 +2,77 @@
 
 ___
 
-SINGA is a distributed deep learning platform, for training large-scale deep
-learning models. Our design is driven by two key observations. First, the
-structures and training algorithms of deep learning models can be expressed
-using simple abstractions, e.g., the layer. SINGA allows users
-to write their own training algorithms by exposing intuitive programming abstractions
-and hiding complex details pertaining distributed execution of the training.
-Specifically, our programming model consists of data objects (layer and network)
-that define the model, and of computation functions over the data objects. Our
-second observation is that there are multiple approaches to partitioning the model
-and the training data onto multiple machines to achieve model parallelism, data
-parallelism or both. Each approach incurs different communication and synchronization
-overhead which directly affects the system’s scalability. We analyze the fundamental
-trade-offs of existing parallelism approaches, and propose an optimization algorithm
-that generates the parallelism scheme with minimal overhead.
+### Overview
+
+SINGA is designed to be general to implement the distributed training algorithms of existing systems.
+Distributed deep learning training is an on- going challenge research problem in terms of scalability.
+There is no established scalable distributed training algorithm. Different algorithms are used by
+existing systems, e.g. Hogwild used by Caffe, AllReduce used by Baidu’s DeepImage, and the Downpour
+algorithm proposed by Google Brain and used at Microsoft Adam. SINGA provides users the chance to
+select the one that is most scalable for their model and data.
+
+To provide good usability, SINGA provides a simple programming model based on the layer structure 
+that is common in deep learning models. Users override the base layer class to implement their own 
+layer logics for feature transformation. A model is constructed by configuring each layer and their 
+connections like Caffe. SINGA takes care of the data and model partitioning, and makes the underlying 
+distributed communication (almost) transparent to users. A set of built-in layers and example models 
+are provided.
+
+SINGA is an [Apache incubator project](http://singa.incubator.apache.org/), released under Apache 
+License 2. It is mainly developed by the DBSystem group of National University of Singapore. 
+A diverse community is being constructed to welcome open-source contribution. 
 
 ### Goals and Principles
 
 #### Goals
-* Scalability: A distributed platform that can scale to a large model and training
-    dataset, e.g., 1 Billion parameters and 10M images.
-* Usability: To provide abstraction and easy to use interface so that users can
-    implement their deep learning model/algorithm without much awareness of the
-    underlying distributed platform.
-* Extensibility: We try to make SINGA extensible for implementing different consistency
-    models, training algorithms and deep learning models.
+* Scalability: A distributed platform that can scale to a large model and training dataset.
+* Usability: To provide abstraction and easy to use interface 
+	so that users can implement their deep learning model/algorithm
+	without much awareness of the underlying distributed platform.
+* Extensibility: to make SINGA extensible for implementing different consistency models,
+	training algorithms and deep learning models.
 
 #### Principles
-To achieve the scalability goal, we parallelize the computation across a cluster
-of nodes by the following partitioning approaches:
-
-* Model Partition---one model replica spreads across multiple machines to handle large
-    models, which have too many parameters to be kept in the memory of a single machine.
-    Overhead: synchronize layer data across machines within one model replica Partition.
+Scalability is a challenge research problem for distributed deep learning training. 
+SINGA provides a general architecture to exploit the scalability of different training algorithms. 
+Different parallelism approaches are also supported:
+
+* Model Partition---one model replica spreads across multiple machines to handle large models, 
+	which have too many parameters to be kept in the memory of a single machine. Overhead: 
+	synchronize layer data across machines within one model replica Partition.
 * Data Partition---one model replica trains against a partition of the whole training dataset.
-    This approach can handle large training dataset.
-    Overhead: synchronize parameters among model replicas.
-* Hybrid Partition---exploit a cost model to find optimal model and data partitions which
-    would reduce both overheads.
+	This approach can handle large training dataset.
+	Overhead: synchronize parameters among model replicas.
+* Hybrid Partition---exploit a cost model to find optimal model and data partitions 
+	which would reduce both overheads.
 
-To achieve the usability goal, we propose our programming model with the following
+To achieve the usability goal, we propose our programming model with the following 
 two major considerations:
 
-* Extract common data structures and operations for deep learning training algorithms, i.e.,
-    Back Propagation and Contrastive Divergence. Users implement their models by
-    inheriting these data structures and overriding the operations.
-* Manage model partition and data partition automatically through distributed array.
-    Users write code against the distributed array, without much awareness of the array partition
-     (which part is stored on which machine).
-
-Considering extensibility, we make our core data structures (e.g., Layer) and operations general enough
-for programmers to override.
-
-### System Overview
-
-<img src="images/software_stack.jpg" alt="SINGA software stack" style="width: 700px"/>
-
-Three goals are considered in designing SINGA, namely ease of use, scalability and extensibility.
-We will introduce them together with the software stack as shown in the above figure.
-Algorithms for deep learning models are complex to code and hard to train. To make
-it ease of use, we provide a simple concept ‘Layer’ to construct deep complex models.
-Built-in Layer implementations include common layers, e.g., convolution layer
-and fully connected layer. Users can configure their models by combining these
-built-in layers through web interface or configuration files. Once the model and
-training data is configured, we start SINGA to conduct the training using the
-standard training algorithm (Back-Propagation,BP or Contrastive Divergence, CD)
-on a cluster of nodes and visualize the training performance to users (e.g.,
-through web interface). Advanced users can also implement their own layers by
-overloading the base Layer class through Python, Matlab, etc wrappers. DistributedArray
-is proposed for easy array operations that are heavily used for realizing layer
-logics. SINGA manages the distributed arrays (stored across multiple nodes)
-automatically and efficiently based on MPI. Training scalability is achieved by
-partitioning the training data and model onto multiple computing nodes and parallelizing
-the computation. A logically centralized parameter server maintains the model
-parameters in a ParameterTable. Computing nodes work according to the consistency
-policy and send information to the parameter server which updates the parameters
-based on SGD (stochastic gradient descent) algorithms. Besides the Layer class,
-other components like SGD algorithms and consistency module are also extensible.
-<!---
-The above figure shows the basic components of SINGA. It starts training a deep
-learning model by parsing a model configuration, which specifies the layer and
-network structure at the every worker node. After that, it initializes the table servers and starts
-workers to run their tasks. Each table server maintains a partition (i.e., a set
-of rows) of a distributed parameter table where model parameters are stored.
-Worker groups consisting one or more worker nodes run in parallel to compute the
-gradients of parameters. In one iteration, every group fetches fresh parameters
-from the table servers, runs BP or CD algorithm to compute gradients against a
-mini-batch from the local data shard (a partition of the training dataset), and
-then sends gradients to the table servers. The data shard is created by loading
-training data from HDFS off-line. The master monitors the training progress and
-stops the workers and table servers once the model has converged to a given loss.
--->
+* Extract common data structures and operations for deep learning training algorithms, i.e., 
+	Back Propagation and Contrastive Divergence. Users implement their models by inheriting 
+	these data structures and overriding the operations.
+* Make model partition and data partition automatically almost transparent to users.
+
+Considering extensibility, we make our core data structures (e.g., Layer) and operations
+general enough for programmers to override.
+
+### System Architecture
+
+<img src="images/arch.png" alt="SINGA Logical Architecture" style="width: 500px"/>
+<p><strong>SINGA Logical Architecture</strong></p>
+
+The logical system architecture is shown in the above figure. There are two types of execution units,
+namely workers and servers. They are grouped according to the cluster configuration. Each worker 
+group runs against a partition of the training dataset to compute the updates (e.g., the gradients) 
+of parameters on one model replica, denoted as ParamShard. Worker groups run asynchronously, while 
+workers within one group run synchronously with each worker computing (partial) updates for a subset 
+of model parameters. Each server group also maintains one replica of the model parameters 
+(i.e., ParamShard). It receives and handles requests (e.g., Get/Put/Update) from workers. Every server 
+group synchronizes with neighboring server groups periodically or ac- cording to some specified rules.
+
+SINGA starts by parsing the cluster and model configurations. The first worker group initializes model 
+parameters and sends Put requests to put them into the ParamShards of servers. Then every worker group 
+runs the training algorithm by iterating over its training data in mini-batch. Each worker collects the 
+fresh parameters from servers before computing the updates (e.g., gradients) for them. Once it finishes 
+the computation, it issues update requests to the servers.

Added: incubator/singa/site/trunk/content/resources/images/arch.png
URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/resources/images/arch.png?rev=1680751&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/singa/site/trunk/content/resources/images/arch.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Modified: incubator/singa/site/trunk/content/site.xml
URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/site.xml?rev=1680751&r1=1680750&r2=1680751&view=diff
==============================================================================
--- incubator/singa/site/trunk/content/site.xml (original)
+++ incubator/singa/site/trunk/content/site.xml Thu May 21 07:34:44 2015
@@ -57,7 +57,6 @@
       <item name="Installation" href="docs/installation.html"/>
       <item name="System Architecture" href="docs/architecture.html"/>
       <item name="Communication" href="docs/communication.html"/>
-      <item name="Code Structure" href="docs/code-structure.html"/>
       <item name="Neural Network Partition" href="docs/neuralnet-partition.html"/>
       <item name="Programming Model" href="docs/programming-model.html"/>
     </menu>