You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@singa.apache.org by wa...@apache.org on 2015/05/07 15:08:13 UTC

[49/57] [partial] incubator-singa git commit: create github pages

http://git-wip-us.apache.org/repos/asf/incubator-singa/blob/666a841d/_posts/docs/2015-02-10-communication.md
----------------------------------------------------------------------
diff --git a/_posts/docs/2015-02-10-communication.md b/_posts/docs/2015-02-10-communication.md
new file mode 100644
index 0000000..9701442
--- /dev/null
+++ b/_posts/docs/2015-02-10-communication.md
@@ -0,0 +1,354 @@
+---
+layout: post
+title: Communication
+category : docs
+tagline:
+tags : [development, documentation, communication]
+---
+{% include JB/setup %}
+
+Different messaging libraries has different benefits and drawbacks. For instance,
+MPI provides fast message passing between GPUs (using GPUDirect), but does not
+support fault-tolerance well. On the contrary, systems using ZeroMQ can be
+fault-tolerant. But ZeroMQ does not support GPUDirect. The AllReduce function
+of MPI is also missing in ZeroMQ which is efficient for data aggregation for
+distributed training. In Singa, we provide general messaging APIs for
+communication between threads within a process and across processes, and let
+users choose the underlying implementation (MPI or ZeroMQ) that meets their requirements.
+
+Singa's messaging library consists of two components, namely the message, and
+the socket to send and receive messages. **Socket** refers to a
+Singa defined data structure instead of the Linux Socket.
+We will introduce the two components in detail with the following figure as an
+example architecture.
+
+<figure>
+<img src="{{ BASE_PATH }}/assets/image/arch/arch2.png" align="center" width="550px"/>
+<img src="{{ BASE_PATH }}/assets/image/arch/comm.png" align="center" width="550px"/>
+<figcaption><strong> Fig.1 - Example physical architecture and network connection</strong></figcaption>
+</figure>
+
+Fig.1 shows an example physical architecture and its network connection.
+[Section-partition server side ParamShard]({{ BASE_PATH }}{% post_url /docs/2015-01-30-architecture %}) has a detailed description of the
+architecture. Each process consists of one main thread running the stub and multiple
+background threads running the worker and server tasks. The stub of the main
+thread forwards messages among threads . The worker and
+server tasks are performed by the background threads.
+
+## Message
+
+<figure >
+<object type="image/svg+xml" align="center" width="100" data="{{ BASE_PATH }}/assets/image/msg.svg" > Not
+supported </object>
+<figcaption><strong> Fig.2 - Logical message format</strong></figcaption>
+</figure>
+
+Fig.2 shows the logical message format which has two parts, the header and the
+content. The message header includes the sender and receiver's ID consisting of
+the group ID and the worker/server ID within the group. The stub forwards
+messages by looking up an address table based on the receiver's ID.
+There are two sets of messages, their message types are:
+
+  * kGet/kPut/kRequest/kSync for messages about parameters
+
+  * kFeaBlob/kGradBlob for messages about transferring feature and gradient
+  blobs of one layer to its neighboring layer
+
+There is a target ID in the header. If the message is related to parameters,
+the target ID is then the parameter ID. Otherwise the message is related to
+layer feature or gradient, and the target ID consists of the layer ID and the
+blob ID of that layer. The message content has multiple frames to store the
+parameter or feature data.
+
+The API for the base Msg is:
+
+    class Msg{
+     public:
+      /**
+       * Destructor to free memory
+       */
+      virtual ~Msg()=0;
+      /**
+       * @param group_id worker/server group id
+       * @param id worker/server id within the group
+       * @param flag 0 for server, 1 for worker, 2 for stub
+       */
+      virtual void set_src(int group_id, int id, int flag)=0;
+      virtual void set_dst(int group_id, int id, int flag)=0;
+      virtual void src_group_id()=0;
+      virtual int src_id() const=0;
+      virtual int dst_group_id() const=0;
+      virtual int dst_id() const=0;
+      virtual int src_flag() const=0;
+      virtual int dst_flag() const=0;
+      virtual void set_type(int type)=0;
+      virtual void set_target(int param_id)=0;
+      virtual void set_target(int layer_id, blob_id)=0;
+      virtual int type() const=0;
+      /**
+       * @return true if the msg is about parameter, otherwise false
+       */
+      virtual bool is_param_msg() const=0;
+      virtual int param_id() const=0;
+      virtual int layer_id() const=0;
+      virtual int blob_id() const=0;
+
+      virtual void add_frame(void*, int nBytes)=0;
+      virtual int frame_size()=0;
+      virtual void* frame_data()=0;
+      /**
+       * Move the cursor to the next frame
+       * @return true if the next frame is not NULL; otherwise false
+       */
+      virtual bool next_frame()=0;
+      virtual int SerializeTo(string* buf);
+      virtual int ParseFrom(const string& buf);
+    };
+
+## Socket
+
+In Singa, there are two types of sockets, the Dealer Socket and the Router
+Socket. The names are from ZeroMQ. All connections are of the same type, i.e.,
+Dealer<-->Router. The communication between dealers and routers are
+asynchronous. In other words, one Dealer
+socket can talk with multiple Router sockets, and one Router socket can talk
+with multiple Dealer sockets.
+
+### Base Socket
+
+The basic functions of a Singa Socket is to send and receive messages. The APIs
+are:
+
+    class Socket{
+     public:
+      /**
+       * @param args depending on the underlying implementation.
+       */
+      Socket(void* args);
+      /**
+       * Send a message to connected socket(s), non-blocking. The message will
+       * be deallocated after sending, thus should not be used after calling Send();
+       * @param  the message to be sent
+       * @param  dst the identifier of the connected socket. By default, it is
+       * -1, which means sending this message to all connected sockets.
+       * @return 1 for success queuing the message for sending, 0 for failure
+       */
+      virtual int Send(Msg** msg, int dst=-1)=0;
+      /**
+       * Receive a message
+       * @return a message pointer if success; nullptr if failure
+       */
+      virtual Message* Receive()=0;
+    };
+
+    class Poller{
+     public:
+      /**
+       * Add a socket for polling; Multiple sockets can be polled together by
+       * adding them into the same poller.
+       */
+      void Add(Socket* socket);
+      /**
+       * Poll for all sockets added into this poller.
+       * @param duration stop after this number of milliseconds
+       * @return pointer to the socket if it has one message in the receiving
+       * queue; nullptr if no message in any sockets,
+       */
+      Socket* Poll(int duation);
+    };
+
+### Dealer Socket
+
+The Dealer socket inherits from the base Socket. In Singa, every Dealer socket
+only connects to one Router socket as shown in Fig.1.  The connection is set up
+by connecting the Dealer socket to the endpoint of a Router socket.
+
+    class Dealer : public Socket{
+     public:
+      /**
+       * Blocking operation to setup the connection with the router, called
+       * only once.
+       * @param endpoint identifier of the router. For intra-process
+       * connection, the endpoint follows the format of ZeroMQ, i.e.,
+       * starting with "inproc://"; in Singa, since each process has one
+       * router, hence we can fix the endpoint to be "inproc://router" for
+       * intra-process. For inter-process, the endpoint follows ZeroMQ's
+       * format, i.e., IP:port, where IP is the connected process.
+       * @return 1 connection sets up successfully; 0 otherwise
+       /
+      int Connect(string endpoint);
+      /*
+       * Since the Dealer socket connects to only one router, it must send to
+       * the connected router, thus the dst argument is useless.
+       */
+      virtual int Send(Msg** msg, int dst=-1);
+      virtual Message* Receive();
+    };
+
+### Router Socket
+
+The Router socket inherits from the base Socket. One Router socket connects to
+at least one Dealer socket.
+
+    class Router : public Socket{
+     /**
+      * Blocking operation to setup the connection with dealers.
+      * It automatically binds to the endpoint for intra-process communication,
+      * i.e., "inproc://router".
+      *
+      * @param endpoint the identifier for the Dealer socket in other process
+      * to connect. It has the format IP:Port, where IP is the host machine.
+      * If endpoint is empty, it means that all connections are
+      * intra-process connection.
+      * @param expected_connections total number of connections. This function
+      * exits after receiving this number of connections from dealers or after
+      * a timeout (1 minutes).
+      * @return number of connected dealers.
+      */
+      int Bind(string endpoint, int expected_connections);
+      virtual int Send(Msg** msg, int dst=-1);
+      virtual Message* Receive();
+    };
+
+## Implementation
+
+### ZeroMQ
+
+**Why [ZeroMQ](http://zeromq.org/)?** Our previous design used MPI for
+communication between Singa processes. But MPI is a poor choice when it comes
+to fault-tolerance, because failure at one node brings down the entire MPI
+cluster. ZeroMQ, on the other hand, is fault tolerant in the sense that one
+node failure does not affect the other nodes. ZeroMQ consists of several basic
+communication patterns that can be easily combined to create more complex
+network topologies.
+
+<figure>
+<img src="{{ BASE_PATH }}/assets/image/msg-flow.png" align="center" width="550px"/>
+<figcaption><strong> Fig.3 - Messages flow for ZeroMQ</strong></figcaption>
+</figure>
+
+The communication APIs of Singa are similar to the DEALER-ROUTER pattern of
+ZeroMQ. Hence we can easily implement the Dealer socket using ZeroMQ's DEALER
+socket, and Router socket using ZeroMQ's ROUTER socket.
+The intra-process can be implemented using ZeroMQ's inproc transport, and the
+inter-process can be implemented using the tcp transport (To exploit the
+Infiniband, we can use the sdp transport). Fig.3 shows the message flow using
+ZeroMQ as the underlying implementation. The messages sent from dealers has two
+frames for the message header, and one or more frames for the message content.
+The messages sent from routers have another frame for the identifier of the
+destination dealer.
+
+Besides the DEALER-ROUTER pattern, we may also implement the Dealer socket and
+Router socket using other ZeroMQ patterns. To be continued.
+
+### MPI
+
+Since MPI does not provide intra-process communication, we have to implement
+it inside the Router and Dealer socket. A simple solution is to allocate one
+message queue for each socket. Messages sent to one socket is inserted into the
+queue of that socket. We create a SafeQueue class to ensure the consistency of
+the queue. All queues are created by the main thread and
+passed to all sockets' constructor via *args*.
+
+    /**
+     * A thread safe queue class.
+     * There would be multiple threads pushing messages into
+     * the queue and only one thread reading and popping the queue.
+     */
+    class SafeQueue{
+     public:
+      void Push(Msg* msg);
+      Msg* Front();
+      void Pop();
+      bool empty();
+    };
+
+For inter-process communication, we serialize the message and call MPI's
+send/receive functions to transferring them. All inter-process connections are
+setup by MPI at the beginning. Consequently, the Connect and Bind functions do
+nothing for both inter-process and intra-process communication.
+
+MPI's AllReduce function is efficient for data aggregation in distributed
+training. For example, [DeepImage of Baidu](http://arxiv.org/abs/1501.02876)
+uses AllReduce to aggregate the updates of parameter from all workers. It has
+similar architecture as [Fig.2]({{ BASE_PATH }}{% post_url /docs/2015-01-30-architecture %}),
+where every process has a server group and is connected with all other processes.
+Hence, we can implement DeepImage in Singa by simply using MPI's AllReduce function for
+inter-process communication.
+
+<!--
+### Server socket
+
+Each server has a DEALER socket to communicate with the stub in the main
+thread via an _in-proc_ socket. It receives requests issued from workers and
+other servers, and forwarded by the ROUTER of the stub. Since the requests are forwarded by the
+stub, we can make the location of workers transparent to server threads. The
+stub records the locations of workers and servers.
+
+As explained previously in the
+[APIs]({{ BASE_PATH }}{% post_url /docs/2015-03-20-parameter-management %})
+for parameter management, some requests may
+not be processed immediately but have to be re-queued. For instance, the Get
+request cannot be processed if the requested parameter is not available, i.e.,
+the parameter has not been put into the server's ParamShard. The re-queueing
+operation is implemented sendings the messages to the ROUTER
+socket of the stub which treats the message as a newly arrived request
+and queues it for processing.
+
+### Worker socket
+
+Each worker thread has a DEALER socket to communicate with the stub in the main
+thread via an _in-proc_ socket. It sends (Get/Update) requests to the ROUTER in
+the stub which forwards the request to (local or remote) processes. In case of
+the partition of ParamShard of worker side, it may also transfer data with other
+workers via the DEALER socket. Again, the location of the other side (a server
+or worker) of the communication is transparent to the worker. The stub handles
+the addressing.
+
+PMClient executes the training logic, during which it generates GET and UPDATE
+requests. A request received at the worker's main thread contains ID of the
+PMClient instance. The worker determines which server to send the request based
+on its content, then sends it via the corresponding socket. Response messages
+received from any of the server socket are forwarded to the in-proc ROUTER
+socket. Since each response header contains the PMClient ID, it is routed to
+the correct instance.
+
+### Stub sockets
+
+#### ROUTER socket
+The main thread has a ROUTER socket to communicate with background threads.
+
+It forwards the requests from workers to background servers. There can be
+multiple servers.If all servers maintain the same (sub) ParamShard, then the
+request can be forwarded to any of them. Load-balance (like round-robin) can be
+implemented in the stub to improve the performance. If each server maintains a
+sub-set of the local ParamShard, then the stub forwards each request to the
+corresponding server.  It also forwards the synchronization requests from
+remote servers to local servers in the same way.
+
+In the case of neural network partition (i.e., model partition), neighbor
+layers would transfer data with each other. Hence, the ROUTER would forwards
+data transfer requests from one worker to other worker. The stub looks up the
+location table to decide where to forward each request.
+
+#### DEALER sockets
+
+The main thread has multiple DEALER sockets to communicate with other
+processes, one socket per process. Two processes are connected if one of the
+following cases exists:
+
+  * one worker group spans across the two processes;
+  * two connected server groups are separated in the two processes;
+  * workers and the subscribed servers are separated in the two processes.
+
+
+All messages in SINGA are of multi-frame ZeroMQ format. The figure above demonstrates different types of messages exchanged in the system.
+
+  1. Requests generated by PMClient consist of the parameter content (which could be empty), followed by the parameter ID (key) and the request type (GET/PUT/REQUEST). Responses received by PMClient are also of this format.
+  2. Messages received by the worker's main thread from PMClient instances contain another frame identifying the PMClient connection (or PMClient ID).
+  3. Requests originating form a worker and arriving at the server contain another frame identifying the worker's connection (or Worker ID).
+  4. Requests originating from another server and arriving at the server have the same format as (3), but the first frame identifies the server connection (or Server ID).
+  5. After a PMServer processes a request, it generates a message with the format similar to (3) but with extra frame indicating if the message is to be routed back to a worker (a response message) or to route to another server (a SYNC request).
+  6. When a request is re-queued, the PMServer generates a message and sends it directly to the server's front-end socket. The re-queued request seen by the server's main thread consists of all the frames in (3), followed by a REQUEUED frame, and finally by another frame generated by the ROUTER socket identifying connection from the PMServer instance. The main thread then strips off these additional two frames before  forwarding it to another PMServer instance like another ordinary request.
+
+-->

http://git-wip-us.apache.org/repos/asf/incubator-singa/blob/666a841d/_posts/docs/2015-02-20-code-structure.md
----------------------------------------------------------------------
diff --git a/_posts/docs/2015-02-20-code-structure.md b/_posts/docs/2015-02-20-code-structure.md
new file mode 100644
index 0000000..198e658
--- /dev/null
+++ b/_posts/docs/2015-02-20-code-structure.md
@@ -0,0 +1,80 @@
+---
+layout: post
+title: Code Structure
+category : docs
+tagline:
+tags : [code, structure, singa]
+---
+{% include JB/setup %}
+
+## Worker Side
+
+### Main Classes
+
+<img src="{{ BASE_PATH }}/assets/image/code-structure/main.jpg" style="width:70%;" align="center"/>
+
+* **Worker**: start the solver to conduct training or resume from previous training snapshots.
+* **Solver**: construct the neural network and run training algorithms over it. Validation and testing is also done by the solver along the training.
+* **TableDelegate**: delegate for the parameter table physically stored in parameter servers.
+    it runs a thread to communicate with table servers for parameter transferring.
+* **Net**: the neural network consists of multiple layers constructed from input configuration file.
+* **Layer**: the core abstraction, read data (neurons) from connecting layers, and compute the data
+    of itself according to layer specific ComputeFeature functions. Data from the bottom layer is forwarded
+    layer by layer to the top.
+
+### Data types
+
+<img src="{{ BASE_PATH }}/assets/image/code-structure/layer.jpg" style="width:90%;" align="center"/>
+
+* **ComputeFeature**: read data (neurons) from in-coming layers, and compute the data
+    of itself according to layer type. This function can be overrided to implement different
+    types layers.
+* **ComputeGradient**: read gradients (and data) from in-coming layers and compute
+    gradients of parameters and data w.r.t the learning objective (loss).
+
+We adpat the implementation for **PoolingLayer**, **Im2colLayer** and **LRNLayer** from [Caffe](http://caffe.berkeleyvision.org/).
+
+
+<img src="{{ BASE_PATH }}/assets/image/code-structure/darray.jpg" style="width:55%;" align="center"/>
+
+* **DArray**: provide the abstraction of distributed array on multiple nodes,
+    supporting array/matrix operations and element-wise operations. Users can use it as a local structure.
+* **LArray**: the local part for the DArray. Each LArray is treated as an
+    independent array, and support all array-related operations.
+* **MemSpace**: manage the memory used by DArray. Distributed memory are allocated
+    and managed by armci. Multiple DArray can share a same MemSpace, the memory
+    will be released when no DArray uses it anymore.
+* **Partition**: maintain both global shape and local partition information.
+    used when two DArray are going to interact.
+* **Shape**: basic class for representing the scope of a DArray/LArray
+* **Range**: basic class for representing the scope of a Partition
+
+## Parameter Server
+
+### Main classes
+
+<img src="{{ BASE_PATH }}/assets/image/code-structure/uml.jpg" style="width:90%;" align="center"/>
+
+* **NetworkService**: provide access to the network (sending and receiving messages). It maintains a queue for received messages, implemented by NetworkQueue.
+* **RequestDispatcher**: pick up next message (request) from the queue, and invoked a method (callback) to process them.
+* **TableServer**: provide access to the data table (parameters). Register callbacks for different types of requests to RequestDispatcher.
+* **GlobalTable**: implement the table. Data is partitioned into multiple Shard objects per table. User-defined consistency model supported by extending TableServerHandler for each table.
+
+### Data types
+
+<img src="{{ BASE_PATH }}/assets/image/code-structure/type.jpg" style="width:400px;" align="middle"/>
+
+Table related messages are either of type **RequestBase** which contains different types of request, or of type **TableData** containing a key-value tuple.
+
+### Control flow and thread model
+
+![uml]({{ BASE_PATH }}/assets/image/code-structure/threads.jpg)
+
+The figure above shows how a GET request sent from a worker is processed by the
+table server. The control flow for other types of requests is similar. At
+the server side, there are at least 3 threads running at any time: two by
+NetworkService for sending and receiving message, and at least one by the
+RequestDispatcher for dispatching requests.
+
+
+

http://git-wip-us.apache.org/repos/asf/incubator-singa/blob/666a841d/_posts/docs/2015-02-30-neuralnet-partition.md
----------------------------------------------------------------------
diff --git a/_posts/docs/2015-02-30-neuralnet-partition.md b/_posts/docs/2015-02-30-neuralnet-partition.md
new file mode 100644
index 0000000..a91cae0
--- /dev/null
+++ b/_posts/docs/2015-02-30-neuralnet-partition.md
@@ -0,0 +1,59 @@
+---
+layout: post
+title: Neural Network Partition
+category : docs
+tagline:
+tags : [partition, neuralnet]
+---
+{% include JB/setup %}
+
+The purposes of partitioning neural network is to distribute the partitions onto
+different working units (e.g., threads or nodes, called workers in this article)
+and parallelize the processing.
+Another reason for partition is to handle large neural network which cannot be
+hold in a single node. For instance, to train models against images with high
+resolution we need large neural networks (in terms of training parameters).
+
+Since *Layer* is the first class citizen in SIGNA, we do the partition against
+layers. Specifically, we support partitions at two levels. First, users can configure
+the location (i.e., worker ID) of each layer. In this way, users assign one worker
+for each layer. Secondly, for one layer, we can partition its neurons or partition
+the instances (e.g, images). They are called layer partition and data partition
+respectively. We illustrate the two types of partitions using an simple convolutional neural network.
+
+<img src="{{ BASE_PATH }}/assets/image/conv-mnist.png" align="center" width="200px"/>
+
+The above figure shows a convolutional neural network without any partition. It
+has 8 layers in total (one rectangular represents one layer). The first layer is
+DataLayer (data) which reads data from local disk files/databases (or HDFS). The second layer
+is a MnistLayer which parses the records from MNIST data to get the pixels of a batch
+of 28 images (each image is of size 28x28). The LabelLayer (label) parses the records to get the label
+of each image in the batch. The ConvolutionalLayer (conv1) transforms the input image to the
+shape of 8x27x27. The ReLULayer (relu1) conducts elementwise transformations. The PoolingLayer (pool1)
+sub-samples the images. The fc1 layer is fully connected with pool1 layer. It
+mulitplies each image with a weight matrix to generate a 10 dimension hidden feature which
+is then normalized by a SoftmaxLossLayer to get the prediction.
+
+<img src="{{ BASE_PATH }}/assets/image/conv-mnist-datap.png" align="center" width="400px"/>
+
+The above figure shows the convolutional neural network after partitioning all layers
+except the DataLayer and ParserLayers, into 3 partitions using data partition.
+The read layers process 4 images of the batch, the black and blue layers process 2 images
+respectively. Some helper layers, i.e., SliceLayer, ConcateLayer, BridgeSrcLayer,
+BridgeDstLayer and SplitLayer, are added automatically by our partition algorithm.
+Layers of the same color resident in the same worker. There would be data transferring
+across different workers at the boundary layers (i.e., BridgeSrcLayer and BridgeDstLayer),
+e.g., between s-slice-mnist-conv1 and d-slice-mnist-conv1.
+
+<img src="{{ BASE_PATH }}/assets/image/conv-mnist-layerp.png" align="center" width="400px"/>
+
+The above figure shows the convolutional neural network after partitioning all layers
+except the DataLayer and ParserLayers, into 2 partitions using layer partition. We can
+see that each layer processes all 8 images from the batch. But different partitions process
+different part of one image. For instance, the layer conv1-00 process only 4 channels. The other
+4 channels are processed by conv1-01 which residents in another worker.
+
+
+Since the partition is done at the layer level, we can apply different partitions for
+different layers to get a hybrid partition for the whole neural network. Moreover,
+we can also specify the layer locations to locate different layers to different workers.

http://git-wip-us.apache.org/repos/asf/incubator-singa/blob/666a841d/_posts/docs/2015-03-10-programming-model.md
----------------------------------------------------------------------
diff --git a/_posts/docs/2015-03-10-programming-model.md b/_posts/docs/2015-03-10-programming-model.md
new file mode 100644
index 0000000..d2aae97
--- /dev/null
+++ b/_posts/docs/2015-03-10-programming-model.md
@@ -0,0 +1,132 @@
+---
+layout: post
+title: Programming Model
+category : docs
+tagline:
+tags : [programming model, API]
+---
+{% include JB/setup %}
+
+We describe the programming model of SINGA in this article.
+Base data structures are introduced firstly, and then we show examples for
+users with different levels of deep learning background.
+
+### Base Data Structures
+
+#### Layer
+
+Layer is the first class citizen in SINGA. Users construct their deep learning
+models by creating layer objects and combining them. SINGA
+takes care of running BackPropagation (or Contrastive Divergence) algorithms
+to calculate the gradients for parameters and calling [Updaters](#updater) to
+update them.
+
+    class Layer{
+      /**
+       * Setup layer properties.
+       * Setup the shapes for data and parameters, also setup some properties
+       * based on the layer configuration and connected src layers.
+       * @param conf user defined layer configuration of type [LayerProto](#netproto)
+       * @param srclayers layers connecting to this layer
+       */
+      Setup(conf, srclayers);
+      /**
+       * Setup the layer properties.
+       * This function is called if the model is partitioned due to distributed
+       * training. Shape of the layer is already set by the partition algorithm,
+       * and is passed in to set other properties.
+       * @param conf user defined layer configuration of type [LayerProto](#netproto)
+       * @param shape shape set by partition algorithm (for distributed training).
+       * @param srclayers layers connecting to this layer
+       */
+      SetupAfterPartition(conf, shape, srclayers);
+      /**
+       * Compute features of this layer based on connected layers.
+       * BP and CD will call this to calculate gradients
+       * @param training boolean phase indicator for training or test
+       * @param srclayers layers connecting to this layer
+       */
+      ComputeFeature(training, srclayers);
+      /**
+       * Compute gradients for parameters and connected layers.
+       * BP and CD will call this to calculate gradients
+       * @param srclayers layers connecting to this layer.
+       */
+      ComputeGradient(srclayers)=0;
+    }
+
+The above pseudo code shows the base Layer class. Users override these
+methods to implement their own layer classes. For example, we have implemented
+popular layers like ConvolutionLayer, InnerProductLayer. We also provide a
+DataLayer which is a base layer for loading (and prefetching) data from disk or HDFS. A base ParserLayer
+is created for parsing the raw data and convert it into records that are recognizable by SINGA.
+
+#### NetProto
+
+Since deep learning models consist of multiple layers. The model structure includes
+the properties of each layer and the connections between layers. SINGA uses
+google protocol buffer for users to configure the model structure. The protocol
+buffer message for the model structure is defined as:
+
+    NetProto{
+      repeated LayerProto layer;
+    }
+
+    LayerProto{
+      string name; // user defined layer name for displaying
+      string type; // One layer class has a unique type.
+      repeated string srclayer_name; // connected layer names;
+      repeated ParamProto param; // parameter configurations
+      ...
+    }
+
+Users can create a plain text file and fill it with the configurations. SINGA
+parses it according to user provided path.
+
+#### Param
+
+The Param class is shown below. Users do not need to extend the Param class for
+most cases. We make it a base class just for future extension. For example,
+if a new initialization trick is proposed in the future, we can override the `Init`
+method to implement it.
+
+    Param{
+      /**
+       * Set properties of the parameter.
+       * @param conf user defined parameter configuration of type ParamProto
+       * @param shape shape of the parameter
+      Setup(conf, shape);
+      /**
+       * Initialize the data of the parameter.
+       /
+      Init();
+      ...// methods to handle synchronizations with parameter servers and other workers
+    }
+
+#### Updater
+
+There are many SGD extensions for updating parameters,
+like [AdaDelta](http://arxiv.org/pdf/1212.5701v1.pdf),
+[AdaGrad](http://www.magicbroom.info/Papers/DuchiHaSi10.pdf),
+[RMSProp](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf),
+[Nesterov](http://scholar.google.com/citations?view_op=view_citation&hl=en&user=DJ8Ep8YAAAAJ&citation_for_view=DJ8Ep8YAAAAJ:hkOj_22Ku90C)
+and SGD with momentum. We provide a base Updater to deal with these algorithms.
+New parameter updating algorithms can be added by extending the base Updater.
+
+    Updater{
+      /**
+      * @param proto user configuration for the updater.
+      Init(conf);
+      /**
+      * Update parameter based on its gradient
+      * @param step training step
+      * @param param the Param object
+      */
+      Update(step, param);
+    }
+
+### Examples
+
+The [MLP example]({{ BASE_PATH }})
+shows how to configure the model through google protocol buffer.
+

http://git-wip-us.apache.org/repos/asf/incubator-singa/blob/666a841d/_posts/docs/2015-03-20-parameter-management.md
----------------------------------------------------------------------
diff --git a/_posts/docs/2015-03-20-parameter-management.md b/_posts/docs/2015-03-20-parameter-management.md
new file mode 100644
index 0000000..f912335
--- /dev/null
+++ b/_posts/docs/2015-03-20-parameter-management.md
@@ -0,0 +1,174 @@
+---
+layout: post
+title: Parameter Management
+category : docs
+tagline:
+tags : [development, documentation, parameter]
+---
+{% include JB/setup %}
+
+In this article, we describe the parameter management in SINGA.
+
+## Base Classes (Abstractions)
+
+### Param class
+
+Parameters in SINGA are represented by Param objects. (todo: add description for Param class).
+
+
+### ParameterManager class
+
+ParameterManager (PM) is responsible for synchronizing Param objects between workers
+and parameter server.
+
+
+**Draft**: the PM provides APIs for both workers and servers to get and update Param objects.
+
+        /**
+         * block operation called by worker to get the parameter.
+         */
+        bool Get(Param*);
+        /**
+         * Processes Get request and responses to the sender.
+         * Returns true if success, otherwise returns false.
+         */
+        bool HandleGet(int paramId, msg_t* msg);
+        /**
+         * Non-blocking opeartion. It passes the parameter to the PM that maintains it.
+         */
+        bool Put(Param*);
+        /**
+         * Non-blocking operation to processes Put request.
+         * Returns true if success, otherwise returns false.
+         */
+        bool HandlePut(int paramId, msg_t* msg);
+        /**
+         * Non-blocking operation for updating parameters.
+         * It may synchronize the updates to other PMs;
+         */
+        bool Update(Param*);
+        /**
+         * Blocking operation to collect results from the previous update.
+         * E.g., receive responses from remote PM. If the HandleUpdate do not
+         * response automatically, it should request explicitly to the PM.
+         */
+        bool Collect(Param*);
+        /**
+         * Processes Update requests.
+         * It may return responses, e..g, parameter back to the sender.
+         */
+        bool HandleUpdate(paramId, msg_t* msg);
+        /**
+         * Returns the node Id on which to send the Get/Put/Update request.
+         */
+        int Shard(int paramId);
+        /**
+         * Returns whether to synchronize updates to other PMs.
+         */
+        bool SyncNow(int paramId);
+
+  With this general PM design, we can support different cluster topologies by
+  implementing different PMs. The following figure shows the logical components of three topologies.
+  In fact, one server may consist of multiple nodes.
+
+  <img src="{{ BASE_PATH }}/assets/image/history/pm-topology.png" align="center" width="500px"/>
+
+  * Worker-Server. This is the current topology we are using.
+
+  * Worker-Replicated Server. This topology is to reduce the communication overload of a single server (group).
+
+  * Worker-Worker. This topology does not have any parameter servers.
+
+
+## Worker-Server Architecture
+
+In this section, we describe our implementations for the worker-server architecture.
+Workers and servers are multi-threaded. We use ZeroMQ for communication.
+
+**Why ZeroMQ?** Our previous design used MPI for communication between SINGA
+processes. But MPI is a poor choice when it comes to failure recovery, because
+failure at one node brings down the entire MPI cluster. ZeroMQ, on the other
+hand, is fault tolerant in the sense that one node failure does not affect
+the other nodes. ZeroMQ consists of several basic communication patterns
+that can be easily combined to create more complex network topologies.
+
+### Overview
+
+![Figure 1](http://www.comp.nus.edu.sg/~dinhtta/param_server.jpg)
+
+Both workers and servers consist of one main thread and multiple background threads. The main thread sets up communication sockets with other processes and with the background threads. Its main job is to forward messages between different sockets. The worker and server logic are performed by the backgrounds threads.
+
+### Server sockets
+
+The server has one front-end ROUTER socket to receive requests from the workers and other servers. ROUTER socket is used because it allows for responses messages to be routed back to the correct recipients.
+
+Server background threads are called **PMServers**. They connect to the main thread via an _in-proc_ DEALER socket. When the server receives a request, it distributes the request fairly to one of the background threads by simply forwarding to the DEALER socket. This DEALER-DEALER pattern implements a simple form of load balancing.
+
+As explained previously in the [APIs](http://www.comp.nus.edu.sg/~dbsystem/singa/development,%20documentation/2015/03/12/parameter-management/) for parameter management, depending on the consistency model some requests may not be processed immediately but have to be re-queued. Such re-queueing is implemented by having each PMServer instance connects directly to the front-end ROUTER socket. More specifically, when the PMServer decides that a message has to be requeued, it sends the messages to the front-end ROUTER socket which treats the message as another request arriving form the network and queues it for processing.
+
+A server communicates with another server, for example in the replicated server architecture, by having  DEALER sockets connecting to each of its neighbors' front-end sockets. Note that we opt for DEALER-ROUTER  instead of ROUTER-ROUTER pattern to avoid complexity caused by the handshake protocol between ROUTER sockets. Furthermore, the request-reply pattern supported by DEALER-ROUTER is sufficient to implement synchronization between servers.
+
+### Worker sockets
+
+Each worker connects to a server via a DEALER socket enabling the request-reply communication pattern initiated by the worker. The main thread binds an _in-proc_ ROUTER socket to which
+background threads (called **PMClients**) are connected. Note that the internal communication pattern here is DEALER-ROUTER, as opposed to DEALER-DEALER used in the server, because each PMClient must receive the correct response for each request it sent.
+
+PMClient executes the training logic, during which it generates GET and UPDATE requests. A request received at the worker's main thread contains ID of the PMClient instance. The worker determines which server to send the request based on its content, then sends it via the corresponding socket. Response messages received from any of the server socket are forwarded to the in-proc ROUTER socket. Since each response header contains the PMClient ID, it is routed to the correct instance.
+
+
+### Message Formats
+
+  <img src="http://www.comp.nus.edu.sg/~dinhtta/messages.jpg?1222259157.415" alt="">
+
+All messages in SINGA are of multi-frame ZeroMQ format. The figure above demonstrates different types of messages exchanged in the system.
+
+  1. Requests generated by PMClient consist of the parameter content (which could be empty), followed by the parameter ID (key) and the request type (GET/PUT/REQUEST). Responses received by PMClient are also of this format.
+  2. Messages received by the worker's main thread from PMClient instances contain another frame identifying the PMClient connection (or PMClient ID).
+  3. Requests originating form a worker and arriving at the server contain another frame identifying the worker's connection (or Worker ID).
+  4. Requests originating from another server and arriving at the server have the same format as (3), but the first frame identifies the server connection (or Server ID).
+  5. After a PMServer processes a request, it generates a message with the format similar to (3) but with extra frame indicating if the message is to be routed back to a worker (a response message) or to route to another server (a SYNC request).
+  6. When a request is re-queued, the PMServer generates a message and sends it directly to the server's front-end socket. The re-queued request seen by the server's main thread consists of all the frames in (3), followed by a REQUEUED frame, and finally by another frame generated by the ROUTER socket identifying connection from the PMServer instance. The main thread then strips off these additional two frames before  forwarding it to another PMServer instance like another ordinary request.
+
+### Parameter Shard
+
+Background threads at the server have access to a shared list of parameter objects (or parameter shard) maintained by a ParamShard object. When processing a request message, PMServer first looks at the request type, then invokes a method from ParamShard according to the request type. PMServer transfers ownership of the message to the ParamShard method.
+
+Each ParamShard method takes as input an ID and the frame containing the request content. It retrieves the Param object with the specified ID, then invokes the handle provided by the Param object. Consistency models are implemented by Param objects.
+
+The ParamShard APIs are similar to that of Param, except for the extra parameter specifying the ID. Additionally, ParamShard APIs contain a method for determining if a Param object needs to be synchronized with other servers:
+
+    bool sync_now(int paramID);
+
+
+###  Topology
+
+Setting a different network topology for workers and servers, as shown in [previous article](http://www.comp.nus.edu.sg/~dbsystem/singa/development,%20documentation/2015/03/12/parameter-management/), is done via a configuration file. In particular, the config file is read into a ProtoBuf message called Topology and is known to all SINGA processes. The Topology message contains multiple ServerConfig, WorkerConfig and ServerSet messages.
+
+    message Topology{
+      repeated ServerConfig server = 5; //parameter server network
+      repeated WorkerConfig worker = 6;
+      repeated ServerSet server_group = 7;
+    }
+
+Each ServerConfig message specifies properties of a server process, namely its ID, network address, its neighbor IDs, synchronization interval and the number threads (or PMServer instances). Each WorkerConfig message represents a worker process, containing the worker's global ID, group and local ID (for replicated server architecture), and the number of threads (or PMClient instances). Finally, the ServerSet message represents a server group in the replicated server architecture. Each server group is identified by a group ID and the set of server IDs.
+
+    message ServerConfig{
+      required int32 id = 1;
+      required string ip = 2;
+      required int32 port = 3;
+      repeated int32 neighbor = 4; //upstream neighbors
+      required int32 sync_interval = 5; //how many update (per Param) before syncing
+      required int32 threads = 6;
+    }
+
+    message WorkerConfig{
+      required int32 global_id = 1; //node id
+      required int32 local_id = 2; //id in the group
+      required int32 group_id = 3; //ServerSet ID
+      required int32 threads = 4;
+    }
+
+    message ServerSet{
+      required int32 id = 1;
+      repeated int32 neighbor = 2; //set of primary server
+    }

http://git-wip-us.apache.org/repos/asf/incubator-singa/blob/666a841d/_posts/research/00-publication.md
----------------------------------------------------------------------
diff --git a/_posts/research/00-publication.md b/_posts/research/00-publication.md
new file mode 100644
index 0000000..c94c085
--- /dev/null
+++ b/_posts/research/00-publication.md
@@ -0,0 +1,19 @@
+---
+layout: post
+title: Publication
+category : research
+tagline:
+tags : [Publication]
+---
+{% include JB/setup %}
+
+### Conference Paper
+
+* W. Wang, B.C. Ooi, X. Yang, D. Zhang, Y. Zhuang:
+  [Effective MultiModal Retrieval based on Stacked AutoEncoders](http://www.comp.nus.edu.sg/~ooibc/crossmodalvldb14.pdf).
+  Int'l Conference on Very Large Data Bases (VLDB), 2014.
+
+### Technical Report
+
+* W. Wang, G. Chen, T.T.A. Dinh, J. Gao, B.C. Ooi, K.L. Tan:
+  [SINGA: A Distributed System for Deep Learning]({{ BASE_PATH }}/assets/file/singa.pdf)