You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@singa.apache.org by bu...@apache.org on 2015/07/28 14:16:16 UTC
svn commit: r959888 - in /websites/staging/singa/trunk/content: ./ docs/data.html docs/neuralnet-partition.html

Author: buildbot
Date: Tue Jul 28 12:16:16 2015
New Revision: 959888

Log:
Staging update by buildbot for singa

Modified:
    websites/staging/singa/trunk/content/   (props changed)
    websites/staging/singa/trunk/content/docs/data.html
    websites/staging/singa/trunk/content/docs/neuralnet-partition.html

Propchange: websites/staging/singa/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Tue Jul 28 12:16:16 2015
@@ -1 +1 @@
-1693074
+1693077

Modified: websites/staging/singa/trunk/content/docs/data.html
==============================================================================
--- websites/staging/singa/trunk/content/docs/data.html (original)
+++ websites/staging/singa/trunk/content/docs/data.html Tue Jul 28 12:16:16 2015
@@ -403,9 +403,114 @@
                                   
             <div class="section">
 <h2><a name="Data_Preparation"></a>Data Preparation</h2>
-<p>To submit a training job, users need to convert raw data (e.g., images, text documents) into records that can be recognized by SINGA. SINGA uses a DataLayer to load these records into memory and uses ParserLayer to parse features (e.g., image pixels and labels) from these records. The records could be organized and stored using many different ways, e.g., using a light database, or a file or HDFS, as long as there is a corresponding DataLayer that can load the records.</p>
+<p>To submit a training job, users need to convert raw data (e.g., images, text documents) into records that can be recognized by SINGA. SINGA uses a DataLayer to load these records into memory and uses ParserLayer to parse features (e.g., image pixels and labels) from these records. The records could be organized and stored using many different ways, e.g., a file, a light database, or HDFS, as long as there is a corresponding DataLayer that can load the records.</p>
 <div class="section">
-<h3><a name="DataShard"></a>DataShard</h3></div>
+<h3><a name="DataShard"></a>DataShard</h3>
+<p>To create shard for your own data, users may need to implement or modify the following files</p>
+
+<ul>
+  
+<li>common.proto</li>
+  
+<li>create_shard.cc</li>
+  
+<li>Makefile</li>
+</ul>
+<p><b>1. Define record</b></p>
+<p>Record class is inherited from Message class whose format follows Google protocol buffers. Please refer to the <a class="externalLink" href="https://developers.google.com/protocol-buffers/docs/cpptutorial">Tutorial</a>. </p>
+<p>Your record will be defined in a file, SINGAfolder/src/proto/common.proto</p>
+<p>(a) Define the record</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">message UserRecord {
+    repeated int userVAR1 = 1; // unique id
+    optional string userVAR2 = 2; // unique id
+    ...
+}
+</pre></div></div>
+<p>(b) Declare user own record in Record</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">message Record {
+    optional UserRecord user_record = 1; // unique id
+    ...
+}
+</pre></div></div>
+<p>(c) Compile SINGA</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">cd SINGAfolder
+./configure
+make
+</pre></div></div>
+<p><b>2. Create shard</b></p>
+<p>(a) Create a folder for dataset, e.g., we call it &#x201c;USERDATAfolder&#x201d;.</p>
+<p>(b) Source files for creating shard will be in SINGAfolder/USERDATAfolder/</p>
+
+<ul>
+  
+<li>For example of RNNLM, create_shard.cc is in SINGAfolder/examples/rnnlm</li>
+</ul>
+<p>(c) Create shard</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">singa::DataShard myShard( outputpath, mode);
+</pre></div></div>
+
+<ul>
+  
+<li><tt>string outputpath</tt>, where user wants to create shard.</li>
+  
+<li><tt>int mode := kRead | kCreate | kAppend</tt>, is defined in SINGAfolder/include/utils/data_shard.h</li>
+</ul>
+<p><b>3. Store record into shard</b></p>
+<p>(a) xxx</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">singa::Record record;
+singa::UserRecord *myRecord = record.mutable_user_record();
+</pre></div></div>
+<p><tt>mutable_user_record()</tt> method is automatically generated after compiling SINGA at Step 1-(c).</p>
+<p>(b) Set/Add values into the record</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">myRecord-&gt;add_userVAR1( int_val );
+myRecord-&gt;set_userVAR2( string_val );
+</pre></div></div>
+<p>(c) Store the record to shard</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">myShard.Insert( key, myRecord );
+</pre></div></div>
+
+<ul>
+  
+<li><tt>String key</tt>, will be a unique id for a message</li>
+</ul>
+<p><b>Example of RNNLM</b></p>
+<p>You can refer to RNNLM example at SINGAfolder/example/rnnlm/</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">message SingleWordRecord {
+    optional string word = 1;
+    optional int32 word_index = 2;
+    optional int32 class_index =3;`
+}
+
+message Record {
+    optional SingleWordRecord word_record = 4;
+}
+
+make download
+to download raw data from https://www.rnnlm.org
+</pre></div></div>
+<p>In this example, rnnlm-0.4b is used.</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">make create
+</pre></div></div>
+<p>to process input text file, create records, and store it into shard</p>
+<p>We create 3 shards for training data, which are class_shard, vocab_shard, word_shard.</p></div>
 <div class="section">
 <h3><a name="LMDB"></a>LMDB</h3></div>
 <div class="section">

Modified: websites/staging/singa/trunk/content/docs/neuralnet-partition.html
==============================================================================
--- websites/staging/singa/trunk/content/docs/neuralnet-partition.html (original)
+++ websites/staging/singa/trunk/content/docs/neuralnet-partition.html Tue Jul 28 12:16:16 2015
@@ -409,7 +409,7 @@
 <p>The purposes of partitioning neural network is to distribute the partitions onto different working units (e.g., threads or nodes, called workers in this article) and parallelize the processing. Another reason for partition is to handle large neural network which cannot be hold in a single node. For instance, to train models against images with high resolution we need large neural networks (in terms of training parameters).</p>
 <p>Since <i>Layer</i> is the first class citizen in SIGNA, we do the partition against layers. Specifically, we support partitions at two levels. First, users can configure the location (i.e., worker ID) of each layer. In this way, users assign one worker for each layer. Secondly, for one layer, we can partition its neurons or partition the instances (e.g, images). They are called layer partition and data partition respectively. We illustrate the two types of partitions using an simple convolutional neural network.</p>
 <p><img src="../images/conv-mnist.png" style="width: 220px" alt="" /></p>
-<p>The above figure shows a convolutional neural network without any partition. It has 8 layers in total (one rectangular represents one layer). The first layer is DataLayer (data) which reads data from local disk files/databases (or HDFS). The second layer is a MnistLayer which parses the records from MNIST data to get the pixels of a batch of 28 images (each image is of size 28x28). The LabelLayer (label) parses the records to get the label of each image in the batch. The ConvolutionalLayer (conv1) transforms the input image to the shape of 8x27x27. The ReLULayer (relu1) conducts elementwise transformations. The PoolingLayer (pool1) sub-samples the images. The fc1 layer is fully connected with pool1 layer. It mulitplies each image with a weight matrix to generate a 10 dimension hidden feature which is then normalized by a SoftmaxLossLayer to get the prediction.</p>
+<p>The above figure shows a convolutional neural network without any partition. It has 8 layers in total (one rectangular represents one layer). The first layer is DataLayer (data) which reads data from local disk files/databases (or HDFS). The second layer is a MnistLayer which parses the records from MNIST data to get the pixels of a batch of 8 images (each image is of size 28x28). The LabelLayer (label) parses the records to get the label of each image in the batch. The ConvolutionalLayer (conv1) transforms the input image to the shape of 8x27x27. The ReLULayer (relu1) conducts elementwise transformations. The PoolingLayer (pool1) sub-samples the images. The fc1 layer is fully connected with pool1 layer. It mulitplies each image with a weight matrix to generate a 10 dimension hidden feature which is then normalized by a SoftmaxLossLayer to get the prediction.</p>
 <p><img src="../images/conv-mnist-datap.png" style="width: 1000px" alt="" /></p>
 <p>The above figure shows the convolutional neural network after partitioning all layers except the DataLayer and ParserLayers, into 3 partitions using data partition. The read layers process 4 images of the batch, the black and blue layers process 2 images respectively. Some helper layers, i.e., SliceLayer, ConcateLayer, BridgeSrcLayer, BridgeDstLayer and SplitLayer, are added automatically by our partition algorithm. Layers of the same color resident in the same worker. There would be data transferring across different workers at the boundary layers (i.e., BridgeSrcLayer and BridgeDstLayer), e.g., between s-slice-mnist-conv1 and d-slice-mnist-conv1.</p>
 <p><img src="../images/conv-mnist-layerp.png" style="width: 1000px" alt="" /></p>