You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@singa.apache.org by wa...@apache.org on 2015/07/28 14:15:58 UTC
svn commit: r1693077 - in /incubator/singa/site/trunk/content/markdown/docs: data.md neuralnet-partition.md

Author: wangwei
Date: Tue Jul 28 12:15:58 2015
New Revision: 1693077

URL: http://svn.apache.org/r1693077
Log:
add docs for data preparation from Chonho

Modified:
    incubator/singa/site/trunk/content/markdown/docs/data.md
    incubator/singa/site/trunk/content/markdown/docs/neuralnet-partition.md

Modified: incubator/singa/site/trunk/content/markdown/docs/data.md
URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/docs/data.md?rev=1693077&r1=1693076&r2=1693077&view=diff
==============================================================================
--- incubator/singa/site/trunk/content/markdown/docs/data.md (original)
+++ incubator/singa/site/trunk/content/markdown/docs/data.md Tue Jul 28 12:15:58 2015
@@ -1,18 +1,110 @@
 ## Data Preparation
 
-To submit a training job, users need to convert raw data (e.g., images, text
-documents) into records that can be recognized by SINGA. SINGA uses a DataLayer
-to load these records into memory and uses ParserLayer to parse features (e.g.,
-image pixels and labels) from these records. The records could be organized and
-stored using many different ways, e.g., using a light database, or a file or
-HDFS, as long as there is a corresponding DataLayer that can load the records.
+To submit a training job, users need to convert raw data (e.g., images, text documents) into records that can be recognized by SINGA. SINGA uses a DataLayer
+to load these records into memory and uses ParserLayer to parse features (e.g., image pixels and labels) from these records. The records could be organized and
+stored using many different ways, e.g., a file, a light database, or HDFS, as long as there is a corresponding DataLayer that can load the records.
 
 ### DataShard
 
+To create shard for your own data, users may need to implement or modify the following files
+  
+- common.proto
+- create_shard.cc
+- Makefile
 
+**1. Define record**
+
+Record class is inherited from Message class whose format follows Google protocol buffers. Please refer to the [Tutorial][1]. 
+
+Your record will be defined in a file, SINGAfolder/src/proto/common.proto
+
+(a) Define the record
+
+    message UserRecord {
+        repeated int userVAR1 = 1; // unique id
+        optional string userVAR2 = 2; // unique id
+        ...
+    }
+
+(b) Declare user own record in Record
+
+    message Record {
+        optional UserRecord user_record = 1; // unique id
+        ...
+    }
+
+(c) Compile SINGA
+
+    cd SINGAfolder
+    ./configure
+    make
+
+
+**2. Create shard**
+
+(a) Create a folder for dataset, e.g., we call it "USERDATAfolder".
+
+(b) Source files for creating shard will be in SINGAfolder/USERDATAfolder/
+
+- For example of RNNLM, create_shard.cc is in SINGAfolder/examples/rnnlm
+
+(c) Create shard
+
+    singa::DataShard myShard( outputpath, mode);
+
+- `string outputpath`, where user wants to create shard.
+- `int mode := kRead | kCreate | kAppend`, is defined in SINGAfolder/include/utils/data_shard.h
+
+
+**3. Store record into shard**
+
+(a) xxx
+
+    singa::Record record;
+    singa::UserRecord *myRecord = record.mutable_user_record();
+
+`mutable_user_record()` method is automatically generated after compiling SINGA at Step 1-(c).
+
+(b) Set/Add values into the record
+
+    myRecord->add_userVAR1( int_val );
+    myRecord->set_userVAR2( string_val );
+
+(c) Store the record to shard
+
+    myShard.Insert( key, myRecord );
+- `String key`, will be a unique id for a message 
+
+**Example of RNNLM**
+
+You can refer to RNNLM example at SINGAfolder/example/rnnlm/
+
+    message SingleWordRecord {
+        optional string word = 1;
+        optional int32 word_index = 2;
+        optional int32 class_index =3;`
+    }
+
+    message Record {
+        optional SingleWordRecord word_record = 4;
+    }
+
+    make download
+    to download raw data from https://www.rnnlm.org
+
+In this example, rnnlm-0.4b is used.
+ 
+    make create
+
+to process input text file, create records, and store it into shard
+ 
+We create 3 shards for training data, which are class_shard, vocab_shard, word_shard.
 
 ### LMDB
 
 
 
 ### HDFS
+
+
+  [1]: https://developers.google.com/protocol-buffers/docs/cpptutorial
\ No newline at end of file

Modified: incubator/singa/site/trunk/content/markdown/docs/neuralnet-partition.md
URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/docs/neuralnet-partition.md?rev=1693077&r1=1693076&r2=1693077&view=diff
==============================================================================
--- incubator/singa/site/trunk/content/markdown/docs/neuralnet-partition.md (original)
+++ incubator/singa/site/trunk/content/markdown/docs/neuralnet-partition.md Tue Jul 28 12:15:58 2015
@@ -22,7 +22,7 @@ The above figure shows a convolutional n
 has 8 layers in total (one rectangular represents one layer). The first layer is
 DataLayer (data) which reads data from local disk files/databases (or HDFS). The second layer
 is a MnistLayer which parses the records from MNIST data to get the pixels of a batch
-of 28 images (each image is of size 28x28). The LabelLayer (label) parses the records to get the label
+of 8 images (each image is of size 28x28). The LabelLayer (label) parses the records to get the label
 of each image in the batch. The ConvolutionalLayer (conv1) transforms the input image to the
 shape of 8x27x27. The ReLULayer (relu1) conducts elementwise transformations. The PoolingLayer (pool1)
 sub-samples the images. The fc1 layer is fully connected with pool1 layer. It