You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@singa.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2015/10/07 09:26:26 UTC

[jira] [Commented] (SINGA-82) Refactor input layers using data store abstraction

    [ https://issues.apache.org/jira/browse/SINGA-82?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946427#comment-14946427 ] 

ASF subversion and git services commented on SINGA-82:
------------------------------------------------------

Commit d99b24cb75def9fdbdc59273c4297abb75813c36 in incubator-singa's branch refs/heads/master from [~flytosky]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-singa.git;h=d99b24c ]

SINGA-82 Refactor input layers using data store abstraction

Add Store abstraction for read (writing data). Implemented two backend,

1. KVFile, which was named DataShard. It is a binary file, each tuple
has a unique key.
2. TextFile, which is a plain text file with each line be the value
field of a tuple (the key is the line No.).

TODO, implment HDFS and image folder as the backend.


> Refactor input layers using data store abstraction
> --------------------------------------------------
>
>                 Key: SINGA-82
>                 URL: https://issues.apache.org/jira/browse/SINGA-82
>             Project: Singa
>          Issue Type: Improvement
>            Reporter: wangwei
>            Assignee: wangwei
>
> 1. Separate the data storage from Layer. Currently, SINGA creates one layer to read data from one storage, e.g., ShardData, CSV, LMDB. One problem is that only read operations are provided. When users prepare the training data, they have to get familiar with the read/write operations for each storage. Inspired from caffe::db::DB, we can provide a storage  abstraction with simple read/write operation interfaces. Then users call these operations to prepare their training data. Particularly, training data is stored as (string key, string value) tuples. The base Store class 
> {code}
> // open the store for reading, writing or appending
> virtual bool Open(const string& source, Mode mode);
> // for reading tuples
> virtual bool Read(string*key, string*value) = 0;
> // for writing tuples
> virtual bool Write(const string& key, const string& value) = 0;
> {code}
> The specific storage, e.g., CSV, LMDB, image folder or HDFS (will be supported soon), inherits Store and overrides the functions. 
> Consequently, a single KVInputLayer (like the SequenceFile.Reader from Hadoop) can read from different sources by configuring *store* field (e.g., store=csv). 
> With the Store class, we can implement a KVInputLayer to read batchsize tuples in its ComputeFeature function. The tuple is parsed by a virtual function depending on the application (or the format of the tuple). 
> {code}
> // parse the tuple as the k-th instance for one mini-batch
> virtual bool Parse(int k, const string& key, const string& tuple) = 0;
> {code}
> For example, a CSVKVInputLayer may parse the key into a line ID, and parse the label and feature from the value field. An ImageKVInputLayer may parse a SingleLabelImageRecord from the value field.
> 2. The will be a set of layers doing data preprocessing, e.g., normalization and image augmentation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)