You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@singa.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2016/01/02 16:20:39 UTC

[jira] [Commented] (SINGA-97) SINGA-97 Add HDFS Store

    [ https://issues.apache.org/jira/browse/SINGA-97?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076538#comment-15076538 ] 

ASF subversion and git services commented on SINGA-97:
------------------------------------------------------

Commit 8a07a29462c6d8ad1d2da17da4a018dfc327c121 in incubator-singa's branch refs/heads/master from [~ug93tad]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-singa.git;h=8a07a29 ]

SINGA-97 Add HDFS Store

This ticket implements HDFS Store for reading data from HDFS. It complements
the existing CSV Store which reads data from CSV file. HDFS is the popular
distributed file system with high (sequential) I/O throughputs, thus supporting
it is necessary in order for SINGA to scale.

HDFS usage in SINGA is different to that in standard MapReduce applications.
Specifically, each SINGA worker may train on sequences of records which do not
lie within block boundary, whereas in MapReduce  each Mapper process a number
of complete blocks.  In MapReduce, the runtime engine may fetch and cache the
entire block over the network, knowing that the block will be processed
entirely. In SINGA, such pre-fetching and caching strategy will be sub-optimal
because it wastes I/O and network bandwidth on data records which are not used.

We defer I/O optimization to a future ticket.

For implementation, we choose `libhdfs3` from Pivotal for HDFS implementation
in C++. This library is built natively for C++, hence it is more optimized and
easier to deploy than the original  `libhdfs` library that is shipped with
Hadoop. libhdfs3 makes extensive use of short-circuit reads to improve local
reads, and it often complain when such option is not set.

Finally, we test the implementation in a distributed environment set up from a
number of  Docker containers. We test with both CIFAR and MNIST examples.


> SINGA-97 Add HDFS Store 
> ------------------------
>
>                 Key: SINGA-97
>                 URL: https://issues.apache.org/jira/browse/SINGA-97
>             Project: Singa
>          Issue Type: New Feature
>            Reporter: Anh Dinh
>            Assignee: Anh Dinh
>
> This ticket implements HDFS Store for reading data from HDFS. It complements the existing CSV Store which reads data from CSV file. HDFS is the popular distributed file system with high (sequential) I/O throughputs, thus supporting it is necessary in order for SINGA to scale. 
> The implementation will extend singa::io::Store class which is declared in `singa/io/store.h`. In particular, it will support the following I/O operations:
> + `bool Open(string& file, Mode mode)`
> + `bool Close()`
> + `bool Flush()`
> + `int Seek(int record_idx)`
> + `int Read(string *content)`
> + `int Write(string& content)`
> HDFS usage in SINGA is different to that in standard MapReduce applications. Specifically, each SINGA worker may train on sequences of records which do not lie within block boundary, whereas in MapReduce  each Mapper process a number of complete blocks.  In MapReduce, the runtime engine may fetch and cache the entire block over the network, knowing that the block will be processed entirely. In SINGA, such pre-fetching and caching strategy will be sub-optimal because it wastes I/O and network bandwidth on data records which are not used. 
> We defer I/O optimization to a future ticket. 
> For implementation, we choose `libhdfs3` from Pivotal for HDFS implementation in C++. This library is built natively for C++, hence it is more optimized and easier to deploy than the original  `libhdfs` library that is shipped with Hadoop. Finally, we test the implementation in a distributed environment set up from a number of  Docker containers (see SINGA-11). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)