You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Vinoth Chandar (JIRA)" <ji...@apache.org> on 2015/04/01 22:54:53 UTC
[jira] [Commented] (SAMZA-622) Persisting Samza State on HDFS
[ https://issues.apache.org/jira/browse/SAMZA-622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391443#comment-14391443 ]
Vinoth Chandar commented on SAMZA-622:
--------------------------------------
[~criccomini] yes. started off cloning SAMZA-263
Anyways, let me start by trying to get rocksdb to write into HDFS.. then we can work our way back to JNI land. I will look at the other tickets too.
P.S : I have a in-place file rewrite version of parquet storage .. but will perform very bad with any sort of high throughput scenarios. :P
> Persisting Samza State on HDFS
> ------------------------------
>
> Key: SAMZA-622
> URL: https://issues.apache.org/jira/browse/SAMZA-622
> Project: Samza
> Issue Type: Improvement
> Reporter: Vinoth Chandar
> Assignee: Vinoth Chandar
>
> Samza's state currently lives in Kafka as a change log (compacted) and local rocksdb kv store..
> It would be nice to save this onto HDFS directly for the following reasons
> - HDFS is a fault tolerant FS. Thus, restarting Samza tasks can be achieved by locating the task to where the other copies are.
> - HDFS virtualizes storage and thus, one would not have to worry explicitly about balancing disk usage across different tiers (I don't know what the right word is) in a data flow graph
> - Storing the state in HDFS, makes it easier to share this with other processing systems in the Hadoop land.
> Rocksdb seems to have an option to store files onto HDFS https://github.com/facebook/rocksdb/tree/master/hdfs (Has someone played with this).
> Context: I am working on producing compacted DB snapshots on HDFS for spark/MR jobs to use and thus super interested in this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)