You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Vinoth Chandar (JIRA)" <ji...@apache.org> on 2015/03/31 17:28:53 UTC

[jira] [Updated] (SAMZA-622) Persisting Samza State on HDFS

     [ https://issues.apache.org/jira/browse/SAMZA-622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinoth Chandar updated SAMZA-622:
---------------------------------
    Description: 
Samza's state currently lives in Kafka as a change log (compacted) and local rocksdb kv store.. 

It would be nice to save this onto HDFS directly for the following reasons 

- HDFS is a fault tolerant FS. Thus, restarting Samza tasks can be achieved by locating the task to where the other copies are.
- HDFS virtualizes storage and thus, one would not have to worry explicitly about balancing disk usage across different tiers (I don't know what the right word is) in a data flow graph
- Storing the state in HDFS, makes it easier to share this with other processing systems in the Hadoop land. 

Rocksdb seems to have an option to store files onto HDFS https://github.com/facebook/rocksdb/tree/master/hdfs (Has someone played with this). 

Context: I am working on producing compacted DB snapshots on HDFS for spark/MR jobs to use and thus super interested in this. 

  was:
It would be nice to be able to read/write from HDFS, particularly for bootstrapping purposes.  A few points:

* Per the discussion [about leveldb|https://issues.apache.org/jira/browse/SAMZA-236?focusedCommentId=13985982&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13985982] this support should be separated into its own package and project (jar) for easy testing and severability.
* Similar to the Kafka RegexTopicGenerator, we can enumerate (recursively or not) the files in an HDFS directory during job startup.
* Connectivity with HCatalog would be interesting as well, but should be handled in a separate JIRA.


> Persisting Samza State on HDFS
> ------------------------------
>
>                 Key: SAMZA-622
>                 URL: https://issues.apache.org/jira/browse/SAMZA-622
>             Project: Samza
>          Issue Type: Improvement
>            Reporter: Vinoth Chandar
>            Assignee: Jakob Homan
>
> Samza's state currently lives in Kafka as a change log (compacted) and local rocksdb kv store.. 
> It would be nice to save this onto HDFS directly for the following reasons 
> - HDFS is a fault tolerant FS. Thus, restarting Samza tasks can be achieved by locating the task to where the other copies are.
> - HDFS virtualizes storage and thus, one would not have to worry explicitly about balancing disk usage across different tiers (I don't know what the right word is) in a data flow graph
> - Storing the state in HDFS, makes it easier to share this with other processing systems in the Hadoop land. 
> Rocksdb seems to have an option to store files onto HDFS https://github.com/facebook/rocksdb/tree/master/hdfs (Has someone played with this). 
> Context: I am working on producing compacted DB snapshots on HDFS for spark/MR jobs to use and thus super interested in this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)