You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@samza.apache.org by "Vinoth Chandar (JIRA)" <ji...@apache.org> on 2015/04/01 22:54:53 UTC

[jira] [Commented] (SAMZA-622) Persisting Samza State on HDFS

    [ https://issues.apache.org/jira/browse/SAMZA-622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391443#comment-14391443 ] 

Vinoth Chandar commented on SAMZA-622:
--------------------------------------

[~criccomini] yes. started off cloning SAMZA-263 

Anyways, let me start by trying to get rocksdb to write into HDFS.. then we can work our way back to JNI land. I will look at the other tickets too. 

P.S : I have a in-place file rewrite version of parquet storage .. but will perform very bad with any sort of high throughput scenarios. :P 

> Persisting Samza State on HDFS
> ------------------------------
>
>                 Key: SAMZA-622
>                 URL: https://issues.apache.org/jira/browse/SAMZA-622
>             Project: Samza
>          Issue Type: Improvement
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>
> Samza's state currently lives in Kafka as a change log (compacted) and local rocksdb kv store.. 
> It would be nice to save this onto HDFS directly for the following reasons 
> - HDFS is a fault tolerant FS. Thus, restarting Samza tasks can be achieved by locating the task to where the other copies are.
> - HDFS virtualizes storage and thus, one would not have to worry explicitly about balancing disk usage across different tiers (I don't know what the right word is) in a data flow graph
> - Storing the state in HDFS, makes it easier to share this with other processing systems in the Hadoop land. 
> Rocksdb seems to have an option to store files onto HDFS https://github.com/facebook/rocksdb/tree/master/hdfs (Has someone played with this). 
> Context: I am working on producing compacted DB snapshots on HDFS for spark/MR jobs to use and thus super interested in this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)