You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Krzysztof Zarzycki (JIRA)" <ji...@apache.org> on 2015/06/30 09:13:04 UTC

[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store

    [ https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14607645#comment-14607645 ] 

Krzysztof Zarzycki commented on SPARK-2365:
-------------------------------------------

IndexedRDD seems to be really great tool! I'm waiting to use it. But one thing still lacking for me is persistence: If only IndexedRDD could be backed by embedded databases like RocksDB/LevelDB, that would be a great enabler. This would allow storing large, long-term, key-value application state in IndexedRDD. The things that are now only possible/or at least natively supported, by Samza framework. 

Currently, the only option in Spark Streaming to store such state is using external DB like Cassandra. But communication with external service retards whole computation, especially if you want reprocess your stream. It makes things about 10x slower, at least, not saying about Cassandra being massively hit, killing your other Cassandra applications.


> Add IndexedRDD, an efficient updatable key-value store
> ------------------------------------------------------
>
>                 Key: SPARK-2365
>                 URL: https://issues.apache.org/jira/browse/SPARK-2365
>             Project: Spark
>          Issue Type: New Feature
>          Components: GraphX, Spark Core
>            Reporter: Ankur Dave
>            Assignee: Ankur Dave
>         Attachments: 2014-07-07-IndexedRDD-design-review.pdf
>
>
> RDDs currently provide a bulk-updatable, iterator-based interface. This imposes minimal requirements on the storage layer, which only needs to support sequential access, enabling on-disk and serialized storage.
> However, many applications would benefit from a richer interface. Efficient support for point lookups would enable serving data out of RDDs, but it currently requires iterating over an entire partition to find the desired element. Point updates similarly require copying an entire iterator. Joins are also expensive, requiring a shuffle and local hash joins.
> To address these problems, we propose IndexedRDD, an efficient key-value store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key uniqueness and pre-indexing the entries for efficient joins and point lookups, updates, and deletions.
> It would be implemented by (1) hash-partitioning the entries by key, (2) maintaining a hash index within each partition, and (3) using purely functional (immutable and efficiently updatable) data structures to enable efficient modifications and deletions.
> GraphX would be the first user of IndexedRDD, since it currently implements a limited form of this functionality in VertexRDD. We envision a variety of other uses for IndexedRDD, including streaming updates to RDDs, direct serving from RDDs, and as an execution strategy for Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org