You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Imran Rashid (JIRA)" <ji...@apache.org> on 2014/09/19 16:33:33 UTC

[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store

    [ https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140647#comment-14140647 ] 

Imran Rashid commented on SPARK-2365:
-------------------------------------

This looks fantastic.  I think it will also see heavy use outside of GraphX as well.

I only have one question about the api -- I think the name `IndexedRDD` is more appropriate for an interface, not this concrete implementation.  I can imagine other indexing strategies that would also feel like a `IndexedRDD`.  Its always hard to know when its worth putting in an interface and you will actually need to put in more concrete implementations, but this seems like a good candidate to me.  You could partially deal with this by having a companion object to the interface, which constructs this particular implementation.  (Eg., the way the `IndexedSeq` companion object's `apply()` methods build a `Vector`).

Not that I have any great recommendations for a better name ... `UpdateableHashIndexedRDD` maybe?

my other comments on the design can probably be handled in future work, but while they are on my mind:

# Is there any way to save & load an `IndexedRDD` from hdfs?  It seems like you'll always need to reshuffle the data when you load it if you just do `IndexedRDD(sc.hadoopFile(...))`. It seems we need a way for a hadoop file to be loaded with an "assumed Partitioner" (I thought I opened a ticket for that a while ago, but I can't find it ... I might open another one).  Also you might want some way to load the index from disk as well, though I suppose rebuilding that isn't tooooooo painful.

# I'm wondering about whether we should add a "bulk multiget".  Eg., say you're IndexedRDD is 1 B entries, and you want to look up and process 100K of them in parallel.  You probably don't want to do a sequential scan ... but you also don't want to call `multiget` which will pull the records onto the driver.  Actually, as I'm writing this, I'm realizing there is probably a better way to do this -- you should just make another RDD out of your 100K elements, and then do an innerjoin.  Does that sound right?  We could add a convenience method for this -- but maybe I'm the only one who wants this so its premature to do anything about it.

again, I think this is a fantastic addition!  I'm looking through the code now, but so far it all seems great.

> Add IndexedRDD, an efficient updatable key-value store
> ------------------------------------------------------
>
>                 Key: SPARK-2365
>                 URL: https://issues.apache.org/jira/browse/SPARK-2365
>             Project: Spark
>          Issue Type: New Feature
>          Components: GraphX, Spark Core
>            Reporter: Ankur Dave
>            Assignee: Ankur Dave
>         Attachments: 2014-07-07-IndexedRDD-design-review.pdf
>
>
> RDDs currently provide a bulk-updatable, iterator-based interface. This imposes minimal requirements on the storage layer, which only needs to support sequential access, enabling on-disk and serialized storage.
> However, many applications would benefit from a richer interface. Efficient support for point lookups would enable serving data out of RDDs, but it currently requires iterating over an entire partition to find the desired element. Point updates similarly require copying an entire iterator. Joins are also expensive, requiring a shuffle and local hash joins.
> To address these problems, we propose IndexedRDD, an efficient key-value store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key uniqueness and pre-indexing the entries for efficient joins and point lookups, updates, and deletions.
> It would be implemented by (1) hash-partitioning the entries by key, (2) maintaining a hash index within each partition, and (3) using purely functional (immutable and efficiently updatable) data structures to enable efficient modifications and deletions.
> GraphX would be the first user of IndexedRDD, since it currently implements a limited form of this functionality in VertexRDD. We envision a variety of other uses for IndexedRDD, including streaming updates to RDDs, direct serving from RDDs, and as an execution strategy for Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org