You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/12/31 10:11:49 UTC

[jira] [Commented] (SPARK-1061) allow Hadoop RDDs to be read w/ a partitioner

    [ https://issues.apache.org/jira/browse/SPARK-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075854#comment-15075854 ] 

Sean Owen commented on SPARK-1061:
----------------------------------

Is this still live?

> allow Hadoop RDDs to be read w/ a partitioner
> ---------------------------------------------
>
>                 Key: SPARK-1061
>                 URL: https://issues.apache.org/jira/browse/SPARK-1061
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Imran Rashid
>            Assignee: Imran Rashid
>
> Using partitioners to get narrow dependencies can save tons of time on a shuffle.  However, after saving an RDD to hdfs, and then reloading it, all partitioner information is lost.  This means that you can never get a narrow dependency when loading data from hadoop.
> I think we could get around this by:
> 1) having a modified version of hadoop rdd that kept track of original part file (or maybe just prevent splits altogether ...)
> 2) add a "assumePartition(partitioner:Partitioner, verify: Boolean)" function to RDD.  It would create a new RDD, which had the exact same data but just pretended that the RDD had the given partitioner applied to it.  And if verify=true, it could add a mapPartitionsWithIndex to check that each record was in the right partition.
> http://apache-spark-user-list.1001560.n3.nabble.com/setting-partitioners-with-hadoop-rdds-td976.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org