You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/01/05 12:51:39 UTC

[jira] [Resolved] (SPARK-1061) allow Hadoop RDDs to be read w/ a partitioner

     [ https://issues.apache.org/jira/browse/SPARK-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-1061.
------------------------------
    Resolution: Won't Fix

> allow Hadoop RDDs to be read w/ a partitioner
> ---------------------------------------------
>
>                 Key: SPARK-1061
>                 URL: https://issues.apache.org/jira/browse/SPARK-1061
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Imran Rashid
>            Assignee: Imran Rashid
>
> Using partitioners to get narrow dependencies can save tons of time on a shuffle.  However, after saving an RDD to hdfs, and then reloading it, all partitioner information is lost.  This means that you can never get a narrow dependency when loading data from hadoop.
> I think we could get around this by:
> 1) having a modified version of hadoop rdd that kept track of original part file (or maybe just prevent splits altogether ...)
> 2) add a "assumePartition(partitioner:Partitioner, verify: Boolean)" function to RDD.  It would create a new RDD, which had the exact same data but just pretended that the RDD had the given partitioner applied to it.  And if verify=true, it could add a mapPartitionsWithIndex to check that each record was in the right partition.
> http://apache-spark-user-list.1001560.n3.nabble.com/setting-partitioners-with-hadoop-rdds-td976.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org