You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Patrick Wendell (JIRA)" <ji...@apache.org> on 2014/09/22 05:24:34 UTC

[jira] [Comment Edited] (SPARK-3622) Provide a custom transformation that can output multiple RDDs

    [ https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142865#comment-14142865 ] 

Patrick Wendell edited comment on SPARK-3622 at 9/22/14 3:24 AM:
-----------------------------------------------------------------

Do you mind clarifying a little bit how hive would use this (maybe with a code example)?

Let's say you had a transformation that went from a single RDD A to two RDD's B and C. The normal way to do this if you want to avoid recomputing A would be to persist it, then use it to derive both B and C (this will do multiple passes on A, but it won't fully recompute A twice).

I think that doing this in the general case is not possible by definition. The user might use B and C at different times, so it's not possible to guarantee that A will be computed only once unless you persist A.


was (Author: pwendell):
Do you mind clarifying a little bit how hive would use this (maybe with a code example)? The normal way to do this if you want to avoid recomputing A would be to persist it, then use it to derive both B and C (this will do multiple passes on A, but it won't fully recompute A twice).

I think that doing this in the general case is not possible by definition. Let's say you had a transformation that went from a single RDD A to two RDD's B and C. The user might use B and C at different times, so it's not possible to guarantee that A will be computed only once unless you persist A.

> Provide a custom transformation that can output multiple RDDs
> -------------------------------------------------------------
>
>                 Key: SPARK-3622
>                 URL: https://issues.apache.org/jira/browse/SPARK-3622
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.1.0
>            Reporter: Xuefu Zhang
>
> All existing transformations return just one RDD at most, even for those which takes user-supplied functions such as mapPartitions() . However, sometimes a user provided function may need to output multiple RDDs. For instance, a filter function that divides the input RDD into serveral RDDs. While it's possible to get multiple RDDs by transforming the same RDD multiple times, it may be more efficient to do this concurrently in one shot. Especially user's existing function is already generating different data sets.
> This the case in Hive on Spark, where Hive's map function and reduce function can output different data sets to be consumed by subsequent stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org