You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/10/29 15:05:28 UTC
[jira] [Resolved] (SPARK-11004) MapReduce Hive-like join operations for RDDs

     [ https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-11004.
-------------------------------
    Resolution: Won't Fix

For the moment I'd like to consider this concluded, but as in all things, it can be reopened to address a specific change.

> MapReduce Hive-like join operations for RDDs
> --------------------------------------------
>
>                 Key: SPARK-11004
>                 URL: https://issues.apache.org/jira/browse/SPARK-11004
>             Project: Spark
>          Issue Type: New Feature
>          Components: Shuffle
>            Reporter: Glenn Strycker
>
> Could a feature be added to Spark that would use disk-only MapReduce operations for the very largest RDD joins?
> MapReduce is able to handle incredibly large table joins in a stable, predictable way with gracious failures and recovery.  I have applications that are able to join 2 tables without error in Hive, but these same tables, when converted into RDDs, are unable to join in Spark (I am using the same cluster, and have played around with all of the memory configurations, persisting to disk, checkpointing, etc., and the RDDs are just too big for Spark on my cluster)
> So, Spark is usually able to handle fairly large RDD joins, but occasionally runs into problems when the tables are just too big (e.g. the notorious 2GB shuffle limit issue, memory problems, etc.)  There are so many parameters to adjust (number of partitions, number of cores, memory per core, etc.) that it is difficult to guarantee stability on a shared cluster (say, running Yarn) with other jobs.
> Could a feature be added to Spark that would use disk-only MapReduce commands to do very large joins?
> That is, instead of myRDD1.join(myRDD2), we would have a special operation myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run MapReduce, and then convert the results of the join back into a standard RDD.
> This might add stability for Spark jobs that deal with extremely large data, and enable developers to mix-and-match some Spark and MapReduce operations in the same program, rather than writing Hive scripts and stringing together Spark and MapReduce programs, which has extremely large overhead to convert RDDs to Hive tables and back again.
> Despite memory-level operations being where most of Spark's speed gains lie, sometimes using disk-only may help with stability!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org