You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Matei Zaharia (JIRA)" <ji...@apache.org> on 2014/09/08 20:25:28 UTC

[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

    [ https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125868#comment-14125868 ] 

Matei Zaharia commented on SPARK-2688:
--------------------------------------

Just as a note, to launch multiple Spark actions in parallel, you can just launch them from multiple threads. If you want to do it from one thread, there are the experimental "Async" APIs in AsyncRDDActions (e.g. foreachAsync).

However, even in these cases, it's true that rdd2 would be recomputed -- you need to cache it to prevent this.

> Need a way to run multiple data pipeline concurrently
> -----------------------------------------------------
>
>                 Key: SPARK-2688
>                 URL: https://issues.apache.org/jira/browse/SPARK-2688
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.0.1
>            Reporter: Xuefu Zhang
>
> Suppose we want to do the following data processing: 
> {code}
> rdd1 -> rdd2 -> rdd3
>            | -> rdd4
>            | -> rdd5
>            \ -> rdd6
> {code}
> where -> represents a transformation. rdd3 to rrdd6 are all derived from an intermediate rdd2. We use foreach(fn) with a dummy function to trigger the execution. However, rdd.foreach(fn) only trigger pipeline rdd1 -> rdd2 -> rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be recomputed. This is very inefficient. Ideally, we should be able to trigger the execution the whole graph and reuse rdd2, but there doesn't seem to be a way doing so. Tez already realized the importance of this (TEZ-391), so I think Spark should provide this too.
> This is required for Hive to support multi-insert queries. HIVE-7292.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org