You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Wei Tan <wt...@us.ibm.com> on 2014/07/15 21:38:42 UTC
parallel stages?
Hi, I wonder if I do wordcount on two different files, like this:
val file1 = sc.textFile("/...")
val file2 = sc.textFile("/...")
val wc1= file.flatMap(..).reduceByKey(_ + _,1)
val wc2= file.flatMap(...).reduceByKey(_ + _,1)
wc1.saveAsTextFile("titles.out")
wc2.saveAsTextFile("tables.out")
Would the two reduceByKey stages run in parallel given sufficient
capacity?
Best regards,
Wei
---------------------------------
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan
Re: parallel stages?
Posted by Sean Owen <so...@cloudera.com>.
Yes, but what I show can be done in one Spark job.
On Wed, Jul 16, 2014 at 5:01 AM, Wei Tan <wt...@us.ibm.com> wrote:
> Thanks Sean. In Oozie you can use fork-join, however using Oozie to drive
> Spark jobs, jobs will not be able to share RDD (Am I right? I think multiple
> jobs submitted by Oozie will have different context).
>
> Wonder if Spark wants to add more workflow feature in future.
Re: parallel stages?
Posted by Wei Tan <wt...@us.ibm.com>.
Thanks Sean. In Oozie you can use fork-join, however using Oozie to drive
Spark jobs, jobs will not be able to share RDD (Am I right? I think
multiple jobs submitted by Oozie will have different context).
Wonder if Spark wants to add more workflow feature in future.
Best regards,
Wei
---------------------------------
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan
From: Sean Owen <so...@cloudera.com>
To: user@spark.apache.org,
Date: 07/15/2014 04:37 PM
Subject: Re: parallel stages?
The last two lines are what trigger the operations, and they will each
block until the result is computed and saved. So if you execute this
code as-is, no. You could write a Scala program that invokes these two
operations in parallel, like:
Array((wc1,"titles.out"), (wc2,"tables.out")).par.foreach { case
(wc,path) => wc.saveAsTestFile(path) }
It worked for me and think it's OK to do this if you know you want to.
On Tue, Jul 15, 2014 at 8:38 PM, Wei Tan <wt...@us.ibm.com> wrote:
> Hi, I wonder if I do wordcount on two different files, like this:
>
> val file1 = sc.textFile("/...")
> val file2 = sc.textFile("/...")
>
>
> val wc1= file.flatMap(..).reduceByKey(_ + _,1)
> val wc2= file.flatMap(...).reduceByKey(_ + _,1)
>
> wc1.saveAsTextFile("titles.out")
> wc2.saveAsTextFile("tables.out")
>
> Would the two reduceByKey stages run in parallel given sufficient
capacity?
>
> Best regards,
> Wei
>
>
> ---------------------------------
> Wei Tan, PhD
> Research Staff Member
> IBM T. J. Watson Research Center
> http://researcher.ibm.com/person/us-wtan
Re: parallel stages?
Posted by Sean Owen <so...@cloudera.com>.
The last two lines are what trigger the operations, and they will each
block until the result is computed and saved. So if you execute this
code as-is, no. You could write a Scala program that invokes these two
operations in parallel, like:
Array((wc1,"titles.out"), (wc2,"tables.out")).par.foreach { case
(wc,path) => wc.saveAsTestFile(path) }
It worked for me and think it's OK to do this if you know you want to.
On Tue, Jul 15, 2014 at 8:38 PM, Wei Tan <wt...@us.ibm.com> wrote:
> Hi, I wonder if I do wordcount on two different files, like this:
>
> val file1 = sc.textFile("/...")
> val file2 = sc.textFile("/...")
>
>
> val wc1= file.flatMap(..).reduceByKey(_ + _,1)
> val wc2= file.flatMap(...).reduceByKey(_ + _,1)
>
> wc1.saveAsTextFile("titles.out")
> wc2.saveAsTextFile("tables.out")
>
> Would the two reduceByKey stages run in parallel given sufficient capacity?
>
> Best regards,
> Wei
>
>
> ---------------------------------
> Wei Tan, PhD
> Research Staff Member
> IBM T. J. Watson Research Center
> http://researcher.ibm.com/person/us-wtan