You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Wei Tan <wt...@us.ibm.com> on 2014/07/15 21:38:42 UTC

parallel stages?

Hi, I wonder if I do wordcount on two different files, like this:

val file1 = sc.textFile("/...")
val file2 = sc.textFile("/...")


val wc1= file.flatMap(..).reduceByKey(_ + _,1)
val wc2= file.flatMap(...).reduceByKey(_ + _,1)

wc1.saveAsTextFile("titles.out")
wc2.saveAsTextFile("tables.out")

Would the two reduceByKey stages run in parallel given sufficient 
capacity?

Best regards,
Wei


---------------------------------
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan

Re: parallel stages?

Posted by Sean Owen <so...@cloudera.com>.
Yes, but what I show can be done in one Spark job.

On Wed, Jul 16, 2014 at 5:01 AM, Wei Tan <wt...@us.ibm.com> wrote:
> Thanks Sean. In Oozie you can use fork-join, however using Oozie to drive
> Spark jobs, jobs will not be able to share RDD (Am I right? I think multiple
> jobs submitted by Oozie will have different context).
>
> Wonder if Spark wants to add more workflow feature in future.

Re: parallel stages?

Posted by Wei Tan <wt...@us.ibm.com>.
Thanks Sean. In Oozie you can use fork-join, however using Oozie to drive 
Spark jobs, jobs will not be able to share RDD (Am I right? I think 
multiple jobs submitted by Oozie will have different context).

Wonder if Spark wants to add more workflow feature in future.

Best regards,
Wei

---------------------------------
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan



From:   Sean Owen <so...@cloudera.com>
To:     user@spark.apache.org, 
Date:   07/15/2014 04:37 PM
Subject:        Re: parallel stages?



The last two lines are what trigger the operations, and they will each
block until the result is computed and saved. So if you execute this
code as-is, no. You could write a Scala program that invokes these two
operations in parallel, like:

Array((wc1,"titles.out"), (wc2,"tables.out")).par.foreach { case
(wc,path) => wc.saveAsTestFile(path) }

It worked for me and think it's OK to do this if you know you want to.

On Tue, Jul 15, 2014 at 8:38 PM, Wei Tan <wt...@us.ibm.com> wrote:
> Hi, I wonder if I do wordcount on two different files, like this:
>
> val file1 = sc.textFile("/...")
> val file2 = sc.textFile("/...")
>
>
> val wc1= file.flatMap(..).reduceByKey(_ + _,1)
> val wc2= file.flatMap(...).reduceByKey(_ + _,1)
>
> wc1.saveAsTextFile("titles.out")
> wc2.saveAsTextFile("tables.out")
>
> Would the two reduceByKey stages run in parallel given sufficient 
capacity?
>
> Best regards,
> Wei
>
>
> ---------------------------------
> Wei Tan, PhD
> Research Staff Member
> IBM T. J. Watson Research Center
> http://researcher.ibm.com/person/us-wtan



Re: parallel stages?

Posted by Sean Owen <so...@cloudera.com>.
The last two lines are what trigger the operations, and they will each
block until the result is computed and saved. So if you execute this
code as-is, no. You could write a Scala program that invokes these two
operations in parallel, like:

Array((wc1,"titles.out"), (wc2,"tables.out")).par.foreach { case
(wc,path) => wc.saveAsTestFile(path) }

It worked for me and think it's OK to do this if you know you want to.

On Tue, Jul 15, 2014 at 8:38 PM, Wei Tan <wt...@us.ibm.com> wrote:
> Hi, I wonder if I do wordcount on two different files, like this:
>
> val file1 = sc.textFile("/...")
> val file2 = sc.textFile("/...")
>
>
> val wc1= file.flatMap(..).reduceByKey(_ + _,1)
> val wc2= file.flatMap(...).reduceByKey(_ + _,1)
>
> wc1.saveAsTextFile("titles.out")
> wc2.saveAsTextFile("tables.out")
>
> Would the two reduceByKey stages run in parallel given sufficient capacity?
>
> Best regards,
> Wei
>
>
> ---------------------------------
> Wei Tan, PhD
> Research Staff Member
> IBM T. J. Watson Research Center
> http://researcher.ibm.com/person/us-wtan