You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by satishl <sa...@gmail.com> on 2017/06/27 00:10:25 UTC

Question about Parallel Stages in Spark

For the below code, since rdd1 and rdd2 dont depend on each other - i was
expecting that both first and second printlns would be interwoven. However -
the spark job runs all "first " statements first and then all "seocnd"
statements next in serial fashion. I have set spark.scheduler.mode = FAIR. 
obviously my understanding of parallel stages is wrong. What am I missing?

    val rdd1 = sc.parallelize(1 to 1000000)
    val rdd2 = sc.parallelize(1 to 1000000)

    for (i <- (1 to 100))
      println("first: " + rdd1.sum())
    for (i <- (1 to 100))
      println("second" + rdd2.sum())



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Question-about-Parallel-Stages-in-Spark-tp28793.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Question about Parallel Stages in Spark

Posted by Bryan Jeffrey <br...@gmail.com>.

Hello.

The driver is running the individual operations in series, but each
operation is parallelized internally.  If you want them run in parallel you
need to provide the driver a mechanism to thread the job scheduling out:

val rdd1 = sc.parallelize(1 to 100000)
val rdd2 = sc.parallelize(1 to 200000)

var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, rdd2).zipWithIndex.par

thingsToDo.foreach { case(rdd, index) =>
  for(i <- (1 to 10000))
    logger.info(s"Index ${index} - ${rdd.sum()}")
}


This will run both operations in parallel.


On Mon, Jun 26, 2017 at 8:10 PM, satishl <sa...@gmail.com> wrote:

> For the below code, since rdd1 and rdd2 dont depend on each other - i was
> expecting that both first and second printlns would be interwoven. However
> -
> the spark job runs all "first " statements first and then all "seocnd"
> statements next in serial fashion. I have set spark.scheduler.mode = FAIR.
> obviously my understanding of parallel stages is wrong. What am I missing?
>
>     val rdd1 = sc.parallelize(1 to 1000000)
>     val rdd2 = sc.parallelize(1 to 1000000)
>
>     for (i <- (1 to 100))
>       println("first: " + rdd1.sum())
>     for (i <- (1 to 100))
>       println("second" + rdd2.sum())
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Question-about-Parallel-Stages-in-Spark-tp28793.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>