You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Rares Vernica <rv...@gmail.com> on 2015/03/06 19:37:03 UTC

takeSample triggers 2 jobs

Hello,

I am using takeSample from the Scala Spark 1.2.1 shell:

scala> sc.textFile("README.md").takeSample(false, 3)


and I notice that two jobs are generated on the Spark Jobs page:

Job Id Description
1 takeSample at <console>:13
0  takeSample at <console>:13


Any ideas why the two jobs are needed?

Thanks!
Rares

Re: takeSample triggers 2 jobs

Posted by Denny Lee <de...@gmail.com>.

Hi Rares,

If you dig into the descriptions for the two jobs, it will probably return
something like:

Job ID: 1
org.apache.spark.rdd.RDD.takeSample(RDD.scala:447)
$line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:22)
...

Job ID: 0
org.apache.spark.rdd.RDD.takeSample(RDD.scala:428)
$line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:22)
...

The code for Spark from the git copy of master at:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala

Basically, line 428 refers to
val initialCount = this.count()

And liine 447 refers to
var samples = this.sample(withReplacement, fraction,
rand.nextInt()).collect()

Basically, the first job is getting the count so you can do the second job
which is to generate the samples.

HTH!
Denny




On Fri, Mar 6, 2015 at 10:44 AM Rares Vernica <rv...@gmail.com> wrote:

> Hello,
>
> I am using takeSample from the Scala Spark 1.2.1 shell:
>
> scala> sc.textFile("README.md").takeSample(false, 3)
>
>
> and I notice that two jobs are generated on the Spark Jobs page:
>
> Job Id Description
> 1 takeSample at <console>:13
> 0  takeSample at <console>:13
>
>
> Any ideas why the two jobs are needed?
>
> Thanks!
> Rares
>