You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Rares Vernica <rv...@gmail.com> on 2015/03/06 19:37:03 UTC
takeSample triggers 2 jobs
Hello,
I am using takeSample from the Scala Spark 1.2.1 shell:
scala> sc.textFile("README.md").takeSample(false, 3)
and I notice that two jobs are generated on the Spark Jobs page:
Job Id Description
1 takeSample at <console>:13
0 takeSample at <console>:13
Any ideas why the two jobs are needed?
Thanks!
Rares
Re: takeSample triggers 2 jobs
Posted by Denny Lee <de...@gmail.com>.
Hi Rares,
If you dig into the descriptions for the two jobs, it will probably return
something like:
Job ID: 1
org.apache.spark.rdd.RDD.takeSample(RDD.scala:447)
$line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:22)
...
Job ID: 0
org.apache.spark.rdd.RDD.takeSample(RDD.scala:428)
$line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:22)
...
The code for Spark from the git copy of master at:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala
Basically, line 428 refers to
val initialCount = this.count()
And liine 447 refers to
var samples = this.sample(withReplacement, fraction,
rand.nextInt()).collect()
Basically, the first job is getting the count so you can do the second job
which is to generate the samples.
HTH!
Denny
On Fri, Mar 6, 2015 at 10:44 AM Rares Vernica <rv...@gmail.com> wrote:
> Hello,
>
> I am using takeSample from the Scala Spark 1.2.1 shell:
>
> scala> sc.textFile("README.md").takeSample(false, 3)
>
>
> and I notice that two jobs are generated on the Spark Jobs page:
>
> Job Id Description
> 1 takeSample at <console>:13
> 0 takeSample at <console>:13
>
>
> Any ideas why the two jobs are needed?
>
> Thanks!
> Rares
>