You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by jatinganhotra <ja...@gmail.com> on 2015/10/18 05:40:43 UTC

Checkpointing RDD calls the job twice?

Hi,

I noticed that when you checkpoint a given RDD, it results in performing the
action twice as I can see 2 jobs being executed in the Spark UI.

Example:
val logFile = "/data/pagecounts"
sc.setCheckpointDir("/checkpoints")
val logData = sc.textFile(logFile, 2)
val as = logData.filter(line => line.contains("a"))

Scenario #1:
as.count()        // Only 1 job.

But, if I change the above code to below:

Scenario #2:
as.cache()
as.checkpoint()
as.count()

Here, there are 2 jobs being executed as shown in the Spark UI, with
duration 0.9s and 0.4s

Why are there 2 jobs in scenario #2? In Spark source code, the comment for
RDD.checkpoint() says the following - 
"This function must be called before any job has been executed on this RDD.
It is strongly recommended that this RDD is persisted in memory, otherwise
saving it on a file will require recompilation."

In my example above, I am calling cache() before checkpoint(), so RDD will
be persisted in memory. Also, both of the above calls are before the count()
action, so checkpoint() is called before any job execution.

What am I missing here? I have looked further into source code to understand
what could be wrong, but found nothing.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Checkpointing-RDD-calls-the-job-twice-tp14666.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org