You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by jatinganhotra <ja...@gmail.com> on 2015/10/18 05:38:37 UTC

Checkpointing calls the job twice?

Hi,

I noticed that when you checkpoint a given RDD, it results in performing the
action twice as I can see 2 jobs being executed in the Spark UI.

Example:
val logFile = "/data/pagecounts"
sc.setCheckpointDir("/checkpoints")
val logData = sc.textFile(logFile, 2)
val as = logData.filter(line => line.contains("a"))

Scenario #1:
as.count()        // Only 1 job.

But, if I change the above code to below:

Scenario #2:
as.cache()
as.checkpoint()
as.count()

Here, there are 2 jobs being executed as shown in the Spark UI, with
duration 0.9s and 0.4s

Why are there 2 jobs in scenario #2? In Spark source code, the comment for
RDD.checkpoint() says the following - 
"This function must be called before any job has been executed on this RDD.
It is strongly recommended that this RDD is persisted in memory, otherwise
saving it on a file will require recompilation."

In my example above, I am calling cache() before checkpoint(), so RDD will
be persisted in memory. Also, both of the above calls are before the count()
action, so checkpoint() is called before any job execution.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Checkpointing-calls-the-job-twice-tp25110.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org