You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by jatinganhotra <ja...@gmail.com> on 2015/10/18 05:38:37 UTC
Checkpointing calls the job twice?
Hi,
I noticed that when you checkpoint a given RDD, it results in performing the
action twice as I can see 2 jobs being executed in the Spark UI.
Example:
val logFile = "/data/pagecounts"
sc.setCheckpointDir("/checkpoints")
val logData = sc.textFile(logFile, 2)
val as = logData.filter(line => line.contains("a"))
Scenario #1:
as.count() // Only 1 job.
But, if I change the above code to below:
Scenario #2:
as.cache()
as.checkpoint()
as.count()
Here, there are 2 jobs being executed as shown in the Spark UI, with
duration 0.9s and 0.4s
Why are there 2 jobs in scenario #2? In Spark source code, the comment for
RDD.checkpoint() says the following -
"This function must be called before any job has been executed on this RDD.
It is strongly recommended that this RDD is persisted in memory, otherwise
saving it on a file will require recompilation."
In my example above, I am calling cache() before checkpoint(), so RDD will
be persisted in memory. Also, both of the above calls are before the count()
action, so checkpoint() is called before any job execution.
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Checkpointing-calls-the-job-twice-tp25110.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org