You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Bernard Jesop <be...@gmail.com> on 2017/07/13 15:35:46 UTC
underlying checkpoint
Hi everyone, I just tried this simple program :
* import
org.apache.spark.sql.SparkSession
object CheckpointTest extends App
{
val spark =
SparkSession
.builder()
.appName("Toto")
.getOrCreate()
spark.sparkContext.setCheckpointDir(".")
val df = spark.createDataFrame(List(("Scala", 35), ("Python", 30), ("R",
15), ("Java",
20)))
df.show()
df.rdd.checkpoint()
println(if (df.rdd.isCheckpointed) "checkpointed" else "not
checkpointed")
}*
But the result is still *"not checkpointed"*.
Do you have any idea why? (knowing that the checkpoint file is created)
Best regards,
Bernard JESOP
RE: underlying checkpoint
Posted by "Mendelson, Assaf" <As...@rsa.com>.
Actually, show is an action.
The issue is that unless you have some aggregations, show will only go over some of the dataframe, not all of it and therefore the caching won’t occur (similar to what happens with cache).
You need an action which requires to go over the entire dataframe (which count does).
Thanks,
Assaf.
From: Bernard Jesop [mailto:bernard.jesop@gmail.com]
Sent: Thursday, July 13, 2017 6:58 PM
To: Vadim Semenov
Cc: user
Subject: Re: underlying checkpoint
Thank you, one of my mistakes was to think that show() was an action.
2017-07-13 17:52 GMT+02:00 Vadim Semenov <va...@datadoghq.com>>:
You need to trigger an action on that rdd to checkpoint it.
```
scala> spark.sparkContext.setCheckpointDir(".")
scala> val df = spark.createDataFrame(List(("Scala", 35), ("Python", 30), ("R", 15), ("Java", 20)))
df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
scala> df.rdd.checkpoint()
scala> df.rdd.isCheckpointed
res2: Boolean = false
scala> df.show()
+------+---+
| _1| _2|
+------+---+
| Scala| 35|
|Python| 30|
| R| 15|
| Java| 20|
+------+---+
scala> df.rdd.isCheckpointed
res4: Boolean = false
scala> df.rdd.count()
res5: Long = 4
scala> df.rdd.isCheckpointed
res6: Boolean = true
```
On Thu, Jul 13, 2017 at 11:35 AM, Bernard Jesop <be...@gmail.com>> wrote:
Hi everyone, I just tried this simple program :
import org.apache.spark.sql.SparkSession
object CheckpointTest extends App {
val spark = SparkSession
.builder()
.appName("Toto")
.getOrCreate()
spark.sparkContext.setCheckpointDir(".")
val df = spark.createDataFrame(List(("Scala", 35), ("Python", 30), ("R", 15), ("Java", 20)))
df.show()
df.rdd.checkpoint()
println(if (df.rdd.isCheckpointed) "checkpointed" else "not checkpointed")
}
But the result is still "not checkpointed".
Do you have any idea why? (knowing that the checkpoint file is created)
Best regards,
Bernard JESOP
Re: underlying checkpoint
Posted by Bernard Jesop <be...@gmail.com>.
Thank you, one of my mistakes was to think that show() was an action.
2017-07-13 17:52 GMT+02:00 Vadim Semenov <va...@datadoghq.com>:
> You need to trigger an action on that rdd to checkpoint it.
>
> ```
> scala> spark.sparkContext.setCheckpointDir(".")
>
> scala> val df = spark.createDataFrame(List(("Scala", 35), ("Python",
> 30), ("R", 15), ("Java", 20)))
> df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
>
> scala> df.rdd.checkpoint()
>
> scala> df.rdd.isCheckpointed
> res2: Boolean = false
>
> scala> df.show()
> +------+---+
> | _1| _2|
> +------+---+
> | Scala| 35|
> |Python| 30|
> | R| 15|
> | Java| 20|
> +------+---+
>
>
> scala> df.rdd.isCheckpointed
> res4: Boolean = false
>
> scala> df.rdd.count()
> res5: Long = 4
>
> scala> df.rdd.isCheckpointed
> res6: Boolean = true
> ```
>
> On Thu, Jul 13, 2017 at 11:35 AM, Bernard Jesop <be...@gmail.com>
> wrote:
>
>> Hi everyone, I just tried this simple program :
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> * import
>> org.apache.spark.sql.SparkSession
>> object CheckpointTest extends App
>> {
>> val spark =
>> SparkSession
>>
>> .builder()
>>
>> .appName("Toto")
>>
>> .getOrCreate()
>>
>> spark.sparkContext.setCheckpointDir(".")
>> val df = spark.createDataFrame(List(("Scala", 35), ("Python", 30), ("R",
>> 15), ("Java",
>> 20)))
>>
>> df.show()
>>
>> df.rdd.checkpoint()
>> println(if (df.rdd.isCheckpointed) "checkpointed" else "not
>> checkpointed")
>> }*
>> But the result is still *"not checkpointed"*.
>> Do you have any idea why? (knowing that the checkpoint file is created)
>>
>> Best regards,
>> Bernard JESOP
>>
>
>
Re: underlying checkpoint
Posted by Vadim Semenov <va...@datadoghq.com>.
You need to trigger an action on that rdd to checkpoint it.
```
scala> spark.sparkContext.setCheckpointDir(".")
scala> val df = spark.createDataFrame(List(("Scala", 35), ("Python",
30), ("R", 15), ("Java", 20)))
df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
scala> df.rdd.checkpoint()
scala> df.rdd.isCheckpointed
res2: Boolean = false
scala> df.show()
+------+---+
| _1| _2|
+------+---+
| Scala| 35|
|Python| 30|
| R| 15|
| Java| 20|
+------+---+
scala> df.rdd.isCheckpointed
res4: Boolean = false
scala> df.rdd.count()
res5: Long = 4
scala> df.rdd.isCheckpointed
res6: Boolean = true
```
On Thu, Jul 13, 2017 at 11:35 AM, Bernard Jesop <be...@gmail.com>
wrote:
> Hi everyone, I just tried this simple program :
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> * import
> org.apache.spark.sql.SparkSession
> object CheckpointTest extends App
> {
> val spark =
> SparkSession
>
> .builder()
>
> .appName("Toto")
>
> .getOrCreate()
>
> spark.sparkContext.setCheckpointDir(".")
> val df = spark.createDataFrame(List(("Scala", 35), ("Python", 30), ("R",
> 15), ("Java",
> 20)))
>
> df.show()
>
> df.rdd.checkpoint()
> println(if (df.rdd.isCheckpointed) "checkpointed" else "not
> checkpointed")
> }*
> But the result is still *"not checkpointed"*.
> Do you have any idea why? (knowing that the checkpoint file is created)
>
> Best regards,
> Bernard JESOP
>