You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Bernard Jesop <be...@gmail.com> on 2017/07/13 15:35:46 UTC

underlying checkpoint

Hi everyone, I just tried this simple program :


















* import
org.apache.spark.sql.SparkSession
 object CheckpointTest extends App
{
   val spark =
SparkSession

.builder()

.appName("Toto")

.getOrCreate()

spark.sparkContext.setCheckpointDir(".")
   val df = spark.createDataFrame(List(("Scala", 35), ("Python", 30), ("R",
15), ("Java",
20)))

df.show()

df.rdd.checkpoint()
   println(if (df.rdd.isCheckpointed) "checkpointed" else "not
checkpointed")
 }*
But the result is still *"not checkpointed"*.
Do you have any idea why? (knowing that the checkpoint file is created)

Best regards,
Bernard JESOP

RE: underlying checkpoint

Posted by "Mendelson, Assaf" <As...@rsa.com>.
Actually, show is an action.
The issue is that unless you have some aggregations, show will only go over some of the dataframe, not all of it and therefore the caching won’t occur (similar to what happens with cache).
You need an action which requires to go over the entire dataframe (which count does).

Thanks,
              Assaf.

From: Bernard Jesop [mailto:bernard.jesop@gmail.com]
Sent: Thursday, July 13, 2017 6:58 PM
To: Vadim Semenov
Cc: user
Subject: Re: underlying checkpoint

Thank you, one of my mistakes was to think that show() was an action.

2017-07-13 17:52 GMT+02:00 Vadim Semenov <va...@datadoghq.com>>:
You need to trigger an action on that rdd to checkpoint it.

```
scala>    spark.sparkContext.setCheckpointDir(".")

scala>    val df = spark.createDataFrame(List(("Scala", 35), ("Python", 30), ("R", 15), ("Java", 20)))
df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]

scala> df.rdd.checkpoint()

scala> df.rdd.isCheckpointed
res2: Boolean = false

scala> df.show()
+------+---+
|    _1| _2|
+------+---+
| Scala| 35|
|Python| 30|
|     R| 15|
|  Java| 20|
+------+---+


scala> df.rdd.isCheckpointed
res4: Boolean = false

scala> df.rdd.count()
res5: Long = 4

scala> df.rdd.isCheckpointed
res6: Boolean = true
```

On Thu, Jul 13, 2017 at 11:35 AM, Bernard Jesop <be...@gmail.com>> wrote:
Hi everyone, I just tried this simple program :

 import org.apache.spark.sql.SparkSession

 object CheckpointTest extends App {

   val spark = SparkSession
     .builder()
     .appName("Toto")
     .getOrCreate()

   spark.sparkContext.setCheckpointDir(".")

   val df = spark.createDataFrame(List(("Scala", 35), ("Python", 30), ("R", 15), ("Java", 20)))

   df.show()
   df.rdd.checkpoint()
   println(if (df.rdd.isCheckpointed) "checkpointed" else "not checkpointed")
 }

But the result is still "not checkpointed".
Do you have any idea why? (knowing that the checkpoint file is created)
Best regards,
Bernard JESOP



Re: underlying checkpoint

Posted by Bernard Jesop <be...@gmail.com>.
Thank you, one of my mistakes was to think that show() was an action.

2017-07-13 17:52 GMT+02:00 Vadim Semenov <va...@datadoghq.com>:

> You need to trigger an action on that rdd to checkpoint it.
>
> ```
> scala>    spark.sparkContext.setCheckpointDir(".")
>
> scala>    val df = spark.createDataFrame(List(("Scala", 35), ("Python",
> 30), ("R", 15), ("Java", 20)))
> df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
>
> scala> df.rdd.checkpoint()
>
> scala> df.rdd.isCheckpointed
> res2: Boolean = false
>
> scala> df.show()
> +------+---+
> |    _1| _2|
> +------+---+
> | Scala| 35|
> |Python| 30|
> |     R| 15|
> |  Java| 20|
> +------+---+
>
>
> scala> df.rdd.isCheckpointed
> res4: Boolean = false
>
> scala> df.rdd.count()
> res5: Long = 4
>
> scala> df.rdd.isCheckpointed
> res6: Boolean = true
> ```
>
> On Thu, Jul 13, 2017 at 11:35 AM, Bernard Jesop <be...@gmail.com>
> wrote:
>
>> Hi everyone, I just tried this simple program :
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> * import
>> org.apache.spark.sql.SparkSession
>>  object CheckpointTest extends App
>> {
>>    val spark =
>> SparkSession
>>
>> .builder()
>>
>> .appName("Toto")
>>
>> .getOrCreate()
>>
>> spark.sparkContext.setCheckpointDir(".")
>>    val df = spark.createDataFrame(List(("Scala", 35), ("Python", 30), ("R",
>> 15), ("Java",
>> 20)))
>>
>> df.show()
>>
>> df.rdd.checkpoint()
>>    println(if (df.rdd.isCheckpointed) "checkpointed" else "not
>> checkpointed")
>>  }*
>> But the result is still *"not checkpointed"*.
>> Do you have any idea why? (knowing that the checkpoint file is created)
>>
>> Best regards,
>> Bernard JESOP
>>
>
>

Re: underlying checkpoint

Posted by Vadim Semenov <va...@datadoghq.com>.
You need to trigger an action on that rdd to checkpoint it.

```
scala>    spark.sparkContext.setCheckpointDir(".")

scala>    val df = spark.createDataFrame(List(("Scala", 35), ("Python",
30), ("R", 15), ("Java", 20)))
df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]

scala> df.rdd.checkpoint()

scala> df.rdd.isCheckpointed
res2: Boolean = false

scala> df.show()
+------+---+
|    _1| _2|
+------+---+
| Scala| 35|
|Python| 30|
|     R| 15|
|  Java| 20|
+------+---+


scala> df.rdd.isCheckpointed
res4: Boolean = false

scala> df.rdd.count()
res5: Long = 4

scala> df.rdd.isCheckpointed
res6: Boolean = true
```

On Thu, Jul 13, 2017 at 11:35 AM, Bernard Jesop <be...@gmail.com>
wrote:

> Hi everyone, I just tried this simple program :
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> * import
> org.apache.spark.sql.SparkSession
>  object CheckpointTest extends App
> {
>    val spark =
> SparkSession
>
> .builder()
>
> .appName("Toto")
>
> .getOrCreate()
>
> spark.sparkContext.setCheckpointDir(".")
>    val df = spark.createDataFrame(List(("Scala", 35), ("Python", 30), ("R",
> 15), ("Java",
> 20)))
>
> df.show()
>
> df.rdd.checkpoint()
>    println(if (df.rdd.isCheckpointed) "checkpointed" else "not
> checkpointed")
>  }*
> But the result is still *"not checkpointed"*.
> Do you have any idea why? (knowing that the checkpoint file is created)
>
> Best regards,
> Bernard JESOP
>