You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Bernard Jesop <be...@gmail.com> on 2017/10/25 13:51:17 UTC

Dataset API Question

Hello everyone,

I have a question about checkpointing on dataset.

It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike RDD
there is no Dataset.isCheckpointed().

I wonder if Dataset.checkpoint is a syntactic sugar for
Dataset.rdd.checkpoint.
When I do :

Dataset.checkpoint; Dataset.count
Dataset.rdd.isCheckpointed // result: false

However, when I explicitly do:
Dataset.rdd.checkpoint; Dataset.rdd.count
Dataset.rdd.isCheckpointed // result: true

Could someone explain this behavior to me, or provide some references?

Best regards,
Bernard

Re: Dataset API Question

Posted by Reynold Xin <rx...@databricks.com>.
It is a bit more than syntactic sugar, but not much more:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L533

BTW this is basically writing all the data out, and then create a new
Dataset to load them in.


On Wed, Oct 25, 2017 at 6:51 AM, Bernard Jesop <be...@gmail.com>
wrote:

> Hello everyone,
>
> I have a question about checkpointing on dataset.
>
> It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike RDD
> there is no Dataset.isCheckpointed().
>
> I wonder if Dataset.checkpoint is a syntactic sugar for
> Dataset.rdd.checkpoint.
> When I do :
>
> Dataset.checkpoint; Dataset.count
> Dataset.rdd.isCheckpointed // result: false
>
> However, when I explicitly do:
> Dataset.rdd.checkpoint; Dataset.rdd.count
> Dataset.rdd.isCheckpointed // result: true
>
> Could someone explain this behavior to me, or provide some references?
>
> Best regards,
> Bernard
>

Re: Dataset API Question

Posted by Wenchen Fan <cl...@gmail.com>.
It's because of different API design.

*RDD.checkpoint* returns void, which means it mutates the RDD state so you
need a *RDD**.isCheckpointed* method to check if this RDD is checkpointed.

*Dataset.checkpoint* returns a new Dataset, which means there is no
isCheckpointed state in Dataset, and thus we don't need a
*Dataset.isCheckpointed* method.


On Wed, Oct 25, 2017 at 6:39 PM, Bernard Jesop <be...@gmail.com>
wrote:

> Actually, I realized keeping the info would not be enough as I need to
> find back the checkpoint files to delete them :/
>
> 2017-10-25 19:07 GMT+02:00 Bernard Jesop <be...@gmail.com>:
>
>> As far as I understand, Dataset.rdd is not the same as InternalRDD.
>> It is just another RDD representation of the same Dataset and is created
>> on demand (lazy val) when Dataset.rdd is called.
>> This totally explains the observed behavior.
>>
>> But how would would it be possible to know that a Dataset have been
>> checkpointed?
>> Should I manually keep track of that info?
>>
>> 2017-10-25 15:51 GMT+02:00 Bernard Jesop <be...@gmail.com>:
>>
>>> Hello everyone,
>>>
>>> I have a question about checkpointing on dataset.
>>>
>>> It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike
>>> RDD there is no Dataset.isCheckpointed().
>>>
>>> I wonder if Dataset.checkpoint is a syntactic sugar for
>>> Dataset.rdd.checkpoint.
>>> When I do :
>>>
>>> Dataset.checkpoint; Dataset.count
>>> Dataset.rdd.isCheckpointed // result: false
>>>
>>> However, when I explicitly do:
>>> Dataset.rdd.checkpoint; Dataset.rdd.count
>>> Dataset.rdd.isCheckpointed // result: true
>>>
>>> Could someone explain this behavior to me, or provide some references?
>>>
>>> Best regards,
>>> Bernard
>>>
>>
>>
>

Re: Dataset API Question

Posted by Wenchen Fan <cl...@gmail.com>.
It's because of different API design.

*RDD.checkpoint* returns void, which means it mutates the RDD state so you
need a *RDD**.isCheckpointed* method to check if this RDD is checkpointed.

*Dataset.checkpoint* returns a new Dataset, which means there is no
isCheckpointed state in Dataset, and thus we don't need a
*Dataset.isCheckpointed* method.


On Wed, Oct 25, 2017 at 6:39 PM, Bernard Jesop <be...@gmail.com>
wrote:

> Actually, I realized keeping the info would not be enough as I need to
> find back the checkpoint files to delete them :/
>
> 2017-10-25 19:07 GMT+02:00 Bernard Jesop <be...@gmail.com>:
>
>> As far as I understand, Dataset.rdd is not the same as InternalRDD.
>> It is just another RDD representation of the same Dataset and is created
>> on demand (lazy val) when Dataset.rdd is called.
>> This totally explains the observed behavior.
>>
>> But how would would it be possible to know that a Dataset have been
>> checkpointed?
>> Should I manually keep track of that info?
>>
>> 2017-10-25 15:51 GMT+02:00 Bernard Jesop <be...@gmail.com>:
>>
>>> Hello everyone,
>>>
>>> I have a question about checkpointing on dataset.
>>>
>>> It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike
>>> RDD there is no Dataset.isCheckpointed().
>>>
>>> I wonder if Dataset.checkpoint is a syntactic sugar for
>>> Dataset.rdd.checkpoint.
>>> When I do :
>>>
>>> Dataset.checkpoint; Dataset.count
>>> Dataset.rdd.isCheckpointed // result: false
>>>
>>> However, when I explicitly do:
>>> Dataset.rdd.checkpoint; Dataset.rdd.count
>>> Dataset.rdd.isCheckpointed // result: true
>>>
>>> Could someone explain this behavior to me, or provide some references?
>>>
>>> Best regards,
>>> Bernard
>>>
>>
>>
>

Re: Dataset API Question

Posted by Bernard Jesop <be...@gmail.com>.
Actually, I realized keeping the info would not be enough as I need to find
back the checkpoint files to delete them :/

2017-10-25 19:07 GMT+02:00 Bernard Jesop <be...@gmail.com>:

> As far as I understand, Dataset.rdd is not the same as InternalRDD.
> It is just another RDD representation of the same Dataset and is created
> on demand (lazy val) when Dataset.rdd is called.
> This totally explains the observed behavior.
>
> But how would would it be possible to know that a Dataset have been
> checkpointed?
> Should I manually keep track of that info?
>
> 2017-10-25 15:51 GMT+02:00 Bernard Jesop <be...@gmail.com>:
>
>> Hello everyone,
>>
>> I have a question about checkpointing on dataset.
>>
>> It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike
>> RDD there is no Dataset.isCheckpointed().
>>
>> I wonder if Dataset.checkpoint is a syntactic sugar for
>> Dataset.rdd.checkpoint.
>> When I do :
>>
>> Dataset.checkpoint; Dataset.count
>> Dataset.rdd.isCheckpointed // result: false
>>
>> However, when I explicitly do:
>> Dataset.rdd.checkpoint; Dataset.rdd.count
>> Dataset.rdd.isCheckpointed // result: true
>>
>> Could someone explain this behavior to me, or provide some references?
>>
>> Best regards,
>> Bernard
>>
>
>

Re: Dataset API Question

Posted by Bernard Jesop <be...@gmail.com>.
Actually, I realized keeping the info would not be enough as I need to find
back the checkpoint files to delete them :/

2017-10-25 19:07 GMT+02:00 Bernard Jesop <be...@gmail.com>:

> As far as I understand, Dataset.rdd is not the same as InternalRDD.
> It is just another RDD representation of the same Dataset and is created
> on demand (lazy val) when Dataset.rdd is called.
> This totally explains the observed behavior.
>
> But how would would it be possible to know that a Dataset have been
> checkpointed?
> Should I manually keep track of that info?
>
> 2017-10-25 15:51 GMT+02:00 Bernard Jesop <be...@gmail.com>:
>
>> Hello everyone,
>>
>> I have a question about checkpointing on dataset.
>>
>> It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike
>> RDD there is no Dataset.isCheckpointed().
>>
>> I wonder if Dataset.checkpoint is a syntactic sugar for
>> Dataset.rdd.checkpoint.
>> When I do :
>>
>> Dataset.checkpoint; Dataset.count
>> Dataset.rdd.isCheckpointed // result: false
>>
>> However, when I explicitly do:
>> Dataset.rdd.checkpoint; Dataset.rdd.count
>> Dataset.rdd.isCheckpointed // result: true
>>
>> Could someone explain this behavior to me, or provide some references?
>>
>> Best regards,
>> Bernard
>>
>
>

Re: Dataset API Question

Posted by Bernard Jesop <be...@gmail.com>.
As far as I understand, Dataset.rdd is not the same as InternalRDD.
It is just another RDD representation of the same Dataset and is created on
demand (lazy val) when Dataset.rdd is called.
This totally explains the observed behavior.

But how would would it be possible to know that a Dataset have been
checkpointed?
Should I manually keep track of that info?

2017-10-25 15:51 GMT+02:00 Bernard Jesop <be...@gmail.com>:

> Hello everyone,
>
> I have a question about checkpointing on dataset.
>
> It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike RDD
> there is no Dataset.isCheckpointed().
>
> I wonder if Dataset.checkpoint is a syntactic sugar for
> Dataset.rdd.checkpoint.
> When I do :
>
> Dataset.checkpoint; Dataset.count
> Dataset.rdd.isCheckpointed // result: false
>
> However, when I explicitly do:
> Dataset.rdd.checkpoint; Dataset.rdd.count
> Dataset.rdd.isCheckpointed // result: true
>
> Could someone explain this behavior to me, or provide some references?
>
> Best regards,
> Bernard
>

Re: Dataset API Question

Posted by Bernard Jesop <be...@gmail.com>.
As far as I understand, Dataset.rdd is not the same as InternalRDD.
It is just another RDD representation of the same Dataset and is created on
demand (lazy val) when Dataset.rdd is called.
This totally explains the observed behavior.

But how would would it be possible to know that a Dataset have been
checkpointed?
Should I manually keep track of that info?

2017-10-25 15:51 GMT+02:00 Bernard Jesop <be...@gmail.com>:

> Hello everyone,
>
> I have a question about checkpointing on dataset.
>
> It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike RDD
> there is no Dataset.isCheckpointed().
>
> I wonder if Dataset.checkpoint is a syntactic sugar for
> Dataset.rdd.checkpoint.
> When I do :
>
> Dataset.checkpoint; Dataset.count
> Dataset.rdd.isCheckpointed // result: false
>
> However, when I explicitly do:
> Dataset.rdd.checkpoint; Dataset.rdd.count
> Dataset.rdd.isCheckpointed // result: true
>
> Could someone explain this behavior to me, or provide some references?
>
> Best regards,
> Bernard
>

Re: Dataset API Question

Posted by Reynold Xin <rx...@databricks.com>.
It is a bit more than syntactic sugar, but not much more:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L533

BTW this is basically writing all the data out, and then create a new
Dataset to load them in.


On Wed, Oct 25, 2017 at 6:51 AM, Bernard Jesop <be...@gmail.com>
wrote:

> Hello everyone,
>
> I have a question about checkpointing on dataset.
>
> It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike RDD
> there is no Dataset.isCheckpointed().
>
> I wonder if Dataset.checkpoint is a syntactic sugar for
> Dataset.rdd.checkpoint.
> When I do :
>
> Dataset.checkpoint; Dataset.count
> Dataset.rdd.isCheckpointed // result: false
>
> However, when I explicitly do:
> Dataset.rdd.checkpoint; Dataset.rdd.count
> Dataset.rdd.isCheckpointed // result: true
>
> Could someone explain this behavior to me, or provide some references?
>
> Best regards,
> Bernard
>