You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ferdonline <gi...@git.apache.org> on 2017/11/23 19:26:25 UTC

[GitHub] spark pull request #19805: Adding localCheckpoint to Dataframe API

GitHub user ferdonline opened a pull request:

    https://github.com/apache/spark/pull/19805

    Adding localCheckpoint to Dataframe API

    ## What changes were proposed in this pull request?
    
    This change adds local checkpoint support to datasets and respective bind from Python Dataframe API.
    
    If reliability requirements can be lowered to favor performance, as in cases of further quick transformations followed by a reliable save, localCheckpoints() fit very well. 
    Furthermore, at the moment Reliable checkpoints still incur double computation (see #9428)
    In general it makes the API more complete as well.
    
    ## How was this patch tested?
    
    Python land quick use case:
    
    ```python
    In [1]: from time import sleep
    
    In [2]: from pyspark.sql import types as T
    
    In [3]: from pyspark.sql import functions as F
    
    In [4]: def f(x):
        sleep(1)
        return x*2
       ...: 
    
    In [5]: df1 = spark.range(30, numPartitions=6)
    
    In [6]: df2 = df1.select(F.udf(f, T.LongType())("id"))
    
    In [7]: %time _ = df2.collect()
    CPU times: user 7.79 ms, sys: 5.84 ms, total: 13.6 ms                           
    Wall time: 12.2 s
    
    In [8]: %time df3 = df2.localCheckpoint()
    CPU times: user 2.38 ms, sys: 2.3 ms, total: 4.68 ms                            
    Wall time: 10.3 s
    
    In [9]: %time _ = df3.collect()
    CPU times: user 5.09 ms, sys: 410 µs, total: 5.5 ms
    Wall time: 148 ms
    ```
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ferdonline/spark feature_dataset_localCheckpoint

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19805.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19805
    
----
commit abe03ab0e8d6647ccb8949a39c431cd845c23dbb
Author: Fernando Pereira <fe...@epfl.ch>
Date:   2017-11-23T18:49:37Z

    Adding localCheckpoint to Dataframe API

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84706/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #84709 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84709/testReport)** for PR 19805 at commit [`da34c4a`](https://github.com/apache/spark/commit/da34c4a49c7c9f0736c08707582966ab6608826c).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r153739562
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -537,9 +536,55 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(eager: Boolean): Dataset[T] = {
    +  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager)
    +
    +  /**
    +   * Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be
    +   * used to truncate the logical plan of this Dataset, which is especially useful in iterative
    +   * algorithms where the plan may grow exponentially. Local checkpoints are written to executor
    +   * storage and despite potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(): Dataset[T] = _checkpoint(eager = true, local = true)
    +
    +  /**
    +   * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate
    +   * the logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager, local = true)
    +
    +  /**
    +   * Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the
    +   * logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially.
    +   * By default reliable checkpoints are created and saved to files inside the checkpoint
    +   * directory set with `SparkContext#setCheckpointDir`. If local is set to true a local checkpoint
    +   * is performed instead. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  private[sql] def _checkpoint(eager: Boolean, local: Boolean = false): Dataset[T] = {
    --- End diff --
    
    I guess we have 2 options here:
    
    - expose `def checkpoint(eager: Boolean, local: Boolean): Dataset[T]` as public, which can be used similar to `localCheckpoint`.
    - make `def _checkpoint(eager: Boolean, local: Boolean = false): Dataset[T]` private to be used only from the public APIs.
    
    and I'm afraid the current one is not good anyway.
    
    I'd prefer the second option but I don't have a strong feeling.
    cc @felixcheung @HyukjinKwon 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r154490687
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -537,9 +537,48 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(eager: Boolean): Dataset[T] = {
    --- End diff --
    
    Could you check the test case of `def checkpoint`? At least we need to add a test case.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API

Posted by ferdonline <gi...@git.apache.org>.
Github user ferdonline commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    restest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    LGTM
    
    Thanks! Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #84709 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84709/testReport)** for PR 19805 at commit [`da34c4a`](https://github.com/apache/spark/commit/da34c4a49c7c9f0736c08707582966ab6608826c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by ferdonline <gi...@git.apache.org>.
Github user ferdonline commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Existing checkpoint tests applied to localCheckpoint as well, all working well. Please verify.
    I'm getting an unrelated fail in SparkR, did anything change n the build system?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85017/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    @ferdonline Could you file a JIRA issue and add the id to the title like `[SPARK-xxx][PYTHON][SQL] ...`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...

Posted by ferdonline <gi...@git.apache.org>.
Github user ferdonline commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r153746129
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -537,9 +536,55 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(eager: Boolean): Dataset[T] = {
    +  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager)
    +
    +  /**
    +   * Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be
    +   * used to truncate the logical plan of this Dataset, which is especially useful in iterative
    +   * algorithms where the plan may grow exponentially. Local checkpoints are written to executor
    +   * storage and despite potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(): Dataset[T] = _checkpoint(eager = true, local = true)
    +
    +  /**
    +   * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate
    +   * the logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager, local = true)
    +
    +  /**
    +   * Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the
    +   * logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially.
    +   * By default reliable checkpoints are created and saved to files inside the checkpoint
    +   * directory set with `SparkContext#setCheckpointDir`. If local is set to true a local checkpoint
    +   * is performed instead. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  private[sql] def _checkpoint(eager: Boolean, local: Boolean = false): Dataset[T] = {
    --- End diff --
    
    Sounds good to me to remove default param values, keeping private. Let me know


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SQL] Adding localCheckpoint to Dataset API

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #84196 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84196/testReport)** for PR 19805 at commit [`abe03ab`](https://github.com/apache/spark/commit/abe03ab0e8d6647ccb8949a39c431cd845c23dbb).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r153693403
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -518,13 +518,12 @@ class Dataset[T] private[sql](
        * the logical plan of this Dataset, which is especially useful in iterative algorithms where the
        * plan may grow exponentially. It will be saved to files inside the checkpoint
        * directory set with `SparkContext#setCheckpointDir`.
    -   *
    --- End diff --
    
    Maybe we should revert this back?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84334/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r153693511
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -537,9 +536,55 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(eager: Boolean): Dataset[T] = {
    +  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager)
    --- End diff --
    
    We don't need the default value for `eager` here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r153694085
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -537,9 +536,55 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(eager: Boolean): Dataset[T] = {
    +  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager)
    +
    +  /**
    +   * Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be
    +   * used to truncate the logical plan of this Dataset, which is especially useful in iterative
    +   * algorithms where the plan may grow exponentially. Local checkpoints are written to executor
    +   * storage and despite potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(): Dataset[T] = _checkpoint(eager = true, local = true)
    +
    +  /**
    +   * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate
    +   * the logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager, local = true)
    +
    +  /**
    +   * Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the
    +   * logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially.
    +   * By default reliable checkpoints are created and saved to files inside the checkpoint
    +   * directory set with `SparkContext#setCheckpointDir`. If local is set to true a local checkpoint
    +   * is performed instead. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  private[sql] def _checkpoint(eager: Boolean, local: Boolean = false): Dataset[T] = {
    --- End diff --
    
    We should make this `private`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [SQL] Adding localCheckpoint to Dataset API

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r153080509
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -524,22 +524,41 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(): Dataset[T] = checkpoint(eager = true)
    +  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager)
    +
    +  /**
    +   * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate
    +   * the logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    --- End diff --
    
    add `@since`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...

Posted by ferdonline <gi...@git.apache.org>.
Github user ferdonline commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r153145177
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -524,22 +524,41 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(): Dataset[T] = checkpoint(eager = true)
    +  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager)
    --- End diff --
    
    I always try to avoid duplication of code, and with docs this takes ~10 lines for nothing - I believe a function with a default parameter is as readable as the function without the parameter. But please let me know if it goes against the code style. Thanks


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84289/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r153693534
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -537,9 +536,55 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(eager: Boolean): Dataset[T] = {
    +  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager)
    +
    +  /**
    +   * Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be
    +   * used to truncate the logical plan of this Dataset, which is especially useful in iterative
    +   * algorithms where the plan may grow exponentially. Local checkpoints are written to executor
    +   * storage and despite potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(): Dataset[T] = _checkpoint(eager = true, local = true)
    +
    +  /**
    +   * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate
    +   * the logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager, local = true)
    --- End diff --
    
    ditto.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #84334 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84334/testReport)** for PR 19805 at commit [`c743c34`](https://github.com/apache/spark/commit/c743c340e6d979f378daefa661040db87d944234).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #85139 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85139/testReport)** for PR 19805 at commit [`45f4bf5`](https://github.com/apache/spark/commit/45f4bf5c39e49ba065311d976f0149aa3bf67a9e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SQL] Adding localCheckpoint to Dataset API

Posted by ferdonline <gi...@git.apache.org>.
Github user ferdonline commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    cc @andrewor14 since I believe you know this part of spark pretty well, maybe you could help me integrating this.
    Any idea why Jenkins didn't start the testing?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SQL] Adding localCheckpoint to Dataset API

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Jenkins, test this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19805


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r157374081
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -540,9 +540,52 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(eager: Boolean): Dataset[T] = {
    +  def checkpoint(eager: Boolean): Dataset[T] = checkpoint(eager = eager, reliableCheckpoint = true)
    +
    +  /**
    +   * Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be
    +   * used to truncate the logical plan of this Dataset, which is especially useful in iterative
    +   * algorithms where the plan may grow exponentially. Local checkpoints are written to executor
    +   * storage and despite potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(): Dataset[T] = checkpoint(eager = true, reliableCheckpoint = false)
    +
    +  /**
    +   * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate
    +   * the logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(eager: Boolean): Dataset[T] = checkpoint(
    +    eager = eager,
    +    reliableCheckpoint = false
    +  )
    +
    +  /**
    +   * Returns a checkpointed version of this Dataset.
    +   *
    +   * @param eager Whether to checkpoint this dataframe immediately
    +   * @param reliableCheckpoint Whether to create a reliable checkpoint saved to files inside the
    +   *                           checkpoint directory. If false creates a local checkpoint using
    +   *                           the caching subsystem
    +   */
    +  private def checkpoint(eager: Boolean, reliableCheckpoint: Boolean): Dataset[T] = {
         val internalRdd = queryExecution.toRdd.map(_.copy())
    -    internalRdd.checkpoint()
    +    if (reliableCheckpoint) {
    +      internalRdd.checkpoint()
    +    } else {
    +      internalRdd.localCheckpoint()
    --- End diff --
    
    cc @zsxwing 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SQL] Adding localCheckpoint to Dataset API

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    could you update the title to add [PYTHON]


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85027/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84720/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #84299 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84299/testReport)** for PR 19805 at commit [`54b7f33`](https://github.com/apache/spark/commit/54b7f3367fe368f6ebfa518bc43bf7de36530fc8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84709/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84282/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #84299 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84299/testReport)** for PR 19805 at commit [`54b7f33`](https://github.com/apache/spark/commit/54b7f3367fe368f6ebfa518bc43bf7de36530fc8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r157361491
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala ---
    @@ -1156,67 +1156,83 @@ class DatasetSuite extends QueryTest with SharedSQLContext {
       }
     
       Seq(true, false).foreach { eager =>
    -    def testCheckpointing(testName: String)(f: => Unit): Unit = {
    -      test(s"Dataset.checkpoint() - $testName (eager = $eager)") {
    -        withTempDir { dir =>
    -          val originalCheckpointDir = spark.sparkContext.checkpointDir
    -
    -          try {
    -            spark.sparkContext.setCheckpointDir(dir.getCanonicalPath)
    +    Seq(true, false).foreach { reliable =>
    +      def testCheckpointing(testName: String)(f: => Unit): Unit = {
    +        test(s"Dataset.checkpoint() - $testName (eager = $eager, reliable = $reliable)") {
    +          if (reliable) {
    +            withTempDir { dir =>
    +              val originalCheckpointDir = spark.sparkContext.checkpointDir
    +
    +              try {
    +                spark.sparkContext.setCheckpointDir(dir.getCanonicalPath)
    +                f
    +              } finally {
    +                // Since the original checkpointDir can be None, we need
    +                // to set the variable directly.
    +                spark.sparkContext.checkpointDir = originalCheckpointDir
    +              }
    +            }
    +          }
    +          else {
    --- End diff --
    
    -> `} else {`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: Adding localCheckpoint to Dataframe API

Posted by ferdonline <gi...@git.apache.org>.
Github user ferdonline commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #84282 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84282/testReport)** for PR 19805 at commit [`59b5562`](https://github.com/apache/spark/commit/59b5562d1cfe738dc991ef17afad71abd891195d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #84289 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84289/testReport)** for PR 19805 at commit [`c5f1b2c`](https://github.com/apache/spark/commit/c5f1b2c5e05ac0410448fbeea9409cc4117f85e4).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #84720 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84720/testReport)** for PR 19805 at commit [`7532fc4`](https://github.com/apache/spark/commit/7532fc4362a380cfbfdd89e8e99e9c6749a39824).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: Adding localCheckpoint to Dataframe API

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #84706 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84706/testReport)** for PR 19805 at commit [`9beb375`](https://github.com/apache/spark/commit/9beb375d15d162519856d32d6bd12e3cf1860d68).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #85017 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85017/testReport)** for PR 19805 at commit [`7532fc4`](https://github.com/apache/spark/commit/7532fc4362a380cfbfdd89e8e99e9c6749a39824).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #85017 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85017/testReport)** for PR 19805 at commit [`7532fc4`](https://github.com/apache/spark/commit/7532fc4362a380cfbfdd89e8e99e9c6749a39824).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SQL] Adding localCheckpoint to Dataset API

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #84282 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84282/testReport)** for PR 19805 at commit [`59b5562`](https://github.com/apache/spark/commit/59b5562d1cfe738dc991ef17afad71abd891195d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SQL] Adding localCheckpoint to Dataset API

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84196/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SQL] Adding localCheckpoint to Dataset API

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #84196 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84196/testReport)** for PR 19805 at commit [`abe03ab`](https://github.com/apache/spark/commit/abe03ab0e8d6647ccb8949a39c431cd845c23dbb).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...

Posted by ferdonline <gi...@git.apache.org>.
Github user ferdonline commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r153147567
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -524,22 +524,41 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(): Dataset[T] = checkpoint(eager = true)
    +  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager)
    +
    +  /**
    +   * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate
    +   * the logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager, local = true)
     
       /**
        * Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the
        * logical plan of this Dataset, which is especially useful in iterative algorithms where the
    -   * plan may grow exponentially. It will be saved to files inside the checkpoint
    -   * directory set with `SparkContext#setCheckpointDir`.
    +   * plan may grow exponentially.
    +   * By default reliable checkpoints are created and saved to files inside the checkpoint
    +   * directory set with `SparkContext#setCheckpointDir`. If local is set to true a local checkpoint
    +   * is performed instead. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
        *
        * @group basic
        * @since 2.1.0
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(eager: Boolean): Dataset[T] = {
    +  def _checkpoint(eager: Boolean, local: Boolean = false): Dataset[T] = {
    --- End diff --
    
    Sounds good. Regarding the naming, I'm not sure adding an underscore is a good choice. Let me know if it should be changed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [SQL] Adding localCheckpoint to Dataset API

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r153080488
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -524,22 +524,41 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(): Dataset[T] = checkpoint(eager = true)
    +  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager)
    +
    +  /**
    +   * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate
    +   * the logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager, local = true)
     
       /**
        * Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the
        * logical plan of this Dataset, which is especially useful in iterative algorithms where the
    -   * plan may grow exponentially. It will be saved to files inside the checkpoint
    -   * directory set with `SparkContext#setCheckpointDir`.
    +   * plan may grow exponentially.
    +   * By default reliable checkpoints are created and saved to files inside the checkpoint
    +   * directory set with `SparkContext#setCheckpointDir`. If local is set to true a local checkpoint
    +   * is performed instead. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
        *
        * @group basic
        * @since 2.1.0
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(eager: Boolean): Dataset[T] = {
    +  def _checkpoint(eager: Boolean, local: Boolean = false): Dataset[T] = {
    --- End diff --
    
    I don't think this is right - this is still a public API and it's not the convention here.
    Change it to private instead


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...

Posted by ferdonline <gi...@git.apache.org>.
Github user ferdonline commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r153802577
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -537,9 +536,55 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(eager: Boolean): Dataset[T] = {
    +  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager)
    +
    +  /**
    +   * Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be
    +   * used to truncate the logical plan of this Dataset, which is especially useful in iterative
    +   * algorithms where the plan may grow exponentially. Local checkpoints are written to executor
    +   * storage and despite potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(): Dataset[T] = _checkpoint(eager = true, local = true)
    +
    +  /**
    +   * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate
    +   * the logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager, local = true)
    +
    +  /**
    +   * Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the
    +   * logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially.
    +   * By default reliable checkpoints are created and saved to files inside the checkpoint
    +   * directory set with `SparkContext#setCheckpointDir`. If local is set to true a local checkpoint
    +   * is performed instead. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  private[sql] def _checkpoint(eager: Boolean, local: Boolean = false): Dataset[T] = {
    --- End diff --
    
    Alright, so I remove both default values, otherwise we end up with a colliding signature. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #84720 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84720/testReport)** for PR 19805 at commit [`7532fc4`](https://github.com/apache/spark/commit/7532fc4362a380cfbfdd89e8e99e9c6749a39824).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #85027 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85027/testReport)** for PR 19805 at commit [`45f4bf5`](https://github.com/apache/spark/commit/45f4bf5c39e49ba065311d976f0149aa3bf67a9e).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #84334 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84334/testReport)** for PR 19805 at commit [`c743c34`](https://github.com/apache/spark/commit/c743c340e6d979f378daefa661040db87d944234).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Will review it this weekend.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API

Posted by ferdonline <gi...@git.apache.org>.
Github user ferdonline commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Thanks for you review. I'm working on the changes


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SQL] Adding localCheckpoint to Dataset API

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    cc @zsxwing @liancheng 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r153761498
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -537,9 +536,55 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(eager: Boolean): Dataset[T] = {
    +  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager)
    +
    +  /**
    +   * Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be
    +   * used to truncate the logical plan of this Dataset, which is especially useful in iterative
    +   * algorithms where the plan may grow exponentially. Local checkpoints are written to executor
    +   * storage and despite potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(): Dataset[T] = _checkpoint(eager = true, local = true)
    +
    +  /**
    +   * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate
    +   * the logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager, local = true)
    +
    +  /**
    +   * Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the
    +   * logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially.
    +   * By default reliable checkpoints are created and saved to files inside the checkpoint
    +   * directory set with `SparkContext#setCheckpointDir`. If local is set to true a local checkpoint
    +   * is performed instead. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  private[sql] def _checkpoint(eager: Boolean, local: Boolean = false): Dataset[T] = {
    --- End diff --
    
    `private checkpoint` makes sense to me too. i think we don't need the underscore though as it's going to have `private` ahead and strictly it might be redundant to have the leading underscore to note it's private is.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #85027 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85027/testReport)** for PR 19805 at commit [`45f4bf5`](https://github.com/apache/spark/commit/45f4bf5c39e49ba065311d976f0149aa3bf67a9e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85139/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [SQL] Adding localCheckpoint to Dataset API

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r153080504
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -524,22 +524,41 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(): Dataset[T] = checkpoint(eager = true)
    +  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager)
    --- End diff --
    
    I think we should keep this signature  - there's already a ` def checkpoint(eager: Boolean = true)`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #84706 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84706/testReport)** for PR 19805 at commit [`9beb375`](https://github.com/apache/spark/commit/9beb375d15d162519856d32d6bd12e3cf1860d68).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r157361558
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -540,9 +540,52 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(eager: Boolean): Dataset[T] = {
    +  def checkpoint(eager: Boolean): Dataset[T] = checkpoint(eager = eager, reliableCheckpoint = true)
    +
    +  /**
    +   * Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be
    +   * used to truncate the logical plan of this Dataset, which is especially useful in iterative
    +   * algorithms where the plan may grow exponentially. Local checkpoints are written to executor
    +   * storage and despite potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(): Dataset[T] = checkpoint(eager = true, reliableCheckpoint = false)
    +
    +  /**
    +   * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate
    +   * the logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(eager: Boolean): Dataset[T] = checkpoint(
    +    eager = eager,
    +    reliableCheckpoint = false
    +  )
    +
    +  /**
    +   * Returns a checkpointed version of this Dataset.
    +   *
    +   * @param eager Whether to checkpoint this dataframe immediately
    +   * @param reliableCheckpoint Whether to create a reliable checkpoint saved to files inside the
    +   *                           checkpoint directory. If false creates a local checkpoint using
    +   *                           the caching subsystem
    +   */
    +  private def checkpoint(eager: Boolean, reliableCheckpoint: Boolean): Dataset[T] = {
         val internalRdd = queryExecution.toRdd.map(_.copy())
    -    internalRdd.checkpoint()
    +    if (reliableCheckpoint) {
    +      internalRdd.checkpoint()
    +    } else {
    +      internalRdd.localCheckpoint()
    --- End diff --
    
    Could you also issue a logWarning message here to indicate the checkpoint is not reliable? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...

Posted by ferdonline <gi...@git.apache.org>.
Github user ferdonline commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r153722443
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -537,9 +536,55 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(eager: Boolean): Dataset[T] = {
    +  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager)
    +
    +  /**
    +   * Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be
    +   * used to truncate the logical plan of this Dataset, which is especially useful in iterative
    +   * algorithms where the plan may grow exponentially. Local checkpoints are written to executor
    +   * storage and despite potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(): Dataset[T] = _checkpoint(eager = true, local = true)
    +
    +  /**
    +   * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate
    +   * the logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager, local = true)
    +
    +  /**
    +   * Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the
    +   * logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially.
    +   * By default reliable checkpoints are created and saved to files inside the checkpoint
    +   * directory set with `SparkContext#setCheckpointDir`. If local is set to true a local checkpoint
    +   * is performed instead. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  private[sql] def _checkpoint(eager: Boolean, local: Boolean = false): Dataset[T] = {
    --- End diff --
    
    I changed according to the first review. I also agree to be private so that "there's only one way of doing it" when using the API. But if you have a strong feeling I can surely change


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...

Posted by ferdonline <gi...@git.apache.org>.
Github user ferdonline commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r154495320
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -537,9 +537,48 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(eager: Boolean): Dataset[T] = {
    --- End diff --
    
    I can try to create a test to localCheckpoint based on the one for checkpoint, but I'm not very familiar with Scala and the Spark scala API, so currently I don't feel at ease to create a meaningful test. Would anybody be up to add one?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r153966170
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -537,9 +537,55 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(eager: Boolean): Dataset[T] = {
    +  def checkpoint(eager: Boolean): Dataset[T] = checkpoint(eager = eager, local = false)
    +
    +  /**
    +   * Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be
    +   * used to truncate the logical plan of this Dataset, which is especially useful in iterative
    +   * algorithms where the plan may grow exponentially. Local checkpoints are written to executor
    +   * storage and despite potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(): Dataset[T] = checkpoint(eager = true, local = true)
    +
    +  /**
    +   * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate
    +   * the logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(eager: Boolean): Dataset[T] = checkpoint(eager = eager, local = true)
    +
    +  /**
    +   * Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the
    +   * logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially.
    +   * By default reliable checkpoints are created and saved to files inside the checkpoint
    +   * directory set with `SparkContext#setCheckpointDir`. If local is set to true a local checkpoint
    +   * is performed instead. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  private[sql] def checkpoint(eager: Boolean, local: Boolean): Dataset[T] = {
    --- End diff --
    
    Hm, can we just do this `private def checkpoint` instead of `private[sql]`? Also, I don't think it needs those `@Experimental`, `@InterfaceStability.Evolving`, `@since 2.3.0` and `@group basic`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API

Posted by ferdonline <gi...@git.apache.org>.
Github user ferdonline commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r155934026
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -537,9 +537,48 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(eager: Boolean): Dataset[T] = {
    --- End diff --
    
    So we already test checkpoint in DatasetSuite


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #84289 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84289/testReport)** for PR 19805 at commit [`c5f1b2c`](https://github.com/apache/spark/commit/c5f1b2c5e05ac0410448fbeea9409cc4117f85e4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r153178787
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -524,22 +524,41 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(): Dataset[T] = checkpoint(eager = true)
    +  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager)
    --- End diff --
    
    No, it's going to break Java API compatibility. Default value does not work with Java.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    **[Test build #85139 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85139/testReport)** for PR 19805 at commit [`45f4bf5`](https://github.com/apache/spark/commit/45f4bf5c39e49ba065311d976f0149aa3bf67a9e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...

Posted by ferdonline <gi...@git.apache.org>.
Github user ferdonline commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r157371948
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -540,9 +540,52 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(eager: Boolean): Dataset[T] = {
    +  def checkpoint(eager: Boolean): Dataset[T] = checkpoint(eager = eager, reliableCheckpoint = true)
    +
    +  /**
    +   * Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be
    +   * used to truncate the logical plan of this Dataset, which is especially useful in iterative
    +   * algorithms where the plan may grow exponentially. Local checkpoints are written to executor
    +   * storage and despite potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(): Dataset[T] = checkpoint(eager = true, reliableCheckpoint = false)
    +
    +  /**
    +   * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate
    +   * the logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(eager: Boolean): Dataset[T] = checkpoint(
    +    eager = eager,
    +    reliableCheckpoint = false
    +  )
    +
    +  /**
    +   * Returns a checkpointed version of this Dataset.
    +   *
    +   * @param eager Whether to checkpoint this dataframe immediately
    +   * @param reliableCheckpoint Whether to create a reliable checkpoint saved to files inside the
    +   *                           checkpoint directory. If false creates a local checkpoint using
    +   *                           the caching subsystem
    +   */
    +  private def checkpoint(eager: Boolean, reliableCheckpoint: Boolean): Dataset[T] = {
         val internalRdd = queryExecution.toRdd.map(_.copy())
    -    internalRdd.checkpoint()
    +    if (reliableCheckpoint) {
    +      internalRdd.checkpoint()
    +    } else {
    +      internalRdd.localCheckpoint()
    --- End diff --
    
    Hi. Thanks for the review.
    From the point of view of the user being aware he's doing a local checkpoint we already force him to use localCheckpoint() (the generic checkpoint is private)
    If we should warn users about the potential issues with localCheckpoint() shouldn't we do it in the RDD API, so that users are always warned?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84299/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19805
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org