You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ueshin <gi...@git.apache.org> on 2017/11/29 09:58:41 UTC

[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19805#discussion_r153739562
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -537,9 +536,55 @@ class Dataset[T] private[sql](
        */
       @Experimental
       @InterfaceStability.Evolving
    -  def checkpoint(eager: Boolean): Dataset[T] = {
    +  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager)
    +
    +  /**
    +   * Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be
    +   * used to truncate the logical plan of this Dataset, which is especially useful in iterative
    +   * algorithms where the plan may grow exponentially. Local checkpoints are written to executor
    +   * storage and despite potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(): Dataset[T] = _checkpoint(eager = true, local = true)
    +
    +  /**
    +   * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate
    +   * the logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  def localCheckpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager, local = true)
    +
    +  /**
    +   * Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the
    +   * logical plan of this Dataset, which is especially useful in iterative algorithms where the
    +   * plan may grow exponentially.
    +   * By default reliable checkpoints are created and saved to files inside the checkpoint
    +   * directory set with `SparkContext#setCheckpointDir`. If local is set to true a local checkpoint
    +   * is performed instead. Local checkpoints are written to executor storage and despite
    +   * potentially faster they are unreliable and may compromise job completion.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  @Experimental
    +  @InterfaceStability.Evolving
    +  private[sql] def _checkpoint(eager: Boolean, local: Boolean = false): Dataset[T] = {
    --- End diff --
    
    I guess we have 2 options here:
    
    - expose `def checkpoint(eager: Boolean, local: Boolean): Dataset[T]` as public, which can be used similar to `localCheckpoint`.
    - make `def _checkpoint(eager: Boolean, local: Boolean = false): Dataset[T]` private to be used only from the public APIs.
    
    and I'm afraid the current one is not good anyway.
    
    I'd prefer the second option but I don't have a strong feeling.
    cc @felixcheung @HyukjinKwon 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org