You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by WeichenXu123 <gi...@git.apache.org> on 2017/01/13 14:41:24 UTC

[GitHub] spark pull request #16576: [SPARK-19215] Add necessary check for `RDD.checkp...

GitHub user WeichenXu123 opened a pull request:

    https://github.com/apache/spark/pull/16576

    [SPARK-19215] Add necessary check for `RDD.checkpoint` to avoid potential mistakes

    ## What changes were proposed in this pull request?
    
    Currently RDD.checkpoint must be called before any job executed on this RDD, otherwise the `doCheckpoint` will never be called. This is a pitfall we should check this and throw exception (or at least log warning ? ) for such case.
    And, if RDD haven't been persisted, doing checkpoint will cause RDD recomputation, because current implementation will run separated job for checkpointing. I think such case it should also print some warning message, remind user to check whether he forgot persist the RDD.
    
    ## How was this patch tested?
    
    Manual.
    
    Test case 1:
    ```
    val rdd = sc.makeRDD(Array(1,2,3),3)
    rdd.count()
    rdd.checkpoint() // here because `rdd.count` executed, checkpoint will never take effect, so that this patch will directly throw exception.
    ```
    
    Test case 2:
    ```
    val rdd = sc.makeRDD(Array(1,2,3),3).map(_ + 10)
    rdd.checkpoint() // because `rdd` do not persisted, so that checkpoint will cause this RDD recomputation, this patch will print warning here.
    rdd.count() // trigger `doCheckpoint`
    ```
    
    Test case 3:
    ```
    val rdd = sc.makeRDD(Array(1,2,3),3).map(_ + 10)
    rdd.persist()
    rdd.checkpoint() // This is correct usage, won't print any warning in this patch.
    rdd.count() // trigger `doCheckpoint`
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/WeichenXu123/spark add_check_and_warning_for_rdd_checkpoint

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16576.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16576
    
----
commit 70fbfb07adbaca17831fd736661135f2d7b2b0e0
Author: WeichenXu <we...@outlook.com>
Date:   2017-01-13T14:17:28Z

    add check and warning msg for rdd checkpoint

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16576: [SPARK-19215] Add necessary check for `RDD.checkp...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16576#discussion_r96114324
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1539,6 +1539,13 @@ abstract class RDD[T: ClassTag](
         // NOTE: we use a global lock here due to complexities downstream with ensuring
         // children RDD partitions point to the correct parent partitions. In the future
         // we should revisit this consideration.
    +    if (doCheckpointCalled) {
    +      throw new SparkException("Because job has been executed on this RDD, checkpoint won't work")
    +    }
    +    if (storageLevel == StorageLevel.NONE) {
    +      logWarning("Call checkpoint on unpersisted RDD, it will cause RDD recomputation" +
    +        " when saving checkpoint files.")
    +    }
    --- End diff --
    
    Moving to doCheckpoint (before `checkpointData.get.checkpoint()`) and making it a logInfo sounds fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16576: [SPARK-19215] Add necessary check for `RDD.checkpoint` t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16576
  
    **[Test build #71374 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71374/testReport)** for PR 16576 at commit [`4b4fad7`](https://github.com/apache/spark/commit/4b4fad7a46c4a85ad8907118edc6b97b436c935f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16576: [SPARK-19215] Add necessary check for `RDD.checkp...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16576#discussion_r96573963
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1539,6 +1539,9 @@ abstract class RDD[T: ClassTag](
         // NOTE: we use a global lock here due to complexities downstream with ensuring
         // children RDD partitions point to the correct parent partitions. In the future
         // we should revisit this consideration.
    +    if (doCheckpointCalled) {
    +      logWarning(s"Because job has been executed on RDD ${id}, checkpoint won't work")
    --- End diff --
    
    reping @zsxwing Would you mind take a look? This is a simple PR but it will bring much help for spark developers to avoid them making some mistake usage...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16576: [SPARK-19215] Add necessary check for `RDD.checkp...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16576#discussion_r96114262
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1539,6 +1539,13 @@ abstract class RDD[T: ClassTag](
         // NOTE: we use a global lock here due to complexities downstream with ensuring
         // children RDD partitions point to the correct parent partitions. In the future
         // we should revisit this consideration.
    +    if (doCheckpointCalled) {
    +      throw new SparkException("Because job has been executed on this RDD, checkpoint won't work")
    +    }
    --- End diff --
    
    I see what you mean, if I am not wrong, are you trying to deal with something like this ?
    
    '''
    val rdd1 = ...
    val rdd2 = rdd1.filter()
    val rdd3 = rdd2.map()
    
    rdd1.checkpoint() 
    rdd3.count()
    
    rdd2.checkpoint()
    rdd3.count()
    '''
    
    The latter will not cause rdd2 to get checkpointed since the earlier checkpoint already marked doCheckpointCalled = true ?
    
    
    If yes, then it is an interesting case which might not have been considered - but I could be wrong.
    +CC @mateiz which wrote the memorization of checkpoint ...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16576: [SPARK-19215] Add necessary check for `RDD.checkpoint` t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16576
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71374/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16576: [SPARK-19215] Add necessary check for `RDD.checkpoint` t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16576
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16576: [SPARK-19215] Add necessary check for `RDD.checkp...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16576#discussion_r96112227
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1539,6 +1539,13 @@ abstract class RDD[T: ClassTag](
         // NOTE: we use a global lock here due to complexities downstream with ensuring
         // children RDD partitions point to the correct parent partitions. In the future
         // we should revisit this consideration.
    +    if (doCheckpointCalled) {
    +      throw new SparkException("Because job has been executed on this RDD, checkpoint won't work")
    +    }
    --- End diff --
    
    All right, how about change to following code:
    ```
    if (doCheckpointCalled) {
          logWarning(s"Because job has been executed on RDD ${id}, checkpoint won't work")
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16576: [SPARK-19215] Add necessary check for `RDD.checkpoint` t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16576
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71394/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16576: [SPARK-19215] Add necessary check for `RDD.checkp...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 closed the pull request at:

    https://github.com/apache/spark/pull/16576


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16576: [SPARK-19215] Add necessary check for `RDD.checkp...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16576#discussion_r96066378
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1539,6 +1539,13 @@ abstract class RDD[T: ClassTag](
         // NOTE: we use a global lock here due to complexities downstream with ensuring
         // children RDD partitions point to the correct parent partitions. In the future
         // we should revisit this consideration.
    +    if (doCheckpointCalled) {
    +      throw new SparkException("Because job has been executed on this RDD, checkpoint won't work")
    +    }
    +    if (storageLevel == StorageLevel.NONE) {
    +      logWarning("Call checkpoint on unpersisted RDD, it will cause RDD recomputation" +
    +        " when saving checkpoint files.")
    +    }
    --- End diff --
    
    
    Two issues with this :
    
    a) It is possible for cache/persist to be called after checkpoint() - and behave the same as cache/persist called before checkpoint()
    b) A narrow dependency parent could have been persisted or checkpoint'ed - causing this checkpoint to not be expensive : so the message and log level is a bit too severe.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16576: [SPARK-19215] Add necessary check for `RDD.checkpoint` t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16576
  
    **[Test build #71374 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71374/testReport)** for PR 16576 at commit [`4b4fad7`](https://github.com/apache/spark/commit/4b4fad7a46c4a85ad8907118edc6b97b436c935f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16576: [SPARK-19215] Add necessary check for `RDD.checkpoint` t...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/16576
  
    At the end, I think current checkpoint behavior is strange, and our customers is very easy to use it in wrong way. For now the core logic is hard to modify, but at least we should add useful warning, otherwise many spark developers may be confused why sometimes checkpoint fail, OR their application re-compute the RDDs but they don't realize it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16576: [SPARK-19215] Add necessary check for `RDD.checkp...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16576#discussion_r96118042
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1539,6 +1539,13 @@ abstract class RDD[T: ClassTag](
         // NOTE: we use a global lock here due to complexities downstream with ensuring
         // children RDD partitions point to the correct parent partitions. In the future
         // we should revisit this consideration.
    +    if (doCheckpointCalled) {
    +      throw new SparkException("Because job has been executed on this RDD, checkpoint won't work")
    +    }
    --- End diff --
    
    Year, that is the pitfall in `RDD.checkpoint` and it will make developers confused, so I try to add some check for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16576: [SPARK-19215] Add necessary check for `RDD.checkpoint` t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16576
  
    **[Test build #71327 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71327/testReport)** for PR 16576 at commit [`70fbfb0`](https://github.com/apache/spark/commit/70fbfb07adbaca17831fd736661135f2d7b2b0e0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16576: [SPARK-19215] Add necessary check for `RDD.checkpoint` t...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/16576
  
    @mridulm  code updated. thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16576: [SPARK-19215] Add necessary check for `RDD.checkp...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16576#discussion_r96130960
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1539,6 +1539,9 @@ abstract class RDD[T: ClassTag](
         // NOTE: we use a global lock here due to complexities downstream with ensuring
         // children RDD partitions point to the correct parent partitions. In the future
         // we should revisit this consideration.
    +    if (doCheckpointCalled) {
    --- End diff --
    
    `doCheckpointCalled && !isCheckpointed`
    If it is already checkpoint'ed successfully, then it is fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16576: [SPARK-19215] Add necessary check for `RDD.checkpoint` t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16576
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16576: [SPARK-19215] Add necessary check for `RDD.checkp...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16576#discussion_r96131251
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1726,6 +1729,10 @@ abstract class RDD[T: ClassTag](
                 // checkpoint ourselves
                 dependencies.foreach(_.rdd.doCheckpoint())
               }
    +          if (storageLevel == StorageLevel.NONE) {
    +            logInfo(s"do checkpoint on unpersisted RDD ${id}, it will cause RDD recomputation" +
    +              " when saving checkpoint files.")
    --- End diff --
    
    Can you please change the message to something like :
    `s"checkpoint on unpersisted RDD $this will result in recomputation"`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16576: [SPARK-19215] Add necessary check for `RDD.checkpoint` t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16576
  
    **[Test build #71394 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71394/testReport)** for PR 16576 at commit [`823ac63`](https://github.com/apache/spark/commit/823ac63101b6fd424550ca0f1e2ac4d9d0a0619c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16576: [SPARK-19215] Add necessary check for `RDD.checkpoint` t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16576
  
    **[Test build #71327 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71327/testReport)** for PR 16576 at commit [`70fbfb0`](https://github.com/apache/spark/commit/70fbfb07adbaca17831fd736661135f2d7b2b0e0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16576: [SPARK-19215] Add necessary check for `RDD.checkp...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16576#discussion_r96112271
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1539,6 +1539,13 @@ abstract class RDD[T: ClassTag](
         // NOTE: we use a global lock here due to complexities downstream with ensuring
         // children RDD partitions point to the correct parent partitions. In the future
         // we should revisit this consideration.
    +    if (doCheckpointCalled) {
    +      throw new SparkException("Because job has been executed on this RDD, checkpoint won't work")
    +    }
    +    if (storageLevel == StorageLevel.NONE) {
    +      logWarning("Call checkpoint on unpersisted RDD, it will cause RDD recomputation" +
    +        " when saving checkpoint files.")
    +    }
    --- End diff --
    
    en... your advice is reasonable!
    to sovle (a), I can move this warning into `doCheckpoint`, avoid the situation `RDD.persist` called later.
    about problem (b), how about change this warning into `logInfo` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16576: [SPARK-19215] Add necessary check for `RDD.checkpoint` t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16576
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16576: [SPARK-19215] Add necessary check for `RDD.checkpoint` t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16576
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71327/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16576: [SPARK-19215] Add necessary check for `RDD.checkpoint` t...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/16576
  
    cc @rxin Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16576: [SPARK-19215] Add necessary check for `RDD.checkp...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16576#discussion_r96131096
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1539,6 +1539,9 @@ abstract class RDD[T: ClassTag](
         // NOTE: we use a global lock here due to complexities downstream with ensuring
         // children RDD partitions point to the correct parent partitions. In the future
         // we should revisit this consideration.
    +    if (doCheckpointCalled) {
    +      logWarning(s"Because job has been executed on RDD ${id}, checkpoint won't work")
    --- End diff --
    
    We have to decide if we simply log a message and ignore it (if it is a design choice) or if we need to fix this.
    From git, @mateiz and @zsxwing were the last people to work on it this - would be great to hear from them : this might be a variant of SPARK-6847 which @zsxwing fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16576: [SPARK-19215] Add necessary check for `RDD.checkp...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16576#discussion_r96064677
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1539,6 +1539,13 @@ abstract class RDD[T: ClassTag](
         // NOTE: we use a global lock here due to complexities downstream with ensuring
         // children RDD partitions point to the correct parent partitions. In the future
         // we should revisit this consideration.
    +    if (doCheckpointCalled) {
    +      throw new SparkException("Because job has been executed on this RDD, checkpoint won't work")
    +    }
    --- End diff --
    
    It is not an error condition to call checkpoint() on an RDD which has already been checkpoint'ed : for example in complex workflow, different modules might call checkpoint as an invariant - and the RDD might already have been persisted/checkpointed/computed.
    Since isCheckpointed returns true only after checkpoint succeeds (and not if checkpoint() was invoked), there is no way for these methods to prevent an exception.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org