You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by esjewett <gi...@git.apache.org> on 2014/05/06 19:01:24 UTC

[GitHub] spark pull request: Proposal: clarify Scala programming guide on c...

GitHub user esjewett opened a pull request:

    https://github.com/apache/spark/pull/668

    Proposal: clarify Scala programming guide on caching ...

    ... with regards to saved map output. Wording taken partially from Matei Zaharia's email to the Spark user list. http://apache-spark-user-list.1001560.n3.nabble.com/performance-improvement-on-second-operation-without-caching-td5227.html

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/esjewett/spark-1 Doc-update

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/668.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #668
    
----
commit 171e670d8846c505ab006e707fd1bad3e531f488
Author: Ethan Jewett <es...@gmail.com>
Date:   2014-05-06T16:59:41Z

    Clarify Scala programming guide on caching ...
    
    ... with regards to saved map output. Wording taken partially from Matei Zaharia's email to the Spark user list. http://apache-spark-user-list.1001560.n3.nabble.com/performance-improvement-on-second-operation-without-caching-td5227.html

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Proposal: clarify Scala programming guide on c...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/668#discussion_r12340137
  
    --- Diff: docs/scala-programming-guide.md ---
    @@ -278,10 +278,13 @@ iterative algorithms with Spark and for interactive use from the interpreter.
     You can mark an RDD to be persisted using the `persist()` or `cache()` methods on it. The first time
     it is computed in an action, it will be kept in memory on the nodes. The cache is fault-tolerant --
     if any partition of an RDD is lost, it will automatically be recomputed using the transformations
    -that originally created it.
    +that originally created it. Note: in a multi-stage job, Spark saves the map output files from map
    --- End diff --
    
    It's a great idea to have this here. This is a totally non-obvious fact and I think many users would like to know this.
    
    My only thought is, would you mind moving this to the end of the "RDD Persistence" section. Also, at this point in the guide I don't think the concept of stages or jobs has been introduced. So it might be good to have something like:
    
    ```
    Spark sometimes automatically persists intermediate state from RDD operations, even without users calling persist() or cache(). In particular, if a shuffle happens when computing an RDD, Spark will keep the outputs from the map side of the shuffle on disk to avoid re-computing the entire dependency graph if an RDD is re-used. We still recommend users call persist() if they plan to re-use an RDD iteratively.
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Proposal: clarify Scala programming guide on c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/668#issuecomment-42329651
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Proposal: clarify Scala programming guide on c...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/668#issuecomment-42386975
  
    Okay I can merge this, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Proposal: clarify Scala programming guide on c...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/668


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Proposal: clarify Scala programming guide on c...

Posted by esjewett <gi...@git.apache.org>.
Github user esjewett commented on the pull request:

    https://github.com/apache/spark/pull/668#issuecomment-42338116
  
    Just putting it out there: I'm not attached to any of this wording, so change away, or don't accept it. No problem either way. I just thought my question on the user list as to whether the programming guide could be updated was better stated as a pull request ;-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Proposal: clarify Scala programming guide on c...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/668#discussion_r12338655
  
    --- Diff: docs/scala-programming-guide.md ---
    @@ -278,10 +278,13 @@ iterative algorithms with Spark and for interactive use from the interpreter.
     You can mark an RDD to be persisted using the `persist()` or `cache()` methods on it. The first time
     it is computed in an action, it will be kept in memory on the nodes. The cache is fault-tolerant --
    --- End diff --
    
    Oh sorry - nevermind, this is explained below and this case only refers to calling `persist()` without arguments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Proposal: clarify Scala programming guide on c...

Posted by esjewett <gi...@git.apache.org>.
Github user esjewett commented on the pull request:

    https://github.com/apache/spark/pull/668#issuecomment-42345721
  
    @pwendell I like your wording. Switched to use it, and moved it to the end of the "RDD Persistence" section as requested. I also updated the "RDD Operations" section with a small change so as not to imply that RDDs that aren't persist()ed will always be reprocessed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Proposal: clarify Scala programming guide on c...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/668#discussion_r12338494
  
    --- Diff: docs/scala-programming-guide.md ---
    @@ -278,10 +278,13 @@ iterative algorithms with Spark and for interactive use from the interpreter.
     You can mark an RDD to be persisted using the `persist()` or `cache()` methods on it. The first time
     it is computed in an action, it will be kept in memory on the nodes. The cache is fault-tolerant --
    --- End diff --
    
    Not your change, but I think this should say "will be persisted in memory or on disk on the nodes"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---