You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by esjewett <gi...@git.apache.org> on 2014/05/06 19:01:24 UTC

[GitHub] spark pull request: Proposal: clarify Scala programming guide on c...

GitHub user esjewett opened a pull request:

    https://github.com/apache/spark/pull/668

    Proposal: clarify Scala programming guide on caching ...

    ... with regards to saved map output. Wording taken partially from Matei Zaharia's email to the Spark user list. http://apache-spark-user-list.1001560.n3.nabble.com/performance-improvement-on-second-operation-without-caching-td5227.html

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/esjewett/spark-1 Doc-update

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/668.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #668
    
----
commit 171e670d8846c505ab006e707fd1bad3e531f488
Author: Ethan Jewett <es...@gmail.com>
Date:   2014-05-06T16:59:41Z

    Clarify Scala programming guide on caching ...
    
    ... with regards to saved map output. Wording taken partially from Matei Zaharia's email to the Spark user list. http://apache-spark-user-list.1001560.n3.nabble.com/performance-improvement-on-second-operation-without-caching-td5227.html

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Proposal: clarify Scala programming guide on c...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/668#discussion_r12340137

--- Diff: docs/scala-programming-guide.md ---
@@ -278,10 +278,13 @@ iterative algorithms with Spark and for interactive use from the interpreter.
You can mark an RDD to be persisted using the `persist()` or `cache()` methods on it. The first time
it is computed in an action, it will be kept in memory on the nodes. The cache is fault-tolerant --
if any partition of an RDD is lost, it will automatically be recomputed using the transformations
-that originally created it.
+that originally created it. Note: in a multi-stage job, Spark saves the map output files from map
--- End diff --

It's a great idea to have this here. This is a totally non-obvious fact and I think many users would like to know this.

My only thought is, would you mind moving this to the end of the "RDD Persistence" section. Also, at this point in the guide I don't think the concept of stages or jobs has been introduced. So it might be good to have something like:

```
Spark sometimes automatically persists intermediate state from RDD operations, even without users calling persist() or cache(). In particular, if a shuffle happens when computing an RDD, Spark will keep the outputs from the map side of the shuffle on disk to avoid re-computing the entire dependency graph if an RDD is re-used. We still recommend users call persist() if they plan to re-use an RDD iteratively.
```

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---