You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2015/04/10 00:23:25 UTC

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/5442

    [SPARK-6806] [SparkR] [Docs] Fill in SparkR examples in programming guide

    sqlCtx -> sqlContext
    
    cc @shivaram 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark r_docs

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5442.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5442
    
----
commit 23f751a37651e49fbea25efbcd58ac96115bd16f
Author: Davies Liu <da...@databricks.com>
Date:   2015-04-09T22:21:43Z

    Fill in SparkR examples in programming guide
    
    sqlCtx -> sqlContext

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5442#discussion_r28188875
  
    --- Diff: docs/quick-start.md ---
    @@ -214,6 +286,24 @@ tens or hundreds of nodes. You can also do this interactively by connecting `bin
     a cluster, as described in the [programming guide](programming-guide.html#initializing-spark).
     
     </div>
    +<div data-lang="r" markdown="1">
    +
    +{% highlight r %}
    +> cache(linesWithSpark)
    +
    +> count(linesWithSpark)
    --- End diff --
    
    I get 19 here ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-91713218
  
      [Test build #30056 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30056/consoleFull) for   PR 5442 at commit [`f61de71`](https://github.com/apache/spark/commit/f61de711581cc2f021eae2b1734463a1d10a67f0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5442#discussion_r28179136
  
    --- Diff: docs/programming-guide.md ---
    @@ -477,8 +541,28 @@ the [Converter examples]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main
     for examples of using Cassandra / HBase ```InputFormat``` and ```OutputFormat``` with custom converters.
     
     </div>
    +<div data-lang="r"  markdown="1">
    +
    +SparkR can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, [Amazon S3](http://wiki.apache.org/hadoop/AmazonS3), etc. Spark supports text files, [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html).
    +
    +Text file RDDs can be created using `textFile` method. This method takes an URI for the file (either a local path on the machine, or a `hdfs://`, `s3n://`, etc URI) and reads it as a collection of lines. Here is an example invocation:
    +
    +{% highlight r %}
    +distFile <- textFile(sc, "data.txt")
    +{% endhighlight %}
    +
    +Once created, `distFile` can be acted on by dataset operations. For example, we can add up the sizes of all the lines using the `map` and `reduce` operations as follows: `reduce(map(distFile, length), function(a, b) {a + b})`.
    +
    +Some notes on reading files with Spark:
    +
    +* If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
    --- End diff --
    
    The EC2 link applies to all the languages, so I'd like to leave out of this section.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by cafreeman <gi...@git.apache.org>.

Github user cafreeman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5442#discussion_r28117327
  
    --- Diff: docs/programming-guide.md ---
    @@ -907,6 +1055,30 @@ We could also use `counts.sortByKey()`, for example, to sort the pairs alphabeti
     
     </div>
     
    +<div data-lang="r" markdown="1">
    +
    +While most Spark operations work on RDDs containing any type of objects, a few special operations are
    +only available on RDDs of key-value pairs.
    +The most common ones are distributed "shuffle" operations, such as grouping or aggregating the elements
    +by a key.
    +
    +In R, these operations work on RDDs containing built-in R list such as `list(1, 2)`.
    +Simply create such lists and then call your desired operation.
    +
    +For example, the following code uses the `reduceByKey` operation on key-value pairs to count how
    +many times each line of text occurs in a file:
    +
    +{% highlight r %}
    +lines <- textFile(sc, "data.txt")
    +pairs <- map(lines, function(s) list(s, 1))
    +counts <- reduceByKey(pairs, function(a, b){a + b})
    +{% endhighlight %}
    +
    +We could also use `counts.sortByKey()`, for example, to sort the pairs alphabetically, and finally
    +`counts.collect()` to bring them back to the driver program as a list of objects.
    --- End diff --
    
    Both code snippets should be changed to the R syntax, e.g `collect(counts)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-91692104
  
      [Test build #660 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/660/consoleFull) for   PR 5442 at commit [`2f10a77`](https://github.com/apache/spark/commit/2f10a77d6f560f9b3cdf195947b9a707dc62ecd0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-91388762
  
      [Test build #29979 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29979/consoleFull) for   PR 5442 at commit [`9c2a062`](https://github.com/apache/spark/commit/9c2a062d6d111eeeed3cb2ffaaabfa53b24b3a63).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5442#discussion_r28268272
  
    --- Diff: docs/programming-guide.md ---
    @@ -576,6 +660,34 @@ before the `reduce`, which would cause `lineLengths` to be saved in memory after
     
     </div>
     
    +<div data-lang="r" markdown="1">
    +
    +To illustrate RDD basics, consider the simple program below:
    +
    +{% highlight r %}
    +lines <- textFile(sc, "data.txt")
    +lineLengths <- map(lines, length)
    +totalLength <- reduce(lineLengths, "+")
    +{% endhighlight %}
    +
    +The first line defines a base RDD from an external file. This dataset is not loaded in memory or
    +otherwise acted on: `lines` is merely a pointer to the file.
    +The second line defines `lineLengths` as the result of a `map` transformation. Again, `lineLengths`
    +is *not* immediately computed, due to laziness.
    +Finally, we run `reduce`, which is an action. At this point Spark breaks the computation into tasks
    +to run on separate machines, and each machine runs both its part of the map and a local reduction,
    +returning only its answer to the driver program.
    +
    +If we also wanted to use `lineLengths` again later, we could add:
    +
    +{% highlight r %}
    +persist(lineLengths)
    --- End diff --
    
    Add "MEMORY_ONLY" as default value of newLevel, to be consistent with other API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-96811042
  
    **[Test build #31015 timed out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31015/consoleFull)**     for PR 5442 at commit [`89684ce`](https://github.com/apache/spark/commit/89684ce59cfe4d989c2f36495d21ecb142c9881d)     after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5442#discussion_r28571272
  
    --- Diff: docs/programming-guide.md ---
    @@ -576,6 +660,34 @@ before the `reduce`, which would cause `lineLengths` to be saved in memory after
     
     </div>
     
    +<div data-lang="r" markdown="1">
    +
    +To illustrate RDD basics, consider the simple program below:
    +
    +{% highlight r %}
    +lines <- textFile(sc, "data.txt")
    +lineLengths <- map(lines, length)
    +totalLength <- reduce(lineLengths, "+")
    +{% endhighlight %}
    +
    +The first line defines a base RDD from an external file. This dataset is not loaded in memory or
    +otherwise acted on: `lines` is merely a pointer to the file.
    +The second line defines `lineLengths` as the result of a `map` transformation. Again, `lineLengths`
    +is *not* immediately computed, due to laziness.
    +Finally, we run `reduce`, which is an action. At this point Spark breaks the computation into tasks
    +to run on separate machines, and each machine runs both its part of the map and a local reduction,
    +returning only its answer to the driver program.
    +
    +If we also wanted to use `lineLengths` again later, we could add:
    +
    +{% highlight r %}
    +persist(lineLengths)
    --- End diff --
    
    Added a default value for `newLevel` of `persist`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-91709609
  
      [Test build #660 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/660/consoleFull) for   PR 5442 at commit [`2f10a77`](https://github.com/apache/spark/commit/2f10a77d6f560f9b3cdf195947b9a707dc62ecd0).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5442#discussion_r28186907
  
    --- Diff: docs/index.md ---
    @@ -54,6 +54,15 @@ Example applications are also provided in Python. For example,
     
         ./bin/spark-submit examples/src/main/python/pi.py 10
     
    +Spark also provides an experimental R API since 1.4 (only RDD and DtaFrame APIs included).
    --- End diff --
    
    DataFrames spelling typo


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5442#discussion_r28268448
  
    --- Diff: docs/quick-start.md ---
    @@ -214,6 +286,24 @@ tens or hundreds of nodes. You can also do this interactively by connecting `bin
     a cluster, as described in the [programming guide](programming-guide.html#initializing-spark).
     
     </div>
    +<div data-lang="r" markdown="1">
    +
    +{% highlight r %}
    +> cache(linesWithSpark)
    +
    +> count(linesWithSpark)
    --- End diff --
    
    This number may be outdated (The readme could be updated without touching these examples).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by cafreeman <gi...@git.apache.org>.

Github user cafreeman commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-91396398
  
    Left some comments inline, but most of them seem like minor details leftover from translating the `PySpark` docs. Overall I think this is looking really good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by cafreeman <gi...@git.apache.org>.

Github user cafreeman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5442#discussion_r28117226
  
    --- Diff: docs/programming-guide.md ---
    @@ -743,6 +855,30 @@ def doStuff(self, rdd):
     
     </div>
     
    +<div data-lang="r"  markdown="1">
    +
    +Spark's API relies heavily on passing functions in the driver program to run on the cluster.
    +There are three recommended ways to do this:
    +
    +* [Anonymous functions](http://adv-r.had.co.nz/Functional-programming.html#anonymous-functions),
    +  for simple functions that can be written as an anonymous function.
    +* Top-level functions in a module.
    +
    +For example, to pass a longer function than can be supported using a `lambda`, consider
    --- End diff --
    
    Is this a carryover from the PySpark docs? Probably better not to use `lambda` in the R version.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5442#discussion_r28188837
  
    --- Diff: docs/quick-start.md ---
    @@ -171,6 +205,44 @@ Here, we combined the [`flatMap`](programming-guide.html#transformations), [`map
     {% endhighlight %}
     
     </div>
    +<div data-lang="r" markdown="1">
    +
    +{% highlight r %}
    +> reduce(map(textFile, function(line) { length(strsplit(line, " ")[[1]]) }), function(a, b) {max(a, b)})
    +[1] 14
    +{% endhighlight %}
    +
    +This first maps a line to an integer value, creating a new RDD. `reduce` is called on that RDD to find the largest line count. The arguments to `map` and `reduce` are R [anonymous functions](http://adv-r.had.co.nz/Functional-programming.html#anonymous-functions),
    +but we can also pass any top-level R function we want.
    +For example, we'll define a `mymax` function to make this code easier to understand:
    +
    +{% highlight r %}
    +> mymax <- function(a, b) { max(a, b) }
    +> reduce(map(textFile, function(line) { length(strsplit(line, " ")[[1]]) }), mymax)
    +[1] 14
    +{% endhighlight %}
    +
    +One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:
    +
    +{% highlight r %}
    +> wordCounts <- reduceByKey(map(flatMap(textFile, function(line) strsplit(line, " ")[[1]]), function(word) list(word, 1)), "+", 2)
    --- End diff --
    
    Can you split this into multiple lines, so it becomes more readable ? i.e. something like
    ```
    wordCounts <- reduceByKey(
      map(
        flatMap(textFile, function(line) strsplit(line, " ")[[1]]), 
        function(word) list(word, 1)), 
     "+",  2)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-91710621
  
      [Test build #30052 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30052/consoleFull) for   PR 5442 at commit [`3ef7cf3`](https://github.com/apache/spark/commit/3ef7cf3f1797e80b82a05cdb0b5015265840a156).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-91383127
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29975/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by cafreeman <gi...@git.apache.org>.

Github user cafreeman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5442#discussion_r28117477
  
    --- Diff: docs/quick-start.md ---
    @@ -214,6 +286,24 @@ tens or hundreds of nodes. You can also do this interactively by connecting `bin
     a cluster, as described in the [programming guide](programming-guide.html#initializing-spark).
     
     </div>
    +<div data-lang="r" markdown="1">
    +
    +{% highlight r %}
    +> cache(linesWithSpark)
    +
    +> count(linesWithSpark)
    +[1] 15
    +
    +> count(linesWithSpark)
    +[1] 15
    --- End diff --
    
    Is `count` supposed to be repeated?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-92488896
  
      [Test build #30183 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30183/consoleFull) for   PR 5442 at commit [`89684ce`](https://github.com/apache/spark/commit/89684ce59cfe4d989c2f36495d21ecb142c9881d).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5442#discussion_r28118060
  
    --- Diff: docs/index.md ---
    @@ -54,6 +54,15 @@ Example applications are also provided in Python. For example,
     
         ./bin/spark-submit examples/src/main/python/pi.py 10
     
    +Spark also provides a R API. To run Spark interactively in a R interpreter, use
    --- End diff --
    
    I think here (or somewhere close by) we should say that SparkR is an experimental component in `<SPARK_VERSION>` and that only the RDD API and DataFrame APIs have been implemented in SparkR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-92463104
  
    @shivaram I had addressed your comments, could you take another pass?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5442#discussion_r28189082
  
    --- Diff: docs/programming-guide.md ---
    @@ -576,6 +660,34 @@ before the `reduce`, which would cause `lineLengths` to be saved in memory after
     
     </div>
     
    +<div data-lang="r" markdown="1">
    +
    +To illustrate RDD basics, consider the simple program below:
    +
    +{% highlight r %}
    +lines <- textFile(sc, "data.txt")
    +lineLengths <- map(lines, length)
    +totalLength <- reduce(lineLengths, "+")
    +{% endhighlight %}
    +
    +The first line defines a base RDD from an external file. This dataset is not loaded in memory or
    +otherwise acted on: `lines` is merely a pointer to the file.
    +The second line defines `lineLengths` as the result of a `map` transformation. Again, `lineLengths`
    +is *not* immediately computed, due to laziness.
    +Finally, we run `reduce`, which is an action. At this point Spark breaks the computation into tasks
    +to run on separate machines, and each machine runs both its part of the map and a local reduction,
    +returning only its answer to the driver program.
    +
    +If we also wanted to use `lineLengths` again later, we could add:
    +
    +{% highlight r %}
    +persist(lineLengths)
    --- End diff --
    
    This should either be `persist(lineLengths, "MEMORY_ONLY")` or `cache(lineLengths)`. The current one gives an error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-92488912
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30183/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-91729503
  
      [Test build #30056 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30056/consoleFull) for   PR 5442 at commit [`f61de71`](https://github.com/apache/spark/commit/f61de711581cc2f021eae2b1734463a1d10a67f0).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-91694199
  
      [Test build #30052 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30052/consoleFull) for   PR 5442 at commit [`3ef7cf3`](https://github.com/apache/spark/commit/3ef7cf3f1797e80b82a05cdb0b5015265840a156).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-91729515
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30056/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-91371701
  
      [Test build #29975 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29975/consoleFull) for   PR 5442 at commit [`23f751a`](https://github.com/apache/spark/commit/23f751a37651e49fbea25efbcd58ac96115bd16f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-91374376
  
    @cafreeman -- If you get a chance, could you take a look at this too ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-91691308
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30047/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-91388767
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29979/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by cafreeman <gi...@git.apache.org>.

Github user cafreeman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5442#discussion_r28117232
  
    --- Diff: docs/programming-guide.md ---
    @@ -743,6 +855,30 @@ def doStuff(self, rdd):
     
     </div>
     
    +<div data-lang="r"  markdown="1">
    +
    +Spark's API relies heavily on passing functions in the driver program to run on the cluster.
    +There are three recommended ways to do this:
    +
    +* [Anonymous functions](http://adv-r.had.co.nz/Functional-programming.html#anonymous-functions),
    +  for simple functions that can be written as an anonymous function.
    +* Top-level functions in a module.
    +
    +For example, to pass a longer function than can be supported using a `lambda`, consider
    +the code below:
    +
    +{% highlight r %}
    +"""MyScript.py"""
    +myFunc <- funciton(s) {
    --- End diff --
    
    nit: `function` is misspelled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-91710627
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30052/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-91383115
  
      [Test build #29975 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29975/consoleFull) for   PR 5442 at commit [`23f751a`](https://github.com/apache/spark/commit/23f751a37651e49fbea25efbcd58ac96115bd16f).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-91374238
  
      [Test build #29979 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29979/consoleFull) for   PR 5442 at commit [`9c2a062`](https://github.com/apache/spark/commit/9c2a062d6d111eeeed3cb2ffaaabfa53b24b3a63).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-91725063
  
    I did one more pass by trying out all the code samples and left a few comments. One last thing I had was that we could probably add a note in the accumulators section that they are not supported in R yet with a link to https://issues.apache.org/jira/browse/SPARK-6815


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5442#discussion_r28118142
  
    --- Diff: docs/programming-guide.md ---
    @@ -1582,4 +1771,4 @@ For help on deploying, the [cluster mode overview](cluster-overview.html) descri
     in distributed operation and supported cluster managers.
     
     Finally, full API documentation is available in
    -[Scala](api/scala/#org.apache.spark.package), [Java](api/java/) and [Python](api/python/).
    +[Scala](api/scala/#org.apache.spark.package), [Java](api/java/), [Python](api/python/) [R](api/R/).
    --- End diff --
    
    Minor nit: Comma after python here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by cafreeman <gi...@git.apache.org>.

Github user cafreeman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5442#discussion_r28116962
  
    --- Diff: docs/programming-guide.md ---
    @@ -477,8 +541,28 @@ the [Converter examples]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main
     for examples of using Cassandra / HBase ```InputFormat``` and ```OutputFormat``` with custom converters.
     
     </div>
    +<div data-lang="r"  markdown="1">
    +
    +SparkR can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, [Amazon S3](http://wiki.apache.org/hadoop/AmazonS3), etc. Spark supports text files, [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html).
    +
    +Text file RDDs can be created using `textFile` method. This method takes an URI for the file (either a local path on the machine, or a `hdfs://`, `s3n://`, etc URI) and reads it as a collection of lines. Here is an example invocation:
    +
    +{% highlight r %}
    +distFile <- textFile(sc, "data.txt")
    +{% endhighlight %}
    +
    +Once created, `distFile` can be acted on by dataset operations. For example, we can add up the sizes of all the lines using the `map` and `reduce` operations as follows: `reduce(map(distFile, length), function(a, b) {a + b})`.
    +
    +Some notes on reading files with Spark:
    +
    +* If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
    --- End diff --
    
    Might be worth linking to either the EC2 article on the SparkR wiki or the Spark doc about distributing a file across a cluster (the `copy-dir` stuff):
    
    https://github.com/amplab-extras/SparkR-pkg/wiki/SparkR-on-EC2
    
    https://spark.apache.org/docs/latest/ec2-scripts.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-93897920
  
    @shivaram Should we merge this or wait for API audit?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5442#discussion_r28186786
  
    --- Diff: R/pkg/R/pairRDD.R ---
    @@ -327,7 +327,7 @@ setMethod("reduceByKey",
                   convertEnvsToList(keys, vals)
                 }
                 locallyReduced <- lapplyPartition(x, reduceVals)
    -            shuffled <- partitionBy(locallyReduced, numPartitions)
    +            shuffled <- partitionBy(locallyReduced, as.integer(numPartitions))
    --- End diff --
    
    We should use `numToInt` from utils.R here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5442#discussion_r28189094
  
    --- Diff: docs/programming-guide.md ---
    @@ -743,6 +855,29 @@ def doStuff(self, rdd):
     
     </div>
     
    +<div data-lang="r"  markdown="1">
    +
    +Spark's API relies heavily on passing functions in the driver program to run on the cluster.
    +There are three recommended ways to do this:
    +
    +* [Anonymous functions](http://adv-r.had.co.nz/Functional-programming.html#anonymous-functions),
    +  for simple functions that can be written as an anonymous function.
    +* Top-level functions in a module.
    +
    +For example, to pass a longer function, consider the code below:
    +
    +{% highlight r %}
    +"""MyScript.py"""
    --- End diff --
    
    MyScript.R ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5442#discussion_r28178692
  
    --- Diff: docs/quick-start.md ---
    @@ -214,6 +286,24 @@ tens or hundreds of nodes. You can also do this interactively by connecting `bin
     a cluster, as described in the [programming guide](programming-guide.html#initializing-spark).
     
     </div>
    +<div data-lang="r" markdown="1">
    +
    +{% highlight r %}
    +> cache(linesWithSpark)
    +
    +> count(linesWithSpark)
    +[1] 15
    +
    +> count(linesWithSpark)
    +[1] 15
    --- End diff --
    
    yes 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5442#issuecomment-92462452
  
      [Test build #30183 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30183/consoleFull) for   PR 5442 at commit [`89684ce`](https://github.com/apache/spark/commit/89684ce59cfe4d989c2f36495d21ecb142c9881d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6806] [SparkR] [Docs] Fill in SparkR ex...

Posted by cafreeman <gi...@git.apache.org>.

Github user cafreeman commented on a diff in the pull request:

https://github.com/apache/spark/pull/5442#discussion_r28116656

--- Diff: docs/programming-guide.md ---
@@ -246,6 +279,23 @@ your notebook before you start to try Spark from the IPython notebook.

</div>

+<div data-lang="r" markdown="1">
+
+In the SparkR shell, a special interpreter-aware SparkContext is already created for you, in the
+variable called `sc`. Making your own SparkContext will not work. You can set which master the
+context connects to using the `--master` argument. You can also add dependencies
+(e.g. Spark Packages) to your shell session by supplying a comma-separated list of maven coordinates
+to the `--packages` argument. Any additional repositories where dependencies might exist (e.g. SonaType)
+can be passed to the `--repositories` argument. For example, to run `bin/pyspark` on exactly four cores, use:
--- End diff --

This should refer to SparkR instead of PySpark.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org