You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by actuaryzhang <gi...@git.apache.org> on 2017/03/04 09:03:43 UTC

[GitHub] spark pull request #17161: [SPARK-19819][SparkR] Use concrete data in SparkR...

GitHub user actuaryzhang opened a pull request:

    https://github.com/apache/spark/pull/17161

    [SPARK-19819][SparkR] Use concrete data in SparkR DataFrame examples

    ## What changes were proposed in this pull request?
    Many examples in SparkDataFrame methods uses:
    ```
    path <- "path/to/file.json"
    df <- read.json(path)
    ```
    This is not directly runnable. Replace this with real numerical examples so that users can directly execute the examples.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/actuaryzhang/spark sparkRDoc2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17161.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17161
    
----
commit 7e50acc4f270400aa00a728a9487ee1d5c1cc4fc
Author: actuaryzhang <ac...@gmail.com>
Date:   2017-03-04T08:45:56Z

    update dataframe doc with examples

commit c4c14adf399bb7b5d272d574113edb7e9ee26d01
Author: actuaryzhang <ac...@gmail.com>
Date:   2017-03-04T08:53:27Z

    update examples

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17161: [SPARK-19819][SparkR] Use concrete data in SparkR DataFr...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17161
  
    **[Test build #73896 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73896/testReport)** for PR 17161 at commit [`c4c14ad`](https://github.com/apache/spark/commit/c4c14adf399bb7b5d272d574113edb7e9ee26d01).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17161: [SPARK-19819][SparkR] Use concrete data in SparkR DataFr...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17161
  
    @felixcheung, please let me leave my humble opinion.
    
    In general, I like this change. Maybe, it should not necessarily be always runnable (at least for now) but I believe it is still an improvement to make them runnable, for example, it allows users to directly copy and paste the example and easily test.
    To me, this is more like the case that we (at least I) prefer the JIRA having a self-contained reproducer. Personally, I just copy and paste the example and check the input/output when I first learn and try new APIs as a user.
    
    > Firstly, I see this as slightly different from Python, in that in R it is common to have built-in datasets and possibly users are used to having them and having examples using them.
    
    In case of Python examples, up to my knowledge, it is also common to document self-runnable examples (e.g., [pandas](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot) and [numpy](https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html)). Moreover, they are tested in doctests so there is no overhead to check if they are runnable everytime. Maybe, let's talk about this in the PR or JIRA (I am sorry for dragging Python one into this).
    
    > I have done a pass on the changes in this PR and I'm happy with changing from non-existing json file to mtcars. I'm slightly concerned with the few cases of artificial 3 rows data (like here) - more on that below on small dataset.
    
    Maybe, we could take out the examples that you are concerned of for now..
    
    
    For the four concerns you described, I do agree in general but if it has a downside more than what it improves, I would disagree but to me it might sound the improvement might be better than the downsides about the change itself. 
    
    >how much work and how much change is it to change all examples (this is only 1 .R out of 20-something files we have, in a total of 300+ methods which is on the high side for R packages)
    
    I guess the first one is about reviewing cost/backporting/conflict concern. I am willing to help verify the examples by manually running even if they are so many in terms of reviewing cost if the change is big.
    
    > how much churn will it be to keep them up-to-date when we are having changes to API (eg. sparkR.session()); especially since in order to have examples self-contained we tend to add additional calls to manipulate data and thereby increasing the number of references of API calls; which could get stale easily like the example with insertInto
    
    For the second concern, if it makes the maintenance header, it'd be a important point but I think we are being conservative on the API changes from Spark 2.x. So, I guess we would less likely need to sweep these. I think it is okay even if it would be change in the near future if the changes are not quite big.
    
    > perhaps more importantly, how practical or useful it would be to use built-in datasets or native R data.frame (mtcars, cars, Titanic, iris, or make up some; that are super small) on a scalable data platform like Spark? perhaps it is better to demonstrate, in examples, how to work with external data sources, multiple file formats etc.?
    
    > and lastly, we still have about a dozen methods that are without example that are being flagged by CRAN checks (but not enough to fail it yet)
    
    For the third and fourth concerns, I think we could improve this by adding more examples, explaining and describing the details. For example, we could leave some comments about this such as .. "this is an example but in pratice blabla..". Maybe, we could do this in another PR maybe. Otherwise, we could leave the examples as they are with some proper comments.
    
    For other thoughts, I think it would be great if they are ran in the tests at lesat.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17161: [SPARK-19819][SparkR] Use concrete data in SparkR...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17161#discussion_r104306095
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -548,10 +537,9 @@ setMethod("registerTempTable",
     #' @examples
     #'\dontrun{
     #' sparkR.session()
    -#' df <- read.df(path, "parquet")
    -#' df2 <- read.df(path2, "parquet")
    -#' createOrReplaceTempView(df, "table1")
    -#' insertInto(df2, "table1", overwrite = TRUE)
    +#' df <- limit(createDataFrame(faithful), 5)
    --- End diff --
    
    why limit?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17161: [SPARK-19819][SparkR] Use concrete data in SparkR DataFr...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17161
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17161: [SPARK-19819][SparkR] Use concrete data in SparkR...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17161#discussion_r104306070
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -1123,10 +1096,9 @@ setMethod("dim",
     #' @examples
     #'\dontrun{
     #' sparkR.session()
    -#' path <- "path/to/file.json"
    -#' df <- read.json(path)
    +#' df <- createDataFrame(mtcars)
     #' collected <- collect(df)
    -#' firstName <- collected[[1]]$name
    +#' collected[[1]]
    --- End diff --
    
    right, that seems rather unnecessary. any other idea on how to show it is a data.frame?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17161: [SPARK-19819][SparkR] Use concrete data in SparkR DataFr...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17161
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73896/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17161: [SPARK-19819][SparkR] Use concrete data in SparkR...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17161#discussion_r104306047
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -2805,10 +2779,9 @@ setMethod("except",
     #' @examples
     #'\dontrun{
     #' sparkR.session()
    -#' path <- "path/to/file.json"
    -#' df <- read.json(path)
    -#' write.df(df, "myfile", "parquet", "overwrite")
    -#' saveDF(df, parquetPath2, "parquet", mode = saveMode, mergeSchema = mergeSchema)
    +#' df <- createDataFrame(mtcars)
    +#' write.df(df, tempfile(), "parquet", "overwrite")
    --- End diff --
    
    I think we should avoid having `tempfile()` as output path in example, as that might point users into the wrong direction - anything saved in tempfile will disappear as soon as the R session ends.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17161: [SPARK-19819][SparkR] Use concrete data in SparkR...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang closed the pull request at:

    https://github.com/apache/spark/pull/17161


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17161: [SPARK-19819][SparkR] Use concrete data in SparkR...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17161#discussion_r104306085
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -741,12 +724,12 @@ setMethod("coalesce",
     #' @examples
     #'\dontrun{
     #' sparkR.session()
    -#' path <- "path/to/file.json"
    -#' df <- read.json(path)
    +#' df <- createDataFrame(mtcars)
    +#' newDF <- coalesce(df, 1L)
    --- End diff --
    
    should probably not have coalesce in the example blob for repartition


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17161: [SPARK-19819][SparkR] Use concrete data in SparkR...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17161#discussion_r104283930
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -92,8 +92,7 @@ dataFrame <- function(sdf, isCached = FALSE) {
     #' @examples
     #'\dontrun{
     #' sparkR.session()
    -#' path <- "path/to/file.json"
    -#' df <- read.json(path)
    +#' df <- createDataFrame(mtcars)
    --- End diff --
    
    Should we define `mtcars` to run this example?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17161: [SPARK-19819][SparkR] Use concrete data in SparkR...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17161#discussion_r104284980
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -92,8 +92,7 @@ dataFrame <- function(sdf, isCached = FALSE) {
     #' @examples
     #'\dontrun{
     #' sparkR.session()
    -#' path <- "path/to/file.json"
    -#' df <- read.json(path)
    +#' df <- createDataFrame(mtcars)
    --- End diff --
    
    `mtcars` is built in to R, like `iris`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17161: [SPARK-19819][SparkR] Use concrete data in SparkR DataFr...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17161
  
    **[Test build #73896 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73896/testReport)** for PR 17161 at commit [`c4c14ad`](https://github.com/apache/spark/commit/c4c14adf399bb7b5d272d574113edb7e9ee26d01).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17161: [SPARK-19819][SparkR] Use concrete data in SparkR DataFr...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17161
  
    I think most examples in R packages are (supposed to be) runnable. Coming from a user perspective, I find it useful if I can run the examples directly and see what the function does in action. Since we already have the pseudo-code here, wouldn't it be better to change it to real data? 
    Especially for the more complicated cases like `join`, providing self-contained examples will save users much time in constructing their own examples.
    
    Indeed, by making the examples runnable, I have found and fixed several issues with the pseudo example. For example, the original example in `insertInto` seems to be wrong:
    ``` 
    createOrReplaceTempView(df, "table1")   # This should be saveAsTable
    insertInto(df2, "table1", overwrite = TRUE)
    ```
    This is very hard to find without running real examples. 
    
    @srowen @felixcheung 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17161: [SPARK-19819][SparkR] Use concrete data in SparkR DataFr...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17161
  
    This rings a bell to me. I opened a PR  https://github.com/apache/spark/pull/16824 to make the examples in Python API doc as self-contained ones but closed as it seems not having much interests from other committers. I would like to see if other R committers support this and then reopen it if they like this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17161: [SPARK-19819][SparkR] Use concrete data in SparkR...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17161#discussion_r104306091
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -741,12 +724,12 @@ setMethod("coalesce",
     #' @examples
     #'\dontrun{
     #' sparkR.session()
    -#' path <- "path/to/file.json"
    -#' df <- read.json(path)
    +#' df <- createDataFrame(mtcars)
    +#' newDF <- coalesce(df, 1L)
     #' newDF <- repartition(df, 2L)
     #' newDF <- repartition(df, numPartitions = 2L)
    -#' newDF <- repartition(df, col = df$"col1", df$"col2")
    -#' newDF <- repartition(df, 3L, col = df$"col1", df$"col2")
    +#' newDF <- repartition(df, col = df[[1]], df[[2]])
    --- End diff --
    
    showing as an example column reference with `$name` is important too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17161: [SPARK-19819][SparkR] Use concrete data in SparkR DataFr...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/17161
  
    Firstly, I see this as slightly different from Python, in that in R it is common to have built-in datasets and possibly users are used to having them and having examples using them.
    
    And as of now, many of our examples are not meant currently to be runnable and they are clearly indicated as such.
    
    I have done a pass on the changes in this PR and I'm happy with changing from non-existing json file to `mtcars`. I'm slightly concerned with the few cases of artificial 3 rows data (like [here](https://github.com/apache/spark/pull/17161/files#diff-508641a8bd6c6b59f3e77c80cdcfa6a9R2483)) - more on that below on small dataset. 
    
    That said, I wonder about the verbosity of adding to examples like this, similarly as in the Python discussions, and, since we have more than 300 pages of API doc, this is not a simple task to change them all. 
    
    But I do agree that not having broken or incorrect examples is very important.
    
    My concerns are:
    - how much work and how much change is it to change all examples (this is only 1 .R out of 20-something files we have, in a total of 300+ methods which is on the high side for R packages)
    - how much churn will it be to keep them up-to-date when we are having changes to API (eg. `sparkR.session()`); especially since in order to have examples self-contained we tend to add additional calls to manipulate data and thereby increasing the number of references of API calls 
    - perhaps more importantly, how practical or useful it would be to use built-in datasets or native R data.frame (`mtcars`, `cars`, `Titanic`, `iris`, or make up some; that are super small) on a scalable data platform like Spark? perhaps it is better to demonstrate, in examples, how to work with external data sources, multiple file formats etc.?
    - and lastly, we still have about a dozen methods that are without example that are being flagged by CRAN checks (but not enough to fail it yet)
    
    Couple of *random* thoughts (would be interested to see how they look first!):
    - group smaller functions into a single page and sharing a longer, more concrete example (need to check if it messes up parameter documentation or make them more confusing! or, how it might affect method help discoverability, like with `?predict`) (btw, this is the approach we have for ML methods)
    - reference external example files
    - have examples using datasets that come with Spark (like [this one](https://github.com/apache/spark/blob/master/examples/src/main/resources/people.json))
    - have examples in templates and reuse them
    - keep existing page breakdown but instead of scattering examples around in each, link to a special group of pages (via `@seealso`) with longer, more concrete examples (eg. column manipulation set)
    - make example run (ie. remove dontrun) this, of course, would need to make sure examples are self-contained and are correct (this is a bigger effort; this could possibly extend build time and/or make build fails more often, as example will then run as a part of CRAN check) (?!)
    
    I suspect we would likely need a combination or subset of these techniques.
    To me, the high-level priority would be in order i) example correctness; ii) example coverage - we should have some examples for every method; iii) better, richer, self-contained examples in strategic places
    
    Thoughts?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org