You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2018/02/01 03:41:20 UTC

[GitHub] spark pull request #20464: [SPARK-23291][SQL][R] R's substr should not reduc...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/20464

    [SPARK-23291][SQL][R] R's substr should not reduce starting position by 1 when calling Scala API

    ## What changes were proposed in this pull request?
    
    Seems R's substr API treats Scala substr API as zero based and so subtracts the given starting position by 1. Because SQL's substr also accepts zero-based starting position, this also causes incorrect results when the starting position is greater than 1, but we don't have related tests in R.
    
    ## How was this patch tested?
    
    Modified tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-23291

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20464.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20464
    
----
commit a2ffdc14ebfa67656e3598f0a0a0131f18f98aa5
Author: Liang-Chi Hsieh <vi...@...>
Date:   2018-02-01T03:10:53Z

    R's substr should not reduce starting position by 1 when calling Scala API.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    **[Test build #88039 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88039/testReport)** for PR 20464 at commit [`8c1a8ec`](https://github.com/apache/spark/commit/8c1a8ec46ea28ce17fcaae42aa7b9955cb34bfc8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    merged to master, thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    I think @felixcheung has the most context here, so I'd suggest we wait for his comments.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Also @shivaram 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Because 2.3 is released, ping @felixcheung again


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20464: [SPARK-23291][SQL][R] R's substr should not reduc...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20464#discussion_r165582806
  
    --- Diff: R/pkg/R/column.R ---
    @@ -169,7 +169,7 @@ setMethod("alias",
     #' @note substr since 1.4.0
     setMethod("substr", signature(x = "Column"),
               function(x, start, stop) {
    -            jc <- callJMethod(x@jc, "substr", as.integer(start - 1), as.integer(stop - start + 1))
    +            jc <- callJMethod(x@jc, "substr", as.integer(start), as.integer(stop - start + 1))
    --- End diff --
    
    > * why consider other changes as a follow up and not here? [#20464 (comment)](https://github.com/apache/spark/pull/20464#issuecomment-362162014)
    
    Just because I think it is another issue regarding 0/negative indices. I can deal it here if you strongly feel it is better.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    **[Test build #86908 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86908/testReport)** for PR 20464 at commit [`a2ffdc1`](https://github.com/apache/spark/commit/a2ffdc14ebfa67656e3598f0a0a0131f18f98aa5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/459/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88039/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20464: [SPARK-23291][SQL][R] R's substr should not reduc...

Posted by viirya <gi...@git.apache.org>.
GitHub user viirya reopened a pull request:

    https://github.com/apache/spark/pull/20464

    [SPARK-23291][SQL][R] R's substr should not reduce starting position by 1 when calling Scala API

    ## What changes were proposed in this pull request?
    
    Seems R's substr API treats Scala substr API as zero based and so subtracts the given starting position by 1.
    
    Because Scala's substr API also accepts zero-based starting position (treated as the first element), so the current R's substr test results are correct as they all use 1 as starting positions.
    
    ## How was this patch tested?
    
    Modified tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-23291

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20464.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20464
    
----
commit a2ffdc14ebfa67656e3598f0a0a0131f18f98aa5
Author: Liang-Chi Hsieh <vi...@...>
Date:   2018-02-01T03:10:53Z

    R's substr should not reduce starting position by 1 when calling Scala API.

commit 95c8a4e48e8f760bb9ca0df844136d19452521d7
Author: Liang-Chi Hsieh <vi...@...>
Date:   2018-02-01T09:02:16Z

    Add a note to migration guide of R doc.

commit d994d76d45e474b3e4a31fff8250c30efef6a757
Author: Liang-Chi Hsieh <vi...@...>
Date:   2018-02-02T08:30:43Z

    Fix doc.

commit 0ebdf74942e0894bfaf6cbede4c03fd3f5d26411
Author: Liang-Chi Hsieh <vi...@...>
Date:   2018-03-06T04:54:48Z

    Improve doc.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86908/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    **[Test build #86985 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86985/testReport)** for PR 20464 at commit [`d994d76`](https://github.com/apache/spark/commit/d994d76d45e474b3e4a31fff8250c30efef6a757).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Sorry, I'm a bit occupied with testing 2.3 RC, will get back to this after.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    **[Test build #86924 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86924/testReport)** for PR 20464 at commit [`95c8a4e`](https://github.com/apache/spark/commit/95c8a4e48e8f760bb9ca0df844136d19452521d7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    @felixcheung Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    **[Test build #86924 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86924/testReport)** for PR 20464 at commit [`95c8a4e`](https://github.com/apache/spark/commit/95c8a4e48e8f760bb9ca0df844136d19452521d7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1347/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20464: [SPARK-23291][SQL][R] R's substr should not reduc...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20464#discussion_r165570529
  
    --- Diff: docs/sparkr.md ---
    @@ -663,3 +663,7 @@ You can inspect the search path in R with [`search()`](https://stat.ethz.ch/R-ma
      - The `stringsAsFactors` parameter was previously ignored with `collect`, for example, in `collect(createDataFrame(iris), stringsAsFactors = TRUE))`. It has been corrected.
      - For `summary`, option for statistics to compute has been added. Its output is changed from that from `describe`.
      - A warning can be raised if versions of SparkR package and the Spark JVM do not match.
    +
    +## Upgrading to Spark 2.4.0
    +
    + - The first parameter of `substr` method was wrongly subtracted by one, previously. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. It has been corrected.
    --- End diff --
    
    instead of  `The first parameter of ` -> `The ``start`` parameter of...`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86985/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20464: [SPARK-23291][SQL][R] R's substr should not reduc...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20464#discussion_r165584849
  
    --- Diff: docs/sparkr.md ---
    @@ -663,3 +663,7 @@ You can inspect the search path in R with [`search()`](https://stat.ethz.ch/R-ma
      - The `stringsAsFactors` parameter was previously ignored with `collect`, for example, in `collect(createDataFrame(iris), stringsAsFactors = TRUE))`. It has been corrected.
      - For `summary`, option for statistics to compute has been added. Its output is changed from that from `describe`.
      - A warning can be raised if versions of SparkR package and the Spark JVM do not match.
    +
    +## Upgrading to Spark 2.4.0
    +
    + - The first parameter of `substr` method was wrongly subtracted by one, previously. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. It has been corrected.
    --- End diff --
    
    Fixed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    **[Test build #86985 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86985/testReport)** for PR 20464 at commit [`d994d76`](https://github.com/apache/spark/commit/d994d76d45e474b3e4a31fff8250c30efef6a757).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20464: [SPARK-23291][SQL][R] R's substr should not reduc...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20464#discussion_r172033021
  
    --- Diff: R/pkg/R/column.R ---
    @@ -169,7 +169,7 @@ setMethod("alias",
     #' @note substr since 1.4.0
     setMethod("substr", signature(x = "Column"),
               function(x, start, stop) {
    -            jc <- callJMethod(x@jc, "substr", as.integer(start - 1), as.integer(stop - start + 1))
    +            jc <- callJMethod(x@jc, "substr", as.integer(start), as.integer(stop - start + 1))
    --- End diff --
    
    I think we should do two things:
    1. add to the func doc that the `start` param should be 0-base and to add to the example with the result
    `collect(select(df, substr(df$a, 0, 5))) # this should give you...`



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Thanks for clarifying @viirya. Is the PR description accurate ? I read it as `..SQL's substr also accepts zero-based starting position` while R uses a 1-based starting position.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/473/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    I was just manually double checking both substr in R and this. It seems correct; however, I think we should add a note in the doc and release note ... 
    
    One followup question is though, would it be difficult to match the behaviour with substr in R when the index is 0 or minus? If i understood https://github.com/apache/spark/pull/20464#issuecomment-362150632 correctly, it sounds better to match it to substr's behaviour in R. Took a quick look/test and seems we can just set `start` to 1 for both cases.
    
    If this followup question is something we are not sure yet, I think we might be okay as is.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Just in case, I am testing with:
    
    ```R
    df <- createDataFrame(list(list(a="abcdef")))
    collect(select(df, substr(df$a, 4, 5)))
    substr("abcdef", 4, 5)
    ```
    
    just in case it helps to check and reproduce.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/525/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    **[Test build #88039 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88039/testReport)** for PR 20464 at commit [`8c1a8ec`](https://github.com/apache/spark/commit/8c1a8ec46ea28ce17fcaae42aa7b9955cb34bfc8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    **[Test build #87993 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87993/testReport)** for PR 20464 at commit [`0ebdf74`](https://github.com/apache/spark/commit/0ebdf74942e0894bfaf6cbede4c03fd3f5d26411).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20464: [SPARK-23291][SQL][R] R's substr should not reduc...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20464#discussion_r172406939
  
    --- Diff: R/pkg/R/column.R ---
    @@ -169,7 +169,7 @@ setMethod("alias",
     #' @note substr since 1.4.0
     setMethod("substr", signature(x = "Column"),
               function(x, start, stop) {
    -            jc <- callJMethod(x@jc, "substr", as.integer(start - 1), as.integer(stop - start + 1))
    +            jc <- callJMethod(x@jc, "substr", as.integer(start), as.integer(stop - start + 1))
    --- End diff --
    
    I think you mean 1-based,


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20464: [SPARK-23291][SQL][R] R's substr should not reduc...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20464#discussion_r165584627
  
    --- Diff: R/pkg/R/column.R ---
    @@ -169,7 +169,7 @@ setMethod("alias",
     #' @note substr since 1.4.0
     setMethod("substr", signature(x = "Column"),
               function(x, start, stop) {
    -            jc <- callJMethod(x@jc, "substr", as.integer(start - 1), as.integer(stop - start + 1))
    +            jc <- callJMethod(x@jc, "substr", as.integer(start), as.integer(stop - start + 1))
    --- End diff --
    
    > is there a way to make the behavior the same before this change for any caller calling substr with common index like 0
    
    Should we keep the behavior when calling substr with 0 as start index?
    
    ```R
    > df <- createDataFrame(list(list(a="abcdef")))
    > collect(select(df, substr(df$a, 0, 5)))
      substring(a, -1, 6)
    1                   f
    > substr("abcdef", 0, 5)
    [1] "abcde"
    ```
    
    I think the previous behavior is pretty unreasonable..


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20464: [SPARK-23291][SQL][R] R's substr should not reduc...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20464#discussion_r172750404
  
    --- Diff: docs/sparkr.md ---
    @@ -663,3 +663,7 @@ You can inspect the search path in R with [`search()`](https://stat.ethz.ch/R-ma
      - The `stringsAsFactors` parameter was previously ignored with `collect`, for example, in `collect(createDataFrame(iris), stringsAsFactors = TRUE))`. It has been corrected.
      - For `summary`, option for statistics to compute has been added. Its output is changed from that from `describe`.
      - A warning can be raised if versions of SparkR package and the Spark JVM do not match.
    +
    +## Upgrading to Spark 2.4.0
    +
    + - The `start` parameter of `substr` method was wrongly subtracted by one, previously. In other words, the index specified by `start` parameter was considered as 0-base. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. It has been fixed so the `start` parameter of `substr` method is now 1-base, e.g., `substr(df$a, 2, 5)` should be changed to `substr(df$a, 1, 4)`.
    --- End diff --
    
    could you add
    `method is now 1-base, e.g., therefore to get the same result as substr(df$a, 2, 5), it should be changed to substr(df$a, 1, 4)`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    ping @felixcheung 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20464: [SPARK-23291][SQL][R] R's substr should not reduc...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20464#discussion_r172033037
  
    --- Diff: docs/sparkr.md ---
    @@ -663,3 +663,7 @@ You can inspect the search path in R with [`search()`](https://stat.ethz.ch/R-ma
      - The `stringsAsFactors` parameter was previously ignored with `collect`, for example, in `collect(createDataFrame(iris), stringsAsFactors = TRUE))`. It has been corrected.
      - For `summary`, option for statistics to compute has been added. Its output is changed from that from `describe`.
      - A warning can be raised if versions of SparkR package and the Spark JVM do not match.
    +
    +## Upgrading to Spark 2.4.0
    +
    + - The `start` parameter of `substr` method was wrongly subtracted by one, previously. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. It has been corrected.
    --- End diff --
    
    2. in the migration guide we should give a concrete example with non-0 start index, eg.
    `substr(df$a, 1, 6)` should be changed to `substr(df$a, 0, 5)`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20464: [SPARK-23291][SQL][R] R's substr should not reduc...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20464#discussion_r165263961
  
    --- Diff: R/pkg/R/column.R ---
    @@ -169,7 +169,7 @@ setMethod("alias",
     #' @note substr since 1.4.0
     setMethod("substr", signature(x = "Column"),
               function(x, start, stop) {
    -            jc <- callJMethod(x@jc, "substr", as.integer(start - 1), as.integer(stop - start + 1))
    +            jc <- callJMethod(x@jc, "substr", as.integer(start), as.integer(stop - start + 1))
    --- End diff --
    
    I'm a bit concern with changing this. As you can see it's been like this from the very beginning...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    @shivaram Thanks for pointing out it. I made change to the description. Hopefully it is clearer now. Basically I just want to clarify why R's substr tests are correct previously.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1305/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    One more thing to notice is that the two parameters (starting and ending positions) of R's substr API is also unaligned with Scala's substr which takes starting position and substring length.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    @shivaram This fix is to make it correctly 1-based. Previously SparkR substr API substracts starting position by 1, so it becomes zero-based.
    
    This fix matches R's substr in above link as I test:
    ```R
    > substr("Michael", 4, 6)                                                                                                                  
    [1] "hae"
    ```
    
    Before this fix, SparkR's substr returns "cha".


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20464: [SPARK-23291][SQL][R] R's substr should not reduc...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20464#discussion_r165271143
  
    --- Diff: R/pkg/R/column.R ---
    @@ -169,7 +169,7 @@ setMethod("alias",
     #' @note substr since 1.4.0
     setMethod("substr", signature(x = "Column"),
               function(x, start, stop) {
    -            jc <- callJMethod(x@jc, "substr", as.integer(start - 1), as.integer(stop - start + 1))
    +            jc <- callJMethod(x@jc, "substr", as.integer(start), as.integer(stop - start + 1))
    --- End diff --
    
    This API behavior should be considered as wrong and performs inconsistently. Because for starting position 1, we get substring from 1st element, but for position 2, we still get the substring from 1. So we will get the following inconsistent results:
    
    ```R
    > collect(select(df, substr(df$a, 1, 5)))
      substring(a, 0, 5)
    1              abcde
    > collect(select(df, substr(df$a, 2, 5)))
      substring(a, 1, 4)
    1               abcd
    ```
    
    For such change, we might need to add a note in the doc as @HyukjinKwon suggested.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20464: [SPARK-23291][SQL][R] R's substr should not reduc...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20464#discussion_r172751184
  
    --- Diff: docs/sparkr.md ---
    @@ -663,3 +663,7 @@ You can inspect the search path in R with [`search()`](https://stat.ethz.ch/R-ma
      - The `stringsAsFactors` parameter was previously ignored with `collect`, for example, in `collect(createDataFrame(iris), stringsAsFactors = TRUE))`. It has been corrected.
      - For `summary`, option for statistics to compute has been added. Its output is changed from that from `describe`.
      - A warning can be raised if versions of SparkR package and the Spark JVM do not match.
    +
    +## Upgrading to Spark 2.4.0
    +
    + - The `start` parameter of `substr` method was wrongly subtracted by one, previously. In other words, the index specified by `start` parameter was considered as 0-base. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. It has been fixed so the `start` parameter of `substr` method is now 1-base, e.g., `substr(df$a, 2, 5)` should be changed to `substr(df$a, 1, 4)`.
    --- End diff --
    
    Yes. Added.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    cc @felixcheung @HyukjinKwon 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20464: [SPARK-23291][SQL][R] R's substr should not reduc...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20464


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20464: [SPARK-23291][SQL][R] R's substr should not reduc...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20464#discussion_r172409837
  
    --- Diff: R/pkg/R/column.R ---
    @@ -169,7 +169,7 @@ setMethod("alias",
     #' @note substr since 1.4.0
     setMethod("substr", signature(x = "Column"),
               function(x, start, stop) {
    -            jc <- callJMethod(x@jc, "substr", as.integer(start - 1), as.integer(stop - start + 1))
    +            jc <- callJMethod(x@jc, "substr", as.integer(start), as.integer(stop - start + 1))
    --- End diff --
    
    Added to the func doc.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87993/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86924/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20464: [SPARK-23291][SQL][R] R's substr should not reduc...

Posted by viirya <gi...@git.apache.org>.
Github user viirya closed the pull request at:

    https://github.com/apache/spark/pull/20464


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    **[Test build #86908 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86908/testReport)** for PR 20464 at commit [`a2ffdc1`](https://github.com/apache/spark/commit/a2ffdc14ebfa67656e3598f0a0a0131f18f98aa5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    > One followup question is though, would it be difficult to match the behaviour with substr in R when the index is 0 or minus? If i understood #20464 (comment) correctly, it sounds better to match it to substr's behaviour in R. Took a quick look/test and seems we can just set start to 1 for both cases.
    
    If we both consider the indices at starting and ending, setting them to 1 seems not enough. E.g.,
    
    ```R
    > substr("abcdef", -2, -3)
    [1] ""
    > substr("abcdef", 1, 1)
    [1] "a"
    ```
    
    For the cases when only ending is zero/negative, no matter what starting is, the result is empty string.
    
    For the cases when only starting is zero/negative, we can set it to 1.
    
    For the cases they are both zero/negative, the result is empty string.
    
    We can address this in another PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    **[Test build #87993 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87993/testReport)** for PR 20464 at commit [`0ebdf74`](https://github.com/apache/spark/commit/0ebdf74942e0894bfaf6cbede4c03fd3f5d26411).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    appveyor tests failed, could you close and reopen this PR to trigger it.
    
    strange, I haven't seen anything like this on appveyor a long time.
    ```
    1. Error: create DataFrame with complex types (@test_sparkSQL.R#535) -----------
    8712org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 114.0 failed 1 times, most recent failure: Lost task 0.0 in stage 114.0 (TID 116, localhost, executor driver): java.net.SocketTimeoutException: Accept timed out
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20464: [SPARK-23291][SQL][R] R's substr should not reduc...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20464#discussion_r165571511
  
    --- Diff: R/pkg/R/column.R ---
    @@ -169,7 +169,7 @@ setMethod("alias",
     #' @note substr since 1.4.0
     setMethod("substr", signature(x = "Column"),
               function(x, start, stop) {
    -            jc <- callJMethod(x@jc, "substr", as.integer(start - 1), as.integer(stop - start + 1))
    +            jc <- callJMethod(x@jc, "substr", as.integer(start), as.integer(stop - start + 1))
    --- End diff --
    
    question:
    - is there a way to make the behavior the same before this change for any caller calling substr with common index like 0
    - why consider other changes as a follow up and not here? 
    https://github.com/apache/spark/pull/20464#issuecomment-362162014



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    One thing to keep in mind is what the user's perception of the API is. If R users are going to use 1-based indexing then this might not be the right fix ? http://stat.ethz.ch/R-manual/R-devel/library/base/html/substr.html is the base R function FWIW


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20464
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org