You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by felixcheung <gi...@git.apache.org> on 2017/01/30 07:50:13 UTC

[GitHub] spark pull request #16739: [SPARK-19399][SPARKR] Add R coalesce API for Data...

GitHub user felixcheung opened a pull request:

    https://github.com/apache/spark/pull/16739

    [SPARK-19399][SPARKR] Add R coalesce API for DataFrame and Column

    ## What changes were proposed in this pull request?
    
    Add coalesce on DataFrame for down partitioning without shuffle and coalesce on Column
    
    ## How was this patch tested?
    
    manual, unit tests


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/felixcheung/spark rcoalesce

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16739.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16739
    
----
commit e5a39f15028e3b34d3c5ebe455ea1fb72cbdc80b
Author: Felix Cheung <fe...@hotmail.com>
Date:   2017-01-30T07:42:00Z

    coalesce

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    surely, i think you mean https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L428
    we will need to update this to say `use repartition() if you want shuffling` though, since the shuffle option is only on RDD.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    **[Test build #72240 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72240/testReport)** for PR 16739 at commit [`3ed835a`](https://github.com/apache/spark/commit/3ed835ad340ea0793f8fbb93a697e09f7eb249d9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16739: [SPARK-19399][SPARKR] Add R coalesce API for Data...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/16739


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    @felixcheung I was refering to the `   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
       * this may result in your computation taking place on fewer nodes than
       * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
       * you can pass shuffle = true. This will add a shuffle step, but means the
       * current upstream partitions will be executed in parallel (per whatever
       * the current partitioning is).
    ` warning
    
    but documentating the coalesce capping out based on numSlices also sounds important to document (and potentially confusing).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    **[Test build #72232 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72232/testReport)** for PR 16739 at commit [`1bd7163`](https://github.com/apache/spark/commit/1bd7163723641bfaa107c9a20974e163eaead0a4).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72925/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    and actually I find the current behavior a bit hard to explain, could someone perhaps enlighten me if this is intentional and how best, if we are to, document this behavior?
    ```
     df <- as.DataFrame(cars, numPartitions = 5) <-- this set numSlices on RDD to 5
     +  expect_equal(getNumPartitions(df), 5)
     +  expect_equal(getNumPartitions(coalesce(df, 3)), 3)
     +  expect_equal(getNumPartitions(coalesce(df, 6)), 5)
     +
     +  df1 <- coalesce(df, 3)
     +  expect_equal(getNumPartitions(df1), 3)
     +  expect_equal(getNumPartitions(coalesce(df1, 6)), 5)  <---- even after a coalesce it can't go beyond 5 
     +  expect_equal(getNumPartitions(coalesce(df1, 4)), 4)
     +  expect_equal(getNumPartitions(coalesce(df1, 2)), 2)
     +
     +  df2 <- repartition(df1, 10)
     +  expect_equal(getNumPartitions(df2), 10) <-- right after repartition the number of partition is greater than the original numSlices
     +  expect_equal(getNumPartitions(coalesce(df2, 13)), 5) <-- but coalesce after repartition it can't go beyond 5
     +  expect_equal(getNumPartitions(coalesce(df2, 7)), 5)
     +  expect_equal(getNumPartitions(coalesce(df2, 3)), 3)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16739: [SPARK-19399][SPARKR] Add R coalesce API for Data...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16739#discussion_r98604331
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -680,14 +680,45 @@ setMethod("storageLevel",
                 storageLevelToString(callJMethod(x@sdf, "storageLevel"))
               })
     
    +#' Coalesce
    +#'
    +#' Returns a new SparkDataFrame that has exactly \code{numPartitions} partitions.
    +#' This operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100
    +#' partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of
    --- End diff --
    
    Actually, no, coalesce is set to `min(prev partitions, numPartitions)` according to CoalescedRDD [here](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L393) so it will be unchanged then.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16739: [SPARK-19399][SPARKR] Add R coalesce API for Data...

Posted by wangmiao1981 <gi...@git.apache.org>.

Github user wangmiao1981 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16739#discussion_r98511320
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -680,14 +680,45 @@ setMethod("storageLevel",
                 storageLevelToString(callJMethod(x@sdf, "storageLevel"))
               })
     
    +#' Coalesce
    +#'
    +#' Returns a new SparkDataFrame that has exactly \code{numPartitions} partitions.
    +#' This operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100
    +#' partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of
    +#' the current partitions.
    +#'
    +#' @param numPartitions the number of partitions to use.
    +#'
    +#' @family SparkDataFrame functions
    +#' @rdname coalesce
    +#' @name coalesce
    +#' @aliases coalesce,SparkDataFrame-method
    +#' @seealso \link{repartition}
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' sparkR.session()
    +#' path <- "path/to/file.json"
    +#' df <- read.json(path)
    +#' newDF <- coalesce(df, 1L)
    +#'}
    +#' @note coalesce(SparkDataFrame) since 2.1.1
    --- End diff --
    
    2.2.0? Or this will be ported back to 2.1.1 too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    **[Test build #72166 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72166/testReport)** for PR 16739 at commit [`938c2ce`](https://github.com/apache/spark/commit/938c2ce27e4e1029a646e25c053baeb304d6f217).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Agree with @jkbradley on this one. We should avoid adding functions that are completely new in a patch release given that the timing between minor versions and patch releases aren't that high. As we discussed in the other thread, lets start tagging JIRAs with `backport` and also add a line in the JIRA saying why its safe/required for backport.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Hi, @felixcheung .
    While backporting, https://github.com/apache/spark/commit/6c35399068f1035fec6d5f909a83a5b1683702e0#diff-3d2a6b9d2b7d84ae179d7ea0f9eca696R1232 seems to break the build of `branch-2.1`.
    The PR about `to_timestamp` is not backported to branch-2.1 yet.
    Could you backport that issue, too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    **[Test build #72240 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72240/testReport)** for PR 16739 at commit [`3ed835a`](https://github.com/apache/spark/commit/3ed835ad340ea0793f8fbb93a697e09f7eb249d9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    **[Test build #72147 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72147/testReport)** for PR 16739 at commit [`50ab563`](https://github.com/apache/spark/commit/50ab5635c54074a24a03d08ed42fd94fa19e68d3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    @gatorsmile thanks for commenting. `coalesce` currently accept a number even if it is larger than the current number of partitions - I guess we didn't want to throw exeception in that case?
    
    but, since you are here, do you know why we see this behavior
    ```
    df2 <- repartition(df1, 10)
    expect_equal(getNumPartitions(df2), 10) <-- right after repartition the number of partition is greater than the original numSlices
    expect_equal(getNumPartitions(coalesce(df2, 13)), 5) <-- but coalesce after repartition it can't go beyond 5
    ```
    
    Shouldn't I allow to set partition to 5 < n < 10, since I just `repartition(10)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72149/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72929/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    **[Test build #72925 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72925/testReport)** for PR 16739 at commit [`bf2373f`](https://github.com/apache/spark/commit/bf2373f260a2af4a8841c0b440e86979de9c98e0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16739: [SPARK-19399][SPARKR] Add R coalesce API for Data...

Posted by wangmiao1981 <gi...@git.apache.org>.

Github user wangmiao1981 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16739#discussion_r98513154
  
    --- Diff: R/pkg/R/generics.R ---
    @@ -406,6 +406,13 @@ setGeneric("attach")
     #' @export
     setGeneric("cache", function(x) { standardGeneric("cache") })
     
    +#' @rdname coalesce
    +#' @param x a Column or a SparkDataFrame.
    +#' @param ... additional argument(s). If \code{x} is a Column, addition Columns can be optionally
    --- End diff --
    
    addition Columns -> additional Columns?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72791/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    merged to master and branch-2.1
    @gatorsmile thanks - please feel free to update or remove unneeded test cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72790/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    **[Test build #72791 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72791/testReport)** for PR 16739 at commit [`a0fe134`](https://github.com/apache/spark/commit/a0fe1344ae1030be98a37ca133ee24a40e8bc65d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Thank YOU, always! :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72166/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    I've commented elsewhere, but wanted to here just to make more people aware: Let's refrain from backporting new APIs into patch versions unless they are really critical.  We do not do this elsewhere in Spark, and we should not in SparkR.  New APIs and API changes should only happen in minor versions (and ideally changes will only happen in major ones).  It's been discussed elsewhere that SparkR is more experimental than other parts of Spark, but the sooner we start treating it like a stable library, the sooner it will be a stable library.  For most people, there isn't a huge difference between getting a new API in a patch version (every 1-2 months) vs. getting it in a minor version (every 4 months).  Thanks all!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Thanks @felixcheung - I think these changes look good. 
    
    cc @gatorsmile  / @holdenk for doc changes in SQL, Python


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72240/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16739: [SPARK-19399][SPARKR] Add R coalesce API for Data...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16739#discussion_r98598572
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -680,14 +680,45 @@ setMethod("storageLevel",
                 storageLevelToString(callJMethod(x@sdf, "storageLevel"))
               })
     
    +#' Coalesce
    +#'
    +#' Returns a new SparkDataFrame that has exactly \code{numPartitions} partitions.
    +#' This operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100
    +#' partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of
    --- End diff --
    
    If there are more partitions then there will be a shuffle right ? Might be useful to add that


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72147/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    **[Test build #72929 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72929/testReport)** for PR 16739 at commit [`bf2373f`](https://github.com/apache/spark/commit/bf2373f260a2af4a8841c0b440e86979de9c98e0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    **[Test build #72166 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72166/testReport)** for PR 16739 at commit [`938c2ce`](https://github.com/apache/spark/commit/938c2ce27e4e1029a646e25c053baeb304d6f217).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16739: [SPARK-19399][SPARKR] Add R coalesce API for Data...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16739#discussion_r98527967
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -680,14 +680,45 @@ setMethod("storageLevel",
                 storageLevelToString(callJMethod(x@sdf, "storageLevel"))
               })
     
    +#' Coalesce
    +#'
    +#' Returns a new SparkDataFrame that has exactly \code{numPartitions} partitions.
    +#' This operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100
    +#' partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of
    +#' the current partitions.
    +#'
    +#' @param numPartitions the number of partitions to use.
    +#'
    +#' @family SparkDataFrame functions
    +#' @rdname coalesce
    +#' @name coalesce
    +#' @aliases coalesce,SparkDataFrame-method
    +#' @seealso \link{repartition}
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' sparkR.session()
    +#' path <- "path/to/file.json"
    +#' df <- read.json(path)
    +#' newDF <- coalesce(df, 1L)
    +#'}
    +#' @note coalesce(SparkDataFrame) since 2.1.1
    +setMethod("coalesce",
    +          signature(x = "SparkDataFrame"),
    +          function(x, numPartitions) {
    +            stopifnot(is.numeric(numPartitions))
    --- End diff --
    
    it's being coerce into integer - the reason we don't want this to be integer is to allow calls like
    ```
    coalesce(df, 3)
    ```
    
    in which `3` is a numeric by default. (vs `3L` is integer) IMO, forcing the user to call with `3L` is a bit too much


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Let me rewrite the test cases in Scala.
    
    ```Scala
        val df = spark.range(0, 10000, 1, 5)
        assert(df.rdd.getNumPartitions == 5)
        assert(df.coalesce(3).rdd.getNumPartitions == 3)
        assert(df.coalesce(6).rdd.getNumPartitions == 5)
    
        val df1 = df.coalesce(3)
        assert(df1.rdd.getNumPartitions == 3)
        assert(df1.coalesce(6).rdd.getNumPartitions == 5)
        assert(df1.coalesce(4).rdd.getNumPartitions == 4)
        assert(df1.coalesce(2).rdd.getNumPartitions == 2)
    
        val df2 = df.repartition(10)
        assert(df2.rdd.getNumPartitions == 10)
        assert(df2.coalesce(13).rdd.getNumPartitions == 5)
        assert(df2.coalesce(7).rdd.getNumPartitions == 5)
        assert(df2.coalesce(3).rdd.getNumPartitions == 3)
    ```
    
    The question is why the second one is `5` instead of `10`. If we do the explain, we got the following plan
    ```
    == Parsed Logical Plan ==
    Repartition 13, false
    +- Repartition 10, true
       +- Range (0, 10000, step=1, splits=Some(5))
    
    == Analyzed Logical Plan ==
    id: bigint
    Repartition 13, false
    +- Repartition 10, true
       +- Range (0, 10000, step=1, splits=Some(5))
    
    == Optimized Logical Plan ==
    Repartition 13, false
    +- Range (0, 10000, step=1, splits=Some(5))
    
    == Physical Plan ==
    Coalesce 13
    +- *Range (0, 10000, step=1, splits=Some(5))
    ```
    
    Ok... `Repartition 10, true` is removed by our Optimizer rule `CollapseRepartition`. It is a bug, I think. Your question is valid. Let me fix it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    hmm, not as far as I can see:
    ```
    > df2 <- repartition(df1, 10)
    >   getNumPartitions(df2) # right after repartition the number of partition is greater than the original numSlices
    [1] 10
    > foo <-  coalesce(df2, 13)
    > explain(foo, extended = T)
    == Parsed Logical Plan ==
    Repartition 13, false
    +- Repartition 10, true
       +- Repartition 3, false
          +- LogicalRDD [speed#2, dist#3]
    
    == Analyzed Logical Plan ==
    speed: double, dist: double
    Repartition 13, false
    +- Repartition 10, true
       +- Repartition 3, false
          +- LogicalRDD [speed#2, dist#3]
    
    == Optimized Logical Plan ==
    Repartition 13, false
    +- LogicalRDD [speed#2, dist#3]
    
    == Physical Plan ==
    Coalesce 13
    +- Scan ExistingRDD[speed#2,dist#3]
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16739: [SPARK-19399][SPARKR] Add R coalesce API for Data...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16739#discussion_r98605173
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -680,14 +680,45 @@ setMethod("storageLevel",
                 storageLevelToString(callJMethod(x@sdf, "storageLevel"))
               })
     
    +#' Coalesce
    +#'
    +#' Returns a new SparkDataFrame that has exactly \code{numPartitions} partitions.
    +#' This operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100
    +#' partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of
    --- End diff --
    
    Oh well I guess thats worth mentioning then ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    **[Test build #72790 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72790/testReport)** for PR 16739 at commit [`55b99df`](https://github.com/apache/spark/commit/55b99dfefacbe549e3d48278fa391c963ac36ab7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    `coalesce` is used to decrease the number of partitions in the RDD, but when you are setting it to a number that is larger than the number of the current RDD partitions, the result is not predicable. It depends on your RDD physical distribution.
    
    Thus, I am wondering whether we should allow users to set it to a larger number? Or some advanced users are using it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16739: [SPARK-19399][SPARKR] Add R coalesce API for Data...

Posted by wangmiao1981 <gi...@git.apache.org>.

Github user wangmiao1981 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16739#discussion_r98512167
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -680,14 +680,45 @@ setMethod("storageLevel",
                 storageLevelToString(callJMethod(x@sdf, "storageLevel"))
               })
     
    +#' Coalesce
    +#'
    +#' Returns a new SparkDataFrame that has exactly \code{numPartitions} partitions.
    +#' This operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100
    +#' partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of
    +#' the current partitions.
    +#'
    +#' @param numPartitions the number of partitions to use.
    +#'
    +#' @family SparkDataFrame functions
    +#' @rdname coalesce
    +#' @name coalesce
    +#' @aliases coalesce,SparkDataFrame-method
    +#' @seealso \link{repartition}
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' sparkR.session()
    +#' path <- "path/to/file.json"
    +#' df <- read.json(path)
    +#' newDF <- coalesce(df, 1L)
    +#'}
    +#' @note coalesce(SparkDataFrame) since 2.1.1
    +setMethod("coalesce",
    +          signature(x = "SparkDataFrame"),
    +          function(x, numPartitions) {
    +            stopifnot(is.numeric(numPartitions))
    --- End diff --
    
    Shall we enforce the input param as Integer? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    : ) This might be caused by the optimizer rule `CollapseRepartition`. Can you output the plan by `explain(true)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    **[Test build #72232 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72232/testReport)** for PR 16739 at commit [`1bd7163`](https://github.com/apache/spark/commit/1bd7163723641bfaa107c9a20974e163eaead0a4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    **[Test build #72790 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72790/testReport)** for PR 16739 at commit [`55b99df`](https://github.com/apache/spark/commit/55b99dfefacbe549e3d48278fa391c963ac36ab7).
     * This patch passes all tests.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    **[Test build #72929 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72929/testReport)** for PR 16739 at commit [`bf2373f`](https://github.com/apache/spark/commit/bf2373f260a2af4a8841c0b440e86979de9c98e0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    **[Test build #72149 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72149/testReport)** for PR 16739 at commit [`50ab563`](https://github.com/apache/spark/commit/50ab5635c54074a24a03d08ed42fd94fa19e68d3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    great, looking forward to that.
    I'm going to merge this unless anyone has a concern?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    @dongjoon-hyun my apologies, thanks for bringing this to my attention. I had to hang merge and didn't realize the mismatch. Opened a new PR to fix that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72232/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    The issue is fixed in https://github.com/apache/spark/pull/16933. If this is merged at first, I will fix the test case in this PR Thanks! : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    yap, https://github.com/apache/spark/pull/16739#issuecomment-276739220 - only RDD has `coalesce(.. shuffle)`, in Dataset, it's `coalesce` and `repartition`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    **[Test build #72791 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72791/testReport)** for PR 16739 at commit [`a0fe134`](https://github.com/apache/spark/commit/a0fe1344ae1030be98a37ca133ee24a40e8bc65d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16739
  
    **[Test build #72149 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72149/testReport)** for PR 16739 at commit [`50ab563`](https://github.com/apache/spark/commit/50ab5635c54074a24a03d08ed42fd94fa19e68d3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org