You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by JulienPeloton <gi...@git.apache.org> on 2018/11/13 21:01:58 UTC

[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

GitHub user JulienPeloton opened a pull request:

    https://github.com/apache/spark/pull/23025

    [SPARK-26024][SQL]: Update documentation for repartitionByRange

    Following [SPARK-26024](https://issues.apache.org/jira/browse/SPARK-26024), I noticed the number of elements in each partition after repartitioning using `df.repartitionByRange` can vary for the same setup:
    
    ```scala
    // Shuffle numbers from 0 to 1000, and make a DataFrame
    val df = Random.shuffle(0.to(1000)).toDF("val")
    
    // Repartition it using 3 partitions
    // Sum up number of elements in each partition, and collect it.
    // And do it several times
    for (i <- 0 to 9) {
      var counts = df.repartitionByRange(3, col("val"))
        .mapPartitions{part => Iterator(part.size)}
        .collect()
      println(counts.toList)
    }
    // -> the number of elements in each partition varies
    ```
    
    This is expected as for performance reasons this method uses sampling to estimate the ranges (with default size of 100). Hence, the output may not be consistent, since sampling can return different values. But documentation was not mentioning it at all, leading to misunderstanding.
    
    ## What changes were proposed in this pull request?
    
    Update the documentation (Spark & PySpark) to mention the impact of `spark.sql.execution.rangeExchange.sampleSizePerPartition` on the resulting partitioned DataFrame.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JulienPeloton/spark SPARK-26024

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/23025.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23025
    
----
commit b47b6d0eb207021dbde38c35108fec0e39a64332
Author: Julien <pe...@...>
Date:   2018-11-13T20:49:53Z

    Update documentation on repartitionByRange according to SPARK-26024 (Spark)

commit 5a50282959c065f9797ce075239c30edeece4fbe
Author: Julien <pe...@...>
Date:   2018-11-13T20:50:38Z

    Update documentation on repartitionByRange according to SPARK-26024 (PySpark)

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    **[Test build #98987 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98987/testReport)** for PR 23025 at commit [`654fed9`](https://github.com/apache/spark/commit/654fed90997140715d2d52578ca6e4f0661d4e69).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98995/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    **[Test build #98991 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98991/testReport)** for PR 23025 at commit [`f829dfe`](https://github.com/apache/spark/commit/f829dfe0ce5c4d6be68c1247102d58a99b21ad56).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23025#discussion_r234071213
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2813,6 +2819,11 @@ class Dataset[T] private[sql](
        * When no explicit sort order is specified, "ascending nulls first" is assumed.
        * Note, the rows are not sorted in each partition of the resulting Dataset.
        *
    +   * [SPARK-26024] Note that due to performance reasons this method uses sampling to
    --- End diff --
    
    We can drop `[SPARK-26024]` here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5131/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

Posted by JulienPeloton <gi...@git.apache.org>.
Github user JulienPeloton commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23025#discussion_r234099956
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2813,6 +2819,11 @@ class Dataset[T] private[sql](
        * When no explicit sort order is specified, "ascending nulls first" is assumed.
        * Note, the rows are not sorted in each partition of the resulting Dataset.
        *
    +   * [SPARK-26024] Note that due to performance reasons this method uses sampling to
    +   * estimate the ranges. Hence, the output may not be consistent, since sampling can return
    +   * different values. The sample size can be controlled by setting the value of the parameter
    +   * {{spark.sql.execution.rangeExchange.sampleSizePerPartition}}.
    --- End diff --
    
    Thanks. Done.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23025#discussion_r234103194
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -732,6 +732,11 @@ def repartitionByRange(self, numPartitions, *cols):
             At least one partition-by expression must be specified.
             When no explicit sort order is specified, "ascending nulls first" is assumed.
     
    +        [SPARK-26024] Note that due to performance reasons this method uses sampling to
    --- End diff --
    
    "[SPARK-26024]" can be removed too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23025#discussion_r236970732
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -767,6 +767,14 @@ setMethod("repartition",
     #'                      using \code{spark.sql.shuffle.partitions} as number of partitions.}
     #'}
     #'
    +#' At least one partition-by expression must be specified.
    --- End diff --
    
    761 is significant also, but correct. 
    
    essentially:
    1. first line of the blob is the title (L760)
    2. second text after "empty line" is the description (L762)
    3. third after another "empty line" is the "detail note" which is stashed all the way to the bottom of the doc page
    
    so generally you want "important" part of the description on top and not in the "detail" section because it is easily missed. 
    
    this is the most common pattern in this code base. there's another, where multiple function is doc together as a group, eg. collection sql function (in functions.R). other finer control is possible as well but not used today in this code base.
    
    similarly L829 is good, L831 is a bit fuzzy - I'd personally prefer without L831 to keep the whole text in the description section of the doc.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23025#discussion_r236762465
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -767,6 +767,14 @@ setMethod("repartition",
     #'                      using \code{spark.sql.shuffle.partitions} as number of partitions.}
     #'}
     #'
    +#' At least one partition-by expression must be specified.
    --- End diff --
    
    this won't be formatted correctly in R doc due to the fact that "empty line" is significant. L769 should be removed to ensure it is in description


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

Posted by JulienPeloton <gi...@git.apache.org>.
Github user JulienPeloton commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23025#discussion_r234512932
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -732,6 +732,11 @@ def repartitionByRange(self, numPartitions, *cols):
             At least one partition-by expression must be specified.
             When no explicit sort order is specified, "ascending nulls first" is assumed.
     
    +        Note that due to performance reasons this method uses sampling to estimate the ranges.
    --- End diff --
    
    Oh right, I missed it! Pushed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    **[Test build #98992 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98992/testReport)** for PR 23025 at commit [`7ca4821`](https://github.com/apache/spark/commit/7ca48214cda312d78c22ad4305d2e490c46535f5).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by JulienPeloton <gi...@git.apache.org>.
Github user JulienPeloton commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    Thanks all for the reviews!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

Posted by JulienPeloton <gi...@git.apache.org>.
Github user JulienPeloton commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23025#discussion_r234099934
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2813,6 +2819,11 @@ class Dataset[T] private[sql](
        * When no explicit sort order is specified, "ascending nulls first" is assumed.
        * Note, the rows are not sorted in each partition of the resulting Dataset.
        *
    +   * [SPARK-26024] Note that due to performance reasons this method uses sampling to
    --- End diff --
    
    Thanks. Done.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23025#discussion_r234103200
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2789,6 +2789,12 @@ class Dataset[T] private[sql](
        * When no explicit sort order is specified, "ascending nulls first" is assumed.
        * Note, the rows are not sorted in each partition of the resulting Dataset.
        *
    +   *
    +   * [SPARK-26024] Note that due to performance reasons this method uses sampling to
    --- End diff --
    
    ditto.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5135/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    **[Test build #98991 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98991/testReport)** for PR 23025 at commit [`f829dfe`](https://github.com/apache/spark/commit/f829dfe0ce5c4d6be68c1247102d58a99b21ad56).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    cc @cloud-fan 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/23025


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5128/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    **[Test build #98987 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98987/testReport)** for PR 23025 at commit [`654fed9`](https://github.com/apache/spark/commit/654fed90997140715d2d52578ca6e4f0661d4e69).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    **[Test build #98995 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98995/testReport)** for PR 23025 at commit [`7ca4821`](https://github.com/apache/spark/commit/7ca48214cda312d78c22ad4305d2e490c46535f5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    **[Test build #98992 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98992/testReport)** for PR 23025 at commit [`7ca4821`](https://github.com/apache/spark/commit/7ca48214cda312d78c22ad4305d2e490c46535f5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23025#discussion_r234511357
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -732,6 +732,11 @@ def repartitionByRange(self, numPartitions, *cols):
             At least one partition-by expression must be specified.
             When no explicit sort order is specified, "ascending nulls first" is assumed.
     
    +        Note that due to performance reasons this method uses sampling to estimate the ranges.
    --- End diff --
    
    Besides Python, we also have `repartitionByRange` API in R. Can you also update it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23025#discussion_r237174135
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -767,6 +767,14 @@ setMethod("repartition",
     #'                      using \code{spark.sql.shuffle.partitions} as number of partitions.}
     #'}
     #'
    +#' At least one partition-by expression must be specified.
    --- End diff --
    
    @felixcheung have a look at https://github.com/apache/spark/pull/23167


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98987/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

Posted by JulienPeloton <gi...@git.apache.org>.
Github user JulienPeloton commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23025#discussion_r234509708
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2789,6 +2789,12 @@ class Dataset[T] private[sql](
        * When no explicit sort order is specified, "ascending nulls first" is assumed.
        * Note, the rows are not sorted in each partition of the resulting Dataset.
        *
    +   *
    +   * Note that due to performance reasons this method uses sampling to estimate the ranges.
    +   * Hence, the output may not be consistent, since sampling can return different values.
    +   * The sample size can be controlled by setting the value of the parameter
    +   * `spark.sql.execution.rangeExchange.sampleSizePerPartition`.
    --- End diff --
    
    @cloud-fan the sentence has been changed according to your suggestion (in both Spark & PySpark).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5132/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

Posted by JulienPeloton <gi...@git.apache.org>.
Github user JulienPeloton commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23025#discussion_r237225273
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -767,6 +767,14 @@ setMethod("repartition",
     #'                      using \code{spark.sql.shuffle.partitions} as number of partitions.}
     #'}
     #'
    +#' At least one partition-by expression must be specified.
    --- End diff --
    
    @felixcheung Thanks, I did not know about this strict doc formatting rule in R.
    
    @srowen Thanks for taking care of the fix!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98991/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by JulienPeloton <gi...@git.apache.org>.
Github user JulienPeloton commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    @viirya OK all references to SPARK-26024 removed from the doc.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    **[Test build #98995 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98995/testReport)** for PR 23025 at commit [`7ca4821`](https://github.com/apache/spark/commit/7ca48214cda312d78c22ad4305d2e490c46535f5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23025#discussion_r234071565
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2813,6 +2819,11 @@ class Dataset[T] private[sql](
        * When no explicit sort order is specified, "ascending nulls first" is assumed.
        * Note, the rows are not sorted in each partition of the resulting Dataset.
        *
    +   * [SPARK-26024] Note that due to performance reasons this method uses sampling to
    +   * estimate the ranges. Hence, the output may not be consistent, since sampling can return
    +   * different values. The sample size can be controlled by setting the value of the parameter
    +   * {{spark.sql.execution.rangeExchange.sampleSizePerPartition}}.
    --- End diff --
    
    ``` `spark.sql.execution.rangeExchange.sampleSizePerPartition` ```.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23025
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98992/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23025#discussion_r234475550
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2789,6 +2789,12 @@ class Dataset[T] private[sql](
        * When no explicit sort order is specified, "ascending nulls first" is assumed.
        * Note, the rows are not sorted in each partition of the resulting Dataset.
        *
    +   *
    +   * Note that due to performance reasons this method uses sampling to estimate the ranges.
    +   * Hence, the output may not be consistent, since sampling can return different values.
    +   * The sample size can be controlled by setting the value of the parameter
    +   * `spark.sql.execution.rangeExchange.sampleSizePerPartition`.
    --- End diff --
    
    It's not a parameter but a config. So I'd like to propose
    ```
    The sample size can be controlled by the config `xxx`
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23025#discussion_r236837158
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -767,6 +767,14 @@ setMethod("repartition",
     #'                      using \code{spark.sql.shuffle.partitions} as number of partitions.}
     #'}
     #'
    +#' At least one partition-by expression must be specified.
    --- End diff --
    
    I see. What about on 761? I see several docs around here with empty lines (829, 831 below). Are those different? These comments are secondary, but I guess they belong in the public docs as much as anything.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org