You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2018/12/06 15:55:41 UTC

[GitHub] spark pull request #23249: [SPARK-26297][SQL] improve the doc of Distributio...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/23249

    [SPARK-26297][SQL] improve the doc of Distribution/Partitioning

    ## What changes were proposed in this pull request?
    
    Some documents of `Distribution/Partitioning` are stale and misleading, this PR fixes them:
    1. `ClusteredDistribution` doesn't have intra-partition requirement
    2. `OrderedDistribution` does not require tuples that share the same value being colocated in the same partition.
    3. `RangePartitioning` can provide a weaker guarantee for a prefix of its `ordering` expressions.
    
    ## How was this patch tested?
    
    comment-only PR.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark doc

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/23249.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23249
    
----
commit 24ea28abd5a385351703335df33b26838d203fe3
Author: Wenchen Fan <we...@...>
Date:   2018-12-06T15:47:23Z

    improve the doc of Distribution/Partitioning

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23249: [SPARK-26297][SQL] improve the doc of Distributio...

Posted by maryannxue <gi...@git.apache.org>.

Github user maryannxue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23249#discussion_r239540987
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala ---
    @@ -22,13 +22,12 @@ import org.apache.spark.sql.types.{DataType, IntegerType}
     
     /**
      * Specifies how tuples that share common expressions will be distributed when a query is executed
    - * in parallel on many machines.  Distribution can be used to refer to two distinct physical
    - * properties:
    - *  - Inter-node partitioning of data: In this case the distribution describes how tuples are
    - *    partitioned across physical machines in a cluster.  Knowing this property allows some
    - *    operators (e.g., Aggregate) to perform partition local operations instead of global ones.
    - *  - Intra-partition ordering of data: In this case the distribution describes guarantees made
    - *    about how tuples are distributed within a single partition.
    + * in parallel on many machines.
    + *
    + * Distribution here refers to inter-node partitioning of data:
    + *   The distribution describes how tuples are partitioned across physical machines in a cluster.
    + *   Knowing this property allows some operators (e.g., Aggregate) to perform partition local
    + *   operations instead of global ones.
      */
    --- End diff --
    
    Do we also need to mention that there's another related but orthogonal physical property, i.e., the intra-partition ordering and maybe list an example here how operators take advantage of these two physical properties together?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    **[Test build #99912 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99912/testReport)** for PR 23249 at commit [`adfcec4`](https://github.com/apache/spark/commit/adfcec41adbffbef2e33fb85db5ad48eba5f3d71).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    **[Test build #99812 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99812/testReport)** for PR 23249 at commit [`04be19e`](https://github.com/apache/spark/commit/04be19e62caa8fd0365b4998e22cdcad846be6b8).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23249: [SPARK-26297][SQL] improve the doc of Distributio...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23249#discussion_r240026485
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala ---
    @@ -118,10 +115,12 @@ case class HashClusteredDistribution(
     
     /**
      * Represents data where tuples have been ordered according to the `ordering`
    - * [[Expression Expressions]].  This is a strictly stronger guarantee than
    - * [[ClusteredDistribution]] as an ordering will ensure that tuples that share the
    - * same value for the ordering expressions are contiguous and will never be split across
    - * partitions.
    + * [[Expression Expressions]]. Its requirement is defined as the following:
    + *   - Given any 2 adjacent partitions, all the rows of the second partition must be larger than or
    + *     equal to any row in the first partition, according to the `ordering` expressions.
    --- End diff --
    
    Note that, only sort requires `OrderedDistribution`, and global sort doesn't care if there are equal-rows across partitions.
    
    Here is a definition of the requirement. When designing protocols, it's important to make the requirement as weak as possible, and make guarantees as strong as possible.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23249: [SPARK-26297][SQL] improve the doc of Distributio...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23249#discussion_r239684697
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala ---
    @@ -22,13 +22,12 @@ import org.apache.spark.sql.types.{DataType, IntegerType}
     
     /**
      * Specifies how tuples that share common expressions will be distributed when a query is executed
    - * in parallel on many machines.  Distribution can be used to refer to two distinct physical
    - * properties:
    - *  - Inter-node partitioning of data: In this case the distribution describes how tuples are
    - *    partitioned across physical machines in a cluster.  Knowing this property allows some
    - *    operators (e.g., Aggregate) to perform partition local operations instead of global ones.
    - *  - Intra-partition ordering of data: In this case the distribution describes guarantees made
    - *    about how tuples are distributed within a single partition.
    + * in parallel on many machines.
    + *
    + * Distribution here refers to inter-node partitioning of data:
    + *   The distribution describes how tuples are partitioned across physical machines in a cluster.
    + *   Knowing this property allows some operators (e.g., Aggregate) to perform partition local
    + *   operations instead of global ones.
      */
    --- End diff --
    
    I intentionally remove everything about intra-partition, as we never leverage it and no partitioning provides this property. Did I miss something?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99814/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23249: [SPARK-26297][SQL] improve the doc of Distributio...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23249#discussion_r240123260
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala ---
    @@ -262,6 +261,15 @@ case class RangePartitioning(ordering: Seq[SortOrder], numPartitions: Int)
         super.satisfies0(required) || {
           required match {
             case OrderedDistribution(requiredOrdering) =>
    +          // If `ordering` is a prefix of `requiredOrdering`:
    +          //   - Let's say `ordering` is [a, b] and `requiredOrdering` is [a, b, c]. If a row is
    +          //     larger than another row w.r.t. [a, b], it's also larger w.r.t. [a, b, c]. So
    +          //     `RangePartitioning(a, b)` satisfy `OrderedDistribution(a, b, c)`.
    --- End diff --
    
    nit `satisfy` -> `satisfies`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5852/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23249: [SPARK-26297][SQL] improve the doc of Distributio...

Posted by maryannxue <gi...@git.apache.org>.

Github user maryannxue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23249#discussion_r239694008
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala ---
    @@ -243,10 +248,19 @@ case class HashPartitioning(expressions: Seq[Expression], numPartitions: Int)
      * Represents a partitioning where rows are split across partitions based on some total ordering of
      * the expressions specified in `ordering`.  When data is partitioned in this manner the following
      * two conditions are guaranteed to hold:
    - *  - All row where the expressions in `ordering` evaluate to the same values will be in the same
    --- End diff --
    
    nit: "row" -> "rows", "where... `ordering`" -> "whose `ordering` expressions"


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5810/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    **[Test build #99775 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99775/testReport)** for PR 23249 at commit [`24ea28a`](https://github.com/apache/spark/commit/24ea28abd5a385351703335df33b26838d203fe3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23249: [SPARK-26297][SQL] improve the doc of Distributio...

Posted by maryannxue <gi...@git.apache.org>.

Github user maryannxue commented on a diff in the pull request:

https://github.com/apache/spark/pull/23249#discussion_r239689874

--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala ---
@@ -22,13 +22,12 @@ import org.apache.spark.sql.types.{DataType, IntegerType}

/**
* Specifies how tuples that share common expressions will be distributed when a query is executed
- * in parallel on many machines. Distribution can be used to refer to two distinct physical
- * properties:
- * - Inter-node partitioning of data: In this case the distribution describes how tuples are
- * partitioned across physical machines in a cluster. Knowing this property allows some
- * operators (e.g., Aggregate) to perform partition local operations instead of global ones.
- * - Intra-partition ordering of data: In this case the distribution describes guarantees made
- * about how tuples are distributed within a single partition.
+ * in parallel on many machines.
+ *
+ * Distribution here refers to inter-node partitioning of data:
+ * The distribution describes how tuples are partitioned across physical machines in a cluster.
+ * Knowing this property allows some operators (e.g., Aggregate) to perform partition local
+ * operations instead of global ones.
*/
--- End diff --

Yes, I understand that partitioning has nothing to do with intra-partition ordering at all. And it was wrong to include intra-partition ordering as part of the distribution properties. But I was thinking mentioning ordering as a side note would probably help ppl understand better how some operators work. Or maybe here's not the best place to put it.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23249: [SPARK-26297][SQL] improve the doc of Distributio...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23249#discussion_r239508437
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala ---
    @@ -118,10 +116,13 @@ case class HashClusteredDistribution(
     
     /**
      * Represents data where tuples have been ordered according to the `ordering`
    - * [[Expression Expressions]].  This is a strictly stronger guarantee than
    - * [[ClusteredDistribution]] as an ordering will ensure that tuples that share the
    - * same value for the ordering expressions are contiguous and will never be split across
    - * partitions.
    + * [[Expression Expressions]].
    + *
    + * Tuples that share the same values for the ordering expressions must be contiguous within a
    + * partition. They can also across partitions, but these partitions must be contiguous. For example,
    + * if value `v` is the biggest values in partition 3, it can also be in partition 4 as the smallest
    + * value. If all the values in partition 4 are `v`, it can also be in partition 5 as the smallest
    + * value.
      */
     case class OrderedDistribution(ordering: Seq[SortOrder]) extends Distribution {
    --- End diff --
    
    This is only used by sort, and sort doesn't require rows of same value to be colocated in the same partition.
    
    Actually we already use this knowledge to optimize `RangePartitioning.satisfy`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    **[Test build #99779 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99779/testReport)** for PR 23249 at commit [`3df1e44`](https://github.com/apache/spark/commit/3df1e446a8f9c9d04912856e617617c1ef7c8373).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    **[Test build #99814 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99814/testReport)** for PR 23249 at commit [`130bc95`](https://github.com/apache/spark/commit/130bc957b6d8cf1b8b6c1cf77c57eb759fcc6aa6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    **[Test build #99775 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99775/testReport)** for PR 23249 at commit [`24ea28a`](https://github.com/apache/spark/commit/24ea28abd5a385351703335df33b26838d203fe3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99821/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    **[Test build #99821 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99821/testReport)** for PR 23249 at commit [`130bc95`](https://github.com/apache/spark/commit/130bc957b6d8cf1b8b6c1cf77c57eb759fcc6aa6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23249: [SPARK-26297][SQL] improve the doc of Distributio...

Posted by maryannxue <gi...@git.apache.org>.

Github user maryannxue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23249#discussion_r239693849
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala ---
    @@ -243,10 +248,19 @@ case class HashPartitioning(expressions: Seq[Expression], numPartitions: Int)
      * Represents a partitioning where rows are split across partitions based on some total ordering of
      * the expressions specified in `ordering`.  When data is partitioned in this manner the following
    --- End diff --
    
    nit: add "," after "this manner".


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    **[Test build #99812 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99812/testReport)** for PR 23249 at commit [`04be19e`](https://github.com/apache/spark/commit/04be19e62caa8fd0365b4998e22cdcad846be6b8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    **[Test build #99814 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99814/testReport)** for PR 23249 at commit [`130bc95`](https://github.com/apache/spark/commit/130bc957b6d8cf1b8b6c1cf77c57eb759fcc6aa6).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99812/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5845/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    **[Test build #99821 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99821/testReport)** for PR 23249 at commit [`130bc95`](https://github.com/apache/spark/commit/130bc957b6d8cf1b8b6c1cf77c57eb759fcc6aa6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5923/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23249: [SPARK-26297][SQL] improve the doc of Distributio...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23249#discussion_r239754619
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala ---
    @@ -118,10 +115,12 @@ case class HashClusteredDistribution(
     
     /**
      * Represents data where tuples have been ordered according to the `ordering`
    - * [[Expression Expressions]].  This is a strictly stronger guarantee than
    - * [[ClusteredDistribution]] as an ordering will ensure that tuples that share the
    - * same value for the ordering expressions are contiguous and will never be split across
    - * partitions.
    + * [[Expression Expressions]]. Its requirement is defined as the following:
    + *   - Given any 2 adjacent partitions, all the rows of the second partition must be larger than or
    + *     equal to any row in the first partition, according to the `ordering` expressions.
    --- End diff --
    
    Why here we need this equality? Can we just have all the rows in the second partition must be larger than any row in the first partition?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    **[Test build #99779 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99779/testReport)** for PR 23249 at commit [`3df1e44`](https://github.com/apache/spark/commit/3df1e446a8f9c9d04912856e617617c1ef7c8373).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99779/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23249: [SPARK-26297][SQL] improve the doc of Distributio...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23249#discussion_r239508488
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala ---
    @@ -118,10 +116,13 @@ case class HashClusteredDistribution(
     
     /**
      * Represents data where tuples have been ordered according to the `ordering`
    - * [[Expression Expressions]].  This is a strictly stronger guarantee than
    - * [[ClusteredDistribution]] as an ordering will ensure that tuples that share the
    - * same value for the ordering expressions are contiguous and will never be split across
    - * partitions.
    + * [[Expression Expressions]].
    + *
    + * Tuples that share the same values for the ordering expressions must be contiguous within a
    + * partition. They can also across partitions, but these partitions must be contiguous. For example,
    + * if value `v` is the biggest values in partition 3, it can also be in partition 4 as the smallest
    + * value. If all the values in partition 4 are `v`, it can also be in partition 5 as the smallest
    + * value.
      */
     case class OrderedDistribution(ordering: Seq[SortOrder]) extends Distribution {
    --- End diff --
    
    This is only used by sort, and sort doesn't require rows of same value to be colocated in the same partition.
    
    Actually we already use this knowledge to optimize `RangePartitioning.satisfy`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99775/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    cc @maryannxue @hvanhovell @gatorsmile @viirya 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5816/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23249: [SPARK-26297][SQL] improve the doc of Distributio...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23249#discussion_r239690226
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala ---
    @@ -22,13 +22,12 @@ import org.apache.spark.sql.types.{DataType, IntegerType}
     
     /**
      * Specifies how tuples that share common expressions will be distributed when a query is executed
    - * in parallel on many machines.  Distribution can be used to refer to two distinct physical
    - * properties:
    - *  - Inter-node partitioning of data: In this case the distribution describes how tuples are
    - *    partitioned across physical machines in a cluster.  Knowing this property allows some
    - *    operators (e.g., Aggregate) to perform partition local operations instead of global ones.
    - *  - Intra-partition ordering of data: In this case the distribution describes guarantees made
    - *    about how tuples are distributed within a single partition.
    + * in parallel on many machines.
    + *
    + * Distribution here refers to inter-node partitioning of data:
    + *   The distribution describes how tuples are partitioned across physical machines in a cluster.
    + *   Knowing this property allows some operators (e.g., Aggregate) to perform partition local
    + *   operations instead of global ones.
      */
    --- End diff --
    
    for ordering, I think people can look at `OrderedDistribution`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5846/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Parti...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23249
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org