You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by dongjoon-hyun <gi...@git.apache.org> on 2016/06/19 01:11:43 UTC

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy opti...

GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/13765

    [SPARK-16052][SQL] Add CollapseRepartitionBy optimizer

    ## What changes were proposed in this pull request?
    
    This issue adds a new optimizer, `CollapseRepartitionBy`, which is similar with `CollapseRepartition`. Also, this PR adds a testsuite for both `CollapseRepartition` and `CollapseRepartitionBy`.
    
    **Before**
    ```scala
    scala> spark.range(10).repartition(1, $"id").repartition(1, $"id").explain
    == Physical Plan ==
    Exchange hashpartitioning(id#0L, 1)
    +- Exchange hashpartitioning(id#0L, 1)
       +- *Range (0, 10, splits=8)
    ```
    
    **After**
    ```scala
    scala> spark.range(10).repartition(1, $"id").repartition(1, $"id").explain
    == Physical Plan ==
    Exchange hashpartitioning(id#0L, 1)
    +- *Range (0, 10, splits=8)
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins tests (including a new testsuite).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-16052

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13765.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13765
    
----
commit caeb08f9e296dc1da3be9728a188b6cc662b6ab1
Author: Dongjoon Hyun <do...@apache.org>
Date:   2016-06-19T01:03:48Z

    [SPARK-16052][SQL] Add CollapseRepartitionBy optimizer

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Definitely, it's my bad. I should spend more time to make a reasonable scenario from the first.
    Thank you, @liancheng , @cloud-fan , @gatorsmile .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r68794253
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -537,12 +537,19 @@ object CollapseProject extends Rule[LogicalPlan] {
     }
     
     /**
    - * Combines adjacent [[Repartition]] operators by keeping only the last one.
    + * Combines adjacent [[Repartition]] and [[RepartitionByExpression]] operator combinations
    + * by keeping only the one.
      */
     object CollapseRepartition extends Rule[LogicalPlan] {
       def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
         case Repartition(numPartitions, shuffle, Repartition(_, _, child)) =>
           Repartition(numPartitions, shuffle, child)
    +    case Repartition(numPartitions, _, r: RepartitionByExpression) =>
    +      r.copy(numPartitions = Some(numPartitions))
    +    case RepartitionByExpression(exprs, Repartition(_, _, child), numPartitions) =>
    +      RepartitionByExpression(exprs, child, numPartitions)
    +    case RepartitionByExpression(exprs, RepartitionByExpression(_, child, _), numPartitions) =>
    +      RepartitionByExpression(exprs, child, numPartitions)
    --- End diff --
    
    Actually, it aims the followings SQLs. It's a kind of query organizing clause in SQL, e.g., ORDER BY, CLUSTER BY, DISTRIBUTE BY, SORT BY, LIMIT. Actually, the risk what you concern is about `exprs` depencency, right?
    **Before**
    ```scala
    scala> sql("(select * from values (1,'a'),(2,'b') T(a,b) distribute by a) distribute by b").explain
    == Physical Plan ==
    Exchange hashpartitioning(b#3, 200)
    +- Exchange hashpartitioning(a#2, 200)
       +- LocalTableScan [a#2, b#3]
    ```
    
    **After**
    ```scala
    scala> sql("(select * from values (1,'a'),(2,'b') T(a,b) distribute by a) distribute by b").explain
    == Physical Plan ==
    Exchange hashpartitioning(b#1, 200)
    +- LocalTableScan [a#0, b#1]
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r68790432
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -537,12 +537,19 @@ object CollapseProject extends Rule[LogicalPlan] {
     }
     
     /**
    - * Combines adjacent [[Repartition]] operators by keeping only the last one.
    + * Combines adjacent [[Repartition]] and [[RepartitionByExpression]] operator combinations
    + * by keeping only the one.
      */
     object CollapseRepartition extends Rule[LogicalPlan] {
       def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
         case Repartition(numPartitions, shuffle, Repartition(_, _, child)) =>
           Repartition(numPartitions, shuffle, child)
    +    case Repartition(numPartitions, _, r: RepartitionByExpression) =>
    +      r.copy(numPartitions = Some(numPartitions))
    +    case RepartitionByExpression(exprs, Repartition(_, _, child), numPartitions) =>
    +      RepartitionByExpression(exprs, child, numPartitions)
    +    case RepartitionByExpression(exprs, RepartitionByExpression(_, child, _), numPartitions) =>
    +      RepartitionByExpression(exprs, child, numPartitions)
    --- End diff --
    
    Hmm. I see. Let me check this again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    thanks, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r69951981
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala ---
    @@ -370,8 +370,11 @@ package object dsl {
             case plan => SubqueryAlias(alias, plan)
           }
     
    -      def distribute(exprs: Expression*): LogicalPlan =
    -        RepartitionByExpression(exprs, logicalPlan)
    +      def repartition(num: Integer): LogicalPlan =
    +        Repartition(num, shuffle = true, logicalPlan)
    +
    +      def distribute(exprs: Expression*)(n: Int = -1): LogicalPlan =
    +        RepartitionByExpression(exprs, logicalPlan, numPartitions = if (n < 0) None else Some(n))
    --- End diff --
    
    Ur, it wasn't possible in this case because `n: Int = -1` has a default value.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Can you show us an example query?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy optimizer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61182/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r68745747
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -537,12 +537,19 @@ object CollapseProject extends Rule[LogicalPlan] {
     }
     
     /**
    - * Combines adjacent [[Repartition]] operators by keeping only the last one.
    + * Combines adjacent [[Repartition]] and [[RepartitionByExpression]] operator combinations
    + * by keeping only the one.
      */
     object CollapseRepartition extends Rule[LogicalPlan] {
       def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
         case Repartition(numPartitions, shuffle, Repartition(_, _, child)) =>
           Repartition(numPartitions, shuffle, child)
    +    case Repartition(numPartitions, _, r: RepartitionByExpression) =>
    +      r.copy(numPartitions = Some(numPartitions))
    +    case RepartitionByExpression(exprs, Repartition(_, _, child), numPartitions) =>
    +      RepartitionByExpression(exprs, child, numPartitions)
    +    case RepartitionByExpression(exprs, RepartitionByExpression(_, child, _), numPartitions) =>
    +      RepartitionByExpression(exprs, child, numPartitions)
    --- End diff --
    
    cc @yhuai to confirm if it's safe to do so.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    **[Test build #61834 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61834/consoleFull)** for PR 13765 at commit [`71d0e38`](https://github.com/apache/spark/commit/71d0e38410b20a3c2078a1511fe3258200c81531).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r68745559
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -537,12 +537,19 @@ object CollapseProject extends Rule[LogicalPlan] {
     }
     
     /**
    - * Combines adjacent [[Repartition]] operators by keeping only the last one.
    + * Combines adjacent [[Repartition]] and [[RepartitionByExpression]] operator combinations
    + * by keeping only the one.
    --- End diff --
    
    can we add more comments to explain the rule about how to combine them?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Add `CollapseRepartitionBy` op...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r68610179
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -547,6 +548,16 @@ object CollapseRepartition extends Rule[LogicalPlan] {
     }
     
     /**
    + * Combines adjacent [[RepartitionByExpression]] operators.
    + */
    +object CollapseRepartitionBy extends Rule[LogicalPlan] {
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
    +    case RepartitionByExpression(exprs, RepartitionByExpression(_, child, _), numPartitions) =>
    --- End diff --
    
    Thank you for review, @cloud-fan .
    Yep. I will merge them into one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61398/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    **[Test build #61663 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61663/consoleFull)** for PR 13765 at commit [`e26e956`](https://github.com/apache/spark/commit/e26e956c89593bbae52c2cdc32b788ed7eea29c7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy optimizer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60793/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy opti...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r67610137
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -451,10 +451,10 @@ def repartition(self, numPartitions, *cols):
             +---+-----+
             |age| name|
             +---+-----+
    -        |  5|  Bob|
    -        |  5|  Bob|
             |  2|Alice|
    +        |  5|  Bob|
             |  2|Alice|
    +        |  5|  Bob|
             +---+-----+
    --- End diff --
    
    It's not a problem. This testcase is not robust since it uses the result of hash-partitioning.
    ```
    >>> data.explain()
    16/06/18 21:25:42 DEBUG SparkOptimizer: 
    === Result of Batch Operator Optimizations ===
     RepartitionByExpression [age#0], 7    RepartitionByExpression [age#0], 7
    !+- RepartitionByExpression [age#0]    +- Union
    !   +- Union                              :- LogicalRDD [age#0, name#1]
    !      :- LogicalRDD [age#0, name#1]      +- LogicalRDD [age#0, name#1]
    !      +- LogicalRDD [age#0, name#1]   
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r70033421
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -451,10 +451,10 @@ def repartition(self, numPartitions, *cols):
             +---+-----+
             |age| name|
             +---+-----+
    -        |  5|  Bob|
    -        |  5|  Bob|
             |  2|Alice|
    +        |  5|  Bob|
             |  2|Alice|
    +        |  5|  Bob|
             +---+-----+
    --- End diff --
    
    I'm confused here: why the results differ from each other when the top level re-partitioning condition stays the same?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Rebased to the master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r69951435
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala ---
    @@ -370,8 +370,11 @@ package object dsl {
             case plan => SubqueryAlias(alias, plan)
           }
     
    -      def distribute(exprs: Expression*): LogicalPlan =
    -        RepartitionByExpression(exprs, logicalPlan)
    +      def repartition(num: Integer): LogicalPlan =
    +        Repartition(num, shuffle = true, logicalPlan)
    +
    +      def distribute(exprs: Expression*)(n: Int = -1): LogicalPlan =
    +        RepartitionByExpression(exprs, logicalPlan, numPartitions = if (n < 0) None else Some(n))
    --- End diff --
    
    Thank you for review, @liancheng .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/13765

Hi, @cloud-fan , @liancheng , @gatorsmile .

We can imagine some data sciences environment having many predefined system-wide tables or datasets. But, here, I simplify that into an extremely two line example.
`dsView1` is one of the system-wide predefined table. The number of partition of `dsView` is optimized for the common usage for the performance (or for preventing skewness.)
Data scientist can not modify that Dataset because it's a shared and predefined one.

However, Data scientist want to save that data into a single file. (You can see this kind of pattern in current Spark MLlib code, too.)
```
scala> val dsView1 = spark.range(8).repartition(8)
scala> dsView1.repartition(1).write.json("/tmp/jsonsinglefile")
```
You can see that the obvious repetition of `repartition`s.

How do you think about this virtual scenario?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Hi, All. Here is a better example.
    I'll update the PR description with this.
    Thank you, all.
    
    **Target Scenario**
    ```scala
    scala> val dsView1 = spark.range(8).repartition(8, $"id")
    scala> dsView1.createOrReplaceTempView("dsView1")
    scala> sql("select id from dsView1 distribute by id").explain(true)
    ```
    
    **Before**
    ```scala
    scala> sql("select id from dsView1 distribute by id").explain(true)
    == Parsed Logical Plan ==
    'RepartitionByExpression ['id]
    +- 'Project ['id]
       +- 'UnresolvedRelation `dsView1`
    
    == Analyzed Logical Plan ==
    id: bigint
    RepartitionByExpression [id#0L]
    +- Project [id#0L]
       +- SubqueryAlias dsview1
          +- RepartitionByExpression [id#0L], 8
             +- Range (0, 8, splits=8)
    
    == Optimized Logical Plan ==
    RepartitionByExpression [id#0L]
    +- RepartitionByExpression [id#0L], 8
       +- Range (0, 8, splits=8)
    
    == Physical Plan ==
    Exchange hashpartitioning(id#0L, 200)
    +- Exchange hashpartitioning(id#0L, 8)
       +- *Range (0, 8, splits=8)
    ```
    
    **After**
    ```scala
    scala> sql("select id from dsView1 distribute by id").explain(true)
    == Parsed Logical Plan ==
    'RepartitionByExpression ['id]
    +- 'Project ['id]
       +- 'UnresolvedRelation `dsView1`
    
    == Analyzed Logical Plan ==
    id: bigint
    RepartitionByExpression [id#0L]
    +- Project [id#0L]
       +- SubqueryAlias dsview1
          +- RepartitionByExpression [id#0L], 8
             +- Range (0, 8, splits=8)
    
    == Optimized Logical Plan ==
    RepartitionByExpression [id#0L]
    +- Range (0, 8, splits=8)
    
    == Physical Plan ==
    Exchange hashpartitioning(id#0L, 200)
    +- *Range (0, 8, splits=8)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Thanks for the examples. This makes sense and LGTM now. Merging into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13765


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    **[Test build #61663 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61663/consoleFull)** for PR 13765 at commit [`e26e956`](https://github.com/apache/spark/commit/e26e956c89593bbae52c2cdc32b788ed7eea29c7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    **[Test build #61564 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61564/consoleFull)** for PR 13765 at commit [`c75b542`](https://github.com/apache/spark/commit/c75b5420491814833b582073818853e3e721ea08).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    **[Test build #61309 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61309/consoleFull)** for PR 13765 at commit [`4bb9047`](https://github.com/apache/spark/commit/4bb9047a8438596b0c74c4a13b2a2d44ba707e1e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add `CollapseRepartitionBy` optimizer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy optimizer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy optimizer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    **[Test build #60793 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60793/consoleFull)** for PR 13765 at commit [`fd53a5a`](https://github.com/apache/spark/commit/fd53a5af8c5cfb23a2ed53c84c60d30824220026).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r70033085
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala ---
    @@ -370,8 +370,11 @@ package object dsl {
             case plan => SubqueryAlias(alias, plan)
           }
     
    -      def distribute(exprs: Expression*): LogicalPlan =
    -        RepartitionByExpression(exprs, logicalPlan)
    +      def repartition(num: Integer): LogicalPlan =
    +        Repartition(num, shuffle = true, logicalPlan)
    +
    +      def distribute(exprs: Expression*)(n: Int = -1): LogicalPlan =
    +        RepartitionByExpression(exprs, logicalPlan, numPartitions = if (n < 0) None else Some(n))
    --- End diff --
    
    Thank you~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Add `CollapseRepartitionBy` op...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r68577644
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -547,6 +548,16 @@ object CollapseRepartition extends Rule[LogicalPlan] {
     }
     
     /**
    + * Combines adjacent [[RepartitionByExpression]] operators.
    + */
    +object CollapseRepartitionBy extends Rule[LogicalPlan] {
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
    +    case RepartitionByExpression(exprs, RepartitionByExpression(_, child, _), numPartitions) =>
    --- End diff --
    
    I think we can add more cases:
    ```
    case RepartitionByExpression(exprs, Repartition(_, _, child), numPartitions) =>
      RepartitionByExpression(exprs, child, numPartitions)
    case Repartition(numPartitions, _, r: RepartitionByExpression) =>
      r.copy(numPartitions = numPartitions)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    **[Test build #61564 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61564/consoleFull)** for PR 13765 at commit [`c75b542`](https://github.com/apache/spark/commit/c75b5420491814833b582073818853e3e721ea08).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r69930213
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala ---
    @@ -370,8 +370,11 @@ package object dsl {
             case plan => SubqueryAlias(alias, plan)
           }
     
    -      def distribute(exprs: Expression*): LogicalPlan =
    -        RepartitionByExpression(exprs, logicalPlan)
    +      def repartition(num: Integer): LogicalPlan =
    +        Repartition(num, shuffle = true, logicalPlan)
    +
    +      def distribute(exprs: Expression*)(n: Int = -1): LogicalPlan =
    +        RepartitionByExpression(exprs, logicalPlan, numPartitions = if (n < 0) None else Some(n))
    --- End diff --
    
    Seems that adding a `distribute(n: Int, exprs: Expression*)` overloading method is simpler?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Ping @cloud-fan and @yhuai .  :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add `CollapseRepartitionBy` optimizer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    **[Test build #61309 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61309/consoleFull)** for PR 13765 at commit [`4bb9047`](https://github.com/apache/spark/commit/4bb9047a8438596b0c74c4a13b2a2d44ba707e1e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy optimizer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    **[Test build #60793 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60793/consoleFull)** for PR 13765 at commit [`fd53a5a`](https://github.com/apache/spark/commit/fd53a5af8c5cfb23a2ed53c84c60d30824220026).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Unfortunately I'm having network issue here and failed to fetch the branch from GitHub :(
    
    @cloud-fan Could you please help merge this one? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy optimizer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60790/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61564/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r70033576
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -451,10 +451,10 @@ def repartition(self, numPartitions, *cols):
             +---+-----+
             |age| name|
             +---+-----+
    -        |  5|  Bob|
    -        |  5|  Bob|
             |  2|Alice|
    +        |  5|  Bob|
             |  2|Alice|
    +        |  5|  Bob|
             +---+-----+
    --- End diff --
    
    Oh, order of elements within a single partition is not guaranteed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    **[Test build #61398 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61398/consoleFull)** for PR 13765 at commit [`f1ee771`](https://github.com/apache/spark/commit/f1ee7718a2c125ee1c7b6ad1ef2059751fa5dc6d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy optimizer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    **[Test build #61182 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61182/consoleFull)** for PR 13765 at commit [`80c6c52`](https://github.com/apache/spark/commit/80c6c52b42fb37b64a3b40cb65d33a15adda0a34).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy optimizer

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Hi, @cloud-fan .
    Could you review this optimizer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r69963613
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala ---
    @@ -370,8 +370,11 @@ package object dsl {
             case plan => SubqueryAlias(alias, plan)
           }
     
    -      def distribute(exprs: Expression*): LogicalPlan =
    -        RepartitionByExpression(exprs, logicalPlan)
    +      def repartition(num: Integer): LogicalPlan =
    +        Repartition(num, shuffle = true, logicalPlan)
    +
    +      def distribute(exprs: Expression*)(n: Int = -1): LogicalPlan =
    +        RepartitionByExpression(exprs, logicalPlan, numPartitions = if (n < 0) None else Some(n))
    --- End diff --
    
    I tried again and got the following compilation error.
    ```
    [error] /Users/dongjoon/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala:376: a parameter section with a `*'-parameter is not allowed to have default arguments
    [error]       def distribute(n: Int = -1, exprs: Expression*): LogicalPlan =
    [error]           ^
    [error] one error found
    [error] (catalyst/compile:compileIncremental) Compilation failed
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy optimizer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    **[Test build #61182 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61182/consoleFull)** for PR 13765 at commit [`80c6c52`](https://github.com/apache/spark/commit/80c6c52b42fb37b64a3b40cb65d33a15adda0a34).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61663/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    By the way, this PR is about improving the existing `CollapseRepartition`, not about introducing new one. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    LGTM, cc @yhuai to take another look


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61834/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add `CollapseRepartitionBy` optimizer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    **[Test build #61255 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61255/consoleFull)** for PR 13765 at commit [`c3016f3`](https://github.com/apache/spark/commit/c3016f3f301e47148de780be1d6dfe8ae91b9f1c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy optimizer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    **[Test build #60790 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60790/consoleFull)** for PR 13765 at commit [`caeb08f`](https://github.com/apache/spark/commit/caeb08f9e296dc1da3be9728a188b6cc662b6ab1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy optimizer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Hi, @cloud-fan .
    Now, this PR can handle all combinations of Repartition and RepartitionBy.
    I updated the description of PR and JIRA, too.
    Thank you so much for making this PR much useful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r69930648
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -537,12 +537,19 @@ object CollapseProject extends Rule[LogicalPlan] {
     }
     
     /**
    - * Combines adjacent [[Repartition]] operators by keeping only the last one.
    + * Combines adjacent [[Repartition]] and [[RepartitionByExpression]] operator combinations
    + * by keeping only the one.
    --- End diff --
    
    > ... by only keeping the top-level one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy optimizer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    **[Test build #60790 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60790/consoleFull)** for PR 13765 at commit [`caeb08f`](https://github.com/apache/spark/commit/caeb08f9e296dc1da3be9728a188b6cc662b6ab1).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Add `CollapseRepartitionBy` op...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r68577777
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -547,6 +548,16 @@ object CollapseRepartition extends Rule[LogicalPlan] {
     }
     
     /**
    + * Combines adjacent [[RepartitionByExpression]] operators.
    + */
    +object CollapseRepartitionBy extends Rule[LogicalPlan] {
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
    +    case RepartitionByExpression(exprs, RepartitionByExpression(_, child, _), numPartitions) =>
    --- End diff --
    
    Then we can consolidate `CollapseRepartition` and `CollapseRepartitionBy` into one rule


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    @gatorsmile Does it make sense to you? I inserted SQL case for you. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy optimizer

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Sorry we are very close to 2.0 release (and very busy to get it done) and this is too late to get into 2.0.  How about we revisit it after 2.0 release? thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    **[Test build #61834 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61834/consoleFull)** for PR 13765 at commit [`71d0e38`](https://github.com/apache/spark/commit/71d0e38410b20a3c2078a1511fe3258200c81531).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Hi, @cloud-fan and @yhuai .
    I added more description about the logic. I hope that makes the intention of this PR more clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Thank you for review, @gatorsmile . You work at night, too! :)
    BTW, as you know, I don't have a real query. May I make some sample queries from my mind? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r68788331
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -537,12 +537,19 @@ object CollapseProject extends Rule[LogicalPlan] {
     }
     
     /**
    - * Combines adjacent [[Repartition]] operators by keeping only the last one.
    + * Combines adjacent [[Repartition]] and [[RepartitionByExpression]] operator combinations
    + * by keeping only the one.
      */
     object CollapseRepartition extends Rule[LogicalPlan] {
       def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
         case Repartition(numPartitions, shuffle, Repartition(_, _, child)) =>
           Repartition(numPartitions, shuffle, child)
    +    case Repartition(numPartitions, _, r: RepartitionByExpression) =>
    +      r.copy(numPartitions = Some(numPartitions))
    +    case RepartitionByExpression(exprs, Repartition(_, _, child), numPartitions) =>
    +      RepartitionByExpression(exprs, child, numPartitions)
    +    case RepartitionByExpression(exprs, RepartitionByExpression(_, child, _), numPartitions) =>
    +      RepartitionByExpression(exprs, child, numPartitions)
    --- End diff --
    
    Yep. @yhuai .
    I think this needs to work in the exactly same way with
    ```
    SELECT a
    FROM (SELECT b
                 FROM t
                 GROUP BY b)
    GROUPY BY a
    ```
    If there is something I missed, please let me know. Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Hi, @yhuai .
    Could you review this PR when you have some time?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r70032947
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala ---
    @@ -370,8 +370,11 @@ package object dsl {
             case plan => SubqueryAlias(alias, plan)
           }
     
    -      def distribute(exprs: Expression*): LogicalPlan =
    -        RepartitionByExpression(exprs, logicalPlan)
    +      def repartition(num: Integer): LogicalPlan =
    +        Repartition(num, shuffle = true, logicalPlan)
    +
    +      def distribute(exprs: Expression*)(n: Int = -1): LogicalPlan =
    +        RepartitionByExpression(exprs, logicalPlan, numPartitions = if (n < 0) None else Some(n))
    --- End diff --
    
    Oh I see. Then let's keep it. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Under what circumstances will a user use 2 or more adjacent re-partitioning operators?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61309/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add `CollapseRepartitionBy` optimizer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61255/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add `CollapseRepartitionBy` optimizer

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Hi, @cloud-fan .
    Could you review this `CollapseRepartitionBy` optimizer when you have some time?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy optimizer

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Rebased.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy optimizer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy optimizer

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Hi, @cloud-fan .
    Could you review this optimizer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy opti...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r67610142
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala ---
    @@ -374,6 +374,9 @@ package object dsl {
             case plan => SubqueryAlias(alias, plan)
           }
     
    +      def repartition(num: Integer): LogicalPlan =
    +        Repartition(num, shuffle = true, logicalPlan)
    +
    --- End diff --
    
    I added this for convenience. It is used in the testsuite of this PR, too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    There are three possibilities.
    
    1. User mistakes. (Rarely)
    2. Intermediate results of optimization. (More frequently.)
    3. `View` (or pre-designed `Dataset`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Hi, @cloud-fan .
    Could you review `CollapseRepartition` optimizer again when you have some time?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy opti...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r67611572
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -451,10 +451,10 @@ def repartition(self, numPartitions, *cols):
             +---+-----+
             |age| name|
             +---+-----+
    -        |  5|  Bob|
    -        |  5|  Bob|
             |  2|Alice|
    +        |  5|  Bob|
             |  2|Alice|
    +        |  5|  Bob|
             +---+-----+
    --- End diff --
    
    Anyway, this testcase gets the first benefit of this optimizer. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add `CollapseRepartitionBy` optimizer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    **[Test build #61255 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61255/consoleFull)** for PR 13765 at commit [`c3016f3`](https://github.com/apache/spark/commit/c3016f3f301e47148de780be1d6dfe8ae91b9f1c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    @rxin .
    For this `CollapseRepartition` optimizer, may I split into another file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    **[Test build #61390 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61390/consoleFull)** for PR 13765 at commit [`f1ee771`](https://github.com/apache/spark/commit/f1ee7718a2c125ee1c7b6ad1ef2059751fa5dc6d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13765#discussion_r68787810
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -537,12 +537,19 @@ object CollapseProject extends Rule[LogicalPlan] {
     }
     
     /**
    - * Combines adjacent [[Repartition]] operators by keeping only the last one.
    + * Combines adjacent [[Repartition]] and [[RepartitionByExpression]] operator combinations
    + * by keeping only the one.
    --- End diff --
    
    Sure!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    **[Test build #61398 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61398/consoleFull)** for PR 13765 at commit [`f1ee771`](https://github.com/apache/spark/commit/f1ee7718a2c125ee1c7b6ad1ef2059751fa5dc6d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13765: [SPARK-16052][SQL] Add CollapseRepartitionBy optimizer

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13765
  
    Thank you for comments!
    I see. No problem! :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org