You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/13 17:00:23 UTC

[GitHub] [spark] aokolnychyi opened a new pull request #29089: [SQL][SPARK-32276] Remove redundant sorts before repartition nodes

aokolnychyi opened a new pull request #29089:
URL: https://github.com/apache/spark/pull/29089


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   This PR removes redundant sorts before repartition, repartitionByExpression and coalesce. 
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   It looks like our `EliminateSorts` rule can be extended further to remove sorts before repartition nodes. Independently of whether we do a shuffle or not while repartitioning data, each repartition operation may change the ordering and distribution of data. That's why it seems safe to perform the following rewrites:
   - `Repartition -> Sort -> Scan` as `Repartition -> Scan`
   - `Repartition -> Project -> Sort -> Scan` as `Repartition -> Project -> Scan`
   
   The only case that requires a bit more discussion is coalesce. It uses `DefaultPartitionCoalescer` that may preserve the ordering of data if there is no locality info in the parent RDD. At the same time, there is no guarantee that will happen.
   
   There is a test that implicitly relies on the current behavior. E.g., we have an assumption that `df.sortBy(...).coalesce(1)` will produce an ordered partition in `ArrowConvertersSuite`. To preserve the existing behavior, we can apply the new optimization only if repartition requires a shuffle.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   No.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   
   More test cases in `EliminateSortsSuite`.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-659083689






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-660344690






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658496722


   **[Test build #125858 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125858/testReport)** for PR 29089 at commit [`ba6a1bb`](https://github.com/apache/spark/commit/ba6a1bb5a20d26fc85485afe116e44f2c934ef20).
    * This patch **fails PySpark pip packaging tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-657836318






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658839477


   Oops, thanks for catching the corner case quickly, @dongjoon-hyun!
   
   My original idea for this PR was based on the fact that a range partitioning followed by a local sort is equivalent to a global sort if expressions are compatible. Then I started to generalize this idea and there was no obvious corner case. While this one is very subtle, I think it makes sense if we think more about it. Repartition nodes change data distribution but may not necessarily change the ordering of data. Partially, this is the reason why we excluded coalesce in the original proposal. Based on the example above, this seems to be true even if we hash partition our data.
   
   I'd explore cases where sort+repartition are next to each other. In that case, we are sure we change both the ordering and distribution and can potentially ignore the sort below. 
   
   For example, we may have this:
   
   ```
   sql("select * from (select * from (select * from t order by b desc) distribute by a) order by b asc")
   ```
   
   ```
   Sort [b#6 ASC NULLS FIRST], true
   +- RepartitionByExpression [a#5], 4
      +- Sort [b#6 DESC NULLS LAST], true
         +- Repartition 2, true
            +- LocalRelation [a#5, b#6]
   ```
   
   Is there a case where we want to keep the first sort before RepartitionByExpression?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-661344352


   @hvanhovell . Thank you for your feedback. The following looks a little wrong to me because the above optimization was one of the recommendations for many Hortonworks customers to save their HDFS usage. I knew many production usages like that. I almost forgot that, but it rang my head suddenly during this PR. (Sadly, after I merged this.)
   >  You are currently just lucky that the system accidentally produces a nice layout for you; 99% of our users won't be as lucky. The only way you can he sure, is when you add these things yourself.
   
   I understand your point of views fully. However, I'm wondering if you can persuade the customers to waste their storage by generating 160x bigger files (the example from SPARK-32318). Do you think you can?
   
   ```
   -rw-r--r--   1 dongjoon  wheel  939 Jul 14 22:12 part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
   ```
   ```
   -rw-r--r--   1 dongjoon  wheel  150741 Jul 14 22:08 part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
   ```
   
   
   .
   For the following, SPARK-32318 added a test coverage at master/3.0/2.4. Are you suggesting that's not enough? If then, we can add more.
   > Finally I do want to point out that there is no mechanism that captures this regression if it pops up again.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-659489525






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658544475


   To generate small final Parquet/ORC files, we do the above tricks, don't we? This PR may cause a regression on the size of output storage.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-659193547


   **[Test build #125931 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125931/testReport)** for PR 29089 at commit [`21a84ad`](https://github.com/apache/spark/commit/21a84adb3561788eea0e98c62129127b5bc9d5d5).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658544475


   To generate small final Parquet/ORC files, we do the above tricks, don't we?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r454621594



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsBeforeRepartitionSuite.scala
##########
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.analysis.{Analyzer, EmptyFunctionRegistry}
+import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+
+class EliminateSortsBeforeRepartitionSuite extends PlanTest {
+
+  val catalog = new SessionCatalog(new InMemoryCatalog, EmptyFunctionRegistry, conf)
+  val analyzer = new Analyzer(catalog, conf)
+  val testRelation = LocalRelation('a.int, 'b.int, 'c.int)
+
+  object Optimize extends RuleExecutor[LogicalPlan] {
+    val batches =
+      Batch("Default", FixedPoint(10),
+        FoldablePropagation,
+        LimitPushDown) ::
+      Batch("Eliminate Sorts", Once,
+        EliminateSorts) ::
+      Batch("Collapse Project", Once,
+        CollapseProject) :: Nil
+  }
+
+  def repartition(plan: LogicalPlan): LogicalPlan = plan.repartition(10)
+  def isOptimized: Boolean = true
+
+  test(s"sortBy") {
+    val plan = testRelation.select('a, 'b).sortBy('a.asc, 'b.desc)
+    val planWithRepartition = repartition(plan)
+    val optimizedPlan = Optimize.execute(analyzer.execute(planWithRepartition))
+    val correctPlan = if (isOptimized) {
+      repartition(testRelation.select('a, 'b))
+    } else {
+      planWithRepartition
+    }
+    comparePlans(optimizedPlan, analyzer.execute(correctPlan))
+  }
+
+  test(s"sortBy with projection") {

Review comment:
       Fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658326302






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-659139822


   **[Test build #125931 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125931/testReport)** for PR 29089 at commit [`21a84ad`](https://github.com/apache/spark/commit/21a84adb3561788eea0e98c62129127b5bc9d5d5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658470126






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658333320






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658550248


   Very sorry, guys. Due to the above regression, I'll revert this commit urgently. We can rethink about this PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-660690504


   **[Test build #126134 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126134/testReport)** for PR 29089 at commit [`2157a71`](https://github.com/apache/spark/commit/2157a71e5f0548071d6f02e6dd3413fc8807daf5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658559706


   No~ It depends on file formats instead of Spark side.
   For example, in the above example, ORC files are small because it supports a special encoding when the data is sorted with a fixed increment. For Parquet files, the result will be different.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r456531088



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -969,11 +969,11 @@ object CombineFilters extends Rule[LogicalPlan] with PredicateHelper {
  * Removes Sort operation. This can happen:

Review comment:
       Shall we revise this line?
   ```scala
   - Removes Sort operation. This can happen:
   + Removes Sort operation if it doesn't affect the final output ordering.
   + Note that the final output ordering changes and affect the file size (SPARK-32318).
   + This optimizer handles the following cases:
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658839477


   Oops, thanks for catching the corner case quickly, @dongjoon-hyun!
   
   My original idea for this PR was based on the fact that a range partitioning followed by a local sort is equivalent to a global sort if expressions are compatible. Then I started to generalize this idea and there was no obvious corner case. While this one is very subtle, I think it makes sense if we think more about it. Repartition nodes change data distribution but may not necessarily change the ordering of data (at least, there may be sorted chunks). Partially, this is the reason why we excluded coalesce in the original proposal. Based on the example above, this seems to be true even if we hash partition our data.
   
   I'd explore cases where sort+repartition are next to each other. In that case, we are sure we change both the ordering and distribution and can potentially ignore the sort below. 
   
   For example, we may have this:
   
   ```
   sql("select * from (select * from (select * from t order by b desc) distribute by a) order by b asc")
   ```
   
   ```
   Sort [b#6 ASC NULLS FIRST], true
   +- RepartitionByExpression [a#5], 4
      +- Sort [b#6 DESC NULLS LAST], true
         +- Repartition 2, true
            +- LocalRelation [a#5, b#6]
   ```
   
   Is there a case where we want to keep the first sort before RepartitionByExpression?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-659488828


   **[Test build #125986 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125986/testReport)** for PR 29089 at commit [`0545b09`](https://github.com/apache/spark/commit/0545b099ff75aa6bfe921b02dddadc7081538136).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-657679330


   **[Test build #125781 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125781/testReport)** for PR 29089 at commit [`d71b9e6`](https://github.com/apache/spark/commit/d71b9e6083d93b1f98a63418de37eaede9626ca9).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658884805


   In general, I think we can remove sort if it doesn't affect the final output ordering. The case caught by @dongjoon-hyun is a good example: the final output ordering changes and affect the file size.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658326302






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658828708


   @hvanhovell . I agree with you for the followings.
   > AFAIK nested ordering can be ignored from a relation algebra point of view. 
   > Regarding the shuffles. ...
   
   However, the following is not reasonable. There is nothing wrong in the file formats. They are just consumers and showing a better performance in a sorted input sequence. I guess you assume that this is only a behavior at ORC. But, I'm sure that you can find your customers are relying on this in Parquet, too.
   > If you want sorted runs in ORC then you ought to fix is there, and not rely on some implicit system behavior.
   
   This is not an implicit system behavior in Apache Spark. Apache Spark has been working in the procedural ways as you see in the above. If we start to ignore the valid working pattern in the production, it becomes a huge regression.
   
   In short, `saving` to a file is a totally different and valid story. To optimize the final output files, the above pattern have been used in the production among Apache Spark users for a long time. If some optimizer rule ignores the existing usage, this ends up at a large regression in terms of the cost (for example, S3) obviously.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658841199


   The same question for local sort too.
   
   ```
   sql("select * from (select * from (select * from t order by b desc) distribute by a) sort by b asc")
   ```
   
   ```
   Sort [b#6 ASC NULLS FIRST], false
   +- RepartitionByExpression [a#5], 4
      +- Sort [b#6 DESC NULLS LAST], true
         +- Repartition 2, true
            +- LocalRelation [a#5, b#6]
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r454541782



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -981,6 +982,10 @@ object EliminateSorts extends Rule[LogicalPlan] {
       j.copy(left = recursiveRemoveSort(originLeft), right = recursiveRemoveSort(originRight))
     case g @ Aggregate(_, aggs, originChild) if isOrderIrrelevantAggs(aggs) =>
       g.copy(child = recursiveRemoveSort(originChild))
+    case r: RepartitionByExpression =>
+      r.copy(child = recursiveRemoveSort(r.child))

Review comment:
       Updated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r453851262



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -981,6 +982,10 @@ object EliminateSorts extends Rule[LogicalPlan] {
       j.copy(left = recursiveRemoveSort(originLeft), right = recursiveRemoveSort(originRight))
     case g @ Aggregate(_, aggs, originChild) if isOrderIrrelevantAggs(aggs) =>
       g.copy(child = recursiveRemoveSort(originChild))
+    case r: RepartitionByExpression =>
+      r.copy(child = recursiveRemoveSort(r.child))

Review comment:
       Should we also limit to `partitionExpressions` are all deterministic case?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r454601951



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsBeforeRepartitionSuite.scala
##########
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.analysis.{Analyzer, EmptyFunctionRegistry}
+import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+
+class EliminateSortsBeforeRepartitionSuite extends PlanTest {
+
+  val catalog = new SessionCatalog(new InMemoryCatalog, EmptyFunctionRegistry, conf)
+  val analyzer = new Analyzer(catalog, conf)
+  val testRelation = LocalRelation('a.int, 'b.int, 'c.int)
+
+  object Optimize extends RuleExecutor[LogicalPlan] {
+    val batches =
+      Batch("Default", FixedPoint(10),
+        FoldablePropagation,
+        LimitPushDown) ::
+      Batch("Eliminate Sorts", Once,
+        EliminateSorts) ::
+      Batch("Collapse Project", Once,
+        CollapseProject) :: Nil
+  }
+
+  def repartition(plan: LogicalPlan): LogicalPlan = plan.repartition(10)
+  def isOptimized: Boolean = true

Review comment:
       It is set to false in `EliminateSortsBeforeCoalesceSuite` as we excluded coalesce.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-660196508






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658347351


   @dongjoon-hyun @viirya @rdblue this PR is ready for another review round.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-659083689






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r454565800



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsSuite.scala
##########
@@ -310,4 +310,109 @@ class EliminateSortsSuite extends PlanTest {
     val correctAnswer = PushDownOptimizer.execute(noOrderByPlan.analyze)
     comparePlans(optimized, correctAnswer)
   }
+
+  testRepartitionOptimization(

Review comment:
       I think this pattern is common for the codebase but I agree having separate suites makes more sense here. Updated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-661344352


   @hvanhovell . The following is complete wrong because the above optimization was one of the recommendations for many Hortonworks customers to save their HDFS usage. I knew many production usages like that. I almost forgot that, but it rang my head suddenly during this PR. (Sadly, after I merged this.)
   >  You are currently just lucky that the system accidentally produces a nice layout for you; 99% of our users won't be as lucky. The only way you can he sure, is when you add these things yourself.
   
   I understand your point of views fully. However, I'm wondering if you can persuade the customers to waste their storage by generating 160x bigger files (the example from SPARK-32318). Do you think you can?
   
   ```
   -rw-r--r--   1 dongjoon  wheel  939 Jul 14 22:12 part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
   ```
   ```
   -rw-r--r--   1 dongjoon  wheel  150741 Jul 14 22:08 part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
   ```
   
   
   .
   For the following, SPARK-32318 added a test coverage at master/3.0/2.4. Are you suggesting that's not enough?
   > Finally I do want to point out that there is no mechanism that captures this regression if it pops up again.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r454015855



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -981,6 +982,10 @@ object EliminateSorts extends Rule[LogicalPlan] {
       j.copy(left = recursiveRemoveSort(originLeft), right = recursiveRemoveSort(originRight))
     case g @ Aggregate(_, aggs, originChild) if isOrderIrrelevantAggs(aggs) =>
       g.copy(child = recursiveRemoveSort(originChild))
+    case r: RepartitionByExpression =>
+      r.copy(child = recursiveRemoveSort(r.child))

Review comment:
       You're right, that does seem to imply that ordering is preserved underneath non-deterministic expressions.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-660727981


   **[Test build #126134 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126134/testReport)** for PR 29089 at commit [`2157a71`](https://github.com/apache/spark/commit/2157a71e5f0548071d6f02e6dd3413fc8807daf5).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658472869






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658397973


   **[Test build #125847 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125847/testReport)** for PR 29089 at commit [`83791b7`](https://github.com/apache/spark/commit/83791b769eb9de9be736c1310dc610c30c8cde5d).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r454607941



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsBeforeRepartitionSuite.scala
##########
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.analysis.{Analyzer, EmptyFunctionRegistry}
+import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+
+class EliminateSortsBeforeRepartitionSuite extends PlanTest {
+
+  val catalog = new SessionCatalog(new InMemoryCatalog, EmptyFunctionRegistry, conf)
+  val analyzer = new Analyzer(catalog, conf)
+  val testRelation = LocalRelation('a.int, 'b.int, 'c.int)
+
+  object Optimize extends RuleExecutor[LogicalPlan] {
+    val batches =
+      Batch("Default", FixedPoint(10),
+        FoldablePropagation,
+        LimitPushDown) ::
+      Batch("Eliminate Sorts", Once,
+        EliminateSorts) ::
+      Batch("Collapse Project", Once,
+        CollapseProject) :: Nil
+  }
+
+  def repartition(plan: LogicalPlan): LogicalPlan = plan.repartition(10)
+  def isOptimized: Boolean = true
+
+  test(s"sortBy") {

Review comment:
       nit. `s"` -> `"`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658876843


   **[Test build #125886 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125886/testReport)** for PR 29089 at commit [`ba6a1bb`](https://github.com/apache/spark/commit/ba6a1bb5a20d26fc85485afe116e44f2c934ef20).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r454621400



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsBeforeRepartitionSuite.scala
##########
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.analysis.{Analyzer, EmptyFunctionRegistry}
+import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+
+class EliminateSortsBeforeRepartitionSuite extends PlanTest {
+
+  val catalog = new SessionCatalog(new InMemoryCatalog, EmptyFunctionRegistry, conf)
+  val analyzer = new Analyzer(catalog, conf)
+  val testRelation = LocalRelation('a.int, 'b.int, 'c.int)
+
+  object Optimize extends RuleExecutor[LogicalPlan] {
+    val batches =
+      Batch("Default", FixedPoint(10),
+        FoldablePropagation,
+        LimitPushDown) ::
+      Batch("Eliminate Sorts", Once,
+        EliminateSorts) ::
+      Batch("Collapse Project", Once,
+        CollapseProject) :: Nil
+  }
+
+  def repartition(plan: LogicalPlan): LogicalPlan = plan.repartition(10)
+  def isOptimized: Boolean = true
+
+  test(s"sortBy") {

Review comment:
       Good catch, these are leftovers from the old version of tests.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-659683962






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
hvanhovell commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658612372


   Ehh... AFAIK nested ordering can be ignored from a relation algebra point of view. So I am not sure this is a very solid argument. This feels a bit like an example of [hyrum's law](https://www.hyrumslaw.com/). If you want sorted runs in ORC then you ought to fix is there, and not rely on some implicit system behavior.
   
   Regarding the shuffles. If the data is sorted before it goes into the shuffle, then the individual shuffle blocks are sorted. This is also the reason why doing a sort aggregate is not completely terrible (TimSort is good at identifying sorted runs).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658329251


   **[Test build #125848 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125848/testReport)** for PR 29089 at commit [`c58ad12`](https://github.com/apache/spark/commit/c58ad12609b732fe8539277f3dcfb046268c5740).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-659193931






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-660195822


   **[Test build #126060 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126060/testReport)** for PR 29089 at commit [`2157a71`](https://github.com/apache/spark/commit/2157a71e5f0548071d6f02e6dd3413fc8807daf5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658470126






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-659085333


   I've updated the PR to show what I meant. I'll check for additional edge cases in the morning but the change is ready for review. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r456524842



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -969,11 +969,11 @@ object CombineFilters extends Rule[LogicalPlan] with PredicateHelper {
  * Removes Sort operation. This can happen:
  * 1) if the sort order is empty or the sort order does not have any reference
  * 2) if the child is already sorted
- * 3) if there is another Sort operator separated by 0...n Project/Filter operators
- * 4) if the Sort operator is within Join separated by 0...n Project/Filter operators only,
- *    and the Join conditions is deterministic
- * 5) if the Sort operator is within GroupBy separated by 0...n Project/Filter operators only,
- *    and the aggregate function is order irrelevant
+ * 3) if there is another Sort operator separated by 0...n Project/Filter/Repartition operators
+ * 4) if the Sort operator is within Join separated by 0...n Project/Filter/Repartition
+ *    operators only, and the Join conditions is deterministic
+ * 5) if the Sort operator is within GroupBy separated by 0...n Project/Filter/Repartition
+ *    operators only, and the aggregate function is order irrelevant

Review comment:
       This documentation update seems to focus on `case _: Repartition => true` only. Could you revise more to cover `case r: RepartitionByExpression => r.partitionExpressions.forall(_.deterministic)`, please?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658349453


   LGTM. I just had one question in the tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-660196508






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-661345335


   For the following, I added SPARK-32318 added a test coverage at master/3.0/2.4. Are you suggesting that's not enough?
   > Finally I do want to point out that there is no mechanism that captures this regression if it pops up again.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658398544


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658876966


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658332654


   **[Test build #125849 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125849/testReport)** for PR 29089 at commit [`0ff3092`](https://github.com/apache/spark/commit/0ff3092914bd51e3ee4eec982981cf3514499ea4).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658395393


   **[Test build #125858 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125858/testReport)** for PR 29089 at commit [`ba6a1bb`](https://github.com/apache/spark/commit/ba6a1bb5a20d26fc85485afe116e44f2c934ef20).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658857595


   **[Test build #125886 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125886/testReport)** for PR 29089 at commit [`ba6a1bb`](https://github.com/apache/spark/commit/ba6a1bb5a20d26fc85485afe116e44f2c934ef20).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-660690648






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r454617890



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsBeforeRepartitionSuite.scala
##########
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.analysis.{Analyzer, EmptyFunctionRegistry}
+import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+
+class EliminateSortsBeforeRepartitionSuite extends PlanTest {
+
+  val catalog = new SessionCatalog(new InMemoryCatalog, EmptyFunctionRegistry, conf)
+  val analyzer = new Analyzer(catalog, conf)
+  val testRelation = LocalRelation('a.int, 'b.int, 'c.int)
+
+  object Optimize extends RuleExecutor[LogicalPlan] {
+    val batches =
+      Batch("Default", FixedPoint(10),
+        FoldablePropagation,
+        LimitPushDown) ::
+      Batch("Eliminate Sorts", Once,
+        EliminateSorts) ::
+      Batch("Collapse Project", Once,
+        CollapseProject) :: Nil
+  }
+
+  def repartition(plan: LogicalPlan): LogicalPlan = plan.repartition(10)
+  def isOptimized: Boolean = true

Review comment:
       I see. I missed that suite. Thanks for explaining!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-661344352


   @hvanhovell . The following is complete wrong because the above optimization was one of the recommendations for many Hortonworks customers to save their HDFS usage. I knew many production usages like that. I almost forgot that, but it rang my head suddenly during this PR. (Sadly, after I merged this.)
   >  You are currently just lucky that the system accidentally produces a nice layout for you; 99% of our users won't be as lucky. The only way you can he sure, is when you add these things yourself.
   
   I understand your point of views fully. However, I'm wondering if you can persuade the customers to waste their storage by generating 160x bigger files (the example from SPARK-32318). Do you think you can?
   
   ```
   -rw-r--r--   1 dongjoon  wheel  939 Jul 14 22:12 part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
   ```
   ```
   -rw-r--r--   1 dongjoon  wheel  150741 Jul 14 22:08 part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
   ```
   
   
   .
   For the following, SPARK-32318 added a test coverage at master/3.0/2.4. Are you suggesting that's not enough? If then, we can add more.
   > Finally I do want to point out that there is no mechanism that captures this regression if it pops up again.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658831236


   BTW, @aokolnychyi . I merged the corner case test case PR, https://github.com/apache/spark/pull/29118. Could you rebase this PR to the master? Then, we can discuss how to proceed this PR in a narrowed direction.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-661348318


   For the following, I'd like to ask your help if you are interested. I believe we want to build the better Apache Spark in the community together.
   > If you generalize the procedural argument then we also should not do things like join reordering or swapping window operators. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
viirya edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658563157


   Yeah, because the different data distribution, physical encoding of data could result in different size, that is what I meant.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-660728192






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658828708


   @hvanhovell . I agree with you for the followings.
   > AFAIK nested ordering can be ignored from a relation algebra point of view. 
   > Regarding the shuffles. ...
   
   However, the following is not reasonable. There is nothing wrong in the file formats. They are just consumers and showing a better performance in a sorted input sequence because they are columnar vectorized format. I guess you assume that this is only a behavior at ORC. But, I'm sure that you can find your customers are relying on this in Parquet, too.
   > If you want sorted runs in ORC then you ought to fix is there, and not rely on some implicit system behavior.
   
   This is not an implicit system behavior in Apache Spark. Apache Spark has been working in the procedural ways as you see in the above. If we start to ignore the valid working pattern in the production, it becomes a huge regression.
   
   In short, `saving` to a file is a totally different and valid story. To optimize the final output files, the above pattern have been used in the production among Apache Spark users for a long time. If some optimizer rule ignores the existing usage, this ends up at a large regression in terms of the cost (for example, S3) obviously.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SQL][SPARK-32276] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-657679907






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658329251


   **[Test build #125848 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125848/testReport)** for PR 29089 at commit [`c58ad12`](https://github.com/apache/spark/commit/c58ad12609b732fe8539277f3dcfb046268c5740).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658398544






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658559706


   No~ It depends on file formats instead of Spark side.
   For example, in the above example, ORC files are small because it supports a special encoding when the input data is sorted with a fixed increment. For Parquet files, the result will be different.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-659488828


   **[Test build #125986 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125986/testReport)** for PR 29089 at commit [`0545b09`](https://github.com/apache/spark/commit/0545b099ff75aa6bfe921b02dddadc7081538136).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658876982


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125886/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-661345335


   For the following, I added SPARK-32318 added a test coverage at master/3.0/2.4. Are you suggesting that's not enough?
   > Finally I do want to point out that there is no mechanism that captures this regression if it pops up again.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-661344352


   @hvanhovell . The following is complete wrong because the above optimization was one of the recommendations for many Hortonworks customers to save their HDFS usage. I knew many production usages like that. I almost forgot that, but it rang my head suddenly during this PR. (Sadly, after I merged this.)
   >  You are currently just lucky that the system accidentally produces a nice layout for you; 99% of our users won't be as lucky. The only way you can he sure, is when you add these things yourself.
   
   I understand your point of views fully. However, I'm wondering if you can persuade the customers to waste their storage by generating 160x bigger files (the example from SPARK-32318). Do you think you can?
   
   ```
   -rw-r--r--   1 dongjoon  wheel  939 Jul 14 22:12 part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
   ```
   ```
   -rw-r--r--   1 dongjoon  wheel  150741 Jul 14 22:08 part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-661344352


   @hvanhovell . The following is complete wrong because the above optimization was one of the recommendations for many Hortonworks customers to save their HDFS usage. I knew many production usages like that. I almost forgot that, but it rang my head suddenly during this PR. (Sadly, after I merged this.)
   >  You are currently just lucky that the system accidentally produces a nice layout for you; 99% of our users won't be as lucky. The only way you can he sure, is when you add these things yourself.
   
   I understand your point of views fully. However, I'm wondering if you can persuade the customers to waste their storage by generating 160x bigger files (SPARK-32318). Do you think you can?
   
   ```
   -rw-r--r--   1 dongjoon  wheel  939 Jul 14 22:12 part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
   ```
   ```
   -rw-r--r--   1 dongjoon  wheel  150741 Jul 14 22:08 part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
hvanhovell commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-660698710


   @dongjoon-hyun I am a bit late with my response but here goes :)
   
   > However, the following is not reasonable. There is nothing wrong in the file formats. They are just consumers and showing a better performance in a sorted input sequence because they are columnar vectorized format. I guess you assume that this is only a behavior at ORC. But, I'm sure that you can find your customers are relying on this in Parquet, too.
   
   That is making the argument for explicitly organizing the data before the write right? You are currently just lucky that the system accidentally produces a nice layout for you; 99% of our users won't be as lucky. The only way you can he sure, is when you add these things yourself.
   
   > This is not an implicit system behavior in Apache Spark. Apache Spark has been working in the procedural ways as you see in the above. If we start to ignore the valid working pattern in the production, it becomes a huge regression.
   > In short, saving to a file is a totally different and valid story. To optimize the final output files, the above pattern have been used in the production among Apache Spark users for a long time. If some optimizer rule ignores the existing usage, this ends up at a large regression in terms of the cost (for example, S3) obviously.
   
   If you generalize the procedural argument then we also should not do things like join reordering or swapping window operators. The whole point of a declarative system like Spark SQL is that you don't care about how the system executes a query, and that it has the freedom move operations around to make execution more optimal.
   
   Have you considered that your regression is someone else's speed-up? Sorting is not free, and if we can avoid it we should. There might a large group of users that are adversely affected by spurious sorts in the queries (e.g. an order by in a view).
   
   Finally I do want to point out that there is no mechanism that captures this regression if it pops up again.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #29089: [SQL][SPARK-32276] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r453796416



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -981,6 +982,10 @@ object EliminateSorts extends Rule[LogicalPlan] {
       j.copy(left = recursiveRemoveSort(originLeft), right = recursiveRemoveSort(originRight))
     case g @ Aggregate(_, aggs, originChild) if isOrderIrrelevantAggs(aggs) =>
       g.copy(child = recursiveRemoveSort(originChild))
+    case r: RepartitionByExpression =>

Review comment:
       These two branches can be replaces with one:
   
   ```
   case r: RepartitionOperation =>
     r.withNewChildren(r.children.map(recursiveRemoveSort))
   ```
   
   It will mean all repartition nodes we add in the future will be also taken into account. It seems safe but I want to hear what everybody thinks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-659683962






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658396071






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658325499


   **[Test build #125847 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125847/testReport)** for PR 29089 at commit [`83791b7`](https://github.com/apache/spark/commit/83791b769eb9de9be736c1310dc610c30c8cde5d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658563157


   Yeah, because the different data distribution, physical encoding of data could result in different size.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658329765






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658931940


   Yes, my proposal is to optimize cases when we sort the data after the repartition like in the examples I gave above. In those cases, sorts below seem to be redundant. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r454024892



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -981,6 +982,10 @@ object EliminateSorts extends Rule[LogicalPlan] {
       j.copy(left = recursiveRemoveSort(originLeft), right = recursiveRemoveSort(originRight))
     case g @ Aggregate(_, aggs, originChild) if isOrderIrrelevantAggs(aggs) =>
       g.copy(child = recursiveRemoveSort(originChild))
+    case r: RepartitionByExpression =>

Review comment:
       I am +1 on having a repartition node that preserves ordering. In fact, we have such a node internally. `Coalesce` is not really order preserving, though. It has custom logic in `DefaultPartitionCoalescer` that gets applied if the parent RDD has locality info. We would also need to report `outputOrdering` correctly (which is not done in case of `Coalesce` now)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658538140


   Also, cc @gatorsmile and @cloud-fan 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-657713339


   Thank you for pinging me, @aokolnychyi .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-659041791


   It looks reasonable to me to remove a sort before a repartition if we know the data will be sorted later, e.g. @aokolnychyi's examples above.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658333320






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658538432


   Thanks, everyone!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-660690648






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658859789






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-660344698


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/126060/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-660343826


   **[Test build #126060 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126060/testReport)** for PR 29089 at commit [`2157a71`](https://github.com/apache/spark/commit/2157a71e5f0548071d6f02e6dd3413fc8807daf5).
    * This patch **fails PySpark pip packaging tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-659489525






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-661344352


   @hvanhovell . The following is complete wrong because the above optimization was one of the recommendations for many Hortonworks customers to save their HDFS usage. I knew many production usages like that. I almost forgot that, but it rang my head suddenly during this PR. (Sadly, after I merged this.)
   >  You are currently just lucky that the system accidentally produces a nice layout for you; 99% of our users won't be as lucky. The only way you can he sure, is when you add these things yourself.
   
   I understand your point of views fully. However, I'm wondering if you can persuade the customers to waste their storage by generating 160x bigger files (the example from SPARK-32318). Do you think you can?
   
   ```
   -rw-r--r--   1 dongjoon  wheel  939 Jul 14 22:12 part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
   ```
   ```
   -rw-r--r--   1 dongjoon  wheel  150741 Jul 14 22:08 part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
   ```
   
   For the following, I added SPARK-32318 added a test coverage at master/3.0/2.4. Are you suggesting that's not enough?
   > Finally I do want to point out that there is no mechanism that captures this regression if it pops up again.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-657836318






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r453998376



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsSuite.scala
##########
@@ -310,4 +310,109 @@ class EliminateSortsSuite extends PlanTest {
     val correctAnswer = PushDownOptimizer.execute(noOrderByPlan.analyze)
     comparePlans(optimized, correctAnswer)
   }
+
+  testRepartitionOptimization(

Review comment:
       I prefer not to replace `test` because it makes tests difficult to run individually (at least in my IntelliJ environment).
   
   It also tends to increase readability. Here, you're passing a function to `testRepartitionOptimization` that gets passed a function that modifies the logical plan. I think it would be easier to read if these were separate suites, with a suite-level repartition function:
   
   ```scala
   class EliminateSortsInRepartitionSuite extends ... {
     def repartition(plan: LogicalPlan): LogicalPlan = plan.repartition(10)
   
     test("remove sortBy") {
       val plan = testRelation.select('a, 'b).sortBy('a.asc, 'b.desc)
       val planWithRepartition = repartition(plan)
       ...
     }
   }
   
   class EliminateSortsInRepartitionByExpressionSuite extends EliminateSortsInRepartitionSuite {
     override def repartition(plan: LogicalPlan): LogicalPlan = plan.distribute('a, 'b)(10)
   }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #29089: [SQL][SPARK-32276] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-657697965


   cc @dongjoon-hyun @dbtsai @cloud-fan @viirya @gengliangwang for feedback


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658472451


   **[Test build #125848 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125848/testReport)** for PR 29089 at commit [`c58ad12`](https://github.com/apache/spark/commit/c58ad12609b732fe8539277f3dcfb046268c5740).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-660693225


   cc @cloud-fan and @gatorsmile once more.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658859789






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658830542


   cc @marmbrus and @gatorsmile since they know the existing customers well and are good at protecting them.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658544475


   To generate small final Parquet/ORC files, we do the above tricks, don't we? This may cause a regression on the size of output storage.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658857595


   **[Test build #125886 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125886/testReport)** for PR 29089 at commit [`ba6a1bb`](https://github.com/apache/spark/commit/ba6a1bb5a20d26fc85485afe116e44f2c934ef20).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658932378


   @dongjoon-hyun @viirya @hvanhovell @maropu, what do you think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r454621506



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsBeforeRepartitionSuite.scala
##########
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.analysis.{Analyzer, EmptyFunctionRegistry}
+import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+
+class EliminateSortsBeforeRepartitionSuite extends PlanTest {
+
+  val catalog = new SessionCatalog(new InMemoryCatalog, EmptyFunctionRegistry, conf)
+  val analyzer = new Analyzer(catalog, conf)
+  val testRelation = LocalRelation('a.int, 'b.int, 'c.int)
+
+  object Optimize extends RuleExecutor[LogicalPlan] {
+    val batches =
+      Batch("Default", FixedPoint(10),
+        FoldablePropagation,
+        LimitPushDown) ::
+      Batch("Eliminate Sorts", Once,
+        EliminateSorts) ::
+      Batch("Collapse Project", Once,
+        CollapseProject) :: Nil
+  }
+
+  def repartition(plan: LogicalPlan): LogicalPlan = plan.repartition(10)
+  def isOptimized: Boolean = true
+
+  test(s"sortBy") {

Review comment:
       Fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r454025327



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -981,6 +982,10 @@ object EliminateSorts extends Rule[LogicalPlan] {
       j.copy(left = recursiveRemoveSort(originLeft), right = recursiveRemoveSort(originRight))
     case g @ Aggregate(_, aggs, originChild) if isOrderIrrelevantAggs(aggs) =>
       g.copy(child = recursiveRemoveSort(originChild))
+    case r: RepartitionByExpression =>
+      r.copy(child = recursiveRemoveSort(r.child))

Review comment:
       Agree, let me update this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #29089:
URL: https://github.com/apache/spark/pull/29089


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658876982


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125886/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658329765






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #29089: [SQL][SPARK-32276] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r453798200



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala
##########
@@ -1299,8 +1299,9 @@ class ArrowConvertersSuite extends SharedSparkSession {
          |}
        """.stripMargin
 
-    val a_i = List[Int](1, -1, 2, -2, 2147483647, -2147483648)
-    val b_i = List[Option[Int]](Some(1), None, None, Some(-2), None, Some(-2147483648))
+    // we order values by $"a_i".desc manually as sortBy before coalesce is ignored

Review comment:
       This is the case I mention in the PR description. Here, `DefaultPartitionCoalescer` does preserve the ordering and the test relied on that even though there is no guarantee it will happen. We could apply the new optimization only if the repartition operation requires a shuffle. That way, we will keep the existing behavior. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r454542614



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsBeforeRepartitionSuite.scala
##########
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.analysis.{Analyzer, EmptyFunctionRegistry}
+import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+
+class EliminateSortsBeforeRepartitionSuite extends PlanTest {
+
+  val catalog = new SessionCatalog(new InMemoryCatalog, EmptyFunctionRegistry, conf)

Review comment:
       I considered making these protected but it makes lines longer and most of tests use public vars as well.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658559706


   The most big factor is file formats instead of Spark side.
   For example, in the above example, ORC files are small because it supports a special encoding when the input data is sorted with a fixed increment. For Parquet files, the result will be different.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658332654


   **[Test build #125849 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125849/testReport)** for PR 29089 at commit [`0ff3092`](https://github.com/apache/spark/commit/0ff3092914bd51e3ee4eec982981cf3514499ea4).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r454009047



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -981,6 +982,10 @@ object EliminateSorts extends Rule[LogicalPlan] {
       j.copy(left = recursiveRemoveSort(originLeft), right = recursiveRemoveSort(originRight))
     case g @ Aggregate(_, aggs, originChild) if isOrderIrrelevantAggs(aggs) =>
       g.copy(child = recursiveRemoveSort(originChild))
+    case r: RepartitionByExpression =>
+      r.copy(child = recursiveRemoveSort(r.child))

Review comment:
       Non-deterministic expressions could have status and row-order sensitive. In `EliminateSorts` we also only remove Sort for Project and Filter where projection list and condition are deterministic.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658543717


   Oops. Sorry, guys. It seems that I missed something during testing. For the following case, we should not remove `Sort`.
   
   **BEFORE THIS PR**
   ```scala
   scala> Seq((1,10),(1,20),(2,30),(2,40)).toDF("a", "b").repartition(2).createOrReplaceTempView("t")
   
   scala> sql("select * from (select * from t order by b desc) distribute by a").show()
   +---+---+
   |  a|  b|
   +---+---+
   |  1| 20|
   |  1| 10|
   |  2| 40|
   |  2| 30|
   +---+---+
   ```
   
   **AFTER THIS PR**
   ```scala
   scala> Seq((1,10),(1,20),(2,30),(2,40)).toDF("a", "b").repartition(2).createOrReplaceTempView("t")
   
   scala> sql("select * from (select * from t order by b desc) distribute by a").show()
   +---+---+
   |  a|  b|
   +---+---+
   |  1| 10|
   |  1| 20|
   |  2| 30|
   |  2| 40|
   +---+---+
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-660344690


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r453993686



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -981,6 +982,10 @@ object EliminateSorts extends Rule[LogicalPlan] {
       j.copy(left = recursiveRemoveSort(originLeft), right = recursiveRemoveSort(originRight))
     case g @ Aggregate(_, aggs, originChild) if isOrderIrrelevantAggs(aggs) =>
       g.copy(child = recursiveRemoveSort(originChild))
+    case r: RepartitionByExpression =>
+      r.copy(child = recursiveRemoveSort(r.child))

Review comment:
       I don't think that is necessary. My understanding is that the requirement for non-deterministic expressions is that they are run only once when finally executed, not that the plan cannot be optimized around them.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658497043


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125858/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658425886


   Those failures seem to belong to the old commit before [this](https://github.com/apache/spark/pull/29089/commits/c58ad12609b732fe8539277f3dcfb046268c5740) change where I removed old tests. Tests for coalesce that failed were no longer valid. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-660728192






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658549984


   **AFTER SPARK-32276**
   ```
   scala> scala.util.Random.shuffle((1 to 100000).map(x => (x % 2, x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t")
   
   scala> sql("select * from (select * from t order by b) distribute by a").write.orc("/tmp/SPARK-32276")
   ```
   
   ```
   $ ls -al /tmp/SPARK-32276/
   total 632
   drwxr-xr-x  10 dongjoon  wheel     320 Jul 14 22:08 ./
   drwxrwxrwt  14 root      wheel     448 Jul 14 22:08 ../
   -rw-r--r--   1 dongjoon  wheel       8 Jul 14 22:08 ._SUCCESS.crc
   -rw-r--r--   1 dongjoon  wheel      12 Jul 14 22:08 .part-00000-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
   -rw-r--r--   1 dongjoon  wheel    1188 Jul 14 22:08 .part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
   -rw-r--r--   1 dongjoon  wheel    1188 Jul 14 22:08 .part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
   -rw-r--r--   1 dongjoon  wheel       0 Jul 14 22:08 _SUCCESS
   -rw-r--r--   1 dongjoon  wheel     119 Jul 14 22:08 part-00000-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
   -rw-r--r--   1 dongjoon  wheel  150735 Jul 14 22:08 part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
   -rw-r--r--   1 dongjoon  wheel  150741 Jul 14 22:08 part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
   ```
   
   **BEFORE**
   ```
   scala> scala.util.Random.shuffle((1 to 100000).map(x => (x % 2, x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t")
   
   scala> sql("select * from (select * from t order by b) distribute by a").write.orc("/tmp/master")
   ```
   
   ```
   $ ls -al /tmp/master/
   total 56
   drwxr-xr-x  10 dongjoon  wheel  320 Jul 14 22:12 ./
   drwxrwxrwt  15 root      wheel  480 Jul 14 22:12 ../
   -rw-r--r--   1 dongjoon  wheel    8 Jul 14 22:12 ._SUCCESS.crc
   -rw-r--r--   1 dongjoon  wheel   12 Jul 14 22:12 .part-00000-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
   -rw-r--r--   1 dongjoon  wheel   16 Jul 14 22:12 .part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
   -rw-r--r--   1 dongjoon  wheel   16 Jul 14 22:12 .part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
   -rw-r--r--   1 dongjoon  wheel    0 Jul 14 22:12 _SUCCESS
   -rw-r--r--   1 dongjoon  wheel  119 Jul 14 22:12 part-00000-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
   -rw-r--r--   1 dongjoon  wheel  932 Jul 14 22:12 part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
   -rw-r--r--   1 dongjoon  wheel  939 Jul 14 22:12 part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658899476


   > To generate small final Parquet/ORC files, we do the above tricks, don't we?
   
   We don't rely on this. Our recommendation to users is to add a global sort to distribute the data, which adds the local sort in the final stage that won't be removed. I can understand people relying on this behavior, though.
   
   For now, I think it makes sense to remove a sort before a repartition if the data will be sorted later, like what I think @aokolnychyi is suggesting. That's really what we will need for tables that require a sort order -- that will be the final sort and we should be able to remove other sorts.
   
   We may also want to choose whether this is a guarantee and document it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r454569169



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsBeforeRepartitionSuite.scala
##########
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.analysis.{Analyzer, EmptyFunctionRegistry}
+import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+
+class EliminateSortsBeforeRepartitionSuite extends PlanTest {
+
+  val catalog = new SessionCatalog(new InMemoryCatalog, EmptyFunctionRegistry, conf)
+  val analyzer = new Analyzer(catalog, conf)
+  val testRelation = LocalRelation('a.int, 'b.int, 'c.int)
+
+  object Optimize extends RuleExecutor[LogicalPlan] {
+    val batches =
+      Batch("Default", FixedPoint(10),
+        FoldablePropagation,
+        LimitPushDown) ::
+      Batch("Eliminate Sorts", Once,
+        EliminateSorts) ::
+      Batch("Collapse Project", Once,
+        CollapseProject) :: Nil
+  }
+
+  def repartition(plan: LogicalPlan): LogicalPlan = plan.repartition(10)
+  def isOptimized: Boolean = true

Review comment:
       What is the purpose of this? It is always set to true, so I don't see why the test cases use it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r454565079



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -981,6 +982,10 @@ object EliminateSorts extends Rule[LogicalPlan] {
       j.copy(left = recursiveRemoveSort(originLeft), right = recursiveRemoveSort(originRight))
     case g @ Aggregate(_, aggs, originChild) if isOrderIrrelevantAggs(aggs) =>
       g.copy(child = recursiveRemoveSort(originChild))
+    case r: RepartitionByExpression =>

Review comment:
       We won't be able to squash all cases into one as we need to check if repartition expressions are deterministic. However, I'd consider extending repartition nodes with order preserving repartition in a follow-up if there is enough support for that.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala
##########
@@ -1299,8 +1299,9 @@ class ArrowConvertersSuite extends SharedSparkSession {
          |}
        """.stripMargin
 
-    val a_i = List[Int](1, -1, 2, -2, 2147483647, -2147483648)
-    val b_i = List[Option[Int]](Some(1), None, None, Some(-2), None, Some(-2147483648))
+    // we order values by $"a_i".desc manually as sortBy before coalesce is ignored

Review comment:
       Updated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658558946


   Oh, this is interesting. I know removing `Sort` before `Repartition` will result in different data distribution because `Repartition` uses `RoundRobinPartitioning`. Because I think repartition doesn't guarantee shuffled data distribution, so I thought it is okay.
   
   Now seems different data distribution causes difference storage output size. I think it is because to repartition sorted data using `RoundRobinPartitioning` can generate more compact output.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29089: [SQL][SPARK-32276] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-657679330


   **[Test build #125781 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125781/testReport)** for PR 29089 at commit [`d71b9e6`](https://github.com/apache/spark/commit/d71b9e6083d93b1f98a63418de37eaede9626ca9).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r453990069



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala
##########
@@ -1299,8 +1299,9 @@ class ArrowConvertersSuite extends SharedSparkSession {
          |}
        """.stripMargin
 
-    val a_i = List[Int](1, -1, 2, -2, 2147483647, -2147483648)
-    val b_i = List[Option[Int]](Some(1), None, None, Some(-2), None, Some(-2147483648))
+    // we order values by $"a_i".desc manually as sortBy before coalesce is ignored

Review comment:
       I think we should apply the optimization only if the repartition requires a shuffle as you suggest. I know that there are users that depend on this behavior.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658398551


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125847/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658325499


   **[Test build #125847 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125847/testReport)** for PR 29089 at commit [`83791b7`](https://github.com/apache/spark/commit/83791b769eb9de9be736c1310dc610c30c8cde5d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-659147381


   Thank you for quick updating, @aokolnychyi . Also, thank you all for your opinions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-657835007


   **[Test build #125781 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125781/testReport)** for PR 29089 at commit [`d71b9e6`](https://github.com/apache/spark/commit/d71b9e6083d93b1f98a63418de37eaede9626ca9).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #29089:
URL: https://github.com/apache/spark/pull/29089


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r454025043



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala
##########
@@ -1299,8 +1299,9 @@ class ArrowConvertersSuite extends SharedSparkSession {
          |}
        """.stripMargin
 
-    val a_i = List[Int](1, -1, 2, -2, 2147483647, -2147483648)
-    val b_i = List[Option[Int]](Some(1), None, None, Some(-2), None, Some(-2147483648))
+    // we order values by $"a_i".desc manually as sortBy before coalesce is ignored

Review comment:
       I lean towards that as well.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658544475


   To generate small Parquet/ORC files, we do the above tricks, don't we?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #29089: [SQL][SPARK-32276] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r453798200



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala
##########
@@ -1299,8 +1299,9 @@ class ArrowConvertersSuite extends SharedSparkSession {
          |}
        """.stripMargin
 
-    val a_i = List[Int](1, -1, 2, -2, 2147483647, -2147483648)
-    val b_i = List[Option[Int]](Some(1), None, None, Some(-2), None, Some(-2147483648))
+    // we order values by $"a_i".desc manually as sortBy before coalesce is ignored

Review comment:
       This is the case I mention in the PR description. In this case, `DefaultPartitionCoalescer` does preserve the ordering and the test relied on that even though there is no guarantee it will happen. We could apply the new optimization only if the repartition operation requires a shuffle. That way, we will keep the existing behavior. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SQL][SPARK-32276] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-657679907






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r454619436



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -981,6 +982,10 @@ object EliminateSorts extends Rule[LogicalPlan] {
       j.copy(left = recursiveRemoveSort(originLeft), right = recursiveRemoveSort(originRight))
     case g @ Aggregate(_, aggs, originChild) if isOrderIrrelevantAggs(aggs) =>
       g.copy(child = recursiveRemoveSort(originChild))
+    case r: RepartitionByExpression =>
+      r.copy(child = recursiveRemoveSort(r.child))

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-660690504


   **[Test build #126134 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126134/testReport)** for PR 29089 at commit [`2157a71`](https://github.com/apache/spark/commit/2157a71e5f0548071d6f02e6dd3413fc8807daf5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-659783828


   I gave it a bit of thought and did not find a case where the updated logic would break. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658469488


   **[Test build #125849 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125849/testReport)** for PR 29089 at commit [`0ff3092`](https://github.com/apache/spark/commit/0ff3092914bd51e3ee4eec982981cf3514499ea4).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-659683069


   **[Test build #125986 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125986/testReport)** for PR 29089 at commit [`0545b09`](https://github.com/apache/spark/commit/0545b099ff75aa6bfe921b02dddadc7081538136).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-660195822


   **[Test build #126060 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126060/testReport)** for PR 29089 at commit [`2157a71`](https://github.com/apache/spark/commit/2157a71e5f0548071d6f02e6dd3413fc8807daf5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-660690119


   Retest this please.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r456537732



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -969,11 +969,11 @@ object CombineFilters extends Rule[LogicalPlan] with PredicateHelper {
  * Removes Sort operation. This can happen:

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658472869






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r454610134



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -981,6 +982,10 @@ object EliminateSorts extends Rule[LogicalPlan] {
       j.copy(left = recursiveRemoveSort(originLeft), right = recursiveRemoveSort(originRight))
     case g @ Aggregate(_, aggs, originChild) if isOrderIrrelevantAggs(aggs) =>
       g.copy(child = recursiveRemoveSort(originChild))
+    case r: RepartitionByExpression =>
+      r.copy(child = recursiveRemoveSort(r.child))

Review comment:
       Could you update the PR description accordingly?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658497039






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658876966


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r454609943



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala
##########
@@ -1299,8 +1299,9 @@ class ArrowConvertersSuite extends SharedSparkSession {
          |}
        """.stripMargin
 
-    val a_i = List[Int](1, -1, 2, -2, 2147483647, -2147483648)
-    val b_i = List[Option[Int]](Some(1), None, None, Some(-2), None, Some(-2147483648))
+    // we order values by $"a_i".desc manually as sortBy before coalesce is ignored

Review comment:
       Could you update the PR description accordingly?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658831236


   BTW, @aokolnychyi . I merged the corner case test case. Could you rebase this to the master? Then, we can discuss how to proceed this PR in a narrowed direction.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-659193941


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125931/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r454608042



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsBeforeRepartitionSuite.scala
##########
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.analysis.{Analyzer, EmptyFunctionRegistry}
+import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+
+class EliminateSortsBeforeRepartitionSuite extends PlanTest {
+
+  val catalog = new SessionCatalog(new InMemoryCatalog, EmptyFunctionRegistry, conf)
+  val analyzer = new Analyzer(catalog, conf)
+  val testRelation = LocalRelation('a.int, 'b.int, 'c.int)
+
+  object Optimize extends RuleExecutor[LogicalPlan] {
+    val batches =
+      Batch("Default", FixedPoint(10),
+        FoldablePropagation,
+        LimitPushDown) ::
+      Batch("Eliminate Sorts", Once,
+        EliminateSorts) ::
+      Batch("Collapse Project", Once,
+        CollapseProject) :: Nil
+  }
+
+  def repartition(plan: LogicalPlan): LogicalPlan = plan.repartition(10)
+  def isOptimized: Boolean = true
+
+  test(s"sortBy") {
+    val plan = testRelation.select('a, 'b).sortBy('a.asc, 'b.desc)
+    val planWithRepartition = repartition(plan)
+    val optimizedPlan = Optimize.execute(analyzer.execute(planWithRepartition))
+    val correctPlan = if (isOptimized) {
+      repartition(testRelation.select('a, 'b))
+    } else {
+      planWithRepartition
+    }
+    comparePlans(optimizedPlan, analyzer.execute(correctPlan))
+  }
+
+  test(s"sortBy with projection") {

Review comment:
       ditto.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658497039


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-661344352


   @hvanhovell . The following is complete wrong because the above optimization was one of the recommendations for many Hortonworks customers to save their HDFS usage. I knew many production usages like that. I almost forgot that, but it rang my head suddenly during this PR. (Sadly, after I merged this.)
   >  You are currently just lucky that the system accidentally produces a nice layout for you; 99% of our users won't be as lucky. The only way you can he sure, is when you add these things yourself.
   
   I understand your point of views fully. However, I'm wondering if you can persuade the customers to waste their storage by generating 160x bigger files (the example from SPARK-32318). Do you think you can?
   
   ```
   -rw-r--r--   1 dongjoon  wheel  939 Jul 14 22:12 part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
   ```
   ```
   -rw-r--r--   1 dongjoon  wheel  150741 Jul 14 22:08 part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
   ```
   
   
   
   For the following, I added SPARK-32318 added a test coverage at master/3.0/2.4. Are you suggesting that's not enough?
   > Finally I do want to point out that there is no mechanism that captures this regression if it pops up again.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-659193931


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-661344352


   @hvanhovell . The following is complete wrong because the above optimization was one of the recommendations for many Hortonworks customers to save their HDFS usage. I knew many production usages like that. I almost forgot that, but it rang my head suddenly during this PR. (Sadly, after I merged this.)
   >  You are currently just lucky that the system accidentally produces a nice layout for you; 99% of our users won't be as lucky. The only way you can he sure, is when you add these things yourself.
   
   I understand your point of views fully. However, I'm wondering if you can persuade the customer to waste their storage by generating 160x bigger files (SPARK-32318). Do you think you can?
   
   ```
   -rw-r--r--   1 dongjoon  wheel  939 Jul 14 22:12 part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
   ```
   ```
   -rw-r--r--   1 dongjoon  wheel  150741 Jul 14 22:08 part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658396071






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658395393


   **[Test build #125858 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125858/testReport)** for PR 29089 at commit [`ba6a1bb`](https://github.com/apache/spark/commit/ba6a1bb5a20d26fc85485afe116e44f2c934ef20).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r453994340



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -981,6 +982,10 @@ object EliminateSorts extends Rule[LogicalPlan] {
       j.copy(left = recursiveRemoveSort(originLeft), right = recursiveRemoveSort(originRight))
     case g @ Aggregate(_, aggs, originChild) if isOrderIrrelevantAggs(aggs) =>
       g.copy(child = recursiveRemoveSort(originChild))
+    case r: RepartitionByExpression =>

Review comment:
       What about adding `RepartitionOperation.preservesOrder`? Then we could collapse these cases while also excluding `Coalesce` and making this explicit for future repartition operators.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #29089:
URL: https://github.com/apache/spark/pull/29089#discussion_r456537845



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -969,11 +969,11 @@ object CombineFilters extends Rule[LogicalPlan] with PredicateHelper {
  * Removes Sort operation. This can happen:
  * 1) if the sort order is empty or the sort order does not have any reference
  * 2) if the child is already sorted
- * 3) if there is another Sort operator separated by 0...n Project/Filter operators
- * 4) if the Sort operator is within Join separated by 0...n Project/Filter operators only,
- *    and the Join conditions is deterministic
- * 5) if the Sort operator is within GroupBy separated by 0...n Project/Filter operators only,
- *    and the aggregate function is order irrelevant
+ * 3) if there is another Sort operator separated by 0...n Project/Filter/Repartition operators
+ * 4) if the Sort operator is within Join separated by 0...n Project/Filter/Repartition
+ *    operators only, and the Join conditions is deterministic
+ * 5) if the Sort operator is within GroupBy separated by 0...n Project/Filter/Repartition
+ *    operators only, and the aggregate function is order irrelevant

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658558813


   I made a PR to add a test coverage for the above case.
   - https://github.com/apache/spark/pull/29118


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658424153


   Some tests in `EliminateSortsSuite` were failed?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-659139822


   **[Test build #125931 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125931/testReport)** for PR 29089 at commit [`21a84ad`](https://github.com/apache/spark/commit/21a84adb3561788eea0e98c62129127b5bc9d5d5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org