You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "wangyum (via GitHub)" <gi...@apache.org> on 2023/03/17 03:20:26 UTC

[GitHub] [spark] wangyum opened a new pull request, #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

wangyum opened a new pull request, #40462:
URL: https://github.com/apache/spark/pull/40462

   ### What changes were proposed in this pull request?
   
   This PR enhances `CollapseRepartition` to remove repartition if it is the child of `LocalLimit`.
   For example:
   ```sql
   SELECT /*+ REBALANCE */ * FROM t WHERE id > 1 LIMIT 5;
   ```
   
   Before this PR:
   ```
   == Optimized Logical Plan ==
   GlobalLimit 5
   +- LocalLimit 5
      +- RebalancePartitions
         +- Filter (isnotnull(id#0L) AND (id#0L > 1))
            +- Relation spark_catalog.default.t[id#0L] parquet
   ```
   
   After this PR:
   ```
   == Optimized Logical Plan ==
   GlobalLimit 5
   +- LocalLimit 5
      +- Filter (isnotnull(id#0L) AND (id#0L > 1))
         +- Relation spark_catalog.default.t[id#0L] parquet
   ```
   
   Note that we don't remove repartition if it looks like the user might want to get data randomly. For example:
   ```sql
   SELECT /*+ REPARTITION(3) */ * FROM t WHERE id > 1 LIMIT 5;
   SELECT * FROM t WHERE id > 1 DISTRIBUTE BY random() LIMIT 5;
   ```
   
   ### Why are the changes needed?
   
   Reduce shuffle to improve query performance.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "wangyum (via GitHub)" <gi...@apache.org>.
wangyum commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1486144770

   If such queries cannot be optimized, the performance of such queries will be very poor. We use a partition to fetch data from MySQL, and increase its parallelism for downstream computing after fetching the data:
   
   ```sql
   CREATE VIEW full_query_log
   AS
   SELECT h.* FROM query_log_hdfs h
   UNION ALL
   SELECT /*+ REBALANCE */ q.*, DATE(start) FROM query_log_mysql q;
   
   SELECT * FROM full_query_log limit 5;
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a diff in pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "wangyum (via GitHub)" <gi...@apache.org>.
wangyum commented on code in PR #40462:
URL: https://github.com/apache/spark/pull/40462#discussion_r1141564486


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -1213,6 +1213,12 @@ object CollapseRepartition extends Rule[LogicalPlan] {
     // child.
     case r @ RebalancePartitions(_, child: RebalancePartitions, _) =>
       r.withNewChildren(child.children)
+    // Case 5: When a LocalLimit has a child of Repartition we can remove the Repartition.

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1486832610

   +1 for @cloud-fan 's suggestion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] VindhyaG commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "VindhyaG (via GitHub)" <gi...@apache.org>.
VindhyaG commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1479847214

   > > Can you please explain more on scenarios when rebalancepartitions becomes child of locallimit? i tried SELECT * FROM t WHERE id > 1 LIMIT 5; with spark 2.4.4 version and rebalancepartitions is not part of the plan.
   > 
   > Spark 2.4.4 does not have `RebalancePartitions`. Please see: https://issues.apache.org/jira/browse/SPARK-35786
   Got it.thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "wangyum (via GitHub)" <gi...@apache.org>.
wangyum commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1487993457

   Reverted this commit: https://github.com/apache/spark/commit/58facc16536cb2784aace38d6cca77bd055a3053.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1487995895

   Thank you for the decision, @wangyum .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #40462:
URL: https://github.com/apache/spark/pull/40462#discussion_r1140007962


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -1213,6 +1213,12 @@ object CollapseRepartition extends Rule[LogicalPlan] {
     // child.
     case r @ RebalancePartitions(_, child: RebalancePartitions, _) =>
       r.withNewChildren(child.children)
+    // Case 5: When a LocalLimit has a child of Repartition we can remove the Repartition.

Review Comment:
   why we can remove it? can we put more explanation in the PR description?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -1213,6 +1213,12 @@ object CollapseRepartition extends Rule[LogicalPlan] {
     // child.
     case r @ RebalancePartitions(_, child: RebalancePartitions, _) =>
       r.withNewChildren(child.children)
+    // Case 5: When a LocalLimit has a child of Repartition we can remove the Repartition.

Review Comment:
   why we can remove it? can we put more explanation in the PR description and code comments here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1484415289

   The optimizer can remove meanless logical operators as it won't change the query result and it's usually more efficient to remove operators. But for physical APIs like rebalance and repartition, the users explicitly ask to take control of performance tuning and we should trust them. Besides, I don't think this is an always-beneficial optimization, what if the limit is very large? We should let the user make the decision: they can remove the rebalance hint if they believe it makes the perf worse.  What do you think? @wangyum @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "wangyum (via GitHub)" <gi...@apache.org>.
wangyum commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1484408565

   rebalance before limit feels meaningless.
   If the user really wants to take n rows from each partition, the SQL should look like this:
   ```sql
   SELECT *
   FROM   (SELECT *,
                  Row_number() OVER(partition BY pmod(Hash(id, 42), 200) ORDER BY id) AS rn
           FROM   tbl)
   WHERE  rn <= 5;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "wangyum (via GitHub)" <gi...@apache.org>.
wangyum commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1487867844

   It is meaningless to only optimize this case, because the user can manually remove the hint:
   ```sql
   SELECT /*+ REBALANCE */ * FROM tbl LIMIT 5;
   ```
   
   I'd like to optimize this case, since the user cannot manually remove the hint, because the view may be maintained by others:
   ```sql
   CREATE VIEW full_query_log
   AS
   SELECT h.* FROM query_log_hdfs h
   UNION ALL
   SELECT /*+ REBALANCE */ q.*, DATE(start) FROM query_log_mysql q;
   
   SELECT * FROM full_query_log limit 5;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "wangyum (via GitHub)" <gi...@apache.org>.
wangyum commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1484592416

   Could we push the `LocalLimit` through `RebalancePartitions`?
   ```
   == Optimized Logical Plan ==
   GlobalLimit 5
   +- LocalLimit 5
      +- RebalancePartitions
         +- LocalLimit 5
            +- Filter (isnotnull(id#0L) AND (id#0L > 1))
               +- Relation spark_catalog.default.tbl[id#0L] parquet
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "wangyum (via GitHub)" <gi...@apache.org>.
wangyum commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1473464387

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1479987729

   Merged to master for Apache Spark 3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1484657163

   I don't think so. Local limit is per partition and rebalance changes partitioning.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1486833076

   Could you make a follow-up, @wangyum ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1486163555

   OK now I got the use case. At the time when we add the rebalance hint, we don't know what the final query is. Shall we make this optimization a bit more conservative to match both global and local limit? I still think it's risky to remove rebalance below local limit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1482318942

   I don't quite get the rationale. For `SELECT /*+ REBALANCE */ * FROM t WHERE id > 1 LIMIT 5;`, the user explicitly requires to do a rebalance before limit, why do we remove it? It's a physical hint and we should respect it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "wangyum (via GitHub)" <gi...@apache.org>.
wangyum commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1487849388

   How about reverting this PR directly? Otherwise, we need to change `LimitPushDown`, which is not elegant:
   ```scala
   case Limit(exp, u: Union) if u.children.exists(_.isInstanceOf[HasPartitionExpressions]) =>
     val newChildren = u.children.map {
       case r: RebalancePartitions =>
         maybePushLocalLimit(exp, r.child)
       case r: RepartitionByExpression
           if r.partitionExpressions.nonEmpty && r.partitionExpressions.forall(_.deterministic) =>
         maybePushLocalLimit(exp, r.child)
       case o =>
         maybePushLocalLimit(exp, o)
     }
     Limit(exp, u.copy(children = newChildren))
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1487859134

   why do we need to change `LimitPushDown`? We can't push limit through repartition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit
URL: https://github.com/apache/spark/pull/40462


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "wangyum (via GitHub)" <gi...@apache.org>.
wangyum commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1475522025

   > Can you please explain more on scenarios when rebalancepartitions becomes child of locallimit? i tried SELECT * FROM t WHERE id > 1 LIMIT 5; with spark 2.4.4 version and rebalancepartitions is not part of the plan.
   
   Spark 2.4.4  does not have `RebalancePartitions`. Please see: https://issues.apache.org/jira/browse/SPARK-35786


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1484659332

   oh wait, I thought the hint applies as the last operator. `SELECT /*+ REBALANCE */ * FROM t WHERE id > 1 LIMIT 5;` should rebalance be the root node for this query?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1485539305

   `HINT` is a part of `SELECT` clause.
   https://github.com/apache/spark/blob/a3d9e0ae0f95a55766078da5d0bf0f74f3c3cfc3/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L530-L532
   
   That's the reason why `LIMIT` is the root of the query.
   ```
   scala> sql("SELECT /*+ REBALANCE */ * FROM t WHERE id > 1 LIMIT 5").explain(true)
   == Parsed Logical Plan ==
   'GlobalLimit 5
   +- 'LocalLimit 5
      +- 'UnresolvedHint REBALANCE
         +- 'Project [*]
            +- 'Filter ('id > 1)
               +- 'UnresolvedRelation [t], [], false
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] VindhyaG commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

Posted by "VindhyaG (via GitHub)" <gi...@apache.org>.
VindhyaG commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1474925826

   Can you please explain more on scenarios when rebalancepartitions becomes child of locallimit? i tried  
   SELECT  * FROM t WHERE id > 1 LIMIT 5; with spark 2.4.4 version and rebalancepartitions is not part of the plan. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org