You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/14 12:16:52 UTC

[GitHub] [spark] manuzhang opened a new pull request #35514: [SPARK-38027][SQL][DOCS] Add migration guide for bucketed scan behavior change

manuzhang opened a new pull request #35514:
URL: https://github.com/apache/spark/pull/35514


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   Add SQL migration guide for bucketed scan behavior change from Spark 3.0 to 3.1
   
   
   ### Why are the changes needed?
   Default behavior of bucketed scan is changed in [SPARK-32859](https://issues.apache.org/jira/browse/SPARK-32859).
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Only doc change.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r807535523



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       This is a performance improvement, which should be listed in the release notes, not the migration guide.
   
   Migration guide is used for changes that can break user queries, or make them return a different result.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] manuzhang commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
manuzhang commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r807608383



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       As commented above, filter on non-sorted column can also be much faster with bucketed scan. Disabling bucketed scan having filter on sorted columns won't fix all regressions in the case of `limit`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r808912337



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       Again, this only affects perf, so we don't put it in the migration guide. The impact is low because it only affects tables with special data layouts.
   
   I'd recommend including `x` as bucket columns as well, as you assume `x = 'a'` hits only one file.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r807540227



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       Sure, I think it's doable. Let me change the rule itself to be more restricted 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] manuzhang commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
manuzhang commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r808388781



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       I'm not arguing about bucket scan is always faster but disabling bucket scan automatically should not introduce unexpected regression or have documentation about it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] manuzhang commented on pull request #35514: [SPARK-38027][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
manuzhang commented on pull request #35514:
URL: https://github.com/apache/spark/pull/35514#issuecomment-1039764372


   cc @c21 @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] manuzhang commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
manuzhang commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r807589399



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       @c21 Add bucket column to project or just select all as @cloud-fan suggested. 
   
   ```
   select i, k
   from t
   where j='a'
   limit 10
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r807575981



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       @manuzhang - please see my comment above - https://github.com/apache/spark/pull/35514#discussion_r807540227 . Thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] manuzhang commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
manuzhang commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r808512761



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       @c21 Thanks. That's good when the change is transparent to users. In this case, however, it's impacting users when upgrading to Spark 3.1. And the first place I'd check is migration guide.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r807536335



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       I think this is a performance regression we should fix. Seems it's better to use bucketed scan if we have predicate against bucket or sort columns. cc @c21 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] manuzhang commented on pull request #35514: [SPARK-38027][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
manuzhang commented on pull request #35514:
URL: https://github.com/apache/spark/pull/35514#issuecomment-1039764372


   cc @c21 @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r807561823



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       btw @manuzhang - just to double check, if the query has `LIMIT`, then the rule will not disable bucketed scan, as limit operator is not allowed operator to disable bucketed scan ([code](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/DisableUnnecessaryBucketedScan.scala#L134)). Could you come up with a complete query example for us to check?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] manuzhang commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
manuzhang commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r808388781



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       I'm not arguing about bucket scan is always faster but disabling bucket scan automatically should not introduce unexpected regression or should have documentation about it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r807540227



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       ~~Sure, I think it's doable. Let me change the rule itself to be more restricted~~ @cloud-fan - actually after a deep look, I tried the same query shape shared per @manuzhang in https://github.com/apache/spark/pull/35514#discussion_r806417239, the query below
   
   ```scala
   (0 until 50).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", "k")
     .write.format("parquet").bucketBy(8, "i").sortBy("k").saveAsTable("t1")
   sql("SELECT j FROM t1 WHERE k = 1 LIMIT 10")
   ```
   
   does not use bucketed scan, because the query does not read bucket column. Verified in query plan:
   
   ```
   CollectLimit 10
   +- Project [j#20]
      +- Filter (isnotnull(k#21) AND (cast(k#21 as int) = 1))
         +- FileScan parquet default.t1[j#20,k#21] Batched: true, Bucketed: false (bucket column(s) not read), DataFilters: [isnotnull(k#21), (cast(k#21 as int) = 1)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/chengsu/spark/sql/core/spark-warehouse/org.apache.spark.sq..., PartitionFilters: [], PushedFilters: [IsNotNull(k)], ReadSchema: struct<j:int,k:string>
   ```
   
   Note: `Bucketed: false (bucket column(s) not read)` in the query plan. After a little bit research, I found the behavior ([do not use bucketed scan if not reading bucket column](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L299)) was introduced in https://github.com/apache/spark/pull/27924 . So it's not the added rule disabling bucketed scan here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r807540227



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       ~~Sure, I think it's doable. Let me change the rule itself to be more restricted~~ @cloud-fan - actually after a deep look, I tried the same query shape shared per @manuzhang in https://github.com/apache/spark/pull/35514#discussion_r806417239, the query below
   
   ```scala
   (0 until 50).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", "k")
     .write.format("parquet").bucketBy(8, "i").sortBy("k").saveAsTable("t1")
   sql("SELECT j FROM t1 WHERE k = 1 LIMIT 10")
   ```
   
   does not use bucketed scan, because the query does not read bucket column. Verified in query plan:
   
   ```
   CollectLimit 10
   +- Project [j#20]
      +- Filter (isnotnull(k#21) AND (cast(k#21 as int) = 1))
         +- FileScan parquet default.t1[j#20,k#21] Batched: true, Bucketed: false (bucket column(s) not read), DataFilters: [isnotnull(k#21), (cast(k#21 as int) = 1)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/chengsu/spark/sql/core/spark-warehouse/org.apache.spark.sq..., PartitionFilters: [], PushedFilters: [IsNotNull(k)], ReadSchema: struct<j:int,k:string>
   ```
   
   Note: `Bucketed: false (bucket column(s) not read)` in the query plan. After a little bit research, I found the behavior (do not use bucketed scan if not reading bucket column) was introduced in https://github.com/apache/spark/pull/27924 . So it's not the added rule disabling bucketed scan here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] manuzhang commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
manuzhang commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r807570035



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       This is the complete query. I think it hits the `FileSourceScanExec` case https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/DisableUnnecessaryBucketedScan.scala#L100
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r807609116



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       > filter on non-sorted column can also be much faster with bucketed scan
   
   I don't get this argument, do you mind giving a concrete example with metrics?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] manuzhang commented on a change in pull request #35514: [SPARK-38027][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
manuzhang commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r806417239



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       Yes. In my case table `t` is bucketed on `i` and sorted on `j`.  The following SQL takes 6 mins to run in Spark 3.1.1 while 30s in Spark 2.3.1.
   
   ```sql
   select k
   from t
   where j='a'
   limit 10
   ```
   
   From the log I could see it's bucketed scan in 2.3.1 while non-bucketed scan 3.1.1 but I didn't know why. It took me days to debug till I found this config. If it were documented in migration guide, I could have figured it out much sooner. Hence, I think it might help others as well.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] manuzhang commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
manuzhang commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r809435918



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       Okay. In our case, the bucket table is used by many users so it's not plausible to change bucket column for one use case. I think all perf regressions regardless of data layouts need be fixed for this optimization.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] manuzhang commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
manuzhang commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r807589399



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       @c21 Add bucket column to project  (sorry for leaving it out in the previous example) or just select all as @cloud-fan suggested. 
   
   ```
   select i, k
   from t
   where j='a'
   limit 10
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r807592894



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       > Even if the predicate is not bucket or sort column, this query could still be much slower without bucketed scan for large bucket file (e.g. 3GB) since more tasks and stages will be launched instead of 1.
   
   Why bucket scan can help here? Without bucket scan, Spark can smartly decide the parallelism w.r.t. the data size, while with bucket scan, the number of partitions must be the same as number of buckets, which is usually not optimal.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #35514: [SPARK-38027][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r806411581



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       Hm, this was technically a performance improvement which shouldn't cause user-facing behaviour impacts. I think we should focus on user-facing changes here. Does it cause any user-facing change?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r808450997



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       @manuzhang - regarding documentation, it's already added in https://spark.apache.org/docs/latest/configuration.html .
   
   <img width="1630" alt="Screen Shot 2022-02-16 at 1 01 28 PM" src="https://user-images.githubusercontent.com/4629931/154356460-1b97d56a-9500-4b0d-9d26-b59da7247cfe.png">
    




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] manuzhang edited a comment on pull request #35514: [SPARK-38027][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
manuzhang edited a comment on pull request #35514:
URL: https://github.com/apache/spark/pull/35514#issuecomment-1039764372


   cc @c21 @cloud-fan @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] manuzhang commented on a change in pull request #35514: [SPARK-38027][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
manuzhang commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r806417239



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       Yes. In my case table `t` is bucketed on `i` and sorted on `j`.  The following SQL takes 6 mins to run in Spark 3.1.1 while 30s in Spark 2.3.1.
   
   ```sql
   select k
   from t
   where j='a'
   limit 10
   ```
   
   From the log I could see it's bucketed scan in 2.3.1 while non-bucketed scan 3.1.1 but I didn't know why. It took me days to debug till I found this config. If it were documented in migration guide, I could have figured it out much sooner. Hence, I think it might help others as well.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] manuzhang edited a comment on pull request #35514: [SPARK-38027][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
manuzhang edited a comment on pull request #35514:
URL: https://github.com/apache/spark/pull/35514#issuecomment-1039764372


   cc @c21 @cloud-fan @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r807604696



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       If selecting bucket columns, I concur this is an issue caused by the rule. I am working on a fix to not disable bucketed scan if having filter on sorted columns. I guess the trick of bucket scan to be more efficient for limit query is:
   * with bucketed scan: one bucket file is processed by one task. For large bucket file, small limit value, and filter on sorted column, it can return early with skipping scan a lot of rest of rows in the file.
   * without bucketed scan: one bucket file can be split to multiple tasks if it's large. The filter for some of file splits may need to scan all of rows in the split without skipping, because the sorted file is split into multiple splits.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] manuzhang commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
manuzhang commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r807713022



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       It's not easy to come up with a simple example. Here are some stats of our production case.
   Table `t` has 10000 buckets, bucketed on column i and sorted on column j,  and each bucket file is around 3GB.
   There are more than 10 rows satisfying `x='a'` at the last part in `part-00000-*.parquet` for below query.
   
   ```sql
   select i, k
   from t
   where x = 'a'
   limit 10;
   ```
   
   With bucketed scan, all results can be found in `part-00000-*.parquet` in one task.
   Without bucketed scan (`spark.sql.files.maxPartitionBytes=128MB`),  1 task, 4 tasks, 20 tasks,..., 151875 tasks were launched in 9 stages to find all results.
   IIUC, file splits are sorted by length from large to small. Hence the last part of `part-00000-*.parquet` was only scanned in last stage.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #35514: [SPARK-38027][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r806411581



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       Hm, this was technically a performance improvement which shouldn't cause user-facing behaviour impacts. I think we should focus on user-facing changes here. Does it cause any user-facing change?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] manuzhang commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
manuzhang commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r807585678



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       Even if the predicate is not bucket or sort column, this query could still be much slower without bucketed scan for large bucket file (e.g. 3GB) since more tasks and stages will be launched instead of 1.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r807583075



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       `sql("SELECT * FROM t1 WHERE k = 1 LIMIT 10")` should trigger the issue?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r808002660



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       > With bucketed scan, all results can be found in part-00000-*.parquet in one task.
   
   This seems more like a coincidence. It's nothing about your bucket/sort column. Maybe bucketed scan is good for this particular table this time, but maybe not next time when the data of this table has been changed. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] manuzhang commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
manuzhang commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r807713022



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       It's not easy to come up with a simple example. Here are some stats of our production case.
   Table `t` has 10000 buckets, bucketed on column `i` and sorted on column `j`,  and each bucket file is around 3GB.
   There are more than 10 rows satisfying `x='a'` at the last part in `part-00000-*.parquet` for below query.
   
   ```sql
   select i, k
   from t
   where x = 'a'
   limit 10;
   ```
   
   With bucketed scan, all results can be found in `part-00000-*.parquet` in one task.
   Without bucketed scan (`spark.sql.files.maxPartitionBytes=128MB`),  1 task, 4 tasks, 20 tasks,..., 151875 tasks were launched in 9 stages to find all results.
   IIUC, file splits are sorted by length from large to small. Hence the last part of `part-00000-*.parquet` was only scanned in last stage.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] manuzhang commented on a change in pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
manuzhang commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r807600009



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       I mean in the case of `limit` where Spark incrementally launches tasks in stages to find rows satisfying the predicate.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] manuzhang closed pull request #35514: [SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change

Posted by GitBox <gi...@apache.org>.
manuzhang closed pull request #35514:
URL: https://github.com/apache/spark/pull/35514


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org