You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/15 03:26:48 UTC

[GitHub] [spark] manuzhang commented on a change in pull request #35514: [SPARK-38027][SQL][DOCS] Add migration guide for bucketed scan behavior change

manuzhang commented on a change in pull request #35514:
URL: https://github.com/apache/spark/pull/35514#discussion_r806417239



##########
File path: docs/sql-migration-guide.md
##########
@@ -185,6 +185,8 @@ license: |
     * `ALTER TABLE .. ADD PARTITION` throws `PartitionsAlreadyExistException` if new partition exists already
     * `ALTER TABLE .. DROP PARTITION` throws `NoSuchPartitionsException` for not existing partitions
 
+  - In Spark 3.1, when bucket join is enabled(`spark.sql.sources.bucketing.enabled=true`), whether to do bucketed scan on input tables is decided automatically based on query plan. Bucketed scan is not used if 1. query does not have operators to utilize bucketing (e.g. join, group-by) or 2. there's an exchange between these operators and table scan. You can restore old behavior by setting `spark.sql.sources.bucketing.autoBucketedScan.enabled` to `false`.  

Review comment:
       Yes. In my case table `t` is bucketed on `i` and sorted on `j`.  The following SQL takes 6 mins to run in Spark 3.1.1 while 30s in Spark 2.3.1.
   
   ```sql
   select k
   from t
   where j='a'
   limit 10
   ```
   
   From the log I could see it's bucketed scan in 2.3.1 while non-bucketed scan 3.1.1 but I didn't know why. It took me days to debug till I found this config. If it were documented in migration guide, I could have figured it out much sooner. Hence, I think it might help others as well.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org