You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by pnpranavrao <gi...@git.apache.org> on 2018/05/30 09:28:15 UTC

[GitHub] spark issue #21460: [SPARK-23442][SQL] Increase reading tasks when reading f...

Github user pnpranavrao commented on the issue:

    https://github.com/apache/spark/pull/21460
  
    The reason `createBucketedReadRDD` works the way currently (accumulate all buckets across partitions into a partition with `id` equal to `bucketId`) is to skip shuffle when joining on the bucketed columns. Your suggested change would break this. 
    
    I opened this [JIRA](https://issues.apache.org/jira/browse/SPARK-23442) for the same need. We need to generate different physical plans for join and non-join usecases, and not assume if a datasource is bucketed, it will be used for a join involving the bucketed columns like now. 
    
    The workaround right now is to turn off bucketing with a SparkConf flag, but I'm working on using the both partitioning and bucketing information to plan queries.   


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org