You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/22 12:38:32 UTC

[GitHub] [spark] jackylee-ch commented on pull request #37601: [SPARK-40150][SQL] Merging file partition dynamically

jackylee-ch commented on PR #37601:
URL: https://github.com/apache/spark/pull/37601#issuecomment-1222301529

   > Hi @jackylee-ch AFAIK, we will split on there files from `listFiles` to `partitionedFile` by `maxSplitBytes` first, second, we will merge this `partitionedFile` to a `Partition`, I'm not sure how this relates to small files, in theory, even with a large number of small files, merge at `maxSplitBytes` without affecting concurrency ?(probably)
   
   A little Example: we hase 7000 files in a table, whose total size is 1TB, and we have start a application with 4500 cores, thus the proper config is maxPartitionBytes=240MB and openCostInBytes=4MB. It is hard for user to calculate the proper maxPartitionBytes and openCostInBytes. 
   
   With this PR, user can easily get the best performance without calculating it. And for long live applications, espetially those use FAIR scheduling mode, it will be also easy to control concurrency with different kind of queries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org