You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/12/17 08:54:26 UTC

[GitHub] [spark] weixiuli opened a new pull request #34933: [SPARK-37674][SQL] Reduce the output partition of output stage to avoid producing small files.

weixiuli opened a new pull request #34933:
URL: https://github.com/apache/spark/pull/34933


   
   ### What changes were proposed in this pull request?
   Reduce the output partition of output stage to avoid producing small files.
   
   ### Why are the changes needed?
   
   The partition size of the finalStage with `DataWritingCommand` or `V2TableWriteExec`  may use the `ADVISORY_PARTITION_SIZE_IN_BYTES` which is smaller one, and  may produce some small files, it is bad for production.
   
   Sometime, we may adjust `ADVISORY_PARTITION_SIZE_IN_BYTES` to a big one to avoid above , but it is NOT a good idea, it may take effect other Jobs or stages to  coalesce small shuffle partitions or split skewed shuffle partition.
   
   So we should introduce a new partition size instead of  `ADVISORY_PARTITION_SIZE_IN_BYTES`  for  the finalStage with `DataWritingCommand` or `V2TableWriteExec`  to avoid small files.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   NO
   
   ### How was this patch tested?
   Added unittests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ulysses-you commented on pull request #34933: [SPARK-37674][SQL] Reduce the output partition of output stage to avoid producing small files.

Posted by GitBox <gi...@apache.org>.

ulysses-you commented on pull request #34933:
URL: https://github.com/apache/spark/pull/34933#issuecomment-1007224052


   I see the requirement but there are some potential issue if we only use a new config for writing's final stage.
   
   - if the final stage is heavy it will cause regression if we make partition size big, e.g. the final stage is join even multi-join
   - the input shuffle size is not equal to the output size. if the plan of final stage changes the data size, this config is less meaning
   - not all query contains shuffle, then the semantics of this config is broken since the config is not used
   - it's not enough for dynamic partition writing that just increase the partition size. we should cluster the same partition value in several partitions as far as possible
   - and this config should also affect the rebalance
   
   I think it's a good idea to add a `RebalancePartitions` node for all writing command as  @wangyum working on SPARK-31264. And then we can consider adding a special partition size config for the added shuffle which is from  `RebalancePartitions`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] weixiuli commented on pull request #34933: [SPARK-37674][SQL] Reduce the output partition of output stage to avoid producing small files.

Posted by GitBox <gi...@apache.org>.

weixiuli commented on pull request #34933:
URL: https://github.com/apache/spark/pull/34933#issuecomment-1056068584


   
   > I think it's a good idea to add a `RebalancePartitions` node for all writing command as @wangyum working on [SPARK-31264](https://issues.apache.org/jira/browse/SPARK-31264). And then we can consider adding a special partition size config for the added shuffle which is from `RebalancePartitions`.
   
   I have discussed with @wangyum offline.  
   
   [SPARK-31264](https://issues.apache.org/jira/browse/SPARK-31264) is different from this pr，this pr only use AQE's  coalesce partitions feature to avoid producing small files，which does not need to introduce new shuffles. However，[SPARK-31264](https://issues.apache.org/jira/browse/SPARK-31264) may introduce new shuffles.
   
   @cloud-fan @ulysses-you @wangyum Can you help me review this pr again? thanks .
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #34933: [SPARK-37674][SQL] Reduce the output partition of output stage to avoid producing small files.

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #34933:
URL: https://github.com/apache/spark/pull/34933#issuecomment-1056084394


   OK let's make a call here. I think it's not a good idea to add a new config for the final stage to tune the advisory size. The final stage may not be a simple table writing, it can contain other operators to do heavy computing. We should not reduce its parallelism which can slow down the computing or even cause OOM. That's why I think adding an extra shuffle is actually more robust: we do not reduce the parallelism of computing, but only reduce the number of output files.
   
   Let's spend more time on [SPARK-31264](https://issues.apache.org/jira/browse/SPARK-31264) and get it merged soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34933: [SPARK-37674][SQL] Reduce the output partition of output stage to avoid producing small files.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34933:
URL: https://github.com/apache/spark/pull/34933#issuecomment-996566048


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] weixiuli commented on pull request #34933: [SPARK-37674][SQL] Reduce the output partition of output stage to avoid producing small files.

Posted by GitBox <gi...@apache.org>.

weixiuli commented on pull request #34933:
URL: https://github.com/apache/spark/pull/34933#issuecomment-1008491936


   @ulysses-you thank you for your review，i got your point. This PR  just gives users more choice to  avoid producing small files, the original logic is kept by default.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ulysses-you commented on pull request #34933: [SPARK-37674][SQL] Reduce the output partition of output stage to avoid producing small files.

Posted by GitBox <gi...@apache.org>.

ulysses-you commented on pull request #34933:
URL: https://github.com/apache/spark/pull/34933#issuecomment-1007224052


   I see the requirement but there are some potential issue if we only use a new config for writing's final stage.
   
   - if the final stage is heavy it will cause regression if we make partition size big, e.g. the final stage is join even multi-join
   - the input shuffle size is not equal to the output size. if the plan of final stage changes the data size, this config is less meaning
   - not all query contains shuffle, then the semantics of this config is broken since the config is not used
   - it's not enough for dynamic partition writing that just increase the partition size. we should cluster the same partition value in several partitions as far as possible
   - and this config should also affect the rebalance
   
   I think it's a good idea to add a `RebalancePartitions` node for all writing command as  @wangyum working on SPARK-31264. And then we can consider adding a special partition size config for the added shuffle which is from  `RebalancePartitions`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #34933: [SPARK-37674][SQL] Reduce the output partition of output stage to avoid producing small files.

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #34933:
URL: https://github.com/apache/spark/pull/34933#issuecomment-1006343708


   cc @yaooqinn @ulysses-you 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org