You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/03/27 02:40:49 UTC

[GitHub] [beam] lukecwik commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3

lukecwik commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-604785112
 
 
   Sorry about the long delay but **Reshuffle** should produce as many partitions as the runner thinks is optimal. It is effectively a **redistribute** operation.
   
   It looks like the spark translation is copying the number of partitions from the upstream transform for the reshuffle translation and in your case this is likely 1. 
   Translation: https://github.com/apache/beam/blob/f5a4a5afcd9425c0ddb9ec9c70067a5d5c0bc769/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java#L681
   Copying partitions:
   https://github.com/apache/beam/blob/f5a4a5afcd9425c0ddb9ec9c70067a5d5c0bc769/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/GroupCombineFunctions.java#L191
   
   @iemejia Shouldn't we be using a much larger value for partitions, e.g. the number of nodes?
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services