You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/03/27 02:40:49 UTC
[GitHub] [beam] lukecwik commented on issue #11037: [BEAM-9434] performance
improvements reading many Avro files in S3
lukecwik commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-604785112
Sorry about the long delay but **Reshuffle** should produce as many partitions as the runner thinks is optimal. It is effectively a **redistribute** operation.
It looks like the spark translation is copying the number of partitions from the upstream transform for the reshuffle translation and in your case this is likely 1.
Translation: https://github.com/apache/beam/blob/f5a4a5afcd9425c0ddb9ec9c70067a5d5c0bc769/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java#L681
Copying partitions:
https://github.com/apache/beam/blob/f5a4a5afcd9425c0ddb9ec9c70067a5d5c0bc769/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/GroupCombineFunctions.java#L191
@iemejia Shouldn't we be using a much larger value for partitions, e.g. the number of nodes?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services