You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/03 19:17:47 UTC

[GitHub] [beam] kennknowles opened a new issue, #18550: Enable the ability to grow partition count in the underlying Spark RDD

kennknowles opened a new issue, #18550:
URL: https://github.com/apache/beam/issues/18550

   When using a `HadoopInputFormatIO` the number of splits seems to be controlled by the underlying `InputFormat` which in turn determines the number of partitions and therefore parallelisation when running on Spark.  It is possible to `Reshuffle` the data to compensate for data skew, but it _appears_ there is no way to grow the number of partitions.  The `GroupCombineFunctions.reshuffle` seems to be the only place calling the Spark `repartition` and it uses the number of partitions from the original RDD.
   
   Scenarios that would benefit from this:
   - Increasing parallelisation for computationally heavy stages
   - ETLs where the input partitions are dictated by the source while you wish to optimise the partitions for fast loading to the target sink
   - Zip files (my case) where they are read in single threaded manner with a custom HadoopInputFormat and therefore get a single task for all stages
   
   (It would be nice if a user could supply a partitioner too, to help dictate data locality)
   
   Imported from Jira [BEAM-3022](https://issues.apache.org/jira/browse/BEAM-3022). Original Jira may contain additional context.
   Reported by: timrobertson100.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org