You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Tim Robertson (JIRA)" <ji...@apache.org> on 2017/10/05 18:38:00 UTC
[jira] [Created] (BEAM-3022) Enable the ability to grow partition
count in the underlying Spark RDD
Tim Robertson created BEAM-3022:
-----------------------------------
Summary: Enable the ability to grow partition count in the underlying Spark RDD
Key: BEAM-3022
URL: https://issues.apache.org/jira/browse/BEAM-3022
Project: Beam
Issue Type: Improvement
Components: runner-spark
Reporter: Tim Robertson
Assignee: Amit Sela
When using a {{HadoopInputFormatIO}} the number of splits seems to be controlled by the underlying {{InputFormat}} which in turn determines the number of partitions and therefore parallelisation when running on Spark. It is possible to {{Reshuffle}} the data to compensate for data skew, but it _appears_ there is no way to grow the number of partitions. The {{GroupCombineFunctions.reshuffle}} seems to be the only place calling the Spark {{repartition}} and it uses the number of partitions from the original RDD.
Scenarios that would benefit from this:
# Increasing parallelisation for computationally heavy stages
# ETLs where the input partitions are dictated by the source while you wish to optimise the partitions for fast loading to the target sink
# Zip files (my case) where they are read in single threaded manner with a custom HadoopInputFormat and therefore get a single executor
(It would be nice if a user could supply a partitioner too, to help dictate data locality)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)