You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@beam.apache.org by "Tim Robertson (JIRA)" <ji...@apache.org> on 2017/10/05 18:38:00 UTC

[jira] [Created] (BEAM-3022) Enable the ability to grow partition count in the underlying Spark RDD

Tim Robertson created BEAM-3022:
-----------------------------------

             Summary: Enable the ability to grow partition count in the underlying Spark RDD
                 Key: BEAM-3022
                 URL: https://issues.apache.org/jira/browse/BEAM-3022
             Project: Beam
          Issue Type: Improvement
          Components: runner-spark
            Reporter: Tim Robertson
            Assignee: Amit Sela


When using a {{HadoopInputFormatIO}} the number of splits seems to be controlled by the underlying {{InputFormat}} which in turn determines the number of partitions and therefore parallelisation when running on Spark.  It is possible to {{Reshuffle}} the data to compensate for data skew, but it _appears_ there is no way to grow the number of partitions.  The {{GroupCombineFunctions.reshuffle}} seems to be the only place calling the Spark {{repartition}} and it uses the number of partitions from the original RDD.

Scenarios that would benefit from this:
# Increasing parallelisation for computationally heavy stages
# ETLs where the input partitions are dictated by the source while you wish to optimise the partitions for fast loading to the target sink
# Zip files (my case) where they are read in single threaded manner with a custom HadoopInputFormat and therefore get a single executor

(It would be nice if a user could supply a partitioner too, to help dictate data locality)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)