You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@beam.apache.org by "Vaclav Plajt (JIRA)" <ji...@apache.org> on 2019/01/02 12:44:00 UTC

[jira] [Created] (BEAM-6332) Avoid unnecessary serialization steps when executing `combine` transformation

Vaclav Plajt created BEAM-6332:
----------------------------------

             Summary: Avoid unnecessary serialization steps when executing `combine` transformation
                 Key: BEAM-6332
                 URL: https://issues.apache.org/jira/browse/BEAM-6332
             Project: Beam
          Issue Type: Improvement
          Components: runner-spark
            Reporter: Vaclav Plajt
            Assignee: Vaclav Plajt


Combine transformation is translated into Spark's RDD API in [GroupCombineFunctions](https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/GroupCombineFunctions.java) `combinePerKey` and `combineGlobally` methods. Both methods use byte arrays as intermediate state of aggregation so they can be transferred over network. That leads to serialization and de-serialization of intermediate aggregation value every time new element is added to aggregation. That is unnecessary and should be avoided.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)