You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Vaclav Plajt (JIRA)" <ji...@apache.org> on 2019/01/02 12:44:00 UTC
[jira] [Created] (BEAM-6332) Avoid unnecessary serialization steps
when executing `combine` transformation
Vaclav Plajt created BEAM-6332:
----------------------------------
Summary: Avoid unnecessary serialization steps when executing `combine` transformation
Key: BEAM-6332
URL: https://issues.apache.org/jira/browse/BEAM-6332
Project: Beam
Issue Type: Improvement
Components: runner-spark
Reporter: Vaclav Plajt
Assignee: Vaclav Plajt
Combine transformation is translated into Spark's RDD API in [GroupCombineFunctions](https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/GroupCombineFunctions.java) `combinePerKey` and `combineGlobally` methods. Both methods use byte arrays as intermediate state of aggregation so they can be transferred over network. That leads to serialization and de-serialization of intermediate aggregation value every time new element is added to aggregation. That is unnecessary and should be avoided.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)