You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Eugene Kirpichov (JIRA)" <ji...@apache.org> on 2017/12/14 23:42:00 UTC

[jira] [Created] (BEAM-3353) Prohibit stacked GBKs with accumulating mode

Eugene Kirpichov created BEAM-3353:
--------------------------------------

             Summary: Prohibit stacked GBKs with accumulating mode
                 Key: BEAM-3353
                 URL: https://issues.apache.org/jira/browse/BEAM-3353
             Project: Beam
          Issue Type: Bug
          Components: sdk-java-core, sdk-py-core
            Reporter: Eugene Kirpichov
            Assignee: Eugene Kirpichov


The following test https://github.com/apache/beam/pull/4239 demonstrates that stacked GBKs with accumulating mode are unsafe, the same way that stacked GBKs with merging windows are unsafe.

In particular, in the pipeline: input -> (gbk onto N keys) -> ungroup -> (gbk onto 1 key) -> ungroup, e.g. suppose the first gbk receives "a" and then "b"; it will emit "a" and then "a","b" - then the second gbk will emit "a" and then "a","a","b" which is meaningless. With combine instead of GBK, it leads to double-counting.

There are cases where accumulation propagated through stacked aggregation can be desirable, but having it propagate by default is definitely the wrong thing to do. Silently changing it to discarding is likely also the wrong thing to do. So, we should reset the windowing strategy and force the user to specify accumulating mode explicitly if they would like to.

All pipelines using this currently are computing meaningless results, so rejecting them should not be considered a breaking change. However, we should still find out whether there are a lot of such pipelines or not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)