You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Aljoscha Krettek (JIRA)" <ji...@apache.org> on 2016/05/02 11:14:12 UTC

[jira] [Commented] (BEAM-241) Not easy for runners to get late-data dropping

    [ https://issues.apache.org/jira/browse/BEAM-241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266262#comment-15266262 ] 

Aljoscha Krettek commented on BEAM-241:
---------------------------------------

I think the short-term (and maybe also long-term) solution is to do everything via {{DoFnRunners}}. Then we automatically get the correct behavior. What do you think? Should I open an issue for the Flink runner?

I also want to unify how we treat {{DoFn}}s and windowing operations, or do you think that we shouldn't use a {{GroupAlsoByWindowViaWindowSetDoFn}} once the new runner API is in?

> Not easy for runners to get late-data dropping
> ----------------------------------------------
>
>                 Key: BEAM-241
>                 URL: https://issues.apache.org/jira/browse/BEAM-241
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-core
>            Reporter: Mark Shields
>            Assignee: Frances Perry
>
> Quite by accident realized the Flink runner delegates to GroupAlsoByWindowViaWindowSetDoFn for GBK, which in turn delegates to ReduceFnRunner. The latter now assumes no messages will arrive beyond the 'garbage collection' time of their target window(s).
> The Dataflow runner interposes a LateDataDroppingDoFnRunner into the path so as to drop those too-late messages. That's done (I think) using DoFnRunners.createDefault.
> I don't think the Flink runner does that.
> We need a nice runner-friendly way of dealing with the too-late data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)