You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/11/12 01:19:22 UTC

[GitHub] [beam] lostluck commented on issue #23963: [Feature Request][Go SDK]: Enable emitting to windowed PCollection from FinishBundle()

lostluck commented on issue #23963:
URL: https://github.com/apache/beam/issues/23963#issuecomment-1312299776

   This is an unfortunate oversight.
   
   Right now the current SDK behavior is incorrect (a consequence of being heavily tested under Global Window batch, and this behavior added pre robust windowing). Basically what's going wrong is that we don't know a good event time for the subsequently emitted batch. In general, we can't* from a framework perspective, as it's user data specific. 
   
   The work around would be to use emitters of the style func(beam.EventTime,V) or func(beam.EventTime, K, V) in both ProcessElement and FinishBundle while also requesting the existing element's beam.EventTime. This, in combination with storing a reasonable event time for the batch, would be a workaround to the windowing issue. The windowing would also need to be downstream of this DoFn to ensure windows.
   
   ----
   
   So the SDK would need to prevent this bad behavior going forward (failing on strict modes for it, providing a descriptive warning and instructions for fixing it as well). The SDK analysis portion would need to allow a slight mismatch in emitters, so the StartBundle/FinishBundle emitters are required to have an explicit event time, and not have to match the ProcessElement emitter exactly. This forces setting explicit event times, allowing for the correct windowing behavior.
   
   The event time thing is the big sticking point since if the elements don't have an event time, that means any *subsequent* windowing will have issues as well. So to permit any existing broken code to continue to function, it would need to analyze if there's a down stream issue,
   
   Having a robust GroupIntoBatches transform would be able to avoid this (have the GIB transform do the grouping, and handling the event time management), since it would then handle this logic for you. This is already on my "sooner", rather than "later" list. (I can't be more precise than that).
   
   ----
   
   *Technically CombineFn's have the framework make a specific choice for it's equivalent behavior here (what is a Lifted CombineFn, but a DoFn with fancy batching after all?), but we simply use the most default strategy for beam.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org