You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/08/15 08:55:54 UTC

[GitHub] [beam] scwhittle opened a new issue, #22723: [Bug]: Python BatchElements uses minimum or arbitrary element timestamps for output timestamps with GlobalWindow specialization

scwhittle opened a new issue, #22723:
URL: https://github.com/apache/beam/issues/22723

   ### What happened?
   
   The python sdk BatchElements specializes the global window with _GlobalWindowsBatchingDoFn
   
   This builds up batches and outputs them in two possible places:
   - during element processing if the batch is deemed complete, this uses the current element's timestamp for the output
   - during finish_bundle if there was an unfinished batch, this uses GlobalWindows.windowed_value which uses the minimum timestamp
   
   It seems like this should be made to be consistent by either:
   - always outputtting with maximum timestamp. This matches the behavior of the non-global window case where the window end timestamp is used.
   - keep track of the minimum timestamp of elements in the input batch and use that for the output timestamp
   
   ### Issue Priority
   
   Priority: 2
   
   ### Issue Component
   
   Component: sdk-py-core


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on issue #22723: [Bug]: Python BatchElements uses minimum or arbitrary element timestamps for output timestamps with GlobalWindow specialization

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #22723:
URL: https://github.com/apache/beam/issues/22723#issuecomment-1224390181

   over to @TheNeuralBit to take a look or triage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on issue #22723: [Bug]: Python BatchElements uses minimum or arbitrary element timestamps for output timestamps with GlobalWindow specialization

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on issue #22723:
URL: https://github.com/apache/beam/issues/22723#issuecomment-1224570954

   It might be worth bringing this to the dev list to see if we have, or should establish, a convention for handling timestamps in transforms that produce batches of elements.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] robertwb closed issue #22723: [Bug]: Python BatchElements uses minimum or arbitrary element timestamps for output timestamps with GlobalWindow specialization

Posted by GitBox <gi...@apache.org>.
robertwb closed issue #22723: [Bug]: Python BatchElements uses minimum or arbitrary element timestamps for output timestamps with GlobalWindow specialization
URL: https://github.com/apache/beam/issues/22723


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on issue #22723: [Bug]: Python BatchElements uses minimum or arbitrary element timestamps for output timestamps with GlobalWindow specialization

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on issue #22723:
URL: https://github.com/apache/beam/issues/22723#issuecomment-1224493228

   Interesting. I'm not sure which of these is the oversight, perhaps @robertwb can comment as the original author. 
   
   Looking at the code: https://github.com/apache/beam/blob/c7f64264451af12ff6c7c0ef4bc95fd7ce0f5418/sdks/python/apache_beam/transforms/util.py#L539-L552
   
   In the process case we just yield the batch - likely this was just written ignoring the timestamp, but it turns out that Beam actually attaches the current element's timestamp in the OutputProcessor.
   In the finish_bundle case we wrap in a `GlobalWindows.windowed_value` - likely this was just to satisfy some type check (static or in output processing), but it turns out we end up using `min_timestamp()`.
   
   As @scwhittle noted, the per window version always uses the current window's max timestamp: https://github.com/apache/beam/blob/c7f64264451af12ff6c7c0ef4bc95fd7ce0f5418/sdks/python/apache_beam/transforms/util.py#L584-L585
   https://github.com/apache/beam/blob/c7f64264451af12ff6c7c0ef4bc95fd7ce0f5418/sdks/python/apache_beam/transforms/util.py#L602-L603
   
   I took a look at what the `GroupIntoBatches` implementations do here, but AFAICT they're not opinionated either, and rely on the SDK worker to decide what timestamp to use: https://github.com/apache/beam/blob/c7f64264451af12ff6c7c0ef4bc95fd7ce0f5418/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java#L545
   
   Presumably this is typically the most recent element's timestamp, when it completes a batch. I'm not sure about the case when a timer fires though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org