You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/08/15 08:55:54 UTC
[GitHub] [beam] scwhittle opened a new issue, #22723: [Bug]: Python BatchElements uses minimum or arbitrary element timestamps for output timestamps with GlobalWindow specialization
scwhittle opened a new issue, #22723:
URL: https://github.com/apache/beam/issues/22723
### What happened?
The python sdk BatchElements specializes the global window with _GlobalWindowsBatchingDoFn
This builds up batches and outputs them in two possible places:
- during element processing if the batch is deemed complete, this uses the current element's timestamp for the output
- during finish_bundle if there was an unfinished batch, this uses GlobalWindows.windowed_value which uses the minimum timestamp
It seems like this should be made to be consistent by either:
- always outputtting with maximum timestamp. This matches the behavior of the non-global window case where the window end timestamp is used.
- keep track of the minimum timestamp of elements in the input batch and use that for the output timestamp
### Issue Priority
Priority: 2
### Issue Component
Component: sdk-py-core
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #22723: [Bug]: Python BatchElements uses minimum or arbitrary element timestamps for output timestamps with GlobalWindow specialization
Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #22723:
URL: https://github.com/apache/beam/issues/22723#issuecomment-1224390181
over to @TheNeuralBit to take a look or triage.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] TheNeuralBit commented on issue #22723: [Bug]: Python BatchElements uses minimum or arbitrary element timestamps for output timestamps with GlobalWindow specialization
Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on issue #22723:
URL: https://github.com/apache/beam/issues/22723#issuecomment-1224570954
It might be worth bringing this to the dev list to see if we have, or should establish, a convention for handling timestamps in transforms that produce batches of elements.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] robertwb closed issue #22723: [Bug]: Python BatchElements uses minimum or arbitrary element timestamps for output timestamps with GlobalWindow specialization
Posted by GitBox <gi...@apache.org>.
robertwb closed issue #22723: [Bug]: Python BatchElements uses minimum or arbitrary element timestamps for output timestamps with GlobalWindow specialization
URL: https://github.com/apache/beam/issues/22723
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] TheNeuralBit commented on issue #22723: [Bug]: Python BatchElements uses minimum or arbitrary element timestamps for output timestamps with GlobalWindow specialization
Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on issue #22723:
URL: https://github.com/apache/beam/issues/22723#issuecomment-1224493228
Interesting. I'm not sure which of these is the oversight, perhaps @robertwb can comment as the original author.
Looking at the code: https://github.com/apache/beam/blob/c7f64264451af12ff6c7c0ef4bc95fd7ce0f5418/sdks/python/apache_beam/transforms/util.py#L539-L552
In the process case we just yield the batch - likely this was just written ignoring the timestamp, but it turns out that Beam actually attaches the current element's timestamp in the OutputProcessor.
In the finish_bundle case we wrap in a `GlobalWindows.windowed_value` - likely this was just to satisfy some type check (static or in output processing), but it turns out we end up using `min_timestamp()`.
As @scwhittle noted, the per window version always uses the current window's max timestamp: https://github.com/apache/beam/blob/c7f64264451af12ff6c7c0ef4bc95fd7ce0f5418/sdks/python/apache_beam/transforms/util.py#L584-L585
https://github.com/apache/beam/blob/c7f64264451af12ff6c7c0ef4bc95fd7ce0f5418/sdks/python/apache_beam/transforms/util.py#L602-L603
I took a look at what the `GroupIntoBatches` implementations do here, but AFAICT they're not opinionated either, and rely on the SDK worker to decide what timestamp to use: https://github.com/apache/beam/blob/c7f64264451af12ff6c7c0ef4bc95fd7ce0f5418/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java#L545
Presumably this is typically the most recent element's timestamp, when it completes a batch. I'm not sure about the case when a timer fires though.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org