You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "Abacn (via GitHub)" <gi...@apache.org> on 2023/05/30 22:40:01 UTC
[GitHub] [beam] Abacn opened a new issue, #26942: [Bug]: Timestamp off by timezone offset when using Python BigQuery Avro File load
Abacn opened a new issue, #26942:
URL: https://github.com/apache/beam/issues/26942
### What happened?
Found when writing rows containing `apache_beam.utils.timestamp.Timestamp` (obtained from xlang PTransform, e.g. ReadFromJdbc)
The following pipeline writes correct timestamp to BigQuery:
```python
with Pipeline(options=options) as p:
input = p | beam.Create([1,2,3]) | beam.Map(lambda x:{"f_time": Timestamp.now().to_utc_datetime()})
_ = input | WriteToBigQuery(
table='google.com:clouddfe:yathu_test.testavroload',
schema={
'fields': [{
'name': 'f_time', 'type': 'TIMESTAMP', 'mode': 'REQUIRED'
}]})
```
I get rows
Row | f_time
-- | --
1 | 2023-05-30 22:07:37.947323 UTC
2 | 2023-05-30 22:07:39.553978 UTCÂ
3 | 2023-05-30 22:07:39.554365 UTC
However, if set `temp_file_format=bigquery_tools.FileFormat.AVRO`, it is then found the timestamps written to BigQuery has +4h offset:
Row | f_time
-- | --
4 | 2023-05-31 02:26:05.533746 UTC
5 | 2023-05-31 02:26:07.054265 UTC
6 | 2023-05-31 02:26:07.054786 UTC
To get the correct time zone, I need to set `Timestamp.now().to_utc_datetime().replace(tzinfo=datetime.timezone.utc)`
This is due to that fastavro treating datetime.datetime object without a time zone as locale time zone, while for json write we have the logic to treat time zone correctly:
https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L149
### Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
### Issue Components
- [ ] Component: Python SDK
- [ ] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] Abacn commented on issue #26942: [Bug]: Timestamp off by timezone offset when using Python BigQuery Avro File load
Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn commented on issue #26942:
URL: https://github.com/apache/beam/issues/26942#issuecomment-1570823454
> Is all we need here to configure fastavro differently
There is different ways to resolve it. The underlying issue is more subtle (see below) and there are many ways to resolve the issues.
Another way is to add time zone info in
https://github.com/apache/beam/blob/bdd29bf45f818ce0c20c111c9787d5051c1b792b/sdks/python/apache_beam/utils/timestamp.py#L163
I am not sure why it intentionally removed tzinfo at first place. A datetime.datetime object without a time zone implicitly means local time zone (as the object returned by datetime.datetime.now()) so the current returned object is not correct in my view.
> would this apply to any avro usage?
Only apply to write a datetime object returned from apache_beam.utils.timestamp.Timestamp.to_utc_datetime
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] johnjcasey commented on issue #26942: [Bug]: Timestamp off by timezone offset when using Python BigQuery Avro File load
Posted by "johnjcasey (via GitHub)" <gi...@apache.org>.
johnjcasey commented on issue #26942:
URL: https://github.com/apache/beam/issues/26942#issuecomment-1570794905
@Abacn is all we need here to configure fastavro differently? Also, is this specific to BQ, or would this apply to any avro usage?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] Abacn closed issue #26942: [Bug]: Timestamp off by timezone offset when using Python BigQuery Avro File load
Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn closed issue #26942: [Bug]: Timestamp off by timezone offset when using Python BigQuery Avro File load
URL: https://github.com/apache/beam/issues/26942
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org