You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "Abacn (via GitHub)" <gi...@apache.org> on 2023/05/30 22:40:01 UTC

[GitHub] [beam] Abacn opened a new issue, #26942: [Bug]: Timestamp off by timezone offset when using Python BigQuery Avro File load

Abacn opened a new issue, #26942:
URL: https://github.com/apache/beam/issues/26942

   ### What happened?
   
   Found when writing rows containing `apache_beam.utils.timestamp.Timestamp` (obtained from xlang PTransform, e.g. ReadFromJdbc)
   
   The following pipeline writes correct timestamp to BigQuery:
   
   ```python
   
   with Pipeline(options=options) as p:
       input = p | beam.Create([1,2,3]) | beam.Map(lambda x:{"f_time": Timestamp.now().to_utc_datetime()})
       _ = input | WriteToBigQuery(
         table='google.com:clouddfe:yathu_test.testavroload', 
         schema={
               'fields': [{
                   'name': 'f_time', 'type': 'TIMESTAMP', 'mode': 'REQUIRED'
               }]})
   ```
   
   I get rows
   
   Row | f_time
   -- | -- 
   1 | 2023-05-30 22:07:37.947323 UTC 
   2 | 2023-05-30 22:07:39.553978 UTC 
   3 | 2023-05-30 22:07:39.554365 UTC
   
   
   However, if set `temp_file_format=bigquery_tools.FileFormat.AVRO`, it is then found the timestamps written to BigQuery has +4h offset:
   
   Row | f_time
   -- | -- 
   4 | 2023-05-31 02:26:05.533746 UTC 
   5 | 2023-05-31 02:26:07.054265 UTC
   6 | 2023-05-31 02:26:07.054786 UTC
   
   To get the correct time zone, I need to set `Timestamp.now().to_utc_datetime().replace(tzinfo=datetime.timezone.utc)`
   
   This is due to that fastavro treating datetime.datetime object without a time zone as locale time zone, while for json write we have the logic to treat time zone correctly:
   
   https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L149
   
   
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn commented on issue #26942: [Bug]: Timestamp off by timezone offset when using Python BigQuery Avro File load

Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn commented on issue #26942:
URL: https://github.com/apache/beam/issues/26942#issuecomment-1570823454

   > Is all we need here to configure fastavro differently
   
   There is different ways to resolve it. The underlying issue is more subtle (see below) and there are many ways to resolve the issues.
   
   Another way is to add time zone info in 
   
   https://github.com/apache/beam/blob/bdd29bf45f818ce0c20c111c9787d5051c1b792b/sdks/python/apache_beam/utils/timestamp.py#L163
   
   I am not sure why it intentionally removed tzinfo at first place. A datetime.datetime object without a time zone implicitly means local time zone (as the object returned by datetime.datetime.now()) so the current returned object is not correct in my view.
   
   > would this apply to any avro usage?
   
   Only apply to write a datetime object returned from apache_beam.utils.timestamp.Timestamp.to_utc_datetime
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] johnjcasey commented on issue #26942: [Bug]: Timestamp off by timezone offset when using Python BigQuery Avro File load

Posted by "johnjcasey (via GitHub)" <gi...@apache.org>.
johnjcasey commented on issue #26942:
URL: https://github.com/apache/beam/issues/26942#issuecomment-1570794905

   @Abacn is all we need here to configure fastavro differently? Also, is this specific to BQ, or would this apply to any avro usage?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn closed issue #26942: [Bug]: Timestamp off by timezone offset when using Python BigQuery Avro File load

Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn closed issue #26942: [Bug]: Timestamp off by timezone offset when using Python BigQuery Avro File load
URL: https://github.com/apache/beam/issues/26942


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org