You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "Fokko (via GitHub)" <gi...@apache.org> on 2023/02/17 23:35:56 UTC

[GitHub] [iceberg] Fokko opened a new pull request, #6879: Python: Liberal Timestamp parsing

Fokko opened a new pull request, #6879:
URL: https://github.com/apache/iceberg/pull/6879

   We don't want to throw an error when the timezone is missing and just assume UTC.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on pull request #6879: Python: Liberal Timestamp parsing

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on PR #6879:
URL: https://github.com/apache/iceberg/pull/6879#issuecomment-1460732865

   > We could add a "session time zone" like SQL and use that when it isn't specified. We can then default the session time zone to UTC
   
   That would work for me. We could set this as a property in the configuration.
   
   > We could make helper methods for creating time literals, like `timestamp(iso8601: str, defaultZone: str = "+00:00") -> int` that make it clear what is happening by having a defined default
   
   I'm less in favor of this one because I think this is mostly around the filter API;
   
   Instead of:
   
   ```python
   tbl.scan(
       row_filter=GreaterThanOrEqual("tpep_pickup_datetime", "2022-01-01T00:00:00+00:00")
   ).to_arrow()
   ```
   
   Have the API As:
   
   ```python
   tbl.scan(
       row_filter="tpep_pickup_datetime >= '2022-01-01T00:00:00+00:00'"
   ).to_arrow()
   ```
   
   Then it is easy to forget the `+00:00` and then a user would need to look up the format which is not very user-friendly. A logical evolution is also:
   
   ```python
   tbl.scan(
       row_filter="tpep_pickup_datetime >= '2022-01-01'")
   ).to_arrow()
   ```
   
   Where `tpep_pickup_datetime` is a datetime. Setting a timezone on a date would also be a bit awkward.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on pull request #6879: Python: Liberal Timestamp parsing

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on PR #6879:
URL: https://github.com/apache/iceberg/pull/6879#issuecomment-1503811480

   @rdblue I've added it to the config. This way you can also do:
   
   ```python
   cat = load_catalog('prod', 'timezone'='+01:00')
   ```
   
   Now we also get warnings:
   ```
   In [1]: from pyiceberg.catalog import load_catalog
      ...: from pyiceberg.expressions import GreaterThanOrEqual, LessThan, And
      ...: 
      ...: cat = load_catalog('prod')
      ...: 
      ...: tbl = cat.load_table(('examples', 'nyc_taxi_yellow'))
      ...: 
      ...: 
      ...: tbl.scan(row_filter=And(GreaterThanOrEqual("pickup_time", "2022-01-01T00:00:00"), GreaterThanOrEqual("dropoff_time", "2022-01-01T00:00:00"))).to_arrow()
      ...: 
   /Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/expressions/visitors.py:1451: UserWarning: Assuming timezone +00:00 for: GreaterThanOrEqual(term=Reference(name='pickup_time'), literal=literal('2022-01-01T00:00:00'))
     warnings.warn(f"Assuming timezone {self.default_timezone} for: {predicate}")
   /Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/expressions/visitors.py:1451: UserWarning: Assuming timezone +00:00 for: GreaterThanOrEqual(term=Reference(name='dropoff_time'), literal=literal('2022-01-01T00:00:00'))
     warnings.warn(f"Assuming timezone {self.default_timezone} for: {predicate}")
   Out[1]: 
   pyarrow.Table
   vendor_id: int32
   pickup_time: timestamp[us, tz=UTC]
   pickup_location_id: int32
   dropoff_time: timestamp[us, tz=UTC]
   dropoff_location_id: int32
   passenger_count: int32
   trip_distance: double
   ratecode_id: int32
   payment_type: int32
   total_amount: double
   fare_amount: double
   tip_amount: double
   tolls_amount: double
   mta_tax: double
   improvement_surcharge: double
   congestion_surcharge: double
   extra_surcharges: double
   store_and_forward_flag: string
   ----
   vendor_id: [[2],[2],...,[2,2,2,2],[2,2]]
   pickup_time: [[2029-05-05 15:37:39.000000],[2026-12-17 07:38:40.000000],...,[2038-02-18 04:41:09.000000,2038-02-18 04:36:24.000000,2038-02-18 05:46:29.000000,2038-02-18 05:55:54.000000],[2088-01-24 08:15:42.000000,2088-01-24 08:25:39.000000]]
   pickup_location_id: [[231],[132],...,[262,263,249,113],[41,24]]
   dropoff_time: [[2029-05-05 19:18:20.000000],[2026-12-17 08:09:21.000000],...,[2038-02-18 08:03:11.000000,2038-02-18 04:39:22.000000,2038-02-18 05:53:20.000000,2038-02-19 05:13:21.000000],[2088-01-24 08:19:46.000000,2088-01-24 15:28:25.000000]]
   dropoff_location_id: [[249],[79],...,[140,262,114,239],[166,162]]
   passenger_count: [[1],[1],...,[1,1,1,1],[1,1]]
   trip_distance: [[1.69],[18.98],...,[0.56,0.45,0.86,5.07],[0.63,4.05]]
   ratecode_id: [[1],[2],...,[1,1,1,1],[1,1]]
   payment_type: [[1],[2],...,[1,2,2,2],[2,2]]
   total_amount: [[12.96],[55.3],...,[9.36,7.8,9.3,21.8],[5.3,15.3]]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko closed pull request #6879: Python: Assume timezone when it is omitted

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko closed pull request #6879: Python: Assume timezone when it is omitted
URL: https://github.com/apache/iceberg/pull/6879


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6879: Python: Liberal Timestamp parsing

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on PR #6879:
URL: https://github.com/apache/iceberg/pull/6879#issuecomment-1468991328

   > > We could add a "session time zone" like SQL and use that when it isn't specified. We can then default the session time zone to UTC
   > 
   > That would work for me.
   
   Okay, so what would that look like? Is there precedent in other Python libraries for setting "session" properties?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6879: Python: Liberal Timestamp parsing

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on PR #6879:
URL: https://github.com/apache/iceberg/pull/6879#issuecomment-1460709964

   The reason why we don't support this is that it is ambiguous. We don't know how to convert the timestamp without a zone to a specific UTC timestamp. We could assume the value is UTC or we could assume that the value is in the user's current time zone. Since it isn't obvious which one is correct or which one we've chosen, I'm worried that this would create time zone bugs in downstream code.
   
   There are probably alternatives:
   * We could add a "session time zone" like SQL and use that when it isn't specified. We can then default the session time zone to UTC
   * We could make helper methods for creating time literals, like `timestamp(iso8601: str, defaultZone: str = "+00:00") -> int` that make it clear what is happening by having a defined default
   
   What do you think about the alternatives, @Fokko?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on pull request #6879: Python: Session timezone

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on PR #6879:
URL: https://github.com/apache/iceberg/pull/6879#issuecomment-1542685254

   I want to revisit this. I had a chat with @danielcweeks on the interesting subject of time zones. We also discussed how Spark is doing it, and I like that the notion of the session timestamp (same as @rdblue suggested above).
   
   In Spark:
   ```
   spark-sql> create table t1(c1 timestamp);
   Time taken: 0.572 seconds
   
   spark-sql> insert into t1 VALUES(now);
   Time taken: 4.716 seconds
   
   spark-sql> select * from t1;
   2023-05-10 19:51:48.241101
   Time taken: 2.351 seconds, Fetched 1 row(s)
   ```
   
   So it is stored in UTC (as described in the Iceberg spec), but the session is already localized to my current timezone (it is 19:51 on the wall clock as well).
   
   ```
   >>> df = spark.read.table("default.t1");
   >>> df
   DataFrame[c1: timestamp]
   >>> df.show(truncate=False)
   +--------------------------+                                                    
   |c1                        |
   +--------------------------+
   |2023-05-10 19:51:48.241101|
   +--------------------------+
   
   >>> pdf = df.toPandas()
   >>> pdf
                             c1
   0 2023-05-10 19:51:48.241101
   
   >>> >>> pdf.c1
   0   2023-05-10 19:51:48.241101
   Name: c1, dtype: datetime64[ns]
   ```
   
   Spark does not assign any timestamp, which I think is wrong.
   
   With PyIceberg:
   
   ```
   0   2023-05-10 17:51:48.241101+00:00
   Name: c1, dtype: datetime64[ns, UTC]
   ```
   
   Then the question is, should we also localize this information (`tz_convert`the columns to `CEST` for the Dutch).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org