You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "Fokko (via GitHub)" <gi...@apache.org> on 2023/02/17 23:35:56 UTC
[GitHub] [iceberg] Fokko opened a new pull request, #6879: Python: Liberal Timestamp parsing
Fokko opened a new pull request, #6879:
URL: https://github.com/apache/iceberg/pull/6879
We don't want to throw an error when the timezone is missing and just assume UTC.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on pull request #6879: Python: Liberal Timestamp parsing
Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on PR #6879:
URL: https://github.com/apache/iceberg/pull/6879#issuecomment-1460732865
> We could add a "session time zone" like SQL and use that when it isn't specified. We can then default the session time zone to UTC
That would work for me. We could set this as a property in the configuration.
> We could make helper methods for creating time literals, like `timestamp(iso8601: str, defaultZone: str = "+00:00") -> int` that make it clear what is happening by having a defined default
I'm less in favor of this one because I think this is mostly around the filter API;
Instead of:
```python
tbl.scan(
row_filter=GreaterThanOrEqual("tpep_pickup_datetime", "2022-01-01T00:00:00+00:00")
).to_arrow()
```
Have the API As:
```python
tbl.scan(
row_filter="tpep_pickup_datetime >= '2022-01-01T00:00:00+00:00'"
).to_arrow()
```
Then it is easy to forget the `+00:00` and then a user would need to look up the format which is not very user-friendly. A logical evolution is also:
```python
tbl.scan(
row_filter="tpep_pickup_datetime >= '2022-01-01'")
).to_arrow()
```
Where `tpep_pickup_datetime` is a datetime. Setting a timezone on a date would also be a bit awkward.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on pull request #6879: Python: Liberal Timestamp parsing
Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on PR #6879:
URL: https://github.com/apache/iceberg/pull/6879#issuecomment-1503811480
@rdblue I've added it to the config. This way you can also do:
```python
cat = load_catalog('prod', 'timezone'='+01:00')
```
Now we also get warnings:
```
In [1]: from pyiceberg.catalog import load_catalog
...: from pyiceberg.expressions import GreaterThanOrEqual, LessThan, And
...:
...: cat = load_catalog('prod')
...:
...: tbl = cat.load_table(('examples', 'nyc_taxi_yellow'))
...:
...:
...: tbl.scan(row_filter=And(GreaterThanOrEqual("pickup_time", "2022-01-01T00:00:00"), GreaterThanOrEqual("dropoff_time", "2022-01-01T00:00:00"))).to_arrow()
...:
/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/expressions/visitors.py:1451: UserWarning: Assuming timezone +00:00 for: GreaterThanOrEqual(term=Reference(name='pickup_time'), literal=literal('2022-01-01T00:00:00'))
warnings.warn(f"Assuming timezone {self.default_timezone} for: {predicate}")
/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/expressions/visitors.py:1451: UserWarning: Assuming timezone +00:00 for: GreaterThanOrEqual(term=Reference(name='dropoff_time'), literal=literal('2022-01-01T00:00:00'))
warnings.warn(f"Assuming timezone {self.default_timezone} for: {predicate}")
Out[1]:
pyarrow.Table
vendor_id: int32
pickup_time: timestamp[us, tz=UTC]
pickup_location_id: int32
dropoff_time: timestamp[us, tz=UTC]
dropoff_location_id: int32
passenger_count: int32
trip_distance: double
ratecode_id: int32
payment_type: int32
total_amount: double
fare_amount: double
tip_amount: double
tolls_amount: double
mta_tax: double
improvement_surcharge: double
congestion_surcharge: double
extra_surcharges: double
store_and_forward_flag: string
----
vendor_id: [[2],[2],...,[2,2,2,2],[2,2]]
pickup_time: [[2029-05-05 15:37:39.000000],[2026-12-17 07:38:40.000000],...,[2038-02-18 04:41:09.000000,2038-02-18 04:36:24.000000,2038-02-18 05:46:29.000000,2038-02-18 05:55:54.000000],[2088-01-24 08:15:42.000000,2088-01-24 08:25:39.000000]]
pickup_location_id: [[231],[132],...,[262,263,249,113],[41,24]]
dropoff_time: [[2029-05-05 19:18:20.000000],[2026-12-17 08:09:21.000000],...,[2038-02-18 08:03:11.000000,2038-02-18 04:39:22.000000,2038-02-18 05:53:20.000000,2038-02-19 05:13:21.000000],[2088-01-24 08:19:46.000000,2088-01-24 15:28:25.000000]]
dropoff_location_id: [[249],[79],...,[140,262,114,239],[166,162]]
passenger_count: [[1],[1],...,[1,1,1,1],[1,1]]
trip_distance: [[1.69],[18.98],...,[0.56,0.45,0.86,5.07],[0.63,4.05]]
ratecode_id: [[1],[2],...,[1,1,1,1],[1,1]]
payment_type: [[1],[2],...,[1,2,2,2],[2,2]]
total_amount: [[12.96],[55.3],...,[9.36,7.8,9.3,21.8],[5.3,15.3]]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Fokko closed pull request #6879: Python: Assume timezone when it is omitted
Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko closed pull request #6879: Python: Assume timezone when it is omitted
URL: https://github.com/apache/iceberg/pull/6879
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on pull request #6879: Python: Liberal Timestamp parsing
Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on PR #6879:
URL: https://github.com/apache/iceberg/pull/6879#issuecomment-1468991328
> > We could add a "session time zone" like SQL and use that when it isn't specified. We can then default the session time zone to UTC
>
> That would work for me.
Okay, so what would that look like? Is there precedent in other Python libraries for setting "session" properties?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on pull request #6879: Python: Liberal Timestamp parsing
Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on PR #6879:
URL: https://github.com/apache/iceberg/pull/6879#issuecomment-1460709964
The reason why we don't support this is that it is ambiguous. We don't know how to convert the timestamp without a zone to a specific UTC timestamp. We could assume the value is UTC or we could assume that the value is in the user's current time zone. Since it isn't obvious which one is correct or which one we've chosen, I'm worried that this would create time zone bugs in downstream code.
There are probably alternatives:
* We could add a "session time zone" like SQL and use that when it isn't specified. We can then default the session time zone to UTC
* We could make helper methods for creating time literals, like `timestamp(iso8601: str, defaultZone: str = "+00:00") -> int` that make it clear what is happening by having a defined default
What do you think about the alternatives, @Fokko?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on pull request #6879: Python: Session timezone
Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on PR #6879:
URL: https://github.com/apache/iceberg/pull/6879#issuecomment-1542685254
I want to revisit this. I had a chat with @danielcweeks on the interesting subject of time zones. We also discussed how Spark is doing it, and I like that the notion of the session timestamp (same as @rdblue suggested above).
In Spark:
```
spark-sql> create table t1(c1 timestamp);
Time taken: 0.572 seconds
spark-sql> insert into t1 VALUES(now);
Time taken: 4.716 seconds
spark-sql> select * from t1;
2023-05-10 19:51:48.241101
Time taken: 2.351 seconds, Fetched 1 row(s)
```
So it is stored in UTC (as described in the Iceberg spec), but the session is already localized to my current timezone (it is 19:51 on the wall clock as well).
```
>>> df = spark.read.table("default.t1");
>>> df
DataFrame[c1: timestamp]
>>> df.show(truncate=False)
+--------------------------+
|c1 |
+--------------------------+
|2023-05-10 19:51:48.241101|
+--------------------------+
>>> pdf = df.toPandas()
>>> pdf
c1
0 2023-05-10 19:51:48.241101
>>> >>> pdf.c1
0 2023-05-10 19:51:48.241101
Name: c1, dtype: datetime64[ns]
```
Spark does not assign any timestamp, which I think is wrong.
With PyIceberg:
```
0 2023-05-10 17:51:48.241101+00:00
Name: c1, dtype: datetime64[ns, UTC]
```
Then the question is, should we also localize this information (`tz_convert`the columns to `CEST` for the Dutch).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org