You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/20 06:59:17 UTC

[GitHub] [arrow] emkornfield opened a new pull request #7805: Honor tzinfo when converting from datetime

emkornfield opened a new pull request #7805:
URL: https://github.com/apache/arrow/pull/7805


   Draft PR to enable round trip to TZ info to hopefully solve spark issues.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kszucs edited a comment on pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
kszucs edited a comment on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-661130910


   > @kszucs Breaking users is a concern, I'll add an environment variable for both this change and the previous one that can keep the old buggy behavior. Just to clarify: was actually ~5PM Budapest time when you ran these test (i.e. this patch looks like it fixes a bug?)
   
   Right, this patch produces the right result.
   
   > 
   > I thought the unit test I added for Pandas captured the intent of the ML example? Let me try to run the example by hand in python to see the results.
   
   It fixes that as well, just doesn't keep the old buggy behavior. I was considering to just apply the spark patch on the current master to keep the old buggy behavior, but there is still the nested issue.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-661160765


   > It fixes that as well, just doesn't keep the old buggy behavior. I was considering to just apply the spark patch on the current master to keep the old buggy behavior, but there is still the nested issue.
   
   I've run out of time this morning to work on this PR.  I'll update it tonight with an environment variable flag that can retain the old buggy behavior.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#discussion_r457436813



##########
File path: cpp/src/arrow/python/inference.cc
##########
@@ -332,6 +329,13 @@ class TypeInferrer {
       ++int_count_;
     } else if (PyDateTime_Check(obj)) {
       ++timestamp_micro_count_;
+      OwnedRef tzinfo(PyObject_GetAttrString(obj, "tzinfo"));
+      if (tzinfo.obj() != nullptr && tzinfo.obj() != Py_None && timezone_.empty()) {
+        // From public docs on array construction
+        // "Localized timestamps will currently be returned as UTC "
+        //     representation). "

Review comment:
       If I reuse the nomenclature from the test, I get without this PR (it's 14h21 UTC currently):
   ```python
   >>> now_utc                                                                                                     
   datetime.datetime(2020, 7, 20, 14, 21, 42, 96119, tzinfo=<UTC>)
   >>> arr = pa.array([now_utc])                                                                                   
   >>> arr                                                                                                         
   <pyarrow.lib.TimestampArray object at 0x7f44b0bfcc90>
   [
     2020-07-20 14:21:42.096119
   ]
   >>> arr.type.tz                                                                                                 
   >>> arr.to_pylist()                                                                                             
   [datetime.datetime(2020, 7, 20, 14, 21, 42, 96119)]
   >>> arr.to_pandas()                                                                                             
   0   2020-07-20 14:21:42.096119
   dtype: datetime64[ns]
   ```
   ```python
   >>> now_with_tz                                                                                                           
   datetime.datetime(2020, 7, 20, 10, 21, 42, 96119, tzinfo=<DstTzInfo 'US/Eastern' EDT-1 day, 20:00:00 DST>)
   >>> arr = pa.array([now_with_tz])                                                                                         
   >>> arr.type.tz                                                                                                           
   >>> arr.to_pylist()                                                                                                       
   [datetime.datetime(2020, 7, 20, 10, 21, 42, 96119)]
   >>> arr.to_pylist()[0].tzinfo                                                                                             
   >>> arr.to_pandas()                                                                                                       
   0   2020-07-20 10:21:42.096119
   dtype: datetime64[ns]
   ```
   So without the PR, there's the problem that timestamps lose the timezone information from Python. It seems to get worse with this PR because non-UTC timestamps get tagged as UTC without being corrected for the timezone's offset, which is misleading. At least originally you may be alerted by the absence of a timezone on the type (in Python terms, it's a "naive" timestamp).
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-660856056


   Revision: a5b2a51665ab1383fb371ecd76bb3c20c4bf8726
   
   Submitted crossbow builds: [ursa-labs/crossbow @ actions-433](https://github.com/ursa-labs/crossbow/branches/all?query=actions-433)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.8-spark-master|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-433-github-test-conda-python-3.8-spark-master)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-433-github-test-conda-python-3.8-spark-master)|


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kszucs commented on pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
kszucs commented on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-661169560


   @emkornfield Thanks for working on this! 
   
   In the meantime I'm going to apply the reversion patch and cut RC2 since it is going to take at least 6-8 hours to build and three additional days for voting, so we'll have enough time to sort this issue out and decide to either cut RC3 including this patch or keep RC2.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-662527331


   yes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#discussion_r457173202



##########
File path: cpp/src/arrow/python/inference.cc
##########
@@ -332,6 +329,13 @@ class TypeInferrer {
       ++int_count_;
     } else if (PyDateTime_Check(obj)) {
       ++timestamp_micro_count_;
+      OwnedRef tzinfo(PyObject_GetAttrString(obj, "tzinfo"));

Review comment:
       If null is returned, it means Python raised an error (for example the attribute doesn't exist, which is unlikely). You want either to return that error, or to ignore it (using `PyErr_Clear`).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-660857649


   @kou 3.8 should be fine. Thank you!.  I copied the command from the previous PR related to datetimes, I guess CI has changed since then.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kou commented on pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
kou commented on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-660855376


   @github-actions crossbow submit test-conda-python-3.8-spark-master


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#discussion_r457121774



##########
File path: python/pyarrow/scalar.pxi
##########
@@ -331,7 +331,8 @@ cdef class Date64Scalar(Scalar):
 
     def as_py(self):
         """
-        Return this value as a Python datetime.datetime instance.
+        Return this value as a Pandas Timestamp instance (if available),

Review comment:
       this needs to be reverted.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kszucs commented on pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
kszucs commented on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-661130910


   > @kszucs Breaking users is a concern, I'll add an environment variable for both this change and the previous one that can keep the old buggy behavior. Just to clarify: was actually ~5PM Budapest time when you ran these test (i.e. this patch looks like it fixes a bug?)
   Right, this patch produces the right result.
   > 
   > I thought the unit test I added for Pandas captured the intent of the ML example? Let me try to run the example by hand in python to see the results.
   It fixes that as well, just doesn't keep the old buggy behavior. I was considering to just apply the spark patch on the current master to keep the old buggy behavior, but there is still the nested issue.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-660848831


   https://issues.apache.org/jira/browse/ARROW-9528


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kszucs commented on pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
kszucs commented on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-662523425


   @emkornfield I think we can close this in favor of #7816 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#discussion_r457422354



##########
File path: python/pyarrow/tests/test_pandas.py
##########
@@ -3325,6 +3325,21 @@ def test_cast_timestamp_unit():
     assert result.equals(expected)
 
 
+def test_nested_with_timestamp_tz_round_trip():
+    ts = pd.Timestamp.now()
+    ts_dt = ts.to_pydatetime()
+    arr = pa.array([ts_dt], type=pa.timestamp('us', tz='America/New_York'))

Review comment:
       I don't believe so. Pretty sure my machine uses it. Local time, I'll double check by setting a different tz




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kou commented on pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
kou commented on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-660856402


   We can use task listed only in https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml#L1930 for `crossbow submit`.
   We don't have `...-3.7-...` task. We only have `...-3.8-...` task for Spark master.
   
   If we need to run with Python 3.7, we can add a task to `dev/tasks/tasks.yml`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-661238238


   Just to clarify things, is the main concern with this patch over keeping the previous buggy behavior? Besides that are these changes producing correct results and passing roundtrip tests?
   It does seem a little late in the game to be making these kind of changes, so unless others view this as extremely low risk, I'm in favor of reverting and fixing this later when it's not so rushed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kszucs commented on pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
kszucs commented on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-661106422


   My main concern with this solution is while it resolves the pandas roundtrip, the intermediate array values are different.
   People may "rely" on the previous buggy behavior, and I'm afraid that it'll cause more post release trouble than we expect.
   
   Running the following snippet on three different revisions:
   
   ```py
   import pytz
   from datetime import datetime
   
   import pyarrow as pa
   
   now_at_budapest = datetime.now(pytz.timezone('Europe/Budapest'))
   arr = pa.array([now_at_budapest], type=pa.timestamp('s', tz='Europe/Budapest'))
   
   try:
       pa.show_versions()
   except AttributeError:
       print("Arrow version: {}".format(pa.__version__))
   
   print(arr)
   print(arr.to_pandas())
   ```
   
   ### 0.17.1
   
   ```py
   Arrow version: 0.17.1
   [
       2020-07-20 17:01:11
   ]
   0   2020-07-20 19:01:11+02:00
   dtype: datetime64[ns, Europe/Budapest]
   ```
   
   ### Master
   
   ```py
   pyarrow version info
   --------------------
   Package kind: not indicated
   Arrow C++ library version: 1.0.0-SNAPSHOT
   Arrow C++ compiler: AppleClang 11.0.3.11030032
   Arrow C++ compiler flags:  -Qunused-arguments -fcolor-diagnostics -ggdb -O0
   Arrow C++ git revision: 210d3609f027ef9ed83911c2d1132cb9cbb2dc06
   Arrow C++ git description: apache-arrow-0.17.0-756-g210d3609f
   [
       2020-07-20 17:10:11
   ]
   0   2020-07-20 19:10:11+02:00
   dtype: datetime64[ns, Europe/Budapest]
   ```
   
   ### This patch
   
   ```py
   pyarrow version inf
   --------------------                                                        
   Package kind: not indicated                                                 
   Arrow C++ library version: 1.0.0-SNAPSHOT                                   
   Arrow C++ compiler: AppleClang 11.0.3.11030032                              
   Arrow C++ compiler flags:  -Qunused-arguments -fcolor-diagnostics -ggdb -O0 
   Arrow C++ git revision: a5b2a51665ab1383fb371ecd76bb3c20c4bf8726            
   Arrow C++ git description: apache-arrow-0.17.0-761-ga5b2a5166               
   [                                                                           
     2020-07-20 15:01:12                                                       
   ]                                                                           
   0   2020-07-20 17:01:12+02:00                                               
   dtype: datetime64[ns, Europe/Budapest]                                      
   ```
   
   While the current master works for this example and the [spark patch](https://github.com/apache/arrow/pull/7804) fixes the spark integration test, it breaks the nested roundtrip example discussed in the ML thread.
   
   @emkornfield @BryanCutler thoughts?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield closed pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
emkornfield closed pull request #7805:
URL: https://github.com/apache/arrow/pull/7805


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#discussion_r457121068



##########
File path: cpp/src/arrow/python/inference.cc
##########
@@ -332,6 +329,13 @@ class TypeInferrer {
       ++int_count_;
     } else if (PyDateTime_Check(obj)) {
       ++timestamp_micro_count_;
+      OwnedRef tzinfo(PyObject_GetAttrString(obj, "tzinfo"));
+      if (tzinfo.obj() != nullptr && tzinfo.obj() != Py_None && timezone_.empty()) {

Review comment:
       timezone_ should be first here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#discussion_r457421000



##########
File path: cpp/src/arrow/python/inference.cc
##########
@@ -332,6 +329,13 @@ class TypeInferrer {
       ++int_count_;
     } else if (PyDateTime_Check(obj)) {
       ++timestamp_micro_count_;
+      OwnedRef tzinfo(PyObject_GetAttrString(obj, "tzinfo"));
+      if (tzinfo.obj() != nullptr && tzinfo.obj() != Py_None && timezone_.empty()) {
+        // From public docs on array construction
+        // "Localized timestamps will currently be returned as UTC "
+        //     representation). "

Review comment:
       I'm actually not sure what it was intended to mean.  This was my best guess.  Not accounting for timezones info seems like a bug?  Would you prefer I try to capture the first time zone encountered as a string? Or is the preference not to have the logic in this PR in arrow in the first place?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#discussion_r457178228



##########
File path: python/pyarrow/tests/test_pandas.py
##########
@@ -3325,6 +3325,21 @@ def test_cast_timestamp_unit():
     assert result.equals(expected)
 
 
+def test_nested_with_timestamp_tz_round_trip():
+    ts = pd.Timestamp.now()
+    ts_dt = ts.to_pydatetime()
+    arr = pa.array([ts_dt], type=pa.timestamp('us', tz='America/New_York'))
+    struct = pa.StructArray.from_arrays([arr, arr], ['start', 'stop'])
+
+    result = struct.to_pandas()
+    # N.B. we test with Panaas because the Arrow types are not

Review comment:
       "Pandas" :-)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-660848232


   @github-actions crossbow submit test-conda-python-3.7-spark-master


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#discussion_r457436813



##########
File path: cpp/src/arrow/python/inference.cc
##########
@@ -332,6 +329,13 @@ class TypeInferrer {
       ++int_count_;
     } else if (PyDateTime_Check(obj)) {
       ++timestamp_micro_count_;
+      OwnedRef tzinfo(PyObject_GetAttrString(obj, "tzinfo"));
+      if (tzinfo.obj() != nullptr && tzinfo.obj() != Py_None && timezone_.empty()) {
+        // From public docs on array construction
+        // "Localized timestamps will currently be returned as UTC "
+        //     representation). "

Review comment:
       If I reuse the nomenclature from the test, I get without this PR (it's 14h21 UTC currently):
   ```python
   >>> now_utc                                                                                                     
   datetime.datetime(2020, 7, 20, 14, 21, 42, 96119, tzinfo=<UTC>)
   >>> arr = pa.array([now_utc])                                                                                   
   >>> arr                                                                                                         
   <pyarrow.lib.TimestampArray object at 0x7f44b0bfcc90>
   [
     2020-07-20 14:21:42.096119
   ]
   >>> arr.type.tz                                                                                                 
   >>> arr.to_pylist()                                                                                             
   [datetime.datetime(2020, 7, 20, 14, 21, 42, 96119)]
   >>> arr.to_pandas()                                                                                             
   0   2020-07-20 14:21:42.096119
   dtype: datetime64[ns]
   ```
   ```python
   >>> now_with_tz                                                                                                           
   datetime.datetime(2020, 7, 20, 10, 21, 42, 96119, tzinfo=<DstTzInfo 'US/Eastern' EDT-1 day, 20:00:00 DST>)
   >>> arr = pa.array([now_with_tz])                                                                                         
   >>> arr.type.tz                                                                                                           
   >>> arr.to_pylist()                                                                                                       
   [datetime.datetime(2020, 7, 20, 10, 21, 42, 96119)]
   >>> arr.to_pylist()[0].tzinfo                                                                                             
   >>> arr.to_pandas()                                                                                                       
   0   2020-07-20 10:21:42.096119
   dtype: datetime64[ns]
   ```
   So without the PR, there's a problem for non-UTC timestamps which lose the timezone information. It seems to get worse with this PR because non-UTC timestamps get tagged as UTC without being corrected for the timezone's offset, which is misleading. At least originally you may be alerted by the absence of a timezone on the type (in Python terms, it's a "naive" timestamp).
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-661327726


   Sure. I thought the mailing list discussion said to discuss here 🤷 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-661127682


   @kszucs Breaking users is a concern, I'll add an environment variable for both this change and the previous one that can keep the old buggy behavior.  Just to clarify: was actually ~5PM Budapest time when you ran these test (i.e. this patch looks like it fixes a bug?)
   
   I thought the unit test I added for Pandas captured the intent of the ML example?  Let me try to run the example by hand in python to see the results.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#discussion_r457176423



##########
File path: python/pyarrow/tests/test_pandas.py
##########
@@ -3325,6 +3325,21 @@ def test_cast_timestamp_unit():
     assert result.equals(expected)
 
 
+def test_nested_with_timestamp_tz_round_trip():
+    ts = pd.Timestamp.now()
+    ts_dt = ts.to_pydatetime()
+    arr = pa.array([ts_dt], type=pa.timestamp('us', tz='America/New_York'))

Review comment:
       Does this test presume that `ts` itself was produced in "America/New York" timezone? It's not clear to me.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#discussion_r457177694



##########
File path: cpp/src/arrow/python/inference.cc
##########
@@ -332,6 +329,13 @@ class TypeInferrer {
       ++int_count_;
     } else if (PyDateTime_Check(obj)) {
       ++timestamp_micro_count_;
+      OwnedRef tzinfo(PyObject_GetAttrString(obj, "tzinfo"));
+      if (tzinfo.obj() != nullptr && tzinfo.obj() != Py_None && timezone_.empty()) {
+        // From public docs on array construction
+        // "Localized timestamps will currently be returned as UTC "
+        //     representation). "

Review comment:
       Does this mean Arrow simply stores an erroneous value? We don't do timezone conversion in Arrow, right?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7805: Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-660844779


   @github-actions crossbow submit test-conda-python-3.7-spark-master


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kszucs edited a comment on pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
kszucs edited a comment on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-661106422


   My main concern with this solution is while it resolves the pandas roundtrip, the intermediate array values are different.
   People may "rely" on the previous buggy behavior, and I'm afraid that it'll cause more post release trouble than we expect.
   
   Running the following snippet on three different revisions:
   
   ```py
   import pytz
   from datetime import datetime
   
   import pyarrow as pa
   
   now_at_budapest = datetime.now(pytz.timezone('Europe/Budapest'))
   arr = pa.array([now_at_budapest], type=pa.timestamp('s', tz='Europe/Budapest'))
   
   try:
       pa.show_versions()
   except AttributeError:
       print("Arrow version: {}".format(pa.__version__))
   
   print(arr)
   print(arr.to_pandas())
   ```
   
   ### 0.17.1
   
   ```py
   Arrow version: 0.17.1
   [
       2020-07-20 17:01:11
   ]
   0   2020-07-20 19:01:11+02:00
   dtype: datetime64[ns, Europe/Budapest]
   ```
   
   ### Master
   
   ```py
   pyarrow version info
   --------------------
   Package kind: not indicated
   Arrow C++ library version: 1.0.0-SNAPSHOT
   Arrow C++ compiler: AppleClang 11.0.3.11030032
   Arrow C++ compiler flags:  -Qunused-arguments -fcolor-diagnostics -ggdb -O0
   Arrow C++ git revision: 210d3609f027ef9ed83911c2d1132cb9cbb2dc06
   Arrow C++ git description: apache-arrow-0.17.0-756-g210d3609f
   [
       2020-07-20 17:10:11
   ]
   0   2020-07-20 19:10:11+02:00
   dtype: datetime64[ns, Europe/Budapest]
   ```
   
   ### This patch
   
   ```py
   pyarrow version inf
   --------------------                                                        
   Package kind: not indicated                                                 
   Arrow C++ library version: 1.0.0-SNAPSHOT                                   
   Arrow C++ compiler: AppleClang 11.0.3.11030032                              
   Arrow C++ compiler flags:  -Qunused-arguments -fcolor-diagnostics -ggdb -O0 
   Arrow C++ git revision: a5b2a51665ab1383fb371ecd76bb3c20c4bf8726            
   Arrow C++ git description: apache-arrow-0.17.0-761-ga5b2a5166               
   [                                                                           
     2020-07-20 15:01:12                                                       
   ]                                                                           
   0   2020-07-20 17:01:12+02:00                                               
   dtype: datetime64[ns, Europe/Budapest]                                      
   ```
   
   While the current master works for this example and the [spark patch](https://github.com/apache/arrow/pull/7804) fixes the spark integration test, it breaks the nested roundtrip [example](https://gist.github.com/kszucs/26c58e794d30b7d783bc8484b67d860a) discussed in the ML thread.
   
   @emkornfield @BryanCutler thoughts?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-661328369


   too many communication channels.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#discussion_r457178426



##########
File path: python/pyarrow/tests/test_pandas.py
##########
@@ -3325,6 +3325,21 @@ def test_cast_timestamp_unit():
     assert result.equals(expected)
 
 
+def test_nested_with_timestamp_tz_round_trip():
+    ts = pd.Timestamp.now()
+    ts_dt = ts.to_pydatetime()
+    arr = pa.array([ts_dt], type=pa.timestamp('us', tz='America/New_York'))
+    struct = pa.StructArray.from_arrays([arr, arr], ['start', 'stop'])
+
+    result = struct.to_pandas()
+    # N.B. we test with Panaas because the Arrow types are not
+    # actually equal.  All Timezone aware times are currently normalized
+    # to "UTC" as the timesetamp type.but since this conversion results

Review comment:
       "timestamp"

##########
File path: python/pyarrow/tests/test_pandas.py
##########
@@ -3325,6 +3325,21 @@ def test_cast_timestamp_unit():
     assert result.equals(expected)
 
 
+def test_nested_with_timestamp_tz_round_trip():
+    ts = pd.Timestamp.now()
+    ts_dt = ts.to_pydatetime()
+    arr = pa.array([ts_dt], type=pa.timestamp('us', tz='America/New_York'))
+    struct = pa.StructArray.from_arrays([arr, arr], ['start', 'stop'])
+
+    result = struct.to_pandas()
+    # N.B. we test with Panaas because the Arrow types are not
+    # actually equal.  All Timezone aware times are currently normalized
+    # to "UTC" as the timesetamp type.but since this conversion results
+    # in object dtypes, and tzinfo is preserrved the comparison should

Review comment:
       "preserved"




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#discussion_r457510267



##########
File path: cpp/src/arrow/python/inference.cc
##########
@@ -332,6 +329,13 @@ class TypeInferrer {
       ++int_count_;
     } else if (PyDateTime_Check(obj)) {
       ++timestamp_micro_count_;
+      OwnedRef tzinfo(PyObject_GetAttrString(obj, "tzinfo"));
+      if (tzinfo.obj() != nullptr && tzinfo.obj() != Py_None && timezone_.empty()) {
+        // From public docs on array construction
+        // "Localized timestamps will currently be returned as UTC "
+        //     representation). "

Review comment:
       > It seems to get worse with this PR because non-UTC timestamps get tagged as UTC without being corrected for the timezone's offset, which is misleading.
   
   That is not the intent of the PR, right now everything gets corrected to UTC.  As an example:
   This correctly keeps the times logically the same.  I can make the change to try to keep the original timezones in place and changes US/Eastern to the correct time in UTC>
   
   
   ```
   >>> now_with_tz = datetime.datetime(2020, 7, 20, 10, 21, 42, 96119, tzinfo=pytz.timezone('US/Eastern'))
   >>> arr = pa.array([now_with_tz]) 
   >>> arr.type.tz  
   'UTC'
   >>> arr.to_pylist() 
   [datetime.datetime(2020, 7, 20, 15, 17, 42, 96119, tzinfo=<UTC>)]
   >>> arr.to_pylist()[0].tzinfo
   <UTC>
   >>> arr.to_pandas()
   0   2020-07-20 15:17:42.096119+00:00
   dtype: datetime64[ns, UTC]
   ```
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#discussion_r457425287



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -300,6 +301,8 @@ def test_nulls(ty):
 def test_array_from_scalar():
     today = datetime.date.today()
     now = datetime.datetime.now()
+    now_utc = now.replace(tzinfo=pytz.utc)

Review comment:
       Based on my experimentations, you should write:
   ```python
   now_utc = datetime.datetime.now(tz=pytz.utc)
   ```
   
   (simply calling `.replace(tzinfo=pytz.utc)` doesn't adjust the recorded time for the timezone change, so you get the local time tagged with a UTC timezone)
   
   And, yes, this probably doesn't matter for the test's correctness :-)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#discussion_r457510267



##########
File path: cpp/src/arrow/python/inference.cc
##########
@@ -332,6 +329,13 @@ class TypeInferrer {
       ++int_count_;
     } else if (PyDateTime_Check(obj)) {
       ++timestamp_micro_count_;
+      OwnedRef tzinfo(PyObject_GetAttrString(obj, "tzinfo"));
+      if (tzinfo.obj() != nullptr && tzinfo.obj() != Py_None && timezone_.empty()) {
+        // From public docs on array construction
+        // "Localized timestamps will currently be returned as UTC "
+        //     representation). "

Review comment:
       > It seems to get worse with this PR because non-UTC timestamps get tagged as UTC without being corrected for the timezone's offset, which is misleading.
   
   That is not the intent of the PR, right now everything gets converted to UTC but in my environment UTC tzinfo is maintained..  I can't replicate your results above (timezone information for me is propagaged as an example with now_with_tz:
   
   ```
   >>> now_with_tz = datetime.datetime(2020, 7, 20, 10, 21, 42, 96119, tzinfo=pytz.timezone('US/Eastern'))
   >>> arr = pa.array([now_with_tz]) 
   >>> arr.type.tz  
   'UTC'
   >>> arr.to_pylist() 
   [datetime.datetime(2020, 7, 20, 15, 17, 42, 96119, tzinfo=<UTC>)]
   >>> arr.to_pylist()[0].tzinfo
   <UTC>
   >>> arr.to_pandas()
   0   2020-07-20 15:17:42.096119+00:00
   dtype: datetime64[ns, UTC]
   ```
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-661327078


   @nealrichardson I think we should discuss this on the mailing list.  The prior patch has been reverted and I'll use this one to have an end-to-end solution which probably won't make it into 1.0
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#issuecomment-661323133


   Correct me if I'm wrong, but IIUC there are doubts about a few things:
   
   1) the correctness of the behavior on master (prior to reverting the initial change)
   2) the correctness of the behavior on this patch
   3) how much people have built expectations around the wrong behavior that is the status quo (0.17.1), in Spark and in general 
   4) how to give people a safe upgrade path that doesn't break production code, even if the changes are more "correct".
   
   This patch may be the right solution, but I fear that we haven't adequately thought through (and tested) all of the implications and upgrade paths. And two of the people with the strongest opinion's about pyarrow's API (@wesm and @pitrou) just left for vacation and have expressed a preference for reverting the initial change for the 1.0 release. At this stage of the 1.0 release, I'd rather pyarrow continue to be wrong in the expected way (i.e. revert and not merge this yet) than be right in an unexpected way and possibly wrong in other unknown ways.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7805: ARROW-9528: [Python] Honor tzinfo when converting from datetime

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#discussion_r457445991



##########
File path: python/pyarrow/tests/test_pandas.py
##########
@@ -3325,6 +3325,21 @@ def test_cast_timestamp_unit():
     assert result.equals(expected)
 
 
+def test_nested_with_timestamp_tz_round_trip():
+    ts = pd.Timestamp.now()

Review comment:
       What timezone does this timestamp have? Is it a naive timestamp? Would be nice explaining it in comments.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org