You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/01 07:59:18 UTC

[GitHub] [arrow] emkornfield opened a new pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion

emkornfield opened a new pull request #7604:
URL: https://github.com/apache/arrow/pull/7604


   - Ports string_to_timezone to C++
   - Causes nested timestamp columns within structs to use conversion
     to object path.
   - Copy timezone on to_object path.
   
   Open to other suggestions on how to solve this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r449110010



##########
File path: cpp/src/arrow/python/arrow_to_pandas.cc
##########
@@ -642,24 +641,27 @@ inline Status ConvertStruct(const PandasOptions& options, const ChunkedArray& da
   std::vector<OwnedRef> fields_data(num_fields);
   OwnedRef dict_item;
 
-  // XXX(wesm): In ARROW-7723, we found as a result of ARROW-3789 that second
+  // In ARROW-7723, we found as a result of ARROW-3789 that second
   // through microsecond resolution tz-aware timestamps were being promoted to
   // use the DATETIME_NANO_TZ conversion path, yielding a datetime64[ns] NumPy
   // array in this function. PyArray_GETITEM returns datetime.datetime for
   // units second through microsecond but PyLong for nanosecond (because
-  // datetime.datetime does not support nanoseconds). We inserted this hack to
-  // preserve the <= 0.15.1 behavior until a better solution can be devised
+  // datetime.datetime does not support nanoseconds).
+  // We force the object conversion to preserve the value of the timezone.
   PandasOptions modified_options = options;
-  modified_options.ignore_timezone = true;
   modified_options.coerce_temporal_nanoseconds = false;
 
   for (int c = 0; c < data.num_chunks(); c++) {
     auto arr = checked_cast<const StructArray*>(data.chunk(c).get());
     // Convert the struct arrays first
     for (int32_t i = 0; i < num_fields; i++) {
+      // Se notes above about conversion.
+      std::shared_ptr<Array> field = arr->field(static_cast<int>(i));
+      modified_options.timestamp_as_object =
+          field->type()->id() == Type::TIMESTAMP &&
+          !checked_cast<const TimestampType&>(*field->type()).timezone().empty();

Review comment:
       Yeah, there's the same issue in R too. In R at least, tz-naive data is localized when it is printed, so it might *look* inconsistent, but the data itself is not modified. I think that part is important--we shouldn't change whether the data is timezone naive when we convert it to/from Arrow. When I fixed that in the R bindings, rather than assuming UTC, all of these things that I thought were problems went away.
   
   (Full disclosure: I have not looked at what this PR is doing, just chiming in on the discussion based on having faced what sounds like similar issues in the R bindings.)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-656420297


   > @jorisvandenbossche Is there a way to skip specific tests, I thought all of the live in spark code?
   
   The script selects all Spark tests that use Arrow. I'm pretty sure all of them are being run and there was just the one failure.
   
   The problem on the Spark side is that it isn't handling nested timestamps correctly either, so fixing that will be a step in the right direction. The user will end up seeing different results for nested timestamps, if upgrading, but they weren't officially supported anyway. As for Spark tests, I should be able to fix this for master and branch-3.0, not sure about branch-2.4 but maybe.
   
   +1 on this change for me.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-652764305


   If the approach is agreeable it would be nice to get in before the next release.  This potentially breaks List[Timestamp] because that now returns datetimes as well but I think that is better for consistency.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-656376374


   ping @BryanCutler, would be good to merge this for the release if possible


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-655622278


   If we agree we can "break" spark's tests like this, @emkornfield can you maybe skip the specific test in the spark integration tests, so the we notice if other tests would start failing.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r450201426



##########
File path: cpp/src/arrow/python/datetime.cc
##########
@@ -262,6 +302,42 @@ int64_t PyDate_to_days(PyDateTime_Date* pydate) {
                             PyDateTime_GET_DAY(pydate));
 }
 
+// GIL must be held when calling this function.
+Status StringToTzinfo(const std::string& tz, PyObject** tzinfo) {

Review comment:
       Looks fine!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on a change in pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r449338456



##########
File path: python/pyarrow/tests/test_pandas.py
##########
@@ -3321,9 +3321,12 @@ def test_cast_timestamp_unit():
     assert result.equals(expected)
 
 
-def test_struct_with_timestamp_tz():
+def test_nested_with_timestamp_tz():
     # ARROW-7723
     ts = pd.Timestamp.now()
+    # This is used for verifying timezone conversion to micros are not
+    # important
+    ts_dt = ts.to_pydatetime().replace(microsecond=0)

Review comment:
       I'm not too sure about the failure and the message isn't helpful at all. The test does have a struct with 2 timestamps in it. I should be able to at least get a better error message right now, and then I can look into what's going on.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield edited a comment on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield edited a comment on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-653081545


   > > > For timezone naive Timestamps, I'm not sure if we should be adjusting the datetime to reflect UTC converted to system time zone. thoughts?
   > > 
   > > 
   > > I don't think we should do that (it would also be a big break). Tz-naive timestamps have no timezone and should therefore be left as is IMO (applications on top of Arrow can choose to use the "naive means system time zone" behaviour, but I would say that Arrow itself should never deal with the system time zone)
   > 
   > You've probably seen this, but this is what our docs say about timezone-naive data: http://arrow.apache.org/docs/cpp/api/datatype.html#_CPPv4N5arrow13TimestampTypeE
   > 
   > (Side note: do our doc URLs have to be so ugly?)
   > 
   > I agree with Joris that at this level, we should not do any localizing of timezone-naive data.
   
   Please see reply: https://github.com/apache/arrow/pull/7604#discussion_r449096333
   
   I haven't looked in the code deeply but this might an overall bug, or at least idiosyncrasy, with our conversion to/from datetime (not accounting for system timezone).  
   
   We can handle this as a separate issue though.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-654267493


   @github-actions crossbow submit test-conda-python-3.7-spark-master


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-652956039


   There is one failure: 
   
   ```
   ======================================================================
   FAIL: test_grouped_over_window_with_key (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
   ----------------------------------------------------------------------
   Traceback (most recent call last):
     File "/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 588, in test_grouped_over_window_with_key
       self.assertTrue(all([r[0] for r in result]))
   AssertionError: False is not true
   
   ----------------------------------------------------------------------
   Ran 21 tests in 141.194s
   
   FAILED (failures=1)
   ```
   
   https://github.com/apache/spark/blob/7fda184f0fc39613fb68e912c189c54b93c638e6/python/pyspark/sql/tests/test_pandas_grouped_map.py#L546-L574


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-652856693


   @github-actions crossbow submit test-conda-python-3.7-spark-master


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-654377721


   Which I think is as expected? (at least if the data in the test are tz-aware) And thus the test can be updated on the spark side? 
   Since we actually change to now return datetimes with tzinfo set, instead of losing the timezone.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-653226285


   > But if the timestamp is meant for display purposes, I would expect the tz-naive datetime value to reflect the local timezone. Since it displays a "calendar" like object. This isn't clear cut case just stating my expectations.
   
   OK, that's indeed what was causing the confusion ;)
   
   So my view on this: the datetime.datetime object is also a tz-naive object, so we should "just" convert to the exact same tz-naive timestamp value. This is not only for display, but an actual value to work with when converting the Struct type to native python objects (eg dicts of datetime.datetime values). We already do such conversions right now, and we always preserve the actual timestamp value (and never adjust for system timezone).
   
   And for display: datetime.datetime does *not* edit the displayed value related to the system timezone. Personally I like this behaviour, while you would prefer that it would use system timezone. But once we convert to datetime.datetime, it's their display that we use, so we don't really have control over it, although you may not like that.  
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield edited a comment on pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield edited a comment on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-652765985


   One other concern.  For timezone naive Timestamps, I'm not sure if we should be adjusting the datetime to reflect UTC converted to system time zone.  thoughts?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-653217519


   Moving the discussion at https://github.com/apache/arrow/pull/7604#discussion_r449130523 outside the inline thread here (which makes it easier to find).
   
   It's still not really clear to me what the actual issue is at hand that we are discussing. What was the problem that started the discussion?
   
   >>>  timezone naive datetimes reflect system timezone
   >>
   >> Who does this? 
   >
   > Python's datetime.astimezone assumes this as of 3.6. Since datetime essentially becomes the "display" I would think most people would assume system timezone values.
   
   But that is only for the `astimezone` method, which we don't use? (in this PR you now only use it once in the tests). And the display of `datetime.datetime` is not influenced by system timezone, AFAIK?
   
   In this PR (and also already on master), we are using `fromutc`, which seems correct to me, since internally in arrow tz-aware timestamps are stored as UTC. 
   And for tz-naive we don't need to handle any timezone issue, and can just convert it do a datetime.datetime object, which by default is also tz-naive and can thus be used as is.
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r449373710



##########
File path: cpp/src/arrow/python/datetime.cc
##########
@@ -262,6 +265,42 @@ int64_t PyDate_to_days(PyDateTime_Date* pydate) {
                             PyDateTime_GET_DAY(pydate));
 }
 
+// GIL must be held when calling this function.
+Status StringToTzinfo(const std::string& tz, PyObject** tzinfo) {
+  static const std::regex kFixedOffset("([+-])(0[0-9]|1[0-9]|2[0-3]):([0-5][0-9])$");

Review comment:
       Hmm, I thought I saw the regex header in other parts of the codebase (it turns out these are ifdefed with boost).  But I went ahead and converted this to a function. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r449109405



##########
File path: cpp/src/arrow/python/arrow_to_pandas.cc
##########
@@ -642,24 +641,27 @@ inline Status ConvertStruct(const PandasOptions& options, const ChunkedArray& da
   std::vector<OwnedRef> fields_data(num_fields);
   OwnedRef dict_item;
 
-  // XXX(wesm): In ARROW-7723, we found as a result of ARROW-3789 that second
+  // In ARROW-7723, we found as a result of ARROW-3789 that second
   // through microsecond resolution tz-aware timestamps were being promoted to
   // use the DATETIME_NANO_TZ conversion path, yielding a datetime64[ns] NumPy
   // array in this function. PyArray_GETITEM returns datetime.datetime for
   // units second through microsecond but PyLong for nanosecond (because
-  // datetime.datetime does not support nanoseconds). We inserted this hack to
-  // preserve the <= 0.15.1 behavior until a better solution can be devised
+  // datetime.datetime does not support nanoseconds).
+  // We force the object conversion to preserve the value of the timezone.
   PandasOptions modified_options = options;
-  modified_options.ignore_timezone = true;
   modified_options.coerce_temporal_nanoseconds = false;
 
   for (int c = 0; c < data.num_chunks(); c++) {
     auto arr = checked_cast<const StructArray*>(data.chunk(c).get());
     // Convert the struct arrays first
     for (int32_t i = 0; i < num_fields; i++) {
+      // Se notes above about conversion.
+      std::shared_ptr<Array> field = arr->field(static_cast<int>(i));
+      modified_options.timestamp_as_object =
+          field->type()->id() == Type::TIMESTAMP &&
+          !checked_cast<const TimestampType&>(*field->type()).timezone().empty();

Review comment:
       I am a bit confused here.
   
   > timezone naive datetimes reflect system timezone
   
   Who does this? Does arrow do this in conversion? Or do you mean that `PyArray_GETITEM` does this? (AFAIK numpy used to do things with system time zone, but that should be solved nowadays)
   
   > Our timestamp -> datetime conversion assumes UTC as far as I can tell (like I said not sure what pandas does here).
   
   Where do we assume this? (and this is about naive or aware timestamps?)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r448413419



##########
File path: cpp/src/arrow/python/datetime.cc
##########
@@ -262,6 +265,42 @@ int64_t PyDate_to_days(PyDateTime_Date* pydate) {
                             PyDateTime_GET_DAY(pydate));
 }
 
+// GIL must be held when calling this function.
+Status StringToTzinfo(const std::string& tz, PyObject** tzinfo) {
+  static const std::regex kFixedOffset("([+-])(0[0-9]|1[0-9]|2[0-3]):([0-5][0-9])$");
+  OwnedRef pytz;
+  RETURN_NOT_OK(internal::ImportModule("pytz", &pytz));
+
+  std::cmatch match;
+  if (std::regex_match(tz.c_str(), match, kFixedOffset)) {

Review comment:
       Should avoid c_str here




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r449738770



##########
File path: cpp/src/arrow/python/datetime.cc
##########
@@ -14,22 +14,62 @@
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.
+#include "arrow/python/datetime.h"
 
 #include <algorithm>
 #include <chrono>
 #include <iostream>
 
 #include "arrow/python/common.h"
-#include "arrow/python/datetime.h"
+#include "arrow/python/helpers.h"
 #include "arrow/python/platform.h"
 #include "arrow/status.h"
 #include "arrow/type.h"
 #include "arrow/util/logging.h"
+#include "arrow/util/value_parsing.h"
 
 namespace arrow {
 namespace py {
 namespace internal {
 
+namespace {
+
+bool MatchFixedOffset(const std::string& tz, util::string_view* sign,
+                      util::string_view* hour, util::string_view* minute) {
+  if (tz.size() < 5) {

Review comment:
       Added  a comment.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-654366148


   Looks like the Spark test fails with
   ```
      File "/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 593, in f
       "{} != {}".format(expected_key[i][1], window_range)
   AssertionError: {'start': datetime.datetime(2018, 3, 10, 0, 0), 'end': datetime.datetime(2018, 3, 15, 0, 0)} != {'start': datetime.datetime(2018, 3, 10, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>), 'end': datetime.datetime(2018, 3, 15, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>)}
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-654294358


   Revision: a550a761781e8a7e1ebe163642a1c6fae98dad52
   
   Submitted crossbow builds: [ursa-labs/crossbow @ actions-380](https://github.com/ursa-labs/crossbow/branches/all?query=actions-380)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.7-spark-master|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-380-github-test-conda-python-3.7-spark-master)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-380-github-test-conda-python-3.7-spark-master)|


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-654462400


   Yeah, it seems possible to fix on the Spark side, but I haven't been able to take a close look at this yet. I'll try to do that as soon as possible and report back.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-653720648


   > But I can also push that change to your branch, if that's fine with you.
   
   @jorisvandenbossche  Yes please feel free to push to this branch.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche edited a comment on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche edited a comment on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-656499067


   > Is there a way to skip specific tests, I thought all of the live in spark code?
   
   With pytest you can deselect a single test by name (that's also what we do with the dask integration tests), but pysparks seems to use a custom test runner (https://github.com/apache/spark/blob/master/python/run-tests.py, being called from https://github.com/apache/arrow/blob/2596b3419f2bb03ba9ecbdd3d4fbd521232157bc/ci/scripts/integration_spark.sh#L56-L67), so no idea then. It doesn't seem to allow to deselect, only to select (https://github.com/apache/spark/blob/master/python/run-tests.py#L176-L216)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-653068566


   > > For timezone naive Timestamps, I'm not sure if we should be adjusting the datetime to reflect UTC converted to system time zone. thoughts?
   > 
   > I don't think we should do that (it would also be a big break). Tz-naive timestamps have no timezone and should therefore be left as is IMO (applications on top of Arrow can choose to use the "naive means system time zone" behaviour, but I would say that Arrow itself should never deal with the system time zone)
   
   You've probably seen this, but this is what our docs say about timezone-naive data: http://arrow.apache.org/docs/cpp/api/datatype.html#_CPPv4N5arrow13TimestampTypeE
   
   (Side note: do our doc URLs have to be so ugly?)
   
   I agree with Joris that at this level, we should not do any localizing of timezone-naive data.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-653223390


   > And for tz-naive we don't need to handle any timezone issue, and can just convert it do a datetime.datetime object, which by default is also tz-naive and can thus be used as is.
   
   I think this is the potentially confusing part.  it is quite possible it is just me.  But if the timestamp is meant for display purposes, I would expect the tz-naive datetime value to reflect the local timezone. Since it displays a "calendar" like object.  This isn't clear cut case just stating my expectations. 
   
   I think this might be a conversation to be had on the ML.  To finish up the conversation, the reason for not always using to the to_object path, is because I don't want to potentially change functionality of pandas conversion to datetime.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-653228729


   > the reason for not always using to the to_object path, is because I don't want to potentially change functionality of pandas conversion to datetime.
   
   Pandas also never uses the system timezone for conversions or display, so by using the to_object path, we wouldn't change any functionality AFAIK.
   
   Eg right now we have the following behaviour:
   
   ```
   In [49]: arr = pa.array([0], type=pa.timestamp(unit)) 
       ...: arr2 = pa.StructArray.from_arrays([arr, arr], ['start', 'stop'])  
   
   In [50]: arr2   
   Out[50]: 
   <pyarrow.lib.StructArray object at 0x7f10edf7d7c8>
   -- is_valid: all not null
   -- child 0 type: timestamp[us]
     [
       1970-01-01 00:00:00.000000
     ]
   -- child 1 type: timestamp[us]
     [
       1970-01-01 00:00:00.000000
     ]
   
   In [52]: arr2.to_pandas()[0]    
   Out[52]: 
   {'start': datetime.datetime(1970, 1, 1, 0, 0),
    'stop': datetime.datetime(1970, 1, 1, 0, 0)}
   ```
   
   where the tz-naive timestamps are converted to the same tz-naive datetime.datetime in the to_pandas conversion (without any adjustment for system timezone). 
   (and IMO we need to keep this behaviour)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-652262316


   https://issues.apache.org/jira/browse/ARROW-9223


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-657286648


   > @BryanCutler if you don't have the bandwidth to make the PR to spark, I can devote a little bit of effort to submitting one.
   
   @emkornfield not sure if I'll have the time to get to it in the next couple of days, so if you're able to take it on that would be great.  I made https://issues.apache.org/jira/browse/SPARK-32285 for the issue - just leave a comment if you start work on it. BTW, the spark integration tests are currently failing due to recent Arrow Java changes and a patch will have to be made to fix that first. I should be able to get that done.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r449130523



##########
File path: cpp/src/arrow/python/arrow_to_pandas.cc
##########
@@ -642,24 +641,27 @@ inline Status ConvertStruct(const PandasOptions& options, const ChunkedArray& da
   std::vector<OwnedRef> fields_data(num_fields);
   OwnedRef dict_item;
 
-  // XXX(wesm): In ARROW-7723, we found as a result of ARROW-3789 that second
+  // In ARROW-7723, we found as a result of ARROW-3789 that second
   // through microsecond resolution tz-aware timestamps were being promoted to
   // use the DATETIME_NANO_TZ conversion path, yielding a datetime64[ns] NumPy
   // array in this function. PyArray_GETITEM returns datetime.datetime for
   // units second through microsecond but PyLong for nanosecond (because
-  // datetime.datetime does not support nanoseconds). We inserted this hack to
-  // preserve the <= 0.15.1 behavior until a better solution can be devised
+  // datetime.datetime does not support nanoseconds).
+  // We force the object conversion to preserve the value of the timezone.
   PandasOptions modified_options = options;
-  modified_options.ignore_timezone = true;
   modified_options.coerce_temporal_nanoseconds = false;
 
   for (int c = 0; c < data.num_chunks(); c++) {
     auto arr = checked_cast<const StructArray*>(data.chunk(c).get());
     // Convert the struct arrays first
     for (int32_t i = 0; i < num_fields; i++) {
+      // Se notes above about conversion.
+      std::shared_ptr<Array> field = arr->field(static_cast<int>(i));
+      modified_options.timestamp_as_object =
+          field->type()->id() == Type::TIMESTAMP &&
+          !checked_cast<const TimestampType&>(*field->type()).timezone().empty();

Review comment:
       > Who does this? Does arrow do this in conversion? Or do you mean that PyArray_GETITEM does this? (AFAIK numpy used to do things with system time zone, but that should be solved nowadays)
   
   
   
   @jorisvandenbossche 
   
   Python's [datetime.astimezone assumes this](https://docs.python.org/3/library/datetime.html#datetime.datetime.astimezone) as of 3.6.  Since datetime essentially becomes the "display" I would think most people would assume system timezone values.
   
   > > Our timestamp -> datetime conversion assumes UTC as far as I can tell (like I said not sure what pandas does here).
   
   > Where do we assume this? (and this is about naive or aware timestamps?)
   
   Core [timestamp conversion doesn't handle timezones](https://github.com/apache/arrow/blob/edd88d7d222598550e4812c94194cbf973b20456/cpp/src/arrow/python/datetime.cc#L246).  for non-naive we use [fromutc](https://github.com/apache/arrow/blob/89cf7bdd6fc0d34c96f1a0e1df0a3505b308188d/python/pyarrow/scalar.pxi#L310) after the fact.
   
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r449371074



##########
File path: cpp/src/arrow/python/arrow_to_pandas.cc
##########
@@ -642,24 +641,27 @@ inline Status ConvertStruct(const PandasOptions& options, const ChunkedArray& da
   std::vector<OwnedRef> fields_data(num_fields);
   OwnedRef dict_item;
 
-  // XXX(wesm): In ARROW-7723, we found as a result of ARROW-3789 that second
+  // In ARROW-7723, we found as a result of ARROW-3789 that second
   // through microsecond resolution tz-aware timestamps were being promoted to
   // use the DATETIME_NANO_TZ conversion path, yielding a datetime64[ns] NumPy
   // array in this function. PyArray_GETITEM returns datetime.datetime for
   // units second through microsecond but PyLong for nanosecond (because
-  // datetime.datetime does not support nanoseconds). We inserted this hack to
-  // preserve the <= 0.15.1 behavior until a better solution can be devised
+  // datetime.datetime does not support nanoseconds).
+  // We force the object conversion to preserve the value of the timezone.
   PandasOptions modified_options = options;
-  modified_options.ignore_timezone = true;
   modified_options.coerce_temporal_nanoseconds = false;
 
   for (int c = 0; c < data.num_chunks(); c++) {
     auto arr = checked_cast<const StructArray*>(data.chunk(c).get());
     // Convert the struct arrays first
     for (int32_t i = 0; i < num_fields; i++) {
+      // Se notes above about conversion.
+      std::shared_ptr<Array> field = arr->field(static_cast<int>(i));
+      modified_options.timestamp_as_object =
+          field->type()->id() == Type::TIMESTAMP &&
+          !checked_cast<const TimestampType&>(*field->type()).timezone().empty();

Review comment:
       forced to always use to_object




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-652857683


   Revision: d321a8d91da755f1e9b716fe1151f8ebbf01b4d4
   
   Submitted crossbow builds: [ursa-labs/crossbow @ actions-373](https://github.com/ursa-labs/crossbow/branches/all?query=actions-373)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.7-spark-master|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-373-github-test-conda-python-3.7-spark-master)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-373-github-test-conda-python-3.7-spark-master)|


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-656984423


   > There is no way to deselect specific tests, afaik. You would have select all the other tests individually from that module, for example
   
   My preference, if it is ok with everyone reviewing this would be to avoid this and try to get something merged into Spark master quickly (I can own looking at nightly spark jobs until that is done to verify no more failures).
   
   @BryanCutler if you don't have the bandwidth to make the PR to spark, I can devote a little bit of effort to submitting one.
   
   Thoughts?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r448801572



##########
File path: cpp/src/arrow/python/arrow_to_pandas.cc
##########
@@ -642,24 +641,27 @@ inline Status ConvertStruct(const PandasOptions& options, const ChunkedArray& da
   std::vector<OwnedRef> fields_data(num_fields);
   OwnedRef dict_item;
 
-  // XXX(wesm): In ARROW-7723, we found as a result of ARROW-3789 that second
+  // In ARROW-7723, we found as a result of ARROW-3789 that second
   // through microsecond resolution tz-aware timestamps were being promoted to
   // use the DATETIME_NANO_TZ conversion path, yielding a datetime64[ns] NumPy
   // array in this function. PyArray_GETITEM returns datetime.datetime for
   // units second through microsecond but PyLong for nanosecond (because
-  // datetime.datetime does not support nanoseconds). We inserted this hack to
-  // preserve the <= 0.15.1 behavior until a better solution can be devised
+  // datetime.datetime does not support nanoseconds).
+  // We force the object conversion to preserve the value of the timezone.
   PandasOptions modified_options = options;
-  modified_options.ignore_timezone = true;
   modified_options.coerce_temporal_nanoseconds = false;
 
   for (int c = 0; c < data.num_chunks(); c++) {
     auto arr = checked_cast<const StructArray*>(data.chunk(c).get());
     // Convert the struct arrays first
     for (int32_t i = 0; i < num_fields; i++) {
+      // Se notes above about conversion.
+      std::shared_ptr<Array> field = arr->field(static_cast<int>(i));
+      modified_options.timestamp_as_object =
+          field->type()->id() == Type::TIMESTAMP &&
+          !checked_cast<const TimestampType&>(*field->type()).timezone().empty();

Review comment:
       We could maybe also *always* use the object path? (so also when there is no timezone) 
   The PyArray_GETITEM will convert it to a python datetime.datetime object later anyway, if I understand correctly? So then we can maybe already do this conversion ourselves also in the tz-naive case.

##########
File path: python/pyarrow/tests/test_pandas.py
##########
@@ -3321,9 +3321,12 @@ def test_cast_timestamp_unit():
     assert result.equals(expected)
 
 
-def test_struct_with_timestamp_tz():
+def test_nested_with_timestamp_tz():
     # ARROW-7723
     ts = pd.Timestamp.now()
+    # This is used for verifying timezone conversion to micros are not
+    # important
+    ts_dt = ts.to_pydatetime().replace(microsecond=0)

Review comment:
       Why is this needed? (or, I suppose this is only needed if the unit is 's' or 'ms' ? If so, I would rather do this only in those cases, and for 'us' unit actually check the microseconds are also correct)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r449130523



##########
File path: cpp/src/arrow/python/arrow_to_pandas.cc
##########
@@ -642,24 +641,27 @@ inline Status ConvertStruct(const PandasOptions& options, const ChunkedArray& da
   std::vector<OwnedRef> fields_data(num_fields);
   OwnedRef dict_item;
 
-  // XXX(wesm): In ARROW-7723, we found as a result of ARROW-3789 that second
+  // In ARROW-7723, we found as a result of ARROW-3789 that second
   // through microsecond resolution tz-aware timestamps were being promoted to
   // use the DATETIME_NANO_TZ conversion path, yielding a datetime64[ns] NumPy
   // array in this function. PyArray_GETITEM returns datetime.datetime for
   // units second through microsecond but PyLong for nanosecond (because
-  // datetime.datetime does not support nanoseconds). We inserted this hack to
-  // preserve the <= 0.15.1 behavior until a better solution can be devised
+  // datetime.datetime does not support nanoseconds).
+  // We force the object conversion to preserve the value of the timezone.
   PandasOptions modified_options = options;
-  modified_options.ignore_timezone = true;
   modified_options.coerce_temporal_nanoseconds = false;
 
   for (int c = 0; c < data.num_chunks(); c++) {
     auto arr = checked_cast<const StructArray*>(data.chunk(c).get());
     // Convert the struct arrays first
     for (int32_t i = 0; i < num_fields; i++) {
+      // Se notes above about conversion.
+      std::shared_ptr<Array> field = arr->field(static_cast<int>(i));
+      modified_options.timestamp_as_object =
+          field->type()->id() == Type::TIMESTAMP &&
+          !checked_cast<const TimestampType&>(*field->type()).timezone().empty();

Review comment:
       > Who does this? Does arrow do this in conversion? Or do you mean that PyArray_GETITEM does this? (AFAIK numpy used to do things with system time zone, but that should be solved nowadays)
   
   Python's [datetime.astimezone assumes this](https://docs.python.org/3/library/datetime.html#datetime.datetime.astimezone) as of 3.6.  Since datetime essentially becomes the "display" I would think most people would assume system timezone values.
   
   > > Our timestamp -> datetime conversion assumes UTC as far as I can tell (like I said not sure what pandas does here).
   
   > Where do we assume this? (and this is about naive or aware timestamps?)
   
   Core [timestamp conversion doesn't handle timezones](https://github.com/apache/arrow/blob/edd88d7d222598550e4812c94194cbf973b20456/cpp/src/arrow/python/datetime.cc#L246).  for non-naive we use [fromutc](https://github.com/apache/arrow/blob/89cf7bdd6fc0d34c96f1a0e1df0a3505b308188d/python/pyarrow/scalar.pxi#L310) after the fact.
   
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r449373516



##########
File path: python/pyarrow/tests/test_pandas.py
##########
@@ -3321,9 +3321,12 @@ def test_cast_timestamp_unit():
     assert result.equals(expected)
 
 
-def test_struct_with_timestamp_tz():
+def test_nested_with_timestamp_tz():
     # ARROW-7723
     ts = pd.Timestamp.now()
+    # This is used for verifying timezone conversion to micros are not
+    # important
+    ts_dt = ts.to_pydatetime().replace(microsecond=0)

Review comment:
       sorry I realize I responded to spark test failure on wrong chain.  @jorisvandenbossche I addressed your original feedback here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r448445229



##########
File path: cpp/src/arrow/python/arrow_to_pandas.cc
##########
@@ -951,8 +951,21 @@ struct ObjectWriterVisitor {
   template <typename Type>
   enable_if_timestamp<Type, Status> Visit(const Type& type) {
     const TimeUnit::type unit = type.unit();
-    auto WrapValue = [unit](typename Type::c_type value, PyObject** out) {
+    PyObject* tzinfo = nullptr;
+    if (!type.timezone().empty()) {
+      RETURN_NOT_OK(internal::StringToTzinfo(type.timezone(), &tzinfo));
+    }
+    auto WrapValue = [unit, tzinfo](typename Type::c_type value, PyObject** out) {
+      OwnedRef tz(tzinfo);
       RETURN_NOT_OK(internal::PyDateTime_from_int(value, unit, out));
+      RETURN_IF_PYERROR();
+      if (tzinfo != nullptr) {
+        PyObject* with_tz = PyObject_CallMethod(*out, "astimezone", "O", tzinfo);

Review comment:
       should always set UTC here first.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r449096333



##########
File path: cpp/src/arrow/python/arrow_to_pandas.cc
##########
@@ -642,24 +641,27 @@ inline Status ConvertStruct(const PandasOptions& options, const ChunkedArray& da
   std::vector<OwnedRef> fields_data(num_fields);
   OwnedRef dict_item;
 
-  // XXX(wesm): In ARROW-7723, we found as a result of ARROW-3789 that second
+  // In ARROW-7723, we found as a result of ARROW-3789 that second
   // through microsecond resolution tz-aware timestamps were being promoted to
   // use the DATETIME_NANO_TZ conversion path, yielding a datetime64[ns] NumPy
   // array in this function. PyArray_GETITEM returns datetime.datetime for
   // units second through microsecond but PyLong for nanosecond (because
-  // datetime.datetime does not support nanoseconds). We inserted this hack to
-  // preserve the <= 0.15.1 behavior until a better solution can be devised
+  // datetime.datetime does not support nanoseconds).
+  // We force the object conversion to preserve the value of the timezone.
   PandasOptions modified_options = options;
-  modified_options.ignore_timezone = true;
   modified_options.coerce_temporal_nanoseconds = false;
 
   for (int c = 0; c < data.num_chunks(); c++) {
     auto arr = checked_cast<const StructArray*>(data.chunk(c).get());
     // Convert the struct arrays first
     for (int32_t i = 0; i < num_fields; i++) {
+      // Se notes above about conversion.
+      std::shared_ptr<Array> field = arr->field(static_cast<int>(i));
+      modified_options.timestamp_as_object =
+          field->type()->id() == Type::TIMESTAMP &&
+          !checked_cast<const TimestampType&>(*field->type()).timezone().empty();

Review comment:
       So on tz-naive timestamps I haven't verified my behavior but my concern with conversion to datetime is thus:
   - timezone naive datetimes reflect system timezone
   - Our timestamp -> datetime conversion assumes UTC as far as I can tell (like I said not sure what pandas does here).
   
   Hence the resulting datetime might be confusing because it represents the datetime assuming UTC and not local time zone.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-656499067


   > Is there a way to skip specific tests, I thought all of the live in spark code?
   
   With pytest you can deselect a single test by name (that's also what we do with the dask integration tests), but pysparks seems to use a custom test runner (https://github.com/apache/spark/blob/master/python/run-tests.py), so no idea then. It doesn't seem to allow to deselect, only to select (https://github.com/apache/spark/blob/master/python/run-tests.py#L176-L216)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-656384231


   @jorisvandenbossche Is there a way to skip specific tests, I thought all of the live in spark code?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r449373710



##########
File path: cpp/src/arrow/python/datetime.cc
##########
@@ -262,6 +265,42 @@ int64_t PyDate_to_days(PyDateTime_Date* pydate) {
                             PyDateTime_GET_DAY(pydate));
 }
 
+// GIL must be held when calling this function.
+Status StringToTzinfo(const std::string& tz, PyObject** tzinfo) {
+  static const std::regex kFixedOffset("([+-])(0[0-9]|1[0-9]|2[0-3]):([0-5][0-9])$");

Review comment:
       Hmm, I thought I saw the regex header in other parts of the codebase.  But I went ahead and converted this to a function.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler edited a comment on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
BryanCutler edited a comment on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-654366148


   Looks like the Spark test fails with
   ```
      File "/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 593, in f
       "{} != {}".format(expected_key[i][1], window_range)
   AssertionError: 
   {'start': datetime.datetime(2018, 3, 10, 0, 0), 'end': datetime.datetime(2018, 3, 15, 0, 0)} != 
   {'start': datetime.datetime(2018, 3, 10, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>), 'end': datetime.datetime(2018, 3, 15, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>)}
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r449123267



##########
File path: python/pyarrow/tests/test_pandas.py
##########
@@ -3321,9 +3321,12 @@ def test_cast_timestamp_unit():
     assert result.equals(expected)
 
 
-def test_struct_with_timestamp_tz():
+def test_nested_with_timestamp_tz():
     # ARROW-7723
     ts = pd.Timestamp.now()
+    # This is used for verifying timezone conversion to micros are not
+    # important
+    ts_dt = ts.to_pydatetime().replace(microsecond=0)

Review comment:
       👋  another observation from the R side which may or may not be relevant here: https://github.com/sparklyr/sparklyr/issues/2439




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-656783691


   There is no way to deselect specific tests, afaik. You would have select all the other tests individually from that module, for example
   ```
   spark_python_tests=( 
      ...
      "pyspark.sql.tests.test_pandas_grouped_map GroupedMapInPandasTests.test_supported_types"  
      "pyspark.sql.tests.test_pandas_grouped_map GroupedMapInPandasTests.test_array_type_correct"
      etc...
   )
   ``` 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-653081545


   > > > For timezone naive Timestamps, I'm not sure if we should be adjusting the datetime to reflect UTC converted to system time zone. thoughts?
   > > 
   > > 
   > > I don't think we should do that (it would also be a big break). Tz-naive timestamps have no timezone and should therefore be left as is IMO (applications on top of Arrow can choose to use the "naive means system time zone" behaviour, but I would say that Arrow itself should never deal with the system time zone)
   > 
   > You've probably seen this, but this is what our docs say about timezone-naive data: http://arrow.apache.org/docs/cpp/api/datatype.html#_CPPv4N5arrow13TimestampTypeE
   > 
   > (Side note: do our doc URLs have to be so ugly?)
   > 
   > I agree with Joris that at this level, we should not do any localizing of timezone-naive data.
   
   Please see reply: https://github.com/apache/arrow/pull/7604#discussion_r449096333
   
   I haven't looked in the code deeply but this might an overall bug, or at least idiosyncrasy, with our conversion to/from datetime (not accounting for system timezone).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-657057892


   We are doing integration tests for both spark master as spark's 3.0 branch, lately. But yes, @BryanCutler indicated that a fix for those branches could be possible, so then it's certainly fine for me to merge as is without skipping any tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-652259406


   CC @wesm @BryanCutler 
   
   Not sure if doing this correctly will break spark in unintentional ways.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r449738860



##########
File path: cpp/src/arrow/python/datetime.cc
##########
@@ -262,6 +302,42 @@ int64_t PyDate_to_days(PyDateTime_Date* pydate) {
                             PyDateTime_GET_DAY(pydate));
 }
 
+// GIL must be held when calling this function.
+Status StringToTzinfo(const std::string& tz, PyObject** tzinfo) {

Review comment:
       Added a pointer back to this PR.  I providing the full code is overkill, and for those familiar with the CPP it should be fairly straight-forward to map back to python.  Let me know if you fell strongly about this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm closed pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
wesm closed pull request #7604:
URL: https://github.com/apache/arrow/pull/7604


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-652765985


   One other concern.  For timezone naive Timestamps, I'm not sure if we should be adjusting the datetime to reflect UTC instead of system time zone.  thoughts?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r449119094



##########
File path: python/pyarrow/tests/test_pandas.py
##########
@@ -3321,9 +3321,12 @@ def test_cast_timestamp_unit():
     assert result.equals(expected)
 
 
-def test_struct_with_timestamp_tz():
+def test_nested_with_timestamp_tz():
     # ARROW-7723
     ts = pd.Timestamp.now()
+    # This is used for verifying timezone conversion to micros are not
+    # important
+    ts_dt = ts.to_pydatetime().replace(microsecond=0)

Review comment:
       > There is one failure (which involves datetimes with timezones, but can't directly understand if it also involves structs):
   > 
   > ```
   > ======================================================================
   > FAIL: test_grouped_over_window_with_key (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
   > ----------------------------------------------------------------------
   > Traceback (most recent call last):
   >   File "/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 588, in test_grouped_over_window_with_key
   >     self.assertTrue(all([r[0] for r in result]))
   > AssertionError: False is not true
   > 
   > ----------------------------------------------------------------------
   > Ran 21 tests in 141.194s
   > 
   > FAILED (failures=1)
   > ```
   > 
   > https://github.com/apache/spark/blob/7fda184f0fc39613fb68e912c189c54b93c638e6/python/pyspark/sql/tests/test_pandas_grouped_map.py#L546-L574
   
   Yeah this seems strange.  @BryanCutler do you have a sense of what is going on here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r449037355



##########
File path: cpp/src/arrow/python/datetime.cc
##########
@@ -262,6 +265,42 @@ int64_t PyDate_to_days(PyDateTime_Date* pydate) {
                             PyDateTime_GET_DAY(pydate));
 }
 
+// GIL must be held when calling this function.
+Status StringToTzinfo(const std::string& tz, PyObject** tzinfo) {
+  static const std::regex kFixedOffset("([+-])(0[0-9]|1[0-9]|2[0-3]):([0-5][0-9])$");

Review comment:
       This won't work for gcc 4.8.x (which we still support)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche edited a comment on pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche edited a comment on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-652956039


   There is one failure (which involves datetimes with timezones, but can't directly understand if it also involves structs): 
   
   ```
   ======================================================================
   FAIL: test_grouped_over_window_with_key (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
   ----------------------------------------------------------------------
   Traceback (most recent call last):
     File "/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 588, in test_grouped_over_window_with_key
       self.assertTrue(all([r[0] for r in result]))
   AssertionError: False is not true
   
   ----------------------------------------------------------------------
   Ran 21 tests in 141.194s
   
   FAILED (failures=1)
   ```
   
   https://github.com/apache/spark/blob/7fda184f0fc39613fb68e912c189c54b93c638e6/python/pyspark/sql/tests/test_pandas_grouped_map.py#L546-L574


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-652814189


   > For timezone naive Timestamps, I'm not sure if we should be adjusting the datetime to reflect UTC converted to system time zone. thoughts?
   
   I don't think we should do that (it would also be a big break). Tz-naive timestamps have no timezone and should therefore be left as is IMO (applications on top of Arrow can choose to use the "naive means system time zone" behaviour, but I would say that Arrow itself should never deal with the system time zone)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r449479330



##########
File path: cpp/src/arrow/python/datetime.cc
##########
@@ -262,6 +302,42 @@ int64_t PyDate_to_days(PyDateTime_Date* pydate) {
                             PyDateTime_GET_DAY(pydate));
 }
 
+// GIL must be held when calling this function.
+Status StringToTzinfo(const std::string& tz, PyObject** tzinfo) {

Review comment:
       It might go a bit too far (and can easily get outdated), but given that the below is basically python code written in C++, for me it would help in understanding it by including the short python snippet that it represents as a comment, eg
   
   ```
   def string_to_tzinfo(tz):
       import pytz
   
       match = ... 
       if match:
           sign, hours, minutes = ...
           return pytz.FixedOffset(sign * (hours * 60 + minutes))
       else:
           return pytz.timezone(tz)
   ```

##########
File path: cpp/src/arrow/python/datetime.cc
##########
@@ -14,22 +14,62 @@
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.
+#include "arrow/python/datetime.h"
 
 #include <algorithm>
 #include <chrono>
 #include <iostream>
 
 #include "arrow/python/common.h"
-#include "arrow/python/datetime.h"
+#include "arrow/python/helpers.h"
 #include "arrow/python/platform.h"
 #include "arrow/status.h"
 #include "arrow/type.h"
 #include "arrow/util/logging.h"
+#include "arrow/util/value_parsing.h"
 
 namespace arrow {
 namespace py {
 namespace internal {
 
+namespace {
+
+bool MatchFixedOffset(const std::string& tz, util::string_view* sign,
+                      util::string_view* hour, util::string_view* minute) {
+  if (tz.size() < 5) {

Review comment:
       Maybe add a comment here that this can be replaced with a simpler regex once we can use std::regex? (and eg pointing to this PR to see the regex)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org