You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/06 13:11:17 UTC

[GitHub] [arrow] pitrou opened a new pull request #8361: ARROW-10192: [Python] Always decode inner dictionaries when converting array to Pandas

pitrou opened a new pull request #8361:
URL: https://github.com/apache/arrow/pull/8361


   Fix a crash on conversion of e.g. struct of dictionaries.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kszucs commented on a change in pull request #8361: ARROW-10192: [Python] Always decode inner dictionaries when converting array to Pandas

Posted by GitBox <gi...@apache.org>.

kszucs commented on a change in pull request #8361:
URL: https://github.com/apache/arrow/pull/8361#discussion_r500499238



##########
File path: cpp/src/arrow/python/arrow_to_pandas.cc
##########
@@ -641,29 +659,18 @@ inline Status ConvertStruct(const PandasOptions& options, const ChunkedArray& da
   std::vector<OwnedRef> fields_data(num_fields);
   OwnedRef dict_item;
 
-  // In ARROW-7723, we found as a result of ARROW-3789 that second
-  // through microsecond resolution tz-aware timestamps were being promoted to
-  // use the DATETIME_NANO_TZ conversion path, yielding a datetime64[ns] NumPy
-  // array in this function. PyArray_GETITEM returns datetime.datetime for
-  // units second through microsecond but PyLong for nanosecond (because
-  // datetime.datetime does not support nanoseconds).
-  // We force the object conversion to preserve the value of the timezone.
-  // Nanoseconds are returned integers inside of structs.
-  PandasOptions modified_options = options;
-  modified_options.coerce_temporal_nanoseconds = false;
+  options = MakeInnerOptions(std::move(options));
+  // See notes in MakeInnerOptions about timestamp conversion.
+  // Don't blindly convert because

Review comment:
       Yes, updated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kszucs commented on a change in pull request #8361: ARROW-10192: [Python] Always decode inner dictionaries when converting array to Pandas

Posted by GitBox <gi...@apache.org>.

kszucs commented on a change in pull request #8361:
URL: https://github.com/apache/arrow/pull/8361#discussion_r500499506



##########
File path: python/pyarrow/tests/test_pandas.py
##########
@@ -2297,6 +2310,24 @@ def test_from_tuples(self):
             df, expected=expected_df, schema=expected_schema,
             expected_schema=expected_schema)
 
+    def test_struct_of_dictionary(self):
+        names = ['ints', 'strs']
+        children = [pa.array([456, 789, 456]).dictionary_encode(),
+                    pa.array(["foo", "foo", None]).dictionary_encode()]
+        arr = pa.StructArray.from_arrays(children, names=names)
+
+        # Expected a Series of {field name: field value} dicts
+        rows_as_tuples = zip(*(child.to_pylist() for child in children))
+        rows_as_dicts = [dict(zip(names, row)) for row in rows_as_tuples]

Review comment:
       Expanded the nested comprehension to make it a bit more readable.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kszucs closed pull request #8361: ARROW-10192: [Python] Always decode inner dictionaries when converting array to Pandas

Posted by GitBox <gi...@apache.org>.

kszucs closed pull request #8361:
URL: https://github.com/apache/arrow/pull/8361


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kszucs commented on a change in pull request #8361: ARROW-10192: [Python] Always decode inner dictionaries when converting array to Pandas

Posted by GitBox <gi...@apache.org>.

kszucs commented on a change in pull request #8361:
URL: https://github.com/apache/arrow/pull/8361#discussion_r500483306



##########
File path: cpp/src/arrow/python/arrow_to_pandas.cc
##########
@@ -641,29 +659,18 @@ inline Status ConvertStruct(const PandasOptions& options, const ChunkedArray& da
   std::vector<OwnedRef> fields_data(num_fields);
   OwnedRef dict_item;
 
-  // In ARROW-7723, we found as a result of ARROW-3789 that second
-  // through microsecond resolution tz-aware timestamps were being promoted to
-  // use the DATETIME_NANO_TZ conversion path, yielding a datetime64[ns] NumPy
-  // array in this function. PyArray_GETITEM returns datetime.datetime for
-  // units second through microsecond but PyLong for nanosecond (because
-  // datetime.datetime does not support nanoseconds).
-  // We force the object conversion to preserve the value of the timezone.
-  // Nanoseconds are returned integers inside of structs.
-  PandasOptions modified_options = options;
-  modified_options.coerce_temporal_nanoseconds = false;
+  options = MakeInnerOptions(std::move(options));
+  // See notes in MakeInnerOptions about timestamp conversion.
+  // Don't blindly convert because

Review comment:
       Seems like the second comment line is missing: ` // timestamps in lists are handled differently.` 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8361: ARROW-10192: [Python] Always decode inner dictionaries when converting array to Pandas

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #8361:
URL: https://github.com/apache/arrow/pull/8361#issuecomment-704266502


   https://issues.apache.org/jira/browse/ARROW-10192


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #8361: ARROW-10192: [Python] Always decode inner dictionaries when converting array to Pandas

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #8361:
URL: https://github.com/apache/arrow/pull/8361#discussion_r500496450



##########
File path: cpp/src/arrow/python/arrow_to_pandas.cc
##########
@@ -641,29 +659,18 @@ inline Status ConvertStruct(const PandasOptions& options, const ChunkedArray& da
   std::vector<OwnedRef> fields_data(num_fields);
   OwnedRef dict_item;
 
-  // In ARROW-7723, we found as a result of ARROW-3789 that second
-  // through microsecond resolution tz-aware timestamps were being promoted to
-  // use the DATETIME_NANO_TZ conversion path, yielding a datetime64[ns] NumPy
-  // array in this function. PyArray_GETITEM returns datetime.datetime for
-  // units second through microsecond but PyLong for nanosecond (because
-  // datetime.datetime does not support nanoseconds).
-  // We force the object conversion to preserve the value of the timezone.
-  // Nanoseconds are returned integers inside of structs.
-  PandasOptions modified_options = options;
-  modified_options.coerce_temporal_nanoseconds = false;
+  options = MakeInnerOptions(std::move(options));
+  // See notes in MakeInnerOptions about timestamp conversion.
+  // Don't blindly convert because

Review comment:
       Ah, right. Do you want to make the change and push?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kszucs commented on a change in pull request #8361: ARROW-10192: [Python] Always decode inner dictionaries when converting array to Pandas

Posted by GitBox <gi...@apache.org>.

kszucs commented on a change in pull request #8361:
URL: https://github.com/apache/arrow/pull/8361#discussion_r500499506



##########
File path: python/pyarrow/tests/test_pandas.py
##########
@@ -2297,6 +2310,24 @@ def test_from_tuples(self):
             df, expected=expected_df, schema=expected_schema,
             expected_schema=expected_schema)
 
+    def test_struct_of_dictionary(self):
+        names = ['ints', 'strs']
+        children = [pa.array([456, 789, 456]).dictionary_encode(),
+                    pa.array(["foo", "foo", None]).dictionary_encode()]
+        arr = pa.StructArray.from_arrays(children, names=names)
+
+        # Expected a Series of {field name: field value} dicts
+        rows_as_tuples = zip(*(child.to_pylist() for child in children))
+        rows_as_dicts = [dict(zip(names, row)) for row in rows_as_tuples]

Review comment:
       Expanded the nested comprehension to make it a bot more readable.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org