You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/06 15:02:56 UTC

[GitHub] [spark] moskvax opened a new pull request #28743: [SPARK-31920] Fix pandas conversion using Arrow with __arrow_array__ columns

moskvax opened a new pull request #28743:
URL: https://github.com/apache/spark/pull/28743


   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   1. Cast pandas DataFrame columns to object before passing to `pa.Schema.from_pandas`, to avoid a potential failed type check for `numpy.dtype` which occurs with pyarrow < 0.17.0,
   2. Check for the implementation of `__arrow_array__` before passing a mask to `pa.Array.from_pandas`, which will raise an exception if both `__arrow_array__` is implemented and a mask is passed in.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   These changes allow usage of pandas DataFrames which contain ExtensionDtype columns that are backed by arrays that implement `__arrow_array__`. DataFrames containing such columns will be returned when specifying an ExtensionDtype-extending pandas type in the `dtype` parameter when constructed, and can also be created via calling `convert_dtypes` on an existing DataFrame.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   Yes. Users will be able to convert a wider variety of pandas DataFrames into Spark DataFrames using any currently released pyarrow version > 0.15.1. Prior to this fix, the Arrow conversion path would not work with these DataFrames.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   Tests were added to cover the cases of converting from pandas DataFrames with `IntegerArray` and `StringArray` backed columns. A typo was also fixed in a recently added test.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-661069055


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] careyhay commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
careyhay commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-952668366


   Any way this can be revived and pulled?!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] moskvax commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
moskvax commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r438051160



##########
File path: python/pyspark/sql/pandas/serializers.py
##########
@@ -150,15 +151,22 @@ def _create_batch(self, series):
         series = ((s, None) if not isinstance(s, (list, tuple)) else s for s in series)
 
         def create_array(s, t):
-            mask = s.isnull()
+            # Create with __arrow_array__ if the series' backing array implements it
+            series_array = getattr(s, 'array', s._values)
+            if hasattr(series_array, "__arrow_array__"):
+                return series_array.__arrow_array__(type=t)
+
             # Ensure timestamp series are in expected form for Spark internal representation
             if t is not None and pa.types.is_timestamp(t):
                 s = _check_series_convert_timestamps_internal(s, self._timezone)
-            elif type(s.dtype) == pd.CategoricalDtype:
+            elif is_categorical_dtype(s.dtype):
                 # Note: This can be removed once minimum pyarrow version is >= 0.16.1
                 s = s.astype(s.dtypes.categories.dtype)
             try:
-                array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck)
+                mask = s.isnull()
+                # pass _ndarray_values to avoid potential failed type checks from pandas array types

Review comment:
       This is a workaround for `IntegerArray` in pre-1.0.0 pandas, which did not yet implement `__arrow_array__`, so pyarrow expects it to be a NumPy array:
   
   ```pycon
   >>> import pandas as pd
   >>> import pyarrow as pa
   >>> print(pd.__version__, pa.__version__)
   0.25.0 0.17.1
   >>> s = pd.Series(range(3), dtype=pd.Int64Dtype())
   >>> pa.Array.from_pandas(s)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas
     File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
     File "pyarrow/types.pxi", line 76, in pyarrow.lib._datatype_to_pep3118
     File "pyarrow/array.pxi", line 64, in pyarrow.lib._ndarray_to_type
     File "pyarrow/error.pxi", line 108, in pyarrow.lib.check_status
   pyarrow.lib.ArrowTypeError: Did not pass numpy.dtype object
   >>> pa.Array.from_pandas(s, type=pa.int64())
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas
     File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
     File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
     File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Input object was not a NumPy array
   >>> pa.Array.from_pandas(s._ndarray_values, type=pa.int64())
   <pyarrow.lib.Int64Array object at 0x7fb88007a980>
   [
     0,
     1,
     2
   ]
   >>>
   ```
   I'll update the comment to mention this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-640724799






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641989192






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] moskvax commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
moskvax commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r438840313



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -394,10 +394,11 @@ def _create_from_pandas_with_arrow(self, pdf, schema, timezone):
 
         # Create the Spark schema from list of names passed in with Arrow types
         if isinstance(schema, (list, tuple)):
-            arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
+            inferred_types = [pa.infer_type(s, mask=s.isna(), from_pandas=True)

Review comment:
       > For the second case above, so `pa.Schema.from_pandas` returns correct types in the case? When `pa.infer_type` infers the specified array types, will it just throw error or return a wrong array type?
   
   `pa.infer_type` will throw an error for these arrays.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641958702


   **[Test build #123762 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123762/testReport)** for PR 28743 at commit [`07d7f2a`](https://github.com/apache/spark/commit/07d7f2abc9cf54a3a0bf42d0555490e51987727e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-640135313


   **[Test build #123593 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123593/testReport)** for PR 28743 at commit [`04a15f6`](https://github.com/apache/spark/commit/04a15f6ff4cbf0cc4cdd3c9573af5c08768c3516).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-640139579


   Thanks for your work, @moskvax! The failures looks valid, so could you fix them first?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-642703848


   **[Test build #123852 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123852/testReport)** for PR 28743 at commit [`01fb6a4`](https://github.com/apache/spark/commit/01fb6a4c25f830ec6a52d1030b2a6790a3dbae4f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r437887602



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -394,10 +394,11 @@ def _create_from_pandas_with_arrow(self, pdf, schema, timezone):
 
         # Create the Spark schema from list of names passed in with Arrow types
         if isinstance(schema, (list, tuple)):
-            arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
+            inferred_types = [pa.infer_type(s, mask=s.isna(), from_pandas=True)
+                              for s in (pdf[c] for c in pdf)]
             struct = StructType()
-            for name, field in zip(schema, arrow_schema):
-                struct.add(name, from_arrow_type(field.type), nullable=field.nullable)
+            for name, t in zip(schema, inferred_types):
+                struct.add(name, from_arrow_type(t), nullable=True)

Review comment:
       Let's add a comment here to explain it?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641721923


   **[Test build #123725 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123725/testReport)** for PR 28743 at commit [`403f579`](https://github.com/apache/spark/commit/403f5796fdb7decf7c174b28efc6aa6bf2367186).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-718293083


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r438245814



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -394,10 +394,11 @@ def _create_from_pandas_with_arrow(self, pdf, schema, timezone):
 
         # Create the Spark schema from list of names passed in with Arrow types
         if isinstance(schema, (list, tuple)):
-            arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
+            inferred_types = [pa.infer_type(s, mask=s.isna(), from_pandas=True)

Review comment:
       For the second case above, so `pa.Schema.from_pandas` returns correct types in the case? When `pa.infer_type` infers the specified array types, will it just throw error or return a wrong array type?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-640136974






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28743: [SPARK-31920] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-640074996


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641955519






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-642732288






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] moskvax commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
moskvax commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r437858625



##########
File path: python/pyspark/sql/tests/test_arrow.py
##########
@@ -30,10 +30,14 @@
     pandas_requirement_message, pyarrow_requirement_message
 from pyspark.testing.utils import QuietTest
 from pyspark.util import _exception_message
+from distutils.version import LooseVersion
 
 if have_pandas:
     import pandas as pd
     from pandas.util.testing import assert_frame_equal
+    pandas_version = LooseVersion(pd.__version__)
+else:
+    pandas_version = LooseVersion("0")

Review comment:
       Nice, will update




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-642732288






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-642703848


   **[Test build #123852 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123852/testReport)** for PR 28743 at commit [`01fb6a4`](https://github.com/apache/spark/commit/01fb6a4c25f830ec6a52d1030b2a6790a3dbae4f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641702673


   cc @BryanCutler FYI


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28743: [SPARK-31920] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-640074996


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-640135230


   ok to test


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-640136976


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/123593/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641989192






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-640135423






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-642704516






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r437843397



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -394,10 +394,11 @@ def _create_from_pandas_with_arrow(self, pdf, schema, timezone):
 
         # Create the Spark schema from list of names passed in with Arrow types
         if isinstance(schema, (list, tuple)):
-            arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
+            inferred_types = [pa.infer_type(s, mask=s.isna(), from_pandas=True)
+                              for s in (pdf[c] for c in pdf)]
             struct = StructType()
-            for name, field in zip(schema, arrow_schema):
-                struct.add(name, from_arrow_type(field.type), nullable=field.nullable)
+            for name, t in zip(schema, inferred_types):
+                struct.add(name, from_arrow_type(t), nullable=True)

Review comment:
       Why don't we follow nullability anymore?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-640135423






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Pverheijen commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
Pverheijen commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-861449072


   Can this be pulled? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641955519






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r437893088



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -394,10 +394,11 @@ def _create_from_pandas_with_arrow(self, pdf, schema, timezone):
 
         # Create the Spark schema from list of names passed in with Arrow types
         if isinstance(schema, (list, tuple)):
-            arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
+            inferred_types = [pa.infer_type(s, mask=s.isna(), from_pandas=True)

Review comment:
       So without this change, `pa.Schema.from_pandas` cannot handle pandas extension types and pd.NA values?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28743: [SPARK-31920] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-640075128


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-642731010


   **[Test build #123852 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123852/testReport)** for PR 28743 at commit [`01fb6a4`](https://github.com/apache/spark/commit/01fb6a4c25f830ec6a52d1030b2a6790a3dbae4f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `trait TimestampFormatterHelper extends TimeZoneAwareExpression `
     * `case class ProcessingTimeTrigger(intervalMs: Long) extends Trigger `
     * `case class ContinuousTrigger(intervalMs: Long) extends Trigger `


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] moskvax commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
moskvax commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r438051033



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -394,10 +394,11 @@ def _create_from_pandas_with_arrow(self, pdf, schema, timezone):
 
         # Create the Spark schema from list of names passed in with Arrow types
         if isinstance(schema, (list, tuple)):
-            arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
+            inferred_types = [pa.infer_type(s, mask=s.isna(), from_pandas=True)
+                              for s in (pdf[c] for c in pdf)]
             struct = StructType()
-            for name, field in zip(schema, arrow_schema):
-                struct.add(name, from_arrow_type(field.type), nullable=field.nullable)
+            for name, t in zip(schema, inferred_types):
+                struct.add(name, from_arrow_type(t), nullable=True)

Review comment:
       Sounds good, will update with a comment.
   
   Alternatively, `any(s.isna())` could be checked if we wanted to actively infer nullability here. This would change existing behavior as well as being inconsistent with the non-Arrow path, though, which similarly defaults to inferred types being nullable: https://github.com/apache/spark/blob/43063e2db2bf7469f985f1954d8615b95cf5c578/python/pyspark/sql/types.py#L1069




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] moskvax commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
moskvax commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r437872692



##########
File path: python/pyspark/sql/pandas/serializers.py
##########
@@ -150,15 +151,22 @@ def _create_batch(self, series):
         series = ((s, None) if not isinstance(s, (list, tuple)) else s for s in series)
 
         def create_array(s, t):
-            mask = s.isnull()
+            # Create with __arrow_array__ if the series' backing array implements it
+            series_array = getattr(s, 'array', s._values)
+            if hasattr(series_array, "__arrow_array__"):
+                return series_array.__arrow_array__(type=t)
+
             # Ensure timestamp series are in expected form for Spark internal representation
             if t is not None and pa.types.is_timestamp(t):
                 s = _check_series_convert_timestamps_internal(s, self._timezone)
-            elif type(s.dtype) == pd.CategoricalDtype:
+            elif is_categorical_dtype(s.dtype):

Review comment:
       By the way, this change was made as `CategoricalDtype` is only imported into the root pandas namespace after pandas 0.24.0, which was causing `AttributeError` when testing with earlier versions.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] moskvax commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
moskvax commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r437858389



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -394,10 +394,11 @@ def _create_from_pandas_with_arrow(self, pdf, schema, timezone):
 
         # Create the Spark schema from list of names passed in with Arrow types
         if isinstance(schema, (list, tuple)):
-            arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
+            inferred_types = [pa.infer_type(s, mask=s.isna(), from_pandas=True)
+                              for s in (pdf[c] for c in pdf)]
             struct = StructType()
-            for name, field in zip(schema, arrow_schema):
-                struct.add(name, from_arrow_type(field.type), nullable=field.nullable)
+            for name, t in zip(schema, inferred_types):
+                struct.add(name, from_arrow_type(t), nullable=True)

Review comment:
       `infer_type` only returns a type, not a `field`, which would supposedly have nullability information. But it appears that in the implementation of `Schema.from_pandas` ([link](https://github.com/apache/arrow/blob/b058cf0d1c26ad7984c104bb84322cc7dcc66f00/python/pyarrow/types.pxi#L1328)), inferring nullability was not actually done and the default `nullable=True` would always be returned. So this change is just following the existing behaviour of `Schema.from_pandas`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #28743:
URL: https://github.com/apache/spark/pull/28743


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-640135250


   cc: @HyukjinKwon @viirya 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641736976






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] moskvax commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
moskvax commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-642700256


   > Thanks @moskvax , adding support for extension types would be great! I'm not sure using `pa.infer_type` is the way to go though, I think it's better to handle these cases explicitly by getting the `pa.ExtensionType` from `pa.Schema.from_pandas` and then extracting the `storage_type` from there. Would that be possible?
   
   The goal of this PR was to allow conversion for `__arrow_array__`-implementing arrays of `ExtensionDtype` values where the underlying type can be directly converted to primitive Arrow and Spark types, so I wasn't focusing on this case at first, but I've looked into it today following the approach you described. 
   
   The `storage_type` of the `pa.ExtensionType` of `PeriodArray` is `int64`, which can be converted to a Spark column using the `PeriodArray`'s `_ndarray_values`. However, without the `PeriodDtype.freq`, the period information cannot be reconstructed and the result in Spark is an arbitrary-looking sequence of integers: 
   
   ```pycon
   >>> periods = pd.period_range('2020-01-01', freq='M', periods=6)
   >>> pdf = pd.DataFrame({'A': pd.Series(periods)})
   >>> pdf
            A
   0  2020-01
   1  2020-02
   2  2020-03
   3  2020-04
   4  2020-05
   5  2020-06
   >>> pdf.dtypes
   A    period[M]
   dtype: object
   >>> df = spark.createDataFrame(pdf)
   >>> df.show()
   +---+
   |  A|
   +---+
   |600|
   |601|
   |602|
   |603|
   |604|
   |605|
   +---+
   
   >>> df.schema
   StructType(List(StructField(A,LongType,true)))
   ```
   
   `IntervalArray` has an Arrow extension type with a `storage_type` of `StructType(struct<left: timestamp[ns], right: timestamp[ns]>)`, which could be converted to a Spark `StructType` column if `StructType` conversion were supported by the Arrow conversion path, however the `closed` information would still be missing using this schema.
   
   So, in the cases where it is possible to convert using the `storage_type`, I think there should be a warning that the results may be unexpected as any type metadata that may be required to meaningfully interpret the type values is being discarded. Additionally, the round-trip back to pandas won't be possible for these types.
   
   As for `pa.Schema.from_pandas`, it's most useful over `pa.infer_type` for the purposes of Spark conversion when the array it is processing implements `__arrow_array__` and thus can immediately and unambiguously return its own Arrow type. I've updated the PR to firstly try using `__arrow_array__` to determine a type, then falling back on `pa.infer_type`. What do you think of this approach?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-640136974


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] BryanCutler commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r438466149



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -394,10 +394,11 @@ def _create_from_pandas_with_arrow(self, pdf, schema, timezone):
 
         # Create the Spark schema from list of names passed in with Arrow types
         if isinstance(schema, (list, tuple)):
-            arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
+            inferred_types = [pa.infer_type(s, mask=s.isna(), from_pandas=True)

Review comment:
       `pa.Schema.from_pandas` will return a type that is a subclass of `pa.ExtensionType`. From that instance, there is a `storage_type` that is defined, which could then be checked as a Spark supported type. Assuming the Pandas extension array implemented `__arrow_array__`, which is recommended, see https://arrow.apache.org/docs/python/extending_types.html#controlling-conversion-to-pyarrow-array-with-the-arrow-array-protocol.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r437844741



##########
File path: python/pyspark/sql/tests/test_arrow.py
##########
@@ -30,10 +30,14 @@
     pandas_requirement_message, pyarrow_requirement_message
 from pyspark.testing.utils import QuietTest
 from pyspark.util import _exception_message
+from distutils.version import LooseVersion
 
 if have_pandas:
     import pandas as pd
     from pandas.util.testing import assert_frame_equal
+    pandas_version = LooseVersion(pd.__version__)
+else:
+    pandas_version = LooseVersion("0")

Review comment:
       I would do something like:
   
   ```python
   pandas_version = None
   if have_pandas:
       import pandas as pd
       from pandas.util.testing import assert_frame_equal
       pandas_version = LooseVersion(pd.__version__)
   ```
   
   if the linter is not happy with it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-640135313


   **[Test build #123593 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123593/testReport)** for PR 28743 at commit [`04a15f6`](https://github.com/apache/spark/commit/04a15f6ff4cbf0cc4cdd3c9573af5c08768c3516).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641958702


   **[Test build #123762 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123762/testReport)** for PR 28743 at commit [`07d7f2a`](https://github.com/apache/spark/commit/07d7f2abc9cf54a3a0bf42d0555490e51987727e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-640724799






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28743: [SPARK-31920][PYTHON][WIP] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-640721541






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641722278






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] moskvax commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
moskvax commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641693507


   @HyukjinKwon @viirya Please review when you've got a moment. Thank you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] moskvax commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
moskvax commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r438051119



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -394,10 +394,11 @@ def _create_from_pandas_with_arrow(self, pdf, schema, timezone):
 
         # Create the Spark schema from list of names passed in with Arrow types
         if isinstance(schema, (list, tuple)):
-            arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
+            inferred_types = [pa.infer_type(s, mask=s.isna(), from_pandas=True)

Review comment:
       pyarrow < 0.17.0 cannot handle either ([ARROW-8159](https://issues.apache.org/jira/browse/ARROW-8159)). pyarrow 0.17.x works as long as the columns that contain `pd.NA` values are not `object`-dtyped, which is the case by default as of pandas 1.0.4 (cf pandas-dev/pandas#32931). `pa.infer_type` can take a mask and thus avoids trying to infer the type of `pd.NA` values, which is what causes `pa.Schema.from_pandas` to fail here.
   
   `pa.Schema.from_pandas` returns different types from `pa.infer_type` in two cases:
   1. `Categorical` arrays
       * `pa.Schema.from_pandas` returns a `DictionaryType`
       * `pa.infer_type` returns the `value_type` of the `DictionaryType`, which is what is already used to determine the Spark type of the resulting column
   2. `__arrow_array__`-implementing arrays which return a specialised Arrow type (`IntervalArray`, `PeriodArray`)
       * `pa.Schema.from_pandas` returns the type of the array returned by `__arrow_array__`
       * `pa.infer_type` does not check for `__arrow_array__` and thus fails with these arrays, however these types cannot currently be converted to Spark types anyway
   
   Neither of these cases cause regressions, which is why I propose replacing `pa.Schema.from_pandas` with `pa.infer_type` here.

##########
File path: python/pyspark/sql/pandas/serializers.py
##########
@@ -150,15 +151,22 @@ def _create_batch(self, series):
         series = ((s, None) if not isinstance(s, (list, tuple)) else s for s in series)
 
         def create_array(s, t):
-            mask = s.isnull()
+            # Create with __arrow_array__ if the series' backing array implements it
+            series_array = getattr(s, 'array', s._values)
+            if hasattr(series_array, "__arrow_array__"):
+                return series_array.__arrow_array__(type=t)
+
             # Ensure timestamp series are in expected form for Spark internal representation
             if t is not None and pa.types.is_timestamp(t):
                 s = _check_series_convert_timestamps_internal(s, self._timezone)
-            elif type(s.dtype) == pd.CategoricalDtype:
+            elif is_categorical_dtype(s.dtype):
                 # Note: This can be removed once minimum pyarrow version is >= 0.16.1
                 s = s.astype(s.dtypes.categories.dtype)
             try:
-                array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck)
+                mask = s.isnull()
+                # pass _ndarray_values to avoid potential failed type checks from pandas array types

Review comment:
       This is a workaround for `IntegerArray` in pre-1.0.0 pandas, which did not yet implement `__arrow_array__`, so pyarrow expects it to be a NumPy array:
   
   ```pycon
   >>> import pandas as pd
   >>> import pyarrow as pa
   >>> print(pd.__version__, pa.__version__)
   0.25.0 0.17.1
   >>> s = pd.Series(range(3), dtype=pd.Int64Dtype())
   >>> pa.Array.from_pandas(s)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas
     File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
     File "pyarrow/types.pxi", line 76, in pyarrow.lib._datatype_to_pep3118
     File "pyarrow/array.pxi", line 64, in pyarrow.lib._ndarray_to_type
     File "pyarrow/error.pxi", line 108, in pyarrow.lib.check_status
   pyarrow.lib.ArrowTypeError: Did not pass numpy.dtype object
   >>> pa.Array.from_pandas(s, type=pa.int64())
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas
     File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
     File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
     File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Input object was not a NumPy array
   >>> pa.Array.from_pandas(s._ndarray_values, type=pa.int64())
   <pyarrow.lib.Int64Array object at 0x7fb88007a980>
   [
     0,
     1,
     2
   ]
   >>>
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641721923


   **[Test build #123725 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123725/testReport)** for PR 28743 at commit [`403f579`](https://github.com/apache/spark/commit/403f5796fdb7decf7c174b28efc6aa6bf2367186).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641988470


   **[Test build #123762 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123762/testReport)** for PR 28743 at commit [`07d7f2a`](https://github.com/apache/spark/commit/07d7f2abc9cf54a3a0bf42d0555490e51987727e).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28743: [SPARK-31920][PYTHON][WIP] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-640721541






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641722278


   Merged build finished. Test PASSed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641736450


   **[Test build #123725 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123725/testReport)** for PR 28743 at commit [`403f579`](https://github.com/apache/spark/commit/403f5796fdb7decf7c174b28efc6aa6bf2367186).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-640075128


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641736976






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-640136929


   **[Test build #123593 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123593/testReport)** for PR 28743 at commit [`04a15f6`](https://github.com/apache/spark/commit/04a15f6ff4cbf0cc4cdd3c9573af5c08768c3516).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-642704516






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r437896443



##########
File path: python/pyspark/sql/pandas/serializers.py
##########
@@ -150,15 +151,22 @@ def _create_batch(self, series):
         series = ((s, None) if not isinstance(s, (list, tuple)) else s for s in series)
 
         def create_array(s, t):
-            mask = s.isnull()
+            # Create with __arrow_array__ if the series' backing array implements it
+            series_array = getattr(s, 'array', s._values)
+            if hasattr(series_array, "__arrow_array__"):
+                return series_array.__arrow_array__(type=t)
+
             # Ensure timestamp series are in expected form for Spark internal representation
             if t is not None and pa.types.is_timestamp(t):
                 s = _check_series_convert_timestamps_internal(s, self._timezone)
-            elif type(s.dtype) == pd.CategoricalDtype:
+            elif is_categorical_dtype(s.dtype):
                 # Note: This can be removed once minimum pyarrow version is >= 0.16.1
                 s = s.astype(s.dtypes.categories.dtype)
             try:
-                array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck)
+                mask = s.isnull()
+                # pass _ndarray_values to avoid potential failed type checks from pandas array types

Review comment:
       Is there any test case for this?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641722284


   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/28349/
   Test PASSed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] howardcornwell commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Posted by GitBox <gi...@apache.org>.
howardcornwell commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-953038520


   Started hitting this issue today. Can this be reviewed and pulled?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org