You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2018/12/13 04:51:55 UTC

[GitHub] HyukjinKwon commented on a change in pull request #23305: [SPARK-26355][PYSPARK] Add a workaround for PyArrow 0.11.

HyukjinKwon commented on a change in pull request #23305: [SPARK-26355][PYSPARK] Add a workaround for PyArrow 0.11.
URL: https://github.com/apache/spark/pull/23305#discussion_r241271368
 
 

 ##########
 File path: python/pyspark/serializers.py
 ##########
 @@ -281,7 +281,10 @@ def create_array(s, t):
             # TODO: see ARROW-2432. Remove when the minimum PyArrow version becomes 0.10.0.
             return pa.Array.from_pandas(s.apply(
                 lambda v: decimal.Decimal('NaN') if v is None else v), mask=mask, type=t)
-        return pa.Array.from_pandas(s, mask=mask, type=t)
+        elif LooseVersion(pa.__version__) < LooseVersion("0.11.0"):
+            # TODO: see ARROW-1949. Remove when the minimum PyArrow version becomes 0.11.0.
+            return pa.Array.from_pandas(s, mask=mask, type=t)
+        return pa.Array.from_pandas(s, mask=mask, type=t, safe=False)
 
 Review comment:
   Hmmmm .. @ueshin, looks `safe=True` (which is the default) guards conversions that potentially causes overflows.
   Do you mind if I ask which behaviour change you faced? I ran simple tests as below, and looks same.
   
   ```python
   >>> pa.__version__
   '0.9.0.post1'
   >>>
   >>> import pyarrow as pa
   >>> import pandas as pd
   >>> s = pd.Series([pd.Timestamp(1)])
   >>> arr = pa.Array.from_pandas(s, type=pa.timestamp('us'))
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas
     File "array.pxi", line 177, in pyarrow.lib.array
     File "error.pxi", line 77, in pyarrow.lib.check_status
     File "error.pxi", line 77, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 1
   ```
   
   ```python
   >>> pa.__version__
   '0.11.1'
   >>>
   >>> import pyarrow as pa
   >>> import pandas as pd
   >>> s = pd.Series([pd.Timestamp(1)])
   >>> arr = pa.Array.from_pandas(s, type=pa.timestamp('us'))
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow/array.pxi", line 474, in pyarrow.lib.Array.from_pandas
     File "pyarrow/array.pxi", line 169, in pyarrow.lib.array
     File "pyarrow/array.pxi", line 69, in pyarrow.lib._ndarray_to_array
     File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 1
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org