You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "zhengruifeng (via GitHub)" <gi...@apache.org> on 2023/09/14 03:42:48 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #42920: [WIP][SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0

zhengruifeng opened a new pull request, #42920:
URL: https://github.com/apache/spark/pull/42920

   ### What changes were proposed in this pull request?
   1, in PyArrow 13.0.0, the behavior of `Table#to_pandas` and `ChunkedArray#to_pandas` changed, set the `coerce_temporal_nanoseconds=True` according to [1](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas) and [2](https://arrow.apache.org/docs/python/generated/pyarrow.ChunkedArray.html#pyarrow.ChunkedArray.to_pandas)
   
   2, there is another undocumented breaking change in data type conversion [`TimestampType#to_pandas_dtype`](https://arrow.apache.org/docs/python/generated/pyarrow.TimestampType.html#pyarrow.TimestampType.to_pandas_dtype):
   
   12.0.1:
   ```
   In [1]: import pyarrow as pa
   
   In [2]: pa.timestamp("us", tz=None).to_pandas_dtype()
   Out[2]: dtype('<M8[ns]')
   
   In [3]: pa.timestamp("ns", tz=None).to_pandas_dtype()
   Out[3]: dtype('<M8[ns]')
   
   In [4]: pa.timestamp("us", tz="UTC").to_pandas_dtype()
   Out[4]: datetime64[ns, UTC]
   
   In [5]: pa.timestamp("ns", tz="UTC").to_pandas_dtype()
   Out[5]: datetime64[ns, UTC]
   ```
   
   13.0.0:
   ```
   In [1]: import pyarrow as pa
   
   In [2]: pa.timestamp("us", tz=None).to_pandas_dtype()
   Out[2]: dtype('<M8[us]')
   
   In [3]: pa.timestamp("ns", tz=None).to_pandas_dtype()
   Out[3]: dtype('<M8[ns]')
   
   In [4]: pa.timestamp("us", tz="UTC").to_pandas_dtype()
   Out[4]: datetime64[us, UTC]
   
   In [5]: pa.timestamp("ns", tz="UTC").to_pandas_dtype()
   Out[5]: datetime64[ns, UTC]
   ```
   
   
   ### Why are the changes needed?
   Make PySpark compatible with PyArrow 13.0.0
   
   
   ### Does this PR introduce _any_ user-facing change?
   NO
   
   
   ### How was this patch tested?
   CI
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   NO
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #42920: [SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #42920:
URL: https://github.com/apache/spark/pull/42920#issuecomment-1720329365

   @dongjoon-hyun I don't see failure in [docker build](https://github.com/zhengruifeng/spark/actions/runs/6184148073/job/16787288022)
   
   ```
   #21 [15/19] RUN python3.9 -m pip install numpy pyarrow 'pandas<=2.0.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*'
   #21 43.17 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
   #21 DONE 46.2s
   ```
   
   we don't pin the mlflow version, instead we only set a lowerbound `mlflow>=2.3.1`, it looks like mlflow 2.7.0 can work with PyArrow 13.0.0:
   
   `pyarrow [required: >=4.0.0,<14, installed: 13.0.0]`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #42920: [WIP][SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #42920:
URL: https://github.com/apache/spark/pull/42920#discussion_r1325319963


##########
python/pyspark/pandas/typedef/typehints.py:
##########
@@ -293,7 +293,9 @@ def spark_type_to_pandas_dtype(
         ),
     ):
         return np.dtype("object")
-    elif isinstance(spark_type, types.TimestampType):
+    elif isinstance(spark_type, types.DayTimeIntervalType):
+        return np.dtype("timedelta64[ns]")
+    elif isinstance(spark_type, (types.TimestampType, types.TimestampNTZType)):
         return np.dtype("datetime64[ns]")
     else:
         return np.dtype(to_arrow_type(spark_type).to_pandas_dtype())

Review Comment:
   the behavior of `to_arrow_type(spark_type).to_pandas_dtype()` changed, e.g.:
   
    `to_arrow_type(DayTimeIntervalType)` -> `pa.timestamp("us", tz="UTC")` -> `datetime64[us, UTC]` in 13.0.0, but `datetime64[ns, UTC]` in 12.0.1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #42920: [SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #42920:
URL: https://github.com/apache/spark/pull/42920#issuecomment-1720572501

   Merged to master for Apache Spark 4.0.0. Thank you, @zhengruifeng and @HyukjinKwon .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42920: [SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #42920:
URL: https://github.com/apache/spark/pull/42920#discussion_r1326623660


##########
dev/infra/Dockerfile:
##########
@@ -85,7 +85,7 @@ RUN Rscript -e "devtools::install_version('roxygen2', version='7.2.0', repos='ht
 ENV R_LIBS_SITE "/usr/local/lib/R/site-library:${R_LIBS_SITE}:/usr/lib/R/library"
 
 RUN pypy3 -m pip install numpy 'pandas<=2.0.3' scipy coverage matplotlib
-RUN python3.9 -m pip install numpy 'pyarrow==12.0.1' 'pandas<=2.0.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*'
+RUN python3.9 -m pip install numpy pyarrow 'pandas<=2.0.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*'

Review Comment:
   Let's also upgrade mlflow version.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #42920: [SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #42920:
URL: https://github.com/apache/spark/pull/42920#issuecomment-1720412021

   Oh, it's great to have mlflow 2.7.0 on time! Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #42920: [SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #42920:
URL: https://github.com/apache/spark/pull/42920#issuecomment-1720690499

   thank you @dongjoon-hyun and @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #42920: [SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #42920: [SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0
URL: https://github.com/apache/spark/pull/42920


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org