You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "xinrong-meng (via GitHub)" <gi...@apache.org> on 2023/05/11 23:08:44 UTC

[GitHub] [spark] xinrong-meng opened a new pull request, #41147: [WIP] Nested non-atomic input type support in Pandas UDF

xinrong-meng opened a new pull request, #41147:
URL: https://github.com/apache/spark/pull/41147

   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-meng commented on pull request #41147: [SPARK-43543][PYTHON] Fix nested MapType behavior in Pandas UDF

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on PR #41147:
URL: https://github.com/apache/spark/pull/41147#issuecomment-1555302733

   Merged to master, thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-meng commented on a diff in pull request #41147: [SPARK-43543][PYTHON] Fix nested MapType behavior in Pandas UDF

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on code in PR #41147:
URL: https://github.com/apache/spark/pull/41147#discussion_r1196924307


##########
python/pyspark/sql/pandas/serializers.py:
##########
@@ -317,66 +320,6 @@ def arrow_to_pandas(self, arrow_column):
             s = super(ArrowStreamPandasUDFSerializer, self).arrow_to_pandas(arrow_column)
         return s
 
-    # To keep the current UDF behavior.
-    def _create_array(self, series, arrow_type):

Review Comment:
   Inherit `_create_array` of `ArrowStreamPandasSerializer`. After the change, it is consistent with `createDataFrame` and `toPandas` when Arrow is enabled.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-meng commented on pull request #41147: [SPARK-43543][PYTHON] Fix nested MapType behavior in Pandas UDF

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on PR #41147:
URL: https://github.com/apache/spark/pull/41147#issuecomment-1555303500

   Please free to leave comments if any, I'll adjust them in follow-ups.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-meng commented on pull request #41147: [SPARK-43543][PYTHON] Fix nested MapType behavior in Pandas UDF

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on PR #41147:
URL: https://github.com/apache/spark/pull/41147#issuecomment-1553393559

   @ueshin @HyukjinKwon @zhengruifeng would you please review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-meng commented on a diff in pull request #41147: [SPARK-43543][PYTHON] Fix nested MapType behavior in Pandas UDF

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on code in PR #41147:
URL: https://github.com/apache/spark/pull/41147#discussion_r1196924307


##########
python/pyspark/sql/pandas/serializers.py:
##########
@@ -317,66 +320,6 @@ def arrow_to_pandas(self, arrow_column):
             s = super(ArrowStreamPandasUDFSerializer, self).arrow_to_pandas(arrow_column)
         return s
 
-    # To keep the current UDF behavior.
-    def _create_array(self, series, arrow_type):

Review Comment:
   Inherit `_create_array` of `ArrowStreamPandasSerializer`. After the change, it is consistent with `createDataFrame` from a pandas DataFrame when Arrow is enabled.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41147: [SPARK-43543][PYTHON] Fix nested MapType behavior in Pandas UDF

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #41147:
URL: https://github.com/apache/spark/pull/41147#discussion_r1197447942


##########
python/pyspark/sql/pandas/serializers.py:
##########
@@ -168,23 +173,21 @@ def __init__(self, timezone, safecheck):
         self._safecheck = safecheck
 
     def arrow_to_pandas(self, arrow_column):
-        from pyspark.sql.pandas.types import (
-            _check_series_localize_timestamps,
-            _convert_map_items_to_dict,
-        )
-        import pyarrow
-
         # If the given column is a date type column, creates a series of datetime.date directly
         # instead of creating datetime64[ns] as intermediate data to avoid overflow caused by
         # datetime64[ns] type handling.
+        # Cast dates to objects instead of datetime64[ns] dtype to avoid overflow.
         s = arrow_column.to_pandas(date_as_object=True)
 
-        if pyarrow.types.is_timestamp(arrow_column.type) and arrow_column.type.tz is not None:
-            return _check_series_localize_timestamps(s, self._timezone)
-        elif pyarrow.types.is_map(arrow_column.type):
-            return _convert_map_items_to_dict(s)
-        else:
-            return s
+        # TODO: cache the converter for reuse

Review Comment:
   Could you file a JIRA issue officially and make this IDed TODO like `TODO(SPARK-XXX)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-meng commented on a diff in pull request #41147: [SPARK-43543][PYTHON] Fix nested MapType behavior in Pandas UDF

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on code in PR #41147:
URL: https://github.com/apache/spark/pull/41147#discussion_r1198097720


##########
python/pyspark/sql/pandas/serializers.py:
##########
@@ -168,23 +173,21 @@ def __init__(self, timezone, safecheck):
         self._safecheck = safecheck
 
     def arrow_to_pandas(self, arrow_column):
-        from pyspark.sql.pandas.types import (
-            _check_series_localize_timestamps,
-            _convert_map_items_to_dict,
-        )
-        import pyarrow
-
         # If the given column is a date type column, creates a series of datetime.date directly
         # instead of creating datetime64[ns] as intermediate data to avoid overflow caused by
         # datetime64[ns] type handling.
+        # Cast dates to objects instead of datetime64[ns] dtype to avoid overflow.
         s = arrow_column.to_pandas(date_as_object=True)
 
-        if pyarrow.types.is_timestamp(arrow_column.type) and arrow_column.type.tz is not None:
-            return _check_series_localize_timestamps(s, self._timezone)
-        elif pyarrow.types.is_map(arrow_column.type):
-            return _convert_map_items_to_dict(s)
-        else:
-            return s
+        # TODO: cache the converter for reuse

Review Comment:
   Certainly, done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-meng closed pull request #41147: [SPARK-43543][PYTHON] Fix nested MapType behavior in Pandas UDF

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng closed pull request #41147: [SPARK-43543][PYTHON] Fix nested MapType behavior in Pandas UDF
URL: https://github.com/apache/spark/pull/41147


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-meng commented on a diff in pull request #41147: [WIP] Standardize nested non-atomic input type support in Pandas UDF

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on code in PR #41147:
URL: https://github.com/apache/spark/pull/41147#discussion_r1196924307


##########
python/pyspark/sql/pandas/serializers.py:
##########
@@ -317,66 +320,6 @@ def arrow_to_pandas(self, arrow_column):
             s = super(ArrowStreamPandasUDFSerializer, self).arrow_to_pandas(arrow_column)
         return s
 
-    # To keep the current UDF behavior.
-    def _create_array(self, series, arrow_type):

Review Comment:
   Inherit `_create_array` of `ArrowStreamPandasSerializer`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org