You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "EnricoMi (via GitHub)" <gi...@apache.org> on 2023/02/09 11:02:27 UTC

[GitHub] [spark] EnricoMi opened a new pull request, #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

EnricoMi opened a new pull request, #39952:
URL: https://github.com/apache/spark/pull/39952

   ### What changes were proposed in this pull request?
   Similar to #38223, improve the error messages when a Python method provided to `DataFrame.mapInPandas` returns a Pandas DataFrame that does not match the expected schema.
   
   With
   ```Python
   df = spark.range(2).withColumn("v", col("id"))
   ```
   
   **Mismatching column names:**
   ```Python
   df.mapInPandas(lambda it: it, "id long, val long").show()
   # was: KeyError: 'val'
   # now: RuntimeError: Column names of the returned pandas.DataFrame do not match specified schema.
   #      Missing: val  Unexpected: v
   ```
   
   **Python function not returning iterator:**
   ```Python
   df.mapInPandas(lambda it: 1, "id long").show()
   # was: TypeError: 'int' object is not iterable
   # now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is <class 'int'>
   ```
   
   **Python function not returning iterator of pandas.DataFrame:**
   ```Python
   df.mapInPandas(lambda it: [1], "id long").show()
   # was: TypeError: Return type of the user-defined function should be Pandas.DataFrame, but is <class 'int'>
   # now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'int'>
   ```
   
   **Mismatching types (ValueError and TypeError):**
   ```Python
   df.mapInPandas(lambda it: it, "id int, v string").show()
   # was: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
   # now: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
   #      The above exception was the direct cause of the following exception:
   #      TypeError: Exception thrown when converting pandas.Series (int64) with name 'v' to Arrow Array (string).
   
   df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show()
   # was: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double
   # now: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double
   #      The above exception was the direct cause of the following exception:
   #      ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double).
   
   with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely": True}):
     df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show()
   # was: ValueError: Exception thrown when converting pandas.Series (object) to Arrow Array (double).
   #      It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled
   #      by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`.
   # now: ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double).
   #      It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled
   #      by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`.
   ```
   
   ### Why are the changes needed?
   Existing errors are generic (`KeyError`) or meaningless (`'int' object is not iterable`). The errors should help users in spotting the mismatching columns by naming them.
   
   The schema of the returned Pandas DataFrames can only be checked during processing the DataFrame, so such errors are very expensive. Therefore, they should be expressive.
   
   ### Does this PR introduce _any_ user-facing change?
   This only changes error messages, not behaviour.
   
   ### How was this patch tested?
   Tests all cases of schema mismatch for `DataFrame.mapInPandas`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-meng commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1639061633

   Merged to master, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-meng commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1635206417

   The last commit seems to fail the tests. Would you fix it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] EnricoMi commented on a diff in pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on code in PR #39952:
URL: https://github.com/apache/spark/pull/39952#discussion_r1258157963


##########
python/pyspark/worker.py:
##########
@@ -133,65 +134,103 @@ def verify_result_length(result, length):
     )
 
 
-def wrap_batch_iter_udf(f, return_type):
+def wrap_batch_iter_udf(f, return_type, is_arrow_iter=False):
     arrow_return_type = to_arrow_type(return_type)
+    iter_type_label = (
+        "pyarrow.RecordBatch"
+        if is_arrow_iter
+        else ("pandas.DataFrame" if type(return_type) == StructType else "pandas.Series")
+    )
 
-    def verify_result_type(result):
-        if not hasattr(result, "__len__"):
-            pd_type = "Pandas.DataFrame" if type(return_type) == StructType else "Pandas.Series"
+    def verify_result(result):
+        if not isinstance(result, Iterator) and not hasattr(result, "__iter__"):
             raise TypeError(
                 "Return type of the user-defined function should be "
-                "{}, but is {}".format(pd_type, type(result))
+                "iterator of {}, but is {}".format(iter_type_label, type(result))
             )
         return result
 
+    def verify_element(elem):
+        if is_arrow_iter:
+            import pyarrow as pa
+
+            if not isinstance(elem, pa.RecordBatch):
+                raise TypeError(
+                    "Return type of the user-defined function should be "
+                    "iterator of {}, but is iterator of {}".format(iter_type_label, type(elem))
+                )
+        else:
+            import pandas as pd
+
+            if not isinstance(elem, pd.DataFrame if type(return_type) == StructType else pd.Series):
+                raise TypeError(

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-meng commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1622600141

   Thanks @EnricoMi !
   I would suggest creating a separate `def wrap_..` for `PythonEvalType.SQL_MAP_ARROW_ITER_UDF` instead of introducing a new parameter `is_arrow_iter` to `wrap_batch_iter_udf`.
   That maintains logical consistency with the other `wrap_` functions (each function is dedicated to wrapping a specific type of UDF) and promotes a modular design.
   My point is subject to debate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] EnricoMi commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1663422809

   Yes, please!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] EnricoMi commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1632701214

   Running `dev/connect-gen-protos.sh` showed the same error. Rebasing with latest master fixed the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] EnricoMi commented on a diff in pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on code in PR #39952:
URL: https://github.com/apache/spark/pull/39952#discussion_r1247662681


##########
python/pyspark/worker.py:
##########
@@ -133,65 +134,103 @@ def verify_result_length(result, length):
     )
 
 
-def wrap_batch_iter_udf(f, return_type):
+def wrap_batch_iter_udf(f, return_type, is_arrow_iter=False):
     arrow_return_type = to_arrow_type(return_type)
+    iter_type_label = (
+        "pyarrow.RecordBatch"
+        if is_arrow_iter
+        else ("pandas.DataFrame" if type(return_type) == StructType else "pandas.Series")
+    )
 
-    def verify_result_type(result):
-        if not hasattr(result, "__len__"):
-            pd_type = "Pandas.DataFrame" if type(return_type) == StructType else "Pandas.Series"
+    def verify_result(result):
+        if not isinstance(result, Iterator) and not hasattr(result, "__iter__"):
             raise TypeError(
                 "Return type of the user-defined function should be "
-                "{}, but is {}".format(pd_type, type(result))
+                "iterator of {}, but is {}".format(iter_type_label, type(result))
             )
         return result
 
+    def verify_element(elem):
+        if is_arrow_iter:
+            import pyarrow as pa
+
+            if not isinstance(elem, pa.RecordBatch):
+                raise TypeError(
+                    "Return type of the user-defined function should be "
+                    "iterator of {}, but is iterator of {}".format(iter_type_label, type(elem))
+                )
+        else:
+            import pandas as pd
+
+            if not isinstance(elem, pd.DataFrame if type(return_type) == StructType else pd.Series):
+                raise TypeError(

Review Comment:
   Sure, but the whole file uses `TypedError` instead of the new `PySparkTypeError`.
   
   I'd rather fix that in a separate PR and not fixing unrelated code in this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] EnricoMi commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1635512336

   All green, all done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-meng closed pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng closed pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch
URL: https://github.com/apache/spark/pull/39952


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] EnricoMi commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1628594621

   @xinrong-meng split `wrap_batch_iter_udf` into `wrap_pandas_batch_iter_udf` and `wrap_arrow_batch_iter_udf`: https://github.com/apache/spark/pull/39952/commits/725c3af5a5cc15b0ba8bf3637ab4c0465914ac1f


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #39952:
URL: https://github.com/apache/spark/pull/39952#discussion_r1248113726


##########
python/pyspark/worker.py:
##########
@@ -133,65 +134,103 @@ def verify_result_length(result, length):
     )
 
 
-def wrap_batch_iter_udf(f, return_type):
+def wrap_batch_iter_udf(f, return_type, is_arrow_iter=False):
     arrow_return_type = to_arrow_type(return_type)
+    iter_type_label = (
+        "pyarrow.RecordBatch"
+        if is_arrow_iter
+        else ("pandas.DataFrame" if type(return_type) == StructType else "pandas.Series")
+    )
 
-    def verify_result_type(result):
-        if not hasattr(result, "__len__"):
-            pd_type = "Pandas.DataFrame" if type(return_type) == StructType else "Pandas.Series"
+    def verify_result(result):
+        if not isinstance(result, Iterator) and not hasattr(result, "__iter__"):
             raise TypeError(
                 "Return type of the user-defined function should be "
-                "{}, but is {}".format(pd_type, type(result))
+                "iterator of {}, but is {}".format(iter_type_label, type(result))
             )
         return result
 
+    def verify_element(elem):
+        if is_arrow_iter:
+            import pyarrow as pa
+
+            if not isinstance(elem, pa.RecordBatch):
+                raise TypeError(
+                    "Return type of the user-defined function should be "
+                    "iterator of {}, but is iterator of {}".format(iter_type_label, type(elem))
+                )
+        else:
+            import pandas as pd
+
+            if not isinstance(elem, pd.DataFrame if type(return_type) == StructType else pd.Series):
+                raise TypeError(

Review Comment:
   Of course. You can only address the related changes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-meng commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1631707757

   Would you try the command "dev/connect-gen-protos.sh"?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #39952:
URL: https://github.com/apache/spark/pull/39952#discussion_r1246937082


##########
python/pyspark/worker.py:
##########
@@ -133,65 +134,103 @@ def verify_result_length(result, length):
     )
 
 
-def wrap_batch_iter_udf(f, return_type):
+def wrap_batch_iter_udf(f, return_type, is_arrow_iter=False):
     arrow_return_type = to_arrow_type(return_type)
+    iter_type_label = (
+        "pyarrow.RecordBatch"
+        if is_arrow_iter
+        else ("pandas.DataFrame" if type(return_type) == StructType else "pandas.Series")
+    )
 
-    def verify_result_type(result):
-        if not hasattr(result, "__len__"):
-            pd_type = "Pandas.DataFrame" if type(return_type) == StructType else "Pandas.Series"
+    def verify_result(result):
+        if not isinstance(result, Iterator) and not hasattr(result, "__iter__"):
             raise TypeError(
                 "Return type of the user-defined function should be "
-                "{}, but is {}".format(pd_type, type(result))
+                "iterator of {}, but is {}".format(iter_type_label, type(result))
             )
         return result
 
+    def verify_element(elem):
+        if is_arrow_iter:
+            import pyarrow as pa
+
+            if not isinstance(elem, pa.RecordBatch):
+                raise TypeError(
+                    "Return type of the user-defined function should be "
+                    "iterator of {}, but is iterator of {}".format(iter_type_label, type(elem))
+                )
+        else:
+            import pandas as pd
+
+            if not isinstance(elem, pd.DataFrame if type(return_type) == StructType else pd.Series):
+                raise TypeError(

Review Comment:
   ditto, and other places as well ?
   
   Basically we should use the PySpark specific errors instead of Python build-in exceptions.
   
   See https://github.com/apache/spark/blob/master/python/pyspark/errors/__init__.py for more details about PySpark specific errors.



##########
python/pyspark/worker.py:
##########
@@ -133,65 +134,103 @@ def verify_result_length(result, length):
     )
 
 
-def wrap_batch_iter_udf(f, return_type):
+def wrap_batch_iter_udf(f, return_type, is_arrow_iter=False):
     arrow_return_type = to_arrow_type(return_type)
+    iter_type_label = (
+        "pyarrow.RecordBatch"
+        if is_arrow_iter
+        else ("pandas.DataFrame" if type(return_type) == StructType else "pandas.Series")
+    )
 
-    def verify_result_type(result):
-        if not hasattr(result, "__len__"):
-            pd_type = "Pandas.DataFrame" if type(return_type) == StructType else "Pandas.Series"
+    def verify_result(result):
+        if not isinstance(result, Iterator) and not hasattr(result, "__iter__"):
             raise TypeError(
                 "Return type of the user-defined function should be "
-                "{}, but is {}".format(pd_type, type(result))
+                "iterator of {}, but is {}".format(iter_type_label, type(result))
             )
         return result
 
+    def verify_element(elem):
+        if is_arrow_iter:
+            import pyarrow as pa
+
+            if not isinstance(elem, pa.RecordBatch):
+                raise TypeError(
+                    "Return type of the user-defined function should be "
+                    "iterator of {}, but is iterator of {}".format(iter_type_label, type(elem))
+                )
+        else:
+            import pandas as pd
+
+            if not isinstance(elem, pd.DataFrame if type(return_type) == StructType else pd.Series):
+                raise TypeError(
+                    "Return type of the user-defined function should be "
+                    "iterator of {}, but is iterator of {}".format(iter_type_label, type(elem))
+                )
+
+            verify_pandas_result(elem, return_type, True, True)
+
+        return elem
+
     return lambda *iterator: map(
-        lambda res: (res, arrow_return_type), map(verify_result_type, f(*iterator))
+        lambda res: (res, arrow_return_type), map(verify_element, verify_result(f(*iterator)))
     )
 
 
-def verify_pandas_result(result, return_type, assign_cols_by_name):
+def verify_pandas_result(result, return_type, assign_cols_by_name, truncate_return_schema):
     import pandas as pd
 
-    if not isinstance(result, pd.DataFrame):
-        raise TypeError(
-            "Return type of the user-defined function should be "
-            "pandas.DataFrame, but is {}".format(type(result))
-        )
-
-    # check the schema of the result only if it is not empty or has columns
-    if not result.empty or len(result.columns) != 0:
-        # if any column name of the result is a string
-        # the column names of the result have to match the return type
-        #   see create_array in pyspark.sql.pandas.serializers.ArrowStreamPandasSerializer
-        field_names = set([field.name for field in return_type.fields])
-        column_names = set(result.columns)
-        if (
-            assign_cols_by_name
-            and any(isinstance(name, str) for name in result.columns)
-            and column_names != field_names
-        ):
-            missing = sorted(list(field_names.difference(column_names)))
-            missing = f" Missing: {', '.join(missing)}." if missing else ""
-
-            extra = sorted(list(column_names.difference(field_names)))
-            extra = f" Unexpected: {', '.join(extra)}." if extra else ""
+    if type(return_type) == StructType:
+        if not isinstance(result, pd.DataFrame):
+            raise TypeError(

Review Comment:
   Can we raise `PySparkTypeError` instead of `TypeError`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-meng commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1629869617

   The refactoring is neat and clean! Would you fix the CI test failure?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] EnricoMi commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1477683110

   CC @gatorsmile @xinrong-meng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] allisonwang-db commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.

allisonwang-db commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1662979295

   @xinrong-meng @EnricoMi should we also merge this in spark-3.5?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1611095210

   @HyukjinKwon @ueshin @itholic Could you have a look at the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1617552918

   @xinrong-meng I think you should take a look at this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] EnricoMi commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1630296118

   Not sure how to fix the `Python code generation check`: https://github.com/G-Research/spark/actions/runs/5516480294/jobs/10057925480#step:18:101


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1613568675

   Could you rebase this PR to master? It seems like there are some conflicts from master and yours.
   https://github.com/G-Research/spark/runs/13927060744
   ```
   From https://github.com/G-Research/spark
    * branch                  branch-pyspark-map-in-pandas-schema-mismatch -> FETCH_HEAD
   Auto-merging python/pyspark/pandas/frame.py
   Auto-merging python/pyspark/sql/pandas/serializers.py
   Auto-merging python/pyspark/sql/tests/pandas/test_pandas_cogrouped_map.py
   Auto-merging python/pyspark/sql/tests/pandas/test_pandas_grouped_map.py
   Auto-merging python/pyspark/sql/tests/pandas/test_pandas_map.py
   CONFLICT (content): Merge conflict in python/pyspark/sql/tests/pandas/test_pandas_map.py
   Auto-merging python/pyspark/sql/tests/test_arrow_map.py
   Squash commit -- not updating HEAD
   Automatic merge failed; fix conflicts and then commit the result.
   Error: Process completed with exit code 1.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] EnricoMi commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1424241443

   @HyukjinKwon this is a follow-up to #38223


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] EnricoMi commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1465739961

   CC @cloud-fan @itholic @zhengruifeng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1663228670

   I am fine with merging it to 3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] EnricoMi commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1663434269

   Merge PR for branch 3.5 in #42316.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] EnricoMi commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1439704892

   @HyukjinKwon @cloud-fan would you say `Dataset.mapInPandas` should go on a par with improved error messages of `Dataset.groupby(...).applyInPandas` in the same Spark release (that would be 3.4.0)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org