You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "EnricoMi (via GitHub)" <gi...@apache.org> on 2023/08/03 07:27:16 UTC

[GitHub] [spark] EnricoMi opened a new pull request, #42316: [SPARK-40770][PYTHON][FOLLOW-UP][3.5] Improved error messages for mapInPandas for schema mismatch

EnricoMi opened a new pull request, #42316:
URL: https://github.com/apache/spark/pull/42316

   ### What changes were proposed in this pull request?
   This merges #39952 into 3.5 branch.
   
   Similar to #38223, improve the error messages when a Python method provided to `DataFrame.mapInPandas` returns a Pandas DataFrame that does not match the expected schema.
   
   With
   ```Python
   df = spark.range(2).withColumn("v", col("id"))
   ```
   
   **Mismatching column names:**
   ```Python
   df.mapInPandas(lambda it: it, "id long, val long").show()
   # was: KeyError: 'val'
   # now: RuntimeError: Column names of the returned pandas.DataFrame do not match specified schema.
   #      Missing: val  Unexpected: v
   ```
   
   **Python function not returning iterator:**
   ```Python
   df.mapInPandas(lambda it: 1, "id long").show()
   # was: TypeError: 'int' object is not iterable
   # now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is <class 'int'>
   ```
   
   **Python function not returning iterator of pandas.DataFrame:**
   ```Python
   df.mapInPandas(lambda it: [1], "id long").show()
   # was: TypeError: Return type of the user-defined function should be Pandas.DataFrame, but is <class 'int'>
   # now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'int'>
   # sometimes: ValueError: A field of type StructType expects a pandas.DataFrame, but got: <class 'list'>
   # now: TypeError: Return type of the user-defined function should be iterator of pandas.DataFrame, but is iterator of <class 'list'>
   ```
   
   **Mismatching types (ValueError and TypeError):**
   ```Python
   df.mapInPandas(lambda it: it, "id int, v string").show()
   # was: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
   # now: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
   #      The above exception was the direct cause of the following exception:
   #      TypeError: Exception thrown when converting pandas.Series (int64) with name 'v' to Arrow Array (string).
   
   df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show()
   # was: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double
   # now: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double
   #      The above exception was the direct cause of the following exception:
   #      ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double).
   
   with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely": True}):
     df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], "id int, v double").show()
   # was: ValueError: Exception thrown when converting pandas.Series (object) to Arrow Array (double).
   #      It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled
   #      by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`.
   # now: ValueError: Exception thrown when converting pandas.Series (object) with name 'v' to Arrow Array (double).
   #      It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled
   #      by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`.
   ```
   
   ### Why are the changes needed?
   Existing errors are generic (`KeyError`) or meaningless (`'int' object is not iterable`). The errors should help users in spotting the mismatching columns by naming them.
   
   The schema of the returned Pandas DataFrames can only be checked during processing the DataFrame, so such errors are very expensive. Therefore, they should be expressive.
   
   ### Does this PR introduce _any_ user-facing change?
   This only changes error messages, not behaviour.
   
   ### How was this patch tested?
   Tests all cases of schema mismatch for `DataFrame.mapInPandas`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #42316: [SPARK-40770][PYTHON][FOLLOW-UP][3.5] Improved error messages for mapInPandas for schema mismatch

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #42316:
URL: https://github.com/apache/spark/pull/42316#issuecomment-1664851259

   Merged to branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #42316: [SPARK-40770][PYTHON][FOLLOW-UP][3.5] Improved error messages for mapInPandas for schema mismatch

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon closed pull request #42316: [SPARK-40770][PYTHON][FOLLOW-UP][3.5] Improved error messages for mapInPandas for schema mismatch
URL: https://github.com/apache/spark/pull/42316


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] EnricoMi commented on pull request #42316: [SPARK-40770][PYTHON][FOLLOW-UP][3.5] Improved error messages for mapInPandas for schema mismatch

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on PR #42316:
URL: https://github.com/apache/spark/pull/42316#issuecomment-1663639866

   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] EnricoMi commented on pull request #42316: [SPARK-40770][PYTHON][FOLLOW-UP][3.5] Improved error messages for mapInPandas for schema mismatch

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on PR #42316:
URL: https://github.com/apache/spark/pull/42316#issuecomment-1663435300

   @allisonwang-db @xinrong-meng @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org