You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2022/02/15 03:53:00 UTC
[jira] [Resolved] (SPARK-38183) Show warning when creating pandas-on-Spark session under ANSI mode.

     [ https://issues.apache.org/jira/browse/SPARK-38183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-38183.
----------------------------------
    Fix Version/s: 3.3.0
       Resolution: Fixed

Issue resolved by pull request 35488
[https://github.com/apache/spark/pull/35488]

> Show warning when creating pandas-on-Spark session under ANSI mode.
> -------------------------------------------------------------------
>
>                 Key: SPARK-38183
>                 URL: https://issues.apache.org/jira/browse/SPARK-38183
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.3.0
>            Reporter: Haejoon Lee
>            Assignee: Haejoon Lee
>            Priority: Major
>             Fix For: 3.3.0
>
>
> Since pandas API on Spark follows the behavior of pandas, not SQL, some unexpected behavior can be occurred when "spark.sql.ansi.enable" is True.
> For example,
>  * It raises exception when {{div}} & {{mod}} related methods returns null (e.g. {{{}DataFrame.rmod{}}})
> {code:java}
> >>> df
>    angels  degress
> 0       0      360
> 1       3      180
> 2       4      360
> >>> df.rmod(2)
> Traceback (most recent call last):
> ...
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 32.0 failed 1 times, most recent failure: Lost task 0.0 in stage 32.0 (TID 165) (172.30.1.44 executor driver): org.apache.spark.SparkArithmeticException: divide by zero. To return NULL instead, use 'try_divide'. If necessary set spark.sql.ansi.enabled to false (except for ANSI interval type) to bypass this error.{code}
>  * It raises exception when DataFrame for {{ps.melt}} has not the same column type.
>  
> {code:java}
> >>> df
>    A  B  C
> 0  a  1  2
> 1  b  3  4
> 2  c  5  6
> >>> ps.melt(df)
> Traceback (most recent call last):
> ...
> pyspark.sql.utils.AnalysisException: cannot resolve 'array(struct('A', A), struct('B', B), struct('C', C))' due to data type mismatch: input to function array should all be the same type, but it's [struct<variable:string,value:string>, struct<variable:string,value:bigint>, struct<variable:string,value:bigint>]
> To fix the error, you might need to add explicit type casts. If necessary set spark.sql.ansi.enabled to false to bypass this error.;
> 'Project [__index_level_0__#223L, A#224, B#225L, C#226L, __natural_order__#231L, explode(array(struct(variable, A, value, A#224), struct(variable, B, value, B#225L), struct(variable, C, value, C#226L))) AS pairs#269]
> +- Project [__index_level_0__#223L, A#224, B#225L, C#226L, monotonically_increasing_id() AS __natural_order__#231L]
>    +- LogicalRDD [__index_level_0__#223L, A#224, B#225L, C#226L], false{code}
>  * It raises exception when {{CategoricalIndex.remove_categories}} doesn't remove the entire index
> {code:java}
> >>> idx
> CategoricalIndex(['a', 'b', 'b', 'c', 'c', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category')
> >>> idx.remove_categories('b')
> 22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID 215)
> org.apache.spark.SparkNoSuchElementException: Key b does not exist. If necessary set spark.sql.ansi.strictIndexOperator to false to bypass this error.
> ...
> ...{code}
>  * It raises exception when {{CategoricalIndex.set_categories}} doesn't set the entire index
> {code:java}
> >>> idx.set_categories(['b', 'c'])
> 22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID 215)
> org.apache.spark.SparkNoSuchElementException: Key a does not exist. If necessary set spark.sql.ansi.strictIndexOperator to false to bypass this error.
> ...
> ...{code}
>  * It raises exception when {{ps.to_numeric}} get a non-numeric type
> {code:java}
> >>> psser
> 0    apple
> 1      1.0
> 2        2
> 3       -3
> dtype: object
> >>> ps.to_numeric(psser)
> 22/02/14 09:22:36 ERROR Executor: Exception in task 2.0 in stage 63.0 (TID 328)
> org.apache.spark.SparkNumberFormatException: invalid input syntax for type numeric: apple. To return NULL instead, use 'try_cast'. If necessary set spark.sql.ansi.enabled to false to bypass this error.
> ...{code}
>  * It raises exception when {{strings.StringMethods.rsplit}} - also {{strings.StringMethods.split}} - with {{expand=True}} returns null columns
> {code:java}
> >>> s
> 0                       this is a regular sentence
> 1    https://docs.python.org/3/tutorial/index.html
> 2                                             None
> dtype: object
> >>> s.str.split(n=4, expand=True)
> 22/02/14 09:26:23 ERROR Executor: Exception in task 5.0 in stage 69.0 (TID 356)
> org.apache.spark.SparkArrayIndexOutOfBoundsException: Invalid index: 1, numElements: 1. If necessary set spark.sql.ansi.strictIndexOperator to false to bypass this error.{code}
>  * It raises exception when {{as_type}} with {{{}CategoricalDtype{}}}, and the categories of {{CategoricalDtype}} is not matched with data.
> {code:java}
> >>> psser
> 0    1994-01-31
> 1    1994-02-01
> 2    1994-02-02
> dtype: object
> >>> cat_type
> CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
> >>> psser.astype(cat_type)
> 22/02/14 09:34:56 ERROR Executor: Exception in task 5.0 in stage 90.0 (TID 468)
> org.apache.spark.SparkNoSuchElementException: Key 1994-02-01 does not exist. If necessary set spark.sql.ansi.strictIndexOperator to false to bypass this error.{code}
> Not only for the example cases, if the internal SQL function used to implement the function has different behavior according to ANSI options, an unexpected error may occur.
> So we might need to show proper warning message when creating pandas-on-Spark session.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org