You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2022/02/15 03:53:00 UTC
[jira] [Resolved] (SPARK-38183) Show warning when creating pandas-on-Spark session under ANSI mode.
[ https://issues.apache.org/jira/browse/SPARK-38183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-38183.
----------------------------------
Fix Version/s: 3.3.0
Resolution: Fixed
Issue resolved by pull request 35488
[https://github.com/apache/spark/pull/35488]
> Show warning when creating pandas-on-Spark session under ANSI mode.
> -------------------------------------------------------------------
>
> Key: SPARK-38183
> URL: https://issues.apache.org/jira/browse/SPARK-38183
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 3.3.0
> Reporter: Haejoon Lee
> Assignee: Haejoon Lee
> Priority: Major
> Fix For: 3.3.0
>
>
> Since pandas API on Spark follows the behavior of pandas, not SQL, some unexpected behavior can be occurred when "spark.sql.ansi.enable" is True.
> For example,
> * It raises exception when {{div}} & {{mod}} related methods returns null (e.g. {{{}DataFrame.rmod{}}})
> {code:java}
> >>> df
> angels degress
> 0 0 360
> 1 3 180
> 2 4 360
> >>> df.rmod(2)
> Traceback (most recent call last):
> ...
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 32.0 failed 1 times, most recent failure: Lost task 0.0 in stage 32.0 (TID 165) (172.30.1.44 executor driver): org.apache.spark.SparkArithmeticException: divide by zero. To return NULL instead, use 'try_divide'. If necessary set spark.sql.ansi.enabled to false (except for ANSI interval type) to bypass this error.{code}
> * It raises exception when DataFrame for {{ps.melt}} has not the same column type.
>
> {code:java}
> >>> df
> A B C
> 0 a 1 2
> 1 b 3 4
> 2 c 5 6
> >>> ps.melt(df)
> Traceback (most recent call last):
> ...
> pyspark.sql.utils.AnalysisException: cannot resolve 'array(struct('A', A), struct('B', B), struct('C', C))' due to data type mismatch: input to function array should all be the same type, but it's [struct<variable:string,value:string>, struct<variable:string,value:bigint>, struct<variable:string,value:bigint>]
> To fix the error, you might need to add explicit type casts. If necessary set spark.sql.ansi.enabled to false to bypass this error.;
> 'Project [__index_level_0__#223L, A#224, B#225L, C#226L, __natural_order__#231L, explode(array(struct(variable, A, value, A#224), struct(variable, B, value, B#225L), struct(variable, C, value, C#226L))) AS pairs#269]
> +- Project [__index_level_0__#223L, A#224, B#225L, C#226L, monotonically_increasing_id() AS __natural_order__#231L]
> +- LogicalRDD [__index_level_0__#223L, A#224, B#225L, C#226L], false{code}
> * It raises exception when {{CategoricalIndex.remove_categories}} doesn't remove the entire index
> {code:java}
> >>> idx
> CategoricalIndex(['a', 'b', 'b', 'c', 'c', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category')
> >>> idx.remove_categories('b')
> 22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID 215)
> org.apache.spark.SparkNoSuchElementException: Key b does not exist. If necessary set spark.sql.ansi.strictIndexOperator to false to bypass this error.
> ...
> ...{code}
> * It raises exception when {{CategoricalIndex.set_categories}} doesn't set the entire index
> {code:java}
> >>> idx.set_categories(['b', 'c'])
> 22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID 215)
> org.apache.spark.SparkNoSuchElementException: Key a does not exist. If necessary set spark.sql.ansi.strictIndexOperator to false to bypass this error.
> ...
> ...{code}
> * It raises exception when {{ps.to_numeric}} get a non-numeric type
> {code:java}
> >>> psser
> 0 apple
> 1 1.0
> 2 2
> 3 -3
> dtype: object
> >>> ps.to_numeric(psser)
> 22/02/14 09:22:36 ERROR Executor: Exception in task 2.0 in stage 63.0 (TID 328)
> org.apache.spark.SparkNumberFormatException: invalid input syntax for type numeric: apple. To return NULL instead, use 'try_cast'. If necessary set spark.sql.ansi.enabled to false to bypass this error.
> ...{code}
> * It raises exception when {{strings.StringMethods.rsplit}} - also {{strings.StringMethods.split}} - with {{expand=True}} returns null columns
> {code:java}
> >>> s
> 0 this is a regular sentence
> 1 https://docs.python.org/3/tutorial/index.html
> 2 None
> dtype: object
> >>> s.str.split(n=4, expand=True)
> 22/02/14 09:26:23 ERROR Executor: Exception in task 5.0 in stage 69.0 (TID 356)
> org.apache.spark.SparkArrayIndexOutOfBoundsException: Invalid index: 1, numElements: 1. If necessary set spark.sql.ansi.strictIndexOperator to false to bypass this error.{code}
> * It raises exception when {{as_type}} with {{{}CategoricalDtype{}}}, and the categories of {{CategoricalDtype}} is not matched with data.
> {code:java}
> >>> psser
> 0 1994-01-31
> 1 1994-02-01
> 2 1994-02-02
> dtype: object
> >>> cat_type
> CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
> >>> psser.astype(cat_type)
> 22/02/14 09:34:56 ERROR Executor: Exception in task 5.0 in stage 90.0 (TID 468)
> org.apache.spark.SparkNoSuchElementException: Key 1994-02-01 does not exist. If necessary set spark.sql.ansi.strictIndexOperator to false to bypass this error.{code}
> Not only for the example cases, if the internal SQL function used to implement the function has different behavior according to ANSI options, an unexpected error may occur.
> So we might need to show proper warning message when creating pandas-on-Spark session.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org