You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/08/02 02:35:00 UTC
[jira] [Assigned] (SPARK-39942) The input parameter of nsmallest should be validated as Integer
[ https://issues.apache.org/jira/browse/SPARK-39942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-39942:
------------------------------------
Assignee: Apache Spark
> The input parameter of nsmallest should be validated as Integer
> ---------------------------------------------------------------
>
> Key: SPARK-39942
> URL: https://issues.apache.org/jira/browse/SPARK-39942
> Project: Spark
> Issue Type: Bug
> Components: Pandas API on Spark
> Affects Versions: 3.2.2
> Environment: PySpark: Master
> Reporter: bo zhao
> Assignee: Apache Spark
> Priority: Minor
>
> The input parameter of nsmallest should be validated as Integer. So I think we might miss this validation.
> And PySpark will raise Error when we input the strange types. Such as
>
> PySpark:
> {code:java}
> >>> df = ps.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}, columns=['A', 'B'])
> >>> df.groupby(['A'])['B'].nsmallest(1)
> A
> 1 0 3
> 2 1 4
> 3 2 5
> 4 3 6
> Name: B, dtype: int64
> >>> df.groupby(['A'])['B'].nsmallest(True)
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home/spark/spark/python/pyspark/pandas/groupby.py", line 3598, in nsmallest
> sdf.withColumn(temp_rank_column, F.row_number().over(window))
> File "/home/spark/spark/python/pyspark/sql/dataframe.py", line 2129, in filter
> jdf = self._jdf.filter(condition._jc)
> File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/py4j/java_gateway.py", line 1321, in __call__
> return_value = get_return_value(
> File "/home/spark/spark/python/pyspark/sql/utils.py", line 196, in deco
> raise converted from None
> pyspark.sql.utils.AnalysisException: cannot resolve '(__rank__ <= true)' due to data type mismatch: differing types in '(__rank__ <= true)' (int and boolean).;
> 'Filter (__rank__#4995 <= true)
> +- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L, __rank__#4995]
> +- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L, __rank__#4995, __rank__#4995]
> +- Window [row_number() windowspecdefinition(__index_level_0__#4988L, B#4979L ASC NULLS FIRST, __natural_order__#4983L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS __rank__#4995], [__index_level_0__#4988L], [B#4979L ASC NULLS FIRST, __natural_order__#4983L ASC NULLS FIRST]
> +- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L]
> +- Project [A#4978L AS __index_level_0__#4988L, __index_level_0__#4977L AS __index_level_1__#4989L, B#4979L, __natural_order__#4983L]
> +- Project [__index_level_0__#4977L, A#4978L, B#4979L, monotonically_increasing_id() AS __natural_order__#4983L]
> +- LogicalRDD [__index_level_0__#4977L, A#4978L, B#4979L], false
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org