You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/08/02 02:35:00 UTC

[jira] [Assigned] (SPARK-39942) The input parameter of nsmallest should be validated as Integer

     [ https://issues.apache.org/jira/browse/SPARK-39942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-39942:
------------------------------------

    Assignee: Apache Spark

> The input parameter of nsmallest should be validated as Integer
> ---------------------------------------------------------------
>
>                 Key: SPARK-39942
>                 URL: https://issues.apache.org/jira/browse/SPARK-39942
>             Project: Spark
>          Issue Type: Bug
>          Components: Pandas API on Spark
>    Affects Versions: 3.2.2
>         Environment: PySpark: Master
>            Reporter: bo zhao
>            Assignee: Apache Spark
>            Priority: Minor
>
> The input parameter of nsmallest should be validated as Integer. So I think we might miss this validation.
> And PySpark will raise Error when we input the strange types. Such as
>  
> PySpark:
> {code:java}
> >>> df = ps.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}, columns=['A', 'B']) 
> >>> df.groupby(['A'])['B'].nsmallest(1)
>  A    
> 1  0    3 
> 2  1    4 
> 3  2    5 
> 4  3    6 
> Name: B, dtype: int64
> >>> df.groupby(['A'])['B'].nsmallest(True)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/spark/spark/python/pyspark/pandas/groupby.py", line 3598, in nsmallest
>     sdf.withColumn(temp_rank_column, F.row_number().over(window))
>   File "/home/spark/spark/python/pyspark/sql/dataframe.py", line 2129, in filter
>     jdf = self._jdf.filter(condition._jc)
>   File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/py4j/java_gateway.py", line 1321, in __call__
>     return_value = get_return_value(
>   File "/home/spark/spark/python/pyspark/sql/utils.py", line 196, in deco
>     raise converted from None
> pyspark.sql.utils.AnalysisException: cannot resolve '(__rank__ <= true)' due to data type mismatch: differing types in '(__rank__ <= true)' (int and boolean).;
> 'Filter (__rank__#4995 <= true)
> +- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L, __rank__#4995]
>    +- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L, __rank__#4995, __rank__#4995]
>       +- Window [row_number() windowspecdefinition(__index_level_0__#4988L, B#4979L ASC NULLS FIRST, __natural_order__#4983L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS __rank__#4995], [__index_level_0__#4988L], [B#4979L ASC NULLS FIRST, __natural_order__#4983L ASC NULLS FIRST]
>          +- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L]
>             +- Project [A#4978L AS __index_level_0__#4988L, __index_level_0__#4977L AS __index_level_1__#4989L, B#4979L, __natural_order__#4983L]
>                +- Project [__index_level_0__#4977L, A#4978L, B#4979L, monotonically_increasing_id() AS __natural_order__#4983L]
>                   +- LogicalRDD [__index_level_0__#4977L, A#4978L, B#4979L], false
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org