You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "bo zhao (Jira)" <ji...@apache.org> on 2022/08/02 02:33:00 UTC
[jira] [Created] (SPARK-39942) The input parameter of nsmallest should be validated as Integer
bo zhao created SPARK-39942:
-------------------------------
Summary: The input parameter of nsmallest should be validated as Integer
Key: SPARK-39942
URL: https://issues.apache.org/jira/browse/SPARK-39942
Project: Spark
Issue Type: Bug
Components: Pandas API on Spark
Affects Versions: 3.2.2
Environment: PySpark: Master
Reporter: bo zhao
The input parameter of nsmallest should be validated as Integer. So I think we might miss this validation.
And PySpark will raise Error when we input the strange types. Such as
PySpark:
{code:java}
>>> df = ps.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}, columns=['A', 'B'])
>>> df.groupby(['A'])['B'].nsmallest(1)
A
1 0 3
2 1 4
3 2 5
4 3 6
Name: B, dtype: int64
>>> df.groupby(['A'])['B'].nsmallest(True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/spark/spark/python/pyspark/pandas/groupby.py", line 3598, in nsmallest
sdf.withColumn(temp_rank_column, F.row_number().over(window))
File "/home/spark/spark/python/pyspark/sql/dataframe.py", line 2129, in filter
jdf = self._jdf.filter(condition._jc)
File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/home/spark/spark/python/pyspark/sql/utils.py", line 196, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: cannot resolve '(__rank__ <= true)' due to data type mismatch: differing types in '(__rank__ <= true)' (int and boolean).;
'Filter (__rank__#4995 <= true)
+- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L, __rank__#4995]
+- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L, __rank__#4995, __rank__#4995]
+- Window [row_number() windowspecdefinition(__index_level_0__#4988L, B#4979L ASC NULLS FIRST, __natural_order__#4983L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS __rank__#4995], [__index_level_0__#4988L], [B#4979L ASC NULLS FIRST, __natural_order__#4983L ASC NULLS FIRST]
+- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L]
+- Project [A#4978L AS __index_level_0__#4988L, __index_level_0__#4977L AS __index_level_1__#4989L, B#4979L, __natural_order__#4983L]
+- Project [__index_level_0__#4977L, A#4978L, B#4979L, monotonically_increasing_id() AS __natural_order__#4983L]
+- LogicalRDD [__index_level_0__#4977L, A#4978L, B#4979L], false
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org