You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "bo zhao (Jira)" <ji...@apache.org> on 2022/08/02 02:33:00 UTC
[jira] [Created] (SPARK-39942) The input parameter of nsmallest should be validated as Integer

bo zhao created SPARK-39942:
-------------------------------

             Summary: The input parameter of nsmallest should be validated as Integer
                 Key: SPARK-39942
                 URL: https://issues.apache.org/jira/browse/SPARK-39942
             Project: Spark
          Issue Type: Bug
          Components: Pandas API on Spark
    Affects Versions: 3.2.2
         Environment: PySpark: Master
            Reporter: bo zhao


The input parameter of nsmallest should be validated as Integer. So I think we might miss this validation.

And PySpark will raise Error when we input the strange types. Such as

 

PySpark:
{code:java}
>>> df = ps.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}, columns=['A', 'B']) 
>>> df.groupby(['A'])['B'].nsmallest(1)
 A    
1  0    3 
2  1    4 
3  2    5 
4  3    6 
Name: B, dtype: int64
>>> df.groupby(['A'])['B'].nsmallest(True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/spark/spark/python/pyspark/pandas/groupby.py", line 3598, in nsmallest
    sdf.withColumn(temp_rank_column, F.row_number().over(window))
  File "/home/spark/spark/python/pyspark/sql/dataframe.py", line 2129, in filter
    jdf = self._jdf.filter(condition._jc)
  File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/home/spark/spark/python/pyspark/sql/utils.py", line 196, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: cannot resolve '(__rank__ <= true)' due to data type mismatch: differing types in '(__rank__ <= true)' (int and boolean).;
'Filter (__rank__#4995 <= true)
+- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L, __rank__#4995]
   +- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L, __rank__#4995, __rank__#4995]
      +- Window [row_number() windowspecdefinition(__index_level_0__#4988L, B#4979L ASC NULLS FIRST, __natural_order__#4983L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS __rank__#4995], [__index_level_0__#4988L], [B#4979L ASC NULLS FIRST, __natural_order__#4983L ASC NULLS FIRST]
         +- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L]
            +- Project [A#4978L AS __index_level_0__#4988L, __index_level_0__#4977L AS __index_level_1__#4989L, B#4979L, __natural_order__#4983L]
               +- Project [__index_level_0__#4977L, A#4978L, B#4979L, monotonically_increasing_id() AS __natural_order__#4983L]
                  +- LogicalRDD [__index_level_0__#4977L, A#4978L, B#4979L], false


{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org