You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Marcelo Rossini Castro (Jira)" <ji...@apache.org> on 2022/08/14 22:06:00 UTC

[jira] [Comment Edited] (SPARK-40063) pyspark.pandas .apply() changing rows ordering

    [ https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579451#comment-17579451 ] 

Marcelo Rossini Castro edited comment on SPARK-40063 at 8/14/22 10:05 PM:
--------------------------------------------------------------------------

I can't really print it because it contains sensitive information, but assume that the DataFrame df is defined as:
||Col_1||Col_2||col_to_apply_function||
|1|Name1|10|
|2|Name2|15|
|3|Name3|20|
|4|Name4|25|

After applying the function, the results are placed in the wrong lines, like this:
||Col_1||Col_2||col_to_apply_function||
|1|Name1|400|
|2|Name2|625|
|3|Name3|225|
|4|Name4|100|

This error does not happen when I use pandas, only with pyspark.pandas. 
But pandas is impossible to use on a DataFrame with millions of rows.


was (Author: JIRAUSER294354):
I can't really print it because it contains sensitive information, but assume that the DataFrame df is defined as:
||Col_1||Col_2||col_to_apply_function||
|1|Name1|10|
|2|Name2|15|
|3|Name3|20|
|4|Name4|25|

After applying the function, the results are placed in the wrong lines, like this:
||Col_1||Col_2||col_to_apply_function||
|1|Name1|400|
|2|Name2|625|
|3|Name3|225|
|4|Name4|100|

 

> pyspark.pandas .apply() changing rows ordering
> ----------------------------------------------
>
>                 Key: SPARK-40063
>                 URL: https://issues.apache.org/jira/browse/SPARK-40063
>             Project: Spark
>          Issue Type: Bug
>          Components: Pandas API on Spark
>    Affects Versions: 3.3.0
>         Environment: Databricks Runtime 11.1
>            Reporter: Marcelo Rossini Castro
>            Priority: Minor
>              Labels: Pandas, PySpark
>
> When using the apply function to apply a function to a DataFrame column, it ends up mixing the column's rows ordering.
> A command like this:
> {code:java}
> def example_func(df_col):
>   return df_col ** 2 
> df['col_to_apply_function'] = df.apply(lambda row: example_func(row['col_to_apply_function']), axis=1) {code}
> A workaround is to assign the results to a new column instead of the same one, but if the old column is dropped, the same error is produced.
> Setting one column as index also didn't work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org