You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "ss (Jira)" <ji...@apache.org> on 2022/02/04 16:38:00 UTC

[jira] [Updated] (SPARK-38109) pyspark DataFrame.replace() is sensitive to column name case in pyspark 3.2 but not in 3.1

     [ https://issues.apache.org/jira/browse/SPARK-38109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ss updated SPARK-38109:
-----------------------
    Description: 
The `subset` argument for `DataFrame.replace()` accepts one or more column names. In pyspark 3.2 the case of the column names must match the column names in the schema exactly or the replacements will not take place. In earlier versions (3.1.2 was tested) the argument is case insensitive.

Minimal example:

```python
replace_dict = {'wrong': 'right'}
df = spark.createDataFrame(
  [['wrong', 'wrong']], 
  schema=['case_matched', 'case_unmatched']
)
df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched'])
```
In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) the result is:

|case_matched|case_unmatched|
|right|wrong|

While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) the result is:

|case_matched|case_unmatched|
|right|right|

I believe the expected behaviour is that shown in pyspark 3.1 as in all other situations column names are accepted in a case insensitive manner. 

  was:
The `subset` argument for `DataFrame.replace()` accepts one or more column names. In pyspark 3.2 the case of the column names must match the column names in the schema exactly or the replacements will not take place. In earlier versions (3.1.2 was tested) the argument is case insensitive.

Minimal example:

```python
replace_dict = {'wrong': 'right'}
df = spark.createDataFrame(
  [['wrong', 'wrong']], 
  schema=['case_matched', 'case_unmatched']
)
df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched'])
```
In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) the result is:

|case_matched|case_unmatched|
|-|-|
|right|wrong|

While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) the result is:

|case_matched|case_unmatched|
|-|-|
|right|right|

I believe the expected behaviour is that shown in pyspark 3.1 as in all other situations column names are accepted in a case insensitive manner. 


> pyspark DataFrame.replace() is sensitive to column name case in pyspark 3.2 but not in 3.1
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-38109
>                 URL: https://issues.apache.org/jira/browse/SPARK-38109
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.2.0, 3.2.1
>            Reporter: ss
>            Priority: Minor
>
> The `subset` argument for `DataFrame.replace()` accepts one or more column names. In pyspark 3.2 the case of the column names must match the column names in the schema exactly or the replacements will not take place. In earlier versions (3.1.2 was tested) the argument is case insensitive.
> Minimal example:
> ```python
> replace_dict = {'wrong': 'right'}
> df = spark.createDataFrame(
>   [['wrong', 'wrong']], 
>   schema=['case_matched', 'case_unmatched']
> )
> df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched'])
> ```
> In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) the result is:
> |case_matched|case_unmatched|
> |right|wrong|
> While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) the result is:
> |case_matched|case_unmatched|
> |right|right|
> I believe the expected behaviour is that shown in pyspark 3.1 as in all other situations column names are accepted in a case insensitive manner. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org