You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Christos Mantas (JIRA)" <ji...@apache.org> on 2018/01/11 10:07:00 UTC
[jira] [Updated] (SPARK-23041) Inconsistent `drop`ing of columns in dataframes

     [ https://issues.apache.org/jira/browse/SPARK-23041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christos Mantas updated SPARK-23041:
------------------------------------
    Description: 
There is a known bug [SPARK-13493](https://issues.apache.org/jira/browse/SPARK-13493) when reading files in JSON format with case-sensitiveness. If eg. the file contains both "test" and "TEST", Catalyst will complain on some occasions (eg. when writing to a parquet file or creating an rdd from the df) with an error like this

org.apache.spark.sql.AnalysisException: Reference 'TEST' is ambiguous, could be: TEST#55L, TEST#57L.;

This bug is not about that error, but a very peculiar side-effect, related to it.  
In short, in cases like the above, dropping the offending columns does not have any effect.
It's very easy to replicate:

Here is a PySpark snippet illustrating it:

{code:javascript}

import pyspark
from pyspark.sql import SparkSession
sc = pyspark.SparkContext('local[*]')
spark = SparkSession(sc)

fname = '/tmp/test.json'
with open(fname, "w") as text_file:
    text_file.write("{\"test\":1, \"cool\": 3}\n{\"TEST\": 2, \"cool\": 4}")
df = spark.read.json(fname)
df_d = df.drop("test").drop("TEST")
print(df_d.schema.names)
df_d.rdd
{code}

This will print ['cool'], but will also produce the aforementioned exception, meaning that it tries to make sense of columns that have actually been dropped.

Same happens when eg. you try to save the dataframe in a parquet file.

  was:
There is a known bug [SPARK-13493](https://issues.apache.org/jira/browse/SPARK-13493) when reading files in JSON format with case-sensitiveness. If eg. the file contains both "test" and "TEST", Catalyst will complain on some occasions (eg. when writing to a parquet file or creating an rdd from the df) with an error like this
```: org.apache.spark.sql.AnalysisException: Reference 'TEST' is ambiguous, could be: TEST#55L, TEST#57L.;```

This bug is not about that error, but a very peculiar side-effect, related to it.  
In short, in cases like the above, dropping the offending columns does not have any effect.
It's very easy to replicate:

Here is a PySpark snippet illustrating it:
```
import pyspark
from pyspark.sql import SparkSession
sc = pyspark.SparkContext('local[*]')
spark = SparkSession(sc)

fname = '/tmp/test.json'
with open(fname, "w") as text_file:
    text_file.write("{\"test\":1, \"cool\": 3}\n{\"TEST\": 2, \"cool\": 4}")
df = spark.read.json(fname)
df_d = df.drop("test").drop("TEST")
print(df_d.schema.names)
df_d.rdd
```
This will print `['cool']`, but will also produce the aforementioned exception, meaning that it tries to make sense of columns that have actually been dropped.
Same happens when eg. you try to save the dataframe in a parquet file.


> Inconsistent `drop`ing of columns in dataframes
> -----------------------------------------------
>
>                 Key: SPARK-23041
>                 URL: https://issues.apache.org/jira/browse/SPARK-23041
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Christos Mantas
>            Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> There is a known bug [SPARK-13493](https://issues.apache.org/jira/browse/SPARK-13493) when reading files in JSON format with case-sensitiveness. If eg. the file contains both "test" and "TEST", Catalyst will complain on some occasions (eg. when writing to a parquet file or creating an rdd from the df) with an error like this
> org.apache.spark.sql.AnalysisException: Reference 'TEST' is ambiguous, could be: TEST#55L, TEST#57L.;
> This bug is not about that error, but a very peculiar side-effect, related to it.  
> In short, in cases like the above, dropping the offending columns does not have any effect.
> It's very easy to replicate:
> Here is a PySpark snippet illustrating it:
> {code:javascript}
> import pyspark
> from pyspark.sql import SparkSession
> sc = pyspark.SparkContext('local[*]')
> spark = SparkSession(sc)
> fname = '/tmp/test.json'
> with open(fname, "w") as text_file:
>     text_file.write("{\"test\":1, \"cool\": 3}\n{\"TEST\": 2, \"cool\": 4}")
> df = spark.read.json(fname)
> df_d = df.drop("test").drop("TEST")
> print(df_d.schema.names)
> df_d.rdd
> {code}
> This will print ['cool'], but will also produce the aforementioned exception, meaning that it tries to make sense of columns that have actually been dropped.
> Same happens when eg. you try to save the dataframe in a parquet file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org