You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Marco Gaido (JIRA)" <ji...@apache.org> on 2018/02/08 11:58:00 UTC

[jira] [Commented] (SPARK-23041) Inconsistent `drop`ing of columns in dataframes

    [ https://issues.apache.org/jira/browse/SPARK-23041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356850#comment-16356850 ] 

Marco Gaido commented on SPARK-23041:
-------------------------------------

yes I am unable to reproduce this problem in master branch.

> Inconsistent `drop`ing of columns in dataframes
> -----------------------------------------------
>
>                 Key: SPARK-23041
>                 URL: https://issues.apache.org/jira/browse/SPARK-23041
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Christos Mantas
>            Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> There is a known bug [SPARK-13493](https://issues.apache.org/jira/browse/SPARK-13493) when reading files in JSON format with case-sensitiveness. If eg. the file contains both "test" and "TEST", Catalyst will complain on some occasions (eg. when writing to a parquet file or creating an rdd from the df) with an error like this
> org.apache.spark.sql.AnalysisException: Reference 'TEST' is ambiguous, could be: TEST#55L, TEST#57L.;
> This bug is not about that error, but a very peculiar side-effect, related to it.  
> In short, in cases like the above, dropping the offending columns does not have any effect.
> It's very easy to replicate:
> Here is a PySpark snippet illustrating it:
> {code:javascript}
> import pyspark
> from pyspark.sql import SparkSession
> sc = pyspark.SparkContext('local[*]')
> spark = SparkSession(sc)
> fname = '/tmp/test.json'
> with open(fname, "w") as text_file:
>     text_file.write("{\"test\":1, \"cool\": 3}\n{\"TEST\": 2, \"cool\": 4}")
> df = spark.read.json(fname)
> df_d = df.drop("test").drop("TEST")
> print(df_d.schema.names)
> df_d.rdd
> {code}
> This will print ['cool'], but will also produce the aforementioned exception, meaning that it tries to make sense of columns that have actually been dropped.
> Same happens when eg. you try to save the dataframe in a parquet file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org