You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2018/01/14 03:36:00 UTC
[jira] [Commented] (SPARK-23041) Inconsistent `drop`ing of columns in dataframes

    [ https://issues.apache.org/jira/browse/SPARK-23041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16325434#comment-16325434 ] 

Hyukjin Kwon commented on SPARK-23041:
--------------------------------------

Seems this reproducer throws an error as below by SPARK-20460. Just double checked:

{code}
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `test`;
  at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85)
  at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:67)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:410)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
  at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:345)
  at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:291)
  ... 48 elided
{code}

I think this throws an exception ahead and fails fast. Seems making sense to fail before allowing other operations that causes a bug hard to debug like this. Case sensitive is another issue filed in SPARK-13493.

> Inconsistent `drop`ing of columns in dataframes
> -----------------------------------------------
>
>                 Key: SPARK-23041
>                 URL: https://issues.apache.org/jira/browse/SPARK-23041
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Christos Mantas
>            Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> There is a known bug [SPARK-13493](https://issues.apache.org/jira/browse/SPARK-13493) when reading files in JSON format with case-sensitiveness. If eg. the file contains both "test" and "TEST", Catalyst will complain on some occasions (eg. when writing to a parquet file or creating an rdd from the df) with an error like this
> org.apache.spark.sql.AnalysisException: Reference 'TEST' is ambiguous, could be: TEST#55L, TEST#57L.;
> This bug is not about that error, but a very peculiar side-effect, related to it.  
> In short, in cases like the above, dropping the offending columns does not have any effect.
> It's very easy to replicate:
> Here is a PySpark snippet illustrating it:
> {code:javascript}
> import pyspark
> from pyspark.sql import SparkSession
> sc = pyspark.SparkContext('local[*]')
> spark = SparkSession(sc)
> fname = '/tmp/test.json'
> with open(fname, "w") as text_file:
>     text_file.write("{\"test\":1, \"cool\": 3}\n{\"TEST\": 2, \"cool\": 4}")
> df = spark.read.json(fname)
> df_d = df.drop("test").drop("TEST")
> print(df_d.schema.names)
> df_d.rdd
> {code}
> This will print ['cool'], but will also produce the aforementioned exception, meaning that it tries to make sense of columns that have actually been dropped.
> Same happens when eg. you try to save the dataframe in a parquet file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org