You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Xuzhou Qin (Jira)" <ji...@apache.org> on 2020/06/11 12:00:00 UTC

[jira] [Created] (SPARK-31968) write.partitionBy() cannot handle duplicated column

Xuzhou Qin created SPARK-31968:
----------------------------------

             Summary: write.partitionBy() cannot handle duplicated column
                 Key: SPARK-31968
                 URL: https://issues.apache.org/jira/browse/SPARK-31968
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.5
            Reporter: Xuzhou Qin


I recently remarked that if there are duplicated elements in the argument of write.partitionBy(), then the same partition subdirectory will be created multiple times.

For example: 
{code:java}
import spark.implicits._

val df: DataFrame = Seq(
  (1, "p1", "c1", 1L),
  (2, "p2", "c2", 2L),
  (2, "p1", "c2", 2L),
  (3, "p3", "c3", 3L),
  (3, "p2", "c3", 3L),
  (3, "p3", "c3", 3L)
).toDF("col1", "col2", "col3", "col4")

df.write
  .partitionBy("col1", "col1")  // we have "col1" twice
  .mode(SaveMode.Overwrite)
  .csv("output_dir"){code}
The above code will produce an output directory with this structure:

 
{code:java}
output_dir
  |
  |--col1=1
  |    |--col1=1
  |
  |--col1=2
  |    |--col1=2
  |
  |--col1=3
       |--col1=3{code}
And we won't be able to read the output

 
{code:java}
spark.read.csv("output_dir").show()
// Exception in thread "main" org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the partition schema: `col1`;{code}
 

I am not sure if partitioning a dataframe twice by the same column make sense in some real world applications, but it will cause schema inference problems in tools like AWS Glue crawler.

Should Spark handle the deduplication of the partition columns? Or maybe throw an exception when duplicated columns are detected?

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org