You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Franklyn Dsouza (JIRA)" <ji...@apache.org> on 2016/03/24 17:08:25 UTC

[jira] [Created] (SPARK-14117) write.partitionBy retains partitioning column when outputting Parquet

Franklyn Dsouza created SPARK-14117:
---------------------------------------

             Summary: write.partitionBy retains partitioning column when outputting Parquet
                 Key: SPARK-14117
                 URL: https://issues.apache.org/jira/browse/SPARK-14117
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.6.1
            Reporter: Franklyn Dsouza
            Priority: Minor


When writing a Dataframe as parquet using a partitionBy on the writer to generate multiple output folders, the resulting parquet files have columns containing the partitioning column.

Here's a simple example:

df = sc.sql.createDataFrame([
  Row(a="folder 1 message 1", folder="folder1"),
  Row(a="folder 1 message 2", folder="folder1"),
  Row(a="folder 1 message 3", folder="folder1"),
  Row(a="folder 2 message 1", folder="folder2"),
  Row(a="folder 2 message 2", folder="folder2"),
  Row(a="folder 2 message 3", folder="folder2"),
])

df.write.partitionBy('folder').parquet('output')

produces the following output :-

+------------------+---------------+
|                            a|   folder|
+------------------+---------------+
|folder 2 message 1|folder2|
+------------------+---------------+

whereas df.write.partitionBy('folder').json('output')

produces :-


{"a":"folder 2 message 1"}

without the partitioning column.

I'm assuming this is a bug because of the different behaviour between the two. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org