You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Franklyn Dsouza (JIRA)" <ji...@apache.org> on 2016/03/24 17:08:25 UTC
[jira] [Created] (SPARK-14117) write.partitionBy retains
partitioning column when outputting Parquet
Franklyn Dsouza created SPARK-14117:
---------------------------------------
Summary: write.partitionBy retains partitioning column when outputting Parquet
Key: SPARK-14117
URL: https://issues.apache.org/jira/browse/SPARK-14117
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.6.1
Reporter: Franklyn Dsouza
Priority: Minor
When writing a Dataframe as parquet using a partitionBy on the writer to generate multiple output folders, the resulting parquet files have columns containing the partitioning column.
Here's a simple example:
df = sc.sql.createDataFrame([
Row(a="folder 1 message 1", folder="folder1"),
Row(a="folder 1 message 2", folder="folder1"),
Row(a="folder 1 message 3", folder="folder1"),
Row(a="folder 2 message 1", folder="folder2"),
Row(a="folder 2 message 2", folder="folder2"),
Row(a="folder 2 message 3", folder="folder2"),
])
df.write.partitionBy('folder').parquet('output')
produces the following output :-
+------------------+---------------+
| a| folder|
+------------------+---------------+
|folder 2 message 1|folder2|
+------------------+---------------+
whereas df.write.partitionBy('folder').json('output')
produces :-
{"a":"folder 2 message 1"}
without the partitioning column.
I'm assuming this is a bug because of the different behaviour between the two.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org