You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michel Lemay (JIRA)" <ji...@apache.org> on 2017/06/08 12:27:18 UTC
[jira] [Updated] (SPARK-21021) Reading partitioned parquet does not
respect specified schema column order
[ https://issues.apache.org/jira/browse/SPARK-21021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michel Lemay updated SPARK-21021:
---------------------------------
Description:
When reading back a partitioned parquet folder, column order gets messed up.
Consider the following example:
{code}
case class Event(f1: String, f2: String, f3: String)
val df = Seq(Event("v1", "v2", "v3")).toDF
df.write.partitionBy("f1", "f2").parquet("out")
val schema: StructType = StructType(StructField("f1", StringType, true) :: StructField("f2", StringType, true) :: StructField("f3", StringType, true) :: Nil)
val dfRead = spark.read.schema(schema).parquet("out")
dfRead.show
+---+---+---+
| f3| f1| f2|
+---+---+---+
| v3| v1| v2|
+---+---+---+
dfRead.columns
Array[String] = Array(f3, f1, f2)
schema.fields
Array(StructField(f1,StringType,true), StructField(f2,StringType,true), StructField(f3,StringType,true))
{code}
This makes it really hard to have compatible schema when reading from multiple sources.
was:
When reading back a partitioned parquet folder, column order gets messed up.
Consider the following example:
{code:scala}
case class Event(f1: String, f2: String, f3: String)
val df = Seq(Event("v1", "v2", "v3")).toDF
df.write.partitionBy("f1", "f2").parquet("out")
val schema: StructType = StructType(StructField("f1", StringType, true) :: StructField("f2", StringType, true) :: StructField("f3", StringType, true) :: Nil)
val dfRead = spark.read.schema(schema).parquet("out")
dfRead.show
+---+---+---+
| f3| f1| f2|
+---+---+---+
| v3| v1| v2|
+---+---+---+
dfRead.columns
Array[String] = Array(f3, f1, f2)
schema.fields
Array(StructField(f1,StringType,true), StructField(f2,StringType,true), StructField(f3,StringType,true))
{code}
This makes it really hard to have compatible schema when reading from multiple sources.
> Reading partitioned parquet does not respect specified schema column order
> --------------------------------------------------------------------------
>
> Key: SPARK-21021
> URL: https://issues.apache.org/jira/browse/SPARK-21021
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.0
> Reporter: Michel Lemay
> Priority: Minor
>
> When reading back a partitioned parquet folder, column order gets messed up.
> Consider the following example:
> {code}
> case class Event(f1: String, f2: String, f3: String)
> val df = Seq(Event("v1", "v2", "v3")).toDF
> df.write.partitionBy("f1", "f2").parquet("out")
> val schema: StructType = StructType(StructField("f1", StringType, true) :: StructField("f2", StringType, true) :: StructField("f3", StringType, true) :: Nil)
> val dfRead = spark.read.schema(schema).parquet("out")
> dfRead.show
> +---+---+---+
> | f3| f1| f2|
> +---+---+---+
> | v3| v1| v2|
> +---+---+---+
> dfRead.columns
> Array[String] = Array(f3, f1, f2)
> schema.fields
> Array(StructField(f1,StringType,true), StructField(f2,StringType,true), StructField(f3,StringType,true))
> {code}
> This makes it really hard to have compatible schema when reading from multiple sources.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org