You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2015/09/03 12:08:46 UTC

[jira] [Assigned] (SPARK-10428) Struct fields read from parquet are mis-aligned

     [ https://issues.apache.org/jira/browse/SPARK-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-10428:
------------------------------------

    Assignee: Apache Spark

> Struct fields read from parquet are mis-aligned
> -----------------------------------------------
>
>                 Key: SPARK-10428
>                 URL: https://issues.apache.org/jira/browse/SPARK-10428
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Yin Huai
>            Assignee: Apache Spark
>            Priority: Critical
>
> {code}
> val df1 = sqlContext
>         .range(1)
>         .selectExpr("NAMED_STRUCT('a', id, 'd', id + 3) AS s")
>         .coalesce(1)
> val df2 = sqlContext
>   .range(1, 2)
>   .selectExpr("NAMED_STRUCT('a', id, 'b', id + 1, 'c', id + 2, 'd', id + 3) AS s")
>   .coalesce(1)
> df1.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=1")
> df2.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=2")
> {code}
> {code}
> sqlContext.read.option("mergeSchema", "true").parquet("/home/yin/sc_11_minimal/").selectExpr("s.a", "s.b", "s.c", "s.d", “p").show
> +---+---+----+----+---+
> |  a|  b|   c|   d|  p|
> +---+---+----+----+---+
> |  0|  3|null|null|  1|
> |  1|  2|   3|   4|  2|
> +---+---+----+----+---+
> {code}
> Looks like the problem is at https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L185-L204, we do padding when global schema has more struct fields than local parquet file's schema. However, when we read field from parquet, we still use parquet's local schema and then we put the value of {{d}} to the wrong slot.
> I tried master. Looks like this issue is resolved by https://github.com/apache/spark/pull/8509. We need to decide if we want to back port that to branch 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org