You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Sam Goodwin <sa...@gmail.com> on 2016/11/05 00:11:40 UTC
Upgrading to Spark 2.0.1 broke array in parquet DataFrame
I have a table with a few columns, some of which are arrays. Since
upgrading from Spark 1.6 to Spark 2.0.1, the array fields are always null
when reading in a DataFrame.
When writing the Parquet files, the schema of the column is specified as
StructField("packageIds",ArrayType(StringType))
The schema of the column in the Hive Metastore is
packageIds array<string>
The schema used in the writer exactly matches the schema in the Metastore
in all ways (order, casing, types etc)
The query is a simple "select *"
spark.sql("select * from tablename limit 1").collect() // null columns in Row
How can I begin debugging this issue? Notable things I've already
investigated:
- Files were written using Spark 1.6
- DataFrame works in spark 1.5 and 1.6
- I've inspected the parquet files using parquet-tools and can see the
data.
- I also have another table written in exactly the same way and it
doesn't have the issue.
Re: Upgrading to Spark 2.0.1 broke array in parquet DataFrame
Posted by Michael Armbrust <mi...@databricks.com>.
If you can reproduce the issue with Spark 2.0.2 I'd suggest opening a JIRA.
On Fri, Nov 4, 2016 at 5:11 PM, Sam Goodwin <sa...@gmail.com> wrote:
> I have a table with a few columns, some of which are arrays. Since
> upgrading from Spark 1.6 to Spark 2.0.1, the array fields are always null
> when reading in a DataFrame.
>
> When writing the Parquet files, the schema of the column is specified as
>
> StructField("packageIds",ArrayType(StringType))
>
> The schema of the column in the Hive Metastore is
>
> packageIds array<string>
>
> The schema used in the writer exactly matches the schema in the Metastore
> in all ways (order, casing, types etc)
>
> The query is a simple "select *"
>
> spark.sql("select * from tablename limit 1").collect() // null columns in Row
>
> How can I begin debugging this issue? Notable things I've already
> investigated:
>
> - Files were written using Spark 1.6
> - DataFrame works in spark 1.5 and 1.6
> - I've inspected the parquet files using parquet-tools and can see the
> data.
> - I also have another table written in exactly the same way and it
> doesn't have the issue.
>
>