You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sam Goodwin <sa...@gmail.com> on 2016/11/05 00:11:40 UTC

Upgrading to Spark 2.0.1 broke array in parquet DataFrame

I have a table with a few columns, some of which are arrays. Since
upgrading from Spark 1.6 to Spark 2.0.1, the array fields are always null
when reading in a DataFrame.

When writing the Parquet files, the schema of the column is specified as

StructField("packageIds",ArrayType(StringType))

The schema of the column in the Hive Metastore is

packageIds array<string>

The schema used in the writer exactly matches the schema in the Metastore
in all ways (order, casing, types etc)

The query is a simple "select *"

spark.sql("select * from tablename limit 1").collect() // null columns in Row

How can I begin debugging this issue? Notable things I've already
investigated:

   - Files were written using Spark 1.6
   - DataFrame works in spark 1.5 and 1.6
   - I've inspected the parquet files using parquet-tools and can see the
   data.
   - I also have another table written in exactly the same way and it
   doesn't have the issue.

Re: Upgrading to Spark 2.0.1 broke array in parquet DataFrame

Posted by Michael Armbrust <mi...@databricks.com>.

If you can reproduce the issue with Spark 2.0.2 I'd suggest opening a JIRA.

On Fri, Nov 4, 2016 at 5:11 PM, Sam Goodwin <sa...@gmail.com> wrote:

> I have a table with a few columns, some of which are arrays. Since
> upgrading from Spark 1.6 to Spark 2.0.1, the array fields are always null
> when reading in a DataFrame.
>
> When writing the Parquet files, the schema of the column is specified as
>
> StructField("packageIds",ArrayType(StringType))
>
> The schema of the column in the Hive Metastore is
>
> packageIds array<string>
>
> The schema used in the writer exactly matches the schema in the Metastore
> in all ways (order, casing, types etc)
>
> The query is a simple "select *"
>
> spark.sql("select * from tablename limit 1").collect() // null columns in Row
>
> How can I begin debugging this issue? Notable things I've already
> investigated:
>
>    - Files were written using Spark 1.6
>    - DataFrame works in spark 1.5 and 1.6
>    - I've inspected the parquet files using parquet-tools and can see the
>    data.
>    - I also have another table written in exactly the same way and it
>    doesn't have the issue.
>
>