You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by "Long, Andrew" <lo...@amazon.com.INVALID> on 2018/08/22 17:16:08 UTC

Spark data quality bug when reading parquet files from hive metastore

Hello Friends,

I’ve encountered a bug where spark silently corrupts data when reading from a parquet hive table where the table schema does not match the file schema.  I’d like to give a shot at adding some extra validations to the code to handle this corner case and I was wondering if anyone had any suggestions for where to start looking in the spark code.

Cheers Andrew

Re: Spark data quality bug when reading parquet files from hive metastore

Posted by "Long, Andrew" <lo...@amazon.com.INVALID>.
Thanks Fokko,

I will definitely take a look at this.

Cheers Andrew

From: "Driesprong, Fokko" <fo...@driesprong.frl>
Date: Friday, August 24, 2018 at 2:39 AM
To: "reubensawyer@hotmail.com" <re...@hotmail.com>
Cc: "dev@spark.apache.org" <de...@spark.apache.org>
Subject: Re: Spark data quality bug when reading parquet files from hive metastore

Hi Andrew,

This blog gives an idea how to schema is resolved: https://blog.godatadriven.com/multiformat-spark-partition There is some optimisation going on when reading Parquet using Spark. Hope this helps.

Cheers, Fokko


Op wo 22 aug. 2018 om 23:59 schreef t4 <re...@hotmail.com>>:
https://issues.apache.org/jira/browse/SPARK-23576 ?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: Spark data quality bug when reading parquet files from hive metastore

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.
Hi Andrew,

This blog gives an idea how to schema is resolved:
https://blog.godatadriven.com/multiformat-spark-partition There is some
optimisation going on when reading Parquet using Spark. Hope this helps.

Cheers, Fokko


Op wo 22 aug. 2018 om 23:59 schreef t4 <re...@hotmail.com>:

> https://issues.apache.org/jira/browse/SPARK-23576 ?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Spark data quality bug when reading parquet files from hive metastore

Posted by t4 <re...@hotmail.com>.
https://issues.apache.org/jira/browse/SPARK-23576 ?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org