You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by "Long, Andrew" <lo...@amazon.com.INVALID> on 2018/08/22 17:16:08 UTC
Spark data quality bug when reading parquet files from hive metastore
Hello Friends,
I’ve encountered a bug where spark silently corrupts data when reading from a parquet hive table where the table schema does not match the file schema. I’d like to give a shot at adding some extra validations to the code to handle this corner case and I was wondering if anyone had any suggestions for where to start looking in the spark code.
Cheers Andrew
Re: Spark data quality bug when reading parquet files from hive
metastore
Posted by "Long, Andrew" <lo...@amazon.com.INVALID>.
Thanks Fokko,
I will definitely take a look at this.
Cheers Andrew
From: "Driesprong, Fokko" <fo...@driesprong.frl>
Date: Friday, August 24, 2018 at 2:39 AM
To: "reubensawyer@hotmail.com" <re...@hotmail.com>
Cc: "dev@spark.apache.org" <de...@spark.apache.org>
Subject: Re: Spark data quality bug when reading parquet files from hive metastore
Hi Andrew,
This blog gives an idea how to schema is resolved: https://blog.godatadriven.com/multiformat-spark-partition There is some optimisation going on when reading Parquet using Spark. Hope this helps.
Cheers, Fokko
Op wo 22 aug. 2018 om 23:59 schreef t4 <re...@hotmail.com>>:
https://issues.apache.org/jira/browse/SPARK-23576 ?
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
Re: Spark data quality bug when reading parquet files from hive metastore
Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.
Hi Andrew,
This blog gives an idea how to schema is resolved:
https://blog.godatadriven.com/multiformat-spark-partition There is some
optimisation going on when reading Parquet using Spark. Hope this helps.
Cheers, Fokko
Op wo 22 aug. 2018 om 23:59 schreef t4 <re...@hotmail.com>:
> https://issues.apache.org/jira/browse/SPARK-23576 ?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>
Re: Spark data quality bug when reading parquet files from hive
metastore
Posted by t4 <re...@hotmail.com>.
https://issues.apache.org/jira/browse/SPARK-23576 ?
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org