You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Marcin Tustin <mt...@handybook.com> on 2016/01/22 19:31:35 UTC

Data corruption/loss in hive

Hi All,

I'm seeing some data loss/corruption in hive. This isn't HDFS-level
corruption - hdfs reports that the files and blocks are healthy.

I'm using managed ORC tables. Normally we write once an hour to each table,
with occasional concatenations through hive. We perform the writing using
spark 1.3.1, (using the spark sql interface) running either locally or over
yarn.

Occasionally we will run many insertion jobs against a table, generally
when backfilling data.

The data loss seems to happen more frequently when we are doing frequent
concatenations and multiple insertion jobs at once.

The problem goes away when we drop the table and reingest. The problem also
appears to be localised to specific orc files within the table - if we
delete the affected files (detectable by trying to orcdump each file), the
rest are just fine.

Has anyone seen this? Any suggestions for avoiding this or chasing down a
root cause?

Thanks,
Marcin

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led 
by Fidelity