You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Travis Crawford <tr...@gmail.com> on 2012/08/09 17:42:46 UTC

Skipping bad records

Hey hive gurus -

I recently had some issues getting Hive to process a partition with bad
records, and am curious how others deal with this issue. From searching
around, I learned Hive uses the MR-provided bad record skipping
functionality, instead of doing anything specific about bad records.

The partition I processed was roughly 87GB, with around 600 million records.

The job eventually completed (with 350 task failures) with these settings:

set mapred.skip.mode.enabled=true;
set mapred.map.max.attempts=100;
set mapred.reduce.max.attempts=100;
set mapred.skip.map.max.skip.records=30000;
set mapred.skip.attempts.to.start.skipping=1;

I believe this means 350 records (~0.0000005%) caused the job to initially
fail?

The code throwing the exception has a todo to discuss record
deserialization
errors<https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java#L508>.
Has a discussion around natively handling bad records happened? As a
comparison, Elephant-Bird handles some percent of bad
records<https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/LzoRecordReader.java>
without
causing task failures.

Thanks!
Travis