You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Dobromir Montauk (JIRA)" <ji...@apache.org> on 2014/12/08 05:28:13 UTC

[jira] [Commented] (MAPREDUCE-15) SequenceFile RecordReader should skip bad records

    [ https://issues.apache.org/jira/browse/MAPREDUCE-15?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237425#comment-14237425 ] 

Dobromir Montauk commented on MAPREDUCE-15:
-------------------------------------------

Curious what the current state is; can the RecordReader skip bad records? This seems like the best default behavior in a complex distributed environment where bad records are non-trivial...

> SequenceFile RecordReader should skip bad records
> -------------------------------------------------
>
>                 Key: MAPREDUCE-15
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-15
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Joydeep Sen Sarma
>
> Currently a bad record in a sequencefile leads to entire job being failed. the best workaround is to skip an errant file manually (by looking at what map task failed).  This is a sucky option because it's manual and because one should be able to skip a sequencefile block (instead of entire file).
> While we don't see this often (and i don't know why this corruption happened) - here's an example stack:
> Status : FAILED java.lang.NegativeArraySizeException
> 	at org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:96)
> 	at org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:75)
> 	at org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:130)
> 	at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1640)
> 	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1712)
> 	at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:79)
> 	at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:176)
> Ideally the recordreader should just skip the entire chunk if it gets an unrecoverable error while reading.
> This was the consensus in hadoop-153 as well (that data corruptions should be handled by recordreaders) and hadoop-3144 did something similar for textinputformat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)