You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Jonathan Coveney (Commented) (JIRA)" <ji...@apache.org> on 2012/03/27 07:26:04 UTC

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239193#comment-13239193 ] 

Jonathan Coveney commented on PIG-2614:
---------------------------------------

Russell,

In Elephant-bird, there is a key elephantbird.mapred.input.bad.record.threshold. For whatever reason I felt like doing this, so find attached a patch that adds the functionality you want (note that it includes PIG-2551, which is more or less good to go... only because that patch brings in a Counter helper).

The default functionality does not change. On an error, it will die. However, there are not two keys that can be set:
pig.piggybank.storage.avro.bad.record.threshold
pig.piggybank.storage.avro.bad.record.min

The former sets the acceptable ratio threshhold. The latter sets the minimum number of errors before it can error out.

Here is where you come in:

Currently, the only error I log is on "reader.next()." Are there any other cases where errors (at least, errors indicating a bad row) can be thrown? And on an error, what do you want to happen? Skip the row, or return null? It seems to make sense to me to skip the record (also, the number of records processed and the number of errors thrown is logged in a Hadoop counter now).

Secondly, someone needs to make tests. It currently passes the tests, but that's because the default threshold and min are 0. I don't know what is and isn't a bad Avro file, though, so yeah. Hopefully the fact that I did the work implementing will motivate someone to add tests ;)
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>            Priority: Blocker
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira