You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org> on 2013/01/03 04:50:12 UTC

[jira] [Commented] (PIG-3059) Global configurable minimum 'bad record' thresholds

    [ https://issues.apache.org/jira/browse/PIG-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542691#comment-13542691 ] 

Dmitriy V. Ryaboy commented on PIG-3059:
----------------------------------------

I agree with the principle that inspired this patch, but the solution seems to fall short of ideal.

Dealing in splits is misleading and hard to reason about:

* Good records read from a split that contains a bad record still get processed, so it's not the case that a "bad split" is ignored, and you are controlling how many bad splits to ignore.
* a single bad record stops the whole *rest* of the split from being processed, whether your loader could recover or not. This is unnecessary data loss.
* most users of pig have no idea (or should have no idea) what a split is
* Pig combines splits -- but this deals with pre-combination splits. Especially when combining small but unequal files, splits are very different from each other, and some may contain 100 records while others contain 100,000 records.

This all means that no matter what the user sets these values to, they actually have no idea what error threshold they are telling Pig to ignore. 

I think the Elephant-Bird way of dealing with errors -- minimal threshold of *record* errors + a percentage of total *records* read -- is quite robust and easy to explain. If Avro can't recover from a bad record in a single split, it can do whatever is appropriate for avro -- estimate how many records it's dropping and throw that many exceptions, or just pretend that this one error is all that was left in the split, or maybe fix the format so that it can recover properly (ok, that was a troll comment :)).

                
> Global configurable minimum 'bad record' thresholds
> ---------------------------------------------------
>
>                 Key: PIG-3059
>                 URL: https://issues.apache.org/jira/browse/PIG-3059
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.11
>            Reporter: Russell Jurney
>            Assignee: Cheolsoo Park
>             Fix For: 0.12
>
>         Attachments: avro_test_files-2.tar.gz, PIG-3059-2.patch, PIG-3059.patch
>
>
> See PIG-2614. 
> Pig dies when one record in a LOAD of a billion records fails to parse. This is almost certainly not the desired behavior. elephant-bird and some other storage UDFs have minimum thresholds in terms of percent and count that must be exceeded before a job will fail outright.
> We need these limits to be configurable for Pig, globally. I've come to realize what a major problem Pig's crashing on bad records is for new Pig users. I believe this feature can greatly improve Pig.
> An example of a config would look like:
> pig.storage.bad.record.threshold=0.01
> pig.storage.bad.record.min=100
> A thorough discussion of this issue is available here: http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira