You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2011/01/21 03:22:29 UTC
[Pig Wiki] Update of "PigErrorHandlingInScripts" by JulienLeDem

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "PigErrorHandlingInScripts" page has been changed by JulienLeDem.
http://wiki.apache.org/pig/PigErrorHandlingInScripts

--------------------------------------------------

New page:
= Error handling in Pig scripts =

The current behavior of Pig when handling exceptions thrown by UDFs is to fail and stop processing. We want to extend this behavior to let user have finer grain control on error handling.

Depending on the use-case there are several options users would like to have.
 * Stop the execution and report an error
 * Ignore tuples that cause exceptions and log warnings
 * Ignore tuples that cause exceptions and redirect them to an error relation (to enable statistics, debugging, ...)
 * Write their own error handler

The proposal is to add a ONERROR keyword ("on error", following the existing naming conventions: http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords)
(we're still looking for a good keyword: ON_ERROR vs ONERROR vs ...)
<relation> = < PIG statement...> ONERROR <optional Handler> [SPLIT INTO <relation>  ...]

Usage:

 * The default behavior is to die on error and can be overridden as follows:
DEFAULT ONERROR <error handler>;

 * Built in error handlers:
Ignore() => ignores errors by dropping records that cause exceptions
Fail() => fails the script on error. (default)
FailOnThreshold(threshold) => fails if number of errors above threshold

 * The error handler interface is defined as follows:
handle() is called on the slave to handle an exception.
collectResult() is called on the client side after the relation is computed to decide what to do next. 
Typically FailOnThreshold will throw an exception if (#errors/#input)>threshold using counters.

public interface ErrorHandler<T> {

// input is not the input of the UDF, it's the tuple from the relation
T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
 IOException;

Schema outputSchema(Schema input);

// called afterwards on the client side
void collectResult() throws IOException;

}

 * SPLIT is optional

example:
DEFAULT ONERROR Ignore();
...

DESCRIBE A;
A: {name: chararray, age: int, gpa: float}

-- fail it more than 1% errors
B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR FailOnThreshold(0.01) ;

-- custom handler that counts errors and logs on the client side
C1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR CountMyErrors() ;

-- B2_ERRORS can not really contain the input to the UDF as it would have a different schema depending on what UDF failed
DESCRIBE B_ERRORS;
B2_ERRORS: {input: (name: chararray, age: int, gpa: float), udf: chararray, error:(class: chararray, message: chararray, stacktrace: chararray) }

-- example of filtering on the udf
C2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO C2_FOO_ERRORS IF udf='Foo', C2_BAR_ERRORS IF udf='Bar';

-- uses handler and SPLIT
A3= FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR HandleItMyWay() SPLIT INTO A3_ERRORS;