You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Julien Le Dem <le...@yahoo-inc.com> on 2011/01/18 19:27:09 UTC

Re: Exception Handling in Pig Scripts

That would be nice.
Also letting the error handler output the result to a relation would be useful.
(To let the script output application error metrics)
For example it could (optionally) use the keyword INTO just like the SPLIT operator.

FOO = LOAD ...;
A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;

ErrorHandler would look a little more like EvalFunc:

public interface ErrorHandler<T> {

  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
IOException;

public Schema outputSchema(Schema input);

}

There could be a built-in handler to output the skipped record (input: tuple, funcname:chararray, errorMessage:chararray)

A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;

Julien

On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:

I was thinking about this..

We add an optional ON_ERROR clause to operators, which allows a user to
specify error handling. The error handler would be a udf that would
implement an interface along these lines:

public interface ErrorHandler {

  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
IOException;

}

I think this makes sense not to make a static method so that users could
keep required state, and for example have the handler throw its own
IOException of it's been invoked too many times.

D


On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <sm...@yahoo-inc.com>wrote:

> Thanks for the clarification Ashutosh.
>
> Implementing this in the user realm is tricky as Dmitriy states.
> Sensitivity to error thresholds will require support from the system. We can
> probably provide a taxonomy of records (good, bad, incomplete, etc.) to let
> users classify each record. The system can then track counts of each record
> type to facilitate the computation of thresholds. The last part is to allow
> users to specify thresholds and appropriate actions (interrupt, exit,
> continue, etc.). A possible mechanism to realize this is the
> ErrorHandlingUDF described by Dmitriy.
>
> Santhosh
>
> -----Original Message-----
> From: Ashutosh Chauhan [mailto:hashutosh@apache.org]
> Sent: Friday, January 14, 2011 7:35 PM
> To: user@pig.apache.org
> Subject: Re: Exception Handling in Pig Scripts
>
> Santhosh,
>
> The way you are proposing, it will kill the pig script. I think what user
> wants is to ignore few "bad records" and to process the rest and get
> results. Problem here is how to let user tell Pig the definition of "bad
> record" and how to let him specify threshold for % of bad records at which
> Pig should fail the script.
>
> Ashutosh
>
> On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <sm...@yahoo-inc.com>
> wrote:
> > Sorry about the late response.
> >
> > Hadoop n00b is proposing a language extension for error handling, similar
> to the mechanisms in other well known languages like C++, Java, etc.
> >
> > For now, can't the error semantics be handled by the UDF? For exceptional
> scenarios you could throw an ExecException with the right details. The
> physical operator that handles the execution of UDF's traps it for you and
> propagates the error back to the client. You can take a look at any of the
> builtin UDFs to see how Pig handles it internally.
> >
> > Santhosh
> >
> > -----Original Message-----
> > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> > Sent: Tuesday, January 11, 2011 10:41 AM
> > To: user@pig.apache.org
> > Subject: Re: Exception Handling in Pig Scripts
> >
> > Right now error handling is controlled by the UDFs themselves, and there
> is no way to direct it externally.
> > You can make an ErrorHandlingUDF that would take a udf spec, invoke it,
> trap errors, and then do the specified error handling behavior.. that's a
> bit ugly though.
> >
> > There is a problem with trapping general exceptions of course, in that if
> they happen 0.000001% of the time you can probably just ignore them, but if
> they happen in half your dataset, you want the job to tell you something is
> wrong. So this stuff gets non-trivial. If anyone wants to propose a design
> to solve this general problem, I think that would be a welcome addition.
> >
> > D
> >
> > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <ne...@gmail.com>
> wrote:
> >
> >> Thanks, I sometimes get a date like 0001-01-01. This would be a valid
> >> date format, but when I try to get the seconds between this and
> >> another date, say 2011-01-01, I get an error that the value is too
> >> large to be fit into int and the process stops. Do we have something
> >> like ifError(x-y, null,x-y)? Or would I have to implement this as an
> >> UDF?
> >>
> >> Thanks
> >>
> >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <dv...@gmail.com>
> >> wrote:
> >>
> >> > Create a UDF that verifies the format, and go through a filtering
> >> > step first.
> >> > If you would like to save the malformated records so you can look
> >> > at them later, you can use the SPLIT operator to route the good
> >> > records to your regular workflow, and the bad records some place on
> HDFS.
> >> >
> >> > -D
> >> >
> >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <ne...@gmail.com>
> wrote:
> >> >
> >> > > Hello,
> >> > >
> >> > > I have a pig script that uses piggy bank to calculate date
> differences.
> >> > > Sometimes, when I get a wierd date or wrong format in the input,
> >> > > the
> >> > script
> >> > > throws and error and aborts.
> >> > >
> >> > > Is there a way I could trap these errors and move on without
> >> > > stopping
> >> the
> >> > > execution?
> >> > >
> >> > > Thanks
> >> > >
> >> > > PS: I'm using CDH2 with Pig 0.5
> >> > >
> >> >
> >>
> >
>

Re: Exception Handling in Pig Scripts

Posted by Milind Bhandarkar <mb...@linkedin.com>.

Thats a nice approach. It fits my Unix-solved-everything-but-needs-syntactic-sugar world-view :-) (e.g. if we had a 1| and 2| syntax, this would be:

0<./FOO "1| ./bar > A" "2| ./MyHandler > B" 

:-)

- milind

On Jan 18, 2011, at 10:27 AM, Julien Le Dem wrote:

> That would be nice.
> Also letting the error handler output the result to a relation would be useful.
> (To let the script output application error metrics)
> For example it could (optionally) use the keyword INTO just like the SPLIT operator.
> 
> FOO = LOAD ...;
> A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;
> 
> ErrorHandler would look a little more like EvalFunc:
> 
> public interface ErrorHandler<T> {
> 
>  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> IOException;
> 
> public Schema outputSchema(Schema input);
> 
> }
> 
> There could be a built-in handler to output the skipped record (input: tuple, funcname:chararray, errorMessage:chararray)
> 
> A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;
> 
> Julien
> 
> On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
> 
> I was thinking about this..
> 
> We add an optional ON_ERROR clause to operators, which allows a user to
> specify error handling. The error handler would be a udf that would
> implement an interface along these lines:
> 
> public interface ErrorHandler {
> 
>  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> IOException;
> 
> }
> 
> I think this makes sense not to make a static method so that users could
> keep required state, and for example have the handler throw its own
> IOException of it's been invoked too many times.
> 
> D
> 
> 
> On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <sm...@yahoo-inc.com>wrote:
> 
>> Thanks for the clarification Ashutosh.
>> 
>> Implementing this in the user realm is tricky as Dmitriy states.
>> Sensitivity to error thresholds will require support from the system. We can
>> probably provide a taxonomy of records (good, bad, incomplete, etc.) to let
>> users classify each record. The system can then track counts of each record
>> type to facilitate the computation of thresholds. The last part is to allow
>> users to specify thresholds and appropriate actions (interrupt, exit,
>> continue, etc.). A possible mechanism to realize this is the
>> ErrorHandlingUDF described by Dmitriy.
>> 
>> Santhosh
>> 
>> -----Original Message-----
>> From: Ashutosh Chauhan [mailto:hashutosh@apache.org]
>> Sent: Friday, January 14, 2011 7:35 PM
>> To: user@pig.apache.org
>> Subject: Re: Exception Handling in Pig Scripts
>> 
>> Santhosh,
>> 
>> The way you are proposing, it will kill the pig script. I think what user
>> wants is to ignore few "bad records" and to process the rest and get
>> results. Problem here is how to let user tell Pig the definition of "bad
>> record" and how to let him specify threshold for % of bad records at which
>> Pig should fail the script.
>> 
>> Ashutosh
>> 
>> On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <sm...@yahoo-inc.com>
>> wrote:
>>> Sorry about the late response.
>>> 
>>> Hadoop n00b is proposing a language extension for error handling, similar
>> to the mechanisms in other well known languages like C++, Java, etc.
>>> 
>>> For now, can't the error semantics be handled by the UDF? For exceptional
>> scenarios you could throw an ExecException with the right details. The
>> physical operator that handles the execution of UDF's traps it for you and
>> propagates the error back to the client. You can take a look at any of the
>> builtin UDFs to see how Pig handles it internally.
>>> 
>>> Santhosh
>>> 
>>> -----Original Message-----
>>> From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
>>> Sent: Tuesday, January 11, 2011 10:41 AM
>>> To: user@pig.apache.org
>>> Subject: Re: Exception Handling in Pig Scripts
>>> 
>>> Right now error handling is controlled by the UDFs themselves, and there
>> is no way to direct it externally.
>>> You can make an ErrorHandlingUDF that would take a udf spec, invoke it,
>> trap errors, and then do the specified error handling behavior.. that's a
>> bit ugly though.
>>> 
>>> There is a problem with trapping general exceptions of course, in that if
>> they happen 0.000001% of the time you can probably just ignore them, but if
>> they happen in half your dataset, you want the job to tell you something is
>> wrong. So this stuff gets non-trivial. If anyone wants to propose a design
>> to solve this general problem, I think that would be a welcome addition.
>>> 
>>> D
>>> 
>>> On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <ne...@gmail.com>
>> wrote:
>>> 
>>>> Thanks, I sometimes get a date like 0001-01-01. This would be a valid
>>>> date format, but when I try to get the seconds between this and
>>>> another date, say 2011-01-01, I get an error that the value is too
>>>> large to be fit into int and the process stops. Do we have something
>>>> like ifError(x-y, null,x-y)? Or would I have to implement this as an
>>>> UDF?
>>>> 
>>>> Thanks
>>>> 
>>>> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <dv...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Create a UDF that verifies the format, and go through a filtering
>>>>> step first.
>>>>> If you would like to save the malformated records so you can look
>>>>> at them later, you can use the SPLIT operator to route the good
>>>>> records to your regular workflow, and the bad records some place on
>> HDFS.
>>>>> 
>>>>> -D
>>>>> 
>>>>> On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <ne...@gmail.com>
>> wrote:
>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> I have a pig script that uses piggy bank to calculate date
>> differences.
>>>>>> Sometimes, when I get a wierd date or wrong format in the input,
>>>>>> the
>>>>> script
>>>>>> throws and error and aborts.
>>>>>> 
>>>>>> Is there a way I could trap these errors and move on without
>>>>>> stopping
>>>> the
>>>>>> execution?
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> PS: I'm using CDH2 with Pig 0.5
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 

---
Milind Bhandarkar
mbhandarkar@linkedin.com

RE: Exception Handling in Pig Scripts

Posted by Olga Natkovich <ol...@yahoo-inc.com>.

Thanks, Julien. I also added a couple of questions to the wiki.

Olga

-----Original Message-----
From: Julien Le Dem [mailto:ledemj@yahoo-inc.com] 
Sent: Thursday, January 20, 2011 6:36 PM
To: dev@pig.apache.org
Subject: Re: Exception Handling in Pig Scripts

I've summed up the thread here:
http://wiki.apache.org/pig/PigErrorHandlingInScripts
I'm sure it's biased toward its author's opinion, let me know what you think.
Julien

On 1/20/11 3:19 PM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:

Sure :)

-----Original Message-----
From: Julien Le Dem [mailto:ledemj@yahoo-inc.com]
Sent: Thursday, January 20, 2011 1:49 PM
To: dev@pig.apache.org
Subject: Re: Exception Handling in Pig Scripts

I see there is a PigErrorHandling, what about calling it PigErrorHandlingInScripts ?
Julien

On 1/20/11 12:53 PM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:

Hi guys,

Could you put a quick wiki with your proposal together? I think it would make it much easier then following email discussion.

Thanks,

Olga

-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
Sent: Thursday, January 20, 2011 11:52 AM
To: dev@pig.apache.org
Subject: Re: Exception Handling in Pig Scripts

Right, what I am saying is that the tasks would not fail because we'd catch
the errors.

Thanks for the lmyit link.. learn something new every day.

On Thu, Jan 20, 2011 at 11:31 AM, Julien Le Dem <le...@yahoo-inc.com>wrote:

> Doesn't Hadoop discard the increments to counters done by failed tasks? (I
> would expect that, but I don't know)
> Also using counters we should make sure we don't mix up multiple relations
> being combined by the optimizer.
>
> P.S.: Regarding rror, I don't see why you would want two of these:
> http://lmyit.com/rror
> :P
>
>
> On 1/20/11 4:54 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>
> I think this is coming together! I like the idea of a client-side handler
> method that allows us to look at all errors in aggregate and make a
> decisions based on proportions. How can we guard against catching the wrong
> mistakes -- say, letting a mapper that's running on a bad node and fails
> all
> local disk writes finish "successfully" even though properly, the task just
> needs to be rerun on a different mapper and normally MR would just take
> care
> of it?
> Let's put this on a wiki for wider feedback.
>
> P.S. What's a "rror" and why do we only want one of them?
>
> On Wed, Jan 19, 2011 at 3:07 PM, Julien Le Dem <le...@yahoo-inc.com>
> wrote:
>
> > Some more thoughts.
> >
> > * Looking at the existing keywords:
> > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords
> > It seems ONERROR would be better than ON_ERROR for consistency. There is
> an
> > existing ONSCHEMA but no _ based keyword.
> >
> > * The default behavior should be to die on error and can be overridden as
> > follows:
> > DEFAULT ONERROR <error handler>;
> >
> > * Built in error handlers:
> > Ignore() => ignores errors by dropping records that cause exceptions
> > Fail() => fails the script on error. (default)
> > FailOnThreshold(threshold) => fails if number of errors above threshold
> >
> > * The error handler interface needs a method called on client side after
> > the relation is computed to decide what to do next.
> > Typically FailOnThreshold will throw an exception if
> > (#errors/#input)>threshold using counters.
> > public interface ErrorHandler<T> {
> >
> > // input is not the input of the UDF, it's the tuple from the relation
> > T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> >  IOException;
> >
> > Schema outputSchema(Schema input);
> >
> > // called afterwards on the client side
> > void collectResult() throws IOException;
> >
> > }
> >
> > * SPLIT is optional
> >
> > example:
> > DEFAULT ONERROR Ignore();
> > ...
> >
> > DESCRIBE A;
> > A: {name: chararray, age: int, gpa: float}
> >
> > -- fail it more than 1% errors
> > B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR
> > FailOnThreshold(0.01) ;
> >
> > -- need to make sure the twitter infrastructure can handle the load
> > C1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR Tweet() ;
> >
> > -- custom handler that counts errors and logs on the client side
> > D1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR CountMyErrors()
> ;
> >
> > -- uses default handler and SPLIT
> > B2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> > B2_ERRORS;
> >
> > -- B2_ERRORS can not really contain the input to the UDF as it would have
> a
> > different schema depending on what UDF failed
> > DESCRIBE B_ERRORS;
> > B2_ERRORS: {input: (name: chararray, age: int, gpa: float), udf:
> chararray,
> > error:(class: chararray, message: chararray, stacktrace: chararray) }
> >
> > -- example of filtering on the udf
> > C2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> > C2_FOO_ERRORS IF udf='Foo', C2_BAR_ERRORS IF udf='Bar';
> >
> > Julien
> >
> > On 1/18/11 3:24 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
> >
> > We should think more about the interface.
> > For example, "Tuple input" argument -- is that the tuple that was passed
> to
> > the udf, or the whole tuple that was being processed? I can see wanting
> > both.
> > Also the Handler should probably have init and finish methods in case
> some
> > accumulation is happening, or state needs to get set up...
> >
> > not sure about "splitting" into a table. Maybe more like
> >
> > A = FOREACH FOO GENERATE Bar(*) ON_ERROR [use] MyHandler SPLIT [into]
> > A_ERRORS;
> >
> > "use" and "into" are optional syntactic sugar.
> >
> > This allows us to do any combination of:
> > - die
> > - put original record into a table
> > - process the error using a custom handler (which can increment counters,
> > write to dbs, send tweets... definitely send tweets...)
> >
> > D
> >
> > On Tue, Jan 18, 2011 at 10:27 AM, Julien Le Dem <ledemj@yahoo-inc.com
> > >wrote:
> >
> > > That would be nice.
> > > Also letting the error handler output the result to a relation would be
> > > useful.
> > > (To let the script output application error metrics)
> > > For example it could (optionally) use the keyword INTO just like the
> > SPLIT
> > > operator.
> > >
> > > FOO = LOAD ...;
> > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;
> > >
> > > ErrorHandler would look a little more like EvalFunc:
> > >
> > > public interface ErrorHandler<T> {
> > >
> > >  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> > > IOException;
> > >
> > > public Schema outputSchema(Schema input);
> > >
> > > }
> > >
> > > There could be a built-in handler to output the skipped record (input:
> > > tuple, funcname:chararray, errorMessage:chararray)
> > >
> > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;
> > >
> > > Julien
> > >
> > > On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
> > >
> > > I was thinking about this..
> > >
> > > We add an optional ON_ERROR clause to operators, which allows a user to
> > > specify error handling. The error handler would be a udf that would
> > > implement an interface along these lines:
> > >
> > > public interface ErrorHandler {
> > >
> > >  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input)
> > throws
> > > IOException;
> > >
> > > }
> > >
> > > I think this makes sense not to make a static method so that users
> could
> > > keep required state, and for example have the handler throw its own
> > > IOException of it's been invoked too many times.
> > >
> > > D
> > >
> > >
> > > On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <
> sms@yahoo-inc.com
> > > >wrote:
> > >
> > > > Thanks for the clarification Ashutosh.
> > > >
> > > > Implementing this in the user realm is tricky as Dmitriy states.
> > > > Sensitivity to error thresholds will require support from the system.
> > We
> > > can
> > > > probably provide a taxonomy of records (good, bad, incomplete, etc.)
> to
> > > let
> > > > users classify each record. The system can then track counts of each
> > > record
> > > > type to facilitate the computation of thresholds. The last part is to
> > > allow
> > > > users to specify thresholds and appropriate actions (interrupt, exit,
> > > > continue, etc.). A possible mechanism to realize this is the
> > > > ErrorHandlingUDF described by Dmitriy.
> > > >
> > > > Santhosh
> > > >
> > > > -----Original Message-----
> > > > From: Ashutosh Chauhan [mailto:hashutosh@apache.org]
> > > > Sent: Friday, January 14, 2011 7:35 PM
> > > > To: user@pig.apache.org
> > > > Subject: Re: Exception Handling in Pig Scripts
> > > >
> > > > Santhosh,
> > > >
> > > > The way you are proposing, it will kill the pig script. I think what
> > user
> > > > wants is to ignore few "bad records" and to process the rest and get
> > > > results. Problem here is how to let user tell Pig the definition of
> > "bad
> > > > record" and how to let him specify threshold for % of bad records at
> > > which
> > > > Pig should fail the script.
> > > >
> > > > Ashutosh
> > > >
> > > > On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <
> sms@yahoo-inc.com>
> > > > wrote:
> > > > > Sorry about the late response.
> > > > >
> > > > > Hadoop n00b is proposing a language extension for error handling,
> > > similar
> > > > to the mechanisms in other well known languages like C++, Java, etc.
> > > > >
> > > > > For now, can't the error semantics be handled by the UDF? For
> > > exceptional
> > > > scenarios you could throw an ExecException with the right details.
> The
> > > > physical operator that handles the execution of UDF's traps it for
> you
> > > and
> > > > propagates the error back to the client. You can take a look at any
> of
> > > the
> > > > builtin UDFs to see how Pig handles it internally.
> > > > >
> > > > > Santhosh
> > > > >
> > > > > -----Original Message-----
> > > > > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> > > > > Sent: Tuesday, January 11, 2011 10:41 AM
> > > > > To: user@pig.apache.org
> > > > > Subject: Re: Exception Handling in Pig Scripts
> > > > >
> > > > > Right now error handling is controlled by the UDFs themselves, and
> > > there
> > > > is no way to direct it externally.
> > > > > You can make an ErrorHandlingUDF that would take a udf spec, invoke
> > it,
> > > > trap errors, and then do the specified error handling behavior..
> that's
> > a
> > > > bit ugly though.
> > > > >
> > > > > There is a problem with trapping general exceptions of course, in
> > that
> > > if
> > > > they happen 0.000001% of the time you can probably just ignore them,
> > but
> > > if
> > > > they happen in half your dataset, you want the job to tell you
> > something
> > > is
> > > > wrong. So this stuff gets non-trivial. If anyone wants to propose a
> > > design
> > > > to solve this general problem, I think that would be a welcome
> > addition.
> > > > >
> > > > > D
> > > > >
> > > > > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <ne...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> Thanks, I sometimes get a date like 0001-01-01. This would be a
> > valid
> > > > >> date format, but when I try to get the seconds between this and
> > > > >> another date, say 2011-01-01, I get an error that the value is too
> > > > >> large to be fit into int and the process stops. Do we have
> something
> > > > >> like ifError(x-y, null,x-y)? Or would I have to implement this as
> an
> > > > >> UDF?
> > > > >>
> > > > >> Thanks
> > > > >>
> > > > >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <
> > dvryaboy@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Create a UDF that verifies the format, and go through a
> filtering
> > > > >> > step first.
> > > > >> > If you would like to save the malformated records so you can
> look
> > > > >> > at them later, you can use the SPLIT operator to route the good
> > > > >> > records to your regular workflow, and the bad records some place
> > on
> > > > HDFS.
> > > > >> >
> > > > >> > -D
> > > > >> >
> > > > >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <
> new2hive@gmail.com>
> > > > wrote:
> > > > >> >
> > > > >> > > Hello,
> > > > >> > >
> > > > >> > > I have a pig script that uses piggy bank to calculate date
> > > > differences.
> > > > >> > > Sometimes, when I get a wierd date or wrong format in the
> input,
> > > > >> > > the
> > > > >> > script
> > > > >> > > throws and error and aborts.
> > > > >> > >
> > > > >> > > Is there a way I could trap these errors and move on without
> > > > >> > > stopping
> > > > >> the
> > > > >> > > execution?
> > > > >> > >
> > > > >> > > Thanks
> > > > >> > >
> > > > >> > > PS: I'm using CDH2 with Pig 0.5
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> > >
> >
> >
>
>

Re: Exception Handling in Pig Scripts

Posted by Julien Le Dem <le...@yahoo-inc.com>.

I've summed up the thread here:
http://wiki.apache.org/pig/PigErrorHandlingInScripts
I'm sure it's biased toward its author's opinion, let me know what you think.
Julien

On 1/20/11 3:19 PM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:

Sure :)

-----Original Message-----
From: Julien Le Dem [mailto:ledemj@yahoo-inc.com]
Sent: Thursday, January 20, 2011 1:49 PM
To: dev@pig.apache.org
Subject: Re: Exception Handling in Pig Scripts

I see there is a PigErrorHandling, what about calling it PigErrorHandlingInScripts ?
Julien

On 1/20/11 12:53 PM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:

Hi guys,

Could you put a quick wiki with your proposal together? I think it would make it much easier then following email discussion.

Thanks,

Olga

-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
Sent: Thursday, January 20, 2011 11:52 AM
To: dev@pig.apache.org
Subject: Re: Exception Handling in Pig Scripts

Right, what I am saying is that the tasks would not fail because we'd catch
the errors.

Thanks for the lmyit link.. learn something new every day.

On Thu, Jan 20, 2011 at 11:31 AM, Julien Le Dem <le...@yahoo-inc.com>wrote:

> Doesn't Hadoop discard the increments to counters done by failed tasks? (I
> would expect that, but I don't know)
> Also using counters we should make sure we don't mix up multiple relations
> being combined by the optimizer.
>
> P.S.: Regarding rror, I don't see why you would want two of these:
> http://lmyit.com/rror
> :P
>
>
> On 1/20/11 4:54 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>
> I think this is coming together! I like the idea of a client-side handler
> method that allows us to look at all errors in aggregate and make a
> decisions based on proportions. How can we guard against catching the wrong
> mistakes -- say, letting a mapper that's running on a bad node and fails
> all
> local disk writes finish "successfully" even though properly, the task just
> needs to be rerun on a different mapper and normally MR would just take
> care
> of it?
> Let's put this on a wiki for wider feedback.
>
> P.S. What's a "rror" and why do we only want one of them?
>
> On Wed, Jan 19, 2011 at 3:07 PM, Julien Le Dem <le...@yahoo-inc.com>
> wrote:
>
> > Some more thoughts.
> >
> > * Looking at the existing keywords:
> > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords
> > It seems ONERROR would be better than ON_ERROR for consistency. There is
> an
> > existing ONSCHEMA but no _ based keyword.
> >
> > * The default behavior should be to die on error and can be overridden as
> > follows:
> > DEFAULT ONERROR <error handler>;
> >
> > * Built in error handlers:
> > Ignore() => ignores errors by dropping records that cause exceptions
> > Fail() => fails the script on error. (default)
> > FailOnThreshold(threshold) => fails if number of errors above threshold
> >
> > * The error handler interface needs a method called on client side after
> > the relation is computed to decide what to do next.
> > Typically FailOnThreshold will throw an exception if
> > (#errors/#input)>threshold using counters.
> > public interface ErrorHandler<T> {
> >
> > // input is not the input of the UDF, it's the tuple from the relation
> > T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> >  IOException;
> >
> > Schema outputSchema(Schema input);
> >
> > // called afterwards on the client side
> > void collectResult() throws IOException;
> >
> > }
> >
> > * SPLIT is optional
> >
> > example:
> > DEFAULT ONERROR Ignore();
> > ...
> >
> > DESCRIBE A;
> > A: {name: chararray, age: int, gpa: float}
> >
> > -- fail it more than 1% errors
> > B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR
> > FailOnThreshold(0.01) ;
> >
> > -- need to make sure the twitter infrastructure can handle the load
> > C1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR Tweet() ;
> >
> > -- custom handler that counts errors and logs on the client side
> > D1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR CountMyErrors()
> ;
> >
> > -- uses default handler and SPLIT
> > B2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> > B2_ERRORS;
> >
> > -- B2_ERRORS can not really contain the input to the UDF as it would have
> a
> > different schema depending on what UDF failed
> > DESCRIBE B_ERRORS;
> > B2_ERRORS: {input: (name: chararray, age: int, gpa: float), udf:
> chararray,
> > error:(class: chararray, message: chararray, stacktrace: chararray) }
> >
> > -- example of filtering on the udf
> > C2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> > C2_FOO_ERRORS IF udf='Foo', C2_BAR_ERRORS IF udf='Bar';
> >
> > Julien
> >
> > On 1/18/11 3:24 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
> >
> > We should think more about the interface.
> > For example, "Tuple input" argument -- is that the tuple that was passed
> to
> > the udf, or the whole tuple that was being processed? I can see wanting
> > both.
> > Also the Handler should probably have init and finish methods in case
> some
> > accumulation is happening, or state needs to get set up...
> >
> > not sure about "splitting" into a table. Maybe more like
> >
> > A = FOREACH FOO GENERATE Bar(*) ON_ERROR [use] MyHandler SPLIT [into]
> > A_ERRORS;
> >
> > "use" and "into" are optional syntactic sugar.
> >
> > This allows us to do any combination of:
> > - die
> > - put original record into a table
> > - process the error using a custom handler (which can increment counters,
> > write to dbs, send tweets... definitely send tweets...)
> >
> > D
> >
> > On Tue, Jan 18, 2011 at 10:27 AM, Julien Le Dem <ledemj@yahoo-inc.com
> > >wrote:
> >
> > > That would be nice.
> > > Also letting the error handler output the result to a relation would be
> > > useful.
> > > (To let the script output application error metrics)
> > > For example it could (optionally) use the keyword INTO just like the
> > SPLIT
> > > operator.
> > >
> > > FOO = LOAD ...;
> > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;
> > >
> > > ErrorHandler would look a little more like EvalFunc:
> > >
> > > public interface ErrorHandler<T> {
> > >
> > >  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> > > IOException;
> > >
> > > public Schema outputSchema(Schema input);
> > >
> > > }
> > >
> > > There could be a built-in handler to output the skipped record (input:
> > > tuple, funcname:chararray, errorMessage:chararray)
> > >
> > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;
> > >
> > > Julien
> > >
> > > On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
> > >
> > > I was thinking about this..
> > >
> > > We add an optional ON_ERROR clause to operators, which allows a user to
> > > specify error handling. The error handler would be a udf that would
> > > implement an interface along these lines:
> > >
> > > public interface ErrorHandler {
> > >
> > >  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input)
> > throws
> > > IOException;
> > >
> > > }
> > >
> > > I think this makes sense not to make a static method so that users
> could
> > > keep required state, and for example have the handler throw its own
> > > IOException of it's been invoked too many times.
> > >
> > > D
> > >
> > >
> > > On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <
> sms@yahoo-inc.com
> > > >wrote:
> > >
> > > > Thanks for the clarification Ashutosh.
> > > >
> > > > Implementing this in the user realm is tricky as Dmitriy states.
> > > > Sensitivity to error thresholds will require support from the system.
> > We
> > > can
> > > > probably provide a taxonomy of records (good, bad, incomplete, etc.)
> to
> > > let
> > > > users classify each record. The system can then track counts of each
> > > record
> > > > type to facilitate the computation of thresholds. The last part is to
> > > allow
> > > > users to specify thresholds and appropriate actions (interrupt, exit,
> > > > continue, etc.). A possible mechanism to realize this is the
> > > > ErrorHandlingUDF described by Dmitriy.
> > > >
> > > > Santhosh
> > > >
> > > > -----Original Message-----
> > > > From: Ashutosh Chauhan [mailto:hashutosh@apache.org]
> > > > Sent: Friday, January 14, 2011 7:35 PM
> > > > To: user@pig.apache.org
> > > > Subject: Re: Exception Handling in Pig Scripts
> > > >
> > > > Santhosh,
> > > >
> > > > The way you are proposing, it will kill the pig script. I think what
> > user
> > > > wants is to ignore few "bad records" and to process the rest and get
> > > > results. Problem here is how to let user tell Pig the definition of
> > "bad
> > > > record" and how to let him specify threshold for % of bad records at
> > > which
> > > > Pig should fail the script.
> > > >
> > > > Ashutosh
> > > >
> > > > On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <
> sms@yahoo-inc.com>
> > > > wrote:
> > > > > Sorry about the late response.
> > > > >
> > > > > Hadoop n00b is proposing a language extension for error handling,
> > > similar
> > > > to the mechanisms in other well known languages like C++, Java, etc.
> > > > >
> > > > > For now, can't the error semantics be handled by the UDF? For
> > > exceptional
> > > > scenarios you could throw an ExecException with the right details.
> The
> > > > physical operator that handles the execution of UDF's traps it for
> you
> > > and
> > > > propagates the error back to the client. You can take a look at any
> of
> > > the
> > > > builtin UDFs to see how Pig handles it internally.
> > > > >
> > > > > Santhosh
> > > > >
> > > > > -----Original Message-----
> > > > > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> > > > > Sent: Tuesday, January 11, 2011 10:41 AM
> > > > > To: user@pig.apache.org
> > > > > Subject: Re: Exception Handling in Pig Scripts
> > > > >
> > > > > Right now error handling is controlled by the UDFs themselves, and
> > > there
> > > > is no way to direct it externally.
> > > > > You can make an ErrorHandlingUDF that would take a udf spec, invoke
> > it,
> > > > trap errors, and then do the specified error handling behavior..
> that's
> > a
> > > > bit ugly though.
> > > > >
> > > > > There is a problem with trapping general exceptions of course, in
> > that
> > > if
> > > > they happen 0.000001% of the time you can probably just ignore them,
> > but
> > > if
> > > > they happen in half your dataset, you want the job to tell you
> > something
> > > is
> > > > wrong. So this stuff gets non-trivial. If anyone wants to propose a
> > > design
> > > > to solve this general problem, I think that would be a welcome
> > addition.
> > > > >
> > > > > D
> > > > >
> > > > > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <ne...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> Thanks, I sometimes get a date like 0001-01-01. This would be a
> > valid
> > > > >> date format, but when I try to get the seconds between this and
> > > > >> another date, say 2011-01-01, I get an error that the value is too
> > > > >> large to be fit into int and the process stops. Do we have
> something
> > > > >> like ifError(x-y, null,x-y)? Or would I have to implement this as
> an
> > > > >> UDF?
> > > > >>
> > > > >> Thanks
> > > > >>
> > > > >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <
> > dvryaboy@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Create a UDF that verifies the format, and go through a
> filtering
> > > > >> > step first.
> > > > >> > If you would like to save the malformated records so you can
> look
> > > > >> > at them later, you can use the SPLIT operator to route the good
> > > > >> > records to your regular workflow, and the bad records some place
> > on
> > > > HDFS.
> > > > >> >
> > > > >> > -D
> > > > >> >
> > > > >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <
> new2hive@gmail.com>
> > > > wrote:
> > > > >> >
> > > > >> > > Hello,
> > > > >> > >
> > > > >> > > I have a pig script that uses piggy bank to calculate date
> > > > differences.
> > > > >> > > Sometimes, when I get a wierd date or wrong format in the
> input,
> > > > >> > > the
> > > > >> > script
> > > > >> > > throws and error and aborts.
> > > > >> > >
> > > > >> > > Is there a way I could trap these errors and move on without
> > > > >> > > stopping
> > > > >> the
> > > > >> > > execution?
> > > > >> > >
> > > > >> > > Thanks
> > > > >> > >
> > > > >> > > PS: I'm using CDH2 with Pig 0.5
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> > >
> >
> >
>
>

RE: Exception Handling in Pig Scripts

Posted by Olga Natkovich <ol...@yahoo-inc.com>.

Sure :)

-----Original Message-----
From: Julien Le Dem [mailto:ledemj@yahoo-inc.com] 
Sent: Thursday, January 20, 2011 1:49 PM
To: dev@pig.apache.org
Subject: Re: Exception Handling in Pig Scripts

I see there is a PigErrorHandling, what about calling it PigErrorHandlingInScripts ?
Julien

On 1/20/11 12:53 PM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:

Hi guys,

Could you put a quick wiki with your proposal together? I think it would make it much easier then following email discussion.

Thanks,

Olga

-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
Sent: Thursday, January 20, 2011 11:52 AM
To: dev@pig.apache.org
Subject: Re: Exception Handling in Pig Scripts

Right, what I am saying is that the tasks would not fail because we'd catch
the errors.

Thanks for the lmyit link.. learn something new every day.

On Thu, Jan 20, 2011 at 11:31 AM, Julien Le Dem <le...@yahoo-inc.com>wrote:

> Doesn't Hadoop discard the increments to counters done by failed tasks? (I
> would expect that, but I don't know)
> Also using counters we should make sure we don't mix up multiple relations
> being combined by the optimizer.
>
> P.S.: Regarding rror, I don't see why you would want two of these:
> http://lmyit.com/rror
> :P
>
>
> On 1/20/11 4:54 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>
> I think this is coming together! I like the idea of a client-side handler
> method that allows us to look at all errors in aggregate and make a
> decisions based on proportions. How can we guard against catching the wrong
> mistakes -- say, letting a mapper that's running on a bad node and fails
> all
> local disk writes finish "successfully" even though properly, the task just
> needs to be rerun on a different mapper and normally MR would just take
> care
> of it?
> Let's put this on a wiki for wider feedback.
>
> P.S. What's a "rror" and why do we only want one of them?
>
> On Wed, Jan 19, 2011 at 3:07 PM, Julien Le Dem <le...@yahoo-inc.com>
> wrote:
>
> > Some more thoughts.
> >
> > * Looking at the existing keywords:
> > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords
> > It seems ONERROR would be better than ON_ERROR for consistency. There is
> an
> > existing ONSCHEMA but no _ based keyword.
> >
> > * The default behavior should be to die on error and can be overridden as
> > follows:
> > DEFAULT ONERROR <error handler>;
> >
> > * Built in error handlers:
> > Ignore() => ignores errors by dropping records that cause exceptions
> > Fail() => fails the script on error. (default)
> > FailOnThreshold(threshold) => fails if number of errors above threshold
> >
> > * The error handler interface needs a method called on client side after
> > the relation is computed to decide what to do next.
> > Typically FailOnThreshold will throw an exception if
> > (#errors/#input)>threshold using counters.
> > public interface ErrorHandler<T> {
> >
> > // input is not the input of the UDF, it's the tuple from the relation
> > T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> >  IOException;
> >
> > Schema outputSchema(Schema input);
> >
> > // called afterwards on the client side
> > void collectResult() throws IOException;
> >
> > }
> >
> > * SPLIT is optional
> >
> > example:
> > DEFAULT ONERROR Ignore();
> > ...
> >
> > DESCRIBE A;
> > A: {name: chararray, age: int, gpa: float}
> >
> > -- fail it more than 1% errors
> > B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR
> > FailOnThreshold(0.01) ;
> >
> > -- need to make sure the twitter infrastructure can handle the load
> > C1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR Tweet() ;
> >
> > -- custom handler that counts errors and logs on the client side
> > D1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR CountMyErrors()
> ;
> >
> > -- uses default handler and SPLIT
> > B2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> > B2_ERRORS;
> >
> > -- B2_ERRORS can not really contain the input to the UDF as it would have
> a
> > different schema depending on what UDF failed
> > DESCRIBE B_ERRORS;
> > B2_ERRORS: {input: (name: chararray, age: int, gpa: float), udf:
> chararray,
> > error:(class: chararray, message: chararray, stacktrace: chararray) }
> >
> > -- example of filtering on the udf
> > C2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> > C2_FOO_ERRORS IF udf='Foo', C2_BAR_ERRORS IF udf='Bar';
> >
> > Julien
> >
> > On 1/18/11 3:24 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
> >
> > We should think more about the interface.
> > For example, "Tuple input" argument -- is that the tuple that was passed
> to
> > the udf, or the whole tuple that was being processed? I can see wanting
> > both.
> > Also the Handler should probably have init and finish methods in case
> some
> > accumulation is happening, or state needs to get set up...
> >
> > not sure about "splitting" into a table. Maybe more like
> >
> > A = FOREACH FOO GENERATE Bar(*) ON_ERROR [use] MyHandler SPLIT [into]
> > A_ERRORS;
> >
> > "use" and "into" are optional syntactic sugar.
> >
> > This allows us to do any combination of:
> > - die
> > - put original record into a table
> > - process the error using a custom handler (which can increment counters,
> > write to dbs, send tweets... definitely send tweets...)
> >
> > D
> >
> > On Tue, Jan 18, 2011 at 10:27 AM, Julien Le Dem <ledemj@yahoo-inc.com
> > >wrote:
> >
> > > That would be nice.
> > > Also letting the error handler output the result to a relation would be
> > > useful.
> > > (To let the script output application error metrics)
> > > For example it could (optionally) use the keyword INTO just like the
> > SPLIT
> > > operator.
> > >
> > > FOO = LOAD ...;
> > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;
> > >
> > > ErrorHandler would look a little more like EvalFunc:
> > >
> > > public interface ErrorHandler<T> {
> > >
> > >  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> > > IOException;
> > >
> > > public Schema outputSchema(Schema input);
> > >
> > > }
> > >
> > > There could be a built-in handler to output the skipped record (input:
> > > tuple, funcname:chararray, errorMessage:chararray)
> > >
> > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;
> > >
> > > Julien
> > >
> > > On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
> > >
> > > I was thinking about this..
> > >
> > > We add an optional ON_ERROR clause to operators, which allows a user to
> > > specify error handling. The error handler would be a udf that would
> > > implement an interface along these lines:
> > >
> > > public interface ErrorHandler {
> > >
> > >  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input)
> > throws
> > > IOException;
> > >
> > > }
> > >
> > > I think this makes sense not to make a static method so that users
> could
> > > keep required state, and for example have the handler throw its own
> > > IOException of it's been invoked too many times.
> > >
> > > D
> > >
> > >
> > > On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <
> sms@yahoo-inc.com
> > > >wrote:
> > >
> > > > Thanks for the clarification Ashutosh.
> > > >
> > > > Implementing this in the user realm is tricky as Dmitriy states.
> > > > Sensitivity to error thresholds will require support from the system.
> > We
> > > can
> > > > probably provide a taxonomy of records (good, bad, incomplete, etc.)
> to
> > > let
> > > > users classify each record. The system can then track counts of each
> > > record
> > > > type to facilitate the computation of thresholds. The last part is to
> > > allow
> > > > users to specify thresholds and appropriate actions (interrupt, exit,
> > > > continue, etc.). A possible mechanism to realize this is the
> > > > ErrorHandlingUDF described by Dmitriy.
> > > >
> > > > Santhosh
> > > >
> > > > -----Original Message-----
> > > > From: Ashutosh Chauhan [mailto:hashutosh@apache.org]
> > > > Sent: Friday, January 14, 2011 7:35 PM
> > > > To: user@pig.apache.org
> > > > Subject: Re: Exception Handling in Pig Scripts
> > > >
> > > > Santhosh,
> > > >
> > > > The way you are proposing, it will kill the pig script. I think what
> > user
> > > > wants is to ignore few "bad records" and to process the rest and get
> > > > results. Problem here is how to let user tell Pig the definition of
> > "bad
> > > > record" and how to let him specify threshold for % of bad records at
> > > which
> > > > Pig should fail the script.
> > > >
> > > > Ashutosh
> > > >
> > > > On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <
> sms@yahoo-inc.com>
> > > > wrote:
> > > > > Sorry about the late response.
> > > > >
> > > > > Hadoop n00b is proposing a language extension for error handling,
> > > similar
> > > > to the mechanisms in other well known languages like C++, Java, etc.
> > > > >
> > > > > For now, can't the error semantics be handled by the UDF? For
> > > exceptional
> > > > scenarios you could throw an ExecException with the right details.
> The
> > > > physical operator that handles the execution of UDF's traps it for
> you
> > > and
> > > > propagates the error back to the client. You can take a look at any
> of
> > > the
> > > > builtin UDFs to see how Pig handles it internally.
> > > > >
> > > > > Santhosh
> > > > >
> > > > > -----Original Message-----
> > > > > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> > > > > Sent: Tuesday, January 11, 2011 10:41 AM
> > > > > To: user@pig.apache.org
> > > > > Subject: Re: Exception Handling in Pig Scripts
> > > > >
> > > > > Right now error handling is controlled by the UDFs themselves, and
> > > there
> > > > is no way to direct it externally.
> > > > > You can make an ErrorHandlingUDF that would take a udf spec, invoke
> > it,
> > > > trap errors, and then do the specified error handling behavior..
> that's
> > a
> > > > bit ugly though.
> > > > >
> > > > > There is a problem with trapping general exceptions of course, in
> > that
> > > if
> > > > they happen 0.000001% of the time you can probably just ignore them,
> > but
> > > if
> > > > they happen in half your dataset, you want the job to tell you
> > something
> > > is
> > > > wrong. So this stuff gets non-trivial. If anyone wants to propose a
> > > design
> > > > to solve this general problem, I think that would be a welcome
> > addition.
> > > > >
> > > > > D
> > > > >
> > > > > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <ne...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> Thanks, I sometimes get a date like 0001-01-01. This would be a
> > valid
> > > > >> date format, but when I try to get the seconds between this and
> > > > >> another date, say 2011-01-01, I get an error that the value is too
> > > > >> large to be fit into int and the process stops. Do we have
> something
> > > > >> like ifError(x-y, null,x-y)? Or would I have to implement this as
> an
> > > > >> UDF?
> > > > >>
> > > > >> Thanks
> > > > >>
> > > > >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <
> > dvryaboy@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Create a UDF that verifies the format, and go through a
> filtering
> > > > >> > step first.
> > > > >> > If you would like to save the malformated records so you can
> look
> > > > >> > at them later, you can use the SPLIT operator to route the good
> > > > >> > records to your regular workflow, and the bad records some place
> > on
> > > > HDFS.
> > > > >> >
> > > > >> > -D
> > > > >> >
> > > > >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <
> new2hive@gmail.com>
> > > > wrote:
> > > > >> >
> > > > >> > > Hello,
> > > > >> > >
> > > > >> > > I have a pig script that uses piggy bank to calculate date
> > > > differences.
> > > > >> > > Sometimes, when I get a wierd date or wrong format in the
> input,
> > > > >> > > the
> > > > >> > script
> > > > >> > > throws and error and aborts.
> > > > >> > >
> > > > >> > > Is there a way I could trap these errors and move on without
> > > > >> > > stopping
> > > > >> the
> > > > >> > > execution?
> > > > >> > >
> > > > >> > > Thanks
> > > > >> > >
> > > > >> > > PS: I'm using CDH2 with Pig 0.5
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> > >
> >
> >
>
>

Re: Exception Handling in Pig Scripts

Posted by Julien Le Dem <le...@yahoo-inc.com>.

My opinion is that the Pig feature would not use it.
What we're discussing is more granular and prevents the task to fail. Also it allows actual handling of the error in a simple way (in Pig).
As Map-Reduce is mainly executing Java code, you can do the same thing by adding a try-catch statement and use a MultipleOutputFormat to send bad records under a different name.
In Pig, the UDF can not have multiple outputs so we need to add a mechanism to easily handle exceptions separately.

Julien

On 1/20/11 1:30 PM, "Ashutosh Chauhan" <ha...@apache.org> wrote:

If its not already been discussed, how does this interact with
hadoop's feature of skipping bad records:
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/SkipBadRecords.html

Ashutosh
On Thu, Jan 20, 2011 at 12:53, Olga Natkovich <ol...@yahoo-inc.com> wrote:
> Hi guys,
>
> Could you put a quick wiki with your proposal together? I think it would make it much easier then following email discussion.
>
> Thanks,
>
> Olga
>
> -----Original Message-----
> From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> Sent: Thursday, January 20, 2011 11:52 AM
> To: dev@pig.apache.org
> Subject: Re: Exception Handling in Pig Scripts
>
> Right, what I am saying is that the tasks would not fail because we'd catch
> the errors.
>
> Thanks for the lmyit link.. learn something new every day.
>
> On Thu, Jan 20, 2011 at 11:31 AM, Julien Le Dem <le...@yahoo-inc.com>wrote:
>
>> Doesn't Hadoop discard the increments to counters done by failed tasks? (I
>> would expect that, but I don't know)
>> Also using counters we should make sure we don't mix up multiple relations
>> being combined by the optimizer.
>>
>> P.S.: Regarding rror, I don't see why you would want two of these:
>> http://lmyit.com/rror
>> :P
>>
>>
>> On 1/20/11 4:54 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>>
>> I think this is coming together! I like the idea of a client-side handler
>> method that allows us to look at all errors in aggregate and make a
>> decisions based on proportions. How can we guard against catching the wrong
>> mistakes -- say, letting a mapper that's running on a bad node and fails
>> all
>> local disk writes finish "successfully" even though properly, the task just
>> needs to be rerun on a different mapper and normally MR would just take
>> care
>> of it?
>> Let's put this on a wiki for wider feedback.
>>
>> P.S. What's a "rror" and why do we only want one of them?
>>
>> On Wed, Jan 19, 2011 at 3:07 PM, Julien Le Dem <le...@yahoo-inc.com>
>> wrote:
>>
>> > Some more thoughts.
>> >
>> > * Looking at the existing keywords:
>> > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords
>> > It seems ONERROR would be better than ON_ERROR for consistency. There is
>> an
>> > existing ONSCHEMA but no _ based keyword.
>> >
>> > * The default behavior should be to die on error and can be overridden as
>> > follows:
>> > DEFAULT ONERROR <error handler>;
>> >
>> > * Built in error handlers:
>> > Ignore() => ignores errors by dropping records that cause exceptions
>> > Fail() => fails the script on error. (default)
>> > FailOnThreshold(threshold) => fails if number of errors above threshold
>> >
>> > * The error handler interface needs a method called on client side after
>> > the relation is computed to decide what to do next.
>> > Typically FailOnThreshold will throw an exception if
>> > (#errors/#input)>threshold using counters.
>> > public interface ErrorHandler<T> {
>> >
>> > // input is not the input of the UDF, it's the tuple from the relation
>> > T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
>> >  IOException;
>> >
>> > Schema outputSchema(Schema input);
>> >
>> > // called afterwards on the client side
>> > void collectResult() throws IOException;
>> >
>> > }
>> >
>> > * SPLIT is optional
>> >
>> > example:
>> > DEFAULT ONERROR Ignore();
>> > ...
>> >
>> > DESCRIBE A;
>> > A: {name: chararray, age: int, gpa: float}
>> >
>> > -- fail it more than 1% errors
>> > B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR
>> > FailOnThreshold(0.01) ;
>> >
>> > -- need to make sure the twitter infrastructure can handle the load
>> > C1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR Tweet() ;
>> >
>> > -- custom handler that counts errors and logs on the client side
>> > D1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR CountMyErrors()
>> ;
>> >
>> > -- uses default handler and SPLIT
>> > B2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
>> > B2_ERRORS;
>> >
>> > -- B2_ERRORS can not really contain the input to the UDF as it would have
>> a
>> > different schema depending on what UDF failed
>> > DESCRIBE B_ERRORS;
>> > B2_ERRORS: {input: (name: chararray, age: int, gpa: float), udf:
>> chararray,
>> > error:(class: chararray, message: chararray, stacktrace: chararray) }
>> >
>> > -- example of filtering on the udf
>> > C2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
>> > C2_FOO_ERRORS IF udf='Foo', C2_BAR_ERRORS IF udf='Bar';
>> >
>> > Julien
>> >
>> > On 1/18/11 3:24 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>> >
>> > We should think more about the interface.
>> > For example, "Tuple input" argument -- is that the tuple that was passed
>> to
>> > the udf, or the whole tuple that was being processed? I can see wanting
>> > both.
>> > Also the Handler should probably have init and finish methods in case
>> some
>> > accumulation is happening, or state needs to get set up...
>> >
>> > not sure about "splitting" into a table. Maybe more like
>> >
>> > A = FOREACH FOO GENERATE Bar(*) ON_ERROR [use] MyHandler SPLIT [into]
>> > A_ERRORS;
>> >
>> > "use" and "into" are optional syntactic sugar.
>> >
>> > This allows us to do any combination of:
>> > - die
>> > - put original record into a table
>> > - process the error using a custom handler (which can increment counters,
>> > write to dbs, send tweets... definitely send tweets...)
>> >
>> > D
>> >
>> > On Tue, Jan 18, 2011 at 10:27 AM, Julien Le Dem <ledemj@yahoo-inc.com
>> > >wrote:
>> >
>> > > That would be nice.
>> > > Also letting the error handler output the result to a relation would be
>> > > useful.
>> > > (To let the script output application error metrics)
>> > > For example it could (optionally) use the keyword INTO just like the
>> > SPLIT
>> > > operator.
>> > >
>> > > FOO = LOAD ...;
>> > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;
>> > >
>> > > ErrorHandler would look a little more like EvalFunc:
>> > >
>> > > public interface ErrorHandler<T> {
>> > >
>> > >  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
>> > > IOException;
>> > >
>> > > public Schema outputSchema(Schema input);
>> > >
>> > > }
>> > >
>> > > There could be a built-in handler to output the skipped record (input:
>> > > tuple, funcname:chararray, errorMessage:chararray)
>> > >
>> > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;
>> > >
>> > > Julien
>> > >
>> > > On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>> > >
>> > > I was thinking about this..
>> > >
>> > > We add an optional ON_ERROR clause to operators, which allows a user to
>> > > specify error handling. The error handler would be a udf that would
>> > > implement an interface along these lines:
>> > >
>> > > public interface ErrorHandler {
>> > >
>> > >  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input)
>> > throws
>> > > IOException;
>> > >
>> > > }
>> > >
>> > > I think this makes sense not to make a static method so that users
>> could
>> > > keep required state, and for example have the handler throw its own
>> > > IOException of it's been invoked too many times.
>> > >
>> > > D
>> > >
>> > >
>> > > On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <
>> sms@yahoo-inc.com
>> > > >wrote:
>> > >
>> > > > Thanks for the clarification Ashutosh.
>> > > >
>> > > > Implementing this in the user realm is tricky as Dmitriy states.
>> > > > Sensitivity to error thresholds will require support from the system.
>> > We
>> > > can
>> > > > probably provide a taxonomy of records (good, bad, incomplete, etc.)
>> to
>> > > let
>> > > > users classify each record. The system can then track counts of each
>> > > record
>> > > > type to facilitate the computation of thresholds. The last part is to
>> > > allow
>> > > > users to specify thresholds and appropriate actions (interrupt, exit,
>> > > > continue, etc.). A possible mechanism to realize this is the
>> > > > ErrorHandlingUDF described by Dmitriy.
>> > > >
>> > > > Santhosh
>> > > >
>> > > > -----Original Message-----
>> > > > From: Ashutosh Chauhan [mailto:hashutosh@apache.org]
>> > > > Sent: Friday, January 14, 2011 7:35 PM
>> > > > To: user@pig.apache.org
>> > > > Subject: Re: Exception Handling in Pig Scripts
>> > > >
>> > > > Santhosh,
>> > > >
>> > > > The way you are proposing, it will kill the pig script. I think what
>> > user
>> > > > wants is to ignore few "bad records" and to process the rest and get
>> > > > results. Problem here is how to let user tell Pig the definition of
>> > "bad
>> > > > record" and how to let him specify threshold for % of bad records at
>> > > which
>> > > > Pig should fail the script.
>> > > >
>> > > > Ashutosh
>> > > >
>> > > > On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <
>> sms@yahoo-inc.com>
>> > > > wrote:
>> > > > > Sorry about the late response.
>> > > > >
>> > > > > Hadoop n00b is proposing a language extension for error handling,
>> > > similar
>> > > > to the mechanisms in other well known languages like C++, Java, etc.
>> > > > >
>> > > > > For now, can't the error semantics be handled by the UDF? For
>> > > exceptional
>> > > > scenarios you could throw an ExecException with the right details.
>> The
>> > > > physical operator that handles the execution of UDF's traps it for
>> you
>> > > and
>> > > > propagates the error back to the client. You can take a look at any
>> of
>> > > the
>> > > > builtin UDFs to see how Pig handles it internally.
>> > > > >
>> > > > > Santhosh
>> > > > >
>> > > > > -----Original Message-----
>> > > > > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
>> > > > > Sent: Tuesday, January 11, 2011 10:41 AM
>> > > > > To: user@pig.apache.org
>> > > > > Subject: Re: Exception Handling in Pig Scripts
>> > > > >
>> > > > > Right now error handling is controlled by the UDFs themselves, and
>> > > there
>> > > > is no way to direct it externally.
>> > > > > You can make an ErrorHandlingUDF that would take a udf spec, invoke
>> > it,
>> > > > trap errors, and then do the specified error handling behavior..
>> that's
>> > a
>> > > > bit ugly though.
>> > > > >
>> > > > > There is a problem with trapping general exceptions of course, in
>> > that
>> > > if
>> > > > they happen 0.000001% of the time you can probably just ignore them,
>> > but
>> > > if
>> > > > they happen in half your dataset, you want the job to tell you
>> > something
>> > > is
>> > > > wrong. So this stuff gets non-trivial. If anyone wants to propose a
>> > > design
>> > > > to solve this general problem, I think that would be a welcome
>> > addition.
>> > > > >
>> > > > > D
>> > > > >
>> > > > > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <ne...@gmail.com>
>> > > > wrote:
>> > > > >
>> > > > >> Thanks, I sometimes get a date like 0001-01-01. This would be a
>> > valid
>> > > > >> date format, but when I try to get the seconds between this and
>> > > > >> another date, say 2011-01-01, I get an error that the value is too
>> > > > >> large to be fit into int and the process stops. Do we have
>> something
>> > > > >> like ifError(x-y, null,x-y)? Or would I have to implement this as
>> an
>> > > > >> UDF?
>> > > > >>
>> > > > >> Thanks
>> > > > >>
>> > > > >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <
>> > dvryaboy@gmail.com>
>> > > > >> wrote:
>> > > > >>
>> > > > >> > Create a UDF that verifies the format, and go through a
>> filtering
>> > > > >> > step first.
>> > > > >> > If you would like to save the malformated records so you can
>> look
>> > > > >> > at them later, you can use the SPLIT operator to route the good
>> > > > >> > records to your regular workflow, and the bad records some place
>> > on
>> > > > HDFS.
>> > > > >> >
>> > > > >> > -D
>> > > > >> >
>> > > > >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <
>> new2hive@gmail.com>
>> > > > wrote:
>> > > > >> >
>> > > > >> > > Hello,
>> > > > >> > >
>> > > > >> > > I have a pig script that uses piggy bank to calculate date
>> > > > differences.
>> > > > >> > > Sometimes, when I get a wierd date or wrong format in the
>> input,
>> > > > >> > > the
>> > > > >> > script
>> > > > >> > > throws and error and aborts.
>> > > > >> > >
>> > > > >> > > Is there a way I could trap these errors and move on without
>> > > > >> > > stopping
>> > > > >> the
>> > > > >> > > execution?
>> > > > >> > >
>> > > > >> > > Thanks
>> > > > >> > >
>> > > > >> > > PS: I'm using CDH2 with Pig 0.5
>> > > > >> > >
>> > > > >> >
>> > > > >>
>> > > > >
>> > > >
>> > >
>> > >
>> >
>> >
>>
>>
>

Re: Exception Handling in Pig Scripts

Posted by Ashutosh Chauhan <ha...@apache.org>.

If its not already been discussed, how does this interact with
hadoop's feature of skipping bad records:
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/SkipBadRecords.html

Ashutosh
On Thu, Jan 20, 2011 at 12:53, Olga Natkovich <ol...@yahoo-inc.com> wrote:
> Hi guys,
>
> Could you put a quick wiki with your proposal together? I think it would make it much easier then following email discussion.
>
> Thanks,
>
> Olga
>
> -----Original Message-----
> From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> Sent: Thursday, January 20, 2011 11:52 AM
> To: dev@pig.apache.org
> Subject: Re: Exception Handling in Pig Scripts
>
> Right, what I am saying is that the tasks would not fail because we'd catch
> the errors.
>
> Thanks for the lmyit link.. learn something new every day.
>
> On Thu, Jan 20, 2011 at 11:31 AM, Julien Le Dem <le...@yahoo-inc.com>wrote:
>
>> Doesn't Hadoop discard the increments to counters done by failed tasks? (I
>> would expect that, but I don't know)
>> Also using counters we should make sure we don't mix up multiple relations
>> being combined by the optimizer.
>>
>> P.S.: Regarding rror, I don't see why you would want two of these:
>> http://lmyit.com/rror
>> :P
>>
>>
>> On 1/20/11 4:54 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>>
>> I think this is coming together! I like the idea of a client-side handler
>> method that allows us to look at all errors in aggregate and make a
>> decisions based on proportions. How can we guard against catching the wrong
>> mistakes -- say, letting a mapper that's running on a bad node and fails
>> all
>> local disk writes finish "successfully" even though properly, the task just
>> needs to be rerun on a different mapper and normally MR would just take
>> care
>> of it?
>> Let's put this on a wiki for wider feedback.
>>
>> P.S. What's a "rror" and why do we only want one of them?
>>
>> On Wed, Jan 19, 2011 at 3:07 PM, Julien Le Dem <le...@yahoo-inc.com>
>> wrote:
>>
>> > Some more thoughts.
>> >
>> > * Looking at the existing keywords:
>> > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords
>> > It seems ONERROR would be better than ON_ERROR for consistency. There is
>> an
>> > existing ONSCHEMA but no _ based keyword.
>> >
>> > * The default behavior should be to die on error and can be overridden as
>> > follows:
>> > DEFAULT ONERROR <error handler>;
>> >
>> > * Built in error handlers:
>> > Ignore() => ignores errors by dropping records that cause exceptions
>> > Fail() => fails the script on error. (default)
>> > FailOnThreshold(threshold) => fails if number of errors above threshold
>> >
>> > * The error handler interface needs a method called on client side after
>> > the relation is computed to decide what to do next.
>> > Typically FailOnThreshold will throw an exception if
>> > (#errors/#input)>threshold using counters.
>> > public interface ErrorHandler<T> {
>> >
>> > // input is not the input of the UDF, it's the tuple from the relation
>> > T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
>> >  IOException;
>> >
>> > Schema outputSchema(Schema input);
>> >
>> > // called afterwards on the client side
>> > void collectResult() throws IOException;
>> >
>> > }
>> >
>> > * SPLIT is optional
>> >
>> > example:
>> > DEFAULT ONERROR Ignore();
>> > ...
>> >
>> > DESCRIBE A;
>> > A: {name: chararray, age: int, gpa: float}
>> >
>> > -- fail it more than 1% errors
>> > B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR
>> > FailOnThreshold(0.01) ;
>> >
>> > -- need to make sure the twitter infrastructure can handle the load
>> > C1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR Tweet() ;
>> >
>> > -- custom handler that counts errors and logs on the client side
>> > D1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR CountMyErrors()
>> ;
>> >
>> > -- uses default handler and SPLIT
>> > B2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
>> > B2_ERRORS;
>> >
>> > -- B2_ERRORS can not really contain the input to the UDF as it would have
>> a
>> > different schema depending on what UDF failed
>> > DESCRIBE B_ERRORS;
>> > B2_ERRORS: {input: (name: chararray, age: int, gpa: float), udf:
>> chararray,
>> > error:(class: chararray, message: chararray, stacktrace: chararray) }
>> >
>> > -- example of filtering on the udf
>> > C2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
>> > C2_FOO_ERRORS IF udf='Foo', C2_BAR_ERRORS IF udf='Bar';
>> >
>> > Julien
>> >
>> > On 1/18/11 3:24 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>> >
>> > We should think more about the interface.
>> > For example, "Tuple input" argument -- is that the tuple that was passed
>> to
>> > the udf, or the whole tuple that was being processed? I can see wanting
>> > both.
>> > Also the Handler should probably have init and finish methods in case
>> some
>> > accumulation is happening, or state needs to get set up...
>> >
>> > not sure about "splitting" into a table. Maybe more like
>> >
>> > A = FOREACH FOO GENERATE Bar(*) ON_ERROR [use] MyHandler SPLIT [into]
>> > A_ERRORS;
>> >
>> > "use" and "into" are optional syntactic sugar.
>> >
>> > This allows us to do any combination of:
>> > - die
>> > - put original record into a table
>> > - process the error using a custom handler (which can increment counters,
>> > write to dbs, send tweets... definitely send tweets...)
>> >
>> > D
>> >
>> > On Tue, Jan 18, 2011 at 10:27 AM, Julien Le Dem <ledemj@yahoo-inc.com
>> > >wrote:
>> >
>> > > That would be nice.
>> > > Also letting the error handler output the result to a relation would be
>> > > useful.
>> > > (To let the script output application error metrics)
>> > > For example it could (optionally) use the keyword INTO just like the
>> > SPLIT
>> > > operator.
>> > >
>> > > FOO = LOAD ...;
>> > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;
>> > >
>> > > ErrorHandler would look a little more like EvalFunc:
>> > >
>> > > public interface ErrorHandler<T> {
>> > >
>> > >  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
>> > > IOException;
>> > >
>> > > public Schema outputSchema(Schema input);
>> > >
>> > > }
>> > >
>> > > There could be a built-in handler to output the skipped record (input:
>> > > tuple, funcname:chararray, errorMessage:chararray)
>> > >
>> > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;
>> > >
>> > > Julien
>> > >
>> > > On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>> > >
>> > > I was thinking about this..
>> > >
>> > > We add an optional ON_ERROR clause to operators, which allows a user to
>> > > specify error handling. The error handler would be a udf that would
>> > > implement an interface along these lines:
>> > >
>> > > public interface ErrorHandler {
>> > >
>> > >  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input)
>> > throws
>> > > IOException;
>> > >
>> > > }
>> > >
>> > > I think this makes sense not to make a static method so that users
>> could
>> > > keep required state, and for example have the handler throw its own
>> > > IOException of it's been invoked too many times.
>> > >
>> > > D
>> > >
>> > >
>> > > On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <
>> sms@yahoo-inc.com
>> > > >wrote:
>> > >
>> > > > Thanks for the clarification Ashutosh.
>> > > >
>> > > > Implementing this in the user realm is tricky as Dmitriy states.
>> > > > Sensitivity to error thresholds will require support from the system.
>> > We
>> > > can
>> > > > probably provide a taxonomy of records (good, bad, incomplete, etc.)
>> to
>> > > let
>> > > > users classify each record. The system can then track counts of each
>> > > record
>> > > > type to facilitate the computation of thresholds. The last part is to
>> > > allow
>> > > > users to specify thresholds and appropriate actions (interrupt, exit,
>> > > > continue, etc.). A possible mechanism to realize this is the
>> > > > ErrorHandlingUDF described by Dmitriy.
>> > > >
>> > > > Santhosh
>> > > >
>> > > > -----Original Message-----
>> > > > From: Ashutosh Chauhan [mailto:hashutosh@apache.org]
>> > > > Sent: Friday, January 14, 2011 7:35 PM
>> > > > To: user@pig.apache.org
>> > > > Subject: Re: Exception Handling in Pig Scripts
>> > > >
>> > > > Santhosh,
>> > > >
>> > > > The way you are proposing, it will kill the pig script. I think what
>> > user
>> > > > wants is to ignore few "bad records" and to process the rest and get
>> > > > results. Problem here is how to let user tell Pig the definition of
>> > "bad
>> > > > record" and how to let him specify threshold for % of bad records at
>> > > which
>> > > > Pig should fail the script.
>> > > >
>> > > > Ashutosh
>> > > >
>> > > > On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <
>> sms@yahoo-inc.com>
>> > > > wrote:
>> > > > > Sorry about the late response.
>> > > > >
>> > > > > Hadoop n00b is proposing a language extension for error handling,
>> > > similar
>> > > > to the mechanisms in other well known languages like C++, Java, etc.
>> > > > >
>> > > > > For now, can't the error semantics be handled by the UDF? For
>> > > exceptional
>> > > > scenarios you could throw an ExecException with the right details.
>> The
>> > > > physical operator that handles the execution of UDF's traps it for
>> you
>> > > and
>> > > > propagates the error back to the client. You can take a look at any
>> of
>> > > the
>> > > > builtin UDFs to see how Pig handles it internally.
>> > > > >
>> > > > > Santhosh
>> > > > >
>> > > > > -----Original Message-----
>> > > > > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
>> > > > > Sent: Tuesday, January 11, 2011 10:41 AM
>> > > > > To: user@pig.apache.org
>> > > > > Subject: Re: Exception Handling in Pig Scripts
>> > > > >
>> > > > > Right now error handling is controlled by the UDFs themselves, and
>> > > there
>> > > > is no way to direct it externally.
>> > > > > You can make an ErrorHandlingUDF that would take a udf spec, invoke
>> > it,
>> > > > trap errors, and then do the specified error handling behavior..
>> that's
>> > a
>> > > > bit ugly though.
>> > > > >
>> > > > > There is a problem with trapping general exceptions of course, in
>> > that
>> > > if
>> > > > they happen 0.000001% of the time you can probably just ignore them,
>> > but
>> > > if
>> > > > they happen in half your dataset, you want the job to tell you
>> > something
>> > > is
>> > > > wrong. So this stuff gets non-trivial. If anyone wants to propose a
>> > > design
>> > > > to solve this general problem, I think that would be a welcome
>> > addition.
>> > > > >
>> > > > > D
>> > > > >
>> > > > > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <ne...@gmail.com>
>> > > > wrote:
>> > > > >
>> > > > >> Thanks, I sometimes get a date like 0001-01-01. This would be a
>> > valid
>> > > > >> date format, but when I try to get the seconds between this and
>> > > > >> another date, say 2011-01-01, I get an error that the value is too
>> > > > >> large to be fit into int and the process stops. Do we have
>> something
>> > > > >> like ifError(x-y, null,x-y)? Or would I have to implement this as
>> an
>> > > > >> UDF?
>> > > > >>
>> > > > >> Thanks
>> > > > >>
>> > > > >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <
>> > dvryaboy@gmail.com>
>> > > > >> wrote:
>> > > > >>
>> > > > >> > Create a UDF that verifies the format, and go through a
>> filtering
>> > > > >> > step first.
>> > > > >> > If you would like to save the malformated records so you can
>> look
>> > > > >> > at them later, you can use the SPLIT operator to route the good
>> > > > >> > records to your regular workflow, and the bad records some place
>> > on
>> > > > HDFS.
>> > > > >> >
>> > > > >> > -D
>> > > > >> >
>> > > > >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <
>> new2hive@gmail.com>
>> > > > wrote:
>> > > > >> >
>> > > > >> > > Hello,
>> > > > >> > >
>> > > > >> > > I have a pig script that uses piggy bank to calculate date
>> > > > differences.
>> > > > >> > > Sometimes, when I get a wierd date or wrong format in the
>> input,
>> > > > >> > > the
>> > > > >> > script
>> > > > >> > > throws and error and aborts.
>> > > > >> > >
>> > > > >> > > Is there a way I could trap these errors and move on without
>> > > > >> > > stopping
>> > > > >> the
>> > > > >> > > execution?
>> > > > >> > >
>> > > > >> > > Thanks
>> > > > >> > >
>> > > > >> > > PS: I'm using CDH2 with Pig 0.5
>> > > > >> > >
>> > > > >> >
>> > > > >>
>> > > > >
>> > > >
>> > >
>> > >
>> >
>> >
>>
>>
>

Re: Exception Handling in Pig Scripts

Posted by Julien Le Dem <le...@yahoo-inc.com>.

I see there is a PigErrorHandling, what about calling it PigErrorHandlingInScripts ?
Julien

On 1/20/11 12:53 PM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:

Hi guys,

Could you put a quick wiki with your proposal together? I think it would make it much easier then following email discussion.

Thanks,

Olga

-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
Sent: Thursday, January 20, 2011 11:52 AM
To: dev@pig.apache.org
Subject: Re: Exception Handling in Pig Scripts

Right, what I am saying is that the tasks would not fail because we'd catch
the errors.

Thanks for the lmyit link.. learn something new every day.

On Thu, Jan 20, 2011 at 11:31 AM, Julien Le Dem <le...@yahoo-inc.com>wrote:

> Doesn't Hadoop discard the increments to counters done by failed tasks? (I
> would expect that, but I don't know)
> Also using counters we should make sure we don't mix up multiple relations
> being combined by the optimizer.
>
> P.S.: Regarding rror, I don't see why you would want two of these:
> http://lmyit.com/rror
> :P
>
>
> On 1/20/11 4:54 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>
> I think this is coming together! I like the idea of a client-side handler
> method that allows us to look at all errors in aggregate and make a
> decisions based on proportions. How can we guard against catching the wrong
> mistakes -- say, letting a mapper that's running on a bad node and fails
> all
> local disk writes finish "successfully" even though properly, the task just
> needs to be rerun on a different mapper and normally MR would just take
> care
> of it?
> Let's put this on a wiki for wider feedback.
>
> P.S. What's a "rror" and why do we only want one of them?
>
> On Wed, Jan 19, 2011 at 3:07 PM, Julien Le Dem <le...@yahoo-inc.com>
> wrote:
>
> > Some more thoughts.
> >
> > * Looking at the existing keywords:
> > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords
> > It seems ONERROR would be better than ON_ERROR for consistency. There is
> an
> > existing ONSCHEMA but no _ based keyword.
> >
> > * The default behavior should be to die on error and can be overridden as
> > follows:
> > DEFAULT ONERROR <error handler>;
> >
> > * Built in error handlers:
> > Ignore() => ignores errors by dropping records that cause exceptions
> > Fail() => fails the script on error. (default)
> > FailOnThreshold(threshold) => fails if number of errors above threshold
> >
> > * The error handler interface needs a method called on client side after
> > the relation is computed to decide what to do next.
> > Typically FailOnThreshold will throw an exception if
> > (#errors/#input)>threshold using counters.
> > public interface ErrorHandler<T> {
> >
> > // input is not the input of the UDF, it's the tuple from the relation
> > T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> >  IOException;
> >
> > Schema outputSchema(Schema input);
> >
> > // called afterwards on the client side
> > void collectResult() throws IOException;
> >
> > }
> >
> > * SPLIT is optional
> >
> > example:
> > DEFAULT ONERROR Ignore();
> > ...
> >
> > DESCRIBE A;
> > A: {name: chararray, age: int, gpa: float}
> >
> > -- fail it more than 1% errors
> > B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR
> > FailOnThreshold(0.01) ;
> >
> > -- need to make sure the twitter infrastructure can handle the load
> > C1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR Tweet() ;
> >
> > -- custom handler that counts errors and logs on the client side
> > D1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR CountMyErrors()
> ;
> >
> > -- uses default handler and SPLIT
> > B2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> > B2_ERRORS;
> >
> > -- B2_ERRORS can not really contain the input to the UDF as it would have
> a
> > different schema depending on what UDF failed
> > DESCRIBE B_ERRORS;
> > B2_ERRORS: {input: (name: chararray, age: int, gpa: float), udf:
> chararray,
> > error:(class: chararray, message: chararray, stacktrace: chararray) }
> >
> > -- example of filtering on the udf
> > C2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> > C2_FOO_ERRORS IF udf='Foo', C2_BAR_ERRORS IF udf='Bar';
> >
> > Julien
> >
> > On 1/18/11 3:24 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
> >
> > We should think more about the interface.
> > For example, "Tuple input" argument -- is that the tuple that was passed
> to
> > the udf, or the whole tuple that was being processed? I can see wanting
> > both.
> > Also the Handler should probably have init and finish methods in case
> some
> > accumulation is happening, or state needs to get set up...
> >
> > not sure about "splitting" into a table. Maybe more like
> >
> > A = FOREACH FOO GENERATE Bar(*) ON_ERROR [use] MyHandler SPLIT [into]
> > A_ERRORS;
> >
> > "use" and "into" are optional syntactic sugar.
> >
> > This allows us to do any combination of:
> > - die
> > - put original record into a table
> > - process the error using a custom handler (which can increment counters,
> > write to dbs, send tweets... definitely send tweets...)
> >
> > D
> >
> > On Tue, Jan 18, 2011 at 10:27 AM, Julien Le Dem <ledemj@yahoo-inc.com
> > >wrote:
> >
> > > That would be nice.
> > > Also letting the error handler output the result to a relation would be
> > > useful.
> > > (To let the script output application error metrics)
> > > For example it could (optionally) use the keyword INTO just like the
> > SPLIT
> > > operator.
> > >
> > > FOO = LOAD ...;
> > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;
> > >
> > > ErrorHandler would look a little more like EvalFunc:
> > >
> > > public interface ErrorHandler<T> {
> > >
> > >  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> > > IOException;
> > >
> > > public Schema outputSchema(Schema input);
> > >
> > > }
> > >
> > > There could be a built-in handler to output the skipped record (input:
> > > tuple, funcname:chararray, errorMessage:chararray)
> > >
> > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;
> > >
> > > Julien
> > >
> > > On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
> > >
> > > I was thinking about this..
> > >
> > > We add an optional ON_ERROR clause to operators, which allows a user to
> > > specify error handling. The error handler would be a udf that would
> > > implement an interface along these lines:
> > >
> > > public interface ErrorHandler {
> > >
> > >  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input)
> > throws
> > > IOException;
> > >
> > > }
> > >
> > > I think this makes sense not to make a static method so that users
> could
> > > keep required state, and for example have the handler throw its own
> > > IOException of it's been invoked too many times.
> > >
> > > D
> > >
> > >
> > > On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <
> sms@yahoo-inc.com
> > > >wrote:
> > >
> > > > Thanks for the clarification Ashutosh.
> > > >
> > > > Implementing this in the user realm is tricky as Dmitriy states.
> > > > Sensitivity to error thresholds will require support from the system.
> > We
> > > can
> > > > probably provide a taxonomy of records (good, bad, incomplete, etc.)
> to
> > > let
> > > > users classify each record. The system can then track counts of each
> > > record
> > > > type to facilitate the computation of thresholds. The last part is to
> > > allow
> > > > users to specify thresholds and appropriate actions (interrupt, exit,
> > > > continue, etc.). A possible mechanism to realize this is the
> > > > ErrorHandlingUDF described by Dmitriy.
> > > >
> > > > Santhosh
> > > >
> > > > -----Original Message-----
> > > > From: Ashutosh Chauhan [mailto:hashutosh@apache.org]
> > > > Sent: Friday, January 14, 2011 7:35 PM
> > > > To: user@pig.apache.org
> > > > Subject: Re: Exception Handling in Pig Scripts
> > > >
> > > > Santhosh,
> > > >
> > > > The way you are proposing, it will kill the pig script. I think what
> > user
> > > > wants is to ignore few "bad records" and to process the rest and get
> > > > results. Problem here is how to let user tell Pig the definition of
> > "bad
> > > > record" and how to let him specify threshold for % of bad records at
> > > which
> > > > Pig should fail the script.
> > > >
> > > > Ashutosh
> > > >
> > > > On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <
> sms@yahoo-inc.com>
> > > > wrote:
> > > > > Sorry about the late response.
> > > > >
> > > > > Hadoop n00b is proposing a language extension for error handling,
> > > similar
> > > > to the mechanisms in other well known languages like C++, Java, etc.
> > > > >
> > > > > For now, can't the error semantics be handled by the UDF? For
> > > exceptional
> > > > scenarios you could throw an ExecException with the right details.
> The
> > > > physical operator that handles the execution of UDF's traps it for
> you
> > > and
> > > > propagates the error back to the client. You can take a look at any
> of
> > > the
> > > > builtin UDFs to see how Pig handles it internally.
> > > > >
> > > > > Santhosh
> > > > >
> > > > > -----Original Message-----
> > > > > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> > > > > Sent: Tuesday, January 11, 2011 10:41 AM
> > > > > To: user@pig.apache.org
> > > > > Subject: Re: Exception Handling in Pig Scripts
> > > > >
> > > > > Right now error handling is controlled by the UDFs themselves, and
> > > there
> > > > is no way to direct it externally.
> > > > > You can make an ErrorHandlingUDF that would take a udf spec, invoke
> > it,
> > > > trap errors, and then do the specified error handling behavior..
> that's
> > a
> > > > bit ugly though.
> > > > >
> > > > > There is a problem with trapping general exceptions of course, in
> > that
> > > if
> > > > they happen 0.000001% of the time you can probably just ignore them,
> > but
> > > if
> > > > they happen in half your dataset, you want the job to tell you
> > something
> > > is
> > > > wrong. So this stuff gets non-trivial. If anyone wants to propose a
> > > design
> > > > to solve this general problem, I think that would be a welcome
> > addition.
> > > > >
> > > > > D
> > > > >
> > > > > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <ne...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> Thanks, I sometimes get a date like 0001-01-01. This would be a
> > valid
> > > > >> date format, but when I try to get the seconds between this and
> > > > >> another date, say 2011-01-01, I get an error that the value is too
> > > > >> large to be fit into int and the process stops. Do we have
> something
> > > > >> like ifError(x-y, null,x-y)? Or would I have to implement this as
> an
> > > > >> UDF?
> > > > >>
> > > > >> Thanks
> > > > >>
> > > > >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <
> > dvryaboy@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Create a UDF that verifies the format, and go through a
> filtering
> > > > >> > step first.
> > > > >> > If you would like to save the malformated records so you can
> look
> > > > >> > at them later, you can use the SPLIT operator to route the good
> > > > >> > records to your regular workflow, and the bad records some place
> > on
> > > > HDFS.
> > > > >> >
> > > > >> > -D
> > > > >> >
> > > > >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <
> new2hive@gmail.com>
> > > > wrote:
> > > > >> >
> > > > >> > > Hello,
> > > > >> > >
> > > > >> > > I have a pig script that uses piggy bank to calculate date
> > > > differences.
> > > > >> > > Sometimes, when I get a wierd date or wrong format in the
> input,
> > > > >> > > the
> > > > >> > script
> > > > >> > > throws and error and aborts.
> > > > >> > >
> > > > >> > > Is there a way I could trap these errors and move on without
> > > > >> > > stopping
> > > > >> the
> > > > >> > > execution?
> > > > >> > >
> > > > >> > > Thanks
> > > > >> > >
> > > > >> > > PS: I'm using CDH2 with Pig 0.5
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> > >
> >
> >
>
>

RE: Exception Handling in Pig Scripts

Posted by Olga Natkovich <ol...@yahoo-inc.com>.

Hi guys,

Could you put a quick wiki with your proposal together? I think it would make it much easier then following email discussion.

Thanks,

Olga

-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com] 
Sent: Thursday, January 20, 2011 11:52 AM
To: dev@pig.apache.org
Subject: Re: Exception Handling in Pig Scripts

Right, what I am saying is that the tasks would not fail because we'd catch
the errors.

Thanks for the lmyit link.. learn something new every day.

On Thu, Jan 20, 2011 at 11:31 AM, Julien Le Dem <le...@yahoo-inc.com>wrote:

> Doesn't Hadoop discard the increments to counters done by failed tasks? (I
> would expect that, but I don't know)
> Also using counters we should make sure we don't mix up multiple relations
> being combined by the optimizer.
>
> P.S.: Regarding rror, I don't see why you would want two of these:
> http://lmyit.com/rror
> :P
>
>
> On 1/20/11 4:54 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>
> I think this is coming together! I like the idea of a client-side handler
> method that allows us to look at all errors in aggregate and make a
> decisions based on proportions. How can we guard against catching the wrong
> mistakes -- say, letting a mapper that's running on a bad node and fails
> all
> local disk writes finish "successfully" even though properly, the task just
> needs to be rerun on a different mapper and normally MR would just take
> care
> of it?
> Let's put this on a wiki for wider feedback.
>
> P.S. What's a "rror" and why do we only want one of them?
>
> On Wed, Jan 19, 2011 at 3:07 PM, Julien Le Dem <le...@yahoo-inc.com>
> wrote:
>
> > Some more thoughts.
> >
> > * Looking at the existing keywords:
> > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords
> > It seems ONERROR would be better than ON_ERROR for consistency. There is
> an
> > existing ONSCHEMA but no _ based keyword.
> >
> > * The default behavior should be to die on error and can be overridden as
> > follows:
> > DEFAULT ONERROR <error handler>;
> >
> > * Built in error handlers:
> > Ignore() => ignores errors by dropping records that cause exceptions
> > Fail() => fails the script on error. (default)
> > FailOnThreshold(threshold) => fails if number of errors above threshold
> >
> > * The error handler interface needs a method called on client side after
> > the relation is computed to decide what to do next.
> > Typically FailOnThreshold will throw an exception if
> > (#errors/#input)>threshold using counters.
> > public interface ErrorHandler<T> {
> >
> > // input is not the input of the UDF, it's the tuple from the relation
> > T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> >  IOException;
> >
> > Schema outputSchema(Schema input);
> >
> > // called afterwards on the client side
> > void collectResult() throws IOException;
> >
> > }
> >
> > * SPLIT is optional
> >
> > example:
> > DEFAULT ONERROR Ignore();
> > ...
> >
> > DESCRIBE A;
> > A: {name: chararray, age: int, gpa: float}
> >
> > -- fail it more than 1% errors
> > B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR
> > FailOnThreshold(0.01) ;
> >
> > -- need to make sure the twitter infrastructure can handle the load
> > C1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR Tweet() ;
> >
> > -- custom handler that counts errors and logs on the client side
> > D1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR CountMyErrors()
> ;
> >
> > -- uses default handler and SPLIT
> > B2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> > B2_ERRORS;
> >
> > -- B2_ERRORS can not really contain the input to the UDF as it would have
> a
> > different schema depending on what UDF failed
> > DESCRIBE B_ERRORS;
> > B2_ERRORS: {input: (name: chararray, age: int, gpa: float), udf:
> chararray,
> > error:(class: chararray, message: chararray, stacktrace: chararray) }
> >
> > -- example of filtering on the udf
> > C2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> > C2_FOO_ERRORS IF udf='Foo', C2_BAR_ERRORS IF udf='Bar';
> >
> > Julien
> >
> > On 1/18/11 3:24 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
> >
> > We should think more about the interface.
> > For example, "Tuple input" argument -- is that the tuple that was passed
> to
> > the udf, or the whole tuple that was being processed? I can see wanting
> > both.
> > Also the Handler should probably have init and finish methods in case
> some
> > accumulation is happening, or state needs to get set up...
> >
> > not sure about "splitting" into a table. Maybe more like
> >
> > A = FOREACH FOO GENERATE Bar(*) ON_ERROR [use] MyHandler SPLIT [into]
> > A_ERRORS;
> >
> > "use" and "into" are optional syntactic sugar.
> >
> > This allows us to do any combination of:
> > - die
> > - put original record into a table
> > - process the error using a custom handler (which can increment counters,
> > write to dbs, send tweets... definitely send tweets...)
> >
> > D
> >
> > On Tue, Jan 18, 2011 at 10:27 AM, Julien Le Dem <ledemj@yahoo-inc.com
> > >wrote:
> >
> > > That would be nice.
> > > Also letting the error handler output the result to a relation would be
> > > useful.
> > > (To let the script output application error metrics)
> > > For example it could (optionally) use the keyword INTO just like the
> > SPLIT
> > > operator.
> > >
> > > FOO = LOAD ...;
> > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;
> > >
> > > ErrorHandler would look a little more like EvalFunc:
> > >
> > > public interface ErrorHandler<T> {
> > >
> > >  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> > > IOException;
> > >
> > > public Schema outputSchema(Schema input);
> > >
> > > }
> > >
> > > There could be a built-in handler to output the skipped record (input:
> > > tuple, funcname:chararray, errorMessage:chararray)
> > >
> > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;
> > >
> > > Julien
> > >
> > > On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
> > >
> > > I was thinking about this..
> > >
> > > We add an optional ON_ERROR clause to operators, which allows a user to
> > > specify error handling. The error handler would be a udf that would
> > > implement an interface along these lines:
> > >
> > > public interface ErrorHandler {
> > >
> > >  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input)
> > throws
> > > IOException;
> > >
> > > }
> > >
> > > I think this makes sense not to make a static method so that users
> could
> > > keep required state, and for example have the handler throw its own
> > > IOException of it's been invoked too many times.
> > >
> > > D
> > >
> > >
> > > On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <
> sms@yahoo-inc.com
> > > >wrote:
> > >
> > > > Thanks for the clarification Ashutosh.
> > > >
> > > > Implementing this in the user realm is tricky as Dmitriy states.
> > > > Sensitivity to error thresholds will require support from the system.
> > We
> > > can
> > > > probably provide a taxonomy of records (good, bad, incomplete, etc.)
> to
> > > let
> > > > users classify each record. The system can then track counts of each
> > > record
> > > > type to facilitate the computation of thresholds. The last part is to
> > > allow
> > > > users to specify thresholds and appropriate actions (interrupt, exit,
> > > > continue, etc.). A possible mechanism to realize this is the
> > > > ErrorHandlingUDF described by Dmitriy.
> > > >
> > > > Santhosh
> > > >
> > > > -----Original Message-----
> > > > From: Ashutosh Chauhan [mailto:hashutosh@apache.org]
> > > > Sent: Friday, January 14, 2011 7:35 PM
> > > > To: user@pig.apache.org
> > > > Subject: Re: Exception Handling in Pig Scripts
> > > >
> > > > Santhosh,
> > > >
> > > > The way you are proposing, it will kill the pig script. I think what
> > user
> > > > wants is to ignore few "bad records" and to process the rest and get
> > > > results. Problem here is how to let user tell Pig the definition of
> > "bad
> > > > record" and how to let him specify threshold for % of bad records at
> > > which
> > > > Pig should fail the script.
> > > >
> > > > Ashutosh
> > > >
> > > > On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <
> sms@yahoo-inc.com>
> > > > wrote:
> > > > > Sorry about the late response.
> > > > >
> > > > > Hadoop n00b is proposing a language extension for error handling,
> > > similar
> > > > to the mechanisms in other well known languages like C++, Java, etc.
> > > > >
> > > > > For now, can't the error semantics be handled by the UDF? For
> > > exceptional
> > > > scenarios you could throw an ExecException with the right details.
> The
> > > > physical operator that handles the execution of UDF's traps it for
> you
> > > and
> > > > propagates the error back to the client. You can take a look at any
> of
> > > the
> > > > builtin UDFs to see how Pig handles it internally.
> > > > >
> > > > > Santhosh
> > > > >
> > > > > -----Original Message-----
> > > > > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> > > > > Sent: Tuesday, January 11, 2011 10:41 AM
> > > > > To: user@pig.apache.org
> > > > > Subject: Re: Exception Handling in Pig Scripts
> > > > >
> > > > > Right now error handling is controlled by the UDFs themselves, and
> > > there
> > > > is no way to direct it externally.
> > > > > You can make an ErrorHandlingUDF that would take a udf spec, invoke
> > it,
> > > > trap errors, and then do the specified error handling behavior..
> that's
> > a
> > > > bit ugly though.
> > > > >
> > > > > There is a problem with trapping general exceptions of course, in
> > that
> > > if
> > > > they happen 0.000001% of the time you can probably just ignore them,
> > but
> > > if
> > > > they happen in half your dataset, you want the job to tell you
> > something
> > > is
> > > > wrong. So this stuff gets non-trivial. If anyone wants to propose a
> > > design
> > > > to solve this general problem, I think that would be a welcome
> > addition.
> > > > >
> > > > > D
> > > > >
> > > > > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <ne...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> Thanks, I sometimes get a date like 0001-01-01. This would be a
> > valid
> > > > >> date format, but when I try to get the seconds between this and
> > > > >> another date, say 2011-01-01, I get an error that the value is too
> > > > >> large to be fit into int and the process stops. Do we have
> something
> > > > >> like ifError(x-y, null,x-y)? Or would I have to implement this as
> an
> > > > >> UDF?
> > > > >>
> > > > >> Thanks
> > > > >>
> > > > >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <
> > dvryaboy@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Create a UDF that verifies the format, and go through a
> filtering
> > > > >> > step first.
> > > > >> > If you would like to save the malformated records so you can
> look
> > > > >> > at them later, you can use the SPLIT operator to route the good
> > > > >> > records to your regular workflow, and the bad records some place
> > on
> > > > HDFS.
> > > > >> >
> > > > >> > -D
> > > > >> >
> > > > >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <
> new2hive@gmail.com>
> > > > wrote:
> > > > >> >
> > > > >> > > Hello,
> > > > >> > >
> > > > >> > > I have a pig script that uses piggy bank to calculate date
> > > > differences.
> > > > >> > > Sometimes, when I get a wierd date or wrong format in the
> input,
> > > > >> > > the
> > > > >> > script
> > > > >> > > throws and error and aborts.
> > > > >> > >
> > > > >> > > Is there a way I could trap these errors and move on without
> > > > >> > > stopping
> > > > >> the
> > > > >> > > execution?
> > > > >> > >
> > > > >> > > Thanks
> > > > >> > >
> > > > >> > > PS: I'm using CDH2 with Pig 0.5
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> > >
> >
> >
>
>

Re: Exception Handling in Pig Scripts

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Right, what I am saying is that the tasks would not fail because we'd catch
the errors.

Thanks for the lmyit link.. learn something new every day.

On Thu, Jan 20, 2011 at 11:31 AM, Julien Le Dem <le...@yahoo-inc.com>wrote:

> Doesn't Hadoop discard the increments to counters done by failed tasks? (I
> would expect that, but I don't know)
> Also using counters we should make sure we don't mix up multiple relations
> being combined by the optimizer.
>
> P.S.: Regarding rror, I don't see why you would want two of these:
> http://lmyit.com/rror
> :P
>
>
> On 1/20/11 4:54 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>
> I think this is coming together! I like the idea of a client-side handler
> method that allows us to look at all errors in aggregate and make a
> decisions based on proportions. How can we guard against catching the wrong
> mistakes -- say, letting a mapper that's running on a bad node and fails
> all
> local disk writes finish "successfully" even though properly, the task just
> needs to be rerun on a different mapper and normally MR would just take
> care
> of it?
> Let's put this on a wiki for wider feedback.
>
> P.S. What's a "rror" and why do we only want one of them?
>
> On Wed, Jan 19, 2011 at 3:07 PM, Julien Le Dem <le...@yahoo-inc.com>
> wrote:
>
> > Some more thoughts.
> >
> > * Looking at the existing keywords:
> > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords
> > It seems ONERROR would be better than ON_ERROR for consistency. There is
> an
> > existing ONSCHEMA but no _ based keyword.
> >
> > * The default behavior should be to die on error and can be overridden as
> > follows:
> > DEFAULT ONERROR <error handler>;
> >
> > * Built in error handlers:
> > Ignore() => ignores errors by dropping records that cause exceptions
> > Fail() => fails the script on error. (default)
> > FailOnThreshold(threshold) => fails if number of errors above threshold
> >
> > * The error handler interface needs a method called on client side after
> > the relation is computed to decide what to do next.
> > Typically FailOnThreshold will throw an exception if
> > (#errors/#input)>threshold using counters.
> > public interface ErrorHandler<T> {
> >
> > // input is not the input of the UDF, it's the tuple from the relation
> > T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> >  IOException;
> >
> > Schema outputSchema(Schema input);
> >
> > // called afterwards on the client side
> > void collectResult() throws IOException;
> >
> > }
> >
> > * SPLIT is optional
> >
> > example:
> > DEFAULT ONERROR Ignore();
> > ...
> >
> > DESCRIBE A;
> > A: {name: chararray, age: int, gpa: float}
> >
> > -- fail it more than 1% errors
> > B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR
> > FailOnThreshold(0.01) ;
> >
> > -- need to make sure the twitter infrastructure can handle the load
> > C1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR Tweet() ;
> >
> > -- custom handler that counts errors and logs on the client side
> > D1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR CountMyErrors()
> ;
> >
> > -- uses default handler and SPLIT
> > B2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> > B2_ERRORS;
> >
> > -- B2_ERRORS can not really contain the input to the UDF as it would have
> a
> > different schema depending on what UDF failed
> > DESCRIBE B_ERRORS;
> > B2_ERRORS: {input: (name: chararray, age: int, gpa: float), udf:
> chararray,
> > error:(class: chararray, message: chararray, stacktrace: chararray) }
> >
> > -- example of filtering on the udf
> > C2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> > C2_FOO_ERRORS IF udf='Foo', C2_BAR_ERRORS IF udf='Bar';
> >
> > Julien
> >
> > On 1/18/11 3:24 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
> >
> > We should think more about the interface.
> > For example, "Tuple input" argument -- is that the tuple that was passed
> to
> > the udf, or the whole tuple that was being processed? I can see wanting
> > both.
> > Also the Handler should probably have init and finish methods in case
> some
> > accumulation is happening, or state needs to get set up...
> >
> > not sure about "splitting" into a table. Maybe more like
> >
> > A = FOREACH FOO GENERATE Bar(*) ON_ERROR [use] MyHandler SPLIT [into]
> > A_ERRORS;
> >
> > "use" and "into" are optional syntactic sugar.
> >
> > This allows us to do any combination of:
> > - die
> > - put original record into a table
> > - process the error using a custom handler (which can increment counters,
> > write to dbs, send tweets... definitely send tweets...)
> >
> > D
> >
> > On Tue, Jan 18, 2011 at 10:27 AM, Julien Le Dem <ledemj@yahoo-inc.com
> > >wrote:
> >
> > > That would be nice.
> > > Also letting the error handler output the result to a relation would be
> > > useful.
> > > (To let the script output application error metrics)
> > > For example it could (optionally) use the keyword INTO just like the
> > SPLIT
> > > operator.
> > >
> > > FOO = LOAD ...;
> > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;
> > >
> > > ErrorHandler would look a little more like EvalFunc:
> > >
> > > public interface ErrorHandler<T> {
> > >
> > >  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> > > IOException;
> > >
> > > public Schema outputSchema(Schema input);
> > >
> > > }
> > >
> > > There could be a built-in handler to output the skipped record (input:
> > > tuple, funcname:chararray, errorMessage:chararray)
> > >
> > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;
> > >
> > > Julien
> > >
> > > On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
> > >
> > > I was thinking about this..
> > >
> > > We add an optional ON_ERROR clause to operators, which allows a user to
> > > specify error handling. The error handler would be a udf that would
> > > implement an interface along these lines:
> > >
> > > public interface ErrorHandler {
> > >
> > >  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input)
> > throws
> > > IOException;
> > >
> > > }
> > >
> > > I think this makes sense not to make a static method so that users
> could
> > > keep required state, and for example have the handler throw its own
> > > IOException of it's been invoked too many times.
> > >
> > > D
> > >
> > >
> > > On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <
> sms@yahoo-inc.com
> > > >wrote:
> > >
> > > > Thanks for the clarification Ashutosh.
> > > >
> > > > Implementing this in the user realm is tricky as Dmitriy states.
> > > > Sensitivity to error thresholds will require support from the system.
> > We
> > > can
> > > > probably provide a taxonomy of records (good, bad, incomplete, etc.)
> to
> > > let
> > > > users classify each record. The system can then track counts of each
> > > record
> > > > type to facilitate the computation of thresholds. The last part is to
> > > allow
> > > > users to specify thresholds and appropriate actions (interrupt, exit,
> > > > continue, etc.). A possible mechanism to realize this is the
> > > > ErrorHandlingUDF described by Dmitriy.
> > > >
> > > > Santhosh
> > > >
> > > > -----Original Message-----
> > > > From: Ashutosh Chauhan [mailto:hashutosh@apache.org]
> > > > Sent: Friday, January 14, 2011 7:35 PM
> > > > To: user@pig.apache.org
> > > > Subject: Re: Exception Handling in Pig Scripts
> > > >
> > > > Santhosh,
> > > >
> > > > The way you are proposing, it will kill the pig script. I think what
> > user
> > > > wants is to ignore few "bad records" and to process the rest and get
> > > > results. Problem here is how to let user tell Pig the definition of
> > "bad
> > > > record" and how to let him specify threshold for % of bad records at
> > > which
> > > > Pig should fail the script.
> > > >
> > > > Ashutosh
> > > >
> > > > On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <
> sms@yahoo-inc.com>
> > > > wrote:
> > > > > Sorry about the late response.
> > > > >
> > > > > Hadoop n00b is proposing a language extension for error handling,
> > > similar
> > > > to the mechanisms in other well known languages like C++, Java, etc.
> > > > >
> > > > > For now, can't the error semantics be handled by the UDF? For
> > > exceptional
> > > > scenarios you could throw an ExecException with the right details.
> The
> > > > physical operator that handles the execution of UDF's traps it for
> you
> > > and
> > > > propagates the error back to the client. You can take a look at any
> of
> > > the
> > > > builtin UDFs to see how Pig handles it internally.
> > > > >
> > > > > Santhosh
> > > > >
> > > > > -----Original Message-----
> > > > > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> > > > > Sent: Tuesday, January 11, 2011 10:41 AM
> > > > > To: user@pig.apache.org
> > > > > Subject: Re: Exception Handling in Pig Scripts
> > > > >
> > > > > Right now error handling is controlled by the UDFs themselves, and
> > > there
> > > > is no way to direct it externally.
> > > > > You can make an ErrorHandlingUDF that would take a udf spec, invoke
> > it,
> > > > trap errors, and then do the specified error handling behavior..
> that's
> > a
> > > > bit ugly though.
> > > > >
> > > > > There is a problem with trapping general exceptions of course, in
> > that
> > > if
> > > > they happen 0.000001% of the time you can probably just ignore them,
> > but
> > > if
> > > > they happen in half your dataset, you want the job to tell you
> > something
> > > is
> > > > wrong. So this stuff gets non-trivial. If anyone wants to propose a
> > > design
> > > > to solve this general problem, I think that would be a welcome
> > addition.
> > > > >
> > > > > D
> > > > >
> > > > > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <ne...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> Thanks, I sometimes get a date like 0001-01-01. This would be a
> > valid
> > > > >> date format, but when I try to get the seconds between this and
> > > > >> another date, say 2011-01-01, I get an error that the value is too
> > > > >> large to be fit into int and the process stops. Do we have
> something
> > > > >> like ifError(x-y, null,x-y)? Or would I have to implement this as
> an
> > > > >> UDF?
> > > > >>
> > > > >> Thanks
> > > > >>
> > > > >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <
> > dvryaboy@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Create a UDF that verifies the format, and go through a
> filtering
> > > > >> > step first.
> > > > >> > If you would like to save the malformated records so you can
> look
> > > > >> > at them later, you can use the SPLIT operator to route the good
> > > > >> > records to your regular workflow, and the bad records some place
> > on
> > > > HDFS.
> > > > >> >
> > > > >> > -D
> > > > >> >
> > > > >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <
> new2hive@gmail.com>
> > > > wrote:
> > > > >> >
> > > > >> > > Hello,
> > > > >> > >
> > > > >> > > I have a pig script that uses piggy bank to calculate date
> > > > differences.
> > > > >> > > Sometimes, when I get a wierd date or wrong format in the
> input,
> > > > >> > > the
> > > > >> > script
> > > > >> > > throws and error and aborts.
> > > > >> > >
> > > > >> > > Is there a way I could trap these errors and move on without
> > > > >> > > stopping
> > > > >> the
> > > > >> > > execution?
> > > > >> > >
> > > > >> > > Thanks
> > > > >> > >
> > > > >> > > PS: I'm using CDH2 with Pig 0.5
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> > >
> >
> >
>
>

Re: Exception Handling in Pig Scripts

Posted by Julien Le Dem <le...@yahoo-inc.com>.

Doesn't Hadoop discard the increments to counters done by failed tasks? (I would expect that, but I don't know)
Also using counters we should make sure we don't mix up multiple relations being combined by the optimizer.

P.S.: Regarding rror, I don't see why you would want two of these:
http://lmyit.com/rror
:P


On 1/20/11 4:54 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:

I think this is coming together! I like the idea of a client-side handler
method that allows us to look at all errors in aggregate and make a
decisions based on proportions. How can we guard against catching the wrong
mistakes -- say, letting a mapper that's running on a bad node and fails all
local disk writes finish "successfully" even though properly, the task just
needs to be rerun on a different mapper and normally MR would just take care
of it?
Let's put this on a wiki for wider feedback.

P.S. What's a "rror" and why do we only want one of them?

On Wed, Jan 19, 2011 at 3:07 PM, Julien Le Dem <le...@yahoo-inc.com> wrote:

> Some more thoughts.
>
> * Looking at the existing keywords:
> http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords
> It seems ONERROR would be better than ON_ERROR for consistency. There is an
> existing ONSCHEMA but no _ based keyword.
>
> * The default behavior should be to die on error and can be overridden as
> follows:
> DEFAULT ONERROR <error handler>;
>
> * Built in error handlers:
> Ignore() => ignores errors by dropping records that cause exceptions
> Fail() => fails the script on error. (default)
> FailOnThreshold(threshold) => fails if number of errors above threshold
>
> * The error handler interface needs a method called on client side after
> the relation is computed to decide what to do next.
> Typically FailOnThreshold will throw an exception if
> (#errors/#input)>threshold using counters.
> public interface ErrorHandler<T> {
>
> // input is not the input of the UDF, it's the tuple from the relation
> T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
>  IOException;
>
> Schema outputSchema(Schema input);
>
> // called afterwards on the client side
> void collectResult() throws IOException;
>
> }
>
> * SPLIT is optional
>
> example:
> DEFAULT ONERROR Ignore();
> ...
>
> DESCRIBE A;
> A: {name: chararray, age: int, gpa: float}
>
> -- fail it more than 1% errors
> B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR
> FailOnThreshold(0.01) ;
>
> -- need to make sure the twitter infrastructure can handle the load
> C1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR Tweet() ;
>
> -- custom handler that counts errors and logs on the client side
> D1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR CountMyErrors() ;
>
> -- uses default handler and SPLIT
> B2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> B2_ERRORS;
>
> -- B2_ERRORS can not really contain the input to the UDF as it would have a
> different schema depending on what UDF failed
> DESCRIBE B_ERRORS;
> B2_ERRORS: {input: (name: chararray, age: int, gpa: float), udf: chararray,
> error:(class: chararray, message: chararray, stacktrace: chararray) }
>
> -- example of filtering on the udf
> C2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> C2_FOO_ERRORS IF udf='Foo', C2_BAR_ERRORS IF udf='Bar';
>
> Julien
>
> On 1/18/11 3:24 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>
> We should think more about the interface.
> For example, "Tuple input" argument -- is that the tuple that was passed to
> the udf, or the whole tuple that was being processed? I can see wanting
> both.
> Also the Handler should probably have init and finish methods in case some
> accumulation is happening, or state needs to get set up...
>
> not sure about "splitting" into a table. Maybe more like
>
> A = FOREACH FOO GENERATE Bar(*) ON_ERROR [use] MyHandler SPLIT [into]
> A_ERRORS;
>
> "use" and "into" are optional syntactic sugar.
>
> This allows us to do any combination of:
> - die
> - put original record into a table
> - process the error using a custom handler (which can increment counters,
> write to dbs, send tweets... definitely send tweets...)
>
> D
>
> On Tue, Jan 18, 2011 at 10:27 AM, Julien Le Dem <ledemj@yahoo-inc.com
> >wrote:
>
> > That would be nice.
> > Also letting the error handler output the result to a relation would be
> > useful.
> > (To let the script output application error metrics)
> > For example it could (optionally) use the keyword INTO just like the
> SPLIT
> > operator.
> >
> > FOO = LOAD ...;
> > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;
> >
> > ErrorHandler would look a little more like EvalFunc:
> >
> > public interface ErrorHandler<T> {
> >
> >  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> > IOException;
> >
> > public Schema outputSchema(Schema input);
> >
> > }
> >
> > There could be a built-in handler to output the skipped record (input:
> > tuple, funcname:chararray, errorMessage:chararray)
> >
> > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;
> >
> > Julien
> >
> > On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
> >
> > I was thinking about this..
> >
> > We add an optional ON_ERROR clause to operators, which allows a user to
> > specify error handling. The error handler would be a udf that would
> > implement an interface along these lines:
> >
> > public interface ErrorHandler {
> >
> >  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input)
> throws
> > IOException;
> >
> > }
> >
> > I think this makes sense not to make a static method so that users could
> > keep required state, and for example have the handler throw its own
> > IOException of it's been invoked too many times.
> >
> > D
> >
> >
> > On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <sms@yahoo-inc.com
> > >wrote:
> >
> > > Thanks for the clarification Ashutosh.
> > >
> > > Implementing this in the user realm is tricky as Dmitriy states.
> > > Sensitivity to error thresholds will require support from the system.
> We
> > can
> > > probably provide a taxonomy of records (good, bad, incomplete, etc.) to
> > let
> > > users classify each record. The system can then track counts of each
> > record
> > > type to facilitate the computation of thresholds. The last part is to
> > allow
> > > users to specify thresholds and appropriate actions (interrupt, exit,
> > > continue, etc.). A possible mechanism to realize this is the
> > > ErrorHandlingUDF described by Dmitriy.
> > >
> > > Santhosh
> > >
> > > -----Original Message-----
> > > From: Ashutosh Chauhan [mailto:hashutosh@apache.org]
> > > Sent: Friday, January 14, 2011 7:35 PM
> > > To: user@pig.apache.org
> > > Subject: Re: Exception Handling in Pig Scripts
> > >
> > > Santhosh,
> > >
> > > The way you are proposing, it will kill the pig script. I think what
> user
> > > wants is to ignore few "bad records" and to process the rest and get
> > > results. Problem here is how to let user tell Pig the definition of
> "bad
> > > record" and how to let him specify threshold for % of bad records at
> > which
> > > Pig should fail the script.
> > >
> > > Ashutosh
> > >
> > > On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <sm...@yahoo-inc.com>
> > > wrote:
> > > > Sorry about the late response.
> > > >
> > > > Hadoop n00b is proposing a language extension for error handling,
> > similar
> > > to the mechanisms in other well known languages like C++, Java, etc.
> > > >
> > > > For now, can't the error semantics be handled by the UDF? For
> > exceptional
> > > scenarios you could throw an ExecException with the right details. The
> > > physical operator that handles the execution of UDF's traps it for you
> > and
> > > propagates the error back to the client. You can take a look at any of
> > the
> > > builtin UDFs to see how Pig handles it internally.
> > > >
> > > > Santhosh
> > > >
> > > > -----Original Message-----
> > > > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> > > > Sent: Tuesday, January 11, 2011 10:41 AM
> > > > To: user@pig.apache.org
> > > > Subject: Re: Exception Handling in Pig Scripts
> > > >
> > > > Right now error handling is controlled by the UDFs themselves, and
> > there
> > > is no way to direct it externally.
> > > > You can make an ErrorHandlingUDF that would take a udf spec, invoke
> it,
> > > trap errors, and then do the specified error handling behavior.. that's
> a
> > > bit ugly though.
> > > >
> > > > There is a problem with trapping general exceptions of course, in
> that
> > if
> > > they happen 0.000001% of the time you can probably just ignore them,
> but
> > if
> > > they happen in half your dataset, you want the job to tell you
> something
> > is
> > > wrong. So this stuff gets non-trivial. If anyone wants to propose a
> > design
> > > to solve this general problem, I think that would be a welcome
> addition.
> > > >
> > > > D
> > > >
> > > > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <ne...@gmail.com>
> > > wrote:
> > > >
> > > >> Thanks, I sometimes get a date like 0001-01-01. This would be a
> valid
> > > >> date format, but when I try to get the seconds between this and
> > > >> another date, say 2011-01-01, I get an error that the value is too
> > > >> large to be fit into int and the process stops. Do we have something
> > > >> like ifError(x-y, null,x-y)? Or would I have to implement this as an
> > > >> UDF?
> > > >>
> > > >> Thanks
> > > >>
> > > >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <
> dvryaboy@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > Create a UDF that verifies the format, and go through a filtering
> > > >> > step first.
> > > >> > If you would like to save the malformated records so you can look
> > > >> > at them later, you can use the SPLIT operator to route the good
> > > >> > records to your regular workflow, and the bad records some place
> on
> > > HDFS.
> > > >> >
> > > >> > -D
> > > >> >
> > > >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <ne...@gmail.com>
> > > wrote:
> > > >> >
> > > >> > > Hello,
> > > >> > >
> > > >> > > I have a pig script that uses piggy bank to calculate date
> > > differences.
> > > >> > > Sometimes, when I get a wierd date or wrong format in the input,
> > > >> > > the
> > > >> > script
> > > >> > > throws and error and aborts.
> > > >> > >
> > > >> > > Is there a way I could trap these errors and move on without
> > > >> > > stopping
> > > >> the
> > > >> > > execution?
> > > >> > >
> > > >> > > Thanks
> > > >> > >
> > > >> > > PS: I'm using CDH2 with Pig 0.5
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
> >
>
>

Re: Exception Handling in Pig Scripts

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

I think this is coming together! I like the idea of a client-side handler
method that allows us to look at all errors in aggregate and make a
decisions based on proportions. How can we guard against catching the wrong
mistakes -- say, letting a mapper that's running on a bad node and fails all
local disk writes finish "successfully" even though properly, the task just
needs to be rerun on a different mapper and normally MR would just take care
of it?
Let's put this on a wiki for wider feedback.

P.S. What's a "rror" and why do we only want one of them?

On Wed, Jan 19, 2011 at 3:07 PM, Julien Le Dem <le...@yahoo-inc.com> wrote:

> Some more thoughts.
>
> * Looking at the existing keywords:
> http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords
> It seems ONERROR would be better than ON_ERROR for consistency. There is an
> existing ONSCHEMA but no _ based keyword.
>
> * The default behavior should be to die on error and can be overridden as
> follows:
> DEFAULT ONERROR <error handler>;
>
> * Built in error handlers:
> Ignore() => ignores errors by dropping records that cause exceptions
> Fail() => fails the script on error. (default)
> FailOnThreshold(threshold) => fails if number of errors above threshold
>
> * The error handler interface needs a method called on client side after
> the relation is computed to decide what to do next.
> Typically FailOnThreshold will throw an exception if
> (#errors/#input)>threshold using counters.
> public interface ErrorHandler<T> {
>
> // input is not the input of the UDF, it's the tuple from the relation
> T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
>  IOException;
>
> Schema outputSchema(Schema input);
>
> // called afterwards on the client side
> void collectResult() throws IOException;
>
> }
>
> * SPLIT is optional
>
> example:
> DEFAULT ONERROR Ignore();
> ...
>
> DESCRIBE A;
> A: {name: chararray, age: int, gpa: float}
>
> -- fail it more than 1% errors
> B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR
> FailOnThreshold(0.01) ;
>
> -- need to make sure the twitter infrastructure can handle the load
> C1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR Tweet() ;
>
> -- custom handler that counts errors and logs on the client side
> D1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR CountMyErrors() ;
>
> -- uses default handler and SPLIT
> B2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> B2_ERRORS;
>
> -- B2_ERRORS can not really contain the input to the UDF as it would have a
> different schema depending on what UDF failed
> DESCRIBE B_ERRORS;
> B2_ERRORS: {input: (name: chararray, age: int, gpa: float), udf: chararray,
> error:(class: chararray, message: chararray, stacktrace: chararray) }
>
> -- example of filtering on the udf
> C2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> C2_FOO_ERRORS IF udf='Foo', C2_BAR_ERRORS IF udf='Bar';
>
> Julien
>
> On 1/18/11 3:24 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>
> We should think more about the interface.
> For example, "Tuple input" argument -- is that the tuple that was passed to
> the udf, or the whole tuple that was being processed? I can see wanting
> both.
> Also the Handler should probably have init and finish methods in case some
> accumulation is happening, or state needs to get set up...
>
> not sure about "splitting" into a table. Maybe more like
>
> A = FOREACH FOO GENERATE Bar(*) ON_ERROR [use] MyHandler SPLIT [into]
> A_ERRORS;
>
> "use" and "into" are optional syntactic sugar.
>
> This allows us to do any combination of:
> - die
> - put original record into a table
> - process the error using a custom handler (which can increment counters,
> write to dbs, send tweets... definitely send tweets...)
>
> D
>
> On Tue, Jan 18, 2011 at 10:27 AM, Julien Le Dem <ledemj@yahoo-inc.com
> >wrote:
>
> > That would be nice.
> > Also letting the error handler output the result to a relation would be
> > useful.
> > (To let the script output application error metrics)
> > For example it could (optionally) use the keyword INTO just like the
> SPLIT
> > operator.
> >
> > FOO = LOAD ...;
> > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;
> >
> > ErrorHandler would look a little more like EvalFunc:
> >
> > public interface ErrorHandler<T> {
> >
> >  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> > IOException;
> >
> > public Schema outputSchema(Schema input);
> >
> > }
> >
> > There could be a built-in handler to output the skipped record (input:
> > tuple, funcname:chararray, errorMessage:chararray)
> >
> > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;
> >
> > Julien
> >
> > On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
> >
> > I was thinking about this..
> >
> > We add an optional ON_ERROR clause to operators, which allows a user to
> > specify error handling. The error handler would be a udf that would
> > implement an interface along these lines:
> >
> > public interface ErrorHandler {
> >
> >  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input)
> throws
> > IOException;
> >
> > }
> >
> > I think this makes sense not to make a static method so that users could
> > keep required state, and for example have the handler throw its own
> > IOException of it's been invoked too many times.
> >
> > D
> >
> >
> > On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <sms@yahoo-inc.com
> > >wrote:
> >
> > > Thanks for the clarification Ashutosh.
> > >
> > > Implementing this in the user realm is tricky as Dmitriy states.
> > > Sensitivity to error thresholds will require support from the system.
> We
> > can
> > > probably provide a taxonomy of records (good, bad, incomplete, etc.) to
> > let
> > > users classify each record. The system can then track counts of each
> > record
> > > type to facilitate the computation of thresholds. The last part is to
> > allow
> > > users to specify thresholds and appropriate actions (interrupt, exit,
> > > continue, etc.). A possible mechanism to realize this is the
> > > ErrorHandlingUDF described by Dmitriy.
> > >
> > > Santhosh
> > >
> > > -----Original Message-----
> > > From: Ashutosh Chauhan [mailto:hashutosh@apache.org]
> > > Sent: Friday, January 14, 2011 7:35 PM
> > > To: user@pig.apache.org
> > > Subject: Re: Exception Handling in Pig Scripts
> > >
> > > Santhosh,
> > >
> > > The way you are proposing, it will kill the pig script. I think what
> user
> > > wants is to ignore few "bad records" and to process the rest and get
> > > results. Problem here is how to let user tell Pig the definition of
> "bad
> > > record" and how to let him specify threshold for % of bad records at
> > which
> > > Pig should fail the script.
> > >
> > > Ashutosh
> > >
> > > On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <sm...@yahoo-inc.com>
> > > wrote:
> > > > Sorry about the late response.
> > > >
> > > > Hadoop n00b is proposing a language extension for error handling,
> > similar
> > > to the mechanisms in other well known languages like C++, Java, etc.
> > > >
> > > > For now, can't the error semantics be handled by the UDF? For
> > exceptional
> > > scenarios you could throw an ExecException with the right details. The
> > > physical operator that handles the execution of UDF's traps it for you
> > and
> > > propagates the error back to the client. You can take a look at any of
> > the
> > > builtin UDFs to see how Pig handles it internally.
> > > >
> > > > Santhosh
> > > >
> > > > -----Original Message-----
> > > > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> > > > Sent: Tuesday, January 11, 2011 10:41 AM
> > > > To: user@pig.apache.org
> > > > Subject: Re: Exception Handling in Pig Scripts
> > > >
> > > > Right now error handling is controlled by the UDFs themselves, and
> > there
> > > is no way to direct it externally.
> > > > You can make an ErrorHandlingUDF that would take a udf spec, invoke
> it,
> > > trap errors, and then do the specified error handling behavior.. that's
> a
> > > bit ugly though.
> > > >
> > > > There is a problem with trapping general exceptions of course, in
> that
> > if
> > > they happen 0.000001% of the time you can probably just ignore them,
> but
> > if
> > > they happen in half your dataset, you want the job to tell you
> something
> > is
> > > wrong. So this stuff gets non-trivial. If anyone wants to propose a
> > design
> > > to solve this general problem, I think that would be a welcome
> addition.
> > > >
> > > > D
> > > >
> > > > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <ne...@gmail.com>
> > > wrote:
> > > >
> > > >> Thanks, I sometimes get a date like 0001-01-01. This would be a
> valid
> > > >> date format, but when I try to get the seconds between this and
> > > >> another date, say 2011-01-01, I get an error that the value is too
> > > >> large to be fit into int and the process stops. Do we have something
> > > >> like ifError(x-y, null,x-y)? Or would I have to implement this as an
> > > >> UDF?
> > > >>
> > > >> Thanks
> > > >>
> > > >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <
> dvryaboy@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > Create a UDF that verifies the format, and go through a filtering
> > > >> > step first.
> > > >> > If you would like to save the malformated records so you can look
> > > >> > at them later, you can use the SPLIT operator to route the good
> > > >> > records to your regular workflow, and the bad records some place
> on
> > > HDFS.
> > > >> >
> > > >> > -D
> > > >> >
> > > >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <ne...@gmail.com>
> > > wrote:
> > > >> >
> > > >> > > Hello,
> > > >> > >
> > > >> > > I have a pig script that uses piggy bank to calculate date
> > > differences.
> > > >> > > Sometimes, when I get a wierd date or wrong format in the input,
> > > >> > > the
> > > >> > script
> > > >> > > throws and error and aborts.
> > > >> > >
> > > >> > > Is there a way I could trap these errors and move on without
> > > >> > > stopping
> > > >> the
> > > >> > > execution?
> > > >> > >
> > > >> > > Thanks
> > > >> > >
> > > >> > > PS: I'm using CDH2 with Pig 0.5
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
> >
>
>

Re: Exception Handling in Pig Scripts

Posted by Julien Le Dem <le...@yahoo-inc.com>.

Some more thoughts.

* Looking at the existing keywords:
http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords
It seems ONERROR would be better than ON_ERROR for consistency. There is an existing ONSCHEMA but no _ based keyword.

* The default behavior should be to die on error and can be overridden as follows:
DEFAULT ONERROR <error handler>;

* Built in error handlers:
Ignore() => ignores errors by dropping records that cause exceptions
Fail() => fails the script on error. (default)
FailOnThreshold(threshold) => fails if number of errors above threshold

* The error handler interface needs a method called on client side after the relation is computed to decide what to do next.
Typically FailOnThreshold will throw an exception if (#errors/#input)>threshold using counters.
public interface ErrorHandler<T> {

// input is not the input of the UDF, it's the tuple from the relation
T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
 IOException;

Schema outputSchema(Schema input);

// called afterwards on the client side
void collectResult() throws IOException;

}

* SPLIT is optional

example:
DEFAULT ONERROR Ignore();
...

DESCRIBE A;
A: {name: chararray, age: int, gpa: float}

-- fail it more than 1% errors
B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR FailOnThreshold(0.01) ;

-- need to make sure the twitter infrastructure can handle the load
C1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR Tweet() ;

-- custom handler that counts errors and logs on the client side
D1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR CountMyErrors() ;

-- uses default handler and SPLIT
B2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO B2_ERRORS;

-- B2_ERRORS can not really contain the input to the UDF as it would have a different schema depending on what UDF failed
DESCRIBE B_ERRORS;
B2_ERRORS: {input: (name: chararray, age: int, gpa: float), udf: chararray, error:(class: chararray, message: chararray, stacktrace: chararray) }

-- example of filtering on the udf
C2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO C2_FOO_ERRORS IF udf='Foo', C2_BAR_ERRORS IF udf='Bar';

Julien

On 1/18/11 3:24 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:

We should think more about the interface.
For example, "Tuple input" argument -- is that the tuple that was passed to
the udf, or the whole tuple that was being processed? I can see wanting
both.
Also the Handler should probably have init and finish methods in case some
accumulation is happening, or state needs to get set up...

not sure about "splitting" into a table. Maybe more like

A = FOREACH FOO GENERATE Bar(*) ON_ERROR [use] MyHandler SPLIT [into]
A_ERRORS;

"use" and "into" are optional syntactic sugar.

This allows us to do any combination of:
- die
- put original record into a table
- process the error using a custom handler (which can increment counters,
write to dbs, send tweets... definitely send tweets...)

D

On Tue, Jan 18, 2011 at 10:27 AM, Julien Le Dem <le...@yahoo-inc.com>wrote:

> That would be nice.
> Also letting the error handler output the result to a relation would be
> useful.
> (To let the script output application error metrics)
> For example it could (optionally) use the keyword INTO just like the SPLIT
> operator.
>
> FOO = LOAD ...;
> A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;
>
> ErrorHandler would look a little more like EvalFunc:
>
> public interface ErrorHandler<T> {
>
>  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> IOException;
>
> public Schema outputSchema(Schema input);
>
> }
>
> There could be a built-in handler to output the skipped record (input:
> tuple, funcname:chararray, errorMessage:chararray)
>
> A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;
>
> Julien
>
> On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>
> I was thinking about this..
>
> We add an optional ON_ERROR clause to operators, which allows a user to
> specify error handling. The error handler would be a udf that would
> implement an interface along these lines:
>
> public interface ErrorHandler {
>
>  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> IOException;
>
> }
>
> I think this makes sense not to make a static method so that users could
> keep required state, and for example have the handler throw its own
> IOException of it's been invoked too many times.
>
> D
>
>
> On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <sms@yahoo-inc.com
> >wrote:
>
> > Thanks for the clarification Ashutosh.
> >
> > Implementing this in the user realm is tricky as Dmitriy states.
> > Sensitivity to error thresholds will require support from the system. We
> can
> > probably provide a taxonomy of records (good, bad, incomplete, etc.) to
> let
> > users classify each record. The system can then track counts of each
> record
> > type to facilitate the computation of thresholds. The last part is to
> allow
> > users to specify thresholds and appropriate actions (interrupt, exit,
> > continue, etc.). A possible mechanism to realize this is the
> > ErrorHandlingUDF described by Dmitriy.
> >
> > Santhosh
> >
> > -----Original Message-----
> > From: Ashutosh Chauhan [mailto:hashutosh@apache.org]
> > Sent: Friday, January 14, 2011 7:35 PM
> > To: user@pig.apache.org
> > Subject: Re: Exception Handling in Pig Scripts
> >
> > Santhosh,
> >
> > The way you are proposing, it will kill the pig script. I think what user
> > wants is to ignore few "bad records" and to process the rest and get
> > results. Problem here is how to let user tell Pig the definition of "bad
> > record" and how to let him specify threshold for % of bad records at
> which
> > Pig should fail the script.
> >
> > Ashutosh
> >
> > On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <sm...@yahoo-inc.com>
> > wrote:
> > > Sorry about the late response.
> > >
> > > Hadoop n00b is proposing a language extension for error handling,
> similar
> > to the mechanisms in other well known languages like C++, Java, etc.
> > >
> > > For now, can't the error semantics be handled by the UDF? For
> exceptional
> > scenarios you could throw an ExecException with the right details. The
> > physical operator that handles the execution of UDF's traps it for you
> and
> > propagates the error back to the client. You can take a look at any of
> the
> > builtin UDFs to see how Pig handles it internally.
> > >
> > > Santhosh
> > >
> > > -----Original Message-----
> > > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> > > Sent: Tuesday, January 11, 2011 10:41 AM
> > > To: user@pig.apache.org
> > > Subject: Re: Exception Handling in Pig Scripts
> > >
> > > Right now error handling is controlled by the UDFs themselves, and
> there
> > is no way to direct it externally.
> > > You can make an ErrorHandlingUDF that would take a udf spec, invoke it,
> > trap errors, and then do the specified error handling behavior.. that's a
> > bit ugly though.
> > >
> > > There is a problem with trapping general exceptions of course, in that
> if
> > they happen 0.000001% of the time you can probably just ignore them, but
> if
> > they happen in half your dataset, you want the job to tell you something
> is
> > wrong. So this stuff gets non-trivial. If anyone wants to propose a
> design
> > to solve this general problem, I think that would be a welcome addition.
> > >
> > > D
> > >
> > > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <ne...@gmail.com>
> > wrote:
> > >
> > >> Thanks, I sometimes get a date like 0001-01-01. This would be a valid
> > >> date format, but when I try to get the seconds between this and
> > >> another date, say 2011-01-01, I get an error that the value is too
> > >> large to be fit into int and the process stops. Do we have something
> > >> like ifError(x-y, null,x-y)? Or would I have to implement this as an
> > >> UDF?
> > >>
> > >> Thanks
> > >>
> > >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <dv...@gmail.com>
> > >> wrote:
> > >>
> > >> > Create a UDF that verifies the format, and go through a filtering
> > >> > step first.
> > >> > If you would like to save the malformated records so you can look
> > >> > at them later, you can use the SPLIT operator to route the good
> > >> > records to your regular workflow, and the bad records some place on
> > HDFS.
> > >> >
> > >> > -D
> > >> >
> > >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <ne...@gmail.com>
> > wrote:
> > >> >
> > >> > > Hello,
> > >> > >
> > >> > > I have a pig script that uses piggy bank to calculate date
> > differences.
> > >> > > Sometimes, when I get a wierd date or wrong format in the input,
> > >> > > the
> > >> > script
> > >> > > throws and error and aborts.
> > >> > >
> > >> > > Is there a way I could trap these errors and move on without
> > >> > > stopping
> > >> the
> > >> > > execution?
> > >> > >
> > >> > > Thanks
> > >> > >
> > >> > > PS: I'm using CDH2 with Pig 0.5
> > >> > >
> > >> >
> > >>
> > >
> >
>
>

Re: Exception Handling in Pig Scripts

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

We should think more about the interface.
For example, "Tuple input" argument -- is that the tuple that was passed to
the udf, or the whole tuple that was being processed? I can see wanting
both.
Also the Handler should probably have init and finish methods in case some
accumulation is happening, or state needs to get set up...

not sure about "splitting" into a table. Maybe more like

A = FOREACH FOO GENERATE Bar(*) ON_ERROR [use] MyHandler SPLIT [into]
A_ERRORS;

"use" and "into" are optional syntactic sugar.

This allows us to do any combination of:
- die
- put original record into a table
- process the error using a custom handler (which can increment counters,
write to dbs, send tweets... definitely send tweets...)

D

On Tue, Jan 18, 2011 at 10:27 AM, Julien Le Dem <le...@yahoo-inc.com>wrote:

> That would be nice.
> Also letting the error handler output the result to a relation would be
> useful.
> (To let the script output application error metrics)
> For example it could (optionally) use the keyword INTO just like the SPLIT
> operator.
>
> FOO = LOAD ...;
> A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;
>
> ErrorHandler would look a little more like EvalFunc:
>
> public interface ErrorHandler<T> {
>
>  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> IOException;
>
> public Schema outputSchema(Schema input);
>
> }
>
> There could be a built-in handler to output the skipped record (input:
> tuple, funcname:chararray, errorMessage:chararray)
>
> A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;
>
> Julien
>
> On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>
> I was thinking about this..
>
> We add an optional ON_ERROR clause to operators, which allows a user to
> specify error handling. The error handler would be a udf that would
> implement an interface along these lines:
>
> public interface ErrorHandler {
>
>  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> IOException;
>
> }
>
> I think this makes sense not to make a static method so that users could
> keep required state, and for example have the handler throw its own
> IOException of it's been invoked too many times.
>
> D
>
>
> On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <sms@yahoo-inc.com
> >wrote:
>
> > Thanks for the clarification Ashutosh.
> >
> > Implementing this in the user realm is tricky as Dmitriy states.
> > Sensitivity to error thresholds will require support from the system. We
> can
> > probably provide a taxonomy of records (good, bad, incomplete, etc.) to
> let
> > users classify each record. The system can then track counts of each
> record
> > type to facilitate the computation of thresholds. The last part is to
> allow
> > users to specify thresholds and appropriate actions (interrupt, exit,
> > continue, etc.). A possible mechanism to realize this is the
> > ErrorHandlingUDF described by Dmitriy.
> >
> > Santhosh
> >
> > -----Original Message-----
> > From: Ashutosh Chauhan [mailto:hashutosh@apache.org]
> > Sent: Friday, January 14, 2011 7:35 PM
> > To: user@pig.apache.org
> > Subject: Re: Exception Handling in Pig Scripts
> >
> > Santhosh,
> >
> > The way you are proposing, it will kill the pig script. I think what user
> > wants is to ignore few "bad records" and to process the rest and get
> > results. Problem here is how to let user tell Pig the definition of "bad
> > record" and how to let him specify threshold for % of bad records at
> which
> > Pig should fail the script.
> >
> > Ashutosh
> >
> > On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <sm...@yahoo-inc.com>
> > wrote:
> > > Sorry about the late response.
> > >
> > > Hadoop n00b is proposing a language extension for error handling,
> similar
> > to the mechanisms in other well known languages like C++, Java, etc.
> > >
> > > For now, can't the error semantics be handled by the UDF? For
> exceptional
> > scenarios you could throw an ExecException with the right details. The
> > physical operator that handles the execution of UDF's traps it for you
> and
> > propagates the error back to the client. You can take a look at any of
> the
> > builtin UDFs to see how Pig handles it internally.
> > >
> > > Santhosh
> > >
> > > -----Original Message-----
> > > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> > > Sent: Tuesday, January 11, 2011 10:41 AM
> > > To: user@pig.apache.org
> > > Subject: Re: Exception Handling in Pig Scripts
> > >
> > > Right now error handling is controlled by the UDFs themselves, and
> there
> > is no way to direct it externally.
> > > You can make an ErrorHandlingUDF that would take a udf spec, invoke it,
> > trap errors, and then do the specified error handling behavior.. that's a
> > bit ugly though.
> > >
> > > There is a problem with trapping general exceptions of course, in that
> if
> > they happen 0.000001% of the time you can probably just ignore them, but
> if
> > they happen in half your dataset, you want the job to tell you something
> is
> > wrong. So this stuff gets non-trivial. If anyone wants to propose a
> design
> > to solve this general problem, I think that would be a welcome addition.
> > >
> > > D
> > >
> > > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <ne...@gmail.com>
> > wrote:
> > >
> > >> Thanks, I sometimes get a date like 0001-01-01. This would be a valid
> > >> date format, but when I try to get the seconds between this and
> > >> another date, say 2011-01-01, I get an error that the value is too
> > >> large to be fit into int and the process stops. Do we have something
> > >> like ifError(x-y, null,x-y)? Or would I have to implement this as an
> > >> UDF?
> > >>
> > >> Thanks
> > >>
> > >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <dv...@gmail.com>
> > >> wrote:
> > >>
> > >> > Create a UDF that verifies the format, and go through a filtering
> > >> > step first.
> > >> > If you would like to save the malformated records so you can look
> > >> > at them later, you can use the SPLIT operator to route the good
> > >> > records to your regular workflow, and the bad records some place on
> > HDFS.
> > >> >
> > >> > -D
> > >> >
> > >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <ne...@gmail.com>
> > wrote:
> > >> >
> > >> > > Hello,
> > >> > >
> > >> > > I have a pig script that uses piggy bank to calculate date
> > differences.
> > >> > > Sometimes, when I get a wierd date or wrong format in the input,
> > >> > > the
> > >> > script
> > >> > > throws and error and aborts.
> > >> > >
> > >> > > Is there a way I could trap these errors and move on without
> > >> > > stopping
> > >> the
> > >> > > execution?
> > >> > >
> > >> > > Thanks
> > >> > >
> > >> > > PS: I'm using CDH2 with Pig 0.5
> > >> > >
> > >> >
> > >>
> > >
> >
>
>

Re: Exception Handling in Pig Scripts

Posted by Julien Le Dem <le...@yahoo-inc.com>.

In some cases you just don't care and want to skip a couple bad records. For example, you're writing ad hoc scripts to extract some stats.
In other cases you have a production system based on Pig and you want to have clear metrics of the ignored data (without adding extra filtering and complexity to your algorithm).

The idea is to be able to handle both.
What about this in the case you describe?
FOREACH FOO GENERATE Bar(*) ON_ERROR SkipMaxHandler(5);

And I would throw in as well:
DEFAULT ON_ERROR SPLIT MyHandler INTO ERRORS;
(It would need to append to the relation. ERRORS = UNION ERRORS, NEW_ERRORS ?)
Julien

On 1/18/11 10:48 AM, "Koji Noguchi" <kn...@yahoo-inc.com> wrote:

If we're talking about couple of  bad records, can we directly use skip-record feature in mapreduce?

Koji


On 1/18/11 10:27 AM, "Julien Le Dem" <le...@yahoo-inc.com> wrote:

That would be nice.
Also letting the error handler output the result to a relation would be useful.
(To let the script output application error metrics)
For example it could (optionally) use the keyword INTO just like the SPLIT operator.

FOO = LOAD ...;
A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;

ErrorHandler would look a little more like EvalFunc:

public interface ErrorHandler<T> {

  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
IOException;

public Schema outputSchema(Schema input);

}

There could be a built-in handler to output the skipped record (input: tuple, funcname:chararray, errorMessage:chararray)

A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;

Julien

On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:

I was thinking about this..

We add an optional ON_ERROR clause to operators, which allows a user to
specify error handling. The error handler would be a udf that would
implement an interface along these lines:

public interface ErrorHandler {

  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
IOException;

}

I think this makes sense not to make a static method so that users could
keep required state, and for example have the handler throw its own
IOException of it's been invoked too many times.

D


On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <sm...@yahoo-inc.com>wrote:

> Thanks for the clarification Ashutosh.
>
> Implementing this in the user realm is tricky as Dmitriy states.
> Sensitivity to error thresholds will require support from the system. We can
> probably provide a taxonomy of records (good, bad, incomplete, etc.) to let
> users classify each record. The system can then track counts of each record
> type to facilitate the computation of thresholds. The last part is to allow
> users to specify thresholds and appropriate actions (interrupt, exit,
> continue, etc.). A possible mechanism to realize this is the
> ErrorHandlingUDF described by Dmitriy.
>
> Santhosh
>
> -----Original Message-----
> From: Ashutosh Chauhan [mailto:hashutosh@apache.org]
> Sent: Friday, January 14, 2011 7:35 PM
> To: user@pig.apache.org
> Subject: Re: Exception Handling in Pig Scripts
>
> Santhosh,
>
> The way you are proposing, it will kill the pig script. I think what user
> wants is to ignore few "bad records" and to process the rest and get
> results. Problem here is how to let user tell Pig the definition of "bad
> record" and how to let him specify threshold for % of bad records at which
> Pig should fail the script.
>
> Ashutosh
>
> On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <sm...@yahoo-inc.com>
> wrote:
> > Sorry about the late response.
> >
> > Hadoop n00b is proposing a language extension for error handling, similar
> to the mechanisms in other well known languages like C++, Java, etc.
> >
> > For now, can't the error semantics be handled by the UDF? For exceptional
> scenarios you could throw an ExecException with the right details. The
> physical operator that handles the execution of UDF's traps it for you and
> propagates the error back to the client. You can take a look at any of the
> builtin UDFs to see how Pig handles it internally.
> >
> > Santhosh
> >
> > -----Original Message-----
> > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> > Sent: Tuesday, January 11, 2011 10:41 AM
> > To: user@pig.apache.org
> > Subject: Re: Exception Handling in Pig Scripts
> >
> > Right now error handling is controlled by the UDFs themselves, and there
> is no way to direct it externally.
> > You can make an ErrorHandlingUDF that would take a udf spec, invoke it,
> trap errors, and then do the specified error handling behavior.. that's a
> bit ugly though.
> >
> > There is a problem with trapping general exceptions of course, in that if
> they happen 0.000001% of the time you can probably just ignore them, but if
> they happen in half your dataset, you want the job to tell you something is
> wrong. So this stuff gets non-trivial. If anyone wants to propose a design
> to solve this general problem, I think that would be a welcome addition.
> >
> > D
> >
> > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <ne...@gmail.com>
> wrote:
> >
> >> Thanks, I sometimes get a date like 0001-01-01. This would be a valid
> >> date format, but when I try to get the seconds between this and
> >> another date, say 2011-01-01, I get an error that the value is too
> >> large to be fit into int and the process stops. Do we have something
> >> like ifError(x-y, null,x-y)? Or would I have to implement this as an
> >> UDF?
> >>
> >> Thanks
> >>
> >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <dv...@gmail.com>
> >> wrote:
> >>
> >> > Create a UDF that verifies the format, and go through a filtering
> >> > step first.
> >> > If you would like to save the malformated records so you can look
> >> > at them later, you can use the SPLIT operator to route the good
> >> > records to your regular workflow, and the bad records some place on
> HDFS.
> >> >
> >> > -D
> >> >
> >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <ne...@gmail.com>
> wrote:
> >> >
> >> > > Hello,
> >> > >
> >> > > I have a pig script that uses piggy bank to calculate date
> differences.
> >> > > Sometimes, when I get a wierd date or wrong format in the input,
> >> > > the
> >> > script
> >> > > throws and error and aborts.
> >> > >
> >> > > Is there a way I could trap these errors and move on without
> >> > > stopping
> >> the
> >> > > execution?
> >> > >
> >> > > Thanks
> >> > >
> >> > > PS: I'm using CDH2 with Pig 0.5
> >> > >
> >> >
> >>
> >
>

Re: Exception Handling in Pig Scripts

Posted by Koji Noguchi <kn...@yahoo-inc.com>.

If we're talking about couple of  bad records, can we directly use skip-record feature in mapreduce?

Koji


On 1/18/11 10:27 AM, "Julien Le Dem" <le...@yahoo-inc.com> wrote:

That would be nice.
Also letting the error handler output the result to a relation would be useful.
(To let the script output application error metrics)
For example it could (optionally) use the keyword INTO just like the SPLIT operator.

FOO = LOAD ...;
A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;

ErrorHandler would look a little more like EvalFunc:

public interface ErrorHandler<T> {

  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
IOException;

public Schema outputSchema(Schema input);

}

There could be a built-in handler to output the skipped record (input: tuple, funcname:chararray, errorMessage:chararray)

A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;

Julien

On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:

I was thinking about this..

We add an optional ON_ERROR clause to operators, which allows a user to
specify error handling. The error handler would be a udf that would
implement an interface along these lines:

public interface ErrorHandler {

  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
IOException;

}

I think this makes sense not to make a static method so that users could
keep required state, and for example have the handler throw its own
IOException of it's been invoked too many times.

D


On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <sm...@yahoo-inc.com>wrote:

> Thanks for the clarification Ashutosh.
>
> Implementing this in the user realm is tricky as Dmitriy states.
> Sensitivity to error thresholds will require support from the system. We can
> probably provide a taxonomy of records (good, bad, incomplete, etc.) to let
> users classify each record. The system can then track counts of each record
> type to facilitate the computation of thresholds. The last part is to allow
> users to specify thresholds and appropriate actions (interrupt, exit,
> continue, etc.). A possible mechanism to realize this is the
> ErrorHandlingUDF described by Dmitriy.
>
> Santhosh
>
> -----Original Message-----
> From: Ashutosh Chauhan [mailto:hashutosh@apache.org]
> Sent: Friday, January 14, 2011 7:35 PM
> To: user@pig.apache.org
> Subject: Re: Exception Handling in Pig Scripts
>
> Santhosh,
>
> The way you are proposing, it will kill the pig script. I think what user
> wants is to ignore few "bad records" and to process the rest and get
> results. Problem here is how to let user tell Pig the definition of "bad
> record" and how to let him specify threshold for % of bad records at which
> Pig should fail the script.
>
> Ashutosh
>
> On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <sm...@yahoo-inc.com>
> wrote:
> > Sorry about the late response.
> >
> > Hadoop n00b is proposing a language extension for error handling, similar
> to the mechanisms in other well known languages like C++, Java, etc.
> >
> > For now, can't the error semantics be handled by the UDF? For exceptional
> scenarios you could throw an ExecException with the right details. The
> physical operator that handles the execution of UDF's traps it for you and
> propagates the error back to the client. You can take a look at any of the
> builtin UDFs to see how Pig handles it internally.
> >
> > Santhosh
> >
> > -----Original Message-----
> > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> > Sent: Tuesday, January 11, 2011 10:41 AM
> > To: user@pig.apache.org
> > Subject: Re: Exception Handling in Pig Scripts
> >
> > Right now error handling is controlled by the UDFs themselves, and there
> is no way to direct it externally.
> > You can make an ErrorHandlingUDF that would take a udf spec, invoke it,
> trap errors, and then do the specified error handling behavior.. that's a
> bit ugly though.
> >
> > There is a problem with trapping general exceptions of course, in that if
> they happen 0.000001% of the time you can probably just ignore them, but if
> they happen in half your dataset, you want the job to tell you something is
> wrong. So this stuff gets non-trivial. If anyone wants to propose a design
> to solve this general problem, I think that would be a welcome addition.
> >
> > D
> >
> > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <ne...@gmail.com>
> wrote:
> >
> >> Thanks, I sometimes get a date like 0001-01-01. This would be a valid
> >> date format, but when I try to get the seconds between this and
> >> another date, say 2011-01-01, I get an error that the value is too
> >> large to be fit into int and the process stops. Do we have something
> >> like ifError(x-y, null,x-y)? Or would I have to implement this as an
> >> UDF?
> >>
> >> Thanks
> >>
> >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <dv...@gmail.com>
> >> wrote:
> >>
> >> > Create a UDF that verifies the format, and go through a filtering
> >> > step first.
> >> > If you would like to save the malformated records so you can look
> >> > at them later, you can use the SPLIT operator to route the good
> >> > records to your regular workflow, and the bad records some place on
> HDFS.
> >> >
> >> > -D
> >> >
> >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <ne...@gmail.com>
> wrote:
> >> >
> >> > > Hello,
> >> > >
> >> > > I have a pig script that uses piggy bank to calculate date
> differences.
> >> > > Sometimes, when I get a wierd date or wrong format in the input,
> >> > > the
> >> > script
> >> > > throws and error and aborts.
> >> > >
> >> > > Is there a way I could trap these errors and move on without
> >> > > stopping
> >> the
> >> > > execution?
> >> > >
> >> > > Thanks
> >> > >
> >> > > PS: I'm using CDH2 with Pig 0.5
> >> > >
> >> >
> >>
> >
>