You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Siddhi Mehta <sm...@gmail.com> on 2015/10/14 23:20:34 UTC

Re: Threshold for errors in STORE

Sending to the pig developers group

On Wed, Oct 14, 2015 at 2:17 PM, Siddhi Mehta <sm...@gmail.com> wrote:

> Hello Everyone,
>
> Just wanted to follow up on the my earlier post and see if there are any
> thoughts around the same.
> I was planning to take a stab to implement the same.
>
> The approach I was planning to use for the same is
> 1. Make the storer that wants error handling capability implement an
> interface(ErrorHandlingStoreFunc).
> 2. Using this interface the storer can define if the thresholds for
> error.Each store func can determine what the threshold should be.For
> example HbaseStorage can have a different threshold from ParquetStorage.
> 3. Whenever the storer gets created in
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getStoreFunc()
> we intercept the called and give it a wrappedStoreFunc
> 4. Every put next calls now gets delegated to the actual storer via the
> delegate and we can listen in for error on putNext() and take care of the
> allowing the error  if within threshold or re throwing from there.
> 5. The client can get information about the threshold value from  the
> counters to know if there was any data dropped.
>
> Thougts?
>
> Thanks,
> Siddhi
>
>
> On Mon, Oct 12, 2015 at 1:49 PM, Siddhi Mehta <sm...@gmail.com>
> wrote:
>
>> Hey Guys,
>>
>> Currently a Pig job fails when one record out of the billions records
>> fails on STORE.
>> This is not always desirable behavior when you are dealing with millions
>> of records and only few fail.
>> In certain use-cases its desirable to know how many such errors and have
>> an accounting for the same.
>> Is there a configurable limits that we can set for pig so that we can
>> allow a threshold for bad records on STORE similar to the lines of the JIRA
>> for LOAD PIG-3059 <https://issues.apache.org/jira/browse/PIG-3059>
>>
>> Thanks,
>> Siddhi
>>
>
>

Re: Threshold for errors in STORE

Posted by Prashant Kommireddi <pr...@gmail.com>.
The proposed approach sounds good. If there are no objections, can you
please go ahead and file a JIRA. I can take a look once you have a patch
available.

On Wed, Oct 14, 2015 at 2:20 PM, Siddhi Mehta <sm...@gmail.com> wrote:

> Sending to the pig developers group
>
> On Wed, Oct 14, 2015 at 2:17 PM, Siddhi Mehta <sm...@gmail.com>
> wrote:
>
> > Hello Everyone,
> >
> > Just wanted to follow up on the my earlier post and see if there are any
> > thoughts around the same.
> > I was planning to take a stab to implement the same.
> >
> > The approach I was planning to use for the same is
> > 1. Make the storer that wants error handling capability implement an
> > interface(ErrorHandlingStoreFunc).
> > 2. Using this interface the storer can define if the thresholds for
> > error.Each store func can determine what the threshold should be.For
> > example HbaseStorage can have a different threshold from ParquetStorage.
> > 3. Whenever the storer gets created in
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getStoreFunc()
> > we intercept the called and give it a wrappedStoreFunc
> > 4. Every put next calls now gets delegated to the actual storer via the
> > delegate and we can listen in for error on putNext() and take care of the
> > allowing the error  if within threshold or re throwing from there.
> > 5. The client can get information about the threshold value from  the
> > counters to know if there was any data dropped.
> >
> > Thougts?
> >
> > Thanks,
> > Siddhi
> >
> >
> > On Mon, Oct 12, 2015 at 1:49 PM, Siddhi Mehta <sm...@gmail.com>
> > wrote:
> >
> >> Hey Guys,
> >>
> >> Currently a Pig job fails when one record out of the billions records
> >> fails on STORE.
> >> This is not always desirable behavior when you are dealing with millions
> >> of records and only few fail.
> >> In certain use-cases its desirable to know how many such errors and have
> >> an accounting for the same.
> >> Is there a configurable limits that we can set for pig so that we can
> >> allow a threshold for bad records on STORE similar to the lines of the
> JIRA
> >> for LOAD PIG-3059 <https://issues.apache.org/jira/browse/PIG-3059>
> >>
> >> Thanks,
> >> Siddhi
> >>
> >
> >
>