You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Prashant Kommireddi (JIRA)" <ji...@apache.org> on 2015/10/20 08:38:27 UTC

[jira] [Commented] (PIG-4704) Customizable Error Handling for Storers in Pig

    [ https://issues.apache.org/jira/browse/PIG-4704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14964650#comment-14964650 ] 

Prashant Kommireddi commented on PIG-4704:
------------------------------------------

Thanks [~siddhimehta] for the contribution. Few comments :

* Can you please add Apache license headers to all files? A few seem to be missing it
* How does a user query for counters that are reported by the ErrorHandler? Based on "storer_Error_Handler" I believe.
* {code}
private Counter getRecordCountCounter(String storeSignature) {
+        PigStatusReporter reporter = PigStatusReporter.getInstance();
+        Counter counter = reporter.getCounter(STORER_ERROR_COUNT_GROUP,
+                getCounterNameForStore(STORER_RECORD_COUNT, storeSignature));
+        return counter;
+    }
{code}

We could probably change this method to take in another argument - String counterName and have "incAndGetErrorCount" re-use the method. Right now 

* You would have to remove @author tags *smile*
* The method *handle* in * OutputErrorHandler* could be made non-final. Classes extending from it might want to do something specific to their needs
* We probably don't want to LOG for every error case, that might have a bad job hose disks on cluster nodes. That could be removed from *handle* method in *OutputErrorHandler* {code} Log.debug("Handling error " + cause); {code}
* The logic in *OutputErrorHandler* appears to be custom to a single use-case, which is handling errors based on BOTH minErrors and errorThreshold. Do you think we should keep this logic in a base abstract class, or move it to an impl instead?
* In *WrappedErrorHandlingFunc* you are using the instance var "errorHandler" at one place and the method getErrorHandler at another. Can we make it consistent using either the var or method at both places?

  


> Customizable Error Handling for Storers in Pig 
> -----------------------------------------------
>
>                 Key: PIG-4704
>                 URL: https://issues.apache.org/jira/browse/PIG-4704
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Siddhi Mehta
>            Assignee: Siddhi Mehta
>         Attachments: PIG-4704.patch
>
>
> On Thu, Oct 15, 2015 at 4:06 AM, Saggi Neumann <sa...@xplenty.com> wrote:
> You may also check these for ideas. It would be good to have them
> implemented:
> https://wiki.apache.org/pig/PigErrorHandlingInScripts
> https://issues.apache.org/jira/browse/PIG-2620
> --
> Saggi Neumann
> Co-founder and CTO, Xplenty
> M: +972-544-546102
> On Thu, Oct 15, 2015 at 12:17 AM, Siddhi Mehta <sm...@gmail.com> wrote:
> > Hello Everyone,
> >
> > Just wanted to follow up on the my earlier post and see if there are any
> > thoughts around the same.
> > I was planning to take a stab to implement the same.
> >
> > The approach I was planning to use for the same is
> > 1. Make the storer that wants error handling capability implement an
> > interface(ErrorHandlingStoreFunc).
> > 2. Using this interface the storer can define if the thresholds for
> > error.Each store func can determine what the threshold should be.For
> > example HbaseStorage can have a different threshold from ParquetStorage.
> > 3. Whenever the storer gets created in
> >
> > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getStoreFunc()
> > we intercept the called and give it a wrappedStoreFunc
> > 4. Every put next calls now gets delegated to the actual storer via the
> > delegate and we can listen in for error on putNext() and take care of the
> > allowing the error  if within threshold or re throwing from there.
> > 5. The client can get information about the threshold value from  the
> > counters to know if there was any data dropped.
> >
> > Thougts?
> >
> > Thanks,
> > Siddhi
> >
> >
> > On Mon, Oct 12, 2015 at 1:49 PM, Siddhi Mehta <sm...@gmail.com>
> > wrote:
> >
> > > Hey Guys,
> > >
> > > Currently a Pig job fails when one record out of the billions records
> > > fails on STORE.
> > > This is not always desirable behavior when you are dealing with millions
> > > of records and only few fail.
> > > In certain use-cases its desirable to know how many such errors and have
> > > an accounting for the same.
> > > Is there a configurable limits that we can set for pig so that we can
> > > allow a threshold for bad records on STORE similar to the lines of the
> > JIRA
> > > for LOAD PIG-3059 <https://issues.apache.org/jira/browse/PIG-3059>
> > >
> > > Thanks,
> > > Siddhi
> > >
> >



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)