You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Maheshwaran Janarthanan <as...@hotmail.com> on 2011/08/09 19:28:29 UTC

Skipping Bad Records in M/R Job

Hi,

I have written a Map reduce job which uses third party libraries to process unseen data which makes job fail because of errors in records.

I realized 'Skipping Bad Records' feature in Hadoop Map/Reduce. Can Anyone send me the code snippet which enables this feature by setting properties on JobConf

Thanks,
Ashwin!



> Date: Sun, 7 Aug 2011 01:11:29 +0530
> From: jagaran_das@yahoo.co.in
> Subject: Help on DFSClient
> To: common-user@hadoop.apache.org; user@pig.apache.org
> 
> I am keeping a Stream Open and writing through it using a multithreaded application.
> The application is in a different box and I am connecting to NN remotely.
> 
> I was using FileSystem and getting same error and now I am trying DFSClient and getting the same error.
> 
> When I am running it via simple StandAlone class, it is not throwing any error but when i put that in my Application, it is throwing this error.
> 
> Please help me with this.
> 
> Regards,
> JD 
> 
>       
>  public String toString() {
>       String s = getClass().getSimpleName();
>       if (LOG.isTraceEnabled()) {
>         return s + "@" + DFSClient.this + ": "
>                + StringUtils.stringifyException(new Throwable("for testing"));
>       }
>       return s;
>     }
> 
> My Stack Trace :::
> 
>       
> 06Aug2011 12:29:24,345 DEBUG [listenerContainer-1] (DFSClient.java:1115) - Wait for lease checker to terminate
> 06Aug2011 12:29:24,346 DEBUG [LeaseChecker@DFSClient[clientName=DFSClient_280246853, ugi=jagarandas]: java.lang.Throwable: for testing
> at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.toString(DFSClient.java:1181)
> at org.apache.hadoop.util.Daemon.<init>(Daemon.java:38)
> at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.put(DFSClient.java:1094)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:547)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:513)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:497)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:442)
> at com.apple.ireporter.common.persistence.ConnectionManager.createConnection(ConnectionManager.java:74)
> at com.apple.ireporter.common.persistence.HDPPersistor.writeToHDP(HDPPersistor.java:95)
> at com.apple.ireporter.datatransformer.translator.HDFSTranslator.persistData(HDFSTranslator.java:41)
> at com.apple.ireporter.datatransformer.adapter.TranslatorAdapter.processData(TranslatorAdapter.java:61)
> at com.apple.ireporter.datatransformer.DefaultMessageListener.persistValidatedData(DefaultMessageListener.java:276)
> at com.apple.ireporter.datatransformer.DefaultMessageListener.onMessage(DefaultMessageListener.java:93)
> at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:506)
> at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:463)
> at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:435)
> at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:322)
> at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:260)
> at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:944)
> at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:868)
> at java.lang.Thread.run(Thread.java:680)
 		 	   		  

RE: Skipping Bad Records in M/R Job

Posted by Aaron Baff <Aa...@telescope.tv>.
I'm curious, what error could be thrown that can't be handled via try/catch by catching Exception or Throwable?

--Aaron
-----Original Message-----
From: Maheshwaran Janarthanan [mailto:ashwinwaran@hotmail.com]
Sent: Tuesday, August 09, 2011 10:41 AM
To: HADOOP USERGROUP
Subject: RE: Skipping Bad Records in M/R Job


Aaron,

I am doing some HTML parsing and special content extraction which throws error (which can't be handled by exception handling mechanism)!

Thanks,
Mahesh

> From: Aaron.Baff@telescope.tv
> To: common-user@hadoop.apache.org
> Date: Tue, 9 Aug 2011 10:38:37 -0700
> Subject: RE: Skipping Bad Records in M/R Job
>
> If the 3rd party library is used as part of your Map() function, you could just catch the appropriate Exceptions, and simply not emit that record and return from the Map() normally.
>
> --Aaron
> -----Original Message-----
> From: Maheshwaran Janarthanan [mailto:ashwinwaran@hotmail.com]
> Sent: Tuesday, August 09, 2011 10:28 AM
> To: HADOOP USERGROUP
> Subject: Skipping Bad Records in M/R Job
>
>
> Hi,
>
> I have written a Map reduce job which uses third party libraries to process unseen data which makes job fail because of errors in records.
>
> I realized 'Skipping Bad Records' feature in Hadoop Map/Reduce. Can Anyone send me the code snippet which enables this feature by setting properties on JobConf
>
> Thanks,
> Ashwin!
>
>
>
> > Date: Sun, 7 Aug 2011 01:11:29 +0530
> > From: jagaran_das@yahoo.co.in
> > Subject: Help on DFSClient
> > To: common-user@hadoop.apache.org; user@pig.apache.org
> >
> > I am keeping a Stream Open and writing through it using a multithreaded application.
> > The application is in a different box and I am connecting to NN remotely.
> >
> > I was using FileSystem and getting same error and now I am trying DFSClient and getting the same error.
> >
> > When I am running it via simple StandAlone class, it is not throwing any error but when i put that in my Application, it is throwing this error.
> >
> > Please help me with this.
> >
> > Regards,
> > JD
> >
> >
> >  public String toString() {
> >       String s = getClass().getSimpleName();
> >       if (LOG.isTraceEnabled()) {
> >         return s + "@" + DFSClient.this + ": "
> >                + StringUtils.stringifyException(new Throwable("for testing"));
> >       }
> >       return s;
> >     }
> >
> > My Stack Trace :::
> >
> >
> > 06Aug2011 12:29:24,345 DEBUG [listenerContainer-1] (DFSClient.java:1115) - Wait for lease checker to terminate
> > 06Aug2011 12:29:24,346 DEBUG [LeaseChecker@DFSClient[clientName=DFSClient_280246853, ugi=jagarandas]: java.lang.Throwable: for testing
> > at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.toString(DFSClient.java:1181)
> > at org.apache.hadoop.util.Daemon.<init>(Daemon.java:38)
> > at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.put(DFSClient.java:1094)
> > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:547)
> > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:513)
> > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:497)
> > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:442)
> > at com.apple.ireporter.common.persistence.ConnectionManager.createConnection(ConnectionManager.java:74)
> > at com.apple.ireporter.common.persistence.HDPPersistor.writeToHDP(HDPPersistor.java:95)
> > at com.apple.ireporter.datatransformer.translator.HDFSTranslator.persistData(HDFSTranslator.java:41)
> > at com.apple.ireporter.datatransformer.adapter.TranslatorAdapter.processData(TranslatorAdapter.java:61)
> > at com.apple.ireporter.datatransformer.DefaultMessageListener.persistValidatedData(DefaultMessageListener.java:276)
> > at com.apple.ireporter.datatransformer.DefaultMessageListener.onMessage(DefaultMessageListener.java:93)
> > at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:506)
> > at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:463)
> > at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:435)
> > at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:322)
> > at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:260)
> > at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:944)
> > at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:868)
> > at java.lang.Thread.run(Thread.java:680)

RE: Skipping Bad Records in M/R Job

Posted by Maheshwaran Janarthanan <as...@hotmail.com>.
Aaron,

I am doing some HTML parsing and special content extraction which throws error (which can't be handled by exception handling mechanism)!

Thanks,
Mahesh

> From: Aaron.Baff@telescope.tv
> To: common-user@hadoop.apache.org
> Date: Tue, 9 Aug 2011 10:38:37 -0700
> Subject: RE: Skipping Bad Records in M/R Job
> 
> If the 3rd party library is used as part of your Map() function, you could just catch the appropriate Exceptions, and simply not emit that record and return from the Map() normally.
> 
> --Aaron
> -----Original Message-----
> From: Maheshwaran Janarthanan [mailto:ashwinwaran@hotmail.com]
> Sent: Tuesday, August 09, 2011 10:28 AM
> To: HADOOP USERGROUP
> Subject: Skipping Bad Records in M/R Job
> 
> 
> Hi,
> 
> I have written a Map reduce job which uses third party libraries to process unseen data which makes job fail because of errors in records.
> 
> I realized 'Skipping Bad Records' feature in Hadoop Map/Reduce. Can Anyone send me the code snippet which enables this feature by setting properties on JobConf
> 
> Thanks,
> Ashwin!
> 
> 
> 
> > Date: Sun, 7 Aug 2011 01:11:29 +0530
> > From: jagaran_das@yahoo.co.in
> > Subject: Help on DFSClient
> > To: common-user@hadoop.apache.org; user@pig.apache.org
> >
> > I am keeping a Stream Open and writing through it using a multithreaded application.
> > The application is in a different box and I am connecting to NN remotely.
> >
> > I was using FileSystem and getting same error and now I am trying DFSClient and getting the same error.
> >
> > When I am running it via simple StandAlone class, it is not throwing any error but when i put that in my Application, it is throwing this error.
> >
> > Please help me with this.
> >
> > Regards,
> > JD
> >
> >
> >  public String toString() {
> >       String s = getClass().getSimpleName();
> >       if (LOG.isTraceEnabled()) {
> >         return s + "@" + DFSClient.this + ": "
> >                + StringUtils.stringifyException(new Throwable("for testing"));
> >       }
> >       return s;
> >     }
> >
> > My Stack Trace :::
> >
> >
> > 06Aug2011 12:29:24,345 DEBUG [listenerContainer-1] (DFSClient.java:1115) - Wait for lease checker to terminate
> > 06Aug2011 12:29:24,346 DEBUG [LeaseChecker@DFSClient[clientName=DFSClient_280246853, ugi=jagarandas]: java.lang.Throwable: for testing
> > at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.toString(DFSClient.java:1181)
> > at org.apache.hadoop.util.Daemon.<init>(Daemon.java:38)
> > at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.put(DFSClient.java:1094)
> > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:547)
> > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:513)
> > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:497)
> > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:442)
> > at com.apple.ireporter.common.persistence.ConnectionManager.createConnection(ConnectionManager.java:74)
> > at com.apple.ireporter.common.persistence.HDPPersistor.writeToHDP(HDPPersistor.java:95)
> > at com.apple.ireporter.datatransformer.translator.HDFSTranslator.persistData(HDFSTranslator.java:41)
> > at com.apple.ireporter.datatransformer.adapter.TranslatorAdapter.processData(TranslatorAdapter.java:61)
> > at com.apple.ireporter.datatransformer.DefaultMessageListener.persistValidatedData(DefaultMessageListener.java:276)
> > at com.apple.ireporter.datatransformer.DefaultMessageListener.onMessage(DefaultMessageListener.java:93)
> > at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:506)
> > at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:463)
> > at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:435)
> > at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:322)
> > at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:260)
> > at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:944)
> > at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:868)
> > at java.lang.Thread.run(Thread.java:680)
 		 	   		  

RE: Skipping Bad Records in M/R Job

Posted by Aaron Baff <Aa...@telescope.tv>.
If the 3rd party library is used as part of your Map() function, you could just catch the appropriate Exceptions, and simply not emit that record and return from the Map() normally.

--Aaron
-----Original Message-----
From: Maheshwaran Janarthanan [mailto:ashwinwaran@hotmail.com]
Sent: Tuesday, August 09, 2011 10:28 AM
To: HADOOP USERGROUP
Subject: Skipping Bad Records in M/R Job


Hi,

I have written a Map reduce job which uses third party libraries to process unseen data which makes job fail because of errors in records.

I realized 'Skipping Bad Records' feature in Hadoop Map/Reduce. Can Anyone send me the code snippet which enables this feature by setting properties on JobConf

Thanks,
Ashwin!



> Date: Sun, 7 Aug 2011 01:11:29 +0530
> From: jagaran_das@yahoo.co.in
> Subject: Help on DFSClient
> To: common-user@hadoop.apache.org; user@pig.apache.org
>
> I am keeping a Stream Open and writing through it using a multithreaded application.
> The application is in a different box and I am connecting to NN remotely.
>
> I was using FileSystem and getting same error and now I am trying DFSClient and getting the same error.
>
> When I am running it via simple StandAlone class, it is not throwing any error but when i put that in my Application, it is throwing this error.
>
> Please help me with this.
>
> Regards,
> JD
>
>
>  public String toString() {
>       String s = getClass().getSimpleName();
>       if (LOG.isTraceEnabled()) {
>         return s + "@" + DFSClient.this + ": "
>                + StringUtils.stringifyException(new Throwable("for testing"));
>       }
>       return s;
>     }
>
> My Stack Trace :::
>
>
> 06Aug2011 12:29:24,345 DEBUG [listenerContainer-1] (DFSClient.java:1115) - Wait for lease checker to terminate
> 06Aug2011 12:29:24,346 DEBUG [LeaseChecker@DFSClient[clientName=DFSClient_280246853, ugi=jagarandas]: java.lang.Throwable: for testing
> at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.toString(DFSClient.java:1181)
> at org.apache.hadoop.util.Daemon.<init>(Daemon.java:38)
> at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.put(DFSClient.java:1094)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:547)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:513)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:497)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:442)
> at com.apple.ireporter.common.persistence.ConnectionManager.createConnection(ConnectionManager.java:74)
> at com.apple.ireporter.common.persistence.HDPPersistor.writeToHDP(HDPPersistor.java:95)
> at com.apple.ireporter.datatransformer.translator.HDFSTranslator.persistData(HDFSTranslator.java:41)
> at com.apple.ireporter.datatransformer.adapter.TranslatorAdapter.processData(TranslatorAdapter.java:61)
> at com.apple.ireporter.datatransformer.DefaultMessageListener.persistValidatedData(DefaultMessageListener.java:276)
> at com.apple.ireporter.datatransformer.DefaultMessageListener.onMessage(DefaultMessageListener.java:93)
> at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:506)
> at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:463)
> at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:435)
> at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:322)
> at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:260)
> at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:944)
> at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:868)
> at java.lang.Thread.run(Thread.java:680)

RE: Skipping Bad Records in M/R Job

Posted by Maheshwaran Janarthanan <as...@hotmail.com>.
Thank you Owen and Aaron,

I will spawn a sub process and handle third party errors!

-- Ashwin!

> Date: Tue, 9 Aug 2011 18:11:40 +0000
> Subject: Re: Skipping Bad Records in M/R Job
> From: owen@hortonworks.com
> To: common-user@hadoop.apache.org
> 
> On Tue, Aug 9, 2011 at 5:28 PM, Maheshwaran Janarthanan <
> ashwinwaran@hotmail.com> wrote:
> 
> >
> > Hi,
> >
> > I have written a Map reduce job which uses third party libraries to process
> > unseen data which makes job fail because of errors in records.
> >
> > I realized 'Skipping Bad Records' feature in Hadoop Map/Reduce. Can Anyone
> > send me the code snippet which enables this feature by setting properties on
> > JobConf
> >
> 
> I wouldn't recommend using the bad record skipping, since it was always
> experimental and I don't think it has been well maintained.
> 
> If your 3rd part library crashes the jvm, I'd suggest using a subprocess to
> call it and handle the errors yourself.
> 
> -- Owen
 		 	   		  

Re: Skipping Bad Records in M/R Job

Posted by Owen O'Malley <ow...@hortonworks.com>.
On Tue, Aug 9, 2011 at 5:28 PM, Maheshwaran Janarthanan <
ashwinwaran@hotmail.com> wrote:

>
> Hi,
>
> I have written a Map reduce job which uses third party libraries to process
> unseen data which makes job fail because of errors in records.
>
> I realized 'Skipping Bad Records' feature in Hadoop Map/Reduce. Can Anyone
> send me the code snippet which enables this feature by setting properties on
> JobConf
>

I wouldn't recommend using the bad record skipping, since it was always
experimental and I don't think it has been well maintained.

If your 3rd part library crashes the jvm, I'd suggest using a subprocess to
call it and handle the errors yourself.

-- Owen