You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by ed <ha...@gmail.com> on 2010/10/20 00:44:52 UTC

How to stop a mapper within a map-reduce job when you detect bad input

Hello,

I have a simple map-reduce job that reads in zipped files and converts them
to lzo compression.  Some of the files are not properly zipped which results
in Hadoop throwing an "java.io.EOFException: Unexpected end of input stream
error" and causes the job to fail.  Is there a way to catch this exception
and tell hadoop to just ignore the file and move on?  I think the exception
is being thrown by the class reading in the Gzip file and not my mapper
class.  Is this correct?  Is there a way to handle this type of error
gracefully?

Thank you!

~Ed

Re: How to stop a mapper within a map-reduce job when you detect bad input

Posted by ed <ha...@gmail.com>.
So the overwritten run() method was a red herring.  The real problem appears
to be that I use MultipleOutputs (the new mapreduce API version) for my
reducer output.  I posted a different thread since it's not really related
to the original question here.  For everyone that was curious, it turns our
overriding the run() method and catching the EOFException works beautifully
for processing files that might be corrupt or have errors. Thanks!

~Ed

On Thu, Oct 21, 2010 at 2:07 PM, ed <ha...@gmail.com> wrote:

> I overwrote the run() method in the mapper with a run() method (below) that
> catches the EOFException.  The mapper and reducer now complete but the
> outputted lzo file from the reducer throws an "Unexpected End of File error"
> when decompressing it indicating something did not clean up properly.  I
> can't think of why this could be happening as the map() method should only
> be called on input that was properly decompressed (anything that can't be
> decompressed will throw an Exception that is being caught).  The reducer
> then should not even know that the mapper hit an EOFException in the input
> gzip file, and yet the output lzo file still has the unexpected end of file
> problem (I'm using the kevinweil lzo libraries).  Is there some call that
> needs to be made that will close out the mapper and ensure that the lzo
> output from the reducer is formatted properly?  Thank you!
>
> @Override
> public void run(Context context) throw InterruptedException{
>      try{
>           setup(context);
>           while(context.nextKeyValue()){
>                  map(context.getCurrentKey(), context.getCurrentValue(),
> context);
>            }
>            cleanup(context);
>       } catch(EOFException){
>            logError(context, "EOFException: Corrupt gzip file" +
> mFileName);
>
>       }
> }
>
>
> On Thu, Oct 21, 2010 at 1:29 PM, ed <ha...@gmail.com> wrote:
>
>> Thanks Tom! Didn't see your post before posting =)
>>
>>
>> On Thu, Oct 21, 2010 at 1:28 PM, ed <ha...@gmail.com> wrote:
>>
>>> Sorry to keep spamming this thread.  It looks like the correct way to
>>> implement MapRunnable using the new mapreduce classes (instead of the
>>> deprecated mapred) is to override the run() method of the mapper class.
>>> This is actually nice and convenient since everyone should already be using
>>> Mapper class (org.apache.hadoop.mapreduce.Maper<KEYIN, VALUEIN, KEYOUT,
>>> VALUEOUT> for their mappers.
>>>
>>> ~Ed
>>>
>>>
>>> On Thu, Oct 21, 2010 at 12:14 PM, ed <ha...@gmail.com> wrote:
>>>
>>>> Just checked the Hadoop 0.21.0 API docs (I was looking in the wrong docs
>>>> before) and it doesn't look like MapRunner is deprecated so I'll try
>>>> catching the error there and will report back if it's a good solution.
>>>> Thanks!
>>>>
>>>> ~Ed
>>>>
>>>>
>>>> On Thu, Oct 21, 2010 at 11:23 AM, ed <ha...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> The MapRunner classes looks promising.  I noticed it is in the
>>>>> deprecated mapred package but I didn't see an equivalent class in the
>>>>> mapreduce package.  Is this going to ported to mapreduce or is it no longer
>>>>> being supported?  Thanks!
>>>>>
>>>>> ~Ed
>>>>>
>>>>>
>>>>> On Thu, Oct 21, 2010 at 6:36 AM, Harsh J <qw...@gmail.com>wrote:
>>>>>
>>>>>> If it occurs eventually as your record reader reads it, then you may
>>>>>> use a MapRunner class instead of a Mapper IFace/Subclass. This way,
>>>>>> you may try/catch over the record reader itself, and call your map
>>>>>> function only on valid next()s. I think this ought to work.
>>>>>>
>>>>>> You can set it via JobConf.setMapRunnerClass(...).
>>>>>>
>>>>>> Ref: MapRunner API @
>>>>>>
>>>>>> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html
>>>>>>
>>>>>> On Wed, Oct 20, 2010 at 4:14 AM, ed <ha...@gmail.com> wrote:
>>>>>> > Hello,
>>>>>> >
>>>>>> > I have a simple map-reduce job that reads in zipped files and
>>>>>> converts them
>>>>>> > to lzo compression.  Some of the files are not properly zipped which
>>>>>> results
>>>>>> > in Hadoop throwing an "java.io.EOFException: Unexpected end of input
>>>>>> stream
>>>>>> > error" and causes the job to fail.  Is there a way to catch this
>>>>>> exception
>>>>>> > and tell hadoop to just ignore the file and move on?  I think the
>>>>>> exception
>>>>>> > is being thrown by the class reading in the Gzip file and not my
>>>>>> mapper
>>>>>> > class.  Is this correct?  Is there a way to handle this type of
>>>>>> error
>>>>>> > gracefully?
>>>>>> >
>>>>>> > Thank you!
>>>>>> >
>>>>>> > ~Ed
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Harsh J
>>>>>> www.harshj.com
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to stop a mapper within a map-reduce job when you detect bad input

Posted by ed <ha...@gmail.com>.
I overwrote the run() method in the mapper with a run() method (below) that
catches the EOFException.  The mapper and reducer now complete but the
outputted lzo file from the reducer throws an "Unexpected End of File error"
when decompressing it indicating something did not clean up properly.  I
can't think of why this could be happening as the map() method should only
be called on input that was properly decompressed (anything that can't be
decompressed will throw an Exception that is being caught).  The reducer
then should not even know that the mapper hit an EOFException in the input
gzip file, and yet the output lzo file still has the unexpected end of file
problem (I'm using the kevinweil lzo libraries).  Is there some call that
needs to be made that will close out the mapper and ensure that the lzo
output from the reducer is formatted properly?  Thank you!

@Override
public void run(Context context) throw InterruptedException{
     try{
          setup(context);
          while(context.nextKeyValue()){
                 map(context.getCurrentKey(), context.getCurrentValue(),
context);
           }
           cleanup(context);
      } catch(EOFException){
           logError(context, "EOFException: Corrupt gzip file" + mFileName);
      }
}


On Thu, Oct 21, 2010 at 1:29 PM, ed <ha...@gmail.com> wrote:

> Thanks Tom! Didn't see your post before posting =)
>
>
> On Thu, Oct 21, 2010 at 1:28 PM, ed <ha...@gmail.com> wrote:
>
>> Sorry to keep spamming this thread.  It looks like the correct way to
>> implement MapRunnable using the new mapreduce classes (instead of the
>> deprecated mapred) is to override the run() method of the mapper class.
>> This is actually nice and convenient since everyone should already be using
>> Mapper class (org.apache.hadoop.mapreduce.Maper<KEYIN, VALUEIN, KEYOUT,
>> VALUEOUT> for their mappers.
>>
>> ~Ed
>>
>>
>> On Thu, Oct 21, 2010 at 12:14 PM, ed <ha...@gmail.com> wrote:
>>
>>> Just checked the Hadoop 0.21.0 API docs (I was looking in the wrong docs
>>> before) and it doesn't look like MapRunner is deprecated so I'll try
>>> catching the error there and will report back if it's a good solution.
>>> Thanks!
>>>
>>> ~Ed
>>>
>>>
>>> On Thu, Oct 21, 2010 at 11:23 AM, ed <ha...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> The MapRunner classes looks promising.  I noticed it is in the
>>>> deprecated mapred package but I didn't see an equivalent class in the
>>>> mapreduce package.  Is this going to ported to mapreduce or is it no longer
>>>> being supported?  Thanks!
>>>>
>>>> ~Ed
>>>>
>>>>
>>>> On Thu, Oct 21, 2010 at 6:36 AM, Harsh J <qw...@gmail.com>wrote:
>>>>
>>>>> If it occurs eventually as your record reader reads it, then you may
>>>>> use a MapRunner class instead of a Mapper IFace/Subclass. This way,
>>>>> you may try/catch over the record reader itself, and call your map
>>>>> function only on valid next()s. I think this ought to work.
>>>>>
>>>>> You can set it via JobConf.setMapRunnerClass(...).
>>>>>
>>>>> Ref: MapRunner API @
>>>>>
>>>>> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html
>>>>>
>>>>> On Wed, Oct 20, 2010 at 4:14 AM, ed <ha...@gmail.com> wrote:
>>>>> > Hello,
>>>>> >
>>>>> > I have a simple map-reduce job that reads in zipped files and
>>>>> converts them
>>>>> > to lzo compression.  Some of the files are not properly zipped which
>>>>> results
>>>>> > in Hadoop throwing an "java.io.EOFException: Unexpected end of input
>>>>> stream
>>>>> > error" and causes the job to fail.  Is there a way to catch this
>>>>> exception
>>>>> > and tell hadoop to just ignore the file and move on?  I think the
>>>>> exception
>>>>> > is being thrown by the class reading in the Gzip file and not my
>>>>> mapper
>>>>> > class.  Is this correct?  Is there a way to handle this type of error
>>>>> > gracefully?
>>>>> >
>>>>> > Thank you!
>>>>> >
>>>>> > ~Ed
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Harsh J
>>>>> www.harshj.com
>>>>>
>>>>
>>>>
>>>
>>
>

Re: How to stop a mapper within a map-reduce job when you detect bad input

Posted by ed <ha...@gmail.com>.
Thanks Tom! Didn't see your post before posting =)

On Thu, Oct 21, 2010 at 1:28 PM, ed <ha...@gmail.com> wrote:

> Sorry to keep spamming this thread.  It looks like the correct way to
> implement MapRunnable using the new mapreduce classes (instead of the
> deprecated mapred) is to override the run() method of the mapper class.
> This is actually nice and convenient since everyone should already be using
> Mapper class (org.apache.hadoop.mapreduce.Maper<KEYIN, VALUEIN, KEYOUT,
> VALUEOUT> for their mappers.
>
> ~Ed
>
>
> On Thu, Oct 21, 2010 at 12:14 PM, ed <ha...@gmail.com> wrote:
>
>> Just checked the Hadoop 0.21.0 API docs (I was looking in the wrong docs
>> before) and it doesn't look like MapRunner is deprecated so I'll try
>> catching the error there and will report back if it's a good solution.
>> Thanks!
>>
>> ~Ed
>>
>>
>> On Thu, Oct 21, 2010 at 11:23 AM, ed <ha...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> The MapRunner classes looks promising.  I noticed it is in the deprecated
>>> mapred package but I didn't see an equivalent class in the mapreduce
>>> package.  Is this going to ported to mapreduce or is it no longer being
>>> supported?  Thanks!
>>>
>>> ~Ed
>>>
>>>
>>> On Thu, Oct 21, 2010 at 6:36 AM, Harsh J <qw...@gmail.com> wrote:
>>>
>>>> If it occurs eventually as your record reader reads it, then you may
>>>> use a MapRunner class instead of a Mapper IFace/Subclass. This way,
>>>> you may try/catch over the record reader itself, and call your map
>>>> function only on valid next()s. I think this ought to work.
>>>>
>>>> You can set it via JobConf.setMapRunnerClass(...).
>>>>
>>>> Ref: MapRunner API @
>>>>
>>>> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html
>>>>
>>>> On Wed, Oct 20, 2010 at 4:14 AM, ed <ha...@gmail.com> wrote:
>>>> > Hello,
>>>> >
>>>> > I have a simple map-reduce job that reads in zipped files and converts
>>>> them
>>>> > to lzo compression.  Some of the files are not properly zipped which
>>>> results
>>>> > in Hadoop throwing an "java.io.EOFException: Unexpected end of input
>>>> stream
>>>> > error" and causes the job to fail.  Is there a way to catch this
>>>> exception
>>>> > and tell hadoop to just ignore the file and move on?  I think the
>>>> exception
>>>> > is being thrown by the class reading in the Gzip file and not my
>>>> mapper
>>>> > class.  Is this correct?  Is there a way to handle this type of error
>>>> > gracefully?
>>>> >
>>>> > Thank you!
>>>> >
>>>> > ~Ed
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>> www.harshj.com
>>>>
>>>
>>>
>>
>

Re: How to stop a mapper within a map-reduce job when you detect bad input

Posted by ed <ha...@gmail.com>.
Sorry to keep spamming this thread.  It looks like the correct way to
implement MapRunnable using the new mapreduce classes (instead of the
deprecated mapred) is to override the run() method of the mapper class.
This is actually nice and convenient since everyone should already be using
Mapper class (org.apache.hadoop.mapreduce.Maper<KEYIN, VALUEIN, KEYOUT,
VALUEOUT> for their mappers.

~Ed

On Thu, Oct 21, 2010 at 12:14 PM, ed <ha...@gmail.com> wrote:

> Just checked the Hadoop 0.21.0 API docs (I was looking in the wrong docs
> before) and it doesn't look like MapRunner is deprecated so I'll try
> catching the error there and will report back if it's a good solution.
> Thanks!
>
> ~Ed
>
>
> On Thu, Oct 21, 2010 at 11:23 AM, ed <ha...@gmail.com> wrote:
>
>> Hello,
>>
>> The MapRunner classes looks promising.  I noticed it is in the deprecated
>> mapred package but I didn't see an equivalent class in the mapreduce
>> package.  Is this going to ported to mapreduce or is it no longer being
>> supported?  Thanks!
>>
>> ~Ed
>>
>>
>> On Thu, Oct 21, 2010 at 6:36 AM, Harsh J <qw...@gmail.com> wrote:
>>
>>> If it occurs eventually as your record reader reads it, then you may
>>> use a MapRunner class instead of a Mapper IFace/Subclass. This way,
>>> you may try/catch over the record reader itself, and call your map
>>> function only on valid next()s. I think this ought to work.
>>>
>>> You can set it via JobConf.setMapRunnerClass(...).
>>>
>>> Ref: MapRunner API @
>>>
>>> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html
>>>
>>> On Wed, Oct 20, 2010 at 4:14 AM, ed <ha...@gmail.com> wrote:
>>> > Hello,
>>> >
>>> > I have a simple map-reduce job that reads in zipped files and converts
>>> them
>>> > to lzo compression.  Some of the files are not properly zipped which
>>> results
>>> > in Hadoop throwing an "java.io.EOFException: Unexpected end of input
>>> stream
>>> > error" and causes the job to fail.  Is there a way to catch this
>>> exception
>>> > and tell hadoop to just ignore the file and move on?  I think the
>>> exception
>>> > is being thrown by the class reading in the Gzip file and not my mapper
>>> > class.  Is this correct?  Is there a way to handle this type of error
>>> > gracefully?
>>> >
>>> > Thank you!
>>> >
>>> > ~Ed
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>> www.harshj.com
>>>
>>
>>
>

Re: How to stop a mapper within a map-reduce job when you detect bad input

Posted by ed <ha...@gmail.com>.
Just checked the Hadoop 0.21.0 API docs (I was looking in the wrong docs
before) and it doesn't look like MapRunner is deprecated so I'll try
catching the error there and will report back if it's a good solution.
Thanks!

~Ed

On Thu, Oct 21, 2010 at 11:23 AM, ed <ha...@gmail.com> wrote:

> Hello,
>
> The MapRunner classes looks promising.  I noticed it is in the deprecated
> mapred package but I didn't see an equivalent class in the mapreduce
> package.  Is this going to ported to mapreduce or is it no longer being
> supported?  Thanks!
>
> ~Ed
>
>
> On Thu, Oct 21, 2010 at 6:36 AM, Harsh J <qw...@gmail.com> wrote:
>
>> If it occurs eventually as your record reader reads it, then you may
>> use a MapRunner class instead of a Mapper IFace/Subclass. This way,
>> you may try/catch over the record reader itself, and call your map
>> function only on valid next()s. I think this ought to work.
>>
>> You can set it via JobConf.setMapRunnerClass(...).
>>
>> Ref: MapRunner API @
>>
>> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html
>>
>> On Wed, Oct 20, 2010 at 4:14 AM, ed <ha...@gmail.com> wrote:
>> > Hello,
>> >
>> > I have a simple map-reduce job that reads in zipped files and converts
>> them
>> > to lzo compression.  Some of the files are not properly zipped which
>> results
>> > in Hadoop throwing an "java.io.EOFException: Unexpected end of input
>> stream
>> > error" and causes the job to fail.  Is there a way to catch this
>> exception
>> > and tell hadoop to just ignore the file and move on?  I think the
>> exception
>> > is being thrown by the class reading in the Gzip file and not my mapper
>> > class.  Is this correct?  Is there a way to handle this type of error
>> > gracefully?
>> >
>> > Thank you!
>> >
>> > ~Ed
>> >
>>
>>
>>
>> --
>> Harsh J
>> www.harshj.com
>>
>
>

Re: How to stop a mapper within a map-reduce job when you detect bad input

Posted by Tom White <to...@cloudera.com>.
On Thu, Oct 21, 2010 at 8:23 AM, ed <ha...@gmail.com> wrote:
> Hello,
>
> The MapRunner classes looks promising.  I noticed it is in the deprecated
> mapred package but I didn't see an equivalent class in the mapreduce
> package.  Is this going to ported to mapreduce or is it no longer being
> supported?  Thanks!

The equivalent functionality is in org.apache.hadoop.mapreduce.Mapper#run.

Cheers
Tom

>
> ~Ed
>
> On Thu, Oct 21, 2010 at 6:36 AM, Harsh J <qw...@gmail.com> wrote:
>
>> If it occurs eventually as your record reader reads it, then you may
>> use a MapRunner class instead of a Mapper IFace/Subclass. This way,
>> you may try/catch over the record reader itself, and call your map
>> function only on valid next()s. I think this ought to work.
>>
>> You can set it via JobConf.setMapRunnerClass(...).
>>
>> Ref: MapRunner API @
>>
>> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html
>>
>> On Wed, Oct 20, 2010 at 4:14 AM, ed <ha...@gmail.com> wrote:
>> > Hello,
>> >
>> > I have a simple map-reduce job that reads in zipped files and converts
>> them
>> > to lzo compression.  Some of the files are not properly zipped which
>> results
>> > in Hadoop throwing an "java.io.EOFException: Unexpected end of input
>> stream
>> > error" and causes the job to fail.  Is there a way to catch this
>> exception
>> > and tell hadoop to just ignore the file and move on?  I think the
>> exception
>> > is being thrown by the class reading in the Gzip file and not my mapper
>> > class.  Is this correct?  Is there a way to handle this type of error
>> > gracefully?
>> >
>> > Thank you!
>> >
>> > ~Ed
>> >
>>
>>
>>
>> --
>> Harsh J
>> www.harshj.com
>>
>

Re: How to stop a mapper within a map-reduce job when you detect bad input

Posted by ed <ha...@gmail.com>.
Hello,

The MapRunner classes looks promising.  I noticed it is in the deprecated
mapred package but I didn't see an equivalent class in the mapreduce
package.  Is this going to ported to mapreduce or is it no longer being
supported?  Thanks!

~Ed

On Thu, Oct 21, 2010 at 6:36 AM, Harsh J <qw...@gmail.com> wrote:

> If it occurs eventually as your record reader reads it, then you may
> use a MapRunner class instead of a Mapper IFace/Subclass. This way,
> you may try/catch over the record reader itself, and call your map
> function only on valid next()s. I think this ought to work.
>
> You can set it via JobConf.setMapRunnerClass(...).
>
> Ref: MapRunner API @
>
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html
>
> On Wed, Oct 20, 2010 at 4:14 AM, ed <ha...@gmail.com> wrote:
> > Hello,
> >
> > I have a simple map-reduce job that reads in zipped files and converts
> them
> > to lzo compression.  Some of the files are not properly zipped which
> results
> > in Hadoop throwing an "java.io.EOFException: Unexpected end of input
> stream
> > error" and causes the job to fail.  Is there a way to catch this
> exception
> > and tell hadoop to just ignore the file and move on?  I think the
> exception
> > is being thrown by the class reading in the Gzip file and not my mapper
> > class.  Is this correct?  Is there a way to handle this type of error
> > gracefully?
> >
> > Thank you!
> >
> > ~Ed
> >
>
>
>
> --
> Harsh J
> www.harshj.com
>

Re: How to stop a mapper within a map-reduce job when you detect bad input

Posted by Harsh J <qw...@gmail.com>.
If it occurs eventually as your record reader reads it, then you may
use a MapRunner class instead of a Mapper IFace/Subclass. This way,
you may try/catch over the record reader itself, and call your map
function only on valid next()s. I think this ought to work.

You can set it via JobConf.setMapRunnerClass(...).

Ref: MapRunner API @
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html

On Wed, Oct 20, 2010 at 4:14 AM, ed <ha...@gmail.com> wrote:
> Hello,
>
> I have a simple map-reduce job that reads in zipped files and converts them
> to lzo compression.  Some of the files are not properly zipped which results
> in Hadoop throwing an "java.io.EOFException: Unexpected end of input stream
> error" and causes the job to fail.  Is there a way to catch this exception
> and tell hadoop to just ignore the file and move on?  I think the exception
> is being thrown by the class reading in the Gzip file and not my mapper
> class.  Is this correct?  Is there a way to handle this type of error
> gracefully?
>
> Thank you!
>
> ~Ed
>



-- 
Harsh J
www.harshj.com