You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by bmdevelopment <bm...@gmail.com> on 2010/07/05 09:11:30 UTC

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Hello,
I still have had no luck with this over the past week.
And even get the same exact problem on a completely different 5 node cluster.
Is it worth opening an new issue in jira for this?
Thanks


On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <bm...@gmail.com> wrote:
> Hello,
> Thanks so much for the reply.
> See inline.
>
> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yh...@gmail.com> wrote:
>> Hi,
>>
>>> I've been getting the following error when trying to run a very simple
>>> MapReduce job.
>>> Map finishes without problem, but error occurs as soon as it enters
>>> Reduce phase.
>>>
>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>
>>> I am running a 5 node cluster and I believe I have all my settings correct:
>>>
>>> * ulimit -n 32768
>>> * DNS/RDNS configured properly
>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>>
>>> The program is very simple - just counts a unique string in a log file.
>>> See here: http://pastebin.com/5uRG3SFL
>>>
>>> When I run, the job fails and I get the following output.
>>> http://pastebin.com/AhW6StEb
>>>
>>> However, runs fine when I do *not* use substring() on the value (see
>>> map function in code above).
>>>
>>> This runs fine and completes successfully:
>>>            String str = val.toString();
>>>
>>> This causes error and fails:
>>>            String str = val.toString().substring(0,10);
>>>
>>> Please let me know if you need any further information.
>>> It would be greatly appreciated if anyone could shed some light on this problem.
>>
>> It catches attention that changing the code to use a substring is
>> causing a difference. Assuming it is consistent and not a red herring,
>
> Yes, this has been consistent over the last week. I was running 0.20.1
> first and then
> upgrade to 0.20.2 but results have been exactly the same.
>
>> can you look at the counters for the two jobs using the JobTracker web
>> UI - things like map records, bytes etc and see if there is a
>> noticeable difference ?
>
> Ok, so here is the first job using write.set(value.toString()); having
> *no* errors:
> http://pastebin.com/xvy0iGwL
>
> And here is the second job using
> write.set(value.toString().substring(0, 10)); that fails:
> http://pastebin.com/uGw6yNqv
>
> And here is even another where I used a longer, and therefore unique string,
> by write.set(value.toString().substring(0, 20)); This makes every line
> unique, similar to first job.
> Still fails.
> http://pastebin.com/GdQ1rp8i
>
>>Also, are the two programs being run against
>> the exact same input data ?
>
> Yes, exactly the same input: a single csv file with 23K lines.
> Using a shorter string leads to more like keys and therefore more
> combining/reducing, but going
> by the above it seems to fail whether the substring/key is entirely
> unique (23000 combine output records) or
> mostly the same (9 combine output records).
>
>>
>> Also, since the cluster size is small, you could also look at the
>> tasktracker logs on the machines where the maps have run to see if
>> there are any failures when the reduce attempts start failing.
>
> Here is the TT log from the last failed job. I do not see anything
> besides the shuffle failure, but there
> may be something I am overlooking or simply do not understand.
> http://pastebin.com/DKFTyGXg
>
> Thanks again!
>
>>
>> Thanks
>> Hemanth
>>
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by Ted Yu <yu...@gmail.com>.

Did you check task tracker log and log from your reducer to see if
anythng was wrong ?
Please also capture jstack output so that we can help you diagnose.

On Friday, July 9, 2010, bmdevelopment <bm...@gmail.com> wrote:
> Hi, I updated to the version here:
> http://github.com/kevinweil/hadoop-lzo
>
> However, when I use lzop for intermediate compression I
> am still having trouble - the reduce phase now freezes at 99% and
> eventually fails.
> No immediate problem, because I can use the default codec.
> But may be of concern to someone else.
>
> Thanks
>
> On Fri, Jul 9, 2010 at 1:54 PM, Ted Yu <yu...@gmail.com> wrote:
>> I updated http://wiki.apache.org/hadoop/UsingLzoCompression to specifically
>> mention this potential issue so that other people can avoid such problem.
>> Feel free to add more onto it.
>>
>> On Thu, Jul 8, 2010 at 8:26 PM, bmdevelopment <bm...@gmail.com>
>> wrote:
>>>
>>> Thanks everyone.
>>>
>>> Yes, using the Google Code version referenced on the wiki:
>>> http://wiki.apache.org/hadoop/UsingLzoCompression
>>>
>>> I will try the latest version and see if that fixes the problem.
>>> http://github.com/kevinweil/hadoop-lzo
>>>
>>> Thanks
>>>
>>> On Fri, Jul 9, 2010 at 3:22 AM, Todd Lipcon <to...@cloudera.com> wrote:
>>> > On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <yu...@gmail.com> wrote:
>>> >>
>>> >> Todd fixed a bug where LZO header or block header data may fall on read
>>> >> boundary:
>>> >>
>>> >>
>>> >> http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58
>>> >>
>>> >>
>>> >> I am wondering if that is related to the issue you saw.
>>> >
>>> > I don't think this bug would show up in intermediate output compression,
>>> > but
>>> > it's certainly possible. There have been a number of bugs fixed in LZO
>>> > over
>>> > on github - are you using the github version or the one from Google Code
>>> > which is out of date? Either mine or Kevin's repo on github should be a
>>> > good
>>> > version (I think we called the newest 0.3.4)
>>> > -Todd
>>> >
>>> >>
>>> >> On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment
>>> >> <bm...@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> A little more on this.
>>> >>>
>>> >>> So, I've narrowed down the problem to using Lzop compression
>>> >>> (com.hadoop.compression.lzo.LzopCodec)
>>> >>> for mapred.map.output.compression.codec.
>>> >>>
>>> >>> <property>
>>> >>>    <name>mapred.map.output.compression.codec</name>
>>> >>>    <value>com.hadoop.compression.lzo.LzopCodec</value>
>>> >>> </property>
>>> >>>
>>> >>> If I do the above, I will get the Shuffle Error.
>>> >>> If I use DefaultCodec for mapred.map.output.compression.codec.
>>> >>> there is no problem.
>>> >>>
>>> >>> Is this a known issue? Or is this a bug?
>>> >>> Doesn't seem like it should be the expected behavior.
>>> >>>
>>> >>> I would be glad to contribute any further info on this if necessary.
>>> >>> Please let me know.
>>> >>>
>>> >>> Thanks
>>> >>>
>>> >>> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment
>>> >>> <bm...@gmail.com>
>>> >>> wrote:
>>> >>> > Hi, No problems. Thanks so much for your time. Greatly appreciated.
>>> >>> >
>>> >>> > I agree that it must be a configuration problem and so today I was
>>> >>> > able
>>> >>> > to start from scratch and did a fresh install of 0.20.2 on the 5
>>> >>> > node
>>> >>> > cluster.
>>> >>> >
>>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by bmdevelopment <bm...@gmail.com>.

Hi, I updated to the version here:
http://github.com/kevinweil/hadoop-lzo

However, when I use lzop for intermediate compression I
am still having trouble - the reduce phase now freezes at 99% and
eventually fails.
No immediate problem, because I can use the default codec.
But may be of concern to someone else.

Thanks

On Fri, Jul 9, 2010 at 1:54 PM, Ted Yu <yu...@gmail.com> wrote:
> I updated http://wiki.apache.org/hadoop/UsingLzoCompression to specifically
> mention this potential issue so that other people can avoid such problem.
> Feel free to add more onto it.
>
> On Thu, Jul 8, 2010 at 8:26 PM, bmdevelopment <bm...@gmail.com>
> wrote:
>>
>> Thanks everyone.
>>
>> Yes, using the Google Code version referenced on the wiki:
>> http://wiki.apache.org/hadoop/UsingLzoCompression
>>
>> I will try the latest version and see if that fixes the problem.
>> http://github.com/kevinweil/hadoop-lzo
>>
>> Thanks
>>
>> On Fri, Jul 9, 2010 at 3:22 AM, Todd Lipcon <to...@cloudera.com> wrote:
>> > On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <yu...@gmail.com> wrote:
>> >>
>> >> Todd fixed a bug where LZO header or block header data may fall on read
>> >> boundary:
>> >>
>> >>
>> >> http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58
>> >>
>> >>
>> >> I am wondering if that is related to the issue you saw.
>> >
>> > I don't think this bug would show up in intermediate output compression,
>> > but
>> > it's certainly possible. There have been a number of bugs fixed in LZO
>> > over
>> > on github - are you using the github version or the one from Google Code
>> > which is out of date? Either mine or Kevin's repo on github should be a
>> > good
>> > version (I think we called the newest 0.3.4)
>> > -Todd
>> >
>> >>
>> >> On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment
>> >> <bm...@gmail.com>
>> >> wrote:
>> >>>
>> >>> A little more on this.
>> >>>
>> >>> So, I've narrowed down the problem to using Lzop compression
>> >>> (com.hadoop.compression.lzo.LzopCodec)
>> >>> for mapred.map.output.compression.codec.
>> >>>
>> >>> <property>
>> >>>    <name>mapred.map.output.compression.codec</name>
>> >>>    <value>com.hadoop.compression.lzo.LzopCodec</value>
>> >>> </property>
>> >>>
>> >>> If I do the above, I will get the Shuffle Error.
>> >>> If I use DefaultCodec for mapred.map.output.compression.codec.
>> >>> there is no problem.
>> >>>
>> >>> Is this a known issue? Or is this a bug?
>> >>> Doesn't seem like it should be the expected behavior.
>> >>>
>> >>> I would be glad to contribute any further info on this if necessary.
>> >>> Please let me know.
>> >>>
>> >>> Thanks
>> >>>
>> >>> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment
>> >>> <bm...@gmail.com>
>> >>> wrote:
>> >>> > Hi, No problems. Thanks so much for your time. Greatly appreciated.
>> >>> >
>> >>> > I agree that it must be a configuration problem and so today I was
>> >>> > able
>> >>> > to start from scratch and did a fresh install of 0.20.2 on the 5
>> >>> > node
>> >>> > cluster.
>> >>> >
>> >>> > I've now noticed that the error occurs when compression is enabled.
>> >>> > I've run the basic wordcount example as so:
>> >>> > http://pastebin.com/wvDMZZT0
>> >>> > and get the Shuffle Error.
>> >>> >
>> >>> > TT logs show this error:
>> >>> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException:
>> >>> > Invalid
>> >>> > header checksum: 225702cc (expected 0x2325)
>> >>> > Full logs:
>> >>> > http://pastebin.com/fVGjcGsW
>> >>> >
>> >>> > My mapred-site.xml:
>> >>> > http://pastebin.com/mQgMrKQw
>> >>> >
>> >>> > If I remove the compression config settings, the wordcount works
>> >>> > fine
>> >>> > - no more Shuffle Error.
>> >>> > So, I have something wrong with my compression settings I imagine.
>> >>> > I'll continue looking into this to see what else I can find out.
>> >>> >
>> >>> > Thanks a million.
>> >>> >
>> >>> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala
>> >>> > <yh...@gmail.com>
>> >>> > wrote:
>> >>> >> Hi,
>> >>> >>
>> >>> >> Sorry, I couldn't take a close look at the logs until now.
>> >>> >> Unfortunately, I could not see any huge difference between the
>> >>> >> success
>> >>> >> and failure case. Can you please check if things like basic
>> >>> >> hostname -
>> >>> >> ip address mapping are in place (if you have static resolution of
>> >>> >> hostnames set up) ? A web search is giving this as the most likely
>> >>> >> cause users have faced regarding this problem. Also do the disks
>> >>> >> have
>> >>> >> enough size ? Also, it would be great if you can upload your hadoop
>> >>> >> configuration information.
>> >>> >>
>> >>> >> I do think it is very likely that configuration is the actual
>> >>> >> problem
>> >>> >> because it works in one case anyway.
>> >>> >>
>> >>> >> Thanks
>> >>> >> Hemanth
>> >>> >>
>> >>> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment
>> >>> >> <bm...@gmail.com> wrote:
>> >>> >>> Hello,
>> >>> >>> I still have had no luck with this over the past week.
>> >>> >>> And even get the same exact problem on a completely different 5
>> >>> >>> node
>> >>> >>> cluster.
>> >>> >>> Is it worth opening an new issue in jira for this?
>> >>> >>> Thanks
>> >>> >>>
>> >>> >>>
>> >>> >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment
>> >>> >>> <bm...@gmail.com> wrote:
>> >>> >>>> Hello,
>> >>> >>>> Thanks so much for the reply.
>> >>> >>>> See inline.
>> >>> >>>>
>> >>> >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala
>> >>> >>>> <yh...@gmail.com> wrote:
>> >>> >>>>> Hi,
>> >>> >>>>>
>> >>> >>>>>> I've been getting the following error when trying to run a very
>> >>> >>>>>> simple
>> >>> >>>>>> MapReduce job.
>> >>> >>>>>> Map finishes without problem, but error occurs as soon as it
>> >>> >>>>>> enters
>> >>> >>>>>> Reduce phase.
>> >>> >>>>>>
>> >>> >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>> >>> >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>> >>> >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> >>> >>>>>>
>> >>> >>>>>> I am running a 5 node cluster and I believe I have all my
>> >>> >>>>>> settings
>> >>> >>>>>> correct:
>> >>> >>>>>>
>> >>> >>>>>> * ulimit -n 32768
>> >>> >>>>>> * DNS/RDNS configured properly
>> >>> >>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>> >>> >>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>> >>> >>>>>>
>> >>> >>>>>> The program is very simple - just counts a unique string in a
>> >>> >>>>>> log
>> >>> >>>>>> file.
>> >>> >>>>>> See here: http://pastebin.com/5uRG3SFL
>> >>> >>>>>>
>> >>> >>>>>> When I run, the job fails and I get the following output.
>> >>> >>>>>> http://pastebin.com/AhW6StEb
>> >>> >>>>>>
>> >>> >>>>>> However, runs fine when I do *not* use substring() on the value
>> >>> >>>>>> (see
>> >>> >>>>>> map function in code above).
>> >>> >>>>>>
>> >>> >>>>>> This runs fine and completes successfully:
>> >>> >>>>>>            String str = val.toString();
>> >>> >>>>>>
>> >>> >>>>>> This causes error and fails:
>> >>> >>>>>>            String str = val.toString().substring(0,10);
>> >>> >>>>>>
>> >>> >>>>>> Please let me know if you need any further information.
>> >>> >>>>>> It would be greatly appreciated if anyone could shed some light
>> >>> >>>>>> on
>> >>> >>>>>> this problem.
>> >>> >>>>>
>> >>> >>>>> It catches attention that changing the code to use a substring
>> >>> >>>>> is
>> >>> >>>>> causing a difference. Assuming it is consistent and not a red
>> >>> >>>>> herring,
>> >>> >>>>
>> >>> >>>> Yes, this has been consistent over the last week. I was running
>> >>> >>>> 0.20.1
>> >>> >>>> first and then
>> >>> >>>> upgrade to 0.20.2 but results have been exactly the same.
>> >>> >>>>
>> >>> >>>>> can you look at the counters for the two jobs using the
>> >>> >>>>> JobTracker
>> >>> >>>>> web
>> >>> >>>>> UI - things like map records, bytes etc and see if there is a
>> >>> >>>>> noticeable difference ?
>> >>> >>>>
>> >>> >>>> Ok, so here is the first job using write.set(value.toString());
>> >>> >>>> having
>> >>> >>>> *no* errors:
>> >>> >>>> http://pastebin.com/xvy0iGwL
>> >>> >>>>
>> >>> >>>> And here is the second job using
>> >>> >>>> write.set(value.toString().substring(0, 10)); that fails:
>> >>> >>>> http://pastebin.com/uGw6yNqv
>> >>> >>>>
>> >>> >>>> And here is even another where I used a longer, and therefore
>> >>> >>>> unique
>> >>> >>>> string,
>> >>> >>>> by write.set(value.toString().substring(0, 20)); This makes every
>> >>> >>>> line
>> >>> >>>> unique, similar to first job.
>> >>> >>>> Still fails.
>> >>> >>>> http://pastebin.com/GdQ1rp8i
>> >>> >>>>
>> >>> >>>>>Also, are the two programs being run against
>> >>> >>>>> the exact same input data ?
>> >>> >>>>
>> >>> >>>> Yes, exactly the same input: a single csv file with 23K lines.
>> >>> >>>> Using a shorter string leads to more like keys and therefore more
>> >>> >>>> combining/reducing, but going
>> >>> >>>> by the above it seems to fail whether the substring/key is
>> >>> >>>> entirely
>> >>> >>>> unique (23000 combine output records) or
>> >>> >>>> mostly the same (9 combine output records).
>> >>> >>>>
>> >>> >>>>>
>> >>> >>>>> Also, since the cluster size is small, you could also look at
>> >>> >>>>> the
>> >>> >>>>> tasktracker logs on the machines where the maps have run to see
>> >>> >>>>> if
>> >>> >>>>> there are any failures when the reduce attempts start failing.
>> >>> >>>>
>> >>> >>>> Here is the TT log from the last failed job. I do not see
>> >>> >>>> anything
>> >>> >>>> besides the shuffle failure, but there
>> >>> >>>> may be something I am overlooking or simply do not understand.
>> >>> >>>> http://pastebin.com/DKFTyGXg
>> >>> >>>>
>> >>> >>>> Thanks again!
>> >>> >>>>
>> >>> >>>>>
>> >>> >>>>> Thanks
>> >>> >>>>> Hemanth
>> >>> >>>>>
>> >>> >>>>
>> >>> >>>
>> >>> >>
>> >>> >
>> >>
>> >
>> >
>> >
>> > --
>> > Todd Lipcon
>> > Software Engineer, Cloudera
>> >
>
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by Ted Yu <yu...@gmail.com>.

I updated http://wiki.apache.org/hadoop/UsingLzoCompression to specifically
mention this potential issue so that other people can avoid such problem.
Feel free to add more onto it.

On Thu, Jul 8, 2010 at 8:26 PM, bmdevelopment <bm...@gmail.com>wrote:

> Thanks everyone.
>
> Yes, using the Google Code version referenced on the wiki:
> http://wiki.apache.org/hadoop/UsingLzoCompression
>
> I will try the latest version and see if that fixes the problem.
> http://github.com/kevinweil/hadoop-lzo
>
> Thanks
>
> On Fri, Jul 9, 2010 at 3:22 AM, Todd Lipcon <to...@cloudera.com> wrote:
> > On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <yu...@gmail.com> wrote:
> >>
> >> Todd fixed a bug where LZO header or block header data may fall on read
> >> boundary:
> >>
> >>
> http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58
> >>
> >>
> >> I am wondering if that is related to the issue you saw.
> >
> > I don't think this bug would show up in intermediate output compression,
> but
> > it's certainly possible. There have been a number of bugs fixed in LZO
> over
> > on github - are you using the github version or the one from Google Code
> > which is out of date? Either mine or Kevin's repo on github should be a
> good
> > version (I think we called the newest 0.3.4)
> > -Todd
> >
> >>
> >> On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment <bmdevelopment@gmail.com
> >
> >> wrote:
> >>>
> >>> A little more on this.
> >>>
> >>> So, I've narrowed down the problem to using Lzop compression
> >>> (com.hadoop.compression.lzo.LzopCodec)
> >>> for mapred.map.output.compression.codec.
> >>>
> >>> <property>
> >>>    <name>mapred.map.output.compression.codec</name>
> >>>    <value>com.hadoop.compression.lzo.LzopCodec</value>
> >>> </property>
> >>>
> >>> If I do the above, I will get the Shuffle Error.
> >>> If I use DefaultCodec for mapred.map.output.compression.codec.
> >>> there is no problem.
> >>>
> >>> Is this a known issue? Or is this a bug?
> >>> Doesn't seem like it should be the expected behavior.
> >>>
> >>> I would be glad to contribute any further info on this if necessary.
> >>> Please let me know.
> >>>
> >>> Thanks
> >>>
> >>> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <bmdevelopment@gmail.com
> >
> >>> wrote:
> >>> > Hi, No problems. Thanks so much for your time. Greatly appreciated.
> >>> >
> >>> > I agree that it must be a configuration problem and so today I was
> able
> >>> > to start from scratch and did a fresh install of 0.20.2 on the 5 node
> >>> > cluster.
> >>> >
> >>> > I've now noticed that the error occurs when compression is enabled.
> >>> > I've run the basic wordcount example as so:
> >>> > http://pastebin.com/wvDMZZT0
> >>> > and get the Shuffle Error.
> >>> >
> >>> > TT logs show this error:
> >>> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException:
> Invalid
> >>> > header checksum: 225702cc (expected 0x2325)
> >>> > Full logs:
> >>> > http://pastebin.com/fVGjcGsW
> >>> >
> >>> > My mapred-site.xml:
> >>> > http://pastebin.com/mQgMrKQw
> >>> >
> >>> > If I remove the compression config settings, the wordcount works fine
> >>> > - no more Shuffle Error.
> >>> > So, I have something wrong with my compression settings I imagine.
> >>> > I'll continue looking into this to see what else I can find out.
> >>> >
> >>> > Thanks a million.
> >>> >
> >>> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <yhemanth@gmail.com
> >
> >>> > wrote:
> >>> >> Hi,
> >>> >>
> >>> >> Sorry, I couldn't take a close look at the logs until now.
> >>> >> Unfortunately, I could not see any huge difference between the
> success
> >>> >> and failure case. Can you please check if things like basic hostname
> -
> >>> >> ip address mapping are in place (if you have static resolution of
> >>> >> hostnames set up) ? A web search is giving this as the most likely
> >>> >> cause users have faced regarding this problem. Also do the disks
> have
> >>> >> enough size ? Also, it would be great if you can upload your hadoop
> >>> >> configuration information.
> >>> >>
> >>> >> I do think it is very likely that configuration is the actual
> problem
> >>> >> because it works in one case anyway.
> >>> >>
> >>> >> Thanks
> >>> >> Hemanth
> >>> >>
> >>> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment
> >>> >> <bm...@gmail.com> wrote:
> >>> >>> Hello,
> >>> >>> I still have had no luck with this over the past week.
> >>> >>> And even get the same exact problem on a completely different 5
> node
> >>> >>> cluster.
> >>> >>> Is it worth opening an new issue in jira for this?
> >>> >>> Thanks
> >>> >>>
> >>> >>>
> >>> >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment
> >>> >>> <bm...@gmail.com> wrote:
> >>> >>>> Hello,
> >>> >>>> Thanks so much for the reply.
> >>> >>>> See inline.
> >>> >>>>
> >>> >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala
> >>> >>>> <yh...@gmail.com> wrote:
> >>> >>>>> Hi,
> >>> >>>>>
> >>> >>>>>> I've been getting the following error when trying to run a very
> >>> >>>>>> simple
> >>> >>>>>> MapReduce job.
> >>> >>>>>> Map finishes without problem, but error occurs as soon as it
> >>> >>>>>> enters
> >>> >>>>>> Reduce phase.
> >>> >>>>>>
> >>> >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
> >>> >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
> >>> >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> >>> >>>>>>
> >>> >>>>>> I am running a 5 node cluster and I believe I have all my
> settings
> >>> >>>>>> correct:
> >>> >>>>>>
> >>> >>>>>> * ulimit -n 32768
> >>> >>>>>> * DNS/RDNS configured properly
> >>> >>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
> >>> >>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
> >>> >>>>>>
> >>> >>>>>> The program is very simple - just counts a unique string in a
> log
> >>> >>>>>> file.
> >>> >>>>>> See here: http://pastebin.com/5uRG3SFL
> >>> >>>>>>
> >>> >>>>>> When I run, the job fails and I get the following output.
> >>> >>>>>> http://pastebin.com/AhW6StEb
> >>> >>>>>>
> >>> >>>>>> However, runs fine when I do *not* use substring() on the value
> >>> >>>>>> (see
> >>> >>>>>> map function in code above).
> >>> >>>>>>
> >>> >>>>>> This runs fine and completes successfully:
> >>> >>>>>>            String str = val.toString();
> >>> >>>>>>
> >>> >>>>>> This causes error and fails:
> >>> >>>>>>            String str = val.toString().substring(0,10);
> >>> >>>>>>
> >>> >>>>>> Please let me know if you need any further information.
> >>> >>>>>> It would be greatly appreciated if anyone could shed some light
> on
> >>> >>>>>> this problem.
> >>> >>>>>
> >>> >>>>> It catches attention that changing the code to use a substring is
> >>> >>>>> causing a difference. Assuming it is consistent and not a red
> >>> >>>>> herring,
> >>> >>>>
> >>> >>>> Yes, this has been consistent over the last week. I was running
> >>> >>>> 0.20.1
> >>> >>>> first and then
> >>> >>>> upgrade to 0.20.2 but results have been exactly the same.
> >>> >>>>
> >>> >>>>> can you look at the counters for the two jobs using the
> JobTracker
> >>> >>>>> web
> >>> >>>>> UI - things like map records, bytes etc and see if there is a
> >>> >>>>> noticeable difference ?
> >>> >>>>
> >>> >>>> Ok, so here is the first job using write.set(value.toString());
> >>> >>>> having
> >>> >>>> *no* errors:
> >>> >>>> http://pastebin.com/xvy0iGwL
> >>> >>>>
> >>> >>>> And here is the second job using
> >>> >>>> write.set(value.toString().substring(0, 10)); that fails:
> >>> >>>> http://pastebin.com/uGw6yNqv
> >>> >>>>
> >>> >>>> And here is even another where I used a longer, and therefore
> unique
> >>> >>>> string,
> >>> >>>> by write.set(value.toString().substring(0, 20)); This makes every
> >>> >>>> line
> >>> >>>> unique, similar to first job.
> >>> >>>> Still fails.
> >>> >>>> http://pastebin.com/GdQ1rp8i
> >>> >>>>
> >>> >>>>>Also, are the two programs being run against
> >>> >>>>> the exact same input data ?
> >>> >>>>
> >>> >>>> Yes, exactly the same input: a single csv file with 23K lines.
> >>> >>>> Using a shorter string leads to more like keys and therefore more
> >>> >>>> combining/reducing, but going
> >>> >>>> by the above it seems to fail whether the substring/key is
> entirely
> >>> >>>> unique (23000 combine output records) or
> >>> >>>> mostly the same (9 combine output records).
> >>> >>>>
> >>> >>>>>
> >>> >>>>> Also, since the cluster size is small, you could also look at the
> >>> >>>>> tasktracker logs on the machines where the maps have run to see
> if
> >>> >>>>> there are any failures when the reduce attempts start failing.
> >>> >>>>
> >>> >>>> Here is the TT log from the last failed job. I do not see anything
> >>> >>>> besides the shuffle failure, but there
> >>> >>>> may be something I am overlooking or simply do not understand.
> >>> >>>> http://pastebin.com/DKFTyGXg
> >>> >>>>
> >>> >>>> Thanks again!
> >>> >>>>
> >>> >>>>>
> >>> >>>>> Thanks
> >>> >>>>> Hemanth
> >>> >>>>>
> >>> >>>>
> >>> >>>
> >>> >>
> >>> >
> >>
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by bmdevelopment <bm...@gmail.com>.

Thanks everyone.

Yes, using the Google Code version referenced on the wiki:
http://wiki.apache.org/hadoop/UsingLzoCompression

I will try the latest version and see if that fixes the problem.
http://github.com/kevinweil/hadoop-lzo

Thanks

On Fri, Jul 9, 2010 at 3:22 AM, Todd Lipcon <to...@cloudera.com> wrote:
> On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <yu...@gmail.com> wrote:
>>
>> Todd fixed a bug where LZO header or block header data may fall on read
>> boundary:
>>
>> http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58
>>
>>
>> I am wondering if that is related to the issue you saw.
>
> I don't think this bug would show up in intermediate output compression, but
> it's certainly possible. There have been a number of bugs fixed in LZO over
> on github - are you using the github version or the one from Google Code
> which is out of date? Either mine or Kevin's repo on github should be a good
> version (I think we called the newest 0.3.4)
> -Todd
>
>>
>> On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment <bm...@gmail.com>
>> wrote:
>>>
>>> A little more on this.
>>>
>>> So, I've narrowed down the problem to using Lzop compression
>>> (com.hadoop.compression.lzo.LzopCodec)
>>> for mapred.map.output.compression.codec.
>>>
>>> <property>
>>>    <name>mapred.map.output.compression.codec</name>
>>>    <value>com.hadoop.compression.lzo.LzopCodec</value>
>>> </property>
>>>
>>> If I do the above, I will get the Shuffle Error.
>>> If I use DefaultCodec for mapred.map.output.compression.codec.
>>> there is no problem.
>>>
>>> Is this a known issue? Or is this a bug?
>>> Doesn't seem like it should be the expected behavior.
>>>
>>> I would be glad to contribute any further info on this if necessary.
>>> Please let me know.
>>>
>>> Thanks
>>>
>>> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <bm...@gmail.com>
>>> wrote:
>>> > Hi, No problems. Thanks so much for your time. Greatly appreciated.
>>> >
>>> > I agree that it must be a configuration problem and so today I was able
>>> > to start from scratch and did a fresh install of 0.20.2 on the 5 node
>>> > cluster.
>>> >
>>> > I've now noticed that the error occurs when compression is enabled.
>>> > I've run the basic wordcount example as so:
>>> > http://pastebin.com/wvDMZZT0
>>> > and get the Shuffle Error.
>>> >
>>> > TT logs show this error:
>>> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid
>>> > header checksum: 225702cc (expected 0x2325)
>>> > Full logs:
>>> > http://pastebin.com/fVGjcGsW
>>> >
>>> > My mapred-site.xml:
>>> > http://pastebin.com/mQgMrKQw
>>> >
>>> > If I remove the compression config settings, the wordcount works fine
>>> > - no more Shuffle Error.
>>> > So, I have something wrong with my compression settings I imagine.
>>> > I'll continue looking into this to see what else I can find out.
>>> >
>>> > Thanks a million.
>>> >
>>> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <yh...@gmail.com>
>>> > wrote:
>>> >> Hi,
>>> >>
>>> >> Sorry, I couldn't take a close look at the logs until now.
>>> >> Unfortunately, I could not see any huge difference between the success
>>> >> and failure case. Can you please check if things like basic hostname -
>>> >> ip address mapping are in place (if you have static resolution of
>>> >> hostnames set up) ? A web search is giving this as the most likely
>>> >> cause users have faced regarding this problem. Also do the disks have
>>> >> enough size ? Also, it would be great if you can upload your hadoop
>>> >> configuration information.
>>> >>
>>> >> I do think it is very likely that configuration is the actual problem
>>> >> because it works in one case anyway.
>>> >>
>>> >> Thanks
>>> >> Hemanth
>>> >>
>>> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment
>>> >> <bm...@gmail.com> wrote:
>>> >>> Hello,
>>> >>> I still have had no luck with this over the past week.
>>> >>> And even get the same exact problem on a completely different 5 node
>>> >>> cluster.
>>> >>> Is it worth opening an new issue in jira for this?
>>> >>> Thanks
>>> >>>
>>> >>>
>>> >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment
>>> >>> <bm...@gmail.com> wrote:
>>> >>>> Hello,
>>> >>>> Thanks so much for the reply.
>>> >>>> See inline.
>>> >>>>
>>> >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala
>>> >>>> <yh...@gmail.com> wrote:
>>> >>>>> Hi,
>>> >>>>>
>>> >>>>>> I've been getting the following error when trying to run a very
>>> >>>>>> simple
>>> >>>>>> MapReduce job.
>>> >>>>>> Map finishes without problem, but error occurs as soon as it
>>> >>>>>> enters
>>> >>>>>> Reduce phase.
>>> >>>>>>
>>> >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>> >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>> >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>> >>>>>>
>>> >>>>>> I am running a 5 node cluster and I believe I have all my settings
>>> >>>>>> correct:
>>> >>>>>>
>>> >>>>>> * ulimit -n 32768
>>> >>>>>> * DNS/RDNS configured properly
>>> >>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>> >>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>> >>>>>>
>>> >>>>>> The program is very simple - just counts a unique string in a log
>>> >>>>>> file.
>>> >>>>>> See here: http://pastebin.com/5uRG3SFL
>>> >>>>>>
>>> >>>>>> When I run, the job fails and I get the following output.
>>> >>>>>> http://pastebin.com/AhW6StEb
>>> >>>>>>
>>> >>>>>> However, runs fine when I do *not* use substring() on the value
>>> >>>>>> (see
>>> >>>>>> map function in code above).
>>> >>>>>>
>>> >>>>>> This runs fine and completes successfully:
>>> >>>>>>            String str = val.toString();
>>> >>>>>>
>>> >>>>>> This causes error and fails:
>>> >>>>>>            String str = val.toString().substring(0,10);
>>> >>>>>>
>>> >>>>>> Please let me know if you need any further information.
>>> >>>>>> It would be greatly appreciated if anyone could shed some light on
>>> >>>>>> this problem.
>>> >>>>>
>>> >>>>> It catches attention that changing the code to use a substring is
>>> >>>>> causing a difference. Assuming it is consistent and not a red
>>> >>>>> herring,
>>> >>>>
>>> >>>> Yes, this has been consistent over the last week. I was running
>>> >>>> 0.20.1
>>> >>>> first and then
>>> >>>> upgrade to 0.20.2 but results have been exactly the same.
>>> >>>>
>>> >>>>> can you look at the counters for the two jobs using the JobTracker
>>> >>>>> web
>>> >>>>> UI - things like map records, bytes etc and see if there is a
>>> >>>>> noticeable difference ?
>>> >>>>
>>> >>>> Ok, so here is the first job using write.set(value.toString());
>>> >>>> having
>>> >>>> *no* errors:
>>> >>>> http://pastebin.com/xvy0iGwL
>>> >>>>
>>> >>>> And here is the second job using
>>> >>>> write.set(value.toString().substring(0, 10)); that fails:
>>> >>>> http://pastebin.com/uGw6yNqv
>>> >>>>
>>> >>>> And here is even another where I used a longer, and therefore unique
>>> >>>> string,
>>> >>>> by write.set(value.toString().substring(0, 20)); This makes every
>>> >>>> line
>>> >>>> unique, similar to first job.
>>> >>>> Still fails.
>>> >>>> http://pastebin.com/GdQ1rp8i
>>> >>>>
>>> >>>>>Also, are the two programs being run against
>>> >>>>> the exact same input data ?
>>> >>>>
>>> >>>> Yes, exactly the same input: a single csv file with 23K lines.
>>> >>>> Using a shorter string leads to more like keys and therefore more
>>> >>>> combining/reducing, but going
>>> >>>> by the above it seems to fail whether the substring/key is entirely
>>> >>>> unique (23000 combine output records) or
>>> >>>> mostly the same (9 combine output records).
>>> >>>>
>>> >>>>>
>>> >>>>> Also, since the cluster size is small, you could also look at the
>>> >>>>> tasktracker logs on the machines where the maps have run to see if
>>> >>>>> there are any failures when the reduce attempts start failing.
>>> >>>>
>>> >>>> Here is the TT log from the last failed job. I do not see anything
>>> >>>> besides the shuffle failure, but there
>>> >>>> may be something I am overlooking or simply do not understand.
>>> >>>> http://pastebin.com/DKFTyGXg
>>> >>>>
>>> >>>> Thanks again!
>>> >>>>
>>> >>>>>
>>> >>>>> Thanks
>>> >>>>> Hemanth
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by Todd Lipcon <to...@cloudera.com>.

On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <yu...@gmail.com> wrote:

> Todd fixed a bug where LZO header or block header data may fall on read
> boundary:
>
> http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58
>

I am wondering if that is related to the issue you saw.
>
> I don't think this bug would show up in intermediate output compression,
but it's certainly possible. There have been a number of bugs fixed in LZO
over on github - are you using the github version or the one from Google
Code which is out of date? Either mine or Kevin's repo on github should be a
good version (I think we called the newest 0.3.4)

-Todd


>
> On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment <bm...@gmail.com>wrote:
>
>> A little more on this.
>>
>> So, I've narrowed down the problem to using Lzop compression
>> (com.hadoop.compression.lzo.LzopCodec)
>> for mapred.map.output.compression.codec.
>>
>> <property>
>>    <name>mapred.map.output.compression.codec</name>
>>    <value>com.hadoop.compression.lzo.LzopCodec</value>
>> </property>
>>
>> If I do the above, I will get the Shuffle Error.
>> If I use DefaultCodec for mapred.map.output.compression.codec.
>> there is no problem.
>>
>> Is this a known issue? Or is this a bug?
>> Doesn't seem like it should be the expected behavior.
>>
>> I would be glad to contribute any further info on this if necessary.
>> Please let me know.
>>
>> Thanks
>>
>> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <bm...@gmail.com>
>> wrote:
>> > Hi, No problems. Thanks so much for your time. Greatly appreciated.
>> >
>> > I agree that it must be a configuration problem and so today I was able
>> > to start from scratch and did a fresh install of 0.20.2 on the 5 node
>> cluster.
>> >
>> > I've now noticed that the error occurs when compression is enabled.
>> > I've run the basic wordcount example as so:
>> > http://pastebin.com/wvDMZZT0
>> > and get the Shuffle Error.
>> >
>> > TT logs show this error:
>> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid
>> > header checksum: 225702cc (expected 0x2325)
>> > Full logs:
>> > http://pastebin.com/fVGjcGsW
>> >
>> > My mapred-site.xml:
>> > http://pastebin.com/mQgMrKQw
>> >
>> > If I remove the compression config settings, the wordcount works fine
>> > - no more Shuffle Error.
>> > So, I have something wrong with my compression settings I imagine.
>> > I'll continue looking into this to see what else I can find out.
>> >
>> > Thanks a million.
>> >
>> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <yh...@gmail.com>
>> wrote:
>> >> Hi,
>> >>
>> >> Sorry, I couldn't take a close look at the logs until now.
>> >> Unfortunately, I could not see any huge difference between the success
>> >> and failure case. Can you please check if things like basic hostname -
>> >> ip address mapping are in place (if you have static resolution of
>> >> hostnames set up) ? A web search is giving this as the most likely
>> >> cause users have faced regarding this problem. Also do the disks have
>> >> enough size ? Also, it would be great if you can upload your hadoop
>> >> configuration information.
>> >>
>> >> I do think it is very likely that configuration is the actual problem
>> >> because it works in one case anyway.
>> >>
>> >> Thanks
>> >> Hemanth
>> >>
>> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <
>> bmdevelopment@gmail.com> wrote:
>> >>> Hello,
>> >>> I still have had no luck with this over the past week.
>> >>> And even get the same exact problem on a completely different 5 node
>> cluster.
>> >>> Is it worth opening an new issue in jira for this?
>> >>> Thanks
>> >>>
>> >>>
>> >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <
>> bmdevelopment@gmail.com> wrote:
>> >>>> Hello,
>> >>>> Thanks so much for the reply.
>> >>>> See inline.
>> >>>>
>> >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <
>> yhemanth@gmail.com> wrote:
>> >>>>> Hi,
>> >>>>>
>> >>>>>> I've been getting the following error when trying to run a very
>> simple
>> >>>>>> MapReduce job.
>> >>>>>> Map finishes without problem, but error occurs as soon as it enters
>> >>>>>> Reduce phase.
>> >>>>>>
>> >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>> >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>> >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> >>>>>>
>> >>>>>> I am running a 5 node cluster and I believe I have all my settings
>> correct:
>> >>>>>>
>> >>>>>> * ulimit -n 32768
>> >>>>>> * DNS/RDNS configured properly
>> >>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>> >>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>> >>>>>>
>> >>>>>> The program is very simple - just counts a unique string in a log
>> file.
>> >>>>>> See here: http://pastebin.com/5uRG3SFL
>> >>>>>>
>> >>>>>> When I run, the job fails and I get the following output.
>> >>>>>> http://pastebin.com/AhW6StEb
>> >>>>>>
>> >>>>>> However, runs fine when I do *not* use substring() on the value
>> (see
>> >>>>>> map function in code above).
>> >>>>>>
>> >>>>>> This runs fine and completes successfully:
>> >>>>>>            String str = val.toString();
>> >>>>>>
>> >>>>>> This causes error and fails:
>> >>>>>>            String str = val.toString().substring(0,10);
>> >>>>>>
>> >>>>>> Please let me know if you need any further information.
>> >>>>>> It would be greatly appreciated if anyone could shed some light on
>> this problem.
>> >>>>>
>> >>>>> It catches attention that changing the code to use a substring is
>> >>>>> causing a difference. Assuming it is consistent and not a red
>> herring,
>> >>>>
>> >>>> Yes, this has been consistent over the last week. I was running
>> 0.20.1
>> >>>> first and then
>> >>>> upgrade to 0.20.2 but results have been exactly the same.
>> >>>>
>> >>>>> can you look at the counters for the two jobs using the JobTracker
>> web
>> >>>>> UI - things like map records, bytes etc and see if there is a
>> >>>>> noticeable difference ?
>> >>>>
>> >>>> Ok, so here is the first job using write.set(value.toString());
>> having
>> >>>> *no* errors:
>> >>>> http://pastebin.com/xvy0iGwL
>> >>>>
>> >>>> And here is the second job using
>> >>>> write.set(value.toString().substring(0, 10)); that fails:
>> >>>> http://pastebin.com/uGw6yNqv
>> >>>>
>> >>>> And here is even another where I used a longer, and therefore unique
>> string,
>> >>>> by write.set(value.toString().substring(0, 20)); This makes every
>> line
>> >>>> unique, similar to first job.
>> >>>> Still fails.
>> >>>> http://pastebin.com/GdQ1rp8i
>> >>>>
>> >>>>>Also, are the two programs being run against
>> >>>>> the exact same input data ?
>> >>>>
>> >>>> Yes, exactly the same input: a single csv file with 23K lines.
>> >>>> Using a shorter string leads to more like keys and therefore more
>> >>>> combining/reducing, but going
>> >>>> by the above it seems to fail whether the substring/key is entirely
>> >>>> unique (23000 combine output records) or
>> >>>> mostly the same (9 combine output records).
>> >>>>
>> >>>>>
>> >>>>> Also, since the cluster size is small, you could also look at the
>> >>>>> tasktracker logs on the machines where the maps have run to see if
>> >>>>> there are any failures when the reduce attempts start failing.
>> >>>>
>> >>>> Here is the TT log from the last failed job. I do not see anything
>> >>>> besides the shuffle failure, but there
>> >>>> may be something I am overlooking or simply do not understand.
>> >>>> http://pastebin.com/DKFTyGXg
>> >>>>
>> >>>> Thanks again!
>> >>>>
>> >>>>>
>> >>>>> Thanks
>> >>>>> Hemanth
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by Ted Yu <yu...@gmail.com>.

Todd fixed a bug where LZO header or block header data may fall on read
boundary:
http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58

I am wondering if that is related to the issue you saw.

On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment <bm...@gmail.com>wrote:

> A little more on this.
>
> So, I've narrowed down the problem to using Lzop compression
> (com.hadoop.compression.lzo.LzopCodec)
> for mapred.map.output.compression.codec.
>
> <property>
>    <name>mapred.map.output.compression.codec</name>
>    <value>com.hadoop.compression.lzo.LzopCodec</value>
> </property>
>
> If I do the above, I will get the Shuffle Error.
> If I use DefaultCodec for mapred.map.output.compression.codec.
> there is no problem.
>
> Is this a known issue? Or is this a bug?
> Doesn't seem like it should be the expected behavior.
>
> I would be glad to contribute any further info on this if necessary.
> Please let me know.
>
> Thanks
>
> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <bm...@gmail.com>
> wrote:
> > Hi, No problems. Thanks so much for your time. Greatly appreciated.
> >
> > I agree that it must be a configuration problem and so today I was able
> > to start from scratch and did a fresh install of 0.20.2 on the 5 node
> cluster.
> >
> > I've now noticed that the error occurs when compression is enabled.
> > I've run the basic wordcount example as so:
> > http://pastebin.com/wvDMZZT0
> > and get the Shuffle Error.
> >
> > TT logs show this error:
> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid
> > header checksum: 225702cc (expected 0x2325)
> > Full logs:
> > http://pastebin.com/fVGjcGsW
> >
> > My mapred-site.xml:
> > http://pastebin.com/mQgMrKQw
> >
> > If I remove the compression config settings, the wordcount works fine
> > - no more Shuffle Error.
> > So, I have something wrong with my compression settings I imagine.
> > I'll continue looking into this to see what else I can find out.
> >
> > Thanks a million.
> >
> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <yh...@gmail.com>
> wrote:
> >> Hi,
> >>
> >> Sorry, I couldn't take a close look at the logs until now.
> >> Unfortunately, I could not see any huge difference between the success
> >> and failure case. Can you please check if things like basic hostname -
> >> ip address mapping are in place (if you have static resolution of
> >> hostnames set up) ? A web search is giving this as the most likely
> >> cause users have faced regarding this problem. Also do the disks have
> >> enough size ? Also, it would be great if you can upload your hadoop
> >> configuration information.
> >>
> >> I do think it is very likely that configuration is the actual problem
> >> because it works in one case anyway.
> >>
> >> Thanks
> >> Hemanth
> >>
> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <bm...@gmail.com>
> wrote:
> >>> Hello,
> >>> I still have had no luck with this over the past week.
> >>> And even get the same exact problem on a completely different 5 node
> cluster.
> >>> Is it worth opening an new issue in jira for this?
> >>> Thanks
> >>>
> >>>
> >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <
> bmdevelopment@gmail.com> wrote:
> >>>> Hello,
> >>>> Thanks so much for the reply.
> >>>> See inline.
> >>>>
> >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <
> yhemanth@gmail.com> wrote:
> >>>>> Hi,
> >>>>>
> >>>>>> I've been getting the following error when trying to run a very
> simple
> >>>>>> MapReduce job.
> >>>>>> Map finishes without problem, but error occurs as soon as it enters
> >>>>>> Reduce phase.
> >>>>>>
> >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
> >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
> >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> >>>>>>
> >>>>>> I am running a 5 node cluster and I believe I have all my settings
> correct:
> >>>>>>
> >>>>>> * ulimit -n 32768
> >>>>>> * DNS/RDNS configured properly
> >>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
> >>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
> >>>>>>
> >>>>>> The program is very simple - just counts a unique string in a log
> file.
> >>>>>> See here: http://pastebin.com/5uRG3SFL
> >>>>>>
> >>>>>> When I run, the job fails and I get the following output.
> >>>>>> http://pastebin.com/AhW6StEb
> >>>>>>
> >>>>>> However, runs fine when I do *not* use substring() on the value (see
> >>>>>> map function in code above).
> >>>>>>
> >>>>>> This runs fine and completes successfully:
> >>>>>>            String str = val.toString();
> >>>>>>
> >>>>>> This causes error and fails:
> >>>>>>            String str = val.toString().substring(0,10);
> >>>>>>
> >>>>>> Please let me know if you need any further information.
> >>>>>> It would be greatly appreciated if anyone could shed some light on
> this problem.
> >>>>>
> >>>>> It catches attention that changing the code to use a substring is
> >>>>> causing a difference. Assuming it is consistent and not a red
> herring,
> >>>>
> >>>> Yes, this has been consistent over the last week. I was running 0.20.1
> >>>> first and then
> >>>> upgrade to 0.20.2 but results have been exactly the same.
> >>>>
> >>>>> can you look at the counters for the two jobs using the JobTracker
> web
> >>>>> UI - things like map records, bytes etc and see if there is a
> >>>>> noticeable difference ?
> >>>>
> >>>> Ok, so here is the first job using write.set(value.toString()); having
> >>>> *no* errors:
> >>>> http://pastebin.com/xvy0iGwL
> >>>>
> >>>> And here is the second job using
> >>>> write.set(value.toString().substring(0, 10)); that fails:
> >>>> http://pastebin.com/uGw6yNqv
> >>>>
> >>>> And here is even another where I used a longer, and therefore unique
> string,
> >>>> by write.set(value.toString().substring(0, 20)); This makes every line
> >>>> unique, similar to first job.
> >>>> Still fails.
> >>>> http://pastebin.com/GdQ1rp8i
> >>>>
> >>>>>Also, are the two programs being run against
> >>>>> the exact same input data ?
> >>>>
> >>>> Yes, exactly the same input: a single csv file with 23K lines.
> >>>> Using a shorter string leads to more like keys and therefore more
> >>>> combining/reducing, but going
> >>>> by the above it seems to fail whether the substring/key is entirely
> >>>> unique (23000 combine output records) or
> >>>> mostly the same (9 combine output records).
> >>>>
> >>>>>
> >>>>> Also, since the cluster size is small, you could also look at the
> >>>>> tasktracker logs on the machines where the maps have run to see if
> >>>>> there are any failures when the reduce attempts start failing.
> >>>>
> >>>> Here is the TT log from the last failed job. I do not see anything
> >>>> besides the shuffle failure, but there
> >>>> may be something I am overlooking or simply do not understand.
> >>>> http://pastebin.com/DKFTyGXg
> >>>>
> >>>> Thanks again!
> >>>>
> >>>>>
> >>>>> Thanks
> >>>>> Hemanth
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by bmdevelopment <bm...@gmail.com>.

A little more on this.

So, I've narrowed down the problem to using Lzop compression
(com.hadoop.compression.lzo.LzopCodec)
for mapred.map.output.compression.codec.

<property>
    <name>mapred.map.output.compression.codec</name>
    <value>com.hadoop.compression.lzo.LzopCodec</value>
</property>

If I do the above, I will get the Shuffle Error.
If I use DefaultCodec for mapred.map.output.compression.codec.
there is no problem.

Is this a known issue? Or is this a bug?
Doesn't seem like it should be the expected behavior.

I would be glad to contribute any further info on this if necessary.
Please let me know.

Thanks

On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <bm...@gmail.com> wrote:
> Hi, No problems. Thanks so much for your time. Greatly appreciated.
>
> I agree that it must be a configuration problem and so today I was able
> to start from scratch and did a fresh install of 0.20.2 on the 5 node cluster.
>
> I've now noticed that the error occurs when compression is enabled.
> I've run the basic wordcount example as so:
> http://pastebin.com/wvDMZZT0
> and get the Shuffle Error.
>
> TT logs show this error:
> WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid
> header checksum: 225702cc (expected 0x2325)
> Full logs:
> http://pastebin.com/fVGjcGsW
>
> My mapred-site.xml:
> http://pastebin.com/mQgMrKQw
>
> If I remove the compression config settings, the wordcount works fine
> - no more Shuffle Error.
> So, I have something wrong with my compression settings I imagine.
> I'll continue looking into this to see what else I can find out.
>
> Thanks a million.
>
> On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <yh...@gmail.com> wrote:
>> Hi,
>>
>> Sorry, I couldn't take a close look at the logs until now.
>> Unfortunately, I could not see any huge difference between the success
>> and failure case. Can you please check if things like basic hostname -
>> ip address mapping are in place (if you have static resolution of
>> hostnames set up) ? A web search is giving this as the most likely
>> cause users have faced regarding this problem. Also do the disks have
>> enough size ? Also, it would be great if you can upload your hadoop
>> configuration information.
>>
>> I do think it is very likely that configuration is the actual problem
>> because it works in one case anyway.
>>
>> Thanks
>> Hemanth
>>
>> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <bm...@gmail.com> wrote:
>>> Hello,
>>> I still have had no luck with this over the past week.
>>> And even get the same exact problem on a completely different 5 node cluster.
>>> Is it worth opening an new issue in jira for this?
>>> Thanks
>>>
>>>
>>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <bm...@gmail.com> wrote:
>>>> Hello,
>>>> Thanks so much for the reply.
>>>> See inline.
>>>>
>>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yh...@gmail.com> wrote:
>>>>> Hi,
>>>>>
>>>>>> I've been getting the following error when trying to run a very simple
>>>>>> MapReduce job.
>>>>>> Map finishes without problem, but error occurs as soon as it enters
>>>>>> Reduce phase.
>>>>>>
>>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>>>>
>>>>>> I am running a 5 node cluster and I believe I have all my settings correct:
>>>>>>
>>>>>> * ulimit -n 32768
>>>>>> * DNS/RDNS configured properly
>>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>>>>>
>>>>>> The program is very simple - just counts a unique string in a log file.
>>>>>> See here: http://pastebin.com/5uRG3SFL
>>>>>>
>>>>>> When I run, the job fails and I get the following output.
>>>>>> http://pastebin.com/AhW6StEb
>>>>>>
>>>>>> However, runs fine when I do *not* use substring() on the value (see
>>>>>> map function in code above).
>>>>>>
>>>>>> This runs fine and completes successfully:
>>>>>>            String str = val.toString();
>>>>>>
>>>>>> This causes error and fails:
>>>>>>            String str = val.toString().substring(0,10);
>>>>>>
>>>>>> Please let me know if you need any further information.
>>>>>> It would be greatly appreciated if anyone could shed some light on this problem.
>>>>>
>>>>> It catches attention that changing the code to use a substring is
>>>>> causing a difference. Assuming it is consistent and not a red herring,
>>>>
>>>> Yes, this has been consistent over the last week. I was running 0.20.1
>>>> first and then
>>>> upgrade to 0.20.2 but results have been exactly the same.
>>>>
>>>>> can you look at the counters for the two jobs using the JobTracker web
>>>>> UI - things like map records, bytes etc and see if there is a
>>>>> noticeable difference ?
>>>>
>>>> Ok, so here is the first job using write.set(value.toString()); having
>>>> *no* errors:
>>>> http://pastebin.com/xvy0iGwL
>>>>
>>>> And here is the second job using
>>>> write.set(value.toString().substring(0, 10)); that fails:
>>>> http://pastebin.com/uGw6yNqv
>>>>
>>>> And here is even another where I used a longer, and therefore unique string,
>>>> by write.set(value.toString().substring(0, 20)); This makes every line
>>>> unique, similar to first job.
>>>> Still fails.
>>>> http://pastebin.com/GdQ1rp8i
>>>>
>>>>>Also, are the two programs being run against
>>>>> the exact same input data ?
>>>>
>>>> Yes, exactly the same input: a single csv file with 23K lines.
>>>> Using a shorter string leads to more like keys and therefore more
>>>> combining/reducing, but going
>>>> by the above it seems to fail whether the substring/key is entirely
>>>> unique (23000 combine output records) or
>>>> mostly the same (9 combine output records).
>>>>
>>>>>
>>>>> Also, since the cluster size is small, you could also look at the
>>>>> tasktracker logs on the machines where the maps have run to see if
>>>>> there are any failures when the reduce attempts start failing.
>>>>
>>>> Here is the TT log from the last failed job. I do not see anything
>>>> besides the shuffle failure, but there
>>>> may be something I am overlooking or simply do not understand.
>>>> http://pastebin.com/DKFTyGXg
>>>>
>>>> Thanks again!
>>>>
>>>>>
>>>>> Thanks
>>>>> Hemanth
>>>>>
>>>>
>>>
>>
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by bmdevelopment <bm...@gmail.com>.

Hi, No problems. Thanks so much for your time. Greatly appreciated.

I agree that it must be a configuration problem and so today I was able
to start from scratch and did a fresh install of 0.20.2 on the 5 node cluster.

I've now noticed that the error occurs when compression is enabled.
I've run the basic wordcount example as so:
http://pastebin.com/wvDMZZT0
and get the Shuffle Error.

TT logs show this error:
WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid
header checksum: 225702cc (expected 0x2325)
Full logs:
http://pastebin.com/fVGjcGsW

My mapred-site.xml:
http://pastebin.com/mQgMrKQw

If I remove the compression config settings, the wordcount works fine
- no more Shuffle Error.
So, I have something wrong with my compression settings I imagine.
I'll continue looking into this to see what else I can find out.

Thanks a million.

On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <yh...@gmail.com> wrote:
> Hi,
>
> Sorry, I couldn't take a close look at the logs until now.
> Unfortunately, I could not see any huge difference between the success
> and failure case. Can you please check if things like basic hostname -
> ip address mapping are in place (if you have static resolution of
> hostnames set up) ? A web search is giving this as the most likely
> cause users have faced regarding this problem. Also do the disks have
> enough size ? Also, it would be great if you can upload your hadoop
> configuration information.
>
> I do think it is very likely that configuration is the actual problem
> because it works in one case anyway.
>
> Thanks
> Hemanth
>
> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <bm...@gmail.com> wrote:
>> Hello,
>> I still have had no luck with this over the past week.
>> And even get the same exact problem on a completely different 5 node cluster.
>> Is it worth opening an new issue in jira for this?
>> Thanks
>>
>>
>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <bm...@gmail.com> wrote:
>>> Hello,
>>> Thanks so much for the reply.
>>> See inline.
>>>
>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yh...@gmail.com> wrote:
>>>> Hi,
>>>>
>>>>> I've been getting the following error when trying to run a very simple
>>>>> MapReduce job.
>>>>> Map finishes without problem, but error occurs as soon as it enters
>>>>> Reduce phase.
>>>>>
>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>>>
>>>>> I am running a 5 node cluster and I believe I have all my settings correct:
>>>>>
>>>>> * ulimit -n 32768
>>>>> * DNS/RDNS configured properly
>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>>>>
>>>>> The program is very simple - just counts a unique string in a log file.
>>>>> See here: http://pastebin.com/5uRG3SFL
>>>>>
>>>>> When I run, the job fails and I get the following output.
>>>>> http://pastebin.com/AhW6StEb
>>>>>
>>>>> However, runs fine when I do *not* use substring() on the value (see
>>>>> map function in code above).
>>>>>
>>>>> This runs fine and completes successfully:
>>>>>            String str = val.toString();
>>>>>
>>>>> This causes error and fails:
>>>>>            String str = val.toString().substring(0,10);
>>>>>
>>>>> Please let me know if you need any further information.
>>>>> It would be greatly appreciated if anyone could shed some light on this problem.
>>>>
>>>> It catches attention that changing the code to use a substring is
>>>> causing a difference. Assuming it is consistent and not a red herring,
>>>
>>> Yes, this has been consistent over the last week. I was running 0.20.1
>>> first and then
>>> upgrade to 0.20.2 but results have been exactly the same.
>>>
>>>> can you look at the counters for the two jobs using the JobTracker web
>>>> UI - things like map records, bytes etc and see if there is a
>>>> noticeable difference ?
>>>
>>> Ok, so here is the first job using write.set(value.toString()); having
>>> *no* errors:
>>> http://pastebin.com/xvy0iGwL
>>>
>>> And here is the second job using
>>> write.set(value.toString().substring(0, 10)); that fails:
>>> http://pastebin.com/uGw6yNqv
>>>
>>> And here is even another where I used a longer, and therefore unique string,
>>> by write.set(value.toString().substring(0, 20)); This makes every line
>>> unique, similar to first job.
>>> Still fails.
>>> http://pastebin.com/GdQ1rp8i
>>>
>>>>Also, are the two programs being run against
>>>> the exact same input data ?
>>>
>>> Yes, exactly the same input: a single csv file with 23K lines.
>>> Using a shorter string leads to more like keys and therefore more
>>> combining/reducing, but going
>>> by the above it seems to fail whether the substring/key is entirely
>>> unique (23000 combine output records) or
>>> mostly the same (9 combine output records).
>>>
>>>>
>>>> Also, since the cluster size is small, you could also look at the
>>>> tasktracker logs on the machines where the maps have run to see if
>>>> there are any failures when the reduce attempts start failing.
>>>
>>> Here is the TT log from the last failed job. I do not see anything
>>> besides the shuffle failure, but there
>>> may be something I am overlooking or simply do not understand.
>>> http://pastebin.com/DKFTyGXg
>>>
>>> Thanks again!
>>>
>>>>
>>>> Thanks
>>>> Hemanth
>>>>
>>>
>>
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by Hemanth Yamijala <yh...@gmail.com>.

Hi,

Sorry, I couldn't take a close look at the logs until now.
Unfortunately, I could not see any huge difference between the success
and failure case. Can you please check if things like basic hostname -
ip address mapping are in place (if you have static resolution of
hostnames set up) ? A web search is giving this as the most likely
cause users have faced regarding this problem. Also do the disks have
enough size ? Also, it would be great if you can upload your hadoop
configuration information.

I do think it is very likely that configuration is the actual problem
because it works in one case anyway.

Thanks
Hemanth

On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <bm...@gmail.com> wrote:
> Hello,
> I still have had no luck with this over the past week.
> And even get the same exact problem on a completely different 5 node cluster.
> Is it worth opening an new issue in jira for this?
> Thanks
>
>
> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <bm...@gmail.com> wrote:
>> Hello,
>> Thanks so much for the reply.
>> See inline.
>>
>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yh...@gmail.com> wrote:
>>> Hi,
>>>
>>>> I've been getting the following error when trying to run a very simple
>>>> MapReduce job.
>>>> Map finishes without problem, but error occurs as soon as it enters
>>>> Reduce phase.
>>>>
>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>>
>>>> I am running a 5 node cluster and I believe I have all my settings correct:
>>>>
>>>> * ulimit -n 32768
>>>> * DNS/RDNS configured properly
>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>>>
>>>> The program is very simple - just counts a unique string in a log file.
>>>> See here: http://pastebin.com/5uRG3SFL
>>>>
>>>> When I run, the job fails and I get the following output.
>>>> http://pastebin.com/AhW6StEb
>>>>
>>>> However, runs fine when I do *not* use substring() on the value (see
>>>> map function in code above).
>>>>
>>>> This runs fine and completes successfully:
>>>>            String str = val.toString();
>>>>
>>>> This causes error and fails:
>>>>            String str = val.toString().substring(0,10);
>>>>
>>>> Please let me know if you need any further information.
>>>> It would be greatly appreciated if anyone could shed some light on this problem.
>>>
>>> It catches attention that changing the code to use a substring is
>>> causing a difference. Assuming it is consistent and not a red herring,
>>
>> Yes, this has been consistent over the last week. I was running 0.20.1
>> first and then
>> upgrade to 0.20.2 but results have been exactly the same.
>>
>>> can you look at the counters for the two jobs using the JobTracker web
>>> UI - things like map records, bytes etc and see if there is a
>>> noticeable difference ?
>>
>> Ok, so here is the first job using write.set(value.toString()); having
>> *no* errors:
>> http://pastebin.com/xvy0iGwL
>>
>> And here is the second job using
>> write.set(value.toString().substring(0, 10)); that fails:
>> http://pastebin.com/uGw6yNqv
>>
>> And here is even another where I used a longer, and therefore unique string,
>> by write.set(value.toString().substring(0, 20)); This makes every line
>> unique, similar to first job.
>> Still fails.
>> http://pastebin.com/GdQ1rp8i
>>
>>>Also, are the two programs being run against
>>> the exact same input data ?
>>
>> Yes, exactly the same input: a single csv file with 23K lines.
>> Using a shorter string leads to more like keys and therefore more
>> combining/reducing, but going
>> by the above it seems to fail whether the substring/key is entirely
>> unique (23000 combine output records) or
>> mostly the same (9 combine output records).
>>
>>>
>>> Also, since the cluster size is small, you could also look at the
>>> tasktracker logs on the machines where the maps have run to see if
>>> there are any failures when the reduce attempts start failing.
>>
>> Here is the TT log from the last failed job. I do not see anything
>> besides the shuffle failure, but there
>> may be something I am overlooking or simply do not understand.
>> http://pastebin.com/DKFTyGXg
>>
>> Thanks again!
>>
>>>
>>> Thanks
>>> Hemanth
>>>
>>
>