You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by bmdevelopment <bm...@gmail.com> on 2010/06/24 21:29:40 UTC

Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Hello,

I've been getting the following error when trying to run a very simple
MapReduce job.
Map finishes without problem, but error occurs as soon as it enters
Reduce phase.

10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
attempt_201006241812_0001_r_000000_0, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

I am running a 5 node cluster and I believe I have all my settings correct:

* ulimit -n 32768
* DNS/RDNS configured properly
* hdfs-site.xml : http://pastebin.com/xuZ17bPM
* mapred-site.xml : http://pastebin.com/JraVQZcW

The program is very simple - just counts a unique string in a log file.
See here: http://pastebin.com/5uRG3SFL

When I run, the job fails and I get the following output.
http://pastebin.com/AhW6StEb

However, runs fine when I do *not* use substring() on the value (see
map function in code above).

This runs fine and completes successfully:
            String str = val.toString();

This causes error and fails:
            String str = val.toString().substring(0,10);

Please let me know if you need any further information.
It would be greatly appreciated if anyone could shed some light on this problem.
Thanks

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by Deepak Diwakar <dd...@gmail.com>.

Thanks Krishna and Chen.

Yes problem was in /etc/hosts. In fact on  each node there was unique
identifier like necromancer/rocker etc.. which is the only difference in
/etc/hosts amongst the nodes. Once I put same identifier for all, it worked.


Thanks & regards
- Deepak Diwakar,




On 28 July 2010 03:09, C.V.Krishnakumar <cv...@me.com> wrote:

> Hi Deepak,
>
> Maybe I did not make my mail clear. I had tried the instructions in the
> blog you mentioned. They are  working for me.
> Did you change the /etc/hosts file at any point of time?
>
> Regards,
> Krishna
>
> On Jul 27, 2010, at 2:30 PM, C.V.Krishnakumar wrote:
>
> > Hi Deepak,
> >
> > YOu could refer this too :
> http://markmail.org/message/mjq6gzjhst2inuab#query:MAX_FAILED_UNIQUE_FETCHES+page:1+mid:ubrwgmddmfvoadh2+state:results
> > I tried those instructions and it is working for me.
> > Regards,
> > Krishna
> > On Jul 27, 2010, at 12:31 PM, Deepak Diwakar wrote:
> >
> >> Hey friends,
> >>
> >> I got stuck on setting up hdfs cluster and getting this error while
> running
> >> simple wordcount example(I did that 2 yrs back not had any problem).
> >>
> >> Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed
> from
> >> (
> >>
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
> >> ).
> >>
> >> I checked the firewall settings and /etc/hosts there is no issue there.
> >> Also master and slave are accessible both ways.
> >>
> >> Also the input size very low ~ 3 MB  and hence there shouldn't be no
> issue
> >> because ulimit(its btw of 4096).
> >>
> >> Would be really thankful  if  anyone can guide me to resolve this.
> >>
> >> Thanks & regards,
> >> - Deepak Diwakar,
> >>
> >>
> >>
> >>
> >> On 28 June 2010 18:39, bmdevelopment <bm...@gmail.com> wrote:
> >>
> >>> Hi, Sorry for the cross-post. But just trying to see if anyone else
> >>> has had this issue before.
> >>> Thanks
> >>>
> >>>
> >>> ---------- Forwarded message ----------
> >>> From: bmdevelopment <bm...@gmail.com>
> >>> Date: Fri, Jun 25, 2010 at 10:56 AM
> >>> Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
> >>> bailing-out.
> >>> To: mapreduce-user@hadoop.apache.org
> >>>
> >>>
> >>> Hello,
> >>> Thanks so much for the reply.
> >>> See inline.
> >>>
> >>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yhemanth@gmail.com
> >
> >>> wrote:
> >>>> Hi,
> >>>>
> >>>>> I've been getting the following error when trying to run a very
> simple
> >>>>> MapReduce job.
> >>>>> Map finishes without problem, but error occurs as soon as it enters
> >>>>> Reduce phase.
> >>>>>
> >>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
> >>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
> >>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> >>>>>
> >>>>> I am running a 5 node cluster and I believe I have all my settings
> >>> correct:
> >>>>>
> >>>>> * ulimit -n 32768
> >>>>> * DNS/RDNS configured properly
> >>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
> >>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
> >>>>>
> >>>>> The program is very simple - just counts a unique string in a log
> file.
> >>>>> See here: http://pastebin.com/5uRG3SFL
> >>>>>
> >>>>> When I run, the job fails and I get the following output.
> >>>>> http://pastebin.com/AhW6StEb
> >>>>>
> >>>>> However, runs fine when I do *not* use substring() on the value (see
> >>>>> map function in code above).
> >>>>>
> >>>>> This runs fine and completes successfully:
> >>>>>          String str = val.toString();
> >>>>>
> >>>>> This causes error and fails:
> >>>>>          String str = val.toString().substring(0,10);
> >>>>>
> >>>>> Please let me know if you need any further information.
> >>>>> It would be greatly appreciated if anyone could shed some light on
> this
> >>> problem.
> >>>>
> >>>> It catches attention that changing the code to use a substring is
> >>>> causing a difference. Assuming it is consistent and not a red herring,
> >>>
> >>> Yes, this has been consistent over the last week. I was running 0.20.1
> >>> first and then
> >>> upgrade to 0.20.2 but results have been exactly the same.
> >>>
> >>>> can you look at the counters for the two jobs using the JobTracker web
> >>>> UI - things like map records, bytes etc and see if there is a
> >>>> noticeable difference ?
> >>>
> >>> Ok, so here is the first job using write.set(value.toString()); having
> >>> *no* errors:
> >>> http://pastebin.com/xvy0iGwL
> >>>
> >>> And here is the second job using
> >>> write.set(value.toString().substring(0, 10)); that fails:
> >>> http://pastebin.com/uGw6yNqv
> >>>
> >>> And here is even another where I used a longer, and therefore unique
> >>> string,
> >>> by write.set(value.toString().substring(0, 20)); This makes every line
> >>> unique, similar to first job.
> >>> Still fails.
> >>> http://pastebin.com/GdQ1rp8i
> >>>
> >>>> Also, are the two programs being run against
> >>>> the exact same input data ?
> >>>
> >>> Yes, exactly the same input: a single csv file with 23K lines.
> >>> Using a shorter string leads to more like keys and therefore more
> >>> combining/reducing, but going
> >>> by the above it seems to fail whether the substring/key is entirely
> >>> unique (23000 combine output records) or
> >>> mostly the same (9 combine output records).
> >>>
> >>>>
> >>>> Also, since the cluster size is small, you could also look at the
> >>>> tasktracker logs on the machines where the maps have run to see if
> >>>> there are any failures when the reduce attempts start failing.
> >>>
> >>> Here is the TT log from the last failed job. I do not see anything
> >>> besides the shuffle failure, but there
> >>> may be something I am overlooking or simply do not understand.
> >>> http://pastebin.com/DKFTyGXg
> >>>
> >>> Thanks again!
> >>>
> >>>>
> >>>> Thanks
> >>>> Hemanth
> >>>>
> >>>
> >
>
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by "C.V.Krishnakumar" <cv...@me.com>.

Hi Deepak,

Maybe I did not make my mail clear. I had tried the instructions in the blog you mentioned. They are  working for me. 
Did you change the /etc/hosts file at any point of time? 

Regards,
Krishna

On Jul 27, 2010, at 2:30 PM, C.V.Krishnakumar wrote:

> Hi Deepak,
> 
> YOu could refer this too : http://markmail.org/message/mjq6gzjhst2inuab#query:MAX_FAILED_UNIQUE_FETCHES+page:1+mid:ubrwgmddmfvoadh2+state:results 
> I tried those instructions and it is working for me. 
> Regards,
> Krishna
> On Jul 27, 2010, at 12:31 PM, Deepak Diwakar wrote:
> 
>> Hey friends,
>> 
>> I got stuck on setting up hdfs cluster and getting this error while running
>> simple wordcount example(I did that 2 yrs back not had any problem).
>> 
>> Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed from
>> (
>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
>> ).
>> 
>> I checked the firewall settings and /etc/hosts there is no issue there.
>> Also master and slave are accessible both ways.
>> 
>> Also the input size very low ~ 3 MB  and hence there shouldn't be no issue
>> because ulimit(its btw of 4096).
>> 
>> Would be really thankful  if  anyone can guide me to resolve this.
>> 
>> Thanks & regards,
>> - Deepak Diwakar,
>> 
>> 
>> 
>> 
>> On 28 June 2010 18:39, bmdevelopment <bm...@gmail.com> wrote:
>> 
>>> Hi, Sorry for the cross-post. But just trying to see if anyone else
>>> has had this issue before.
>>> Thanks
>>> 
>>> 
>>> ---------- Forwarded message ----------
>>> From: bmdevelopment <bm...@gmail.com>
>>> Date: Fri, Jun 25, 2010 at 10:56 AM
>>> Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
>>> bailing-out.
>>> To: mapreduce-user@hadoop.apache.org
>>> 
>>> 
>>> Hello,
>>> Thanks so much for the reply.
>>> See inline.
>>> 
>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yh...@gmail.com>
>>> wrote:
>>>> Hi,
>>>> 
>>>>> I've been getting the following error when trying to run a very simple
>>>>> MapReduce job.
>>>>> Map finishes without problem, but error occurs as soon as it enters
>>>>> Reduce phase.
>>>>> 
>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>>> 
>>>>> I am running a 5 node cluster and I believe I have all my settings
>>> correct:
>>>>> 
>>>>> * ulimit -n 32768
>>>>> * DNS/RDNS configured properly
>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>>>> 
>>>>> The program is very simple - just counts a unique string in a log file.
>>>>> See here: http://pastebin.com/5uRG3SFL
>>>>> 
>>>>> When I run, the job fails and I get the following output.
>>>>> http://pastebin.com/AhW6StEb
>>>>> 
>>>>> However, runs fine when I do *not* use substring() on the value (see
>>>>> map function in code above).
>>>>> 
>>>>> This runs fine and completes successfully:
>>>>>          String str = val.toString();
>>>>> 
>>>>> This causes error and fails:
>>>>>          String str = val.toString().substring(0,10);
>>>>> 
>>>>> Please let me know if you need any further information.
>>>>> It would be greatly appreciated if anyone could shed some light on this
>>> problem.
>>>> 
>>>> It catches attention that changing the code to use a substring is
>>>> causing a difference. Assuming it is consistent and not a red herring,
>>> 
>>> Yes, this has been consistent over the last week. I was running 0.20.1
>>> first and then
>>> upgrade to 0.20.2 but results have been exactly the same.
>>> 
>>>> can you look at the counters for the two jobs using the JobTracker web
>>>> UI - things like map records, bytes etc and see if there is a
>>>> noticeable difference ?
>>> 
>>> Ok, so here is the first job using write.set(value.toString()); having
>>> *no* errors:
>>> http://pastebin.com/xvy0iGwL
>>> 
>>> And here is the second job using
>>> write.set(value.toString().substring(0, 10)); that fails:
>>> http://pastebin.com/uGw6yNqv
>>> 
>>> And here is even another where I used a longer, and therefore unique
>>> string,
>>> by write.set(value.toString().substring(0, 20)); This makes every line
>>> unique, similar to first job.
>>> Still fails.
>>> http://pastebin.com/GdQ1rp8i
>>> 
>>>> Also, are the two programs being run against
>>>> the exact same input data ?
>>> 
>>> Yes, exactly the same input: a single csv file with 23K lines.
>>> Using a shorter string leads to more like keys and therefore more
>>> combining/reducing, but going
>>> by the above it seems to fail whether the substring/key is entirely
>>> unique (23000 combine output records) or
>>> mostly the same (9 combine output records).
>>> 
>>>> 
>>>> Also, since the cluster size is small, you could also look at the
>>>> tasktracker logs on the machines where the maps have run to see if
>>>> there are any failures when the reduce attempts start failing.
>>> 
>>> Here is the TT log from the last failed job. I do not see anything
>>> besides the shuffle failure, but there
>>> may be something I am overlooking or simply do not understand.
>>> http://pastebin.com/DKFTyGXg
>>> 
>>> Thanks again!
>>> 
>>>> 
>>>> Thanks
>>>> Hemanth
>>>> 
>>> 
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by "C.V.Krishnakumar" <cv...@me.com>.

Hi Deepak,

YOu could refer this too : http://markmail.org/message/mjq6gzjhst2inuab#query:MAX_FAILED_UNIQUE_FETCHES+page:1+mid:ubrwgmddmfvoadh2+state:results 
 I tried those instructions and it is working for me. 
Regards,
Krishna
On Jul 27, 2010, at 12:31 PM, Deepak Diwakar wrote:

> Hey friends,
> 
> I got stuck on setting up hdfs cluster and getting this error while running
> simple wordcount example(I did that 2 yrs back not had any problem).
> 
> Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed from
> (
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
> ).
> 
> I checked the firewall settings and /etc/hosts there is no issue there.
> Also master and slave are accessible both ways.
> 
> Also the input size very low ~ 3 MB  and hence there shouldn't be no issue
> because ulimit(its btw of 4096).
> 
> Would be really thankful  if  anyone can guide me to resolve this.
> 
> Thanks & regards,
> - Deepak Diwakar,
> 
> 
> 
> 
> On 28 June 2010 18:39, bmdevelopment <bm...@gmail.com> wrote:
> 
>> Hi, Sorry for the cross-post. But just trying to see if anyone else
>> has had this issue before.
>> Thanks
>> 
>> 
>> ---------- Forwarded message ----------
>> From: bmdevelopment <bm...@gmail.com>
>> Date: Fri, Jun 25, 2010 at 10:56 AM
>> Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
>> bailing-out.
>> To: mapreduce-user@hadoop.apache.org
>> 
>> 
>> Hello,
>> Thanks so much for the reply.
>> See inline.
>> 
>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yh...@gmail.com>
>> wrote:
>>> Hi,
>>> 
>>>> I've been getting the following error when trying to run a very simple
>>>> MapReduce job.
>>>> Map finishes without problem, but error occurs as soon as it enters
>>>> Reduce phase.
>>>> 
>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>> 
>>>> I am running a 5 node cluster and I believe I have all my settings
>> correct:
>>>> 
>>>> * ulimit -n 32768
>>>> * DNS/RDNS configured properly
>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>>> 
>>>> The program is very simple - just counts a unique string in a log file.
>>>> See here: http://pastebin.com/5uRG3SFL
>>>> 
>>>> When I run, the job fails and I get the following output.
>>>> http://pastebin.com/AhW6StEb
>>>> 
>>>> However, runs fine when I do *not* use substring() on the value (see
>>>> map function in code above).
>>>> 
>>>> This runs fine and completes successfully:
>>>>           String str = val.toString();
>>>> 
>>>> This causes error and fails:
>>>>           String str = val.toString().substring(0,10);
>>>> 
>>>> Please let me know if you need any further information.
>>>> It would be greatly appreciated if anyone could shed some light on this
>> problem.
>>> 
>>> It catches attention that changing the code to use a substring is
>>> causing a difference. Assuming it is consistent and not a red herring,
>> 
>> Yes, this has been consistent over the last week. I was running 0.20.1
>> first and then
>> upgrade to 0.20.2 but results have been exactly the same.
>> 
>>> can you look at the counters for the two jobs using the JobTracker web
>>> UI - things like map records, bytes etc and see if there is a
>>> noticeable difference ?
>> 
>> Ok, so here is the first job using write.set(value.toString()); having
>> *no* errors:
>> http://pastebin.com/xvy0iGwL
>> 
>> And here is the second job using
>> write.set(value.toString().substring(0, 10)); that fails:
>> http://pastebin.com/uGw6yNqv
>> 
>> And here is even another where I used a longer, and therefore unique
>> string,
>> by write.set(value.toString().substring(0, 20)); This makes every line
>> unique, similar to first job.
>> Still fails.
>> http://pastebin.com/GdQ1rp8i
>> 
>>> Also, are the two programs being run against
>>> the exact same input data ?
>> 
>> Yes, exactly the same input: a single csv file with 23K lines.
>> Using a shorter string leads to more like keys and therefore more
>> combining/reducing, but going
>> by the above it seems to fail whether the substring/key is entirely
>> unique (23000 combine output records) or
>> mostly the same (9 combine output records).
>> 
>>> 
>>> Also, since the cluster size is small, you could also look at the
>>> tasktracker logs on the machines where the maps have run to see if
>>> there are any failures when the reduce attempts start failing.
>> 
>> Here is the TT log from the last failed job. I do not see anything
>> besides the shuffle failure, but there
>> may be something I am overlooking or simply do not understand.
>> http://pastebin.com/DKFTyGXg
>> 
>> Thanks again!
>> 
>>> 
>>> Thanks
>>> Hemanth
>>> 
>>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by He Chen <ai...@gmail.com>.

Hey Deepak Diwakar

Try to keep the /etc/hosts file as the same among all your cluster nodes.
See whether this problem will disappear.

On Tue, Jul 27, 2010 at 2:31 PM, Deepak Diwakar <dd...@gmail.com> wrote:

> Hey friends,
>
> I got stuck on setting up hdfs cluster and getting this error while running
> simple wordcount example(I did that 2 yrs back not had any problem).
>
> Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed
> from
> (
>
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
> ).
>
>  I checked the firewall settings and /etc/hosts there is no issue there.
> Also master and slave are accessible both ways.
>
> Also the input size very low ~ 3 MB  and hence there shouldn't be no issue
> because ulimit(its btw of 4096).
>
> Would be really thankful  if  anyone can guide me to resolve this.
>
> Thanks & regards,
> - Deepak Diwakar,
>
>
>
>
> On 28 June 2010 18:39, bmdevelopment <bm...@gmail.com> wrote:
>
> > Hi, Sorry for the cross-post. But just trying to see if anyone else
> > has had this issue before.
> > Thanks
> >
> >
> > ---------- Forwarded message ----------
> > From: bmdevelopment <bm...@gmail.com>
> > Date: Fri, Jun 25, 2010 at 10:56 AM
> > Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
> > bailing-out.
> > To: mapreduce-user@hadoop.apache.org
> >
> >
> > Hello,
> > Thanks so much for the reply.
> > See inline.
> >
> > On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yh...@gmail.com>
> > wrote:
> > > Hi,
> > >
> > >> I've been getting the following error when trying to run a very simple
> > >> MapReduce job.
> > >> Map finishes without problem, but error occurs as soon as it enters
> > >> Reduce phase.
> > >>
> > >> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
> > >> attempt_201006241812_0001_r_000000_0, Status : FAILED
> > >> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> > >>
> > >> I am running a 5 node cluster and I believe I have all my settings
> > correct:
> > >>
> > >> * ulimit -n 32768
> > >> * DNS/RDNS configured properly
> > >> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
> > >> * mapred-site.xml : http://pastebin.com/JraVQZcW
> > >>
> > >> The program is very simple - just counts a unique string in a log
> file.
> > >> See here: http://pastebin.com/5uRG3SFL
> > >>
> > >> When I run, the job fails and I get the following output.
> > >> http://pastebin.com/AhW6StEb
> > >>
> > >> However, runs fine when I do *not* use substring() on the value (see
> > >> map function in code above).
> > >>
> > >> This runs fine and completes successfully:
> > >>            String str = val.toString();
> > >>
> > >> This causes error and fails:
> > >>            String str = val.toString().substring(0,10);
> > >>
> > >> Please let me know if you need any further information.
> > >> It would be greatly appreciated if anyone could shed some light on
> this
> > problem.
> > >
> > > It catches attention that changing the code to use a substring is
> > > causing a difference. Assuming it is consistent and not a red herring,
> >
> > Yes, this has been consistent over the last week. I was running 0.20.1
> > first and then
> > upgrade to 0.20.2 but results have been exactly the same.
> >
> > > can you look at the counters for the two jobs using the JobTracker web
> > > UI - things like map records, bytes etc and see if there is a
> > > noticeable difference ?
> >
> > Ok, so here is the first job using write.set(value.toString()); having
> > *no* errors:
> > http://pastebin.com/xvy0iGwL
> >
> > And here is the second job using
> > write.set(value.toString().substring(0, 10)); that fails:
> > http://pastebin.com/uGw6yNqv
> >
> > And here is even another where I used a longer, and therefore unique
> > string,
> > by write.set(value.toString().substring(0, 20)); This makes every line
> > unique, similar to first job.
> > Still fails.
> > http://pastebin.com/GdQ1rp8i
> >
> > >Also, are the two programs being run against
> > > the exact same input data ?
> >
> > Yes, exactly the same input: a single csv file with 23K lines.
> > Using a shorter string leads to more like keys and therefore more
> > combining/reducing, but going
> > by the above it seems to fail whether the substring/key is entirely
> > unique (23000 combine output records) or
> > mostly the same (9 combine output records).
> >
> > >
> > > Also, since the cluster size is small, you could also look at the
> > > tasktracker logs on the machines where the maps have run to see if
> > > there are any failures when the reduce attempts start failing.
> >
> > Here is the TT log from the last failed job. I do not see anything
> > besides the shuffle failure, but there
> > may be something I am overlooking or simply do not understand.
> > http://pastebin.com/DKFTyGXg
> >
> > Thanks again!
> >
> > >
> > > Thanks
> > > Hemanth
> > >
> >
>



-- 
Best Wishes！
顺送商祺！

－－
Chen He
(402)613-9298
PhD. student of CSE Dept.
Research Assistant of Holland Computing Center
University of Nebraska-Lincoln
Lincoln NE 68588

Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by Deepak Diwakar <dd...@gmail.com>.

Hey friends,

I got stuck on setting up hdfs cluster and getting this error while running
simple wordcount example(I did that 2 yrs back not had any problem).

Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed from
(
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
).

 I checked the firewall settings and /etc/hosts there is no issue there.
Also master and slave are accessible both ways.

Also the input size very low ~ 3 MB  and hence there shouldn't be no issue
because ulimit(its btw of 4096).

Would be really thankful  if  anyone can guide me to resolve this.

Thanks & regards,
- Deepak Diwakar,




On 28 June 2010 18:39, bmdevelopment <bm...@gmail.com> wrote:

> Hi, Sorry for the cross-post. But just trying to see if anyone else
> has had this issue before.
> Thanks
>
>
> ---------- Forwarded message ----------
> From: bmdevelopment <bm...@gmail.com>
> Date: Fri, Jun 25, 2010 at 10:56 AM
> Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
> bailing-out.
> To: mapreduce-user@hadoop.apache.org
>
>
> Hello,
> Thanks so much for the reply.
> See inline.
>
> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yh...@gmail.com>
> wrote:
> > Hi,
> >
> >> I've been getting the following error when trying to run a very simple
> >> MapReduce job.
> >> Map finishes without problem, but error occurs as soon as it enters
> >> Reduce phase.
> >>
> >> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
> >> attempt_201006241812_0001_r_000000_0, Status : FAILED
> >> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> >>
> >> I am running a 5 node cluster and I believe I have all my settings
> correct:
> >>
> >> * ulimit -n 32768
> >> * DNS/RDNS configured properly
> >> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
> >> * mapred-site.xml : http://pastebin.com/JraVQZcW
> >>
> >> The program is very simple - just counts a unique string in a log file.
> >> See here: http://pastebin.com/5uRG3SFL
> >>
> >> When I run, the job fails and I get the following output.
> >> http://pastebin.com/AhW6StEb
> >>
> >> However, runs fine when I do *not* use substring() on the value (see
> >> map function in code above).
> >>
> >> This runs fine and completes successfully:
> >>            String str = val.toString();
> >>
> >> This causes error and fails:
> >>            String str = val.toString().substring(0,10);
> >>
> >> Please let me know if you need any further information.
> >> It would be greatly appreciated if anyone could shed some light on this
> problem.
> >
> > It catches attention that changing the code to use a substring is
> > causing a difference. Assuming it is consistent and not a red herring,
>
> Yes, this has been consistent over the last week. I was running 0.20.1
> first and then
> upgrade to 0.20.2 but results have been exactly the same.
>
> > can you look at the counters for the two jobs using the JobTracker web
> > UI - things like map records, bytes etc and see if there is a
> > noticeable difference ?
>
> Ok, so here is the first job using write.set(value.toString()); having
> *no* errors:
> http://pastebin.com/xvy0iGwL
>
> And here is the second job using
> write.set(value.toString().substring(0, 10)); that fails:
> http://pastebin.com/uGw6yNqv
>
> And here is even another where I used a longer, and therefore unique
> string,
> by write.set(value.toString().substring(0, 20)); This makes every line
> unique, similar to first job.
> Still fails.
> http://pastebin.com/GdQ1rp8i
>
> >Also, are the two programs being run against
> > the exact same input data ?
>
> Yes, exactly the same input: a single csv file with 23K lines.
> Using a shorter string leads to more like keys and therefore more
> combining/reducing, but going
> by the above it seems to fail whether the substring/key is entirely
> unique (23000 combine output records) or
> mostly the same (9 combine output records).
>
> >
> > Also, since the cluster size is small, you could also look at the
> > tasktracker logs on the machines where the maps have run to see if
> > there are any failures when the reduce attempts start failing.
>
> Here is the TT log from the last failed job. I do not see anything
> besides the shuffle failure, but there
> may be something I am overlooking or simply do not understand.
> http://pastebin.com/DKFTyGXg
>
> Thanks again!
>
> >
> > Thanks
> > Hemanth
> >
>

Fwd: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by bmdevelopment <bm...@gmail.com>.

Hi, Sorry for the cross-post. But just trying to see if anyone else
has had this issue before.
Thanks


---------- Forwarded message ----------
From: bmdevelopment <bm...@gmail.com>
Date: Fri, Jun 25, 2010 at 10:56 AM
Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
To: mapreduce-user@hadoop.apache.org


Hello,
Thanks so much for the reply.
See inline.

On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yh...@gmail.com> wrote:
> Hi,
>
>> I've been getting the following error when trying to run a very simple
>> MapReduce job.
>> Map finishes without problem, but error occurs as soon as it enters
>> Reduce phase.
>>
>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>
>> I am running a 5 node cluster and I believe I have all my settings correct:
>>
>> * ulimit -n 32768
>> * DNS/RDNS configured properly
>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>
>> The program is very simple - just counts a unique string in a log file.
>> See here: http://pastebin.com/5uRG3SFL
>>
>> When I run, the job fails and I get the following output.
>> http://pastebin.com/AhW6StEb
>>
>> However, runs fine when I do *not* use substring() on the value (see
>> map function in code above).
>>
>> This runs fine and completes successfully:
>>            String str = val.toString();
>>
>> This causes error and fails:
>>            String str = val.toString().substring(0,10);
>>
>> Please let me know if you need any further information.
>> It would be greatly appreciated if anyone could shed some light on this problem.
>
> It catches attention that changing the code to use a substring is
> causing a difference. Assuming it is consistent and not a red herring,

Yes, this has been consistent over the last week. I was running 0.20.1
first and then
upgrade to 0.20.2 but results have been exactly the same.

> can you look at the counters for the two jobs using the JobTracker web
> UI - things like map records, bytes etc and see if there is a
> noticeable difference ?

Ok, so here is the first job using write.set(value.toString()); having
*no* errors:
http://pastebin.com/xvy0iGwL

And here is the second job using
write.set(value.toString().substring(0, 10)); that fails:
http://pastebin.com/uGw6yNqv

And here is even another where I used a longer, and therefore unique string,
by write.set(value.toString().substring(0, 20)); This makes every line
unique, similar to first job.
Still fails.
http://pastebin.com/GdQ1rp8i

>Also, are the two programs being run against
> the exact same input data ?

Yes, exactly the same input: a single csv file with 23K lines.
Using a shorter string leads to more like keys and therefore more
combining/reducing, but going
by the above it seems to fail whether the substring/key is entirely
unique (23000 combine output records) or
mostly the same (9 combine output records).

>
> Also, since the cluster size is small, you could also look at the
> tasktracker logs on the machines where the maps have run to see if
> there are any failures when the reduce attempts start failing.

Here is the TT log from the last failed job. I do not see anything
besides the shuffle failure, but there
may be something I am overlooking or simply do not understand.
http://pastebin.com/DKFTyGXg

Thanks again!

>
> Thanks
> Hemanth
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by Ted Yu <yu...@gmail.com>.

Did you check task tracker log and log from your reducer to see if
anythng was wrong ?
Please also capture jstack output so that we can help you diagnose.

On Friday, July 9, 2010, bmdevelopment <bm...@gmail.com> wrote:
> Hi, I updated to the version here:
> http://github.com/kevinweil/hadoop-lzo
>
> However, when I use lzop for intermediate compression I
> am still having trouble - the reduce phase now freezes at 99% and
> eventually fails.
> No immediate problem, because I can use the default codec.
> But may be of concern to someone else.
>
> Thanks
>
> On Fri, Jul 9, 2010 at 1:54 PM, Ted Yu <yu...@gmail.com> wrote:
>> I updated http://wiki.apache.org/hadoop/UsingLzoCompression to specifically
>> mention this potential issue so that other people can avoid such problem.
>> Feel free to add more onto it.
>>
>> On Thu, Jul 8, 2010 at 8:26 PM, bmdevelopment <bm...@gmail.com>
>> wrote:
>>>
>>> Thanks everyone.
>>>
>>> Yes, using the Google Code version referenced on the wiki:
>>> http://wiki.apache.org/hadoop/UsingLzoCompression
>>>
>>> I will try the latest version and see if that fixes the problem.
>>> http://github.com/kevinweil/hadoop-lzo
>>>
>>> Thanks
>>>
>>> On Fri, Jul 9, 2010 at 3:22 AM, Todd Lipcon <to...@cloudera.com> wrote:
>>> > On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <yu...@gmail.com> wrote:
>>> >>
>>> >> Todd fixed a bug where LZO header or block header data may fall on read
>>> >> boundary:
>>> >>
>>> >>
>>> >> http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58
>>> >>
>>> >>
>>> >> I am wondering if that is related to the issue you saw.
>>> >
>>> > I don't think this bug would show up in intermediate output compression,
>>> > but
>>> > it's certainly possible. There have been a number of bugs fixed in LZO
>>> > over
>>> > on github - are you using the github version or the one from Google Code
>>> > which is out of date? Either mine or Kevin's repo on github should be a
>>> > good
>>> > version (I think we called the newest 0.3.4)
>>> > -Todd
>>> >
>>> >>
>>> >> On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment
>>> >> <bm...@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> A little more on this.
>>> >>>
>>> >>> So, I've narrowed down the problem to using Lzop compression
>>> >>> (com.hadoop.compression.lzo.LzopCodec)
>>> >>> for mapred.map.output.compression.codec.
>>> >>>
>>> >>> <property>
>>> >>>    <name>mapred.map.output.compression.codec</name>
>>> >>>    <value>com.hadoop.compression.lzo.LzopCodec</value>
>>> >>> </property>
>>> >>>
>>> >>> If I do the above, I will get the Shuffle Error.
>>> >>> If I use DefaultCodec for mapred.map.output.compression.codec.
>>> >>> there is no problem.
>>> >>>
>>> >>> Is this a known issue? Or is this a bug?
>>> >>> Doesn't seem like it should be the expected behavior.
>>> >>>
>>> >>> I would be glad to contribute any further info on this if necessary.
>>> >>> Please let me know.
>>> >>>
>>> >>> Thanks
>>> >>>
>>> >>> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment
>>> >>> <bm...@gmail.com>
>>> >>> wrote:
>>> >>> > Hi, No problems. Thanks so much for your time. Greatly appreciated.
>>> >>> >
>>> >>> > I agree that it must be a configuration problem and so today I was
>>> >>> > able
>>> >>> > to start from scratch and did a fresh install of 0.20.2 on the 5
>>> >>> > node
>>> >>> > cluster.
>>> >>> >
>>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by bmdevelopment <bm...@gmail.com>.

Hi, I updated to the version here:
http://github.com/kevinweil/hadoop-lzo

However, when I use lzop for intermediate compression I
am still having trouble - the reduce phase now freezes at 99% and
eventually fails.
No immediate problem, because I can use the default codec.
But may be of concern to someone else.

Thanks

On Fri, Jul 9, 2010 at 1:54 PM, Ted Yu <yu...@gmail.com> wrote:
> I updated http://wiki.apache.org/hadoop/UsingLzoCompression to specifically
> mention this potential issue so that other people can avoid such problem.
> Feel free to add more onto it.
>
> On Thu, Jul 8, 2010 at 8:26 PM, bmdevelopment <bm...@gmail.com>
> wrote:
>>
>> Thanks everyone.
>>
>> Yes, using the Google Code version referenced on the wiki:
>> http://wiki.apache.org/hadoop/UsingLzoCompression
>>
>> I will try the latest version and see if that fixes the problem.
>> http://github.com/kevinweil/hadoop-lzo
>>
>> Thanks
>>
>> On Fri, Jul 9, 2010 at 3:22 AM, Todd Lipcon <to...@cloudera.com> wrote:
>> > On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <yu...@gmail.com> wrote:
>> >>
>> >> Todd fixed a bug where LZO header or block header data may fall on read
>> >> boundary:
>> >>
>> >>
>> >> http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58
>> >>
>> >>
>> >> I am wondering if that is related to the issue you saw.
>> >
>> > I don't think this bug would show up in intermediate output compression,
>> > but
>> > it's certainly possible. There have been a number of bugs fixed in LZO
>> > over
>> > on github - are you using the github version or the one from Google Code
>> > which is out of date? Either mine or Kevin's repo on github should be a
>> > good
>> > version (I think we called the newest 0.3.4)
>> > -Todd
>> >
>> >>
>> >> On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment
>> >> <bm...@gmail.com>
>> >> wrote:
>> >>>
>> >>> A little more on this.
>> >>>
>> >>> So, I've narrowed down the problem to using Lzop compression
>> >>> (com.hadoop.compression.lzo.LzopCodec)
>> >>> for mapred.map.output.compression.codec.
>> >>>
>> >>> <property>
>> >>>    <name>mapred.map.output.compression.codec</name>
>> >>>    <value>com.hadoop.compression.lzo.LzopCodec</value>
>> >>> </property>
>> >>>
>> >>> If I do the above, I will get the Shuffle Error.
>> >>> If I use DefaultCodec for mapred.map.output.compression.codec.
>> >>> there is no problem.
>> >>>
>> >>> Is this a known issue? Or is this a bug?
>> >>> Doesn't seem like it should be the expected behavior.
>> >>>
>> >>> I would be glad to contribute any further info on this if necessary.
>> >>> Please let me know.
>> >>>
>> >>> Thanks
>> >>>
>> >>> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment
>> >>> <bm...@gmail.com>
>> >>> wrote:
>> >>> > Hi, No problems. Thanks so much for your time. Greatly appreciated.
>> >>> >
>> >>> > I agree that it must be a configuration problem and so today I was
>> >>> > able
>> >>> > to start from scratch and did a fresh install of 0.20.2 on the 5
>> >>> > node
>> >>> > cluster.
>> >>> >
>> >>> > I've now noticed that the error occurs when compression is enabled.
>> >>> > I've run the basic wordcount example as so:
>> >>> > http://pastebin.com/wvDMZZT0
>> >>> > and get the Shuffle Error.
>> >>> >
>> >>> > TT logs show this error:
>> >>> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException:
>> >>> > Invalid
>> >>> > header checksum: 225702cc (expected 0x2325)
>> >>> > Full logs:
>> >>> > http://pastebin.com/fVGjcGsW
>> >>> >
>> >>> > My mapred-site.xml:
>> >>> > http://pastebin.com/mQgMrKQw
>> >>> >
>> >>> > If I remove the compression config settings, the wordcount works
>> >>> > fine
>> >>> > - no more Shuffle Error.
>> >>> > So, I have something wrong with my compression settings I imagine.
>> >>> > I'll continue looking into this to see what else I can find out.
>> >>> >
>> >>> > Thanks a million.
>> >>> >
>> >>> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala
>> >>> > <yh...@gmail.com>
>> >>> > wrote:
>> >>> >> Hi,
>> >>> >>
>> >>> >> Sorry, I couldn't take a close look at the logs until now.
>> >>> >> Unfortunately, I could not see any huge difference between the
>> >>> >> success
>> >>> >> and failure case. Can you please check if things like basic
>> >>> >> hostname -
>> >>> >> ip address mapping are in place (if you have static resolution of
>> >>> >> hostnames set up) ? A web search is giving this as the most likely
>> >>> >> cause users have faced regarding this problem. Also do the disks
>> >>> >> have
>> >>> >> enough size ? Also, it would be great if you can upload your hadoop
>> >>> >> configuration information.
>> >>> >>
>> >>> >> I do think it is very likely that configuration is the actual
>> >>> >> problem
>> >>> >> because it works in one case anyway.
>> >>> >>
>> >>> >> Thanks
>> >>> >> Hemanth
>> >>> >>
>> >>> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment
>> >>> >> <bm...@gmail.com> wrote:
>> >>> >>> Hello,
>> >>> >>> I still have had no luck with this over the past week.
>> >>> >>> And even get the same exact problem on a completely different 5
>> >>> >>> node
>> >>> >>> cluster.
>> >>> >>> Is it worth opening an new issue in jira for this?
>> >>> >>> Thanks
>> >>> >>>
>> >>> >>>
>> >>> >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment
>> >>> >>> <bm...@gmail.com> wrote:
>> >>> >>>> Hello,
>> >>> >>>> Thanks so much for the reply.
>> >>> >>>> See inline.
>> >>> >>>>
>> >>> >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala
>> >>> >>>> <yh...@gmail.com> wrote:
>> >>> >>>>> Hi,
>> >>> >>>>>
>> >>> >>>>>> I've been getting the following error when trying to run a very
>> >>> >>>>>> simple
>> >>> >>>>>> MapReduce job.
>> >>> >>>>>> Map finishes without problem, but error occurs as soon as it
>> >>> >>>>>> enters
>> >>> >>>>>> Reduce phase.
>> >>> >>>>>>
>> >>> >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>> >>> >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>> >>> >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> >>> >>>>>>
>> >>> >>>>>> I am running a 5 node cluster and I believe I have all my
>> >>> >>>>>> settings
>> >>> >>>>>> correct:
>> >>> >>>>>>
>> >>> >>>>>> * ulimit -n 32768
>> >>> >>>>>> * DNS/RDNS configured properly
>> >>> >>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>> >>> >>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>> >>> >>>>>>
>> >>> >>>>>> The program is very simple - just counts a unique string in a
>> >>> >>>>>> log
>> >>> >>>>>> file.
>> >>> >>>>>> See here: http://pastebin.com/5uRG3SFL
>> >>> >>>>>>
>> >>> >>>>>> When I run, the job fails and I get the following output.
>> >>> >>>>>> http://pastebin.com/AhW6StEb
>> >>> >>>>>>
>> >>> >>>>>> However, runs fine when I do *not* use substring() on the value
>> >>> >>>>>> (see
>> >>> >>>>>> map function in code above).
>> >>> >>>>>>
>> >>> >>>>>> This runs fine and completes successfully:
>> >>> >>>>>>            String str = val.toString();
>> >>> >>>>>>
>> >>> >>>>>> This causes error and fails:
>> >>> >>>>>>            String str = val.toString().substring(0,10);
>> >>> >>>>>>
>> >>> >>>>>> Please let me know if you need any further information.
>> >>> >>>>>> It would be greatly appreciated if anyone could shed some light
>> >>> >>>>>> on
>> >>> >>>>>> this problem.
>> >>> >>>>>
>> >>> >>>>> It catches attention that changing the code to use a substring
>> >>> >>>>> is
>> >>> >>>>> causing a difference. Assuming it is consistent and not a red
>> >>> >>>>> herring,
>> >>> >>>>
>> >>> >>>> Yes, this has been consistent over the last week. I was running
>> >>> >>>> 0.20.1
>> >>> >>>> first and then
>> >>> >>>> upgrade to 0.20.2 but results have been exactly the same.
>> >>> >>>>
>> >>> >>>>> can you look at the counters for the two jobs using the
>> >>> >>>>> JobTracker
>> >>> >>>>> web
>> >>> >>>>> UI - things like map records, bytes etc and see if there is a
>> >>> >>>>> noticeable difference ?
>> >>> >>>>
>> >>> >>>> Ok, so here is the first job using write.set(value.toString());
>> >>> >>>> having
>> >>> >>>> *no* errors:
>> >>> >>>> http://pastebin.com/xvy0iGwL
>> >>> >>>>
>> >>> >>>> And here is the second job using
>> >>> >>>> write.set(value.toString().substring(0, 10)); that fails:
>> >>> >>>> http://pastebin.com/uGw6yNqv
>> >>> >>>>
>> >>> >>>> And here is even another where I used a longer, and therefore
>> >>> >>>> unique
>> >>> >>>> string,
>> >>> >>>> by write.set(value.toString().substring(0, 20)); This makes every
>> >>> >>>> line
>> >>> >>>> unique, similar to first job.
>> >>> >>>> Still fails.
>> >>> >>>> http://pastebin.com/GdQ1rp8i
>> >>> >>>>
>> >>> >>>>>Also, are the two programs being run against
>> >>> >>>>> the exact same input data ?
>> >>> >>>>
>> >>> >>>> Yes, exactly the same input: a single csv file with 23K lines.
>> >>> >>>> Using a shorter string leads to more like keys and therefore more
>> >>> >>>> combining/reducing, but going
>> >>> >>>> by the above it seems to fail whether the substring/key is
>> >>> >>>> entirely
>> >>> >>>> unique (23000 combine output records) or
>> >>> >>>> mostly the same (9 combine output records).
>> >>> >>>>
>> >>> >>>>>
>> >>> >>>>> Also, since the cluster size is small, you could also look at
>> >>> >>>>> the
>> >>> >>>>> tasktracker logs on the machines where the maps have run to see
>> >>> >>>>> if
>> >>> >>>>> there are any failures when the reduce attempts start failing.
>> >>> >>>>
>> >>> >>>> Here is the TT log from the last failed job. I do not see
>> >>> >>>> anything
>> >>> >>>> besides the shuffle failure, but there
>> >>> >>>> may be something I am overlooking or simply do not understand.
>> >>> >>>> http://pastebin.com/DKFTyGXg
>> >>> >>>>
>> >>> >>>> Thanks again!
>> >>> >>>>
>> >>> >>>>>
>> >>> >>>>> Thanks
>> >>> >>>>> Hemanth
>> >>> >>>>>
>> >>> >>>>
>> >>> >>>
>> >>> >>
>> >>> >
>> >>
>> >
>> >
>> >
>> > --
>> > Todd Lipcon
>> > Software Engineer, Cloudera
>> >
>
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by Ted Yu <yu...@gmail.com>.

I updated http://wiki.apache.org/hadoop/UsingLzoCompression to specifically
mention this potential issue so that other people can avoid such problem.
Feel free to add more onto it.

On Thu, Jul 8, 2010 at 8:26 PM, bmdevelopment <bm...@gmail.com>wrote:

> Thanks everyone.
>
> Yes, using the Google Code version referenced on the wiki:
> http://wiki.apache.org/hadoop/UsingLzoCompression
>
> I will try the latest version and see if that fixes the problem.
> http://github.com/kevinweil/hadoop-lzo
>
> Thanks
>
> On Fri, Jul 9, 2010 at 3:22 AM, Todd Lipcon <to...@cloudera.com> wrote:
> > On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <yu...@gmail.com> wrote:
> >>
> >> Todd fixed a bug where LZO header or block header data may fall on read
> >> boundary:
> >>
> >>
> http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58
> >>
> >>
> >> I am wondering if that is related to the issue you saw.
> >
> > I don't think this bug would show up in intermediate output compression,
> but
> > it's certainly possible. There have been a number of bugs fixed in LZO
> over
> > on github - are you using the github version or the one from Google Code
> > which is out of date? Either mine or Kevin's repo on github should be a
> good
> > version (I think we called the newest 0.3.4)
> > -Todd
> >
> >>
> >> On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment <bmdevelopment@gmail.com
> >
> >> wrote:
> >>>
> >>> A little more on this.
> >>>
> >>> So, I've narrowed down the problem to using Lzop compression
> >>> (com.hadoop.compression.lzo.LzopCodec)
> >>> for mapred.map.output.compression.codec.
> >>>
> >>> <property>
> >>>    <name>mapred.map.output.compression.codec</name>
> >>>    <value>com.hadoop.compression.lzo.LzopCodec</value>
> >>> </property>
> >>>
> >>> If I do the above, I will get the Shuffle Error.
> >>> If I use DefaultCodec for mapred.map.output.compression.codec.
> >>> there is no problem.
> >>>
> >>> Is this a known issue? Or is this a bug?
> >>> Doesn't seem like it should be the expected behavior.
> >>>
> >>> I would be glad to contribute any further info on this if necessary.
> >>> Please let me know.
> >>>
> >>> Thanks
> >>>
> >>> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <bmdevelopment@gmail.com
> >
> >>> wrote:
> >>> > Hi, No problems. Thanks so much for your time. Greatly appreciated.
> >>> >
> >>> > I agree that it must be a configuration problem and so today I was
> able
> >>> > to start from scratch and did a fresh install of 0.20.2 on the 5 node
> >>> > cluster.
> >>> >
> >>> > I've now noticed that the error occurs when compression is enabled.
> >>> > I've run the basic wordcount example as so:
> >>> > http://pastebin.com/wvDMZZT0
> >>> > and get the Shuffle Error.
> >>> >
> >>> > TT logs show this error:
> >>> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException:
> Invalid
> >>> > header checksum: 225702cc (expected 0x2325)
> >>> > Full logs:
> >>> > http://pastebin.com/fVGjcGsW
> >>> >
> >>> > My mapred-site.xml:
> >>> > http://pastebin.com/mQgMrKQw
> >>> >
> >>> > If I remove the compression config settings, the wordcount works fine
> >>> > - no more Shuffle Error.
> >>> > So, I have something wrong with my compression settings I imagine.
> >>> > I'll continue looking into this to see what else I can find out.
> >>> >
> >>> > Thanks a million.
> >>> >
> >>> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <yhemanth@gmail.com
> >
> >>> > wrote:
> >>> >> Hi,
> >>> >>
> >>> >> Sorry, I couldn't take a close look at the logs until now.
> >>> >> Unfortunately, I could not see any huge difference between the
> success
> >>> >> and failure case. Can you please check if things like basic hostname
> -
> >>> >> ip address mapping are in place (if you have static resolution of
> >>> >> hostnames set up) ? A web search is giving this as the most likely
> >>> >> cause users have faced regarding this problem. Also do the disks
> have
> >>> >> enough size ? Also, it would be great if you can upload your hadoop
> >>> >> configuration information.
> >>> >>
> >>> >> I do think it is very likely that configuration is the actual
> problem
> >>> >> because it works in one case anyway.
> >>> >>
> >>> >> Thanks
> >>> >> Hemanth
> >>> >>
> >>> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment
> >>> >> <bm...@gmail.com> wrote:
> >>> >>> Hello,
> >>> >>> I still have had no luck with this over the past week.
> >>> >>> And even get the same exact problem on a completely different 5
> node
> >>> >>> cluster.
> >>> >>> Is it worth opening an new issue in jira for this?
> >>> >>> Thanks
> >>> >>>
> >>> >>>
> >>> >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment
> >>> >>> <bm...@gmail.com> wrote:
> >>> >>>> Hello,
> >>> >>>> Thanks so much for the reply.
> >>> >>>> See inline.
> >>> >>>>
> >>> >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala
> >>> >>>> <yh...@gmail.com> wrote:
> >>> >>>>> Hi,
> >>> >>>>>
> >>> >>>>>> I've been getting the following error when trying to run a very
> >>> >>>>>> simple
> >>> >>>>>> MapReduce job.
> >>> >>>>>> Map finishes without problem, but error occurs as soon as it
> >>> >>>>>> enters
> >>> >>>>>> Reduce phase.
> >>> >>>>>>
> >>> >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
> >>> >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
> >>> >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> >>> >>>>>>
> >>> >>>>>> I am running a 5 node cluster and I believe I have all my
> settings
> >>> >>>>>> correct:
> >>> >>>>>>
> >>> >>>>>> * ulimit -n 32768
> >>> >>>>>> * DNS/RDNS configured properly
> >>> >>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
> >>> >>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
> >>> >>>>>>
> >>> >>>>>> The program is very simple - just counts a unique string in a
> log
> >>> >>>>>> file.
> >>> >>>>>> See here: http://pastebin.com/5uRG3SFL
> >>> >>>>>>
> >>> >>>>>> When I run, the job fails and I get the following output.
> >>> >>>>>> http://pastebin.com/AhW6StEb
> >>> >>>>>>
> >>> >>>>>> However, runs fine when I do *not* use substring() on the value
> >>> >>>>>> (see
> >>> >>>>>> map function in code above).
> >>> >>>>>>
> >>> >>>>>> This runs fine and completes successfully:
> >>> >>>>>>            String str = val.toString();
> >>> >>>>>>
> >>> >>>>>> This causes error and fails:
> >>> >>>>>>            String str = val.toString().substring(0,10);
> >>> >>>>>>
> >>> >>>>>> Please let me know if you need any further information.
> >>> >>>>>> It would be greatly appreciated if anyone could shed some light
> on
> >>> >>>>>> this problem.
> >>> >>>>>
> >>> >>>>> It catches attention that changing the code to use a substring is
> >>> >>>>> causing a difference. Assuming it is consistent and not a red
> >>> >>>>> herring,
> >>> >>>>
> >>> >>>> Yes, this has been consistent over the last week. I was running
> >>> >>>> 0.20.1
> >>> >>>> first and then
> >>> >>>> upgrade to 0.20.2 but results have been exactly the same.
> >>> >>>>
> >>> >>>>> can you look at the counters for the two jobs using the
> JobTracker
> >>> >>>>> web
> >>> >>>>> UI - things like map records, bytes etc and see if there is a
> >>> >>>>> noticeable difference ?
> >>> >>>>
> >>> >>>> Ok, so here is the first job using write.set(value.toString());
> >>> >>>> having
> >>> >>>> *no* errors:
> >>> >>>> http://pastebin.com/xvy0iGwL
> >>> >>>>
> >>> >>>> And here is the second job using
> >>> >>>> write.set(value.toString().substring(0, 10)); that fails:
> >>> >>>> http://pastebin.com/uGw6yNqv
> >>> >>>>
> >>> >>>> And here is even another where I used a longer, and therefore
> unique
> >>> >>>> string,
> >>> >>>> by write.set(value.toString().substring(0, 20)); This makes every
> >>> >>>> line
> >>> >>>> unique, similar to first job.
> >>> >>>> Still fails.
> >>> >>>> http://pastebin.com/GdQ1rp8i
> >>> >>>>
> >>> >>>>>Also, are the two programs being run against
> >>> >>>>> the exact same input data ?
> >>> >>>>
> >>> >>>> Yes, exactly the same input: a single csv file with 23K lines.
> >>> >>>> Using a shorter string leads to more like keys and therefore more
> >>> >>>> combining/reducing, but going
> >>> >>>> by the above it seems to fail whether the substring/key is
> entirely
> >>> >>>> unique (23000 combine output records) or
> >>> >>>> mostly the same (9 combine output records).
> >>> >>>>
> >>> >>>>>
> >>> >>>>> Also, since the cluster size is small, you could also look at the
> >>> >>>>> tasktracker logs on the machines where the maps have run to see
> if
> >>> >>>>> there are any failures when the reduce attempts start failing.
> >>> >>>>
> >>> >>>> Here is the TT log from the last failed job. I do not see anything
> >>> >>>> besides the shuffle failure, but there
> >>> >>>> may be something I am overlooking or simply do not understand.
> >>> >>>> http://pastebin.com/DKFTyGXg
> >>> >>>>
> >>> >>>> Thanks again!
> >>> >>>>
> >>> >>>>>
> >>> >>>>> Thanks
> >>> >>>>> Hemanth
> >>> >>>>>
> >>> >>>>
> >>> >>>
> >>> >>
> >>> >
> >>
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by bmdevelopment <bm...@gmail.com>.

Thanks everyone.

Yes, using the Google Code version referenced on the wiki:
http://wiki.apache.org/hadoop/UsingLzoCompression

I will try the latest version and see if that fixes the problem.
http://github.com/kevinweil/hadoop-lzo

Thanks

On Fri, Jul 9, 2010 at 3:22 AM, Todd Lipcon <to...@cloudera.com> wrote:
> On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <yu...@gmail.com> wrote:
>>
>> Todd fixed a bug where LZO header or block header data may fall on read
>> boundary:
>>
>> http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58
>>
>>
>> I am wondering if that is related to the issue you saw.
>
> I don't think this bug would show up in intermediate output compression, but
> it's certainly possible. There have been a number of bugs fixed in LZO over
> on github - are you using the github version or the one from Google Code
> which is out of date? Either mine or Kevin's repo on github should be a good
> version (I think we called the newest 0.3.4)
> -Todd
>
>>
>> On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment <bm...@gmail.com>
>> wrote:
>>>
>>> A little more on this.
>>>
>>> So, I've narrowed down the problem to using Lzop compression
>>> (com.hadoop.compression.lzo.LzopCodec)
>>> for mapred.map.output.compression.codec.
>>>
>>> <property>
>>>    <name>mapred.map.output.compression.codec</name>
>>>    <value>com.hadoop.compression.lzo.LzopCodec</value>
>>> </property>
>>>
>>> If I do the above, I will get the Shuffle Error.
>>> If I use DefaultCodec for mapred.map.output.compression.codec.
>>> there is no problem.
>>>
>>> Is this a known issue? Or is this a bug?
>>> Doesn't seem like it should be the expected behavior.
>>>
>>> I would be glad to contribute any further info on this if necessary.
>>> Please let me know.
>>>
>>> Thanks
>>>
>>> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <bm...@gmail.com>
>>> wrote:
>>> > Hi, No problems. Thanks so much for your time. Greatly appreciated.
>>> >
>>> > I agree that it must be a configuration problem and so today I was able
>>> > to start from scratch and did a fresh install of 0.20.2 on the 5 node
>>> > cluster.
>>> >
>>> > I've now noticed that the error occurs when compression is enabled.
>>> > I've run the basic wordcount example as so:
>>> > http://pastebin.com/wvDMZZT0
>>> > and get the Shuffle Error.
>>> >
>>> > TT logs show this error:
>>> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid
>>> > header checksum: 225702cc (expected 0x2325)
>>> > Full logs:
>>> > http://pastebin.com/fVGjcGsW
>>> >
>>> > My mapred-site.xml:
>>> > http://pastebin.com/mQgMrKQw
>>> >
>>> > If I remove the compression config settings, the wordcount works fine
>>> > - no more Shuffle Error.
>>> > So, I have something wrong with my compression settings I imagine.
>>> > I'll continue looking into this to see what else I can find out.
>>> >
>>> > Thanks a million.
>>> >
>>> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <yh...@gmail.com>
>>> > wrote:
>>> >> Hi,
>>> >>
>>> >> Sorry, I couldn't take a close look at the logs until now.
>>> >> Unfortunately, I could not see any huge difference between the success
>>> >> and failure case. Can you please check if things like basic hostname -
>>> >> ip address mapping are in place (if you have static resolution of
>>> >> hostnames set up) ? A web search is giving this as the most likely
>>> >> cause users have faced regarding this problem. Also do the disks have
>>> >> enough size ? Also, it would be great if you can upload your hadoop
>>> >> configuration information.
>>> >>
>>> >> I do think it is very likely that configuration is the actual problem
>>> >> because it works in one case anyway.
>>> >>
>>> >> Thanks
>>> >> Hemanth
>>> >>
>>> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment
>>> >> <bm...@gmail.com> wrote:
>>> >>> Hello,
>>> >>> I still have had no luck with this over the past week.
>>> >>> And even get the same exact problem on a completely different 5 node
>>> >>> cluster.
>>> >>> Is it worth opening an new issue in jira for this?
>>> >>> Thanks
>>> >>>
>>> >>>
>>> >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment
>>> >>> <bm...@gmail.com> wrote:
>>> >>>> Hello,
>>> >>>> Thanks so much for the reply.
>>> >>>> See inline.
>>> >>>>
>>> >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala
>>> >>>> <yh...@gmail.com> wrote:
>>> >>>>> Hi,
>>> >>>>>
>>> >>>>>> I've been getting the following error when trying to run a very
>>> >>>>>> simple
>>> >>>>>> MapReduce job.
>>> >>>>>> Map finishes without problem, but error occurs as soon as it
>>> >>>>>> enters
>>> >>>>>> Reduce phase.
>>> >>>>>>
>>> >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>> >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>> >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>> >>>>>>
>>> >>>>>> I am running a 5 node cluster and I believe I have all my settings
>>> >>>>>> correct:
>>> >>>>>>
>>> >>>>>> * ulimit -n 32768
>>> >>>>>> * DNS/RDNS configured properly
>>> >>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>> >>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>> >>>>>>
>>> >>>>>> The program is very simple - just counts a unique string in a log
>>> >>>>>> file.
>>> >>>>>> See here: http://pastebin.com/5uRG3SFL
>>> >>>>>>
>>> >>>>>> When I run, the job fails and I get the following output.
>>> >>>>>> http://pastebin.com/AhW6StEb
>>> >>>>>>
>>> >>>>>> However, runs fine when I do *not* use substring() on the value
>>> >>>>>> (see
>>> >>>>>> map function in code above).
>>> >>>>>>
>>> >>>>>> This runs fine and completes successfully:
>>> >>>>>>            String str = val.toString();
>>> >>>>>>
>>> >>>>>> This causes error and fails:
>>> >>>>>>            String str = val.toString().substring(0,10);
>>> >>>>>>
>>> >>>>>> Please let me know if you need any further information.
>>> >>>>>> It would be greatly appreciated if anyone could shed some light on
>>> >>>>>> this problem.
>>> >>>>>
>>> >>>>> It catches attention that changing the code to use a substring is
>>> >>>>> causing a difference. Assuming it is consistent and not a red
>>> >>>>> herring,
>>> >>>>
>>> >>>> Yes, this has been consistent over the last week. I was running
>>> >>>> 0.20.1
>>> >>>> first and then
>>> >>>> upgrade to 0.20.2 but results have been exactly the same.
>>> >>>>
>>> >>>>> can you look at the counters for the two jobs using the JobTracker
>>> >>>>> web
>>> >>>>> UI - things like map records, bytes etc and see if there is a
>>> >>>>> noticeable difference ?
>>> >>>>
>>> >>>> Ok, so here is the first job using write.set(value.toString());
>>> >>>> having
>>> >>>> *no* errors:
>>> >>>> http://pastebin.com/xvy0iGwL
>>> >>>>
>>> >>>> And here is the second job using
>>> >>>> write.set(value.toString().substring(0, 10)); that fails:
>>> >>>> http://pastebin.com/uGw6yNqv
>>> >>>>
>>> >>>> And here is even another where I used a longer, and therefore unique
>>> >>>> string,
>>> >>>> by write.set(value.toString().substring(0, 20)); This makes every
>>> >>>> line
>>> >>>> unique, similar to first job.
>>> >>>> Still fails.
>>> >>>> http://pastebin.com/GdQ1rp8i
>>> >>>>
>>> >>>>>Also, are the two programs being run against
>>> >>>>> the exact same input data ?
>>> >>>>
>>> >>>> Yes, exactly the same input: a single csv file with 23K lines.
>>> >>>> Using a shorter string leads to more like keys and therefore more
>>> >>>> combining/reducing, but going
>>> >>>> by the above it seems to fail whether the substring/key is entirely
>>> >>>> unique (23000 combine output records) or
>>> >>>> mostly the same (9 combine output records).
>>> >>>>
>>> >>>>>
>>> >>>>> Also, since the cluster size is small, you could also look at the
>>> >>>>> tasktracker logs on the machines where the maps have run to see if
>>> >>>>> there are any failures when the reduce attempts start failing.
>>> >>>>
>>> >>>> Here is the TT log from the last failed job. I do not see anything
>>> >>>> besides the shuffle failure, but there
>>> >>>> may be something I am overlooking or simply do not understand.
>>> >>>> http://pastebin.com/DKFTyGXg
>>> >>>>
>>> >>>> Thanks again!
>>> >>>>
>>> >>>>>
>>> >>>>> Thanks
>>> >>>>> Hemanth
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by Todd Lipcon <to...@cloudera.com>.

On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <yu...@gmail.com> wrote:

> Todd fixed a bug where LZO header or block header data may fall on read
> boundary:
>
> http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58
>

I am wondering if that is related to the issue you saw.
>
> I don't think this bug would show up in intermediate output compression,
but it's certainly possible. There have been a number of bugs fixed in LZO
over on github - are you using the github version or the one from Google
Code which is out of date? Either mine or Kevin's repo on github should be a
good version (I think we called the newest 0.3.4)

-Todd


>
> On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment <bm...@gmail.com>wrote:
>
>> A little more on this.
>>
>> So, I've narrowed down the problem to using Lzop compression
>> (com.hadoop.compression.lzo.LzopCodec)
>> for mapred.map.output.compression.codec.
>>
>> <property>
>>    <name>mapred.map.output.compression.codec</name>
>>    <value>com.hadoop.compression.lzo.LzopCodec</value>
>> </property>
>>
>> If I do the above, I will get the Shuffle Error.
>> If I use DefaultCodec for mapred.map.output.compression.codec.
>> there is no problem.
>>
>> Is this a known issue? Or is this a bug?
>> Doesn't seem like it should be the expected behavior.
>>
>> I would be glad to contribute any further info on this if necessary.
>> Please let me know.
>>
>> Thanks
>>
>> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <bm...@gmail.com>
>> wrote:
>> > Hi, No problems. Thanks so much for your time. Greatly appreciated.
>> >
>> > I agree that it must be a configuration problem and so today I was able
>> > to start from scratch and did a fresh install of 0.20.2 on the 5 node
>> cluster.
>> >
>> > I've now noticed that the error occurs when compression is enabled.
>> > I've run the basic wordcount example as so:
>> > http://pastebin.com/wvDMZZT0
>> > and get the Shuffle Error.
>> >
>> > TT logs show this error:
>> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid
>> > header checksum: 225702cc (expected 0x2325)
>> > Full logs:
>> > http://pastebin.com/fVGjcGsW
>> >
>> > My mapred-site.xml:
>> > http://pastebin.com/mQgMrKQw
>> >
>> > If I remove the compression config settings, the wordcount works fine
>> > - no more Shuffle Error.
>> > So, I have something wrong with my compression settings I imagine.
>> > I'll continue looking into this to see what else I can find out.
>> >
>> > Thanks a million.
>> >
>> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <yh...@gmail.com>
>> wrote:
>> >> Hi,
>> >>
>> >> Sorry, I couldn't take a close look at the logs until now.
>> >> Unfortunately, I could not see any huge difference between the success
>> >> and failure case. Can you please check if things like basic hostname -
>> >> ip address mapping are in place (if you have static resolution of
>> >> hostnames set up) ? A web search is giving this as the most likely
>> >> cause users have faced regarding this problem. Also do the disks have
>> >> enough size ? Also, it would be great if you can upload your hadoop
>> >> configuration information.
>> >>
>> >> I do think it is very likely that configuration is the actual problem
>> >> because it works in one case anyway.
>> >>
>> >> Thanks
>> >> Hemanth
>> >>
>> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <
>> bmdevelopment@gmail.com> wrote:
>> >>> Hello,
>> >>> I still have had no luck with this over the past week.
>> >>> And even get the same exact problem on a completely different 5 node
>> cluster.
>> >>> Is it worth opening an new issue in jira for this?
>> >>> Thanks
>> >>>
>> >>>
>> >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <
>> bmdevelopment@gmail.com> wrote:
>> >>>> Hello,
>> >>>> Thanks so much for the reply.
>> >>>> See inline.
>> >>>>
>> >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <
>> yhemanth@gmail.com> wrote:
>> >>>>> Hi,
>> >>>>>
>> >>>>>> I've been getting the following error when trying to run a very
>> simple
>> >>>>>> MapReduce job.
>> >>>>>> Map finishes without problem, but error occurs as soon as it enters
>> >>>>>> Reduce phase.
>> >>>>>>
>> >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>> >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>> >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> >>>>>>
>> >>>>>> I am running a 5 node cluster and I believe I have all my settings
>> correct:
>> >>>>>>
>> >>>>>> * ulimit -n 32768
>> >>>>>> * DNS/RDNS configured properly
>> >>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>> >>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>> >>>>>>
>> >>>>>> The program is very simple - just counts a unique string in a log
>> file.
>> >>>>>> See here: http://pastebin.com/5uRG3SFL
>> >>>>>>
>> >>>>>> When I run, the job fails and I get the following output.
>> >>>>>> http://pastebin.com/AhW6StEb
>> >>>>>>
>> >>>>>> However, runs fine when I do *not* use substring() on the value
>> (see
>> >>>>>> map function in code above).
>> >>>>>>
>> >>>>>> This runs fine and completes successfully:
>> >>>>>>            String str = val.toString();
>> >>>>>>
>> >>>>>> This causes error and fails:
>> >>>>>>            String str = val.toString().substring(0,10);
>> >>>>>>
>> >>>>>> Please let me know if you need any further information.
>> >>>>>> It would be greatly appreciated if anyone could shed some light on
>> this problem.
>> >>>>>
>> >>>>> It catches attention that changing the code to use a substring is
>> >>>>> causing a difference. Assuming it is consistent and not a red
>> herring,
>> >>>>
>> >>>> Yes, this has been consistent over the last week. I was running
>> 0.20.1
>> >>>> first and then
>> >>>> upgrade to 0.20.2 but results have been exactly the same.
>> >>>>
>> >>>>> can you look at the counters for the two jobs using the JobTracker
>> web
>> >>>>> UI - things like map records, bytes etc and see if there is a
>> >>>>> noticeable difference ?
>> >>>>
>> >>>> Ok, so here is the first job using write.set(value.toString());
>> having
>> >>>> *no* errors:
>> >>>> http://pastebin.com/xvy0iGwL
>> >>>>
>> >>>> And here is the second job using
>> >>>> write.set(value.toString().substring(0, 10)); that fails:
>> >>>> http://pastebin.com/uGw6yNqv
>> >>>>
>> >>>> And here is even another where I used a longer, and therefore unique
>> string,
>> >>>> by write.set(value.toString().substring(0, 20)); This makes every
>> line
>> >>>> unique, similar to first job.
>> >>>> Still fails.
>> >>>> http://pastebin.com/GdQ1rp8i
>> >>>>
>> >>>>>Also, are the two programs being run against
>> >>>>> the exact same input data ?
>> >>>>
>> >>>> Yes, exactly the same input: a single csv file with 23K lines.
>> >>>> Using a shorter string leads to more like keys and therefore more
>> >>>> combining/reducing, but going
>> >>>> by the above it seems to fail whether the substring/key is entirely
>> >>>> unique (23000 combine output records) or
>> >>>> mostly the same (9 combine output records).
>> >>>>
>> >>>>>
>> >>>>> Also, since the cluster size is small, you could also look at the
>> >>>>> tasktracker logs on the machines where the maps have run to see if
>> >>>>> there are any failures when the reduce attempts start failing.
>> >>>>
>> >>>> Here is the TT log from the last failed job. I do not see anything
>> >>>> besides the shuffle failure, but there
>> >>>> may be something I am overlooking or simply do not understand.
>> >>>> http://pastebin.com/DKFTyGXg
>> >>>>
>> >>>> Thanks again!
>> >>>>
>> >>>>>
>> >>>>> Thanks
>> >>>>> Hemanth
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by Ted Yu <yu...@gmail.com>.

Todd fixed a bug where LZO header or block header data may fall on read
boundary:
http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58

I am wondering if that is related to the issue you saw.

On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment <bm...@gmail.com>wrote:

> A little more on this.
>
> So, I've narrowed down the problem to using Lzop compression
> (com.hadoop.compression.lzo.LzopCodec)
> for mapred.map.output.compression.codec.
>
> <property>
>    <name>mapred.map.output.compression.codec</name>
>    <value>com.hadoop.compression.lzo.LzopCodec</value>
> </property>
>
> If I do the above, I will get the Shuffle Error.
> If I use DefaultCodec for mapred.map.output.compression.codec.
> there is no problem.
>
> Is this a known issue? Or is this a bug?
> Doesn't seem like it should be the expected behavior.
>
> I would be glad to contribute any further info on this if necessary.
> Please let me know.
>
> Thanks
>
> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <bm...@gmail.com>
> wrote:
> > Hi, No problems. Thanks so much for your time. Greatly appreciated.
> >
> > I agree that it must be a configuration problem and so today I was able
> > to start from scratch and did a fresh install of 0.20.2 on the 5 node
> cluster.
> >
> > I've now noticed that the error occurs when compression is enabled.
> > I've run the basic wordcount example as so:
> > http://pastebin.com/wvDMZZT0
> > and get the Shuffle Error.
> >
> > TT logs show this error:
> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid
> > header checksum: 225702cc (expected 0x2325)
> > Full logs:
> > http://pastebin.com/fVGjcGsW
> >
> > My mapred-site.xml:
> > http://pastebin.com/mQgMrKQw
> >
> > If I remove the compression config settings, the wordcount works fine
> > - no more Shuffle Error.
> > So, I have something wrong with my compression settings I imagine.
> > I'll continue looking into this to see what else I can find out.
> >
> > Thanks a million.
> >
> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <yh...@gmail.com>
> wrote:
> >> Hi,
> >>
> >> Sorry, I couldn't take a close look at the logs until now.
> >> Unfortunately, I could not see any huge difference between the success
> >> and failure case. Can you please check if things like basic hostname -
> >> ip address mapping are in place (if you have static resolution of
> >> hostnames set up) ? A web search is giving this as the most likely
> >> cause users have faced regarding this problem. Also do the disks have
> >> enough size ? Also, it would be great if you can upload your hadoop
> >> configuration information.
> >>
> >> I do think it is very likely that configuration is the actual problem
> >> because it works in one case anyway.
> >>
> >> Thanks
> >> Hemanth
> >>
> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <bm...@gmail.com>
> wrote:
> >>> Hello,
> >>> I still have had no luck with this over the past week.
> >>> And even get the same exact problem on a completely different 5 node
> cluster.
> >>> Is it worth opening an new issue in jira for this?
> >>> Thanks
> >>>
> >>>
> >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <
> bmdevelopment@gmail.com> wrote:
> >>>> Hello,
> >>>> Thanks so much for the reply.
> >>>> See inline.
> >>>>
> >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <
> yhemanth@gmail.com> wrote:
> >>>>> Hi,
> >>>>>
> >>>>>> I've been getting the following error when trying to run a very
> simple
> >>>>>> MapReduce job.
> >>>>>> Map finishes without problem, but error occurs as soon as it enters
> >>>>>> Reduce phase.
> >>>>>>
> >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
> >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
> >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> >>>>>>
> >>>>>> I am running a 5 node cluster and I believe I have all my settings
> correct:
> >>>>>>
> >>>>>> * ulimit -n 32768
> >>>>>> * DNS/RDNS configured properly
> >>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
> >>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
> >>>>>>
> >>>>>> The program is very simple - just counts a unique string in a log
> file.
> >>>>>> See here: http://pastebin.com/5uRG3SFL
> >>>>>>
> >>>>>> When I run, the job fails and I get the following output.
> >>>>>> http://pastebin.com/AhW6StEb
> >>>>>>
> >>>>>> However, runs fine when I do *not* use substring() on the value (see
> >>>>>> map function in code above).
> >>>>>>
> >>>>>> This runs fine and completes successfully:
> >>>>>>            String str = val.toString();
> >>>>>>
> >>>>>> This causes error and fails:
> >>>>>>            String str = val.toString().substring(0,10);
> >>>>>>
> >>>>>> Please let me know if you need any further information.
> >>>>>> It would be greatly appreciated if anyone could shed some light on
> this problem.
> >>>>>
> >>>>> It catches attention that changing the code to use a substring is
> >>>>> causing a difference. Assuming it is consistent and not a red
> herring,
> >>>>
> >>>> Yes, this has been consistent over the last week. I was running 0.20.1
> >>>> first and then
> >>>> upgrade to 0.20.2 but results have been exactly the same.
> >>>>
> >>>>> can you look at the counters for the two jobs using the JobTracker
> web
> >>>>> UI - things like map records, bytes etc and see if there is a
> >>>>> noticeable difference ?
> >>>>
> >>>> Ok, so here is the first job using write.set(value.toString()); having
> >>>> *no* errors:
> >>>> http://pastebin.com/xvy0iGwL
> >>>>
> >>>> And here is the second job using
> >>>> write.set(value.toString().substring(0, 10)); that fails:
> >>>> http://pastebin.com/uGw6yNqv
> >>>>
> >>>> And here is even another where I used a longer, and therefore unique
> string,
> >>>> by write.set(value.toString().substring(0, 20)); This makes every line
> >>>> unique, similar to first job.
> >>>> Still fails.
> >>>> http://pastebin.com/GdQ1rp8i
> >>>>
> >>>>>Also, are the two programs being run against
> >>>>> the exact same input data ?
> >>>>
> >>>> Yes, exactly the same input: a single csv file with 23K lines.
> >>>> Using a shorter string leads to more like keys and therefore more
> >>>> combining/reducing, but going
> >>>> by the above it seems to fail whether the substring/key is entirely
> >>>> unique (23000 combine output records) or
> >>>> mostly the same (9 combine output records).
> >>>>
> >>>>>
> >>>>> Also, since the cluster size is small, you could also look at the
> >>>>> tasktracker logs on the machines where the maps have run to see if
> >>>>> there are any failures when the reduce attempts start failing.
> >>>>
> >>>> Here is the TT log from the last failed job. I do not see anything
> >>>> besides the shuffle failure, but there
> >>>> may be something I am overlooking or simply do not understand.
> >>>> http://pastebin.com/DKFTyGXg
> >>>>
> >>>> Thanks again!
> >>>>
> >>>>>
> >>>>> Thanks
> >>>>> Hemanth
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by bmdevelopment <bm...@gmail.com>.

A little more on this.

So, I've narrowed down the problem to using Lzop compression
(com.hadoop.compression.lzo.LzopCodec)
for mapred.map.output.compression.codec.

<property>
    <name>mapred.map.output.compression.codec</name>
    <value>com.hadoop.compression.lzo.LzopCodec</value>
</property>

If I do the above, I will get the Shuffle Error.
If I use DefaultCodec for mapred.map.output.compression.codec.
there is no problem.

Is this a known issue? Or is this a bug?
Doesn't seem like it should be the expected behavior.

I would be glad to contribute any further info on this if necessary.
Please let me know.

Thanks

On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <bm...@gmail.com> wrote:
> Hi, No problems. Thanks so much for your time. Greatly appreciated.
>
> I agree that it must be a configuration problem and so today I was able
> to start from scratch and did a fresh install of 0.20.2 on the 5 node cluster.
>
> I've now noticed that the error occurs when compression is enabled.
> I've run the basic wordcount example as so:
> http://pastebin.com/wvDMZZT0
> and get the Shuffle Error.
>
> TT logs show this error:
> WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid
> header checksum: 225702cc (expected 0x2325)
> Full logs:
> http://pastebin.com/fVGjcGsW
>
> My mapred-site.xml:
> http://pastebin.com/mQgMrKQw
>
> If I remove the compression config settings, the wordcount works fine
> - no more Shuffle Error.
> So, I have something wrong with my compression settings I imagine.
> I'll continue looking into this to see what else I can find out.
>
> Thanks a million.
>
> On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <yh...@gmail.com> wrote:
>> Hi,
>>
>> Sorry, I couldn't take a close look at the logs until now.
>> Unfortunately, I could not see any huge difference between the success
>> and failure case. Can you please check if things like basic hostname -
>> ip address mapping are in place (if you have static resolution of
>> hostnames set up) ? A web search is giving this as the most likely
>> cause users have faced regarding this problem. Also do the disks have
>> enough size ? Also, it would be great if you can upload your hadoop
>> configuration information.
>>
>> I do think it is very likely that configuration is the actual problem
>> because it works in one case anyway.
>>
>> Thanks
>> Hemanth
>>
>> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <bm...@gmail.com> wrote:
>>> Hello,
>>> I still have had no luck with this over the past week.
>>> And even get the same exact problem on a completely different 5 node cluster.
>>> Is it worth opening an new issue in jira for this?
>>> Thanks
>>>
>>>
>>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <bm...@gmail.com> wrote:
>>>> Hello,
>>>> Thanks so much for the reply.
>>>> See inline.
>>>>
>>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yh...@gmail.com> wrote:
>>>>> Hi,
>>>>>
>>>>>> I've been getting the following error when trying to run a very simple
>>>>>> MapReduce job.
>>>>>> Map finishes without problem, but error occurs as soon as it enters
>>>>>> Reduce phase.
>>>>>>
>>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>>>>
>>>>>> I am running a 5 node cluster and I believe I have all my settings correct:
>>>>>>
>>>>>> * ulimit -n 32768
>>>>>> * DNS/RDNS configured properly
>>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>>>>>
>>>>>> The program is very simple - just counts a unique string in a log file.
>>>>>> See here: http://pastebin.com/5uRG3SFL
>>>>>>
>>>>>> When I run, the job fails and I get the following output.
>>>>>> http://pastebin.com/AhW6StEb
>>>>>>
>>>>>> However, runs fine when I do *not* use substring() on the value (see
>>>>>> map function in code above).
>>>>>>
>>>>>> This runs fine and completes successfully:
>>>>>>            String str = val.toString();
>>>>>>
>>>>>> This causes error and fails:
>>>>>>            String str = val.toString().substring(0,10);
>>>>>>
>>>>>> Please let me know if you need any further information.
>>>>>> It would be greatly appreciated if anyone could shed some light on this problem.
>>>>>
>>>>> It catches attention that changing the code to use a substring is
>>>>> causing a difference. Assuming it is consistent and not a red herring,
>>>>
>>>> Yes, this has been consistent over the last week. I was running 0.20.1
>>>> first and then
>>>> upgrade to 0.20.2 but results have been exactly the same.
>>>>
>>>>> can you look at the counters for the two jobs using the JobTracker web
>>>>> UI - things like map records, bytes etc and see if there is a
>>>>> noticeable difference ?
>>>>
>>>> Ok, so here is the first job using write.set(value.toString()); having
>>>> *no* errors:
>>>> http://pastebin.com/xvy0iGwL
>>>>
>>>> And here is the second job using
>>>> write.set(value.toString().substring(0, 10)); that fails:
>>>> http://pastebin.com/uGw6yNqv
>>>>
>>>> And here is even another where I used a longer, and therefore unique string,
>>>> by write.set(value.toString().substring(0, 20)); This makes every line
>>>> unique, similar to first job.
>>>> Still fails.
>>>> http://pastebin.com/GdQ1rp8i
>>>>
>>>>>Also, are the two programs being run against
>>>>> the exact same input data ?
>>>>
>>>> Yes, exactly the same input: a single csv file with 23K lines.
>>>> Using a shorter string leads to more like keys and therefore more
>>>> combining/reducing, but going
>>>> by the above it seems to fail whether the substring/key is entirely
>>>> unique (23000 combine output records) or
>>>> mostly the same (9 combine output records).
>>>>
>>>>>
>>>>> Also, since the cluster size is small, you could also look at the
>>>>> tasktracker logs on the machines where the maps have run to see if
>>>>> there are any failures when the reduce attempts start failing.
>>>>
>>>> Here is the TT log from the last failed job. I do not see anything
>>>> besides the shuffle failure, but there
>>>> may be something I am overlooking or simply do not understand.
>>>> http://pastebin.com/DKFTyGXg
>>>>
>>>> Thanks again!
>>>>
>>>>>
>>>>> Thanks
>>>>> Hemanth
>>>>>
>>>>
>>>
>>
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by bmdevelopment <bm...@gmail.com>.

Hi, No problems. Thanks so much for your time. Greatly appreciated.

I agree that it must be a configuration problem and so today I was able
to start from scratch and did a fresh install of 0.20.2 on the 5 node cluster.

I've now noticed that the error occurs when compression is enabled.
I've run the basic wordcount example as so:
http://pastebin.com/wvDMZZT0
and get the Shuffle Error.

TT logs show this error:
WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid
header checksum: 225702cc (expected 0x2325)
Full logs:
http://pastebin.com/fVGjcGsW

My mapred-site.xml:
http://pastebin.com/mQgMrKQw

If I remove the compression config settings, the wordcount works fine
- no more Shuffle Error.
So, I have something wrong with my compression settings I imagine.
I'll continue looking into this to see what else I can find out.

Thanks a million.

On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <yh...@gmail.com> wrote:
> Hi,
>
> Sorry, I couldn't take a close look at the logs until now.
> Unfortunately, I could not see any huge difference between the success
> and failure case. Can you please check if things like basic hostname -
> ip address mapping are in place (if you have static resolution of
> hostnames set up) ? A web search is giving this as the most likely
> cause users have faced regarding this problem. Also do the disks have
> enough size ? Also, it would be great if you can upload your hadoop
> configuration information.
>
> I do think it is very likely that configuration is the actual problem
> because it works in one case anyway.
>
> Thanks
> Hemanth
>
> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <bm...@gmail.com> wrote:
>> Hello,
>> I still have had no luck with this over the past week.
>> And even get the same exact problem on a completely different 5 node cluster.
>> Is it worth opening an new issue in jira for this?
>> Thanks
>>
>>
>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <bm...@gmail.com> wrote:
>>> Hello,
>>> Thanks so much for the reply.
>>> See inline.
>>>
>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yh...@gmail.com> wrote:
>>>> Hi,
>>>>
>>>>> I've been getting the following error when trying to run a very simple
>>>>> MapReduce job.
>>>>> Map finishes without problem, but error occurs as soon as it enters
>>>>> Reduce phase.
>>>>>
>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>>>
>>>>> I am running a 5 node cluster and I believe I have all my settings correct:
>>>>>
>>>>> * ulimit -n 32768
>>>>> * DNS/RDNS configured properly
>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>>>>
>>>>> The program is very simple - just counts a unique string in a log file.
>>>>> See here: http://pastebin.com/5uRG3SFL
>>>>>
>>>>> When I run, the job fails and I get the following output.
>>>>> http://pastebin.com/AhW6StEb
>>>>>
>>>>> However, runs fine when I do *not* use substring() on the value (see
>>>>> map function in code above).
>>>>>
>>>>> This runs fine and completes successfully:
>>>>>            String str = val.toString();
>>>>>
>>>>> This causes error and fails:
>>>>>            String str = val.toString().substring(0,10);
>>>>>
>>>>> Please let me know if you need any further information.
>>>>> It would be greatly appreciated if anyone could shed some light on this problem.
>>>>
>>>> It catches attention that changing the code to use a substring is
>>>> causing a difference. Assuming it is consistent and not a red herring,
>>>
>>> Yes, this has been consistent over the last week. I was running 0.20.1
>>> first and then
>>> upgrade to 0.20.2 but results have been exactly the same.
>>>
>>>> can you look at the counters for the two jobs using the JobTracker web
>>>> UI - things like map records, bytes etc and see if there is a
>>>> noticeable difference ?
>>>
>>> Ok, so here is the first job using write.set(value.toString()); having
>>> *no* errors:
>>> http://pastebin.com/xvy0iGwL
>>>
>>> And here is the second job using
>>> write.set(value.toString().substring(0, 10)); that fails:
>>> http://pastebin.com/uGw6yNqv
>>>
>>> And here is even another where I used a longer, and therefore unique string,
>>> by write.set(value.toString().substring(0, 20)); This makes every line
>>> unique, similar to first job.
>>> Still fails.
>>> http://pastebin.com/GdQ1rp8i
>>>
>>>>Also, are the two programs being run against
>>>> the exact same input data ?
>>>
>>> Yes, exactly the same input: a single csv file with 23K lines.
>>> Using a shorter string leads to more like keys and therefore more
>>> combining/reducing, but going
>>> by the above it seems to fail whether the substring/key is entirely
>>> unique (23000 combine output records) or
>>> mostly the same (9 combine output records).
>>>
>>>>
>>>> Also, since the cluster size is small, you could also look at the
>>>> tasktracker logs on the machines where the maps have run to see if
>>>> there are any failures when the reduce attempts start failing.
>>>
>>> Here is the TT log from the last failed job. I do not see anything
>>> besides the shuffle failure, but there
>>> may be something I am overlooking or simply do not understand.
>>> http://pastebin.com/DKFTyGXg
>>>
>>> Thanks again!
>>>
>>>>
>>>> Thanks
>>>> Hemanth
>>>>
>>>
>>
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by Hemanth Yamijala <yh...@gmail.com>.

Hi,

Sorry, I couldn't take a close look at the logs until now.
Unfortunately, I could not see any huge difference between the success
and failure case. Can you please check if things like basic hostname -
ip address mapping are in place (if you have static resolution of
hostnames set up) ? A web search is giving this as the most likely
cause users have faced regarding this problem. Also do the disks have
enough size ? Also, it would be great if you can upload your hadoop
configuration information.

I do think it is very likely that configuration is the actual problem
because it works in one case anyway.

Thanks
Hemanth

On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <bm...@gmail.com> wrote:
> Hello,
> I still have had no luck with this over the past week.
> And even get the same exact problem on a completely different 5 node cluster.
> Is it worth opening an new issue in jira for this?
> Thanks
>
>
> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <bm...@gmail.com> wrote:
>> Hello,
>> Thanks so much for the reply.
>> See inline.
>>
>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yh...@gmail.com> wrote:
>>> Hi,
>>>
>>>> I've been getting the following error when trying to run a very simple
>>>> MapReduce job.
>>>> Map finishes without problem, but error occurs as soon as it enters
>>>> Reduce phase.
>>>>
>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>>
>>>> I am running a 5 node cluster and I believe I have all my settings correct:
>>>>
>>>> * ulimit -n 32768
>>>> * DNS/RDNS configured properly
>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>>>
>>>> The program is very simple - just counts a unique string in a log file.
>>>> See here: http://pastebin.com/5uRG3SFL
>>>>
>>>> When I run, the job fails and I get the following output.
>>>> http://pastebin.com/AhW6StEb
>>>>
>>>> However, runs fine when I do *not* use substring() on the value (see
>>>> map function in code above).
>>>>
>>>> This runs fine and completes successfully:
>>>>            String str = val.toString();
>>>>
>>>> This causes error and fails:
>>>>            String str = val.toString().substring(0,10);
>>>>
>>>> Please let me know if you need any further information.
>>>> It would be greatly appreciated if anyone could shed some light on this problem.
>>>
>>> It catches attention that changing the code to use a substring is
>>> causing a difference. Assuming it is consistent and not a red herring,
>>
>> Yes, this has been consistent over the last week. I was running 0.20.1
>> first and then
>> upgrade to 0.20.2 but results have been exactly the same.
>>
>>> can you look at the counters for the two jobs using the JobTracker web
>>> UI - things like map records, bytes etc and see if there is a
>>> noticeable difference ?
>>
>> Ok, so here is the first job using write.set(value.toString()); having
>> *no* errors:
>> http://pastebin.com/xvy0iGwL
>>
>> And here is the second job using
>> write.set(value.toString().substring(0, 10)); that fails:
>> http://pastebin.com/uGw6yNqv
>>
>> And here is even another where I used a longer, and therefore unique string,
>> by write.set(value.toString().substring(0, 20)); This makes every line
>> unique, similar to first job.
>> Still fails.
>> http://pastebin.com/GdQ1rp8i
>>
>>>Also, are the two programs being run against
>>> the exact same input data ?
>>
>> Yes, exactly the same input: a single csv file with 23K lines.
>> Using a shorter string leads to more like keys and therefore more
>> combining/reducing, but going
>> by the above it seems to fail whether the substring/key is entirely
>> unique (23000 combine output records) or
>> mostly the same (9 combine output records).
>>
>>>
>>> Also, since the cluster size is small, you could also look at the
>>> tasktracker logs on the machines where the maps have run to see if
>>> there are any failures when the reduce attempts start failing.
>>
>> Here is the TT log from the last failed job. I do not see anything
>> besides the shuffle failure, but there
>> may be something I am overlooking or simply do not understand.
>> http://pastebin.com/DKFTyGXg
>>
>> Thanks again!
>>
>>>
>>> Thanks
>>> Hemanth
>>>
>>
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by bmdevelopment <bm...@gmail.com>.

Hello,
I still have had no luck with this over the past week.
And even get the same exact problem on a completely different 5 node cluster.
Is it worth opening an new issue in jira for this?
Thanks


On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <bm...@gmail.com> wrote:
> Hello,
> Thanks so much for the reply.
> See inline.
>
> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yh...@gmail.com> wrote:
>> Hi,
>>
>>> I've been getting the following error when trying to run a very simple
>>> MapReduce job.
>>> Map finishes without problem, but error occurs as soon as it enters
>>> Reduce phase.
>>>
>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>
>>> I am running a 5 node cluster and I believe I have all my settings correct:
>>>
>>> * ulimit -n 32768
>>> * DNS/RDNS configured properly
>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>>
>>> The program is very simple - just counts a unique string in a log file.
>>> See here: http://pastebin.com/5uRG3SFL
>>>
>>> When I run, the job fails and I get the following output.
>>> http://pastebin.com/AhW6StEb
>>>
>>> However, runs fine when I do *not* use substring() on the value (see
>>> map function in code above).
>>>
>>> This runs fine and completes successfully:
>>>            String str = val.toString();
>>>
>>> This causes error and fails:
>>>            String str = val.toString().substring(0,10);
>>>
>>> Please let me know if you need any further information.
>>> It would be greatly appreciated if anyone could shed some light on this problem.
>>
>> It catches attention that changing the code to use a substring is
>> causing a difference. Assuming it is consistent and not a red herring,
>
> Yes, this has been consistent over the last week. I was running 0.20.1
> first and then
> upgrade to 0.20.2 but results have been exactly the same.
>
>> can you look at the counters for the two jobs using the JobTracker web
>> UI - things like map records, bytes etc and see if there is a
>> noticeable difference ?
>
> Ok, so here is the first job using write.set(value.toString()); having
> *no* errors:
> http://pastebin.com/xvy0iGwL
>
> And here is the second job using
> write.set(value.toString().substring(0, 10)); that fails:
> http://pastebin.com/uGw6yNqv
>
> And here is even another where I used a longer, and therefore unique string,
> by write.set(value.toString().substring(0, 20)); This makes every line
> unique, similar to first job.
> Still fails.
> http://pastebin.com/GdQ1rp8i
>
>>Also, are the two programs being run against
>> the exact same input data ?
>
> Yes, exactly the same input: a single csv file with 23K lines.
> Using a shorter string leads to more like keys and therefore more
> combining/reducing, but going
> by the above it seems to fail whether the substring/key is entirely
> unique (23000 combine output records) or
> mostly the same (9 combine output records).
>
>>
>> Also, since the cluster size is small, you could also look at the
>> tasktracker logs on the machines where the maps have run to see if
>> there are any failures when the reduce attempts start failing.
>
> Here is the TT log from the last failed job. I do not see anything
> besides the shuffle failure, but there
> may be something I am overlooking or simply do not understand.
> http://pastebin.com/DKFTyGXg
>
> Thanks again!
>
>>
>> Thanks
>> Hemanth
>>
>

Fwd: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by bmdevelopment <bm...@gmail.com>.

Hi, Sorry for the cross-post. But just trying to see if anyone else
has had this issue before.
Thanks


---------- Forwarded message ----------
From: bmdevelopment <bm...@gmail.com>
Date: Fri, Jun 25, 2010 at 10:56 AM
Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
To: mapreduce-user@hadoop.apache.org


Hello,
Thanks so much for the reply.
See inline.

On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yh...@gmail.com> wrote:
> Hi,
>
>> I've been getting the following error when trying to run a very simple
>> MapReduce job.
>> Map finishes without problem, but error occurs as soon as it enters
>> Reduce phase.
>>
>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>
>> I am running a 5 node cluster and I believe I have all my settings correct:
>>
>> * ulimit -n 32768
>> * DNS/RDNS configured properly
>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>
>> The program is very simple - just counts a unique string in a log file.
>> See here: http://pastebin.com/5uRG3SFL
>>
>> When I run, the job fails and I get the following output.
>> http://pastebin.com/AhW6StEb
>>
>> However, runs fine when I do *not* use substring() on the value (see
>> map function in code above).
>>
>> This runs fine and completes successfully:
>>            String str = val.toString();
>>
>> This causes error and fails:
>>            String str = val.toString().substring(0,10);
>>
>> Please let me know if you need any further information.
>> It would be greatly appreciated if anyone could shed some light on this problem.
>
> It catches attention that changing the code to use a substring is
> causing a difference. Assuming it is consistent and not a red herring,

Yes, this has been consistent over the last week. I was running 0.20.1
first and then
upgrade to 0.20.2 but results have been exactly the same.

> can you look at the counters for the two jobs using the JobTracker web
> UI - things like map records, bytes etc and see if there is a
> noticeable difference ?

Ok, so here is the first job using write.set(value.toString()); having
*no* errors:
http://pastebin.com/xvy0iGwL

And here is the second job using
write.set(value.toString().substring(0, 10)); that fails:
http://pastebin.com/uGw6yNqv

And here is even another where I used a longer, and therefore unique string,
by write.set(value.toString().substring(0, 20)); This makes every line
unique, similar to first job.
Still fails.
http://pastebin.com/GdQ1rp8i

>Also, are the two programs being run against
> the exact same input data ?

Yes, exactly the same input: a single csv file with 23K lines.
Using a shorter string leads to more like keys and therefore more
combining/reducing, but going
by the above it seems to fail whether the substring/key is entirely
unique (23000 combine output records) or
mostly the same (9 combine output records).

>
> Also, since the cluster size is small, you could also look at the
> tasktracker logs on the machines where the maps have run to see if
> there are any failures when the reduce attempts start failing.

Here is the TT log from the last failed job. I do not see anything
besides the shuffle failure, but there
may be something I am overlooking or simply do not understand.
http://pastebin.com/DKFTyGXg

Thanks again!

>
> Thanks
> Hemanth
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by bmdevelopment <bm...@gmail.com>.

Hello,
Thanks so much for the reply.
See inline.

On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yh...@gmail.com> wrote:
> Hi,
>
>> I've been getting the following error when trying to run a very simple
>> MapReduce job.
>> Map finishes without problem, but error occurs as soon as it enters
>> Reduce phase.
>>
>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>
>> I am running a 5 node cluster and I believe I have all my settings correct:
>>
>> * ulimit -n 32768
>> * DNS/RDNS configured properly
>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>
>> The program is very simple - just counts a unique string in a log file.
>> See here: http://pastebin.com/5uRG3SFL
>>
>> When I run, the job fails and I get the following output.
>> http://pastebin.com/AhW6StEb
>>
>> However, runs fine when I do *not* use substring() on the value (see
>> map function in code above).
>>
>> This runs fine and completes successfully:
>>            String str = val.toString();
>>
>> This causes error and fails:
>>            String str = val.toString().substring(0,10);
>>
>> Please let me know if you need any further information.
>> It would be greatly appreciated if anyone could shed some light on this problem.
>
> It catches attention that changing the code to use a substring is
> causing a difference. Assuming it is consistent and not a red herring,

Yes, this has been consistent over the last week. I was running 0.20.1
first and then
upgrade to 0.20.2 but results have been exactly the same.

> can you look at the counters for the two jobs using the JobTracker web
> UI - things like map records, bytes etc and see if there is a
> noticeable difference ?

Ok, so here is the first job using write.set(value.toString()); having
*no* errors:
http://pastebin.com/xvy0iGwL

And here is the second job using
write.set(value.toString().substring(0, 10)); that fails:
http://pastebin.com/uGw6yNqv

And here is even another where I used a longer, and therefore unique string,
by write.set(value.toString().substring(0, 20)); This makes every line
unique, similar to first job.
Still fails.
http://pastebin.com/GdQ1rp8i

>Also, are the two programs being run against
> the exact same input data ?

Yes, exactly the same input: a single csv file with 23K lines.
Using a shorter string leads to more like keys and therefore more
combining/reducing, but going
by the above it seems to fail whether the substring/key is entirely
unique (23000 combine output records) or
mostly the same (9 combine output records).

>
> Also, since the cluster size is small, you could also look at the
> tasktracker logs on the machines where the maps have run to see if
> there are any failures when the reduce attempts start failing.

Here is the TT log from the last failed job. I do not see anything
besides the shuffle failure, but there
may be something I am overlooking or simply do not understand.
http://pastebin.com/DKFTyGXg

Thanks again!

>
> Thanks
> Hemanth
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by Hemanth Yamijala <yh...@gmail.com>.

Hi,

> I've been getting the following error when trying to run a very simple
> MapReduce job.
> Map finishes without problem, but error occurs as soon as it enters
> Reduce phase.
>
> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
> attempt_201006241812_0001_r_000000_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>
> I am running a 5 node cluster and I believe I have all my settings correct:
>
> * ulimit -n 32768
> * DNS/RDNS configured properly
> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
> * mapred-site.xml : http://pastebin.com/JraVQZcW
>
> The program is very simple - just counts a unique string in a log file.
> See here: http://pastebin.com/5uRG3SFL
>
> When I run, the job fails and I get the following output.
> http://pastebin.com/AhW6StEb
>
> However, runs fine when I do *not* use substring() on the value (see
> map function in code above).
>
> This runs fine and completes successfully:
>            String str = val.toString();
>
> This causes error and fails:
>            String str = val.toString().substring(0,10);
>
> Please let me know if you need any further information.
> It would be greatly appreciated if anyone could shed some light on this problem.

It catches attention that changing the code to use a substring is
causing a difference. Assuming it is consistent and not a red herring,
can you look at the counters for the two jobs using the JobTracker web
UI - things like map records, bytes etc and see if there is a
noticeable difference ? Also, are the two programs being run against
the exact same input data ?

Also, since the cluster size is small, you could also look at the
tasktracker logs on the machines where the maps have run to see if
there are any failures when the reduce attempts start failing.

Thanks
Hemanth