You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Deepak Diwakar <dd...@gmail.com> on 2010/07/27 21:31:28 UTC

Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Hey friends,

I got stuck on setting up hdfs cluster and getting this error while running
simple wordcount example(I did that 2 yrs back not had any problem).

Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed from
(
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
).

 I checked the firewall settings and /etc/hosts there is no issue there.
Also master and slave are accessible both ways.

Also the input size very low ~ 3 MB  and hence there shouldn't be no issue
because ulimit(its btw of 4096).

Would be really thankful  if  anyone can guide me to resolve this.

Thanks & regards,
- Deepak Diwakar,




On 28 June 2010 18:39, bmdevelopment <bm...@gmail.com> wrote:

> Hi, Sorry for the cross-post. But just trying to see if anyone else
> has had this issue before.
> Thanks
>
>
> ---------- Forwarded message ----------
> From: bmdevelopment <bm...@gmail.com>
> Date: Fri, Jun 25, 2010 at 10:56 AM
> Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
> bailing-out.
> To: mapreduce-user@hadoop.apache.org
>
>
> Hello,
> Thanks so much for the reply.
> See inline.
>
> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yh...@gmail.com>
> wrote:
> > Hi,
> >
> >> I've been getting the following error when trying to run a very simple
> >> MapReduce job.
> >> Map finishes without problem, but error occurs as soon as it enters
> >> Reduce phase.
> >>
> >> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
> >> attempt_201006241812_0001_r_000000_0, Status : FAILED
> >> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> >>
> >> I am running a 5 node cluster and I believe I have all my settings
> correct:
> >>
> >> * ulimit -n 32768
> >> * DNS/RDNS configured properly
> >> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
> >> * mapred-site.xml : http://pastebin.com/JraVQZcW
> >>
> >> The program is very simple - just counts a unique string in a log file.
> >> See here: http://pastebin.com/5uRG3SFL
> >>
> >> When I run, the job fails and I get the following output.
> >> http://pastebin.com/AhW6StEb
> >>
> >> However, runs fine when I do *not* use substring() on the value (see
> >> map function in code above).
> >>
> >> This runs fine and completes successfully:
> >>            String str = val.toString();
> >>
> >> This causes error and fails:
> >>            String str = val.toString().substring(0,10);
> >>
> >> Please let me know if you need any further information.
> >> It would be greatly appreciated if anyone could shed some light on this
> problem.
> >
> > It catches attention that changing the code to use a substring is
> > causing a difference. Assuming it is consistent and not a red herring,
>
> Yes, this has been consistent over the last week. I was running 0.20.1
> first and then
> upgrade to 0.20.2 but results have been exactly the same.
>
> > can you look at the counters for the two jobs using the JobTracker web
> > UI - things like map records, bytes etc and see if there is a
> > noticeable difference ?
>
> Ok, so here is the first job using write.set(value.toString()); having
> *no* errors:
> http://pastebin.com/xvy0iGwL
>
> And here is the second job using
> write.set(value.toString().substring(0, 10)); that fails:
> http://pastebin.com/uGw6yNqv
>
> And here is even another where I used a longer, and therefore unique
> string,
> by write.set(value.toString().substring(0, 20)); This makes every line
> unique, similar to first job.
> Still fails.
> http://pastebin.com/GdQ1rp8i
>
> >Also, are the two programs being run against
> > the exact same input data ?
>
> Yes, exactly the same input: a single csv file with 23K lines.
> Using a shorter string leads to more like keys and therefore more
> combining/reducing, but going
> by the above it seems to fail whether the substring/key is entirely
> unique (23000 combine output records) or
> mostly the same (9 combine output records).
>
> >
> > Also, since the cluster size is small, you could also look at the
> > tasktracker logs on the machines where the maps have run to see if
> > there are any failures when the reduce attempts start failing.
>
> Here is the TT log from the last failed job. I do not see anything
> besides the shuffle failure, but there
> may be something I am overlooking or simply do not understand.
> http://pastebin.com/DKFTyGXg
>
> Thanks again!
>
> >
> > Thanks
> > Hemanth
> >
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by Deepak Diwakar <dd...@gmail.com>.

Thanks Krishna and Chen.

Yes problem was in /etc/hosts. In fact on  each node there was unique
identifier like necromancer/rocker etc.. which is the only difference in
/etc/hosts amongst the nodes. Once I put same identifier for all, it worked.


Thanks & regards
- Deepak Diwakar,




On 28 July 2010 03:09, C.V.Krishnakumar <cv...@me.com> wrote:

> Hi Deepak,
>
> Maybe I did not make my mail clear. I had tried the instructions in the
> blog you mentioned. They are  working for me.
> Did you change the /etc/hosts file at any point of time?
>
> Regards,
> Krishna
>
> On Jul 27, 2010, at 2:30 PM, C.V.Krishnakumar wrote:
>
> > Hi Deepak,
> >
> > YOu could refer this too :
> http://markmail.org/message/mjq6gzjhst2inuab#query:MAX_FAILED_UNIQUE_FETCHES+page:1+mid:ubrwgmddmfvoadh2+state:results
> > I tried those instructions and it is working for me.
> > Regards,
> > Krishna
> > On Jul 27, 2010, at 12:31 PM, Deepak Diwakar wrote:
> >
> >> Hey friends,
> >>
> >> I got stuck on setting up hdfs cluster and getting this error while
> running
> >> simple wordcount example(I did that 2 yrs back not had any problem).
> >>
> >> Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed
> from
> >> (
> >>
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
> >> ).
> >>
> >> I checked the firewall settings and /etc/hosts there is no issue there.
> >> Also master and slave are accessible both ways.
> >>
> >> Also the input size very low ~ 3 MB  and hence there shouldn't be no
> issue
> >> because ulimit(its btw of 4096).
> >>
> >> Would be really thankful  if  anyone can guide me to resolve this.
> >>
> >> Thanks & regards,
> >> - Deepak Diwakar,
> >>
> >>
> >>
> >>
> >> On 28 June 2010 18:39, bmdevelopment <bm...@gmail.com> wrote:
> >>
> >>> Hi, Sorry for the cross-post. But just trying to see if anyone else
> >>> has had this issue before.
> >>> Thanks
> >>>
> >>>
> >>> ---------- Forwarded message ----------
> >>> From: bmdevelopment <bm...@gmail.com>
> >>> Date: Fri, Jun 25, 2010 at 10:56 AM
> >>> Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
> >>> bailing-out.
> >>> To: mapreduce-user@hadoop.apache.org
> >>>
> >>>
> >>> Hello,
> >>> Thanks so much for the reply.
> >>> See inline.
> >>>
> >>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yhemanth@gmail.com
> >
> >>> wrote:
> >>>> Hi,
> >>>>
> >>>>> I've been getting the following error when trying to run a very
> simple
> >>>>> MapReduce job.
> >>>>> Map finishes without problem, but error occurs as soon as it enters
> >>>>> Reduce phase.
> >>>>>
> >>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
> >>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
> >>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> >>>>>
> >>>>> I am running a 5 node cluster and I believe I have all my settings
> >>> correct:
> >>>>>
> >>>>> * ulimit -n 32768
> >>>>> * DNS/RDNS configured properly
> >>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
> >>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
> >>>>>
> >>>>> The program is very simple - just counts a unique string in a log
> file.
> >>>>> See here: http://pastebin.com/5uRG3SFL
> >>>>>
> >>>>> When I run, the job fails and I get the following output.
> >>>>> http://pastebin.com/AhW6StEb
> >>>>>
> >>>>> However, runs fine when I do *not* use substring() on the value (see
> >>>>> map function in code above).
> >>>>>
> >>>>> This runs fine and completes successfully:
> >>>>>          String str = val.toString();
> >>>>>
> >>>>> This causes error and fails:
> >>>>>          String str = val.toString().substring(0,10);
> >>>>>
> >>>>> Please let me know if you need any further information.
> >>>>> It would be greatly appreciated if anyone could shed some light on
> this
> >>> problem.
> >>>>
> >>>> It catches attention that changing the code to use a substring is
> >>>> causing a difference. Assuming it is consistent and not a red herring,
> >>>
> >>> Yes, this has been consistent over the last week. I was running 0.20.1
> >>> first and then
> >>> upgrade to 0.20.2 but results have been exactly the same.
> >>>
> >>>> can you look at the counters for the two jobs using the JobTracker web
> >>>> UI - things like map records, bytes etc and see if there is a
> >>>> noticeable difference ?
> >>>
> >>> Ok, so here is the first job using write.set(value.toString()); having
> >>> *no* errors:
> >>> http://pastebin.com/xvy0iGwL
> >>>
> >>> And here is the second job using
> >>> write.set(value.toString().substring(0, 10)); that fails:
> >>> http://pastebin.com/uGw6yNqv
> >>>
> >>> And here is even another where I used a longer, and therefore unique
> >>> string,
> >>> by write.set(value.toString().substring(0, 20)); This makes every line
> >>> unique, similar to first job.
> >>> Still fails.
> >>> http://pastebin.com/GdQ1rp8i
> >>>
> >>>> Also, are the two programs being run against
> >>>> the exact same input data ?
> >>>
> >>> Yes, exactly the same input: a single csv file with 23K lines.
> >>> Using a shorter string leads to more like keys and therefore more
> >>> combining/reducing, but going
> >>> by the above it seems to fail whether the substring/key is entirely
> >>> unique (23000 combine output records) or
> >>> mostly the same (9 combine output records).
> >>>
> >>>>
> >>>> Also, since the cluster size is small, you could also look at the
> >>>> tasktracker logs on the machines where the maps have run to see if
> >>>> there are any failures when the reduce attempts start failing.
> >>>
> >>> Here is the TT log from the last failed job. I do not see anything
> >>> besides the shuffle failure, but there
> >>> may be something I am overlooking or simply do not understand.
> >>> http://pastebin.com/DKFTyGXg
> >>>
> >>> Thanks again!
> >>>
> >>>>
> >>>> Thanks
> >>>> Hemanth
> >>>>
> >>>
> >
>
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by "C.V.Krishnakumar" <cv...@me.com>.

Hi Deepak,

Maybe I did not make my mail clear. I had tried the instructions in the blog you mentioned. They are  working for me. 
Did you change the /etc/hosts file at any point of time? 

Regards,
Krishna

On Jul 27, 2010, at 2:30 PM, C.V.Krishnakumar wrote:

> Hi Deepak,
> 
> YOu could refer this too : http://markmail.org/message/mjq6gzjhst2inuab#query:MAX_FAILED_UNIQUE_FETCHES+page:1+mid:ubrwgmddmfvoadh2+state:results 
> I tried those instructions and it is working for me. 
> Regards,
> Krishna
> On Jul 27, 2010, at 12:31 PM, Deepak Diwakar wrote:
> 
>> Hey friends,
>> 
>> I got stuck on setting up hdfs cluster and getting this error while running
>> simple wordcount example(I did that 2 yrs back not had any problem).
>> 
>> Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed from
>> (
>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
>> ).
>> 
>> I checked the firewall settings and /etc/hosts there is no issue there.
>> Also master and slave are accessible both ways.
>> 
>> Also the input size very low ~ 3 MB  and hence there shouldn't be no issue
>> because ulimit(its btw of 4096).
>> 
>> Would be really thankful  if  anyone can guide me to resolve this.
>> 
>> Thanks & regards,
>> - Deepak Diwakar,
>> 
>> 
>> 
>> 
>> On 28 June 2010 18:39, bmdevelopment <bm...@gmail.com> wrote:
>> 
>>> Hi, Sorry for the cross-post. But just trying to see if anyone else
>>> has had this issue before.
>>> Thanks
>>> 
>>> 
>>> ---------- Forwarded message ----------
>>> From: bmdevelopment <bm...@gmail.com>
>>> Date: Fri, Jun 25, 2010 at 10:56 AM
>>> Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
>>> bailing-out.
>>> To: mapreduce-user@hadoop.apache.org
>>> 
>>> 
>>> Hello,
>>> Thanks so much for the reply.
>>> See inline.
>>> 
>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yh...@gmail.com>
>>> wrote:
>>>> Hi,
>>>> 
>>>>> I've been getting the following error when trying to run a very simple
>>>>> MapReduce job.
>>>>> Map finishes without problem, but error occurs as soon as it enters
>>>>> Reduce phase.
>>>>> 
>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>>> 
>>>>> I am running a 5 node cluster and I believe I have all my settings
>>> correct:
>>>>> 
>>>>> * ulimit -n 32768
>>>>> * DNS/RDNS configured properly
>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>>>> 
>>>>> The program is very simple - just counts a unique string in a log file.
>>>>> See here: http://pastebin.com/5uRG3SFL
>>>>> 
>>>>> When I run, the job fails and I get the following output.
>>>>> http://pastebin.com/AhW6StEb
>>>>> 
>>>>> However, runs fine when I do *not* use substring() on the value (see
>>>>> map function in code above).
>>>>> 
>>>>> This runs fine and completes successfully:
>>>>>          String str = val.toString();
>>>>> 
>>>>> This causes error and fails:
>>>>>          String str = val.toString().substring(0,10);
>>>>> 
>>>>> Please let me know if you need any further information.
>>>>> It would be greatly appreciated if anyone could shed some light on this
>>> problem.
>>>> 
>>>> It catches attention that changing the code to use a substring is
>>>> causing a difference. Assuming it is consistent and not a red herring,
>>> 
>>> Yes, this has been consistent over the last week. I was running 0.20.1
>>> first and then
>>> upgrade to 0.20.2 but results have been exactly the same.
>>> 
>>>> can you look at the counters for the two jobs using the JobTracker web
>>>> UI - things like map records, bytes etc and see if there is a
>>>> noticeable difference ?
>>> 
>>> Ok, so here is the first job using write.set(value.toString()); having
>>> *no* errors:
>>> http://pastebin.com/xvy0iGwL
>>> 
>>> And here is the second job using
>>> write.set(value.toString().substring(0, 10)); that fails:
>>> http://pastebin.com/uGw6yNqv
>>> 
>>> And here is even another where I used a longer, and therefore unique
>>> string,
>>> by write.set(value.toString().substring(0, 20)); This makes every line
>>> unique, similar to first job.
>>> Still fails.
>>> http://pastebin.com/GdQ1rp8i
>>> 
>>>> Also, are the two programs being run against
>>>> the exact same input data ?
>>> 
>>> Yes, exactly the same input: a single csv file with 23K lines.
>>> Using a shorter string leads to more like keys and therefore more
>>> combining/reducing, but going
>>> by the above it seems to fail whether the substring/key is entirely
>>> unique (23000 combine output records) or
>>> mostly the same (9 combine output records).
>>> 
>>>> 
>>>> Also, since the cluster size is small, you could also look at the
>>>> tasktracker logs on the machines where the maps have run to see if
>>>> there are any failures when the reduce attempts start failing.
>>> 
>>> Here is the TT log from the last failed job. I do not see anything
>>> besides the shuffle failure, but there
>>> may be something I am overlooking or simply do not understand.
>>> http://pastebin.com/DKFTyGXg
>>> 
>>> Thanks again!
>>> 
>>>> 
>>>> Thanks
>>>> Hemanth
>>>> 
>>> 
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by "C.V.Krishnakumar" <cv...@me.com>.

Hi Deepak,

YOu could refer this too : http://markmail.org/message/mjq6gzjhst2inuab#query:MAX_FAILED_UNIQUE_FETCHES+page:1+mid:ubrwgmddmfvoadh2+state:results 
 I tried those instructions and it is working for me. 
Regards,
Krishna
On Jul 27, 2010, at 12:31 PM, Deepak Diwakar wrote:

> Hey friends,
> 
> I got stuck on setting up hdfs cluster and getting this error while running
> simple wordcount example(I did that 2 yrs back not had any problem).
> 
> Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed from
> (
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
> ).
> 
> I checked the firewall settings and /etc/hosts there is no issue there.
> Also master and slave are accessible both ways.
> 
> Also the input size very low ~ 3 MB  and hence there shouldn't be no issue
> because ulimit(its btw of 4096).
> 
> Would be really thankful  if  anyone can guide me to resolve this.
> 
> Thanks & regards,
> - Deepak Diwakar,
> 
> 
> 
> 
> On 28 June 2010 18:39, bmdevelopment <bm...@gmail.com> wrote:
> 
>> Hi, Sorry for the cross-post. But just trying to see if anyone else
>> has had this issue before.
>> Thanks
>> 
>> 
>> ---------- Forwarded message ----------
>> From: bmdevelopment <bm...@gmail.com>
>> Date: Fri, Jun 25, 2010 at 10:56 AM
>> Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
>> bailing-out.
>> To: mapreduce-user@hadoop.apache.org
>> 
>> 
>> Hello,
>> Thanks so much for the reply.
>> See inline.
>> 
>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yh...@gmail.com>
>> wrote:
>>> Hi,
>>> 
>>>> I've been getting the following error when trying to run a very simple
>>>> MapReduce job.
>>>> Map finishes without problem, but error occurs as soon as it enters
>>>> Reduce phase.
>>>> 
>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>> 
>>>> I am running a 5 node cluster and I believe I have all my settings
>> correct:
>>>> 
>>>> * ulimit -n 32768
>>>> * DNS/RDNS configured properly
>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>>> 
>>>> The program is very simple - just counts a unique string in a log file.
>>>> See here: http://pastebin.com/5uRG3SFL
>>>> 
>>>> When I run, the job fails and I get the following output.
>>>> http://pastebin.com/AhW6StEb
>>>> 
>>>> However, runs fine when I do *not* use substring() on the value (see
>>>> map function in code above).
>>>> 
>>>> This runs fine and completes successfully:
>>>>           String str = val.toString();
>>>> 
>>>> This causes error and fails:
>>>>           String str = val.toString().substring(0,10);
>>>> 
>>>> Please let me know if you need any further information.
>>>> It would be greatly appreciated if anyone could shed some light on this
>> problem.
>>> 
>>> It catches attention that changing the code to use a substring is
>>> causing a difference. Assuming it is consistent and not a red herring,
>> 
>> Yes, this has been consistent over the last week. I was running 0.20.1
>> first and then
>> upgrade to 0.20.2 but results have been exactly the same.
>> 
>>> can you look at the counters for the two jobs using the JobTracker web
>>> UI - things like map records, bytes etc and see if there is a
>>> noticeable difference ?
>> 
>> Ok, so here is the first job using write.set(value.toString()); having
>> *no* errors:
>> http://pastebin.com/xvy0iGwL
>> 
>> And here is the second job using
>> write.set(value.toString().substring(0, 10)); that fails:
>> http://pastebin.com/uGw6yNqv
>> 
>> And here is even another where I used a longer, and therefore unique
>> string,
>> by write.set(value.toString().substring(0, 20)); This makes every line
>> unique, similar to first job.
>> Still fails.
>> http://pastebin.com/GdQ1rp8i
>> 
>>> Also, are the two programs being run against
>>> the exact same input data ?
>> 
>> Yes, exactly the same input: a single csv file with 23K lines.
>> Using a shorter string leads to more like keys and therefore more
>> combining/reducing, but going
>> by the above it seems to fail whether the substring/key is entirely
>> unique (23000 combine output records) or
>> mostly the same (9 combine output records).
>> 
>>> 
>>> Also, since the cluster size is small, you could also look at the
>>> tasktracker logs on the machines where the maps have run to see if
>>> there are any failures when the reduce attempts start failing.
>> 
>> Here is the TT log from the last failed job. I do not see anything
>> besides the shuffle failure, but there
>> may be something I am overlooking or simply do not understand.
>> http://pastebin.com/DKFTyGXg
>> 
>> Thanks again!
>> 
>>> 
>>> Thanks
>>> Hemanth
>>> 
>>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Posted by He Chen <ai...@gmail.com>.

Hey Deepak Diwakar

Try to keep the /etc/hosts file as the same among all your cluster nodes.
See whether this problem will disappear.

On Tue, Jul 27, 2010 at 2:31 PM, Deepak Diwakar <dd...@gmail.com> wrote:

> Hey friends,
>
> I got stuck on setting up hdfs cluster and getting this error while running
> simple wordcount example(I did that 2 yrs back not had any problem).
>
> Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed
> from
> (
>
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
> ).
>
>  I checked the firewall settings and /etc/hosts there is no issue there.
> Also master and slave are accessible both ways.
>
> Also the input size very low ~ 3 MB  and hence there shouldn't be no issue
> because ulimit(its btw of 4096).
>
> Would be really thankful  if  anyone can guide me to resolve this.
>
> Thanks & regards,
> - Deepak Diwakar,
>
>
>
>
> On 28 June 2010 18:39, bmdevelopment <bm...@gmail.com> wrote:
>
> > Hi, Sorry for the cross-post. But just trying to see if anyone else
> > has had this issue before.
> > Thanks
> >
> >
> > ---------- Forwarded message ----------
> > From: bmdevelopment <bm...@gmail.com>
> > Date: Fri, Jun 25, 2010 at 10:56 AM
> > Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
> > bailing-out.
> > To: mapreduce-user@hadoop.apache.org
> >
> >
> > Hello,
> > Thanks so much for the reply.
> > See inline.
> >
> > On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yh...@gmail.com>
> > wrote:
> > > Hi,
> > >
> > >> I've been getting the following error when trying to run a very simple
> > >> MapReduce job.
> > >> Map finishes without problem, but error occurs as soon as it enters
> > >> Reduce phase.
> > >>
> > >> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
> > >> attempt_201006241812_0001_r_000000_0, Status : FAILED
> > >> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> > >>
> > >> I am running a 5 node cluster and I believe I have all my settings
> > correct:
> > >>
> > >> * ulimit -n 32768
> > >> * DNS/RDNS configured properly
> > >> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
> > >> * mapred-site.xml : http://pastebin.com/JraVQZcW
> > >>
> > >> The program is very simple - just counts a unique string in a log
> file.
> > >> See here: http://pastebin.com/5uRG3SFL
> > >>
> > >> When I run, the job fails and I get the following output.
> > >> http://pastebin.com/AhW6StEb
> > >>
> > >> However, runs fine when I do *not* use substring() on the value (see
> > >> map function in code above).
> > >>
> > >> This runs fine and completes successfully:
> > >>            String str = val.toString();
> > >>
> > >> This causes error and fails:
> > >>            String str = val.toString().substring(0,10);
> > >>
> > >> Please let me know if you need any further information.
> > >> It would be greatly appreciated if anyone could shed some light on
> this
> > problem.
> > >
> > > It catches attention that changing the code to use a substring is
> > > causing a difference. Assuming it is consistent and not a red herring,
> >
> > Yes, this has been consistent over the last week. I was running 0.20.1
> > first and then
> > upgrade to 0.20.2 but results have been exactly the same.
> >
> > > can you look at the counters for the two jobs using the JobTracker web
> > > UI - things like map records, bytes etc and see if there is a
> > > noticeable difference ?
> >
> > Ok, so here is the first job using write.set(value.toString()); having
> > *no* errors:
> > http://pastebin.com/xvy0iGwL
> >
> > And here is the second job using
> > write.set(value.toString().substring(0, 10)); that fails:
> > http://pastebin.com/uGw6yNqv
> >
> > And here is even another where I used a longer, and therefore unique
> > string,
> > by write.set(value.toString().substring(0, 20)); This makes every line
> > unique, similar to first job.
> > Still fails.
> > http://pastebin.com/GdQ1rp8i
> >
> > >Also, are the two programs being run against
> > > the exact same input data ?
> >
> > Yes, exactly the same input: a single csv file with 23K lines.
> > Using a shorter string leads to more like keys and therefore more
> > combining/reducing, but going
> > by the above it seems to fail whether the substring/key is entirely
> > unique (23000 combine output records) or
> > mostly the same (9 combine output records).
> >
> > >
> > > Also, since the cluster size is small, you could also look at the
> > > tasktracker logs on the machines where the maps have run to see if
> > > there are any failures when the reduce attempts start failing.
> >
> > Here is the TT log from the last failed job. I do not see anything
> > besides the shuffle failure, but there
> > may be something I am overlooking or simply do not understand.
> > http://pastebin.com/DKFTyGXg
> >
> > Thanks again!
> >
> > >
> > > Thanks
> > > Hemanth
> > >
> >
>



-- 
Best Wishes！
顺送商祺！

－－
Chen He
(402)613-9298
PhD. student of CSE Dept.
Research Assistant of Holland Computing Center
University of Nebraska-Lincoln
Lincoln NE 68588