You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Terance Dias <te...@gmail.com> on 2014/04/19 14:32:20 UTC

Shuffle Error after enabling Kerberos authentication

Hi,

I'm using apache hadoop-2.1.0-beta. I'm able to set up a basic multi-node
cluster and run map reduce jobs. But when I enable Kerberos authentication,
the reduce task fails with following error.

Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error
in shuffle in fetcher#1
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:121)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1477)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES;
bailing-out.
at
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:311)
at
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:243)
at
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:347)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)

I did a search and found that people have generally seen this error when
their network configuration is not correct and so the data nodes are not
able to communicate with each other to shuffle the data. I don't think that
is the problem in my case because everything works fine if Kerberos
authentication is disabled. Any idea what what the problem could be?

Thanks,
Terance.

Re: Shuffle Error after enabling Kerberos authentication

Posted by Jay Vyas <ja...@gmail.com>.
(bump) this is a good question.

im new to kerberos as well, and have been wondering how to prevent
scenarios such as this from happening.....

my thought is that since Kerberos iirc requires a ticket for each pair of
client + services  working together  ... maybe there is a chance that,  if
*any* two nodes in a cluster havent been initialized with the right tickets
to talk together, then a possible error can happen during shuffle-sort b/c
so much distributed copying is going on ???

In any case, id love to know any good smoke tests for a large size
kerberized hadoop cluster .... that dont require running a mapreduce job.



On Sat, Apr 19, 2014 at 11:11 PM, Mike <mi...@unitedrmr.com> wrote:

> Unsubscribe
>
> > On Apr 19, 2014, at 5:32 AM, Terance Dias <te...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > I'm using apache hadoop-2.1.0-beta. I'm able to set up a basic
> multi-node cluster and run map reduce jobs. But when I enable Kerberos
> authentication, the reduce task fails with following error.
> >
> > Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError:
> error in shuffle in fetcher#1
> >       at
> org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:121)
> >       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380)
> >       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
> >       at java.security.AccessController.doPrivileged(Native Method)
> >       at javax.security.auth.Subject.doAs(Subject.java:396)
> >       at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1477)
> >       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
> > Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES;
> bailing-out.
> >       at
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:311)
> >       at
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:243)
> >       at
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:347)
> >       at
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
> >
> > I did a search and found that people have generally seen this error when
> their network configuration is not correct and so the data nodes are not
> able to communicate with each other to shuffle the data. I don't think that
> is the problem in my case because everything works fine if Kerberos
> authentication is disabled. Any idea what what the problem could be?
> >
> > Thanks,
> > Terance.
> >
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Shuffle Error after enabling Kerberos authentication

Posted by Jay Vyas <ja...@gmail.com>.
(bump) this is a good question.

im new to kerberos as well, and have been wondering how to prevent
scenarios such as this from happening.....

my thought is that since Kerberos iirc requires a ticket for each pair of
client + services  working together  ... maybe there is a chance that,  if
*any* two nodes in a cluster havent been initialized with the right tickets
to talk together, then a possible error can happen during shuffle-sort b/c
so much distributed copying is going on ???

In any case, id love to know any good smoke tests for a large size
kerberized hadoop cluster .... that dont require running a mapreduce job.



On Sat, Apr 19, 2014 at 11:11 PM, Mike <mi...@unitedrmr.com> wrote:

> Unsubscribe
>
> > On Apr 19, 2014, at 5:32 AM, Terance Dias <te...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > I'm using apache hadoop-2.1.0-beta. I'm able to set up a basic
> multi-node cluster and run map reduce jobs. But when I enable Kerberos
> authentication, the reduce task fails with following error.
> >
> > Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError:
> error in shuffle in fetcher#1
> >       at
> org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:121)
> >       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380)
> >       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
> >       at java.security.AccessController.doPrivileged(Native Method)
> >       at javax.security.auth.Subject.doAs(Subject.java:396)
> >       at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1477)
> >       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
> > Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES;
> bailing-out.
> >       at
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:311)
> >       at
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:243)
> >       at
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:347)
> >       at
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
> >
> > I did a search and found that people have generally seen this error when
> their network configuration is not correct and so the data nodes are not
> able to communicate with each other to shuffle the data. I don't think that
> is the problem in my case because everything works fine if Kerberos
> authentication is disabled. Any idea what what the problem could be?
> >
> > Thanks,
> > Terance.
> >
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Shuffle Error after enabling Kerberos authentication

Posted by Jay Vyas <ja...@gmail.com>.
(bump) this is a good question.

im new to kerberos as well, and have been wondering how to prevent
scenarios such as this from happening.....

my thought is that since Kerberos iirc requires a ticket for each pair of
client + services  working together  ... maybe there is a chance that,  if
*any* two nodes in a cluster havent been initialized with the right tickets
to talk together, then a possible error can happen during shuffle-sort b/c
so much distributed copying is going on ???

In any case, id love to know any good smoke tests for a large size
kerberized hadoop cluster .... that dont require running a mapreduce job.



On Sat, Apr 19, 2014 at 11:11 PM, Mike <mi...@unitedrmr.com> wrote:

> Unsubscribe
>
> > On Apr 19, 2014, at 5:32 AM, Terance Dias <te...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > I'm using apache hadoop-2.1.0-beta. I'm able to set up a basic
> multi-node cluster and run map reduce jobs. But when I enable Kerberos
> authentication, the reduce task fails with following error.
> >
> > Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError:
> error in shuffle in fetcher#1
> >       at
> org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:121)
> >       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380)
> >       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
> >       at java.security.AccessController.doPrivileged(Native Method)
> >       at javax.security.auth.Subject.doAs(Subject.java:396)
> >       at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1477)
> >       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
> > Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES;
> bailing-out.
> >       at
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:311)
> >       at
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:243)
> >       at
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:347)
> >       at
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
> >
> > I did a search and found that people have generally seen this error when
> their network configuration is not correct and so the data nodes are not
> able to communicate with each other to shuffle the data. I don't think that
> is the problem in my case because everything works fine if Kerberos
> authentication is disabled. Any idea what what the problem could be?
> >
> > Thanks,
> > Terance.
> >
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Shuffle Error after enabling Kerberos authentication

Posted by Jay Vyas <ja...@gmail.com>.
(bump) this is a good question.

im new to kerberos as well, and have been wondering how to prevent
scenarios such as this from happening.....

my thought is that since Kerberos iirc requires a ticket for each pair of
client + services  working together  ... maybe there is a chance that,  if
*any* two nodes in a cluster havent been initialized with the right tickets
to talk together, then a possible error can happen during shuffle-sort b/c
so much distributed copying is going on ???

In any case, id love to know any good smoke tests for a large size
kerberized hadoop cluster .... that dont require running a mapreduce job.



On Sat, Apr 19, 2014 at 11:11 PM, Mike <mi...@unitedrmr.com> wrote:

> Unsubscribe
>
> > On Apr 19, 2014, at 5:32 AM, Terance Dias <te...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > I'm using apache hadoop-2.1.0-beta. I'm able to set up a basic
> multi-node cluster and run map reduce jobs. But when I enable Kerberos
> authentication, the reduce task fails with following error.
> >
> > Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError:
> error in shuffle in fetcher#1
> >       at
> org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:121)
> >       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380)
> >       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
> >       at java.security.AccessController.doPrivileged(Native Method)
> >       at javax.security.auth.Subject.doAs(Subject.java:396)
> >       at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1477)
> >       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
> > Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES;
> bailing-out.
> >       at
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:311)
> >       at
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:243)
> >       at
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:347)
> >       at
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
> >
> > I did a search and found that people have generally seen this error when
> their network configuration is not correct and so the data nodes are not
> able to communicate with each other to shuffle the data. I don't think that
> is the problem in my case because everything works fine if Kerberos
> authentication is disabled. Any idea what what the problem could be?
> >
> > Thanks,
> > Terance.
> >
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Shuffle Error after enabling Kerberos authentication

Posted by Mike <mi...@unitedrmr.com>.
Unsubscribe

> On Apr 19, 2014, at 5:32 AM, Terance Dias <te...@gmail.com> wrote:
> 
> Hi,
> 
> I'm using apache hadoop-2.1.0-beta. I'm able to set up a basic multi-node cluster and run map reduce jobs. But when I enable Kerberos authentication, the reduce task fails with following error.
> 
> Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#1
> 	at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:121)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1477)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
> Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 	at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:311)
> 	at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:243)
> 	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:347)
> 	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
> 
> I did a search and found that people have generally seen this error when their network configuration is not correct and so the data nodes are not able to communicate with each other to shuffle the data. I don't think that is the problem in my case because everything works fine if Kerberos authentication is disabled. Any idea what what the problem could be?
> 
> Thanks,
> Terance. 
> 

Re: Shuffle Error after enabling Kerberos authentication

Posted by Mike <mi...@unitedrmr.com>.
Unsubscribe

> On Apr 19, 2014, at 5:32 AM, Terance Dias <te...@gmail.com> wrote:
> 
> Hi,
> 
> I'm using apache hadoop-2.1.0-beta. I'm able to set up a basic multi-node cluster and run map reduce jobs. But when I enable Kerberos authentication, the reduce task fails with following error.
> 
> Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#1
> 	at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:121)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1477)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
> Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 	at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:311)
> 	at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:243)
> 	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:347)
> 	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
> 
> I did a search and found that people have generally seen this error when their network configuration is not correct and so the data nodes are not able to communicate with each other to shuffle the data. I don't think that is the problem in my case because everything works fine if Kerberos authentication is disabled. Any idea what what the problem could be?
> 
> Thanks,
> Terance. 
> 

Re: Shuffle Error after enabling Kerberos authentication

Posted by Mike <mi...@unitedrmr.com>.
Unsubscribe

> On Apr 19, 2014, at 5:32 AM, Terance Dias <te...@gmail.com> wrote:
> 
> Hi,
> 
> I'm using apache hadoop-2.1.0-beta. I'm able to set up a basic multi-node cluster and run map reduce jobs. But when I enable Kerberos authentication, the reduce task fails with following error.
> 
> Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#1
> 	at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:121)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1477)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
> Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 	at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:311)
> 	at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:243)
> 	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:347)
> 	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
> 
> I did a search and found that people have generally seen this error when their network configuration is not correct and so the data nodes are not able to communicate with each other to shuffle the data. I don't think that is the problem in my case because everything works fine if Kerberos authentication is disabled. Any idea what what the problem could be?
> 
> Thanks,
> Terance. 
> 

Re: Shuffle Error after enabling Kerberos authentication

Posted by Mike <mi...@unitedrmr.com>.
Unsubscribe

> On Apr 19, 2014, at 5:32 AM, Terance Dias <te...@gmail.com> wrote:
> 
> Hi,
> 
> I'm using apache hadoop-2.1.0-beta. I'm able to set up a basic multi-node cluster and run map reduce jobs. But when I enable Kerberos authentication, the reduce task fails with following error.
> 
> Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#1
> 	at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:121)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1477)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
> Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 	at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:311)
> 	at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:243)
> 	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:347)
> 	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
> 
> I did a search and found that people have generally seen this error when their network configuration is not correct and so the data nodes are not able to communicate with each other to shuffle the data. I don't think that is the problem in my case because everything works fine if Kerberos authentication is disabled. Any idea what what the problem could be?
> 
> Thanks,
> Terance. 
>