You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "Kane, David" <Da...@sra.com> on 2010/03/17 15:46:22 UTC

Re: Slave data node failing to connect?

Folks,

Does anyone know if this earlier post ever reached a resolution?  I am trying to work through the same tutorial, and I have encountered the same issue.  Of the candidate problems Jason suggested, none of them seem to pan out in my case (details below).  I'm looking for suggestions as to how I can get this to work or other avenues to try to debug.  I am using hadoop-0.20.2 on Red Hat Enterprise Linux Server release 5.4.  

Thanks!

Sincerely,
David Kane

None of my logs are reporting any errors.

Candidate Issue: Either your master namenode/jobtrackers are not actually starting:
     JPS Shows the following on the Master Node:
          32559 DataNode
          398 TaskTracker
          32749 JobTracker <----- Master Job Tracker started
          32414 NameNode <----- Master NameNode started
          32668 SecondaryNameNode
          439 Jps

     BTW, the master does seem to be starting up the processes correctly on the slave.  JPS there reports:
         4048 DataNode
         4179 Jps
         4108 TaskTracker

Candidate Issue: they [master namenode/jobtrackers] are  not listening on those particular ports
     On my master, my namenode log shows:
          2010-03-17 09:08:11,711 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
          2010-03-17 09:08:11,712 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 54310: starting
          ....
         2010-03-17 09:08:11,752 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54310: starting

    On my master, my jobtracker log shows:
          2010-03-17 09:09:31,036 INFO org.apache.hadoop.mapred.JobTracker: JobTracker up at: 54311
          2010-03-17 09:09:31,036 INFO org.apache.hadoop.mapred.JobTracker: JobTracker webserver: 50030
          2010-03-17 09:09:31,308 INFO org.apache.hadoop.mapred.JobTracker: Cleaning up the system directory
          2010-03-17 09:09:31,369 INFO org.apache.hadoop.mapred.CompletedJobStatusStore: Completed job store is inactive
          2010-03-17 09:09:31,500 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
          2010-03-17 09:09:31,501 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 54311: starting
          2010-03-17 09:09:31,507 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 54311: starting
          2010-03-17 09:09:31,509 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 54311: starting
          2010-03-17 09:09:31,511 INFO org.apache.hadoop.mapred.JobTracker: Starting RUNNING
          .....
          2010-03-17 09:09:31,523 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54311: starting

     On my slave, my namenode log shows:
          2010-03-17 09:25:56,217 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: mdadqsgdac1.mdanderson.edu/10.111.85.15:54310. Already tried 0 time(s).
          2010-03-17 09:25:57,231 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: mdadqsgdac1.mdanderson.edu/10.111.85.15:54310. Already tried 1 time(s).
          ...
          2010-03-17 09:26:05,364 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: mdadqsgdac1.mdanderson.edu/10.111.85.15:54310. Already tried 9 time(s).
          2010-03-17 09:26:05,365 INFO org.apache.hadoop.ipc.RPC: Server at mdadqsgdac1.mdanderson.edu/10.111.85.15:54310 not available yet, Zzzzz...

     On my slave, my namenode log shows:
          2010-03-17 09:26:00,850 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: mdadqsgdac1.mdanderson.edu/10.111.85.15:54311. Already tried 0 time(s).
          2010-03-17 09:26:01,869 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: mdadqsgdac1.mdanderson.edu/10.111.85.15:54311. Already tried 1 time(s).
          ...
          2010-03-17 09:26:10,002 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: mdadqsgdac1.mdanderson.edu/10.111.85.15:54311. Already tried 9 time(s).
          2010-03-17 09:26:10,003 INFO org.apache.hadoop.ipc.RPC: Server at mdadqsgdac1.mdanderson.edu/10.111.85.15:54311 not available yet, Zzzzz...

The domain names and IP numbers that the slave is using do appear to match the ones that the master's logs reports it is using.  While the jobtracker has an explicit Starting RUNNING message and the namenode does not, the messages on the slave side are the same.

Candidate Issue: There is a networking issue

     What sort of issue would cause this problem.  I don't seem to have any issues getting from one machine to the other.  I can ssh in both directions.  I can traceroute from slave to master:
         -sh-3.2$ traceroute mdadqsgdac1.mdanderson.edu
              traceroute to mdadqsgdac1.mdanderson.edu (10.111.85.15), 30 hops max, 40 byte packets
             1  mdadqsgdac1.mdanderson.edu (10.111.85.15)  0.080 ms  0.089 ms  0.084 ms
     and I can traceroute from master to slave:
          -sh-3.2$ traceroute mdadqsgdac2.mdanderson.edu
          traceroute to mdadqsgdac2.mdanderson.edu (10.111.85.16), 30 hops max, 40 byte packets
          1  mdadqsgdac2.mdanderson.edu (10.111.85.16)  0.142 ms  0.144 ms  0.136 ms




-------Jason Venner <ja...@gmail.com> wrote on Thu, 05 Nov 2009 11:15:54 GMT----------------------------------------------------------------

Either your master namenode/jobtrackers are not actually starting, or they
are  not listening on those particular ports or there is a networking issue.

On Tue, Nov 3, 2009 at 4:23 AM, Neil Blue <Ne...@biowisdom.com> wrote:

> Hello
>
> I am trying to start up my first twin node hadoop cluster. I have followed
> this guide:
>
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Nod
> e_Cluster%29<http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Nod%0Ae_Cluster%29>
> ,and got two machines running as single node instances and then moved on to
> connect them into a multi-node cluster.
>
> I have two ubuntu instances running in virtual box with a bridged network
> adapter.
>
> I have configured the xml files slaves and master to point to the correct
> machines, along with the ssh key.
>
> When I start up the services I get all these starting on the master:
>
> JobTracker
> DataNode
> SecondaryNameNode
> TaskTracker
> NameNode
>
> The web interface shows the system is up and running with one node.
>
> On the slave these are running:
> TaskTracker
> DataNode
>
> The output logs on the slave show:
>
> hadoop-hadoop-datanode-slave.log
> 2009-11-03 11:15:52,055 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: master/172.18.11.95:4310. Already tried 9 time(s).
> 2009-11-03 11:15:52,057 INFO org.apache.hadoop.ipc.RPC: Server at
> master/172.18.11.95:4310 not available yet, Zzzzz...
> 2009-11-03 11:15:54,063 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: master/172.18.11.95:4310. Already tried 0 time(s).
> 2009-11-03 11:15:55,064 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: master/172.18.11.95:4310. Already tried 1 time(s).
> 2009-11-03 11:15:56,068 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: master/172.18.11.95:4310. Already tried 2 time(s).
> 2009-11-03 11:15:57,073 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: master/172.18.11.95:4310. Already tried 3 time(s).
>
> hadoop-hadoop-tasktracker-slave.log
> 2009-11-03 11:18:01,002 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: master/172.18.11.95:9001. Already tried 9 time(s).
> 2009-11-03 11:18:01,004 INFO org.apache.hadoop.ipc.RPC: Server at
> master/172.18.11.95:9001 not available yet, Zzzzz...
> 2009-11-03 11:18:03,007 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: master/172.18.11.95:9001. Already tried 0 time(s).
> 2009-11-03 11:18:04,009 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: master/172.18.11.95:9001. Already tried 1 time(s).
> 2009-11-03 11:18:05,011 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: master/172.18.11.95:9001. Already tried 2 time(s).
>
> Tcpdump shows that the packets are being sent between the machines, and ssh
> works, so there does not seem to be any network problems. Also on the
> slave,
> the remote http://master:50070/dfshealth.jsp page is visible.
>
> I have also tried changing the port numbers used by the master, but no
> luck.
>
> Any suggestions please.
>
> Thanks
> Neil
>
> *********************************************


RE: Slave data node failing to connect?

Posted by "Kane, David" <Da...@sra.com>.
Folks,

As it turns out, it was a networking problem.  The solution was similar to what described here:

http://wiki.apache.org/hadoop/Hbase/Troubleshooting

However, I needed to make a similar adjustment to the master's /etc/hosts file as well.

Sincerely,
David Kane

-----Original Message-----
From: Kane, David [mailto:David_Kane@sra.com]
Sent: Wed 3/17/2010 10:46 AM
To: common-user@hadoop.apache.org
Subject: Re: Slave data node failing to connect?
 
Folks,

Does anyone know if this earlier post ever reached a resolution?  I am trying to work through the same tutorial, and I have encountered the same issue.  Of the candidate problems Jason suggested, none of them seem to pan out in my case (details below).  I'm looking for suggestions as to how I can get this to work or other avenues to try to debug.  I am using hadoop-0.20.2 on Red Hat Enterprise Linux Server release 5.4.  

Thanks!

Sincerely,
David Kane

None of my logs are reporting any errors.

Candidate Issue: Either your master namenode/jobtrackers are not actually starting:
     JPS Shows the following on the Master Node:
          32559 DataNode
          398 TaskTracker
          32749 JobTracker <----- Master Job Tracker started
          32414 NameNode <----- Master NameNode started
          32668 SecondaryNameNode
          439 Jps

     BTW, the master does seem to be starting up the processes correctly on the slave.  JPS there reports:
         4048 DataNode
         4179 Jps
         4108 TaskTracker

Candidate Issue: they [master namenode/jobtrackers] are  not listening on those particular ports
     On my master, my namenode log shows:
          2010-03-17 09:08:11,711 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
          2010-03-17 09:08:11,712 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 54310: starting
          ....
         2010-03-17 09:08:11,752 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54310: starting

    On my master, my jobtracker log shows:
          2010-03-17 09:09:31,036 INFO org.apache.hadoop.mapred.JobTracker: JobTracker up at: 54311
          2010-03-17 09:09:31,036 INFO org.apache.hadoop.mapred.JobTracker: JobTracker webserver: 50030
          2010-03-17 09:09:31,308 INFO org.apache.hadoop.mapred.JobTracker: Cleaning up the system directory
          2010-03-17 09:09:31,369 INFO org.apache.hadoop.mapred.CompletedJobStatusStore: Completed job store is inactive
          2010-03-17 09:09:31,500 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
          2010-03-17 09:09:31,501 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 54311: starting
          2010-03-17 09:09:31,507 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 54311: starting
          2010-03-17 09:09:31,509 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 54311: starting
          2010-03-17 09:09:31,511 INFO org.apache.hadoop.mapred.JobTracker: Starting RUNNING
          .....
          2010-03-17 09:09:31,523 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54311: starting

     On my slave, my namenode log shows:
          2010-03-17 09:25:56,217 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: mdadqsgdac1.mdanderson.edu/10.111.85.15:54310. Already tried 0 time(s).
          2010-03-17 09:25:57,231 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: mdadqsgdac1.mdanderson.edu/10.111.85.15:54310. Already tried 1 time(s).
          ...
          2010-03-17 09:26:05,364 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: mdadqsgdac1.mdanderson.edu/10.111.85.15:54310. Already tried 9 time(s).
          2010-03-17 09:26:05,365 INFO org.apache.hadoop.ipc.RPC: Server at mdadqsgdac1.mdanderson.edu/10.111.85.15:54310 not available yet, Zzzzz...

     On my slave, my namenode log shows:
          2010-03-17 09:26:00,850 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: mdadqsgdac1.mdanderson.edu/10.111.85.15:54311. Already tried 0 time(s).
          2010-03-17 09:26:01,869 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: mdadqsgdac1.mdanderson.edu/10.111.85.15:54311. Already tried 1 time(s).
          ...
          2010-03-17 09:26:10,002 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: mdadqsgdac1.mdanderson.edu/10.111.85.15:54311. Already tried 9 time(s).
          2010-03-17 09:26:10,003 INFO org.apache.hadoop.ipc.RPC: Server at mdadqsgdac1.mdanderson.edu/10.111.85.15:54311 not available yet, Zzzzz...

The domain names and IP numbers that the slave is using do appear to match the ones that the master's logs reports it is using.  While the jobtracker has an explicit Starting RUNNING message and the namenode does not, the messages on the slave side are the same.

Candidate Issue: There is a networking issue

     What sort of issue would cause this problem.  I don't seem to have any issues getting from one machine to the other.  I can ssh in both directions.  I can traceroute from slave to master:
         -sh-3.2$ traceroute mdadqsgdac1.mdanderson.edu
              traceroute to mdadqsgdac1.mdanderson.edu (10.111.85.15), 30 hops max, 40 byte packets
             1  mdadqsgdac1.mdanderson.edu (10.111.85.15)  0.080 ms  0.089 ms  0.084 ms
     and I can traceroute from master to slave:
          -sh-3.2$ traceroute mdadqsgdac2.mdanderson.edu
          traceroute to mdadqsgdac2.mdanderson.edu (10.111.85.16), 30 hops max, 40 byte packets
          1  mdadqsgdac2.mdanderson.edu (10.111.85.16)  0.142 ms  0.144 ms  0.136 ms




-------Jason Venner <ja...@gmail.com> wrote on Thu, 05 Nov 2009 11:15:54 GMT----------------------------------------------------------------

Either your master namenode/jobtrackers are not actually starting, or they
are  not listening on those particular ports or there is a networking issue.

On Tue, Nov 3, 2009 at 4:23 AM, Neil Blue <Ne...@biowisdom.com> wrote:

> Hello
>
> I am trying to start up my first twin node hadoop cluster. I have followed
> this guide:
>
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Nod
> e_Cluster%29<http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Nod%0Ae_Cluster%29>
> ,and got two machines running as single node instances and then moved on to
> connect them into a multi-node cluster.
>
> I have two ubuntu instances running in virtual box with a bridged network
> adapter.
>
> I have configured the xml files slaves and master to point to the correct
> machines, along with the ssh key.
>
> When I start up the services I get all these starting on the master:
>
> JobTracker
> DataNode
> SecondaryNameNode
> TaskTracker
> NameNode
>
> The web interface shows the system is up and running with one node.
>
> On the slave these are running:
> TaskTracker
> DataNode
>
> The output logs on the slave show:
>
> hadoop-hadoop-datanode-slave.log
> 2009-11-03 11:15:52,055 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: master/172.18.11.95:4310. Already tried 9 time(s).
> 2009-11-03 11:15:52,057 INFO org.apache.hadoop.ipc.RPC: Server at
> master/172.18.11.95:4310 not available yet, Zzzzz...
> 2009-11-03 11:15:54,063 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: master/172.18.11.95:4310. Already tried 0 time(s).
> 2009-11-03 11:15:55,064 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: master/172.18.11.95:4310. Already tried 1 time(s).
> 2009-11-03 11:15:56,068 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: master/172.18.11.95:4310. Already tried 2 time(s).
> 2009-11-03 11:15:57,073 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: master/172.18.11.95:4310. Already tried 3 time(s).
>
> hadoop-hadoop-tasktracker-slave.log
> 2009-11-03 11:18:01,002 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: master/172.18.11.95:9001. Already tried 9 time(s).
> 2009-11-03 11:18:01,004 INFO org.apache.hadoop.ipc.RPC: Server at
> master/172.18.11.95:9001 not available yet, Zzzzz...
> 2009-11-03 11:18:03,007 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: master/172.18.11.95:9001. Already tried 0 time(s).
> 2009-11-03 11:18:04,009 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: master/172.18.11.95:9001. Already tried 1 time(s).
> 2009-11-03 11:18:05,011 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: master/172.18.11.95:9001. Already tried 2 time(s).
>
> Tcpdump shows that the packets are being sent between the machines, and ssh
> works, so there does not seem to be any network problems. Also on the
> slave,
> the remote http://master:50070/dfshealth.jsp page is visible.
>
> I have also tried changing the port numbers used by the master, but no
> luck.
>
> Any suggestions please.
>
> Thanks
> Neil
>
> *********************************************