You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Amar Kamat (JIRA)" <ji...@apache.org> on 2008/08/25 20:23:44 UTC

[jira] Created: (HADOOP-4016) TaskTrackers never (re)connect back to the JobTracker if the JobTracker node/machine is changed

TaskTrackers never (re)connect back to the JobTracker if the JobTracker node/machine is changed
-----------------------------------------------------------------------------------------------

                 Key: HADOOP-4016
                 URL: https://issues.apache.org/jira/browse/HADOOP-4016
             Project: Hadoop Core
          Issue Type: Bug
          Components: mapred
            Reporter: Amar Kamat


I tried the following 
1) Started a hadoop cluster.
2) Killed the JT
3) Selected a new node for starting JT. 
4) Changed the entry on the tasktracker to reflect the new (old) hostname to (new) ip mapping. Checked if the tracker node correctly resolves the hostname to the new ip.
5) Start the JT on the new node
The tasktracker fails to connect to the new jobtracker. It seems that the hostname resolution remains stale and is never updated.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4016) TaskTrackers never (re)connect back to the JobTracker if the JobTracker node/machine is changed

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646529#action_12646529 ] 

Steve Loughran commented on HADOOP-4016:
----------------------------------------

Dhruba, I think you've got same-IP and same hostname mixed up. Same IP is easy, only the routers care

Same hostname, different IP means you have to wait for DNS entries to propagate around. Some places it works, some places it doesnt. 

Changing the configuration works on top of changing the hostname, as it doesn't prevent you doing that, but you can try other things like bringing up a new JT on a different port on the same box, new machine, etc. But you do need a way to push out configuration changes to all the task trackers, which classic XML-driven configurations don't have. I can do it, but my code to do that is from subclassing JobConf and using SmartFrog. HADOOP-3582 is a todo list of better configuration options; you'd need a way to push out the change.

1. the extra cost of another lookup of the configuration file is minimal if you are doing retries anyway.
2. we should set the dns TTL anyway too, but check with Allen what settings he likes before bringing the network down. 
3. configuration file changes are easier to test, so we can see that the file gets checked.

> TaskTrackers never (re)connect back to the JobTracker if the JobTracker node/machine is changed
> -----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4016
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4016
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>
> I tried the following 
> 1) Started a hadoop cluster.
> 2) Killed the JT
> 3) Selected a new node for starting JT. 
> 4) Changed the entry on the tasktracker to reflect the new (old) hostname to (new) ip mapping. Checked if the tracker node correctly resolves the hostname to the new ip.
> 5) Start the JT on the new node
> The tasktracker fails to connect to the new jobtracker. It seems that the hostname resolution remains stale and is never updated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4016) TaskTrackers never (re)connect back to the JobTracker if the JobTracker node/machine is changed

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646237#action_12646237 ] 

Steve Loughran commented on HADOOP-4016:
----------------------------------------

Assuming Amar changed the DNS entry for the job tracker, then it won't be enough

- the JVM caches hostnames forever unless you tell it otherwise

"Otherwise" means setting the JVM security properties
networkaddress.cache.ttl and  networkaddress.cache.ttl 

http://java.sun.com/javase/6/docs/api/java/net/InetAddress.html

- That's regardless of any caching done in process. The task tracker reads in "mapred.job.tracker" from the configuration on startup only.

To do failover of job tracker you'd need to change the JVM to not cache the addresses forever -which will have other consequences, good and bad, and then change TaskTracker to try and redo the nslookup when the job tracker heartbeat's failed. 

This will be a fun test to automate. You could do it in-VM by starting a second job tracker on a different port of localhost and then stop the original tracker, check that the tasktracker failed its hearbeat, reread the config and picked up the new (host,port) setting. This would not test DNS caching, but would show the tasktracker was rereading its configuration. DNS Caching tests are hard outside of a VMWare/Xen cluster. 



> TaskTrackers never (re)connect back to the JobTracker if the JobTracker node/machine is changed
> -----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4016
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4016
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>
> I tried the following 
> 1) Started a hadoop cluster.
> 2) Killed the JT
> 3) Selected a new node for starting JT. 
> 4) Changed the entry on the tasktracker to reflect the new (old) hostname to (new) ip mapping. Checked if the tracker node correctly resolves the hostname to the new ip.
> 5) Start the JT on the new node
> The tasktracker fails to connect to the new jobtracker. It seems that the hostname resolution remains stale and is never updated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4016) TaskTrackers never (re)connect back to the JobTracker if the JobTracker node/machine is changed

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645626#action_12645626 ] 

dhruba borthakur commented on HADOOP-4016:
------------------------------------------

>Changed the entry on the tasktracker to reflect the new (old) hostname to (new) ip mapping

hi amar, can u pl explain what precisely u changed on the tasktracker machines?

> TaskTrackers never (re)connect back to the JobTracker if the JobTracker node/machine is changed
> -----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4016
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4016
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>
> I tried the following 
> 1) Started a hadoop cluster.
> 2) Killed the JT
> 3) Selected a new node for starting JT. 
> 4) Changed the entry on the tasktracker to reflect the new (old) hostname to (new) ip mapping. Checked if the tracker node correctly resolves the hostname to the new ip.
> 5) Start the JT on the new node
> The tasktracker fails to connect to the new jobtracker. It seems that the hostname resolution remains stale and is never updated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4016) TaskTrackers never (re)connect back to the JobTracker if the JobTracker node/machine is changed

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646318#action_12646318 ] 

dhruba borthakur commented on HADOOP-4016:
------------------------------------------

The way I understand this one:

Option1: When a JT machine fails, another machine can be brought up with the same IP address as the original JT. The config files do not have to change, The TaskTrackers have to start talking to the new machine using the old IP address.

Option2: Another way would be to start the a new instance of the JT on a different machine that has a different IP address. The administrator can change the entries in the hadoop config file. The TaskTrackers should be intelligent enough to re-read the config file and then start talking to the new JT.

I guess Amar is trying out Option 1. 

Option 2 has great advantagtes when the Hadoop administrator is not the same as the DNS administrator. In my case, I can deploy Hadoop config files instantly, where changing a DNS entry in a corporate-wide setting and waiting for the DNS change to propagate takes a long long time.

> TaskTrackers never (re)connect back to the JobTracker if the JobTracker node/machine is changed
> -----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4016
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4016
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>
> I tried the following 
> 1) Started a hadoop cluster.
> 2) Killed the JT
> 3) Selected a new node for starting JT. 
> 4) Changed the entry on the tasktracker to reflect the new (old) hostname to (new) ip mapping. Checked if the tracker node correctly resolves the hostname to the new ip.
> 5) Start the JT on the new node
> The tasktracker fails to connect to the new jobtracker. It seems that the hostname resolution remains stale and is never updated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4016) TaskTrackers never (re)connect back to the JobTracker if the JobTracker node/machine is changed

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646458#action_12646458 ] 

Amar Kamat commented on HADOOP-4016:
------------------------------------

Dhruba is right. I was trying to do a quick fix to make the 'recovery' feature complete. I think option 2 has other advantages too. Consider a case where after running the cluster for long we realize that we need to change the slot ratio cluster wide. One way this can be done is to re-deploy new config files with right/improved parameters and let the task-trackers update themselves. 

> TaskTrackers never (re)connect back to the JobTracker if the JobTracker node/machine is changed
> -----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4016
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4016
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>
> I tried the following 
> 1) Started a hadoop cluster.
> 2) Killed the JT
> 3) Selected a new node for starting JT. 
> 4) Changed the entry on the tasktracker to reflect the new (old) hostname to (new) ip mapping. Checked if the tracker node correctly resolves the hostname to the new ip.
> 5) Start the JT on the new node
> The tasktracker fails to connect to the new jobtracker. It seems that the hostname resolution remains stale and is never updated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4016) TaskTrackers never (re)connect back to the JobTracker if the JobTracker node/machine is changed

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646272#action_12646272 ] 

Amar Kamat commented on HADOOP-4016:
------------------------------------

@Dhruba I changed the entry in _/etc/hosts_ to point to the ip address of the new JobTracker. The idea was to check if task-tracker ever tries to re-resolve the jobtracker's hostname.
@Steve I think we should force a hostname->ip-address resolution after few retries. So that a simple DNS entry change and a jobtracker restart would be enough to recover from a failure. Either we could make the JVM do it or we can manually do it by keeping tracker of how many times we reconnect. I think doing it manually makes sense as we can control when the resolution happens.

> TaskTrackers never (re)connect back to the JobTracker if the JobTracker node/machine is changed
> -----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4016
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4016
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>
> I tried the following 
> 1) Started a hadoop cluster.
> 2) Killed the JT
> 3) Selected a new node for starting JT. 
> 4) Changed the entry on the tasktracker to reflect the new (old) hostname to (new) ip mapping. Checked if the tracker node correctly resolves the hostname to the new ip.
> 5) Start the JT on the new node
> The tasktracker fails to connect to the new jobtracker. It seems that the hostname resolution remains stale and is never updated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4016) TaskTrackers never (re)connect back to the JobTracker if the JobTracker node/machine is changed

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646287#action_12646287 ] 

Steve Loughran commented on HADOOP-4016:
----------------------------------------

-to get DNS changes picked up then, we need two things
  1. JVM settings to limit DNS caching in the TaskTracker to something everyone is happy with. 
  2. renew hostname lookup after failures

I'd still push for a complete lookup of the mapred.job.tracker string. It is easier to write tests for this, and on systems that have used alternate configuration back ends (I know, still on my todo list), they get to pick these values up from the CM infrastructure. That would let people roll out new JobTracker locations without playing with DNS, just edit the tracker URL to point to the new host

> TaskTrackers never (re)connect back to the JobTracker if the JobTracker node/machine is changed
> -----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4016
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4016
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>
> I tried the following 
> 1) Started a hadoop cluster.
> 2) Killed the JT
> 3) Selected a new node for starting JT. 
> 4) Changed the entry on the tasktracker to reflect the new (old) hostname to (new) ip mapping. Checked if the tracker node correctly resolves the hostname to the new ip.
> 5) Start the JT on the new node
> The tasktracker fails to connect to the new jobtracker. It seems that the hostname resolution remains stale and is never updated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.