You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Michał Czerwiński <mi...@qubitproducts.com> on 2013/03/11 18:06:03 UTC

Namenode caching hostname resolution and tries to bind to an old ip address.

I had to change instance type of a namenode running in EC2 which requires
host shutdown/startup and that involves ip change unless you are using
elastic ip addresses.
We are using fqdns to access our hadoop cluster so after host startup I've
updated this particular DNS entry to point to a new ip.
Having TTL 60 allowed change propagation to happen very quickly and I've
double checked that this fqdn is resolving correctly.
I was very surprised seeing namenode trying to bind() to a
x.y.z.com/1.1.1.1:54310 (where 1.1.1.1 is an old ip address).
There was no rational place or way hadoop could be picking up this old ip
instead of a new one, the only places I've found some traces of an old ip
are namenode's hdfs metadata:

name/current/fsimage
name/previous.checkpoint/fsimage

First I've tried to restore data from secondarynamenode but it didn't help.
I've also kicked datanodes and they've picked the change immediately, tried
to establish connection to a new address.

Also I could see some errors related to tasktrackers, there was a problem
with submitting logs (by jobtracker) to hdfs location:
2013-03-11 13:25:09,500 ERROR mapred.JobHistory
(JobHistory.java:logSubmitted(1598)) - Failed creating job history log file
for job job_201303091517_9689
java.net.ConnectException: Call to x.y.z.com/1.1.1.1:54310 failed on
connection exception: java.net.ConnectException: Connection refused

I worked around this problem by specifing new ip in fs.default.name in
core-site.xml on namenode's server and the tasktrackers machines.
I don't see anymore traces of an old ip in fsimage files.

Some jobs on tasktrackers still had some jobs running with an actual fqdn
instead of an ip (causing jobtracker's failures when submitting history
logs), I've worked around this by attaching additional old ip to an
interface and creating some DNAT rules so jobtracker was able to submit
those logs.
After this cleared out I didn't see more traffic to this old ip.

I've tried to replicate this issue by running namenode under some third
party fqdn, shutting ec2 instance down, starting it up with a new ip,
updating fqdn and I was unable to hit the problem again.

This looks very similiar to https://issues.apache.org/jira/browse/HDFS-34

Any ideas why namenode is caching those dns entries in such a way, how to
clear this out?

Thanks.