You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ambari.apache.org by Jonathan Hurley <jh...@hortonworks.com> on 2015/04/14 16:09:14 UTC

Review Request 33166: Ambari Agent holding socket open on 50070 prevents NN from starting

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33166/
-----------------------------------------------------------

Review request for Ambari, Alejandro Fernandez and Sumit Mohanty.


Bugs: AMBARI-10464
    https://issues.apache.org/jira/browse/AMBARI-10464


Repository: ambari


Description
-------

The Ambari Agent process appears to be listening on port 50070 and holding it open. This is causing the NN to fail to start until the Ambari Agent is restarted. A netstat -natp reveals that the agent process has this port open.
```
root@hdp2-02-01 hdfs]# netstat -anp | grep 50070
tcp 0 0 192.168.1.141:50070 192.168.1.141:50070 ESTABLISHED 1630/python2.6
```

After digging some more through sockets and linux, I think it's entirely possible that the agent could be assigned a source port that matches the destination port. Anything in the ephemeral port range is up for grabs. Essentially what is happening here is that NN is down and when the agent tries to check it via a socket connection to 50070, the source (client) side of the socket connection binds to 50070 since it's open and within the range specified by /proc/sys/net/ipv4/ip_local_port_range

The client essentially connects to itself; the WEB alert connection timeout is set to 10 seconds. That means that after 10 seconds, it will release the connection automatically. The METRIC alerts, however, use a slightly different mechanism of opening the socket and don't specify the socket timeout. For a METRIC alert, when both the source and destination ports are the same, it will connection and hold that connection for as long as socket._GLOBAL_DEFAULT_TIMEOUT which could be a very long time.

I believe that we need to change METRIC alert to pass in a timeout value to the socket (between 5 and 10 seconds just like WEB alerts)
Since the Hadoop components seem to use emphemeral ports that the OS says are free game to any client, this will still end up being a problem. The above proposed fix will make it so that the agent will release the socket after a while preventing the need to restart the agent after fixing the problem. But it's still possible that the agent could bind to that port when making its check.


Diffs
-----

  ambari-agent/src/main/python/ambari_agent/alerts/metric_alert.py 8b5f15d 
  ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py 032310d 
  ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_ha_namenode_health.py 058b7b2 
  ambari-server/src/main/resources/common-services/HIVE/0.12.0.2.0/package/alerts/alert_webhcat_server.py fb6c4c2 
  ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py 8c72f4c 
  ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanagers_summary.py b297b0c 
  ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/HDFS/package/files/alert_checkpoint_time.py 032310d 
  ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/HDFS/package/files/alert_ha_namenode_health.py 058b7b2 
  ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/WEBHCAT/package/files/alert_webhcat_server.py fb6c4c2 
  ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/YARN/package/files/alert_nodemanager_health.py 8c72f4c 

Diff: https://reviews.apache.org/r/33166/diff/


Testing
-------

I was able to force the alerts to use a specific client port (under Python 2.7) - I chose 50070 since that's the port in this issue. I then verified that when binding, the metric alerts did not let the port go until the agent was restarted. After the fixes were applied, the agent was still able to bind to 50070, but it did release it after the specified timeout.


Thanks,

Jonathan Hurley

Re: Review Request 33166: Ambari Agent holding socket open on 50070 prevents NN from starting

Posted by Jonathan Hurley <jh...@hortonworks.com>.


> On April 14, 2015, 1:22 p.m., Alejandro Fernandez wrote:
> > Where do we specify what ports ambari-agent is allowed to listen on? Trying to make Hadoop and ambari-agent use disjoint sets is one option, although difficult to enforce, but this is something else we could do to avoid port collisions.

I think you misunderstand the problem here. It's not the port that Ambari Agent is listening on, it's the port that's assigned by the OS for use when creating the client (source) socket. Ambari agent is already bound and running at this point.

Also, see https://issues.apache.org/jira/browse/AMBARI-10473

Seems like some Hadoop components are catching on and no longer using the ephemeral ports.


- Jonathan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33166/#review80059
-----------------------------------------------------------


On April 14, 2015, 10:09 a.m., Jonathan Hurley wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33166/
> -----------------------------------------------------------
> 
> (Updated April 14, 2015, 10:09 a.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez and Sumit Mohanty.
> 
> 
> Bugs: AMBARI-10464
>     https://issues.apache.org/jira/browse/AMBARI-10464
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> The Ambari Agent process appears to be listening on port 50070 and holding it open. This is causing the NN to fail to start until the Ambari Agent is restarted. A netstat -natp reveals that the agent process has this port open.
> ```
> root@hdp2-02-01 hdfs]# netstat -anp | grep 50070
> tcp 0 0 192.168.1.141:50070 192.168.1.141:50070 ESTABLISHED 1630/python2.6
> ```
> 
> After digging some more through sockets and linux, I think it's entirely possible that the agent could be assigned a source port that matches the destination port. Anything in the ephemeral port range is up for grabs. Essentially what is happening here is that NN is down and when the agent tries to check it via a socket connection to 50070, the source (client) side of the socket connection binds to 50070 since it's open and within the range specified by /proc/sys/net/ipv4/ip_local_port_range
> 
> The client essentially connects to itself; the WEB alert connection timeout is set to 10 seconds. That means that after 10 seconds, it will release the connection automatically. The METRIC alerts, however, use a slightly different mechanism of opening the socket and don't specify the socket timeout. For a METRIC alert, when both the source and destination ports are the same, it will connection and hold that connection for as long as socket._GLOBAL_DEFAULT_TIMEOUT which could be a very long time.
> 
> I believe that we need to change METRIC alert to pass in a timeout value to the socket (between 5 and 10 seconds just like WEB alerts)
> Since the Hadoop components seem to use emphemeral ports that the OS says are free game to any client, this will still end up being a problem. The above proposed fix will make it so that the agent will release the socket after a while preventing the need to restart the agent after fixing the problem. But it's still possible that the agent could bind to that port when making its check.
> 
> 
> Diffs
> -----
> 
>   ambari-agent/src/main/python/ambari_agent/alerts/metric_alert.py 8b5f15d 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py 032310d 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_ha_namenode_health.py 058b7b2 
>   ambari-server/src/main/resources/common-services/HIVE/0.12.0.2.0/package/alerts/alert_webhcat_server.py fb6c4c2 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py 8c72f4c 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanagers_summary.py b297b0c 
>   ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/HDFS/package/files/alert_checkpoint_time.py 032310d 
>   ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/HDFS/package/files/alert_ha_namenode_health.py 058b7b2 
>   ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/WEBHCAT/package/files/alert_webhcat_server.py fb6c4c2 
>   ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/YARN/package/files/alert_nodemanager_health.py 8c72f4c 
> 
> Diff: https://reviews.apache.org/r/33166/diff/
> 
> 
> Testing
> -------
> 
> I was able to force the alerts to use a specific client port (under Python 2.7) - I chose 50070 since that's the port in this issue. I then verified that when binding, the metric alerts did not let the port go until the agent was restarted. After the fixes were applied, the agent was still able to bind to 50070, but it did release it after the specified timeout.
> 
> 
> Thanks,
> 
> Jonathan Hurley
> 
>

Re: Review Request 33166: Ambari Agent holding socket open on 50070 prevents NN from starting

Posted by Alejandro Fernandez <af...@hortonworks.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33166/#review80059
-----------------------------------------------------------


Where do we specify what ports ambari-agent is allowed to listen on? Trying to make Hadoop and ambari-agent use disjoint sets is one option, although difficult to enforce, but this is something else we could do to avoid port collisions.

- Alejandro Fernandez


On April 14, 2015, 2:09 p.m., Jonathan Hurley wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33166/
> -----------------------------------------------------------
> 
> (Updated April 14, 2015, 2:09 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez and Sumit Mohanty.
> 
> 
> Bugs: AMBARI-10464
>     https://issues.apache.org/jira/browse/AMBARI-10464
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> The Ambari Agent process appears to be listening on port 50070 and holding it open. This is causing the NN to fail to start until the Ambari Agent is restarted. A netstat -natp reveals that the agent process has this port open.
> ```
> root@hdp2-02-01 hdfs]# netstat -anp | grep 50070
> tcp 0 0 192.168.1.141:50070 192.168.1.141:50070 ESTABLISHED 1630/python2.6
> ```
> 
> After digging some more through sockets and linux, I think it's entirely possible that the agent could be assigned a source port that matches the destination port. Anything in the ephemeral port range is up for grabs. Essentially what is happening here is that NN is down and when the agent tries to check it via a socket connection to 50070, the source (client) side of the socket connection binds to 50070 since it's open and within the range specified by /proc/sys/net/ipv4/ip_local_port_range
> 
> The client essentially connects to itself; the WEB alert connection timeout is set to 10 seconds. That means that after 10 seconds, it will release the connection automatically. The METRIC alerts, however, use a slightly different mechanism of opening the socket and don't specify the socket timeout. For a METRIC alert, when both the source and destination ports are the same, it will connection and hold that connection for as long as socket._GLOBAL_DEFAULT_TIMEOUT which could be a very long time.
> 
> I believe that we need to change METRIC alert to pass in a timeout value to the socket (between 5 and 10 seconds just like WEB alerts)
> Since the Hadoop components seem to use emphemeral ports that the OS says are free game to any client, this will still end up being a problem. The above proposed fix will make it so that the agent will release the socket after a while preventing the need to restart the agent after fixing the problem. But it's still possible that the agent could bind to that port when making its check.
> 
> 
> Diffs
> -----
> 
>   ambari-agent/src/main/python/ambari_agent/alerts/metric_alert.py 8b5f15d 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py 032310d 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_ha_namenode_health.py 058b7b2 
>   ambari-server/src/main/resources/common-services/HIVE/0.12.0.2.0/package/alerts/alert_webhcat_server.py fb6c4c2 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py 8c72f4c 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanagers_summary.py b297b0c 
>   ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/HDFS/package/files/alert_checkpoint_time.py 032310d 
>   ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/HDFS/package/files/alert_ha_namenode_health.py 058b7b2 
>   ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/WEBHCAT/package/files/alert_webhcat_server.py fb6c4c2 
>   ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/YARN/package/files/alert_nodemanager_health.py 8c72f4c 
> 
> Diff: https://reviews.apache.org/r/33166/diff/
> 
> 
> Testing
> -------
> 
> I was able to force the alerts to use a specific client port (under Python 2.7) - I chose 50070 since that's the port in this issue. I then verified that when binding, the metric alerts did not let the port go until the agent was restarted. After the fixes were applied, the agent was still able to bind to 50070, but it did release it after the specified timeout.
> 
> 
> Thanks,
> 
> Jonathan Hurley
> 
>

Re: Review Request 33166: Ambari Agent holding socket open on 50070 prevents NN from starting

Posted by Alejandro Fernandez <af...@hortonworks.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33166/#review80067
-----------------------------------------------------------

Ship it!


Ship It!

- Alejandro Fernandez


On April 14, 2015, 2:09 p.m., Jonathan Hurley wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33166/
> -----------------------------------------------------------
> 
> (Updated April 14, 2015, 2:09 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez and Sumit Mohanty.
> 
> 
> Bugs: AMBARI-10464
>     https://issues.apache.org/jira/browse/AMBARI-10464
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> The Ambari Agent process appears to be listening on port 50070 and holding it open. This is causing the NN to fail to start until the Ambari Agent is restarted. A netstat -natp reveals that the agent process has this port open.
> ```
> root@hdp2-02-01 hdfs]# netstat -anp | grep 50070
> tcp 0 0 192.168.1.141:50070 192.168.1.141:50070 ESTABLISHED 1630/python2.6
> ```
> 
> After digging some more through sockets and linux, I think it's entirely possible that the agent could be assigned a source port that matches the destination port. Anything in the ephemeral port range is up for grabs. Essentially what is happening here is that NN is down and when the agent tries to check it via a socket connection to 50070, the source (client) side of the socket connection binds to 50070 since it's open and within the range specified by /proc/sys/net/ipv4/ip_local_port_range
> 
> The client essentially connects to itself; the WEB alert connection timeout is set to 10 seconds. That means that after 10 seconds, it will release the connection automatically. The METRIC alerts, however, use a slightly different mechanism of opening the socket and don't specify the socket timeout. For a METRIC alert, when both the source and destination ports are the same, it will connection and hold that connection for as long as socket._GLOBAL_DEFAULT_TIMEOUT which could be a very long time.
> 
> I believe that we need to change METRIC alert to pass in a timeout value to the socket (between 5 and 10 seconds just like WEB alerts)
> Since the Hadoop components seem to use emphemeral ports that the OS says are free game to any client, this will still end up being a problem. The above proposed fix will make it so that the agent will release the socket after a while preventing the need to restart the agent after fixing the problem. But it's still possible that the agent could bind to that port when making its check.
> 
> 
> Diffs
> -----
> 
>   ambari-agent/src/main/python/ambari_agent/alerts/metric_alert.py 8b5f15d 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py 032310d 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_ha_namenode_health.py 058b7b2 
>   ambari-server/src/main/resources/common-services/HIVE/0.12.0.2.0/package/alerts/alert_webhcat_server.py fb6c4c2 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py 8c72f4c 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanagers_summary.py b297b0c 
>   ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/HDFS/package/files/alert_checkpoint_time.py 032310d 
>   ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/HDFS/package/files/alert_ha_namenode_health.py 058b7b2 
>   ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/WEBHCAT/package/files/alert_webhcat_server.py fb6c4c2 
>   ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/YARN/package/files/alert_nodemanager_health.py 8c72f4c 
> 
> Diff: https://reviews.apache.org/r/33166/diff/
> 
> 
> Testing
> -------
> 
> I was able to force the alerts to use a specific client port (under Python 2.7) - I chose 50070 since that's the port in this issue. I then verified that when binding, the metric alerts did not let the port go until the agent was restarted. After the fixes were applied, the agent was still able to bind to 50070, but it did release it after the specified timeout.
> 
> 
> Thanks,
> 
> Jonathan Hurley
> 
>