You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "David Miller (JIRA)" <ji...@apache.org> on 2016/01/07 21:03:40 UTC

[jira] [Created] (AMBARI-14580) ams collector clients likely to create self simultaneous open tcp sockets

David Miller created AMBARI-14580:
-------------------------------------

             Summary: ams collector clients likely to create self simultaneous open tcp sockets
                 Key: AMBARI-14580
                 URL: https://issues.apache.org/jira/browse/AMBARI-14580
             Project: Ambari
          Issue Type: Bug
          Components: ambari-metrics
    Affects Versions: 2.1.0
         Environment: IBM BigInsights 4.1
            Reporter: David Miller


Multiple clients connect to the ambari metrics timeline metrics service.

timeline.metrics.service.webapp.address in the Advanced ams-site configuration section specifies the collector port by default as 6188.

Many of these clients are on the same host as the collector which can lead to them creating a self simultaneous open TCP connection if the ambari metrics collector is not listening on this port (such as when it is stopped). See http://stackoverflow.com/questions/5139808/tcp-simultaneous-open-and-self-connect-prevention for a discussion of this condition.

Once this condition is triggered,  the ams collector cannot start because the port is now held by the client which tried to connect to it.

Any client which connects to itself expecting to connect to the ams collector appears to hold this connection forever.

We have seen this condition happen twice by accident and we can reproduce.  While this condition is possible for any connection with the same remote and local address it appears that it is especially likely to happen with connections to the ams collector, probably due to the usual scenario of having the collector on the same machine as many other services which try to connect to it.

To reproduce the problem:
1.Stop the ambari metrics collector
2.wait an unspecified amount of time (hours or days) and check netstat for self simultaneous open connections having the same local and remote host:port tuple like the below:
a.	tcp        0      0 10.93.132.110:6188          10.93.132.110:6188          ESTABLISHED –
3. attempt to start the ambari metrics collector, it will fail with an error line:
Caused by: java.net.BindException: Port in use: 0.0.0.0:6188

Possible Solutions:
*Change collector clients to time out connections when no response or unexpected responses are received (connected to self scenario)
*Enable SO_REUSEADDR to possibly decrease chances of selecting the same local port as remote port
*Recommend that users reconfigure their OS's ephermal port range to not include the collector listener port
*Increase reconnect wait time when connecting to the connector
*Others?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)