You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Hari Sekhon (JIRA)" <ji...@apache.org> on 2015/04/27 12:02:38 UTC

[jira] [Created] (AMBARI-10757) MapReduce History Server curl call gets stuck, agent restarts fail on "Address already in use" error

Hari Sekhon created AMBARI-10757:
------------------------------------

             Summary: MapReduce History Server curl call gets stuck, agent restarts fail on "Address already in use" error
                 Key: AMBARI-10757
                 URL: https://issues.apache.org/jira/browse/AMBARI-10757
             Project: Ambari
          Issue Type: Bug
          Components: ambari-agent
    Affects Versions: 2.0.0
         Environment: HDP 2.2
            Reporter: Hari Sekhon
            Priority: Minor


The curl call to the MapReduce History server gets stuck, which appears to block the ambari-agent (typical no health check report in 3 minutes in Ambari UI). Restarting ambari-agent gives the usual "Address already in use error":
{code}# ps -ef|grep ambari-agent
root     17616 14155  0 10:27 pts/11   00:00:00 curl --negotiate -u   -b /var/lib/ambari-agent/data/tmp/cookies/3a05acb6-5d0c-4b6a-9304-91af19ae4efa -c /var/
lib/ambari-agent/data/tmp/cookies/3a05acb6-5d0c-4b6a-9304-91af19ae4efa -sL -w %{http_code} http://host:19888 --connect-timeout 10 -o /dev/null
root     17677 12202  0 10:28 pts/11   00:00:00 grep ambari-agent
# date
Mon Apr 27 10:28:21 BST 2015
...
# date
Mon Apr 27 10:29:11 BST 2015
# ps -ef|grep ambari-agent
root     17616 14155  0 10:27 pts/11   00:00:00 curl --negotiate -u   -b /var/lib/ambari-agent/data/tmp/cookies/3a05acb6-5d0c-4b6a-9304-91af19ae4efa -c /var/lib/ambari-agent/data/tmp/cookies/3a05acb6-5d0c-4b6a-9304-91af19ae4efa -sL -w %{http_code} http://host:19888 --connect-timeout 10 -o /dev/null
{code}
Although there is a 10 sec timeout passed to curl itself
{code}... --connect-timeout 10 ...{code}
the man page says this is only for connection initiation, if the connection somehow hung after connection I believe this would not help - that must be what is happening in this case.

After killing the curl call, another the stuck 'df' command was also still then holding the port as described in AMBARI-8768, killing that finally freed the port and allowed Ambari agent restart to succeed and heartbeat back to the Ambari server.

This is related to AMBARI-8768 in that basically it's same type of problem of not having hard timeouts in code on all commands. It's also a similar type of problem to AMBARI-10495 and AMBARI-9197, all really related to not having generic timeouts applied in Ambari.

There should be a general change made to Ambari to timeout all arbitrary commands and actions after a reasonably long period, configurable at time of each command call in the code.

Hari Sekhon
http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)