You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ambari.apache.org by Jonathan Hurley <jh...@hortonworks.com> on 2015/05/05 17:35:55 UTC

Re: Restarting nodes to avoid HTTP 403

Can you provide some more information on your environment, such as:

1) Version of Ambari
2) Whether the environment is kerberized
3) Are you running the Ambari agent as root, or another user.
4) Any information from the ambari-agent.log file that might seen to indicate a problem
5) You said that restarting the agents resolves the issue. Does it continue to happen after restarting? If so, how long before new warnings start to show up.

I’m guessing you have a kerberized environment running Ambari 2.0. Ambari will use curl in this case to attempt to make a connection to the web endpoints. It uses the keytabs and principals defined on the alert definition. For NameNode, as an example, it would use:

hdfs-site/dfs.web.authentication.kerberos.keytab
hdfs-site/dfs.web.authentication.kerberos.principal

You’ll want to verify that these properties are correctly set and that the keytab file is accessible to the agent user.

It could also be a cache problem as the agents cache the kerberos credentials in the agent’s temp directory. How long it takes after alerts start producing warnings would in determining if it’s a caching issue.

On May 5, 2015, at 10:49 AM, Chanel Loïc <lo...@worldline.com>> wrote:

Hi,

I have currently some issues with cluster nodes. According to the Ambari web User Interface, I have 11 alerts linked to the fact that all nodes (except the monitoring one) return HTTP 403 response for services such as NameNode web UI, DataNode web UI, or NodeManager Health.

The first weird thing is the fact that I can easily see that the so-called Forbidden ports are actually quite available (via cURL for example) and indicate that the cluster is totally ok.
The second is the fact that the 11 alerts magically disappear when rebooting the agent on the hosts related to the errors.

Does anyone know where these errors might come from ?

Thanks,

Loïc

________________________________

Ce message et les pièces jointes sont confidentiels et réservés à l'usage exclusif de ses destinataires. Il peut également être protégé par le secret professionnel. Si vous recevez ce message par erreur, merci d'en avertir immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant être assurée sur Internet, la responsabilité de Worldline ne pourra être recherchée quant au contenu de ce message. Bien que les meilleurs efforts soient faits pour maintenir cette transmission exempte de tout virus, l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne saurait être recherchée pour tout dommage résultant d'un virus transmis.

This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Worldline liability cannot be triggered for the message content. Although the sender endeavours to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted.

RE: Restarting nodes to avoid HTTP 403

Posted by Chanel Loïc <lo...@worldline.com>.

Hi Jonathan,

I executed your code, and it results in a 200.
I may have a lead on where that proxy issue may come from. I'll explore it deeper and keep you in touch.

Thanks,

Loïc

De : Jonathan Hurley [mailto:jhurley@hortonworks.com]
Envoyé : jeudi 7 mai 2015 18:53
À : user@ambari.apache.org
Objet : Re: Restarting nodes to avoid HTTP 403

All of this really seems to point to some kind of firewall/proxy issue on the agent hosts. From your agent named "vm-03cfbc97-f027-46fe-8e65-cb8c54edf377.frida.priv.atos.fr<http://vm-03cfbc97-f027-46fe-8e65-cb8c54edf377.frida.priv.atos.fr>", could you try the following python code:

>>> import urllib2
>>> response = urllib2.urlopen("http://vm-03cfbc97-f027-46fe-8e65-cb8c54edf377.frida.priv.atos.fr:8042/ws/v1/node/info", timeout=10.0)
>>> print(response.code)

I'm really curious if executing that from your agent host results in a 200 or a 403.

On May 7, 2015, at 11:45 AM, Chanel Loïc <lo...@worldline.com>> wrote:

Hi Jonathan,

Here are the answers to your questions :

1) Which operating system are you running the agents on?
--> CentOS6

2) Is there a linux proxy setup, such as "export http_proxy=foo" - curl doesn't respect this proxy setting but python's urllib2 does, which has caused some issues before
--> There is a proxy, but there is no such setup (I mean "echo $http_proxy" returns nothing, and I do not remember setting Ambari properly in order to access to the Internet via the proxy)

3) Which stack are you deploying?
--> I am deploying HDP-2.2.4.2-2

Have a nice weekend,

Loïc

De : Jonathan Hurley [mailto:jhurley@hortonworks.com]
Envoyé : jeudi 7 mai 2015 17:34
À : user@ambari.apache.org<ma...@ambari.apache.org>
Objet : Re: Restarting nodes to avoid HTTP 403

The logs indicate that the alerts are running correctly and are simply hitting a 403. Normally, you might encounter this kind of problem from Python when making web connections without specifying a known user agent header. Python's default header sometimes causes issues since it's not standard. However, that problem would continue to happen after you restarted the agents. The fact that a simple agent restart completely fixes the issue is baffling.

I've certainly never seen this type of behavior before. I'd like to know a few more details on your environment:
1) Which operating system are you running the agents on?
2) Is there a linux proxy setup, such as "export http_proxy=foo" - curl doesn't respect this proxy setting but python's urllib2 does, which has caused some issues before
3) Which stack are you deploying?

On May 7, 2015, at 4:05 AM, Chanel Loïc <lo...@worldline.com>> wrote:

Hi Jonathan,

You will find in attachment what I get from the API you gave me the URL.
As far as the log file is concerned, here is an extract from the one on the NameNode :

WARNING 2015-05-07 10:01:47,221 base_alert.py:365 - [Alert][namenode_directory_status] HA nameservice value is present but there are no aliases for {{hdfs-site/dfs.ha.namenodes.{{ha-nameservice}}}}
WARNING 2015-05-07 10:01:47,222 base_alert.py:365 - [Alert][datanode_health_summary] HA nameservice value is present but there are no aliases for {{hdfs-site/dfs.ha.namenodes.{{ha-nameservice}}}}
INFO 2015-05-07 10:01:47,228 scheduler.py:509 - Running job "e89e2c29-1f2c-4bf5-a37c-7a6c5b43433a (trigger: interval[0:01:00], next run at: 2015-05-07 10:02:47.211937)" (scheduled at 2015-05-07 10:01:47.211937)
INFO 2015-05-07 10:01:47,229 scheduler.py:509 - Running job "ad0e0fd3-d5f3-471f-955e-ec10f53cdd5b (trigger: interval[0:01:00], next run at: 2015-05-07 10:01:47.212773)" (scheduled at 2015-05-07 10:01:47.212773)
WARNING 2015-05-07 10:01:47,230 base_alert.py:365 - [Alert][namenode_webui] HA nameservice value is present but there are no aliases for {{hdfs-site/dfs.ha.namenodes.{{ha-nameservice}}}}
INFO 2015-05-07 10:01:47,230 scheduler.py:527 - Job "71988556-19e3-4871-92e0-e3c0a838df13 (trigger: interval[0:01:00], next run at: 2015-05-07 10:02:47.206667)" executed successfully
INFO 2015-05-07 10:01:47,238 scheduler.py:509 - Running job "2de4dc85-8993-4c68-9915-db25c4313d6e (trigger: interval[0:01:00], next run at: 2015-05-07 10:02:47.213529)" (scheduled at 2015-05-07 10:01:47.213529)
WARNING 2015-05-07 10:01:47,240 base_alert.py:140 - [Alert][datanode_health_summary] Unable to execute alert. HTTP Error 403: Forbidden
INFO 2015-05-07 10:01:47,242 scheduler.py:527 - Job "5bcf7b5d-73e7-4d59-bc0a-f4772d4e3166 (trigger: interval[0:01:00], next run at: 2015-05-07 10:02:47.208176)" executed successfully
INFO 2015-05-07 10:01:47,241 scheduler.py:509 - Running job "ff1fc102-bdec-4858-baf5-8b60de4488e4 (trigger: interval[0:01:00], next run at: 2015-05-07 10:01:47.214826)" (scheduled at 2015-05-07 10:01:47.214826)
WARNING 2015-05-07 10:01:47,244 base_alert.py:365 - [Alert][yarn_resourcemanager_webui] HA nameservice value is present but there are no aliases for {{yarn-site/yarn.resourcemanager.ha.rm-ids}}
INFO 2015-05-07 10:01:47,240 scheduler.py:527 - Job "72e5031e-4c2a-4236-b09e-6a749100bc9a (trigger: interval[0:01:00], next run at: 2015-05-07 10:02:47.203349)" executed successfully
INFO 2015-05-07 10:01:47,250 scheduler.py:509 - Running job "a083e3c2-d4b2-430b-9b64-b60c783a06a5 (trigger: interval[0:01:00], next run at: 2015-05-07 10:01:47.218706)" (scheduled at 2015-05-07 10:01:47.218706)
INFO 2015-05-07 10:01:47,250 scheduler.py:509 - Running job "7e9470cb-043b-48a8-990a-0050f2c63311 (trigger: interval[0:01:00], next run at: 2015-05-07 10:02:47.217519)" (scheduled at 2015-05-07 10:01:47.217519)
WARNING 2015-05-07 10:01:47,253 base_alert.py:140 - [Alert][namenode_directory_status] Unable to execute alert. HTTP Error 403: Forbidden

So yes, I can see some output in the logs corresponding to the alerts I get on the Ambari web app.
Please tell me if you need any complementary information about my problem,

Thanks,

Loïc

De : Jonathan Hurley [mailto:jhurley@hortonworks.com]
Envoyé : mercredi 6 mai 2015 16:24
À : user@ambari.apache.org<ma...@ambari.apache.org>
Objet : Re: Restarting nodes to avoid HTTP 403

OK, so I think I have a clear picture of how you get to this situation. I'd still like to know a few things:

1) When you have the warnings present in the web client, can you try the alerts URL I posted below to see the actual alerts coming back from the API. I'm mostly interesting in whether any come back in the first place, and what the most recent timestamp was (indicating they are actually running and still reporting a warning status)

2) When the alerts are present in the web client, do you see any output in the agent log file that I mentioned for alerts that start with [Alert].

On May 6, 2015, at 5:14 AM, Chanel Loïc <lo...@worldline.com>> wrote:

I encounter the warnings and critical alerts when deploying the cluster. I install Ambari server 2.0 on a VM and Ambari agent 2.0 on 4 others VMs. Then, I give the Ambari server a blueprint coming from a functioning cluster to instantiate my new cluster and have a quickstart configuration.

Then, when logging into the Ambari web application to ensure everything is running properly, I have these alerts concerning the HTTP 403 errors returned by all host VMs but the one which only handles Ambari metrics.

As I am not sure my explanations are quite understandable, do not hesitate to tell me if something remains unclear.
Thanks,

De : Jonathan Hurley [mailto:jhurley@hortonworks.com]
Envoyé : mardi 5 mai 2015 20:38
À : user@ambari.apache.org<ma...@ambari.apache.org>
Objet : Re: Restarting nodes to avoid HTTP 403

If restarting the agents fixes everything, can you explain when you first encounter the warnings? Is this only after a cluster deployment? You can also check to see if this is some sort of web client issue by issuing the following GET:

http://server/api/v1/clusters/<cluster>/alerts?fields=*&Alert/state.in(CRITICAL,WARNING)<http://server/api/v1/clusters/%3Ccluster%3E/alerts?fields=*&Alert/state.in(CRITICAL,WARNING)>

This will show you alerts which are actually being returned from the agents in a warning or critical state.

For reference, you can also look in /var/log/ambari-agent/ambari-agent.log to see if you see any alert issues. Most important messages are prefixed with "[Alert]".

On May 5, 2015, at 11:43 AM, Chanel Loïc <lo...@worldline.com>> wrote:

Indeed, I did not gave so much information about my problem, sorry about that. Here is your answers :

1) Version of Ambari
--> I'm using Ambari 2.0

2) Whether the environment is kerberized
--> Not it's not. Kerberos security is not enabled on this cluster.

3) Are you running the Ambari agent as root, or another user
--> I am running it as root

4) Any information from the ambari-agent.log file that might seem to indicate a problem
--> Here is another weird thing I did not mentioned : I could not find logs referring to the problem.

5) You said that restarting the agents resolves the issue. Does it continue to happen after restarting? If so, how long before new warnings start to show up
--> Restarting the agent totally resolves the problem. It does not happen anymore, and everything run quite normally.

De : Jonathan Hurley [mailto:jhurley@hortonworks.com]
Envoyé : mardi 5 mai 2015 17:36
À : user@ambari.apache.org<ma...@ambari.apache.org>
Objet : Re: Restarting nodes to avoid HTTP 403

Can you provide some more information on your environment, such as:

I'm guessing you have a kerberized environment running Ambari 2.0. Ambari will use curl in this case to attempt to make a connection to the web endpoints. It uses the keytabs and principals defined on the alert definition. For NameNode, as an example, it would use:

hdfs-site/dfs.web.authentication.kerberos.keytab
hdfs-site/dfs.web.authentication.kerberos.principal

You'll want to verify that these properties are correctly set and that the keytab file is accessible to the agent user.

It could also be a cache problem as the agents cache the kerberos credentials in the agent's temp directory. How long it takes after alerts start producing warnings would in determining if it's a caching issue.

On May 5, 2015, at 10:49 AM, Chanel Loïc <lo...@worldline.com>> wrote:

Hi,

Does anyone know where these errors might come from ?

Thanks,

Loïc

________________________________

RE: Restarting nodes to avoid HTTP 403

Posted by Chanel Loïc <lo...@worldline.com>.

Hi Jonathan,

What you mention is actually quite what I did.
tcpdump tcp dst port 8042 should capture the requests from the agent to the port 8042
So, in my opinion, the requests are not even made by the agent. What would that indicate ?

Regards,

Loïc

De : Jonathan Hurley [mailto:jhurley@hortonworks.com]
Envoyé : lundi 18 mai 2015 21:19
À : user@ambari.apache.org
Objet : Re: Restarting nodes to avoid HTTP 403

That's kind of what I expected. I believe that the web request isn't even making it to the host; that there's something in your environment redirecting your request and returning a 403. I'd instead do a packet capture the other way, from the agent, showing outbound requests on 8042.

On May 18, 2015, at 11:44 AM, Chanel Loïc <lo...@worldline.com>> wrote:

Hi Jonathan,

First of all sorry for my (very) late answer.
I confirm that restarting the agents before installing the cluster fixes the issue. But what is slightly more complicated is the fact that there is not network trace.
The command sudo tcpdump -i lo -l -s0 -w - tcp dst port 8042 | strings & executed on the host returning 403 returns absolutely nothing.

Does that help you ?
If you want any additional information, feel free to ask.

Regards,

Loïc

De : Jonathan Hurley [mailto:jhurley@hortonworks.com]
Envoyé : mercredi 13 mai 2015 23:39
À : user@ambari.apache.org<ma...@ambari.apache.org>
Objet : Re: Restarting nodes to avoid HTTP 403

Well, I'm really baffled by this. Restarting the agents before installing a cluster fixes the issue as well? So it seems like after agents are installed they are not able to make connections from python without getting a 403 forbidden until they are restarted at least once. Is it possible to get a network trace of the agents when they encounter the 403 forbidden? That way we can see the communication path between the agent and a particular endpoint, like NameNode WebUI.

On May 13, 2015, at 8:31 AM, Chanel Loïc <lo...@worldline.com>> wrote:

Hi Jonathan,

Additional information : the restart does not have to be necessarily AFTER the cluster have been deployed using a Blueprint. Restarting the ambari-agent before using a blueprint to deploy the cluster make a perfectly clear cluster, without the 11 Warnings/Errors I mentioned in my previous emails.

Hope this will help understand where the problem comes from.
Have a nice day,

Loïc

De : Jonathan Hurley [mailto:jhurley@hortonworks.com]
Envoyé : jeudi 7 mai 2015 18:53
À : user@ambari.apache.org<ma...@ambari.apache.org>
Objet : Re: Restarting nodes to avoid HTTP 403