You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Chris Nauroth <cn...@apache.org> on 2023/01/03 19:17:14 UTC

Re: stale_status_of_NM_from_standby_RM

You can only run "yarn rmadmin -refreshNodes" against the active
ResourceManager instance. In an HA deployment, a standby instance would
return a "not active" error if it received this call, and then the client
would failover to the other instance to retry.

The ResourceManagers do not synchronize the state of include/exclude files.

Chris Nauroth


On Wed, Dec 28, 2022 at 11:08 PM Dong Ye <ye...@gmail.com> wrote:

> Hi, Chris:
>
>         Thank you very much! Yes, I am also concerned with the
> decommissioning of nodemanager in a Resource Manager High Availability
> scenario. In order to decommission a node manager,
>
> Can I add the node manager address to a standby RM exclude.xml and run
> "yarn refreshnodes"? Or I can only do that on an active RM? Do RM's sync
> the exclude/include xml file?
>
> Thanks.
> Have a nice holiday.
>
>
> On Tue, Dec 27, 2022 at 11:44 AM Chris Nauroth <cn...@apache.org>
> wrote:
>
>> Every NodeManager registers and heartbeats to the active ResourceManager
>> instance, which acts as the source of truth for cluster node status. If the
>> active ResourceManager terminates, then another becomes active, and every
>> NodeManager will start a new connection to register and heartbeat with that
>> new active ResourceManager.
>>
>> As such, a standby ResourceManager cannot satisfy requests for node
>> status and instead will redirect to the current active:
>>
>> curl -i '
>> http://cnauroth-ha-m-2:8088/ws/v1/cluster/nodes/cnauroth-ha-w-0.us-central1-c.c.hadoop-cloud-dev.google.com.internal:8026
>> '
>> HTTP/1.1 307 Temporary Redirect
>> Date: Tue, 27 Dec 2022 19:28:38 GMT
>> Cache-Control: no-cache
>> Expires: Tue, 27 Dec 2022 19:28:38 GMT
>> Date: Tue, 27 Dec 2022 19:28:38 GMT
>> Pragma: no-cache
>> Content-Type: text/plain;charset=utf-8
>> X-Content-Type-Options: nosniff
>> X-XSS-Protection: 1; mode=block
>> X-Frame-Options: SAMEORIGIN
>> Location:
>> http://cnauroth-ha-m-1.us-central1-c.c.hadoop-cloud-dev.google.com.internal.:8088/ws/v1/cluster/nodes/cnauroth-ha-w-0.us-central1-c.c.hadoop-cloud-dev.google.com.internal:8026
>> Content-Length: 136
>>
>> If it looked like you were able to query a standby, then perhaps you were
>> using a browser or some other client that automatically follows redirects
>> (e.g. curl -L)?
>>
>> The data really would have come from the active though, so you can trust
>> that it's not stale. The only thing you might have to consider is that
>> after a failover, it might take a while before every NodeManager registers
>> with the new ResourceManager.
>>
>> Separately, if you're concerned about divergence of node include/exclude
>> files, you can configure them to be stored at a shared file system (e.g.
>> your preferred cloud object store) to be used by all ResourceManager
>> instances.
>>
>> Chris Nauroth
>>
>>
>> On Sat, Dec 24, 2022 at 6:27 PM Dong Ye <ye...@gmail.com> wrote:
>>
>>> Hi, All:
>>>
>>>     I have some questions about the state of the node manager. If I use
>>> the rest API
>>>
>>>    - http://rm-http-address:port/ws/v1/cluster/nodes/{nodeid}
>>>
>>> to get node manager state from a standby RM,
>>> 1) is it possible that it could be stale?
>>> 2) If it is possible, how long will the node manager state be updated?
>>> 3) Is it possible that the NM state returned from standby RM be very
>>> different from that returned from the active RM? Say one is returning
>>> RUNNING while the other returns DECOMMISSIONED because the local
>>> exclude.xml is very different/diverges?
>>>
>>> Thanks.
>>> Have a good holiday.
>>>
>>