You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Greg Hill (JIRA)" <ji...@apache.org> on 2015/03/03 18:45:04 UTC

[jira] [Updated] (AMBARI-9902) Decommission DATANODE silently fails if in maintenance mode

     [ https://issues.apache.org/jira/browse/AMBARI-9902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Greg Hill updated AMBARI-9902:
------------------------------
    Description: 
If you set maintenance mode on multiple hosts, then attempt to decommission the DATANODE on those hosts, it says that it succeeded but it did not actually decommission any nodes in HDFS.  This can lead to data loss as the customer might assume that it's safe to remove those hosts from the pool.

The request looks like:
{noformat}
         "RequestInfo": {
                "command": "DECOMMISSION",
                "context": "Decommission DataNode”),
                "parameters": {"slave_type": “DATANODE", "excluded_hosts": “slave-3.local,slave-1.local"},
                "operation_level": {
                    “level”: “CLUSTER”,
                    “cluster_name”: cluster_name
                },
            },
            "Requests/resource_filters": [{
                "service_name": “HDFS",
                "component_name": “NAMENODE",
            }],
{noformat}

The task output appears to work:

{noformat}
File['/etc/hadoop/conf/dfs.exclude'] {'owner': 'hdfs', 'content': Template('exclude_hosts_list.j2'), 'group': 'hadoop'}
Execute[''] {'user': 'hdfs'}
ExecuteHadoop['dfsadmin -refreshNodes'] {'bin_dir': '/usr/hdp/current/hadoop-client/bin', 'conf_dir': '/etc/hadoop/conf', 'kinit_override': True, 'user': 'hdfs'}
Execute['hadoop --config /etc/hadoop/conf dfsadmin -refreshNodes'] {'logoutput': False, 'path': ['/usr/hdp/current/hadoop-client/bin'], 'tries': 1, 'user': 'hdfs', 'try_sleep': 0}
{noformat}

But it didn't actually write any contents to the file.  If it had, this line would have been in there:

{noformat}
Writing File['/etc/hadoop/conf/dfs.exclude'] because contents don't match
{noformat}

The command json file for the task has the right hosts list as a parameter:

{noformat}
"commandParams": {
        "service_package_folder": "HDP/2.0.6/services/HDFS/package",
        "update_exclude_file_only": "false",
        "script": "scripts/namenode.py",
        "hooks_folder": "HDP/2.0.6/hooks",
        "excluded_hosts": "slave-3.local,slave-1.local",
        "command_timeout": "600",
        "slave_type": "DATANODE",
        "script_type": "PYTHON"
    },
{noformat}

So something is filtering the list external to that.

If maintenance mode was not set, everything works as expected.  I don't believe there's a legitimate reason to disallow decommissioning nodes in maintenance mode, as that seems to be the expected course of action (set maintenance, decommission, remove) for dealing with a problematic host.

  was:
If you set maintenance mode on multiple hosts, then attempt to decommission the DATANODE on those hosts, it says that it succeeded but it did not actually decommission any nodes in HDFS.  This can lead to data loss as the customer might assume that it's safe to remove those hosts from the pool.

The request looks like:
<noformat>
         "RequestInfo": {
                "command": "DECOMMISSION",
                "context": "Decommission DataNode”),
                "parameters": {"slave_type": “DATANODE", "excluded_hosts": “slave-3.local,slave-1.local"},
                "operation_level": {
                    “level”: “CLUSTER”,
                    “cluster_name”: cluster_name
                },
            },
            "Requests/resource_filters": [{
                "service_name": “HDFS",
                "component_name": “NAMENODE",
            }],
</noformat>

The task output appears to work:

<noformat>
File['/etc/hadoop/conf/dfs.exclude'] {'owner': 'hdfs', 'content': Template('exclude_hosts_list.j2'), 'group': 'hadoop'}
Execute[''] {'user': 'hdfs'}
ExecuteHadoop['dfsadmin -refreshNodes'] {'bin_dir': '/usr/hdp/current/hadoop-client/bin', 'conf_dir': '/etc/hadoop/conf', 'kinit_override': True, 'user': 'hdfs'}
Execute['hadoop --config /etc/hadoop/conf dfsadmin -refreshNodes'] {'logoutput': False, 'path': ['/usr/hdp/current/hadoop-client/bin'], 'tries': 1, 'user': 'hdfs', 'try_sleep': 0}
</noformat>

But it didn't actually write any contents to the file.  If it had, this line would have been in there:

<noformat>
Writing File['/etc/hadoop/conf/dfs.exclude'] because contents don't match
</noformat>

The command json file for the task has the right hosts list as a parameter:

<noformat>
"commandParams": {
        "service_package_folder": "HDP/2.0.6/services/HDFS/package",
        "update_exclude_file_only": "false",
        "script": "scripts/namenode.py",
        "hooks_folder": "HDP/2.0.6/hooks",
        "excluded_hosts": "slave-3.local,slave-1.local",
        "command_timeout": "600",
        "slave_type": "DATANODE",
        "script_type": "PYTHON"
    },
</noformat>

So something is filtering the list external to that.

If maintenance mode was not set, everything works as expected.  I don't believe there's a legitimate reason to disallow decommissioning nodes in maintenance mode, as that seems to be the expected course of action (set maintenance, decommission, remove) for dealing with a problematic host.


> Decommission DATANODE silently fails if in maintenance mode
> -----------------------------------------------------------
>
>                 Key: AMBARI-9902
>                 URL: https://issues.apache.org/jira/browse/AMBARI-9902
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-agent
>    Affects Versions: 1.7.0
>            Reporter: Greg Hill
>
> If you set maintenance mode on multiple hosts, then attempt to decommission the DATANODE on those hosts, it says that it succeeded but it did not actually decommission any nodes in HDFS.  This can lead to data loss as the customer might assume that it's safe to remove those hosts from the pool.
> The request looks like:
> {noformat}
>          "RequestInfo": {
>                 "command": "DECOMMISSION",
>                 "context": "Decommission DataNode”),
>                 "parameters": {"slave_type": “DATANODE", "excluded_hosts": “slave-3.local,slave-1.local"},
>                 "operation_level": {
>                     “level”: “CLUSTER”,
>                     “cluster_name”: cluster_name
>                 },
>             },
>             "Requests/resource_filters": [{
>                 "service_name": “HDFS",
>                 "component_name": “NAMENODE",
>             }],
> {noformat}
> The task output appears to work:
> {noformat}
> File['/etc/hadoop/conf/dfs.exclude'] {'owner': 'hdfs', 'content': Template('exclude_hosts_list.j2'), 'group': 'hadoop'}
> Execute[''] {'user': 'hdfs'}
> ExecuteHadoop['dfsadmin -refreshNodes'] {'bin_dir': '/usr/hdp/current/hadoop-client/bin', 'conf_dir': '/etc/hadoop/conf', 'kinit_override': True, 'user': 'hdfs'}
> Execute['hadoop --config /etc/hadoop/conf dfsadmin -refreshNodes'] {'logoutput': False, 'path': ['/usr/hdp/current/hadoop-client/bin'], 'tries': 1, 'user': 'hdfs', 'try_sleep': 0}
> {noformat}
> But it didn't actually write any contents to the file.  If it had, this line would have been in there:
> {noformat}
> Writing File['/etc/hadoop/conf/dfs.exclude'] because contents don't match
> {noformat}
> The command json file for the task has the right hosts list as a parameter:
> {noformat}
> "commandParams": {
>         "service_package_folder": "HDP/2.0.6/services/HDFS/package",
>         "update_exclude_file_only": "false",
>         "script": "scripts/namenode.py",
>         "hooks_folder": "HDP/2.0.6/hooks",
>         "excluded_hosts": "slave-3.local,slave-1.local",
>         "command_timeout": "600",
>         "slave_type": "DATANODE",
>         "script_type": "PYTHON"
>     },
> {noformat}
> So something is filtering the list external to that.
> If maintenance mode was not set, everything works as expected.  I don't believe there's a legitimate reason to disallow decommissioning nodes in maintenance mode, as that seems to be the expected course of action (set maintenance, decommission, remove) for dealing with a problematic host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)