You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ambari.apache.org by Jonathan Hurley <jh...@hortonworks.com> on 2015/06/02 17:36:24 UTC

Review Request 34947: Datanode Shutdown Retries During Upgrade Are Too Long

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34947/
-----------------------------------------------------------

Review request for Ambari, Alejandro Fernandez and Nate Cole.


Bugs: AMBARI-11624
    https://issues.apache.org/jira/browse/AMBARI-11624


Repository: ambari


Description
-------

See HDFS-8510.

During upgrade from HDP 2.2 to HDP 2.3, even if there are 4 DataNodes in the cluster, HBase still goes down during the core slaves portion. This is because the DataNode upgrade takes too long. The default properties in the HDP stack for {{ipc.client.connect.retry.interval}} are greater than the 30 second period in which the DataNode would be marked as dead.

Notice that after the shutdown command, it takes 52 seconds for {{dfsadmin}} to report that the DataNode is down:

{noformat}
2015-05-29 13:13:27,222 - hadoop-hdfs-datanode is currently at version 2.2.7.0-2808
2015-05-29 13:13:27,306 - Execute['hdfs dfsadmin -shutdownDatanode 0.0.0.0:8010 upgrade'] {'tries': 1, 'user': 'hdfs'}
2015-05-29 13:13:29,003 - Execute['hdfs dfsadmin -getDatanodeInfo 0.0.0.0:8010'] {'tries': 1, 'user': 'hdfs'}
2015-05-29 13:13:30,648 - DataNode has not shutdown.
2015-05-29 13:13:40,655 - Execute['hdfs dfsadmin -getDatanodeInfo 0.0.0.0:8010'] {'tries': 1, 'user': 'hdfs'}
2015-05-29 13:14:32,280 - DataNode has successfully shutdown for upgrade.
2015-05-29 13:14:32,327 - Execute['hdp-select set hadoop-hdfs-datanode 2.3.0.0-2162'] {}
...
2015-05-29 13:14:32,835 - Execute['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'ulimit -c unlimited ;  /usr/hdp/2.3.0.0-2162/hadoop/sbin/hadoop-daemon.sh --config /usr/hdp/2.3.0.0-2162/hadoop/conf start datanode''] {'environment': {'HADOOP_LIBEXEC_DIR': '/usr/hdp/2.3.0.0-2162/hadoop/libexec'}, 'not_if': 'ls /var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid >/dev/null 2>&1 && ps -p `cat /var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid` >/dev/null 2>&1'}
2015-05-29 13:14:36,954 - Executing DataNode Rolling Upgrade post-restart
2015-05-29 13:14:36,957 - Checking that the DataNode has rejoined the cluster after upgrade...
...
2015-05-29 13:14:40,281 - DataNode jhurley-hdp22-ru-5.c.pramod-thangali.internal reports that it has rejoined the cluster.
{noformat}

As DataNodes are upgraded, we should be temporarily overriding the default retry timeout values:

{code}
dfsadmin -D ipc.client.connect.max.retries=5 -D ipc.client.connect.retry.interval=1000 -getDatanodeInfo 0.0.0.0:8010
{code}


Diffs
-----

  ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/datanode_upgrade.py 529ca4438 
  ambari-server/src/test/python/stacks/2.0.6/HDFS/test_datanode.py a310bf4 

Diff: https://reviews.apache.org/r/34947/diff/


Testing
-------

----------------------------------------------------------------------
Total run:750
Total errors:0
Total failures:0
OK


Thanks,

Jonathan Hurley

Re: Review Request 34947: Datanode Shutdown Retries During Upgrade Are Too Long

Posted by Nate Cole <nc...@hortonworks.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34947/#review86256
-----------------------------------------------------------

Ship it!


Ship It!

- Nate Cole


On June 2, 2015, 11:36 a.m., Jonathan Hurley wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34947/
> -----------------------------------------------------------
> 
> (Updated June 2, 2015, 11:36 a.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez and Nate Cole.
> 
> 
> Bugs: AMBARI-11624
>     https://issues.apache.org/jira/browse/AMBARI-11624
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> See HDFS-8510.
> 
> During upgrade from HDP 2.2 to HDP 2.3, even if there are 4 DataNodes in the cluster, HBase still goes down during the core slaves portion. This is because the DataNode upgrade takes too long. The default properties in the HDP stack for {{ipc.client.connect.retry.interval}} are greater than the 30 second period in which the DataNode would be marked as dead.
> 
> Notice that after the shutdown command, it takes 52 seconds for {{dfsadmin}} to report that the DataNode is down:
> 
> {noformat}
> 2015-05-29 13:13:27,222 - hadoop-hdfs-datanode is currently at version 2.2.7.0-2808
> 2015-05-29 13:13:27,306 - Execute['hdfs dfsadmin -shutdownDatanode 0.0.0.0:8010 upgrade'] {'tries': 1, 'user': 'hdfs'}
> 2015-05-29 13:13:29,003 - Execute['hdfs dfsadmin -getDatanodeInfo 0.0.0.0:8010'] {'tries': 1, 'user': 'hdfs'}
> 2015-05-29 13:13:30,648 - DataNode has not shutdown.
> 2015-05-29 13:13:40,655 - Execute['hdfs dfsadmin -getDatanodeInfo 0.0.0.0:8010'] {'tries': 1, 'user': 'hdfs'}
> 2015-05-29 13:14:32,280 - DataNode has successfully shutdown for upgrade.
> 2015-05-29 13:14:32,327 - Execute['hdp-select set hadoop-hdfs-datanode 2.3.0.0-2162'] {}
> ...
> 2015-05-29 13:14:32,835 - Execute['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'ulimit -c unlimited ;  /usr/hdp/2.3.0.0-2162/hadoop/sbin/hadoop-daemon.sh --config /usr/hdp/2.3.0.0-2162/hadoop/conf start datanode''] {'environment': {'HADOOP_LIBEXEC_DIR': '/usr/hdp/2.3.0.0-2162/hadoop/libexec'}, 'not_if': 'ls /var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid >/dev/null 2>&1 && ps -p `cat /var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid` >/dev/null 2>&1'}
> 2015-05-29 13:14:36,954 - Executing DataNode Rolling Upgrade post-restart
> 2015-05-29 13:14:36,957 - Checking that the DataNode has rejoined the cluster after upgrade...
> ...
> 2015-05-29 13:14:40,281 - DataNode jhurley-hdp22-ru-5.c.pramod-thangali.internal reports that it has rejoined the cluster.
> {noformat}
> 
> As DataNodes are upgraded, we should be temporarily overriding the default retry timeout values:
> 
> {code}
> dfsadmin -D ipc.client.connect.max.retries=5 -D ipc.client.connect.retry.interval=1000 -getDatanodeInfo 0.0.0.0:8010
> {code}
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/datanode_upgrade.py 529ca4438 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_datanode.py a310bf4 
> 
> Diff: https://reviews.apache.org/r/34947/diff/
> 
> 
> Testing
> -------
> 
> ----------------------------------------------------------------------
> Total run:750
> Total errors:0
> Total failures:0
> OK
> 
> 
> Thanks,
> 
> Jonathan Hurley
> 
>

Re: Review Request 34947: Datanode Shutdown Retries During Upgrade Are Too Long

Posted by Alejandro Fernandez <af...@hortonworks.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34947/#review86254
-----------------------------------------------------------

Ship it!


Ship It!

- Alejandro Fernandez


On June 2, 2015, 3:36 p.m., Jonathan Hurley wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34947/
> -----------------------------------------------------------
> 
> (Updated June 2, 2015, 3:36 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez and Nate Cole.
> 
> 
> Bugs: AMBARI-11624
>     https://issues.apache.org/jira/browse/AMBARI-11624
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> See HDFS-8510.
> 
> During upgrade from HDP 2.2 to HDP 2.3, even if there are 4 DataNodes in the cluster, HBase still goes down during the core slaves portion. This is because the DataNode upgrade takes too long. The default properties in the HDP stack for {{ipc.client.connect.retry.interval}} are greater than the 30 second period in which the DataNode would be marked as dead.
> 
> Notice that after the shutdown command, it takes 52 seconds for {{dfsadmin}} to report that the DataNode is down:
> 
> {noformat}
> 2015-05-29 13:13:27,222 - hadoop-hdfs-datanode is currently at version 2.2.7.0-2808
> 2015-05-29 13:13:27,306 - Execute['hdfs dfsadmin -shutdownDatanode 0.0.0.0:8010 upgrade'] {'tries': 1, 'user': 'hdfs'}
> 2015-05-29 13:13:29,003 - Execute['hdfs dfsadmin -getDatanodeInfo 0.0.0.0:8010'] {'tries': 1, 'user': 'hdfs'}
> 2015-05-29 13:13:30,648 - DataNode has not shutdown.
> 2015-05-29 13:13:40,655 - Execute['hdfs dfsadmin -getDatanodeInfo 0.0.0.0:8010'] {'tries': 1, 'user': 'hdfs'}
> 2015-05-29 13:14:32,280 - DataNode has successfully shutdown for upgrade.
> 2015-05-29 13:14:32,327 - Execute['hdp-select set hadoop-hdfs-datanode 2.3.0.0-2162'] {}
> ...
> 2015-05-29 13:14:32,835 - Execute['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'ulimit -c unlimited ;  /usr/hdp/2.3.0.0-2162/hadoop/sbin/hadoop-daemon.sh --config /usr/hdp/2.3.0.0-2162/hadoop/conf start datanode''] {'environment': {'HADOOP_LIBEXEC_DIR': '/usr/hdp/2.3.0.0-2162/hadoop/libexec'}, 'not_if': 'ls /var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid >/dev/null 2>&1 && ps -p `cat /var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid` >/dev/null 2>&1'}
> 2015-05-29 13:14:36,954 - Executing DataNode Rolling Upgrade post-restart
> 2015-05-29 13:14:36,957 - Checking that the DataNode has rejoined the cluster after upgrade...
> ...
> 2015-05-29 13:14:40,281 - DataNode jhurley-hdp22-ru-5.c.pramod-thangali.internal reports that it has rejoined the cluster.
> {noformat}
> 
> As DataNodes are upgraded, we should be temporarily overriding the default retry timeout values:
> 
> {code}
> dfsadmin -D ipc.client.connect.max.retries=5 -D ipc.client.connect.retry.interval=1000 -getDatanodeInfo 0.0.0.0:8010
> {code}
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/datanode_upgrade.py 529ca4438 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_datanode.py a310bf4 
> 
> Diff: https://reviews.apache.org/r/34947/diff/
> 
> 
> Testing
> -------
> 
> ----------------------------------------------------------------------
> Total run:750
> Total errors:0
> Total failures:0
> OK
> 
> 
> Thanks,
> 
> Jonathan Hurley
> 
>