You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Tak Lon (Stephen) Wu (JIRA)" <ji...@apache.org> on 2019/02/26 18:55:00 UTC
[jira] [Comment Edited] (HBASE-21666) Break up the TestExportSnapshot UTs; they can timeout

    [ https://issues.apache.org/jira/browse/HBASE-21666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16778462#comment-16778462 ] 

Tak Lon (Stephen) Wu edited comment on HBASE-21666 at 2/26/19 6:54 PM:
-----------------------------------------------------------------------

I have done investigation below, and I found the hanging/slow is related to test node's network setup and local disk issue. I'd like to propose the solution to be fail fast instead of timeout at 780+ when possible.

First of all, test methods in {{TestExportSnapshot}} contains two phases of operations, operations in Mini HBase Cluster and operations in Mini MR Cluster, and we are only snapshotting 50 rows into a test table (the data is very small).

So, the timeout issue is related the followings
 1. the building node has an `incorrect` network interface setup such that 
      a. it hangs the HDFS file operations e.g.
{quote}2019-02-25 22:28:36,099 ERROR [ClientFinalizer-shutdown-hook] hdfs.DFSClient(949): Failed to close inode 16420
 java.io.EOFException: End of File Exception between local host is: "f45c89a57f29.ant.amazon.com/192.168.1.15"; destination host is: "localhost":54524; : java.io.EOFException; For more details see: [http://wiki.apache.org/hadoop/EOFException]
{quote}
    b. server (region server or hmaster) cannot be connected or regions cannot be assigned and kept retrying till timeout, e.g.
{quote}2019-02-26 09:27:54,754 DEBUG [RpcServer.default.FPBQ.Fifo.handler=4,queue=0,port=57922] client.RpcRetryingCallerImpl(132): Call exception, tries=10, retries=19, started=96205 ms ago, cancelled=false, msg=Call to f45c89a57f29-2.local/10.63.166.57:57926 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: f45c89a57f29-2.local/10.63.166.57:57926, details=row 'testtb-testExportFileSystemStateWithSkipTmp' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=f45c89a57f29-2.local,57926,1551201763075, seqNum=-1, see [https://s.apache.org/timeout], exception=org.apache.hadoop.hbase.ipc.FailedServerException: Call to f45c89a57f29-2.local/10.63.166.57:57926 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: f45c89a57f29-2.local/10.63.166.57:57926`
{quote}
2. the building node has an out of disk space issue such node manager is not in the health state, e.g. I saw from the node manger UI {{1/1 local-dirs are bad: /yarn/nm; 1/1 log-dirs are bad: /yarn/container-logs}} even if we have set {{yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage}} to 99%

In above cases, assuming case 1) is an node setup issues (e.g. in {{/etc/hosts}}) that can be fixed by the infra admin or the contributor who is running the unit test on their laptop/machine, we don't need to fix it.

for case 2), I'm thinking to set a new value {{yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb}} to 128MB (should be enough for log-dirs and local-dirs) to fail fast when starting the miniMRCluster by {{[TestExportSnapshot#setUpBeforeClass|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/snapshot/TestExportSnapshot.java#L100-L104]}} instead of timeout for 780+ seconds

In fact, if the building node does not have any of the connections and disk issues, the average time of running all (7) tests within {{TestExportSnapshot}} is about 280 seconds and IMO it won't be able to speedup with splitting some of the test methods into a separate classes and tests of each class are executed in a sequential order (are we running tests in parallel especially for {{TestExportSnapshot}} which labeled as {{LargeTests}}? when I tested with {{mvn test -PrunAllTests -Dtest=TestExportSnapshot}}, I didn't see methods are running concurrently even if I found the {{surefire.secondPartForkCount=5}} for {{runAllTests}}, but if anyone confirm it does, we can also separate each method in {{TestExportSnapshot}} to different classes).

So, if we think disk space issue of YARN's nodemanager should be failed fast when running tests, proposed code change in {{HBaseTestingUtility#startMiniMapReduceCluster}} should be as below.

Any comments?
{code:java}
@@ -2736,6 +2736,8 @@ public class HBaseTestingUtility extends HBaseZKTestingUtility {
     conf.setIfUnset(
         "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage",
         "99.0");
+    // Make sure we have enough disk space for log-dirs and local-dirs
+    conf.setIfUnset("yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb", "128");
     startMiniMapReduceCluster(2);
     return mrCluster;
   }

{code}


was (Author: taklwu):
I have done investigation below, and I found the hanging/slow is related to test node's network setup and local disk issue. I'd like to propose the solution to be fail fast instead of timeout at 780+ when possible. 

First of all, test methods in {{TestExportSnapshot}} contains two phases of operations, operations in Mini HBase Cluster and operations in Mini MR Cluster, and we are only snapshotting 50 rows into a test table (the data is very small).

So, the timeout issue is related the followings
 1. the building node has an `incorrect` network interface setup such that 
      a. it hangs the HDFS file operations e.g.
{quote}2019-02-25 22:28:36,099 ERROR [ClientFinalizer-shutdown-hook] hdfs.DFSClient(949): Failed to close inode 16420
 java.io.EOFException: End of File Exception between local host is: "f45c89a57f29.ant.amazon.com/192.168.1.15"; destination host is: "localhost":54524; : java.io.EOFException; For more details see: [http://wiki.apache.org/hadoop/EOFException]
{quote}
    b. server (region server or hmaster) cannot be connected or regions cannot be assigned and kept retrying till timeout, e.g.
{quote}2019-02-26 09:27:54,754 DEBUG [RpcServer.default.FPBQ.Fifo.handler=4,queue=0,port=57922] client.RpcRetryingCallerImpl(132): Call exception, tries=10, retries=19, started=96205 ms ago, cancelled=false, msg=Call to f45c89a57f29-2.local/10.63.166.57:57926 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: f45c89a57f29-2.local/10.63.166.57:57926, details=row 'testtb-testExportFileSystemStateWithSkipTmp' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=f45c89a57f29-2.local,57926,1551201763075, seqNum=-1, see [https://s.apache.org/timeout], exception=org.apache.hadoop.hbase.ipc.FailedServerException: Call to f45c89a57f29-2.local/10.63.166.57:57926 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: f45c89a57f29-2.local/10.63.166.57:57926`
{quote}
2. the building node has an out of disk space issue such node manager is not in the health state, e.g. I saw from the node manger UI {{1/1 local-dirs are bad: /yarn/nm; 1/1 log-dirs are bad: /yarn/container-logs}} even if we have set {{yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage}} to 99%

In above cases, assuming case 1) is an node setup issues (e.g. in {{/etc/hosts}}) that can be fixed by the infra admin or the contributor who is running the unit test on their laptop/machine, we don't need to fix it.

for case 2), I'm thinking to set a new value {{yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb}} to 128MB (should be enough for log-dirs and local-dirs) to fail fast when starting the miniMRCluster by {{[TestExportSnapshot#setUpBeforeClass|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/snapshot/TestExportSnapshot.java#L100-L104]}} instead of timeout for 780+ seconds

In fact, if the building node does not have any of the connections and disk issues, the average time of running all tests within {{TestExportSnapshot}} is about 280 seconds and IMO it won't be able to speedup with splitting some of the test methods into a separate classes and tests of each class are executed in a sequential order (are we running tests in parallel especially for {{TestExportSnapshot}} which labeled as {{LargeTests}}? when I tested with {{mvn test -PrunAllTests -Dtest=TestExportSnapshot}}, I didn't see methods are running concurrently even if I found the {{surefire.secondPartForkCount=5}} for {{runAllTests}}).

So, if we think disk space issue of YARN's nodemanager should be failed fast when running tests, proposed code change in {{HBaseTestingUtility#startMiniMapReduceCluster}} should be as below.

Any comments?
{code:java}
@@ -2736,6 +2736,8 @@ public class HBaseTestingUtility extends HBaseZKTestingUtility {
     conf.setIfUnset(
         "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage",
         "99.0");
+    // Make sure we have enough disk space for log-dirs and local-dirs
+    conf.setIfUnset("yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb", "128");
     startMiniMapReduceCluster(2);
     return mrCluster;
   }

{code}

> Break up the TestExportSnapshot UTs; they can timeout
> -----------------------------------------------------
>
>                 Key: HBASE-21666
>                 URL: https://issues.apache.org/jira/browse/HBASE-21666
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>            Reporter: stack
>            Assignee: Tak Lon (Stephen) Wu
>            Priority: Major
>              Labels: beginner
>
> These timed out for [~Apache9] when he ran with the -PrunAllTests. Suggests breaking them up into smaller tests so less likely they'll timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)