You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Allen Wittenauer (JIRA)" <ji...@apache.org> on 2018/01/31 18:31:00 UTC

[jira] [Comment Edited] (HBASE-19902) Current Jenkins Madness: OOME, can't start minihbasecluster, etc.

    [ https://issues.apache.org/jira/browse/HBASE-19902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347336#comment-16347336 ] 

Allen Wittenauer edited comment on HBASE-19902 at 1/31/18 6:30 PM:
-------------------------------------------------------------------

bq. proclimit of 20k instead of 5k (looks like machines allow for 60k fds).

The proclimit option sets ulimit -u, the maximum number of processes allowed.  There is no correlation with fds.  [Yetus does not set that ulimit value.]

The process limit is exceedingly tricky.  There is the actual value set by ulimit -u and friends.  Then there are cgroup settings enforced by systemd.  The cgroup limit (set by the UserTasksMax in systemd settings) is the ultimate authority.  It also counts across the entire node, not by process group or session or any of the other normal boundaries.  The default limit ends up being a bit over 12k on the build nodes.

To make matters worse, Java native threads (on Linux, at least) count against this limit.  Running 'ps -L -u jenkins -o lwp' will give an approximate idea of how many processes are in play at any given time.  [The number reported by Yetus when in Docker mode is this number but only present in the container.] 

In the end, this means that all threads/processes consumed by BOTH executors and the jenkins slave process must be less than ~13k. 


was (Author: aw):
bq. proclimit of 20k instead of 5k (looks like machines allow for 60k fds).

The proclimit option sets ulimit -u, the maximum number of processes allowed.  There is no correlation with fds.  [Yetus does not set that ulimit value.]

The process limit is exceedingly tricky.  There is the actual value set by ulimit -u and friends.  Then there are cgroup settings enforced by systemd.  The cgroup limit (set by the UserTasksMax in systemd settings) is the ultimate authority.  It also counts across the entire node, not by process group or session or any of the other normal boundaries.  The default limit ends up being a bit over 12k on the build nodes.

To make matters worse, Java native threads count against this limit.  Running 'ps -L -u jenkins -o lwp' will give an approximate idea of how many processes are in play at any given time.  [The number reported by Yetus when in Docker mode is this number but only present in the container.] 

In the end, this means that all threads/processes consumed by BOTH executors and the jenkins slave process must be less than ~13k. 

> Current Jenkins Madness: OOME, can't start minihbasecluster, etc.
> -----------------------------------------------------------------
>
>                 Key: HBASE-19902
>                 URL: https://issues.apache.org/jira/browse/HBASE-19902
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Major
>         Attachments: HBASE-19902.temporary-2.001.patch
>
>
> Trying to figure what is going on w/ jenkins build....
> Changed the hadoopqa config to output long process listing rather than just 'java'... 
> I can't get loadavg... tried dumping /proc...
>  /tmp/jenkins6485196190911961762.sh: line 48: /loadavg: Permission denied
> Looking at https://builds.apache.org/job/PreCommit-HBASE-Build/11273/console, see 7 java processes running on H2. Extra args on ps may help here whether it zombies of us.
> Test run was find then fell into hbase-server second part and soon after started failing..
> https://builds.apache.org/job/PreCommit-HBASE-Build/11273/artifact/patchprocess/patch-unit-hbase-server.txt
> Looking at first test failure... this is where main thread is, trying to get thread info:
> {code}
> Thread 23 (Time-limited test):
>   State: RUNNABLE
>   Blocked count: 118
>   Waited count: 58
>   Stack:
>     sun.management.ThreadImpl.getThreadInfo1(Native Method)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:178)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:139)
>     org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:168)
>     sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     java.lang.reflect.Method.invoke(Method.java:498)
>     org.apache.hadoop.hbase.util.Threads$PrintThreadInfoLazyHolder$1.printThreadInfo(Threads.java:294)
>     org.apache.hadoop.hbase.util.Threads.printThreadInfo(Threads.java:341)
>     org.apache.hadoop.hbase.util.JVMClusterUtil.startup(JVMClusterUtil.java:191)
>     org.apache.hadoop.hbase.LocalHBaseCluster.startup(LocalHBaseCluster.java:391)
>     org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:262)
>     org.apache.hadoop.hbase.MiniHBaseCluster.<init>(MiniHBaseCluster.java:119)
>     org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:1025)
>     org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:971)
>     org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:842)
>     org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:824)
>     org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:806)
>     org.apache.hadoop.hbase.AcidGuaranteesTestBase.setUpBeforeClass(AcidGuaranteesTestBase.java:61)
> {code}
> Master is not coming up....
> {code}
> 2018-01-31 02:22:31,474 ERROR [Time-limited test] hbase.MiniHBaseCluster(267): Error starting cluster
> java.lang.RuntimeException: Master not active after 30000ms
> 	at org.apache.hadoop.hbase.util.JVMClusterUtil.startup(JVMClusterUtil.java:192)
> 	at org.apache.hadoop.hbase.LocalHBaseCluster.startup(LocalHBaseCluster.java:391)
> 	at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:262)
> 	at org.apache.hadoop.hbase.MiniHBaseCluster.<init>(MiniHBaseCluster.java:119)
> 	at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:1025)
> 	at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:971)
> 	at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:842)
> 	at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:824)
> 	at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:806)
> 	at org.apache.hadoop.hbase.AcidGuaranteesTestBase.setUpBeforeClass(AcidGuaranteesTestBase.java:61)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
> 	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> 	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> 	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.lang.Thread.run(Thread.java:748)
> {code}
> Next test starts but doesn't complete.
> Running findHangingTests it finds 24 hung and 151 that have not timed out....
> Trying a few things:
> Set yetus version for hadoopqa temporarily back to 0.6.0 and started this build:
> https://builds.apache.org/job/PreCommit-HBASE-Build/11281/console
> ... and this one:
> https://builds.apache.org/job/PreCommit-HBASE-Build/11282/console



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)