You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Srinidhi Muppalla <sr...@trulia.com> on 2019/04/18 00:06:03 UTC
HBase Scan Thread Stuck after Upgrade to 1.4.9
Hello,
We recently upgrade our staging cluster running on S3 backed EMR from HBase 1.3.0 to HBase 1.4.9. After doing so and running a test load, we noticed a sudden spike in CPU utilization on one of the nodes in the cluster (jumping from 20% to 60% all at once). After looking at CPU time of the threads and taking a thread dump of the application, we noticed that there was exactly two threads each with enormous CPU usage running since the start of the spike in CPU. They both appear to be stuck on the CellComparator compare method. We turned off all request to the cluster and the CPU usage and those threads remain in the same state. I’ve pasted some data and stack traces below and also attached full thread dumps. We haven’t had this issue when running Hbase 1.3.
What top shows:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7198 hbase 20 0 7944m 2.6g 40m S 213.0 17.9 135:39.08 java
4755 root 20 0 4329m 262m 23m S 0.7 1.7 0:30.52 java
6630 yarn 20 0 4014m 430m 36m S 0.3 2.9 0:48.25 java
1 root 20 0 19680 2708 2324 S 0.0 0.0 0:01.37 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
7 root 20 0 0 0 0 S 0.0 0.0 0:01.47 ksoftirqd/0
8 root 20 0 0 0 0 I 0.0 0.0 0:02.74 rcu_sched
9 root 20 0 0 0 0 I 0.0 0.0 0:00.00 rcu_bh
10 root RT 0 0 0 0 S 0.0 0.0 0:00.04 migration/0
11 root RT 0 0 0 0 S 0.0 0.0 0:00.01 watchdog/0
12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/0
13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/1
14 root RT 0 0 0 0 S 0.0 0.0 0:00.01 watchdog/1
15 root RT 0 0 0 0 S 0.0 0.0 0:00.03 migration/1
16 root 20 0 0 0 0 S 0.0 0.0 0:00.03 ksoftirqd/1
18 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/1:0H
A breakdown of the threads
sudo top -n 1 -H -p 7198
top - 23:25:50 up 1:50, 2 users, load average: 2.11, 2.28, 2.27
Tasks: 232 total, 2 running, 230 sleeping, 0 stopped, 0 zombie
Cpu(s): 36.5%us, 1.0%sy, 0.0%ni, 62.0%id, 0.1%wa, 0.0%hi, 0.2%si, 0.2%st
Mem: 15394140k total, 10347008k used, 5047132k free, 60708k buffers
Swap: 0k total, 0k used, 0k free, 5358004k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7402 hbase 20 0 7944m 2.7g 40m R 92.8 18.7 65:12.87 java
7399 hbase 20 0 7944m 2.7g 40m R 90.8 18.7 65:17.01 java
7257 hbase 20 0 7944m 2.7g 40m S 7.9 18.7 4:01.49 java
7258 hbase 20 0 7944m 2.7g 40m S 7.9 18.7 4:01.30 java
7259 hbase 20 0 7944m 2.7g 40m S 7.9 18.7 4:01.46 java
7260 hbase 20 0 7944m 2.7g 40m S 7.9 18.7 4:01.69 java
10443 hbase 20 0 7944m 2.7g 40m S 2.0 18.7 0:14.43 java
7198 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.01 java
7256 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:06.54 java
7261 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:13.84 java
7262 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:41.45 java
7263 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.48 java
7264 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.41 java
7265 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.00 java
7266 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.00 java
7267 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:33.95 java
7268 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:32.26 java
7269 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:08.90 java
Stacktrace for PID 7402:
RpcServer.default.FPBQ.Fifo.handler=94,queue=4,port=16020
priority:5 - threadId:0x00007f0ec5ef7000 - nativeId:0x1cea - nativeId (decimal):7402 - state:RUNNABLE
stackTrace:
java.lang.Thread.State: RUNNABLE
at org.apache.hadoop.hbase.CellComparator.compareColumns(CellComparator.java:158)
at org.apache.hadoop.hbase.CellComparator.compareWithoutRow(CellComparator.java:224)
at org.apache.hadoop.hbase.CellComparator.compare(CellComparator.java:66)
at org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:2026)
at org.apache.hadoop.hbase.filter.FilterListBase.compareCell(FilterListBase.java:86)
at org.apache.hadoop.hbase.filter.FilterListWithOR.getNextCellHint(FilterListWithOR.java:371)
at org.apache.hadoop.hbase.filter.FilterList.getNextCellHint(FilterList.java:265)
at org.apache.hadoop.hbase.regionserver.querymatcher.UserScanQueryMatcher.getNextKeyHint(UserScanQueryMatcher.java:96)
at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:686)
at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:152)
at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:6290)
at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6450)
at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6222)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2847)
- locked <0x00000006964deaf0> (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3096)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
Locked ownable synchronizers:
- None
Stacktrace for PID 7399
priority:5 - threadId:0x00007f0ec5ef1000 - nativeId:0x1ce7 - nativeId (decimal):7399 - state:RUNNABLE
stackTrace:
java.lang.Thread.State: RUNNABLE
at org.apache.hadoop.hbase.CellComparator.compareRows(CellComparator.java:187)
at org.apache.hadoop.hbase.CellComparator.compare(CellComparator.java:63)
at org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:2026)
at org.apache.hadoop.hbase.filter.FilterListBase.compareCell(FilterListBase.java:86)
at org.apache.hadoop.hbase.filter.FilterListWithOR.getNextCellHint(FilterListWithOR.java:371)
at org.apache.hadoop.hbase.filter.FilterList.getNextCellHint(FilterList.java:265)
at org.apache.hadoop.hbase.regionserver.querymatcher.UserScanQueryMatcher.getNextKeyHint(UserScanQueryMatcher.java:96)
at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:686)
at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:152)
at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:6290)
at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6450)
at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6222)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2847)
- locked <0x00000006ae08a770> (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3096)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
Locked ownable synchronizers:
- None
Does anyone know what is going on? Is there possibly an infinite loop or some other mechanism causing these threads to get stuck?
Thanks,
Srinidhi
Re: HBase Scan Thread Stuck after Upgrade to 1.4.9
Posted by Toshihiro Suzuki <br...@gmail.com>.
Maybe you were doing scans with filters that were very heavy and taking a
long time. According to the thread dumps, it looks like the 2 scans were
not stuck as they were RUNNABLE and I think the scans were just taking a
long time.
On Thu, Apr 18, 2019 at 9:06 AM Srinidhi Muppalla <sr...@trulia.com>
wrote:
> Hello,
>
>
>
> We recently upgrade our staging cluster running on S3 backed EMR from
> HBase 1.3.0 to HBase 1.4.9. After doing so and running a test load, we
> noticed a sudden spike in CPU utilization on one of the nodes in the
> cluster (jumping from 20% to 60% all at once). After looking at CPU time of
> the threads and taking a thread dump of the application, we noticed that
> there was exactly two threads each with enormous CPU usage running since
> the start of the spike in CPU. They both appear to be stuck on the
> CellComparator compare method. We turned off all request to the cluster and
> the CPU usage and those threads remain in the same state. I’ve pasted some
> data and stack traces below and also attached full thread dumps. We haven’t
> had this issue when running Hbase 1.3.
>
>
>
> What top shows:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>
> 7198 hbase 20 0 7944m 2.6g 40m S 213.0 17.9 135:39.08 java
>
> 4755 root 20 0 4329m 262m 23m S 0.7 1.7 0:30.52 java
>
> 6630 yarn 20 0 4014m 430m 36m S 0.3 2.9 0:48.25 java
>
> 1 root 20 0 19680 2708 2324 S 0.0 0.0 0:01.37 init
>
> 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
>
> 4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H
>
> 6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
>
> 7 root 20 0 0 0 0 S 0.0 0.0 0:01.47 ksoftirqd/0
>
> 8 root 20 0 0 0 0 I 0.0 0.0 0:02.74 rcu_sched
>
> 9 root 20 0 0 0 0 I 0.0 0.0 0:00.00 rcu_bh
>
> 10 root RT 0 0 0 0 S 0.0 0.0 0:00.04 migration/0
>
> 11 root RT 0 0 0 0 S 0.0 0.0 0:00.01 watchdog/0
>
> 12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/0
>
> 13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/1
>
> 14 root RT 0 0 0 0 S 0.0 0.0 0:00.01 watchdog/1
>
> 15 root RT 0 0 0 0 S 0.0 0.0 0:00.03 migration/1
>
> 16 root 20 0 0 0 0 S 0.0 0.0 0:00.03 ksoftirqd/1
>
> 18 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/1:0H
>
>
>
> A breakdown of the threads
>
>
>
> sudo top -n 1 -H -p 7198
>
> top - 23:25:50 up 1:50, 2 users, load average: 2.11, 2.28, 2.27
>
> Tasks: 232 total, 2 running, 230 sleeping, 0 stopped, 0 zombie
>
> Cpu(s): 36.5%us, 1.0%sy, 0.0%ni, 62.0%id, 0.1%wa, 0.0%hi, 0.2%si,
> 0.2%st
>
> Mem: 15394140k total, 10347008k used, 5047132k free, 60708k buffers
>
> Swap: 0k total, 0k used, 0k free, 5358004k cached
>
>
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>
> 7402 hbase 20 0 7944m 2.7g 40m R 92.8 18.7 65:12.87 java
>
> 7399 hbase 20 0 7944m 2.7g 40m R 90.8 18.7 65:17.01 java
>
> 7257 hbase 20 0 7944m 2.7g 40m S 7.9 18.7 4:01.49 java
>
> 7258 hbase 20 0 7944m 2.7g 40m S 7.9 18.7 4:01.30 java
>
> 7259 hbase 20 0 7944m 2.7g 40m S 7.9 18.7 4:01.46 java
>
> 7260 hbase 20 0 7944m 2.7g 40m S 7.9 18.7 4:01.69 java
>
> 10443 hbase 20 0 7944m 2.7g 40m S 2.0 18.7 0:14.43 java
>
> 7198 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.01 java
>
> 7256 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:06.54 java
>
> 7261 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:13.84 java
>
> 7262 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:41.45 java
>
> 7263 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.48 java
>
> 7264 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.41 java
>
> 7265 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.00 java
>
> 7266 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.00 java
>
> 7267 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:33.95 java
>
> 7268 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:32.26 java
>
> 7269 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:08.90 java
>
>
>
> Stacktrace for PID 7402:
>
> RpcServer.default.FPBQ.Fifo.handler=94,queue=4,port=16020
>
> priority:5 - threadId:0x00007f0ec5ef7000 - nativeId:0x1cea - nativeId
> (decimal):7402 - state:RUNNABLE
> stackTrace:
> java.lang.Thread.State: RUNNABLE
> at
> org.apache.hadoop.hbase.CellComparator.compareColumns(CellComparator.java:158)
> at
> org.apache.hadoop.hbase.CellComparator.compareWithoutRow(CellComparator.java:224)
> at org.apache.hadoop.hbase.CellComparator.compare(CellComparator.java:66)
> at
> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:2026)
> at
> org.apache.hadoop.hbase.filter.FilterListBase.compareCell(FilterListBase.java:86)
> at
> org.apache.hadoop.hbase.filter.FilterListWithOR.getNextCellHint(FilterListWithOR.java:371)
> at
> org.apache.hadoop.hbase.filter.FilterList.getNextCellHint(FilterList.java:265)
> at
> org.apache.hadoop.hbase.regionserver.querymatcher.UserScanQueryMatcher.getNextKeyHint(UserScanQueryMatcher.java:96)
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:686)
> at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:152)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:6290)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6450)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6222)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2847)
> - locked *<0x00000006964deaf0>* (a
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3096)
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
> Locked ownable synchronizers:
> - None
>
>
>
> Stacktrace for PID 7399
>
> priority:5 - threadId:0x00007f0ec5ef1000 - nativeId:0x1ce7 - nativeId
> (decimal):7399 - state:RUNNABLE
> stackTrace:
> java.lang.Thread.State: RUNNABLE
> at
> org.apache.hadoop.hbase.CellComparator.compareRows(CellComparator.java:187)
> at org.apache.hadoop.hbase.CellComparator.compare(CellComparator.java:63)
> at
> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:2026)
> at
> org.apache.hadoop.hbase.filter.FilterListBase.compareCell(FilterListBase.java:86)
> at
> org.apache.hadoop.hbase.filter.FilterListWithOR.getNextCellHint(FilterListWithOR.java:371)
> at
> org.apache.hadoop.hbase.filter.FilterList.getNextCellHint(FilterList.java:265)
> at
> org.apache.hadoop.hbase.regionserver.querymatcher.UserScanQueryMatcher.getNextKeyHint(UserScanQueryMatcher.java:96)
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:686)
> at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:152)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:6290)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6450)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6222)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2847)
> - locked *<0x00000006ae08a770>* (a
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3096)
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
> Locked ownable synchronizers:
> - None
>
>
>
> Does anyone know what is going on? Is there possibly an infinite loop or
> some other mechanism causing these threads to get stuck?
>
>
>
> Thanks,
>
> Srinidhi
>
>
>
>
>