You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@phoenix.apache.org by Chris Tarnas <cf...@biotiquesystems.com> on 2014/05/12 23:28:23 UTC

Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

We have hit a problem with Phoenix and regionservers CPU usage spiking up to use all available CPU and becoming unresponsive.

After HDP 2.1 was released we setup a 4 compute node cluster (with 3 VMWare "master" nodes) to test out Phoenix on it. It is a plain Ambari 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and hand linked in the jar files to the hadoop lib. Everything was going well and we were able to load in ~30k records into several tables. What happened was after about 3-4 days of being up the regionservers became unresponsive and started to use most of the available CPU (12 core boxes). Nothing terribly informative was in the logs (initially we saw some flush messages that seemed excessive, but that was not all of the time and we changed back to the standard HBase WAL codec). We are able to kill the unresponsive regionservers and then restart them, the cluster will be fine for a day or so but will start to lock up again. 

We've dropped the entire HBase and zookeper information and started from scratch, but that has not helped.

James Taylor suggested I send this off here. I've attached a jstack report of a locked up regionserver in hopes that someone can shed some light. 

thanks,
-chris

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Christopher Tarnas <cf...@biotiquesystems.com>.

Hi Jeffrey,

The performance.py from HDP 2.1 does not work out of the box - it is not
looking in the right place for the phoenix jar files (even with
PHOENIX_LIB_DIR set). I downloaded the tarball release from apache and that
one works. After running the script and the counts as you suggested I saw
no increase in CPU usage or stuck RpcServer.handlers. I ran a few scans as
well just for good measure. I used the HDP sqlline.py to do the queries.

Re-running the count on one of our tables I did see the problem - but can
add one more bit of info: the only affected regionserver is the
regionserver that hosts the last region of the scanned table, and I cannot
perfectly reproduce it. It seems to happen about 80% of the time. Also, if
there is a stuck RpcServer.handlers it also prevents the regionserver from
exiting, it needs to be killed with -9.

I have one regionserver running with a few stuck handlers, any commands
that  you would like me to run on it?

thank you,
-chris


On Wed, May 14, 2014 at 5:54 PM, Jeffrey Zhong <jz...@hortonworks.com>wrote:

>
> Hey Chris,
>
> I used performance.py tool which created a table with 50K rows in one
> table, run the following query from sqlline.py and everything seems fine
> without seeing CPU running hot.
>
> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
> PERFORMANCE_50000;
> +------------+
> |  COUNT(1)  |
> +------------+
> | 50000      |
> +------------+
> 1 row selected (0.166 seconds)
> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
> PERFORMANCE_50000;
> +------------+
> |  COUNT(1)  |
> +------------+
> | 50000      |
> +------------+
> 1 row selected (0.167 seconds)
>
> Is there anyway could you run profiler to see where the CPU goes?
>
>
>
> On 5/13/14 6:40 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>
> >Ahh, yes. Here is a pastebin for it:
> >
> >http://pastebin.com/w6mtabag
> >
> >thanks again,
> >-chris
> >
> >On May 13, 2014, at 7:47 PM, Nick Dimiduk <nd...@gmail.com> wrote:
> >
> >> Hi Chris,
> >>
> >> Attachments are filtered out by the mail server. Can you pastebin it
> >>some
> >> place?
> >>
> >> Thanks,
> >> Nick
> >>
> >>
> >> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
> >><cf...@biotiquesystems.com>wrote:
> >>
> >>> Hello,
> >>>
> >>> We set the HBase RegionServer Handler to 10 (it appears to have been
> >>>set
> >>> to 60 by Ambari during install process). Now we have narrowed down what
> >>> causes the CPU to increase and have some detailed logs:
> >>>
> >>> If we connect using sqlline.py and execute a select that selects one
> >>>row
> >>> using the primary_key, no increate in CPU is observed and the number
> >>>of RPC
> >>> threads in a RUNNABLE state remains the same.
> >>>
> >>> If we execute a select that scans the table such as "select count(*)
> >>>from
> >>> TABLE" or where the "where" clause only limits on non-primary key
> >>> attributes, then the number of RUNNABLE RpcServer.handler threads
> >>>increases
> >>> and the CPU utilization of the regionserver increases by ~105%.
> >>>
> >>> Disconnecting the client does not have an effect and the
> >>>RpcServer.handler
> >>> thread is left RUNNABLE and the CPU stays at the higher usage.
> >>>
> >>> Checking the Web Console for the Regionserver just shows 10
> >>> RpcServer.reader tasks, all in a WAITING state, no other monitored
> >>>tasks
> >>> are happening. The regionserver has a Max Heap of 10G and a Used heap
> >>>of
> >>> 445.2M.
> >>>
> >>> I've attached the regionserver log with IPC debug logging turned on
> >>>right
> >>> when one of the Phoenix statements is executed (this statement actually
> >>> used up the last available handler).
> >>>
> >>> thanks,
> >>> -chris
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jz...@hortonworks.com>
> >>>wrote:
> >>>
> >>>>
> >>>> From the stack, it seems you increase the default rpc handler number
> >>>>to
> >>>> about 60. All handlers are serving Get request(You can search
> >>>>
> >>>
> >>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
> >>>a:2
> >>>> 841).
> >>>>
> >>>> You can check why there are so many get requests by adding some log
> >>>>info
> >>>> or enable hbase rpc trace. I guess if you decrease the number of rpc
> >>>> handlers per region server, it will mitigate your current issue.
> >>>>
> >>>>
> >>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
> >>>>
> >>>>> We have hit a problem with Phoenix and regionservers CPU usage
> >>>>>spiking
> >>> up
> >>>>> to use all available CPU and becoming unresponsive.
> >>>>>
> >>>>> After HDP 2.1 was released we setup a 4 compute node cluster (with 3
> >>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain
> >>>>>Ambari
> >>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and hand
> >>>>> linked in the jar files to the hadoop lib. Everything was going well
> >>>>>and
> >>>>> we were able to load in ~30k records into several tables. What
> >>>>>happened
> >>>>> was after about 3-4 days of being up the regionservers became
> >>>>> unresponsive and started to use most of the available CPU (12 core
> >>>>> boxes). Nothing terribly informative was in the logs (initially we
> >>>>>saw
> >>>>> some flush messages that seemed excessive, but that was not all of
> >>>>>the
> >>>>> time and we changed back to the standard HBase WAL codec). We are
> >>>>>able
> >>> to
> >>>>> kill the unresponsive regionservers and then restart them, the
> >>>>>cluster
> >>>>> will be fine for a day or so but will start to lock up again.
> >>>>>
> >>>>> We've dropped the entire HBase and zookeper information and started
> >>>>>from
> >>>>> scratch, but that has not helped.
> >>>>>
> >>>>> James Taylor suggested I send this off here. I've attached a jstack
> >>>>> report of a locked up regionserver in hopes that someone can shed
> >>>>>some
> >>>>> light.
> >>>>>
> >>>>> thanks,
> >>>>> -chris
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> CONFIDENTIALITY NOTICE
> >>>> NOTICE: This message is intended for the use of the individual or
> >>>>entity
> >>> to
> >>>> which it is addressed and may contain information that is
> >>>>confidential,
> >>>> privileged and exempt from disclosure under applicable law. If the
> >>>>reader
> >>>> of this message is not the intended recipient, you are hereby notified
> >>> that
> >>>> any printing, copying, dissemination, distribution, disclosure or
> >>>> forwarding of this communication is strictly prohibited. If you have
> >>>> received this communication in error, please contact the sender
> >>> immediately
> >>>> and delete it from your system. Thank You.
> >>>
> >>>
> >>>
>
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Chris Tarnas <cf...@biotiquesystems.com>.

Hi Jeffrey,

The performance.py from HDP 2.1 does not work out of the box - it is not looking in the right place for the phoenix jar files (even with PHOENIX_LIB_DIR set). I downloaded the tarball release from apache and that one works. After running the script and the counts as you suggested I saw no increase in CPU usage or stuck RpcServer.handlers. I ran a few scans as well just for good measure. I used the HDP sqlline.py to do the queries.

Re-running the count on one of our tables I did see the problem - but can add one more bit of info: the only affected regionserver is the regionserver that hosts the last region of the scanned table, and I cannot perfectly reproduce it. It seems to happen about 80% of the time. Also, if there is a stuck RpcServer.handlers it also prevents the regionserver from exiting, it needs to be killed with -9. 

I have one regionserver running with a few stuck handlers, any commands that  you would like me to run on it?

thank you,
-chris

On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:

> 
> Hey Chris,
> 
> I used performance.py tool which created a table with 50K rows in one
> table, run the following query from sqlline.py and everything seems fine
> without seeing CPU running hot.
> 
> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
> PERFORMANCE_50000;
> +------------+
> |  COUNT(1)  |
> +------------+
> | 50000      |
> +------------+
> 1 row selected (0.166 seconds)
> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
> PERFORMANCE_50000;
> +------------+
> |  COUNT(1)  |
> +------------+
> | 50000      |
> +------------+
> 1 row selected (0.167 seconds)
> 
> Is there anyway could you run profiler to see where the CPU goes?
> 
> 
> 
> On 5/13/14 6:40 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
> 
>> Ahh, yes. Here is a pastebin for it:
>> 
>> http://pastebin.com/w6mtabag
>> 
>> thanks again,
>> -chris
>> 
>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <nd...@gmail.com> wrote:
>> 
>>> Hi Chris,
>>> 
>>> Attachments are filtered out by the mail server. Can you pastebin it
>>> some
>>> place?
>>> 
>>> Thanks,
>>> Nick
>>> 
>>> 
>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>> <cf...@biotiquesystems.com>wrote:
>>> 
>>>> Hello,
>>>> 
>>>> We set the HBase RegionServer Handler to 10 (it appears to have been
>>>> set
>>>> to 60 by Ambari during install process). Now we have narrowed down what
>>>> causes the CPU to increase and have some detailed logs:
>>>> 
>>>> If we connect using sqlline.py and execute a select that selects one
>>>> row
>>>> using the primary_key, no increate in CPU is observed and the number
>>>> of RPC
>>>> threads in a RUNNABLE state remains the same.
>>>> 
>>>> If we execute a select that scans the table such as "select count(*)
>>>> from
>>>> TABLE" or where the "where" clause only limits on non-primary key
>>>> attributes, then the number of RUNNABLE RpcServer.handler threads
>>>> increases
>>>> and the CPU utilization of the regionserver increases by ~105%.
>>>> 
>>>> Disconnecting the client does not have an effect and the
>>>> RpcServer.handler
>>>> thread is left RUNNABLE and the CPU stays at the higher usage.
>>>> 
>>>> Checking the Web Console for the Regionserver just shows 10
>>>> RpcServer.reader tasks, all in a WAITING state, no other monitored
>>>> tasks
>>>> are happening. The regionserver has a Max Heap of 10G and a Used heap
>>>> of
>>>> 445.2M.
>>>> 
>>>> I've attached the regionserver log with IPC debug logging turned on
>>>> right
>>>> when one of the Phoenix statements is executed (this statement actually
>>>> used up the last available handler).
>>>> 
>>>> thanks,
>>>> -chris
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jz...@hortonworks.com>
>>>> wrote:
>>>> 
>>>>> 
>>>>> From the stack, it seems you increase the default rpc handler number
>>>>> to
>>>>> about 60. All handlers are serving Get request(You can search
>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>> a:2
>>>>> 841).
>>>>> 
>>>>> You can check why there are so many get requests by adding some log
>>>>> info
>>>>> or enable hbase rpc trace. I guess if you decrease the number of rpc
>>>>> handlers per region server, it will mitigate your current issue.
>>>>> 
>>>>> 
>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>>>>> 
>>>>>> We have hit a problem with Phoenix and regionservers CPU usage
>>>>>> spiking
>>>> up
>>>>>> to use all available CPU and becoming unresponsive.
>>>>>> 
>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster (with 3
>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain
>>>>>> Ambari
>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and hand
>>>>>> linked in the jar files to the hadoop lib. Everything was going well
>>>>>> and
>>>>>> we were able to load in ~30k records into several tables. What
>>>>>> happened
>>>>>> was after about 3-4 days of being up the regionservers became
>>>>>> unresponsive and started to use most of the available CPU (12 core
>>>>>> boxes). Nothing terribly informative was in the logs (initially we
>>>>>> saw
>>>>>> some flush messages that seemed excessive, but that was not all of
>>>>>> the
>>>>>> time and we changed back to the standard HBase WAL codec). We are
>>>>>> able
>>>> to
>>>>>> kill the unresponsive regionservers and then restart them, the
>>>>>> cluster
>>>>>> will be fine for a day or so but will start to lock up again.
>>>>>> 
>>>>>> We've dropped the entire HBase and zookeper information and started
>>>>>> from
>>>>>> scratch, but that has not helped.
>>>>>> 
>>>>>> James Taylor suggested I send this off here. I've attached a jstack
>>>>>> report of a locked up regionserver in hopes that someone can shed
>>>>>> some
>>>>>> light.
>>>>>> 
>>>>>> thanks,
>>>>>> -chris
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> CONFIDENTIALITY NOTICE
>>>>> NOTICE: This message is intended for the use of the individual or
>>>>> entity
>>>> to
>>>>> which it is addressed and may contain information that is
>>>>> confidential,
>>>>> privileged and exempt from disclosure under applicable law. If the
>>>>> reader
>>>>> of this message is not the intended recipient, you are hereby notified
>>>> that
>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>> forwarding of this communication is strictly prohibited. If you have
>>>>> received this communication in error, please contact the sender
>>>> immediately
>>>>> and delete it from your system. Thank You.
>>>> 
>>>> 
>>>> 
> 
> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Jeffrey Zhong <jz...@hortonworks.com>.

I attached a patch on Phoenix-992 which you may can try out as well.

On 5/21/14 10:46 AM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:

>We will try out the two work-arounds today. I've added the TRACE logging,
>here is the log on an affected regionserver when running a problem
>query/table combination, below it is the new stuck thread stack trace:
>
>-------------------------------------
>
>2014-05-21 12:44:39,280 DEBUG [RpcServer.listener,port=60020]
>ipc.RpcServer: RpcServer.listener,port=60020: connection from
>10.56.200.108:39956; # active connections: 1
>2014-05-21 12:44:39,280 DEBUG [RpcServer.reader=7,port=60020]
>ipc.RpcServer: Authorized user_info { effective_user: "tarnasch" }
>service_name: "ClientService" cell_block_codec_class:
>"org.apache.hadoop.hbase.codec.KeyValueCodec"
>2014-05-21 12:44:39,281 TRACE [RpcServer.reader=7,port=60020]
>ipc.RpcServer: RequestHeader call_id: 22 method_name: "Scan"
>request_param: true totalRequestSize: 229 bytes
>2014-05-21 12:44:39,282 DEBUG [RpcServer.handler=6,port=60020]
>ipc.RpcServer: callId: 22 service: ClientService methodName: Scan size:
>229 connection: 10.56.200.108:39956 executing as tarnasch
>2014-05-21 12:44:39,284 TRACE [RpcServer.reader=7,port=60020]
>ipc.RpcServer: RequestHeader call_id: 28 method_name: "Get"
>request_param: true totalRequestSize: 94 bytes
>2014-05-21 12:44:39,284 INFO  [RpcServer.handler=6,port=60020]
>coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped
>coprocessor scan 
>{"timeRange":[0,1400694279223],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00
>\\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"ca
>cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":
>1,"filter":"FirstKeyOnlyFilter","caching":-1}
>2014-05-21 12:44:39,285 DEBUG [RpcServer.handler=7,port=60020]
>ipc.RpcServer: callId: 28 service: ClientService methodName: Get size: 94
>connection: 10.56.200.108:39956 executing as tarnasch
>2014-05-21 12:44:39,706 INFO  [RpcServer.handler=6,port=60020]
>coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608
>rows for ungrouped coprocessor scan
>{"timeRange":[0,1400694279223],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00
>\\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"ca
>cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":
>1,"filter":"FirstKeyOnlyFilter","caching":-1}
>2014-05-21 12:44:39,707 TRACE [RpcServer.handler=6,port=60020]
>ipc.RpcServer: callId: 22 service: ClientService methodName: Scan size:
>229 connection: 10.56.200.108:39956 param: region { type: REGION_NAME
>value: 
>"ACRE.GE,\003\000\000\000\000,1399489118008.78e15ac392815f1a8f2b5a53b80a08
>43." } scan { column { family: "0" } attribute { name: "_ondemand_"
>value: "\001" } attribute { name: "_Aggs" value:
>"\001\001\030\001\003\005\200\000\000\001\002\003" } attribute { name:
>"_UngroupedAgg" value: "\001" } start_row: "\003\000\000\000\000" filter
>{ name: "org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter"
>serialized_filter: "" } time_range { from: 0 to: 1400694279223 }
>max_versions: 1 cache_blocks: true } number_of_rows: 0 close_scanner:
>false connection: 10.56.200.108:39956, response scanner_id:
>5393104135575718994 more_results: true ttl: 60000 queueTime: 0
>processingTime: 424
>2014-05-21 12:44:39,707 DEBUG [RpcServer.handler=6,port=60020]
>ipc.RpcServer: RpcServer.responder: callId: 22 wrote 24 bytes.
>2014-05-21 12:44:39,707 TRACE [RpcServer.reader=7,port=60020]
>ipc.RpcServer: RequestHeader call_id: 59 method_name: "Scan"
>request_param: true totalRequestSize: 28 bytes
>2014-05-21 12:44:39,708 DEBUG [RpcServer.handler=5,port=60020]
>ipc.RpcServer: callId: 59 service: ClientService methodName: Scan size:
>28 connection: 10.56.200.108:39956 executing as tarnasch
>2014-05-21 12:44:39,708 TRACE [RpcServer.handler=5,port=60020]
>ipc.RpcServer: callId: 59 service: ClientService methodName: Scan size:
>28 connection: 10.56.200.108:39956 param: scanner_id: 5393104135575718994
>number_of_rows: 100 close_scanner: false next_call_seq: 0 connection:
>10.56.200.108:39956, response cells_per_result: 1 scanner_id:
>5393104135575718994 more_results: true ttl: 60000 queueTime: 0
>processingTime: 0
>2014-05-21 12:44:39,709 DEBUG [RpcServer.handler=5,port=60020]
>ipc.RpcServer: RpcServer.responder: callId: 59 wrote 65 bytes.
>2014-05-21 12:44:39,710 TRACE [RpcServer.reader=7,port=60020]
>ipc.RpcServer: RequestHeader call_id: 60 method_name: "Scan"
>request_param: true totalRequestSize: 26 bytes
>2014-05-21 12:44:39,710 DEBUG [RpcServer.handler=2,port=60020]
>ipc.RpcServer: callId: 60 service: ClientService methodName: Scan size:
>26 connection: 10.56.200.108:39956 executing as tarnasch
>2014-05-21 12:44:39,710 TRACE [RpcServer.handler=2,port=60020]
>ipc.RpcServer: callId: 60 service: ClientService methodName: Scan size:
>26 connection: 10.56.200.108:39956 param: scanner_id: 5393104135575718994
>number_of_rows: 0 close_scanner: true connection: 10.56.200.108:39956,
>response scanner_id: 5393104135575718994 more_results: false queueTime: 0
>processingTime: 0
>2014-05-21 12:44:39,710 DEBUG [RpcServer.handler=2,port=60020]
>ipc.RpcServer: RpcServer.responder: callId: 60 wrote 20 bytes.
>
>-------------------------------------
>
>The new corresponding stack thread trace is:
>
>"RpcServer.handler=7,port=60020" daemon prio=10 tid=0x00007ff7b06fc800
>nid=0x7864 runnable [0x00007ff78e028000]
>   java.lang.Thread.State: RUNNABLE
>	at 
>org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(Fast
>DiffDeltaEncoder.java:540)
>	at 
>org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncod
>edSeeker.next(BufferedDataBlockEncoder.java:261)
>	at 
>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFile
>ReaderV2.java:1063)
>	at 
>org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.
>java:1776)
>	at 
>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HSt
>ore.java:1722)
>	at 
>org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.jav
>a:1655)
>	at 
>org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.j
>ava:1826)
>	at 
>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:
>2841)
>	at 
>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.ca
>llBlockingMethod(ClientProtos.java:28857)
>	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>	at 
>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSched
>uler.java:160)
>	at 
>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSchedul
>er.java:38)
>	at 
>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.ja
>va:110)
>	at java.lang.Thread.run(Thread.java:744)
>
>
>thanks
>
>On May 20, 2014, at 5:50 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:
>
>> 
>> 
>> Hey Chris,
>> 
>> Sorry for the trace option. The right option should be below. Before you
>> were able to turn on DEBUG, you just need to change it to TRACE.
>> 
>>    log4j.logger.org.apache.hadoop.ipc.RpcServer=TRACE
>> 
>> 
>> 
>> Do you see that the two read requests burn one of your CPU to 100%? From
>> the call stacks you posted today and before, it seems there might be an
>> infinite loop inside the function Htable.getRowOrBefore() which is
>> deprecated and supposed to only work for meta region.(brought by Enis).
>> Fortunately there is only one place in phoenix call this function, so
>> could you try the following things?
>> 
>> set following configuration in client side hbase-site.xml to disable
>>stats
>> update 
>>   phoenix.query.statsUpdateFrequency = 2147483647
>>   phoenix.query.maxStatsAge = 0
>> 
>> After the configuration changes, could you verify if the CPU is still
>> spiking for the same query?
>> 
>> You can also do the following:
>> 
>> Alter your table's ColumnFamily data block encoding to NONE as
>> DATA_BLOCK_ENCODING => 'NONE' and do a major compaction after that
>>through
>> Hbase client.
>> 
>> 
>> We suspect that the combination of your data, FAST_DIFF data block
>> encoding & the deprecated Htable.getRowOrBefore() cause the issue.
>> 
>> 
>> 
>> On 5/20/14 3:03 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>> 
>>> Ho Jeffrey.
>>> 
>>> Thank you for taking a look. I added
>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE to the
>>>Custom
>>> log4j.properties in Ambari and did a rolling restart of the
>>>regionservers
>>> (and verified it was pushed out the nodes in
>>> /etc/hbase/conf/log4j.properties ) then issued a "select count(*)" on
>>>the
>>> table. Looking at the logs I don't see any TRACE messages. The only two
>>> log messages that happened when issuing the select are:
>>> 
>>> 2014-05-20 16:43:53,609 INFO  [RpcServer.handler=9,port=60020]
>>> coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped
>>> coprocessor scan
>>> 
>>>{"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x
>>>00
>>> 
>>>\\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"
>>>ca
>>> 
>>>cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions
>>>":
>>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
>>> 2014-05-20 16:43:54,309 INFO  [RpcServer.handler=9,port=60020]
>>> coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608
>>> rows for ungrouped coprocessor scan
>>> 
>>>{"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x
>>>00
>>> 
>>>\\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"
>>>ca
>>> 
>>>cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions
>>>":
>>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
>>> 
>>> The metrics for the regionserver show only 2 total read requests since
>>>it
>>> started and 0 per second now. Those two requests are for the table I
>>>did
>>> the count on. The regionserver's UI says it has no active RPC calls.
>>>It
>>> currently has one stuck handler from running that command:
>>> 
>>> "RpcServer.handler=8,port=60020" daemon prio=10 tid=0x0000000001c08000
>>> nid=0x2582 runnable [0x00007fbd5ba6c000]
>>>  java.lang.Thread.State: RUNNABLE
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java
>>>:2
>>> 888)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java
>>>:2
>>> 880)
>>> 	at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:369)
>>> 	at java.util.TreeMap.getEntry(TreeMap.java:340)
>>> 	at java.util.TreeMap.get(TreeMap.java:273)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.isDelete
>>>d(
>>> GetClosestRowBeforeTracker.java:128)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.addCandi
>>>da
>>> te(GetClosestRowBeforeTracker.java:107)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.handle(G
>>>et
>>> ClosestRowBeforeTracker.java:202)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStor
>>>e.
>>> java:1772)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(H
>>>St
>>> ore.java:1722)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.j
>>>av
>>> a:1655)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion
>>>.j
>>> ava:1826)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>a:
>>> 2841)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
>>>ca
>>> llBlockingMethod(ClientProtos.java:28857)
>>> 	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>> 	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSch
>>>ed
>>> uler.java:160)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSched
>>>ul
>>> er.java:38)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.
>>>ja
>>> va:110)
>>> 	at java.lang.Thread.run(Thread.java:744)
>>> 
>>> It also has one listener running:
>>> 
>>> "RpcServer.listener,port=60020" daemon prio=10 tid=0x0000000001db8000
>>> nid=0x2579 runnable [0x00007fbd5c376000]
>>>  java.lang.Thread.State: RUNNABLE
>>> 	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>> 	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>>> 	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>>> 	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
>>> 	- locked <0x000000059ae28988> (a sun.nio.ch.Util$2)
>>> 	- locked <0x000000059ae289a0> (a
>>>java.util.Collections$UnmodifiableSet)
>>> 	- locked <0x000000059ae294b8> (a sun.nio.ch.EPollSelectorImpl)
>>> 	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
>>> 	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
>>> 	at 
>>>org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:676)
>>> 
>>> If there is anything else you need, please let me know.
>>> 
>>> thanks again,
>>> -chris
>>> 
>>> 
>>> On May 20, 2014, at 2:03 PM, Jeffrey Zhong <jz...@hortonworks.com>
>>>wrote:
>>> 
>>>> 
>>>> I rechecked your RPC log and found that you didn't turn on the RPC
>>>>trace
>>>> level. You can set the following so that we can see detail request &
>>>> response. 
>>>> 
>>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE
>>>> 
>>>> If you didn't see time out exception in your region server logs, it
>>>> means
>>>> that there are something keeping issuing lots of Get requests. Have
>>>>you
>>>> used secondary index for this table? If that's the case, please
>>>> disable/drop it to see if that helps.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 5/20/14 1:21 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>>>> 
>>>>> I will see how possible a new, all bare metal cluster is, but we
>>>>>have 5
>>>>> other similarly configured clusters with no troubles. We only have an
>>>>> issue when Phoenix does a full scan of some tables that a handler
>>>>>gets
>>>>> stuck in 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>> after it has returned all of its data.
>>>>> 
>>>>> thank you,
>>>>> 
>>>>> -chris
>>>>> 
>>>>> On May 20, 2014, at 12:34 PM, alex kamil <al...@gmail.com>
>>>>>wrote:
>>>>> 
>>>>>> i'm just guessing but may be region servers are stuck because they
>>>>>> can't
>>>>>> communicate with zookeeper or hmaster, or may be datanodes can't
>>>>>>talk
>>>>>> to
>>>>>> namenode because of some vmware networking quirk and some internal
>>>>>> thread
>>>>>> is hanging, who knows,
>>>>>> it would take 30 min to setup up bare metal cluster vs days
>>>>>>debugging
>>>>>> this
>>>>>> hybrid setup
>>>>>> also I'd suggest to install directly from hbase/hadoop websites  to
>>>>>> eliminate additional variables
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas
>>>>>> <cf...@biotiquesystems.com>wrote:
>>>>>> 
>>>>>>> Only the master nodes (HMaster, ZK and NN) are VMs. The
>>>>>>> datanodes/regionservers with the stuck processes all bare metal.
>>>>>>> 
>>>>>>> -chris
>>>>>>> 
>>>>>>> On May 20, 2014, at 11:52 AM, alex kamil <al...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>>> "with 3 VMWare "master" nodes"
>>>>>>>> 
>>>>>>>> can you try running hbase/phoenix on physical nodes instead of
>>>>>>>>using
>>>>>>>> virtual machines
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas
>>>>>>>> <cft@biotiquesystems.com
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Sorry to follow up on my own message, but I was wondering if
>>>>>>>>>anyone
>>>>>>>>> had
>>>>>>>>> any ideas? Normal non-phoenix scans don't cause this symptom, but
>>>>>>>>> right
>>>>>>>>> after a select * on the exact same table will.
>>>>>>>>> 
>>>>>>>>> If we export the table and then re-import it into a new table,
>>>>>>>>>the
>>>>>>>>> new
>>>>>>>>> table doesn't exhibit these symptoms, same as if we use an
>>>>>>> upsert..select
>>>>>>>>> to do a copy. It seems something happens to the last region to
>>>>>>>>> cause
>>>>>>> this,
>>>>>>>>> but it is not directly data dependent. Moving the region to
>>>>>>>>>another
>>>>>>>>> regionserver doesn't have any effect - just moves where the
>>>>>>>>>problem
>>>>>>>>> happens. Major compactions get hung up by the running threads as
>>>>>>>>> they
>>>>>>>>> probably have a lock.
>>>>>>>>> 
>>>>>>>>> I've run the hfile tool on the final region and nothing seems
>>>>>>>>>awry.
>>>>>>>>> 
>>>>>>>>> Figuring this out will allow this project to continue, as of now
>>>>>>>>>it
>>>>>>>>> is
>>>>>>>>> hung up on this issue.
>>>>>>>>> 
>>>>>>>>> thank you,
>>>>>>>>> -chris
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas
>>>>>>>>><cf...@biotiquesystems.com>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I did some "poor mans" profiling with multiple jstack and came
>>>>>>>>>>up
>>>>>>>>>> with
>>>>>>>>> where the RpcServer.handler appears to be stuck:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStore
>>>>>>>>>Fi
>>>>>>>>> le
>>>>>>>>> .
>>>>>>>>> That is the last deepest method in all of the traces, either
>>>>>>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces
>>>>>>>>> for a
>>>>>>>>> thread (which has been running for the last couple hours):
>>>>>>>>>> 
>>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>>>>> tid=0x0000000000cdb800
>>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNex
>>>>>>>t(
>>>>>>> Fa
>>>>>>> stDiffDeltaEncoder.java:540)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$Buffere
>>>>>>>dE
>>>>>>> nc
>>>>>>> odedSeeker.next(BufferedDataBlockEncoder.java:261)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next
>>>>>>>(H
>>>>>>> Fi
>>>>>>> leReaderV2.java:1063)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(H
>>>>>>>St
>>>>>>> or
>>>>>>> e.java:1776)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
>>>>>>>le
>>>>>>> (H
>>>>>>> Store.java:1722)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HSto
>>>>>>>re
>>>>>>> .j
>>>>>>> ava:1655)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRe
>>>>>>>gi
>>>>>>> on
>>>>>>> .java:1826)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
>>>>>>>.j
>>>>>>> av
>>>>>>> a:2841)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServic
>>>>>>>e$
>>>>>>> 2.
>>>>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>>>>   at 
>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>>>>   at 
>>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRp
>>>>>>>cS
>>>>>>> ch
>>>>>>> eduler.java:160)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcS
>>>>>>>ch
>>>>>>> ed
>>>>>>> uler.java:38)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedu
>>>>>>>le
>>>>>>> r.
>>>>>>> java:110)
>>>>>>>>>>   at java.lang.Thread.run(Thread.java:744)
>>>>>>>>>> 
>>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>>>>> tid=0x0000000000cdb800
>>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:
>>>>>>>19
>>>>>>> 44
>>>>>>> )
>>>>>>>>>>   at
>>>>>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.ro
>>>>>>>ot
>>>>>>> Bl
>>>>>>> ockContainingKey(HFileBlockIndex.java:392)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.lo
>>>>>>>ad
>>>>>>> Da
>>>>>>> taBlockWithScanInfo(HFileBlockIndex.java:209)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.se
>>>>>>>ek
>>>>>>> To
>>>>>>> DataBlock(HFileBlockIndex.java:179)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.see
>>>>>>>kB
>>>>>>> ef
>>>>>>> ore(HFileReaderV2.java:548)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
>>>>>>>le
>>>>>>> (H
>>>>>>> Store.java:1712)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HSto
>>>>>>>re
>>>>>>> .j
>>>>>>> ava:1655)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRe
>>>>>>>gi
>>>>>>> on
>>>>>>> .java:1826)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
>>>>>>>.j
>>>>>>> av
>>>>>>> a:2841)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServic
>>>>>>>e$
>>>>>>> 2.
>>>>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>>>>   at 
>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>>>>   at 
>>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRp
>>>>>>>cS
>>>>>>> ch
>>>>>>> eduler.java:160)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcS
>>>>>>>ch
>>>>>>> ed
>>>>>>> uler.java:38)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedu
>>>>>>>le
>>>>>>> r.
>>>>>>> java:110)
>>>>>>>>>>   at java.lang.Thread.run(Thread.java:744)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong
>>>>>>>>>> <jz...@hortonworks.com>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Hey Chris,
>>>>>>>>>>> 
>>>>>>>>>>> I used performance.py tool which created a table with 50K rows
>>>>>>>>>>>in
>>>>>>>>>>> one
>>>>>>>>>>> table, run the following query from sqlline.py and everything
>>>>>>>>>>> seems
>>>>>>> fine
>>>>>>>>>>> without seeing CPU running hot.
>>>>>>>>>>> 
>>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*)
>>>>>>>>>>>from
>>>>>>>>>>> PERFORMANCE_50000;
>>>>>>>>>>> +------------+
>>>>>>>>>>> |  COUNT(1)  |
>>>>>>>>>>> +------------+
>>>>>>>>>>> | 50000      |
>>>>>>>>>>> +------------+
>>>>>>>>>>> 1 row selected (0.166 seconds)
>>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*)
>>>>>>>>>>>from
>>>>>>>>>>> PERFORMANCE_50000;
>>>>>>>>>>> +------------+
>>>>>>>>>>> |  COUNT(1)  |
>>>>>>>>>>> +------------+
>>>>>>>>>>> | 50000      |
>>>>>>>>>>> +------------+
>>>>>>>>>>> 1 row selected (0.167 seconds)
>>>>>>>>>>> 
>>>>>>>>>>> Is there anyway could you run profiler to see where the CPU
>>>>>>>>>>>goes?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <cf...@biotiquesystems.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Ahh, yes. Here is a pastebin for it:
>>>>>>>>>>>> 
>>>>>>>>>>>> http://pastebin.com/w6mtabag
>>>>>>>>>>>> 
>>>>>>>>>>>> thanks again,
>>>>>>>>>>>> -chris
>>>>>>>>>>>> 
>>>>>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <nd...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Chris,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Attachments are filtered out by the mail server. Can you
>>>>>>>>>>>>> pastebin it
>>>>>>>>>>>>> some
>>>>>>>>>>>>> place?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Nick
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>>>>>>>>>>> <cf...@biotiquesystems.com>wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to
>>>>>>>>>>>>>> have
>>>>>>> been
>>>>>>>>>>>>>> set
>>>>>>>>>>>>>> to 60 by Ambari during install process). Now we have
>>>>>>>>>>>>>>narrowed
>>>>>>>>>>>>>> down
>>>>>>>>> what
>>>>>>>>>>>>>> causes the CPU to increase and have some detailed logs:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If we connect using sqlline.py and execute a select that
>>>>>>>>>>>>>> selects
>>>>>>> one
>>>>>>>>>>>>>> row
>>>>>>>>>>>>>> using the primary_key, no increate in CPU is observed and
>>>>>>>>>>>>>>the
>>>>>>> number
>>>>>>>>>>>>>> of RPC
>>>>>>>>>>>>>> threads in a RUNNABLE state remains the same.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If we execute a select that scans the table such as "select
>>>>>>> count(*)
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>> TABLE" or where the "where" clause only limits on
>>>>>>>>>>>>>>non-primary
>>>>>>>>>>>>>> key
>>>>>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler
>>>>>>>>>>>>>> threads
>>>>>>>>>>>>>> increases
>>>>>>>>>>>>>> and the CPU utilization of the regionserver increases by
>>>>>>>>>>>>>> ~105%.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Disconnecting the client does not have an effect and the
>>>>>>>>>>>>>> RpcServer.handler
>>>>>>>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher
>>>>>>>>>>>>>>usage.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Checking the Web Console for the Regionserver just shows 10
>>>>>>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other
>>>>>>>>>>>>>> monitored
>>>>>>>>>>>>>> tasks
>>>>>>>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a
>>>>>>>>>>>>>> Used
>>>>>>> heap
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>> 445.2M.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I've attached the regionserver log with IPC debug logging
>>>>>>>>>>>>>> turned on
>>>>>>>>>>>>>> right
>>>>>>>>>>>>>> when one of the Phoenix statements is executed (this
>>>>>>>>>>>>>>statement
>>>>>>>>> actually
>>>>>>>>>>>>>> used up the last available handler).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>> -chris
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong
>>>>>>>>>>>>>> <jzhong@hortonworks.com
>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> From the stack, it seems you increase the default rpc
>>>>>>>>>>>>>>>handler
>>>>>>> number
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> about 60. All handlers are serving Get request(You can
>>>>>>>>>>>>>>>search
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServ
>>>>>>>>>er
>>>>>>>>> .j
>>>>>>>>> av
>>>>>>>>>>>>>> a:2
>>>>>>>>>>>>>>> 841).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> You can check why there are so many get requests by adding
>>>>>>>>>>>>>>> some
>>>>>>> log
>>>>>>>>>>>>>>> info
>>>>>>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the
>>>>>>>>>>>>>>>number
>>>>>>>>>>>>>>> of
>>>>>>> rpc
>>>>>>>>>>>>>>> handlers per region server, it will mitigate your current
>>>>>>>>>>>>>>> issue.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas"
>>>>>>>>>>>>>>><cf...@biotiquesystems.com>
>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU
>>>>>>>>>>>>>>>> usage
>>>>>>>>>>>>>>>> spiking
>>>>>>>>>>>>>> up
>>>>>>>>>>>>>>>> to use all available CPU and becoming unresponsive.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node
>>>>>>>>>>>>>>>>cluster
>>>>>>> (with
>>>>>>>>> 3
>>>>>>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a
>>>>>>>>>>>>>>>> plain
>>>>>>>>>>>>>>>> Ambari
>>>>>>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM
>>>>>>>>>>>>>>>>release
>>>>>>>>>>>>>>>> and
>>>>>>>>> hand
>>>>>>>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was
>>>>>>>>>>>>>>>> going
>>>>>>>>> well
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> we were able to load in ~30k records into several tables.
>>>>>>>>>>>>>>>> What
>>>>>>>>>>>>>>>> happened
>>>>>>>>>>>>>>>> was after about 3-4 days of being up the regionservers
>>>>>>>>>>>>>>>> became
>>>>>>>>>>>>>>>> unresponsive and started to use most of the available CPU
>>>>>>>>>>>>>>>> (12
>>>>>>> core
>>>>>>>>>>>>>>>> boxes). Nothing terribly informative was in the logs
>>>>>>>>>>>>>>>> (initially
>>>>>>> we
>>>>>>>>>>>>>>>> saw
>>>>>>>>>>>>>>>> some flush messages that seemed excessive, but that was
>>>>>>>>>>>>>>>>not
>>>>>>>>>>>>>>>> all
>>>>>>> of
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> time and we changed back to the standard HBase WAL codec).
>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>> able
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> kill the unresponsive regionservers and then restart them,
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>> will be fine for a day or so but will start to lock up
>>>>>>>>>>>>>>>> again.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We've dropped the entire HBase and zookeper information
>>>>>>>>>>>>>>>>and
>>>>>>> started
>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>> scratch, but that has not helped.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> James Taylor suggested I send this off here. I've
>>>>>>>>>>>>>>>>attached a
>>>>>>> jstack
>>>>>>>>>>>>>>>> report of a locked up regionserver in hopes that someone
>>>>>>>>>>>>>>>>can
>>>>>>>>>>>>>>>> shed
>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>> light.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>>> -chris
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>>>>>>> NOTICE: This message is intended for the use of the
>>>>>>>>>>>>>>> individual
>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>> entity
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> which it is addressed and may contain information that is
>>>>>>>>>>>>>>> confidential,
>>>>>>>>>>>>>>> privileged and exempt from disclosure under applicable law.
>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> reader
>>>>>>>>>>>>>>> of this message is not the intended recipient, you are
>>>>>>>>>>>>>>>hereby
>>>>>>>>> notified
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> any printing, copying, dissemination, distribution,
>>>>>>>>>>>>>>> disclosure
>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>> forwarding of this communication is strictly prohibited. If
>>>>>>>>>>>>>>> you
>>>>>>> have
>>>>>>>>>>>>>>> received this communication in error, please contact the
>>>>>>>>>>>>>>> sender
>>>>>>>>>>>>>> immediately
>>>>>>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>>> NOTICE: This message is intended for the use of the individual
>>>>>>>>>>>or
>>>>>>>>> entity to
>>>>>>>>>>> which it is addressed and may contain information that is
>>>>>>> confidential,
>>>>>>>>>>> privileged and exempt from disclosure under applicable law. If
>>>>>>>>>>> the
>>>>>>>>> reader
>>>>>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>>>>>>> notified
>>>>>>>>> that
>>>>>>>>>>> any printing, copying, dissemination, distribution, disclosure
>>>>>>>>>>>or
>>>>>>>>>>> forwarding of this communication is strictly prohibited. If you
>>>>>>>>>>> have
>>>>>>>>>>> received this communication in error, please contact the sender
>>>>>>>>> immediately
>>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> CONFIDENTIALITY NOTICE
>>>> NOTICE: This message is intended for the use of the individual or
>>>> entity to 
>>>> which it is addressed and may contain information that is 
>>>>confidential,
>>>> privileged and exempt from disclosure under applicable law. If the
>>>> reader 
>>>> of this message is not the intended recipient, you are hereby notified
>>>> that 
>>>> any printing, copying, dissemination, distribution, disclosure or
>>>> forwarding of this communication is strictly prohibited. If you have
>>>> received this communication in error, please contact the sender
>>>> immediately 
>>>> and delete it from your system. Thank You.
>>> 
>> 
>> 
>> 
>> -- 
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or 
>>entity to 
>> which it is addressed and may contain information that is confidential, 
>> privileged and exempt from disclosure under applicable law. If the 
>>reader 
>> of this message is not the intended recipient, you are hereby notified 
>>that 
>> any printing, copying, dissemination, distribution, disclosure or 
>> forwarding of this communication is strictly prohibited. If you have 
>> received this communication in error, please contact the sender 
>>immediately 
>> and delete it from your system. Thank You.
>



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Chris Tarnas <cf...@biotiquesystems.com>.

We will try out the two work-arounds today. I've added the TRACE logging, here is the log on an affected regionserver when running a problem query/table combination, below it is the new stuck thread stack trace:

-------------------------------------

2014-05-21 12:44:39,280 DEBUG [RpcServer.listener,port=60020] ipc.RpcServer: RpcServer.listener,port=60020: connection from 10.56.200.108:39956; # active connections: 1
2014-05-21 12:44:39,280 DEBUG [RpcServer.reader=7,port=60020] ipc.RpcServer: Authorized user_info { effective_user: "tarnasch" } service_name: "ClientService" cell_block_codec_class: "org.apache.hadoop.hbase.codec.KeyValueCodec"
2014-05-21 12:44:39,281 TRACE [RpcServer.reader=7,port=60020] ipc.RpcServer: RequestHeader call_id: 22 method_name: "Scan" request_param: true totalRequestSize: 229 bytes
2014-05-21 12:44:39,282 DEBUG [RpcServer.handler=6,port=60020] ipc.RpcServer: callId: 22 service: ClientService methodName: Scan size: 229 connection: 10.56.200.108:39956 executing as tarnasch
2014-05-21 12:44:39,284 TRACE [RpcServer.reader=7,port=60020] ipc.RpcServer: RequestHeader call_id: 28 method_name: "Get" request_param: true totalRequestSize: 94 bytes
2014-05-21 12:44:39,284 INFO  [RpcServer.handler=6,port=60020] coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped coprocessor scan {"timeRange":[0,1400694279223],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00\\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"cacheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":1,"filter":"FirstKeyOnlyFilter","caching":-1}
2014-05-21 12:44:39,285 DEBUG [RpcServer.handler=7,port=60020] ipc.RpcServer: callId: 28 service: ClientService methodName: Get size: 94 connection: 10.56.200.108:39956 executing as tarnasch
2014-05-21 12:44:39,706 INFO  [RpcServer.handler=6,port=60020] coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608 rows for ungrouped coprocessor scan {"timeRange":[0,1400694279223],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00\\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"cacheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":1,"filter":"FirstKeyOnlyFilter","caching":-1}
2014-05-21 12:44:39,707 TRACE [RpcServer.handler=6,port=60020] ipc.RpcServer: callId: 22 service: ClientService methodName: Scan size: 229 connection: 10.56.200.108:39956 param: region { type: REGION_NAME value: "ACRE.GE,\003\000\000\000\000,1399489118008.78e15ac392815f1a8f2b5a53b80a0843." } scan { column { family: "0" } attribute { name: "_ondemand_" value: "\001" } attribute { name: "_Aggs" value: "\001\001\030\001\003\005\200\000\000\001\002\003" } attribute { name: "_UngroupedAgg" value: "\001" } start_row: "\003\000\000\000\000" filter { name: "org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter" serialized_filter: "" } time_range { from: 0 to: 1400694279223 } max_versions: 1 cache_blocks: true } number_of_rows: 0 close_scanner: false connection: 10.56.200.108:39956, response scanner_id: 5393104135575718994 more_results: true ttl: 60000 queueTime: 0 processingTime: 424
2014-05-21 12:44:39,707 DEBUG [RpcServer.handler=6,port=60020] ipc.RpcServer: RpcServer.responder: callId: 22 wrote 24 bytes.
2014-05-21 12:44:39,707 TRACE [RpcServer.reader=7,port=60020] ipc.RpcServer: RequestHeader call_id: 59 method_name: "Scan" request_param: true totalRequestSize: 28 bytes
2014-05-21 12:44:39,708 DEBUG [RpcServer.handler=5,port=60020] ipc.RpcServer: callId: 59 service: ClientService methodName: Scan size: 28 connection: 10.56.200.108:39956 executing as tarnasch
2014-05-21 12:44:39,708 TRACE [RpcServer.handler=5,port=60020] ipc.RpcServer: callId: 59 service: ClientService methodName: Scan size: 28 connection: 10.56.200.108:39956 param: scanner_id: 5393104135575718994 number_of_rows: 100 close_scanner: false next_call_seq: 0 connection: 10.56.200.108:39956, response cells_per_result: 1 scanner_id: 5393104135575718994 more_results: true ttl: 60000 queueTime: 0 processingTime: 0
2014-05-21 12:44:39,709 DEBUG [RpcServer.handler=5,port=60020] ipc.RpcServer: RpcServer.responder: callId: 59 wrote 65 bytes.
2014-05-21 12:44:39,710 TRACE [RpcServer.reader=7,port=60020] ipc.RpcServer: RequestHeader call_id: 60 method_name: "Scan" request_param: true totalRequestSize: 26 bytes
2014-05-21 12:44:39,710 DEBUG [RpcServer.handler=2,port=60020] ipc.RpcServer: callId: 60 service: ClientService methodName: Scan size: 26 connection: 10.56.200.108:39956 executing as tarnasch
2014-05-21 12:44:39,710 TRACE [RpcServer.handler=2,port=60020] ipc.RpcServer: callId: 60 service: ClientService methodName: Scan size: 26 connection: 10.56.200.108:39956 param: scanner_id: 5393104135575718994 number_of_rows: 0 close_scanner: true connection: 10.56.200.108:39956, response scanner_id: 5393104135575718994 more_results: false queueTime: 0 processingTime: 0
2014-05-21 12:44:39,710 DEBUG [RpcServer.handler=2,port=60020] ipc.RpcServer: RpcServer.responder: callId: 60 wrote 20 bytes.

-------------------------------------

The new corresponding stack thread trace is:

"RpcServer.handler=7,port=60020" daemon prio=10 tid=0x00007ff7b06fc800 nid=0x7864 runnable [0x00007ff78e028000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(FastDiffDeltaEncoder.java:540)
	at org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncodedSeeker.next(BufferedDataBlockEncoder.java:261)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1063)
	at org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1776)
	at org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722)
	at org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
	at org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
	at java.lang.Thread.run(Thread.java:744)


thanks

On May 20, 2014, at 5:50 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:

> 
> 
> Hey Chris,
> 
> Sorry for the trace option. The right option should be below. Before you
> were able to turn on DEBUG, you just need to change it to TRACE.
> 
>    log4j.logger.org.apache.hadoop.ipc.RpcServer=TRACE
> 
> 
> 
> Do you see that the two read requests burn one of your CPU to 100%? From
> the call stacks you posted today and before, it seems there might be an
> infinite loop inside the function Htable.getRowOrBefore() which is
> deprecated and supposed to only work for meta region.(brought by Enis).
> Fortunately there is only one place in phoenix call this function, so
> could you try the following things?
> 
> set following configuration in client side hbase-site.xml to disable stats
> update 
>   phoenix.query.statsUpdateFrequency = 2147483647
>   phoenix.query.maxStatsAge = 0
> 
> After the configuration changes, could you verify if the CPU is still
> spiking for the same query?
> 
> You can also do the following:
> 
> Alter your table's ColumnFamily data block encoding to NONE as
> DATA_BLOCK_ENCODING => 'NONE' and do a major compaction after that through
> Hbase client.
> 
> 
> We suspect that the combination of your data, FAST_DIFF data block
> encoding & the deprecated Htable.getRowOrBefore() cause the issue.
> 
> 
> 
> On 5/20/14 3:03 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
> 
>> Ho Jeffrey.
>> 
>> Thank you for taking a look. I added
>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE to the Custom
>> log4j.properties in Ambari and did a rolling restart of the regionservers
>> (and verified it was pushed out the nodes in
>> /etc/hbase/conf/log4j.properties ) then issued a "select count(*)" on the
>> table. Looking at the logs I don't see any TRACE messages. The only two
>> log messages that happened when issuing the select are:
>> 
>> 2014-05-20 16:43:53,609 INFO  [RpcServer.handler=9,port=60020]
>> coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped
>> coprocessor scan 
>> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00
>> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"ca
>> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":
>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
>> 2014-05-20 16:43:54,309 INFO  [RpcServer.handler=9,port=60020]
>> coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608
>> rows for ungrouped coprocessor scan
>> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00
>> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"ca
>> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":
>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
>> 
>> The metrics for the regionserver show only 2 total read requests since it
>> started and 0 per second now. Those two requests are for the table I did
>> the count on. The regionserver's UI says it has no active RPC calls.  It
>> currently has one stuck handler from running that command:
>> 
>> "RpcServer.handler=8,port=60020" daemon prio=10 tid=0x0000000001c08000
>> nid=0x2582 runnable [0x00007fbd5ba6c000]
>>  java.lang.Thread.State: RUNNABLE
>> 	at 
>> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java:2
>> 888)
>> 	at 
>> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java:2
>> 880)
>> 	at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:369)
>> 	at java.util.TreeMap.getEntry(TreeMap.java:340)
>> 	at java.util.TreeMap.get(TreeMap.java:273)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.isDeleted(
>> GetClosestRowBeforeTracker.java:128)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.addCandida
>> te(GetClosestRowBeforeTracker.java:107)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.handle(Get
>> ClosestRowBeforeTracker.java:202)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.
>> java:1772)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HSt
>> ore.java:1722)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.jav
>> a:1655)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.j
>> ava:1826)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:
>> 2841)
>> 	at 
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.ca
>> llBlockingMethod(ClientProtos.java:28857)
>> 	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>> 	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>> 	at 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSched
>> uler.java:160)
>> 	at 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSchedul
>> er.java:38)
>> 	at 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.ja
>> va:110)
>> 	at java.lang.Thread.run(Thread.java:744)
>> 
>> It also has one listener running:
>> 
>> "RpcServer.listener,port=60020" daemon prio=10 tid=0x0000000001db8000
>> nid=0x2579 runnable [0x00007fbd5c376000]
>>  java.lang.Thread.State: RUNNABLE
>> 	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>> 	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>> 	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>> 	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
>> 	- locked <0x000000059ae28988> (a sun.nio.ch.Util$2)
>> 	- locked <0x000000059ae289a0> (a java.util.Collections$UnmodifiableSet)
>> 	- locked <0x000000059ae294b8> (a sun.nio.ch.EPollSelectorImpl)
>> 	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
>> 	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
>> 	at org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:676)
>> 
>> If there is anything else you need, please let me know.
>> 
>> thanks again,
>> -chris
>> 
>> 
>> On May 20, 2014, at 2:03 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:
>> 
>>> 
>>> I rechecked your RPC log and found that you didn't turn on the RPC trace
>>> level. You can set the following so that we can see detail request &
>>> response. 
>>> 
>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE
>>> 
>>> If you didn't see time out exception in your region server logs, it
>>> means
>>> that there are something keeping issuing lots of Get requests. Have you
>>> used secondary index for this table? If that's the case, please
>>> disable/drop it to see if that helps.
>>> 
>>> 
>>> 
>>> 
>>> On 5/20/14 1:21 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>>> 
>>>> I will see how possible a new, all bare metal cluster is, but we have 5
>>>> other similarly configured clusters with no troubles. We only have an
>>>> issue when Phoenix does a full scan of some tables that a handler gets
>>>> stuck in 
>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>> after it has returned all of its data.
>>>> 
>>>> thank you,
>>>> 
>>>> -chris
>>>> 
>>>> On May 20, 2014, at 12:34 PM, alex kamil <al...@gmail.com> wrote:
>>>> 
>>>>> i'm just guessing but may be region servers are stuck because they
>>>>> can't
>>>>> communicate with zookeeper or hmaster, or may be datanodes can't talk
>>>>> to
>>>>> namenode because of some vmware networking quirk and some internal
>>>>> thread
>>>>> is hanging, who knows,
>>>>> it would take 30 min to setup up bare metal cluster vs days debugging
>>>>> this
>>>>> hybrid setup
>>>>> also I'd suggest to install directly from hbase/hadoop websites  to
>>>>> eliminate additional variables
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas
>>>>> <cf...@biotiquesystems.com>wrote:
>>>>> 
>>>>>> Only the master nodes (HMaster, ZK and NN) are VMs. The
>>>>>> datanodes/regionservers with the stuck processes all bare metal.
>>>>>> 
>>>>>> -chris
>>>>>> 
>>>>>> On May 20, 2014, at 11:52 AM, alex kamil <al...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>>> "with 3 VMWare "master" nodes"
>>>>>>> 
>>>>>>> can you try running hbase/phoenix on physical nodes instead of using
>>>>>>> virtual machines
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas
>>>>>>> <cft@biotiquesystems.com
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Sorry to follow up on my own message, but I was wondering if anyone
>>>>>>>> had
>>>>>>>> any ideas? Normal non-phoenix scans don't cause this symptom, but
>>>>>>>> right
>>>>>>>> after a select * on the exact same table will.
>>>>>>>> 
>>>>>>>> If we export the table and then re-import it into a new table, the
>>>>>>>> new
>>>>>>>> table doesn't exhibit these symptoms, same as if we use an
>>>>>> upsert..select
>>>>>>>> to do a copy. It seems something happens to the last region to
>>>>>>>> cause
>>>>>> this,
>>>>>>>> but it is not directly data dependent. Moving the region to another
>>>>>>>> regionserver doesn't have any effect - just moves where the problem
>>>>>>>> happens. Major compactions get hung up by the running threads as
>>>>>>>> they
>>>>>>>> probably have a lock.
>>>>>>>> 
>>>>>>>> I've run the hfile tool on the final region and nothing seems awry.
>>>>>>>> 
>>>>>>>> Figuring this out will allow this project to continue, as of now it
>>>>>>>> is
>>>>>>>> hung up on this issue.
>>>>>>>> 
>>>>>>>> thank you,
>>>>>>>> -chris
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas <cf...@biotiquesystems.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I did some "poor mans" profiling with multiple jstack and came up
>>>>>>>>> with
>>>>>>>> where the RpcServer.handler appears to be stuck:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
>>>>>>>> le
>>>>>>>> .
>>>>>>>> That is the last deepest method in all of the traces, either
>>>>>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces
>>>>>>>> for a
>>>>>>>> thread (which has been running for the last couple hours):
>>>>>>>>> 
>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>>>> tid=0x0000000000cdb800
>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(
>>>>>> Fa
>>>>>> stDiffDeltaEncoder.java:540)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedE
>>>>>> nc
>>>>>> odedSeeker.next(BufferedDataBlockEncoder.java:261)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(H
>>>>>> Fi
>>>>>> leReaderV2.java:1063)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HSt
>>>>>> or
>>>>>> e.java:1776)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>>> (H
>>>>>> Store.java:1722)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore
>>>>>> .j
>>>>>> ava:1655)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegi
>>>>>> on
>>>>>> .java:1826)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.j
>>>>>> av
>>>>>> a:2841)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$
>>>>>> 2.
>>>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>>>   at 
>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>>>   at 
>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcS
>>>>>> ch
>>>>>> eduler.java:160)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSch
>>>>>> ed
>>>>>> uler.java:38)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedule
>>>>>> r.
>>>>>> java:110)
>>>>>>>>>   at java.lang.Thread.run(Thread.java:744)
>>>>>>>>> 
>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>>>> tid=0x0000000000cdb800
>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:19
>>>>>> 44
>>>>>> )
>>>>>>>>>   at
>>>>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.root
>>>>>> Bl
>>>>>> ockContainingKey(HFileBlockIndex.java:392)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.load
>>>>>> Da
>>>>>> taBlockWithScanInfo(HFileBlockIndex.java:209)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seek
>>>>>> To
>>>>>> DataBlock(HFileBlockIndex.java:179)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekB
>>>>>> ef
>>>>>> ore(HFileReaderV2.java:548)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>>> (H
>>>>>> Store.java:1712)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore
>>>>>> .j
>>>>>> ava:1655)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegi
>>>>>> on
>>>>>> .java:1826)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.j
>>>>>> av
>>>>>> a:2841)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$
>>>>>> 2.
>>>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>>>   at 
>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>>>   at 
>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcS
>>>>>> ch
>>>>>> eduler.java:160)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSch
>>>>>> ed
>>>>>> uler.java:38)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedule
>>>>>> r.
>>>>>> java:110)
>>>>>>>>>   at java.lang.Thread.run(Thread.java:744)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong
>>>>>>>>> <jz...@hortonworks.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hey Chris,
>>>>>>>>>> 
>>>>>>>>>> I used performance.py tool which created a table with 50K rows in
>>>>>>>>>> one
>>>>>>>>>> table, run the following query from sqlline.py and everything
>>>>>>>>>> seems
>>>>>> fine
>>>>>>>>>> without seeing CPU running hot.
>>>>>>>>>> 
>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>>>>>>>> PERFORMANCE_50000;
>>>>>>>>>> +------------+
>>>>>>>>>> |  COUNT(1)  |
>>>>>>>>>> +------------+
>>>>>>>>>> | 50000      |
>>>>>>>>>> +------------+
>>>>>>>>>> 1 row selected (0.166 seconds)
>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>>>>>>>> PERFORMANCE_50000;
>>>>>>>>>> +------------+
>>>>>>>>>> |  COUNT(1)  |
>>>>>>>>>> +------------+
>>>>>>>>>> | 50000      |
>>>>>>>>>> +------------+
>>>>>>>>>> 1 row selected (0.167 seconds)
>>>>>>>>>> 
>>>>>>>>>> Is there anyway could you run profiler to see where the CPU goes?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <cf...@biotiquesystems.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Ahh, yes. Here is a pastebin for it:
>>>>>>>>>>> 
>>>>>>>>>>> http://pastebin.com/w6mtabag
>>>>>>>>>>> 
>>>>>>>>>>> thanks again,
>>>>>>>>>>> -chris
>>>>>>>>>>> 
>>>>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <nd...@gmail.com>
>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Chris,
>>>>>>>>>>>> 
>>>>>>>>>>>> Attachments are filtered out by the mail server. Can you
>>>>>>>>>>>> pastebin it
>>>>>>>>>>>> some
>>>>>>>>>>>> place?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Nick
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>>>>>>>>>> <cf...@biotiquesystems.com>wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to
>>>>>>>>>>>>> have
>>>>>> been
>>>>>>>>>>>>> set
>>>>>>>>>>>>> to 60 by Ambari during install process). Now we have narrowed
>>>>>>>>>>>>> down
>>>>>>>> what
>>>>>>>>>>>>> causes the CPU to increase and have some detailed logs:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If we connect using sqlline.py and execute a select that
>>>>>>>>>>>>> selects
>>>>>> one
>>>>>>>>>>>>> row
>>>>>>>>>>>>> using the primary_key, no increate in CPU is observed and the
>>>>>> number
>>>>>>>>>>>>> of RPC
>>>>>>>>>>>>> threads in a RUNNABLE state remains the same.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If we execute a select that scans the table such as "select
>>>>>> count(*)
>>>>>>>>>>>>> from
>>>>>>>>>>>>> TABLE" or where the "where" clause only limits on non-primary
>>>>>>>>>>>>> key
>>>>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler
>>>>>>>>>>>>> threads
>>>>>>>>>>>>> increases
>>>>>>>>>>>>> and the CPU utilization of the regionserver increases by
>>>>>>>>>>>>> ~105%.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Disconnecting the client does not have an effect and the
>>>>>>>>>>>>> RpcServer.handler
>>>>>>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher usage.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Checking the Web Console for the Regionserver just shows 10
>>>>>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other
>>>>>>>>>>>>> monitored
>>>>>>>>>>>>> tasks
>>>>>>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a
>>>>>>>>>>>>> Used
>>>>>> heap
>>>>>>>>>>>>> of
>>>>>>>>>>>>> 445.2M.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I've attached the regionserver log with IPC debug logging
>>>>>>>>>>>>> turned on
>>>>>>>>>>>>> right
>>>>>>>>>>>>> when one of the Phoenix statements is executed (this statement
>>>>>>>> actually
>>>>>>>>>>>>> used up the last available handler).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>> -chris
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong
>>>>>>>>>>>>> <jzhong@hortonworks.com
>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> From the stack, it seems you increase the default rpc handler
>>>>>> number
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> about 60. All handlers are serving Get request(You can search
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
>>>>>>>> .j
>>>>>>>> av
>>>>>>>>>>>>> a:2
>>>>>>>>>>>>>> 841).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> You can check why there are so many get requests by adding
>>>>>>>>>>>>>> some
>>>>>> log
>>>>>>>>>>>>>> info
>>>>>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the number
>>>>>>>>>>>>>> of
>>>>>> rpc
>>>>>>>>>>>>>> handlers per region server, it will mitigate your current
>>>>>>>>>>>>>> issue.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cf...@biotiquesystems.com>
>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU
>>>>>>>>>>>>>>> usage
>>>>>>>>>>>>>>> spiking
>>>>>>>>>>>>> up
>>>>>>>>>>>>>>> to use all available CPU and becoming unresponsive.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster
>>>>>> (with
>>>>>>>> 3
>>>>>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a
>>>>>>>>>>>>>>> plain
>>>>>>>>>>>>>>> Ambari
>>>>>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release
>>>>>>>>>>>>>>> and
>>>>>>>> hand
>>>>>>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was
>>>>>>>>>>>>>>> going
>>>>>>>> well
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> we were able to load in ~30k records into several tables.
>>>>>>>>>>>>>>> What
>>>>>>>>>>>>>>> happened
>>>>>>>>>>>>>>> was after about 3-4 days of being up the regionservers
>>>>>>>>>>>>>>> became
>>>>>>>>>>>>>>> unresponsive and started to use most of the available CPU
>>>>>>>>>>>>>>> (12
>>>>>> core
>>>>>>>>>>>>>>> boxes). Nothing terribly informative was in the logs
>>>>>>>>>>>>>>> (initially
>>>>>> we
>>>>>>>>>>>>>>> saw
>>>>>>>>>>>>>>> some flush messages that seemed excessive, but that was not
>>>>>>>>>>>>>>> all
>>>>>> of
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> time and we changed back to the standard HBase WAL codec).
>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>> able
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> kill the unresponsive regionservers and then restart them,
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>> will be fine for a day or so but will start to lock up
>>>>>>>>>>>>>>> again.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We've dropped the entire HBase and zookeper information and
>>>>>> started
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>> scratch, but that has not helped.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> James Taylor suggested I send this off here. I've attached a
>>>>>> jstack
>>>>>>>>>>>>>>> report of a locked up regionserver in hopes that someone can
>>>>>>>>>>>>>>> shed
>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>> light.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>> -chris
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>>>>>> NOTICE: This message is intended for the use of the
>>>>>>>>>>>>>> individual
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> entity
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> which it is addressed and may contain information that is
>>>>>>>>>>>>>> confidential,
>>>>>>>>>>>>>> privileged and exempt from disclosure under applicable law.
>>>>>>>>>>>>>> If
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> reader
>>>>>>>>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>>>> notified
>>>>>>>>>>>>> that
>>>>>>>>>>>>>> any printing, copying, dissemination, distribution,
>>>>>>>>>>>>>> disclosure
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> forwarding of this communication is strictly prohibited. If
>>>>>>>>>>>>>> you
>>>>>> have
>>>>>>>>>>>>>> received this communication in error, please contact the
>>>>>>>>>>>>>> sender
>>>>>>>>>>>>> immediately
>>>>>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>> NOTICE: This message is intended for the use of the individual or
>>>>>>>> entity to
>>>>>>>>>> which it is addressed and may contain information that is
>>>>>> confidential,
>>>>>>>>>> privileged and exempt from disclosure under applicable law. If
>>>>>>>>>> the
>>>>>>>> reader
>>>>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>>>>>> notified
>>>>>>>> that
>>>>>>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>>>>>>> forwarding of this communication is strictly prohibited. If you
>>>>>>>>>> have
>>>>>>>>>> received this communication in error, please contact the sender
>>>>>>>> immediately
>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or
>>> entity to 
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the
>>> reader 
>>> of this message is not the intended recipient, you are hereby notified
>>> that 
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender
>>> immediately 
>>> and delete it from your system. Thank You.
>> 
> 
> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Chris Tarnas <cf...@biotiquesystems.com>.

I am trying to re-produce this issue on our sandbox cluster. No matter how I try to insert the same data (random, sorted, reverse sorted, bulk, with upserrts, etc) I don't see this problem. I'll be copying the actual hfiles of the problematic table over next to see if that can do it.

-chris


On May 21, 2014, at 3:58 PM, Enis Soztutar <en...@hortonworks.com> wrote:

> Hey Chris, 
> 
> Thanks for bearing with us. It seems that the FAST_DIFF encoding + your data schema + use of getRowOrBefore() caused the issue. That is why Jeffrey was asking for the data. It is fine. getRowOrBefore() will go away anyway (https://issues.apache.org/jira/browse/HBASE-11230). Jeffrey's patch changes Phoenix to use a "reverse scan" which should not have an issue with FAST_DIFF. It would be awesome if you can test it with your data so that we can see whether it is a problem with reverse lookups in FAST_DIFF encoding, or a specific problem with getRowOrBefore()
> 
> Thanks for reporting. 
> Enis
> 
> 
> On Wed, May 21, 2014 at 3:44 PM, Chris Tarnas <cf...@biotiquesystems.com> wrote:
> I'll start on building a patched version then trying it out (Now that it is working I'll have to wrangle the cluster away from the users). I cannot send the data, it private data.
> 
> thanks,
> -chris
> 
> 
> On May 21, 2014, at 3:04 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:
> 
> >
> > Thanks for reporting the status back. Mind to try the patch on phoenix-992
> > or attach the data which can repro the issue so that we can verify current
> > fix. Thanks.
> >
> > On 5/21/14 2:06 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
> >
> >>
> >> Altering the table to not use a DATA_BLOCK_ENCODING fixed the problem.
> >> Just to verify that it was the fix, I altered it back to FAST_DIFF and
> >> the problem re-appeared. Thank you!
> >>
> >> Adding the changes to the client hbase-site.xml did not help the issue,
> >> but it may be I was not sourcing it correctly in sqlline. I made sure it
> >> was in the classpath being passed to java via -cp when launching it.
> >>
> >> thanks again,
> >> -chris
> >>
> >> On May 20, 2014, at 5:50 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:
> >>
> >>>
> >>>
> >>> Hey Chris,
> >>>
> >>> Sorry for the trace option. The right option should be below. Before you
> >>> were able to turn on DEBUG, you just need to change it to TRACE.
> >>>
> >>>   log4j.logger.org.apache.hadoop.ipc.RpcServer=TRACE
> >>>
> >>>
> >>>
> >>> Do you see that the two read requests burn one of your CPU to 100%? From
> >>> the call stacks you posted today and before, it seems there might be an
> >>> infinite loop inside the function Htable.getRowOrBefore() which is
> >>> deprecated and supposed to only work for meta region.(brought by Enis).
> >>> Fortunately there is only one place in phoenix call this function, so
> >>> could you try the following things?
> >>>
> >>> set following configuration in client side hbase-site.xml to disable
> >>> stats
> >>> update
> >>>  phoenix.query.statsUpdateFrequency = 2147483647
> >>>  phoenix.query.maxStatsAge = 0
> >>>
> >>> After the configuration changes, could you verify if the CPU is still
> >>> spiking for the same query?
> >>>
> >>> You can also do the following:
> >>>
> >>> Alter your table's ColumnFamily data block encoding to NONE as
> >>> DATA_BLOCK_ENCODING => 'NONE' and do a major compaction after that
> >>> through
> >>> Hbase client.
> >>>
> >>>
> >>> We suspect that the combination of your data, FAST_DIFF data block
> >>> encoding & the deprecated Htable.getRowOrBefore() cause the issue.
> >>>
> >>>
> >>>
> >>> On 5/20/14 3:03 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
> >>>
> >>>> Ho Jeffrey.
> >>>>
> >>>> Thank you for taking a look. I added
> >>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE to the
> >>>> Custom
> >>>> log4j.properties in Ambari and did a rolling restart of the
> >>>> regionservers
> >>>> (and verified it was pushed out the nodes in
> >>>> /etc/hbase/conf/log4j.properties ) then issued a "select count(*)" on
> >>>> the
> >>>> table. Looking at the logs I don't see any TRACE messages. The only two
> >>>> log messages that happened when issuing the select are:
> >>>>
> >>>> 2014-05-20 16:43:53,609 INFO  [RpcServer.handler=9,port=60020]
> >>>> coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped
> >>>> coprocessor scan
> >>>>
> >>>> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x
> >>>> 00
> >>>>
> >>>> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"
> >>>> ca
> >>>>
> >>>> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions
> >>>> ":
> >>>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
> >>>> 2014-05-20 16:43:54,309 INFO  [RpcServer.handler=9,port=60020]
> >>>> coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608
> >>>> rows for ungrouped coprocessor scan
> >>>>
> >>>> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x
> >>>> 00
> >>>>
> >>>> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"
> >>>> ca
> >>>>
> >>>> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions
> >>>> ":
> >>>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
> >>>>
> >>>> The metrics for the regionserver show only 2 total read requests since
> >>>> it
> >>>> started and 0 per second now. Those two requests are for the table I
> >>>> did
> >>>> the count on. The regionserver's UI says it has no active RPC calls.
> >>>> It
> >>>> currently has one stuck handler from running that command:
> >>>>
> >>>> "RpcServer.handler=8,port=60020" daemon prio=10 tid=0x0000000001c08000
> >>>> nid=0x2582 runnable [0x00007fbd5ba6c000]
> >>>> java.lang.Thread.State: RUNNABLE
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java
> >>>> :2
> >>>> 888)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java
> >>>> :2
> >>>> 880)
> >>>>    at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:369)
> >>>>    at java.util.TreeMap.getEntry(TreeMap.java:340)
> >>>>    at java.util.TreeMap.get(TreeMap.java:273)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.isDelete
> >>>> d(
> >>>> GetClosestRowBeforeTracker.java:128)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.addCandi
> >>>> da
> >>>> te(GetClosestRowBeforeTracker.java:107)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.handle(G
> >>>> et
> >>>> ClosestRowBeforeTracker.java:202)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStor
> >>>> e.
> >>>> java:1772)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(H
> >>>> St
> >>>> ore.java:1722)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.j
> >>>> av
> >>>> a:1655)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion
> >>>> .j
> >>>> ava:1826)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
> >>>> a:
> >>>> 2841)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
> >>>> ca
> >>>> llBlockingMethod(ClientProtos.java:28857)
> >>>>    at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >>>>    at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSch
> >>>> ed
> >>>> uler.java:160)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSched
> >>>> ul
> >>>> er.java:38)
> >>>>    at
> >>>>
> >>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.
> >>>> ja
> >>>> va:110)
> >>>>    at java.lang.Thread.run(Thread.java:744)
> >>>>
> >>>> It also has one listener running:
> >>>>
> >>>> "RpcServer.listener,port=60020" daemon prio=10 tid=0x0000000001db8000
> >>>> nid=0x2579 runnable [0x00007fbd5c376000]
> >>>> java.lang.Thread.State: RUNNABLE
> >>>>    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> >>>>    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
> >>>>    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
> >>>>    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
> >>>>    - locked <0x000000059ae28988> (a sun.nio.ch.Util$2)
> >>>>    - locked <0x000000059ae289a0> (a
> >>>> java.util.Collections$UnmodifiableSet)
> >>>>    - locked <0x000000059ae294b8> (a sun.nio.ch.EPollSelectorImpl)
> >>>>    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
> >>>>    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
> >>>>    at
> >>>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:676)
> >>>>
> >>>> If there is anything else you need, please let me know.
> >>>>
> >>>> thanks again,
> >>>> -chris
> >>>>
> >>>>
> >>>> On May 20, 2014, at 2:03 PM, Jeffrey Zhong <jz...@hortonworks.com>
> >>>> wrote:
> >>>>
> >>>>>
> >>>>> I rechecked your RPC log and found that you didn't turn on the RPC
> >>>>> trace
> >>>>> level. You can set the following so that we can see detail request &
> >>>>> response.
> >>>>>
> >>>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE
> >>>>>
> >>>>> If you didn't see time out exception in your region server logs, it
> >>>>> means
> >>>>> that there are something keeping issuing lots of Get requests. Have
> >>>>> you
> >>>>> used secondary index for this table? If that's the case, please
> >>>>> disable/drop it to see if that helps.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 5/20/14 1:21 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
> >>>>>
> >>>>>> I will see how possible a new, all bare metal cluster is, but we
> >>>>>> have 5
> >>>>>> other similarly configured clusters with no troubles. We only have an
> >>>>>> issue when Phoenix does a full scan of some tables that a handler
> >>>>>> gets
> >>>>>> stuck in
> >>>>>>
> >>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
> >>>>>> after it has returned all of its data.
> >>>>>>
> >>>>>> thank you,
> >>>>>>
> >>>>>> -chris
> >>>>>>
> >>>>>> On May 20, 2014, at 12:34 PM, alex kamil <al...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> i'm just guessing but may be region servers are stuck because they
> >>>>>>> can't
> >>>>>>> communicate with zookeeper or hmaster, or may be datanodes can't
> >>>>>>> talk
> >>>>>>> to
> >>>>>>> namenode because of some vmware networking quirk and some internal
> >>>>>>> thread
> >>>>>>> is hanging, who knows,
> >>>>>>> it would take 30 min to setup up bare metal cluster vs days
> >>>>>>> debugging
> >>>>>>> this
> >>>>>>> hybrid setup
> >>>>>>> also I'd suggest to install directly from hbase/hadoop websites  to
> >>>>>>> eliminate additional variables
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas
> >>>>>>> <cf...@biotiquesystems.com>wrote:
> >>>>>>>
> >>>>>>>> Only the master nodes (HMaster, ZK and NN) are VMs. The
> >>>>>>>> datanodes/regionservers with the stuck processes all bare metal.
> >>>>>>>>
> >>>>>>>> -chris
> >>>>>>>>
> >>>>>>>> On May 20, 2014, at 11:52 AM, alex kamil <al...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>> "with 3 VMWare "master" nodes"
> >>>>>>>>>
> >>>>>>>>> can you try running hbase/phoenix on physical nodes instead of
> >>>>>>>>> using
> >>>>>>>>> virtual machines
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas
> >>>>>>>>> <cft@biotiquesystems.com
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Sorry to follow up on my own message, but I was wondering if
> >>>>>>>>>> anyone
> >>>>>>>>>> had
> >>>>>>>>>> any ideas? Normal non-phoenix scans don't cause this symptom, but
> >>>>>>>>>> right
> >>>>>>>>>> after a select * on the exact same table will.
> >>>>>>>>>>
> >>>>>>>>>> If we export the table and then re-import it into a new table,
> >>>>>>>>>> the
> >>>>>>>>>> new
> >>>>>>>>>> table doesn't exhibit these symptoms, same as if we use an
> >>>>>>>> upsert..select
> >>>>>>>>>> to do a copy. It seems something happens to the last region to
> >>>>>>>>>> cause
> >>>>>>>> this,
> >>>>>>>>>> but it is not directly data dependent. Moving the region to
> >>>>>>>>>> another
> >>>>>>>>>> regionserver doesn't have any effect - just moves where the
> >>>>>>>>>> problem
> >>>>>>>>>> happens. Major compactions get hung up by the running threads as
> >>>>>>>>>> they
> >>>>>>>>>> probably have a lock.
> >>>>>>>>>>
> >>>>>>>>>> I've run the hfile tool on the final region and nothing seems
> >>>>>>>>>> awry.
> >>>>>>>>>>
> >>>>>>>>>> Figuring this out will allow this project to continue, as of now
> >>>>>>>>>> it
> >>>>>>>>>> is
> >>>>>>>>>> hung up on this issue.
> >>>>>>>>>>
> >>>>>>>>>> thank you,
> >>>>>>>>>> -chris
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas
> >>>>>>>>>> <cf...@biotiquesystems.com>
> >>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I did some "poor mans" profiling with multiple jstack and came
> >>>>>>>>>>> up
> >>>>>>>>>>> with
> >>>>>>>>>> where the RpcServer.handler appears to be stuck:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStore
> >>>>>>>>>> Fi
> >>>>>>>>>> le
> >>>>>>>>>> .
> >>>>>>>>>> That is the last deepest method in all of the traces, either
> >>>>>>>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces
> >>>>>>>>>> for a
> >>>>>>>>>> thread (which has been running for the last couple hours):
> >>>>>>>>>>>
> >>>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
> >>>>>>>>>>> tid=0x0000000000cdb800
> >>>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
> >>>>>>>>>>> java.lang.Thread.State: RUNNABLE
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNex
> >>>>>>>> t(
> >>>>>>>> Fa
> >>>>>>>> stDiffDeltaEncoder.java:540)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$Buffere
> >>>>>>>> dE
> >>>>>>>> nc
> >>>>>>>> odedSeeker.next(BufferedDataBlockEncoder.java:261)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next
> >>>>>>>> (H
> >>>>>>>> Fi
> >>>>>>>> leReaderV2.java:1063)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(H
> >>>>>>>> St
> >>>>>>>> or
> >>>>>>>> e.java:1776)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
> >>>>>>>> le
> >>>>>>>> (H
> >>>>>>>> Store.java:1722)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HSto
> >>>>>>>> re
> >>>>>>>> .j
> >>>>>>>> ava:1655)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRe
> >>>>>>>> gi
> >>>>>>>> on
> >>>>>>>> .java:1826)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
> >>>>>>>> .j
> >>>>>>>> av
> >>>>>>>> a:2841)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServic
> >>>>>>>> e$
> >>>>>>>> 2.
> >>>>>>>> callBlockingMethod(ClientProtos.java:28857)
> >>>>>>>>>>>  at
> >>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >>>>>>>>>>>  at
> >>>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRp
> >>>>>>>> cS
> >>>>>>>> ch
> >>>>>>>> eduler.java:160)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcS
> >>>>>>>> ch
> >>>>>>>> ed
> >>>>>>>> uler.java:38)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedu
> >>>>>>>> le
> >>>>>>>> r.
> >>>>>>>> java:110)
> >>>>>>>>>>>  at java.lang.Thread.run(Thread.java:744)
> >>>>>>>>>>>
> >>>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
> >>>>>>>>>>> tid=0x0000000000cdb800
> >>>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
> >>>>>>>>>>> java.lang.Thread.State: RUNNABLE
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:
> >>>>>>>> 19
> >>>>>>>> 44
> >>>>>>>> )
> >>>>>>>>>>>  at
> >>>>>>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.ro
> >>>>>>>> ot
> >>>>>>>> Bl
> >>>>>>>> ockContainingKey(HFileBlockIndex.java:392)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.lo
> >>>>>>>> ad
> >>>>>>>> Da
> >>>>>>>> taBlockWithScanInfo(HFileBlockIndex.java:209)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.se
> >>>>>>>> ek
> >>>>>>>> To
> >>>>>>>> DataBlock(HFileBlockIndex.java:179)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.see
> >>>>>>>> kB
> >>>>>>>> ef
> >>>>>>>> ore(HFileReaderV2.java:548)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
> >>>>>>>> le
> >>>>>>>> (H
> >>>>>>>> Store.java:1712)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HSto
> >>>>>>>> re
> >>>>>>>> .j
> >>>>>>>> ava:1655)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRe
> >>>>>>>> gi
> >>>>>>>> on
> >>>>>>>> .java:1826)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
> >>>>>>>> .j
> >>>>>>>> av
> >>>>>>>> a:2841)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServic
> >>>>>>>> e$
> >>>>>>>> 2.
> >>>>>>>> callBlockingMethod(ClientProtos.java:28857)
> >>>>>>>>>>>  at
> >>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >>>>>>>>>>>  at
> >>>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRp
> >>>>>>>> cS
> >>>>>>>> ch
> >>>>>>>> eduler.java:160)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcS
> >>>>>>>> ch
> >>>>>>>> ed
> >>>>>>>> uler.java:38)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedu
> >>>>>>>> le
> >>>>>>>> r.
> >>>>>>>> java:110)
> >>>>>>>>>>>  at java.lang.Thread.run(Thread.java:744)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong
> >>>>>>>>>>> <jz...@hortonworks.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hey Chris,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I used performance.py tool which created a table with 50K rows
> >>>>>>>>>>>> in
> >>>>>>>>>>>> one
> >>>>>>>>>>>> table, run the following query from sqlline.py and everything
> >>>>>>>>>>>> seems
> >>>>>>>> fine
> >>>>>>>>>>>> without seeing CPU running hot.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*)
> >>>>>>>>>>>> from
> >>>>>>>>>>>> PERFORMANCE_50000;
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> |  COUNT(1)  |
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> | 50000      |
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> 1 row selected (0.166 seconds)
> >>>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*)
> >>>>>>>>>>>> from
> >>>>>>>>>>>> PERFORMANCE_50000;
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> |  COUNT(1)  |
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> | 50000      |
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> 1 row selected (0.167 seconds)
> >>>>>>>>>>>>
> >>>>>>>>>>>> Is there anyway could you run profiler to see where the CPU
> >>>>>>>>>>>> goes?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <cf...@biotiquesystems.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Ahh, yes. Here is a pastebin for it:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> http://pastebin.com/w6mtabag
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> thanks again,
> >>>>>>>>>>>>> -chris
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <nd...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Chris,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Attachments are filtered out by the mail server. Can you
> >>>>>>>>>>>>>> pastebin it
> >>>>>>>>>>>>>> some
> >>>>>>>>>>>>>> place?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> Nick
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
> >>>>>>>>>>>>>> <cf...@biotiquesystems.com>wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to
> >>>>>>>>>>>>>>> have
> >>>>>>>> been
> >>>>>>>>>>>>>>> set
> >>>>>>>>>>>>>>> to 60 by Ambari during install process). Now we have
> >>>>>>>>>>>>>>> narrowed
> >>>>>>>>>>>>>>> down
> >>>>>>>>>> what
> >>>>>>>>>>>>>>> causes the CPU to increase and have some detailed logs:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If we connect using sqlline.py and execute a select that
> >>>>>>>>>>>>>>> selects
> >>>>>>>> one
> >>>>>>>>>>>>>>> row
> >>>>>>>>>>>>>>> using the primary_key, no increate in CPU is observed and
> >>>>>>>>>>>>>>> the
> >>>>>>>> number
> >>>>>>>>>>>>>>> of RPC
> >>>>>>>>>>>>>>> threads in a RUNNABLE state remains the same.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If we execute a select that scans the table such as "select
> >>>>>>>> count(*)
> >>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>> TABLE" or where the "where" clause only limits on
> >>>>>>>>>>>>>>> non-primary
> >>>>>>>>>>>>>>> key
> >>>>>>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler
> >>>>>>>>>>>>>>> threads
> >>>>>>>>>>>>>>> increases
> >>>>>>>>>>>>>>> and the CPU utilization of the regionserver increases by
> >>>>>>>>>>>>>>> ~105%.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Disconnecting the client does not have an effect and the
> >>>>>>>>>>>>>>> RpcServer.handler
> >>>>>>>>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher
> >>>>>>>>>>>>>>> usage.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Checking the Web Console for the Regionserver just shows 10
> >>>>>>>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other
> >>>>>>>>>>>>>>> monitored
> >>>>>>>>>>>>>>> tasks
> >>>>>>>>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a
> >>>>>>>>>>>>>>> Used
> >>>>>>>> heap
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>> 445.2M.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I've attached the regionserver log with IPC debug logging
> >>>>>>>>>>>>>>> turned on
> >>>>>>>>>>>>>>> right
> >>>>>>>>>>>>>>> when one of the Phoenix statements is executed (this
> >>>>>>>>>>>>>>> statement
> >>>>>>>>>> actually
> >>>>>>>>>>>>>>> used up the last available handler).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> thanks,
> >>>>>>>>>>>>>>> -chris
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong
> >>>>>>>>>>>>>>> <jzhong@hortonworks.com
> >>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> From the stack, it seems you increase the default rpc
> >>>>>>>>>>>>>>>> handler
> >>>>>>>> number
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> about 60. All handlers are serving Get request(You can
> >>>>>>>>>>>>>>>> search
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServ
> >>>>>>>>>> er
> >>>>>>>>>> .j
> >>>>>>>>>> av
> >>>>>>>>>>>>>>> a:2
> >>>>>>>>>>>>>>>> 841).
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> You can check why there are so many get requests by adding
> >>>>>>>>>>>>>>>> some
> >>>>>>>> log
> >>>>>>>>>>>>>>>> info
> >>>>>>>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the
> >>>>>>>>>>>>>>>> number
> >>>>>>>>>>>>>>>> of
> >>>>>>>> rpc
> >>>>>>>>>>>>>>>> handlers per region server, it will mitigate your current
> >>>>>>>>>>>>>>>> issue.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas"
> >>>>>>>>>>>>>>>> <cf...@biotiquesystems.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU
> >>>>>>>>>>>>>>>>> usage
> >>>>>>>>>>>>>>>>> spiking
> >>>>>>>>>>>>>>> up
> >>>>>>>>>>>>>>>>> to use all available CPU and becoming unresponsive.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node
> >>>>>>>>>>>>>>>>> cluster
> >>>>>>>> (with
> >>>>>>>>>> 3
> >>>>>>>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a
> >>>>>>>>>>>>>>>>> plain
> >>>>>>>>>>>>>>>>> Ambari
> >>>>>>>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM
> >>>>>>>>>>>>>>>>> release
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>> hand
> >>>>>>>>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was
> >>>>>>>>>>>>>>>>> going
> >>>>>>>>>> well
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> we were able to load in ~30k records into several tables.
> >>>>>>>>>>>>>>>>> What
> >>>>>>>>>>>>>>>>> happened
> >>>>>>>>>>>>>>>>> was after about 3-4 days of being up the regionservers
> >>>>>>>>>>>>>>>>> became
> >>>>>>>>>>>>>>>>> unresponsive and started to use most of the available CPU
> >>>>>>>>>>>>>>>>> (12
> >>>>>>>> core
> >>>>>>>>>>>>>>>>> boxes). Nothing terribly informative was in the logs
> >>>>>>>>>>>>>>>>> (initially
> >>>>>>>> we
> >>>>>>>>>>>>>>>>> saw
> >>>>>>>>>>>>>>>>> some flush messages that seemed excessive, but that was
> >>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>> all
> >>>>>>>> of
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> time and we changed back to the standard HBase WAL codec).
> >>>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>> able
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> kill the unresponsive regionservers and then restart them,
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> cluster
> >>>>>>>>>>>>>>>>> will be fine for a day or so but will start to lock up
> >>>>>>>>>>>>>>>>> again.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> We've dropped the entire HBase and zookeper information
> >>>>>>>>>>>>>>>>> and
> >>>>>>>> started
> >>>>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>> scratch, but that has not helped.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> James Taylor suggested I send this off here. I've
> >>>>>>>>>>>>>>>>> attached a
> >>>>>>>> jstack
> >>>>>>>>>>>>>>>>> report of a locked up regionserver in hopes that someone
> >>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>> shed
> >>>>>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>> light.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> thanks,
> >>>>>>>>>>>>>>>>> -chris
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>> CONFIDENTIALITY NOTICE
> >>>>>>>>>>>>>>>> NOTICE: This message is intended for the use of the
> >>>>>>>>>>>>>>>> individual
> >>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>> entity
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> which it is addressed and may contain information that is
> >>>>>>>>>>>>>>>> confidential,
> >>>>>>>>>>>>>>>> privileged and exempt from disclosure under applicable law.
> >>>>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> reader
> >>>>>>>>>>>>>>>> of this message is not the intended recipient, you are
> >>>>>>>>>>>>>>>> hereby
> >>>>>>>>>> notified
> >>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> any printing, copying, dissemination, distribution,
> >>>>>>>>>>>>>>>> disclosure
> >>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>> forwarding of this communication is strictly prohibited. If
> >>>>>>>>>>>>>>>> you
> >>>>>>>> have
> >>>>>>>>>>>>>>>> received this communication in error, please contact the
> >>>>>>>>>>>>>>>> sender
> >>>>>>>>>>>>>>> immediately
> >>>>>>>>>>>>>>>> and delete it from your system. Thank You.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> CONFIDENTIALITY NOTICE
> >>>>>>>>>>>> NOTICE: This message is intended for the use of the individual
> >>>>>>>>>>>> or
> >>>>>>>>>> entity to
> >>>>>>>>>>>> which it is addressed and may contain information that is
> >>>>>>>> confidential,
> >>>>>>>>>>>> privileged and exempt from disclosure under applicable law. If
> >>>>>>>>>>>> the
> >>>>>>>>>> reader
> >>>>>>>>>>>> of this message is not the intended recipient, you are hereby
> >>>>>>>>>>>> notified
> >>>>>>>>>> that
> >>>>>>>>>>>> any printing, copying, dissemination, distribution, disclosure
> >>>>>>>>>>>> or
> >>>>>>>>>>>> forwarding of this communication is strictly prohibited. If you
> >>>>>>>>>>>> have
> >>>>>>>>>>>> received this communication in error, please contact the sender
> >>>>>>>>>> immediately
> >>>>>>>>>>>> and delete it from your system. Thank You.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> CONFIDENTIALITY NOTICE
> >>>>> NOTICE: This message is intended for the use of the individual or
> >>>>> entity to
> >>>>> which it is addressed and may contain information that is
> >>>>> confidential,
> >>>>> privileged and exempt from disclosure under applicable law. If the
> >>>>> reader
> >>>>> of this message is not the intended recipient, you are hereby notified
> >>>>> that
> >>>>> any printing, copying, dissemination, distribution, disclosure or
> >>>>> forwarding of this communication is strictly prohibited. If you have
> >>>>> received this communication in error, please contact the sender
> >>>>> immediately
> >>>>> and delete it from your system. Thank You.
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> CONFIDENTIALITY NOTICE
> >>> NOTICE: This message is intended for the use of the individual or
> >>> entity to
> >>> which it is addressed and may contain information that is confidential,
> >>> privileged and exempt from disclosure under applicable law. If the
> >>> reader
> >>> of this message is not the intended recipient, you are hereby notified
> >>> that
> >>> any printing, copying, dissemination, distribution, disclosure or
> >>> forwarding of this communication is strictly prohibited. If you have
> >>> received this communication in error, please contact the sender
> >>> immediately
> >>> and delete it from your system. Thank You.
> >>
> >
> >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender immediately
> > and delete it from your system. Thank You.
> 
> 
> 
> 
> -- 
> 
> 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Enis Soztutar <en...@hortonworks.com>.

Hey Chris,

Thanks for bearing with us. It seems that the FAST_DIFF encoding + your
data schema + use of getRowOrBefore() caused the issue. That is why Jeffrey
was asking for the data. It is fine. getRowOrBefore() will go away anyway (
https://issues.apache.org/jira/browse/HBASE-11230). Jeffrey's patch changes
Phoenix to use a "reverse scan" which should not have an issue with
FAST_DIFF. It would be awesome if you can test it with your data so that we
can see whether it is a problem with reverse lookups in FAST_DIFF encoding,
or a specific problem with getRowOrBefore()

Thanks for reporting.
Enis


On Wed, May 21, 2014 at 3:44 PM, Chris Tarnas <cf...@biotiquesystems.com>wrote:

> I'll start on building a patched version then trying it out (Now that it
> is working I'll have to wrangle the cluster away from the users). I cannot
> send the data, it private data.
>
> thanks,
> -chris
>
>
> On May 21, 2014, at 3:04 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:
>
> >
> > Thanks for reporting the status back. Mind to try the patch on
> phoenix-992
> > or attach the data which can repro the issue so that we can verify
> current
> > fix. Thanks.
> >
> > On 5/21/14 2:06 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
> >
> >>
> >> Altering the table to not use a DATA_BLOCK_ENCODING fixed the problem.
> >> Just to verify that it was the fix, I altered it back to FAST_DIFF and
> >> the problem re-appeared. Thank you!
> >>
> >> Adding the changes to the client hbase-site.xml did not help the issue,
> >> but it may be I was not sourcing it correctly in sqlline. I made sure it
> >> was in the classpath being passed to java via -cp when launching it.
> >>
> >> thanks again,
> >> -chris
> >>
> >> On May 20, 2014, at 5:50 PM, Jeffrey Zhong <jz...@hortonworks.com>
> wrote:
> >>
> >>>
> >>>
> >>> Hey Chris,
> >>>
> >>> Sorry for the trace option. The right option should be below. Before
> you
> >>> were able to turn on DEBUG, you just need to change it to TRACE.
> >>>
> >>>   log4j.logger.org.apache.hadoop.ipc.RpcServer=TRACE
> >>>
> >>>
> >>>
> >>> Do you see that the two read requests burn one of your CPU to 100%?
> From
> >>> the call stacks you posted today and before, it seems there might be an
> >>> infinite loop inside the function Htable.getRowOrBefore() which is
> >>> deprecated and supposed to only work for meta region.(brought by Enis).
> >>> Fortunately there is only one place in phoenix call this function, so
> >>> could you try the following things?
> >>>
> >>> set following configuration in client side hbase-site.xml to disable
> >>> stats
> >>> update
> >>>  phoenix.query.statsUpdateFrequency = 2147483647
> >>>  phoenix.query.maxStatsAge = 0
> >>>
> >>> After the configuration changes, could you verify if the CPU is still
> >>> spiking for the same query?
> >>>
> >>> You can also do the following:
> >>>
> >>> Alter your table's ColumnFamily data block encoding to NONE as
> >>> DATA_BLOCK_ENCODING => 'NONE' and do a major compaction after that
> >>> through
> >>> Hbase client.
> >>>
> >>>
> >>> We suspect that the combination of your data, FAST_DIFF data block
> >>> encoding & the deprecated Htable.getRowOrBefore() cause the issue.
> >>>
> >>>
> >>>
> >>> On 5/20/14 3:03 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
> >>>
> >>>> Ho Jeffrey.
> >>>>
> >>>> Thank you for taking a look. I added
> >>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE to the
> >>>> Custom
> >>>> log4j.properties in Ambari and did a rolling restart of the
> >>>> regionservers
> >>>> (and verified it was pushed out the nodes in
> >>>> /etc/hbase/conf/log4j.properties ) then issued a "select count(*)" on
> >>>> the
> >>>> table. Looking at the logs I don't see any TRACE messages. The only
> two
> >>>> log messages that happened when issuing the select are:
> >>>>
> >>>> 2014-05-20 16:43:53,609 INFO  [RpcServer.handler=9,port=60020]
> >>>> coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped
> >>>> coprocessor scan
> >>>>
> >>>>
> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x
> >>>> 00
> >>>>
> >>>>
> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"
> >>>> ca
> >>>>
> >>>>
> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions
> >>>> ":
> >>>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
> >>>> 2014-05-20 16:43:54,309 INFO  [RpcServer.handler=9,port=60020]
> >>>> coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608
> >>>> rows for ungrouped coprocessor scan
> >>>>
> >>>>
> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x
> >>>> 00
> >>>>
> >>>>
> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"
> >>>> ca
> >>>>
> >>>>
> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions
> >>>> ":
> >>>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
> >>>>
> >>>> The metrics for the regionserver show only 2 total read requests since
> >>>> it
> >>>> started and 0 per second now. Those two requests are for the table I
> >>>> did
> >>>> the count on. The regionserver's UI says it has no active RPC calls.
> >>>> It
> >>>> currently has one stuck handler from running that command:
> >>>>
> >>>> "RpcServer.handler=8,port=60020" daemon prio=10 tid=0x0000000001c08000
> >>>> nid=0x2582 runnable [0x00007fbd5ba6c000]
> >>>> java.lang.Thread.State: RUNNABLE
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java
> >>>> :2
> >>>> 888)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java
> >>>> :2
> >>>> 880)
> >>>>    at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:369)
> >>>>    at java.util.TreeMap.getEntry(TreeMap.java:340)
> >>>>    at java.util.TreeMap.get(TreeMap.java:273)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.isDelete
> >>>> d(
> >>>> GetClosestRowBeforeTracker.java:128)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.addCandi
> >>>> da
> >>>> te(GetClosestRowBeforeTracker.java:107)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.handle(G
> >>>> et
> >>>> ClosestRowBeforeTracker.java:202)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStor
> >>>> e.
> >>>> java:1772)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(H
> >>>> St
> >>>> ore.java:1722)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.j
> >>>> av
> >>>> a:1655)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion
> >>>> .j
> >>>> ava:1826)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
> >>>> a:
> >>>> 2841)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
> >>>> ca
> >>>> llBlockingMethod(ClientProtos.java:28857)
> >>>>    at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >>>>    at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSch
> >>>> ed
> >>>> uler.java:160)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSched
> >>>> ul
> >>>> er.java:38)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.
> >>>> ja
> >>>> va:110)
> >>>>    at java.lang.Thread.run(Thread.java:744)
> >>>>
> >>>> It also has one listener running:
> >>>>
> >>>> "RpcServer.listener,port=60020" daemon prio=10 tid=0x0000000001db8000
> >>>> nid=0x2579 runnable [0x00007fbd5c376000]
> >>>> java.lang.Thread.State: RUNNABLE
> >>>>    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> >>>>    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
> >>>>    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
> >>>>    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
> >>>>    - locked <0x000000059ae28988> (a sun.nio.ch.Util$2)
> >>>>    - locked <0x000000059ae289a0> (a
> >>>> java.util.Collections$UnmodifiableSet)
> >>>>    - locked <0x000000059ae294b8> (a sun.nio.ch.EPollSelectorImpl)
> >>>>    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
> >>>>    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
> >>>>    at
> >>>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:676)
> >>>>
> >>>> If there is anything else you need, please let me know.
> >>>>
> >>>> thanks again,
> >>>> -chris
> >>>>
> >>>>
> >>>> On May 20, 2014, at 2:03 PM, Jeffrey Zhong <jz...@hortonworks.com>
> >>>> wrote:
> >>>>
> >>>>>
> >>>>> I rechecked your RPC log and found that you didn't turn on the RPC
> >>>>> trace
> >>>>> level. You can set the following so that we can see detail request &
> >>>>> response.
> >>>>>
> >>>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE
> >>>>>
> >>>>> If you didn't see time out exception in your region server logs, it
> >>>>> means
> >>>>> that there are something keeping issuing lots of Get requests. Have
> >>>>> you
> >>>>> used secondary index for this table? If that's the case, please
> >>>>> disable/drop it to see if that helps.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 5/20/14 1:21 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
> >>>>>
> >>>>>> I will see how possible a new, all bare metal cluster is, but we
> >>>>>> have 5
> >>>>>> other similarly configured clusters with no troubles. We only have
> an
> >>>>>> issue when Phoenix does a full scan of some tables that a handler
> >>>>>> gets
> >>>>>> stuck in
> >>>>>>
> >>>>>>
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
> >>>>>> after it has returned all of its data.
> >>>>>>
> >>>>>> thank you,
> >>>>>>
> >>>>>> -chris
> >>>>>>
> >>>>>> On May 20, 2014, at 12:34 PM, alex kamil <al...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> i'm just guessing but may be region servers are stuck because they
> >>>>>>> can't
> >>>>>>> communicate with zookeeper or hmaster, or may be datanodes can't
> >>>>>>> talk
> >>>>>>> to
> >>>>>>> namenode because of some vmware networking quirk and some internal
> >>>>>>> thread
> >>>>>>> is hanging, who knows,
> >>>>>>> it would take 30 min to setup up bare metal cluster vs days
> >>>>>>> debugging
> >>>>>>> this
> >>>>>>> hybrid setup
> >>>>>>> also I'd suggest to install directly from hbase/hadoop websites  to
> >>>>>>> eliminate additional variables
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas
> >>>>>>> <cf...@biotiquesystems.com>wrote:
> >>>>>>>
> >>>>>>>> Only the master nodes (HMaster, ZK and NN) are VMs. The
> >>>>>>>> datanodes/regionservers with the stuck processes all bare metal.
> >>>>>>>>
> >>>>>>>> -chris
> >>>>>>>>
> >>>>>>>> On May 20, 2014, at 11:52 AM, alex kamil <al...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>> "with 3 VMWare "master" nodes"
> >>>>>>>>>
> >>>>>>>>> can you try running hbase/phoenix on physical nodes instead of
> >>>>>>>>> using
> >>>>>>>>> virtual machines
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas
> >>>>>>>>> <cft@biotiquesystems.com
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Sorry to follow up on my own message, but I was wondering if
> >>>>>>>>>> anyone
> >>>>>>>>>> had
> >>>>>>>>>> any ideas? Normal non-phoenix scans don't cause this symptom,
> but
> >>>>>>>>>> right
> >>>>>>>>>> after a select * on the exact same table will.
> >>>>>>>>>>
> >>>>>>>>>> If we export the table and then re-import it into a new table,
> >>>>>>>>>> the
> >>>>>>>>>> new
> >>>>>>>>>> table doesn't exhibit these symptoms, same as if we use an
> >>>>>>>> upsert..select
> >>>>>>>>>> to do a copy. It seems something happens to the last region to
> >>>>>>>>>> cause
> >>>>>>>> this,
> >>>>>>>>>> but it is not directly data dependent. Moving the region to
> >>>>>>>>>> another
> >>>>>>>>>> regionserver doesn't have any effect - just moves where the
> >>>>>>>>>> problem
> >>>>>>>>>> happens. Major compactions get hung up by the running threads as
> >>>>>>>>>> they
> >>>>>>>>>> probably have a lock.
> >>>>>>>>>>
> >>>>>>>>>> I've run the hfile tool on the final region and nothing seems
> >>>>>>>>>> awry.
> >>>>>>>>>>
> >>>>>>>>>> Figuring this out will allow this project to continue, as of now
> >>>>>>>>>> it
> >>>>>>>>>> is
> >>>>>>>>>> hung up on this issue.
> >>>>>>>>>>
> >>>>>>>>>> thank you,
> >>>>>>>>>> -chris
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas
> >>>>>>>>>> <cf...@biotiquesystems.com>
> >>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I did some "poor mans" profiling with multiple jstack and came
> >>>>>>>>>>> up
> >>>>>>>>>>> with
> >>>>>>>>>> where the RpcServer.handler appears to be stuck:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStore
> >>>>>>>>>> Fi
> >>>>>>>>>> le
> >>>>>>>>>> .
> >>>>>>>>>> That is the last deepest method in all of the traces, either
> >>>>>>>>>> HStore.java:1712 or HStore.java:1722. Here are two example
> traces
> >>>>>>>>>> for a
> >>>>>>>>>> thread (which has been running for the last couple hours):
> >>>>>>>>>>>
> >>>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
> >>>>>>>>>>> tid=0x0000000000cdb800
> >>>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
> >>>>>>>>>>> java.lang.Thread.State: RUNNABLE
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNex
> >>>>>>>> t(
> >>>>>>>> Fa
> >>>>>>>> stDiffDeltaEncoder.java:540)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$Buffere
> >>>>>>>> dE
> >>>>>>>> nc
> >>>>>>>> odedSeeker.next(BufferedDataBlockEncoder.java:261)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next
> >>>>>>>> (H
> >>>>>>>> Fi
> >>>>>>>> leReaderV2.java:1063)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(H
> >>>>>>>> St
> >>>>>>>> or
> >>>>>>>> e.java:1776)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
> >>>>>>>> le
> >>>>>>>> (H
> >>>>>>>> Store.java:1722)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HSto
> >>>>>>>> re
> >>>>>>>> .j
> >>>>>>>> ava:1655)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRe
> >>>>>>>> gi
> >>>>>>>> on
> >>>>>>>> .java:1826)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
> >>>>>>>> .j
> >>>>>>>> av
> >>>>>>>> a:2841)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServic
> >>>>>>>> e$
> >>>>>>>> 2.
> >>>>>>>> callBlockingMethod(ClientProtos.java:28857)
> >>>>>>>>>>>  at
> >>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >>>>>>>>>>>  at
> >>>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRp
> >>>>>>>> cS
> >>>>>>>> ch
> >>>>>>>> eduler.java:160)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcS
> >>>>>>>> ch
> >>>>>>>> ed
> >>>>>>>> uler.java:38)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedu
> >>>>>>>> le
> >>>>>>>> r.
> >>>>>>>> java:110)
> >>>>>>>>>>>  at java.lang.Thread.run(Thread.java:744)
> >>>>>>>>>>>
> >>>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
> >>>>>>>>>>> tid=0x0000000000cdb800
> >>>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
> >>>>>>>>>>> java.lang.Thread.State: RUNNABLE
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:
> >>>>>>>> 19
> >>>>>>>> 44
> >>>>>>>> )
> >>>>>>>>>>>  at
> >>>>>>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.ro
> >>>>>>>> ot
> >>>>>>>> Bl
> >>>>>>>> ockContainingKey(HFileBlockIndex.java:392)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.lo
> >>>>>>>> ad
> >>>>>>>> Da
> >>>>>>>> taBlockWithScanInfo(HFileBlockIndex.java:209)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.se
> >>>>>>>> ek
> >>>>>>>> To
> >>>>>>>> DataBlock(HFileBlockIndex.java:179)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.see
> >>>>>>>> kB
> >>>>>>>> ef
> >>>>>>>> ore(HFileReaderV2.java:548)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
> >>>>>>>> le
> >>>>>>>> (H
> >>>>>>>> Store.java:1712)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HSto
> >>>>>>>> re
> >>>>>>>> .j
> >>>>>>>> ava:1655)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRe
> >>>>>>>> gi
> >>>>>>>> on
> >>>>>>>> .java:1826)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
> >>>>>>>> .j
> >>>>>>>> av
> >>>>>>>> a:2841)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServic
> >>>>>>>> e$
> >>>>>>>> 2.
> >>>>>>>> callBlockingMethod(ClientProtos.java:28857)
> >>>>>>>>>>>  at
> >>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >>>>>>>>>>>  at
> >>>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRp
> >>>>>>>> cS
> >>>>>>>> ch
> >>>>>>>> eduler.java:160)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcS
> >>>>>>>> ch
> >>>>>>>> ed
> >>>>>>>> uler.java:38)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedu
> >>>>>>>> le
> >>>>>>>> r.
> >>>>>>>> java:110)
> >>>>>>>>>>>  at java.lang.Thread.run(Thread.java:744)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong
> >>>>>>>>>>> <jz...@hortonworks.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hey Chris,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I used performance.py tool which created a table with 50K rows
> >>>>>>>>>>>> in
> >>>>>>>>>>>> one
> >>>>>>>>>>>> table, run the following query from sqlline.py and everything
> >>>>>>>>>>>> seems
> >>>>>>>> fine
> >>>>>>>>>>>> without seeing CPU running hot.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*)
> >>>>>>>>>>>> from
> >>>>>>>>>>>> PERFORMANCE_50000;
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> |  COUNT(1)  |
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> | 50000      |
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> 1 row selected (0.166 seconds)
> >>>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*)
> >>>>>>>>>>>> from
> >>>>>>>>>>>> PERFORMANCE_50000;
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> |  COUNT(1)  |
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> | 50000      |
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> 1 row selected (0.167 seconds)
> >>>>>>>>>>>>
> >>>>>>>>>>>> Is there anyway could you run profiler to see where the CPU
> >>>>>>>>>>>> goes?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <cf...@biotiquesystems.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Ahh, yes. Here is a pastebin for it:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> http://pastebin.com/w6mtabag
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> thanks again,
> >>>>>>>>>>>>> -chris
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <
> ndimiduk@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Chris,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Attachments are filtered out by the mail server. Can you
> >>>>>>>>>>>>>> pastebin it
> >>>>>>>>>>>>>> some
> >>>>>>>>>>>>>> place?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> Nick
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
> >>>>>>>>>>>>>> <cf...@biotiquesystems.com>wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to
> >>>>>>>>>>>>>>> have
> >>>>>>>> been
> >>>>>>>>>>>>>>> set
> >>>>>>>>>>>>>>> to 60 by Ambari during install process). Now we have
> >>>>>>>>>>>>>>> narrowed
> >>>>>>>>>>>>>>> down
> >>>>>>>>>> what
> >>>>>>>>>>>>>>> causes the CPU to increase and have some detailed logs:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If we connect using sqlline.py and execute a select that
> >>>>>>>>>>>>>>> selects
> >>>>>>>> one
> >>>>>>>>>>>>>>> row
> >>>>>>>>>>>>>>> using the primary_key, no increate in CPU is observed and
> >>>>>>>>>>>>>>> the
> >>>>>>>> number
> >>>>>>>>>>>>>>> of RPC
> >>>>>>>>>>>>>>> threads in a RUNNABLE state remains the same.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If we execute a select that scans the table such as "select
> >>>>>>>> count(*)
> >>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>> TABLE" or where the "where" clause only limits on
> >>>>>>>>>>>>>>> non-primary
> >>>>>>>>>>>>>>> key
> >>>>>>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler
> >>>>>>>>>>>>>>> threads
> >>>>>>>>>>>>>>> increases
> >>>>>>>>>>>>>>> and the CPU utilization of the regionserver increases by
> >>>>>>>>>>>>>>> ~105%.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Disconnecting the client does not have an effect and the
> >>>>>>>>>>>>>>> RpcServer.handler
> >>>>>>>>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher
> >>>>>>>>>>>>>>> usage.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Checking the Web Console for the Regionserver just shows 10
> >>>>>>>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other
> >>>>>>>>>>>>>>> monitored
> >>>>>>>>>>>>>>> tasks
> >>>>>>>>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a
> >>>>>>>>>>>>>>> Used
> >>>>>>>> heap
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>> 445.2M.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I've attached the regionserver log with IPC debug logging
> >>>>>>>>>>>>>>> turned on
> >>>>>>>>>>>>>>> right
> >>>>>>>>>>>>>>> when one of the Phoenix statements is executed (this
> >>>>>>>>>>>>>>> statement
> >>>>>>>>>> actually
> >>>>>>>>>>>>>>> used up the last available handler).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> thanks,
> >>>>>>>>>>>>>>> -chris
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong
> >>>>>>>>>>>>>>> <jzhong@hortonworks.com
> >>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> From the stack, it seems you increase the default rpc
> >>>>>>>>>>>>>>>> handler
> >>>>>>>> number
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> about 60. All handlers are serving Get request(You can
> >>>>>>>>>>>>>>>> search
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServ
> >>>>>>>>>> er
> >>>>>>>>>> .j
> >>>>>>>>>> av
> >>>>>>>>>>>>>>> a:2
> >>>>>>>>>>>>>>>> 841).
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> You can check why there are so many get requests by adding
> >>>>>>>>>>>>>>>> some
> >>>>>>>> log
> >>>>>>>>>>>>>>>> info
> >>>>>>>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the
> >>>>>>>>>>>>>>>> number
> >>>>>>>>>>>>>>>> of
> >>>>>>>> rpc
> >>>>>>>>>>>>>>>> handlers per region server, it will mitigate your current
> >>>>>>>>>>>>>>>> issue.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas"
> >>>>>>>>>>>>>>>> <cf...@biotiquesystems.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU
> >>>>>>>>>>>>>>>>> usage
> >>>>>>>>>>>>>>>>> spiking
> >>>>>>>>>>>>>>> up
> >>>>>>>>>>>>>>>>> to use all available CPU and becoming unresponsive.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node
> >>>>>>>>>>>>>>>>> cluster
> >>>>>>>> (with
> >>>>>>>>>> 3
> >>>>>>>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a
> >>>>>>>>>>>>>>>>> plain
> >>>>>>>>>>>>>>>>> Ambari
> >>>>>>>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM
> >>>>>>>>>>>>>>>>> release
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>> hand
> >>>>>>>>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was
> >>>>>>>>>>>>>>>>> going
> >>>>>>>>>> well
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> we were able to load in ~30k records into several tables.
> >>>>>>>>>>>>>>>>> What
> >>>>>>>>>>>>>>>>> happened
> >>>>>>>>>>>>>>>>> was after about 3-4 days of being up the regionservers
> >>>>>>>>>>>>>>>>> became
> >>>>>>>>>>>>>>>>> unresponsive and started to use most of the available CPU
> >>>>>>>>>>>>>>>>> (12
> >>>>>>>> core
> >>>>>>>>>>>>>>>>> boxes). Nothing terribly informative was in the logs
> >>>>>>>>>>>>>>>>> (initially
> >>>>>>>> we
> >>>>>>>>>>>>>>>>> saw
> >>>>>>>>>>>>>>>>> some flush messages that seemed excessive, but that was
> >>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>> all
> >>>>>>>> of
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> time and we changed back to the standard HBase WAL
> codec).
> >>>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>> able
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> kill the unresponsive regionservers and then restart
> them,
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> cluster
> >>>>>>>>>>>>>>>>> will be fine for a day or so but will start to lock up
> >>>>>>>>>>>>>>>>> again.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> We've dropped the entire HBase and zookeper information
> >>>>>>>>>>>>>>>>> and
> >>>>>>>> started
> >>>>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>> scratch, but that has not helped.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> James Taylor suggested I send this off here. I've
> >>>>>>>>>>>>>>>>> attached a
> >>>>>>>> jstack
> >>>>>>>>>>>>>>>>> report of a locked up regionserver in hopes that someone
> >>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>> shed
> >>>>>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>> light.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> thanks,
> >>>>>>>>>>>>>>>>> -chris
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>> CONFIDENTIALITY NOTICE
> >>>>>>>>>>>>>>>> NOTICE: This message is intended for the use of the
> >>>>>>>>>>>>>>>> individual
> >>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>> entity
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> which it is addressed and may contain information that is
> >>>>>>>>>>>>>>>> confidential,
> >>>>>>>>>>>>>>>> privileged and exempt from disclosure under applicable
> law.
> >>>>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> reader
> >>>>>>>>>>>>>>>> of this message is not the intended recipient, you are
> >>>>>>>>>>>>>>>> hereby
> >>>>>>>>>> notified
> >>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> any printing, copying, dissemination, distribution,
> >>>>>>>>>>>>>>>> disclosure
> >>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>> forwarding of this communication is strictly prohibited.
> If
> >>>>>>>>>>>>>>>> you
> >>>>>>>> have
> >>>>>>>>>>>>>>>> received this communication in error, please contact the
> >>>>>>>>>>>>>>>> sender
> >>>>>>>>>>>>>>> immediately
> >>>>>>>>>>>>>>>> and delete it from your system. Thank You.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> CONFIDENTIALITY NOTICE
> >>>>>>>>>>>> NOTICE: This message is intended for the use of the individual
> >>>>>>>>>>>> or
> >>>>>>>>>> entity to
> >>>>>>>>>>>> which it is addressed and may contain information that is
> >>>>>>>> confidential,
> >>>>>>>>>>>> privileged and exempt from disclosure under applicable law. If
> >>>>>>>>>>>> the
> >>>>>>>>>> reader
> >>>>>>>>>>>> of this message is not the intended recipient, you are hereby
> >>>>>>>>>>>> notified
> >>>>>>>>>> that
> >>>>>>>>>>>> any printing, copying, dissemination, distribution, disclosure
> >>>>>>>>>>>> or
> >>>>>>>>>>>> forwarding of this communication is strictly prohibited. If
> you
> >>>>>>>>>>>> have
> >>>>>>>>>>>> received this communication in error, please contact the
> sender
> >>>>>>>>>> immediately
> >>>>>>>>>>>> and delete it from your system. Thank You.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> CONFIDENTIALITY NOTICE
> >>>>> NOTICE: This message is intended for the use of the individual or
> >>>>> entity to
> >>>>> which it is addressed and may contain information that is
> >>>>> confidential,
> >>>>> privileged and exempt from disclosure under applicable law. If the
> >>>>> reader
> >>>>> of this message is not the intended recipient, you are hereby
> notified
> >>>>> that
> >>>>> any printing, copying, dissemination, distribution, disclosure or
> >>>>> forwarding of this communication is strictly prohibited. If you have
> >>>>> received this communication in error, please contact the sender
> >>>>> immediately
> >>>>> and delete it from your system. Thank You.
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> CONFIDENTIALITY NOTICE
> >>> NOTICE: This message is intended for the use of the individual or
> >>> entity to
> >>> which it is addressed and may contain information that is confidential,
> >>> privileged and exempt from disclosure under applicable law. If the
> >>> reader
> >>> of this message is not the intended recipient, you are hereby notified
> >>> that
> >>> any printing, copying, dissemination, distribution, disclosure or
> >>> forwarding of this communication is strictly prohibited. If you have
> >>> received this communication in error, please contact the sender
> >>> immediately
> >>> and delete it from your system. Thank You.
> >>
> >
> >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
>
>


-- 
<http://hortonworks.com/download>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Chris Tarnas <cf...@biotiquesystems.com>.

I'll start on building a patched version then trying it out (Now that it is working I'll have to wrangle the cluster away from the users). I cannot send the data, it private data.

thanks,
-chris


On May 21, 2014, at 3:04 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:

> 
> Thanks for reporting the status back. Mind to try the patch on phoenix-992
> or attach the data which can repro the issue so that we can verify current
> fix. Thanks.
> 
> On 5/21/14 2:06 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
> 
>> 
>> Altering the table to not use a DATA_BLOCK_ENCODING fixed the problem.
>> Just to verify that it was the fix, I altered it back to FAST_DIFF and
>> the problem re-appeared. Thank you!
>> 
>> Adding the changes to the client hbase-site.xml did not help the issue,
>> but it may be I was not sourcing it correctly in sqlline. I made sure it
>> was in the classpath being passed to java via -cp when launching it.
>> 
>> thanks again,
>> -chris
>> 
>> On May 20, 2014, at 5:50 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:
>> 
>>> 
>>> 
>>> Hey Chris,
>>> 
>>> Sorry for the trace option. The right option should be below. Before you
>>> were able to turn on DEBUG, you just need to change it to TRACE.
>>> 
>>>   log4j.logger.org.apache.hadoop.ipc.RpcServer=TRACE
>>> 
>>> 
>>> 
>>> Do you see that the two read requests burn one of your CPU to 100%? From
>>> the call stacks you posted today and before, it seems there might be an
>>> infinite loop inside the function Htable.getRowOrBefore() which is
>>> deprecated and supposed to only work for meta region.(brought by Enis).
>>> Fortunately there is only one place in phoenix call this function, so
>>> could you try the following things?
>>> 
>>> set following configuration in client side hbase-site.xml to disable
>>> stats
>>> update 
>>>  phoenix.query.statsUpdateFrequency = 2147483647
>>>  phoenix.query.maxStatsAge = 0
>>> 
>>> After the configuration changes, could you verify if the CPU is still
>>> spiking for the same query?
>>> 
>>> You can also do the following:
>>> 
>>> Alter your table's ColumnFamily data block encoding to NONE as
>>> DATA_BLOCK_ENCODING => 'NONE' and do a major compaction after that
>>> through
>>> Hbase client.
>>> 
>>> 
>>> We suspect that the combination of your data, FAST_DIFF data block
>>> encoding & the deprecated Htable.getRowOrBefore() cause the issue.
>>> 
>>> 
>>> 
>>> On 5/20/14 3:03 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>>> 
>>>> Ho Jeffrey.
>>>> 
>>>> Thank you for taking a look. I added
>>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE to the
>>>> Custom
>>>> log4j.properties in Ambari and did a rolling restart of the
>>>> regionservers
>>>> (and verified it was pushed out the nodes in
>>>> /etc/hbase/conf/log4j.properties ) then issued a "select count(*)" on
>>>> the
>>>> table. Looking at the logs I don't see any TRACE messages. The only two
>>>> log messages that happened when issuing the select are:
>>>> 
>>>> 2014-05-20 16:43:53,609 INFO  [RpcServer.handler=9,port=60020]
>>>> coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped
>>>> coprocessor scan
>>>> 
>>>> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x
>>>> 00
>>>> 
>>>> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"
>>>> ca
>>>> 
>>>> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions
>>>> ":
>>>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
>>>> 2014-05-20 16:43:54,309 INFO  [RpcServer.handler=9,port=60020]
>>>> coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608
>>>> rows for ungrouped coprocessor scan
>>>> 
>>>> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x
>>>> 00
>>>> 
>>>> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"
>>>> ca
>>>> 
>>>> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions
>>>> ":
>>>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
>>>> 
>>>> The metrics for the regionserver show only 2 total read requests since
>>>> it
>>>> started and 0 per second now. Those two requests are for the table I
>>>> did
>>>> the count on. The regionserver's UI says it has no active RPC calls.
>>>> It
>>>> currently has one stuck handler from running that command:
>>>> 
>>>> "RpcServer.handler=8,port=60020" daemon prio=10 tid=0x0000000001c08000
>>>> nid=0x2582 runnable [0x00007fbd5ba6c000]
>>>> java.lang.Thread.State: RUNNABLE
>>>> 	at 
>>>> 
>>>> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java
>>>> :2
>>>> 888)
>>>> 	at 
>>>> 
>>>> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java
>>>> :2
>>>> 880)
>>>> 	at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:369)
>>>> 	at java.util.TreeMap.getEntry(TreeMap.java:340)
>>>> 	at java.util.TreeMap.get(TreeMap.java:273)
>>>> 	at 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.isDelete
>>>> d(
>>>> GetClosestRowBeforeTracker.java:128)
>>>> 	at 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.addCandi
>>>> da
>>>> te(GetClosestRowBeforeTracker.java:107)
>>>> 	at 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.handle(G
>>>> et
>>>> ClosestRowBeforeTracker.java:202)
>>>> 	at 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStor
>>>> e.
>>>> java:1772)
>>>> 	at 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(H
>>>> St
>>>> ore.java:1722)
>>>> 	at 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.j
>>>> av
>>>> a:1655)
>>>> 	at 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion
>>>> .j
>>>> ava:1826)
>>>> 	at 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>> a:
>>>> 2841)
>>>> 	at 
>>>> 
>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
>>>> ca
>>>> llBlockingMethod(ClientProtos.java:28857)
>>>> 	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>> 	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>> 	at 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSch
>>>> ed
>>>> uler.java:160)
>>>> 	at 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSched
>>>> ul
>>>> er.java:38)
>>>> 	at 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.
>>>> ja
>>>> va:110)
>>>> 	at java.lang.Thread.run(Thread.java:744)
>>>> 
>>>> It also has one listener running:
>>>> 
>>>> "RpcServer.listener,port=60020" daemon prio=10 tid=0x0000000001db8000
>>>> nid=0x2579 runnable [0x00007fbd5c376000]
>>>> java.lang.Thread.State: RUNNABLE
>>>> 	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>>> 	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>>>> 	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>>>> 	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
>>>> 	- locked <0x000000059ae28988> (a sun.nio.ch.Util$2)
>>>> 	- locked <0x000000059ae289a0> (a
>>>> java.util.Collections$UnmodifiableSet)
>>>> 	- locked <0x000000059ae294b8> (a sun.nio.ch.EPollSelectorImpl)
>>>> 	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
>>>> 	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
>>>> 	at 
>>>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:676)
>>>> 
>>>> If there is anything else you need, please let me know.
>>>> 
>>>> thanks again,
>>>> -chris
>>>> 
>>>> 
>>>> On May 20, 2014, at 2:03 PM, Jeffrey Zhong <jz...@hortonworks.com>
>>>> wrote:
>>>> 
>>>>> 
>>>>> I rechecked your RPC log and found that you didn't turn on the RPC
>>>>> trace
>>>>> level. You can set the following so that we can see detail request &
>>>>> response. 
>>>>> 
>>>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE
>>>>> 
>>>>> If you didn't see time out exception in your region server logs, it
>>>>> means
>>>>> that there are something keeping issuing lots of Get requests. Have
>>>>> you
>>>>> used secondary index for this table? If that's the case, please
>>>>> disable/drop it to see if that helps.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 5/20/14 1:21 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>>>>> 
>>>>>> I will see how possible a new, all bare metal cluster is, but we
>>>>>> have 5
>>>>>> other similarly configured clusters with no troubles. We only have an
>>>>>> issue when Phoenix does a full scan of some tables that a handler
>>>>>> gets
>>>>>> stuck in 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>>> after it has returned all of its data.
>>>>>> 
>>>>>> thank you,
>>>>>> 
>>>>>> -chris
>>>>>> 
>>>>>> On May 20, 2014, at 12:34 PM, alex kamil <al...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> i'm just guessing but may be region servers are stuck because they
>>>>>>> can't
>>>>>>> communicate with zookeeper or hmaster, or may be datanodes can't
>>>>>>> talk
>>>>>>> to
>>>>>>> namenode because of some vmware networking quirk and some internal
>>>>>>> thread
>>>>>>> is hanging, who knows,
>>>>>>> it would take 30 min to setup up bare metal cluster vs days
>>>>>>> debugging
>>>>>>> this
>>>>>>> hybrid setup
>>>>>>> also I'd suggest to install directly from hbase/hadoop websites  to
>>>>>>> eliminate additional variables
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas
>>>>>>> <cf...@biotiquesystems.com>wrote:
>>>>>>> 
>>>>>>>> Only the master nodes (HMaster, ZK and NN) are VMs. The
>>>>>>>> datanodes/regionservers with the stuck processes all bare metal.
>>>>>>>> 
>>>>>>>> -chris
>>>>>>>> 
>>>>>>>> On May 20, 2014, at 11:52 AM, alex kamil <al...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>>> "with 3 VMWare "master" nodes"
>>>>>>>>> 
>>>>>>>>> can you try running hbase/phoenix on physical nodes instead of
>>>>>>>>> using
>>>>>>>>> virtual machines
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas
>>>>>>>>> <cft@biotiquesystems.com
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Sorry to follow up on my own message, but I was wondering if
>>>>>>>>>> anyone
>>>>>>>>>> had
>>>>>>>>>> any ideas? Normal non-phoenix scans don't cause this symptom, but
>>>>>>>>>> right
>>>>>>>>>> after a select * on the exact same table will.
>>>>>>>>>> 
>>>>>>>>>> If we export the table and then re-import it into a new table,
>>>>>>>>>> the
>>>>>>>>>> new
>>>>>>>>>> table doesn't exhibit these symptoms, same as if we use an
>>>>>>>> upsert..select
>>>>>>>>>> to do a copy. It seems something happens to the last region to
>>>>>>>>>> cause
>>>>>>>> this,
>>>>>>>>>> but it is not directly data dependent. Moving the region to
>>>>>>>>>> another
>>>>>>>>>> regionserver doesn't have any effect - just moves where the
>>>>>>>>>> problem
>>>>>>>>>> happens. Major compactions get hung up by the running threads as
>>>>>>>>>> they
>>>>>>>>>> probably have a lock.
>>>>>>>>>> 
>>>>>>>>>> I've run the hfile tool on the final region and nothing seems
>>>>>>>>>> awry.
>>>>>>>>>> 
>>>>>>>>>> Figuring this out will allow this project to continue, as of now
>>>>>>>>>> it
>>>>>>>>>> is
>>>>>>>>>> hung up on this issue.
>>>>>>>>>> 
>>>>>>>>>> thank you,
>>>>>>>>>> -chris
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas
>>>>>>>>>> <cf...@biotiquesystems.com>
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I did some "poor mans" profiling with multiple jstack and came
>>>>>>>>>>> up
>>>>>>>>>>> with
>>>>>>>>>> where the RpcServer.handler appears to be stuck:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStore
>>>>>>>>>> Fi
>>>>>>>>>> le
>>>>>>>>>> .
>>>>>>>>>> That is the last deepest method in all of the traces, either
>>>>>>>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces
>>>>>>>>>> for a
>>>>>>>>>> thread (which has been running for the last couple hours):
>>>>>>>>>>> 
>>>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>>>>>> tid=0x0000000000cdb800
>>>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNex
>>>>>>>> t(
>>>>>>>> Fa
>>>>>>>> stDiffDeltaEncoder.java:540)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$Buffere
>>>>>>>> dE
>>>>>>>> nc
>>>>>>>> odedSeeker.next(BufferedDataBlockEncoder.java:261)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next
>>>>>>>> (H
>>>>>>>> Fi
>>>>>>>> leReaderV2.java:1063)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(H
>>>>>>>> St
>>>>>>>> or
>>>>>>>> e.java:1776)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
>>>>>>>> le
>>>>>>>> (H
>>>>>>>> Store.java:1722)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HSto
>>>>>>>> re
>>>>>>>> .j
>>>>>>>> ava:1655)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRe
>>>>>>>> gi
>>>>>>>> on
>>>>>>>> .java:1826)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
>>>>>>>> .j
>>>>>>>> av
>>>>>>>> a:2841)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServic
>>>>>>>> e$
>>>>>>>> 2.
>>>>>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>>>>>  at 
>>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>>>>>  at 
>>>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRp
>>>>>>>> cS
>>>>>>>> ch
>>>>>>>> eduler.java:160)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcS
>>>>>>>> ch
>>>>>>>> ed
>>>>>>>> uler.java:38)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedu
>>>>>>>> le
>>>>>>>> r.
>>>>>>>> java:110)
>>>>>>>>>>>  at java.lang.Thread.run(Thread.java:744)
>>>>>>>>>>> 
>>>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>>>>>> tid=0x0000000000cdb800
>>>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:
>>>>>>>> 19
>>>>>>>> 44
>>>>>>>> )
>>>>>>>>>>>  at
>>>>>>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.ro
>>>>>>>> ot
>>>>>>>> Bl
>>>>>>>> ockContainingKey(HFileBlockIndex.java:392)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.lo
>>>>>>>> ad
>>>>>>>> Da
>>>>>>>> taBlockWithScanInfo(HFileBlockIndex.java:209)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.se
>>>>>>>> ek
>>>>>>>> To
>>>>>>>> DataBlock(HFileBlockIndex.java:179)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.see
>>>>>>>> kB
>>>>>>>> ef
>>>>>>>> ore(HFileReaderV2.java:548)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
>>>>>>>> le
>>>>>>>> (H
>>>>>>>> Store.java:1712)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HSto
>>>>>>>> re
>>>>>>>> .j
>>>>>>>> ava:1655)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRe
>>>>>>>> gi
>>>>>>>> on
>>>>>>>> .java:1826)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
>>>>>>>> .j
>>>>>>>> av
>>>>>>>> a:2841)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServic
>>>>>>>> e$
>>>>>>>> 2.
>>>>>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>>>>>  at 
>>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>>>>>  at 
>>>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRp
>>>>>>>> cS
>>>>>>>> ch
>>>>>>>> eduler.java:160)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcS
>>>>>>>> ch
>>>>>>>> ed
>>>>>>>> uler.java:38)
>>>>>>>>>>>  at
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedu
>>>>>>>> le
>>>>>>>> r.
>>>>>>>> java:110)
>>>>>>>>>>>  at java.lang.Thread.run(Thread.java:744)
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong
>>>>>>>>>>> <jz...@hortonworks.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Hey Chris,
>>>>>>>>>>>> 
>>>>>>>>>>>> I used performance.py tool which created a table with 50K rows
>>>>>>>>>>>> in
>>>>>>>>>>>> one
>>>>>>>>>>>> table, run the following query from sqlline.py and everything
>>>>>>>>>>>> seems
>>>>>>>> fine
>>>>>>>>>>>> without seeing CPU running hot.
>>>>>>>>>>>> 
>>>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*)
>>>>>>>>>>>> from
>>>>>>>>>>>> PERFORMANCE_50000;
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> |  COUNT(1)  |
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> | 50000      |
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> 1 row selected (0.166 seconds)
>>>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*)
>>>>>>>>>>>> from
>>>>>>>>>>>> PERFORMANCE_50000;
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> |  COUNT(1)  |
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> | 50000      |
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> 1 row selected (0.167 seconds)
>>>>>>>>>>>> 
>>>>>>>>>>>> Is there anyway could you run profiler to see where the CPU
>>>>>>>>>>>> goes?
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <cf...@biotiquesystems.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Ahh, yes. Here is a pastebin for it:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> http://pastebin.com/w6mtabag
>>>>>>>>>>>>> 
>>>>>>>>>>>>> thanks again,
>>>>>>>>>>>>> -chris
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <nd...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Chris,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Attachments are filtered out by the mail server. Can you
>>>>>>>>>>>>>> pastebin it
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>> place?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>>>>>>>>>>>> <cf...@biotiquesystems.com>wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to
>>>>>>>>>>>>>>> have
>>>>>>>> been
>>>>>>>>>>>>>>> set
>>>>>>>>>>>>>>> to 60 by Ambari during install process). Now we have
>>>>>>>>>>>>>>> narrowed
>>>>>>>>>>>>>>> down
>>>>>>>>>> what
>>>>>>>>>>>>>>> causes the CPU to increase and have some detailed logs:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If we connect using sqlline.py and execute a select that
>>>>>>>>>>>>>>> selects
>>>>>>>> one
>>>>>>>>>>>>>>> row
>>>>>>>>>>>>>>> using the primary_key, no increate in CPU is observed and
>>>>>>>>>>>>>>> the
>>>>>>>> number
>>>>>>>>>>>>>>> of RPC
>>>>>>>>>>>>>>> threads in a RUNNABLE state remains the same.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If we execute a select that scans the table such as "select
>>>>>>>> count(*)
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>> TABLE" or where the "where" clause only limits on
>>>>>>>>>>>>>>> non-primary
>>>>>>>>>>>>>>> key
>>>>>>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler
>>>>>>>>>>>>>>> threads
>>>>>>>>>>>>>>> increases
>>>>>>>>>>>>>>> and the CPU utilization of the regionserver increases by
>>>>>>>>>>>>>>> ~105%.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Disconnecting the client does not have an effect and the
>>>>>>>>>>>>>>> RpcServer.handler
>>>>>>>>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher
>>>>>>>>>>>>>>> usage.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Checking the Web Console for the Regionserver just shows 10
>>>>>>>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other
>>>>>>>>>>>>>>> monitored
>>>>>>>>>>>>>>> tasks
>>>>>>>>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a
>>>>>>>>>>>>>>> Used
>>>>>>>> heap
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> 445.2M.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I've attached the regionserver log with IPC debug logging
>>>>>>>>>>>>>>> turned on
>>>>>>>>>>>>>>> right
>>>>>>>>>>>>>>> when one of the Phoenix statements is executed (this
>>>>>>>>>>>>>>> statement
>>>>>>>>>> actually
>>>>>>>>>>>>>>> used up the last available handler).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>> -chris
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong
>>>>>>>>>>>>>>> <jzhong@hortonworks.com
>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> From the stack, it seems you increase the default rpc
>>>>>>>>>>>>>>>> handler
>>>>>>>> number
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> about 60. All handlers are serving Get request(You can
>>>>>>>>>>>>>>>> search
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServ
>>>>>>>>>> er
>>>>>>>>>> .j
>>>>>>>>>> av
>>>>>>>>>>>>>>> a:2
>>>>>>>>>>>>>>>> 841).
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> You can check why there are so many get requests by adding
>>>>>>>>>>>>>>>> some
>>>>>>>> log
>>>>>>>>>>>>>>>> info
>>>>>>>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the
>>>>>>>>>>>>>>>> number
>>>>>>>>>>>>>>>> of
>>>>>>>> rpc
>>>>>>>>>>>>>>>> handlers per region server, it will mitigate your current
>>>>>>>>>>>>>>>> issue.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas"
>>>>>>>>>>>>>>>> <cf...@biotiquesystems.com>
>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU
>>>>>>>>>>>>>>>>> usage
>>>>>>>>>>>>>>>>> spiking
>>>>>>>>>>>>>>> up
>>>>>>>>>>>>>>>>> to use all available CPU and becoming unresponsive.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node
>>>>>>>>>>>>>>>>> cluster
>>>>>>>> (with
>>>>>>>>>> 3
>>>>>>>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a
>>>>>>>>>>>>>>>>> plain
>>>>>>>>>>>>>>>>> Ambari
>>>>>>>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM
>>>>>>>>>>>>>>>>> release
>>>>>>>>>>>>>>>>> and
>>>>>>>>>> hand
>>>>>>>>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was
>>>>>>>>>>>>>>>>> going
>>>>>>>>>> well
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> we were able to load in ~30k records into several tables.
>>>>>>>>>>>>>>>>> What
>>>>>>>>>>>>>>>>> happened
>>>>>>>>>>>>>>>>> was after about 3-4 days of being up the regionservers
>>>>>>>>>>>>>>>>> became
>>>>>>>>>>>>>>>>> unresponsive and started to use most of the available CPU
>>>>>>>>>>>>>>>>> (12
>>>>>>>> core
>>>>>>>>>>>>>>>>> boxes). Nothing terribly informative was in the logs
>>>>>>>>>>>>>>>>> (initially
>>>>>>>> we
>>>>>>>>>>>>>>>>> saw
>>>>>>>>>>>>>>>>> some flush messages that seemed excessive, but that was
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>> all
>>>>>>>> of
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> time and we changed back to the standard HBase WAL codec).
>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>> able
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> kill the unresponsive regionservers and then restart them,
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>>> will be fine for a day or so but will start to lock up
>>>>>>>>>>>>>>>>> again.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> We've dropped the entire HBase and zookeper information
>>>>>>>>>>>>>>>>> and
>>>>>>>> started
>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>> scratch, but that has not helped.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> James Taylor suggested I send this off here. I've
>>>>>>>>>>>>>>>>> attached a
>>>>>>>> jstack
>>>>>>>>>>>>>>>>> report of a locked up regionserver in hopes that someone
>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> shed
>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>> light.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>>>> -chris
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>>>>>>>> NOTICE: This message is intended for the use of the
>>>>>>>>>>>>>>>> individual
>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>> entity
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> which it is addressed and may contain information that is
>>>>>>>>>>>>>>>> confidential,
>>>>>>>>>>>>>>>> privileged and exempt from disclosure under applicable law.
>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> reader
>>>>>>>>>>>>>>>> of this message is not the intended recipient, you are
>>>>>>>>>>>>>>>> hereby
>>>>>>>>>> notified
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> any printing, copying, dissemination, distribution,
>>>>>>>>>>>>>>>> disclosure
>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>> forwarding of this communication is strictly prohibited. If
>>>>>>>>>>>>>>>> you
>>>>>>>> have
>>>>>>>>>>>>>>>> received this communication in error, please contact the
>>>>>>>>>>>>>>>> sender
>>>>>>>>>>>>>>> immediately
>>>>>>>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>>>> NOTICE: This message is intended for the use of the individual
>>>>>>>>>>>> or
>>>>>>>>>> entity to
>>>>>>>>>>>> which it is addressed and may contain information that is
>>>>>>>> confidential,
>>>>>>>>>>>> privileged and exempt from disclosure under applicable law. If
>>>>>>>>>>>> the
>>>>>>>>>> reader
>>>>>>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>>>>>>>> notified
>>>>>>>>>> that
>>>>>>>>>>>> any printing, copying, dissemination, distribution, disclosure
>>>>>>>>>>>> or
>>>>>>>>>>>> forwarding of this communication is strictly prohibited. If you
>>>>>>>>>>>> have
>>>>>>>>>>>> received this communication in error, please contact the sender
>>>>>>>>>> immediately
>>>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> CONFIDENTIALITY NOTICE
>>>>> NOTICE: This message is intended for the use of the individual or
>>>>> entity to 
>>>>> which it is addressed and may contain information that is
>>>>> confidential,
>>>>> privileged and exempt from disclosure under applicable law. If the
>>>>> reader 
>>>>> of this message is not the intended recipient, you are hereby notified
>>>>> that 
>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>> forwarding of this communication is strictly prohibited. If you have
>>>>> received this communication in error, please contact the sender
>>>>> immediately 
>>>>> and delete it from your system. Thank You.
>>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or
>>> entity to 
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the
>>> reader 
>>> of this message is not the intended recipient, you are hereby notified
>>> that 
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender
>>> immediately 
>>> and delete it from your system. Thank You.
>> 
> 
> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Jeffrey Zhong <jz...@hortonworks.com>.

Thanks for reporting the status back. Mind to try the patch on phoenix-992
or attach the data which can repro the issue so that we can verify current
fix. Thanks.

On 5/21/14 2:06 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:

>
>Altering the table to not use a DATA_BLOCK_ENCODING fixed the problem.
>Just to verify that it was the fix, I altered it back to FAST_DIFF and
>the problem re-appeared. Thank you!
>
>Adding the changes to the client hbase-site.xml did not help the issue,
>but it may be I was not sourcing it correctly in sqlline. I made sure it
>was in the classpath being passed to java via -cp when launching it.
>
>thanks again,
>-chris
>
>On May 20, 2014, at 5:50 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:
>
>> 
>> 
>> Hey Chris,
>> 
>> Sorry for the trace option. The right option should be below. Before you
>> were able to turn on DEBUG, you just need to change it to TRACE.
>> 
>>    log4j.logger.org.apache.hadoop.ipc.RpcServer=TRACE
>> 
>> 
>> 
>> Do you see that the two read requests burn one of your CPU to 100%? From
>> the call stacks you posted today and before, it seems there might be an
>> infinite loop inside the function Htable.getRowOrBefore() which is
>> deprecated and supposed to only work for meta region.(brought by Enis).
>> Fortunately there is only one place in phoenix call this function, so
>> could you try the following things?
>> 
>> set following configuration in client side hbase-site.xml to disable
>>stats
>> update 
>>   phoenix.query.statsUpdateFrequency = 2147483647
>>   phoenix.query.maxStatsAge = 0
>> 
>> After the configuration changes, could you verify if the CPU is still
>> spiking for the same query?
>> 
>> You can also do the following:
>> 
>> Alter your table's ColumnFamily data block encoding to NONE as
>> DATA_BLOCK_ENCODING => 'NONE' and do a major compaction after that
>>through
>> Hbase client.
>> 
>> 
>> We suspect that the combination of your data, FAST_DIFF data block
>> encoding & the deprecated Htable.getRowOrBefore() cause the issue.
>> 
>> 
>> 
>> On 5/20/14 3:03 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>> 
>>> Ho Jeffrey.
>>> 
>>> Thank you for taking a look. I added
>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE to the
>>>Custom
>>> log4j.properties in Ambari and did a rolling restart of the
>>>regionservers
>>> (and verified it was pushed out the nodes in
>>> /etc/hbase/conf/log4j.properties ) then issued a "select count(*)" on
>>>the
>>> table. Looking at the logs I don't see any TRACE messages. The only two
>>> log messages that happened when issuing the select are:
>>> 
>>> 2014-05-20 16:43:53,609 INFO  [RpcServer.handler=9,port=60020]
>>> coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped
>>> coprocessor scan
>>> 
>>>{"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x
>>>00
>>> 
>>>\\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"
>>>ca
>>> 
>>>cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions
>>>":
>>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
>>> 2014-05-20 16:43:54,309 INFO  [RpcServer.handler=9,port=60020]
>>> coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608
>>> rows for ungrouped coprocessor scan
>>> 
>>>{"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x
>>>00
>>> 
>>>\\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"
>>>ca
>>> 
>>>cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions
>>>":
>>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
>>> 
>>> The metrics for the regionserver show only 2 total read requests since
>>>it
>>> started and 0 per second now. Those two requests are for the table I
>>>did
>>> the count on. The regionserver's UI says it has no active RPC calls.
>>>It
>>> currently has one stuck handler from running that command:
>>> 
>>> "RpcServer.handler=8,port=60020" daemon prio=10 tid=0x0000000001c08000
>>> nid=0x2582 runnable [0x00007fbd5ba6c000]
>>>  java.lang.Thread.State: RUNNABLE
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java
>>>:2
>>> 888)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java
>>>:2
>>> 880)
>>> 	at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:369)
>>> 	at java.util.TreeMap.getEntry(TreeMap.java:340)
>>> 	at java.util.TreeMap.get(TreeMap.java:273)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.isDelete
>>>d(
>>> GetClosestRowBeforeTracker.java:128)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.addCandi
>>>da
>>> te(GetClosestRowBeforeTracker.java:107)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.handle(G
>>>et
>>> ClosestRowBeforeTracker.java:202)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStor
>>>e.
>>> java:1772)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(H
>>>St
>>> ore.java:1722)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.j
>>>av
>>> a:1655)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion
>>>.j
>>> ava:1826)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>a:
>>> 2841)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
>>>ca
>>> llBlockingMethod(ClientProtos.java:28857)
>>> 	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>> 	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSch
>>>ed
>>> uler.java:160)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSched
>>>ul
>>> er.java:38)
>>> 	at 
>>> 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.
>>>ja
>>> va:110)
>>> 	at java.lang.Thread.run(Thread.java:744)
>>> 
>>> It also has one listener running:
>>> 
>>> "RpcServer.listener,port=60020" daemon prio=10 tid=0x0000000001db8000
>>> nid=0x2579 runnable [0x00007fbd5c376000]
>>>  java.lang.Thread.State: RUNNABLE
>>> 	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>> 	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>>> 	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>>> 	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
>>> 	- locked <0x000000059ae28988> (a sun.nio.ch.Util$2)
>>> 	- locked <0x000000059ae289a0> (a
>>>java.util.Collections$UnmodifiableSet)
>>> 	- locked <0x000000059ae294b8> (a sun.nio.ch.EPollSelectorImpl)
>>> 	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
>>> 	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
>>> 	at 
>>>org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:676)
>>> 
>>> If there is anything else you need, please let me know.
>>> 
>>> thanks again,
>>> -chris
>>> 
>>> 
>>> On May 20, 2014, at 2:03 PM, Jeffrey Zhong <jz...@hortonworks.com>
>>>wrote:
>>> 
>>>> 
>>>> I rechecked your RPC log and found that you didn't turn on the RPC
>>>>trace
>>>> level. You can set the following so that we can see detail request &
>>>> response. 
>>>> 
>>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE
>>>> 
>>>> If you didn't see time out exception in your region server logs, it
>>>> means
>>>> that there are something keeping issuing lots of Get requests. Have
>>>>you
>>>> used secondary index for this table? If that's the case, please
>>>> disable/drop it to see if that helps.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 5/20/14 1:21 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>>>> 
>>>>> I will see how possible a new, all bare metal cluster is, but we
>>>>>have 5
>>>>> other similarly configured clusters with no troubles. We only have an
>>>>> issue when Phoenix does a full scan of some tables that a handler
>>>>>gets
>>>>> stuck in 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>> after it has returned all of its data.
>>>>> 
>>>>> thank you,
>>>>> 
>>>>> -chris
>>>>> 
>>>>> On May 20, 2014, at 12:34 PM, alex kamil <al...@gmail.com>
>>>>>wrote:
>>>>> 
>>>>>> i'm just guessing but may be region servers are stuck because they
>>>>>> can't
>>>>>> communicate with zookeeper or hmaster, or may be datanodes can't
>>>>>>talk
>>>>>> to
>>>>>> namenode because of some vmware networking quirk and some internal
>>>>>> thread
>>>>>> is hanging, who knows,
>>>>>> it would take 30 min to setup up bare metal cluster vs days
>>>>>>debugging
>>>>>> this
>>>>>> hybrid setup
>>>>>> also I'd suggest to install directly from hbase/hadoop websites  to
>>>>>> eliminate additional variables
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas
>>>>>> <cf...@biotiquesystems.com>wrote:
>>>>>> 
>>>>>>> Only the master nodes (HMaster, ZK and NN) are VMs. The
>>>>>>> datanodes/regionservers with the stuck processes all bare metal.
>>>>>>> 
>>>>>>> -chris
>>>>>>> 
>>>>>>> On May 20, 2014, at 11:52 AM, alex kamil <al...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>>> "with 3 VMWare "master" nodes"
>>>>>>>> 
>>>>>>>> can you try running hbase/phoenix on physical nodes instead of
>>>>>>>>using
>>>>>>>> virtual machines
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas
>>>>>>>> <cft@biotiquesystems.com
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Sorry to follow up on my own message, but I was wondering if
>>>>>>>>>anyone
>>>>>>>>> had
>>>>>>>>> any ideas? Normal non-phoenix scans don't cause this symptom, but
>>>>>>>>> right
>>>>>>>>> after a select * on the exact same table will.
>>>>>>>>> 
>>>>>>>>> If we export the table and then re-import it into a new table,
>>>>>>>>>the
>>>>>>>>> new
>>>>>>>>> table doesn't exhibit these symptoms, same as if we use an
>>>>>>> upsert..select
>>>>>>>>> to do a copy. It seems something happens to the last region to
>>>>>>>>> cause
>>>>>>> this,
>>>>>>>>> but it is not directly data dependent. Moving the region to
>>>>>>>>>another
>>>>>>>>> regionserver doesn't have any effect - just moves where the
>>>>>>>>>problem
>>>>>>>>> happens. Major compactions get hung up by the running threads as
>>>>>>>>> they
>>>>>>>>> probably have a lock.
>>>>>>>>> 
>>>>>>>>> I've run the hfile tool on the final region and nothing seems
>>>>>>>>>awry.
>>>>>>>>> 
>>>>>>>>> Figuring this out will allow this project to continue, as of now
>>>>>>>>>it
>>>>>>>>> is
>>>>>>>>> hung up on this issue.
>>>>>>>>> 
>>>>>>>>> thank you,
>>>>>>>>> -chris
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas
>>>>>>>>><cf...@biotiquesystems.com>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I did some "poor mans" profiling with multiple jstack and came
>>>>>>>>>>up
>>>>>>>>>> with
>>>>>>>>> where the RpcServer.handler appears to be stuck:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStore
>>>>>>>>>Fi
>>>>>>>>> le
>>>>>>>>> .
>>>>>>>>> That is the last deepest method in all of the traces, either
>>>>>>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces
>>>>>>>>> for a
>>>>>>>>> thread (which has been running for the last couple hours):
>>>>>>>>>> 
>>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>>>>> tid=0x0000000000cdb800
>>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNex
>>>>>>>t(
>>>>>>> Fa
>>>>>>> stDiffDeltaEncoder.java:540)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$Buffere
>>>>>>>dE
>>>>>>> nc
>>>>>>> odedSeeker.next(BufferedDataBlockEncoder.java:261)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next
>>>>>>>(H
>>>>>>> Fi
>>>>>>> leReaderV2.java:1063)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(H
>>>>>>>St
>>>>>>> or
>>>>>>> e.java:1776)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
>>>>>>>le
>>>>>>> (H
>>>>>>> Store.java:1722)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HSto
>>>>>>>re
>>>>>>> .j
>>>>>>> ava:1655)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRe
>>>>>>>gi
>>>>>>> on
>>>>>>> .java:1826)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
>>>>>>>.j
>>>>>>> av
>>>>>>> a:2841)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServic
>>>>>>>e$
>>>>>>> 2.
>>>>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>>>>   at 
>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>>>>   at 
>>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRp
>>>>>>>cS
>>>>>>> ch
>>>>>>> eduler.java:160)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcS
>>>>>>>ch
>>>>>>> ed
>>>>>>> uler.java:38)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedu
>>>>>>>le
>>>>>>> r.
>>>>>>> java:110)
>>>>>>>>>>   at java.lang.Thread.run(Thread.java:744)
>>>>>>>>>> 
>>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>>>>> tid=0x0000000000cdb800
>>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:
>>>>>>>19
>>>>>>> 44
>>>>>>> )
>>>>>>>>>>   at
>>>>>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.ro
>>>>>>>ot
>>>>>>> Bl
>>>>>>> ockContainingKey(HFileBlockIndex.java:392)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.lo
>>>>>>>ad
>>>>>>> Da
>>>>>>> taBlockWithScanInfo(HFileBlockIndex.java:209)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.se
>>>>>>>ek
>>>>>>> To
>>>>>>> DataBlock(HFileBlockIndex.java:179)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.see
>>>>>>>kB
>>>>>>> ef
>>>>>>> ore(HFileReaderV2.java:548)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
>>>>>>>le
>>>>>>> (H
>>>>>>> Store.java:1712)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HSto
>>>>>>>re
>>>>>>> .j
>>>>>>> ava:1655)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRe
>>>>>>>gi
>>>>>>> on
>>>>>>> .java:1826)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
>>>>>>>.j
>>>>>>> av
>>>>>>> a:2841)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServic
>>>>>>>e$
>>>>>>> 2.
>>>>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>>>>   at 
>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>>>>   at 
>>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRp
>>>>>>>cS
>>>>>>> ch
>>>>>>> eduler.java:160)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcS
>>>>>>>ch
>>>>>>> ed
>>>>>>> uler.java:38)
>>>>>>>>>>   at
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedu
>>>>>>>le
>>>>>>> r.
>>>>>>> java:110)
>>>>>>>>>>   at java.lang.Thread.run(Thread.java:744)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong
>>>>>>>>>> <jz...@hortonworks.com>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Hey Chris,
>>>>>>>>>>> 
>>>>>>>>>>> I used performance.py tool which created a table with 50K rows
>>>>>>>>>>>in
>>>>>>>>>>> one
>>>>>>>>>>> table, run the following query from sqlline.py and everything
>>>>>>>>>>> seems
>>>>>>> fine
>>>>>>>>>>> without seeing CPU running hot.
>>>>>>>>>>> 
>>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*)
>>>>>>>>>>>from
>>>>>>>>>>> PERFORMANCE_50000;
>>>>>>>>>>> +------------+
>>>>>>>>>>> |  COUNT(1)  |
>>>>>>>>>>> +------------+
>>>>>>>>>>> | 50000      |
>>>>>>>>>>> +------------+
>>>>>>>>>>> 1 row selected (0.166 seconds)
>>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*)
>>>>>>>>>>>from
>>>>>>>>>>> PERFORMANCE_50000;
>>>>>>>>>>> +------------+
>>>>>>>>>>> |  COUNT(1)  |
>>>>>>>>>>> +------------+
>>>>>>>>>>> | 50000      |
>>>>>>>>>>> +------------+
>>>>>>>>>>> 1 row selected (0.167 seconds)
>>>>>>>>>>> 
>>>>>>>>>>> Is there anyway could you run profiler to see where the CPU
>>>>>>>>>>>goes?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <cf...@biotiquesystems.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Ahh, yes. Here is a pastebin for it:
>>>>>>>>>>>> 
>>>>>>>>>>>> http://pastebin.com/w6mtabag
>>>>>>>>>>>> 
>>>>>>>>>>>> thanks again,
>>>>>>>>>>>> -chris
>>>>>>>>>>>> 
>>>>>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <nd...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Chris,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Attachments are filtered out by the mail server. Can you
>>>>>>>>>>>>> pastebin it
>>>>>>>>>>>>> some
>>>>>>>>>>>>> place?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Nick
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>>>>>>>>>>> <cf...@biotiquesystems.com>wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to
>>>>>>>>>>>>>> have
>>>>>>> been
>>>>>>>>>>>>>> set
>>>>>>>>>>>>>> to 60 by Ambari during install process). Now we have
>>>>>>>>>>>>>>narrowed
>>>>>>>>>>>>>> down
>>>>>>>>> what
>>>>>>>>>>>>>> causes the CPU to increase and have some detailed logs:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If we connect using sqlline.py and execute a select that
>>>>>>>>>>>>>> selects
>>>>>>> one
>>>>>>>>>>>>>> row
>>>>>>>>>>>>>> using the primary_key, no increate in CPU is observed and
>>>>>>>>>>>>>>the
>>>>>>> number
>>>>>>>>>>>>>> of RPC
>>>>>>>>>>>>>> threads in a RUNNABLE state remains the same.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If we execute a select that scans the table such as "select
>>>>>>> count(*)
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>> TABLE" or where the "where" clause only limits on
>>>>>>>>>>>>>>non-primary
>>>>>>>>>>>>>> key
>>>>>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler
>>>>>>>>>>>>>> threads
>>>>>>>>>>>>>> increases
>>>>>>>>>>>>>> and the CPU utilization of the regionserver increases by
>>>>>>>>>>>>>> ~105%.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Disconnecting the client does not have an effect and the
>>>>>>>>>>>>>> RpcServer.handler
>>>>>>>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher
>>>>>>>>>>>>>>usage.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Checking the Web Console for the Regionserver just shows 10
>>>>>>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other
>>>>>>>>>>>>>> monitored
>>>>>>>>>>>>>> tasks
>>>>>>>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a
>>>>>>>>>>>>>> Used
>>>>>>> heap
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>> 445.2M.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I've attached the regionserver log with IPC debug logging
>>>>>>>>>>>>>> turned on
>>>>>>>>>>>>>> right
>>>>>>>>>>>>>> when one of the Phoenix statements is executed (this
>>>>>>>>>>>>>>statement
>>>>>>>>> actually
>>>>>>>>>>>>>> used up the last available handler).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>> -chris
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong
>>>>>>>>>>>>>> <jzhong@hortonworks.com
>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> From the stack, it seems you increase the default rpc
>>>>>>>>>>>>>>>handler
>>>>>>> number
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> about 60. All handlers are serving Get request(You can
>>>>>>>>>>>>>>>search
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServ
>>>>>>>>>er
>>>>>>>>> .j
>>>>>>>>> av
>>>>>>>>>>>>>> a:2
>>>>>>>>>>>>>>> 841).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> You can check why there are so many get requests by adding
>>>>>>>>>>>>>>> some
>>>>>>> log
>>>>>>>>>>>>>>> info
>>>>>>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the
>>>>>>>>>>>>>>>number
>>>>>>>>>>>>>>> of
>>>>>>> rpc
>>>>>>>>>>>>>>> handlers per region server, it will mitigate your current
>>>>>>>>>>>>>>> issue.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas"
>>>>>>>>>>>>>>><cf...@biotiquesystems.com>
>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU
>>>>>>>>>>>>>>>> usage
>>>>>>>>>>>>>>>> spiking
>>>>>>>>>>>>>> up
>>>>>>>>>>>>>>>> to use all available CPU and becoming unresponsive.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node
>>>>>>>>>>>>>>>>cluster
>>>>>>> (with
>>>>>>>>> 3
>>>>>>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a
>>>>>>>>>>>>>>>> plain
>>>>>>>>>>>>>>>> Ambari
>>>>>>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM
>>>>>>>>>>>>>>>>release
>>>>>>>>>>>>>>>> and
>>>>>>>>> hand
>>>>>>>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was
>>>>>>>>>>>>>>>> going
>>>>>>>>> well
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> we were able to load in ~30k records into several tables.
>>>>>>>>>>>>>>>> What
>>>>>>>>>>>>>>>> happened
>>>>>>>>>>>>>>>> was after about 3-4 days of being up the regionservers
>>>>>>>>>>>>>>>> became
>>>>>>>>>>>>>>>> unresponsive and started to use most of the available CPU
>>>>>>>>>>>>>>>> (12
>>>>>>> core
>>>>>>>>>>>>>>>> boxes). Nothing terribly informative was in the logs
>>>>>>>>>>>>>>>> (initially
>>>>>>> we
>>>>>>>>>>>>>>>> saw
>>>>>>>>>>>>>>>> some flush messages that seemed excessive, but that was
>>>>>>>>>>>>>>>>not
>>>>>>>>>>>>>>>> all
>>>>>>> of
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> time and we changed back to the standard HBase WAL codec).
>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>> able
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> kill the unresponsive regionservers and then restart them,
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>> will be fine for a day or so but will start to lock up
>>>>>>>>>>>>>>>> again.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We've dropped the entire HBase and zookeper information
>>>>>>>>>>>>>>>>and
>>>>>>> started
>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>> scratch, but that has not helped.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> James Taylor suggested I send this off here. I've
>>>>>>>>>>>>>>>>attached a
>>>>>>> jstack
>>>>>>>>>>>>>>>> report of a locked up regionserver in hopes that someone
>>>>>>>>>>>>>>>>can
>>>>>>>>>>>>>>>> shed
>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>> light.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>>> -chris
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>>>>>>> NOTICE: This message is intended for the use of the
>>>>>>>>>>>>>>> individual
>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>> entity
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> which it is addressed and may contain information that is
>>>>>>>>>>>>>>> confidential,
>>>>>>>>>>>>>>> privileged and exempt from disclosure under applicable law.
>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> reader
>>>>>>>>>>>>>>> of this message is not the intended recipient, you are
>>>>>>>>>>>>>>>hereby
>>>>>>>>> notified
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> any printing, copying, dissemination, distribution,
>>>>>>>>>>>>>>> disclosure
>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>> forwarding of this communication is strictly prohibited. If
>>>>>>>>>>>>>>> you
>>>>>>> have
>>>>>>>>>>>>>>> received this communication in error, please contact the
>>>>>>>>>>>>>>> sender
>>>>>>>>>>>>>> immediately
>>>>>>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>>> NOTICE: This message is intended for the use of the individual
>>>>>>>>>>>or
>>>>>>>>> entity to
>>>>>>>>>>> which it is addressed and may contain information that is
>>>>>>> confidential,
>>>>>>>>>>> privileged and exempt from disclosure under applicable law. If
>>>>>>>>>>> the
>>>>>>>>> reader
>>>>>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>>>>>>> notified
>>>>>>>>> that
>>>>>>>>>>> any printing, copying, dissemination, distribution, disclosure
>>>>>>>>>>>or
>>>>>>>>>>> forwarding of this communication is strictly prohibited. If you
>>>>>>>>>>> have
>>>>>>>>>>> received this communication in error, please contact the sender
>>>>>>>>> immediately
>>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> CONFIDENTIALITY NOTICE
>>>> NOTICE: This message is intended for the use of the individual or
>>>> entity to 
>>>> which it is addressed and may contain information that is
>>>>confidential,
>>>> privileged and exempt from disclosure under applicable law. If the
>>>> reader 
>>>> of this message is not the intended recipient, you are hereby notified
>>>> that 
>>>> any printing, copying, dissemination, distribution, disclosure or
>>>> forwarding of this communication is strictly prohibited. If you have
>>>> received this communication in error, please contact the sender
>>>> immediately 
>>>> and delete it from your system. Thank You.
>>> 
>> 
>> 
>> 
>> -- 
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or
>>entity to 
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the
>>reader 
>> of this message is not the intended recipient, you are hereby notified
>>that 
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>>immediately 
>> and delete it from your system. Thank You.
>



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Chris Tarnas <cf...@biotiquesystems.com>.

Altering the table to not use a DATA_BLOCK_ENCODING fixed the problem. Just to verify that it was the fix, I altered it back to FAST_DIFF and the problem re-appeared. Thank you!

Adding the changes to the client hbase-site.xml did not help the issue, but it may be I was not sourcing it correctly in sqlline. I made sure it was in the classpath being passed to java via -cp when launching it. 

thanks again,
-chris

On May 20, 2014, at 5:50 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:

> 
> 
> Hey Chris,
> 
> Sorry for the trace option. The right option should be below. Before you
> were able to turn on DEBUG, you just need to change it to TRACE.
> 
>    log4j.logger.org.apache.hadoop.ipc.RpcServer=TRACE
> 
> 
> 
> Do you see that the two read requests burn one of your CPU to 100%? From
> the call stacks you posted today and before, it seems there might be an
> infinite loop inside the function Htable.getRowOrBefore() which is
> deprecated and supposed to only work for meta region.(brought by Enis).
> Fortunately there is only one place in phoenix call this function, so
> could you try the following things?
> 
> set following configuration in client side hbase-site.xml to disable stats
> update 
>   phoenix.query.statsUpdateFrequency = 2147483647
>   phoenix.query.maxStatsAge = 0
> 
> After the configuration changes, could you verify if the CPU is still
> spiking for the same query?
> 
> You can also do the following:
> 
> Alter your table's ColumnFamily data block encoding to NONE as
> DATA_BLOCK_ENCODING => 'NONE' and do a major compaction after that through
> Hbase client.
> 
> 
> We suspect that the combination of your data, FAST_DIFF data block
> encoding & the deprecated Htable.getRowOrBefore() cause the issue.
> 
> 
> 
> On 5/20/14 3:03 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
> 
>> Ho Jeffrey.
>> 
>> Thank you for taking a look. I added
>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE to the Custom
>> log4j.properties in Ambari and did a rolling restart of the regionservers
>> (and verified it was pushed out the nodes in
>> /etc/hbase/conf/log4j.properties ) then issued a "select count(*)" on the
>> table. Looking at the logs I don't see any TRACE messages. The only two
>> log messages that happened when issuing the select are:
>> 
>> 2014-05-20 16:43:53,609 INFO  [RpcServer.handler=9,port=60020]
>> coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped
>> coprocessor scan 
>> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00
>> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"ca
>> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":
>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
>> 2014-05-20 16:43:54,309 INFO  [RpcServer.handler=9,port=60020]
>> coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608
>> rows for ungrouped coprocessor scan
>> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00
>> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"ca
>> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":
>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
>> 
>> The metrics for the regionserver show only 2 total read requests since it
>> started and 0 per second now. Those two requests are for the table I did
>> the count on. The regionserver's UI says it has no active RPC calls.  It
>> currently has one stuck handler from running that command:
>> 
>> "RpcServer.handler=8,port=60020" daemon prio=10 tid=0x0000000001c08000
>> nid=0x2582 runnable [0x00007fbd5ba6c000]
>>  java.lang.Thread.State: RUNNABLE
>> 	at 
>> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java:2
>> 888)
>> 	at 
>> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java:2
>> 880)
>> 	at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:369)
>> 	at java.util.TreeMap.getEntry(TreeMap.java:340)
>> 	at java.util.TreeMap.get(TreeMap.java:273)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.isDeleted(
>> GetClosestRowBeforeTracker.java:128)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.addCandida
>> te(GetClosestRowBeforeTracker.java:107)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.handle(Get
>> ClosestRowBeforeTracker.java:202)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.
>> java:1772)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HSt
>> ore.java:1722)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.jav
>> a:1655)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.j
>> ava:1826)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:
>> 2841)
>> 	at 
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.ca
>> llBlockingMethod(ClientProtos.java:28857)
>> 	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>> 	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>> 	at 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSched
>> uler.java:160)
>> 	at 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSchedul
>> er.java:38)
>> 	at 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.ja
>> va:110)
>> 	at java.lang.Thread.run(Thread.java:744)
>> 
>> It also has one listener running:
>> 
>> "RpcServer.listener,port=60020" daemon prio=10 tid=0x0000000001db8000
>> nid=0x2579 runnable [0x00007fbd5c376000]
>>  java.lang.Thread.State: RUNNABLE
>> 	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>> 	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>> 	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>> 	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
>> 	- locked <0x000000059ae28988> (a sun.nio.ch.Util$2)
>> 	- locked <0x000000059ae289a0> (a java.util.Collections$UnmodifiableSet)
>> 	- locked <0x000000059ae294b8> (a sun.nio.ch.EPollSelectorImpl)
>> 	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
>> 	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
>> 	at org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:676)
>> 
>> If there is anything else you need, please let me know.
>> 
>> thanks again,
>> -chris
>> 
>> 
>> On May 20, 2014, at 2:03 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:
>> 
>>> 
>>> I rechecked your RPC log and found that you didn't turn on the RPC trace
>>> level. You can set the following so that we can see detail request &
>>> response. 
>>> 
>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE
>>> 
>>> If you didn't see time out exception in your region server logs, it
>>> means
>>> that there are something keeping issuing lots of Get requests. Have you
>>> used secondary index for this table? If that's the case, please
>>> disable/drop it to see if that helps.
>>> 
>>> 
>>> 
>>> 
>>> On 5/20/14 1:21 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>>> 
>>>> I will see how possible a new, all bare metal cluster is, but we have 5
>>>> other similarly configured clusters with no troubles. We only have an
>>>> issue when Phoenix does a full scan of some tables that a handler gets
>>>> stuck in 
>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>> after it has returned all of its data.
>>>> 
>>>> thank you,
>>>> 
>>>> -chris
>>>> 
>>>> On May 20, 2014, at 12:34 PM, alex kamil <al...@gmail.com> wrote:
>>>> 
>>>>> i'm just guessing but may be region servers are stuck because they
>>>>> can't
>>>>> communicate with zookeeper or hmaster, or may be datanodes can't talk
>>>>> to
>>>>> namenode because of some vmware networking quirk and some internal
>>>>> thread
>>>>> is hanging, who knows,
>>>>> it would take 30 min to setup up bare metal cluster vs days debugging
>>>>> this
>>>>> hybrid setup
>>>>> also I'd suggest to install directly from hbase/hadoop websites  to
>>>>> eliminate additional variables
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas
>>>>> <cf...@biotiquesystems.com>wrote:
>>>>> 
>>>>>> Only the master nodes (HMaster, ZK and NN) are VMs. The
>>>>>> datanodes/regionservers with the stuck processes all bare metal.
>>>>>> 
>>>>>> -chris
>>>>>> 
>>>>>> On May 20, 2014, at 11:52 AM, alex kamil <al...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>>> "with 3 VMWare "master" nodes"
>>>>>>> 
>>>>>>> can you try running hbase/phoenix on physical nodes instead of using
>>>>>>> virtual machines
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas
>>>>>>> <cft@biotiquesystems.com
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Sorry to follow up on my own message, but I was wondering if anyone
>>>>>>>> had
>>>>>>>> any ideas? Normal non-phoenix scans don't cause this symptom, but
>>>>>>>> right
>>>>>>>> after a select * on the exact same table will.
>>>>>>>> 
>>>>>>>> If we export the table and then re-import it into a new table, the
>>>>>>>> new
>>>>>>>> table doesn't exhibit these symptoms, same as if we use an
>>>>>> upsert..select
>>>>>>>> to do a copy. It seems something happens to the last region to
>>>>>>>> cause
>>>>>> this,
>>>>>>>> but it is not directly data dependent. Moving the region to another
>>>>>>>> regionserver doesn't have any effect - just moves where the problem
>>>>>>>> happens. Major compactions get hung up by the running threads as
>>>>>>>> they
>>>>>>>> probably have a lock.
>>>>>>>> 
>>>>>>>> I've run the hfile tool on the final region and nothing seems awry.
>>>>>>>> 
>>>>>>>> Figuring this out will allow this project to continue, as of now it
>>>>>>>> is
>>>>>>>> hung up on this issue.
>>>>>>>> 
>>>>>>>> thank you,
>>>>>>>> -chris
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas <cf...@biotiquesystems.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I did some "poor mans" profiling with multiple jstack and came up
>>>>>>>>> with
>>>>>>>> where the RpcServer.handler appears to be stuck:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
>>>>>>>> le
>>>>>>>> .
>>>>>>>> That is the last deepest method in all of the traces, either
>>>>>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces
>>>>>>>> for a
>>>>>>>> thread (which has been running for the last couple hours):
>>>>>>>>> 
>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>>>> tid=0x0000000000cdb800
>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(
>>>>>> Fa
>>>>>> stDiffDeltaEncoder.java:540)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedE
>>>>>> nc
>>>>>> odedSeeker.next(BufferedDataBlockEncoder.java:261)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(H
>>>>>> Fi
>>>>>> leReaderV2.java:1063)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HSt
>>>>>> or
>>>>>> e.java:1776)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>>> (H
>>>>>> Store.java:1722)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore
>>>>>> .j
>>>>>> ava:1655)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegi
>>>>>> on
>>>>>> .java:1826)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.j
>>>>>> av
>>>>>> a:2841)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$
>>>>>> 2.
>>>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>>>   at 
>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>>>   at 
>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcS
>>>>>> ch
>>>>>> eduler.java:160)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSch
>>>>>> ed
>>>>>> uler.java:38)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedule
>>>>>> r.
>>>>>> java:110)
>>>>>>>>>   at java.lang.Thread.run(Thread.java:744)
>>>>>>>>> 
>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>>>> tid=0x0000000000cdb800
>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:19
>>>>>> 44
>>>>>> )
>>>>>>>>>   at
>>>>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.root
>>>>>> Bl
>>>>>> ockContainingKey(HFileBlockIndex.java:392)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.load
>>>>>> Da
>>>>>> taBlockWithScanInfo(HFileBlockIndex.java:209)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seek
>>>>>> To
>>>>>> DataBlock(HFileBlockIndex.java:179)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekB
>>>>>> ef
>>>>>> ore(HFileReaderV2.java:548)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>>> (H
>>>>>> Store.java:1712)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore
>>>>>> .j
>>>>>> ava:1655)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegi
>>>>>> on
>>>>>> .java:1826)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.j
>>>>>> av
>>>>>> a:2841)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$
>>>>>> 2.
>>>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>>>   at 
>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>>>   at 
>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcS
>>>>>> ch
>>>>>> eduler.java:160)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSch
>>>>>> ed
>>>>>> uler.java:38)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedule
>>>>>> r.
>>>>>> java:110)
>>>>>>>>>   at java.lang.Thread.run(Thread.java:744)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong
>>>>>>>>> <jz...@hortonworks.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hey Chris,
>>>>>>>>>> 
>>>>>>>>>> I used performance.py tool which created a table with 50K rows in
>>>>>>>>>> one
>>>>>>>>>> table, run the following query from sqlline.py and everything
>>>>>>>>>> seems
>>>>>> fine
>>>>>>>>>> without seeing CPU running hot.
>>>>>>>>>> 
>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>>>>>>>> PERFORMANCE_50000;
>>>>>>>>>> +------------+
>>>>>>>>>> |  COUNT(1)  |
>>>>>>>>>> +------------+
>>>>>>>>>> | 50000      |
>>>>>>>>>> +------------+
>>>>>>>>>> 1 row selected (0.166 seconds)
>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>>>>>>>> PERFORMANCE_50000;
>>>>>>>>>> +------------+
>>>>>>>>>> |  COUNT(1)  |
>>>>>>>>>> +------------+
>>>>>>>>>> | 50000      |
>>>>>>>>>> +------------+
>>>>>>>>>> 1 row selected (0.167 seconds)
>>>>>>>>>> 
>>>>>>>>>> Is there anyway could you run profiler to see where the CPU goes?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <cf...@biotiquesystems.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Ahh, yes. Here is a pastebin for it:
>>>>>>>>>>> 
>>>>>>>>>>> http://pastebin.com/w6mtabag
>>>>>>>>>>> 
>>>>>>>>>>> thanks again,
>>>>>>>>>>> -chris
>>>>>>>>>>> 
>>>>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <nd...@gmail.com>
>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Chris,
>>>>>>>>>>>> 
>>>>>>>>>>>> Attachments are filtered out by the mail server. Can you
>>>>>>>>>>>> pastebin it
>>>>>>>>>>>> some
>>>>>>>>>>>> place?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Nick
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>>>>>>>>>> <cf...@biotiquesystems.com>wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to
>>>>>>>>>>>>> have
>>>>>> been
>>>>>>>>>>>>> set
>>>>>>>>>>>>> to 60 by Ambari during install process). Now we have narrowed
>>>>>>>>>>>>> down
>>>>>>>> what
>>>>>>>>>>>>> causes the CPU to increase and have some detailed logs:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If we connect using sqlline.py and execute a select that
>>>>>>>>>>>>> selects
>>>>>> one
>>>>>>>>>>>>> row
>>>>>>>>>>>>> using the primary_key, no increate in CPU is observed and the
>>>>>> number
>>>>>>>>>>>>> of RPC
>>>>>>>>>>>>> threads in a RUNNABLE state remains the same.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If we execute a select that scans the table such as "select
>>>>>> count(*)
>>>>>>>>>>>>> from
>>>>>>>>>>>>> TABLE" or where the "where" clause only limits on non-primary
>>>>>>>>>>>>> key
>>>>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler
>>>>>>>>>>>>> threads
>>>>>>>>>>>>> increases
>>>>>>>>>>>>> and the CPU utilization of the regionserver increases by
>>>>>>>>>>>>> ~105%.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Disconnecting the client does not have an effect and the
>>>>>>>>>>>>> RpcServer.handler
>>>>>>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher usage.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Checking the Web Console for the Regionserver just shows 10
>>>>>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other
>>>>>>>>>>>>> monitored
>>>>>>>>>>>>> tasks
>>>>>>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a
>>>>>>>>>>>>> Used
>>>>>> heap
>>>>>>>>>>>>> of
>>>>>>>>>>>>> 445.2M.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I've attached the regionserver log with IPC debug logging
>>>>>>>>>>>>> turned on
>>>>>>>>>>>>> right
>>>>>>>>>>>>> when one of the Phoenix statements is executed (this statement
>>>>>>>> actually
>>>>>>>>>>>>> used up the last available handler).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>> -chris
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong
>>>>>>>>>>>>> <jzhong@hortonworks.com
>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> From the stack, it seems you increase the default rpc handler
>>>>>> number
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> about 60. All handlers are serving Get request(You can search
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
>>>>>>>> .j
>>>>>>>> av
>>>>>>>>>>>>> a:2
>>>>>>>>>>>>>> 841).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> You can check why there are so many get requests by adding
>>>>>>>>>>>>>> some
>>>>>> log
>>>>>>>>>>>>>> info
>>>>>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the number
>>>>>>>>>>>>>> of
>>>>>> rpc
>>>>>>>>>>>>>> handlers per region server, it will mitigate your current
>>>>>>>>>>>>>> issue.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cf...@biotiquesystems.com>
>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU
>>>>>>>>>>>>>>> usage
>>>>>>>>>>>>>>> spiking
>>>>>>>>>>>>> up
>>>>>>>>>>>>>>> to use all available CPU and becoming unresponsive.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster
>>>>>> (with
>>>>>>>> 3
>>>>>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a
>>>>>>>>>>>>>>> plain
>>>>>>>>>>>>>>> Ambari
>>>>>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release
>>>>>>>>>>>>>>> and
>>>>>>>> hand
>>>>>>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was
>>>>>>>>>>>>>>> going
>>>>>>>> well
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> we were able to load in ~30k records into several tables.
>>>>>>>>>>>>>>> What
>>>>>>>>>>>>>>> happened
>>>>>>>>>>>>>>> was after about 3-4 days of being up the regionservers
>>>>>>>>>>>>>>> became
>>>>>>>>>>>>>>> unresponsive and started to use most of the available CPU
>>>>>>>>>>>>>>> (12
>>>>>> core
>>>>>>>>>>>>>>> boxes). Nothing terribly informative was in the logs
>>>>>>>>>>>>>>> (initially
>>>>>> we
>>>>>>>>>>>>>>> saw
>>>>>>>>>>>>>>> some flush messages that seemed excessive, but that was not
>>>>>>>>>>>>>>> all
>>>>>> of
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> time and we changed back to the standard HBase WAL codec).
>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>> able
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> kill the unresponsive regionservers and then restart them,
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>> will be fine for a day or so but will start to lock up
>>>>>>>>>>>>>>> again.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We've dropped the entire HBase and zookeper information and
>>>>>> started
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>> scratch, but that has not helped.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> James Taylor suggested I send this off here. I've attached a
>>>>>> jstack
>>>>>>>>>>>>>>> report of a locked up regionserver in hopes that someone can
>>>>>>>>>>>>>>> shed
>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>> light.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>> -chris
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>>>>>> NOTICE: This message is intended for the use of the
>>>>>>>>>>>>>> individual
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> entity
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> which it is addressed and may contain information that is
>>>>>>>>>>>>>> confidential,
>>>>>>>>>>>>>> privileged and exempt from disclosure under applicable law.
>>>>>>>>>>>>>> If
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> reader
>>>>>>>>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>>>> notified
>>>>>>>>>>>>> that
>>>>>>>>>>>>>> any printing, copying, dissemination, distribution,
>>>>>>>>>>>>>> disclosure
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> forwarding of this communication is strictly prohibited. If
>>>>>>>>>>>>>> you
>>>>>> have
>>>>>>>>>>>>>> received this communication in error, please contact the
>>>>>>>>>>>>>> sender
>>>>>>>>>>>>> immediately
>>>>>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>> NOTICE: This message is intended for the use of the individual or
>>>>>>>> entity to
>>>>>>>>>> which it is addressed and may contain information that is
>>>>>> confidential,
>>>>>>>>>> privileged and exempt from disclosure under applicable law. If
>>>>>>>>>> the
>>>>>>>> reader
>>>>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>>>>>> notified
>>>>>>>> that
>>>>>>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>>>>>>> forwarding of this communication is strictly prohibited. If you
>>>>>>>>>> have
>>>>>>>>>> received this communication in error, please contact the sender
>>>>>>>> immediately
>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or
>>> entity to 
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the
>>> reader 
>>> of this message is not the intended recipient, you are hereby notified
>>> that 
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender
>>> immediately 
>>> and delete it from your system. Thank You.
>> 
> 
> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Jeffrey Zhong <jz...@hortonworks.com>.


Hey Chris,

Sorry for the trace option. The right option should be below. Before you
were able to turn on DEBUG, you just need to change it to TRACE.

    log4j.logger.org.apache.hadoop.ipc.RpcServer=TRACE



Do you see that the two read requests burn one of your CPU to 100%? From
the call stacks you posted today and before, it seems there might be an
infinite loop inside the function Htable.getRowOrBefore() which is
deprecated and supposed to only work for meta region.(brought by Enis).
Fortunately there is only one place in phoenix call this function, so
could you try the following things?

set following configuration in client side hbase-site.xml to disable stats
update 
   phoenix.query.statsUpdateFrequency = 2147483647
   phoenix.query.maxStatsAge = 0

After the configuration changes, could you verify if the CPU is still
spiking for the same query?

You can also do the following:

Alter your table's ColumnFamily data block encoding to NONE as
DATA_BLOCK_ENCODING => 'NONE' and do a major compaction after that through
Hbase client.


We suspect that the combination of your data, FAST_DIFF data block
encoding & the deprecated Htable.getRowOrBefore() cause the issue.



On 5/20/14 3:03 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:

>Ho Jeffrey.
>
>Thank you for taking a look. I added
>log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE to the Custom
>log4j.properties in Ambari and did a rolling restart of the regionservers
>(and verified it was pushed out the nodes in
>/etc/hbase/conf/log4j.properties ) then issued a "select count(*)" on the
>table. Looking at the logs I don't see any TRACE messages. The only two
>log messages that happened when issuing the select are:
>
>2014-05-20 16:43:53,609 INFO  [RpcServer.handler=9,port=60020]
>coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped
>coprocessor scan 
>{"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00
>\\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"ca
>cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":
>1,"filter":"FirstKeyOnlyFilter","caching":-1}
>2014-05-20 16:43:54,309 INFO  [RpcServer.handler=9,port=60020]
>coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608
>rows for ungrouped coprocessor scan
>{"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00
>\\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"ca
>cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":
>1,"filter":"FirstKeyOnlyFilter","caching":-1}
>
>The metrics for the regionserver show only 2 total read requests since it
>started and 0 per second now. Those two requests are for the table I did
>the count on. The regionserver's UI says it has no active RPC calls.  It
>currently has one stuck handler from running that command:
>
>"RpcServer.handler=8,port=60020" daemon prio=10 tid=0x0000000001c08000
>nid=0x2582 runnable [0x00007fbd5ba6c000]
>   java.lang.Thread.State: RUNNABLE
>	at 
>org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java:2
>888)
>	at 
>org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java:2
>880)
>	at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:369)
>	at java.util.TreeMap.getEntry(TreeMap.java:340)
>	at java.util.TreeMap.get(TreeMap.java:273)
>	at 
>org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.isDeleted(
>GetClosestRowBeforeTracker.java:128)
>	at 
>org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.addCandida
>te(GetClosestRowBeforeTracker.java:107)
>	at 
>org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.handle(Get
>ClosestRowBeforeTracker.java:202)
>	at 
>org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.
>java:1772)
>	at 
>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HSt
>ore.java:1722)
>	at 
>org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.jav
>a:1655)
>	at 
>org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.j
>ava:1826)
>	at 
>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:
>2841)
>	at 
>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.ca
>llBlockingMethod(ClientProtos.java:28857)
>	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>	at 
>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSched
>uler.java:160)
>	at 
>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSchedul
>er.java:38)
>	at 
>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.ja
>va:110)
>	at java.lang.Thread.run(Thread.java:744)
>
>It also has one listener running:
>
>"RpcServer.listener,port=60020" daemon prio=10 tid=0x0000000001db8000
>nid=0x2579 runnable [0x00007fbd5c376000]
>   java.lang.Thread.State: RUNNABLE
>	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
>	- locked <0x000000059ae28988> (a sun.nio.ch.Util$2)
>	- locked <0x000000059ae289a0> (a java.util.Collections$UnmodifiableSet)
>	- locked <0x000000059ae294b8> (a sun.nio.ch.EPollSelectorImpl)
>	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
>	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
>	at org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:676)
>
>If there is anything else you need, please let me know.
>
>thanks again,
>-chris
>
>
>On May 20, 2014, at 2:03 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:
>
>> 
>> I rechecked your RPC log and found that you didn't turn on the RPC trace
>> level. You can set the following so that we can see detail request &
>> response. 
>> 
>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE
>> 
>> If you didn't see time out exception in your region server logs, it
>>means
>> that there are something keeping issuing lots of Get requests. Have you
>> used secondary index for this table? If that's the case, please
>> disable/drop it to see if that helps.
>> 
>> 
>> 
>> 
>> On 5/20/14 1:21 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>> 
>>> I will see how possible a new, all bare metal cluster is, but we have 5
>>> other similarly configured clusters with no troubles. We only have an
>>> issue when Phoenix does a full scan of some tables that a handler gets
>>> stuck in 
>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>> after it has returned all of its data.
>>> 
>>> thank you,
>>> 
>>> -chris
>>> 
>>> On May 20, 2014, at 12:34 PM, alex kamil <al...@gmail.com> wrote:
>>> 
>>>> i'm just guessing but may be region servers are stuck because they
>>>>can't
>>>> communicate with zookeeper or hmaster, or may be datanodes can't talk
>>>>to
>>>> namenode because of some vmware networking quirk and some internal
>>>> thread
>>>> is hanging, who knows,
>>>> it would take 30 min to setup up bare metal cluster vs days debugging
>>>> this
>>>> hybrid setup
>>>> also I'd suggest to install directly from hbase/hadoop websites  to
>>>> eliminate additional variables
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas
>>>> <cf...@biotiquesystems.com>wrote:
>>>> 
>>>>> Only the master nodes (HMaster, ZK and NN) are VMs. The
>>>>> datanodes/regionservers with the stuck processes all bare metal.
>>>>> 
>>>>> -chris
>>>>> 
>>>>> On May 20, 2014, at 11:52 AM, alex kamil <al...@gmail.com>
>>>>>wrote:
>>>>> 
>>>>>>> "with 3 VMWare "master" nodes"
>>>>>> 
>>>>>> can you try running hbase/phoenix on physical nodes instead of using
>>>>>> virtual machines
>>>>>> 
>>>>>> 
>>>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas
>>>>>><cft@biotiquesystems.com
>>>>>> wrote:
>>>>>> 
>>>>>>> Sorry to follow up on my own message, but I was wondering if anyone
>>>>>>> had
>>>>>>> any ideas? Normal non-phoenix scans don't cause this symptom, but
>>>>>>> right
>>>>>>> after a select * on the exact same table will.
>>>>>>> 
>>>>>>> If we export the table and then re-import it into a new table, the
>>>>>>> new
>>>>>>> table doesn't exhibit these symptoms, same as if we use an
>>>>> upsert..select
>>>>>>> to do a copy. It seems something happens to the last region to
>>>>>>>cause
>>>>> this,
>>>>>>> but it is not directly data dependent. Moving the region to another
>>>>>>> regionserver doesn't have any effect - just moves where the problem
>>>>>>> happens. Major compactions get hung up by the running threads as
>>>>>>>they
>>>>>>> probably have a lock.
>>>>>>> 
>>>>>>> I've run the hfile tool on the final region and nothing seems awry.
>>>>>>> 
>>>>>>> Figuring this out will allow this project to continue, as of now it
>>>>>>> is
>>>>>>> hung up on this issue.
>>>>>>> 
>>>>>>> thank you,
>>>>>>> -chris
>>>>>>> 
>>>>>>> 
>>>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas <cf...@biotiquesystems.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> I did some "poor mans" profiling with multiple jstack and came up
>>>>>>>> with
>>>>>>> where the RpcServer.handler appears to be stuck:
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
>>>>>>>le
>>>>>>> .
>>>>>>> That is the last deepest method in all of the traces, either
>>>>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces
>>>>>>> for a
>>>>>>> thread (which has been running for the last couple hours):
>>>>>>>> 
>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>>> tid=0x0000000000cdb800
>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(
>>>>>Fa
>>>>> stDiffDeltaEncoder.java:540)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedE
>>>>>nc
>>>>> odedSeeker.next(BufferedDataBlockEncoder.java:261)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(H
>>>>>Fi
>>>>> leReaderV2.java:1063)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HSt
>>>>>or
>>>>> e.java:1776)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>>(H
>>>>> Store.java:1722)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore
>>>>>.j
>>>>> ava:1655)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegi
>>>>>on
>>>>> .java:1826)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.j
>>>>>av
>>>>> a:2841)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$
>>>>>2.
>>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>>    at 
>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>>    at 
>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcS
>>>>>ch
>>>>> eduler.java:160)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSch
>>>>>ed
>>>>> uler.java:38)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedule
>>>>>r.
>>>>> java:110)
>>>>>>>>    at java.lang.Thread.run(Thread.java:744)
>>>>>>>> 
>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>>> tid=0x0000000000cdb800
>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:19
>>>>>44
>>>>> )
>>>>>>>>    at
>>>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.root
>>>>>Bl
>>>>> ockContainingKey(HFileBlockIndex.java:392)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.load
>>>>>Da
>>>>> taBlockWithScanInfo(HFileBlockIndex.java:209)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seek
>>>>>To
>>>>> DataBlock(HFileBlockIndex.java:179)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekB
>>>>>ef
>>>>> ore(HFileReaderV2.java:548)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>>(H
>>>>> Store.java:1712)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore
>>>>>.j
>>>>> ava:1655)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegi
>>>>>on
>>>>> .java:1826)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.j
>>>>>av
>>>>> a:2841)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$
>>>>>2.
>>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>>    at 
>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>>    at 
>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcS
>>>>>ch
>>>>> eduler.java:160)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSch
>>>>>ed
>>>>> uler.java:38)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedule
>>>>>r.
>>>>> java:110)
>>>>>>>>    at java.lang.Thread.run(Thread.java:744)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong
>>>>>>>><jz...@hortonworks.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Hey Chris,
>>>>>>>>> 
>>>>>>>>> I used performance.py tool which created a table with 50K rows in
>>>>>>>>> one
>>>>>>>>> table, run the following query from sqlline.py and everything
>>>>>>>>>seems
>>>>> fine
>>>>>>>>> without seeing CPU running hot.
>>>>>>>>> 
>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>>>>>>> PERFORMANCE_50000;
>>>>>>>>> +------------+
>>>>>>>>> |  COUNT(1)  |
>>>>>>>>> +------------+
>>>>>>>>> | 50000      |
>>>>>>>>> +------------+
>>>>>>>>> 1 row selected (0.166 seconds)
>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>>>>>>> PERFORMANCE_50000;
>>>>>>>>> +------------+
>>>>>>>>> |  COUNT(1)  |
>>>>>>>>> +------------+
>>>>>>>>> | 50000      |
>>>>>>>>> +------------+
>>>>>>>>> 1 row selected (0.167 seconds)
>>>>>>>>> 
>>>>>>>>> Is there anyway could you run profiler to see where the CPU goes?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <cf...@biotiquesystems.com>
>>>>>>>>>wrote:
>>>>>>>>> 
>>>>>>>>>> Ahh, yes. Here is a pastebin for it:
>>>>>>>>>> 
>>>>>>>>>> http://pastebin.com/w6mtabag
>>>>>>>>>> 
>>>>>>>>>> thanks again,
>>>>>>>>>> -chris
>>>>>>>>>> 
>>>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <nd...@gmail.com>
>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Chris,
>>>>>>>>>>> 
>>>>>>>>>>> Attachments are filtered out by the mail server. Can you
>>>>>>>>>>> pastebin it
>>>>>>>>>>> some
>>>>>>>>>>> place?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Nick
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>>>>>>>>> <cf...@biotiquesystems.com>wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hello,
>>>>>>>>>>>> 
>>>>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to
>>>>>>>>>>>>have
>>>>> been
>>>>>>>>>>>> set
>>>>>>>>>>>> to 60 by Ambari during install process). Now we have narrowed
>>>>>>>>>>>> down
>>>>>>> what
>>>>>>>>>>>> causes the CPU to increase and have some detailed logs:
>>>>>>>>>>>> 
>>>>>>>>>>>> If we connect using sqlline.py and execute a select that
>>>>>>>>>>>>selects
>>>>> one
>>>>>>>>>>>> row
>>>>>>>>>>>> using the primary_key, no increate in CPU is observed and the
>>>>> number
>>>>>>>>>>>> of RPC
>>>>>>>>>>>> threads in a RUNNABLE state remains the same.
>>>>>>>>>>>> 
>>>>>>>>>>>> If we execute a select that scans the table such as "select
>>>>> count(*)
>>>>>>>>>>>> from
>>>>>>>>>>>> TABLE" or where the "where" clause only limits on non-primary
>>>>>>>>>>>> key
>>>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler
>>>>>>>>>>>> threads
>>>>>>>>>>>> increases
>>>>>>>>>>>> and the CPU utilization of the regionserver increases by
>>>>>>>>>>>>~105%.
>>>>>>>>>>>> 
>>>>>>>>>>>> Disconnecting the client does not have an effect and the
>>>>>>>>>>>> RpcServer.handler
>>>>>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher usage.
>>>>>>>>>>>> 
>>>>>>>>>>>> Checking the Web Console for the Regionserver just shows 10
>>>>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other
>>>>>>>>>>>> monitored
>>>>>>>>>>>> tasks
>>>>>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a
>>>>>>>>>>>>Used
>>>>> heap
>>>>>>>>>>>> of
>>>>>>>>>>>> 445.2M.
>>>>>>>>>>>> 
>>>>>>>>>>>> I've attached the regionserver log with IPC debug logging
>>>>>>>>>>>> turned on
>>>>>>>>>>>> right
>>>>>>>>>>>> when one of the Phoenix statements is executed (this statement
>>>>>>> actually
>>>>>>>>>>>> used up the last available handler).
>>>>>>>>>>>> 
>>>>>>>>>>>> thanks,
>>>>>>>>>>>> -chris
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong
>>>>>>>>>>>> <jzhong@hortonworks.com
>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> From the stack, it seems you increase the default rpc handler
>>>>> number
>>>>>>>>>>>>> to
>>>>>>>>>>>>> about 60. All handlers are serving Get request(You can search
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
>>>>>>>.j
>>>>>>> av
>>>>>>>>>>>> a:2
>>>>>>>>>>>>> 841).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> You can check why there are so many get requests by adding
>>>>>>>>>>>>>some
>>>>> log
>>>>>>>>>>>>> info
>>>>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the number
>>>>>>>>>>>>> of
>>>>> rpc
>>>>>>>>>>>>> handlers per region server, it will mitigate your current
>>>>>>>>>>>>> issue.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cf...@biotiquesystems.com>
>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU
>>>>>>>>>>>>>>usage
>>>>>>>>>>>>>> spiking
>>>>>>>>>>>> up
>>>>>>>>>>>>>> to use all available CPU and becoming unresponsive.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster
>>>>> (with
>>>>>>> 3
>>>>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a
>>>>>>>>>>>>>> plain
>>>>>>>>>>>>>> Ambari
>>>>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release
>>>>>>>>>>>>>> and
>>>>>>> hand
>>>>>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was
>>>>>>>>>>>>>> going
>>>>>>> well
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> we were able to load in ~30k records into several tables.
>>>>>>>>>>>>>>What
>>>>>>>>>>>>>> happened
>>>>>>>>>>>>>> was after about 3-4 days of being up the regionservers
>>>>>>>>>>>>>>became
>>>>>>>>>>>>>> unresponsive and started to use most of the available CPU
>>>>>>>>>>>>>>(12
>>>>> core
>>>>>>>>>>>>>> boxes). Nothing terribly informative was in the logs
>>>>>>>>>>>>>> (initially
>>>>> we
>>>>>>>>>>>>>> saw
>>>>>>>>>>>>>> some flush messages that seemed excessive, but that was not
>>>>>>>>>>>>>> all
>>>>> of
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> time and we changed back to the standard HBase WAL codec).
>>>>>>>>>>>>>>We
>>>>>>>>>>>>>> are
>>>>>>>>>>>>>> able
>>>>>>>>>>>> to
>>>>>>>>>>>>>> kill the unresponsive regionservers and then restart them,
>>>>>>>>>>>>>>the
>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>> will be fine for a day or so but will start to lock up
>>>>>>>>>>>>>>again.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We've dropped the entire HBase and zookeper information and
>>>>> started
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>> scratch, but that has not helped.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> James Taylor suggested I send this off here. I've attached a
>>>>> jstack
>>>>>>>>>>>>>> report of a locked up regionserver in hopes that someone can
>>>>>>>>>>>>>> shed
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>> light.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>> -chris
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>>>>> NOTICE: This message is intended for the use of the
>>>>>>>>>>>>>individual
>>>>>>>>>>>>> or
>>>>>>>>>>>>> entity
>>>>>>>>>>>> to
>>>>>>>>>>>>> which it is addressed and may contain information that is
>>>>>>>>>>>>> confidential,
>>>>>>>>>>>>> privileged and exempt from disclosure under applicable law.
>>>>>>>>>>>>>If
>>>>>>>>>>>>> the
>>>>>>>>>>>>> reader
>>>>>>>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>>> notified
>>>>>>>>>>>> that
>>>>>>>>>>>>> any printing, copying, dissemination, distribution,
>>>>>>>>>>>>>disclosure
>>>>>>>>>>>>> or
>>>>>>>>>>>>> forwarding of this communication is strictly prohibited. If
>>>>>>>>>>>>>you
>>>>> have
>>>>>>>>>>>>> received this communication in error, please contact the
>>>>>>>>>>>>>sender
>>>>>>>>>>>> immediately
>>>>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>> NOTICE: This message is intended for the use of the individual or
>>>>>>> entity to
>>>>>>>>> which it is addressed and may contain information that is
>>>>> confidential,
>>>>>>>>> privileged and exempt from disclosure under applicable law. If
>>>>>>>>>the
>>>>>>> reader
>>>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>>>>> notified
>>>>>>> that
>>>>>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>>>>>> forwarding of this communication is strictly prohibited. If you
>>>>>>>>> have
>>>>>>>>> received this communication in error, please contact the sender
>>>>>>> immediately
>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or
>>entity to 
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the
>>reader 
>> of this message is not the intended recipient, you are hereby notified
>>that 
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>>immediately 
>> and delete it from your system. Thank You.
>



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Chris Tarnas <cf...@biotiquesystems.com>.

Ho Jeffrey.

Thank you for taking a look. I added log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE to the Custom log4j.properties in Ambari and did a rolling restart of the regionservers (and verified it was pushed out the nodes in /etc/hbase/conf/log4j.properties ) then issued a "select count(*)" on the table. Looking at the logs I don't see any TRACE messages. The only two log messages that happened when issuing the select are:

2014-05-20 16:43:53,609 INFO  [RpcServer.handler=9,port=60020] coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped coprocessor scan {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00\\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"cacheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":1,"filter":"FirstKeyOnlyFilter","caching":-1}
2014-05-20 16:43:54,309 INFO  [RpcServer.handler=9,port=60020] coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608 rows for ungrouped coprocessor scan {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00\\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"cacheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":1,"filter":"FirstKeyOnlyFilter","caching":-1}

The metrics for the regionserver show only 2 total read requests since it started and 0 per second now. Those two requests are for the table I did the count on. The regionserver's UI says it has no active RPC calls.  It currently has one stuck handler from running that command:

"RpcServer.handler=8,port=60020" daemon prio=10 tid=0x0000000001c08000 nid=0x2582 runnable [0x00007fbd5ba6c000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java:2888)
	at org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java:2880)
	at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:369)
	at java.util.TreeMap.getEntry(TreeMap.java:340)
	at java.util.TreeMap.get(TreeMap.java:273)
	at org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.isDeleted(GetClosestRowBeforeTracker.java:128)
	at org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.addCandidate(GetClosestRowBeforeTracker.java:107)
	at org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.handle(GetClosestRowBeforeTracker.java:202)
	at org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1772)
	at org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722)
	at org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
	at org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
	at java.lang.Thread.run(Thread.java:744)

It also has one listener running:

"RpcServer.listener,port=60020" daemon prio=10 tid=0x0000000001db8000 nid=0x2579 runnable [0x00007fbd5c376000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
	- locked <0x000000059ae28988> (a sun.nio.ch.Util$2)
	- locked <0x000000059ae289a0> (a java.util.Collections$UnmodifiableSet)
	- locked <0x000000059ae294b8> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
	at org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:676)

If there is anything else you need, please let me know.

thanks again,
-chris


On May 20, 2014, at 2:03 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:

> 
> I rechecked your RPC log and found that you didn't turn on the RPC trace
> level. You can set the following so that we can see detail request &
> response. 
> 
> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE
> 
> If you didn't see time out exception in your region server logs, it means
> that there are something keeping issuing lots of Get requests. Have you
> used secondary index for this table? If that's the case, please
> disable/drop it to see if that helps.
> 
> 
> 
> 
> On 5/20/14 1:21 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
> 
>> I will see how possible a new, all bare metal cluster is, but we have 5
>> other similarly configured clusters with no troubles. We only have an
>> issue when Phoenix does a full scan of some tables that a handler gets
>> stuck in 
>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>> after it has returned all of its data.
>> 
>> thank you,
>> 
>> -chris
>> 
>> On May 20, 2014, at 12:34 PM, alex kamil <al...@gmail.com> wrote:
>> 
>>> i'm just guessing but may be region servers are stuck because they can't
>>> communicate with zookeeper or hmaster, or may be datanodes can't talk to
>>> namenode because of some vmware networking quirk and some internal
>>> thread
>>> is hanging, who knows,
>>> it would take 30 min to setup up bare metal cluster vs days debugging
>>> this
>>> hybrid setup
>>> also I'd suggest to install directly from hbase/hadoop websites  to
>>> eliminate additional variables
>>> 
>>> 
>>> 
>>> 
>>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas
>>> <cf...@biotiquesystems.com>wrote:
>>> 
>>>> Only the master nodes (HMaster, ZK and NN) are VMs. The
>>>> datanodes/regionservers with the stuck processes all bare metal.
>>>> 
>>>> -chris
>>>> 
>>>> On May 20, 2014, at 11:52 AM, alex kamil <al...@gmail.com> wrote:
>>>> 
>>>>>> "with 3 VMWare "master" nodes"
>>>>> 
>>>>> can you try running hbase/phoenix on physical nodes instead of using
>>>>> virtual machines
>>>>> 
>>>>> 
>>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas <cft@biotiquesystems.com
>>>>> wrote:
>>>>> 
>>>>>> Sorry to follow up on my own message, but I was wondering if anyone
>>>>>> had
>>>>>> any ideas? Normal non-phoenix scans don't cause this symptom, but
>>>>>> right
>>>>>> after a select * on the exact same table will.
>>>>>> 
>>>>>> If we export the table and then re-import it into a new table, the
>>>>>> new
>>>>>> table doesn't exhibit these symptoms, same as if we use an
>>>> upsert..select
>>>>>> to do a copy. It seems something happens to the last region to cause
>>>> this,
>>>>>> but it is not directly data dependent. Moving the region to another
>>>>>> regionserver doesn't have any effect - just moves where the problem
>>>>>> happens. Major compactions get hung up by the running threads as they
>>>>>> probably have a lock.
>>>>>> 
>>>>>> I've run the hfile tool on the final region and nothing seems awry.
>>>>>> 
>>>>>> Figuring this out will allow this project to continue, as of now it
>>>>>> is
>>>>>> hung up on this issue.
>>>>>> 
>>>>>> thank you,
>>>>>> -chris
>>>>>> 
>>>>>> 
>>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas <cf...@biotiquesystems.com>
>>>> wrote:
>>>>>> 
>>>>>>> I did some "poor mans" profiling with multiple jstack and came up
>>>>>>> with
>>>>>> where the RpcServer.handler appears to be stuck:
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>>> .
>>>>>> That is the last deepest method in all of the traces, either
>>>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces
>>>>>> for a
>>>>>> thread (which has been running for the last couple hours):
>>>>>>> 
>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>> tid=0x0000000000cdb800
>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(Fa
>>>> stDiffDeltaEncoder.java:540)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEnc
>>>> odedSeeker.next(BufferedDataBlockEncoder.java:261)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFi
>>>> leReaderV2.java:1063)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStor
>>>> e.java:1776)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(H
>>>> Store.java:1722)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.j
>>>> ava:1655)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion
>>>> .java:1826)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>> a:2841)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>    at 
>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>    at 
>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSch
>>>> eduler.java:160)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSched
>>>> uler.java:38)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.
>>>> java:110)
>>>>>>>    at java.lang.Thread.run(Thread.java:744)
>>>>>>> 
>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>> tid=0x0000000000cdb800
>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944
>>>> )
>>>>>>>    at
>>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBl
>>>> ockContainingKey(HFileBlockIndex.java:392)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDa
>>>> taBlockWithScanInfo(HFileBlockIndex.java:209)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekTo
>>>> DataBlock(HFileBlockIndex.java:179)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBef
>>>> ore(HFileReaderV2.java:548)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(H
>>>> Store.java:1712)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.j
>>>> ava:1655)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion
>>>> .java:1826)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>> a:2841)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>    at 
>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>    at 
>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSch
>>>> eduler.java:160)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSched
>>>> uler.java:38)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.
>>>> java:110)
>>>>>>>    at java.lang.Thread.run(Thread.java:744)
>>>>>>> 
>>>>>>> 
>>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jz...@hortonworks.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> Hey Chris,
>>>>>>>> 
>>>>>>>> I used performance.py tool which created a table with 50K rows in
>>>>>>>> one
>>>>>>>> table, run the following query from sqlline.py and everything seems
>>>> fine
>>>>>>>> without seeing CPU running hot.
>>>>>>>> 
>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>>>>>> PERFORMANCE_50000;
>>>>>>>> +------------+
>>>>>>>> |  COUNT(1)  |
>>>>>>>> +------------+
>>>>>>>> | 50000      |
>>>>>>>> +------------+
>>>>>>>> 1 row selected (0.166 seconds)
>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>>>>>> PERFORMANCE_50000;
>>>>>>>> +------------+
>>>>>>>> |  COUNT(1)  |
>>>>>>>> +------------+
>>>>>>>> | 50000      |
>>>>>>>> +------------+
>>>>>>>> 1 row selected (0.167 seconds)
>>>>>>>> 
>>>>>>>> Is there anyway could you run profiler to see where the CPU goes?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>>>>>>>> 
>>>>>>>>> Ahh, yes. Here is a pastebin for it:
>>>>>>>>> 
>>>>>>>>> http://pastebin.com/w6mtabag
>>>>>>>>> 
>>>>>>>>> thanks again,
>>>>>>>>> -chris
>>>>>>>>> 
>>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <nd...@gmail.com>
>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Chris,
>>>>>>>>>> 
>>>>>>>>>> Attachments are filtered out by the mail server. Can you
>>>>>>>>>> pastebin it
>>>>>>>>>> some
>>>>>>>>>> place?
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Nick
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>>>>>>>> <cf...@biotiquesystems.com>wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hello,
>>>>>>>>>>> 
>>>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to have
>>>> been
>>>>>>>>>>> set
>>>>>>>>>>> to 60 by Ambari during install process). Now we have narrowed
>>>>>>>>>>> down
>>>>>> what
>>>>>>>>>>> causes the CPU to increase and have some detailed logs:
>>>>>>>>>>> 
>>>>>>>>>>> If we connect using sqlline.py and execute a select that selects
>>>> one
>>>>>>>>>>> row
>>>>>>>>>>> using the primary_key, no increate in CPU is observed and the
>>>> number
>>>>>>>>>>> of RPC
>>>>>>>>>>> threads in a RUNNABLE state remains the same.
>>>>>>>>>>> 
>>>>>>>>>>> If we execute a select that scans the table such as "select
>>>> count(*)
>>>>>>>>>>> from
>>>>>>>>>>> TABLE" or where the "where" clause only limits on non-primary
>>>>>>>>>>> key
>>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler
>>>>>>>>>>> threads
>>>>>>>>>>> increases
>>>>>>>>>>> and the CPU utilization of the regionserver increases by ~105%.
>>>>>>>>>>> 
>>>>>>>>>>> Disconnecting the client does not have an effect and the
>>>>>>>>>>> RpcServer.handler
>>>>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher usage.
>>>>>>>>>>> 
>>>>>>>>>>> Checking the Web Console for the Regionserver just shows 10
>>>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other
>>>>>>>>>>> monitored
>>>>>>>>>>> tasks
>>>>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a Used
>>>> heap
>>>>>>>>>>> of
>>>>>>>>>>> 445.2M.
>>>>>>>>>>> 
>>>>>>>>>>> I've attached the regionserver log with IPC debug logging
>>>>>>>>>>> turned on
>>>>>>>>>>> right
>>>>>>>>>>> when one of the Phoenix statements is executed (this statement
>>>>>> actually
>>>>>>>>>>> used up the last available handler).
>>>>>>>>>>> 
>>>>>>>>>>> thanks,
>>>>>>>>>>> -chris
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong
>>>>>>>>>>> <jzhong@hortonworks.com
>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> From the stack, it seems you increase the default rpc handler
>>>> number
>>>>>>>>>>>> to
>>>>>>>>>>>> about 60. All handlers are serving Get request(You can search
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.j
>>>>>> av
>>>>>>>>>>> a:2
>>>>>>>>>>>> 841).
>>>>>>>>>>>> 
>>>>>>>>>>>> You can check why there are so many get requests by adding some
>>>> log
>>>>>>>>>>>> info
>>>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the number
>>>>>>>>>>>> of
>>>> rpc
>>>>>>>>>>>> handlers per region server, it will mitigate your current
>>>>>>>>>>>> issue.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cf...@biotiquesystems.com>
>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU usage
>>>>>>>>>>>>> spiking
>>>>>>>>>>> up
>>>>>>>>>>>>> to use all available CPU and becoming unresponsive.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster
>>>> (with
>>>>>> 3
>>>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a
>>>>>>>>>>>>> plain
>>>>>>>>>>>>> Ambari
>>>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release
>>>>>>>>>>>>> and
>>>>>> hand
>>>>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was
>>>>>>>>>>>>> going
>>>>>> well
>>>>>>>>>>>>> and
>>>>>>>>>>>>> we were able to load in ~30k records into several tables. What
>>>>>>>>>>>>> happened
>>>>>>>>>>>>> was after about 3-4 days of being up the regionservers became
>>>>>>>>>>>>> unresponsive and started to use most of the available CPU (12
>>>> core
>>>>>>>>>>>>> boxes). Nothing terribly informative was in the logs
>>>>>>>>>>>>> (initially
>>>> we
>>>>>>>>>>>>> saw
>>>>>>>>>>>>> some flush messages that seemed excessive, but that was not
>>>>>>>>>>>>> all
>>>> of
>>>>>>>>>>>>> the
>>>>>>>>>>>>> time and we changed back to the standard HBase WAL codec). We
>>>>>>>>>>>>> are
>>>>>>>>>>>>> able
>>>>>>>>>>> to
>>>>>>>>>>>>> kill the unresponsive regionservers and then restart them, the
>>>>>>>>>>>>> cluster
>>>>>>>>>>>>> will be fine for a day or so but will start to lock up again.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We've dropped the entire HBase and zookeper information and
>>>> started
>>>>>>>>>>>>> from
>>>>>>>>>>>>> scratch, but that has not helped.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> James Taylor suggested I send this off here. I've attached a
>>>> jstack
>>>>>>>>>>>>> report of a locked up regionserver in hopes that someone can
>>>>>>>>>>>>> shed
>>>>>>>>>>>>> some
>>>>>>>>>>>>> light.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>> -chris
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>>>> NOTICE: This message is intended for the use of the individual
>>>>>>>>>>>> or
>>>>>>>>>>>> entity
>>>>>>>>>>> to
>>>>>>>>>>>> which it is addressed and may contain information that is
>>>>>>>>>>>> confidential,
>>>>>>>>>>>> privileged and exempt from disclosure under applicable law. If
>>>>>>>>>>>> the
>>>>>>>>>>>> reader
>>>>>>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>> notified
>>>>>>>>>>> that
>>>>>>>>>>>> any printing, copying, dissemination, distribution, disclosure
>>>>>>>>>>>> or
>>>>>>>>>>>> forwarding of this communication is strictly prohibited. If you
>>>> have
>>>>>>>>>>>> received this communication in error, please contact the sender
>>>>>>>>>>> immediately
>>>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>> NOTICE: This message is intended for the use of the individual or
>>>>>> entity to
>>>>>>>> which it is addressed and may contain information that is
>>>> confidential,
>>>>>>>> privileged and exempt from disclosure under applicable law. If the
>>>>>> reader
>>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>>>> notified
>>>>>> that
>>>>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>>>>> forwarding of this communication is strictly prohibited. If you
>>>>>>>> have
>>>>>>>> received this communication in error, please contact the sender
>>>>>> immediately
>>>>>>>> and delete it from your system. Thank You.
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
> 
> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Jeffrey Zhong <jz...@hortonworks.com>.

I rechecked your RPC log and found that you didn't turn on the RPC trace
level. You can set the following so that we can see detail request &
response. 

log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE

If you didn't see time out exception in your region server logs, it means
that there are something keeping issuing lots of Get requests. Have you
used secondary index for this table? If that's the case, please
disable/drop it to see if that helps.




On 5/20/14 1:21 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:

>I will see how possible a new, all bare metal cluster is, but we have 5
>other similarly configured clusters with no troubles. We only have an
>issue when Phoenix does a full scan of some tables that a handler gets
>stuck in 
>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>after it has returned all of its data.
>
>thank you,
>
>-chris
>
>On May 20, 2014, at 12:34 PM, alex kamil <al...@gmail.com> wrote:
>
>> i'm just guessing but may be region servers are stuck because they can't
>> communicate with zookeeper or hmaster, or may be datanodes can't talk to
>> namenode because of some vmware networking quirk and some internal
>>thread
>> is hanging, who knows,
>> it would take 30 min to setup up bare metal cluster vs days debugging
>>this
>> hybrid setup
>> also I'd suggest to install directly from hbase/hadoop websites  to
>> eliminate additional variables
>> 
>> 
>> 
>> 
>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas
>><cf...@biotiquesystems.com>wrote:
>> 
>>> Only the master nodes (HMaster, ZK and NN) are VMs. The
>>> datanodes/regionservers with the stuck processes all bare metal.
>>> 
>>> -chris
>>> 
>>> On May 20, 2014, at 11:52 AM, alex kamil <al...@gmail.com> wrote:
>>> 
>>>>> "with 3 VMWare "master" nodes"
>>>> 
>>>> can you try running hbase/phoenix on physical nodes instead of using
>>>> virtual machines
>>>> 
>>>> 
>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas <cft@biotiquesystems.com
>>>> wrote:
>>>> 
>>>>> Sorry to follow up on my own message, but I was wondering if anyone
>>>>>had
>>>>> any ideas? Normal non-phoenix scans don't cause this symptom, but
>>>>>right
>>>>> after a select * on the exact same table will.
>>>>> 
>>>>> If we export the table and then re-import it into a new table, the
>>>>>new
>>>>> table doesn't exhibit these symptoms, same as if we use an
>>> upsert..select
>>>>> to do a copy. It seems something happens to the last region to cause
>>> this,
>>>>> but it is not directly data dependent. Moving the region to another
>>>>> regionserver doesn't have any effect - just moves where the problem
>>>>> happens. Major compactions get hung up by the running threads as they
>>>>> probably have a lock.
>>>>> 
>>>>> I've run the hfile tool on the final region and nothing seems awry.
>>>>> 
>>>>> Figuring this out will allow this project to continue, as of now it
>>>>>is
>>>>> hung up on this issue.
>>>>> 
>>>>> thank you,
>>>>> -chris
>>>>> 
>>>>> 
>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas <cf...@biotiquesystems.com>
>>> wrote:
>>>>> 
>>>>>> I did some "poor mans" profiling with multiple jstack and came up
>>>>>>with
>>>>> where the RpcServer.handler appears to be stuck:
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>>.
>>>>> That is the last deepest method in all of the traces, either
>>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces
>>>>>for a
>>>>> thread (which has been running for the last couple hours):
>>>>>> 
>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>tid=0x0000000000cdb800
>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(Fa
>>>stDiffDeltaEncoder.java:540)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEnc
>>>odedSeeker.next(BufferedDataBlockEncoder.java:261)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFi
>>>leReaderV2.java:1063)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStor
>>>e.java:1776)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(H
>>>Store.java:1722)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.j
>>>ava:1655)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion
>>>.java:1826)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>a:2841)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
>>>callBlockingMethod(ClientProtos.java:28857)
>>>>>>     at 
>>>>>>org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>     at 
>>>>>>org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSch
>>>eduler.java:160)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSched
>>>uler.java:38)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.
>>>java:110)
>>>>>>     at java.lang.Thread.run(Thread.java:744)
>>>>>> 
>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>tid=0x0000000000cdb800
>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944
>>>)
>>>>>>     at
>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBl
>>>ockContainingKey(HFileBlockIndex.java:392)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDa
>>>taBlockWithScanInfo(HFileBlockIndex.java:209)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekTo
>>>DataBlock(HFileBlockIndex.java:179)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBef
>>>ore(HFileReaderV2.java:548)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(H
>>>Store.java:1712)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.j
>>>ava:1655)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion
>>>.java:1826)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>a:2841)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
>>>callBlockingMethod(ClientProtos.java:28857)
>>>>>>     at 
>>>>>>org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>     at 
>>>>>>org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSch
>>>eduler.java:160)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSched
>>>uler.java:38)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.
>>>java:110)
>>>>>>     at java.lang.Thread.run(Thread.java:744)
>>>>>> 
>>>>>> 
>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jz...@hortonworks.com>
>>>>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> Hey Chris,
>>>>>>> 
>>>>>>> I used performance.py tool which created a table with 50K rows in
>>>>>>>one
>>>>>>> table, run the following query from sqlline.py and everything seems
>>> fine
>>>>>>> without seeing CPU running hot.
>>>>>>> 
>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>>>>> PERFORMANCE_50000;
>>>>>>> +------------+
>>>>>>> |  COUNT(1)  |
>>>>>>> +------------+
>>>>>>> | 50000      |
>>>>>>> +------------+
>>>>>>> 1 row selected (0.166 seconds)
>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>>>>> PERFORMANCE_50000;
>>>>>>> +------------+
>>>>>>> |  COUNT(1)  |
>>>>>>> +------------+
>>>>>>> | 50000      |
>>>>>>> +------------+
>>>>>>> 1 row selected (0.167 seconds)
>>>>>>> 
>>>>>>> Is there anyway could you run profiler to see where the CPU goes?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>>>>>>> 
>>>>>>>> Ahh, yes. Here is a pastebin for it:
>>>>>>>> 
>>>>>>>> http://pastebin.com/w6mtabag
>>>>>>>> 
>>>>>>>> thanks again,
>>>>>>>> -chris
>>>>>>>> 
>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <nd...@gmail.com>
>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Chris,
>>>>>>>>> 
>>>>>>>>> Attachments are filtered out by the mail server. Can you
>>>>>>>>>pastebin it
>>>>>>>>> some
>>>>>>>>> place?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Nick
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>>>>>>> <cf...@biotiquesystems.com>wrote:
>>>>>>>>> 
>>>>>>>>>> Hello,
>>>>>>>>>> 
>>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to have
>>> been
>>>>>>>>>> set
>>>>>>>>>> to 60 by Ambari during install process). Now we have narrowed
>>>>>>>>>>down
>>>>> what
>>>>>>>>>> causes the CPU to increase and have some detailed logs:
>>>>>>>>>> 
>>>>>>>>>> If we connect using sqlline.py and execute a select that selects
>>> one
>>>>>>>>>> row
>>>>>>>>>> using the primary_key, no increate in CPU is observed and the
>>> number
>>>>>>>>>> of RPC
>>>>>>>>>> threads in a RUNNABLE state remains the same.
>>>>>>>>>> 
>>>>>>>>>> If we execute a select that scans the table such as "select
>>> count(*)
>>>>>>>>>> from
>>>>>>>>>> TABLE" or where the "where" clause only limits on non-primary
>>>>>>>>>>key
>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler
>>>>>>>>>>threads
>>>>>>>>>> increases
>>>>>>>>>> and the CPU utilization of the regionserver increases by ~105%.
>>>>>>>>>> 
>>>>>>>>>> Disconnecting the client does not have an effect and the
>>>>>>>>>> RpcServer.handler
>>>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher usage.
>>>>>>>>>> 
>>>>>>>>>> Checking the Web Console for the Regionserver just shows 10
>>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other
>>>>>>>>>>monitored
>>>>>>>>>> tasks
>>>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a Used
>>> heap
>>>>>>>>>> of
>>>>>>>>>> 445.2M.
>>>>>>>>>> 
>>>>>>>>>> I've attached the regionserver log with IPC debug logging
>>>>>>>>>>turned on
>>>>>>>>>> right
>>>>>>>>>> when one of the Phoenix statements is executed (this statement
>>>>> actually
>>>>>>>>>> used up the last available handler).
>>>>>>>>>> 
>>>>>>>>>> thanks,
>>>>>>>>>> -chris
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong
>>>>>>>>>><jzhong@hortonworks.com
>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> From the stack, it seems you increase the default rpc handler
>>> number
>>>>>>>>>>> to
>>>>>>>>>>> about 60. All handlers are serving Get request(You can search
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.j
>>>>>av
>>>>>>>>>> a:2
>>>>>>>>>>> 841).
>>>>>>>>>>> 
>>>>>>>>>>> You can check why there are so many get requests by adding some
>>> log
>>>>>>>>>>> info
>>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the number
>>>>>>>>>>>of
>>> rpc
>>>>>>>>>>> handlers per region server, it will mitigate your current
>>>>>>>>>>>issue.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cf...@biotiquesystems.com>
>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU usage
>>>>>>>>>>>> spiking
>>>>>>>>>> up
>>>>>>>>>>>> to use all available CPU and becoming unresponsive.
>>>>>>>>>>>> 
>>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster
>>> (with
>>>>> 3
>>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a
>>>>>>>>>>>>plain
>>>>>>>>>>>> Ambari
>>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release
>>>>>>>>>>>>and
>>>>> hand
>>>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was
>>>>>>>>>>>>going
>>>>> well
>>>>>>>>>>>> and
>>>>>>>>>>>> we were able to load in ~30k records into several tables. What
>>>>>>>>>>>> happened
>>>>>>>>>>>> was after about 3-4 days of being up the regionservers became
>>>>>>>>>>>> unresponsive and started to use most of the available CPU (12
>>> core
>>>>>>>>>>>> boxes). Nothing terribly informative was in the logs
>>>>>>>>>>>>(initially
>>> we
>>>>>>>>>>>> saw
>>>>>>>>>>>> some flush messages that seemed excessive, but that was not
>>>>>>>>>>>>all
>>> of
>>>>>>>>>>>> the
>>>>>>>>>>>> time and we changed back to the standard HBase WAL codec). We
>>>>>>>>>>>>are
>>>>>>>>>>>> able
>>>>>>>>>> to
>>>>>>>>>>>> kill the unresponsive regionservers and then restart them, the
>>>>>>>>>>>> cluster
>>>>>>>>>>>> will be fine for a day or so but will start to lock up again.
>>>>>>>>>>>> 
>>>>>>>>>>>> We've dropped the entire HBase and zookeper information and
>>> started
>>>>>>>>>>>> from
>>>>>>>>>>>> scratch, but that has not helped.
>>>>>>>>>>>> 
>>>>>>>>>>>> James Taylor suggested I send this off here. I've attached a
>>> jstack
>>>>>>>>>>>> report of a locked up regionserver in hopes that someone can
>>>>>>>>>>>>shed
>>>>>>>>>>>> some
>>>>>>>>>>>> light.
>>>>>>>>>>>> 
>>>>>>>>>>>> thanks,
>>>>>>>>>>>> -chris
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>>> NOTICE: This message is intended for the use of the individual
>>>>>>>>>>>or
>>>>>>>>>>> entity
>>>>>>>>>> to
>>>>>>>>>>> which it is addressed and may contain information that is
>>>>>>>>>>> confidential,
>>>>>>>>>>> privileged and exempt from disclosure under applicable law. If
>>>>>>>>>>>the
>>>>>>>>>>> reader
>>>>>>>>>>> of this message is not the intended recipient, you are hereby
>>>>> notified
>>>>>>>>>> that
>>>>>>>>>>> any printing, copying, dissemination, distribution, disclosure
>>>>>>>>>>>or
>>>>>>>>>>> forwarding of this communication is strictly prohibited. If you
>>> have
>>>>>>>>>>> received this communication in error, please contact the sender
>>>>>>>>>> immediately
>>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>> NOTICE: This message is intended for the use of the individual or
>>>>> entity to
>>>>>>> which it is addressed and may contain information that is
>>> confidential,
>>>>>>> privileged and exempt from disclosure under applicable law. If the
>>>>> reader
>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>>>notified
>>>>> that
>>>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>>>> forwarding of this communication is strictly prohibited. If you
>>>>>>>have
>>>>>>> received this communication in error, please contact the sender
>>>>> immediately
>>>>>>> and delete it from your system. Thank You.
>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Chris Tarnas <cf...@biotiquesystems.com>.

I will see how possible a new, all bare metal cluster is, but we have 5 other similarly configured clusters with no troubles. We only have an issue when Phoenix does a full scan of some tables that a handler gets stuck in org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile after it has returned all of its data.

thank you,

-chris

On May 20, 2014, at 12:34 PM, alex kamil <al...@gmail.com> wrote:

> i'm just guessing but may be region servers are stuck because they can't
> communicate with zookeeper or hmaster, or may be datanodes can't talk to
> namenode because of some vmware networking quirk and some internal thread
> is hanging, who knows,
> it would take 30 min to setup up bare metal cluster vs days debugging this
> hybrid setup
> also I'd suggest to install directly from hbase/hadoop websites  to
> eliminate additional variables
> 
> 
> 
> 
> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas <cf...@biotiquesystems.com>wrote:
> 
>> Only the master nodes (HMaster, ZK and NN) are VMs. The
>> datanodes/regionservers with the stuck processes all bare metal.
>> 
>> -chris
>> 
>> On May 20, 2014, at 11:52 AM, alex kamil <al...@gmail.com> wrote:
>> 
>>>> "with 3 VMWare "master" nodes"
>>> 
>>> can you try running hbase/phoenix on physical nodes instead of using
>>> virtual machines
>>> 
>>> 
>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas <cft@biotiquesystems.com
>>> wrote:
>>> 
>>>> Sorry to follow up on my own message, but I was wondering if anyone had
>>>> any ideas? Normal non-phoenix scans don't cause this symptom, but right
>>>> after a select * on the exact same table will.
>>>> 
>>>> If we export the table and then re-import it into a new table, the new
>>>> table doesn't exhibit these symptoms, same as if we use an
>> upsert..select
>>>> to do a copy. It seems something happens to the last region to cause
>> this,
>>>> but it is not directly data dependent. Moving the region to another
>>>> regionserver doesn't have any effect - just moves where the problem
>>>> happens. Major compactions get hung up by the running threads as they
>>>> probably have a lock.
>>>> 
>>>> I've run the hfile tool on the final region and nothing seems awry.
>>>> 
>>>> Figuring this out will allow this project to continue, as of now it is
>>>> hung up on this issue.
>>>> 
>>>> thank you,
>>>> -chris
>>>> 
>>>> 
>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas <cf...@biotiquesystems.com>
>> wrote:
>>>> 
>>>>> I did some "poor mans" profiling with multiple jstack and came up with
>>>> where the RpcServer.handler appears to be stuck:
>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile.
>>>> That is the last deepest method in all of the traces, either
>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces for a
>>>> thread (which has been running for the last couple hours):
>>>>> 
>>>>> "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800
>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>> java.lang.Thread.State: RUNNABLE
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(FastDiffDeltaEncoder.java:540)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncodedSeeker.next(BufferedDataBlockEncoder.java:261)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1063)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1776)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
>>>>>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
>>>>>     at java.lang.Thread.run(Thread.java:744)
>>>>> 
>>>>> "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800
>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>> java.lang.Thread.State: RUNNABLE
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944)
>>>>>     at
>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBlockContainingKey(HFileBlockIndex.java:392)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:209)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:179)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBefore(HFileReaderV2.java:548)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1712)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
>>>>>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
>>>>>     at
>>>> 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
>>>>>     at java.lang.Thread.run(Thread.java:744)
>>>>> 
>>>>> 
>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jz...@hortonworks.com>
>>>> wrote:
>>>>> 
>>>>>> 
>>>>>> Hey Chris,
>>>>>> 
>>>>>> I used performance.py tool which created a table with 50K rows in one
>>>>>> table, run the following query from sqlline.py and everything seems
>> fine
>>>>>> without seeing CPU running hot.
>>>>>> 
>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>>>> PERFORMANCE_50000;
>>>>>> +------------+
>>>>>> |  COUNT(1)  |
>>>>>> +------------+
>>>>>> | 50000      |
>>>>>> +------------+
>>>>>> 1 row selected (0.166 seconds)
>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>>>> PERFORMANCE_50000;
>>>>>> +------------+
>>>>>> |  COUNT(1)  |
>>>>>> +------------+
>>>>>> | 50000      |
>>>>>> +------------+
>>>>>> 1 row selected (0.167 seconds)
>>>>>> 
>>>>>> Is there anyway could you run profiler to see where the CPU goes?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>>>>>> 
>>>>>>> Ahh, yes. Here is a pastebin for it:
>>>>>>> 
>>>>>>> http://pastebin.com/w6mtabag
>>>>>>> 
>>>>>>> thanks again,
>>>>>>> -chris
>>>>>>> 
>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <nd...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>>> Hi Chris,
>>>>>>>> 
>>>>>>>> Attachments are filtered out by the mail server. Can you pastebin it
>>>>>>>> some
>>>>>>>> place?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Nick
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>>>>>> <cf...@biotiquesystems.com>wrote:
>>>>>>>> 
>>>>>>>>> Hello,
>>>>>>>>> 
>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to have
>> been
>>>>>>>>> set
>>>>>>>>> to 60 by Ambari during install process). Now we have narrowed down
>>>> what
>>>>>>>>> causes the CPU to increase and have some detailed logs:
>>>>>>>>> 
>>>>>>>>> If we connect using sqlline.py and execute a select that selects
>> one
>>>>>>>>> row
>>>>>>>>> using the primary_key, no increate in CPU is observed and the
>> number
>>>>>>>>> of RPC
>>>>>>>>> threads in a RUNNABLE state remains the same.
>>>>>>>>> 
>>>>>>>>> If we execute a select that scans the table such as "select
>> count(*)
>>>>>>>>> from
>>>>>>>>> TABLE" or where the "where" clause only limits on non-primary key
>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler threads
>>>>>>>>> increases
>>>>>>>>> and the CPU utilization of the regionserver increases by ~105%.
>>>>>>>>> 
>>>>>>>>> Disconnecting the client does not have an effect and the
>>>>>>>>> RpcServer.handler
>>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher usage.
>>>>>>>>> 
>>>>>>>>> Checking the Web Console for the Regionserver just shows 10
>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other monitored
>>>>>>>>> tasks
>>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a Used
>> heap
>>>>>>>>> of
>>>>>>>>> 445.2M.
>>>>>>>>> 
>>>>>>>>> I've attached the regionserver log with IPC debug logging turned on
>>>>>>>>> right
>>>>>>>>> when one of the Phoenix statements is executed (this statement
>>>> actually
>>>>>>>>> used up the last available handler).
>>>>>>>>> 
>>>>>>>>> thanks,
>>>>>>>>> -chris
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jzhong@hortonworks.com
>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> From the stack, it seems you increase the default rpc handler
>> number
>>>>>>>>>> to
>>>>>>>>>> about 60. All handlers are serving Get request(You can search
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>>>>>>> a:2
>>>>>>>>>> 841).
>>>>>>>>>> 
>>>>>>>>>> You can check why there are so many get requests by adding some
>> log
>>>>>>>>>> info
>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the number of
>> rpc
>>>>>>>>>> handlers per region server, it will mitigate your current issue.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cf...@biotiquesystems.com>
>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU usage
>>>>>>>>>>> spiking
>>>>>>>>> up
>>>>>>>>>>> to use all available CPU and becoming unresponsive.
>>>>>>>>>>> 
>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster
>> (with
>>>> 3
>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain
>>>>>>>>>>> Ambari
>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and
>>>> hand
>>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was going
>>>> well
>>>>>>>>>>> and
>>>>>>>>>>> we were able to load in ~30k records into several tables. What
>>>>>>>>>>> happened
>>>>>>>>>>> was after about 3-4 days of being up the regionservers became
>>>>>>>>>>> unresponsive and started to use most of the available CPU (12
>> core
>>>>>>>>>>> boxes). Nothing terribly informative was in the logs (initially
>> we
>>>>>>>>>>> saw
>>>>>>>>>>> some flush messages that seemed excessive, but that was not all
>> of
>>>>>>>>>>> the
>>>>>>>>>>> time and we changed back to the standard HBase WAL codec). We are
>>>>>>>>>>> able
>>>>>>>>> to
>>>>>>>>>>> kill the unresponsive regionservers and then restart them, the
>>>>>>>>>>> cluster
>>>>>>>>>>> will be fine for a day or so but will start to lock up again.
>>>>>>>>>>> 
>>>>>>>>>>> We've dropped the entire HBase and zookeper information and
>> started
>>>>>>>>>>> from
>>>>>>>>>>> scratch, but that has not helped.
>>>>>>>>>>> 
>>>>>>>>>>> James Taylor suggested I send this off here. I've attached a
>> jstack
>>>>>>>>>>> report of a locked up regionserver in hopes that someone can shed
>>>>>>>>>>> some
>>>>>>>>>>> light.
>>>>>>>>>>> 
>>>>>>>>>>> thanks,
>>>>>>>>>>> -chris
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>> NOTICE: This message is intended for the use of the individual or
>>>>>>>>>> entity
>>>>>>>>> to
>>>>>>>>>> which it is addressed and may contain information that is
>>>>>>>>>> confidential,
>>>>>>>>>> privileged and exempt from disclosure under applicable law. If the
>>>>>>>>>> reader
>>>>>>>>>> of this message is not the intended recipient, you are hereby
>>>> notified
>>>>>>>>> that
>>>>>>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>>>>>>> forwarding of this communication is strictly prohibited. If you
>> have
>>>>>>>>>> received this communication in error, please contact the sender
>>>>>>>>> immediately
>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> CONFIDENTIALITY NOTICE
>>>>>> NOTICE: This message is intended for the use of the individual or
>>>> entity to
>>>>>> which it is addressed and may contain information that is
>> confidential,
>>>>>> privileged and exempt from disclosure under applicable law. If the
>>>> reader
>>>>>> of this message is not the intended recipient, you are hereby notified
>>>> that
>>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>>> forwarding of this communication is strictly prohibited. If you have
>>>>>> received this communication in error, please contact the sender
>>>> immediately
>>>>>> and delete it from your system. Thank You.
>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by alex kamil <al...@gmail.com>.

i'm just guessing but may be region servers are stuck because they can't
communicate with zookeeper or hmaster, or may be datanodes can't talk to
namenode because of some vmware networking quirk and some internal thread
is hanging, who knows,
it would take 30 min to setup up bare metal cluster vs days debugging this
hybrid setup
also I'd suggest to install directly from hbase/hadoop websites  to
eliminate additional variables




On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas <cf...@biotiquesystems.com>wrote:

> Only the master nodes (HMaster, ZK and NN) are VMs. The
> datanodes/regionservers with the stuck processes all bare metal.
>
> -chris
>
> On May 20, 2014, at 11:52 AM, alex kamil <al...@gmail.com> wrote:
>
> >> "with 3 VMWare "master" nodes"
> >
> > can you try running hbase/phoenix on physical nodes instead of using
> > virtual machines
> >
> >
> > On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas <cft@biotiquesystems.com
> >wrote:
> >
> >> Sorry to follow up on my own message, but I was wondering if anyone had
> >> any ideas? Normal non-phoenix scans don't cause this symptom, but right
> >> after a select * on the exact same table will.
> >>
> >> If we export the table and then re-import it into a new table, the new
> >> table doesn't exhibit these symptoms, same as if we use an
> upsert..select
> >> to do a copy. It seems something happens to the last region to cause
> this,
> >> but it is not directly data dependent. Moving the region to another
> >> regionserver doesn't have any effect - just moves where the problem
> >> happens. Major compactions get hung up by the running threads as they
> >> probably have a lock.
> >>
> >> I've run the hfile tool on the final region and nothing seems awry.
> >>
> >> Figuring this out will allow this project to continue, as of now it is
> >> hung up on this issue.
> >>
> >> thank you,
> >> -chris
> >>
> >>
> >> On May 15, 2014, at 8:47 PM, Chris Tarnas <cf...@biotiquesystems.com>
> wrote:
> >>
> >>> I did some "poor mans" profiling with multiple jstack and came up with
> >> where the RpcServer.handler appears to be stuck:
> >> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile.
> >> That is the last deepest method in all of the traces, either
> >> HStore.java:1712 or HStore.java:1722. Here are two example traces for a
> >> thread (which has been running for the last couple hours):
> >>>
> >>> "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800
> >> nid=0x727b runnable [0x00007f4b49e9e000]
> >>>  java.lang.Thread.State: RUNNABLE
> >>>      at
> >>
> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(FastDiffDeltaEncoder.java:540)
> >>>      at
> >>
> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncodedSeeker.next(BufferedDataBlockEncoder.java:261)
> >>>      at
> >>
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1063)
> >>>      at
> >>
> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1776)
> >>>      at
> >>
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722)
> >>>      at
> >>
> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
> >>>      at
> >>
> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
> >>>      at
> >>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
> >>>      at
> >>
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
> >>>      at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >>>      at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >>>      at
> >>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
> >>>      at
> >>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
> >>>      at
> >>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
> >>>      at java.lang.Thread.run(Thread.java:744)
> >>>
> >>> "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800
> >> nid=0x727b runnable [0x00007f4b49e9e000]
> >>>  java.lang.Thread.State: RUNNABLE
> >>>      at
> >>
> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944)
> >>>      at
> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
> >>>      at
> >>
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBlockContainingKey(HFileBlockIndex.java:392)
> >>>      at
> >>
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:209)
> >>>      at
> >>
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:179)
> >>>      at
> >>
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBefore(HFileReaderV2.java:548)
> >>>      at
> >>
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1712)
> >>>      at
> >>
> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
> >>>      at
> >>
> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
> >>>      at
> >>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
> >>>      at
> >>
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
> >>>      at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >>>      at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >>>      at
> >>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
> >>>      at
> >>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
> >>>      at
> >>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
> >>>      at java.lang.Thread.run(Thread.java:744)
> >>>
> >>>
> >>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jz...@hortonworks.com>
> >> wrote:
> >>>
> >>>>
> >>>> Hey Chris,
> >>>>
> >>>> I used performance.py tool which created a table with 50K rows in one
> >>>> table, run the following query from sqlline.py and everything seems
> fine
> >>>> without seeing CPU running hot.
> >>>>
> >>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
> >>>> PERFORMANCE_50000;
> >>>> +------------+
> >>>> |  COUNT(1)  |
> >>>> +------------+
> >>>> | 50000      |
> >>>> +------------+
> >>>> 1 row selected (0.166 seconds)
> >>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
> >>>> PERFORMANCE_50000;
> >>>> +------------+
> >>>> |  COUNT(1)  |
> >>>> +------------+
> >>>> | 50000      |
> >>>> +------------+
> >>>> 1 row selected (0.167 seconds)
> >>>>
> >>>> Is there anyway could you run profiler to see where the CPU goes?
> >>>>
> >>>>
> >>>>
> >>>> On 5/13/14 6:40 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
> >>>>
> >>>>> Ahh, yes. Here is a pastebin for it:
> >>>>>
> >>>>> http://pastebin.com/w6mtabag
> >>>>>
> >>>>> thanks again,
> >>>>> -chris
> >>>>>
> >>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <nd...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Hi Chris,
> >>>>>>
> >>>>>> Attachments are filtered out by the mail server. Can you pastebin it
> >>>>>> some
> >>>>>> place?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Nick
> >>>>>>
> >>>>>>
> >>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
> >>>>>> <cf...@biotiquesystems.com>wrote:
> >>>>>>
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> We set the HBase RegionServer Handler to 10 (it appears to have
> been
> >>>>>>> set
> >>>>>>> to 60 by Ambari during install process). Now we have narrowed down
> >> what
> >>>>>>> causes the CPU to increase and have some detailed logs:
> >>>>>>>
> >>>>>>> If we connect using sqlline.py and execute a select that selects
> one
> >>>>>>> row
> >>>>>>> using the primary_key, no increate in CPU is observed and the
> number
> >>>>>>> of RPC
> >>>>>>> threads in a RUNNABLE state remains the same.
> >>>>>>>
> >>>>>>> If we execute a select that scans the table such as "select
> count(*)
> >>>>>>> from
> >>>>>>> TABLE" or where the "where" clause only limits on non-primary key
> >>>>>>> attributes, then the number of RUNNABLE RpcServer.handler threads
> >>>>>>> increases
> >>>>>>> and the CPU utilization of the regionserver increases by ~105%.
> >>>>>>>
> >>>>>>> Disconnecting the client does not have an effect and the
> >>>>>>> RpcServer.handler
> >>>>>>> thread is left RUNNABLE and the CPU stays at the higher usage.
> >>>>>>>
> >>>>>>> Checking the Web Console for the Regionserver just shows 10
> >>>>>>> RpcServer.reader tasks, all in a WAITING state, no other monitored
> >>>>>>> tasks
> >>>>>>> are happening. The regionserver has a Max Heap of 10G and a Used
> heap
> >>>>>>> of
> >>>>>>> 445.2M.
> >>>>>>>
> >>>>>>> I've attached the regionserver log with IPC debug logging turned on
> >>>>>>> right
> >>>>>>> when one of the Phoenix statements is executed (this statement
> >> actually
> >>>>>>> used up the last available handler).
> >>>>>>>
> >>>>>>> thanks,
> >>>>>>> -chris
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jzhong@hortonworks.com
> >
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>> From the stack, it seems you increase the default rpc handler
> number
> >>>>>>>> to
> >>>>>>>> about 60. All handlers are serving Get request(You can search
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
> >>>>>>> a:2
> >>>>>>>> 841).
> >>>>>>>>
> >>>>>>>> You can check why there are so many get requests by adding some
> log
> >>>>>>>> info
> >>>>>>>> or enable hbase rpc trace. I guess if you decrease the number of
> rpc
> >>>>>>>> handlers per region server, it will mitigate your current issue.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cf...@biotiquesystems.com>
> wrote:
> >>>>>>>>
> >>>>>>>>> We have hit a problem with Phoenix and regionservers CPU usage
> >>>>>>>>> spiking
> >>>>>>> up
> >>>>>>>>> to use all available CPU and becoming unresponsive.
> >>>>>>>>>
> >>>>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster
> (with
> >> 3
> >>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain
> >>>>>>>>> Ambari
> >>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and
> >> hand
> >>>>>>>>> linked in the jar files to the hadoop lib. Everything was going
> >> well
> >>>>>>>>> and
> >>>>>>>>> we were able to load in ~30k records into several tables. What
> >>>>>>>>> happened
> >>>>>>>>> was after about 3-4 days of being up the regionservers became
> >>>>>>>>> unresponsive and started to use most of the available CPU (12
> core
> >>>>>>>>> boxes). Nothing terribly informative was in the logs (initially
> we
> >>>>>>>>> saw
> >>>>>>>>> some flush messages that seemed excessive, but that was not all
> of
> >>>>>>>>> the
> >>>>>>>>> time and we changed back to the standard HBase WAL codec). We are
> >>>>>>>>> able
> >>>>>>> to
> >>>>>>>>> kill the unresponsive regionservers and then restart them, the
> >>>>>>>>> cluster
> >>>>>>>>> will be fine for a day or so but will start to lock up again.
> >>>>>>>>>
> >>>>>>>>> We've dropped the entire HBase and zookeper information and
> started
> >>>>>>>>> from
> >>>>>>>>> scratch, but that has not helped.
> >>>>>>>>>
> >>>>>>>>> James Taylor suggested I send this off here. I've attached a
> jstack
> >>>>>>>>> report of a locked up regionserver in hopes that someone can shed
> >>>>>>>>> some
> >>>>>>>>> light.
> >>>>>>>>>
> >>>>>>>>> thanks,
> >>>>>>>>> -chris
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> CONFIDENTIALITY NOTICE
> >>>>>>>> NOTICE: This message is intended for the use of the individual or
> >>>>>>>> entity
> >>>>>>> to
> >>>>>>>> which it is addressed and may contain information that is
> >>>>>>>> confidential,
> >>>>>>>> privileged and exempt from disclosure under applicable law. If the
> >>>>>>>> reader
> >>>>>>>> of this message is not the intended recipient, you are hereby
> >> notified
> >>>>>>> that
> >>>>>>>> any printing, copying, dissemination, distribution, disclosure or
> >>>>>>>> forwarding of this communication is strictly prohibited. If you
> have
> >>>>>>>> received this communication in error, please contact the sender
> >>>>>>> immediately
> >>>>>>>> and delete it from your system. Thank You.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> CONFIDENTIALITY NOTICE
> >>>> NOTICE: This message is intended for the use of the individual or
> >> entity to
> >>>> which it is addressed and may contain information that is
> confidential,
> >>>> privileged and exempt from disclosure under applicable law. If the
> >> reader
> >>>> of this message is not the intended recipient, you are hereby notified
> >> that
> >>>> any printing, copying, dissemination, distribution, disclosure or
> >>>> forwarding of this communication is strictly prohibited. If you have
> >>>> received this communication in error, please contact the sender
> >> immediately
> >>>> and delete it from your system. Thank You.
> >>>
> >>
> >>
>
>

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Chris Tarnas <cf...@biotiquesystems.com>.

Only the master nodes (HMaster, ZK and NN) are VMs. The datanodes/regionservers with the stuck processes all bare metal. 

-chris

On May 20, 2014, at 11:52 AM, alex kamil <al...@gmail.com> wrote:

>> "with 3 VMWare "master" nodes"
> 
> can you try running hbase/phoenix on physical nodes instead of using
> virtual machines
> 
> 
> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas <cf...@biotiquesystems.com>wrote:
> 
>> Sorry to follow up on my own message, but I was wondering if anyone had
>> any ideas? Normal non-phoenix scans don't cause this symptom, but right
>> after a select * on the exact same table will.
>> 
>> If we export the table and then re-import it into a new table, the new
>> table doesn't exhibit these symptoms, same as if we use an upsert..select
>> to do a copy. It seems something happens to the last region to cause this,
>> but it is not directly data dependent. Moving the region to another
>> regionserver doesn't have any effect - just moves where the problem
>> happens. Major compactions get hung up by the running threads as they
>> probably have a lock.
>> 
>> I've run the hfile tool on the final region and nothing seems awry.
>> 
>> Figuring this out will allow this project to continue, as of now it is
>> hung up on this issue.
>> 
>> thank you,
>> -chris
>> 
>> 
>> On May 15, 2014, at 8:47 PM, Chris Tarnas <cf...@biotiquesystems.com> wrote:
>> 
>>> I did some "poor mans" profiling with multiple jstack and came up with
>> where the RpcServer.handler appears to be stuck:
>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile.
>> That is the last deepest method in all of the traces, either
>> HStore.java:1712 or HStore.java:1722. Here are two example traces for a
>> thread (which has been running for the last couple hours):
>>> 
>>> "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800
>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>  java.lang.Thread.State: RUNNABLE
>>>      at
>> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(FastDiffDeltaEncoder.java:540)
>>>      at
>> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncodedSeeker.next(BufferedDataBlockEncoder.java:261)
>>>      at
>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1063)
>>>      at
>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1776)
>>>      at
>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722)
>>>      at
>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
>>>      at
>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
>>>      at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
>>>      at
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
>>>      at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>      at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>      at
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
>>>      at
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
>>>      at
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
>>>      at java.lang.Thread.run(Thread.java:744)
>>> 
>>> "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800
>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>  java.lang.Thread.State: RUNNABLE
>>>      at
>> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944)
>>>      at org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
>>>      at
>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBlockContainingKey(HFileBlockIndex.java:392)
>>>      at
>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:209)
>>>      at
>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:179)
>>>      at
>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBefore(HFileReaderV2.java:548)
>>>      at
>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1712)
>>>      at
>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
>>>      at
>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
>>>      at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
>>>      at
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
>>>      at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>      at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>      at
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
>>>      at
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
>>>      at
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
>>>      at java.lang.Thread.run(Thread.java:744)
>>> 
>>> 
>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jz...@hortonworks.com>
>> wrote:
>>> 
>>>> 
>>>> Hey Chris,
>>>> 
>>>> I used performance.py tool which created a table with 50K rows in one
>>>> table, run the following query from sqlline.py and everything seems fine
>>>> without seeing CPU running hot.
>>>> 
>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>> PERFORMANCE_50000;
>>>> +------------+
>>>> |  COUNT(1)  |
>>>> +------------+
>>>> | 50000      |
>>>> +------------+
>>>> 1 row selected (0.166 seconds)
>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>>>> PERFORMANCE_50000;
>>>> +------------+
>>>> |  COUNT(1)  |
>>>> +------------+
>>>> | 50000      |
>>>> +------------+
>>>> 1 row selected (0.167 seconds)
>>>> 
>>>> Is there anyway could you run profiler to see where the CPU goes?
>>>> 
>>>> 
>>>> 
>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>>>> 
>>>>> Ahh, yes. Here is a pastebin for it:
>>>>> 
>>>>> http://pastebin.com/w6mtabag
>>>>> 
>>>>> thanks again,
>>>>> -chris
>>>>> 
>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <nd...@gmail.com> wrote:
>>>>> 
>>>>>> Hi Chris,
>>>>>> 
>>>>>> Attachments are filtered out by the mail server. Can you pastebin it
>>>>>> some
>>>>>> place?
>>>>>> 
>>>>>> Thanks,
>>>>>> Nick
>>>>>> 
>>>>>> 
>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>>>> <cf...@biotiquesystems.com>wrote:
>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to have been
>>>>>>> set
>>>>>>> to 60 by Ambari during install process). Now we have narrowed down
>> what
>>>>>>> causes the CPU to increase and have some detailed logs:
>>>>>>> 
>>>>>>> If we connect using sqlline.py and execute a select that selects one
>>>>>>> row
>>>>>>> using the primary_key, no increate in CPU is observed and the number
>>>>>>> of RPC
>>>>>>> threads in a RUNNABLE state remains the same.
>>>>>>> 
>>>>>>> If we execute a select that scans the table such as "select count(*)
>>>>>>> from
>>>>>>> TABLE" or where the "where" clause only limits on non-primary key
>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler threads
>>>>>>> increases
>>>>>>> and the CPU utilization of the regionserver increases by ~105%.
>>>>>>> 
>>>>>>> Disconnecting the client does not have an effect and the
>>>>>>> RpcServer.handler
>>>>>>> thread is left RUNNABLE and the CPU stays at the higher usage.
>>>>>>> 
>>>>>>> Checking the Web Console for the Regionserver just shows 10
>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other monitored
>>>>>>> tasks
>>>>>>> are happening. The regionserver has a Max Heap of 10G and a Used heap
>>>>>>> of
>>>>>>> 445.2M.
>>>>>>> 
>>>>>>> I've attached the regionserver log with IPC debug logging turned on
>>>>>>> right
>>>>>>> when one of the Phoenix statements is executed (this statement
>> actually
>>>>>>> used up the last available handler).
>>>>>>> 
>>>>>>> thanks,
>>>>>>> -chris
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jz...@hortonworks.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> From the stack, it seems you increase the default rpc handler number
>>>>>>>> to
>>>>>>>> about 60. All handlers are serving Get request(You can search
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>>>>> a:2
>>>>>>>> 841).
>>>>>>>> 
>>>>>>>> You can check why there are so many get requests by adding some log
>>>>>>>> info
>>>>>>>> or enable hbase rpc trace. I guess if you decrease the number of rpc
>>>>>>>> handlers per region server, it will mitigate your current issue.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>>>>>>>> 
>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU usage
>>>>>>>>> spiking
>>>>>>> up
>>>>>>>>> to use all available CPU and becoming unresponsive.
>>>>>>>>> 
>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster (with
>> 3
>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain
>>>>>>>>> Ambari
>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and
>> hand
>>>>>>>>> linked in the jar files to the hadoop lib. Everything was going
>> well
>>>>>>>>> and
>>>>>>>>> we were able to load in ~30k records into several tables. What
>>>>>>>>> happened
>>>>>>>>> was after about 3-4 days of being up the regionservers became
>>>>>>>>> unresponsive and started to use most of the available CPU (12 core
>>>>>>>>> boxes). Nothing terribly informative was in the logs (initially we
>>>>>>>>> saw
>>>>>>>>> some flush messages that seemed excessive, but that was not all of
>>>>>>>>> the
>>>>>>>>> time and we changed back to the standard HBase WAL codec). We are
>>>>>>>>> able
>>>>>>> to
>>>>>>>>> kill the unresponsive regionservers and then restart them, the
>>>>>>>>> cluster
>>>>>>>>> will be fine for a day or so but will start to lock up again.
>>>>>>>>> 
>>>>>>>>> We've dropped the entire HBase and zookeper information and started
>>>>>>>>> from
>>>>>>>>> scratch, but that has not helped.
>>>>>>>>> 
>>>>>>>>> James Taylor suggested I send this off here. I've attached a jstack
>>>>>>>>> report of a locked up regionserver in hopes that someone can shed
>>>>>>>>> some
>>>>>>>>> light.
>>>>>>>>> 
>>>>>>>>> thanks,
>>>>>>>>> -chris
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>> NOTICE: This message is intended for the use of the individual or
>>>>>>>> entity
>>>>>>> to
>>>>>>>> which it is addressed and may contain information that is
>>>>>>>> confidential,
>>>>>>>> privileged and exempt from disclosure under applicable law. If the
>>>>>>>> reader
>>>>>>>> of this message is not the intended recipient, you are hereby
>> notified
>>>>>>> that
>>>>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>>>>> forwarding of this communication is strictly prohibited. If you have
>>>>>>>> received this communication in error, please contact the sender
>>>>>>> immediately
>>>>>>>> and delete it from your system. Thank You.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> CONFIDENTIALITY NOTICE
>>>> NOTICE: This message is intended for the use of the individual or
>> entity to
>>>> which it is addressed and may contain information that is confidential,
>>>> privileged and exempt from disclosure under applicable law. If the
>> reader
>>>> of this message is not the intended recipient, you are hereby notified
>> that
>>>> any printing, copying, dissemination, distribution, disclosure or
>>>> forwarding of this communication is strictly prohibited. If you have
>>>> received this communication in error, please contact the sender
>> immediately
>>>> and delete it from your system. Thank You.
>>> 
>> 
>>

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by alex kamil <al...@gmail.com>.

>"with 3 VMWare "master" nodes"

can you try running hbase/phoenix on physical nodes instead of using
virtual machines


On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas <cf...@biotiquesystems.com>wrote:

> Sorry to follow up on my own message, but I was wondering if anyone had
> any ideas? Normal non-phoenix scans don't cause this symptom, but right
> after a select * on the exact same table will.
>
> If we export the table and then re-import it into a new table, the new
> table doesn't exhibit these symptoms, same as if we use an upsert..select
> to do a copy. It seems something happens to the last region to cause this,
> but it is not directly data dependent. Moving the region to another
> regionserver doesn't have any effect - just moves where the problem
> happens. Major compactions get hung up by the running threads as they
> probably have a lock.
>
> I've run the hfile tool on the final region and nothing seems awry.
>
> Figuring this out will allow this project to continue, as of now it is
> hung up on this issue.
>
> thank you,
> -chris
>
>
> On May 15, 2014, at 8:47 PM, Chris Tarnas <cf...@biotiquesystems.com> wrote:
>
> > I did some "poor mans" profiling with multiple jstack and came up with
> where the RpcServer.handler appears to be stuck:
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile.
> That is the last deepest method in all of the traces, either
> HStore.java:1712 or HStore.java:1722. Here are two example traces for a
> thread (which has been running for the last couple hours):
> >
> > "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800
> nid=0x727b runnable [0x00007f4b49e9e000]
> >   java.lang.Thread.State: RUNNABLE
> >       at
> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(FastDiffDeltaEncoder.java:540)
> >       at
> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncodedSeeker.next(BufferedDataBlockEncoder.java:261)
> >       at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1063)
> >       at
> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1776)
> >       at
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722)
> >       at
> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
> >       at
> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
> >       at
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
> >       at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
> >       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >       at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >       at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
> >       at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
> >       at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
> >       at java.lang.Thread.run(Thread.java:744)
> >
> > "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800
> nid=0x727b runnable [0x00007f4b49e9e000]
> >   java.lang.Thread.State: RUNNABLE
> >       at
> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944)
> >       at org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
> >       at
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBlockContainingKey(HFileBlockIndex.java:392)
> >       at
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:209)
> >       at
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:179)
> >       at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBefore(HFileReaderV2.java:548)
> >       at
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1712)
> >       at
> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
> >       at
> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
> >       at
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
> >       at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
> >       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >       at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >       at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
> >       at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
> >       at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
> >       at java.lang.Thread.run(Thread.java:744)
> >
> >
> > On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jz...@hortonworks.com>
> wrote:
> >
> >>
> >> Hey Chris,
> >>
> >> I used performance.py tool which created a table with 50K rows in one
> >> table, run the following query from sqlline.py and everything seems fine
> >> without seeing CPU running hot.
> >>
> >> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
> >> PERFORMANCE_50000;
> >> +------------+
> >> |  COUNT(1)  |
> >> +------------+
> >> | 50000      |
> >> +------------+
> >> 1 row selected (0.166 seconds)
> >> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
> >> PERFORMANCE_50000;
> >> +------------+
> >> |  COUNT(1)  |
> >> +------------+
> >> | 50000      |
> >> +------------+
> >> 1 row selected (0.167 seconds)
> >>
> >> Is there anyway could you run profiler to see where the CPU goes?
> >>
> >>
> >>
> >> On 5/13/14 6:40 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
> >>
> >>> Ahh, yes. Here is a pastebin for it:
> >>>
> >>> http://pastebin.com/w6mtabag
> >>>
> >>> thanks again,
> >>> -chris
> >>>
> >>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <nd...@gmail.com> wrote:
> >>>
> >>>> Hi Chris,
> >>>>
> >>>> Attachments are filtered out by the mail server. Can you pastebin it
> >>>> some
> >>>> place?
> >>>>
> >>>> Thanks,
> >>>> Nick
> >>>>
> >>>>
> >>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
> >>>> <cf...@biotiquesystems.com>wrote:
> >>>>
> >>>>> Hello,
> >>>>>
> >>>>> We set the HBase RegionServer Handler to 10 (it appears to have been
> >>>>> set
> >>>>> to 60 by Ambari during install process). Now we have narrowed down
> what
> >>>>> causes the CPU to increase and have some detailed logs:
> >>>>>
> >>>>> If we connect using sqlline.py and execute a select that selects one
> >>>>> row
> >>>>> using the primary_key, no increate in CPU is observed and the number
> >>>>> of RPC
> >>>>> threads in a RUNNABLE state remains the same.
> >>>>>
> >>>>> If we execute a select that scans the table such as "select count(*)
> >>>>> from
> >>>>> TABLE" or where the "where" clause only limits on non-primary key
> >>>>> attributes, then the number of RUNNABLE RpcServer.handler threads
> >>>>> increases
> >>>>> and the CPU utilization of the regionserver increases by ~105%.
> >>>>>
> >>>>> Disconnecting the client does not have an effect and the
> >>>>> RpcServer.handler
> >>>>> thread is left RUNNABLE and the CPU stays at the higher usage.
> >>>>>
> >>>>> Checking the Web Console for the Regionserver just shows 10
> >>>>> RpcServer.reader tasks, all in a WAITING state, no other monitored
> >>>>> tasks
> >>>>> are happening. The regionserver has a Max Heap of 10G and a Used heap
> >>>>> of
> >>>>> 445.2M.
> >>>>>
> >>>>> I've attached the regionserver log with IPC debug logging turned on
> >>>>> right
> >>>>> when one of the Phoenix statements is executed (this statement
> actually
> >>>>> used up the last available handler).
> >>>>>
> >>>>> thanks,
> >>>>> -chris
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jz...@hortonworks.com>
> >>>>> wrote:
> >>>>>
> >>>>>>
> >>>>>> From the stack, it seems you increase the default rpc handler number
> >>>>>> to
> >>>>>> about 60. All handlers are serving Get request(You can search
> >>>>>>
> >>>>>
> >>>>>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
> >>>>> a:2
> >>>>>> 841).
> >>>>>>
> >>>>>> You can check why there are so many get requests by adding some log
> >>>>>> info
> >>>>>> or enable hbase rpc trace. I guess if you decrease the number of rpc
> >>>>>> handlers per region server, it will mitigate your current issue.
> >>>>>>
> >>>>>>
> >>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
> >>>>>>
> >>>>>>> We have hit a problem with Phoenix and regionservers CPU usage
> >>>>>>> spiking
> >>>>> up
> >>>>>>> to use all available CPU and becoming unresponsive.
> >>>>>>>
> >>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster (with
> 3
> >>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain
> >>>>>>> Ambari
> >>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and
> hand
> >>>>>>> linked in the jar files to the hadoop lib. Everything was going
> well
> >>>>>>> and
> >>>>>>> we were able to load in ~30k records into several tables. What
> >>>>>>> happened
> >>>>>>> was after about 3-4 days of being up the regionservers became
> >>>>>>> unresponsive and started to use most of the available CPU (12 core
> >>>>>>> boxes). Nothing terribly informative was in the logs (initially we
> >>>>>>> saw
> >>>>>>> some flush messages that seemed excessive, but that was not all of
> >>>>>>> the
> >>>>>>> time and we changed back to the standard HBase WAL codec). We are
> >>>>>>> able
> >>>>> to
> >>>>>>> kill the unresponsive regionservers and then restart them, the
> >>>>>>> cluster
> >>>>>>> will be fine for a day or so but will start to lock up again.
> >>>>>>>
> >>>>>>> We've dropped the entire HBase and zookeper information and started
> >>>>>>> from
> >>>>>>> scratch, but that has not helped.
> >>>>>>>
> >>>>>>> James Taylor suggested I send this off here. I've attached a jstack
> >>>>>>> report of a locked up regionserver in hopes that someone can shed
> >>>>>>> some
> >>>>>>> light.
> >>>>>>>
> >>>>>>> thanks,
> >>>>>>> -chris
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> CONFIDENTIALITY NOTICE
> >>>>>> NOTICE: This message is intended for the use of the individual or
> >>>>>> entity
> >>>>> to
> >>>>>> which it is addressed and may contain information that is
> >>>>>> confidential,
> >>>>>> privileged and exempt from disclosure under applicable law. If the
> >>>>>> reader
> >>>>>> of this message is not the intended recipient, you are hereby
> notified
> >>>>> that
> >>>>>> any printing, copying, dissemination, distribution, disclosure or
> >>>>>> forwarding of this communication is strictly prohibited. If you have
> >>>>>> received this communication in error, please contact the sender
> >>>>> immediately
> >>>>>> and delete it from your system. Thank You.
> >>>>>
> >>>>>
> >>>>>
> >>
> >>
> >>
> >> --
> >> CONFIDENTIALITY NOTICE
> >> NOTICE: This message is intended for the use of the individual or
> entity to
> >> which it is addressed and may contain information that is confidential,
> >> privileged and exempt from disclosure under applicable law. If the
> reader
> >> of this message is not the intended recipient, you are hereby notified
> that
> >> any printing, copying, dissemination, distribution, disclosure or
> >> forwarding of this communication is strictly prohibited. If you have
> >> received this communication in error, please contact the sender
> immediately
> >> and delete it from your system. Thank You.
> >
>
>

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Chris Tarnas <cf...@biotiquesystems.com>.

Sorry to follow up on my own message, but I was wondering if anyone had any ideas? Normal non-phoenix scans don't cause this symptom, but right after a select * on the exact same table will. 

If we export the table and then re-import it into a new table, the new table doesn't exhibit these symptoms, same as if we use an upsert..select to do a copy. It seems something happens to the last region to cause this, but it is not directly data dependent. Moving the region to another regionserver doesn't have any effect - just moves where the problem happens. Major compactions get hung up by the running threads as they probably have a lock. 

I've run the hfile tool on the final region and nothing seems awry.

Figuring this out will allow this project to continue, as of now it is hung up on this issue.

thank you,
-chris


On May 15, 2014, at 8:47 PM, Chris Tarnas <cf...@biotiquesystems.com> wrote:

> I did some "poor mans" profiling with multiple jstack and came up with where the RpcServer.handler appears to be stuck: org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile. That is the last deepest method in all of the traces, either HStore.java:1712 or HStore.java:1722. Here are two example traces for a thread (which has been running for the last couple hours):
> 
> "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800 nid=0x727b runnable [0x00007f4b49e9e000]
>   java.lang.Thread.State: RUNNABLE
> 	at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(FastDiffDeltaEncoder.java:540)
> 	at org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncodedSeeker.next(BufferedDataBlockEncoder.java:261)
> 	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1063)
> 	at org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1776)
> 	at org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722)
> 	at org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
> 	at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
> 	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
> 	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> 	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> 	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
> 	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
> 	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
> 	at java.lang.Thread.run(Thread.java:744)
> 
> "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800 nid=0x727b runnable [0x00007f4b49e9e000]
>   java.lang.Thread.State: RUNNABLE
> 	at org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944)
> 	at org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBlockContainingKey(HFileBlockIndex.java:392)
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:209)
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:179)
> 	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBefore(HFileReaderV2.java:548)
> 	at org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1712)
> 	at org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
> 	at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
> 	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
> 	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> 	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> 	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
> 	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
> 	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
> 	at java.lang.Thread.run(Thread.java:744)
> 
> 
> On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:
> 
>> 
>> Hey Chris,
>> 
>> I used performance.py tool which created a table with 50K rows in one
>> table, run the following query from sqlline.py and everything seems fine
>> without seeing CPU running hot.
>> 
>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>> PERFORMANCE_50000;
>> +------------+
>> |  COUNT(1)  |
>> +------------+
>> | 50000      |
>> +------------+
>> 1 row selected (0.166 seconds)
>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>> PERFORMANCE_50000;
>> +------------+
>> |  COUNT(1)  |
>> +------------+
>> | 50000      |
>> +------------+
>> 1 row selected (0.167 seconds)
>> 
>> Is there anyway could you run profiler to see where the CPU goes?
>> 
>> 
>> 
>> On 5/13/14 6:40 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>> 
>>> Ahh, yes. Here is a pastebin for it:
>>> 
>>> http://pastebin.com/w6mtabag
>>> 
>>> thanks again,
>>> -chris
>>> 
>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <nd...@gmail.com> wrote:
>>> 
>>>> Hi Chris,
>>>> 
>>>> Attachments are filtered out by the mail server. Can you pastebin it
>>>> some
>>>> place?
>>>> 
>>>> Thanks,
>>>> Nick
>>>> 
>>>> 
>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>> <cf...@biotiquesystems.com>wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> We set the HBase RegionServer Handler to 10 (it appears to have been
>>>>> set
>>>>> to 60 by Ambari during install process). Now we have narrowed down what
>>>>> causes the CPU to increase and have some detailed logs:
>>>>> 
>>>>> If we connect using sqlline.py and execute a select that selects one
>>>>> row
>>>>> using the primary_key, no increate in CPU is observed and the number
>>>>> of RPC
>>>>> threads in a RUNNABLE state remains the same.
>>>>> 
>>>>> If we execute a select that scans the table such as "select count(*)
>>>>> from
>>>>> TABLE" or where the "where" clause only limits on non-primary key
>>>>> attributes, then the number of RUNNABLE RpcServer.handler threads
>>>>> increases
>>>>> and the CPU utilization of the regionserver increases by ~105%.
>>>>> 
>>>>> Disconnecting the client does not have an effect and the
>>>>> RpcServer.handler
>>>>> thread is left RUNNABLE and the CPU stays at the higher usage.
>>>>> 
>>>>> Checking the Web Console for the Regionserver just shows 10
>>>>> RpcServer.reader tasks, all in a WAITING state, no other monitored
>>>>> tasks
>>>>> are happening. The regionserver has a Max Heap of 10G and a Used heap
>>>>> of
>>>>> 445.2M.
>>>>> 
>>>>> I've attached the regionserver log with IPC debug logging turned on
>>>>> right
>>>>> when one of the Phoenix statements is executed (this statement actually
>>>>> used up the last available handler).
>>>>> 
>>>>> thanks,
>>>>> -chris
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jz...@hortonworks.com>
>>>>> wrote:
>>>>> 
>>>>>> 
>>>>>> From the stack, it seems you increase the default rpc handler number
>>>>>> to
>>>>>> about 60. All handlers are serving Get request(You can search
>>>>>> 
>>>>> 
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>>> a:2
>>>>>> 841).
>>>>>> 
>>>>>> You can check why there are so many get requests by adding some log
>>>>>> info
>>>>>> or enable hbase rpc trace. I guess if you decrease the number of rpc
>>>>>> handlers per region server, it will mitigate your current issue.
>>>>>> 
>>>>>> 
>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>>>>>> 
>>>>>>> We have hit a problem with Phoenix and regionservers CPU usage
>>>>>>> spiking
>>>>> up
>>>>>>> to use all available CPU and becoming unresponsive.
>>>>>>> 
>>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster (with 3
>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain
>>>>>>> Ambari
>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and hand
>>>>>>> linked in the jar files to the hadoop lib. Everything was going well
>>>>>>> and
>>>>>>> we were able to load in ~30k records into several tables. What
>>>>>>> happened
>>>>>>> was after about 3-4 days of being up the regionservers became
>>>>>>> unresponsive and started to use most of the available CPU (12 core
>>>>>>> boxes). Nothing terribly informative was in the logs (initially we
>>>>>>> saw
>>>>>>> some flush messages that seemed excessive, but that was not all of
>>>>>>> the
>>>>>>> time and we changed back to the standard HBase WAL codec). We are
>>>>>>> able
>>>>> to
>>>>>>> kill the unresponsive regionservers and then restart them, the
>>>>>>> cluster
>>>>>>> will be fine for a day or so but will start to lock up again.
>>>>>>> 
>>>>>>> We've dropped the entire HBase and zookeper information and started
>>>>>>> from
>>>>>>> scratch, but that has not helped.
>>>>>>> 
>>>>>>> James Taylor suggested I send this off here. I've attached a jstack
>>>>>>> report of a locked up regionserver in hopes that someone can shed
>>>>>>> some
>>>>>>> light.
>>>>>>> 
>>>>>>> thanks,
>>>>>>> -chris
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> CONFIDENTIALITY NOTICE
>>>>>> NOTICE: This message is intended for the use of the individual or
>>>>>> entity
>>>>> to
>>>>>> which it is addressed and may contain information that is
>>>>>> confidential,
>>>>>> privileged and exempt from disclosure under applicable law. If the
>>>>>> reader
>>>>>> of this message is not the intended recipient, you are hereby notified
>>>>> that
>>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>>> forwarding of this communication is strictly prohibited. If you have
>>>>>> received this communication in error, please contact the sender
>>>>> immediately
>>>>>> and delete it from your system. Thank You.
>>>>> 
>>>>> 
>>>>> 
>> 
>> 
>> 
>> -- 
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to 
>> which it is addressed and may contain information that is confidential, 
>> privileged and exempt from disclosure under applicable law. If the reader 
>> of this message is not the intended recipient, you are hereby notified that 
>> any printing, copying, dissemination, distribution, disclosure or 
>> forwarding of this communication is strictly prohibited. If you have 
>> received this communication in error, please contact the sender immediately 
>> and delete it from your system. Thank You.
>

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Chris Tarnas <cf...@biotiquesystems.com>.

I did some "poor mans" profiling with multiple jstack and came up with where the RpcServer.handler appears to be stuck: org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile. That is the last deepest method in all of the traces, either HStore.java:1712 or HStore.java:1722. Here are two example traces for a thread (which has been running for the last couple hours):

"RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800 nid=0x727b runnable [0x00007f4b49e9e000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(FastDiffDeltaEncoder.java:540)
	at org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncodedSeeker.next(BufferedDataBlockEncoder.java:261)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1063)
	at org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1776)
	at org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722)
	at org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
	at org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
	at java.lang.Thread.run(Thread.java:744)

"RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800 nid=0x727b runnable [0x00007f4b49e9e000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944)
	at org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBlockContainingKey(HFileBlockIndex.java:392)
	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:209)
	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:179)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBefore(HFileReaderV2.java:548)
	at org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1712)
	at org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
	at org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
	at java.lang.Thread.run(Thread.java:744)


On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:

> 
> Hey Chris,
> 
> I used performance.py tool which created a table with 50K rows in one
> table, run the following query from sqlline.py and everything seems fine
> without seeing CPU running hot.
> 
> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
> PERFORMANCE_50000;
> +------------+
> |  COUNT(1)  |
> +------------+
> | 50000      |
> +------------+
> 1 row selected (0.166 seconds)
> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
> PERFORMANCE_50000;
> +------------+
> |  COUNT(1)  |
> +------------+
> | 50000      |
> +------------+
> 1 row selected (0.167 seconds)
> 
> Is there anyway could you run profiler to see where the CPU goes?
> 
> 
> 
> On 5/13/14 6:40 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
> 
>> Ahh, yes. Here is a pastebin for it:
>> 
>> http://pastebin.com/w6mtabag
>> 
>> thanks again,
>> -chris
>> 
>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <nd...@gmail.com> wrote:
>> 
>>> Hi Chris,
>>> 
>>> Attachments are filtered out by the mail server. Can you pastebin it
>>> some
>>> place?
>>> 
>>> Thanks,
>>> Nick
>>> 
>>> 
>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>> <cf...@biotiquesystems.com>wrote:
>>> 
>>>> Hello,
>>>> 
>>>> We set the HBase RegionServer Handler to 10 (it appears to have been
>>>> set
>>>> to 60 by Ambari during install process). Now we have narrowed down what
>>>> causes the CPU to increase and have some detailed logs:
>>>> 
>>>> If we connect using sqlline.py and execute a select that selects one
>>>> row
>>>> using the primary_key, no increate in CPU is observed and the number
>>>> of RPC
>>>> threads in a RUNNABLE state remains the same.
>>>> 
>>>> If we execute a select that scans the table such as "select count(*)
>>>> from
>>>> TABLE" or where the "where" clause only limits on non-primary key
>>>> attributes, then the number of RUNNABLE RpcServer.handler threads
>>>> increases
>>>> and the CPU utilization of the regionserver increases by ~105%.
>>>> 
>>>> Disconnecting the client does not have an effect and the
>>>> RpcServer.handler
>>>> thread is left RUNNABLE and the CPU stays at the higher usage.
>>>> 
>>>> Checking the Web Console for the Regionserver just shows 10
>>>> RpcServer.reader tasks, all in a WAITING state, no other monitored
>>>> tasks
>>>> are happening. The regionserver has a Max Heap of 10G and a Used heap
>>>> of
>>>> 445.2M.
>>>> 
>>>> I've attached the regionserver log with IPC debug logging turned on
>>>> right
>>>> when one of the Phoenix statements is executed (this statement actually
>>>> used up the last available handler).
>>>> 
>>>> thanks,
>>>> -chris
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jz...@hortonworks.com>
>>>> wrote:
>>>> 
>>>>> 
>>>>> From the stack, it seems you increase the default rpc handler number
>>>>> to
>>>>> about 60. All handlers are serving Get request(You can search
>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>> a:2
>>>>> 841).
>>>>> 
>>>>> You can check why there are so many get requests by adding some log
>>>>> info
>>>>> or enable hbase rpc trace. I guess if you decrease the number of rpc
>>>>> handlers per region server, it will mitigate your current issue.
>>>>> 
>>>>> 
>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>>>>> 
>>>>>> We have hit a problem with Phoenix and regionservers CPU usage
>>>>>> spiking
>>>> up
>>>>>> to use all available CPU and becoming unresponsive.
>>>>>> 
>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster (with 3
>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain
>>>>>> Ambari
>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and hand
>>>>>> linked in the jar files to the hadoop lib. Everything was going well
>>>>>> and
>>>>>> we were able to load in ~30k records into several tables. What
>>>>>> happened
>>>>>> was after about 3-4 days of being up the regionservers became
>>>>>> unresponsive and started to use most of the available CPU (12 core
>>>>>> boxes). Nothing terribly informative was in the logs (initially we
>>>>>> saw
>>>>>> some flush messages that seemed excessive, but that was not all of
>>>>>> the
>>>>>> time and we changed back to the standard HBase WAL codec). We are
>>>>>> able
>>>> to
>>>>>> kill the unresponsive regionservers and then restart them, the
>>>>>> cluster
>>>>>> will be fine for a day or so but will start to lock up again.
>>>>>> 
>>>>>> We've dropped the entire HBase and zookeper information and started
>>>>>> from
>>>>>> scratch, but that has not helped.
>>>>>> 
>>>>>> James Taylor suggested I send this off here. I've attached a jstack
>>>>>> report of a locked up regionserver in hopes that someone can shed
>>>>>> some
>>>>>> light.
>>>>>> 
>>>>>> thanks,
>>>>>> -chris
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> CONFIDENTIALITY NOTICE
>>>>> NOTICE: This message is intended for the use of the individual or
>>>>> entity
>>>> to
>>>>> which it is addressed and may contain information that is
>>>>> confidential,
>>>>> privileged and exempt from disclosure under applicable law. If the
>>>>> reader
>>>>> of this message is not the intended recipient, you are hereby notified
>>>> that
>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>> forwarding of this communication is strictly prohibited. If you have
>>>>> received this communication in error, please contact the sender
>>>> immediately
>>>>> and delete it from your system. Thank You.
>>>> 
>>>> 
>>>> 
> 
> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Jeffrey Zhong <jz...@hortonworks.com>.

Hey Chris,

I used performance.py tool which created a table with 50K rows in one
table, run the following query from sqlline.py and everything seems fine
without seeing CPU running hot.

0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
PERFORMANCE_50000;
+------------+
|  COUNT(1)  |
+------------+
| 50000      |
+------------+
1 row selected (0.166 seconds)
0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
PERFORMANCE_50000;
+------------+
|  COUNT(1)  |
+------------+
| 50000      |
+------------+
1 row selected (0.167 seconds)

Is there anyway could you run profiler to see where the CPU goes?



On 5/13/14 6:40 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:

>Ahh, yes. Here is a pastebin for it:
>
>http://pastebin.com/w6mtabag
>
>thanks again,
>-chris
>
>On May 13, 2014, at 7:47 PM, Nick Dimiduk <nd...@gmail.com> wrote:
>
>> Hi Chris,
>> 
>> Attachments are filtered out by the mail server. Can you pastebin it
>>some
>> place?
>> 
>> Thanks,
>> Nick
>> 
>> 
>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>><cf...@biotiquesystems.com>wrote:
>> 
>>> Hello,
>>> 
>>> We set the HBase RegionServer Handler to 10 (it appears to have been
>>>set
>>> to 60 by Ambari during install process). Now we have narrowed down what
>>> causes the CPU to increase and have some detailed logs:
>>> 
>>> If we connect using sqlline.py and execute a select that selects one
>>>row
>>> using the primary_key, no increate in CPU is observed and the number
>>>of RPC
>>> threads in a RUNNABLE state remains the same.
>>> 
>>> If we execute a select that scans the table such as "select count(*)
>>>from
>>> TABLE" or where the "where" clause only limits on non-primary key
>>> attributes, then the number of RUNNABLE RpcServer.handler threads
>>>increases
>>> and the CPU utilization of the regionserver increases by ~105%.
>>> 
>>> Disconnecting the client does not have an effect and the
>>>RpcServer.handler
>>> thread is left RUNNABLE and the CPU stays at the higher usage.
>>> 
>>> Checking the Web Console for the Regionserver just shows 10
>>> RpcServer.reader tasks, all in a WAITING state, no other monitored
>>>tasks
>>> are happening. The regionserver has a Max Heap of 10G and a Used heap
>>>of
>>> 445.2M.
>>> 
>>> I've attached the regionserver log with IPC debug logging turned on
>>>right
>>> when one of the Phoenix statements is executed (this statement actually
>>> used up the last available handler).
>>> 
>>> thanks,
>>> -chris
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jz...@hortonworks.com>
>>>wrote:
>>> 
>>>> 
>>>> From the stack, it seems you increase the default rpc handler number
>>>>to
>>>> about 60. All handlers are serving Get request(You can search
>>>> 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>a:2
>>>> 841).
>>>> 
>>>> You can check why there are so many get requests by adding some log
>>>>info
>>>> or enable hbase rpc trace. I guess if you decrease the number of rpc
>>>> handlers per region server, it will mitigate your current issue.
>>>> 
>>>> 
>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>>>> 
>>>>> We have hit a problem with Phoenix and regionservers CPU usage
>>>>>spiking
>>> up
>>>>> to use all available CPU and becoming unresponsive.
>>>>> 
>>>>> After HDP 2.1 was released we setup a 4 compute node cluster (with 3
>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain
>>>>>Ambari
>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and hand
>>>>> linked in the jar files to the hadoop lib. Everything was going well
>>>>>and
>>>>> we were able to load in ~30k records into several tables. What
>>>>>happened
>>>>> was after about 3-4 days of being up the regionservers became
>>>>> unresponsive and started to use most of the available CPU (12 core
>>>>> boxes). Nothing terribly informative was in the logs (initially we
>>>>>saw
>>>>> some flush messages that seemed excessive, but that was not all of
>>>>>the
>>>>> time and we changed back to the standard HBase WAL codec). We are
>>>>>able
>>> to
>>>>> kill the unresponsive regionservers and then restart them, the
>>>>>cluster
>>>>> will be fine for a day or so but will start to lock up again.
>>>>> 
>>>>> We've dropped the entire HBase and zookeper information and started
>>>>>from
>>>>> scratch, but that has not helped.
>>>>> 
>>>>> James Taylor suggested I send this off here. I've attached a jstack
>>>>> report of a locked up regionserver in hopes that someone can shed
>>>>>some
>>>>> light.
>>>>> 
>>>>> thanks,
>>>>> -chris
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> CONFIDENTIALITY NOTICE
>>>> NOTICE: This message is intended for the use of the individual or
>>>>entity
>>> to
>>>> which it is addressed and may contain information that is
>>>>confidential,
>>>> privileged and exempt from disclosure under applicable law. If the
>>>>reader
>>>> of this message is not the intended recipient, you are hereby notified
>>> that
>>>> any printing, copying, dissemination, distribution, disclosure or
>>>> forwarding of this communication is strictly prohibited. If you have
>>>> received this communication in error, please contact the sender
>>> immediately
>>>> and delete it from your system. Thank You.
>>> 
>>> 
>>> 



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Chris Tarnas <cf...@biotiquesystems.com>.

Ahh, yes. Here is a pastebin for it:

http://pastebin.com/w6mtabag

thanks again,
-chris

On May 13, 2014, at 7:47 PM, Nick Dimiduk <nd...@gmail.com> wrote:

> Hi Chris,
> 
> Attachments are filtered out by the mail server. Can you pastebin it some
> place?
> 
> Thanks,
> Nick
> 
> 
> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas <cf...@biotiquesystems.com>wrote:
> 
>> Hello,
>> 
>> We set the HBase RegionServer Handler to 10 (it appears to have been set
>> to 60 by Ambari during install process). Now we have narrowed down what
>> causes the CPU to increase and have some detailed logs:
>> 
>> If we connect using sqlline.py and execute a select that selects one row
>> using the primary_key, no increate in CPU is observed and the number of RPC
>> threads in a RUNNABLE state remains the same.
>> 
>> If we execute a select that scans the table such as "select count(*) from
>> TABLE" or where the "where" clause only limits on non-primary key
>> attributes, then the number of RUNNABLE RpcServer.handler threads increases
>> and the CPU utilization of the regionserver increases by ~105%.
>> 
>> Disconnecting the client does not have an effect and the RpcServer.handler
>> thread is left RUNNABLE and the CPU stays at the higher usage.
>> 
>> Checking the Web Console for the Regionserver just shows 10
>> RpcServer.reader tasks, all in a WAITING state, no other monitored tasks
>> are happening. The regionserver has a Max Heap of 10G and a Used heap of
>> 445.2M.
>> 
>> I've attached the regionserver log with IPC debug logging turned on right
>> when one of the Phoenix statements is executed (this statement actually
>> used up the last available handler).
>> 
>> thanks,
>> -chris
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:
>> 
>>> 
>>> From the stack, it seems you increase the default rpc handler number to
>>> about 60. All handlers are serving Get request(You can search
>>> 
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2
>>> 841).
>>> 
>>> You can check why there are so many get requests by adding some log info
>>> or enable hbase rpc trace. I guess if you decrease the number of rpc
>>> handlers per region server, it will mitigate your current issue.
>>> 
>>> 
>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>>> 
>>>> We have hit a problem with Phoenix and regionservers CPU usage spiking
>> up
>>>> to use all available CPU and becoming unresponsive.
>>>> 
>>>> After HDP 2.1 was released we setup a 4 compute node cluster (with 3
>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain Ambari
>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and hand
>>>> linked in the jar files to the hadoop lib. Everything was going well and
>>>> we were able to load in ~30k records into several tables. What happened
>>>> was after about 3-4 days of being up the regionservers became
>>>> unresponsive and started to use most of the available CPU (12 core
>>>> boxes). Nothing terribly informative was in the logs (initially we saw
>>>> some flush messages that seemed excessive, but that was not all of the
>>>> time and we changed back to the standard HBase WAL codec). We are able
>> to
>>>> kill the unresponsive regionservers and then restart them, the cluster
>>>> will be fine for a day or so but will start to lock up again.
>>>> 
>>>> We've dropped the entire HBase and zookeper information and started from
>>>> scratch, but that has not helped.
>>>> 
>>>> James Taylor suggested I send this off here. I've attached a jstack
>>>> report of a locked up regionserver in hopes that someone can shed some
>>>> light.
>>>> 
>>>> thanks,
>>>> -chris
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity
>> to
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified
>> that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender
>> immediately
>>> and delete it from your system. Thank You.
>> 
>> 
>>

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Nick Dimiduk <nd...@gmail.com>.

Hi Chris,

Attachments are filtered out by the mail server. Can you pastebin it some
place?

Thanks,
Nick


On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas <cf...@biotiquesystems.com>wrote:

> Hello,
>
> We set the HBase RegionServer Handler to 10 (it appears to have been set
> to 60 by Ambari during install process). Now we have narrowed down what
> causes the CPU to increase and have some detailed logs:
>
> If we connect using sqlline.py and execute a select that selects one row
> using the primary_key, no increate in CPU is observed and the number of RPC
> threads in a RUNNABLE state remains the same.
>
> If we execute a select that scans the table such as "select count(*) from
> TABLE" or where the "where" clause only limits on non-primary key
> attributes, then the number of RUNNABLE RpcServer.handler threads increases
> and the CPU utilization of the regionserver increases by ~105%.
>
> Disconnecting the client does not have an effect and the RpcServer.handler
> thread is left RUNNABLE and the CPU stays at the higher usage.
>
> Checking the Web Console for the Regionserver just shows 10
> RpcServer.reader tasks, all in a WAITING state, no other monitored tasks
> are happening. The regionserver has a Max Heap of 10G and a Used heap of
> 445.2M.
>
> I've attached the regionserver log with IPC debug logging turned on right
> when one of the Phoenix statements is executed (this statement actually
> used up the last available handler).
>
> thanks,
> -chris
>
>
>
>
>
>
>
> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:
>
> >
> > From the stack, it seems you increase the default rpc handler number to
> > about 60. All handlers are serving Get request(You can search
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2
> > 841).
> >
> > You can check why there are so many get requests by adding some log info
> > or enable hbase rpc trace. I guess if you decrease the number of rpc
> > handlers per region server, it will mitigate your current issue.
> >
> >
> > On 5/12/14 2:28 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
> >
> >> We have hit a problem with Phoenix and regionservers CPU usage spiking
> up
> >> to use all available CPU and becoming unresponsive.
> >>
> >> After HDP 2.1 was released we setup a 4 compute node cluster (with 3
> >> VMWare "master" nodes) to test out Phoenix on it. It is a plain Ambari
> >> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and hand
> >> linked in the jar files to the hadoop lib. Everything was going well and
> >> we were able to load in ~30k records into several tables. What happened
> >> was after about 3-4 days of being up the regionservers became
> >> unresponsive and started to use most of the available CPU (12 core
> >> boxes). Nothing terribly informative was in the logs (initially we saw
> >> some flush messages that seemed excessive, but that was not all of the
> >> time and we changed back to the standard HBase WAL codec). We are able
> to
> >> kill the unresponsive regionservers and then restart them, the cluster
> >> will be fine for a day or so but will start to lock up again.
> >>
> >> We've dropped the entire HBase and zookeper information and started from
> >> scratch, but that has not helped.
> >>
> >> James Taylor suggested I send this off here. I've attached a jstack
> >> report of a locked up regionserver in hopes that someone can shed some
> >> light.
> >>
> >> thanks,
> >> -chris
> >>
> >>
> >
> >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
>
>
>

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Chris Tarnas <cf...@biotiquesystems.com>.

Hello,

We set the HBase RegionServer Handler to 10 (it appears to have been set to 60 by Ambari during install process). Now we have narrowed down what causes the CPU to increase and have some detailed logs:

If we connect using sqlline.py and execute a select that selects one row using the primary_key, no increate in CPU is observed and the number of RPC threads in a RUNNABLE state remains the same. 

If we execute a select that scans the table such as "select count(*) from TABLE" or where the "where" clause only limits on non-primary key attributes, then the number of RUNNABLE RpcServer.handler threads increases and the CPU utilization of the regionserver increases by ~105%. 

Disconnecting the client does not have an effect and the RpcServer.handler thread is left RUNNABLE and the CPU stays at the higher usage. 

Checking the Web Console for the Regionserver just shows 10 RpcServer.reader tasks, all in a WAITING state, no other monitored tasks are happening. The regionserver has a Max Heap of 10G and a Used heap of 445.2M.

I've attached the regionserver log with IPC debug logging turned on right when one of the Phoenix statements is executed (this statement actually used up the last available handler).

thanks,
-chris

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Chris Tarnas <cf...@biotiquesystems.com>.

Thanks Devaraj,

We have waited a couple hours. We are waiting for the next event to get more details. Should not be long. 

Memory so far has not been a problem, we allocate 10GB to each regionserver and usage tends to peak around 1GB form 350mb when idle. The region load is quite small, only 11 small ~3mb regions per server.

The servers themselves are decent, new 12 core/12 spindle boxes with 128GB of RAM running CentOS 6.5.

-chris

On May 12, 2014, at 6:40 PM, Devaraj Das <dd...@hortonworks.com> wrote:

> How much time do you wait for the RegionServers to come back? Seems
> many handler are busy processing GETs and DELETEs. I don't think that
> 60 handlers is high if you have decent memory in the regionserver (how
> much are they running with, could they be GC'ing a lot leading to
> unresponsiveness?).
> 
> On Mon, May 12, 2014 at 5:08 PM, Christopher Tarnas
> <cf...@biotiquesystems.com> wrote:
>> Hi Jeffery,
>> 
>> Thank you, I don't believe we changed the number of handlers from the default but we'll double check. What preceded the most recent event (not for the earlier stacktrace we just sent) was the developers issuing some "delete *" statements  for several tables.
>> 
>> -chris
>> 
>>> On May 12, 2014, at 3:32 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:
>>> 
>>> 
>>> From the stack, it seems you increase the default rpc handler number to
>>> about 60. All handlers are serving Get request(You can search
>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2
>>> 841).
>>> 
>>> You can check why there are so many get requests by adding some log info
>>> or enable hbase rpc trace. I guess if you decrease the number of rpc
>>> handlers per region server, it will mitigate your current issue.
>>> 
>>> 
>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>>>> 
>>>> We have hit a problem with Phoenix and regionservers CPU usage spiking up
>>>> to use all available CPU and becoming unresponsive.
>>>> 
>>>> After HDP 2.1 was released we setup a 4 compute node cluster (with 3
>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain Ambari
>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and hand
>>>> linked in the jar files to the hadoop lib. Everything was going well and
>>>> we were able to load in ~30k records into several tables. What happened
>>>> was after about 3-4 days of being up the regionservers became
>>>> unresponsive and started to use most of the available CPU (12 core
>>>> boxes). Nothing terribly informative was in the logs (initially we saw
>>>> some flush messages that seemed excessive, but that was not all of the
>>>> time and we changed back to the standard HBase WAL codec). We are able to
>>>> kill the unresponsive regionservers and then restart them, the cluster
>>>> will be fine for a day or so but will start to lock up again.
>>>> 
>>>> We've dropped the entire HBase and zookeper information and started from
>>>> scratch, but that has not helped.
>>>> 
>>>> James Taylor suggested I send this off here. I've attached a jstack
>>>> report of a locked up regionserver in hopes that someone can shed some
>>>> light.
>>>> 
>>>> thanks,
>>>> -chris
>>> 
>>> 
>>> 
>>> --
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity to
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender immediately
>>> and delete it from your system. Thank You.
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Devaraj Das <dd...@hortonworks.com>.

How much time do you wait for the RegionServers to come back? Seems
many handler are busy processing GETs and DELETEs. I don't think that
60 handlers is high if you have decent memory in the regionserver (how
much are they running with, could they be GC'ing a lot leading to
unresponsiveness?).

On Mon, May 12, 2014 at 5:08 PM, Christopher Tarnas
<cf...@biotiquesystems.com> wrote:
> Hi Jeffery,
>
> Thank you, I don't believe we changed the number of handlers from the default but we'll double check. What preceded the most recent event (not for the earlier stacktrace we just sent) was the developers issuing some "delete *" statements  for several tables.
>
> -chris
>
>> On May 12, 2014, at 3:32 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:
>>
>>
>> From the stack, it seems you increase the default rpc handler number to
>> about 60. All handlers are serving Get request(You can search
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2
>> 841).
>>
>> You can check why there are so many get requests by adding some log info
>> or enable hbase rpc trace. I guess if you decrease the number of rpc
>> handlers per region server, it will mitigate your current issue.
>>
>>
>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>>>
>>> We have hit a problem with Phoenix and regionservers CPU usage spiking up
>>> to use all available CPU and becoming unresponsive.
>>>
>>> After HDP 2.1 was released we setup a 4 compute node cluster (with 3
>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain Ambari
>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and hand
>>> linked in the jar files to the hadoop lib. Everything was going well and
>>> we were able to load in ~30k records into several tables. What happened
>>> was after about 3-4 days of being up the regionservers became
>>> unresponsive and started to use most of the available CPU (12 core
>>> boxes). Nothing terribly informative was in the logs (initially we saw
>>> some flush messages that seemed excessive, but that was not all of the
>>> time and we changed back to the standard HBase WAL codec). We are able to
>>> kill the unresponsive regionservers and then restart them, the cluster
>>> will be fine for a day or so but will start to lock up again.
>>>
>>> We've dropped the entire HBase and zookeper information and started from
>>> scratch, but that has not helped.
>>>
>>> James Taylor suggested I send this off here. I've attached a jstack
>>> report of a locked up regionserver in hopes that someone can shed some
>>> light.
>>>
>>> thanks,
>>> -chris
>>
>>
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Christopher Tarnas <cf...@biotiquesystems.com>.

Hi Jeffery,

Thank you, I don't believe we changed the number of handlers from the default but we'll double check. What preceded the most recent event (not for the earlier stacktrace we just sent) was the developers issuing some "delete *" statements  for several tables.  

-chris

> On May 12, 2014, at 3:32 PM, Jeffrey Zhong <jz...@hortonworks.com> wrote:
> 
> 
> From the stack, it seems you increase the default rpc handler number to
> about 60. All handlers are serving Get request(You can search
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2
> 841). 
> 
> You can check why there are so many get requests by adding some log info
> or enable hbase rpc trace. I guess if you decrease the number of rpc
> handlers per region server, it will mitigate your current issue.
> 
> 
>> On 5/12/14 2:28 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:
>> 
>> We have hit a problem with Phoenix and regionservers CPU usage spiking up
>> to use all available CPU and becoming unresponsive.
>> 
>> After HDP 2.1 was released we setup a 4 compute node cluster (with 3
>> VMWare "master" nodes) to test out Phoenix on it. It is a plain Ambari
>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and hand
>> linked in the jar files to the hadoop lib. Everything was going well and
>> we were able to load in ~30k records into several tables. What happened
>> was after about 3-4 days of being up the regionservers became
>> unresponsive and started to use most of the available CPU (12 core
>> boxes). Nothing terribly informative was in the logs (initially we saw
>> some flush messages that seemed excessive, but that was not all of the
>> time and we changed back to the standard HBase WAL codec). We are able to
>> kill the unresponsive regionservers and then restart them, the cluster
>> will be fine for a day or so but will start to lock up again.
>> 
>> We've dropped the entire HBase and zookeper information and started from
>> scratch, but that has not helped.
>> 
>> James Taylor suggested I send this off here. I've attached a jstack
>> report of a locked up regionserver in hopes that someone can shed some
>> light. 
>> 
>> thanks,
>> -chris
> 
> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.

Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1

Posted by Jeffrey Zhong <jz...@hortonworks.com>.

>From the stack, it seems you increase the default rpc handler number to
about 60. All handlers are serving Get request(You can search
org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2
841). 

You can check why there are so many get requests by adding some log info
or enable hbase rpc trace. I guess if you decrease the number of rpc
handlers per region server, it will mitigate your current issue.


On 5/12/14 2:28 PM, "Chris Tarnas" <cf...@biotiquesystems.com> wrote:

>We have hit a problem with Phoenix and regionservers CPU usage spiking up
>to use all available CPU and becoming unresponsive.
>
>After HDP 2.1 was released we setup a 4 compute node cluster (with 3
>VMWare "master" nodes) to test out Phoenix on it. It is a plain Ambari
>1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and hand
>linked in the jar files to the hadoop lib. Everything was going well and
>we were able to load in ~30k records into several tables. What happened
>was after about 3-4 days of being up the regionservers became
>unresponsive and started to use most of the available CPU (12 core
>boxes). Nothing terribly informative was in the logs (initially we saw
>some flush messages that seemed excessive, but that was not all of the
>time and we changed back to the standard HBase WAL codec). We are able to
>kill the unresponsive regionservers and then restart them, the cluster
>will be fine for a day or so but will start to lock up again.
>
>We've dropped the entire HBase and zookeper information and started from
>scratch, but that has not helped.
>
>James Taylor suggested I send this off here. I've attached a jstack
>report of a locked up regionserver in hopes that someone can shed some
>light. 
>
>thanks,
>-chris
>
>



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.