You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "Srikanth P. Shreenivas" <Sr...@mindtree.com> on 2011/08/18 16:37:36 UTC

Query regarding HTable.get and timeouts

Hi,

We are experiencing an issue in our HBase Cluster wherein some of the gets are timing outs at:

java.io.IOException: Giving up trying to get region server: thread is interrupted.
                at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1016)
                at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546)


When we look at the logs of master, zookeeper and region servers, there is nothing that indicates anything abnormal.

I tried looking up below functions, but at this point could not make much out of it.
https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java  - getRegionServerWithRetries  starts at Line 1233
https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HTable.java  Htable.get starts at Line 611.


If you can please suggest what are the scenarios in which all retries can get exhausted resulting in thread interruption.

We have seen this issue in two of our HBase Clusters, where load is quite less.  We have 20 reads per minute,  we run 1 zookeeper, and 4 regionservers in fully-distributed mode (Hadoop).  We are using CDH3.

Thanks,
Srikanth

________________________________

http://www.mindtree.com/email/disclaimer.html

RE: Query regarding HTable.get and timeouts

Posted by "Srikanth P. Shreenivas" <Sr...@mindtree.com>.

Hi Ram,

Thanks for your help.   We seemed to resolve the issue today by doing application changes. (Refer my other reply to this thread).

Regards,
Srikanth

-----Original Message-----
From: Ramkrishna S Vasudevan [mailto:ramakrishnas@huawei.com] 
Sent: Monday, August 22, 2011 11:56 AM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

Hi Srikanth

I went through your logs.. not much info is present in that.
Could you give the logs which shows what happened when the master and RS
started.  and also to whom the ROOT and META table got assigned?

Regards
Ram

-----Original Message-----
From: Srikanth P. Shreenivas [mailto:Srikanth_Shreenivas@mindtree.com] 
Sent: Saturday, August 20, 2011 6:27 PM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

Further in this investigation, we enabled the debug logs on client side.

We are observing that client is trying to root region, and is continuously
failing to do so.  The logs are filled with entries like this:

2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG
[hbase.client.HConnectionManager$HConnectionImplementation]  - Lookedup root
region location,
connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImpl
ementation@2cc25ae3; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020
2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG
[hbase.client.HConnectionManager$HConnectionImplementation]  -
locateRegionInMeta parentTable=-ROOT-, metaLocation=address:
DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=0
of 10 failed; retrying after sleep of 1000 
because: null

Client keeps retrying and retries get exhausted.

Complete logs are available here: https://gist.github.com/1159064  including
logs of master, zookeeper and region servers.  

If you can please look at the logs and provide some inputs on this issue,
then it will be really helpful.
We are really not sure why client is failing to get root regions from the
server.  Any guidance will be greatly appreciated.

Thanks a lot,
Srikanth

-----Original Message-----
From: Srikanth P. Shreenivas 
Sent: Saturday, August 20, 2011 1:57 AM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

I did some tests today.  In our QA setup, we dont see any issues.  I ran
more than 100,000 operations in our QA setup in 1 hour with all HBase
reads/writes working as expected.

However, in our production setup, I regularly see the issue wherein client
thread gets interrupted because HTable.get() call does not return.
It is possible that it is taking more than 10 minutes due to
https://issues.apache.org/jira/browse/HBASE-2121.

However, I am not able to figure out what is causing this.  Logs of Hbase as
well as Hadoop servers seems to be quite normal. The cluster on which we are
seeing this issue has no writes happening, and I do see this issue after
about 10 operations.

One strange thing I noticed though is that zookeeper logs were truncated and
had only entries for last 15-20 minutes instead of complete day.  This was
around 8PM, so it had not rolled over.

I had earlier too asked a query on same issue
(http://www.mail-archive.com/user@hbase.apache.org/msg09904.html).  The
changes we did of moving to CDH3 and using -XX:UseMemBar fixed issues in our
QA setup.  However, same changes in production does not seem to have similar
effect.

If you can provide any clues on how should we go about investigating this
issue, that will be real help.

Regards,
Srikanth.

________________________________________
From: Srikanth P. Shreenivas
Sent: Friday, August 19, 2011 12:39 AM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

My apologies, I may not be reading the code right.

You are right, it is GridGain timeout that is making the line 1255 to
execute.
However, the question is what would make a HTable.get() to take close to 10
minutes to induce a timeout in GridGain task.

The value of numRetries at line 1236 should be 10 (default) and if we go
with default value of HConstants.RETRY_BACKOFF, then, sleep time added with
all retries will be only 61 seconds, and not close to 600 seconds as the
case in our code is.

Regards,
Srikanth

________________________________________
From: Srikanth P. Shreenivas
Sent: Friday, August 19, 2011 12:21 AM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

Please note that line numbers I am referencing are from the file :
https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/h
base/client/HConnectionManager.java

________________________________________
From: Srikanth P. Shreenivas
Sent: Friday, August 19, 2011 12:19 AM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

Hi Stack,

Thanks a lot for your reply.  It's always a comforting feeling to see very
active community and especially your prompt replies to the queries.

Yes, I am running it in as GridGain task,  so it runs it GridGain's thread
pool.   In this case, we can imaging GridGain as something that hands off
works to various worker threads and waits asynhronously  for it complete.  I
have 10 minute timeout after which GridGain would consider work as timed
out.

What we are observing is that our tasks are timeing out at 10 minute
boundary, and delay seems to be caused by the part of the work which is
doing HTable.get.

My suspicion is that Line 1255 in HConnectionManager.java is calling the
Thread.currentThread().interrupt(), due to which the GridGain thread kind of
stops doing what it was meant to do, and never responsds to master node
resulting in timeout in master.

In order for line 1255 to execute, we will have to assume that all retries
were exhausted.
Hence, my query that what would cause a HTable.get() to get into a situation
wherein
HConnectionManager$HConnectionImplementation.getRegionServerWithRetries gets
to line 1255.

Regards,
Srikanth

________________________________________
From: saint.ack@gmail.com [saint.ack@gmail.com] on behalf of Stack
[stack@duboce.net]
Sent: Friday, August 19, 2011 12:03 AM
To: user@hbase.apache.org
Subject: Re: Query regarding HTable.get and timeouts

Is your client running inside a container of some form and could the
container be doing the interrupting?   I've not come across
client-side thread interrupts before.
St.Ack

On Thu, Aug 18, 2011 at 7:37 AM, Srikanth P. Shreenivas
<Sr...@mindtree.com> wrote:
> Hi,
>
> We are experiencing an issue in our HBase Cluster wherein some of the gets
are timing outs at:
>
> java.io.IOException: Giving up trying to get region server: thread is
interrupted.
>                at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.
getRegionServerWithRetries(HConnectionManager.java:1016)
>                at
org.apache.hadoop.hbase.client.HTable.get(HTable.java:546)
>
>
> When we look at the logs of master, zookeeper and region servers, there is
nothing that indicates anything abnormal.
>
> I tried looking up below functions, but at this point could not make much
out of it.
>
https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/h
base/client/HConnectionManager.java  - getRegionServerWithRetries  starts at
Line 1233
>
https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/h
base/client/HTable.java  Htable.get starts at Line 611.
>
>
> If you can please suggest what are the scenarios in which all retries can
get exhausted resulting in thread interruption.
>
> We have seen this issue in two of our HBase Clusters, where load is quite
less.  We have 20 reads per minute,  we run 1 zookeeper, and 4 regionservers
in fully-distributed mode (Hadoop).  We are using CDH3.
>
> Thanks,
> Srikanth
>
> ________________________________
>
> http://www.mindtree.com/email/disclaimer.html
>

Re: Hbase version

Posted by Jean-Daniel Cryans <jd...@apache.org>.

> Is 0.90.4 currently the most stable/bug fixed version of hbase?

Yes.

> What is the best hdfs version for it?  0.20.x, x=?...  Opinions?

Hasn't changed, see the top of this page:
http://hbase.apache.org/book/hadoop.html

J-D

Re: Hbase version

Posted by Stack <st...@duboce.net>.

Yes, 0.90.4 is our current stable offering.  On version of hadoop, the
versions that fit the bill are detailed in our doc here:
http://hbase.apache.org/book.html#hadoop
St.Ack

On Sun, Aug 21, 2011 at 2:07 PM, Greg Bledsoe <gr...@personal.com> wrote:
> Hi
>
> Is 0.90.4 currently the most stable/bug fixed version of hbase? What is
> the best hdfs version for it?  0.20.x, x=?...  Opinions?
>
> Greg
>
>

RE: Query regarding HTable.get and timeouts

Posted by "Srikanth P. Shreenivas" <Sr...@mindtree.com>.

Yes, DC1AuthDFSC1D3 hosts the root region.  It is also region server 3.    DC1AuthDFSC1D1, DC1AuthDFSC1D2, DC1AuthDFSC1D3 and DC1AuthDFSC1D4 are 4 region servers in our cluster.

******************************************

I checked with Data Centre team, they confirmed that there is no firewall in the network where hbase servers and client applications is running.

******************************************

Regarding client and server running different versions, they are running same versions.  If there was version mismatch, I guess we would be seeing the issue for all the reads.  Here we see the issue only for few reads, one in 10-15 reads fail this way.  We do use same hbase, zookeeper and hadoop jars as found in the HBase distribution.

Strangely enough, I saw the below for the first time today, and it has occurred only once so far.  10.3.48.61 is the IP address where our client app is running.
2011-08-22 11:46:55,905 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7625 got version 6 expected version 3
2011-08-22 11:46:57,542 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7626 got version 6 expected version 3
2011-08-22 11:46:58,483 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7627 got version 6 expected version 3
2011-08-22 11:46:59,335 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7628 got version 6 expected version 3
2011-08-22 11:47:00,164 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7629 got version 6 expected version 3
2011-08-22 11:47:00,972 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7630 got version 6 expected version 3
2011-08-22 11:47:01,768 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7631 got version 6 expected version 3
2011-08-22 11:47:02,648 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7632 got version 6 expected version 3

******************************************

I enabled debug logging level for all classes today.  Here is the exception associated with "null" messages.

*** Do you think that some thread in client is doing interrupt() resulting in "java.nio.channels.ClosedByInterruptException" below? ***


2011-08-22 11:51:29,663 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation]  - locateRegionInMeta parentTable=-ROOT-, metaLocation=address: DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=0 of 10 failed; retrying after sleep of 1000 because: null
2011-08-22 11:51:29,663 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation]  - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@211c7f8d; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020
2011-08-22 11:51:30,665 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation]  - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@211c7f8d; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020
2011-08-22 11:51:30,665 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hadoop.ipc.HBaseClient]  - Connecting to DC1AuthDFSC1D3.cidr.gov.in/10.3.48.69:6020
2011-08-22 11:51:30,665 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hadoop.ipc.HBaseClient]  - closing ipc connection to DC1AuthDFSC1D3.cidr.gov.in/10.3.48.69:6020: null
java.nio.channels.ClosedByInterruptException
        at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:511)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:328)
        at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:883)
        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
        at $Proxy41.getClosestRowBefore(Unknown Source)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:719)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:589)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:558)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:687)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:593)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:564)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:415)
        at org.apache.hadoop.hbase.client.ServerCallable.instantiateServer(ServerCallable.java:57)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1002)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546)
        at in.gov.uidai.platform.impl.persistence.handler.HBaseHandler.findEntities(HBaseHandler.java:271)
        at in.gov.uidai.platform.impl.persistence.handler.HBaseHandler.findObject(HBaseHandler.java:156)
        at in.gov.uidai.platform.impl.persistence.provider.AbstractPersistenceProvider.findObject(AbstractPersistenceProvider.java:116)
        at in.gov.uidai.platform.impl.persistence.PersistenceManagerProvider.findObject(PersistenceManagerProvider.java:270)
        at in.gov.uidai.authcommon.dao.impl.hbase.ResidentDetailsDAOImpl.findResidentDetailEntity(ResidentDetailsDAOImpl.java:69)
        at in.gov.uidai.authcommon.dao.impl.hbase.ResidentDetailsDAOImpl.findResidentDetails(ResidentDetailsDAOImpl.java:48)
        at in.gov.uidai.authcommon.core.impl.steps.ResidentDetailsReader.findResident(ResidentDetailsReader.java:176)
        at in.gov.uidai.authcommon.core.impl.steps.ResidentDetailsReader.doPerform(ResidentDetailsReader.java:63)
        at in.gov.uidai.authcommon.core.ProcessingStep.perform(ProcessingStep.java:36)
        at in.gov.uidai.authcommon.core.impl.Authenticator.performAndReturnContext(Authenticator.java:40)
        at in.gov.uidai.authserver.grid.AuthenticationGridJob.execute(AuthenticationGridJob.java:27)
        at org.gridgain.grid.kernal.processors.job.GridJobWorker.body(GridJobWorker.java:406)
        at org.gridgain.grid.util.runnable.GridRunnable$1.run(GridRunnable.java:142)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at org.gridgain.grid.util.runnable.GridRunnable.run(GridRunnable.java:194)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2011-08-22 11:51:30,666 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hadoop.ipc.HBaseClient]  - IPC Client (47) connection to DC1AuthDFSC1D3.cidr.gov.in/10.3.48.69:6020 from an unknown user: closed
2011-08-22 11:51:30,666 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation]  - locateRegionInMeta parentTable=-ROOT-, metaLocation=address: DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=1 of 10 failed; retrying after sleep of 1000 because: null
2011-08-22 11:51:30,666 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation]  - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@211c7f8d; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020
...
...
...
And above pattern keeps repeating.

******************************************



Regards,
Srikanth


-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
Sent: Monday, August 22, 2011 2:32 AM
To: user@hbase.apache.org
Subject: Re: Query regarding HTable.get and timeouts

Yeah that null message isn't really helpful :)

So one thing that might be helpful would be to know who DC1AuthDFSC1D3
is, since you identified the logs as "Region server n".

Then look at the master's web UI and see where -ROOT- is assigned. Is
it also DC1AuthDFSC1D3?

If so, then I would proceed by checking if there's a firewall in
between the client and the cluster, also I would make sure that the
client is running the same version as the server.

J-D

On Sat, Aug 20, 2011 at 5:56 AM, Srikanth P. Shreenivas
<Sr...@mindtree.com> wrote:
> Further in this investigation, we enabled the debug logs on client side.
>
> We are observing that client is trying to root region, and is continuously failing to do so.  The logs are filled with entries like this:
>
> 2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation]  - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2cc25ae3; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020
> 2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation]  - locateRegionInMeta parentTable=-ROOT-, metaLocation=address: DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=0 of 10 failed; retrying after sleep of 1000
> because: null
>
> Client keeps retrying and retries get exhausted.
>
>
> Complete logs are available here: https://gist.github.com/1159064  including logs of master, zookeeper and region servers.
>
>
> If you can please look at the logs and provide some inputs on this issue, then it will be really helpful.
> We are really not sure why client is failing to get root regions from the server.  Any guidance will be greatly appreciated.
>
>
> Thanks a lot,
> Srikanth

________________________________

http://www.mindtree.com/email/disclaimer.html

RE: Query regarding HTable.get and timeouts

Posted by "Srikanth P. Shreenivas" <Sr...@mindtree.com>.

Hi,

Going with the assumption that our client threads may getting interrupted and it may not be an hbase issue,  we rebuilt our client application without GridGain.  Earlier our code was being executed by GridGain's thread pool, but now we made the app to run in raw Tomcat.

I am very glad to say that we are not seeing any " java.nio.channels.ClosedByInterruptException".
My client app is working just great, performing its hbase read/writes as expected.


Thanks a lot for all the help.

Regards,
Srikanth
PS: I must say that HBase community is really great.  Really appreciate all the inputs and suggestions.


-----Original Message-----
From: Srikanth P. Shreenivas
Sent: Monday, August 22, 2011 3:43 PM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

Yes, DC1AuthDFSC1D3 hosts the root region.  It is also region server 3.    DC1AuthDFSC1D1, DC1AuthDFSC1D2, DC1AuthDFSC1D3 and DC1AuthDFSC1D4 are 4 region servers in our cluster.

******************************************

I checked with Data Centre team, they confirmed that there is no firewall in the network where hbase servers and client applications is running.

******************************************

Regarding client and server running different versions, they are running same versions.  If there was version mismatch, I guess we would be seeing the issue for all the reads.  Here we see the issue only for few reads, one in 10-15 reads fail this way.  We do use same hbase, zookeeper and hadoop jars as found in the HBase distribution.

Strangely enough, I saw the below for the first time today, and it has occurred only once so far.  10.3.48.61 is the IP address where our client app is running.
2011-08-22 11:46:55,905 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7625 got version 6 expected version 3
2011-08-22 11:46:57,542 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7626 got version 6 expected version 3
2011-08-22 11:46:58,483 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7627 got version 6 expected version 3
2011-08-22 11:46:59,335 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7628 got version 6 expected version 3
2011-08-22 11:47:00,164 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7629 got version 6 expected version 3
2011-08-22 11:47:00,972 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7630 got version 6 expected version 3
2011-08-22 11:47:01,768 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7631 got version 6 expected version 3
2011-08-22 11:47:02,648 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7632 got version 6 expected version 3

******************************************

I enabled debug logging level for all classes today.  Here is the exception associated with "null" messages.

*** Do you think that some thread in client is doing interrupt() resulting in "java.nio.channels.ClosedByInterruptException" below? ***


2011-08-22 11:51:29,663 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation]  - locateRegionInMeta parentTable=-ROOT-, metaLocation=address: DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=0 of 10 failed; retrying after sleep of 1000 because: null
2011-08-22 11:51:29,663 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation]  - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@211c7f8d; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020
2011-08-22 11:51:30,665 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation]  - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@211c7f8d; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020
2011-08-22 11:51:30,665 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hadoop.ipc.HBaseClient]  - Connecting to DC1AuthDFSC1D3.cidr.gov.in/10.3.48.69:6020
2011-08-22 11:51:30,665 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hadoop.ipc.HBaseClient]  - closing ipc connection to DC1AuthDFSC1D3.cidr.gov.in/10.3.48.69:6020: null
java.nio.channels.ClosedByInterruptException
        at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:511)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:328)
        at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:883)
        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
        at $Proxy41.getClosestRowBefore(Unknown Source)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:719)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:589)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:558)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:687)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:593)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:564)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:415)
        at org.apache.hadoop.hbase.client.ServerCallable.instantiateServer(ServerCallable.java:57)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1002)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546)
        at in.gov.uidai.platform.impl.persistence.handler.HBaseHandler.findEntities(HBaseHandler.java:271)
        at in.gov.uidai.platform.impl.persistence.handler.HBaseHandler.findObject(HBaseHandler.java:156)
        at in.gov.uidai.platform.impl.persistence.provider.AbstractPersistenceProvider.findObject(AbstractPersistenceProvider.java:116)
        at in.gov.uidai.platform.impl.persistence.PersistenceManagerProvider.findObject(PersistenceManagerProvider.java:270)
        at in.gov.uidai.authcommon.dao.impl.hbase.ResidentDetailsDAOImpl.findResidentDetailEntity(ResidentDetailsDAOImpl.java:69)
        at in.gov.uidai.authcommon.dao.impl.hbase.ResidentDetailsDAOImpl.findResidentDetails(ResidentDetailsDAOImpl.java:48)
        at in.gov.uidai.authcommon.core.impl.steps.ResidentDetailsReader.findResident(ResidentDetailsReader.java:176)
        at in.gov.uidai.authcommon.core.impl.steps.ResidentDetailsReader.doPerform(ResidentDetailsReader.java:63)
        at in.gov.uidai.authcommon.core.ProcessingStep.perform(ProcessingStep.java:36)
        at in.gov.uidai.authcommon.core.impl.Authenticator.performAndReturnContext(Authenticator.java:40)
        at in.gov.uidai.authserver.grid.AuthenticationGridJob.execute(AuthenticationGridJob.java:27)
        at org.gridgain.grid.kernal.processors.job.GridJobWorker.body(GridJobWorker.java:406)
        at org.gridgain.grid.util.runnable.GridRunnable$1.run(GridRunnable.java:142)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at org.gridgain.grid.util.runnable.GridRunnable.run(GridRunnable.java:194)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2011-08-22 11:51:30,666 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hadoop.ipc.HBaseClient]  - IPC Client (47) connection to DC1AuthDFSC1D3.cidr.gov.in/10.3.48.69:6020 from an unknown user: closed
2011-08-22 11:51:30,666 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation]  - locateRegionInMeta parentTable=-ROOT-, metaLocation=address: DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=1 of 10 failed; retrying after sleep of 1000 because: null
2011-08-22 11:51:30,666 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation]  - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@211c7f8d; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020
...
...
...
And above pattern keeps repeating.

******************************************



Regards,
Srikanth


-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
Sent: Monday, August 22, 2011 2:32 AM
To: user@hbase.apache.org
Subject: Re: Query regarding HTable.get and timeouts

Yeah that null message isn't really helpful :)

So one thing that might be helpful would be to know who DC1AuthDFSC1D3
is, since you identified the logs as "Region server n".

Then look at the master's web UI and see where -ROOT- is assigned. Is
it also DC1AuthDFSC1D3?

If so, then I would proceed by checking if there's a firewall in
between the client and the cluster, also I would make sure that the
client is running the same version as the server.

J-D

On Sat, Aug 20, 2011 at 5:56 AM, Srikanth P. Shreenivas
<Sr...@mindtree.com> wrote:
> Further in this investigation, we enabled the debug logs on client side.
>
> We are observing that client is trying to root region, and is continuously failing to do so.  The logs are filled with entries like this:
>
> 2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation]  - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2cc25ae3; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020
> 2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation]  - locateRegionInMeta parentTable=-ROOT-, metaLocation=address: DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=0 of 10 failed; retrying after sleep of 1000
> because: null
>
> Client keeps retrying and retries get exhausted.
>
>
> Complete logs are available here: https://gist.github.com/1159064  including logs of master, zookeeper and region servers.
>
>
> If you can please look at the logs and provide some inputs on this issue, then it will be really helpful.
> We are really not sure why client is failing to get root regions from the server.  Any guidance will be greatly appreciated.
>
>
> Thanks a lot,
> Srikanth

________________________________

http://www.mindtree.com/email/disclaimer.html

Hbase version

Posted by Greg Bledsoe <gr...@personal.com>.

Hi

Is 0.90.4 currently the most stable/bug fixed version of hbase? What is
the best hdfs version for it?  0.20.x, x=?...  Opinions?

Greg

Re: Query regarding HTable.get and timeouts

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Yeah that null message isn't really helpful :)

So one thing that might be helpful would be to know who DC1AuthDFSC1D3
is, since you identified the logs as "Region server n".

Then look at the master's web UI and see where -ROOT- is assigned. Is
it also DC1AuthDFSC1D3?

If so, then I would proceed by checking if there's a firewall in
between the client and the cluster, also I would make sure that the
client is running the same version as the server.

J-D

On Sat, Aug 20, 2011 at 5:56 AM, Srikanth P. Shreenivas
<Sr...@mindtree.com> wrote:
> Further in this investigation, we enabled the debug logs on client side.
>
> We are observing that client is trying to root region, and is continuously failing to do so.  The logs are filled with entries like this:
>
> 2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation]  - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2cc25ae3; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020
> 2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation]  - locateRegionInMeta parentTable=-ROOT-, metaLocation=address: DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=0 of 10 failed; retrying after sleep of 1000
> because: null
>
> Client keeps retrying and retries get exhausted.
>
>
> Complete logs are available here: https://gist.github.com/1159064  including logs of master, zookeeper and region servers.
>
>
> If you can please look at the logs and provide some inputs on this issue, then it will be really helpful.
> We are really not sure why client is failing to get root regions from the server.  Any guidance will be greatly appreciated.
>
>
> Thanks a lot,
> Srikanth

RE: Query regarding HTable.get and timeouts

Posted by Ramkrishna S Vasudevan <ra...@huawei.com>.

Hi Srikanth

I went through your logs.. not much info is present in that.
Could you give the logs which shows what happened when the master and RS
started.  and also to whom the ROOT and META table got assigned?

Regards
Ram

-----Original Message-----
From: Srikanth P. Shreenivas [mailto:Srikanth_Shreenivas@mindtree.com] 
Sent: Saturday, August 20, 2011 6:27 PM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

Further in this investigation, we enabled the debug logs on client side.

We are observing that client is trying to root region, and is continuously
failing to do so.  The logs are filled with entries like this:

2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG
[hbase.client.HConnectionManager$HConnectionImplementation]  - Lookedup root
region location,
connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImpl
ementation@2cc25ae3; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020
2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG
[hbase.client.HConnectionManager$HConnectionImplementation]  -
locateRegionInMeta parentTable=-ROOT-, metaLocation=address:
DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=0
of 10 failed; retrying after sleep of 1000 
because: null

Client keeps retrying and retries get exhausted.

Complete logs are available here: https://gist.github.com/1159064  including
logs of master, zookeeper and region servers.  

If you can please look at the logs and provide some inputs on this issue,
then it will be really helpful.
We are really not sure why client is failing to get root regions from the
server.  Any guidance will be greatly appreciated.

Thanks a lot,
Srikanth

-----Original Message-----
From: Srikanth P. Shreenivas 
Sent: Saturday, August 20, 2011 1:57 AM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

I did some tests today.  In our QA setup, we dont see any issues.  I ran
more than 100,000 operations in our QA setup in 1 hour with all HBase
reads/writes working as expected.

However, in our production setup, I regularly see the issue wherein client
thread gets interrupted because HTable.get() call does not return.
It is possible that it is taking more than 10 minutes due to
https://issues.apache.org/jira/browse/HBASE-2121.

However, I am not able to figure out what is causing this.  Logs of Hbase as
well as Hadoop servers seems to be quite normal. The cluster on which we are
seeing this issue has no writes happening, and I do see this issue after
about 10 operations.

One strange thing I noticed though is that zookeeper logs were truncated and
had only entries for last 15-20 minutes instead of complete day.  This was
around 8PM, so it had not rolled over.

I had earlier too asked a query on same issue
(http://www.mail-archive.com/user@hbase.apache.org/msg09904.html).  The
changes we did of moving to CDH3 and using -XX:UseMemBar fixed issues in our
QA setup.  However, same changes in production does not seem to have similar
effect.

If you can provide any clues on how should we go about investigating this
issue, that will be real help.

Regards,
Srikanth.

________________________________________
From: Srikanth P. Shreenivas
Sent: Friday, August 19, 2011 12:39 AM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

My apologies, I may not be reading the code right.

You are right, it is GridGain timeout that is making the line 1255 to
execute.
However, the question is what would make a HTable.get() to take close to 10
minutes to induce a timeout in GridGain task.

The value of numRetries at line 1236 should be 10 (default) and if we go
with default value of HConstants.RETRY_BACKOFF, then, sleep time added with
all retries will be only 61 seconds, and not close to 600 seconds as the
case in our code is.

Regards,
Srikanth

________________________________________
From: Srikanth P. Shreenivas
Sent: Friday, August 19, 2011 12:21 AM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

Please note that line numbers I am referencing are from the file :
https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/h
base/client/HConnectionManager.java

________________________________________
From: Srikanth P. Shreenivas
Sent: Friday, August 19, 2011 12:19 AM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

Hi Stack,

Thanks a lot for your reply.  It's always a comforting feeling to see very
active community and especially your prompt replies to the queries.

Yes, I am running it in as GridGain task,  so it runs it GridGain's thread
pool.   In this case, we can imaging GridGain as something that hands off
works to various worker threads and waits asynhronously  for it complete.  I
have 10 minute timeout after which GridGain would consider work as timed
out.

What we are observing is that our tasks are timeing out at 10 minute
boundary, and delay seems to be caused by the part of the work which is
doing HTable.get.

My suspicion is that Line 1255 in HConnectionManager.java is calling the
Thread.currentThread().interrupt(), due to which the GridGain thread kind of
stops doing what it was meant to do, and never responsds to master node
resulting in timeout in master.

In order for line 1255 to execute, we will have to assume that all retries
were exhausted.
Hence, my query that what would cause a HTable.get() to get into a situation
wherein
HConnectionManager$HConnectionImplementation.getRegionServerWithRetries gets
to line 1255.

Regards,
Srikanth

________________________________________
From: saint.ack@gmail.com [saint.ack@gmail.com] on behalf of Stack
[stack@duboce.net]
Sent: Friday, August 19, 2011 12:03 AM
To: user@hbase.apache.org
Subject: Re: Query regarding HTable.get and timeouts

Is your client running inside a container of some form and could the
container be doing the interrupting?   I've not come across
client-side thread interrupts before.
St.Ack

On Thu, Aug 18, 2011 at 7:37 AM, Srikanth P. Shreenivas
<Sr...@mindtree.com> wrote:
> Hi,
>
> We are experiencing an issue in our HBase Cluster wherein some of the gets
are timing outs at:
>
> java.io.IOException: Giving up trying to get region server: thread is
interrupted.
>                at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.
getRegionServerWithRetries(HConnectionManager.java:1016)
>                at
org.apache.hadoop.hbase.client.HTable.get(HTable.java:546)
>
>
> When we look at the logs of master, zookeeper and region servers, there is
nothing that indicates anything abnormal.
>
> I tried looking up below functions, but at this point could not make much
out of it.
>
https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/h
base/client/HConnectionManager.java  - getRegionServerWithRetries  starts at
Line 1233
>
https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/h
base/client/HTable.java  Htable.get starts at Line 611.
>
>
> If you can please suggest what are the scenarios in which all retries can
get exhausted resulting in thread interruption.
>
> We have seen this issue in two of our HBase Clusters, where load is quite
less.  We have 20 reads per minute,  we run 1 zookeeper, and 4 regionservers
in fully-distributed mode (Hadoop).  We are using CDH3.
>
> Thanks,
> Srikanth
>
> ________________________________
>
> http://www.mindtree.com/email/disclaimer.html
>

RE: Query regarding HTable.get and timeouts

Posted by "Srikanth P. Shreenivas" <Sr...@mindtree.com>.

Further in this investigation, we enabled the debug logs on client side.

We are observing that client is trying to root region, and is continuously failing to do so.  The logs are filled with entries like this:

2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation]  - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2cc25ae3; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020
2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation]  - locateRegionInMeta parentTable=-ROOT-, metaLocation=address: DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=0 of 10 failed; retrying after sleep of 1000 
because: null

Client keeps retrying and retries get exhausted.

Complete logs are available here: https://gist.github.com/1159064  including logs of master, zookeeper and region servers.  

If you can please look at the logs and provide some inputs on this issue, then it will be really helpful.
We are really not sure why client is failing to get root regions from the server.  Any guidance will be greatly appreciated.

Thanks a lot,
Srikanth

-----Original Message-----
From: Srikanth P. Shreenivas 
Sent: Saturday, August 20, 2011 1:57 AM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

I did some tests today.  In our QA setup, we dont see any issues.  I ran more than 100,000 operations in our QA setup in 1 hour with all HBase reads/writes working as expected.

However, in our production setup, I regularly see the issue wherein client thread gets interrupted because HTable.get() call does not return.
It is possible that it is taking more than 10 minutes due to https://issues.apache.org/jira/browse/HBASE-2121.

However, I am not able to figure out what is causing this.  Logs of Hbase as well as Hadoop servers seems to be quite normal. The cluster on which we are seeing this issue has no writes happening, and I do see this issue after about 10 operations.

One strange thing I noticed though is that zookeeper logs were truncated and had only entries for last 15-20 minutes instead of complete day.  This was around 8PM, so it had not rolled over.

I had earlier too asked a query on same issue (http://www.mail-archive.com/user@hbase.apache.org/msg09904.html).  The changes we did of moving to CDH3 and using -XX:UseMemBar fixed issues in our QA setup.  However, same changes in production does not seem to have similar effect.

If you can provide any clues on how should we go about investigating this issue, that will be real help.

Regards,
Srikanth.

________________________________________
From: Srikanth P. Shreenivas
Sent: Friday, August 19, 2011 12:39 AM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

My apologies, I may not be reading the code right.

You are right, it is GridGain timeout that is making the line 1255 to execute.
However, the question is what would make a HTable.get() to take close to 10 minutes to induce a timeout in GridGain task.

The value of numRetries at line 1236 should be 10 (default) and if we go with default value of HConstants.RETRY_BACKOFF, then, sleep time added with all retries will be only 61 seconds, and not close to 600 seconds as the case in our code is.

Regards,
Srikanth

________________________________________
From: Srikanth P. Shreenivas
Sent: Friday, August 19, 2011 12:21 AM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

Please note that line numbers I am referencing are from the file : https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java

________________________________________
From: Srikanth P. Shreenivas
Sent: Friday, August 19, 2011 12:19 AM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

Hi Stack,

Thanks a lot for your reply.  It's always a comforting feeling to see very active community and especially your prompt replies to the queries.

Yes, I am running it in as GridGain task,  so it runs it GridGain's thread pool.   In this case, we can imaging GridGain as something that hands off works to various worker threads and waits asynhronously  for it complete.  I  have 10 minute timeout after which GridGain would consider work as timed out.

What we are observing is that our tasks are timeing out at 10 minute boundary, and delay seems to be caused by the part of the work which is doing HTable.get.

My suspicion is that Line 1255 in HConnectionManager.java is calling the Thread.currentThread().interrupt(), due to which the GridGain thread kind of stops doing what it was meant to do, and never responsds to master node resulting in timeout in master.

In order for line 1255 to execute, we will have to assume that all retries were exhausted.
Hence, my query that what would cause a HTable.get() to get into a situation wherein HConnectionManager$HConnectionImplementation.getRegionServerWithRetries gets to line 1255.

Regards,
Srikanth

________________________________________
From: saint.ack@gmail.com [saint.ack@gmail.com] on behalf of Stack [stack@duboce.net]
Sent: Friday, August 19, 2011 12:03 AM
To: user@hbase.apache.org
Subject: Re: Query regarding HTable.get and timeouts

Is your client running inside a container of some form and could the
container be doing the interrupting?   I've not come across
client-side thread interrupts before.
St.Ack

On Thu, Aug 18, 2011 at 7:37 AM, Srikanth P. Shreenivas
<Sr...@mindtree.com> wrote:
> Hi,
>
> We are experiencing an issue in our HBase Cluster wherein some of the gets are timing outs at:
>
> java.io.IOException: Giving up trying to get region server: thread is interrupted.
>                at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1016)
>                at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546)
>
>
> When we look at the logs of master, zookeeper and region servers, there is nothing that indicates anything abnormal.
>
> I tried looking up below functions, but at this point could not make much out of it.
> https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java  - getRegionServerWithRetries  starts at Line 1233
> https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HTable.java  Htable.get starts at Line 611.
>
>
> If you can please suggest what are the scenarios in which all retries can get exhausted resulting in thread interruption.
>
> We have seen this issue in two of our HBase Clusters, where load is quite less.  We have 20 reads per minute,  we run 1 zookeeper, and 4 regionservers in fully-distributed mode (Hadoop).  We are using CDH3.
>
> Thanks,
> Srikanth
>
> ________________________________
>
> http://www.mindtree.com/email/disclaimer.html
>

RE: Query regarding HTable.get and timeouts

Posted by "Srikanth P. Shreenivas" <Sr...@mindtree.com>.

I did some tests today.  In our QA setup, we dont see any issues.  I ran more than 100,000 operations in our QA setup in 1 hour with all HBase reads/writes working as expected.

However, in our production setup, I regularly see the issue wherein client thread gets interrupted because HTable.get() call does not return.
It is possible that it is taking more than 10 minutes due to https://issues.apache.org/jira/browse/HBASE-2121.

However, I am not able to figure out what is causing this.  Logs of Hbase as well as Hadoop servers seems to be quite normal. The cluster on which we are seeing this issue has no writes happening, and I do see this issue after about 10 operations.

One strange thing I noticed though is that zookeeper logs were truncated and had only entries for last 15-20 minutes instead of complete day.  This was around 8PM, so it had not rolled over.

I had earlier too asked a query on same issue (http://www.mail-archive.com/user@hbase.apache.org/msg09904.html).  The changes we did of moving to CDH3 and using -XX:UseMemBar fixed issues in our QA setup.  However, same changes in production does not seem to have similar effect.

If you can provide any clues on how should we go about investigating this issue, that will be real help.

Regards,
Srikanth.

________________________________________
From: Srikanth P. Shreenivas
Sent: Friday, August 19, 2011 12:39 AM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

My apologies, I may not be reading the code right.

You are right, it is GridGain timeout that is making the line 1255 to execute.
However, the question is what would make a HTable.get() to take close to 10 minutes to induce a timeout in GridGain task.

The value of numRetries at line 1236 should be 10 (default) and if we go with default value of HConstants.RETRY_BACKOFF, then, sleep time added with all retries will be only 61 seconds, and not close to 600 seconds as the case in our code is.

Regards,
Srikanth

________________________________________
From: Srikanth P. Shreenivas
Sent: Friday, August 19, 2011 12:21 AM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

Please note that line numbers I am referencing are from the file : https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java

________________________________________
From: Srikanth P. Shreenivas
Sent: Friday, August 19, 2011 12:19 AM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

Hi Stack,

Thanks a lot for your reply.  It's always a comforting feeling to see very active community and especially your prompt replies to the queries.

Yes, I am running it in as GridGain task,  so it runs it GridGain's thread pool.   In this case, we can imaging GridGain as something that hands off works to various worker threads and waits asynhronously  for it complete.  I  have 10 minute timeout after which GridGain would consider work as timed out.

What we are observing is that our tasks are timeing out at 10 minute boundary, and delay seems to be caused by the part of the work which is doing HTable.get.

My suspicion is that Line 1255 in HConnectionManager.java is calling the Thread.currentThread().interrupt(), due to which the GridGain thread kind of stops doing what it was meant to do, and never responsds to master node resulting in timeout in master.

In order for line 1255 to execute, we will have to assume that all retries were exhausted.
Hence, my query that what would cause a HTable.get() to get into a situation wherein HConnectionManager$HConnectionImplementation.getRegionServerWithRetries gets to line 1255.

Regards,
Srikanth

________________________________________
From: saint.ack@gmail.com [saint.ack@gmail.com] on behalf of Stack [stack@duboce.net]
Sent: Friday, August 19, 2011 12:03 AM
To: user@hbase.apache.org
Subject: Re: Query regarding HTable.get and timeouts

Is your client running inside a container of some form and could the
container be doing the interrupting?   I've not come across
client-side thread interrupts before.
St.Ack

On Thu, Aug 18, 2011 at 7:37 AM, Srikanth P. Shreenivas
<Sr...@mindtree.com> wrote:
> Hi,
>
> We are experiencing an issue in our HBase Cluster wherein some of the gets are timing outs at:
>
> java.io.IOException: Giving up trying to get region server: thread is interrupted.
>                at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1016)
>                at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546)
>
>
> When we look at the logs of master, zookeeper and region servers, there is nothing that indicates anything abnormal.
>
> I tried looking up below functions, but at this point could not make much out of it.
> https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java  - getRegionServerWithRetries  starts at Line 1233
> https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HTable.java  Htable.get starts at Line 611.
>
>
> If you can please suggest what are the scenarios in which all retries can get exhausted resulting in thread interruption.
>
> We have seen this issue in two of our HBase Clusters, where load is quite less.  We have 20 reads per minute,  we run 1 zookeeper, and 4 regionservers in fully-distributed mode (Hadoop).  We are using CDH3.
>
> Thanks,
> Srikanth
>
> ________________________________
>
> http://www.mindtree.com/email/disclaimer.html
>

Re: Query regarding HTable.get and timeouts

Posted by Shrijeet Paliwal <sh...@rocketfuel.com>.

It follows exponential back off. Each pause is longer than the last one and
all adds up close to 600.

On Thu, Aug 18, 2011 at 12:09 PM, Srikanth P. Shreenivas <
Srikanth_Shreenivas@mindtree.com> wrote:

> My apologies, I may not be reading the code right.
>
> You are right, it is GridGain timeout that is making the line 1255 to
> execute.
> However, the question is what would make a HTable.get() to take close to 10
> minutes to induce a timeout in GridGain task.
>
> The value of numRetries at line 1236 should be 10 (default) and if we go
> with default value of HConstants.RETRY_BACKOFF, then, sleep time added with
> all retries will be only 61 seconds, and not close to 600 seconds as the
> case in our code is.
>
>
> Regards,
> Srikanth
>
>
> ________________________________________
> From: Srikanth P. Shreenivas
> Sent: Friday, August 19, 2011 12:21 AM
> To: user@hbase.apache.org
> Subject: RE: Query regarding HTable.get and timeouts
>
> Please note that line numbers I am referencing are from the file :
> https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
>
>
> ________________________________________
> From: Srikanth P. Shreenivas
> Sent: Friday, August 19, 2011 12:19 AM
> To: user@hbase.apache.org
> Subject: RE: Query regarding HTable.get and timeouts
>
> Hi Stack,
>
> Thanks a lot for your reply.  It's always a comforting feeling to see very
> active community and especially your prompt replies to the queries.
>
> Yes, I am running it in as GridGain task,  so it runs it GridGain's thread
> pool.   In this case, we can imaging GridGain as something that hands off
> works to various worker threads and waits asynhronously  for it complete.  I
>  have 10 minute timeout after which GridGain would consider work as timed
> out.
>
> What we are observing is that our tasks are timeing out at 10 minute
> boundary, and delay seems to be caused by the part of the work which is
> doing HTable.get.
>
> My suspicion is that Line 1255 in HConnectionManager.java is calling the
> Thread.currentThread().interrupt(), due to which the GridGain thread kind of
> stops doing what it was meant to do, and never responsds to master node
> resulting in timeout in master.
>
> In order for line 1255 to execute, we will have to assume that all retries
> were exhausted.
> Hence, my query that what would cause a HTable.get() to get into a
> situation wherein
> HConnectionManager$HConnectionImplementation.getRegionServerWithRetries gets
> to line 1255.
>
>
> Regards,
> Srikanth
>
> ________________________________________
> From: saint.ack@gmail.com [saint.ack@gmail.com] on behalf of Stack [
> stack@duboce.net]
> Sent: Friday, August 19, 2011 12:03 AM
> To: user@hbase.apache.org
> Subject: Re: Query regarding HTable.get and timeouts
>
> Is your client running inside a container of some form and could the
> container be doing the interrupting?   I've not come across
> client-side thread interrupts before.
> St.Ack
>
> On Thu, Aug 18, 2011 at 7:37 AM, Srikanth P. Shreenivas
> <Sr...@mindtree.com> wrote:
> > Hi,
> >
> > We are experiencing an issue in our HBase Cluster wherein some of the
> gets are timing outs at:
> >
> > java.io.IOException: Giving up trying to get region server: thread is
> interrupted.
> >                at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1016)
> >                at
> org.apache.hadoop.hbase.client.HTable.get(HTable.java:546)
> >
> >
> > When we look at the logs of master, zookeeper and region servers, there
> is nothing that indicates anything abnormal.
> >
> > I tried looking up below functions, but at this point could not make much
> out of it.
> >
> https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java - getRegionServerWithRetries  starts at Line 1233
> >
> https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HTable.java Htable.get starts at Line 611.
> >
> >
> > If you can please suggest what are the scenarios in which all retries can
> get exhausted resulting in thread interruption.
> >
> > We have seen this issue in two of our HBase Clusters, where load is quite
> less.  We have 20 reads per minute,  we run 1 zookeeper, and 4 regionservers
> in fully-distributed mode (Hadoop).  We are using CDH3.
> >
> > Thanks,
> > Srikanth
> >
> > ________________________________
> >
> > http://www.mindtree.com/email/disclaimer.html
> >
>

RE: Query regarding HTable.get and timeouts

Posted by "Srikanth P. Shreenivas" <Sr...@mindtree.com>.

My apologies, I may not be reading the code right.

You are right, it is GridGain timeout that is making the line 1255 to execute.
However, the question is what would make a HTable.get() to take close to 10 minutes to induce a timeout in GridGain task.

The value of numRetries at line 1236 should be 10 (default) and if we go with default value of HConstants.RETRY_BACKOFF, then, sleep time added with all retries will be only 61 seconds, and not close to 600 seconds as the case in our code is.

Regards,
Srikanth

________________________________________
From: Srikanth P. Shreenivas
Sent: Friday, August 19, 2011 12:21 AM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

Please note that line numbers I am referencing are from the file : https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java

________________________________________
From: Srikanth P. Shreenivas
Sent: Friday, August 19, 2011 12:19 AM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

Hi Stack,

Thanks a lot for your reply.  It's always a comforting feeling to see very active community and especially your prompt replies to the queries.

Yes, I am running it in as GridGain task,  so it runs it GridGain's thread pool.   In this case, we can imaging GridGain as something that hands off works to various worker threads and waits asynhronously  for it complete.  I  have 10 minute timeout after which GridGain would consider work as timed out.

What we are observing is that our tasks are timeing out at 10 minute boundary, and delay seems to be caused by the part of the work which is doing HTable.get.

My suspicion is that Line 1255 in HConnectionManager.java is calling the Thread.currentThread().interrupt(), due to which the GridGain thread kind of stops doing what it was meant to do, and never responsds to master node resulting in timeout in master.

In order for line 1255 to execute, we will have to assume that all retries were exhausted.
Hence, my query that what would cause a HTable.get() to get into a situation wherein HConnectionManager$HConnectionImplementation.getRegionServerWithRetries gets to line 1255.

Regards,
Srikanth

________________________________________
From: saint.ack@gmail.com [saint.ack@gmail.com] on behalf of Stack [stack@duboce.net]
Sent: Friday, August 19, 2011 12:03 AM
To: user@hbase.apache.org
Subject: Re: Query regarding HTable.get and timeouts

Is your client running inside a container of some form and could the
container be doing the interrupting?   I've not come across
client-side thread interrupts before.
St.Ack

On Thu, Aug 18, 2011 at 7:37 AM, Srikanth P. Shreenivas
<Sr...@mindtree.com> wrote:
> Hi,
>
> We are experiencing an issue in our HBase Cluster wherein some of the gets are timing outs at:
>
> java.io.IOException: Giving up trying to get region server: thread is interrupted.
>                at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1016)
>                at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546)
>
>
> When we look at the logs of master, zookeeper and region servers, there is nothing that indicates anything abnormal.
>
> I tried looking up below functions, but at this point could not make much out of it.
> https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java  - getRegionServerWithRetries  starts at Line 1233
> https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HTable.java  Htable.get starts at Line 611.
>
>
> If you can please suggest what are the scenarios in which all retries can get exhausted resulting in thread interruption.
>
> We have seen this issue in two of our HBase Clusters, where load is quite less.  We have 20 reads per minute,  we run 1 zookeeper, and 4 regionservers in fully-distributed mode (Hadoop).  We are using CDH3.
>
> Thanks,
> Srikanth
>
> ________________________________
>
> http://www.mindtree.com/email/disclaimer.html
>

RE: Query regarding HTable.get and timeouts

Posted by "Srikanth P. Shreenivas" <Sr...@mindtree.com>.

Please note that line numbers I am referencing are from the file : https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java 

________________________________________
From: Srikanth P. Shreenivas
Sent: Friday, August 19, 2011 12:19 AM
To: user@hbase.apache.org
Subject: RE: Query regarding HTable.get and timeouts

Hi Stack,

Thanks a lot for your reply.  It's always a comforting feeling to see very active community and especially your prompt replies to the queries.

Yes, I am running it in as GridGain task,  so it runs it GridGain's thread pool.   In this case, we can imaging GridGain as something that hands off works to various worker threads and waits asynhronously  for it complete.  I  have 10 minute timeout after which GridGain would consider work as timed out.

What we are observing is that our tasks are timeing out at 10 minute boundary, and delay seems to be caused by the part of the work which is doing HTable.get.

My suspicion is that Line 1255 in HConnectionManager.java is calling the Thread.currentThread().interrupt(), due to which the GridGain thread kind of stops doing what it was meant to do, and never responsds to master node resulting in timeout in master.

In order for line 1255 to execute, we will have to assume that all retries were exhausted.
Hence, my query that what would cause a HTable.get() to get into a situation wherein HConnectionManager$HConnectionImplementation.getRegionServerWithRetries gets to line 1255.

Regards,
Srikanth

________________________________________
From: saint.ack@gmail.com [saint.ack@gmail.com] on behalf of Stack [stack@duboce.net]
Sent: Friday, August 19, 2011 12:03 AM
To: user@hbase.apache.org
Subject: Re: Query regarding HTable.get and timeouts

Is your client running inside a container of some form and could the
container be doing the interrupting?   I've not come across
client-side thread interrupts before.
St.Ack

On Thu, Aug 18, 2011 at 7:37 AM, Srikanth P. Shreenivas
<Sr...@mindtree.com> wrote:
> Hi,
>
> We are experiencing an issue in our HBase Cluster wherein some of the gets are timing outs at:
>
> java.io.IOException: Giving up trying to get region server: thread is interrupted.
>                at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1016)
>                at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546)
>
>
> When we look at the logs of master, zookeeper and region servers, there is nothing that indicates anything abnormal.
>
> I tried looking up below functions, but at this point could not make much out of it.
> https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java  - getRegionServerWithRetries  starts at Line 1233
> https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HTable.java  Htable.get starts at Line 611.
>
>
> If you can please suggest what are the scenarios in which all retries can get exhausted resulting in thread interruption.
>
> We have seen this issue in two of our HBase Clusters, where load is quite less.  We have 20 reads per minute,  we run 1 zookeeper, and 4 regionservers in fully-distributed mode (Hadoop).  We are using CDH3.
>
> Thanks,
> Srikanth
>
> ________________________________
>
> http://www.mindtree.com/email/disclaimer.html
>

RE: Query regarding HTable.get and timeouts

Posted by "Srikanth P. Shreenivas" <Sr...@mindtree.com>.

Hi Stack,

Thanks a lot for your reply.  It's always a comforting feeling to see very active community and especially your prompt replies to the queries.

Yes, I am running it in as GridGain task,  so it runs it GridGain's thread pool.   In this case, we can imaging GridGain as something that hands off works to various worker threads and waits asynhronously  for it complete.  I  have 10 minute timeout after which GridGain would consider work as timed out.

What we are observing is that our tasks are timeing out at 10 minute boundary, and delay seems to be caused by the part of the work which is doing HTable.get.

My suspicion is that Line 1255 in HConnectionManager.java is calling the Thread.currentThread().interrupt(), due to which the GridGain thread kind of stops doing what it was meant to do, and never responsds to master node resulting in timeout in master.

In order for line 1255 to execute, we will have to assume that all retries were exhausted.
Hence, my query that what would cause a HTable.get() to get into a situation wherein HConnectionManager$HConnectionImplementation.getRegionServerWithRetries gets to line 1255.

Regards,
Srikanth

________________________________________
From: saint.ack@gmail.com [saint.ack@gmail.com] on behalf of Stack [stack@duboce.net]
Sent: Friday, August 19, 2011 12:03 AM
To: user@hbase.apache.org
Subject: Re: Query regarding HTable.get and timeouts

Is your client running inside a container of some form and could the
container be doing the interrupting?   I've not come across
client-side thread interrupts before.
St.Ack

On Thu, Aug 18, 2011 at 7:37 AM, Srikanth P. Shreenivas
<Sr...@mindtree.com> wrote:
> Hi,
>
> We are experiencing an issue in our HBase Cluster wherein some of the gets are timing outs at:
>
> java.io.IOException: Giving up trying to get region server: thread is interrupted.
>                at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1016)
>                at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546)
>
>
> When we look at the logs of master, zookeeper and region servers, there is nothing that indicates anything abnormal.
>
> I tried looking up below functions, but at this point could not make much out of it.
> https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java  - getRegionServerWithRetries  starts at Line 1233
> https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HTable.java  Htable.get starts at Line 611.
>
>
> If you can please suggest what are the scenarios in which all retries can get exhausted resulting in thread interruption.
>
> We have seen this issue in two of our HBase Clusters, where load is quite less.  We have 20 reads per minute,  we run 1 zookeeper, and 4 regionservers in fully-distributed mode (Hadoop).  We are using CDH3.
>
> Thanks,
> Srikanth
>
> ________________________________
>
> http://www.mindtree.com/email/disclaimer.html
>

Re: Query regarding HTable.get and timeouts

Posted by Stack <st...@duboce.net>.

Is your client running inside a container of some form and could the
container be doing the interrupting?   I've not come across
client-side thread interrupts before.
St.Ack

On Thu, Aug 18, 2011 at 7:37 AM, Srikanth P. Shreenivas
<Sr...@mindtree.com> wrote:
> Hi,
>
> We are experiencing an issue in our HBase Cluster wherein some of the gets are timing outs at:
>
> java.io.IOException: Giving up trying to get region server: thread is interrupted.
>                at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1016)
>                at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546)
>
>
> When we look at the logs of master, zookeeper and region servers, there is nothing that indicates anything abnormal.
>
> I tried looking up below functions, but at this point could not make much out of it.
> https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java  - getRegionServerWithRetries  starts at Line 1233
> https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HTable.java  Htable.get starts at Line 611.
>
>
> If you can please suggest what are the scenarios in which all retries can get exhausted resulting in thread interruption.
>
> We have seen this issue in two of our HBase Clusters, where load is quite less.  We have 20 reads per minute,  we run 1 zookeeper, and 4 regionservers in fully-distributed mode (Hadoop).  We are using CDH3.
>
> Thanks,
> Srikanth
>
> ________________________________
>
> http://www.mindtree.com/email/disclaimer.html
>