You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Viral Bajaria <vi...@gmail.com> on 2013/04/28 22:30:51 UTC

max regionserver handler count

Hi,

I have been trying to play around with the regionserver handler count. What
I noticed was, the cluster comes up fine up to a certain point, ~7500
regionserver handler counts. But above that the system refuses to start up.
They keep on spinning for a certain point. The ROOT region keeps on
bouncing around different states but never stabilizes.

So the first question, what's the max that folks on the list have gone with
this settings ? If anyone has gone above 10,000 have you done any special
settings ?

Secondly, the setting is per regionserver (as the name suggests) and not
per region right ?

Following are my versions:
HBase 0.94.5
Hadoop 1.0.4
Ubuntu 12.04

Let me know if you need any more information from my side.

Thanks,
Viral

Re: max regionserver handler count

Posted by Viral Bajaria <vi...@gmail.com>.

Looked closely into the async API and there is no way to batch GETs to
reduce the # of RPC calls and thus handlers. Will play around tomorrow with
the handlers again and see if I can find anything interesting.

On Tue, Apr 30, 2013 at 12:03 AM, Anoop John <an...@gmail.com> wrote:

> If you can make use of the batch API  ie.  get(List<Get>) you can reduce
> the handlers (and no# of RPC calls also).. One batch will use one handler.
> >I am using asynchbase which does not have the notion of batch gets
> I have not checked with asynchbase.  Just telling as a pointer..
>
> -Anoop-
>

Re: max regionserver handler count

Posted by Anoop John <an...@gmail.com>.

If you can make use of the batch API  ie.  get(List<Get>) you can reduce
the handlers (and no# of RPC calls also).. One batch will use one handler.
>I am using asynchbase which does not have the notion of batch gets
I have not checked with asynchbase.  Just telling as a pointer..

-Anoop-

On Tue, Apr 30, 2013 at 12:12 PM, Viral Bajaria <vi...@gmail.com>wrote:

> I am using asynchbase which does not have the notion of batch gets. It
> allows you to batch at a rowkey level in a single get request.
>
> -Viral
>
> On Mon, Apr 29, 2013 at 11:29 PM, Anoop John <an...@gmail.com>
> wrote:
>
> > You are making use of batch Gets?  get(List)
> >
> > -Anoop-
> >
>

Re: max regionserver handler count

Posted by Viral Bajaria <vi...@gmail.com>.

I am using asynchbase which does not have the notion of batch gets. It
allows you to batch at a rowkey level in a single get request.

-Viral

On Mon, Apr 29, 2013 at 11:29 PM, Anoop John <an...@gmail.com> wrote:

> You are making use of batch Gets?  get(List)
>
> -Anoop-
>

Re: max regionserver handler count

Posted by Anoop John <an...@gmail.com>.

You are making use of batch Gets?  get(List)

-Anoop-

On Tue, Apr 30, 2013 at 11:40 AM, Viral Bajaria <vi...@gmail.com>wrote:

> Thanks for getting back, Ted. I totally understand other priorities and
> will wait for some feedback. I am adding some more info to this post to
> allow better diagnosing of performance.
>
> I hit my region servers with a lot of GET requests (~20K per second per
> regionserver) using asynchbase in my test environment, the storage pattern
> is very similar to OpenTSDB though with a lot more columns. Each row is
> around 45-50 bytes long. The regionservers have a lot of RAM available to
> them (48 out of 60 GB) and they are not sharing resources with anyone else,
> so memory is not under pressure. The total # of rows in the table is around
> 100M and growing (there is a put process too)
>
> GETs take over 15s for 16K rows, and I don't see any operationTooSlow logs
> in the regionserver logs either. PUTs take around 1s for 16K rows (deferred
> log flush is enabled though).
>
> I looked at the RPC stats and it seems the RPC threads were always doing
> something and I assumed my requests were waiting on handlers and so thought
> of experimenting by increasing number of handlers. But as mentioned in my
> thread, going above 10K kills my regionserver.
>
> Thanks,
> Viral
>
> On Mon, Apr 29, 2013 at 9:43 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Viral:
> > I am currently dealing with some high priority bugs so I didn't have time
> > to look deeper into your case.
> >
> > My feeling is that raising max regionserver handler count shouldn't be
> the
> > key to boosting performance.
> >
> > Cheers
> >
>

Re: max regionserver handler count

Posted by Viral Bajaria <vi...@gmail.com>.

Thanks for getting back, Ted. I totally understand other priorities and
will wait for some feedback. I am adding some more info to this post to
allow better diagnosing of performance.

I hit my region servers with a lot of GET requests (~20K per second per
regionserver) using asynchbase in my test environment, the storage pattern
is very similar to OpenTSDB though with a lot more columns. Each row is
around 45-50 bytes long. The regionservers have a lot of RAM available to
them (48 out of 60 GB) and they are not sharing resources with anyone else,
so memory is not under pressure. The total # of rows in the table is around
100M and growing (there is a put process too)

GETs take over 15s for 16K rows, and I don't see any operationTooSlow logs
in the regionserver logs either. PUTs take around 1s for 16K rows (deferred
log flush is enabled though).

I looked at the RPC stats and it seems the RPC threads were always doing
something and I assumed my requests were waiting on handlers and so thought
of experimenting by increasing number of handlers. But as mentioned in my
thread, going above 10K kills my regionserver.

Thanks,
Viral

On Mon, Apr 29, 2013 at 9:43 PM, Ted Yu <yu...@gmail.com> wrote:

> Viral:
> I am currently dealing with some high priority bugs so I didn't have time
> to look deeper into your case.
>
> My feeling is that raising max regionserver handler count shouldn't be the
> key to boosting performance.
>
> Cheers
>

Re: max regionserver handler count

Posted by Ted Yu <yu...@gmail.com>.

Viral:
I am currently dealing with some high priority bugs so I didn't have time
to look deeper into your case.

My feeling is that raising max regionserver handler count shouldn't be the
key to boosting performance.

Cheers

On Mon, Apr 29, 2013 at 8:48 PM, Viral Bajaria <vi...@gmail.com>wrote:

> Still unclear how and where the same session id is being re-used. Any
> thoughts ?
>
> The ROOT region was just bouncing around state's since the RS would be
> marked as dead whenever these timeouts occur and thus the ROOT region will
> be moved to a different server. Once the ROOT gets assigned to a RS which
> had less handlers (< 15K), it stopped bouncing around. I am surprised
> bumping up handlers and having 0 traffic on the cluster can cause this
> issue.
>
> -Viral
>
> On Mon, Apr 29, 2013 at 1:23 PM, Viral Bajaria <viral.bajaria@gmail.com
> >wrote:
>
> > On Mon, Apr 29, 2013 at 2:49 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> After each zookeeper reconnect, I saw same session Id (0x703e...)
> >>
> >> What version of zookeeper are you using ? Can you search zookeeper log
> >> for this session Id to see what happened ?
> >>
> >> Thanks
> >>
> >
> > Zookeeper version is 3.4.5, following are the logs from 2 zookeeper
> > servers. The first one was the first time ever the regionserver connected
> > to ZK and after that all of them are retries. I doubt the issue is on the
> > ZK side since I have 3 other services running which run fine with the
> same
> > quorum and have no issues.
> >
> > I feel I am hitting some other bug with HBase when the number of handlers
> > is increased by a lot. Anyone seen a high number of handlers in any
> > production deployment out there ?
> >
> > Thanks,
> > Viral
> >
> > ===server with first session===
> > 2013-04-29 07:40:55,574 [myid:1072011376] - INFO
> >  [CommitProcessor:1072011376:ZooKeeperServer@595] - Established session
> > 0x703e48a8cfd81be6 with negotiated timeout 40000 for client /
> > 10.155.234.3:53814
> > EndOfStreamException: Unable to read additional data from client
> sessionid
> > 0x703e48a8cfd81be6, likely client has closed socket
> > 2013-04-29 07:43:47,737 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
> > client /10.155.234.3:53814 which had sessionid 0x703e48a8cfd81be6
> > 2013-04-29 07:46:29,679 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client attempting to renew
> > session 0x703e48a8cfd81be6 at /10.155.234.3:53831
> > 2013-04-29 07:46:29,679 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating client:
> > 0x703e48a8cfd81be6
> > 2013-04-29 07:46:29,680 [myid:1072011376] - INFO
> >  [QuorumPeer[myid=1072011376]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@595]
> -
> > Established session 0x703e48a8cfd81be6 with negotiated timeout 40000 for
> > client /10.155.234.3:53831
> > EndOfStreamException: Unable to read additional data from client
> sessionid
> > 0x703e48a8cfd81be6, likely client has closed socket
> > 2013-04-29 07:49:10,401 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
> > client /10.155.234.3:53831 which had sessionid 0x703e48a8cfd81be6
> > 2013-04-29 07:55:53,441 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client attempting to renew
> > session 0x703e48a8cfd81be6 at /10.155.234.3:53854
> > 2013-04-29 07:55:53,441 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating client:
> > 0x703e48a8cfd81be6
> > 2013-04-29 07:55:53,442 [myid:1072011376] - INFO
> >  [QuorumPeer[myid=1072011376]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@595]
> -
> > Established session 0x703e48a8cfd81be6 with negotiated timeout 40000 for
> > client /10.155.234.3:53854
> > EndOfStreamException: Unable to read additional data from client
> sessionid
> > 0x703e48a8cfd81be6, likely client has closed socket
> > 2013-04-29 07:58:33,947 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
> > client /10.155.234.3:53854 which had sessionid 0x703e48a8cfd81be6
> >
> > ===server with subsequent sessions===
> > 2013-04-29 07:44:28,733 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client attempting to renew
> > session 0x703e48a8cfd81be6 at /10.155.234.3:51353
> > 2013-04-29 07:44:28,734 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@595] - Established session
> > 0x703e48a8cfd81be6 with negotiated timeout 40000 for client /
> > 10.155.234.3:51353
> > EndOfStreamException: Unable to read additional data from client
> sessionid
> > 0x703e48a8cfd81be6, likely client has closed socket
> > 2013-04-29 07:46:29,206 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
> > client /10.155.234.3:51353 which had sessionid 0x703e48a8cfd81be6
> > 2013-04-29 07:53:52,553 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client attempting to renew
> > session 0x703e48a8cfd81be6 at /10.155.234.3:51376
> > 2013-04-29 07:53:52,553 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@595] - Established session
> > 0x703e48a8cfd81be6 with negotiated timeout 40000 for client /
> > 10.155.234.3:51376
> > EndOfStreamException: Unable to read additional data from client
> sessionid
> > 0x703e48a8cfd81be6, likely client has closed socket
> > 2013-04-29 07:55:53,049 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
> > client /10.155.234.3:51376 which had sessionid 0x703e48a8cfd81be6
> > 2013-04-29 08:02:36,000 [myid:54242244232] - INFO
> >  [SessionTracker:ZooKeeperServer@325] - Expiring session
> > 0x703e48a8cfd81be6, timeout of 40000ms exceeded
> > 2013-04-29 08:02:36,001 [myid:54242244232] - INFO
> >  [ProcessThread(sid:54242244232 cport:-1)::PrepRequestProcessor@476] -
> > Processed session termination for sessionid: 0x703e48a8cfd81be6
> > 2013-04-29 08:02:36,283 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client attempting to renew
> > session 0x703e48a8cfd81be6 at /10.155.234.3:51398
> > 2013-04-29 08:02:36,284 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@588] - Invalid session
> > 0x703e48a8cfd81be6 for client /10.155.234.3:51398, probably expired
> > 2013-04-29 08:02:36,284 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
> > client /10.155.234.3:51398 which had sessionid 0x703e48a8cfd81be6
> >
> >
>

Re: max regionserver handler count

Posted by Viral Bajaria <vi...@gmail.com>.

Still unclear how and where the same session id is being re-used. Any
thoughts ?

The ROOT region was just bouncing around state's since the RS would be
marked as dead whenever these timeouts occur and thus the ROOT region will
be moved to a different server. Once the ROOT gets assigned to a RS which
had less handlers (< 15K), it stopped bouncing around. I am surprised
bumping up handlers and having 0 traffic on the cluster can cause this
issue.

-Viral

On Mon, Apr 29, 2013 at 1:23 PM, Viral Bajaria <vi...@gmail.com>wrote:

> On Mon, Apr 29, 2013 at 2:49 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> After each zookeeper reconnect, I saw same session Id (0x703e...)
>>
>> What version of zookeeper are you using ? Can you search zookeeper log
>> for this session Id to see what happened ?
>>
>> Thanks
>>
>
> Zookeeper version is 3.4.5, following are the logs from 2 zookeeper
> servers. The first one was the first time ever the regionserver connected
> to ZK and after that all of them are retries. I doubt the issue is on the
> ZK side since I have 3 other services running which run fine with the same
> quorum and have no issues.
>
> I feel I am hitting some other bug with HBase when the number of handlers
> is increased by a lot. Anyone seen a high number of handlers in any
> production deployment out there ?
>
> Thanks,
> Viral
>
> ===server with first session===
> 2013-04-29 07:40:55,574 [myid:1072011376] - INFO
>  [CommitProcessor:1072011376:ZooKeeperServer@595] - Established session
> 0x703e48a8cfd81be6 with negotiated timeout 40000 for client /
> 10.155.234.3:53814
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x703e48a8cfd81be6, likely client has closed socket
> 2013-04-29 07:43:47,737 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
> client /10.155.234.3:53814 which had sessionid 0x703e48a8cfd81be6
> 2013-04-29 07:46:29,679 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client attempting to renew
> session 0x703e48a8cfd81be6 at /10.155.234.3:53831
> 2013-04-29 07:46:29,679 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating client:
> 0x703e48a8cfd81be6
> 2013-04-29 07:46:29,680 [myid:1072011376] - INFO
>  [QuorumPeer[myid=1072011376]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@595] -
> Established session 0x703e48a8cfd81be6 with negotiated timeout 40000 for
> client /10.155.234.3:53831
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x703e48a8cfd81be6, likely client has closed socket
> 2013-04-29 07:49:10,401 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
> client /10.155.234.3:53831 which had sessionid 0x703e48a8cfd81be6
> 2013-04-29 07:55:53,441 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client attempting to renew
> session 0x703e48a8cfd81be6 at /10.155.234.3:53854
> 2013-04-29 07:55:53,441 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating client:
> 0x703e48a8cfd81be6
> 2013-04-29 07:55:53,442 [myid:1072011376] - INFO
>  [QuorumPeer[myid=1072011376]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@595] -
> Established session 0x703e48a8cfd81be6 with negotiated timeout 40000 for
> client /10.155.234.3:53854
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x703e48a8cfd81be6, likely client has closed socket
> 2013-04-29 07:58:33,947 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
> client /10.155.234.3:53854 which had sessionid 0x703e48a8cfd81be6
>
> ===server with subsequent sessions===
> 2013-04-29 07:44:28,733 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client attempting to renew
> session 0x703e48a8cfd81be6 at /10.155.234.3:51353
> 2013-04-29 07:44:28,734 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@595] - Established session
> 0x703e48a8cfd81be6 with negotiated timeout 40000 for client /
> 10.155.234.3:51353
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x703e48a8cfd81be6, likely client has closed socket
> 2013-04-29 07:46:29,206 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
> client /10.155.234.3:51353 which had sessionid 0x703e48a8cfd81be6
> 2013-04-29 07:53:52,553 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client attempting to renew
> session 0x703e48a8cfd81be6 at /10.155.234.3:51376
> 2013-04-29 07:53:52,553 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@595] - Established session
> 0x703e48a8cfd81be6 with negotiated timeout 40000 for client /
> 10.155.234.3:51376
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x703e48a8cfd81be6, likely client has closed socket
> 2013-04-29 07:55:53,049 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
> client /10.155.234.3:51376 which had sessionid 0x703e48a8cfd81be6
> 2013-04-29 08:02:36,000 [myid:54242244232] - INFO
>  [SessionTracker:ZooKeeperServer@325] - Expiring session
> 0x703e48a8cfd81be6, timeout of 40000ms exceeded
> 2013-04-29 08:02:36,001 [myid:54242244232] - INFO
>  [ProcessThread(sid:54242244232 cport:-1)::PrepRequestProcessor@476] -
> Processed session termination for sessionid: 0x703e48a8cfd81be6
> 2013-04-29 08:02:36,283 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client attempting to renew
> session 0x703e48a8cfd81be6 at /10.155.234.3:51398
> 2013-04-29 08:02:36,284 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@588] - Invalid session
> 0x703e48a8cfd81be6 for client /10.155.234.3:51398, probably expired
> 2013-04-29 08:02:36,284 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
> client /10.155.234.3:51398 which had sessionid 0x703e48a8cfd81be6
>
>

Re: max regionserver handler count

Posted by Viral Bajaria <vi...@gmail.com>.

On Mon, Apr 29, 2013 at 2:49 AM, Ted Yu <yu...@gmail.com> wrote:

> After each zookeeper reconnect, I saw same session Id (0x703e...)
>
> What version of zookeeper are you using ? Can you search zookeeper log for
> this session Id to see what happened ?
>
> Thanks
>

Zookeeper version is 3.4.5, following are the logs from 2 zookeeper
servers. The first one was the first time ever the regionserver connected
to ZK and after that all of them are retries. I doubt the issue is on the
ZK side since I have 3 other services running which run fine with the same
quorum and have no issues.

I feel I am hitting some other bug with HBase when the number of handlers
is increased by a lot. Anyone seen a high number of handlers in any
production deployment out there ?

Thanks,
Viral

===server with first session===
2013-04-29 07:40:55,574 [myid:1072011376] - INFO
 [CommitProcessor:1072011376:ZooKeeperServer@595] - Established session
0x703e48a8cfd81be6 with negotiated timeout 40000 for client /
10.155.234.3:53814
EndOfStreamException: Unable to read additional data from client sessionid
0x703e48a8cfd81be6, likely client has closed socket
2013-04-29 07:43:47,737 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
client /10.155.234.3:53814 which had sessionid 0x703e48a8cfd81be6
2013-04-29 07:46:29,679 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client attempting to renew
session 0x703e48a8cfd81be6 at /10.155.234.3:53831
2013-04-29 07:46:29,679 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating client: 0x703e48a8cfd81be6
2013-04-29 07:46:29,680 [myid:1072011376] - INFO
 [QuorumPeer[myid=1072011376]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@595] -
Established session 0x703e48a8cfd81be6 with negotiated timeout 40000 for
client /10.155.234.3:53831
EndOfStreamException: Unable to read additional data from client sessionid
0x703e48a8cfd81be6, likely client has closed socket
2013-04-29 07:49:10,401 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
client /10.155.234.3:53831 which had sessionid 0x703e48a8cfd81be6
2013-04-29 07:55:53,441 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client attempting to renew
session 0x703e48a8cfd81be6 at /10.155.234.3:53854
2013-04-29 07:55:53,441 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating client: 0x703e48a8cfd81be6
2013-04-29 07:55:53,442 [myid:1072011376] - INFO
 [QuorumPeer[myid=1072011376]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@595] -
Established session 0x703e48a8cfd81be6 with negotiated timeout 40000 for
client /10.155.234.3:53854
EndOfStreamException: Unable to read additional data from client sessionid
0x703e48a8cfd81be6, likely client has closed socket
2013-04-29 07:58:33,947 [myid:1072011376] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
client /10.155.234.3:53854 which had sessionid 0x703e48a8cfd81be6

===server with subsequent sessions===
2013-04-29 07:44:28,733 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client attempting to renew
session 0x703e48a8cfd81be6 at /10.155.234.3:51353
2013-04-29 07:44:28,734 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@595] - Established session
0x703e48a8cfd81be6 with negotiated timeout 40000 for client /
10.155.234.3:51353
EndOfStreamException: Unable to read additional data from client sessionid
0x703e48a8cfd81be6, likely client has closed socket
2013-04-29 07:46:29,206 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
client /10.155.234.3:51353 which had sessionid 0x703e48a8cfd81be6
2013-04-29 07:53:52,553 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client attempting to renew
session 0x703e48a8cfd81be6 at /10.155.234.3:51376
2013-04-29 07:53:52,553 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@595] - Established session
0x703e48a8cfd81be6 with negotiated timeout 40000 for client /
10.155.234.3:51376
EndOfStreamException: Unable to read additional data from client sessionid
0x703e48a8cfd81be6, likely client has closed socket
2013-04-29 07:55:53,049 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
client /10.155.234.3:51376 which had sessionid 0x703e48a8cfd81be6
2013-04-29 08:02:36,000 [myid:54242244232] - INFO
 [SessionTracker:ZooKeeperServer@325] - Expiring session
0x703e48a8cfd81be6, timeout of 40000ms exceeded
2013-04-29 08:02:36,001 [myid:54242244232] - INFO
 [ProcessThread(sid:54242244232 cport:-1)::PrepRequestProcessor@476] -
Processed session termination for sessionid: 0x703e48a8cfd81be6
2013-04-29 08:02:36,283 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client attempting to renew
session 0x703e48a8cfd81be6 at /10.155.234.3:51398
2013-04-29 08:02:36,284 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@588] - Invalid session
0x703e48a8cfd81be6 for client /10.155.234.3:51398, probably expired
2013-04-29 08:02:36,284 [myid:54242244232] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
client /10.155.234.3:51398 which had sessionid 0x703e48a8cfd81be6

Re: max regionserver handler count

Posted by Ted Yu <yu...@gmail.com>.

After each zookeeper reconnect, I saw same session Id (0x703e...)

What version of zookeeper are you using ? Can you search zookeeper log for this session Id to see what happened ?

Thanks

On Apr 29, 2013, at 2:36 AM, Viral Bajaria <vi...@gmail.com> wrote:

> On Mon, Apr 29, 2013 at 2:25 AM, Ted Yu <yu...@gmail.com> wrote:
> 
>> I noticed the 8 occurrences of 0x703e... following region server name in
>> the abort message.
>> I wonder why the repetition ?
>> 
>> Cheers
> 
> Oh good observation. I just stepped through the logs again and saw that the
> client timeout occurred 8 times before that exception. Maybe that explains
> the repetitive occurrence, but that is definitely weird ? Logs pasted below.
> 
> -Viral
> 
> ===logs===
> 2013-04-29 07:41:03,227 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Serving as
> regionserver1,60020,1367221255417, RPC listening on regionserver1/
> 10.155.234.3:60020, sessionid=0x703e48a8cfd81be6
> 2013-04-29 07:41:03,228 INFO
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
> regionserver1,60020,1367221255417 starting
> 2013-04-29 07:41:03,230 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Registered RegionServer
> MXBean
> 2013-04-29 07:41:05,967 INFO org.apache.hadoop.hbase.util.ChecksumType:
> Checksum can use java.util.zip.CRC32
> 2013-04-29 07:43:47,739 INFO org.apache.zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 27868ms for sessionid
> 0x703e48a8cfd81be6, closing socket connection and attempting reconnect
> 2013-04-29 07:43:48,776 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server 10.165.33.180:2181. Will not attempt to
> authenticate using SASL (Unable to locate a login configuration)
> 2013-04-29 07:43:48,777 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to 10.165.33.180:2181, initiating session
> 2013-04-29 07:43:48,782 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server 10.165.33.180:2181, sessionid =
> 0x703e48a8cfd81be6, negotiated timeout = 40000
> 2013-04-29 07:44:28,337 INFO org.apache.zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 39427ms for sessionid
> 0x703e48a8cfd81be6, closing socket connection and attempting reconnect
> 2013-04-29 07:44:28,735 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server 10.152.152.84:2181. Will not attempt to
> authenticate using SASL (Unable to locate a login configuration)
> 2013-04-29 07:44:28,736 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to 10.152.152.84:2181, initiating session
> 2013-04-29 07:44:28,738 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server 10.152.152.84:2181, sessionid =
> 0x703e48a8cfd81be6, negotiated timeout = 40000
> 2013-04-29 07:46:29,080 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=98.34 MB,
> free=11.61 GB, max=11.71 GB, blocks=0, accesses=0, hits=0, hitRatio=0,
> cachingAccesses=0, cachingHits=0, cachingHitsRatio=0, evictions=0,
> evicted=0, evictedPerRun=NaN
> 2013-04-29 07:46:29,174 INFO org.apache.zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 40354ms for sessionid
> 0x703e48a8cfd81be6, closing socket connection and attempting reconnect
> 2013-04-29 07:46:29,682 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server 10.147.128.158:2181. Will not attempt to
> authenticate using SASL (Unable to locate a login configuration)
> 2013-04-29 07:46:29,682 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to 10.147.128.158:2181, initiating session
> 2013-04-29 07:46:29,684 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server 10.147.128.158:2181, sessionid =
> 0x703e48a8cfd81be6, negotiated timeout = 40000
> 2013-04-29 07:49:10,404 INFO org.apache.zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 39876ms for sessionid
> 0x703e48a8cfd81be6, closing socket connection and attempting reconnect
> 2013-04-29 07:49:11,503 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server 10.165.33.180:2181. Will not attempt to
> authenticate using SASL (Unable to locate a login configuration)
> 2013-04-29 07:49:11,504 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to 10.165.33.180:2181, initiating session
> 2013-04-29 07:49:11,506 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server 10.165.33.180:2181, sessionid =
> 0x703e48a8cfd81be6, negotiated timeout = 40000
> 2013-04-29 07:51:11,462 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=98.34 MB,
> free=11.61 GB, max=11.71 GB, blocks=0, accesses=0, hits=0, hitRatio=0,
> cachingAccesses=0, cachingHits=0, cachingHitsRatio=0, evictions=0,
> evicted=0, evictedPerRun=NaN
> 2013-04-29 07:53:52,158 INFO org.apache.zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 40195ms for sessionid
> 0x703e48a8cfd81be6, closing socket connection and attempting reconnect
> 2013-04-29 07:53:52,555 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server 10.152.152.84:2181. Will not attempt to
> authenticate using SASL (Unable to locate a login configuration)
> 2013-04-29 07:53:52,556 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to 10.152.152.84:2181, initiating session
> 2013-04-29 07:53:52,558 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server 10.152.152.84:2181, sessionid =
> 0x703e48a8cfd81be6, negotiated timeout = 40000
> 2013-04-29 07:55:53,052 INFO org.apache.zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 26955ms for sessionid
> 0x703e48a8cfd81be6, closing socket connection and attempting reconnect
> 2013-04-29 07:55:53,444 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server 10.147.128.158:2181. Will not attempt to
> authenticate using SASL (Unable to locate a login configuration)
> 2013-04-29 07:55:53,444 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to 10.147.128.158:2181, initiating session
> 2013-04-29 07:55:53,446 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server 10.147.128.158:2181, sessionid =
> 0x703e48a8cfd81be6, negotiated timeout = 40000
> 2013-04-29 07:55:55,512 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=98.34 MB,
> free=11.61 GB, max=11.71 GB, blocks=0, accesses=0, hits=0, hitRatio=0,
> cachingAccesses=0, cachingHits=0, cachingHitsRatio=0, evictions=0,
> evicted=0, evictedPerRun=NaN
> 2013-04-29 07:58:33,950 INFO org.apache.zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 39972ms for sessionid
> 0x703e48a8cfd81be6, closing socket connection and attempting reconnect
> 2013-04-29 07:58:35,096 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server 10.165.33.180:2181. Will not attempt to
> authenticate using SASL (Unable to locate a login configuration)
> 2013-04-29 07:58:35,097 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to 10.165.33.180:2181, initiating session
> 2013-04-29 07:58:35,098 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server 10.165.33.180:2181, sessionid =
> 0x703e48a8cfd81be6, negotiated timeout = 40000
> 2013-04-29 08:00:55,512 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=98.34 MB,
> free=11.61 GB, max=11.71 GB, blocks=0, accesses=0, hits=0, hitRatio=0,
> cachingAccesses=0, cachingHits=0, cachingHitsRatio=0, evictions=0,
> evicted=0, evictedPerRun=NaN
> 2013-04-29 08:02:35,674 INFO org.apache.zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 40592ms for sessionid
> 0x703e48a8cfd81be6, closing socket connection and attempting reconnect
> 2013-04-29 08:02:36,286 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server 10.152.152.84:2181. Will not attempt to
> authenticate using SASL (Unable to locate a login configuration)
> 2013-04-29 08:02:36,287 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to 10.152.152.84:2181, initiating session
> 2013-04-29 08:02:36,288 INFO org.apache.zookeeper.ClientCnxn: Unable to
> reconnect to ZooKeeper service, session 0x703e48a8cfd81be6 has expired,
> closing socket connection
> 2013-04-29 08:03:16,287 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> regionserver1,60020,1367221255417:
> regionserver:60020-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6
> regionserver:60020-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6
> received expired from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:389)
>        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:286)
>        at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
>        at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
> 2013-04-29 08:03:16,288 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> regionserver1,60020,1367221255417: Unhandled exception:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing regionserver1,60020,1367221255417 as dead server
> org.apache.hadoop.hbase.YouAreDeadException:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing regionserver1,60020,1367221255417 as dead server
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>        at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>        at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
>        at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:880)
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:748)
>        at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing regionserver1,60020,1367221255417 as dead server
>        at
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:254)
>        at
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:172)
>        at
> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:984)
>        at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
> 
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1021)
>        at
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150)
>        at $Proxy8.regionServerReport(Unknown Source)

Re: max regionserver handler count

Posted by Viral Bajaria <vi...@gmail.com>.

On Mon, Apr 29, 2013 at 2:25 AM, Ted Yu <yu...@gmail.com> wrote:

> I noticed the 8 occurrences of 0x703e... following region server name in
> the abort message.
> I wonder why the repetition ?
>
> Cheers
>

Oh good observation. I just stepped through the logs again and saw that the
client timeout occurred 8 times before that exception. Maybe that explains
the repetitive occurrence, but that is definitely weird ? Logs pasted below.

-Viral

===logs===
2013-04-29 07:41:03,227 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Serving as
regionserver1,60020,1367221255417, RPC listening on regionserver1/
10.155.234.3:60020, sessionid=0x703e48a8cfd81be6
2013-04-29 07:41:03,228 INFO
org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
regionserver1,60020,1367221255417 starting
2013-04-29 07:41:03,230 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Registered RegionServer
MXBean
2013-04-29 07:41:05,967 INFO org.apache.hadoop.hbase.util.ChecksumType:
Checksum can use java.util.zip.CRC32
2013-04-29 07:43:47,739 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 27868ms for sessionid
0x703e48a8cfd81be6, closing socket connection and attempting reconnect
2013-04-29 07:43:48,776 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server 10.165.33.180:2181. Will not attempt to
authenticate using SASL (Unable to locate a login configuration)
2013-04-29 07:43:48,777 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to 10.165.33.180:2181, initiating session
2013-04-29 07:43:48,782 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server 10.165.33.180:2181, sessionid =
0x703e48a8cfd81be6, negotiated timeout = 40000
2013-04-29 07:44:28,337 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 39427ms for sessionid
0x703e48a8cfd81be6, closing socket connection and attempting reconnect
2013-04-29 07:44:28,735 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server 10.152.152.84:2181. Will not attempt to
authenticate using SASL (Unable to locate a login configuration)
2013-04-29 07:44:28,736 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to 10.152.152.84:2181, initiating session
2013-04-29 07:44:28,738 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server 10.152.152.84:2181, sessionid =
0x703e48a8cfd81be6, negotiated timeout = 40000
2013-04-29 07:46:29,080 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=98.34 MB,
free=11.61 GB, max=11.71 GB, blocks=0, accesses=0, hits=0, hitRatio=0,
cachingAccesses=0, cachingHits=0, cachingHitsRatio=0, evictions=0,
evicted=0, evictedPerRun=NaN
2013-04-29 07:46:29,174 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 40354ms for sessionid
0x703e48a8cfd81be6, closing socket connection and attempting reconnect
2013-04-29 07:46:29,682 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server 10.147.128.158:2181. Will not attempt to
authenticate using SASL (Unable to locate a login configuration)
2013-04-29 07:46:29,682 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to 10.147.128.158:2181, initiating session
2013-04-29 07:46:29,684 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server 10.147.128.158:2181, sessionid =
0x703e48a8cfd81be6, negotiated timeout = 40000
2013-04-29 07:49:10,404 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 39876ms for sessionid
0x703e48a8cfd81be6, closing socket connection and attempting reconnect
2013-04-29 07:49:11,503 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server 10.165.33.180:2181. Will not attempt to
authenticate using SASL (Unable to locate a login configuration)
2013-04-29 07:49:11,504 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to 10.165.33.180:2181, initiating session
2013-04-29 07:49:11,506 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server 10.165.33.180:2181, sessionid =
0x703e48a8cfd81be6, negotiated timeout = 40000
2013-04-29 07:51:11,462 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=98.34 MB,
free=11.61 GB, max=11.71 GB, blocks=0, accesses=0, hits=0, hitRatio=0,
cachingAccesses=0, cachingHits=0, cachingHitsRatio=0, evictions=0,
evicted=0, evictedPerRun=NaN
2013-04-29 07:53:52,158 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 40195ms for sessionid
0x703e48a8cfd81be6, closing socket connection and attempting reconnect
2013-04-29 07:53:52,555 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server 10.152.152.84:2181. Will not attempt to
authenticate using SASL (Unable to locate a login configuration)
2013-04-29 07:53:52,556 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to 10.152.152.84:2181, initiating session
2013-04-29 07:53:52,558 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server 10.152.152.84:2181, sessionid =
0x703e48a8cfd81be6, negotiated timeout = 40000
2013-04-29 07:55:53,052 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 26955ms for sessionid
0x703e48a8cfd81be6, closing socket connection and attempting reconnect
2013-04-29 07:55:53,444 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server 10.147.128.158:2181. Will not attempt to
authenticate using SASL (Unable to locate a login configuration)
2013-04-29 07:55:53,444 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to 10.147.128.158:2181, initiating session
2013-04-29 07:55:53,446 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server 10.147.128.158:2181, sessionid =
0x703e48a8cfd81be6, negotiated timeout = 40000
2013-04-29 07:55:55,512 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=98.34 MB,
free=11.61 GB, max=11.71 GB, blocks=0, accesses=0, hits=0, hitRatio=0,
cachingAccesses=0, cachingHits=0, cachingHitsRatio=0, evictions=0,
evicted=0, evictedPerRun=NaN
2013-04-29 07:58:33,950 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 39972ms for sessionid
0x703e48a8cfd81be6, closing socket connection and attempting reconnect
2013-04-29 07:58:35,096 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server 10.165.33.180:2181. Will not attempt to
authenticate using SASL (Unable to locate a login configuration)
2013-04-29 07:58:35,097 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to 10.165.33.180:2181, initiating session
2013-04-29 07:58:35,098 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server 10.165.33.180:2181, sessionid =
0x703e48a8cfd81be6, negotiated timeout = 40000
2013-04-29 08:00:55,512 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=98.34 MB,
free=11.61 GB, max=11.71 GB, blocks=0, accesses=0, hits=0, hitRatio=0,
cachingAccesses=0, cachingHits=0, cachingHitsRatio=0, evictions=0,
evicted=0, evictedPerRun=NaN
2013-04-29 08:02:35,674 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 40592ms for sessionid
0x703e48a8cfd81be6, closing socket connection and attempting reconnect
2013-04-29 08:02:36,286 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server 10.152.152.84:2181. Will not attempt to
authenticate using SASL (Unable to locate a login configuration)
2013-04-29 08:02:36,287 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to 10.152.152.84:2181, initiating session
2013-04-29 08:02:36,288 INFO org.apache.zookeeper.ClientCnxn: Unable to
reconnect to ZooKeeper service, session 0x703e48a8cfd81be6 has expired,
closing socket connection
2013-04-29 08:03:16,287 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
regionserver1,60020,1367221255417:
regionserver:60020-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6
regionserver:60020-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6
received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired
        at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:389)
        at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:286)
        at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
        at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
2013-04-29 08:03:16,288 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
regionserver1,60020,1367221255417: Unhandled exception:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing regionserver1,60020,1367221255417 as dead server
org.apache.hadoop.hbase.YouAreDeadException:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing regionserver1,60020,1367221255417 as dead server
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
        at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
        at
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:880)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:748)
        at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing regionserver1,60020,1367221255417 as dead server
        at
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:254)
        at
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:172)
        at
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:984)
        at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)

        at
org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1021)
        at
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150)
        at $Proxy8.regionServerReport(Unknown Source)

Re: max regionserver handler count

Posted by Ted Yu <yu...@gmail.com>.

I noticed the 8 occurrences of 0x703e... following region server name in the abort message. 
I wonder why the repetition ?

Cheers  

On Apr 29, 2013, at 2:17 AM, Viral Bajaria <vi...@gmail.com> wrote:

> On Sun, Apr 28, 2013 at 7:37 PM, ramkrishna vasudevan <
> ramkrishna.s.vasudevan@gmail.com> wrote:
> 
>> So you mean that when the handler count is more than 5k this happens when
>> it is lesser this does not.  Have you repeated this behaviour?
> 
>> What i doubt is when you say bouncing around different states i feel may be
>> the ROOT assignment was a problem and something messed up there.
>> If the reason was due to handler count then that needs different analysis.
>> 
>> I think that if you can repeat the experiment and get the same behaviour,
>> you can share the logs so that we can ascertain the exact problem.
> 
> Yeah I have repeated the behavior. But it seems the issue is due to some
> weird pauses in the region server whenever I bump up the region handler
> count (logs are below). I doubt the issue is GC, since it should not take
> such a long time because this is happening on startup with 48GB heap size.
> There are no active clients either.
> 
> I can safely say this is due to bumping up the region handler count is due
> to the fact that I started 3 regionservers with 5000 handlers and 3 with
> 15000 handlers. The one's with 15000 spun up all IPC handlers and then
> errored out as show in the logs below. This is just the logs around the
> error. Before the error there were a few more timeouts.
> 
> I checked zookeeper servers (I have a 3-node cluster) and it did not GC
> around the same time nor was it under any kind of load.
> 
> Thanks,
> Viral
> 
> Region Server Logs
> 2013-04-29 08:00:55,512 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=98.34 MB,
> free=11.61 GB, max=11.71 GB, blocks=0, accesses=0, hits=0, hitRatio=0,
> cachingAccesses=0, cachingHits=0, cachingHitsRatio=0, evictions=0,
> evicted=0, evictedPerRun=NaN
> 2013-04-29 08:02:35,674 INFO org.apache.zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 40592ms for sessionid
> 0x703e48a8cfd81be6, closing socket connection and attempting reconnect
> 2013-04-29 08:02:36,286 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server 10.152.152.84:2181. Will not attempt to
> authenticate using SASL (Unable to locate a login configuration)
> 2013-04-29 08:02:36,287 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to 10.152.152.84:2181, initiating session
> 2013-04-29 08:02:36,288 INFO org.apache.zookeeper.ClientCnxn: Unable to
> reconnect to ZooKeeper service, session 0x703e48a8cfd81be6 has expired,
> closing socket connection
> 2013-04-29 08:03:16,287 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> <hostname>,60020,1367221255417:
> regionserver:60020-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6
> regionserver:60020-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6
> received expired from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:389)
>        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:286)
>        at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
>        at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
> 2013-04-29 08:03:16,288 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> <hostname>,60020,1367221255417: Unhandled exception:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing <hostname>,60020,1367221255417 as dead server
> org.apache.hadoop.hbase.YouAreDeadException:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing <hostname>,60020,1367221255417 as dead server
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>        at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>        at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
>        at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:880)
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:748)
>        at java.lang.Thread.run(Thread.java:662)

Re: max regionserver handler count

Posted by Viral Bajaria <vi...@gmail.com>.

On Sun, Apr 28, 2013 at 7:37 PM, ramkrishna vasudevan <
ramkrishna.s.vasudevan@gmail.com> wrote:

> So you mean that when the handler count is more than 5k this happens when
> it is lesser this does not.  Have you repeated this behaviour?
>

> What i doubt is when you say bouncing around different states i feel may be
> the ROOT assignment was a problem and something messed up there.
> If the reason was due to handler count then that needs different analysis.
>
> I think that if you can repeat the experiment and get the same behaviour,
> you can share the logs so that we can ascertain the exact problem.
>

Yeah I have repeated the behavior. But it seems the issue is due to some
weird pauses in the region server whenever I bump up the region handler
count (logs are below). I doubt the issue is GC, since it should not take
such a long time because this is happening on startup with 48GB heap size.
There are no active clients either.

I can safely say this is due to bumping up the region handler count is due
to the fact that I started 3 regionservers with 5000 handlers and 3 with
15000 handlers. The one's with 15000 spun up all IPC handlers and then
errored out as show in the logs below. This is just the logs around the
error. Before the error there were a few more timeouts.

I checked zookeeper servers (I have a 3-node cluster) and it did not GC
around the same time nor was it under any kind of load.

Thanks,
Viral

Region Server Logs
2013-04-29 08:00:55,512 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=98.34 MB,
free=11.61 GB, max=11.71 GB, blocks=0, accesses=0, hits=0, hitRatio=0,
cachingAccesses=0, cachingHits=0, cachingHitsRatio=0, evictions=0,
evicted=0, evictedPerRun=NaN
2013-04-29 08:02:35,674 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 40592ms for sessionid
0x703e48a8cfd81be6, closing socket connection and attempting reconnect
2013-04-29 08:02:36,286 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server 10.152.152.84:2181. Will not attempt to
authenticate using SASL (Unable to locate a login configuration)
2013-04-29 08:02:36,287 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to 10.152.152.84:2181, initiating session
2013-04-29 08:02:36,288 INFO org.apache.zookeeper.ClientCnxn: Unable to
reconnect to ZooKeeper service, session 0x703e48a8cfd81be6 has expired,
closing socket connection
2013-04-29 08:03:16,287 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
<hostname>,60020,1367221255417:
regionserver:60020-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6
regionserver:60020-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6
received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired
        at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:389)
        at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:286)
        at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
        at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
2013-04-29 08:03:16,288 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
<hostname>,60020,1367221255417: Unhandled exception:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing <hostname>,60020,1367221255417 as dead server
org.apache.hadoop.hbase.YouAreDeadException:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing <hostname>,60020,1367221255417 as dead server
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
        at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
        at
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:880)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:748)
        at java.lang.Thread.run(Thread.java:662)

Re: max regionserver handler count

Posted by ramkrishna vasudevan <ra...@gmail.com>.

>The ROOT region keeps on bouncing around different states but never
stabilizes.
So you mean that when the handler count is more than 5k this happens when
it is lesser this does not.  Have you repeated this behaviour?

What i doubt is when you say bouncing around different states i feel may be
the ROOT assignment was a problem and something messed up there.
If the reason was due to handler count then that needs different analysis.

I think that if you can repeat the experiment and get the same behaviour,
you can share the logs so that we can ascertain the exact problem.

Regards
Ram


On Mon, Apr 29, 2013 at 7:09 AM, Viral Bajaria <vi...@gmail.com>wrote:

> This is an experimental system right now and I will bump up the # of
> servers in production. But here are the specs:
>
> 6 regionservers (64GB RAM, 48GB allocated to HBase heap, some more
> allocated to Datanode and other processes)
> 50-55 regions per server
> workload: 25K gets / second, 25K puts / second (puts is not consistently
> that high but I am quoting the highest number)
> handler count: 2.5K per regionserver
>
> The data size is pretty compact (think TSDB style) and it should fit in
> memory (in test environment). Yet I see long pauses when doing GETs. I feel
> those pauses happen when all the regionserver handles are servicing RPC
> requests and it makes sense.
>
> I can experiment scaling out the cluster, but before doing that I want to
> bump the region handler count and see how far I can stretch it. But it
> seems I can't go beyond 5K right now.
>
> Thanks,
> Viral
>
>
> On Sun, Apr 28, 2013 at 3:19 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > bq. the setting is per regionserver (as the name suggests) and not per
> > region right ?
> >
> > That is correct.
> >
> > Can you give us more information about your cluster size, workload, etc ?
> >
> > Thanks
> >
> > On Mon, Apr 29, 2013 at 4:30 AM, Viral Bajaria <viral.bajaria@gmail.com
> > >wrote:
> >
> > > Hi,
> > >
> > > I have been trying to play around with the regionserver handler count.
> > What
> > > I noticed was, the cluster comes up fine up to a certain point, ~7500
> > > regionserver handler counts. But above that the system refuses to start
> > up.
> > > They keep on spinning for a certain point. The ROOT region keeps on
> > > bouncing around different states but never stabilizes.
> > >
> > > So the first question, what's the max that folks on the list have gone
> > with
> > > this settings ? If anyone has gone above 10,000 have you done any
> special
> > > settings ?
> > >
> > > Secondly, the setting is per regionserver (as the name suggests) and
> not
> > > per region right ?
> > >
> > > Following are my versions:
> > > HBase 0.94.5
> > > Hadoop 1.0.4
> > > Ubuntu 12.04
> > >
> > > Let me know if you need any more information from my side.
> > >
> > > Thanks,
> > > Viral
> > >
> >
>

Re: max regionserver handler count

Posted by Viral Bajaria <vi...@gmail.com>.

This is an experimental system right now and I will bump up the # of
servers in production. But here are the specs:

6 regionservers (64GB RAM, 48GB allocated to HBase heap, some more
allocated to Datanode and other processes)
50-55 regions per server
workload: 25K gets / second, 25K puts / second (puts is not consistently
that high but I am quoting the highest number)
handler count: 2.5K per regionserver

The data size is pretty compact (think TSDB style) and it should fit in
memory (in test environment). Yet I see long pauses when doing GETs. I feel
those pauses happen when all the regionserver handles are servicing RPC
requests and it makes sense.

I can experiment scaling out the cluster, but before doing that I want to
bump the region handler count and see how far I can stretch it. But it
seems I can't go beyond 5K right now.

Thanks,
Viral


On Sun, Apr 28, 2013 at 3:19 PM, Ted Yu <yu...@gmail.com> wrote:

> bq. the setting is per regionserver (as the name suggests) and not per
> region right ?
>
> That is correct.
>
> Can you give us more information about your cluster size, workload, etc ?
>
> Thanks
>
> On Mon, Apr 29, 2013 at 4:30 AM, Viral Bajaria <viral.bajaria@gmail.com
> >wrote:
>
> > Hi,
> >
> > I have been trying to play around with the regionserver handler count.
> What
> > I noticed was, the cluster comes up fine up to a certain point, ~7500
> > regionserver handler counts. But above that the system refuses to start
> up.
> > They keep on spinning for a certain point. The ROOT region keeps on
> > bouncing around different states but never stabilizes.
> >
> > So the first question, what's the max that folks on the list have gone
> with
> > this settings ? If anyone has gone above 10,000 have you done any special
> > settings ?
> >
> > Secondly, the setting is per regionserver (as the name suggests) and not
> > per region right ?
> >
> > Following are my versions:
> > HBase 0.94.5
> > Hadoop 1.0.4
> > Ubuntu 12.04
> >
> > Let me know if you need any more information from my side.
> >
> > Thanks,
> > Viral
> >
>

Re: max regionserver handler count

Posted by Ted Yu <yu...@gmail.com>.

bq. the setting is per regionserver (as the name suggests) and not per
region right ?

That is correct.

Can you give us more information about your cluster size, workload, etc ?

Thanks

On Mon, Apr 29, 2013 at 4:30 AM, Viral Bajaria <vi...@gmail.com>wrote:

> Hi,
>
> I have been trying to play around with the regionserver handler count. What
> I noticed was, the cluster comes up fine up to a certain point, ~7500
> regionserver handler counts. But above that the system refuses to start up.
> They keep on spinning for a certain point. The ROOT region keeps on
> bouncing around different states but never stabilizes.
>
> So the first question, what's the max that folks on the list have gone with
> this settings ? If anyone has gone above 10,000 have you done any special
> settings ?
>
> Secondly, the setting is per regionserver (as the name suggests) and not
> per region right ?
>
> Following are my versions:
> HBase 0.94.5
> Hadoop 1.0.4
> Ubuntu 12.04
>
> Let me know if you need any more information from my side.
>
> Thanks,
> Viral
>