You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Jean-Marc Spaggiari <je...@spaggiari.org> on 2014/04/29 16:37:32 UTC

ZK issue on one single RS.

Hi,

I face a "strange" issue today.

I stopped my cluster to merge some regions. When I tried to restart it, one
server got stuck.

If I try to stop the cluster, this server ignore the request and stay stuck.

Then if I restart, the cluster come back without this server saying that
it's already started.

The server seems to be stuck with ZK. http://pastebin.com/6UcYEKN2

I have killed the RS and tried to restart it, same issue. I'm able to
access the ZK server from the RS. I have restarted the 3 ZK server too,
same result. The RS tries to connect to all of them but always fail.

Any idea why ZK might faile that way?

Thanks,

JM

Re: ZK issue on one single RS.

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

After some time the node finally exit by itself (Logs below) and I have
been able to restart it correctly... The only things I found in the ZK logs
was this:
2014-04-29 11:28:00,026 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@209] - Too many connections from /
192.168.23.2 - max is 60


All over the place. I don't know why for this specific server it did that.
All the sessions expired at 11:36 then when I have restarted this node at
18h it went well. I have not been able to any other clue. there is no other
exception in the RS logs.

JM



2014-04-29 11:28:07,112 WARN org.apache.zookeeper.ClientCnxn: Session 0x0
for server latitude.google.com/192.168.23.4:2181, unexpected error, closing
socket connection and attempting reconnect
java.io.IOException: Connexion ré-initialisée par le correspondant
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
    at sun.nio.ch.IOUtil.read(IOUtil.java:192)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
    at
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
    at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
2014-04-29 11:28:08,920 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server node3.google.com/192.168.23.7:2181. Will not
attempt to authenticate using SASL (unknown error)
2014-04-29 11:28:08,921 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to node3.google.com/192.168.23.7:2181, initiating
session
2014-04-29 11:28:08,921 WARN org.apache.zookeeper.ClientCnxn: Session 0x0
for server node3.google.com/192.168.23.7:2181, unexpected error, closing
socket connection and attempting reconnect
java.io.IOException: Connexion ré-initialisée par le correspondant
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
    at sun.nio.ch.IOUtil.read(IOUtil.java:192)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
    at
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
    at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
2014-04-29 11:28:09,413 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server cube.google.com/192.168.23.1:2181. Will not
attempt to authenticate using SASL (unknown error)
2014-04-29 11:28:09,414 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to cube.google.com/192.168.23.1:2181, initiating
session
2014-04-29 11:28:09,416 WARN org.apache.zookeeper.ClientCnxn: Session 0x0
for server cube.google.com/192.168.23.1:2181, unexpected error, closing
socket connection and attempting reconnect
java.io.IOException: Connexion ré-initialisée par le correspondant
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
    at sun.nio.ch.IOUtil.read(IOUtil.java:192)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
    at
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
    at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
2014-04-29 11:28:09,606 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook starting;
hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-9,5,main]
2014-04-29 11:28:09,606 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown hook
2014-04-29 11:28:09,606 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
server on 60020
2014-04-29 11:28:09,607 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 0 on 60020: exiting
2014-04-29 11:28:09,607 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC
Server handler 0 on 60020: exiting
2014-04-29 11:28:09,607 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC
Server handler 0 on 60020: exiting
2014-04-29 11:28:09,607 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 3 on 60020: exiting
2014-04-29 11:28:09,607 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC
Server handler 4 on 60020: exiting
2014-04-29 11:28:09,607 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC
Server handler 1 on 60020: exiting
2014-04-29 11:28:09,607 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 1 on 60020: exiting
2014-04-29 11:28:09,608 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 2 on 60020: exiting
2014-04-29 11:28:09,608 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 9 on 60020: exiting
2014-04-29 11:28:09,608 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 8 on 60020: exiting
2014-04-29 11:28:09,608 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC
Server handler 5 on 60020: exiting
2014-04-29 11:28:09,607 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
IPC Server Responder
2014-04-29 11:28:09,609 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
IPC Server Responder
2014-04-29 11:28:09,607 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
IPC Server listener on 60020
2014-04-29 11:28:09,607 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 5 on 60020: exiting
2014-04-29 11:28:09,607 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 7 on 60020: exiting
2014-04-29 11:28:09,607 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC
Server handler 2 on 60020: exiting
2014-04-29 11:28:09,607 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 6 on 60020: exiting
2014-04-29 11:28:09,607 INFO
org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to
stop the worker thread
2014-04-29 11:28:09,609 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
2014-04-29 11:28:09,607 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC
Server handler 1 on 60020: exiting
2014-04-29 11:28:09,607 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC
Server handler 2 on 60020: exiting
2014-04-29 11:28:09,608 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 4 on 60020: exiting
2014-04-29 11:28:09,608 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC
Server handler 8 on 60020: exiting
2014-04-29 11:28:09,608 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC
Server handler 3 on 60020: exiting
2014-04-29 11:28:09,608 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC
Server handler 7 on 60020: exiting
2014-04-29 11:28:09,608 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC
Server handler 9 on 60020: exiting
2014-04-29 11:28:09,608 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC
Server handler 6 on 60020: exiting
2014-04-29 11:28:09,614 INFO
org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
interrupted while waiting for task, exiting: java.lang.InterruptedException
2014-04-29 11:28:09,614 INFO
org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
node8.google.com,60020,1398781907511 exiting
2014-04-29 11:28:09,622 INFO org.mortbay.log: Stopped
SelectChannelConnector@0.0.0.0:60030
2014-04-29 11:28:09,623 INFO
org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager:
Stopping RegionServerSnapshotManager gracefully.
2014-04-29 11:28:09,623 INFO
org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
2014-04-29 11:28:09,623 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
regionserver60020.cacheFlusher exiting
2014-04-29 11:28:09,624 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server
node8.google.com,60020,1398781907511
2014-04-29 11:28:09,624 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer$CompactionChecker:
regionserver60020.compactionChecker exiting
2014-04-29 11:28:09,624 INFO
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Closed zookeeper sessionid=0x0
2014-04-29 11:28:10,372 INFO org.apache.zookeeper.ZooKeeper: Session: 0x0
closed
2014-04-29 11:28:10,372 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down
2014-04-29 11:28:10,373 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 1 regions to
close
2014-04-29 11:28:10,394 INFO org.apache.hadoop.hbase.util.FSUtils:
FileSystem doesn't support getDefaultReplication
2014-04-29 11:28:10,394 INFO org.apache.hadoop.hbase.util.FSUtils:
FileSystem doesn't support getDefaultBlockSize
2014-04-29 11:28:10,490 INFO
org.apache.hadoop.hbase.regionserver.StoreFile: Delete Family Bloom filter
type for
hdfs://node3:9000/hbase/-ROOT-/70236052/.tmp/976056adabd74ab09942b9ff09ba7422:
CompoundBloomFilterWriter
2014-04-29 11:28:10,575 INFO
org.apache.hadoop.hbase.regionserver.StoreFile: NO General Bloom and NO
DeleteFamily was added to HFile
(hdfs://node3:9000/hbase/-ROOT-/70236052/.tmp/976056adabd74ab09942b9ff09ba7422)
2014-04-29 11:28:10,575 INFO org.apache.hadoop.hbase.regionserver.Store:
Flushed , sequenceid=879964757, memsize=384,0, into tmp file
hdfs://node3:9000/hbase/-ROOT-/70236052/.tmp/976056adabd74ab09942b9ff09ba7422
2014-04-29 11:28:10,676 INFO org.apache.hadoop.hbase.regionserver.Store:
Added
hdfs://node3:9000/hbase/-ROOT-/70236052/info/976056adabd74ab09942b9ff09ba7422,
entries=2, sequenceid=879964757, filesize=866,0
2014-04-29 11:28:10,678 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Finished memstore flush of ~384,0/384, currentsize=0,0/0 for region
-ROOT-,,0.70236052 in 305ms, sequenceid=879964757, compaction
requested=false
2014-04-29 11:28:10,684 INFO org.apache.hadoop.hbase.regionserver.Store:
Closed info
2014-04-29 11:28:10,684 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Closed -ROOT-,,0.70236052
2014-04-29 11:28:10,773 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server
node8.google.com,60020,1398781907511; all regions closed.
2014-04-29 11:28:10,774 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
regionserver60020.logSyncer exiting
2014-04-29 11:28:10,921 INFO org.apache.hadoop.hbase.regionserver.Leases:
regionserver60020 closing leases
2014-04-29 11:28:10,921 INFO org.apache.hadoop.hbase.regionserver.Leases:
regionserver60020 closed leases
2014-04-29 11:28:19,310 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer$PeriodicMemstoreFlusher:
regionserver60020.periodicFlusher exiting
2014-04-29 11:28:19,315 INFO org.apache.hadoop.hbase.regionserver.Leases:
regionserver60020.leaseChecker closing leases
2014-04-29 11:28:19,315 INFO org.apache.hadoop.hbase.regionserver.Leases:
regionserver60020.leaseChecker closed leases
2014-04-29 11:28:19,517 INFO org.apache.zookeeper.ZooKeeper: Session:
0x345ade606c40005 closed
2014-04-29 11:28:19,517 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down
2014-04-29 11:28:19,517 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server
node8.google.com,60020,1398781907511; zookeeper connection closed.
2014-04-29 11:28:19,517 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020
exiting
2014-04-29 11:28:19,517 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown
hook thread.
2014-04-29 11:28:19,518 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook finished.



2014-04-29 18:08 GMT-04:00 Mikhail Antonov <ol...@gmail.com>:

>    1. 2014-04-29 10:32:28,876 INFO
>
>  org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
>    This client just lost it's session with ZooKeeper, will automatically
>    reconnect when needed.
>
> What about ZK metrics, may be it's timing out or something (I mean echo
> srvr |nc host port)?
>
> -Mikhail
>
>
> 2014-04-29 14:52 GMT-07:00 Stack <st...@duboce.net>:
>
> > Anything in the zk log when the RS connects JMS?
> >
> >
> > On Tue, Apr 29, 2014 at 7:37 AM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> > > Hi,
> > >
> > > I face a "strange" issue today.
> > >
> > > I stopped my cluster to merge some regions. When I tried to restart it,
> > one
> > > server got stuck.
> > >
> > > If I try to stop the cluster, this server ignore the request and stay
> > > stuck.
> > >
> > > Then if I restart, the cluster come back without this server saying
> that
> > > it's already started.
> > >
> > > The server seems to be stuck with ZK. http://pastebin.com/6UcYEKN2
> > >
> > > I have killed the RS and tried to restart it, same issue. I'm able to
> > > access the ZK server from the RS. I have restarted the 3 ZK server too,
> > > same result. The RS tries to connect to all of them but always fail.
> > >
> > > Any idea why ZK might faile that way?
> > >
> > > Thanks,
> > >
> > > JM
> > >
> >
>
>
>
> --
> Thanks,
> Michael Antonov
>

Re: ZK issue on one single RS.

Posted by Mikhail Antonov <ol...@gmail.com>.

   1. 2014-04-29 10:32:28,876 INFO
   org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
   This client just lost it's session with ZooKeeper, will automatically
   reconnect when needed.

What about ZK metrics, may be it's timing out or something (I mean echo
srvr |nc host port)?

-Mikhail


2014-04-29 14:52 GMT-07:00 Stack <st...@duboce.net>:

> Anything in the zk log when the RS connects JMS?
>
>
> On Tue, Apr 29, 2014 at 7:37 AM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > Hi,
> >
> > I face a "strange" issue today.
> >
> > I stopped my cluster to merge some regions. When I tried to restart it,
> one
> > server got stuck.
> >
> > If I try to stop the cluster, this server ignore the request and stay
> > stuck.
> >
> > Then if I restart, the cluster come back without this server saying that
> > it's already started.
> >
> > The server seems to be stuck with ZK. http://pastebin.com/6UcYEKN2
> >
> > I have killed the RS and tried to restart it, same issue. I'm able to
> > access the ZK server from the RS. I have restarted the 3 ZK server too,
> > same result. The RS tries to connect to all of them but always fail.
> >
> > Any idea why ZK might faile that way?
> >
> > Thanks,
> >
> > JM
> >
>



-- 
Thanks,
Michael Antonov

Re: ZK issue on one single RS.

Posted by Stack <st...@duboce.net>.

Anything in the zk log when the RS connects JMS?


On Tue, Apr 29, 2014 at 7:37 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi,
>
> I face a "strange" issue today.
>
> I stopped my cluster to merge some regions. When I tried to restart it, one
> server got stuck.
>
> If I try to stop the cluster, this server ignore the request and stay
> stuck.
>
> Then if I restart, the cluster come back without this server saying that
> it's already started.
>
> The server seems to be stuck with ZK. http://pastebin.com/6UcYEKN2
>
> I have killed the RS and tried to restart it, same issue. I'm able to
> access the ZK server from the RS. I have restarted the 3 ZK server too,
> same result. The RS tries to connect to all of them but always fail.
>
> Any idea why ZK might faile that way?
>
> Thanks,
>
> JM
>

Re: ZK issue on one single RS.

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

HBase 0.94.19. I have the RC1 jars but I guess they are the same as the
release.
ZK 3.4.3

And for the translation, I will say something like "Connection reset by
peer"? ;)

Seems that ZK close the connection for this specific RS. I'm not really
getting why...

JM


2014-04-29 10:43 GMT-04:00 Ted Yu <yu...@gmail.com>:

> Can you give us versions of hbase and zookeeper ?
>
> bq. Connexion ré-initialisée par le correspondant
>
> Google translate says:
> Connection re-initialized by the corresponding
>
> Is the above correct ?
>
>
> On Tue, Apr 29, 2014 at 7:37 AM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > Hi,
> >
> > I face a "strange" issue today.
> >
> > I stopped my cluster to merge some regions. When I tried to restart it,
> one
> > server got stuck.
> >
> > If I try to stop the cluster, this server ignore the request and stay
> > stuck.
> >
> > Then if I restart, the cluster come back without this server saying that
> > it's already started.
> >
> > The server seems to be stuck with ZK. http://pastebin.com/6UcYEKN2
> >
> > I have killed the RS and tried to restart it, same issue. I'm able to
> > access the ZK server from the RS. I have restarted the 3 ZK server too,
> > same result. The RS tries to connect to all of them but always fail.
> >
> > Any idea why ZK might faile that way?
> >
> > Thanks,
> >
> > JM
> >
>

Re: ZK issue on one single RS.

Posted by Ted Yu <yu...@gmail.com>.

Can you give us versions of hbase and zookeeper ?

bq. Connexion ré-initialisée par le correspondant

Google translate says:
Connection re-initialized by the corresponding

Is the above correct ?


On Tue, Apr 29, 2014 at 7:37 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi,
>
> I face a "strange" issue today.
>
> I stopped my cluster to merge some regions. When I tried to restart it, one
> server got stuck.
>
> If I try to stop the cluster, this server ignore the request and stay
> stuck.
>
> Then if I restart, the cluster come back without this server saying that
> it's already started.
>
> The server seems to be stuck with ZK. http://pastebin.com/6UcYEKN2
>
> I have killed the RS and tried to restart it, same issue. I'm able to
> access the ZK server from the RS. I have restarted the 3 ZK server too,
> same result. The RS tries to connect to all of them but always fail.
>
> Any idea why ZK might faile that way?
>
> Thanks,
>
> JM
>