You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Vidhyashankar Venkataraman <vi...@yahoo-inc.com> on 2010/08/04 20:27:23 UTC

Put MR job.. Regionservers crashes..

I am trying to run an update (inserts and modify operations) experiment on a 100 TB  existing database..

The job is a map-only job that doesn't use TableOutputFormat (since mappers cannot possibly know the split anyways, so I cant optimize on locality here of course unless I customize it on my own.. I will be shortly doing that..)..
  I have a program that generates Puts on existing rows and new inserts within a specified range.. I use hadoop streaming to run the program as individual tasks on specified ranges..
 (A side question is if it is better to use TableOutputFormat with reduce jobs: the cost of sort-and-shuffle operations seem to be the same in both the cases...

 In my case, shuffle will happen anyways.. Sorts will happen in each region server.. The maps are at coarser shard ranges: 6 maps per node as against 170 regions per node.. )

Existing setup:
180 nodes, 6 map tasks per node, 100 TB existing db, 10 TB of updates.. Logs disabled.. 4 gig regions.. 12 MB write buffer size in each client.. Single column family and single value of 28 KB for each row: no compression in the expts (will soon be doing expts with compressed data)..

Results: Around 130 region servers crashed.. All of them had logs of the following form: The first set of data isnt the root for crashes.. They seem to be put errors because of a possible socket timeout.. Is that right?
The second set of logs below show the point where it crashed.. It looks like a zk timeout.. Were the RS's too busy to ping the ZK quorum?


Thank you in advance :)
Vidhya


The logs:


2010-08-04 08:05:42,886 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region DocData,0000000149473989,1280881832403/571301273 because: regionserver/63.250.207.82:60020.cacheFlusher2010-08-04 08:05:42,886 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region DocData,0000000148937993,1280881832423/196751683 because: regionserver/63.250.207.82:60020.cacheFlusher2010-08-04 08:05:42,886 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region DocData,0000000147751972,1280881832465/1982268997 because: regionserver/63.250.207.82:60020.cacheFlusher2010-08-04 08:05:42,900 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call put([B@661cc0d6, [Lorg.apache.hadoop.hbase.client.Put;@69b855f9) from 63.250.207.23:37600: output error
2010-08-04 08:05:42,901 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call put([B@460ca540, [Lorg.apache.hadoop.hbase.client.Put;@683530ad) from 98.136.191.232:59388: output error
2010-08-04 08:05:42,901 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call put([B@631e978e, [Lorg.apache.hadoop.hbase.client.Put;@6d701e78) from 63.250.207.113:44020: output error
2010-08-04 08:05:42,913 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call put([B@685060b9, [Lorg.apache.hadoop.hbase.client.Put;@61f39aaa) from 98.136.191.122:40316: output error
2010-08-04 08:05:42,916 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 87 on 60020 caught: java.nio.channels.ClosedChannelException        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)         at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
2010-08-04 08:05:42,916 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 86 on 60020 caught: java.nio.channels.ClosedChannelException        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)         at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)

2010-08-04 08:05:42,916 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 85 on 60020 caught: java.nio.channels.ClosedChannelException
        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
        at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)



THE SECOND PART OF LOGS WHERE CRASHES OCCURRED...



2010-08-04 09:30:29,512 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
org.apache.hadoop.hbase.Leases$LeaseStillHeldException        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
        at org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
        at org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:549)
        at java.lang.Thread.run(Thread.java:619)
2010-08-04 09:30:29,512 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x32a206703ca0616 to sun.nio.ch.SelectionKeyImpl@3d2292cd
java.io.IOException: TIMED OUT
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2010-08-04 09:30:29,512 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x32a206703ca0014 to sun.nio.ch.SelectionKeyImpl@36f0b7f8
java.io.IOException: TIMED OUT
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2010-08-04 09:30:29,685 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: Disconnected, type: None, path: null2010-08-04 09:30:29,695 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: Got ZooKeeper event, state: Disconnected, type: None, path:
null
2010-08-04 09:30:29,897 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server b3122695.yst.yahoo.net/98.136.191.163:2181
2010-08-04 09:30:29,898 INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/63.250.207.82:40417 remo
te=b3122695.yst.yahoo.net/98.136.191.163:2181]
2010-08-04 09:30:29,898 INFO org.apache.zookeeper.ClientCnxn: Server connection successful
2010-08-04 09:30:29,901 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: Got ZooKeeper event, state: Expired, type: None, path: null
2010-08-04 09:30:29,901 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x32a206703ca0616 to sun.nio.ch.SelectionKeyImpl@7935c2a8
java.io.IOException: Session Expired
        at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)        at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
2010-08-04 09:30:29,901 INFO org.apache.zookeeper.ZooKeeper: Closing session: 0x32a206703ca0616
2010-08-04 09:30:29,901 INFO org.apache.zookeeper.ClientCnxn: Closing ClientCnxn for session: 0x32a206703ca0616
2010-08-04 09:30:29,948 INFO org.apache.zookeeper.ClientCnxn: Disconnecting ClientCnxn for session: 0x32a206703ca0616
2010-08-04 09:30:29,949 INFO org.apache.zookeeper.ZooKeeper: Session: 0x32a206703ca0616 closed
2010-08-04 09:30:29,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Closed connection with ZooKeeper
2010-08-04 09:30:29,949 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2010-08-04 09:30:30,288 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server b3122695.yst.yahoo.net/98.136.191.163:2181
2010-08-04 09:30:30,289 INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/63.250.207.82:40418 remote=b3122695.yst.yahoo.net/98.136.191.163:2181]
2010-08-04 09:30:30,289 INFO org.apache.zookeeper.ClientCnxn: Server connection successful
2010-08-04 09:30:30,291 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x32a206703ca0014 to sun.nio.ch.SelectionKeyImpl@3dc37e55
java.io.IOException: Session Expired
        at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
        at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
2010-08-04 09:30:30,291 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: Expired, type: None, path: null
2010-08-04 09:30:30,291 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired
2010-08-04 09:30:30,292 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=162, stores=162, storefiles=280, storefileIndexSize=58, memstoreSize=1724, compactionQueueSize=63, usedHeap=1877, maxHeap=7999, blockCacheSize=20643688, blockCacheFree=2495761816, blockCacheCount=0, blockCacheHitRatio=0, fsReadLatency=0, fsWriteLatency=0, fsSyncLatency=0

Re: Put MR job.. Regionservers crashes..

Posted by Stack <st...@duboce.net>.

On Wed, Aug 4, 2010 at 12:06 PM, Vidhyashankar Venkataraman
<vi...@yahoo-inc.com> wrote:
> As a side question, what are all the log levels available in HBAse for diagonosis? Is the default level, the maximum?

Yes.

> How do I ensure client-side logging?
>

Client will use the settings it finds in the log4j.properties that
comes first on its CLASSPATH.  Are you seeing any?  If the client is
running inside your MR task, then it'll be under the userlogs dir for
your hadoop job logging.

St.Ack

Re: Put MR job.. Regionservers crashes..

Posted by Vidhyashankar Venkataraman <vi...@yahoo-inc.com>.

>> What about speculative execution? Or RPC timeouts (do you log that)?
Spec execn is set to false..
As a side question, what are all the log levels available in HBAse for diagonosis? Is the default level, the maximum? How do I ensure client-side logging?

>> Please! On a cluster of that size it's almost mandatory :)
Yup.. I had been meaning to set it up..  My bad..

Vidhya

On 8/4/10 12:00 PM, "Jean-Daniel Cryans" <jd...@apache.org> wrote:

> Wow. I get tons of them in the logs.. And there arent that many clients that got killed as reported by the MR job.. Is that the only case when these errors are reported?

What about speculative execution? Or RPC timeouts (do you log that)?

> Ok good, so one of the two happened then.. I will try figuring out what happened.. ZK servers are not collocated in my setup.. They are a set of 5 dedicated nodes (nothing else running)..

Then I'm betting the farm it's the nodes that have a resource problem.

> No, I was referring to using TOF on a regular java-API MR job.. But I guessed using TOF will be similar to what I am currently doing..

Unless you have something weird to do with the HTable, using the TOF
is good practice on the map output.

>
> I will use ganglia to monitor the stats..

Please! On a cluster of that size it's almost mandatory :)

J-D

Re: Put MR job.. Regionservers crashes..

Posted by Jean-Daniel Cryans <jd...@apache.org>.

> Wow. I get tons of them in the logs.. And there arent that many clients that got killed as reported by the MR job.. Is that the only case when these errors are reported?

What about speculative execution? Or RPC timeouts (do you log that)?

> Ok good, so one of the two happened then.. I will try figuring out what happened.. ZK servers are not collocated in my setup.. They are a set of 5 dedicated nodes (nothing else running)..

Then I'm betting the farm it's the nodes that have a resource problem.

> No, I was referring to using TOF on a regular java-API MR job.. But I guessed using TOF will be similar to what I am currently doing..

Unless you have something weird to do with the HTable, using the TOF
is good practice on the map output.

>
> I will use ganglia to monitor the stats..

Please! On a cluster of that size it's almost mandatory :)

J-D

Re: Put MR job.. Regionservers crashes..

Posted by Vidhyashankar Venkataraman <vi...@yahoo-inc.com>.

>> The first part of the log is usually a client that died while it was
>> doing a request.
Wow. I get tons of them in the logs.. And there arent that many clients that got killed as reported by the MR job.. Is that the only case when these errors are reported?

>> The second part of the log is a session expiration. The log fragment
>> is too small to tell if it was the region server that paused or the ZK
>> ensemble that was unreachable during that time... are the zk servers
>> collocated with region servers?
Ok good, so one of the two happened then.. I will try figuring out what happened.. ZK servers are not collocated in my setup.. They are a set of 5 dedicated nodes (nothing else running)..

>> About your side question, TOF can be used as the output of a map and
>> it will be just like using the HTable, but I'm not sure how's that
>> going to play with your streaming job.
No, I was referring to using TOF on a regular java-API MR job.. But I guessed using TOF will be similar to what I am currently doing..

I will use ganglia to monitor the stats..

Thank you for the prompt response
I will update soon with more results..
Vidhya

On 8/4/10 11:45 AM, "Jean-Daniel Cryans" <jd...@apache.org> wrote:

The first part of the log is usually a client that died while it was
doing a request.

The second part of the log is a session expiration. The log fragment
is too small to tell if it was the region server that paused or the ZK
ensemble that was unreachable during that time... are the zk servers
collocated with region servers?

About your side question, TOF can be used as the output of a map and
it will be just like using the HTable, but I'm not sure how's that
going to play with your streaming job.

About "Were the RS's too busy to ping the ZK quorum?", apart from
looking at the log and seeing a pause or not, only you can tell if you
are overcommitting your ressources. Setup ganglia, look at the
metrics, see if the CPUs are too busy, make sure there's 0kb in the
swap (or just disable it altogether).

J-D

On Wed, Aug 4, 2010 at 11:27 AM, Vidhyashankar Venkataraman
<vi...@yahoo-inc.com> wrote:
> I am trying to run an update (inserts and modify operations) experiment on a 100 TB  existing database..
>
> The job is a map-only job that doesn't use TableOutputFormat (since mappers cannot possibly know the split anyways, so I cant optimize on locality here of course unless I customize it on my own.. I will be shortly doing that..)..
>  I have a program that generates Puts on existing rows and new inserts within a specified range.. I use hadoop streaming to run the program as individual tasks on specified ranges..
>  (A side question is if it is better to use TableOutputFormat with reduce jobs: the cost of sort-and-shuffle operations seem to be the same in both the cases...
>
>  In my case, shuffle will happen anyways.. Sorts will happen in each region server.. The maps are at coarser shard ranges: 6 maps per node as against 170 regions per node.. )
>
> Existing setup:
> 180 nodes, 6 map tasks per node, 100 TB existing db, 10 TB of updates.. Logs disabled.. 4 gig regions.. 12 MB write buffer size in each client.. Single column family and single value of 28 KB for each row: no compression in the expts (will soon be doing expts with compressed data)..
>
> Results: Around 130 region servers crashed.. All of them had logs of the following form: The first set of data isnt the root for crashes.. They seem to be put errors because of a possible socket timeout.. Is that right?
> The second set of logs below show the point where it crashed.. It looks like a zk timeout.. Were the RS's too busy to ping the ZK quorum?
>
>
> Thank you in advance :)
> Vidhya
>
>
> The logs:
>
>
> 2010-08-04 08:05:42,886 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region DocData,0000000149473989,1280881832403/571301273 because: regionserver/63.250.207.82:60020.cacheFlusher2010-08-04 08:05:42,886 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region DocData,0000000148937993,1280881832423/196751683 because: regionserver/63.250.207.82:60020.cacheFlusher2010-08-04 08:05:42,886 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region DocData,0000000147751972,1280881832465/1982268997 because: regionserver/63.250.207.82:60020.cacheFlusher2010-08-04 08:05:42,900 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call put([B@661cc0d6, [Lorg.apache.hadoop.hbase.client.Put;@69b855f9) from 63.250.207.23:37600: output error
> 2010-08-04 08:05:42,901 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call put([B@460ca540, [Lorg.apache.hadoop.hbase.client.Put;@683530ad) from 98.136.191.232:59388: output error
> 2010-08-04 08:05:42,901 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call put([B@631e978e, [Lorg.apache.hadoop.hbase.client.Put;@6d701e78) from 63.250.207.113:44020: output error
> 2010-08-04 08:05:42,913 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call put([B@685060b9, [Lorg.apache.hadoop.hbase.client.Put;@61f39aaa) from 98.136.191.122:40316: output error
> 2010-08-04 08:05:42,916 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 87 on 60020 caught: java.nio.channels.ClosedChannelException        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
>        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)         at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
> 2010-08-04 08:05:42,916 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 86 on 60020 caught: java.nio.channels.ClosedChannelException        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
>        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)         at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
>
> 2010-08-04 08:05:42,916 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 85 on 60020 caught: java.nio.channels.ClosedChannelException
>        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
>        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>        at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
>
>
>
> THE SECOND PART OF LOGS WHERE CRASHES OCCURRED...
>
>
>
> 2010-08-04 09:30:29,512 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
> org.apache.hadoop.hbase.Leases$LeaseStillHeldException        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>        at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
>        at org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
>        at org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:549)
>        at java.lang.Thread.run(Thread.java:619)
> 2010-08-04 09:30:29,512 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x32a206703ca0616 to sun.nio.ch.SelectionKeyImpl@3d2292cd
> java.io.IOException: TIMED OUT
>        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2010-08-04 09:30:29,512 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x32a206703ca0014 to sun.nio.ch.SelectionKeyImpl@36f0b7f8
> java.io.IOException: TIMED OUT
>        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2010-08-04 09:30:29,685 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: Disconnected, type: None, path: null2010-08-04 09:30:29,695 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: Got ZooKeeper event, state: Disconnected, type: None, path:
> null
> 2010-08-04 09:30:29,897 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server b3122695.yst.yahoo.net/98.136.191.163:2181
> 2010-08-04 09:30:29,898 INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/63.250.207.82:40417 remo
> te=b3122695.yst.yahoo.net/98.136.191.163:2181]
> 2010-08-04 09:30:29,898 INFO org.apache.zookeeper.ClientCnxn: Server connection successful
> 2010-08-04 09:30:29,901 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: Got ZooKeeper event, state: Expired, type: None, path: null
> 2010-08-04 09:30:29,901 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x32a206703ca0616 to sun.nio.ch.SelectionKeyImpl@7935c2a8
> java.io.IOException: Session Expired
>        at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)        at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> 2010-08-04 09:30:29,901 INFO org.apache.zookeeper.ZooKeeper: Closing session: 0x32a206703ca0616
> 2010-08-04 09:30:29,901 INFO org.apache.zookeeper.ClientCnxn: Closing ClientCnxn for session: 0x32a206703ca0616
> 2010-08-04 09:30:29,948 INFO org.apache.zookeeper.ClientCnxn: Disconnecting ClientCnxn for session: 0x32a206703ca0616
> 2010-08-04 09:30:29,949 INFO org.apache.zookeeper.ZooKeeper: Session: 0x32a206703ca0616 closed
> 2010-08-04 09:30:29,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Closed connection with ZooKeeper
> 2010-08-04 09:30:29,949 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
> 2010-08-04 09:30:30,288 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server b3122695.yst.yahoo.net/98.136.191.163:2181
> 2010-08-04 09:30:30,289 INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/63.250.207.82:40418 remote=b3122695.yst.yahoo.net/98.136.191.163:2181]
> 2010-08-04 09:30:30,289 INFO org.apache.zookeeper.ClientCnxn: Server connection successful
> 2010-08-04 09:30:30,291 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x32a206703ca0014 to sun.nio.ch.SelectionKeyImpl@3dc37e55
> java.io.IOException: Session Expired
>        at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
>        at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
>        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> 2010-08-04 09:30:30,291 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: Expired, type: None, path: null
> 2010-08-04 09:30:30,291 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired
> 2010-08-04 09:30:30,292 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=162, stores=162, storefiles=280, storefileIndexSize=58, memstoreSize=1724, compactionQueueSize=63, usedHeap=1877, maxHeap=7999, blockCacheSize=20643688, blockCacheFree=2495761816, blockCacheCount=0, blockCacheHitRatio=0, fsReadLatency=0, fsWriteLatency=0, fsSyncLatency=0
>
>
>

Re: Put MR job.. Regionservers crashes..

Posted by Jean-Daniel Cryans <jd...@apache.org>.

The first part of the log is usually a client that died while it was
doing a request.

The second part of the log is a session expiration. The log fragment
is too small to tell if it was the region server that paused or the ZK
ensemble that was unreachable during that time... are the zk servers
collocated with region servers?

About your side question, TOF can be used as the output of a map and
it will be just like using the HTable, but I'm not sure how's that
going to play with your streaming job.

About "Were the RS's too busy to ping the ZK quorum?", apart from
looking at the log and seeing a pause or not, only you can tell if you
are overcommitting your ressources. Setup ganglia, look at the
metrics, see if the CPUs are too busy, make sure there's 0kb in the
swap (or just disable it altogether).

J-D

On Wed, Aug 4, 2010 at 11:27 AM, Vidhyashankar Venkataraman
<vi...@yahoo-inc.com> wrote:
> I am trying to run an update (inserts and modify operations) experiment on a 100 TB  existing database..
>
> The job is a map-only job that doesn't use TableOutputFormat (since mappers cannot possibly know the split anyways, so I cant optimize on locality here of course unless I customize it on my own.. I will be shortly doing that..)..
>  I have a program that generates Puts on existing rows and new inserts within a specified range.. I use hadoop streaming to run the program as individual tasks on specified ranges..
>  (A side question is if it is better to use TableOutputFormat with reduce jobs: the cost of sort-and-shuffle operations seem to be the same in both the cases...
>
>  In my case, shuffle will happen anyways.. Sorts will happen in each region server.. The maps are at coarser shard ranges: 6 maps per node as against 170 regions per node.. )
>
> Existing setup:
> 180 nodes, 6 map tasks per node, 100 TB existing db, 10 TB of updates.. Logs disabled.. 4 gig regions.. 12 MB write buffer size in each client.. Single column family and single value of 28 KB for each row: no compression in the expts (will soon be doing expts with compressed data)..
>
> Results: Around 130 region servers crashed.. All of them had logs of the following form: The first set of data isnt the root for crashes.. They seem to be put errors because of a possible socket timeout.. Is that right?
> The second set of logs below show the point where it crashed.. It looks like a zk timeout.. Were the RS's too busy to ping the ZK quorum?
>
>
> Thank you in advance :)
> Vidhya
>
>
> The logs:
>
>
> 2010-08-04 08:05:42,886 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region DocData,0000000149473989,1280881832403/571301273 because: regionserver/63.250.207.82:60020.cacheFlusher2010-08-04 08:05:42,886 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region DocData,0000000148937993,1280881832423/196751683 because: regionserver/63.250.207.82:60020.cacheFlusher2010-08-04 08:05:42,886 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region DocData,0000000147751972,1280881832465/1982268997 because: regionserver/63.250.207.82:60020.cacheFlusher2010-08-04 08:05:42,900 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call put([B@661cc0d6, [Lorg.apache.hadoop.hbase.client.Put;@69b855f9) from 63.250.207.23:37600: output error
> 2010-08-04 08:05:42,901 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call put([B@460ca540, [Lorg.apache.hadoop.hbase.client.Put;@683530ad) from 98.136.191.232:59388: output error
> 2010-08-04 08:05:42,901 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call put([B@631e978e, [Lorg.apache.hadoop.hbase.client.Put;@6d701e78) from 63.250.207.113:44020: output error
> 2010-08-04 08:05:42,913 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call put([B@685060b9, [Lorg.apache.hadoop.hbase.client.Put;@61f39aaa) from 98.136.191.122:40316: output error
> 2010-08-04 08:05:42,916 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 87 on 60020 caught: java.nio.channels.ClosedChannelException        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
>        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)         at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
> 2010-08-04 08:05:42,916 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 86 on 60020 caught: java.nio.channels.ClosedChannelException        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
>        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)         at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
>
> 2010-08-04 08:05:42,916 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 85 on 60020 caught: java.nio.channels.ClosedChannelException
>        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
>        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>        at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
>
>
>
> THE SECOND PART OF LOGS WHERE CRASHES OCCURRED...
>
>
>
> 2010-08-04 09:30:29,512 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
> org.apache.hadoop.hbase.Leases$LeaseStillHeldException        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>        at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
>        at org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
>        at org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:549)
>        at java.lang.Thread.run(Thread.java:619)
> 2010-08-04 09:30:29,512 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x32a206703ca0616 to sun.nio.ch.SelectionKeyImpl@3d2292cd
> java.io.IOException: TIMED OUT
>        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2010-08-04 09:30:29,512 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x32a206703ca0014 to sun.nio.ch.SelectionKeyImpl@36f0b7f8
> java.io.IOException: TIMED OUT
>        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2010-08-04 09:30:29,685 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: Disconnected, type: None, path: null2010-08-04 09:30:29,695 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: Got ZooKeeper event, state: Disconnected, type: None, path:
> null
> 2010-08-04 09:30:29,897 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server b3122695.yst.yahoo.net/98.136.191.163:2181
> 2010-08-04 09:30:29,898 INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/63.250.207.82:40417 remo
> te=b3122695.yst.yahoo.net/98.136.191.163:2181]
> 2010-08-04 09:30:29,898 INFO org.apache.zookeeper.ClientCnxn: Server connection successful
> 2010-08-04 09:30:29,901 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: Got ZooKeeper event, state: Expired, type: None, path: null
> 2010-08-04 09:30:29,901 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x32a206703ca0616 to sun.nio.ch.SelectionKeyImpl@7935c2a8
> java.io.IOException: Session Expired
>        at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)        at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> 2010-08-04 09:30:29,901 INFO org.apache.zookeeper.ZooKeeper: Closing session: 0x32a206703ca0616
> 2010-08-04 09:30:29,901 INFO org.apache.zookeeper.ClientCnxn: Closing ClientCnxn for session: 0x32a206703ca0616
> 2010-08-04 09:30:29,948 INFO org.apache.zookeeper.ClientCnxn: Disconnecting ClientCnxn for session: 0x32a206703ca0616
> 2010-08-04 09:30:29,949 INFO org.apache.zookeeper.ZooKeeper: Session: 0x32a206703ca0616 closed
> 2010-08-04 09:30:29,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Closed connection with ZooKeeper
> 2010-08-04 09:30:29,949 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
> 2010-08-04 09:30:30,288 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server b3122695.yst.yahoo.net/98.136.191.163:2181
> 2010-08-04 09:30:30,289 INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/63.250.207.82:40418 remote=b3122695.yst.yahoo.net/98.136.191.163:2181]
> 2010-08-04 09:30:30,289 INFO org.apache.zookeeper.ClientCnxn: Server connection successful
> 2010-08-04 09:30:30,291 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x32a206703ca0014 to sun.nio.ch.SelectionKeyImpl@3dc37e55
> java.io.IOException: Session Expired
>        at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
>        at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
>        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> 2010-08-04 09:30:30,291 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: Expired, type: None, path: null
> 2010-08-04 09:30:30,291 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired
> 2010-08-04 09:30:30,292 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=162, stores=162, storefiles=280, storefileIndexSize=58, memstoreSize=1724, compactionQueueSize=63, usedHeap=1877, maxHeap=7999, blockCacheSize=20643688, blockCacheFree=2495761816, blockCacheCount=0, blockCacheHitRatio=0, fsReadLatency=0, fsWriteLatency=0, fsSyncLatency=0
>
>
>