You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Li Li <fa...@gmail.com> on 2014/04/22 04:39:34 UTC

is my hbase cluster overloaded?

I have a small hbase cluster with 1 namenode, 1 secondary namenode, 4 datanode.
and the hbase master is on the same machine with namenode, 4 hbase
slave on datanode machine.
I found average requests per seconds is about 10,000. and the clusters
crashed. and I found the reason is one datanode failed.

the datanode configuration is about 4 cpu core and 10GB memory
is my cluster overloaded?

Re: is my hbase cluster overloaded?

Posted by Li Li <fa...@gmail.com>.

how much MaxNewSize needed for my configuration?

On Tue, Apr 22, 2014 at 3:02 PM, Azuryy Yu <az...@gmail.com> wrote:
> Do you still have the same issue?
>
> and:
> -Xmx8000m -server -XX:NewSize=512m -XX:MaxNewSize=512m
>
> the Eden size is too small.
>
>
>
> On Tue, Apr 22, 2014 at 2:55 PM, Li Li <fa...@gmail.com> wrote:
>
>> <property>
>>   <name>dfs.datanode.handler.count</name>
>>   <value>100</value>
>>   <description>The number of server threads for the datanode.</description>
>> </property>
>>
>>
>> 1. namenode/master  192.168.10.48
>> http://pastebin.com/7M0zzAAc
>>
>> $free -m (this is value when I restart the hadoop and hbase now, not
>> the value when it crashed)
>>              total       used       free     shared    buffers     cached
>> Mem:         15951       3819      12131          0        509       1990
>> -/+ buffers/cache:       1319      14631
>> Swap:         8191          0       8191
>>
>> 2. datanode/region 192.168.10.45
>> http://pastebin.com/FiAw1yju
>>
>> $free -m
>>              total       used       free     shared    buffers     cached
>> Mem:         15951       3627      12324          0       1516        641
>> -/+ buffers/cache:       1469      14482
>> Swap:         8191          8       8183
>>
>> On Tue, Apr 22, 2014 at 2:29 PM, Azuryy Yu <az...@gmail.com> wrote:
>> > one big possible issue is that you have a high concurrent request on HDFS
>> > or HBASE, then all Data nodes handlers are all busy, then more requests
>> are
>> > pending, then timeout, so you can try to increase
>> > dfs.datanode.handler.count and dfs.namenode.handler.count in the
>> > hdfs-site.xml, then restart the HDFS.
>> >
>> > another, do you have datanode, namenode, region servers JVM options? if
>> > they are all by default, then there is also have this issue.
>> >
>> >
>> >
>> >
>> > On Tue, Apr 22, 2014 at 2:20 PM, Li Li <fa...@gmail.com> wrote:
>> >
>> >> my cluster setup: both 6 machines are virtual machine. each machine:
>> >> 4CPU Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 16GB memory
>> >> 192.168.10.48 namenode/jobtracker
>> >> 192.168.10.47 secondary namenode
>> >> 192.168.10.45 datanode/tasktracker
>> >> 192.168.10.46 datanode/tasktracker
>> >> 192.168.10.49 datanode/tasktracker
>> >> 192.168.10.50 datanode/tasktracker
>> >>
>> >> hdfs logs around 20:33
>> >> 192.168.10.48 namenode log  http://pastebin.com/rwgmPEXR
>> >> 192.168.10.45 datanode log http://pastebin.com/HBgZ8rtV (I found this
>> >> datanode crash first)
>> >> 192.168.10.46 datanode log http://pastebin.com/aQ2emnUi
>> >> 192.168.10.49 datanode log http://pastebin.com/aqsWrrL1
>> >> 192.168.10.50 datanode log http://pastebin.com/V7C6tjpB
>> >>
>> >> hbase logs around 20:33
>> >> 192.168.10.48 master log http://pastebin.com/2ZfeYA1p
>> >> 192.168.10.45 region log http://pastebin.com/idCF2a7Y
>> >> 192.168.10.46 region log http://pastebin.com/WEh4dA0f
>> >> 192.168.10.49 region log http://pastebin.com/cGtpbTLz
>> >> 192.168.10.50 region log http://pastebin.com/bD6h5T6p(very strange,
>> >> not log at 20:33, but have log at 20:32 and 20:34)
>> >>
>> >> On Tue, Apr 22, 2014 at 12:25 PM, Ted Yu <yu...@gmail.com> wrote:
>> >> > Can you post more of the data node log, around 20:33 ?
>> >> >
>> >> > Cheers
>> >> >
>> >> >
>> >> > On Mon, Apr 21, 2014 at 8:57 PM, Li Li <fa...@gmail.com> wrote:
>> >> >
>> >> >> hadoop 1.0
>> >> >> hbase 0.94.11
>> >> >>
>> >> >> datanode log from 192.168.10.45. why it shut down itself?
>> >> >>
>> >> >> 2014-04-21 20:33:59,309 INFO
>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>> >> >> blk_-7969006819959471805_202154 received exception
>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
>> >> >> left.
>> >> >> 2014-04-21 20:33:59,310 ERROR
>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> >> >> DatanodeRegistration(192.168.10.45:50010,
>> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
>> >> >> infoPort=50075, ipcPort=50020):DataXceiver
>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
>> >> >> left.
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>> >> >>         at
>> >> >>
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>> >> >>         at
>> >> >>
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>> >> >>         at
>> >> java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>> >> >>         at
>> >> java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>> >> >>         at java.io.DataInputStream.read(DataInputStream.java:149)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
>> >> >>         at java.lang.Thread.run(Thread.java:722)
>> >> >> 2014-04-21 20:33:59,310 ERROR
>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> >> >> DatanodeRegistration(192.168.10.45:50010,
>> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
>> >> >> infoPort=50075, ipcPort=50020):DataXceiver
>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>> >> >> channel java.nio.channels.SocketChannel[closed]. 466924 millis
>> timeout
>> >> >> left.
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>> >> >>         at java.lang.Thread.run(Thread.java:722)
>> >> >> 2014-04-21 20:34:00,291 INFO
>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
>> >> >> threadgroup to exit, active threads is 0
>> >> >> 2014-04-21 20:34:00,404 INFO
>> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
>> >> >> Shutting down all async disk service threads...
>> >> >> 2014-04-21 20:34:00,405 INFO
>> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: All
>> >> >> async disk service threads have been shut down.
>> >> >> 2014-04-21 20:34:00,413 INFO
>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
>> >> >> 2014-04-21 20:34:00,424 INFO
>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
>> >> >> /************************************************************
>> >> >> SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45
>> >> >> ************************************************************/
>> >> >>
>> >> >> On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu <yu...@gmail.com>
>> wrote:
>> >> >> > bq. one datanode failed
>> >> >> >
>> >> >> > Was the crash due to out of memory error ?
>> >> >> > Can you post the tail of data node log on pastebin ?
>> >> >> >
>> >> >> > Giving us versions of hadoop and hbase would be helpful.
>> >> >> >
>> >> >> >
>> >> >> > On Mon, Apr 21, 2014 at 7:39 PM, Li Li <fa...@gmail.com>
>> wrote:
>> >> >> >
>> >> >> >> I have a small hbase cluster with 1 namenode, 1 secondary
>> namenode, 4
>> >> >> >> datanode.
>> >> >> >> and the hbase master is on the same machine with namenode, 4 hbase
>> >> >> >> slave on datanode machine.
>> >> >> >> I found average requests per seconds is about 10,000. and the
>> >> clusters
>> >> >> >> crashed. and I found the reason is one datanode failed.
>> >> >> >>
>> >> >> >> the datanode configuration is about 4 cpu core and 10GB memory
>> >> >> >> is my cluster overloaded?
>> >> >> >>
>> >> >>
>> >>
>>

Re: is my hbase cluster overloaded?

Posted by Li Li <fa...@gmail.com>.

I found an exception in regionserver log:
2014-04-22 16:20:11,490 INFO
org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META
updated, and report to master.
Parent=vc2.url_db,\xBF\xC3\x04c\xF7\xD2\xE0\xF0\xD4]\xCA\x83/x\xAAZ,1398152426132.788da9fbd09aff035ff4fc53bc7e6e5b.,
new regions: vc2.url_db,\xBF\xC3\x04c\xF7\xD2\xE0\xF0\xD4]\xCA\x83/x\xAAZ,1398154810909.35678897b245b940ae8bac793f0fc361.,
vc2.url_db,\xDF\xD72\xA0\xA0\x11\x18\x99\x12b\xF2\xBE\x0A@#\xBD,1398154810909.2e73a884aec6c7f981a1340edd6550a5..
Split took 0sec
2014-04-22 16:20:11,957 WARN
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cached an already
cached block: c7da3491c9eb4a30a9d66a6067b6a4cb_131187
cb:c7da3491c9eb4a30a9d66a6067b6a4cb_131187. This is harmless and can
happen in rare cases (see HBASE-8547)
2014-04-22 16:20:14,785 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
7842243909358917272 lease expired on region
vc2.url_db,\xBF\xC3\x04c\xF7\xD2\xE0\xF0\xD4]\xCA\x83/x\xAAZ,1398152426132.788da9fbd09aff035ff4fc53bc7e6e5b.
2014-04-22 16:20:14,785 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: Closing scanner
for vc2.url_db,\xBF\xC3\x04c\xF7\xD2\xE0\xF0\xD4]\xCA\x83/x\xAAZ,1398152426132.788da9fbd09aff035ff4fc53bc7e6e5b.
org.apache.hadoop.hbase.NotServingRegionException: Region is not
online: vc2.url_db,\xBF\xC3\x04c\xF7\xD2\xE0\xF0\xD4]\xCA\x83/x\xAAZ,1398152426132.788da9fbd09aff035ff4fc53bc7e6e5b.
        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3511)
        at org.apache.hadoop.hbase.regionserver.HRegionServer$ScannerListener.leaseExpired(HRegionServer.java:2742)
        at org.apache.hadoop.hbase.regionserver.Leases.run(Leases.java:100)
        at java.lang.Thread.run(Thread.java:722)
2014-04-22 16:20:22,452 DEBUG
org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on
vc2.url_db,\x7F\xE8\xDB\xACq\xC0R\x109\x96\xF1\x08\xA5\xD3X\x1D,1398152426132.8e596f645f035ba44991e77cd73e1b01.


On Tue, Apr 22, 2014 at 4:11 PM, Li Li <fa...@gmail.com> wrote:
> jmap -heap of app-hbase-1 why oldSize so small?
>
> Heap Configuration:
>    MinHeapFreeRatio = 40
>    MaxHeapFreeRatio = 70
>    MaxHeapSize      = 8388608000 (8000.0MB)
>    NewSize          = 536870912 (512.0MB)
>    MaxNewSize       = 536870912 (512.0MB)
>    OldSize          = 5439488 (5.1875MB)
>    NewRatio         = 2
>    SurvivorRatio    = 8
>    PermSize         = 21757952 (20.75MB)
>    MaxPermSize      = 85983232 (82.0MB)
>
> Heap Usage:
> New Generation (Eden + 1 Survivor Space):
>    capacity = 483196928 (460.8125MB)
>    used     = 263917680 (251.69151306152344MB)
>    free     = 219279248 (209.12098693847656MB)
>    54.6190724126458% used
> Eden Space:
>    capacity = 429522944 (409.625MB)
>    used     = 242509200 (231.27479553222656MB)
>    free     = 187013744 (178.35020446777344MB)
>    56.460127075307064% used
> From Space:
>    capacity = 53673984 (51.1875MB)
>    used     = 21408480 (20.416717529296875MB)
>    free     = 32265504 (30.770782470703125MB)
>    39.88613925137363% used
> To Space:
>    capacity = 53673984 (51.1875MB)
>    used     = 0 (0.0MB)
>    free     = 53673984 (51.1875MB)
>    0.0% used
> concurrent mark-sweep generation:
>    capacity = 2857693184 (2725.30859375MB)
>    used     = 1738653912 (1658.1095809936523MB)
>    free     = 1119039272 (1067.1990127563477MB)
>    60.841168034923655% used
> Perm Generation:
>    capacity = 40685568 (38.80078125MB)
>    used     = 24327624 (23.20063018798828MB)
>    free     = 16357944 (15.600151062011719MB)
>    59.79423465342797% used
>
>
> On Tue, Apr 22, 2014 at 4:00 PM, Li Li <fa...@gmail.com> wrote:
>> Region Servers
>> ServerName Start time Load
>> app-hbase-1,60020,1398141516916 Tue Apr 22 12:38:36 CST 2014
>> requestsPerSecond=448799, numberOfOnlineRegions=8, usedHeapMB=1241,
>> maxHeapMB=7948
>> app-hbase-2,60020,1398141516914 Tue Apr 22 12:38:36 CST 2014
>> requestsPerSecond=0, numberOfOnlineRegions=3, usedHeapMB=829,
>> maxHeapMB=7948
>> app-hbase-4,60020,1398141525533 Tue Apr 22 12:38:45 CST 2014
>> requestsPerSecond=217349, numberOfOnlineRegions=6, usedHeapMB=1356,
>> maxHeapMB=7948
>> app-hbase-5,60020,1398141524870 Tue Apr 22 12:38:44 CST 2014
>> requestsPerSecond=844, numberOfOnlineRegions=3, usedHeapMB=285,
>> maxHeapMB=7948
>> Total: servers: 4 requestsPerSecond=666992, numberOfOnlineRegions=20
>>
>> hbase-2 and 4 only have 3 region. how to balance them?
>>
>> On Tue, Apr 22, 2014 at 3:53 PM, Li Li <fa...@gmail.com> wrote:
>>> hbase current statistics:
>>>
>>> Region Servers
>>> ServerName Start time Load
>>> app-hbase-1,60020,1398141516916 Tue Apr 22 12:38:36 CST 2014
>>> requestsPerSecond=6100, numberOfOnlineRegions=7, usedHeapMB=1201,
>>> maxHeapMB=7948
>>> app-hbase-2,60020,1398141516914 Tue Apr 22 12:38:36 CST 2014
>>> requestsPerSecond=1770, numberOfOnlineRegions=4, usedHeapMB=224,
>>> maxHeapMB=7948
>>> app-hbase-4,60020,1398141525533 Tue Apr 22 12:38:45 CST 2014
>>> requestsPerSecond=3445, numberOfOnlineRegions=5, usedHeapMB=798,
>>> maxHeapMB=7948
>>> app-hbase-5,60020,1398141524870 Tue Apr 22 12:38:44 CST 2014
>>> requestsPerSecond=57, numberOfOnlineRegions=2, usedHeapMB=328,
>>> maxHeapMB=7948
>>> Total: servers: 4 requestsPerSecond=11372, numberOfOnlineRegions=18
>>>
>>> On Tue, Apr 22, 2014 at 3:40 PM, Li Li <fa...@gmail.com> wrote:
>>>> I am now restart the sever and running. maybe an hour later the load
>>>> will become high
>>>>
>>>> On Tue, Apr 22, 2014 at 3:02 PM, Azuryy Yu <az...@gmail.com> wrote:
>>>>> Do you still have the same issue?
>>>>>
>>>>> and:
>>>>> -Xmx8000m -server -XX:NewSize=512m -XX:MaxNewSize=512m
>>>>>
>>>>> the Eden size is too small.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Apr 22, 2014 at 2:55 PM, Li Li <fa...@gmail.com> wrote:
>>>>>
>>>>>> <property>
>>>>>>   <name>dfs.datanode.handler.count</name>
>>>>>>   <value>100</value>
>>>>>>   <description>The number of server threads for the datanode.</description>
>>>>>> </property>
>>>>>>
>>>>>>
>>>>>> 1. namenode/master  192.168.10.48
>>>>>> http://pastebin.com/7M0zzAAc
>>>>>>
>>>>>> $free -m (this is value when I restart the hadoop and hbase now, not
>>>>>> the value when it crashed)
>>>>>>              total       used       free     shared    buffers     cached
>>>>>> Mem:         15951       3819      12131          0        509       1990
>>>>>> -/+ buffers/cache:       1319      14631
>>>>>> Swap:         8191          0       8191
>>>>>>
>>>>>> 2. datanode/region 192.168.10.45
>>>>>> http://pastebin.com/FiAw1yju
>>>>>>
>>>>>> $free -m
>>>>>>              total       used       free     shared    buffers     cached
>>>>>> Mem:         15951       3627      12324          0       1516        641
>>>>>> -/+ buffers/cache:       1469      14482
>>>>>> Swap:         8191          8       8183
>>>>>>
>>>>>> On Tue, Apr 22, 2014 at 2:29 PM, Azuryy Yu <az...@gmail.com> wrote:
>>>>>> > one big possible issue is that you have a high concurrent request on HDFS
>>>>>> > or HBASE, then all Data nodes handlers are all busy, then more requests
>>>>>> are
>>>>>> > pending, then timeout, so you can try to increase
>>>>>> > dfs.datanode.handler.count and dfs.namenode.handler.count in the
>>>>>> > hdfs-site.xml, then restart the HDFS.
>>>>>> >
>>>>>> > another, do you have datanode, namenode, region servers JVM options? if
>>>>>> > they are all by default, then there is also have this issue.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Tue, Apr 22, 2014 at 2:20 PM, Li Li <fa...@gmail.com> wrote:
>>>>>> >
>>>>>> >> my cluster setup: both 6 machines are virtual machine. each machine:
>>>>>> >> 4CPU Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 16GB memory
>>>>>> >> 192.168.10.48 namenode/jobtracker
>>>>>> >> 192.168.10.47 secondary namenode
>>>>>> >> 192.168.10.45 datanode/tasktracker
>>>>>> >> 192.168.10.46 datanode/tasktracker
>>>>>> >> 192.168.10.49 datanode/tasktracker
>>>>>> >> 192.168.10.50 datanode/tasktracker
>>>>>> >>
>>>>>> >> hdfs logs around 20:33
>>>>>> >> 192.168.10.48 namenode log  http://pastebin.com/rwgmPEXR
>>>>>> >> 192.168.10.45 datanode log http://pastebin.com/HBgZ8rtV (I found this
>>>>>> >> datanode crash first)
>>>>>> >> 192.168.10.46 datanode log http://pastebin.com/aQ2emnUi
>>>>>> >> 192.168.10.49 datanode log http://pastebin.com/aqsWrrL1
>>>>>> >> 192.168.10.50 datanode log http://pastebin.com/V7C6tjpB
>>>>>> >>
>>>>>> >> hbase logs around 20:33
>>>>>> >> 192.168.10.48 master log http://pastebin.com/2ZfeYA1p
>>>>>> >> 192.168.10.45 region log http://pastebin.com/idCF2a7Y
>>>>>> >> 192.168.10.46 region log http://pastebin.com/WEh4dA0f
>>>>>> >> 192.168.10.49 region log http://pastebin.com/cGtpbTLz
>>>>>> >> 192.168.10.50 region log http://pastebin.com/bD6h5T6p(very strange,
>>>>>> >> not log at 20:33, but have log at 20:32 and 20:34)
>>>>>> >>
>>>>>> >> On Tue, Apr 22, 2014 at 12:25 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>> >> > Can you post more of the data node log, around 20:33 ?
>>>>>> >> >
>>>>>> >> > Cheers
>>>>>> >> >
>>>>>> >> >
>>>>>> >> > On Mon, Apr 21, 2014 at 8:57 PM, Li Li <fa...@gmail.com> wrote:
>>>>>> >> >
>>>>>> >> >> hadoop 1.0
>>>>>> >> >> hbase 0.94.11
>>>>>> >> >>
>>>>>> >> >> datanode log from 192.168.10.45. why it shut down itself?
>>>>>> >> >>
>>>>>> >> >> 2014-04-21 20:33:59,309 INFO
>>>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>>>>>> >> >> blk_-7969006819959471805_202154 received exception
>>>>>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>>>>>> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
>>>>>> >> >> left.
>>>>>> >> >> 2014-04-21 20:33:59,310 ERROR
>>>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>>>>>> >> >> DatanodeRegistration(192.168.10.45:50010,
>>>>>> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
>>>>>> >> >> infoPort=50075, ipcPort=50020):DataXceiver
>>>>>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>>>>>> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
>>>>>> >> >> left.
>>>>>> >> >>         at
>>>>>> >> >>
>>>>>> >>
>>>>>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>>>>>> >> >>         at
>>>>>> >> >>
>>>>>> >>
>>>>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>>>>>> >> >>         at
>>>>>> >> >>
>>>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>>>>>> >> >>         at
>>>>>> >> >>
>>>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>>>>>> >> >>         at
>>>>>> >> java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>>>>>> >> >>         at
>>>>>> >> java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>>>>>> >> >>         at java.io.DataInputStream.read(DataInputStream.java:149)
>>>>>> >> >>         at
>>>>>> >> >>
>>>>>> >>
>>>>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265)
>>>>>> >> >>         at
>>>>>> >> >>
>>>>>> >>
>>>>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
>>>>>> >> >>         at
>>>>>> >> >>
>>>>>> >>
>>>>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
>>>>>> >> >>         at
>>>>>> >> >>
>>>>>> >>
>>>>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
>>>>>> >> >>         at
>>>>>> >> >>
>>>>>> >>
>>>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
>>>>>> >> >>         at
>>>>>> >> >>
>>>>>> >>
>>>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
>>>>>> >> >>         at java.lang.Thread.run(Thread.java:722)
>>>>>> >> >> 2014-04-21 20:33:59,310 ERROR
>>>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>>>>>> >> >> DatanodeRegistration(192.168.10.45:50010,
>>>>>> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
>>>>>> >> >> infoPort=50075, ipcPort=50020):DataXceiver
>>>>>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>>>>>> >> >> channel java.nio.channels.SocketChannel[closed]. 466924 millis
>>>>>> timeout
>>>>>> >> >> left.
>>>>>> >> >>         at
>>>>>> >> >>
>>>>>> >>
>>>>>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>>>>>> >> >>         at
>>>>>> >> >>
>>>>>> >>
>>>>>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245)
>>>>>> >> >>         at
>>>>>> >> >>
>>>>>> >>
>>>>>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>>>>>> >> >>         at
>>>>>> >> >>
>>>>>> >>
>>>>>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>>>>>> >> >>         at
>>>>>> >> >>
>>>>>> >>
>>>>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
>>>>>> >> >>         at
>>>>>> >> >>
>>>>>> >>
>>>>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
>>>>>> >> >>         at
>>>>>> >> >>
>>>>>> >>
>>>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
>>>>>> >> >>         at
>>>>>> >> >>
>>>>>> >>
>>>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>>>>> >> >>         at java.lang.Thread.run(Thread.java:722)
>>>>>> >> >> 2014-04-21 20:34:00,291 INFO
>>>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
>>>>>> >> >> threadgroup to exit, active threads is 0
>>>>>> >> >> 2014-04-21 20:34:00,404 INFO
>>>>>> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
>>>>>> >> >> Shutting down all async disk service threads...
>>>>>> >> >> 2014-04-21 20:34:00,405 INFO
>>>>>> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: All
>>>>>> >> >> async disk service threads have been shut down.
>>>>>> >> >> 2014-04-21 20:34:00,413 INFO
>>>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
>>>>>> >> >> 2014-04-21 20:34:00,424 INFO
>>>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
>>>>>> >> >> /************************************************************
>>>>>> >> >> SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45
>>>>>> >> >> ************************************************************/
>>>>>> >> >>
>>>>>> >> >> On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu <yu...@gmail.com>
>>>>>> wrote:
>>>>>> >> >> > bq. one datanode failed
>>>>>> >> >> >
>>>>>> >> >> > Was the crash due to out of memory error ?
>>>>>> >> >> > Can you post the tail of data node log on pastebin ?
>>>>>> >> >> >
>>>>>> >> >> > Giving us versions of hadoop and hbase would be helpful.
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > On Mon, Apr 21, 2014 at 7:39 PM, Li Li <fa...@gmail.com>
>>>>>> wrote:
>>>>>> >> >> >
>>>>>> >> >> >> I have a small hbase cluster with 1 namenode, 1 secondary
>>>>>> namenode, 4
>>>>>> >> >> >> datanode.
>>>>>> >> >> >> and the hbase master is on the same machine with namenode, 4 hbase
>>>>>> >> >> >> slave on datanode machine.
>>>>>> >> >> >> I found average requests per seconds is about 10,000. and the
>>>>>> >> clusters
>>>>>> >> >> >> crashed. and I found the reason is one datanode failed.
>>>>>> >> >> >>
>>>>>> >> >> >> the datanode configuration is about 4 cpu core and 10GB memory
>>>>>> >> >> >> is my cluster overloaded?
>>>>>> >> >> >>
>>>>>> >> >>
>>>>>> >>
>>>>>>

Re: is my hbase cluster overloaded?

Posted by Li Li <fa...@gmail.com>.

jmap -heap of app-hbase-1 why oldSize so small?

Heap Configuration:
   MinHeapFreeRatio = 40
   MaxHeapFreeRatio = 70
   MaxHeapSize      = 8388608000 (8000.0MB)
   NewSize          = 536870912 (512.0MB)
   MaxNewSize       = 536870912 (512.0MB)
   OldSize          = 5439488 (5.1875MB)
   NewRatio         = 2
   SurvivorRatio    = 8
   PermSize         = 21757952 (20.75MB)
   MaxPermSize      = 85983232 (82.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 483196928 (460.8125MB)
   used     = 263917680 (251.69151306152344MB)
   free     = 219279248 (209.12098693847656MB)
   54.6190724126458% used
Eden Space:
   capacity = 429522944 (409.625MB)
   used     = 242509200 (231.27479553222656MB)
   free     = 187013744 (178.35020446777344MB)
   56.460127075307064% used
>From Space:
   capacity = 53673984 (51.1875MB)
   used     = 21408480 (20.416717529296875MB)
   free     = 32265504 (30.770782470703125MB)
   39.88613925137363% used
To Space:
   capacity = 53673984 (51.1875MB)
   used     = 0 (0.0MB)
   free     = 53673984 (51.1875MB)
   0.0% used
concurrent mark-sweep generation:
   capacity = 2857693184 (2725.30859375MB)
   used     = 1738653912 (1658.1095809936523MB)
   free     = 1119039272 (1067.1990127563477MB)
   60.841168034923655% used
Perm Generation:
   capacity = 40685568 (38.80078125MB)
   used     = 24327624 (23.20063018798828MB)
   free     = 16357944 (15.600151062011719MB)
   59.79423465342797% used


On Tue, Apr 22, 2014 at 4:00 PM, Li Li <fa...@gmail.com> wrote:
> Region Servers
> ServerName Start time Load
> app-hbase-1,60020,1398141516916 Tue Apr 22 12:38:36 CST 2014
> requestsPerSecond=448799, numberOfOnlineRegions=8, usedHeapMB=1241,
> maxHeapMB=7948
> app-hbase-2,60020,1398141516914 Tue Apr 22 12:38:36 CST 2014
> requestsPerSecond=0, numberOfOnlineRegions=3, usedHeapMB=829,
> maxHeapMB=7948
> app-hbase-4,60020,1398141525533 Tue Apr 22 12:38:45 CST 2014
> requestsPerSecond=217349, numberOfOnlineRegions=6, usedHeapMB=1356,
> maxHeapMB=7948
> app-hbase-5,60020,1398141524870 Tue Apr 22 12:38:44 CST 2014
> requestsPerSecond=844, numberOfOnlineRegions=3, usedHeapMB=285,
> maxHeapMB=7948
> Total: servers: 4 requestsPerSecond=666992, numberOfOnlineRegions=20
>
> hbase-2 and 4 only have 3 region. how to balance them?
>
> On Tue, Apr 22, 2014 at 3:53 PM, Li Li <fa...@gmail.com> wrote:
>> hbase current statistics:
>>
>> Region Servers
>> ServerName Start time Load
>> app-hbase-1,60020,1398141516916 Tue Apr 22 12:38:36 CST 2014
>> requestsPerSecond=6100, numberOfOnlineRegions=7, usedHeapMB=1201,
>> maxHeapMB=7948
>> app-hbase-2,60020,1398141516914 Tue Apr 22 12:38:36 CST 2014
>> requestsPerSecond=1770, numberOfOnlineRegions=4, usedHeapMB=224,
>> maxHeapMB=7948
>> app-hbase-4,60020,1398141525533 Tue Apr 22 12:38:45 CST 2014
>> requestsPerSecond=3445, numberOfOnlineRegions=5, usedHeapMB=798,
>> maxHeapMB=7948
>> app-hbase-5,60020,1398141524870 Tue Apr 22 12:38:44 CST 2014
>> requestsPerSecond=57, numberOfOnlineRegions=2, usedHeapMB=328,
>> maxHeapMB=7948
>> Total: servers: 4 requestsPerSecond=11372, numberOfOnlineRegions=18
>>
>> On Tue, Apr 22, 2014 at 3:40 PM, Li Li <fa...@gmail.com> wrote:
>>> I am now restart the sever and running. maybe an hour later the load
>>> will become high
>>>
>>> On Tue, Apr 22, 2014 at 3:02 PM, Azuryy Yu <az...@gmail.com> wrote:
>>>> Do you still have the same issue?
>>>>
>>>> and:
>>>> -Xmx8000m -server -XX:NewSize=512m -XX:MaxNewSize=512m
>>>>
>>>> the Eden size is too small.
>>>>
>>>>
>>>>
>>>> On Tue, Apr 22, 2014 at 2:55 PM, Li Li <fa...@gmail.com> wrote:
>>>>
>>>>> <property>
>>>>>   <name>dfs.datanode.handler.count</name>
>>>>>   <value>100</value>
>>>>>   <description>The number of server threads for the datanode.</description>
>>>>> </property>
>>>>>
>>>>>
>>>>> 1. namenode/master  192.168.10.48
>>>>> http://pastebin.com/7M0zzAAc
>>>>>
>>>>> $free -m (this is value when I restart the hadoop and hbase now, not
>>>>> the value when it crashed)
>>>>>              total       used       free     shared    buffers     cached
>>>>> Mem:         15951       3819      12131          0        509       1990
>>>>> -/+ buffers/cache:       1319      14631
>>>>> Swap:         8191          0       8191
>>>>>
>>>>> 2. datanode/region 192.168.10.45
>>>>> http://pastebin.com/FiAw1yju
>>>>>
>>>>> $free -m
>>>>>              total       used       free     shared    buffers     cached
>>>>> Mem:         15951       3627      12324          0       1516        641
>>>>> -/+ buffers/cache:       1469      14482
>>>>> Swap:         8191          8       8183
>>>>>
>>>>> On Tue, Apr 22, 2014 at 2:29 PM, Azuryy Yu <az...@gmail.com> wrote:
>>>>> > one big possible issue is that you have a high concurrent request on HDFS
>>>>> > or HBASE, then all Data nodes handlers are all busy, then more requests
>>>>> are
>>>>> > pending, then timeout, so you can try to increase
>>>>> > dfs.datanode.handler.count and dfs.namenode.handler.count in the
>>>>> > hdfs-site.xml, then restart the HDFS.
>>>>> >
>>>>> > another, do you have datanode, namenode, region servers JVM options? if
>>>>> > they are all by default, then there is also have this issue.
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Tue, Apr 22, 2014 at 2:20 PM, Li Li <fa...@gmail.com> wrote:
>>>>> >
>>>>> >> my cluster setup: both 6 machines are virtual machine. each machine:
>>>>> >> 4CPU Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 16GB memory
>>>>> >> 192.168.10.48 namenode/jobtracker
>>>>> >> 192.168.10.47 secondary namenode
>>>>> >> 192.168.10.45 datanode/tasktracker
>>>>> >> 192.168.10.46 datanode/tasktracker
>>>>> >> 192.168.10.49 datanode/tasktracker
>>>>> >> 192.168.10.50 datanode/tasktracker
>>>>> >>
>>>>> >> hdfs logs around 20:33
>>>>> >> 192.168.10.48 namenode log  http://pastebin.com/rwgmPEXR
>>>>> >> 192.168.10.45 datanode log http://pastebin.com/HBgZ8rtV (I found this
>>>>> >> datanode crash first)
>>>>> >> 192.168.10.46 datanode log http://pastebin.com/aQ2emnUi
>>>>> >> 192.168.10.49 datanode log http://pastebin.com/aqsWrrL1
>>>>> >> 192.168.10.50 datanode log http://pastebin.com/V7C6tjpB
>>>>> >>
>>>>> >> hbase logs around 20:33
>>>>> >> 192.168.10.48 master log http://pastebin.com/2ZfeYA1p
>>>>> >> 192.168.10.45 region log http://pastebin.com/idCF2a7Y
>>>>> >> 192.168.10.46 region log http://pastebin.com/WEh4dA0f
>>>>> >> 192.168.10.49 region log http://pastebin.com/cGtpbTLz
>>>>> >> 192.168.10.50 region log http://pastebin.com/bD6h5T6p(very strange,
>>>>> >> not log at 20:33, but have log at 20:32 and 20:34)
>>>>> >>
>>>>> >> On Tue, Apr 22, 2014 at 12:25 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>> >> > Can you post more of the data node log, around 20:33 ?
>>>>> >> >
>>>>> >> > Cheers
>>>>> >> >
>>>>> >> >
>>>>> >> > On Mon, Apr 21, 2014 at 8:57 PM, Li Li <fa...@gmail.com> wrote:
>>>>> >> >
>>>>> >> >> hadoop 1.0
>>>>> >> >> hbase 0.94.11
>>>>> >> >>
>>>>> >> >> datanode log from 192.168.10.45. why it shut down itself?
>>>>> >> >>
>>>>> >> >> 2014-04-21 20:33:59,309 INFO
>>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>>>>> >> >> blk_-7969006819959471805_202154 received exception
>>>>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>>>>> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
>>>>> >> >> left.
>>>>> >> >> 2014-04-21 20:33:59,310 ERROR
>>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>>>>> >> >> DatanodeRegistration(192.168.10.45:50010,
>>>>> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
>>>>> >> >> infoPort=50075, ipcPort=50020):DataXceiver
>>>>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>>>>> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
>>>>> >> >> left.
>>>>> >> >>         at
>>>>> >> >>
>>>>> >>
>>>>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>>>>> >> >>         at
>>>>> >> >>
>>>>> >>
>>>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>>>>> >> >>         at
>>>>> >> >>
>>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>>>>> >> >>         at
>>>>> >> >>
>>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>>>>> >> >>         at
>>>>> >> java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>>>>> >> >>         at
>>>>> >> java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>>>>> >> >>         at java.io.DataInputStream.read(DataInputStream.java:149)
>>>>> >> >>         at
>>>>> >> >>
>>>>> >>
>>>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265)
>>>>> >> >>         at
>>>>> >> >>
>>>>> >>
>>>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
>>>>> >> >>         at
>>>>> >> >>
>>>>> >>
>>>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
>>>>> >> >>         at
>>>>> >> >>
>>>>> >>
>>>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
>>>>> >> >>         at
>>>>> >> >>
>>>>> >>
>>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
>>>>> >> >>         at
>>>>> >> >>
>>>>> >>
>>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
>>>>> >> >>         at java.lang.Thread.run(Thread.java:722)
>>>>> >> >> 2014-04-21 20:33:59,310 ERROR
>>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>>>>> >> >> DatanodeRegistration(192.168.10.45:50010,
>>>>> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
>>>>> >> >> infoPort=50075, ipcPort=50020):DataXceiver
>>>>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>>>>> >> >> channel java.nio.channels.SocketChannel[closed]. 466924 millis
>>>>> timeout
>>>>> >> >> left.
>>>>> >> >>         at
>>>>> >> >>
>>>>> >>
>>>>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>>>>> >> >>         at
>>>>> >> >>
>>>>> >>
>>>>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245)
>>>>> >> >>         at
>>>>> >> >>
>>>>> >>
>>>>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>>>>> >> >>         at
>>>>> >> >>
>>>>> >>
>>>>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>>>>> >> >>         at
>>>>> >> >>
>>>>> >>
>>>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
>>>>> >> >>         at
>>>>> >> >>
>>>>> >>
>>>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
>>>>> >> >>         at
>>>>> >> >>
>>>>> >>
>>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
>>>>> >> >>         at
>>>>> >> >>
>>>>> >>
>>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>>>> >> >>         at java.lang.Thread.run(Thread.java:722)
>>>>> >> >> 2014-04-21 20:34:00,291 INFO
>>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
>>>>> >> >> threadgroup to exit, active threads is 0
>>>>> >> >> 2014-04-21 20:34:00,404 INFO
>>>>> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
>>>>> >> >> Shutting down all async disk service threads...
>>>>> >> >> 2014-04-21 20:34:00,405 INFO
>>>>> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: All
>>>>> >> >> async disk service threads have been shut down.
>>>>> >> >> 2014-04-21 20:34:00,413 INFO
>>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
>>>>> >> >> 2014-04-21 20:34:00,424 INFO
>>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
>>>>> >> >> /************************************************************
>>>>> >> >> SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45
>>>>> >> >> ************************************************************/
>>>>> >> >>
>>>>> >> >> On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu <yu...@gmail.com>
>>>>> wrote:
>>>>> >> >> > bq. one datanode failed
>>>>> >> >> >
>>>>> >> >> > Was the crash due to out of memory error ?
>>>>> >> >> > Can you post the tail of data node log on pastebin ?
>>>>> >> >> >
>>>>> >> >> > Giving us versions of hadoop and hbase would be helpful.
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> > On Mon, Apr 21, 2014 at 7:39 PM, Li Li <fa...@gmail.com>
>>>>> wrote:
>>>>> >> >> >
>>>>> >> >> >> I have a small hbase cluster with 1 namenode, 1 secondary
>>>>> namenode, 4
>>>>> >> >> >> datanode.
>>>>> >> >> >> and the hbase master is on the same machine with namenode, 4 hbase
>>>>> >> >> >> slave on datanode machine.
>>>>> >> >> >> I found average requests per seconds is about 10,000. and the
>>>>> >> clusters
>>>>> >> >> >> crashed. and I found the reason is one datanode failed.
>>>>> >> >> >>
>>>>> >> >> >> the datanode configuration is about 4 cpu core and 10GB memory
>>>>> >> >> >> is my cluster overloaded?
>>>>> >> >> >>
>>>>> >> >>
>>>>> >>
>>>>>

Re: is my hbase cluster overloaded?

Posted by Li Li <fa...@gmail.com>.

Region Servers
ServerName Start time Load
app-hbase-1,60020,1398141516916 Tue Apr 22 12:38:36 CST 2014
requestsPerSecond=448799, numberOfOnlineRegions=8, usedHeapMB=1241,
maxHeapMB=7948
app-hbase-2,60020,1398141516914 Tue Apr 22 12:38:36 CST 2014
requestsPerSecond=0, numberOfOnlineRegions=3, usedHeapMB=829,
maxHeapMB=7948
app-hbase-4,60020,1398141525533 Tue Apr 22 12:38:45 CST 2014
requestsPerSecond=217349, numberOfOnlineRegions=6, usedHeapMB=1356,
maxHeapMB=7948
app-hbase-5,60020,1398141524870 Tue Apr 22 12:38:44 CST 2014
requestsPerSecond=844, numberOfOnlineRegions=3, usedHeapMB=285,
maxHeapMB=7948
Total: servers: 4 requestsPerSecond=666992, numberOfOnlineRegions=20

hbase-2 and 4 only have 3 region. how to balance them?

On Tue, Apr 22, 2014 at 3:53 PM, Li Li <fa...@gmail.com> wrote:
> hbase current statistics:
>
> Region Servers
> ServerName Start time Load
> app-hbase-1,60020,1398141516916 Tue Apr 22 12:38:36 CST 2014
> requestsPerSecond=6100, numberOfOnlineRegions=7, usedHeapMB=1201,
> maxHeapMB=7948
> app-hbase-2,60020,1398141516914 Tue Apr 22 12:38:36 CST 2014
> requestsPerSecond=1770, numberOfOnlineRegions=4, usedHeapMB=224,
> maxHeapMB=7948
> app-hbase-4,60020,1398141525533 Tue Apr 22 12:38:45 CST 2014
> requestsPerSecond=3445, numberOfOnlineRegions=5, usedHeapMB=798,
> maxHeapMB=7948
> app-hbase-5,60020,1398141524870 Tue Apr 22 12:38:44 CST 2014
> requestsPerSecond=57, numberOfOnlineRegions=2, usedHeapMB=328,
> maxHeapMB=7948
> Total: servers: 4 requestsPerSecond=11372, numberOfOnlineRegions=18
>
> On Tue, Apr 22, 2014 at 3:40 PM, Li Li <fa...@gmail.com> wrote:
>> I am now restart the sever and running. maybe an hour later the load
>> will become high
>>
>> On Tue, Apr 22, 2014 at 3:02 PM, Azuryy Yu <az...@gmail.com> wrote:
>>> Do you still have the same issue?
>>>
>>> and:
>>> -Xmx8000m -server -XX:NewSize=512m -XX:MaxNewSize=512m
>>>
>>> the Eden size is too small.
>>>
>>>
>>>
>>> On Tue, Apr 22, 2014 at 2:55 PM, Li Li <fa...@gmail.com> wrote:
>>>
>>>> <property>
>>>>   <name>dfs.datanode.handler.count</name>
>>>>   <value>100</value>
>>>>   <description>The number of server threads for the datanode.</description>
>>>> </property>
>>>>
>>>>
>>>> 1. namenode/master  192.168.10.48
>>>> http://pastebin.com/7M0zzAAc
>>>>
>>>> $free -m (this is value when I restart the hadoop and hbase now, not
>>>> the value when it crashed)
>>>>              total       used       free     shared    buffers     cached
>>>> Mem:         15951       3819      12131          0        509       1990
>>>> -/+ buffers/cache:       1319      14631
>>>> Swap:         8191          0       8191
>>>>
>>>> 2. datanode/region 192.168.10.45
>>>> http://pastebin.com/FiAw1yju
>>>>
>>>> $free -m
>>>>              total       used       free     shared    buffers     cached
>>>> Mem:         15951       3627      12324          0       1516        641
>>>> -/+ buffers/cache:       1469      14482
>>>> Swap:         8191          8       8183
>>>>
>>>> On Tue, Apr 22, 2014 at 2:29 PM, Azuryy Yu <az...@gmail.com> wrote:
>>>> > one big possible issue is that you have a high concurrent request on HDFS
>>>> > or HBASE, then all Data nodes handlers are all busy, then more requests
>>>> are
>>>> > pending, then timeout, so you can try to increase
>>>> > dfs.datanode.handler.count and dfs.namenode.handler.count in the
>>>> > hdfs-site.xml, then restart the HDFS.
>>>> >
>>>> > another, do you have datanode, namenode, region servers JVM options? if
>>>> > they are all by default, then there is also have this issue.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Tue, Apr 22, 2014 at 2:20 PM, Li Li <fa...@gmail.com> wrote:
>>>> >
>>>> >> my cluster setup: both 6 machines are virtual machine. each machine:
>>>> >> 4CPU Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 16GB memory
>>>> >> 192.168.10.48 namenode/jobtracker
>>>> >> 192.168.10.47 secondary namenode
>>>> >> 192.168.10.45 datanode/tasktracker
>>>> >> 192.168.10.46 datanode/tasktracker
>>>> >> 192.168.10.49 datanode/tasktracker
>>>> >> 192.168.10.50 datanode/tasktracker
>>>> >>
>>>> >> hdfs logs around 20:33
>>>> >> 192.168.10.48 namenode log  http://pastebin.com/rwgmPEXR
>>>> >> 192.168.10.45 datanode log http://pastebin.com/HBgZ8rtV (I found this
>>>> >> datanode crash first)
>>>> >> 192.168.10.46 datanode log http://pastebin.com/aQ2emnUi
>>>> >> 192.168.10.49 datanode log http://pastebin.com/aqsWrrL1
>>>> >> 192.168.10.50 datanode log http://pastebin.com/V7C6tjpB
>>>> >>
>>>> >> hbase logs around 20:33
>>>> >> 192.168.10.48 master log http://pastebin.com/2ZfeYA1p
>>>> >> 192.168.10.45 region log http://pastebin.com/idCF2a7Y
>>>> >> 192.168.10.46 region log http://pastebin.com/WEh4dA0f
>>>> >> 192.168.10.49 region log http://pastebin.com/cGtpbTLz
>>>> >> 192.168.10.50 region log http://pastebin.com/bD6h5T6p(very strange,
>>>> >> not log at 20:33, but have log at 20:32 and 20:34)
>>>> >>
>>>> >> On Tue, Apr 22, 2014 at 12:25 PM, Ted Yu <yu...@gmail.com> wrote:
>>>> >> > Can you post more of the data node log, around 20:33 ?
>>>> >> >
>>>> >> > Cheers
>>>> >> >
>>>> >> >
>>>> >> > On Mon, Apr 21, 2014 at 8:57 PM, Li Li <fa...@gmail.com> wrote:
>>>> >> >
>>>> >> >> hadoop 1.0
>>>> >> >> hbase 0.94.11
>>>> >> >>
>>>> >> >> datanode log from 192.168.10.45. why it shut down itself?
>>>> >> >>
>>>> >> >> 2014-04-21 20:33:59,309 INFO
>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>>>> >> >> blk_-7969006819959471805_202154 received exception
>>>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>>>> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
>>>> >> >> left.
>>>> >> >> 2014-04-21 20:33:59,310 ERROR
>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>>>> >> >> DatanodeRegistration(192.168.10.45:50010,
>>>> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
>>>> >> >> infoPort=50075, ipcPort=50020):DataXceiver
>>>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>>>> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
>>>> >> >> left.
>>>> >> >>         at
>>>> >> >>
>>>> >>
>>>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>>>> >> >>         at
>>>> >> >>
>>>> >>
>>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>>>> >> >>         at
>>>> >> >>
>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>>>> >> >>         at
>>>> >> >>
>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>>>> >> >>         at
>>>> >> java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>>>> >> >>         at
>>>> >> java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>>>> >> >>         at java.io.DataInputStream.read(DataInputStream.java:149)
>>>> >> >>         at
>>>> >> >>
>>>> >>
>>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265)
>>>> >> >>         at
>>>> >> >>
>>>> >>
>>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
>>>> >> >>         at
>>>> >> >>
>>>> >>
>>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
>>>> >> >>         at
>>>> >> >>
>>>> >>
>>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
>>>> >> >>         at
>>>> >> >>
>>>> >>
>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
>>>> >> >>         at
>>>> >> >>
>>>> >>
>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
>>>> >> >>         at java.lang.Thread.run(Thread.java:722)
>>>> >> >> 2014-04-21 20:33:59,310 ERROR
>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>>>> >> >> DatanodeRegistration(192.168.10.45:50010,
>>>> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
>>>> >> >> infoPort=50075, ipcPort=50020):DataXceiver
>>>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>>>> >> >> channel java.nio.channels.SocketChannel[closed]. 466924 millis
>>>> timeout
>>>> >> >> left.
>>>> >> >>         at
>>>> >> >>
>>>> >>
>>>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>>>> >> >>         at
>>>> >> >>
>>>> >>
>>>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245)
>>>> >> >>         at
>>>> >> >>
>>>> >>
>>>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>>>> >> >>         at
>>>> >> >>
>>>> >>
>>>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>>>> >> >>         at
>>>> >> >>
>>>> >>
>>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
>>>> >> >>         at
>>>> >> >>
>>>> >>
>>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
>>>> >> >>         at
>>>> >> >>
>>>> >>
>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
>>>> >> >>         at
>>>> >> >>
>>>> >>
>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>>> >> >>         at java.lang.Thread.run(Thread.java:722)
>>>> >> >> 2014-04-21 20:34:00,291 INFO
>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
>>>> >> >> threadgroup to exit, active threads is 0
>>>> >> >> 2014-04-21 20:34:00,404 INFO
>>>> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
>>>> >> >> Shutting down all async disk service threads...
>>>> >> >> 2014-04-21 20:34:00,405 INFO
>>>> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: All
>>>> >> >> async disk service threads have been shut down.
>>>> >> >> 2014-04-21 20:34:00,413 INFO
>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
>>>> >> >> 2014-04-21 20:34:00,424 INFO
>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
>>>> >> >> /************************************************************
>>>> >> >> SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45
>>>> >> >> ************************************************************/
>>>> >> >>
>>>> >> >> On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu <yu...@gmail.com>
>>>> wrote:
>>>> >> >> > bq. one datanode failed
>>>> >> >> >
>>>> >> >> > Was the crash due to out of memory error ?
>>>> >> >> > Can you post the tail of data node log on pastebin ?
>>>> >> >> >
>>>> >> >> > Giving us versions of hadoop and hbase would be helpful.
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > On Mon, Apr 21, 2014 at 7:39 PM, Li Li <fa...@gmail.com>
>>>> wrote:
>>>> >> >> >
>>>> >> >> >> I have a small hbase cluster with 1 namenode, 1 secondary
>>>> namenode, 4
>>>> >> >> >> datanode.
>>>> >> >> >> and the hbase master is on the same machine with namenode, 4 hbase
>>>> >> >> >> slave on datanode machine.
>>>> >> >> >> I found average requests per seconds is about 10,000. and the
>>>> >> clusters
>>>> >> >> >> crashed. and I found the reason is one datanode failed.
>>>> >> >> >>
>>>> >> >> >> the datanode configuration is about 4 cpu core and 10GB memory
>>>> >> >> >> is my cluster overloaded?
>>>> >> >> >>
>>>> >> >>
>>>> >>
>>>>

Re: is my hbase cluster overloaded?

Posted by Li Li <fa...@gmail.com>.

hbase current statistics:

Region Servers
ServerName Start time Load
app-hbase-1,60020,1398141516916 Tue Apr 22 12:38:36 CST 2014
requestsPerSecond=6100, numberOfOnlineRegions=7, usedHeapMB=1201,
maxHeapMB=7948
app-hbase-2,60020,1398141516914 Tue Apr 22 12:38:36 CST 2014
requestsPerSecond=1770, numberOfOnlineRegions=4, usedHeapMB=224,
maxHeapMB=7948
app-hbase-4,60020,1398141525533 Tue Apr 22 12:38:45 CST 2014
requestsPerSecond=3445, numberOfOnlineRegions=5, usedHeapMB=798,
maxHeapMB=7948
app-hbase-5,60020,1398141524870 Tue Apr 22 12:38:44 CST 2014
requestsPerSecond=57, numberOfOnlineRegions=2, usedHeapMB=328,
maxHeapMB=7948
Total: servers: 4 requestsPerSecond=11372, numberOfOnlineRegions=18

On Tue, Apr 22, 2014 at 3:40 PM, Li Li <fa...@gmail.com> wrote:
> I am now restart the sever and running. maybe an hour later the load
> will become high
>
> On Tue, Apr 22, 2014 at 3:02 PM, Azuryy Yu <az...@gmail.com> wrote:
>> Do you still have the same issue?
>>
>> and:
>> -Xmx8000m -server -XX:NewSize=512m -XX:MaxNewSize=512m
>>
>> the Eden size is too small.
>>
>>
>>
>> On Tue, Apr 22, 2014 at 2:55 PM, Li Li <fa...@gmail.com> wrote:
>>
>>> <property>
>>>   <name>dfs.datanode.handler.count</name>
>>>   <value>100</value>
>>>   <description>The number of server threads for the datanode.</description>
>>> </property>
>>>
>>>
>>> 1. namenode/master  192.168.10.48
>>> http://pastebin.com/7M0zzAAc
>>>
>>> $free -m (this is value when I restart the hadoop and hbase now, not
>>> the value when it crashed)
>>>              total       used       free     shared    buffers     cached
>>> Mem:         15951       3819      12131          0        509       1990
>>> -/+ buffers/cache:       1319      14631
>>> Swap:         8191          0       8191
>>>
>>> 2. datanode/region 192.168.10.45
>>> http://pastebin.com/FiAw1yju
>>>
>>> $free -m
>>>              total       used       free     shared    buffers     cached
>>> Mem:         15951       3627      12324          0       1516        641
>>> -/+ buffers/cache:       1469      14482
>>> Swap:         8191          8       8183
>>>
>>> On Tue, Apr 22, 2014 at 2:29 PM, Azuryy Yu <az...@gmail.com> wrote:
>>> > one big possible issue is that you have a high concurrent request on HDFS
>>> > or HBASE, then all Data nodes handlers are all busy, then more requests
>>> are
>>> > pending, then timeout, so you can try to increase
>>> > dfs.datanode.handler.count and dfs.namenode.handler.count in the
>>> > hdfs-site.xml, then restart the HDFS.
>>> >
>>> > another, do you have datanode, namenode, region servers JVM options? if
>>> > they are all by default, then there is also have this issue.
>>> >
>>> >
>>> >
>>> >
>>> > On Tue, Apr 22, 2014 at 2:20 PM, Li Li <fa...@gmail.com> wrote:
>>> >
>>> >> my cluster setup: both 6 machines are virtual machine. each machine:
>>> >> 4CPU Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 16GB memory
>>> >> 192.168.10.48 namenode/jobtracker
>>> >> 192.168.10.47 secondary namenode
>>> >> 192.168.10.45 datanode/tasktracker
>>> >> 192.168.10.46 datanode/tasktracker
>>> >> 192.168.10.49 datanode/tasktracker
>>> >> 192.168.10.50 datanode/tasktracker
>>> >>
>>> >> hdfs logs around 20:33
>>> >> 192.168.10.48 namenode log  http://pastebin.com/rwgmPEXR
>>> >> 192.168.10.45 datanode log http://pastebin.com/HBgZ8rtV (I found this
>>> >> datanode crash first)
>>> >> 192.168.10.46 datanode log http://pastebin.com/aQ2emnUi
>>> >> 192.168.10.49 datanode log http://pastebin.com/aqsWrrL1
>>> >> 192.168.10.50 datanode log http://pastebin.com/V7C6tjpB
>>> >>
>>> >> hbase logs around 20:33
>>> >> 192.168.10.48 master log http://pastebin.com/2ZfeYA1p
>>> >> 192.168.10.45 region log http://pastebin.com/idCF2a7Y
>>> >> 192.168.10.46 region log http://pastebin.com/WEh4dA0f
>>> >> 192.168.10.49 region log http://pastebin.com/cGtpbTLz
>>> >> 192.168.10.50 region log http://pastebin.com/bD6h5T6p(very strange,
>>> >> not log at 20:33, but have log at 20:32 and 20:34)
>>> >>
>>> >> On Tue, Apr 22, 2014 at 12:25 PM, Ted Yu <yu...@gmail.com> wrote:
>>> >> > Can you post more of the data node log, around 20:33 ?
>>> >> >
>>> >> > Cheers
>>> >> >
>>> >> >
>>> >> > On Mon, Apr 21, 2014 at 8:57 PM, Li Li <fa...@gmail.com> wrote:
>>> >> >
>>> >> >> hadoop 1.0
>>> >> >> hbase 0.94.11
>>> >> >>
>>> >> >> datanode log from 192.168.10.45. why it shut down itself?
>>> >> >>
>>> >> >> 2014-04-21 20:33:59,309 INFO
>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>>> >> >> blk_-7969006819959471805_202154 received exception
>>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>>> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
>>> >> >> left.
>>> >> >> 2014-04-21 20:33:59,310 ERROR
>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>>> >> >> DatanodeRegistration(192.168.10.45:50010,
>>> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
>>> >> >> infoPort=50075, ipcPort=50020):DataXceiver
>>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>>> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
>>> >> >> left.
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>>> >> >>         at
>>> >> >>
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>>> >> >>         at
>>> >> >>
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>>> >> >>         at
>>> >> java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>>> >> >>         at
>>> >> java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>>> >> >>         at java.io.DataInputStream.read(DataInputStream.java:149)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
>>> >> >>         at java.lang.Thread.run(Thread.java:722)
>>> >> >> 2014-04-21 20:33:59,310 ERROR
>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>>> >> >> DatanodeRegistration(192.168.10.45:50010,
>>> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
>>> >> >> infoPort=50075, ipcPort=50020):DataXceiver
>>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>>> >> >> channel java.nio.channels.SocketChannel[closed]. 466924 millis
>>> timeout
>>> >> >> left.
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>> >> >>         at java.lang.Thread.run(Thread.java:722)
>>> >> >> 2014-04-21 20:34:00,291 INFO
>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
>>> >> >> threadgroup to exit, active threads is 0
>>> >> >> 2014-04-21 20:34:00,404 INFO
>>> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
>>> >> >> Shutting down all async disk service threads...
>>> >> >> 2014-04-21 20:34:00,405 INFO
>>> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: All
>>> >> >> async disk service threads have been shut down.
>>> >> >> 2014-04-21 20:34:00,413 INFO
>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
>>> >> >> 2014-04-21 20:34:00,424 INFO
>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
>>> >> >> /************************************************************
>>> >> >> SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45
>>> >> >> ************************************************************/
>>> >> >>
>>> >> >> On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu <yu...@gmail.com>
>>> wrote:
>>> >> >> > bq. one datanode failed
>>> >> >> >
>>> >> >> > Was the crash due to out of memory error ?
>>> >> >> > Can you post the tail of data node log on pastebin ?
>>> >> >> >
>>> >> >> > Giving us versions of hadoop and hbase would be helpful.
>>> >> >> >
>>> >> >> >
>>> >> >> > On Mon, Apr 21, 2014 at 7:39 PM, Li Li <fa...@gmail.com>
>>> wrote:
>>> >> >> >
>>> >> >> >> I have a small hbase cluster with 1 namenode, 1 secondary
>>> namenode, 4
>>> >> >> >> datanode.
>>> >> >> >> and the hbase master is on the same machine with namenode, 4 hbase
>>> >> >> >> slave on datanode machine.
>>> >> >> >> I found average requests per seconds is about 10,000. and the
>>> >> clusters
>>> >> >> >> crashed. and I found the reason is one datanode failed.
>>> >> >> >>
>>> >> >> >> the datanode configuration is about 4 cpu core and 10GB memory
>>> >> >> >> is my cluster overloaded?
>>> >> >> >>
>>> >> >>
>>> >>
>>>

Re: is my hbase cluster overloaded?

Posted by Li Li <fa...@gmail.com>.

I am now restart the sever and running. maybe an hour later the load
will become high

On Tue, Apr 22, 2014 at 3:02 PM, Azuryy Yu <az...@gmail.com> wrote:
> Do you still have the same issue?
>
> and:
> -Xmx8000m -server -XX:NewSize=512m -XX:MaxNewSize=512m
>
> the Eden size is too small.
>
>
>
> On Tue, Apr 22, 2014 at 2:55 PM, Li Li <fa...@gmail.com> wrote:
>
>> <property>
>>   <name>dfs.datanode.handler.count</name>
>>   <value>100</value>
>>   <description>The number of server threads for the datanode.</description>
>> </property>
>>
>>
>> 1. namenode/master  192.168.10.48
>> http://pastebin.com/7M0zzAAc
>>
>> $free -m (this is value when I restart the hadoop and hbase now, not
>> the value when it crashed)
>>              total       used       free     shared    buffers     cached
>> Mem:         15951       3819      12131          0        509       1990
>> -/+ buffers/cache:       1319      14631
>> Swap:         8191          0       8191
>>
>> 2. datanode/region 192.168.10.45
>> http://pastebin.com/FiAw1yju
>>
>> $free -m
>>              total       used       free     shared    buffers     cached
>> Mem:         15951       3627      12324          0       1516        641
>> -/+ buffers/cache:       1469      14482
>> Swap:         8191          8       8183
>>
>> On Tue, Apr 22, 2014 at 2:29 PM, Azuryy Yu <az...@gmail.com> wrote:
>> > one big possible issue is that you have a high concurrent request on HDFS
>> > or HBASE, then all Data nodes handlers are all busy, then more requests
>> are
>> > pending, then timeout, so you can try to increase
>> > dfs.datanode.handler.count and dfs.namenode.handler.count in the
>> > hdfs-site.xml, then restart the HDFS.
>> >
>> > another, do you have datanode, namenode, region servers JVM options? if
>> > they are all by default, then there is also have this issue.
>> >
>> >
>> >
>> >
>> > On Tue, Apr 22, 2014 at 2:20 PM, Li Li <fa...@gmail.com> wrote:
>> >
>> >> my cluster setup: both 6 machines are virtual machine. each machine:
>> >> 4CPU Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 16GB memory
>> >> 192.168.10.48 namenode/jobtracker
>> >> 192.168.10.47 secondary namenode
>> >> 192.168.10.45 datanode/tasktracker
>> >> 192.168.10.46 datanode/tasktracker
>> >> 192.168.10.49 datanode/tasktracker
>> >> 192.168.10.50 datanode/tasktracker
>> >>
>> >> hdfs logs around 20:33
>> >> 192.168.10.48 namenode log  http://pastebin.com/rwgmPEXR
>> >> 192.168.10.45 datanode log http://pastebin.com/HBgZ8rtV (I found this
>> >> datanode crash first)
>> >> 192.168.10.46 datanode log http://pastebin.com/aQ2emnUi
>> >> 192.168.10.49 datanode log http://pastebin.com/aqsWrrL1
>> >> 192.168.10.50 datanode log http://pastebin.com/V7C6tjpB
>> >>
>> >> hbase logs around 20:33
>> >> 192.168.10.48 master log http://pastebin.com/2ZfeYA1p
>> >> 192.168.10.45 region log http://pastebin.com/idCF2a7Y
>> >> 192.168.10.46 region log http://pastebin.com/WEh4dA0f
>> >> 192.168.10.49 region log http://pastebin.com/cGtpbTLz
>> >> 192.168.10.50 region log http://pastebin.com/bD6h5T6p(very strange,
>> >> not log at 20:33, but have log at 20:32 and 20:34)
>> >>
>> >> On Tue, Apr 22, 2014 at 12:25 PM, Ted Yu <yu...@gmail.com> wrote:
>> >> > Can you post more of the data node log, around 20:33 ?
>> >> >
>> >> > Cheers
>> >> >
>> >> >
>> >> > On Mon, Apr 21, 2014 at 8:57 PM, Li Li <fa...@gmail.com> wrote:
>> >> >
>> >> >> hadoop 1.0
>> >> >> hbase 0.94.11
>> >> >>
>> >> >> datanode log from 192.168.10.45. why it shut down itself?
>> >> >>
>> >> >> 2014-04-21 20:33:59,309 INFO
>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>> >> >> blk_-7969006819959471805_202154 received exception
>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
>> >> >> left.
>> >> >> 2014-04-21 20:33:59,310 ERROR
>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> >> >> DatanodeRegistration(192.168.10.45:50010,
>> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
>> >> >> infoPort=50075, ipcPort=50020):DataXceiver
>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
>> >> >> left.
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>> >> >>         at
>> >> >>
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>> >> >>         at
>> >> >>
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>> >> >>         at
>> >> java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>> >> >>         at
>> >> java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>> >> >>         at java.io.DataInputStream.read(DataInputStream.java:149)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
>> >> >>         at java.lang.Thread.run(Thread.java:722)
>> >> >> 2014-04-21 20:33:59,310 ERROR
>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> >> >> DatanodeRegistration(192.168.10.45:50010,
>> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
>> >> >> infoPort=50075, ipcPort=50020):DataXceiver
>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>> >> >> channel java.nio.channels.SocketChannel[closed]. 466924 millis
>> timeout
>> >> >> left.
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>> >> >>         at java.lang.Thread.run(Thread.java:722)
>> >> >> 2014-04-21 20:34:00,291 INFO
>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
>> >> >> threadgroup to exit, active threads is 0
>> >> >> 2014-04-21 20:34:00,404 INFO
>> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
>> >> >> Shutting down all async disk service threads...
>> >> >> 2014-04-21 20:34:00,405 INFO
>> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: All
>> >> >> async disk service threads have been shut down.
>> >> >> 2014-04-21 20:34:00,413 INFO
>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
>> >> >> 2014-04-21 20:34:00,424 INFO
>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
>> >> >> /************************************************************
>> >> >> SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45
>> >> >> ************************************************************/
>> >> >>
>> >> >> On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu <yu...@gmail.com>
>> wrote:
>> >> >> > bq. one datanode failed
>> >> >> >
>> >> >> > Was the crash due to out of memory error ?
>> >> >> > Can you post the tail of data node log on pastebin ?
>> >> >> >
>> >> >> > Giving us versions of hadoop and hbase would be helpful.
>> >> >> >
>> >> >> >
>> >> >> > On Mon, Apr 21, 2014 at 7:39 PM, Li Li <fa...@gmail.com>
>> wrote:
>> >> >> >
>> >> >> >> I have a small hbase cluster with 1 namenode, 1 secondary
>> namenode, 4
>> >> >> >> datanode.
>> >> >> >> and the hbase master is on the same machine with namenode, 4 hbase
>> >> >> >> slave on datanode machine.
>> >> >> >> I found average requests per seconds is about 10,000. and the
>> >> clusters
>> >> >> >> crashed. and I found the reason is one datanode failed.
>> >> >> >>
>> >> >> >> the datanode configuration is about 4 cpu core and 10GB memory
>> >> >> >> is my cluster overloaded?
>> >> >> >>
>> >> >>
>> >>
>>

Re: is my hbase cluster overloaded?

Posted by lars hofhansl <la...@apache.org>.

Do you have that observed yourself with a recent HBase (0.94 or later, which allocate heap for the memstore in 2mb chunks)?
Of course it's possible that there is a scenario that benefits from a larger eden size, I just haven't seen it.


-- Lars



________________________________
 From: Azuryy Yu <az...@gmail.com>
To: user@hbase.apache.org; lars hofhansl <la...@apache.org> 
Sent: Wednesday, April 23, 2014 7:11 PM
Subject: Re: is my hbase cluster overloaded?
 


Lars,

I cannot agree with you. I don't think memstore use the old gen, instead, for a heavy write cluster, memstore is keeping for a short time then flushed to the disk, so if you specified  small Eden, there will be lots of promotions, and easy to cause FullGC.

so for the case in this thread, it's a heavy write cluster, so Eden should be large and MTT is also little large to avoid promotion.




On Wed, Apr 23, 2014 at 12:42 PM, lars hofhansl <la...@apache.org> wrote:

What makes you say this?
>
>HBase has a lot of very short lived garbage (like KeyValue objects that do not outlive an RPC request) and a lot of long lived data in the memstore and the block cache. We want to avoid accumulating the short lived garbage and at the same time leave most heap for memstores and blockcache.
>
>A small eden size of 512mb or even less makes sense to me.
>
>-- Lars
>
>
>
>
>----- Original Message -----
>From: Azuryy Yu <az...@gmail.com>
>To: user@hbase.apache.org
>Cc:
>Sent: Tuesday, April 22, 2014 12:02 AM
>Subject: Re: is my hbase cluster overloaded?
>
>Do you still have the same issue?
>
>and:
>-Xmx8000m -server -XX:NewSize=512m -XX:MaxNewSize=512m
>
>the Eden size is too small.
>
>
>
>
>On Tue, Apr 22, 2014 at 2:55 PM, Li Li <fa...@gmail.com> wrote:
>
>> <property>
>>   <name>dfs.datanode.handler.count</name>
>>   <value>100</value>
>>   <description>The number of server threads for the datanode.</description>
>> </property>
>>
>>
>> 1. namenode/master  192.168.10.48
>> http://pastebin.com/7M0zzAAc
>>
>> $free -m (this is value when I restart the hadoop and hbase now, not
>> the value when it crashed)
>>              total       used       free     shared    buffers     cached
>> Mem:         15951       3819      12131          0        509       1990
>> -/+ buffers/cache:       1319      14631
>> Swap:         8191          0       8191
>>
>> 2. datanode/region 192.168.10.45
>> http://pastebin.com/FiAw1yju
>>
>> $free -m
>>              total       used       free     shared    buffers     cached
>> Mem:         15951       3627      12324          0       1516        641
>> -/+ buffers/cache:       1469      14482
>> Swap:         8191          8       8183
>>
>> On Tue, Apr 22, 2014 at 2:29 PM, Azuryy Yu <az...@gmail.com> wrote:
>> > one big possible issue is that you have a high concurrent request on HDFS
>> > or HBASE, then all Data nodes handlers are all busy, then more requests
>> are
>> > pending, then timeout, so you can try to increase
>> > dfs.datanode.handler.count and dfs.namenode.handler.count in the
>> > hdfs-site.xml, then restart the HDFS.
>> >
>> > another, do you have datanode, namenode, region servers JVM options? if
>> > they are all by default, then there is also have this issue.
>> >
>> >
>> >
>> >
>> > On Tue, Apr 22, 2014 at 2:20 PM, Li Li <fa...@gmail.com> wrote:
>> >
>> >> my cluster setup: both 6 machines are virtual machine. each machine:
>> >> 4CPU Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 16GB memory
>> >> 192.168.10.48 namenode/jobtracker
>> >> 192.168.10.47 secondary namenode
>> >> 192.168.10.45 datanode/tasktracker
>> >> 192.168.10.46 datanode/tasktracker
>> >> 192.168.10.49 datanode/tasktracker
>> >> 192.168.10.50 datanode/tasktracker
>> >>
>> >> hdfs logs around 20:33
>> >> 192.168.10.48 namenode log  http://pastebin.com/rwgmPEXR
>> >> 192.168.10.45 datanode log http://pastebin.com/HBgZ8rtV (I found this
>> >> datanode crash first)
>> >> 192.168.10.46 datanode log http://pastebin.com/aQ2emnUi
>> >> 192.168.10.49 datanode log http://pastebin.com/aqsWrrL1
>> >> 192.168.10.50 datanode log http://pastebin.com/V7C6tjpB
>> >>
>> >> hbase logs around 20:33
>> >> 192.168.10.48 master log http://pastebin.com/2ZfeYA1p
>> >> 192.168.10.45 region log http://pastebin.com/idCF2a7Y
>> >> 192.168.10.46 region log http://pastebin.com/WEh4dA0f
>> >> 192.168.10.49 region log http://pastebin.com/cGtpbTLz
>> >> 192.168.10.50 region log http://pastebin.com/bD6h5T6p(very strange,
>> >> not log at 20:33, but have log at 20:32 and 20:34)
>> >>
>> >> On Tue, Apr 22, 2014 at 12:25 PM, Ted Yu <yu...@gmail.com> wrote:
>> >> > Can you post more of the data node log, around 20:33 ?
>> >> >
>> >> > Cheers
>> >> >
>> >> >
>> >> > On Mon, Apr 21, 2014 at 8:57 PM, Li Li <fa...@gmail.com> wrote:
>> >> >
>> >> >> hadoop 1.0
>> >> >> hbase 0.94.11
>> >> >>
>> >> >> datanode log from 192.168.10.45. why it shut down itself?
>> >> >>
>> >> >> 2014-04-21 20:33:59,309 INFO
>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>> >> >> blk_-7969006819959471805_202154 received exception
>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
>> >> >> left.
>> >> >> 2014-04-21 20:33:59,310 ERROR
>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> >> >> DatanodeRegistration(192.168.10.45:50010,
>> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
>> >> >> infoPort=50075, ipcPort=50020):DataXceiver
>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
>> >> >> left.
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>> >> >>         at
>> >> >>
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>> >> >>         at
>> >> >>
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>> >> >>         at
>> >> java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>> >> >>         at
>> >> java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>> >> >>         at java.io.DataInputStream.read(DataInputStream.java:149)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
>> >> >>         at java.lang.Thread.run(Thread.java:722)
>> >> >> 2014-04-21 20:33:59,310 ERROR
>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> >> >> DatanodeRegistration(192.168.10.45:50010,
>> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
>> >> >> infoPort=50075, ipcPort=50020):DataXceiver
>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>> >> >> channel java.nio.channels.SocketChannel[closed]. 466924 millis
>> timeout
>> >> >> left.
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>> >> >>         at java.lang.Thread.run(Thread.java:722)
>> >> >> 2014-04-21 20:34:00,291 INFO
>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
>> >> >> threadgroup to exit, active threads is 0
>> >> >> 2014-04-21 20:34:00,404 INFO
>> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
>> >> >> Shutting down all async disk service threads...
>> >> >> 2014-04-21 20:34:00,405 INFO
>> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: All
>> >> >> async disk service threads have been shut down.
>> >> >> 2014-04-21 20:34:00,413 INFO
>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
>> >> >> 2014-04-21 20:34:00,424 INFO
>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
>> >> >> /************************************************************
>> >> >> SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45
>> >> >> ************************************************************/
>> >> >>
>> >> >> On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu <yu...@gmail.com>
>> wrote:
>> >> >> > bq. one datanode failed
>> >> >> >
>> >> >> > Was the crash due to out of memory error ?
>> >> >> > Can you post the tail of data node log on pastebin ?
>> >> >> >
>> >> >> > Giving us versions of hadoop and hbase would be helpful.
>> >> >> >
>> >> >> >
>> >> >> > On Mon, Apr 21, 2014 at 7:39 PM, Li Li <fa...@gmail.com>
>> wrote:
>> >> >> >
>> >> >> >> I have a small hbase cluster with 1 namenode, 1 secondary
>> namenode, 4
>> >> >> >> datanode.
>> >> >> >> and the hbase master is on the same machine with namenode, 4 hbase
>> >> >> >> slave on datanode machine.
>> >> >> >> I found average requests per seconds is about 10,000. and the
>> >> clusters
>> >> >> >> crashed. and I found the reason is one datanode failed.
>> >> >> >>
>> >> >> >> the datanode configuration is about 4 cpu core and 10GB memory
>> >> >> >> is my cluster overloaded?
>> >> >> >>
>> >> >>
>> >>
>>
>
>

Re: is my hbase cluster overloaded?

Posted by Azuryy Yu <az...@gmail.com>.

Lars,

I cannot agree with you. I don't think memstore use the old gen, instead,
for a heavy write cluster, memstore is keeping for a short time then
flushed to the disk, so if you specified  small Eden, there will be lots of
promotions, and easy to cause FullGC.

so for the case in this thread, it's a heavy write cluster, so Eden should
be large and MTT is also little large to avoid promotion.



On Wed, Apr 23, 2014 at 12:42 PM, lars hofhansl <la...@apache.org> wrote:

> What makes you say this?
>
> HBase has a lot of very short lived garbage (like KeyValue objects that do
> not outlive an RPC request) and a lot of long lived data in the memstore
> and the block cache. We want to avoid accumulating the short lived garbage
> and at the same time leave most heap for memstores and blockcache.
>
> A small eden size of 512mb or even less makes sense to me.
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Azuryy Yu <az...@gmail.com>
> To: user@hbase.apache.org
> Cc:
> Sent: Tuesday, April 22, 2014 12:02 AM
> Subject: Re: is my hbase cluster overloaded?
>
> Do you still have the same issue?
>
> and:
> -Xmx8000m -server -XX:NewSize=512m -XX:MaxNewSize=512m
>
> the Eden size is too small.
>
>
>
>
> On Tue, Apr 22, 2014 at 2:55 PM, Li Li <fa...@gmail.com> wrote:
>
> > <property>
> >   <name>dfs.datanode.handler.count</name>
> >   <value>100</value>
> >   <description>The number of server threads for the
> datanode.</description>
> > </property>
> >
> >
> > 1. namenode/master  192.168.10.48
> > http://pastebin.com/7M0zzAAc
> >
> > $free -m (this is value when I restart the hadoop and hbase now, not
> > the value when it crashed)
> >              total       used       free     shared    buffers     cached
> > Mem:         15951       3819      12131          0        509       1990
> > -/+ buffers/cache:       1319      14631
> > Swap:         8191          0       8191
> >
> > 2. datanode/region 192.168.10.45
> > http://pastebin.com/FiAw1yju
> >
> > $free -m
> >              total       used       free     shared    buffers     cached
> > Mem:         15951       3627      12324          0       1516        641
> > -/+ buffers/cache:       1469      14482
> > Swap:         8191          8       8183
> >
> > On Tue, Apr 22, 2014 at 2:29 PM, Azuryy Yu <az...@gmail.com> wrote:
> > > one big possible issue is that you have a high concurrent request on
> HDFS
> > > or HBASE, then all Data nodes handlers are all busy, then more requests
> > are
> > > pending, then timeout, so you can try to increase
> > > dfs.datanode.handler.count and dfs.namenode.handler.count in the
> > > hdfs-site.xml, then restart the HDFS.
> > >
> > > another, do you have datanode, namenode, region servers JVM options? if
> > > they are all by default, then there is also have this issue.
> > >
> > >
> > >
> > >
> > > On Tue, Apr 22, 2014 at 2:20 PM, Li Li <fa...@gmail.com> wrote:
> > >
> > >> my cluster setup: both 6 machines are virtual machine. each machine:
> > >> 4CPU Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 16GB memory
> > >> 192.168.10.48 namenode/jobtracker
> > >> 192.168.10.47 secondary namenode
> > >> 192.168.10.45 datanode/tasktracker
> > >> 192.168.10.46 datanode/tasktracker
> > >> 192.168.10.49 datanode/tasktracker
> > >> 192.168.10.50 datanode/tasktracker
> > >>
> > >> hdfs logs around 20:33
> > >> 192.168.10.48 namenode log  http://pastebin.com/rwgmPEXR
> > >> 192.168.10.45 datanode log http://pastebin.com/HBgZ8rtV (I found this
> > >> datanode crash first)
> > >> 192.168.10.46 datanode log http://pastebin.com/aQ2emnUi
> > >> 192.168.10.49 datanode log http://pastebin.com/aqsWrrL1
> > >> 192.168.10.50 datanode log http://pastebin.com/V7C6tjpB
> > >>
> > >> hbase logs around 20:33
> > >> 192.168.10.48 master log http://pastebin.com/2ZfeYA1p
> > >> 192.168.10.45 region log http://pastebin.com/idCF2a7Y
> > >> 192.168.10.46 region log http://pastebin.com/WEh4dA0f
> > >> 192.168.10.49 region log http://pastebin.com/cGtpbTLz
> > >> 192.168.10.50 region log http://pastebin.com/bD6h5T6p(very strange,
> > >> not log at 20:33, but have log at 20:32 and 20:34)
> > >>
> > >> On Tue, Apr 22, 2014 at 12:25 PM, Ted Yu <yu...@gmail.com> wrote:
> > >> > Can you post more of the data node log, around 20:33 ?
> > >> >
> > >> > Cheers
> > >> >
> > >> >
> > >> > On Mon, Apr 21, 2014 at 8:57 PM, Li Li <fa...@gmail.com> wrote:
> > >> >
> > >> >> hadoop 1.0
> > >> >> hbase 0.94.11
> > >> >>
> > >> >> datanode log from 192.168.10.45. why it shut down itself?
> > >> >>
> > >> >> 2014-04-21 20:33:59,309 INFO
> > >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> > >> >> blk_-7969006819959471805_202154 received exception
> > >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
> > >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
> > >> >> left.
> > >> >> 2014-04-21 20:33:59,310 ERROR
> > >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
> > >> >> DatanodeRegistration(192.168.10.45:50010,
> > >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
> > >> >> infoPort=50075, ipcPort=50020):DataXceiver
> > >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
> > >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
> > >> >> left.
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> > >> >>         at
> > >> >>
> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> > >> >>         at
> > >> >>
> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> > >> >>         at
> > >> java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
> > >> >>         at
> > >> java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> > >> >>         at java.io.DataInputStream.read(DataInputStream.java:149)
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265)
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
> > >> >>         at java.lang.Thread.run(Thread.java:722)
> > >> >> 2014-04-21 20:33:59,310 ERROR
> > >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
> > >> >> DatanodeRegistration(192.168.10.45:50010,
> > >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
> > >> >> infoPort=50075, ipcPort=50020):DataXceiver
> > >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
> > >> >> channel java.nio.channels.SocketChannel[closed]. 466924 millis
> > timeout
> > >> >> left.
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245)
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
> > >> >>         at java.lang.Thread.run(Thread.java:722)
> > >> >> 2014-04-21 20:34:00,291 INFO
> > >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
> > >> >> threadgroup to exit, active threads is 0
> > >> >> 2014-04-21 20:34:00,404 INFO
> > >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
> > >> >> Shutting down all async disk service threads...
> > >> >> 2014-04-21 20:34:00,405 INFO
> > >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
> All
> > >> >> async disk service threads have been shut down.
> > >> >> 2014-04-21 20:34:00,413 INFO
> > >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
> > >> >> 2014-04-21 20:34:00,424 INFO
> > >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
> > >> >> /************************************************************
> > >> >> SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45
> > >> >> ************************************************************/
> > >> >>
> > >> >> On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu <yu...@gmail.com>
> > wrote:
> > >> >> > bq. one datanode failed
> > >> >> >
> > >> >> > Was the crash due to out of memory error ?
> > >> >> > Can you post the tail of data node log on pastebin ?
> > >> >> >
> > >> >> > Giving us versions of hadoop and hbase would be helpful.
> > >> >> >
> > >> >> >
> > >> >> > On Mon, Apr 21, 2014 at 7:39 PM, Li Li <fa...@gmail.com>
> > wrote:
> > >> >> >
> > >> >> >> I have a small hbase cluster with 1 namenode, 1 secondary
> > namenode, 4
> > >> >> >> datanode.
> > >> >> >> and the hbase master is on the same machine with namenode, 4
> hbase
> > >> >> >> slave on datanode machine.
> > >> >> >> I found average requests per seconds is about 10,000. and the
> > >> clusters
> > >> >> >> crashed. and I found the reason is one datanode failed.
> > >> >> >>
> > >> >> >> the datanode configuration is about 4 cpu core and 10GB memory
> > >> >> >> is my cluster overloaded?
> > >> >> >>
> > >> >>
> > >>
> >
>
>

Re: is my hbase cluster overloaded?

Posted by lars hofhansl <la...@apache.org>.

What makes you say this?

HBase has a lot of very short lived garbage (like KeyValue objects that do not outlive an RPC request) and a lot of long lived data in the memstore and the block cache. We want to avoid accumulating the short lived garbage and at the same time leave most heap for memstores and blockcache.

A small eden size of 512mb or even less makes sense to me.

-- Lars



----- Original Message -----
From: Azuryy Yu <az...@gmail.com>
To: user@hbase.apache.org
Cc: 
Sent: Tuesday, April 22, 2014 12:02 AM
Subject: Re: is my hbase cluster overloaded?

Do you still have the same issue?

and:
-Xmx8000m -server -XX:NewSize=512m -XX:MaxNewSize=512m

the Eden size is too small.




On Tue, Apr 22, 2014 at 2:55 PM, Li Li <fa...@gmail.com> wrote:

> <property>
>   <name>dfs.datanode.handler.count</name>
>   <value>100</value>
>   <description>The number of server threads for the datanode.</description>
> </property>
>
>
> 1. namenode/master  192.168.10.48
> http://pastebin.com/7M0zzAAc
>
> $free -m (this is value when I restart the hadoop and hbase now, not
> the value when it crashed)
>              total       used       free     shared    buffers     cached
> Mem:         15951       3819      12131          0        509       1990
> -/+ buffers/cache:       1319      14631
> Swap:         8191          0       8191
>
> 2. datanode/region 192.168.10.45
> http://pastebin.com/FiAw1yju
>
> $free -m
>              total       used       free     shared    buffers     cached
> Mem:         15951       3627      12324          0       1516        641
> -/+ buffers/cache:       1469      14482
> Swap:         8191          8       8183
>
> On Tue, Apr 22, 2014 at 2:29 PM, Azuryy Yu <az...@gmail.com> wrote:
> > one big possible issue is that you have a high concurrent request on HDFS
> > or HBASE, then all Data nodes handlers are all busy, then more requests
> are
> > pending, then timeout, so you can try to increase
> > dfs.datanode.handler.count and dfs.namenode.handler.count in the
> > hdfs-site.xml, then restart the HDFS.
> >
> > another, do you have datanode, namenode, region servers JVM options? if
> > they are all by default, then there is also have this issue.
> >
> >
> >
> >
> > On Tue, Apr 22, 2014 at 2:20 PM, Li Li <fa...@gmail.com> wrote:
> >
> >> my cluster setup: both 6 machines are virtual machine. each machine:
> >> 4CPU Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 16GB memory
> >> 192.168.10.48 namenode/jobtracker
> >> 192.168.10.47 secondary namenode
> >> 192.168.10.45 datanode/tasktracker
> >> 192.168.10.46 datanode/tasktracker
> >> 192.168.10.49 datanode/tasktracker
> >> 192.168.10.50 datanode/tasktracker
> >>
> >> hdfs logs around 20:33
> >> 192.168.10.48 namenode log  http://pastebin.com/rwgmPEXR
> >> 192.168.10.45 datanode log http://pastebin.com/HBgZ8rtV (I found this
> >> datanode crash first)
> >> 192.168.10.46 datanode log http://pastebin.com/aQ2emnUi
> >> 192.168.10.49 datanode log http://pastebin.com/aqsWrrL1
> >> 192.168.10.50 datanode log http://pastebin.com/V7C6tjpB
> >>
> >> hbase logs around 20:33
> >> 192.168.10.48 master log http://pastebin.com/2ZfeYA1p
> >> 192.168.10.45 region log http://pastebin.com/idCF2a7Y
> >> 192.168.10.46 region log http://pastebin.com/WEh4dA0f
> >> 192.168.10.49 region log http://pastebin.com/cGtpbTLz
> >> 192.168.10.50 region log http://pastebin.com/bD6h5T6p(very strange,
> >> not log at 20:33, but have log at 20:32 and 20:34)
> >>
> >> On Tue, Apr 22, 2014 at 12:25 PM, Ted Yu <yu...@gmail.com> wrote:
> >> > Can you post more of the data node log, around 20:33 ?
> >> >
> >> > Cheers
> >> >
> >> >
> >> > On Mon, Apr 21, 2014 at 8:57 PM, Li Li <fa...@gmail.com> wrote:
> >> >
> >> >> hadoop 1.0
> >> >> hbase 0.94.11
> >> >>
> >> >> datanode log from 192.168.10.45. why it shut down itself?
> >> >>
> >> >> 2014-04-21 20:33:59,309 INFO
> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> >> >> blk_-7969006819959471805_202154 received exception
> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
> >> >> left.
> >> >> 2014-04-21 20:33:59,310 ERROR
> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> >> DatanodeRegistration(192.168.10.45:50010,
> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
> >> >> infoPort=50075, ipcPort=50020):DataXceiver
> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
> >> >> left.
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> >> >>         at
> >> >>
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> >> >>         at
> >> >>
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> >> >>         at
> >> java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
> >> >>         at
> >> java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> >> >>         at java.io.DataInputStream.read(DataInputStream.java:149)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
> >> >>         at java.lang.Thread.run(Thread.java:722)
> >> >> 2014-04-21 20:33:59,310 ERROR
> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> >> DatanodeRegistration(192.168.10.45:50010,
> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
> >> >> infoPort=50075, ipcPort=50020):DataXceiver
> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
> >> >> channel java.nio.channels.SocketChannel[closed]. 466924 millis
> timeout
> >> >> left.
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
> >> >>         at java.lang.Thread.run(Thread.java:722)
> >> >> 2014-04-21 20:34:00,291 INFO
> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
> >> >> threadgroup to exit, active threads is 0
> >> >> 2014-04-21 20:34:00,404 INFO
> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
> >> >> Shutting down all async disk service threads...
> >> >> 2014-04-21 20:34:00,405 INFO
> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: All
> >> >> async disk service threads have been shut down.
> >> >> 2014-04-21 20:34:00,413 INFO
> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
> >> >> 2014-04-21 20:34:00,424 INFO
> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
> >> >> /************************************************************
> >> >> SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45
> >> >> ************************************************************/
> >> >>
> >> >> On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu <yu...@gmail.com>
> wrote:
> >> >> > bq. one datanode failed
> >> >> >
> >> >> > Was the crash due to out of memory error ?
> >> >> > Can you post the tail of data node log on pastebin ?
> >> >> >
> >> >> > Giving us versions of hadoop and hbase would be helpful.
> >> >> >
> >> >> >
> >> >> > On Mon, Apr 21, 2014 at 7:39 PM, Li Li <fa...@gmail.com>
> wrote:
> >> >> >
> >> >> >> I have a small hbase cluster with 1 namenode, 1 secondary
> namenode, 4
> >> >> >> datanode.
> >> >> >> and the hbase master is on the same machine with namenode, 4 hbase
> >> >> >> slave on datanode machine.
> >> >> >> I found average requests per seconds is about 10,000. and the
> >> clusters
> >> >> >> crashed. and I found the reason is one datanode failed.
> >> >> >>
> >> >> >> the datanode configuration is about 4 cpu core and 10GB memory
> >> >> >> is my cluster overloaded?
> >> >> >>
> >> >>
> >>
>

Re: is my hbase cluster overloaded?

Posted by Azuryy Yu <az...@gmail.com>.

Do you still have the same issue?

and:
-Xmx8000m -server -XX:NewSize=512m -XX:MaxNewSize=512m

the Eden size is too small.



On Tue, Apr 22, 2014 at 2:55 PM, Li Li <fa...@gmail.com> wrote:

> <property>
>   <name>dfs.datanode.handler.count</name>
>   <value>100</value>
>   <description>The number of server threads for the datanode.</description>
> </property>
>
>
> 1. namenode/master  192.168.10.48
> http://pastebin.com/7M0zzAAc
>
> $free -m (this is value when I restart the hadoop and hbase now, not
> the value when it crashed)
>              total       used       free     shared    buffers     cached
> Mem:         15951       3819      12131          0        509       1990
> -/+ buffers/cache:       1319      14631
> Swap:         8191          0       8191
>
> 2. datanode/region 192.168.10.45
> http://pastebin.com/FiAw1yju
>
> $free -m
>              total       used       free     shared    buffers     cached
> Mem:         15951       3627      12324          0       1516        641
> -/+ buffers/cache:       1469      14482
> Swap:         8191          8       8183
>
> On Tue, Apr 22, 2014 at 2:29 PM, Azuryy Yu <az...@gmail.com> wrote:
> > one big possible issue is that you have a high concurrent request on HDFS
> > or HBASE, then all Data nodes handlers are all busy, then more requests
> are
> > pending, then timeout, so you can try to increase
> > dfs.datanode.handler.count and dfs.namenode.handler.count in the
> > hdfs-site.xml, then restart the HDFS.
> >
> > another, do you have datanode, namenode, region servers JVM options? if
> > they are all by default, then there is also have this issue.
> >
> >
> >
> >
> > On Tue, Apr 22, 2014 at 2:20 PM, Li Li <fa...@gmail.com> wrote:
> >
> >> my cluster setup: both 6 machines are virtual machine. each machine:
> >> 4CPU Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 16GB memory
> >> 192.168.10.48 namenode/jobtracker
> >> 192.168.10.47 secondary namenode
> >> 192.168.10.45 datanode/tasktracker
> >> 192.168.10.46 datanode/tasktracker
> >> 192.168.10.49 datanode/tasktracker
> >> 192.168.10.50 datanode/tasktracker
> >>
> >> hdfs logs around 20:33
> >> 192.168.10.48 namenode log  http://pastebin.com/rwgmPEXR
> >> 192.168.10.45 datanode log http://pastebin.com/HBgZ8rtV (I found this
> >> datanode crash first)
> >> 192.168.10.46 datanode log http://pastebin.com/aQ2emnUi
> >> 192.168.10.49 datanode log http://pastebin.com/aqsWrrL1
> >> 192.168.10.50 datanode log http://pastebin.com/V7C6tjpB
> >>
> >> hbase logs around 20:33
> >> 192.168.10.48 master log http://pastebin.com/2ZfeYA1p
> >> 192.168.10.45 region log http://pastebin.com/idCF2a7Y
> >> 192.168.10.46 region log http://pastebin.com/WEh4dA0f
> >> 192.168.10.49 region log http://pastebin.com/cGtpbTLz
> >> 192.168.10.50 region log http://pastebin.com/bD6h5T6p(very strange,
> >> not log at 20:33, but have log at 20:32 and 20:34)
> >>
> >> On Tue, Apr 22, 2014 at 12:25 PM, Ted Yu <yu...@gmail.com> wrote:
> >> > Can you post more of the data node log, around 20:33 ?
> >> >
> >> > Cheers
> >> >
> >> >
> >> > On Mon, Apr 21, 2014 at 8:57 PM, Li Li <fa...@gmail.com> wrote:
> >> >
> >> >> hadoop 1.0
> >> >> hbase 0.94.11
> >> >>
> >> >> datanode log from 192.168.10.45. why it shut down itself?
> >> >>
> >> >> 2014-04-21 20:33:59,309 INFO
> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> >> >> blk_-7969006819959471805_202154 received exception
> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
> >> >> left.
> >> >> 2014-04-21 20:33:59,310 ERROR
> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> >> DatanodeRegistration(192.168.10.45:50010,
> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
> >> >> infoPort=50075, ipcPort=50020):DataXceiver
> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
> >> >> left.
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> >> >>         at
> >> >>
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> >> >>         at
> >> >>
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> >> >>         at
> >> java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
> >> >>         at
> >> java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> >> >>         at java.io.DataInputStream.read(DataInputStream.java:149)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
> >> >>         at java.lang.Thread.run(Thread.java:722)
> >> >> 2014-04-21 20:33:59,310 ERROR
> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> >> DatanodeRegistration(192.168.10.45:50010,
> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
> >> >> infoPort=50075, ipcPort=50020):DataXceiver
> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on
> >> >> channel java.nio.channels.SocketChannel[closed]. 466924 millis
> timeout
> >> >> left.
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
> >> >>         at java.lang.Thread.run(Thread.java:722)
> >> >> 2014-04-21 20:34:00,291 INFO
> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
> >> >> threadgroup to exit, active threads is 0
> >> >> 2014-04-21 20:34:00,404 INFO
> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
> >> >> Shutting down all async disk service threads...
> >> >> 2014-04-21 20:34:00,405 INFO
> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: All
> >> >> async disk service threads have been shut down.
> >> >> 2014-04-21 20:34:00,413 INFO
> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
> >> >> 2014-04-21 20:34:00,424 INFO
> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
> >> >> /************************************************************
> >> >> SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45
> >> >> ************************************************************/
> >> >>
> >> >> On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu <yu...@gmail.com>
> wrote:
> >> >> > bq. one datanode failed
> >> >> >
> >> >> > Was the crash due to out of memory error ?
> >> >> > Can you post the tail of data node log on pastebin ?
> >> >> >
> >> >> > Giving us versions of hadoop and hbase would be helpful.
> >> >> >
> >> >> >
> >> >> > On Mon, Apr 21, 2014 at 7:39 PM, Li Li <fa...@gmail.com>
> wrote:
> >> >> >
> >> >> >> I have a small hbase cluster with 1 namenode, 1 secondary
> namenode, 4
> >> >> >> datanode.
> >> >> >> and the hbase master is on the same machine with namenode, 4 hbase
> >> >> >> slave on datanode machine.
> >> >> >> I found average requests per seconds is about 10,000. and the
> >> clusters
> >> >> >> crashed. and I found the reason is one datanode failed.
> >> >> >>
> >> >> >> the datanode configuration is about 4 cpu core and 10GB memory
> >> >> >> is my cluster overloaded?
> >> >> >>
> >> >>
> >>
>

Re: is my hbase cluster overloaded?

Posted by Li Li <fa...@gmail.com>.

<property>
  <name>dfs.datanode.handler.count</name>
  <value>100</value>
  <description>The number of server threads for the datanode.</description>
</property>


1. namenode/master  192.168.10.48
http://pastebin.com/7M0zzAAc

$free -m (this is value when I restart the hadoop and hbase now, not
the value when it crashed)
             total       used       free     shared    buffers     cached
Mem:         15951       3819      12131          0        509       1990
-/+ buffers/cache:       1319      14631
Swap:         8191          0       8191

2. datanode/region 192.168.10.45
http://pastebin.com/FiAw1yju

$free -m
             total       used       free     shared    buffers     cached
Mem:         15951       3627      12324          0       1516        641
-/+ buffers/cache:       1469      14482
Swap:         8191          8       8183

On Tue, Apr 22, 2014 at 2:29 PM, Azuryy Yu <az...@gmail.com> wrote:
> one big possible issue is that you have a high concurrent request on HDFS
> or HBASE, then all Data nodes handlers are all busy, then more requests are
> pending, then timeout, so you can try to increase
> dfs.datanode.handler.count and dfs.namenode.handler.count in the
> hdfs-site.xml, then restart the HDFS.
>
> another, do you have datanode, namenode, region servers JVM options? if
> they are all by default, then there is also have this issue.
>
>
>
>
> On Tue, Apr 22, 2014 at 2:20 PM, Li Li <fa...@gmail.com> wrote:
>
>> my cluster setup: both 6 machines are virtual machine. each machine:
>> 4CPU Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 16GB memory
>> 192.168.10.48 namenode/jobtracker
>> 192.168.10.47 secondary namenode
>> 192.168.10.45 datanode/tasktracker
>> 192.168.10.46 datanode/tasktracker
>> 192.168.10.49 datanode/tasktracker
>> 192.168.10.50 datanode/tasktracker
>>
>> hdfs logs around 20:33
>> 192.168.10.48 namenode log  http://pastebin.com/rwgmPEXR
>> 192.168.10.45 datanode log http://pastebin.com/HBgZ8rtV (I found this
>> datanode crash first)
>> 192.168.10.46 datanode log http://pastebin.com/aQ2emnUi
>> 192.168.10.49 datanode log http://pastebin.com/aqsWrrL1
>> 192.168.10.50 datanode log http://pastebin.com/V7C6tjpB
>>
>> hbase logs around 20:33
>> 192.168.10.48 master log http://pastebin.com/2ZfeYA1p
>> 192.168.10.45 region log http://pastebin.com/idCF2a7Y
>> 192.168.10.46 region log http://pastebin.com/WEh4dA0f
>> 192.168.10.49 region log http://pastebin.com/cGtpbTLz
>> 192.168.10.50 region log http://pastebin.com/bD6h5T6p(very strange,
>> not log at 20:33, but have log at 20:32 and 20:34)
>>
>> On Tue, Apr 22, 2014 at 12:25 PM, Ted Yu <yu...@gmail.com> wrote:
>> > Can you post more of the data node log, around 20:33 ?
>> >
>> > Cheers
>> >
>> >
>> > On Mon, Apr 21, 2014 at 8:57 PM, Li Li <fa...@gmail.com> wrote:
>> >
>> >> hadoop 1.0
>> >> hbase 0.94.11
>> >>
>> >> datanode log from 192.168.10.45. why it shut down itself?
>> >>
>> >> 2014-04-21 20:33:59,309 INFO
>> >> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>> >> blk_-7969006819959471805_202154 received exception
>> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
>> >> left.
>> >> 2014-04-21 20:33:59,310 ERROR
>> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> >> DatanodeRegistration(192.168.10.45:50010,
>> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
>> >> infoPort=50075, ipcPort=50020):DataXceiver
>> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
>> >> left.
>> >>         at
>> >>
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>> >>         at
>> >>
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>> >>         at
>> >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>> >>         at
>> >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>> >>         at
>> java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>> >>         at
>> java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>> >>         at java.io.DataInputStream.read(DataInputStream.java:149)
>> >>         at
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265)
>> >>         at
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
>> >>         at
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
>> >>         at
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
>> >>         at
>> >>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
>> >>         at
>> >>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
>> >>         at java.lang.Thread.run(Thread.java:722)
>> >> 2014-04-21 20:33:59,310 ERROR
>> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> >> DatanodeRegistration(192.168.10.45:50010,
>> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
>> >> infoPort=50075, ipcPort=50020):DataXceiver
>> >> java.io.InterruptedIOException: Interruped while waiting for IO on
>> >> channel java.nio.channels.SocketChannel[closed]. 466924 millis timeout
>> >> left.
>> >>         at
>> >>
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>> >>         at
>> >>
>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245)
>> >>         at
>> >>
>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>> >>         at
>> >>
>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>> >>         at
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
>> >>         at
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
>> >>         at
>> >>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
>> >>         at
>> >>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>> >>         at java.lang.Thread.run(Thread.java:722)
>> >> 2014-04-21 20:34:00,291 INFO
>> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
>> >> threadgroup to exit, active threads is 0
>> >> 2014-04-21 20:34:00,404 INFO
>> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
>> >> Shutting down all async disk service threads...
>> >> 2014-04-21 20:34:00,405 INFO
>> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: All
>> >> async disk service threads have been shut down.
>> >> 2014-04-21 20:34:00,413 INFO
>> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
>> >> 2014-04-21 20:34:00,424 INFO
>> >> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
>> >> /************************************************************
>> >> SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45
>> >> ************************************************************/
>> >>
>> >> On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu <yu...@gmail.com> wrote:
>> >> > bq. one datanode failed
>> >> >
>> >> > Was the crash due to out of memory error ?
>> >> > Can you post the tail of data node log on pastebin ?
>> >> >
>> >> > Giving us versions of hadoop and hbase would be helpful.
>> >> >
>> >> >
>> >> > On Mon, Apr 21, 2014 at 7:39 PM, Li Li <fa...@gmail.com> wrote:
>> >> >
>> >> >> I have a small hbase cluster with 1 namenode, 1 secondary namenode, 4
>> >> >> datanode.
>> >> >> and the hbase master is on the same machine with namenode, 4 hbase
>> >> >> slave on datanode machine.
>> >> >> I found average requests per seconds is about 10,000. and the
>> clusters
>> >> >> crashed. and I found the reason is one datanode failed.
>> >> >>
>> >> >> the datanode configuration is about 4 cpu core and 10GB memory
>> >> >> is my cluster overloaded?
>> >> >>
>> >>
>>

Re: is my hbase cluster overloaded?

Posted by Azuryy Yu <az...@gmail.com>.

one big possible issue is that you have a high concurrent request on HDFS
or HBASE, then all Data nodes handlers are all busy, then more requests are
pending, then timeout, so you can try to increase
dfs.datanode.handler.count and dfs.namenode.handler.count in the
hdfs-site.xml, then restart the HDFS.

another, do you have datanode, namenode, region servers JVM options? if
they are all by default, then there is also have this issue.




On Tue, Apr 22, 2014 at 2:20 PM, Li Li <fa...@gmail.com> wrote:

> my cluster setup: both 6 machines are virtual machine. each machine:
> 4CPU Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 16GB memory
> 192.168.10.48 namenode/jobtracker
> 192.168.10.47 secondary namenode
> 192.168.10.45 datanode/tasktracker
> 192.168.10.46 datanode/tasktracker
> 192.168.10.49 datanode/tasktracker
> 192.168.10.50 datanode/tasktracker
>
> hdfs logs around 20:33
> 192.168.10.48 namenode log  http://pastebin.com/rwgmPEXR
> 192.168.10.45 datanode log http://pastebin.com/HBgZ8rtV (I found this
> datanode crash first)
> 192.168.10.46 datanode log http://pastebin.com/aQ2emnUi
> 192.168.10.49 datanode log http://pastebin.com/aqsWrrL1
> 192.168.10.50 datanode log http://pastebin.com/V7C6tjpB
>
> hbase logs around 20:33
> 192.168.10.48 master log http://pastebin.com/2ZfeYA1p
> 192.168.10.45 region log http://pastebin.com/idCF2a7Y
> 192.168.10.46 region log http://pastebin.com/WEh4dA0f
> 192.168.10.49 region log http://pastebin.com/cGtpbTLz
> 192.168.10.50 region log http://pastebin.com/bD6h5T6p(very strange,
> not log at 20:33, but have log at 20:32 and 20:34)
>
> On Tue, Apr 22, 2014 at 12:25 PM, Ted Yu <yu...@gmail.com> wrote:
> > Can you post more of the data node log, around 20:33 ?
> >
> > Cheers
> >
> >
> > On Mon, Apr 21, 2014 at 8:57 PM, Li Li <fa...@gmail.com> wrote:
> >
> >> hadoop 1.0
> >> hbase 0.94.11
> >>
> >> datanode log from 192.168.10.45. why it shut down itself?
> >>
> >> 2014-04-21 20:33:59,309 INFO
> >> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> >> blk_-7969006819959471805_202154 received exception
> >> java.io.InterruptedIOException: Interruped while waiting for IO on
> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
> >> left.
> >> 2014-04-21 20:33:59,310 ERROR
> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> DatanodeRegistration(192.168.10.45:50010,
> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
> >> infoPort=50075, ipcPort=50020):DataXceiver
> >> java.io.InterruptedIOException: Interruped while waiting for IO on
> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
> >> left.
> >>         at
> >>
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
> >>         at
> >>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> >>         at
> >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> >>         at
> >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> >>         at
> java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
> >>         at
> java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> >>         at java.io.DataInputStream.read(DataInputStream.java:149)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
> >>         at java.lang.Thread.run(Thread.java:722)
> >> 2014-04-21 20:33:59,310 ERROR
> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> DatanodeRegistration(192.168.10.45:50010,
> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
> >> infoPort=50075, ipcPort=50020):DataXceiver
> >> java.io.InterruptedIOException: Interruped while waiting for IO on
> >> channel java.nio.channels.SocketChannel[closed]. 466924 millis timeout
> >> left.
> >>         at
> >>
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
> >>         at
> >>
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245)
> >>         at
> >>
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> >>         at
> >>
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
> >>         at java.lang.Thread.run(Thread.java:722)
> >> 2014-04-21 20:34:00,291 INFO
> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
> >> threadgroup to exit, active threads is 0
> >> 2014-04-21 20:34:00,404 INFO
> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
> >> Shutting down all async disk service threads...
> >> 2014-04-21 20:34:00,405 INFO
> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: All
> >> async disk service threads have been shut down.
> >> 2014-04-21 20:34:00,413 INFO
> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
> >> 2014-04-21 20:34:00,424 INFO
> >> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
> >> /************************************************************
> >> SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45
> >> ************************************************************/
> >>
> >> On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu <yu...@gmail.com> wrote:
> >> > bq. one datanode failed
> >> >
> >> > Was the crash due to out of memory error ?
> >> > Can you post the tail of data node log on pastebin ?
> >> >
> >> > Giving us versions of hadoop and hbase would be helpful.
> >> >
> >> >
> >> > On Mon, Apr 21, 2014 at 7:39 PM, Li Li <fa...@gmail.com> wrote:
> >> >
> >> >> I have a small hbase cluster with 1 namenode, 1 secondary namenode, 4
> >> >> datanode.
> >> >> and the hbase master is on the same machine with namenode, 4 hbase
> >> >> slave on datanode machine.
> >> >> I found average requests per seconds is about 10,000. and the
> clusters
> >> >> crashed. and I found the reason is one datanode failed.
> >> >>
> >> >> the datanode configuration is about 4 cpu core and 10GB memory
> >> >> is my cluster overloaded?
> >> >>
> >>
>

Re: is my hbase cluster overloaded?

Posted by Li Li <fa...@gmail.com>.

my cluster setup: both 6 machines are virtual machine. each machine:
4CPU Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 16GB memory
192.168.10.48 namenode/jobtracker
192.168.10.47 secondary namenode
192.168.10.45 datanode/tasktracker
192.168.10.46 datanode/tasktracker
192.168.10.49 datanode/tasktracker
192.168.10.50 datanode/tasktracker

hdfs logs around 20:33
192.168.10.48 namenode log  http://pastebin.com/rwgmPEXR
192.168.10.45 datanode log http://pastebin.com/HBgZ8rtV (I found this
datanode crash first)
192.168.10.46 datanode log http://pastebin.com/aQ2emnUi
192.168.10.49 datanode log http://pastebin.com/aqsWrrL1
192.168.10.50 datanode log http://pastebin.com/V7C6tjpB

hbase logs around 20:33
192.168.10.48 master log http://pastebin.com/2ZfeYA1p
192.168.10.45 region log http://pastebin.com/idCF2a7Y
192.168.10.46 region log http://pastebin.com/WEh4dA0f
192.168.10.49 region log http://pastebin.com/cGtpbTLz
192.168.10.50 region log http://pastebin.com/bD6h5T6p(very strange,
not log at 20:33, but have log at 20:32 and 20:34)

On Tue, Apr 22, 2014 at 12:25 PM, Ted Yu <yu...@gmail.com> wrote:
> Can you post more of the data node log, around 20:33 ?
>
> Cheers
>
>
> On Mon, Apr 21, 2014 at 8:57 PM, Li Li <fa...@gmail.com> wrote:
>
>> hadoop 1.0
>> hbase 0.94.11
>>
>> datanode log from 192.168.10.45. why it shut down itself?
>>
>> 2014-04-21 20:33:59,309 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>> blk_-7969006819959471805_202154 received exception
>> java.io.InterruptedIOException: Interruped while waiting for IO on
>> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
>> left.
>> 2014-04-21 20:33:59,310 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> DatanodeRegistration(192.168.10.45:50010,
>> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
>> infoPort=50075, ipcPort=50020):DataXceiver
>> java.io.InterruptedIOException: Interruped while waiting for IO on
>> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
>> left.
>>         at
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>>         at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>>         at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>>         at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>>         at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>>         at java.io.DataInputStream.read(DataInputStream.java:149)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
>>         at java.lang.Thread.run(Thread.java:722)
>> 2014-04-21 20:33:59,310 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> DatanodeRegistration(192.168.10.45:50010,
>> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
>> infoPort=50075, ipcPort=50020):DataXceiver
>> java.io.InterruptedIOException: Interruped while waiting for IO on
>> channel java.nio.channels.SocketChannel[closed]. 466924 millis timeout
>> left.
>>         at
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>>         at
>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245)
>>         at
>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>>         at
>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>         at java.lang.Thread.run(Thread.java:722)
>> 2014-04-21 20:34:00,291 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
>> threadgroup to exit, active threads is 0
>> 2014-04-21 20:34:00,404 INFO
>> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
>> Shutting down all async disk service threads...
>> 2014-04-21 20:34:00,405 INFO
>> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: All
>> async disk service threads have been shut down.
>> 2014-04-21 20:34:00,413 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
>> 2014-04-21 20:34:00,424 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
>> /************************************************************
>> SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45
>> ************************************************************/
>>
>> On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu <yu...@gmail.com> wrote:
>> > bq. one datanode failed
>> >
>> > Was the crash due to out of memory error ?
>> > Can you post the tail of data node log on pastebin ?
>> >
>> > Giving us versions of hadoop and hbase would be helpful.
>> >
>> >
>> > On Mon, Apr 21, 2014 at 7:39 PM, Li Li <fa...@gmail.com> wrote:
>> >
>> >> I have a small hbase cluster with 1 namenode, 1 secondary namenode, 4
>> >> datanode.
>> >> and the hbase master is on the same machine with namenode, 4 hbase
>> >> slave on datanode machine.
>> >> I found average requests per seconds is about 10,000. and the clusters
>> >> crashed. and I found the reason is one datanode failed.
>> >>
>> >> the datanode configuration is about 4 cpu core and 10GB memory
>> >> is my cluster overloaded?
>> >>
>>

Re: is my hbase cluster overloaded?

Posted by Azuryy Yu <az...@gmail.com>.

Can you use pastebin to add the whole log?


On Tue, Apr 22, 2014 at 12:25 PM, Ted Yu <yu...@gmail.com> wrote:

> Can you post more of the data node log, around 20:33 ?
>
> Cheers
>
>
> On Mon, Apr 21, 2014 at 8:57 PM, Li Li <fa...@gmail.com> wrote:
>
> > hadoop 1.0
> > hbase 0.94.11
> >
> > datanode log from 192.168.10.45. why it shut down itself?
> >
> > 2014-04-21 20:33:59,309 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> > blk_-7969006819959471805_202154 received exception
> > java.io.InterruptedIOException: Interruped while waiting for IO on
> > channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
> > left.
> > 2014-04-21 20:33:59,310 ERROR
> > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > DatanodeRegistration(192.168.10.45:50010,
> > storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
> > infoPort=50075, ipcPort=50020):DataXceiver
> > java.io.InterruptedIOException: Interruped while waiting for IO on
> > channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
> > left.
> >         at
> >
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
> >         at
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> >         at
> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> >         at
> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> >         at
> java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
> >         at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> >         at java.io.DataInputStream.read(DataInputStream.java:149)
> >         at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265)
> >         at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
> >         at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
> >         at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
> >         at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
> >         at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
> >         at java.lang.Thread.run(Thread.java:722)
> > 2014-04-21 20:33:59,310 ERROR
> > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > DatanodeRegistration(192.168.10.45:50010,
> > storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
> > infoPort=50075, ipcPort=50020):DataXceiver
> > java.io.InterruptedIOException: Interruped while waiting for IO on
> > channel java.nio.channels.SocketChannel[closed]. 466924 millis timeout
> > left.
> >         at
> >
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
> >         at
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245)
> >         at
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> >         at
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> >         at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
> >         at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
> >         at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
> >         at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
> >         at java.lang.Thread.run(Thread.java:722)
> > 2014-04-21 20:34:00,291 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
> > threadgroup to exit, active threads is 0
> > 2014-04-21 20:34:00,404 INFO
> > org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
> > Shutting down all async disk service threads...
> > 2014-04-21 20:34:00,405 INFO
> > org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: All
> > async disk service threads have been shut down.
> > 2014-04-21 20:34:00,413 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
> > 2014-04-21 20:34:00,424 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
> > /************************************************************
> > SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45
> > ************************************************************/
> >
> > On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu <yu...@gmail.com> wrote:
> > > bq. one datanode failed
> > >
> > > Was the crash due to out of memory error ?
> > > Can you post the tail of data node log on pastebin ?
> > >
> > > Giving us versions of hadoop and hbase would be helpful.
> > >
> > >
> > > On Mon, Apr 21, 2014 at 7:39 PM, Li Li <fa...@gmail.com> wrote:
> > >
> > >> I have a small hbase cluster with 1 namenode, 1 secondary namenode, 4
> > >> datanode.
> > >> and the hbase master is on the same machine with namenode, 4 hbase
> > >> slave on datanode machine.
> > >> I found average requests per seconds is about 10,000. and the clusters
> > >> crashed. and I found the reason is one datanode failed.
> > >>
> > >> the datanode configuration is about 4 cpu core and 10GB memory
> > >> is my cluster overloaded?
> > >>
> >
>

Re: is my hbase cluster overloaded?

Posted by Ted Yu <yu...@gmail.com>.

Can you post more of the data node log, around 20:33 ?

Cheers


On Mon, Apr 21, 2014 at 8:57 PM, Li Li <fa...@gmail.com> wrote:

> hadoop 1.0
> hbase 0.94.11
>
> datanode log from 192.168.10.45. why it shut down itself?
>
> 2014-04-21 20:33:59,309 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> blk_-7969006819959471805_202154 received exception
> java.io.InterruptedIOException: Interruped while waiting for IO on
> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
> left.
> 2014-04-21 20:33:59,310 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(192.168.10.45:50010,
> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
> infoPort=50075, ipcPort=50020):DataXceiver
> java.io.InterruptedIOException: Interruped while waiting for IO on
> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
> left.
>         at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>         at java.io.DataInputStream.read(DataInputStream.java:149)
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265)
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
>         at java.lang.Thread.run(Thread.java:722)
> 2014-04-21 20:33:59,310 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(192.168.10.45:50010,
> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
> infoPort=50075, ipcPort=50020):DataXceiver
> java.io.InterruptedIOException: Interruped while waiting for IO on
> channel java.nio.channels.SocketChannel[closed]. 466924 millis timeout
> left.
>         at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245)
>         at
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>         at
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>         at java.lang.Thread.run(Thread.java:722)
> 2014-04-21 20:34:00,291 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
> threadgroup to exit, active threads is 0
> 2014-04-21 20:34:00,404 INFO
> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
> Shutting down all async disk service threads...
> 2014-04-21 20:34:00,405 INFO
> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: All
> async disk service threads have been shut down.
> 2014-04-21 20:34:00,413 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
> 2014-04-21 20:34:00,424 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45
> ************************************************************/
>
> On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu <yu...@gmail.com> wrote:
> > bq. one datanode failed
> >
> > Was the crash due to out of memory error ?
> > Can you post the tail of data node log on pastebin ?
> >
> > Giving us versions of hadoop and hbase would be helpful.
> >
> >
> > On Mon, Apr 21, 2014 at 7:39 PM, Li Li <fa...@gmail.com> wrote:
> >
> >> I have a small hbase cluster with 1 namenode, 1 secondary namenode, 4
> >> datanode.
> >> and the hbase master is on the same machine with namenode, 4 hbase
> >> slave on datanode machine.
> >> I found average requests per seconds is about 10,000. and the clusters
> >> crashed. and I found the reason is one datanode failed.
> >>
> >> the datanode configuration is about 4 cpu core and 10GB memory
> >> is my cluster overloaded?
> >>
>

Re: is my hbase cluster overloaded?

Posted by Li Li <fa...@gmail.com>.

hadoop 1.0
hbase 0.94.11

datanode log from 192.168.10.45. why it shut down itself?

2014-04-21 20:33:59,309 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_-7969006819959471805_202154 received exception
java.io.InterruptedIOException: Interruped while waiting for IO on
channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
left.
2014-04-21 20:33:59,310 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(192.168.10.45:50010,
storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
infoPort=50075, ipcPort=50020):DataXceiver
java.io.InterruptedIOException: Interruped while waiting for IO on
channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
left.
        at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
        at java.lang.Thread.run(Thread.java:722)
2014-04-21 20:33:59,310 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(192.168.10.45:50010,
storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
infoPort=50075, ipcPort=50020):DataXceiver
java.io.InterruptedIOException: Interruped while waiting for IO on
channel java.nio.channels.SocketChannel[closed]. 466924 millis timeout
left.
        at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245)
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
        at java.lang.Thread.run(Thread.java:722)
2014-04-21 20:34:00,291 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
threadgroup to exit, active threads is 0
2014-04-21 20:34:00,404 INFO
org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
Shutting down all async disk service threads...
2014-04-21 20:34:00,405 INFO
org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: All
async disk service threads have been shut down.
2014-04-21 20:34:00,413 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2014-04-21 20:34:00,424 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45
************************************************************/

On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu <yu...@gmail.com> wrote:
> bq. one datanode failed
>
> Was the crash due to out of memory error ?
> Can you post the tail of data node log on pastebin ?
>
> Giving us versions of hadoop and hbase would be helpful.
>
>
> On Mon, Apr 21, 2014 at 7:39 PM, Li Li <fa...@gmail.com> wrote:
>
>> I have a small hbase cluster with 1 namenode, 1 secondary namenode, 4
>> datanode.
>> and the hbase master is on the same machine with namenode, 4 hbase
>> slave on datanode machine.
>> I found average requests per seconds is about 10,000. and the clusters
>> crashed. and I found the reason is one datanode failed.
>>
>> the datanode configuration is about 4 cpu core and 10GB memory
>> is my cluster overloaded?
>>

Re: is my hbase cluster overloaded?

Posted by Ted Yu <yu...@gmail.com>.

bq. one datanode failed

Was the crash due to out of memory error ?
Can you post the tail of data node log on pastebin ?

Giving us versions of hadoop and hbase would be helpful.

On Mon, Apr 21, 2014 at 7:39 PM, Li Li <fa...@gmail.com> wrote:

> I have a small hbase cluster with 1 namenode, 1 secondary namenode, 4
> datanode.
> and the hbase master is on the same machine with namenode, 4 hbase
> slave on datanode machine.
> I found average requests per seconds is about 10,000. and the clusters
> crashed. and I found the reason is one datanode failed.
>
> the datanode configuration is about 4 cpu core and 10GB memory
> is my cluster overloaded?
>