You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by 谢良 <xi...@xiaomi.com> on 2014/01/16 09:24:13 UTC

答复: 答复: HBase 0.94.15: writes stalls periodically even under moderate steady load (AWS EC2)

Just curious, what's your hadoop version, Vladimir ?
At least on hadoop2.0+, the default ReplcaceDatanode policy should be expected pick another dn up to setupPipeline, then if you have only 1 dn broken, it should be expected still could write into 3 nodes successful, and then the HBase's "hbase.regionserver.hlog.tolerable.lowreplication" checking will not jump out:)

Thanks,
________________________________________
发件人: Vladimir Rodionov [vladrodionov@gmail.com]
发送时间: 2014年1月16日 14:45
收件人: dev@hbase.apache.org
抄送: lars hofhansl
主题: Re: 答复: HBase 0.94.15: writes stalls periodically even under moderate steady load (AWS EC2)

This what I found in a RS Log:
2014-01-16 01:22:18,256 ResponseProcessor for block
blk_5619307008368309102_2603 WARN  [DFSClient] DFSOutputStream
ResponseProcessor exception  for block
blk_5619307008368309102_2603java.io.IOException: Bad response 1 for block
blk_5619307008368309102_2603 from datanode 10.38.106.234:50010
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2977)

2014-01-16 01:22:18,258 DataStreamer for file
/hbase/.logs/ip-10-10-25-199.ec2.internal,60020,1389843986689/ip-10-10-25-199.ec2.internal%2C60020%2C1389843986689.1389853200626
WARN  [DFSClient] Error Recovery for block blk_5619307008368309102_2603 bad
datanode[2] 10.38.106.234:50010
2014-01-16 01:22:18,258 DataStreamer for file
/hbase/.logs/ip-10-10-25-199.ec2.internal,60020,1389843986689/ip-10-10-25-199.ec2.internal%2C60020%2C1389843986689.1389853200626
WARN  [DFSClient] Error Recovery for block blk_5619307008368309102_2603 in
pipeline 10.10.25.199:50010, 10.40.249.135:50010, 10.38.106.234:50010: bad
datanode 10.38.106.234:50010
2014-01-16 01:22:22,800 IPC Server handler 10 on 60020 WARN  [HLog] HDFS
pipeline error detected. Found 2 replicas but expecting no less than 3
replicas.  Requesting close of hlog.
2014-01-16 01:22:22,806 IPC Server handler 2 on 60020 WARN  [HLog] HDFS
pipeline error detected. Found 2 replicas but expecting no less than 3
replicas.  Requesting close of hlog.
2014-01-16 01:22:22,808 IPC Server handler 28 on 60020 WARN  [HLog] HDFS
pipeline error detected. Found 2 replicas but expecting no less than 3
replicas.  Requesting close of hlog.
2014-01-16 01:22:22,808 IPC Server handler 13 on 60020 WARN  [HLog] HDFS
pipeline error detected. Found 2 replicas but expecting no less than 3
replicas.  Requesting close of hlog.
2014-01-16 01:22:22,808 IPC Server handler 27 on 60020 WARN  [HLog] HDFS
pipeline error detected. Found 2 replicas but expecting no less than 3
replicas.  Requesting close of hlog.
2014-01-16 01:22:22,811 IPC Server handler 22 on 60020 WARN  [HLog] Too
many consecutive RollWriter requests, it's a sign of the total number of
live datanodes is lower than the tolerable replicas.
2014-01-16 01:22:22,911 IPC Server handler 8 on 60020 INFO  [HLog]
LowReplication-Roller was enabled.
2014-01-16 01:22:22,930 regionserver60020.cacheFlusher INFO  [HRegion]
Finished memstore flush of ~128.3m/134538640, currentsize=3.0m/3113200 for
region usertable,,1389844429593.d4843a72f02a7396244930162fbecd06. in
68096ms, sequenceid=108753, compaction requested=false
2014-01-16 01:22:22,930 regionserver60020.logRoller INFO  [FSUtils]
FileSystem doesn't support getDefaultReplication
2014-01-16 01:22:22,930 regionserver60020.logRoller INFO  [FSUtils]
FileSystem doesn't support getDefaultBlockSize
2014-01-16 01:22:23,027 regionserver60020.logRoller INFO  [HLog] Roll
/hbase/.logs/ip-10-10-25-199.ec2.internal,60020,1389843986689/ip-10-10-25-199.ec2.internal%2C60020%2C1389843986689.1389853200626,
entries=1012, filesize=140440002.  for
/hbase/.logs/ip-10-10-25-199.ec2.internal,60020,1389843986689/ip-10-10-25-199.ec2.internal%2C60020%2C1389843986689.1389853342930
2014-01-16 01:22:23,194 IPC Server handler 23 on 60020 WARN  [HBaseServer]
(responseTooSlow):
{"processingtimems":68410,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@51ff528e),
rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
10.38.163.32:51727
","starttimems":1389853274560,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"}
2014-01-16 01:22:23,401 IPC Server handler 13 on 60020 WARN  [HBaseServer]
(responseTooSlow):
{"processingtimems":68813,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@4e136610),
rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
10.38.163.32:51727
","starttimems":1389853274586,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"}
2014-01-16 01:22:23,609 IPC Server handler 1 on 60020 WARN  [HBaseServer]
(responseTooSlow):
{"processingtimems":69002,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@51390a8),
rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
10.38.163.32:51727
","starttimems":1389853274604,"queuetimems":1,"class":"HRegionServer","responsesize":0,"method":"multi"}
2014-01-16 01:22:23,629 IPC Server handler 20 on 60020 WARN  [HBaseServer]
(responseTooSlow):
{"processingtimems":68991,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@5f125a0f),
rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
10.38.163.32:51727
","starttimems":1389853274635,"queuetimems":1,"class":"HRegionServer","responsesize":0,"method":"multi"}
2014-01-16 01:22:23,656 IPC Server handler 27 on 60020 WARN  [HBaseServer]
(responseTooSlow):
{"processingtimems":68835,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@2dd6bf8c),
rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
10.38.163.32:51727
","starttimems":1389853274818,"queuetimems":1,"class":"HRegionServer","responsesize":0,"method":"multi"}
2014-01-16 01:22:23,657 IPC Server handler 19 on 60020 WARN  [HBaseServer]
(responseTooSlow):
{"processingtimems":68982,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@6db997d6),
rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
10.38.163.32:51727
","starttimems":1389853274673,"queuetimems":1,"class":"HRegionServer","responsesize":0,"method":"multi"}

There are 10 DNs and all of them are pretty much alive. Replication factor
is 2 (dfs.replication in hdfs-site.xml).

On Wed, Jan 15, 2014 at 9:55 PM, 谢良 <xi...@xiaomi.com> wrote:

> It would be better if you could provide some thread dumps while the stalls
> happened.
>
> Thanks,
> Liang
> ________________________________________
> 发件人: Vladimir Rodionov [vladrodionov@gmail.com]
> 发送时间: 2014年1月16日 13:49
> 收件人: dev@hbase.apache.org; lars hofhansl
> 主题: Re: HBase 0.94.15: writes stalls periodically even under moderate
> steady load (AWS EC2)
>
> Its not IO, CPU or Network - its HBase. Stalls repeat periodically. Any
> particular message in a Log file I should look for?
>
>
> On Wed, Jan 15, 2014 at 9:17 PM, lars hofhansl <la...@apache.org> wrote:
>
> > So where's the bottleneck? You say it's not IO, not is it CPU, I presume.
> > Network? Are the writers blocked because there are too many storefiles?
> > (in which case you maxed out your storage IO)
> > Are you hotspotting a region server?
> >
> > From the stacktrace it looks like ycsb is doing single puts, each
> > incurring an RPC. You're testing AWS' network :)
> >
> >
> > I write 10-20k (small) rows per second in bulk on a single box for
> testing
> > all the time.
> > With 3-way replication a 5 nodes cluster is pretty puny. Each box will
> get
> > 60% of each write on average, just to state the obvious.
> >
> > As I said, if it's slow, I'd love to see where the bottleneck is, so that
> > we can fix it, if it is something we can fix in HBase.
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> >  From: Vladimir Rodionov <vl...@gmail.com>
> > To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> > Sent: Wednesday, January 15, 2014 5:32 PM
> > Subject: Re: HBase 0.94.15: writes stalls periodically even under
> moderate
> > steady load (AWS EC2)
> >
> >
> > Yes, I am using ephemeral (local) storage. I found that iostat is most of
> > the time idle on 3K load with periodic bursts up to 10% iowait. 3-4K is
> > probably the maximum this skinny cluster can sustain w/o additional
> > configuration tweaking. I will try more powerful instances, of course,
> but
> > the beauty of m1.xlarge is 0.05 price on the spot market. 5 nodes cluster
> > (+1) is ~ $7 per day. Good for experiments, but, definitely, not for real
> > testing.
> >
> > -Vladimir Rodionov
> >
> >
> >
> > On Wed, Jan 15, 2014 at 3:27 PM, Andrew Purtell <ap...@apache.org>
> > wrote:
> >
> > > Also I assume your HDFS is provisioned on locally attached disk, aka
> > > instance store, and not EBS?
> > >
> > >
> > > On Wed, Jan 15, 2014 at 3:26 PM, Andrew Purtell <ap...@apache.org>
> > > wrote:
> > >
> > > > m1.xlarge is a poorly provisioned instance type, with low PPS at the
> > > > network layer. Can you try a type advertised to have "high" I/O
> > > > performance?
> > > >
> > > >
> > > > On Wed, Jan 15, 2014 at 12:33 PM, Vladimir Rodionov <
> > > > vrodionov@carrieriq.com> wrote:
> > > >
> > > >> This is something which needs to be definitely solved/fixed/resolved
> > > >>
> > > >> I am running YCSB benchmark on aws ec2 on a small HBase cluster
> > > >>
> > > >> 5 (m1.xlarge) as RS
> > > >> 1 (m1.xlarge) hbase-master, zookeper
> > > >>
> > > >> Whirr 0.8.2 (with many hacks) is used to provision HBase.
> > > >>
> > > >> I am running 1 ycsb client (100% insert ops) throttled at 5K ops:
> > > >>
> > > >> ./bin/ycsb load hbase -P workloads/load20m -p columnfamily=family -s
> > > >> -threads 10 -target 5000
> > > >>
> > > >> OUTPUT:
> > > >>
> > > >> 1120 sec: 5602339 operations; 4999.7 current ops/sec; [INSERT
> > > >> AverageLatency(us)=225.53]
> > > >>  1130 sec: 5652117 operations; 4969.35 current ops/sec; [INSERT
> > > >> AverageLatency(us)=203.31]
> > > >>  1140 sec: 5665210 operations; 1309.04 current ops/sec; [INSERT
> > > >> AverageLatency(us)=17.13]
> > > >>  1150 sec: 5665210 operations; 0 current ops/sec;
> > > >>  1160 sec: 5665210 operations; 0 current ops/sec;
> > > >>  1170 sec: 5665210 operations; 0 current ops/sec;
> > > >>  1180 sec: 5665210 operations; 0 current ops/sec;
> > > >>  1190 sec: 5665210 operations; 0 current ops/sec;
> > > >> 2014-01-15 15:19:34,139 Thread-2 WARN
> > > >>  [HConnectionManager$HConnectionImplementation] Failed all from
> > > >>
> > >
> >
> region=usertable,user6039,1389811852201.40518862106856d23b883e5d543d0b89.,
> > > >> hostname=ip-10-45-174-120.ec2.internal, port=60020
> > > >> java.util.concurrent.ExecutionException:
> > > java.net.SocketTimeoutException:
> > > >> Call to ip-10-45-174-120.ec2.internal/10.45.174.120:60020 failed on
> > > >> socket timeout exception: java.net.SocketTimeoutException: 60000
> > millis
> > > >> timeout while waiting for channel to be ready for read. ch :
> > > >> java.nio.channels.SocketChannel[connected local=/
> 10.180.211.173:42466
> > > remote=ip-10-45-174-120.ec2.internal/
> > > >> 10.45.174.120:60020]
> > > >>         at
> > > >> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252)
> > > >>         at java.util.concurrent.FutureTask.get(FutureTask.java:111)
> > > >>         at
> > > >>
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1708)
> > > >>         at
> > > >>
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1560)
> > > >>         at
> > > >> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:994)
> > > >>         at
> > org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:850)
> > > >>         at
> org.apache.hadoop.hbase.client.HTable.put(HTable.java:826)
> > > >>         at
> com.yahoo.ycsb.db.HBaseClient.update(HBaseClient.java:328)
> > > >>         at
> com.yahoo.ycsb.db.HBaseClient.insert(HBaseClient.java:357)
> > > >>         at com.yahoo.ycsb.DBWrapper.insert(DBWrapper.java:148)
> > > >>         at
> > > >>
> com.yahoo.ycsb.workloads.CoreWorkload.doInsert(CoreWorkload.java:461)
> > > >>         at com.yahoo.ycsb.ClientThread.run(Client.java:269)
> > > >> Caused by: java.net.SocketTimeoutException: Call to
> > > >> ip-10-45-174-120.ec2.internal/10.45.174.120:60020 failed on socket
> > > >> timeout exception: java.net.SocketTimeoutException: 60000 millis
> > timeout
> > > >> while waiting for channel to be ready for read. ch :
> > > >> java.nio.channels.SocketChannel[connected local=/
> 10.180.211.173:42466
> > > remote=ip-10-45-174-120.ec2.internal/
> > > >> 10.45.174.120:60020]
> > > >>         at
> > > >>
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:1043)
> > > >>         at
> > > >> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1016)
> > > >>         at
> > > >>
> > >
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:87)
> > > >>         at com.sun.proxy.$Proxy5.multi(Unknown Source)
> > > >>         at
> > > >>
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1537)
> > > >>         at
> > > >>
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1535)
> > > >>         at
> > > >>
> > >
> >
> org.apache.hadoop.hbase.client.ServerCallable.withoutRetries(ServerCallable.java:229)
> > > >>         at
> > > >>
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1544)
> > > >>         at
> > > >>
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1532)
> > > >>         at
> > > >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> > > >>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> > > >>         at
> > > >>
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> > > >>         at
> > > >>
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > > >>         at java.lang.Thread.run(Thread.java:701)
> > > >>
> > > >>
> > > >> SKIPPED A LOT
> > > >>
> > > >>
> > > >>  1200 sec: 5674180 operations; 896.82 current ops/sec; [INSERT
> > > >> AverageLatency(us)=7506.37]
> > > >>  1210 sec: 6022326 operations; 34811.12 current ops/sec; [INSERT
> > > >> AverageLatency(us)=1998.26]
> > > >>  1220 sec: 6102627 operations; 8018.07 current ops/sec; [INSERT
> > > >> AverageLatency(us)=395.11]
> > > >>  1230 sec: 6152632 operations; 5000 current ops/sec; [INSERT
> > > >> AverageLatency(us)=182.53]
> > > >>  1240 sec: 6202641 operations; 4999.9 current ops/sec; [INSERT
> > > >> AverageLatency(us)=201.76]
> > > >>  1250 sec: 6252642 operations; 4999.6 current ops/sec; [INSERT
> > > >> AverageLatency(us)=190.46]
> > > >>  1260 sec: 6302653 operations; 5000.1 current ops/sec; [INSERT
> > > >> AverageLatency(us)=212.31]
> > > >>  1270 sec: 6352660 operations; 5000.2 current ops/sec; [INSERT
> > > >> AverageLatency(us)=217.77]
> > > >>  1280 sec: 6402731 operations; 5000.1 current ops/sec; [INSERT
> > > >> AverageLatency(us)=195.83]
> > > >>  1290 sec: 6452740 operations; 4999.9 current ops/sec; [INSERT
> > > >> AverageLatency(us)=232.43]
> > > >>  1300 sec: 6502743 operations; 4999.8 current ops/sec; [INSERT
> > > >> AverageLatency(us)=290.52]
> > > >>  1310 sec: 6552755 operations; 5000.2 current ops/sec; [INSERT
> > > >> AverageLatency(us)=259.49]
> > > >>
> > > >>
> > > >> As you can see here there is ~ 60 sec total write stall on a cluster
> > > >> which I suppose 100% correlates with compactions started (minor)
> > > >>
> > > >> MAX_FILESIZE = 5GB
> > > >> ## Regions of 'usertable' - 50
> > > >>
> > > >> I would appreciate any advices on how to get rid of these stalls. 5K
> > per
> > > >> sec is quite moderate load even for 5 lousy AWS servers. Or it is
> not?
> > > >>
> > > >> Best regards,
> > > >> Vladimir Rodionov
> > > >> Principal Platform Engineer
> > > >> Carrier IQ, www.carrieriq.com
> > > >> e-mail: vrodionov@carrieriq.com
> > > >>
> > > >>
> > > >> Confidentiality Notice:  The information contained in this message,
> > > >> including any attachments hereto, may be confidential and is
> intended
> > > to be
> > > >> read only by the individual or entity to whom this message is
> > > addressed. If
> > > >> the reader of this message is not the intended recipient or an agent
> > or
> > > >> designee of the intended recipient, please note that any review,
> use,
> > > >> disclosure or distribution of this message or its attachments, in
> any
> > > form,
> > > >> is strictly prohibited.  If you have received this message in error,
> > > please
> > > >> immediately notify the sender and/or Notifications@carrieriq.comand
> > > >> delete or destroy any copy of this message and its attachments.
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >
> > > >    - Andy
> > > >
> > > > Problems worthy of attack prove their worth by hitting back. - Piet
> > Hein
> > > > (via Tom White)
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > >    - Andy
> > >
> > > Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> > > (via Tom White)
> > >
> >
>

Re: 答复: 答复: HBase 0.94.15: writes stalls periodically even under moderate steady load (AWS EC2)

Posted by lars hofhansl <la...@apache.org>.

One other question: Did you give the RS the default 1000mb heap?
For real work you have to increase that. Maybe try with 4000 or 8000 on those boxes.

(In any case that will not solve any problem with unreachable or unavailable data nodes)


-- Lars



----- Original Message -----
From: Vladimir Rodionov <vl...@gmail.com>
To: "dev@hbase.apache.org" <de...@hbase.apache.org>
Cc: lars hofhansl <la...@apache.org>
Sent: Thursday, January 16, 2014 6:46 AM
Subject: Re: 答复: 答复: HBase 0.94.15: writes stalls periodically even under moderate steady load (AWS EC2)

1.0.4 (default).


On Thu, Jan 16, 2014 at 12:24 AM, 谢良 <xi...@xiaomi.com> wrote:

> Just curious, what's your hadoop version, Vladimir ?
> At least on hadoop2.0+, the default ReplcaceDatanode policy should be
> expected pick another dn up to setupPipeline, then if you have only 1 dn
> broken, it should be expected still could write into 3 nodes successful,
> and then the HBase's "hbase.regionserver.hlog.tolerable.lowreplication"
> checking will not jump out:)
>
> Thanks,
> ________________________________________
> 发件人: Vladimir Rodionov [vladrodionov@gmail.com]
> 发送时间: 2014年1月16日 14:45
> 收件人: dev@hbase.apache.org
> 抄送: lars hofhansl
> 主题: Re: 答复: HBase 0.94.15: writes stalls periodically even under moderate
> steady load (AWS EC2)
>
> This what I found in a RS Log:
> 2014-01-16 01:22:18,256 ResponseProcessor for block
> blk_5619307008368309102_2603 WARN  [DFSClient] DFSOutputStream
> ResponseProcessor exception  for block
> blk_5619307008368309102_2603java.io.IOException: Bad response 1 for block
> blk_5619307008368309102_2603 from datanode 10.38.106.234:50010
>         at
>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2977)
>
> 2014-01-16 01:22:18,258 DataStreamer for file
>
> /hbase/.logs/ip-10-10-25-199.ec2.internal,60020,1389843986689/ip-10-10-25-199.ec2.internal%2C60020%2C1389843986689.1389853200626
> WARN  [DFSClient] Error Recovery for block blk_5619307008368309102_2603 bad
> datanode[2] 10.38.106.234:50010
> 2014-01-16 01:22:18,258 DataStreamer for file
>
> /hbase/.logs/ip-10-10-25-199.ec2.internal,60020,1389843986689/ip-10-10-25-199.ec2.internal%2C60020%2C1389843986689.1389853200626
> WARN  [DFSClient] Error Recovery for block blk_5619307008368309102_2603 in
> pipeline 10.10.25.199:50010, 10.40.249.135:50010, 10.38.106.234:50010: bad
> datanode 10.38.106.234:50010
> 2014-01-16 01:22:22,800 IPC Server handler 10 on 60020 WARN  [HLog] HDFS
> pipeline error detected. Found 2 replicas but expecting no less than 3
> replicas.  Requesting close of hlog.
> 2014-01-16 01:22:22,806 IPC Server handler 2 on 60020 WARN  [HLog] HDFS
> pipeline error detected. Found 2 replicas but expecting no less than 3
> replicas.  Requesting close of hlog.
> 2014-01-16 01:22:22,808 IPC Server handler 28 on 60020 WARN  [HLog] HDFS
> pipeline error detected. Found 2 replicas but expecting no less than 3
> replicas.  Requesting close of hlog.
> 2014-01-16 01:22:22,808 IPC Server handler 13 on 60020 WARN  [HLog] HDFS
> pipeline error detected. Found 2 replicas but expecting no less than 3
> replicas.  Requesting close of hlog.
> 2014-01-16 01:22:22,808 IPC Server handler 27 on 60020 WARN  [HLog] HDFS
> pipeline error detected. Found 2 replicas but expecting no less than 3
> replicas.  Requesting close of hlog.
> 2014-01-16 01:22:22,811 IPC Server handler 22 on 60020 WARN  [HLog] Too
> many consecutive RollWriter requests, it's a sign of the total number of
> live datanodes is lower than the tolerable replicas.
> 2014-01-16 01:22:22,911 IPC Server handler 8 on 60020 INFO  [HLog]
> LowReplication-Roller was enabled.
> 2014-01-16 01:22:22,930 regionserver60020.cacheFlusher INFO  [HRegion]
> Finished memstore flush of ~128.3m/134538640, currentsize=3.0m/3113200 for
> region usertable,,1389844429593.d4843a72f02a7396244930162fbecd06. in
> 68096ms, sequenceid=108753, compaction requested=false
> 2014-01-16 01:22:22,930 regionserver60020.logRoller INFO  [FSUtils]
> FileSystem doesn't support getDefaultReplication
> 2014-01-16 01:22:22,930 regionserver60020.logRoller INFO  [FSUtils]
> FileSystem doesn't support getDefaultBlockSize
> 2014-01-16 01:22:23,027 regionserver60020.logRoller INFO  [HLog] Roll
>
> /hbase/.logs/ip-10-10-25-199.ec2.internal,60020,1389843986689/ip-10-10-25-199.ec2.internal%2C60020%2C1389843986689.1389853200626,
> entries=1012, filesize=140440002.  for
>
> /hbase/.logs/ip-10-10-25-199.ec2.internal,60020,1389843986689/ip-10-10-25-199.ec2.internal%2C60020%2C1389843986689.1389853342930
> 2014-01-16 01:22:23,194 IPC Server handler 23 on 60020 WARN  [HBaseServer]
> (responseTooSlow):
>
> {"processingtimems":68410,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@51ff528e
> ),
> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
> 10.38.163.32:51727
>
> ","starttimems":1389853274560,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"}
> 2014-01-16 01:22:23,401 IPC Server handler 13 on 60020 WARN  [HBaseServer]
> (responseTooSlow):
>
> {"processingtimems":68813,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@4e136610
> ),
> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
> 10.38.163.32:51727
>
> ","starttimems":1389853274586,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"}
> 2014-01-16 01:22:23,609 IPC Server handler 1 on 60020 WARN  [HBaseServer]
> (responseTooSlow):
>
> {"processingtimems":69002,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@51390a8
> ),
> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
> 10.38.163.32:51727
>
> ","starttimems":1389853274604,"queuetimems":1,"class":"HRegionServer","responsesize":0,"method":"multi"}
> 2014-01-16 01:22:23,629 IPC Server handler 20 on 60020 WARN  [HBaseServer]
> (responseTooSlow):
>
> {"processingtimems":68991,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@5f125a0f
> ),
> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
> 10.38.163.32:51727
>
> ","starttimems":1389853274635,"queuetimems":1,"class":"HRegionServer","responsesize":0,"method":"multi"}
> 2014-01-16 01:22:23,656 IPC Server handler 27 on 60020 WARN  [HBaseServer]
> (responseTooSlow):
>
> {"processingtimems":68835,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@2dd6bf8c
> ),
> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
> 10.38.163.32:51727
>
> ","starttimems":1389853274818,"queuetimems":1,"class":"HRegionServer","responsesize":0,"method":"multi"}
> 2014-01-16 01:22:23,657 IPC Server handler 19 on 60020 WARN  [HBaseServer]
> (responseTooSlow):
>
> {"processingtimems":68982,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@6db997d6
> ),
> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
> 10.38.163.32:51727
>
> ","starttimems":1389853274673,"queuetimems":1,"class":"HRegionServer","responsesize":0,"method":"multi"}
>
>
> There are 10 DNs and all of them are pretty much alive. Replication factor
> is 2 (dfs.replication in hdfs-site.xml).
>
>
>
> On Wed, Jan 15, 2014 at 9:55 PM, 谢良 <xi...@xiaomi.com> wrote:
>
> > It would be better if you could provide some thread dumps while the
> stalls
> > happened.
> >
> > Thanks,
> > Liang
> > ________________________________________
> > 发件人: Vladimir Rodionov [vladrodionov@gmail.com]
> > 发送时间: 2014年1月16日 13:49
> > 收件人: dev@hbase.apache.org; lars hofhansl
> > 主题: Re: HBase 0.94.15: writes stalls periodically even under moderate
> > steady load (AWS EC2)
> >
> > Its not IO, CPU or Network - its HBase. Stalls repeat periodically. Any
> > particular message in a Log file I should look for?
> >
> >
> > On Wed, Jan 15, 2014 at 9:17 PM, lars hofhansl <la...@apache.org> wrote:
> >
> > > So where's the bottleneck? You say it's not IO, not is it CPU, I
> presume.
> > > Network? Are the writers blocked because there are too many storefiles?
> > > (in which case you maxed out your storage IO)
> > > Are you hotspotting a region server?
> > >
> > > From the stacktrace it looks like ycsb is doing single puts, each
> > > incurring an RPC. You're testing AWS' network :)
> > >
> > >
> > > I write 10-20k (small) rows per second in bulk on a single box for
> > testing
> > > all the time.
> > > With 3-way replication a 5 nodes cluster is pretty puny. Each box will
> > get
> > > 60% of each write on average, just to state the obvious.
> > >
> > > As I said, if it's slow, I'd love to see where the bottleneck is, so
> that
> > > we can fix it, if it is something we can fix in HBase.
> > >
> > > -- Lars
> > >
> > >
> > >
> > > ________________________________
> > >  From: Vladimir Rodionov <vl...@gmail.com>
> > > To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> > > Sent: Wednesday, January 15, 2014 5:32 PM
> > > Subject: Re: HBase 0.94.15: writes stalls periodically even under
> > moderate
> > > steady load (AWS EC2)
> > >
> > >
> > > Yes, I am using ephemeral (local) storage. I found that iostat is most
> of
> > > the time idle on 3K load with periodic bursts up to 10% iowait. 3-4K is
> > > probably the maximum this skinny cluster can sustain w/o additional
> > > configuration tweaking. I will try more powerful instances, of course,
> > but
> > > the beauty of m1.xlarge is 0.05 price on the spot market. 5 nodes
> cluster
> > > (+1) is ~ $7 per day. Good for experiments, but, definitely, not for
> real
> > > testing.
> > >
> > > -Vladimir Rodionov
> > >
> > >
> > >
> > > On Wed, Jan 15, 2014 at 3:27 PM, Andrew Purtell <ap...@apache.org>
> > > wrote:
> > >
> > > > Also I assume your HDFS is provisioned on locally attached disk, aka
> > > > instance store, and not EBS?
> > > >
> > > >
> > > > On Wed, Jan 15, 2014 at 3:26 PM, Andrew Purtell <apurtell@apache.org
> >
> > > > wrote:
> > > >
> > > > > m1.xlarge is a poorly provisioned instance type, with low PPS at
> the
> > > > > network layer. Can you try a type advertised to have "high" I/O
> > > > > performance?
> > > > >
> > > > >
> > > > > On Wed, Jan 15, 2014 at 12:33 PM, Vladimir Rodionov <
> > > > > vrodionov@carrieriq.com> wrote:
> > > > >
> > > > >> This is something which needs to be definitely
> solved/fixed/resolved
> > > > >>
> > > > >> I am running YCSB benchmark on aws ec2 on a small HBase cluster
> > > > >>
> > > > >> 5 (m1.xlarge) as RS
> > > > >> 1 (m1.xlarge) hbase-master, zookeper
> > > > >>
> > > > >> Whirr 0.8.2 (with many hacks) is used to provision HBase.
> > > > >>
> > > > >> I am running 1 ycsb client (100% insert ops) throttled at 5K ops:
> > > > >>
> > > > >> ./bin/ycsb load hbase -P workloads/load20m -p columnfamily=family
> -s
> > > > >> -threads 10 -target 5000
> > > > >>
> > > > >> OUTPUT:
> > > > >>
> > > > >> 1120 sec: 5602339 operations; 4999.7 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=225.53]
> > > > >>  1130 sec: 5652117 operations; 4969.35 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=203.31]
> > > > >>  1140 sec: 5665210 operations; 1309.04 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=17.13]
> > > > >>  1150 sec: 5665210 operations; 0 current ops/sec;
> > > > >>  1160 sec: 5665210 operations; 0 current ops/sec;
> > > > >>  1170 sec: 5665210 operations; 0 current ops/sec;
> > > > >>  1180 sec: 5665210 operations; 0 current ops/sec;
> > > > >>  1190 sec: 5665210 operations; 0 current ops/sec;
> > > > >> 2014-01-15 15:19:34,139 Thread-2 WARN
> > > > >>  [HConnectionManager$HConnectionImplementation] Failed all from
> > > > >>
> > > >
> > >
> >
> region=usertable,user6039,1389811852201.40518862106856d23b883e5d543d0b89.,
> > > > >> hostname=ip-10-45-174-120.ec2.internal, port=60020
> > > > >> java.util.concurrent.ExecutionException:
> > > > java.net.SocketTimeoutException:
> > > > >> Call to ip-10-45-174-120.ec2.internal/10.45.174.120:60020 failed
> on
> > > > >> socket timeout exception: java.net.SocketTimeoutException: 60000
> > > millis
> > > > >> timeout while waiting for channel to be ready for read. ch :
> > > > >> java.nio.channels.SocketChannel[connected local=/
> > 10.180.211.173:42466
> > > > remote=ip-10-45-174-120.ec2.internal/
> > > > >> 10.45.174.120:60020]
> > > > >>         at
> > > > >> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252)
> > > > >>         at
> java.util.concurrent.FutureTask.get(FutureTask.java:111)
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1708)
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1560)
> > > > >>         at
> > > > >>
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:994)
> > > > >>         at
> > > org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:850)
> > > > >>         at
> > org.apache.hadoop.hbase.client.HTable.put(HTable.java:826)
> > > > >>         at
> > com.yahoo.ycsb.db.HBaseClient.update(HBaseClient.java:328)
> > > > >>         at
> > com.yahoo.ycsb.db.HBaseClient.insert(HBaseClient.java:357)
> > > > >>         at com.yahoo.ycsb.DBWrapper.insert(DBWrapper.java:148)
> > > > >>         at
> > > > >>
> > com.yahoo.ycsb.workloads.CoreWorkload.doInsert(CoreWorkload.java:461)
> > > > >>         at com.yahoo.ycsb.ClientThread.run(Client.java:269)
> > > > >> Caused by: java.net.SocketTimeoutException: Call to
> > > > >> ip-10-45-174-120.ec2.internal/10.45.174.120:60020 failed on
> socket
> > > > >> timeout exception: java.net.SocketTimeoutException: 60000 millis
> > > timeout
> > > > >> while waiting for channel to be ready for read. ch :
> > > > >> java.nio.channels.SocketChannel[connected local=/
> > 10.180.211.173:42466
> > > > remote=ip-10-45-174-120.ec2.internal/
> > > > >> 10.45.174.120:60020]
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:1043)
> > > > >>         at
> > > > >>
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1016)
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:87)
> > > > >>         at com.sun.proxy.$Proxy5.multi(Unknown Source)
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1537)
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1535)
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.ServerCallable.withoutRetries(ServerCallable.java:229)
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1544)
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1532)
> > > > >>         at
> > > > >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> > > > >>         at
> java.util.concurrent.FutureTask.run(FutureTask.java:166)
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > > > >>         at java.lang.Thread.run(Thread.java:701)
> > > > >>
> > > > >>
> > > > >> SKIPPED A LOT
> > > > >>
> > > > >>
> > > > >>  1200 sec: 5674180 operations; 896.82 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=7506.37]
> > > > >>  1210 sec: 6022326 operations; 34811.12 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=1998.26]
> > > > >>  1220 sec: 6102627 operations; 8018.07 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=395.11]
> > > > >>  1230 sec: 6152632 operations; 5000 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=182.53]
> > > > >>  1240 sec: 6202641 operations; 4999.9 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=201.76]
> > > > >>  1250 sec: 6252642 operations; 4999.6 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=190.46]
> > > > >>  1260 sec: 6302653 operations; 5000.1 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=212.31]
> > > > >>  1270 sec: 6352660 operations; 5000.2 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=217.77]
> > > > >>  1280 sec: 6402731 operations; 5000.1 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=195.83]
> > > > >>  1290 sec: 6452740 operations; 4999.9 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=232.43]
> > > > >>  1300 sec: 6502743 operations; 4999.8 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=290.52]
> > > > >>  1310 sec: 6552755 operations; 5000.2 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=259.49]
> > > > >>
> > > > >>
> > > > >> As you can see here there is ~ 60 sec total write stall on a
> cluster
> > > > >> which I suppose 100% correlates with compactions started (minor)
> > > > >>
> > > > >> MAX_FILESIZE = 5GB
> > > > >> ## Regions of 'usertable' - 50
> > > > >>
> > > > >> I would appreciate any advices on how to get rid of these stalls.
> 5K
> > > per
> > > > >> sec is quite moderate load even for 5 lousy AWS servers. Or it is
> > not?
> > > > >>
> > > > >> Best regards,
> > > > >> Vladimir Rodionov
> > > > >> Principal Platform Engineer
> > > > >> Carrier IQ, www.carrieriq.com
> > > > >> e-mail: vrodionov@carrieriq.com
> > > > >>
> > > > >>
> > > > >> Confidentiality Notice:  The information contained in this
> message,
> > > > >> including any attachments hereto, may be confidential and is
> > intended
> > > > to be
> > > > >> read only by the individual or entity to whom this message is
> > > > addressed. If
> > > > >> the reader of this message is not the intended recipient or an
> agent
> > > or
> > > > >> designee of the intended recipient, please note that any review,
> > use,
> > > > >> disclosure or distribution of this message or its attachments, in
> > any
> > > > form,
> > > > >> is strictly prohibited.  If you have received this message in
> error,
> > > > please
> > > > >> immediately notify the sender and/or
> Notifications@carrieriq.comand
> > > > >> delete or destroy any copy of this message and its attachments.
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >
> > > > >    - Andy
> > > > >
> > > > > Problems worthy of attack prove their worth by hitting back. - Piet
> > > Hein
> > > > > (via Tom White)
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >
> > > >    - Andy
> > > >
> > > > Problems worthy of attack prove their worth by hitting back. - Piet
> > Hein
> > > > (via Tom White)
> > > >
> > >
> >
>

Re: 答复: 答复: HBase 0.94.15: writes stalls periodically even under moderate steady load (AWS EC2)

Posted by Vladimir Rodionov <vl...@gmail.com>.

1.0.4 (default).


On Thu, Jan 16, 2014 at 12:24 AM, 谢良 <xi...@xiaomi.com> wrote:

> Just curious, what's your hadoop version, Vladimir ?
> At least on hadoop2.0+, the default ReplcaceDatanode policy should be
> expected pick another dn up to setupPipeline, then if you have only 1 dn
> broken, it should be expected still could write into 3 nodes successful,
> and then the HBase's "hbase.regionserver.hlog.tolerable.lowreplication"
> checking will not jump out:)
>
> Thanks,
> ________________________________________
> 发件人: Vladimir Rodionov [vladrodionov@gmail.com]
> 发送时间: 2014年1月16日 14:45
> 收件人: dev@hbase.apache.org
> 抄送: lars hofhansl
> 主题: Re: 答复: HBase 0.94.15: writes stalls periodically even under moderate
> steady load (AWS EC2)
>
> This what I found in a RS Log:
> 2014-01-16 01:22:18,256 ResponseProcessor for block
> blk_5619307008368309102_2603 WARN  [DFSClient] DFSOutputStream
> ResponseProcessor exception  for block
> blk_5619307008368309102_2603java.io.IOException: Bad response 1 for block
> blk_5619307008368309102_2603 from datanode 10.38.106.234:50010
>         at
>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2977)
>
> 2014-01-16 01:22:18,258 DataStreamer for file
>
> /hbase/.logs/ip-10-10-25-199.ec2.internal,60020,1389843986689/ip-10-10-25-199.ec2.internal%2C60020%2C1389843986689.1389853200626
> WARN  [DFSClient] Error Recovery for block blk_5619307008368309102_2603 bad
> datanode[2] 10.38.106.234:50010
> 2014-01-16 01:22:18,258 DataStreamer for file
>
> /hbase/.logs/ip-10-10-25-199.ec2.internal,60020,1389843986689/ip-10-10-25-199.ec2.internal%2C60020%2C1389843986689.1389853200626
> WARN  [DFSClient] Error Recovery for block blk_5619307008368309102_2603 in
> pipeline 10.10.25.199:50010, 10.40.249.135:50010, 10.38.106.234:50010: bad
> datanode 10.38.106.234:50010
> 2014-01-16 01:22:22,800 IPC Server handler 10 on 60020 WARN  [HLog] HDFS
> pipeline error detected. Found 2 replicas but expecting no less than 3
> replicas.  Requesting close of hlog.
> 2014-01-16 01:22:22,806 IPC Server handler 2 on 60020 WARN  [HLog] HDFS
> pipeline error detected. Found 2 replicas but expecting no less than 3
> replicas.  Requesting close of hlog.
> 2014-01-16 01:22:22,808 IPC Server handler 28 on 60020 WARN  [HLog] HDFS
> pipeline error detected. Found 2 replicas but expecting no less than 3
> replicas.  Requesting close of hlog.
> 2014-01-16 01:22:22,808 IPC Server handler 13 on 60020 WARN  [HLog] HDFS
> pipeline error detected. Found 2 replicas but expecting no less than 3
> replicas.  Requesting close of hlog.
> 2014-01-16 01:22:22,808 IPC Server handler 27 on 60020 WARN  [HLog] HDFS
> pipeline error detected. Found 2 replicas but expecting no less than 3
> replicas.  Requesting close of hlog.
> 2014-01-16 01:22:22,811 IPC Server handler 22 on 60020 WARN  [HLog] Too
> many consecutive RollWriter requests, it's a sign of the total number of
> live datanodes is lower than the tolerable replicas.
> 2014-01-16 01:22:22,911 IPC Server handler 8 on 60020 INFO  [HLog]
> LowReplication-Roller was enabled.
> 2014-01-16 01:22:22,930 regionserver60020.cacheFlusher INFO  [HRegion]
> Finished memstore flush of ~128.3m/134538640, currentsize=3.0m/3113200 for
> region usertable,,1389844429593.d4843a72f02a7396244930162fbecd06. in
> 68096ms, sequenceid=108753, compaction requested=false
> 2014-01-16 01:22:22,930 regionserver60020.logRoller INFO  [FSUtils]
> FileSystem doesn't support getDefaultReplication
> 2014-01-16 01:22:22,930 regionserver60020.logRoller INFO  [FSUtils]
> FileSystem doesn't support getDefaultBlockSize
> 2014-01-16 01:22:23,027 regionserver60020.logRoller INFO  [HLog] Roll
>
> /hbase/.logs/ip-10-10-25-199.ec2.internal,60020,1389843986689/ip-10-10-25-199.ec2.internal%2C60020%2C1389843986689.1389853200626,
> entries=1012, filesize=140440002.  for
>
> /hbase/.logs/ip-10-10-25-199.ec2.internal,60020,1389843986689/ip-10-10-25-199.ec2.internal%2C60020%2C1389843986689.1389853342930
> 2014-01-16 01:22:23,194 IPC Server handler 23 on 60020 WARN  [HBaseServer]
> (responseTooSlow):
>
> {"processingtimems":68410,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@51ff528e
> ),
> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
> 10.38.163.32:51727
>
> ","starttimems":1389853274560,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"}
> 2014-01-16 01:22:23,401 IPC Server handler 13 on 60020 WARN  [HBaseServer]
> (responseTooSlow):
>
> {"processingtimems":68813,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@4e136610
> ),
> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
> 10.38.163.32:51727
>
> ","starttimems":1389853274586,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"}
> 2014-01-16 01:22:23,609 IPC Server handler 1 on 60020 WARN  [HBaseServer]
> (responseTooSlow):
>
> {"processingtimems":69002,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@51390a8
> ),
> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
> 10.38.163.32:51727
>
> ","starttimems":1389853274604,"queuetimems":1,"class":"HRegionServer","responsesize":0,"method":"multi"}
> 2014-01-16 01:22:23,629 IPC Server handler 20 on 60020 WARN  [HBaseServer]
> (responseTooSlow):
>
> {"processingtimems":68991,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@5f125a0f
> ),
> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
> 10.38.163.32:51727
>
> ","starttimems":1389853274635,"queuetimems":1,"class":"HRegionServer","responsesize":0,"method":"multi"}
> 2014-01-16 01:22:23,656 IPC Server handler 27 on 60020 WARN  [HBaseServer]
> (responseTooSlow):
>
> {"processingtimems":68835,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@2dd6bf8c
> ),
> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
> 10.38.163.32:51727
>
> ","starttimems":1389853274818,"queuetimems":1,"class":"HRegionServer","responsesize":0,"method":"multi"}
> 2014-01-16 01:22:23,657 IPC Server handler 19 on 60020 WARN  [HBaseServer]
> (responseTooSlow):
>
> {"processingtimems":68982,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@6db997d6
> ),
> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
> 10.38.163.32:51727
>
> ","starttimems":1389853274673,"queuetimems":1,"class":"HRegionServer","responsesize":0,"method":"multi"}
>
>
> There are 10 DNs and all of them are pretty much alive. Replication factor
> is 2 (dfs.replication in hdfs-site.xml).
>
>
>
> On Wed, Jan 15, 2014 at 9:55 PM, 谢良 <xi...@xiaomi.com> wrote:
>
> > It would be better if you could provide some thread dumps while the
> stalls
> > happened.
> >
> > Thanks,
> > Liang
> > ________________________________________
> > 发件人: Vladimir Rodionov [vladrodionov@gmail.com]
> > 发送时间: 2014年1月16日 13:49
> > 收件人: dev@hbase.apache.org; lars hofhansl
> > 主题: Re: HBase 0.94.15: writes stalls periodically even under moderate
> > steady load (AWS EC2)
> >
> > Its not IO, CPU or Network - its HBase. Stalls repeat periodically. Any
> > particular message in a Log file I should look for?
> >
> >
> > On Wed, Jan 15, 2014 at 9:17 PM, lars hofhansl <la...@apache.org> wrote:
> >
> > > So where's the bottleneck? You say it's not IO, not is it CPU, I
> presume.
> > > Network? Are the writers blocked because there are too many storefiles?
> > > (in which case you maxed out your storage IO)
> > > Are you hotspotting a region server?
> > >
> > > From the stacktrace it looks like ycsb is doing single puts, each
> > > incurring an RPC. You're testing AWS' network :)
> > >
> > >
> > > I write 10-20k (small) rows per second in bulk on a single box for
> > testing
> > > all the time.
> > > With 3-way replication a 5 nodes cluster is pretty puny. Each box will
> > get
> > > 60% of each write on average, just to state the obvious.
> > >
> > > As I said, if it's slow, I'd love to see where the bottleneck is, so
> that
> > > we can fix it, if it is something we can fix in HBase.
> > >
> > > -- Lars
> > >
> > >
> > >
> > > ________________________________
> > >  From: Vladimir Rodionov <vl...@gmail.com>
> > > To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> > > Sent: Wednesday, January 15, 2014 5:32 PM
> > > Subject: Re: HBase 0.94.15: writes stalls periodically even under
> > moderate
> > > steady load (AWS EC2)
> > >
> > >
> > > Yes, I am using ephemeral (local) storage. I found that iostat is most
> of
> > > the time idle on 3K load with periodic bursts up to 10% iowait. 3-4K is
> > > probably the maximum this skinny cluster can sustain w/o additional
> > > configuration tweaking. I will try more powerful instances, of course,
> > but
> > > the beauty of m1.xlarge is 0.05 price on the spot market. 5 nodes
> cluster
> > > (+1) is ~ $7 per day. Good for experiments, but, definitely, not for
> real
> > > testing.
> > >
> > > -Vladimir Rodionov
> > >
> > >
> > >
> > > On Wed, Jan 15, 2014 at 3:27 PM, Andrew Purtell <ap...@apache.org>
> > > wrote:
> > >
> > > > Also I assume your HDFS is provisioned on locally attached disk, aka
> > > > instance store, and not EBS?
> > > >
> > > >
> > > > On Wed, Jan 15, 2014 at 3:26 PM, Andrew Purtell <apurtell@apache.org
> >
> > > > wrote:
> > > >
> > > > > m1.xlarge is a poorly provisioned instance type, with low PPS at
> the
> > > > > network layer. Can you try a type advertised to have "high" I/O
> > > > > performance?
> > > > >
> > > > >
> > > > > On Wed, Jan 15, 2014 at 12:33 PM, Vladimir Rodionov <
> > > > > vrodionov@carrieriq.com> wrote:
> > > > >
> > > > >> This is something which needs to be definitely
> solved/fixed/resolved
> > > > >>
> > > > >> I am running YCSB benchmark on aws ec2 on a small HBase cluster
> > > > >>
> > > > >> 5 (m1.xlarge) as RS
> > > > >> 1 (m1.xlarge) hbase-master, zookeper
> > > > >>
> > > > >> Whirr 0.8.2 (with many hacks) is used to provision HBase.
> > > > >>
> > > > >> I am running 1 ycsb client (100% insert ops) throttled at 5K ops:
> > > > >>
> > > > >> ./bin/ycsb load hbase -P workloads/load20m -p columnfamily=family
> -s
> > > > >> -threads 10 -target 5000
> > > > >>
> > > > >> OUTPUT:
> > > > >>
> > > > >> 1120 sec: 5602339 operations; 4999.7 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=225.53]
> > > > >>  1130 sec: 5652117 operations; 4969.35 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=203.31]
> > > > >>  1140 sec: 5665210 operations; 1309.04 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=17.13]
> > > > >>  1150 sec: 5665210 operations; 0 current ops/sec;
> > > > >>  1160 sec: 5665210 operations; 0 current ops/sec;
> > > > >>  1170 sec: 5665210 operations; 0 current ops/sec;
> > > > >>  1180 sec: 5665210 operations; 0 current ops/sec;
> > > > >>  1190 sec: 5665210 operations; 0 current ops/sec;
> > > > >> 2014-01-15 15:19:34,139 Thread-2 WARN
> > > > >>  [HConnectionManager$HConnectionImplementation] Failed all from
> > > > >>
> > > >
> > >
> >
> region=usertable,user6039,1389811852201.40518862106856d23b883e5d543d0b89.,
> > > > >> hostname=ip-10-45-174-120.ec2.internal, port=60020
> > > > >> java.util.concurrent.ExecutionException:
> > > > java.net.SocketTimeoutException:
> > > > >> Call to ip-10-45-174-120.ec2.internal/10.45.174.120:60020 failed
> on
> > > > >> socket timeout exception: java.net.SocketTimeoutException: 60000
> > > millis
> > > > >> timeout while waiting for channel to be ready for read. ch :
> > > > >> java.nio.channels.SocketChannel[connected local=/
> > 10.180.211.173:42466
> > > > remote=ip-10-45-174-120.ec2.internal/
> > > > >> 10.45.174.120:60020]
> > > > >>         at
> > > > >> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252)
> > > > >>         at
> java.util.concurrent.FutureTask.get(FutureTask.java:111)
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1708)
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1560)
> > > > >>         at
> > > > >>
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:994)
> > > > >>         at
> > > org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:850)
> > > > >>         at
> > org.apache.hadoop.hbase.client.HTable.put(HTable.java:826)
> > > > >>         at
> > com.yahoo.ycsb.db.HBaseClient.update(HBaseClient.java:328)
> > > > >>         at
> > com.yahoo.ycsb.db.HBaseClient.insert(HBaseClient.java:357)
> > > > >>         at com.yahoo.ycsb.DBWrapper.insert(DBWrapper.java:148)
> > > > >>         at
> > > > >>
> > com.yahoo.ycsb.workloads.CoreWorkload.doInsert(CoreWorkload.java:461)
> > > > >>         at com.yahoo.ycsb.ClientThread.run(Client.java:269)
> > > > >> Caused by: java.net.SocketTimeoutException: Call to
> > > > >> ip-10-45-174-120.ec2.internal/10.45.174.120:60020 failed on
> socket
> > > > >> timeout exception: java.net.SocketTimeoutException: 60000 millis
> > > timeout
> > > > >> while waiting for channel to be ready for read. ch :
> > > > >> java.nio.channels.SocketChannel[connected local=/
> > 10.180.211.173:42466
> > > > remote=ip-10-45-174-120.ec2.internal/
> > > > >> 10.45.174.120:60020]
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:1043)
> > > > >>         at
> > > > >>
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1016)
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:87)
> > > > >>         at com.sun.proxy.$Proxy5.multi(Unknown Source)
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1537)
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1535)
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.ServerCallable.withoutRetries(ServerCallable.java:229)
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1544)
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1532)
> > > > >>         at
> > > > >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> > > > >>         at
> java.util.concurrent.FutureTask.run(FutureTask.java:166)
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> > > > >>         at
> > > > >>
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > > > >>         at java.lang.Thread.run(Thread.java:701)
> > > > >>
> > > > >>
> > > > >> SKIPPED A LOT
> > > > >>
> > > > >>
> > > > >>  1200 sec: 5674180 operations; 896.82 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=7506.37]
> > > > >>  1210 sec: 6022326 operations; 34811.12 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=1998.26]
> > > > >>  1220 sec: 6102627 operations; 8018.07 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=395.11]
> > > > >>  1230 sec: 6152632 operations; 5000 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=182.53]
> > > > >>  1240 sec: 6202641 operations; 4999.9 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=201.76]
> > > > >>  1250 sec: 6252642 operations; 4999.6 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=190.46]
> > > > >>  1260 sec: 6302653 operations; 5000.1 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=212.31]
> > > > >>  1270 sec: 6352660 operations; 5000.2 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=217.77]
> > > > >>  1280 sec: 6402731 operations; 5000.1 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=195.83]
> > > > >>  1290 sec: 6452740 operations; 4999.9 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=232.43]
> > > > >>  1300 sec: 6502743 operations; 4999.8 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=290.52]
> > > > >>  1310 sec: 6552755 operations; 5000.2 current ops/sec; [INSERT
> > > > >> AverageLatency(us)=259.49]
> > > > >>
> > > > >>
> > > > >> As you can see here there is ~ 60 sec total write stall on a
> cluster
> > > > >> which I suppose 100% correlates with compactions started (minor)
> > > > >>
> > > > >> MAX_FILESIZE = 5GB
> > > > >> ## Regions of 'usertable' - 50
> > > > >>
> > > > >> I would appreciate any advices on how to get rid of these stalls.
> 5K
> > > per
> > > > >> sec is quite moderate load even for 5 lousy AWS servers. Or it is
> > not?
> > > > >>
> > > > >> Best regards,
> > > > >> Vladimir Rodionov
> > > > >> Principal Platform Engineer
> > > > >> Carrier IQ, www.carrieriq.com
> > > > >> e-mail: vrodionov@carrieriq.com
> > > > >>
> > > > >>
> > > > >> Confidentiality Notice:  The information contained in this
> message,
> > > > >> including any attachments hereto, may be confidential and is
> > intended
> > > > to be
> > > > >> read only by the individual or entity to whom this message is
> > > > addressed. If
> > > > >> the reader of this message is not the intended recipient or an
> agent
> > > or
> > > > >> designee of the intended recipient, please note that any review,
> > use,
> > > > >> disclosure or distribution of this message or its attachments, in
> > any
> > > > form,
> > > > >> is strictly prohibited.  If you have received this message in
> error,
> > > > please
> > > > >> immediately notify the sender and/or
> Notifications@carrieriq.comand
> > > > >> delete or destroy any copy of this message and its attachments.
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >
> > > > >    - Andy
> > > > >
> > > > > Problems worthy of attack prove their worth by hitting back. - Piet
> > > Hein
> > > > > (via Tom White)
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >
> > > >    - Andy
> > > >
> > > > Problems worthy of attack prove their worth by hitting back. - Piet
> > Hein
> > > > (via Tom White)
> > > >
> > >
> >
>