You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@rocketmq.apache.org by Jason Joo <hb...@163.com> on 2018/06/15 01:39:39 UTC

About batch messages cause many slow logs

Hello,

We have a cluster with 2m-2s architecture. TPS is between 1000 to 6000 and about 200 million messages in 24 hours typically.
The commit log size is about 200GB and can be cached in memory fully.

Our server(one rocketmq each, 4 servers in this cluster):
6 core Xeon-E5-2620 x 2 (24 thread total)
256G ddr3 1600MHz
240G Intel SSD 530(SSDSC2BW240H6) x 6 (Raid 1)

To "IMPROVE" producing performance(mainly reduce network packages) we start to use batch sending feature, and we begain to receive so much "slow log" after that(batch is 50~70, 100~1000B per message body).

After some tuning(osPageCacheBusyTimeOutMills, waitTimeMillsInSendQueue to 500ms, etc) we only increasing the "time cost" printed in slow log.

The pagecache rt in log before using batch sending:

[PAGECACHERT] TotalPut 50650, PutMessageDistributeTime [<=0ms]:44250 [0~10ms]:5865 [10~50ms]:514 [50~100ms]:10 [100~200ms]:11 [200~500ms]:0 [500ms~1s]:0 [1~2s]:0 [2~3s]:0 [3~4s]:0 [4~5s]:0 [5~10s]:0 [10s~]:0 

after:

[PAGECACHERT] TotalPut 1970, PutMessageDistributeTime [<=0ms]:1428 [0~10ms]:459 [10~50ms]:4 [50~100ms]:2 [100~200ms]:5 [200~500ms]:10 [500ms~1s]:10 [1~2s]:25 [2~3s]:24 [3~4s]:3 [4~5s]:0 [5~10s]:0 [10s~]:0



After reading CommitLog i found the whole batch message is surrounded in one lock together. So it make sense the average executing time of PUT would increase. but we alse will receive some "in lock slow log" which will trigger flow control for a while:

2018-06-14 21:02:42 WARN SendMessageThread_102 - [NOTIFYME]putMessages in lock cost time(ms)=1373, bodyLength=49380 AppendMessageResult=AppendMessageResult{status=PUT_OK, wroteOffset=68238809866237, wroteBytes=54420, msgId='0A0A63CF00002A9F00003E1016059FFD,0A0A63CF00002A9F00003E101605A37D,0A0A63CF00002A9F00003E101605A6DE,0A0A63CF00002A9F00003E101605A8C3,0A0A63CF00002A9F00003E101605ABA3,0A0A63CF00002A9F00003E101605AE81,0A0A63CF00002A9F00003E101605B066,0A0A63CF00002A9F00003E101605B24B,0A0A63CF00002A9F00003E101605B4CC,0A0A63CF00002A9F00003E101605B7F5,0A0A63CF00002A9F00003E101605BB78,0A0A63CF00002A9F00003E101605BE70,0A0A63CF00002A9F00003E101605C19F,0A0A63CF00002A9F00003E101605C4BF,0A0A63CF00002A9F00003E101605C78B,0A0A63CF00002A9F00003E101605CAE4,0A0A63CF00002A9F00003E101605CDC4,0A0A63CF00002A9F00003E101605D206,0A0A63CF00002A9F00003E101605D581,0A0A63CF00002A9F00003E101605D83B,0A0A63CF00002A9F00003E101605DB5B,0A0A63CF00002A9F00003E101605DED0,0A0A63CF00002A9F00003E101605E1AA,0A0A63CF00002A9F00003E101605E54D,0A0A63CF00002A9F00003E101605E8C7,0A0A63CF00002A9F00003E101605EB92,0A0A63CF00002A9F00003E101605EF0C,0A0A63CF00002A9F00003E101605F0D8,0A0A63CF00002A9F00003E101605F346,0A0A63CF00002A9F00003E101605F5FB,0A0A63CF00002A9F00003E101605F8D6,0A0A63CF00002A9F00003E101605FC51,0A0A63CF00002A9F00003E101605FFD1,0A0A63CF00002A9F00003E101606019D,0A0A63CF00002A9F00003E1016060509,0A0A63CF00002A9F00003E1016060816,0A0A63CF00002A9F00003E1016060B7A,0A0A63CF00002A9F00003E1016060E61,0A0A63CF00002A9F00003E10160611DF,0A0A63CF00002A9F00003E1016061558,0A0A63CF00002A9F00003E10160618A6,0A0A63CF00002A9F00003E1016061B68,0A0A63CF00002A9F00003E1016061E6B,0A0A63CF00002A9F00003E10160621E5,0A0A63CF00002A9F00003E10160624A2,0A0A63CF00002A9F00003E101606281D,0A0A63CF00002A9F00003E1016062AE3,0A0A63CF00002A9F00003E1016062E60,0A0A63CF00002A9F00003E101606313E,0A0A63CF00002A9F00003E101606343A,0A0A63CF00002A9F00003E10160637B5,0A0A63CF00002A9F00003E1016063AE3,0A0A63CF00002A9F00003E1016063E67,0A0A63CF00002A9F00003E1016064218,0A0A63CF00002A9F00003E10160644E0,0A0A63CF00002A9F00003E101606487A,0A0A63CF00002A9F00003E1016064BF4,0A0A63CF00002A9F00003E1016064E61,0A0A63CF00002A9F00003E101606512E,0A0A63CF00002A9F00003E1016065489,0A0A63CF00002A9F00003E101606574F,0A0A63CF00002A9F00003E1016065AE3,0A0A63CF00002A9F00003E1016065CC8,0A0A63CF00002A9F00003E1016065FED,0A0A63CF00002A9F00003E10160662E2,0A0A63CF00002A9F00003E10160664C7,0A0A63CF00002A9F00003E10160667F6,0A0A63CF00002A9F00003E1016066B81,0A0A63CF00002A9F00003E1016066E54,0A0A63CF00002A9F00003E1016067132', storeTimestamp=1528981361218, logicsOffset=5784993480, pagecacheRT=1373, msgNum=70}
2018-06-14 21:02:42 WARN SendMessageThread_102 - not in lock eclipse time(ms)=1374, bodyLength=49380
2018-06-14 21:02:42 WARN SendMessageThread_79 - not in lock eclipse time(ms)=688, bodyLength=274
2018-06-14 21:02:42 WARN SendMessageThread_84 - not in lock eclipse time(ms)=562, bodyLength=48242
2018-06-14 21:02:42 WARN SendMessageThread_107 - not in lock eclipse time(ms)=839, bodyLength=274
2018-06-14 21:02:42 WARN SendMessageThread_97 - not in lock eclipse time(ms)=1199, bodyLength=49469
2018-06-14 21:02:42 WARN SendMessageThread_111 - not in lock eclipse time(ms)=1307, bodyLength=47329
2018-06-14 21:02:42 WARN SendMessageThread_91 - not in lock eclipse time(ms)=786, bodyLength=275
2018-06-14 21:02:42 WARN SendMessageThread_67 - not in lock eclipse time(ms)=807, bodyLength=575
2018-06-14 21:02:42 WARN SendMessageThread_65 - not in lock eclipse time(ms)=1326, bodyLength=47294
2018-06-14 21:02:43 INFO FlushRealTimeService - Flush data to disk costs 2271 ms
2018-06-14 21:02:43 WARN SendMessageThread_116 - not in lock eclipse time(ms)=1542, bodyLength=50039
2018-06-14 21:02:43 WARN SendMessageThread_119 - not in lock eclipse time(ms)=1202, bodyLength=50024
2018-06-14 21:02:43 WARN SendMessageThread_55 - not in lock eclipse time(ms)=1436, bodyLength=907

So the question is:
Is batch sending recommended?
It maybe "slower" compared to simple sending, should i increase the in-lock-timeout?
how we can solve the performance problem(many REJECTREQUEST will generated in producers), add new Master into the cluster? 


My question

best regards,

Jason


Re: About batch messages cause many slow logs

Posted by dongeforever <do...@apache.org>.
BTW
now i am developing a new end-to-end batch, in which the batch will not be
decoded all the way from producer to consumer, which will get higher
throughput and lower latency.
Now the storage layer is ready, but the network layer is to be finished.
The code will be shared to github later some day when i get some spare time.



2018-06-15 10:35 GMT+08:00 dongeforever <do...@apache.org>:

> *Hi:*
>
>  Here is my understanding, hope it useful to you:
>
> 1. Batch sending is recommend for data collecting, such as website
> activity tracking or log aggregation, which requires high throughput and
> could tolerate some latency. On the contrary, simple sending is good for
> message delivery in latency-sensitive distributed systems, such as trading
> or financial senarios.
>
> 2.If REJECTREQUEST happens, it usually means the broker is carrying a
> heavy load. It is suggested to expand the cluster. Just as you say, add new
> Master into the cluster.
>
>
> dongeforever
> Best Regards
>
> 2018-06-15 10:17 GMT+08:00 Von Gosling <vo...@apache.org>:
>
>> Hi,
>>
>> Batch Message is designed for the streaming case in which you could get a
>> high throughput.
>>
>> Best Regards,
>>
>>
>> 在 2018年6月15日,09:39,Jason Joo <hb...@163.com> 写道:
>>
>> Hello,
>>
>> We have a cluster with 2m-2s architecture. TPS is between 1000 to 6000
>> and about 200 million messages in 24 hours typically.
>> The commit log size is about 200GB and can be cached in memory fully.
>>
>> Our server(one rocketmq each, 4 servers in this cluster):
>>
>> 6 core Xeon-E5-2620 x 2 (24 thread total)
>> 256G ddr3 1600MHz
>> 240G Intel SSD 530(SSDSC2BW240H6) x 6 (Raid 1)
>>
>>
>> To "IMPROVE" producing performance(mainly reduce network packages) we
>> start to use batch sending feature, and we begain to receive so much "slow
>> log" after that(batch is 50~70, 100~1000B per message body).
>>
>> After some tuning(osPageCacheBusyTimeOutMills, waitTimeMillsInSendQueue
>> to 500ms, etc) we only increasing the "time cost" printed in slow log.
>>
>> The pagecache rt in log before using batch sending:
>>
>> *[PAGECACHERT] TotalPut 50650, PutMessageDistributeTime [<=0ms]:44250
>> [0~10ms]:5865 [10~50ms]:514 [50~100ms]:10 [100~200ms]:11 [200~500ms]:0
>> [500ms~1s]:0 [1~2s]:0 [2~3s]:0 [3~4s]:0 [4~5s]:0 [5~10s]:0 [10s~]:0 *
>>
>>
>> after:
>>
>> *[PAGECACHERT] TotalPut 1970, PutMessageDistributeTime [<=0ms]:1428
>> [0~10ms]:459 [10~50ms]:4 [50~100ms]:2 [100~200ms]:5 [200~500ms]:10
>> [500ms~1s]:10 [1~2s]:25 [2~3s]:24 [3~4s]:3 [4~5s]:0 [5~10s]:0 [10s~]:0*
>>
>>
>>
>>
>> After reading CommitLog i found the whole batch message is surrounded in
>> one lock together. So it make sense the average executing time of PUT would
>> increase. but we alse will receive some "in lock slow log" which will
>> trigger flow control for a while:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *2018-06-14 21:02:42 WARN SendMessageThread_102 - [NOTIFYME]putMessages
>> in lock cost time(ms)=1373, bodyLength=49380
>> AppendMessageResult=AppendMessageResult{status=PUT_OK,
>> wroteOffset=68238809866237, wroteBytes=54420, msgId='0A0A63CF00002A9F00003E1016059FFD,0A0A63CF00002A9F00003E101605A37D,0A0A63CF00002A9F00003E101605A6DE,0A0A63CF00002A9F00003E101605A8C3,0A0A63CF00002A9F00003E101605ABA3,0A0A63CF00002A9F00003E101605AE81,0A0A63CF00002A9F00003E101605B066,0A0A63CF00002A9F00003E101605B24B,0A0A63CF00002A9F00003E101605B4CC,0A0A63CF00002A9F00003E101605B7F5,0A0A63CF00002A9F00003E101605BB78,0A0A63CF00002A9F00003E101605BE70,0A0A63CF00002A9F00003E101605C19F,0A0A63CF00002A9F00003E101605C4BF,0A0A63CF00002A9F00003E101605C78B,0A0A63CF00002A9F00003E101605CAE4,0A0A63CF00002A9F00003E101605CDC4,0A0A63CF00002A9F00003E101605D206,0A0A63CF00002A9F00003E101605D581,0A0A63CF00002A9F00003E101605D83B,0A0A63CF00002A9F00003E101605DB5B,0A0A63CF00002A9F00003E101605DED0,0A0A63CF00002A9F00003E101605E1AA,0A0A63CF00002A9F00003E101605E54D,0A0A63CF00002A9F00003E101605E8C7,0A0A63CF00002A9F00003E101605EB92,0A0A63CF00002A9F00003E101605EF0C,0A0A63CF00002A9F00003E101605F0D8,0A0A63CF00002A9F00003E101605F346,0A0A63CF00002A9F00003E101605F5FB,0A0A63CF00002A9F00003E101605F8D6,0A0A63CF00002A9F00003E101605FC51,0A0A63CF00002A9F00003E101605FFD1,0A0A63CF00002A9F00003E101606019D,0A0A63CF00002A9F00003E1016060509,0A0A63CF00002A9F00003E1016060816,0A0A63CF00002A9F00003E1016060B7A,0A0A63CF00002A9F00003E1016060E61,0A0A63CF00002A9F00003E10160611DF,0A0A63CF00002A9F00003E1016061558,0A0A63CF00002A9F00003E10160618A6,0A0A63CF00002A9F00003E1016061B68,0A0A63CF00002A9F00003E1016061E6B,0A0A63CF00002A9F00003E10160621E5,0A0A63CF00002A9F00003E10160624A2,0A0A63CF00002A9F00003E101606281D,0A0A63CF00002A9F00003E1016062AE3,0A0A63CF00002A9F00003E1016062E60,0A0A63CF00002A9F00003E101606313E,0A0A63CF00002A9F00003E101606343A,0A0A63CF00002A9F00003E10160637B5,0A0A63CF00002A9F00003E1016063AE3,0A0A63CF00002A9F00003E1016063E67,0A0A63CF00002A9F00003E1016064218,0A0A63CF00002A9F00003E10160644E0,0A0A63CF00002A9F00003E101606487A,0A0A63CF00002A9F00003E1016064BF4,0A0A63CF00002A9F00003E1016064E61,0A0A63CF00002A9F00003E101606512E,0A0A63CF00002A9F00003E1016065489,0A0A63CF00002A9F00003E101606574F,0A0A63CF00002A9F00003E1016065AE3,0A0A63CF00002A9F00003E1016065CC8,0A0A63CF00002A9F00003E1016065FED,0A0A63CF00002A9F00003E10160662E2,0A0A63CF00002A9F00003E10160664C7,0A0A63CF00002A9F00003E10160667F6,0A0A63CF00002A9F00003E1016066B81,0A0A63CF00002A9F00003E1016066E54,0A0A63CF00002A9F00003E1016067132',
>> storeTimestamp=1528981361218, logicsOffset=5784993480, pagecacheRT=1373,
>> msgNum=70}2018-06-14 21:02:42 WARN SendMessageThread_102 - not in lock
>> eclipse time(ms)=1374, bodyLength=493802018-06-14 21:02:42 WARN
>> SendMessageThread_79 - not in lock eclipse time(ms)=688,
>> bodyLength=2742018-06-14 21:02:42 WARN SendMessageThread_84 - not in lock
>> eclipse time(ms)=562, bodyLength=482422018-06-14 21:02:42 WARN
>> SendMessageThread_107 - not in lock eclipse time(ms)=839,
>> bodyLength=2742018-06-14 21:02:42 WARN SendMessageThread_97 - not in lock
>> eclipse time(ms)=1199, bodyLength=494692018-06-14 21:02:42 WARN
>> SendMessageThread_111 - not in lock eclipse
>> time(ms)=1307, bodyLength=473292018-06-14 21:02:42 WARN
>> SendMessageThread_91 - not in lock eclipse time(ms)=786,
>> bodyLength=2752018-06-14 21:02:42 WARN SendMessageThread_67 - not in lock
>> eclipse time(ms)=807, bodyLength=5752018-06-14 21:02:42 WARN
>> SendMessageThread_65 - not in lock eclipse time(ms)=1326,
>> bodyLength=472942018-06-14 21:02:43 INFO FlushRealTimeService - Flush data
>> to disk costs 2271 ms2018-06-14 21:02:43 WARN SendMessageThread_116 - not
>> in lock eclipse time(ms)=1542, bodyLength=500392018-06-14 21:02:43 WARN
>> SendMessageThread_119 - not in lock eclipse
>> time(ms)=1202, bodyLength=500242018-06-14 21:02:43 WARN
>> SendMessageThread_55 - not in lock eclipse time(ms)=1436, bodyLength=907*
>>
>>
>> So the question is:
>>
>>    - Is batch sending recommended?
>>    - It maybe "slower" compared to simple sending, should i increase the
>>    in-lock-timeout?
>>    - how we can solve the performance problem(many REJECTREQUEST will
>>    generated in producers), add new Master into the cluster?
>>
>>
>>
>> My question
>>
>> best regards,
>>
>> Jason
>>
>>
>>
>

Re: About batch messages cause many slow logs

Posted by dongeforever <do...@apache.org>.
*Hi:*

 Here is my understanding, hope it useful to you:

1. Batch sending is recommend for data collecting, such as website activity
tracking or log aggregation, which requires high throughput and could
tolerate some latency. On the contrary, simple sending is good for message
delivery in latency-sensitive distributed systems, such as trading or
financial senarios.

2.If REJECTREQUEST happens, it usually means the broker is carrying a heavy
load. It is suggested to expand the cluster. Just as you say, add new
Master into the cluster.


dongeforever
Best Regards

2018-06-15 10:17 GMT+08:00 Von Gosling <vo...@apache.org>:

> Hi,
>
> Batch Message is designed for the streaming case in which you could get a
> high throughput.
>
> Best Regards,
>
>
> 在 2018年6月15日,09:39,Jason Joo <hb...@163.com> 写道:
>
> Hello,
>
> We have a cluster with 2m-2s architecture. TPS is between 1000 to 6000 and
> about 200 million messages in 24 hours typically.
> The commit log size is about 200GB and can be cached in memory fully.
>
> Our server(one rocketmq each, 4 servers in this cluster):
>
> 6 core Xeon-E5-2620 x 2 (24 thread total)
> 256G ddr3 1600MHz
> 240G Intel SSD 530(SSDSC2BW240H6) x 6 (Raid 1)
>
>
> To "IMPROVE" producing performance(mainly reduce network packages) we
> start to use batch sending feature, and we begain to receive so much "slow
> log" after that(batch is 50~70, 100~1000B per message body).
>
> After some tuning(osPageCacheBusyTimeOutMills, waitTimeMillsInSendQueue
> to 500ms, etc) we only increasing the "time cost" printed in slow log.
>
> The pagecache rt in log before using batch sending:
>
> *[PAGECACHERT] TotalPut 50650, PutMessageDistributeTime [<=0ms]:44250
> [0~10ms]:5865 [10~50ms]:514 [50~100ms]:10 [100~200ms]:11 [200~500ms]:0
> [500ms~1s]:0 [1~2s]:0 [2~3s]:0 [3~4s]:0 [4~5s]:0 [5~10s]:0 [10s~]:0 *
>
>
> after:
>
> *[PAGECACHERT] TotalPut 1970, PutMessageDistributeTime [<=0ms]:1428
> [0~10ms]:459 [10~50ms]:4 [50~100ms]:2 [100~200ms]:5 [200~500ms]:10
> [500ms~1s]:10 [1~2s]:25 [2~3s]:24 [3~4s]:3 [4~5s]:0 [5~10s]:0 [10s~]:0*
>
>
>
>
> After reading CommitLog i found the whole batch message is surrounded in
> one lock together. So it make sense the average executing time of PUT would
> increase. but we alse will receive some "in lock slow log" which will
> trigger flow control for a while:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *2018-06-14 21:02:42 WARN SendMessageThread_102 - [NOTIFYME]putMessages in
> lock cost time(ms)=1373, bodyLength=49380
> AppendMessageResult=AppendMessageResult{status=PUT_OK,
> wroteOffset=68238809866237, wroteBytes=54420, msgId='0A0A63CF00002A9F00003E1016059FFD,0A0A63CF00002A9F00003E101605A37D,0A0A63CF00002A9F00003E101605A6DE,0A0A63CF00002A9F00003E101605A8C3,0A0A63CF00002A9F00003E101605ABA3,0A0A63CF00002A9F00003E101605AE81,0A0A63CF00002A9F00003E101605B066,0A0A63CF00002A9F00003E101605B24B,0A0A63CF00002A9F00003E101605B4CC,0A0A63CF00002A9F00003E101605B7F5,0A0A63CF00002A9F00003E101605BB78,0A0A63CF00002A9F00003E101605BE70,0A0A63CF00002A9F00003E101605C19F,0A0A63CF00002A9F00003E101605C4BF,0A0A63CF00002A9F00003E101605C78B,0A0A63CF00002A9F00003E101605CAE4,0A0A63CF00002A9F00003E101605CDC4,0A0A63CF00002A9F00003E101605D206,0A0A63CF00002A9F00003E101605D581,0A0A63CF00002A9F00003E101605D83B,0A0A63CF00002A9F00003E101605DB5B,0A0A63CF00002A9F00003E101605DED0,0A0A63CF00002A9F00003E101605E1AA,0A0A63CF00002A9F00003E101605E54D,0A0A63CF00002A9F00003E101605E8C7,0A0A63CF00002A9F00003E101605EB92,0A0A63CF00002A9F00003E101605EF0C,0A0A63CF00002A9F00003E101605F0D8,0A0A63CF00002A9F00003E101605F346,0A0A63CF00002A9F00003E101605F5FB,0A0A63CF00002A9F00003E101605F8D6,0A0A63CF00002A9F00003E101605FC51,0A0A63CF00002A9F00003E101605FFD1,0A0A63CF00002A9F00003E101606019D,0A0A63CF00002A9F00003E1016060509,0A0A63CF00002A9F00003E1016060816,0A0A63CF00002A9F00003E1016060B7A,0A0A63CF00002A9F00003E1016060E61,0A0A63CF00002A9F00003E10160611DF,0A0A63CF00002A9F00003E1016061558,0A0A63CF00002A9F00003E10160618A6,0A0A63CF00002A9F00003E1016061B68,0A0A63CF00002A9F00003E1016061E6B,0A0A63CF00002A9F00003E10160621E5,0A0A63CF00002A9F00003E10160624A2,0A0A63CF00002A9F00003E101606281D,0A0A63CF00002A9F00003E1016062AE3,0A0A63CF00002A9F00003E1016062E60,0A0A63CF00002A9F00003E101606313E,0A0A63CF00002A9F00003E101606343A,0A0A63CF00002A9F00003E10160637B5,0A0A63CF00002A9F00003E1016063AE3,0A0A63CF00002A9F00003E1016063E67,0A0A63CF00002A9F00003E1016064218,0A0A63CF00002A9F00003E10160644E0,0A0A63CF00002A9F00003E101606487A,0A0A63CF00002A9F00003E1016064BF4,0A0A63CF00002A9F00003E1016064E61,0A0A63CF00002A9F00003E101606512E,0A0A63CF00002A9F00003E1016065489,0A0A63CF00002A9F00003E101606574F,0A0A63CF00002A9F00003E1016065AE3,0A0A63CF00002A9F00003E1016065CC8,0A0A63CF00002A9F00003E1016065FED,0A0A63CF00002A9F00003E10160662E2,0A0A63CF00002A9F00003E10160664C7,0A0A63CF00002A9F00003E10160667F6,0A0A63CF00002A9F00003E1016066B81,0A0A63CF00002A9F00003E1016066E54,0A0A63CF00002A9F00003E1016067132',
> storeTimestamp=1528981361218, logicsOffset=5784993480, pagecacheRT=1373,
> msgNum=70}2018-06-14 21:02:42 WARN SendMessageThread_102 - not in lock
> eclipse time(ms)=1374, bodyLength=493802018-06-14 21:02:42 WARN
> SendMessageThread_79 - not in lock eclipse time(ms)=688,
> bodyLength=2742018-06-14 21:02:42 WARN SendMessageThread_84 - not in lock
> eclipse time(ms)=562, bodyLength=482422018-06-14 21:02:42 WARN
> SendMessageThread_107 - not in lock eclipse time(ms)=839,
> bodyLength=2742018-06-14 21:02:42 WARN SendMessageThread_97 - not in lock
> eclipse time(ms)=1199, bodyLength=494692018-06-14 21:02:42 WARN
> SendMessageThread_111 - not in lock eclipse
> time(ms)=1307, bodyLength=473292018-06-14 21:02:42 WARN
> SendMessageThread_91 - not in lock eclipse time(ms)=786,
> bodyLength=2752018-06-14 21:02:42 WARN SendMessageThread_67 - not in lock
> eclipse time(ms)=807, bodyLength=5752018-06-14 21:02:42 WARN
> SendMessageThread_65 - not in lock eclipse time(ms)=1326,
> bodyLength=472942018-06-14 21:02:43 INFO FlushRealTimeService - Flush data
> to disk costs 2271 ms2018-06-14 21:02:43 WARN SendMessageThread_116 - not
> in lock eclipse time(ms)=1542, bodyLength=500392018-06-14 21:02:43 WARN
> SendMessageThread_119 - not in lock eclipse
> time(ms)=1202, bodyLength=500242018-06-14 21:02:43 WARN
> SendMessageThread_55 - not in lock eclipse time(ms)=1436, bodyLength=907*
>
>
> So the question is:
>
>    - Is batch sending recommended?
>    - It maybe "slower" compared to simple sending, should i increase the
>    in-lock-timeout?
>    - how we can solve the performance problem(many REJECTREQUEST will
>    generated in producers), add new Master into the cluster?
>
>
>
> My question
>
> best regards,
>
> Jason
>
>
>

Re: About batch messages cause many slow logs

Posted by Von Gosling <vo...@apache.org>.
Hi,

Batch Message is designed for the streaming case in which you could get a high throughput.  

Best Regards,


> 在 2018年6月15日,09:39,Jason Joo <hb...@163.com> 写道:
> 
> Hello,
> 
> We have a cluster with 2m-2s architecture. TPS is between 1000 to 6000 and about 200 million messages in 24 hours typically.
> The commit log size is about 200GB and can be cached in memory fully.
> 
> Our server(one rocketmq each, 4 servers in this cluster):
> 6 core Xeon-E5-2620 x 2 (24 thread total)
> 256G ddr3 1600MHz
> 240G Intel SSD 530(SSDSC2BW240H6) x 6 (Raid 1)
> 
> To "IMPROVE" producing performance(mainly reduce network packages) we start to use batch sending feature, and we begain to receive so much "slow log" after that(batch is 50~70, 100~1000B per message body).
> 
> After some tuning(osPageCacheBusyTimeOutMills, waitTimeMillsInSendQueue to 500ms, etc) we only increasing the "time cost" printed in slow log.
> 
> The pagecache rt in log before using batch sending:
> 
> [PAGECACHERT] TotalPut 50650, PutMessageDistributeTime [<=0ms]:44250 [0~10ms]:5865 [10~50ms]:514 [50~100ms]:10 [100~200ms]:11 [200~500ms]:0 [500ms~1s]:0 [1~2s]:0 [2~3s]:0 [3~4s]:0 [4~5s]:0 [5~10s]:0 [10s~]:0 
> 
> after:
> 
> [PAGECACHERT] TotalPut 1970, PutMessageDistributeTime [<=0ms]:1428 [0~10ms]:459 [10~50ms]:4 [50~100ms]:2 [100~200ms]:5 [200~500ms]:10 [500ms~1s]:10 [1~2s]:25 [2~3s]:24 [3~4s]:3 [4~5s]:0 [5~10s]:0 [10s~]:0
> 
> 
> 
> After reading CommitLog i found the whole batch message is surrounded in one lock together. So it make sense the average executing time of PUT would increase. but we alse will receive some "in lock slow log" which will trigger flow control for a while:
> 
> 2018-06-14 21:02:42 WARN SendMessageThread_102 - [NOTIFYME]putMessages in lock cost time(ms)=1373, bodyLength=49380 AppendMessageResult=AppendMessageResult{status=PUT_OK, wroteOffset=68238809866237, wroteBytes=54420, msgId='0A0A63CF00002A9F00003E1016059FFD,0A0A63CF00002A9F00003E101605A37D,0A0A63CF00002A9F00003E101605A6DE,0A0A63CF00002A9F00003E101605A8C3,0A0A63CF00002A9F00003E101605ABA3,0A0A63CF00002A9F00003E101605AE81,0A0A63CF00002A9F00003E101605B066,0A0A63CF00002A9F00003E101605B24B,0A0A63CF00002A9F00003E101605B4CC,0A0A63CF00002A9F00003E101605B7F5,0A0A63CF00002A9F00003E101605BB78,0A0A63CF00002A9F00003E101605BE70,0A0A63CF00002A9F00003E101605C19F,0A0A63CF00002A9F00003E101605C4BF,0A0A63CF00002A9F00003E101605C78B,0A0A63CF00002A9F00003E101605CAE4,0A0A63CF00002A9F00003E101605CDC4,0A0A63CF00002A9F00003E101605D206,0A0A63CF00002A9F00003E101605D581,0A0A63CF00002A9F00003E101605D83B,0A0A63CF00002A9F00003E101605DB5B,0A0A63CF00002A9F00003E101605DED0,0A0A63CF00002A9F00003E101605E1AA,0A0A63CF00002A9F00003E101605E54D,0A0A63CF00002A9F00003E101605E8C7,0A0A63CF00002A9F00003E101605EB92,0A0A63CF00002A9F00003E101605EF0C,0A0A63CF00002A9F00003E101605F0D8,0A0A63CF00002A9F00003E101605F346,0A0A63CF00002A9F00003E101605F5FB,0A0A63CF00002A9F00003E101605F8D6,0A0A63CF00002A9F00003E101605FC51,0A0A63CF00002A9F00003E101605FFD1,0A0A63CF00002A9F00003E101606019D,0A0A63CF00002A9F00003E1016060509,0A0A63CF00002A9F00003E1016060816,0A0A63CF00002A9F00003E1016060B7A,0A0A63CF00002A9F00003E1016060E61,0A0A63CF00002A9F00003E10160611DF,0A0A63CF00002A9F00003E1016061558,0A0A63CF00002A9F00003E10160618A6,0A0A63CF00002A9F00003E1016061B68,0A0A63CF00002A9F00003E1016061E6B,0A0A63CF00002A9F00003E10160621E5,0A0A63CF00002A9F00003E10160624A2,0A0A63CF00002A9F00003E101606281D,0A0A63CF00002A9F00003E1016062AE3,0A0A63CF00002A9F00003E1016062E60,0A0A63CF00002A9F00003E101606313E,0A0A63CF00002A9F00003E101606343A,0A0A63CF00002A9F00003E10160637B5,0A0A63CF00002A9F00003E1016063AE3,0A0A63CF00002A9F00003E1016063E67,0A0A63CF00002A9F00003E1016064218,0A0A63CF00002A9F00003E10160644E0,0A0A63CF00002A9F00003E101606487A,0A0A63CF00002A9F00003E1016064BF4,0A0A63CF00002A9F00003E1016064E61,0A0A63CF00002A9F00003E101606512E,0A0A63CF00002A9F00003E1016065489,0A0A63CF00002A9F00003E101606574F,0A0A63CF00002A9F00003E1016065AE3,0A0A63CF00002A9F00003E1016065CC8,0A0A63CF00002A9F00003E1016065FED,0A0A63CF00002A9F00003E10160662E2,0A0A63CF00002A9F00003E10160664C7,0A0A63CF00002A9F00003E10160667F6,0A0A63CF00002A9F00003E1016066B81,0A0A63CF00002A9F00003E1016066E54,0A0A63CF00002A9F00003E1016067132', storeTimestamp=1528981361218, logicsOffset=5784993480, pagecacheRT=1373, msgNum=70}
> 2018-06-14 21:02:42 WARN SendMessageThread_102 - not in lock eclipse time(ms)=1374, bodyLength=49380
> 2018-06-14 21:02:42 WARN SendMessageThread_79 - not in lock eclipse time(ms)=688, bodyLength=274
> 2018-06-14 21:02:42 WARN SendMessageThread_84 - not in lock eclipse time(ms)=562, bodyLength=48242
> 2018-06-14 21:02:42 WARN SendMessageThread_107 - not in lock eclipse time(ms)=839, bodyLength=274
> 2018-06-14 21:02:42 WARN SendMessageThread_97 - not in lock eclipse time(ms)=1199, bodyLength=49469
> 2018-06-14 21:02:42 WARN SendMessageThread_111 - not in lock eclipse time(ms)=1307, bodyLength=47329
> 2018-06-14 21:02:42 WARN SendMessageThread_91 - not in lock eclipse time(ms)=786, bodyLength=275
> 2018-06-14 21:02:42 WARN SendMessageThread_67 - not in lock eclipse time(ms)=807, bodyLength=575
> 2018-06-14 21:02:42 WARN SendMessageThread_65 - not in lock eclipse time(ms)=1326, bodyLength=47294
> 2018-06-14 21:02:43 INFO FlushRealTimeService - Flush data to disk costs 2271 ms
> 2018-06-14 21:02:43 WARN SendMessageThread_116 - not in lock eclipse time(ms)=1542, bodyLength=50039
> 2018-06-14 21:02:43 WARN SendMessageThread_119 - not in lock eclipse time(ms)=1202, bodyLength=50024
> 2018-06-14 21:02:43 WARN SendMessageThread_55 - not in lock eclipse time(ms)=1436, bodyLength=907
> 
> So the question is:
> Is batch sending recommended?
> It maybe "slower" compared to simple sending, should i increase the in-lock-timeout?
> how we can solve the performance problem(many REJECTREQUEST will generated in producers), add new Master into the cluster? 
> 
> 
> My question
> 
> best regards,
> 
> Jason
>