You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Haruki Okada (Jira)" <ji...@apache.org> on 2021/07/13 02:06:00 UTC

[jira] [Comment Edited] (KAFKA-9648) kafka server should resize backlog when create serversocket

    [ https://issues.apache.org/jira/browse/KAFKA-9648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379515#comment-17379515 ] 

Haruki Okada edited comment on KAFKA-9648 at 7/13/21, 2:05 AM:
---------------------------------------------------------------

Hi.

We operate a Kafka cluster in our company which has 130+ brokers, 1.3M+ total socket server connections and 25K+ partitions.
 We faced very similar issue to https://issues.apache.org/jira/browse/KAFKA-9211 (i.e. producer slowing down mysteriously with small TCP packets) recently, and we figured out the cause.
 We hope our knowledge could help.
h2. Environment:
 * Kafka broker version: 2.4.1 (but we suppose the version doesn't matter)
 * Kafka broker OS: CentOS7 (kernel 3.10.X)

h2. Phenomenon:
 * Restart a broker process, and execute preferred leader election after the broker became in-sync
 * Some producers's node-request-latency to the broker got insanely higher than usual
 ** However no such high produce response latency is observed on broker-side metrics
 * As the result, producer batches couldn't be sent out in sufficient pace, then caused batch expiration

h2. Analysis:
 * We observed TCP SYN cookies metric was increased at incidental time, with following dmesg message:
 ** 
{code:java}
TCP: request_sock_TCP: Possible SYN flooding on port 22991. Sending cookies.  Check SNMP counters.{code}

 * So we also suspected the phenomenon is due to `wscale` drop (as like described in this issue), but we doubt it because:
 ** Even with TCP SYN cookies, `wscale` should be available as long as TCP timestamp is enabled. (refs: [https://blog.cloudflare.com/syn-packet-handling-in-the-wild/])
 ** In our environment, TCP timestamp is enabled.
 ** Also, generally, `wscale` (window scaling factor) is used for extending window beyond 65535 (max window size in TCP spec) on large-network round trip environment such as internet
 *** Our typical produce request size is smaller than that
 *** So it's hard to imagine that `wscale` drop causes such significant request-latency degradation
 * After several attempts to reproduce, we found out that receiver (i.e. broker in this context)'s `wscale` is inconsistent between producer and broker at incidental time.
 ** receiver's `wscale` advertised from broker -> producer along with SYN+ACK packet: 7
 *** 
{code:java}
17:32:05.161213 IP BROKER.HOST.XXXX > CLIENT.YYYY: Flags [S.], seq 29292755, ack 17469019, win 28960, options [mss 1460,sackOK,TS val 25589601 ecr 9055245,nop,wscale 1], length 0{code}

 * 
 ** 
 *** (seq numbers are substituted with random value)
 ** actual receiver's `wscale` after established: 1
 *** 
{code:java}
[user@BROKER ~]$ /sbin/ss -e -i -t | less
...
ESTAB      0      0      BROKER.XXXX                CLIENT.YYYY                 timer:(keepalive,21min,0) uid:503 ino:15143860 sk:ffff9ba25dc4f440 <->         ts sack cubic wscale:7,7 rto:201 rtt:0.179/0.006 ato:140 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:97054 bytes_received:18950537 segs_out:15094 segs_in:13941 send 647.2Mbps lastsnd:217 lastrcv:17 lastack:217 pacing_rate 1288.0Mbps rcv_rtt:1.875 rcv_space:29200{code}

 * 
 ** 
 *** `wscale:7,7` means that broker's receiver window scaling factor is 7
 * Okay, then this inconsistency could explain the phenomenon as below:
 ** Premise: When `wscale` is enabled, TCP window size is calculated by `window_size * 2^wscale`
 ** When broker calculates advertised window size, it's bit-shifted to the right by `wscale` (== 7)
 *** [https://github.com/torvalds/linux/blob/v3.10/net/ipv4/tcp_output.c#L290]
 ** On the other hand, producer multiplies advertised window size by 1, which is conveyed through SYN+ACK
 ** As the result, window size became 64 times smaller than expected
 ** Then producer splits TCP packets to much smaller size than usual (possibly under MSS)
 *** TCP acks are delayed due to the conditions are not met ([https://github.com/torvalds/linux/blob/v3.10/net/ipv4/tcp_input.c#L4760])
 * Last remaining question is "why such wscale inconsistency happened?"
 ** Read through the kernel source code, then we found that there's an issue in the logic to calculate wscale if TCP connection is established through syncookies
 *** It's fixed in this commit: [https://github.com/torvalds/linux/commit/909172a149749242990a6e64cb55d55460d4e417]

h2. Solution:
 * 1. Upgrade linux kernel to at least v5.10 (which the patch is committed)
 * 2. Disable SYN cookies
 ** i.e. setting `net.ipv4.tcp_syncookies` kernel parameter to 0
 ** With SYN cookies disabled, some SYN's are dropped on same situations but typically it's not be a serious problem thanks to SYN retries. Clients should reconnect soon.
 * 3. Adjust backlog size (as this ticket suggests)
 ** Even disabling SYN cookies work thanks to SYN retries, it will cause certain delay in TCP establishment if reconnect happened.
 ** So generally it's preferred to adjust backlog size as well, to avoid SYN drops in the first place.

 

I think we should add new config to adjust backlog size, since preferred backlog size depends on environment so it's hard to decide globally-appropriate fixed value.

If you don't mind, I'm ready to submit a patch to add a new config to adjust backlog size.

What do you think?

 

Thanks,


was (Author: ocadaruma):
Hi.


 We operate a Kafka cluster in our company which has 130+ brokers, 1.3M+ total socket server connections and 25K+ partitions.
 We faced very similar issue to https://issues.apache.org/jira/browse/KAFKA-9211 (i.e. producer slowing down mysteriously with small TCP packets) recently, and we figured out the cause.
 We hope our knowledge could help.
h2. Environment:
 * Kafka broker version: 2.4.1 (but we suppose the version doesn't matter)
 * Kafka broker OS: CentOS7 (kernel 3.10.X)

h2. Phenomenon:
 * Restart a broker process, and execute preferred leader election after the broker became in-sync
 * Some producers's node-request-latency to the broker got insanely higher than usual
 ** However no such high produce response latency is observed on broker-side metrics
 * As the result, producer batches couldn't be sent out in sufficient pace, then caused batch expiration

h2. Analysis:
 * We observed TCP SYN cookies metric was increased at incidental time, with following dmesg message:
 ** 
{code:java}
TCP: request_sock_TCP: Possible SYN flooding on port 22991. Sending cookies.  Check SNMP counters.{code}

 * So we also suspected the phenomenon is due to `wscale` drop (as like described in this issue), but we doubt it because:
 ** Even with TCP SYN cookies, `wscale` should be available as long as TCP timestamp is enabled. (refs: [https://blog.cloudflare.com/syn-packet-handling-in-the-wild/])
 ** In our environment, TCP timestamp is enabled.
 ** Also, generally, `wscale` (window scaling factor) is used for extending window beyond 65535 (max window size in TCP spec) on large-network round trip environment such as internet
 *** Our typical produce request size is smaller than that
 *** So it's hard to imagine that `wscale` drop causes such significant request-latency degradation
 * After several attempts to reproduce, we found out that receiver (i.e. broker in this context)'s `wscale` is inconsistent between producer and broker at incidental time.
 ** receiver's `wscale` advertised from broker -> producer along with SYN+ACK packet: 7
 *** 
{code:java}
17:32:05.161213 IP BROKER.HOST.XXXX > CLIENT.YYYY: Flags [S.], seq 29292755, ack 17469019, win 28960, options [mss 1460,sackOK,TS val 25589601 ecr 9055245,nop,wscale 1], length 0{code}

 *** (seq numbers are substituted with random value)
 ** actual receiver's `wscale` after established: 1
 *** 
{code:java}
[user@BROKER ~]$ /sbin/ss -e -i -t | less
...
ESTAB      0      0      BROKER.XXXX                CLIENT.YYYY                 timer:(keepalive,21min,0) uid:503 ino:15143860 sk:ffff9ba25dc4f440 <->         ts sack cubic wscale:7,7 rto:201 rtt:0.179/0.006 ato:140 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:97054 bytes_received:18950537 segs_out:15094 segs_in:13941 send 647.2Mbps lastsnd:217 lastrcv:17 lastack:217 pacing_rate 1288.0Mbps rcv_rtt:1.875 rcv_space:29200{code}

 *** `wscale:7,7` means that broker's receiver window scaling factor is 7
 * Okay, then this inconsistency could explain the phenomenon as below:
 ** Premise: When `wscale` is enabled, TCP window size is calculated by `window_size * 2^wscale`
 ** When broker calculates advertised window size, it's bit-shifted to the right by `wscale` (== 7)
 *** [https://github.com/torvalds/linux/blob/v3.10/net/ipv4/tcp_output.c#L290]
 ** On the other hand, producer multiplies advertised window size by 1, which is conveyed through SYN+ACK
 ** As the result, window size became 64 times smaller than expected
 ** Then producer splits TCP packets to much smaller size than usual (possibly under MSS)
 *** TCP acks are delayed due to the conditions are not met ([https://github.com/torvalds/linux/blob/v3.10/net/ipv4/tcp_input.c#L4760])
 * Last remaining question is "why such wscale inconsistency happened?"
 ** Read through the kernel source code, then we found that there's an issue in the logic to calculate wscale if TCP connection is established through syncookies
 *** It's fixed in this commit: [https://github.com/torvalds/linux/commit/909172a149749242990a6e64cb55d55460d4e417]

h2. Solution:
 * 1. Upgrade linux kernel to at least v5.10 (which the patch is committed)
 * 2. Disable SYN cookies
 ** i.e. setting `net.ipv4.tcp_syncookies` kernel parameter to 0
 ** With SYN cookies disabled, some SYN's are dropped on same situations but typically it's not be a serious problem thanks to SYN retries. Clients should reconnect soon.
 * 3. Adjust backlog size (as this ticket suggests)
 ** Even disabling SYN cookies work thanks to SYN retries, it will cause certain delay in TCP establishment if reconnect happened.
 ** So generally it's preferable to adjust backlog size as well, to avoid SYN drops in the first place.

 

I think we should add new config to adjust backlog size, since preferred backlog size depends on environment so it's hard to decide globally-appropriate fixed value.

If you don't mind, I'm ready to submit a patch to add a new config to adjust backlog size.

What do you think?

 

Thanks,

> kafka server should resize backlog when create serversocket
> -----------------------------------------------------------
>
>                 Key: KAFKA-9648
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9648
>             Project: Kafka
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 0.10.0.1
>            Reporter: li xiangyuan
>            Priority: Minor
>
> I have describe a mystery problem (https://issues.apache.org/jira/browse/KAFKA-9211). This issue I found kafka server will trigger tcp Congestion Control in some condition. finally we found the root cause.
> when kafka server restart for any reason and then execute preferred replica leader, lots of replica leader will give back to it & trigger cluster metadata update. then all clients will establish connection to this server. at the monment many tcp estable request are waiting in the tcp sync queue , and then to accept queue. 
> kafka create serversocket in SocketServer.scala 
>  
> {code:java}
> serverChannel.socket.bind(socketAddress);{code}
> this method has second parameter "backlog", min(backlog,tcp_max_syn_backlog) will decide the queue length.beacues kafka haven't set ,it is default value 50.
> if this queue is full, and tcp_syncookies = 0, then new connection request will be rejected. If tcp_syncookies=1, it will trigger the tcp synccookie mechanism. this mechanism could allow linux handle more tcp sync request, but it would lose many tcp external parameter, include "wscale", the one that allow tcp connection to send much more bytes per tcp package. because syncookie triggerd, wscale has lost, and this tcp connection will handle network very slow, forever,until this connection is closed and establish another tcp connection.
> so after a preferred repilca executed, lots of new tcp connection will establish without set wscale,and many network traffic to this server will have a very slow speed.
> i'm not sure whether new linux version have resolved this problem, but kafka also should set backlog a larger value. we now have modify this to 512, seems everything is ok.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)