You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Jun Rao (JIRA)" <ji...@apache.org> on 2015/03/08 23:45:38 UTC

[jira] [Commented] (KAFKA-2010) unit tests sometimes are slow during shutdown

    [ https://issues.apache.org/jira/browse/KAFKA-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352321#comment-14352321 ] 

Jun Rao commented on KAFKA-2010:
--------------------------------

The long delay always happens during the shutdown of a multi-node Kafka cluster with the controlled shutdown disabled. The typical pattern is the following. 

1. Broker X, which is the controller, initiates the shutdown process.
2. The socket server of broker X is shut down.

After step 2, two things are happening in parallel.
A. Another broker is still trying to fetch data from broker X since the leaders on broker X haven't been moved. The replica fetcher thread keeps trying to connect to broker X, but keeps failing. Initially, the failing happens immediately. However, after a few seconds, the connecting will block for the socket connection timeout (configured to 1.5 secs in the unit test) and then fail.
B. In broker X,
B1. The shutdown of the controller is initiated. One of the steps is to shut down the RequestSendThread that's responsible for sending requests from the controller to broker X itself.
B2. The RequestSendThread is in the middle of sending a controller request to broker X. Since the socket server of broker X is already shut down. The sending should fail immediately. However, what happens is that the RequestSendThread blocks on connecting to broker X for the socket connect timeout, which has the default value of 30 secs.

Because of this, the shutdown of broker X is delayed by at least 30 secs. This seems to happen more likely after kafka-1971. Before kafka-1971, we de-register a broker's ZK registration before shutting down the socket server in step 2 above. After kafka-1971, broker de-registers itself from ZK in the last step in the shutdown process (so it happens after shutting down the socket server).

The following is the output from a tcpdump. As you can see, starting from 14:29:24.279037, the connecting from both the replica fetcher thread and the RequestSendThread to broker X's port (58693) starts to fail. Initially, a TCP RESET is sent immediately after a TCP SYN (which matches the TCP spec). However, starting from 14:29:25.142164, no RESET is sent after SYN. Therefore the connecting failure is only detected after the timeout.

It's not very clear to me why RESET is not sent after some time. Anyone has a good explanation of this behavior? A few more details.
(1) This happens on my mac. There is no firewall configured.
(2) The unit test open around 20 consecutive ports.
(3) The reconnect in the replica fetch thread has no backoff (it probably should). So, it will try to reconnect immediately after the previous connecting fails.
(4) The reconnect in the RequestSendThread has a 300ms backoff.
(5) There seems to be a gap between 14:29:24.477485 and 14:29:25.142164 in the tcpdump output. It's not clear to me why since in that time window, both the replica fetch thread and
RequestSendThread are still connecting and failing.

tcpdump output:
14:29:24.279037 IP localhost.58973 > localhost.58693: Flags [S], seq 3455869244, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706836 ecr 0,sackOK,eol], length 0
14:29:24.361327 IP localhost.58974 > localhost.58693: Flags [S], seq 3041388581, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 1107706906 ecr 0,sackOK,eol], length 0
14:29:24.361376 IP localhost.58693 > localhost.58974: Flags [R.], seq 0, ack 3041388582, win 0, length 0
14:29:24.405285 IP localhost.58977 > localhost.58693: Flags [S], seq 1164560537, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 1107706945 ecr 0,sackOK,eol], length 0
14:29:24.405384 IP localhost.58693 > localhost.58977: Flags [R.], seq 0, ack 1164560538, win 0, length 0
14:29:24.439743 IP localhost.58973 > localhost.58693: Flags [S], seq 3455869244, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706973 ecr 0,sackOK,eol], length 0
14:29:24.439844 IP localhost.58693 > localhost.58973: Flags [R.], seq 0, ack 3455869245, win 0, length 0
14:29:24.442677 IP localhost.58978 > localhost.58693: Flags [S], seq 1119137787, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706975 ecr 0,sackOK,eol], length 0
14:29:24.442693 IP localhost.58693 > localhost.58978: Flags [R.], seq 0, ack 1119137788, win 0, length 0
14:29:24.443898 IP localhost.58979 > localhost.58693: Flags [S], seq 1843362092, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706977 ecr 0,sackOK,eol], length 0
14:29:24.443912 IP localhost.58693 > localhost.58979: Flags [R.], seq 0, ack 1843362093, win 0, length 0
14:29:24.445085 IP localhost.58980 > localhost.58693: Flags [S], seq 2336267591, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706978 ecr 0,sackOK,eol], length 0
14:29:24.445100 IP localhost.58693 > localhost.58980: Flags [R.], seq 0, ack 2336267592, win 0, length 0
14:29:24.446303 IP localhost.58981 > localhost.58693: Flags [S], seq 3297569415, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706979 ecr 0,sackOK,eol], length 0
14:29:24.446321 IP localhost.58693 > localhost.58981: Flags [R.], seq 0, ack 3297569416, win 0, length 0
14:29:24.447565 IP localhost.58982 > localhost.58693: Flags [S], seq 2742157513, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706980 ecr 0,sackOK,eol], length 0
14:29:24.447575 IP localhost.58693 > localhost.58982: Flags [R.], seq 0, ack 2742157514, win 0, length 0
14:29:24.448901 IP localhost.58983 > localhost.58693: Flags [S], seq 3569996507, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706981 ecr 0,sackOK,eol], length 0
14:29:24.448913 IP localhost.58693 > localhost.58983: Flags [R.], seq 0, ack 3569996508, win 0, length 0
14:29:24.450233 IP localhost.58984 > localhost.58693: Flags [S], seq 2189096629, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706983 ecr 0,sackOK,eol], length 0
14:29:24.450254 IP localhost.58693 > localhost.58984: Flags [R.], seq 0, ack 2189096630, win 0, length 0
14:29:24.451602 IP localhost.58985 > localhost.58693: Flags [S], seq 3668623372, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706984 ecr 0,sackOK,eol], length 0
14:29:24.451616 IP localhost.58693 > localhost.58985: Flags [R.], seq 0, ack 3668623373, win 0, length 0
14:29:24.453369 IP localhost.58986 > localhost.58693: Flags [S], seq 3841327697, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706986 ecr 0,sackOK,eol], length 0
14:29:24.453415 IP localhost.58693 > localhost.58986: Flags [R.], seq 0, ack 3841327698, win 0, length 0
14:29:24.455264 IP localhost.58987 > localhost.58693: Flags [S], seq 1397446064, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706987 ecr 0,sackOK,eol], length 0
14:29:24.455285 IP localhost.58693 > localhost.58987: Flags [R.], seq 0, ack 1397446065, win 0, length 0
14:29:24.456635 IP localhost.58988 > localhost.58693: Flags [S], seq 658704010, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706989 ecr 0,sackOK,eol], length 0
14:29:24.456655 IP localhost.58693 > localhost.58988: Flags [R.], seq 0, ack 658704011, win 0, length 0
14:29:24.457805 IP localhost.58989 > localhost.58693: Flags [S], seq 108288592, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706990 ecr 0,sackOK,eol], length 0
14:29:24.457821 IP localhost.58693 > localhost.58989: Flags [R.], seq 0, ack 108288593, win 0, length 0
14:29:24.458936 IP localhost.58990 > localhost.58693: Flags [S], seq 834439671, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706991 ecr 0,sackOK,eol], length 0
14:29:24.458976 IP localhost.58693 > localhost.58990: Flags [R.], seq 0, ack 834439672, win 0, length 0
14:29:24.460094 IP localhost.58991 > localhost.58693: Flags [S], seq 3341628230, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706992 ecr 0,sackOK,eol], length 0
14:29:24.460108 IP localhost.58693 > localhost.58991: Flags [R.], seq 0, ack 3341628231, win 0, length 0
14:29:24.461227 IP localhost.58992 > localhost.58693: Flags [S], seq 1072784276, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706993 ecr 0,sackOK,eol], length 0
14:29:24.461242 IP localhost.58693 > localhost.58992: Flags [R.], seq 0, ack 1072784277, win 0, length 0
14:29:24.463193 IP localhost.58993 > localhost.58693: Flags [S], seq 3613406588, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706994 ecr 0,sackOK,eol], length 0
14:29:24.463213 IP localhost.58693 > localhost.58993: Flags [R.], seq 0, ack 3613406589, win 0, length 0
14:29:24.464639 IP localhost.58994 > localhost.58693: Flags [S], seq 2569113338, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706995 ecr 0,sackOK,eol], length 0
14:29:24.464658 IP localhost.58693 > localhost.58994: Flags [R.], seq 0, ack 2569113339, win 0, length 0
14:29:24.465911 IP localhost.58995 > localhost.58693: Flags [S], seq 897053363, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706996 ecr 0,sackOK,eol], length 0
14:29:24.465924 IP localhost.58693 > localhost.58995: Flags [R.], seq 0, ack 897053364, win 0, length 0
14:29:24.467197 IP localhost.58996 > localhost.58693: Flags [S], seq 1556778022, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706997 ecr 0,sackOK,eol], length 0
14:29:24.467210 IP localhost.58693 > localhost.58996: Flags [R.], seq 0, ack 1556778023, win 0, length 0
14:29:24.468455 IP localhost.58997 > localhost.58693: Flags [S], seq 522890491, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706998 ecr 0,sackOK,eol], length 0
14:29:24.468469 IP localhost.58693 > localhost.58997: Flags [R.], seq 0, ack 522890492, win 0, length 0
14:29:24.469650 IP localhost.58998 > localhost.58693: Flags [S], seq 3272035680, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706999 ecr 0,sackOK,eol], length 0
14:29:24.469661 IP localhost.58693 > localhost.58998: Flags [R.], seq 0, ack 3272035681, win 0, length 0
14:29:24.470846 IP localhost.58999 > localhost.58693: Flags [S], seq 1810034270, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107707000 ecr 0,sackOK,eol], length 0
14:29:24.470861 IP localhost.58693 > localhost.58999: Flags [R.], seq 0, ack 1810034271, win 0, length 0
14:29:24.472034 IP localhost.59000 > localhost.58693: Flags [S], seq 2410840076, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107707001 ecr 0,sackOK,eol], length 0
14:29:24.472046 IP localhost.58693 > localhost.59000: Flags [R.], seq 0, ack 2410840077, win 0, length 0
14:29:24.473404 IP localhost.59001 > localhost.58693: Flags [S], seq 1786606508, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107707002 ecr 0,sackOK,eol], length 0
14:29:24.473468 IP localhost.58693 > localhost.59001: Flags [R.], seq 0, ack 1786606509, win 0, length 0
14:29:24.476083 IP localhost.59002 > localhost.58693: Flags [S], seq 2237271129, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107707004 ecr 0,sackOK,eol], length 0
14:29:24.476098 IP localhost.58693 > localhost.59002: Flags [R.], seq 0, ack 2237271130, win 0, length 0
14:29:24.477475 IP localhost.59003 > localhost.58693: Flags [S], seq 1523915848, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107707006 ecr 0,sackOK,eol], length 0
14:29:24.477485 IP localhost.58693 > localhost.59003: Flags [R.], seq 0, ack 1523915849, win 0, length 0
14:29:25.142164 IP localhost.59224 > localhost.58693: Flags [S], seq 2276013461, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 1107707572 ecr 0,sackOK,eol], length 0
14:29:25.142176 IP localhost.59223 > localhost.58693: Flags [S], seq 4039966681, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107707572 ecr 0,sackOK,eol], length 0
14:29:25.342565 IP localhost.59224 > localhost.58693: Flags [S], seq 2276013461, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 1107707748 ecr 0,sackOK,eol], length 0
14:29:25.342578 IP localhost.59223 > localhost.58693: Flags [S], seq 4039966681, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107707748 ecr 0,sackOK,eol], length 0
14:29:25.443273 IP localhost.59224 > localhost.58693: Flags [S], seq 2276013461, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 1107707848 ecr 0,sackOK,eol], length 0
14:29:25.443319 IP localhost.59223 > localhost.58693: Flags [S], seq 4039966681, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107707848 ecr 0,sackOK,eol], length 0
14:29:25.644115 IP localhost.59224 > localhost.58693: Flags [S], seq 2276013461, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 1107708047 ecr 0,sackOK,eol], length 0
14:29:25.644126 IP localhost.59223 > localhost.58693: Flags [S], seq 4039966681, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107708047 ecr 0,sackOK,eol], length 0
14:29:25.745269 IP localhost.59224 > localhost.58693: Flags [S], seq 2276013461, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 1107708148 ecr 0,sackOK,eol], length 0
14:29:25.846262 IP localhost.59223 > localhost.58693: Flags [S], seq 4039966681, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107708248 ecr 0,sackOK,eol], length 0
14:29:25.946931 IP localhost.59224 > localhost.58693: Flags [S], seq 2276013461, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 1107708348 ecr 0,sackOK,eol], length 0
14:29:26.249474 IP localhost.59223 > localhost.58693: Flags [S], seq 4039966681, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107708650 ecr 0,sackOK,eol], length 0
14:29:26.350322 IP localhost.59224 > localhost.58693: Flags [S], seq 2276013461, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 1107708750 ecr 0,sackOK,eol], length 0
14:29:27.255626 IP localhost.59224 > localhost.58693: Flags [S], seq 2276013461, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 1107709606 ecr 0,sackOK,eol], length 0
14:29:28.967861 IP localhost.59224 > localhost.58693: Flags [S], seq 2276013461, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 1107711260 ecr 0,sackOK,eol], length 0
14:29:32.293864 IP localhost.59224 > localhost.58693: Flags [S], seq 2276013461, win 65535, options [mss 16344,sackOK,eol], length 0
14:29:38.938614 IP localhost.59224 > localhost.58693: Flags [S], seq 2276013461, win 65535, options [mss 16344,sackOK,eol], length 0
14:29:45.481879 IP localhost.59224 > localhost.58693: Flags [S], seq 2276013461, win 65535, options [mss 16344,sackOK,eol], length 0


> unit tests sometimes are slow during shutdown
> ---------------------------------------------
>
>                 Key: KAFKA-2010
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2010
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jun Rao
>            Assignee: Jun Rao
>             Fix For: 0.8.3
>
>
> Our unit tests in trunk seem to be slower than before. The slowness seems to be due to a handful of tests.
> For example, if you run the following test,  sometimes it can take more than 40
> secs, while normally it takes less than 10 secs.
> ./gradlew -i cleanTest core:test --tests kafka.admin.AddPartitionsTest.testIncrementPartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)