You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Will Berkeley (Code Review)" <ge...@cloudera.org> on 2019/03/15 00:34:12 UTC

[kudu-CR] Increase timeout in tls socket-test

Will Berkeley has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12761


Change subject: Increase timeout in tls_socket-test
......................................................................

Increase timeout in tls_socket-test

Very rarely (~3/2000 times in TSAN with 8 stress threads),
tls_socket-test will fail with an log like the following:

I0314 19:20:54.118880   236 tls_socket-test.cc:109] server: negotiation complete
I0314 19:20:54.119151   223 tls_socket-test.cc:109] client: negotiation complete
I0314 19:21:04.127199   236 tls_socket-test.cc:165] server echoing 33406976 bytes
/data/6/wdberkeley/kudu/src/kudu/security/tls_socket-test.cc:234: Failure
Failed
Bad status: Network error: BlockingRecv error: failed to read from TLS socket (remote: unknown): Connection reset by peer (error 104)

It seems the following is happening:

1. The client and the echo server connect successfully.
2. The client sends its payload of 32MiB (33554432 bytes) in
   BlockingWrite.
3. The server, while looping in BlockingRecv receiving the payload and
   through some combination of resource saturation, unfavorable
   scheduling, and EINTR returns from recv, fails to read the whole
   payload before timing out. Notice the 10 second delay between the
   second and third messages (the timeout is 10s) and the number of
   bytes being echoed of < 32MiB.
4. The server terminates the connection because of the timeout, but this
   does not result in a failure on its side because the server was
   stopped by the client.
5. The client fails when it first tries to BlockingRecv from the
   closed connection, instead of on the second BlockingRecv as the test
   intends.

This seems like a test-only issue- the time out on the server side
seems like reasonable behavior. Since it's so rare, tripling the timeout
should hopefully make the issue stop or at least make it much, much
rarer. With a 10s timeout, 2000 runs on TSAN, and 8 stress threads, I saw
2-4 failures. With a 30s timeout, I see 0.

Change-Id: Ibc615ea8f03a74f38b2bd6f3b4c140b3e435d4f3
---
M src/kudu/security/tls_socket-test.cc
1 file changed, 1 insertion(+), 1 deletion(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/12761/1
-- 
To view, visit http://gerrit.cloudera.org:8080/12761
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ibc615ea8f03a74f38b2bd6f3b4c140b3e435d4f3
Gerrit-Change-Number: 12761
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>

[kudu-CR] Increase timeout in tls socket-test

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/12761 )

Change subject: Increase timeout in tls_socket-test
......................................................................


Patch Set 1: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12761/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12761/1//COMMIT_MSG@10
PS1, Line 10: an
nit: a?



-- 
To view, visit http://gerrit.cloudera.org:8080/12761
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc615ea8f03a74f38b2bd6f3b4c140b3e435d4f3
Gerrit-Change-Number: 12761
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Fri, 15 Mar 2019 00:40:09 +0000
Gerrit-HasComments: Yes

[kudu-CR] Increase timeout in tls socket-test

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12761 )

Change subject: Increase timeout in tls_socket-test
......................................................................


Patch Set 1: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/12761
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc615ea8f03a74f38b2bd6f3b4c140b3e435d4f3
Gerrit-Change-Number: 12761
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Fri, 15 Mar 2019 03:24:34 +0000
Gerrit-HasComments: No

[kudu-CR] Increase timeout in tls socket-test

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/12761 )

Change subject: Increase timeout in tls_socket-test
......................................................................

Increase timeout in tls_socket-test

Very rarely (~3/2000 times in TSAN with 8 stress threads),
tls_socket-test will fail with an log like the following:

I0314 19:20:54.118880   236 tls_socket-test.cc:109] server: negotiation complete
I0314 19:20:54.119151   223 tls_socket-test.cc:109] client: negotiation complete
I0314 19:21:04.127199   236 tls_socket-test.cc:165] server echoing 33406976 bytes
/data/6/wdberkeley/kudu/src/kudu/security/tls_socket-test.cc:234: Failure
Failed
Bad status: Network error: BlockingRecv error: failed to read from TLS socket (remote: unknown): Connection reset by peer (error 104)

It seems the following is happening:

1. The client and the echo server connect successfully.
2. The client sends its payload of 32MiB (33554432 bytes) in
   BlockingWrite.
3. The server, while looping in BlockingRecv receiving the payload and
   through some combination of resource saturation, unfavorable
   scheduling, and EINTR returns from recv, fails to read the whole
   payload before timing out. Notice the 10 second delay between the
   second and third messages (the timeout is 10s) and the number of
   bytes being echoed of < 32MiB.
4. The server terminates the connection because of the timeout, but this
   does not result in a failure on its side because the server was
   stopped by the client.
5. The client fails when it first tries to BlockingRecv from the
   closed connection, instead of on the second BlockingRecv as the test
   intends.

This seems like a test-only issue- the time out on the server side
seems like reasonable behavior. Since it's so rare, tripling the timeout
should hopefully make the issue stop or at least make it much, much
rarer. With a 10s timeout, 2000 runs on TSAN, and 8 stress threads, I saw
2-4 failures. With a 30s timeout, I see 0.

Change-Id: Ibc615ea8f03a74f38b2bd6f3b4c140b3e435d4f3
Reviewed-on: http://gerrit.cloudera.org:8080/12761
Reviewed-by: Alexey Serbin <as...@cloudera.com>
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo <ad...@cloudera.com>
---
M src/kudu/security/tls_socket-test.cc
1 file changed, 1 insertion(+), 1 deletion(-)

Approvals:
  Alexey Serbin: Looks good to me, approved
  Kudu Jenkins: Verified
  Adar Dembo: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/12761
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ibc615ea8f03a74f38b2bd6f3b4c140b3e435d4f3
Gerrit-Change-Number: 12761
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>