You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Adar Dembo (Code Review)" <ge...@cloudera.org> on 2017/08/02 00:36:22 UTC

[kudu-CR] raft consensus-itest: another fix for asynchronous SIGSTOP

Hello David Ribeiro Alves, Todd Lipcon,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/7561

to review the following change.

Change subject: raft_consensus-itest: another fix for asynchronous SIGSTOP
......................................................................

raft_consensus-itest: another fix for asynchronous SIGSTOP

Another x1000 loop of raft_consensus-itest yielded the following failure in
TestElectPendingVoter:

  /data/1/adar/kudu/src/kudu/integration-tests/raft_consensus-itest.cc:2035: Failure
  Value of: s.IsTimedOut()
    Actual: false
  Expected: true
  Expected AddServer() to time out. Result: OK

The fix is similar to the fix in commit 27435da.

Change-Id: I99d400e971d6f9b22cc7b4483db94a98ec306e10
---
M src/kudu/integration-tests/raft_consensus-itest.cc
1 file changed, 8 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/7561/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7561
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I99d400e971d6f9b22cc7b4483db94a98ec306e10
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] subprocess: even more robust fix for asynchronous signals

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: subprocess: even more robust fix for asynchronous signals
......................................................................


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7561/6/src/kudu/util/subprocess-test.cc
File src/kudu/util/subprocess-test.cc:

Line 302:   ASSERT_EVENTUALLY([&]{
why do these need EVENTUALLY? isn't the point of the changes you made to Kill that, as soon as kill returns, you're guaranteed it's in the right state? similarly when Start() returns it should definitely be running


-- 
To view, visit http://gerrit.cloudera.org:8080/7561
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I99d400e971d6f9b22cc7b4483db94a98ec306e10
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] raft consensus-itest: robust fix for asynchronous kill

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7561

to look at the new patch set (#3).

Change subject: raft_consensus-itest: robust fix for asynchronous kill
......................................................................

raft_consensus-itest: robust fix for asynchronous kill

Another x1000 loop of raft_consensus-itest yielded the following failure in
TestElectPendingVoter:

  /data/1/adar/kudu/src/kudu/integration-tests/raft_consensus-itest.cc:2035: Failure
  Value of: s.IsTimedOut()
    Actual: false
  Expected: true
  Expected AddServer() to time out. Result: OK

The fix from commit 27435da doesn't work well for AddServer because it's not
an idempotent operation. Instead, let's take a more robust tack and
repeatedly ping the paused server until it stops responding.

I rolled this out to other places in raft_consensus-itest where we expect an
operation to time out shortly after sending a signal via kill(2). It may
even be necessary after Subprocess::Shutdown; even though that includes a
waitpid() call, I saw one test failure where RemoveServer did not time out
after the followers were both shut down.

Change-Id: I99d400e971d6f9b22cc7b4483db94a98ec306e10
---
M src/kudu/integration-tests/raft_consensus-itest.cc
1 file changed, 34 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/7561/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7561
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I99d400e971d6f9b22cc7b4483db94a98ec306e10
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] raft consensus-itest: robust fix for asynchronous kill

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: raft_consensus-itest: robust fix for asynchronous kill
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7561/5//COMMIT_MSG
Commit Message:

Line 20: repeatedly ping the paused server until it stops responding.
> Yeah, I remember this from an earlier discussion we had. Was hoping you'd f
ha, your memory is apparently better than mine then, because I thought it was a fresh idea ;)


-- 
To view, visit http://gerrit.cloudera.org:8080/7561
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I99d400e971d6f9b22cc7b4483db94a98ec306e10
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] raft consensus-itest: robust fix for asynchronous kill

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7561

to look at the new patch set (#2).

Change subject: raft_consensus-itest: robust fix for asynchronous kill
......................................................................

raft_consensus-itest: robust fix for asynchronous kill

Another x1000 loop of raft_consensus-itest yielded the following failure in
TestElectPendingVoter:

  /data/1/adar/kudu/src/kudu/integration-tests/raft_consensus-itest.cc:2035: Failure
  Value of: s.IsTimedOut()
    Actual: false
  Expected: true
  Expected AddServer() to time out. Result: OK

The fix from commit 27435da doesn't work well for AddServer because it's not
an idempotent operation. Instead, let's take a more robust tack and
repeatedly ping the paused server until it stops responding.

I rolled this out to other places in raft_consensus-itest where we expect an
operation to time out shortly after sending a signal via kill(2). It may
even be necessary after Subprocess::Shutdown; even though that includes a
waitpid() call, I saw one test failure where RemoveServer did not time out
after the followers were both shut down.

Change-Id: I99d400e971d6f9b22cc7b4483db94a98ec306e10
---
M src/kudu/integration-tests/raft_consensus-itest.cc
1 file changed, 34 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/7561/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7561
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I99d400e971d6f9b22cc7b4483db94a98ec306e10
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] raft consensus-itest: robust fix for asynchronous kill

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: raft_consensus-itest: robust fix for asynchronous kill
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7561/5//COMMIT_MSG
Commit Message:

Line 20: repeatedly ping the paused server until it stops responding.
one idea for an even more general solution that addresses this problem everywhere instead of at each call site:

what if ExternalDaemon::Pause() polled /proc/<pid>/status waiting for the process state to be 'T' (stopped)? it's not osx-compatible but I don't think we are really worried about 0.1% flakes on osx.


-- 
To view, visit http://gerrit.cloudera.org:8080/7561
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I99d400e971d6f9b22cc7b4483db94a98ec306e10
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] subprocess: even more robust fix for asynchronous signals

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: subprocess: even more robust fix for asynchronous signals
......................................................................


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7561/6/src/kudu/util/subprocess-test.cc
File src/kudu/util/subprocess-test.cc:

Line 302:   ASSERT_EVENTUALLY([&]{
> why do these need EVENTUALLY? isn't the point of the changes you made to Ki
Oops. An earlier iteration of the patch used kill(2) directly, and so ASSERT_EVENTUALLY was relevant. I'll remove it.


-- 
To view, visit http://gerrit.cloudera.org:8080/7561
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I99d400e971d6f9b22cc7b4483db94a98ec306e10
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] subprocess: even more robust fix for asynchronous signals

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7561

to look at the new patch set (#7).

Change subject: subprocess: even more robust fix for asynchronous signals
......................................................................

subprocess: even more robust fix for asynchronous signals

Another x1000 loop of raft_consensus-itest yielded the following failure in
TestElectPendingVoter:

  /data/1/adar/kudu/src/kudu/integration-tests/raft_consensus-itest.cc:2035: Failure
  Value of: s.IsTimedOut()
    Actual: false
  Expected: true
  Expected AddServer() to time out. Result: OK

The fix from commit 27435da doesn't work well for AddServer because it's not
an idempotent operation. Instead, let's take a more robust tack and add some
waiting behavior directly into Subprocess, using the /proc/<pid>/stat
process state as a guide.

Change-Id: I99d400e971d6f9b22cc7b4483db94a98ec306e10
---
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/util/subprocess-test.cc
M src/kudu/util/subprocess.cc
M src/kudu/util/subprocess.h
4 files changed, 113 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/7561/7
-- 
To view, visit http://gerrit.cloudera.org:8080/7561
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I99d400e971d6f9b22cc7b4483db94a98ec306e10
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] subprocess: even more robust fix for asynchronous signals

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7561

to look at the new patch set (#6).

Change subject: subprocess: even more robust fix for asynchronous signals
......................................................................

subprocess: even more robust fix for asynchronous signals

Another x1000 loop of raft_consensus-itest yielded the following failure in
TestElectPendingVoter:

  /data/1/adar/kudu/src/kudu/integration-tests/raft_consensus-itest.cc:2035: Failure
  Value of: s.IsTimedOut()
    Actual: false
  Expected: true
  Expected AddServer() to time out. Result: OK

The fix from commit 27435da doesn't work well for AddServer because it's not
an idempotent operation. Instead, let's take a more robust tack and add some
waiting behavior directly into Subprocess, using the /proc/<pid>/stat
process state as a guide.

Change-Id: I99d400e971d6f9b22cc7b4483db94a98ec306e10
---
M src/kudu/util/subprocess-test.cc
M src/kudu/util/subprocess.cc
M src/kudu/util/subprocess.h
3 files changed, 113 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/7561/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7561
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I99d400e971d6f9b22cc7b4483db94a98ec306e10
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] subprocess: even more robust fix for asynchronous signals

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has submitted this change and it was merged.

Change subject: subprocess: even more robust fix for asynchronous signals
......................................................................


subprocess: even more robust fix for asynchronous signals

Another x1000 loop of raft_consensus-itest yielded the following failure in
TestElectPendingVoter:

  /data/1/adar/kudu/src/kudu/integration-tests/raft_consensus-itest.cc:2035: Failure
  Value of: s.IsTimedOut()
    Actual: false
  Expected: true
  Expected AddServer() to time out. Result: OK

The fix from commit 27435da doesn't work well for AddServer because it's not
an idempotent operation. Instead, let's take a more robust tack and add some
waiting behavior directly into Subprocess, using the /proc/<pid>/stat
process state as a guide.

Change-Id: I99d400e971d6f9b22cc7b4483db94a98ec306e10
Reviewed-on: http://gerrit.cloudera.org:8080/7561
Reviewed-by: Todd Lipcon <to...@apache.org>
Tested-by: Kudu Jenkins
---
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/util/subprocess-test.cc
M src/kudu/util/subprocess.cc
M src/kudu/util/subprocess.h
4 files changed, 113 insertions(+), 11 deletions(-)

Approvals:
  Todd Lipcon: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/7561
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I99d400e971d6f9b22cc7b4483db94a98ec306e10
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] subprocess: even more robust fix for asynchronous signals

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: subprocess: even more robust fix for asynchronous signals
......................................................................


Patch Set 7: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/7561
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I99d400e971d6f9b22cc7b4483db94a98ec306e10
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] raft consensus-itest: another fix for asynchronous SIGSTOP

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: raft_consensus-itest: another fix for asynchronous SIGSTOP
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7561/1/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

Line 2040:   ASSERT_EVENTUALLY([&]{
On second thought, I'm not so sure about this.

I spotted a similar looking failure with the fix. Perhaps if the first AddServer succeeds, subsequent AddServers will return InvalidArgument and not  TimedOut, even if the majority of tservers are gone?

/data/1/adar/kudu/src/kudu/integration-tests/raft_consensus-itest.cc:2044: Failure
Value of: s.IsTimedOut()
  Actual: false
Expected: true
Expected AddServer() to time out. Result: Invalid argument: Server with UUID 3bfe5d6825be4d648e7d179427e7ce42 is already a member of the config. RaftConfig: opid
_index: 3 OBSOLETE_local: false peers { permanent_uuid: "700da541e0524dff947f3321d0f62f45" member_type: VOTER last_known_addr { host: "127.56.176.4" port: 57412 
} } peers { permanent_uuid: "89bc08641e5d4f32a466f276b1960697" member_type: VOTER last_known_addr { host: "127.56.176.1" port: 54832 } } peers { permanent_uuid: 
"bd8c3b95dd71464483296209a1571bca" member_type: VOTER last_known_addr { host: "127.56.176.3" port: 44271 } } peers { permanent_uuid: "15ba5d1429854802bccddd57f83
5dd0e" member_type: VOTER last_known_addr { host: "127.56.176.2" port: 45364 } } peers { permanent_uuid: "3bfe5d6825be4d648e7d179427e7ce42" member_type: VOTER la
st_known_addr { host: "127.56.176.5" port: 35065 } }
/data/1/adar/kudu/src/kudu/util/test_util.cc:273: Failure
Failed
Timed out waiting for assertion to pass.


-- 
To view, visit http://gerrit.cloudera.org:8080/7561
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I99d400e971d6f9b22cc7b4483db94a98ec306e10
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] raft consensus-itest: robust fix for asynchronous kill

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: raft_consensus-itest: robust fix for asynchronous kill
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7561/5//COMMIT_MSG
Commit Message:

Line 20: repeatedly ping the paused server until it stops responding.
> one idea for an even more general solution that addresses this problem ever
Yeah, I remember this from an earlier discussion we had. Was hoping you'd forgotten. :)

But I'll bite the bullet and try it out.


-- 
To view, visit http://gerrit.cloudera.org:8080/7561
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I99d400e971d6f9b22cc7b4483db94a98ec306e10
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes