You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Andrew Wong (Code Review)" <ge...@cloudera.org> on 2017/08/25 00:16:53 UTC

[kudu-CR] Fix flaky ts recovery-itest TestChangeMaxCellSize

Andrew Wong has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7820

Change subject: Fix flaky ts_recovery-itest TestChangeMaxCellSize
......................................................................

Fix flaky ts_recovery-itest TestChangeMaxCellSize

In TSTabletManager::CreateReportedTabletPB(), each replica's state is
reported, and if this state is FAILED, it additionally reports the error
state that caused it to fail. However, replica state is transient, and
can, on rare occasion, change between these two retrievals of replica
state.

This led to a DCHECK failure in ts_recovery-itest when parsing a
report, as reports with errors are expected to be FAILED. The failure
was consistent with the following sequence of events:
  1. The tablet is opened, hits an error, and starts shutting down
  2. The report notes the replica state (STOPPING)
  3. The tablet finishes shutting down and switches its state (FAILED)
  4. The report, seeing the replica is FAILED, notes the error that
     caused the shut down
  5. The catalog manager goes through the report, and notes the error
     tacked onto the report and the fact that the tablet is STOPPING
     instead of FAILED. This fails the DCHECK, as logged below:

BlockManagerType/TsRecoveryITest.TestChangeMaxCellSize/0: catalog_manager.cc:2489] Check failed: report.state() == tablet::FAILED (3 vs. 2)

This sequence may be more frequent since a9d17c0, which enforces the
tablet stop (i.e. go into the STOPPING state and shut down the replica's
internals) before going into the FAILED state, instead of going directly
to the FAILED state.

This was validated by injecting a pause in the CreateReportedTabletPB
between the two calls and hitting the DCHECK failure. This is fixed by
reporting the error regardless of state and removing this DCHECK.

Change-Id: I6f0a6b19756777e5f4081ef6c8cb5af4ecc8a3d6
---
M src/kudu/master/catalog_manager.cc
M src/kudu/tserver/ts_tablet_manager.cc
2 files changed, 2 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/20/7820/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7820
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I6f0a6b19756777e5f4081ef6c8cb5af4ecc8a3d6
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>

[kudu-CR] Fix flaky ts recovery-itest TestChangeMaxCellSize

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has submitted this change and it was merged.

Change subject: Fix flaky ts_recovery-itest TestChangeMaxCellSize
......................................................................


Fix flaky ts_recovery-itest TestChangeMaxCellSize

In TSTabletManager::CreateReportedTabletPB(), each replica's state is
reported, and if this state is FAILED, it additionally reports the error
state that caused it to fail. However, replica state is transient, and
can, on rare occasion, change between these two retrievals of replica
state.

This led to a DCHECK failure in ts_recovery-itest when parsing a
report, as reports with errors are expected to be FAILED. The failure
was consistent with the following sequence of events:
  1. The tablet is opened, hits an error, and starts shutting down
  2. The report notes the replica state (STOPPING)
  3. The tablet finishes shutting down and switches its state (FAILED)
  4. The report, seeing the replica is FAILED, notes the error that
     caused the shut down
  5. The catalog manager goes through the report, and notes the error
     tacked onto the report and the fact that the tablet is STOPPING
     instead of FAILED. This fails the DCHECK, as logged below:

BlockManagerType/TsRecoveryITest.TestChangeMaxCellSize/0: catalog_manager.cc:2489] Check failed: report.state() == tablet::FAILED (3 vs. 2)

This sequence may be more frequent since a9d17c0, which enforces the
tablet stops (i.e. go into the STOPPING state and shut down the
replica's internals) before going into the FAILED state, instead of
going directly to the FAILED state.

Validated by injecting a pause in the CreateReportedTabletPB between the
two calls and hitting the DCHECK failure. This is fixed by reporting the
error regardless of state and removing this DCHECK.

Change-Id: I6f0a6b19756777e5f4081ef6c8cb5af4ecc8a3d6
Reviewed-on: http://gerrit.cloudera.org:8080/7820
Tested-by: Kudu Jenkins
Reviewed-by: Mike Percy <mp...@apache.org>
---
M src/kudu/master/catalog_manager.cc
M src/kudu/tserver/ts_tablet_manager.cc
2 files changed, 4 insertions(+), 5 deletions(-)

Approvals:
  Mike Percy: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/7820
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I6f0a6b19756777e5f4081ef6c8cb5af4ecc8a3d6
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Fix flaky ts recovery-itest TestChangeMaxCellSize

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: Fix flaky ts_recovery-itest TestChangeMaxCellSize
......................................................................


Patch Set 2: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/7820
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6f0a6b19756777e5f4081ef6c8cb5af4ecc8a3d6
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] Fix flaky ts recovery-itest TestChangeMaxCellSize

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: Fix flaky ts_recovery-itest TestChangeMaxCellSize
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7820/1/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 1014:   if (!replica->error().ok()) {
> This has an ABA problem, potentially. How about:
Hrm, sure. Setting error _should_ be permanent (in that it shouldn't go back to OK), but it's a reasonable safeguard against further state-transience changes.


-- 
To view, visit http://gerrit.cloudera.org:8080/7820
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6f0a6b19756777e5f4081ef6c8cb5af4ecc8a3d6
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Fix flaky ts recovery-itest TestChangeMaxCellSize

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: Fix flaky ts_recovery-itest TestChangeMaxCellSize
......................................................................


Patch Set 1:

(1 comment)

Yeah, that DCHECK is not useful.

http://gerrit.cloudera.org:8080/#/c/7820/1/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 1014:   if (!replica->error().ok()) {
This has an ABA problem, potentially. How about:

  const Status& error = replica->error();
  if (!error.ok()) {
    StatusToPB(reported_tablet->mutable_error(), error_status);
  }


-- 
To view, visit http://gerrit.cloudera.org:8080/7820
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6f0a6b19756777e5f4081ef6c8cb5af4ecc8a3d6
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Fix flaky ts recovery-itest TestChangeMaxCellSize

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7820

to look at the new patch set (#2).

Change subject: Fix flaky ts_recovery-itest TestChangeMaxCellSize
......................................................................

Fix flaky ts_recovery-itest TestChangeMaxCellSize

In TSTabletManager::CreateReportedTabletPB(), each replica's state is
reported, and if this state is FAILED, it additionally reports the error
state that caused it to fail. However, replica state is transient, and
can, on rare occasion, change between these two retrievals of replica
state.

This led to a DCHECK failure in ts_recovery-itest when parsing a
report, as reports with errors are expected to be FAILED. The failure
was consistent with the following sequence of events:
  1. The tablet is opened, hits an error, and starts shutting down
  2. The report notes the replica state (STOPPING)
  3. The tablet finishes shutting down and switches its state (FAILED)
  4. The report, seeing the replica is FAILED, notes the error that
     caused the shut down
  5. The catalog manager goes through the report, and notes the error
     tacked onto the report and the fact that the tablet is STOPPING
     instead of FAILED. This fails the DCHECK, as logged below:

BlockManagerType/TsRecoveryITest.TestChangeMaxCellSize/0: catalog_manager.cc:2489] Check failed: report.state() == tablet::FAILED (3 vs. 2)

This sequence may be more frequent since a9d17c0, which enforces the
tablet stops (i.e. go into the STOPPING state and shut down the
replica's internals) before going into the FAILED state, instead of
going directly to the FAILED state.

Validated by injecting a pause in the CreateReportedTabletPB between the
two calls and hitting the DCHECK failure. This is fixed by reporting the
error regardless of state and removing this DCHECK.

Change-Id: I6f0a6b19756777e5f4081ef6c8cb5af4ecc8a3d6
---
M src/kudu/master/catalog_manager.cc
M src/kudu/tserver/ts_tablet_manager.cc
2 files changed, 4 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/20/7820/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7820
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6f0a6b19756777e5f4081ef6c8cb5af4ecc8a3d6
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>