You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Andrew Wong (Code Review)" <ge...@cloudera.org> on 2018/11/21 23:55:53 UTC

[kudu-CR] deflake TsRecoveryITest.TestTabletRecoveryAfterSegmentDelete

Hello Mike Percy, Adar Dembo,

I'd like you to do a code review. Please visit

    http://gerrit.cloudera.org:8080/11979

to review the following change.


Change subject: deflake TsRecoveryITest.TestTabletRecoveryAfterSegmentDelete
......................................................................

deflake TsRecoveryITest.TestTabletRecoveryAfterSegmentDelete

The test runs a write workload and waits for a certain number of WAL
segments to show up on disk. Recently, the test has become particularly
flaky (~10% flaky in the last couple days according to our test tracking
server), though I haven't been able to determine the cause of this new
flakiness.

Regardless, the test has been at least a little flaky for much longer
than the last couple of days, and the cause seems to be that in TSAN, we
might not always hit the expected number of WAL segments in the allotted
amount of time.

Upon inspecting a flamegraph of the test, it seems like a decent
percentage of cycles are spent compressing the WALs, so I've removed the
log compression codec for the test.

Without this fix, the test failed 100/100 times with 4 stress threads in
TSAN mode. With it, it passed 1000/1000.

Change-Id: Ic19d33a5e43aaae21c1cb6273a09a09b1b91f92c
---
M src/kudu/integration-tests/ts_recovery-itest.cc
1 file changed, 12 insertions(+), 12 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/79/11979/1
-- 
To view, visit http://gerrit.cloudera.org:8080/11979
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ic19d33a5e43aaae21c1cb6273a09a09b1b91f92c
Gerrit-Change-Number: 11979
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] deflake TsRecoveryITest.TestTabletRecoveryAfterSegmentDelete

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11979 )

Change subject: deflake TsRecoveryITest.TestTabletRecoveryAfterSegmentDelete
......................................................................


Patch Set 1:

> Patch Set 1:
> 
> > Patch Set 1: Code-Review+2
> > 
> > It does seem kind of unusual that 60s isn't enough to populate three 1MB log segments given a write workload of 32KB per row. Did you repro it locally by any chance? I wonder what a periodic 'ls -l' of the WAL directory would show while the test is running.
> 
> Not quite, but with some extra logging, I could see that slowly, but surely, the writes (measured in bytes inserted by the TestWorkload) _were_ taking place over the span of the minute.

To clarify, the logging was done running via dist-test; I wasn't able to repro it locally. I ran a flamegraph locally, though, to see if anything stood out as unreasonably slow for the purposes of the test, and compression was the big one.


-- 
To view, visit http://gerrit.cloudera.org:8080/11979
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic19d33a5e43aaae21c1cb6273a09a09b1b91f92c
Gerrit-Change-Number: 11979
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Thu, 22 Nov 2018 01:44:26 +0000
Gerrit-HasComments: No

[kudu-CR] deflake TsRecoveryITest.TestTabletRecoveryAfterSegmentDelete

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/11979 )

Change subject: deflake TsRecoveryITest.TestTabletRecoveryAfterSegmentDelete
......................................................................


Patch Set 1: Code-Review+2

It does seem kind of unusual that 60s isn't enough to populate three 1MB log segments given a write workload of 32KB per row. Did you repro it locally by any chance? I wonder what a periodic 'ls -l' of the WAL directory would show while the test is running.


-- 
To view, visit http://gerrit.cloudera.org:8080/11979
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic19d33a5e43aaae21c1cb6273a09a09b1b91f92c
Gerrit-Change-Number: 11979
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Thu, 22 Nov 2018 00:43:54 +0000
Gerrit-HasComments: No

[kudu-CR] deflake TsRecoveryITest.TestTabletRecoveryAfterSegmentDelete

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/11979 )

Change subject: deflake TsRecoveryITest.TestTabletRecoveryAfterSegmentDelete
......................................................................

deflake TsRecoveryITest.TestTabletRecoveryAfterSegmentDelete

The test runs a write workload and waits for a certain number of WAL
segments to show up on disk. Recently, the test has become particularly
flaky (~10% flaky in the last couple days according to our test tracking
server), though I haven't been able to determine the cause of this new
flakiness.

Regardless, the test has been at least a little flaky for much longer
than the last couple of days, and the cause seems to be that in TSAN, we
might not always hit the expected number of WAL segments in the allotted
amount of time.

Upon inspecting a flamegraph of the test, it seems like a decent
percentage of cycles are spent compressing the WALs, so I've removed the
log compression codec for the test.

Without this fix, the test failed 100/100 times with 4 stress threads in
TSAN mode. With it, it passed 1000/1000.

Change-Id: Ic19d33a5e43aaae21c1cb6273a09a09b1b91f92c
Reviewed-on: http://gerrit.cloudera.org:8080/11979
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo <ad...@cloudera.com>
---
M src/kudu/integration-tests/ts_recovery-itest.cc
1 file changed, 12 insertions(+), 12 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Adar Dembo: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/11979
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ic19d33a5e43aaae21c1cb6273a09a09b1b91f92c
Gerrit-Change-Number: 11979
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] deflake TsRecoveryITest.TestTabletRecoveryAfterSegmentDelete

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11979 )

Change subject: deflake TsRecoveryITest.TestTabletRecoveryAfterSegmentDelete
......................................................................


Patch Set 1:

> Patch Set 1: Code-Review+2
> 
> It does seem kind of unusual that 60s isn't enough to populate three 1MB log segments given a write workload of 32KB per row. Did you repro it locally by any chance? I wonder what a periodic 'ls -l' of the WAL directory would show while the test is running.

Not quite, but with some extra logging, I could see that slowly, but surely, the writes (measured in bytes inserted by the TestWorkload) _were_ taking place over the span of the minute.


-- 
To view, visit http://gerrit.cloudera.org:8080/11979
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic19d33a5e43aaae21c1cb6273a09a09b1b91f92c
Gerrit-Change-Number: 11979
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Thu, 22 Nov 2018 01:42:51 +0000
Gerrit-HasComments: No