You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2016/02/24 21:11:44 UTC

[kudu-CR] WIP: KUDU-1341 testing

Hello David Ribeiro Alves, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/2300

to look at the new patch set (#2).

Change subject: WIP: KUDU-1341 testing
......................................................................

WIP: KUDU-1341 testing

This adds a new test which reproduces the following bootstrap error:
- a row is written and flushed to a DRS
- the row is updated and deleted, but the DMS is not flushed
- the row is inserted again into MRS and flushed again
- the TS restarts

When bootstrap begins, we have the following tablet state:

- DRS 1: contains the row (original inserted version)
- DRS 2: contains the row (second inserted version)

and the following log:

  REPLICATE 1.1: INSERT orig version
  COMMIT 1.1: dms id = 1
    - bootstrap correctly decides that this already flushed

  REPLICATE 1.2: UPDATE
  COMMIT 1.2: drs_id = 1, dms_id = 1
    - bootstrap correctly decides that the update is not flushed
    - it applies the update to the tablet
      - the tablet MutateRow code checks the interval tree and finds
        that this row could be in either DRS 1 or DRS 2
      - it tries applying the update to them in whatever order the
        RowSetTree returns the DiskRowSets
      - if it happens to try DRS 2 first, then we incorrectly apply
        the update to the new version of the row instead of the old

We haven't seen this bug in most of our testing because the RowSetTree
returns rows in a deterministic order. We only test the
insert-update-delete-insert sequence in a few test cases, and it appears
that those test cases either:
1) do not restart the tablet server, and thus don't see bootstrap issues
2) happen to result in RowSetTrees that give back the DRS in the lucky
   order that doesn't trigger the bug.

This patch changes the tablet code so that, in debug mode, the results
returned by RowSetTree are shuffled before being used. This makes the
bug more likely to reproduce. It adds a specific test case which reproduces
the bug reliably in debug builds with the shuffling enabled.

Additionally, this patch moves the "fuzz test" from tablet_random-access-test
into a new 'fuzz-itest', and changes it to write to a single-TS minicluster
rather than directly to a tablet. The test disables the maintenance manager
and manually manages background operations, so that it has the same capabilities
as before, but the fact that it's writing to the "full stack" allows us to
add a new fuzz test action: restarting the tablet server. Additionally, the
improved test now supports generating writes to multiple rows, since some
errors don't show up if you have single-row rowsets.

With the improved fuzz test, I was able to generate sequences that reliably
reproduce the crash described in KUDU-1341 as well as several other bugs:
incorrect results on reads, other CHECK failures in compactions, etc.
The fuzz tester should give us good confidence in the bug fix.

Change-Id: I6017ef67ae236021f7e6bd19d21b89310b8e6894
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/fuzz-itest.cc
M src/kudu/tablet/deltafile.cc
M src/kudu/tablet/deltafile.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_random_access-test.cc
M src/kudu/tserver/tablet_server-test.cc
7 files changed, 642 insertions(+), 331 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/00/2300/2
-- 
To view, visit http://gerrit.cloudera.org:8080/2300
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6017ef67ae236021f7e6bd19d21b89310b8e6894
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins