You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Will Berkeley (Code Review)" <ge...@cloudera.org> on 2019/05/07 16:58:22 UTC

[kudu-CR] KUDU-2807 Crash when flush or compaction overlaps with another compaction

Will Berkeley has uploaded this change for review. ( http://gerrit.cloudera.org:8080/13264


Change subject: KUDU-2807 Crash when flush or compaction overlaps with another compaction
......................................................................

KUDU-2807 Crash when flush or compaction overlaps with another compaction

Commit d3684a7b2add8f06b7189adb9ce9222b8ae1eff5 introduced a metric for
average rowset height. Computing this requires examining the rowsets in
the rowset tree and briefly taking each one's `compact_flush_lock_`.
However, any time a thread takes the `compact_flush_lock_` of a rowset,
it must hold the `compact_select_lock_` of the tablet that rowset
belongs to. This was not happening in two of the three places where the
average height is computed:

1. When opening the tablet.
2. When updating the rowset tree during a flush or compaction.

The first case is benign (as far as I know). The second case could cause
a crash like

F0429 07:26:56.918041 34043 tablet.cc:2268] Check failed: lock.owns_lock() RowSet(24130) unable to lock compact_flush_lock

MM ops enforced the invariant above by try-locking the
`compact_flush_lock_` and checking that they obtained the lock, while
holding the `compact_select_lock_`. So, if a MM op try-locked a rowset
at the same time as another MM op was holding its `compact_flush_lock_`,
the above crash would result.

This patch fixes the crash by ensuring that the `compact_select_lock_`
is held whenever `ComputeCdfAndCheckOrdered`, which computes the average
rowset height, is called. I also made a small modification to the scope
of a `component_lock_` to avoid having to define a lock order for
`component_lock_` and `compact_select_lock_`.

Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
---
M src/kudu/tablet/rowset_info.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
3 files changed, 23 insertions(+), 16 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/64/13264/1
-- 
To view, visit http://gerrit.cloudera.org:8080/13264
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
Gerrit-Change-Number: 13264
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>

[kudu-CR] KUDU-2807 Crash when flush or compaction overlaps with another compaction

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/13264 )

Change subject: KUDU-2807 Crash when flush or compaction overlaps with another compaction
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/13264/2/src/kudu/tablet/tablet.cc
File src/kudu/tablet/tablet.cc:

http://gerrit.cloudera.org:8080/#/c/13264/2/src/kudu/tablet/tablet.cc@293
PS2, Line 293:   if (metrics_) {
Why not use the new helper function here? Seems like you could if you did it after components_ was set up.


http://gerrit.cloudera.org:8080/#/c/13264/2/src/kudu/tablet/tablet.cc@1624
PS2, Line 1624:     // Taking component_lock_ in write mode ensures that no new transactions
              :     // can StartApplying() (or snapshot components_) during this block.
              :     std::lock_guard<rw_spinlock> lock(component_lock_);
Could we just call AtomicSwapRowSets here? Or do we need to hold component_lock_ for L1633-1634 too?


http://gerrit.cloudera.org:8080/#/c/13264/2/src/kudu/tablet/tablet.cc@1729
PS2, Line 1729:     UpdateAverageRowsetHeight();
Why not just always call this in AtomicSwapRowSetsUnlocked? You might need one exception: swapping in the duplicating rowset. But you can achieve that via parameterization. Seems easier than keeping the two decoupled (i.e. what if we added a new call to AtomicSwapRowSets in the future but forgot to also call UpdateAverageRowsetHeight there?).



-- 
To view, visit http://gerrit.cloudera.org:8080/13264
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
Gerrit-Change-Number: 13264
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Wed, 08 May 2019 21:15:24 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2807 Crash when flush or compaction overlaps with another compaction

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/13264 )

Change subject: KUDU-2807 Crash when flush or compaction overlaps with another compaction
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13264/3/src/kudu/tablet/tablet.cc
File src/kudu/tablet/tablet.cc:

http://gerrit.cloudera.org:8080/#/c/13264/3/src/kudu/tablet/tablet.cc@1058
PS3, Line 1058:   UpdateAverageRowsetHeight();
Restricting this call to the locked version of the function doesn't seem like a particularly robust way to avoid calling it when swapping in the duplicating rowset. It also violates POLA in that people usually expect Foo and FooUnlocked to only differ in locking behavior.

This is tough though, because ideally:
1. UpdateAverageRowsetHeight is called with every atomic rowset swap.
2. It may be called with component_lock_ held or not.
3. It'll take component_lock_ if necessary to get the list of components.
4. But then it drops the lock while recomputing the rowset height (at that point only compact_select_lock_ is needed).



-- 
To view, visit http://gerrit.cloudera.org:8080/13264
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
Gerrit-Change-Number: 13264
Gerrit-PatchSet: 3
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Wed, 08 May 2019 22:52:38 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2807 Crash when flush or compaction overlaps with another compaction

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/13264 )

Change subject: KUDU-2807 Crash when flush or compaction overlaps with another compaction
......................................................................


Patch Set 1:

(1 comment)

> Would be great to test this. I am surprised mt-tablet-test didn't
 > trigger the crash; could you take a look?

I think we are testing for it as the mt-tablet-test has threads flushing and compacting and GCing undos. Presumably it's just really unlikely because the compact_flush_lock_ is held so briefly for each rowset when recomputing the average height. What's interesting is that the reporter reports running into it on multiple servers consistently. Maybe the most efficient thing to do it investigate that report to understand why they are seeing this, if this patch fixes them, and if we can write a more targeted test based off what we learn from them.

http://gerrit.cloudera.org:8080/#/c/13264/1/src/kudu/tablet/tablet.cc
File src/kudu/tablet/tablet.cc:

http://gerrit.cloudera.org:8080/#/c/13264/1/src/kudu/tablet/tablet.cc@1753
PS1, Line 1753: Status Tablet::HandleEmptyCompactionOrFlush(const RowSetVector& rowsets,
> Shouldn't we recompute the rowset height in the event of an empty compact/f
My bad. Missed this.



-- 
To view, visit http://gerrit.cloudera.org:8080/13264
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
Gerrit-Change-Number: 13264
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Wed, 08 May 2019 19:50:25 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2807 Crash when flush or compaction overlaps with another compaction

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13264

to look at the new patch set (#2).

Change subject: KUDU-2807 Crash when flush or compaction overlaps with another compaction
......................................................................

KUDU-2807 Crash when flush or compaction overlaps with another compaction

Commit d3684a7b2add8f06b7189adb9ce9222b8ae1eff5 introduced a metric for
average rowset height. Computing this requires examining the rowsets in
the rowset tree and briefly taking each one's `compact_flush_lock_`.
However, any time a thread takes the `compact_flush_lock_` of a rowset,
it must hold the `compact_select_lock_` of the tablet that rowset
belongs to. This was not happening in two of the three places where the
average height is computed:

1. When opening the tablet.
2. When updating the rowset tree during a flush or compaction.

The first case is benign (as far as I know). The second case could cause
a crash like

F0429 07:26:56.918041 34043 tablet.cc:2268] Check failed: lock.owns_lock() RowSet(24130) unable to lock compact_flush_lock

MM ops enforced the invariant above by try-locking the
`compact_flush_lock_` and checking that they obtained the lock, while
holding the `compact_select_lock_`. So, if a MM op try-locked a rowset
at the same time as another MM op was holding its `compact_flush_lock_`,
the above crash would result.

This patch fixes the crash by ensuring that the `compact_select_lock_`
is held whenever `ComputeCdfAndCheckOrdered`, which computes the average
rowset height, is called. I also made a small modification to the scope
of a `component_lock_` to avoid having to define a lock order for
`component_lock_` and `compact_select_lock_`.

Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
---
M src/kudu/tablet/rowset_info.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
3 files changed, 33 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/64/13264/2
-- 
To view, visit http://gerrit.cloudera.org:8080/13264
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
Gerrit-Change-Number: 13264
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] KUDU-2807 Crash when flush or compaction overlaps with another compaction

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/13264 )

Change subject: KUDU-2807 Crash when flush or compaction overlaps with another compaction
......................................................................


Patch Set 2:

(1 comment)

Ah I see I forgot to post my comments on PS2. Anyway they are sort of superseded by my comment on this one so I'm deleting them.

http://gerrit.cloudera.org:8080/#/c/13264/3/src/kudu/tablet/tablet.cc
File src/kudu/tablet/tablet.cc:

http://gerrit.cloudera.org:8080/#/c/13264/3/src/kudu/tablet/tablet.cc@1058
PS3, Line 1058: 
> Restricting this call to the locked version of the function doesn't seem li
To be honest, this is gotten to the point where I regret adding this metric.
1. It caused a bug because the relationship between the locking is tricky and surprising.
2. It's not clear how useful this metric is, it just seemed nice.
3. Now it seems like just to add in a snipper of code to update the metric it's necessary to do some refactoring of tricky code, or tolerate risk of the metric not being updated.
Maybe we just shouldn't have this metric at all?



-- 
To view, visit http://gerrit.cloudera.org:8080/13264
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
Gerrit-Change-Number: 13264
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Thu, 09 May 2019 16:48:24 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2807 Crash when flush or compaction overlaps with another compaction

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13264

to look at the new patch set (#4).

Change subject: KUDU-2807 Crash when flush or compaction overlaps with another compaction
......................................................................

KUDU-2807 Crash when flush or compaction overlaps with another compaction

Commit d3684a7b2add8f06b7189adb9ce9222b8ae1eff5 introduced a metric for
average rowset height. Computing this requires examining the rowsets in
the rowset tree and briefly taking each one's `compact_flush_lock_`.
However, any time a thread takes the `compact_flush_lock_` of a rowset,
it must hold the `compact_select_lock_` of the tablet that rowset
belongs to. This was not happening in two of the three places where the
average height is computed:

1. When opening the tablet.
2. When updating the rowset tree during a flush or compaction.

The first case is benign (as far as I know). The second case could cause
a crash like

F0429 07:26:56.918041 34043 tablet.cc:2268] Check failed: lock.owns_lock() RowSet(24130) unable to lock compact_flush_lock

MM ops enforced the invariant above by try-locking the
`compact_flush_lock_` and checking that they obtained the lock, while
holding the `compact_select_lock_`. So, if a MM op try-locked a rowset
at the same time as another MM op was holding its `compact_flush_lock_`,
the above crash would result.

This patch fixes the crash by ensuring that the `compact_select_lock_`
is held whenever `ComputeCdfAndCheckOrdered`, which computes the average
rowset height, is called. I also made a small modification to the scope
of a `component_lock_` to avoid having to define a lock order for
`component_lock_` and `compact_select_lock_`.

Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
---
M src/kudu/tablet/rowset_info.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
3 files changed, 36 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/64/13264/4
-- 
To view, visit http://gerrit.cloudera.org:8080/13264
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
Gerrit-Change-Number: 13264
Gerrit-PatchSet: 4
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] KUDU-2807 Crash when flush or compaction overlaps with another compaction

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/13264 )

Change subject: KUDU-2807 Crash when flush or compaction overlaps with another compaction
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13264/3/src/kudu/tablet/tablet.cc
File src/kudu/tablet/tablet.cc:

http://gerrit.cloudera.org:8080/#/c/13264/3/src/kudu/tablet/tablet.cc@1058
PS3, Line 1058:   UpdateAverageRowsetHeight();
> To be honest, this is gotten to the point where I regret adding this metric
I think it's a useful metric. When debugging performance issues we've generally wanted to know how poorly compacted the tablets are, and this gives us that.

I'd say let's go back to decoupling UpdateAverageRowsetHeight from AtomicSwapRowSets{Unlocked}. Sure it's a little fragile in that we might forget to call it if we sprout new calls to swap the rowsets, but that seems pretty rare, and this way we get the locking behavior we want.



-- 
To view, visit http://gerrit.cloudera.org:8080/13264
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
Gerrit-Change-Number: 13264
Gerrit-PatchSet: 3
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Thu, 09 May 2019 19:01:23 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2807 Crash when flush or compaction overlaps with another compaction

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/13264 )

Change subject: KUDU-2807 Crash when flush or compaction overlaps with another compaction
......................................................................

KUDU-2807 Crash when flush or compaction overlaps with another compaction

Commit d3684a7b2add8f06b7189adb9ce9222b8ae1eff5 introduced a metric for
average rowset height. Computing this requires examining the rowsets in
the rowset tree and briefly taking each one's `compact_flush_lock_`.
However, any time a thread takes the `compact_flush_lock_` of a rowset,
it must hold the `compact_select_lock_` of the tablet that rowset
belongs to. This was not happening in two of the three places where the
average height is computed:

1. When opening the tablet.
2. When updating the rowset tree during a flush or compaction.

The first case is benign (as far as I know). The second case could cause
a crash like

F0429 07:26:56.918041 34043 tablet.cc:2268] Check failed: lock.owns_lock() RowSet(24130) unable to lock compact_flush_lock

MM ops enforced the invariant above by try-locking the
`compact_flush_lock_` and checking that they obtained the lock, while
holding the `compact_select_lock_`. So, if a MM op try-locked a rowset
at the same time as another MM op was holding its `compact_flush_lock_`,
the above crash would result.

This patch fixes the crash by ensuring that the `compact_select_lock_`
is held whenever `ComputeCdfAndCheckOrdered`, which computes the average
rowset height, is called. I also made a small modification to the scope
of a `component_lock_` to avoid having to define a lock order for
`component_lock_` and `compact_select_lock_`.

Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
Reviewed-on: http://gerrit.cloudera.org:8080/13264
Reviewed-by: Adar Dembo <ad...@cloudera.com>
Tested-by: Will Berkeley <wd...@gmail.com>
---
M src/kudu/tablet/rowset_info.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
3 files changed, 36 insertions(+), 28 deletions(-)

Approvals:
  Adar Dembo: Looks good to me, approved
  Will Berkeley: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/13264
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
Gerrit-Change-Number: 13264
Gerrit-PatchSet: 5
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] KUDU-2807 Crash when flush or compaction overlaps with another compaction

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/13264 )

Change subject: KUDU-2807 Crash when flush or compaction overlaps with another compaction
......................................................................


Patch Set 4:

> BTW looks like TestTimeouts is still a bit flaky. :/

:(


-- 
To view, visit http://gerrit.cloudera.org:8080/13264
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
Gerrit-Change-Number: 13264
Gerrit-PatchSet: 4
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Mon, 13 May 2019 18:24:56 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2807 Crash when flush or compaction overlaps with another compaction

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/13264 )

Change subject: KUDU-2807 Crash when flush or compaction overlaps with another compaction
......................................................................


Patch Set 4: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/13264
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
Gerrit-Change-Number: 13264
Gerrit-PatchSet: 4
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Mon, 13 May 2019 18:26:40 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2807 Crash when flush or compaction overlaps with another compaction

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/13264 )

Change subject: KUDU-2807 Crash when flush or compaction overlaps with another compaction
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/13264/3/src/kudu/tablet/tablet.h
File src/kudu/tablet/tablet.h:

http://gerrit.cloudera.org:8080/#/c/13264/3/src/kudu/tablet/tablet.h@616
PS3, Line 616:   void AtomicSwapRowSets(const RowSetVector &to_remove,
> warning: function 'kudu::tablet::Tablet::AtomicSwapRowSets' has a definitio
Done


http://gerrit.cloudera.org:8080/#/c/13264/3/src/kudu/tablet/tablet.cc
File src/kudu/tablet/tablet.cc:

http://gerrit.cloudera.org:8080/#/c/13264/3/src/kudu/tablet/tablet.cc@1058
PS3, Line 1058:   UpdateAverageRowsetHeight();
> I think it's a useful metric. When debugging performance issues we've gener
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/13264
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
Gerrit-Change-Number: 13264
Gerrit-PatchSet: 3
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Mon, 13 May 2019 16:57:19 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2807 Crash when flush or compaction overlaps with another compaction

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has removed a vote on this change.

Change subject: KUDU-2807 Crash when flush or compaction overlaps with another compaction
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/13264
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
Gerrit-Change-Number: 13264
Gerrit-PatchSet: 4
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] KUDU-2807 Crash when flush or compaction overlaps with another compaction

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/13264 )

Change subject: KUDU-2807 Crash when flush or compaction overlaps with another compaction
......................................................................


Patch Set 1:

(1 comment)

Would be great to test this. I am surprised mt-tablet-test didn't trigger the crash; could you take a look?

http://gerrit.cloudera.org:8080/#/c/13264/1/src/kudu/tablet/tablet.cc
File src/kudu/tablet/tablet.cc:

http://gerrit.cloudera.org:8080/#/c/13264/1/src/kudu/tablet/tablet.cc@1753
PS1, Line 1753: Status Tablet::HandleEmptyCompactionOrFlush(const RowSetVector& rowsets,
Shouldn't we recompute the rowset height in the event of an empty compact/flush?



-- 
To view, visit http://gerrit.cloudera.org:8080/13264
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
Gerrit-Change-Number: 13264
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 07 May 2019 17:34:09 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2807 Crash when flush or compaction overlaps with another compaction

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/13264 )

Change subject: KUDU-2807 Crash when flush or compaction overlaps with another compaction
......................................................................


Patch Set 1:

I uploaded Manuel's crash log to KUDU-2807. Could you take a look at it and see whether it really seems like this race?


-- 
To view, visit http://gerrit.cloudera.org:8080/13264
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
Gerrit-Change-Number: 13264
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Wed, 08 May 2019 16:41:24 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2807 Crash when flush or compaction overlaps with another compaction

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13264

to look at the new patch set (#3).

Change subject: KUDU-2807 Crash when flush or compaction overlaps with another compaction
......................................................................

KUDU-2807 Crash when flush or compaction overlaps with another compaction

Commit d3684a7b2add8f06b7189adb9ce9222b8ae1eff5 introduced a metric for
average rowset height. Computing this requires examining the rowsets in
the rowset tree and briefly taking each one's `compact_flush_lock_`.
However, any time a thread takes the `compact_flush_lock_` of a rowset,
it must hold the `compact_select_lock_` of the tablet that rowset
belongs to. This was not happening in two of the three places where the
average height is computed:

1. When opening the tablet.
2. When updating the rowset tree during a flush or compaction.

The first case is benign (as far as I know). The second case could cause
a crash like

F0429 07:26:56.918041 34043 tablet.cc:2268] Check failed: lock.owns_lock() RowSet(24130) unable to lock compact_flush_lock

MM ops enforced the invariant above by try-locking the
`compact_flush_lock_` and checking that they obtained the lock, while
holding the `compact_select_lock_`. So, if a MM op try-locked a rowset
at the same time as another MM op was holding its `compact_flush_lock_`,
the above crash would result.

This patch fixes the crash by ensuring that the `compact_select_lock_`
is held whenever `ComputeCdfAndCheckOrdered`, which computes the average
rowset height, is called. I also made a small modification to the scope
of a `component_lock_` to avoid having to define a lock order for
`component_lock_` and `compact_select_lock_`.

Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
---
M src/kudu/tablet/rowset_info.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
3 files changed, 36 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/64/13264/3
-- 
To view, visit http://gerrit.cloudera.org:8080/13264
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
Gerrit-Change-Number: 13264
Gerrit-PatchSet: 3
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] KUDU-2807 Crash when flush or compaction overlaps with another compaction

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/13264 )

Change subject: KUDU-2807 Crash when flush or compaction overlaps with another compaction
......................................................................


Patch Set 4: Code-Review+2

BTW looks like TestTimeouts is still a bit flaky. :/


-- 
To view, visit http://gerrit.cloudera.org:8080/13264
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic255f0466aa2c158fa32e8e38428eddfcf901b99
Gerrit-Change-Number: 13264
Gerrit-PatchSet: 4
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Mon, 13 May 2019 18:16:57 +0000
Gerrit-HasComments: No