You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Adar Dembo (JIRA)" <ji...@apache.org> on 2019/05/08 16:40:00 UTC

[jira] [Commented] (KUDU-2807) Possible crash when flush or compaction overlaps with another compaction

    [ https://issues.apache.org/jira/browse/KUDU-2807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835745#comment-16835745 ] 

Adar Dembo commented on KUDU-2807:
----------------------------------

I've attached Manuel's log. I think what it shows is a tablet with many rowsets, a high number of MM threads, and an even distribution of deltas across those rowsets. As such, the tablet is absolutely blasted with UndoDeltaBlockGC ops, especially near the site of the crash. It's conceivable that one of these ops raced with a flush or compaction such that the race was triggered.

> Possible crash when flush or compaction overlaps with another compaction
> ------------------------------------------------------------------------
>
>                 Key: KUDU-2807
>                 URL: https://issues.apache.org/jira/browse/KUDU-2807
>             Project: Kudu
>          Issue Type: Bug
>          Components: tablet
>    Affects Versions: 1.9.0
>            Reporter: Adar Dembo
>            Assignee: Will Berkeley
>            Priority: Blocker
>         Attachments: kudu-tserver.INFO.gz
>
>
> Manuel Sopena reported a crash like this in Slack:
> {noformat}
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> F0429 07:26:56.918041 34043 tablet.cc:2268] Check failed: lock.owns_lock() RowSet(24130) unable to lock compact_flush_lock
> {noformat}
> It's hard to say exactly what's going on without more logging, but after looking at the code in more detail, I think the culprit is [this commit|https://github.com/apache/kudu/commit/d3684a7b2add8f06b7189adb9ce9222b8ae1eff5], new in Kudu 1.9.0. To understand why it's problematic, we first need to understand the locking invariant in play:
> # A thread must acquire the tablet's compact_select_lock_ in order to select rowsets to compact.
> # Because of #1, it's safe to assume that, if a thread successfully acquired a rowset's compact_flush_lock_ in the act of selecting it for compaction, it can release and reacquire the lock without contention. More precisely, it can release the compact_flush_lock_, then try-lock it, and the try-lock is guaranteed to succeed. All compacting MM ops use a CHECK to enforce this invariant.
> With that in mind, here's the problem: at the time that the call to {{RowSetInfo::ComputeCdfAndCollectOrdered}} is made from {{Tablet::AtomicSwapRowSetsUnlocked}}, the tablet's compact_select_lock_ is not held. {{ComputeCdfAndCollectOrdered}} calls {{RowSet::IsAvailableForCompaction}}, which try-locks the per-rowset compact_flush_lock_. As a result, it's possible for a racing MM operation to also call {{IsAvailableForCompaction}}, successfully try-lock the compact_flush_lock_, release it, try-lock it again (as per the invariant above), fail, and crash in the aforementioned CHECK.
> I don't think this can result in corruption as we crash rather than allowing the MM op to proceed. But it's a bad race and a bad crash, so we should fix it. Possibly producing a 1.9.1 release in the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)