You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2016/03/01 20:32:18 UTC
[jira] [Updated] (KUDU-1131) Crash in compaction due to overlapping flush/undo snapshots

     [ https://issues.apache.org/jira/browse/KUDU-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated KUDU-1131:
------------------------------
    Attachment: alter_table-randomized-test.txt.gz

Here's a log where we hit it. Interestingly a particular write was stuck in flight for over a minute and a half:

{code}
I0301 01:02:38.998479  9226 tablet.cc:1165] Flush: entering phase 1 (flushing snapshot). Phase 1 snapshot: MvccSnapshot[committed={T|T < 5967146793304829952 or (T in {5967146793304829952})}]
I0301 01:02:53.615841 10177 tablet.cc:1165] Flush: entering phase 1 (flushing snapshot). Phase 1 snapshot: MvccSnapshot[committed={T|T < 5967146793304829952 or (T in {5967146793304829952})}]
I0301 01:03:13.208626 11415 tablet.cc:1165] Flush: entering phase 1 (flushing snapshot). Phase 1 snapshot: MvccSnapshot[committed={T|T < 5967146793304829952 or (T in {5967146793304829952})}]
I0301 01:03:26.035773 12706 tablet.cc:1165] Flush: entering phase 1 (flushing snapshot). Phase 1 snapshot: MvccSnapshot[committed={T|T < 5967146793304829952 or (T in {5967146793304829952})}]
I0301 01:03:38.797785 13965 tablet.cc:1165] Flush: entering phase 1 (flushing snapshot). Phase 1 snapshot: MvccSnapshot[committed={T|T < 5967146793304829952 or (T in {5967146793304829952})}]
I0301 01:03:56.999855 16288 tablet.cc:1165] Flush: entering phase 1 (flushing snapshot). Phase 1 snapshot: MvccSnapshot[committed={T|T < 5967146793304829952 or (T in {5967146793304829952})}]
I0301 01:04:16.210322 22134 tablet.cc:1165] Flush: entering phase 1 (flushing snapshot). Phase 1 snapshot: MvccSnapshot[committed={T|T < 5967146793304829952 or (T in {5967146793304829952})}]
{code}

> Crash in compaction due to overlapping flush/undo snapshots
> -----------------------------------------------------------
>
>                 Key: KUDU-1131
>                 URL: https://issues.apache.org/jira/browse/KUDU-1131
>             Project: Kudu
>          Issue Type: Bug
>          Components: tablet
>    Affects Versions: Private Beta
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Blocker
>              Labels: crash
>         Attachments: alter_table-randomized-test.txt.gz
>
>
> Binglin is triggering a crash reasonably regularly under load:
> - a tablet is flushed with a snapshot that has at least one txn in flight, but a txn with a later timestamp already committed. eg:
> -- txn 1 and 3 committed, 2 in flight. This gives a flush snapshot txn <= 1 or txn == 3.
> - as of KUDU-987, we don't wait for all in-flight transactions to commit during flush (necessary since the txn might be in flight for a while)
> - because txn 3 was committed, the UNDO delta has a ts range of [1, 3]
> - we then select the newly-flushed rowset for compaction, and txn 2 is _still_ not committed
> -- at this point, we hit a CHECK failure because we see an UNDO file which can't be fully ignored by a compaction (its time range overlaps with uncommitted ranges in the current snapshot)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)