You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2016/05/17 04:54:12 UTC

[jira] [Resolved] (KUDU-1131) Crash in compaction due to overlapping flush/undo snapshots

     [ https://issues.apache.org/jira/browse/KUDU-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon resolved KUDU-1131.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 0.9.0

fixed in 3512b1ac68f92ebe1e4e632d9c9dd11396edf1b3

> Crash in compaction due to overlapping flush/undo snapshots
> -----------------------------------------------------------
>
>                 Key: KUDU-1131
>                 URL: https://issues.apache.org/jira/browse/KUDU-1131
>             Project: Kudu
>          Issue Type: Bug
>          Components: tablet
>    Affects Versions: Private Beta
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>              Labels: crash
>             Fix For: 0.9.0
>
>         Attachments: alter_table-randomized-test.txt.gz
>
>
> Binglin is triggering a crash reasonably regularly under load:
> - a tablet is flushed with a snapshot that has at least one txn in flight, but a txn with a later timestamp already committed. eg:
> -- txn 1 and 3 committed, 2 in flight. This gives a flush snapshot txn <= 1 or txn == 3.
> - as of KUDU-987, we don't wait for all in-flight transactions to commit during flush (necessary since the txn might be in flight for a while)
> - because txn 3 was committed, the UNDO delta has a ts range of [1, 3]
> - we then select the newly-flushed rowset for compaction, and txn 2 is _still_ not committed
> -- at this point, we hit a CHECK failure because we see an UNDO file which can't be fully ignored by a compaction (its time range overlaps with uncommitted ranges in the current snapshot)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)