You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2017/04/12 19:55:41 UTC

[jira] [Commented] (KUDU-1968) Aborted tablet copies delete live blocks

    [ https://issues.apache.org/jira/browse/KUDU-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966490#comment-15966490 ] 

Todd Lipcon commented on KUDU-1968:
-----------------------------------

I'm able to repro with the following sequence:

{code}
rm -Rf /tmp/m /tmp/ts-{1,2,3}
ninja -C build/release kudu-tserver kudu-master kudu

build/latest/bin/kudu-master -fs_wal_dir /tmp/m  &
build/latest/bin/kudu-tserver -fs_wal_dir /tmp/ts-1 -rpc_bind_addresses=0.0.0.0:7001 -webserver_port=8001 -flush_threshold_secs=10 -unlock-experimental-flags &
build/latest/bin/kudu-tserver -fs_wal_dir /tmp/ts-2 -rpc_bind_addresses=0.0.0.0:7002 -webserver_port=8002 -flush_threshold_secs=10 -unlock-experimental-flags -unlock-unsafe-flags -fault-crash-on-handle-tc-fetch-data=0.2 &

sleep 5 # wait for servers to all start

build/latest/bin/kudu test loadgen localhost -keep_auto_table -num_rows_per_thread=1000000

sleep 20 # wait for flush

tablet=$(ls -1 /tmp/ts-2/tablet-meta/* | head -1 | xargs basename)
build/latest/bin/kudu remote_replica copy $tablet localhost:7002 localhost:7001
build/latest/bin/kudu fs check -fs_wal_dir /tmp/ts-1/
{code}

We should revert the patch in trunk and branch-1.3 and release 1.3.1 ASAP.

> Aborted tablet copies delete live blocks
> ----------------------------------------
>
>                 Key: KUDU-1968
>                 URL: https://issues.apache.org/jira/browse/KUDU-1968
>             Project: Kudu
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.3.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Blocker
>
> 72541b47eb55b2df4eab5d6050f517476ed6d370 (KUDU-1853) caused a serious regression in the case of a failed tablet copy. As of that patch, the following sequence happens:
> - we fetch the remote tablet's metadata, and set our local metadata to match it (including the remote block IDs)
> - as we download blocks, we replace remote block ids with local block IDs
> - if we fail in the middle, we call DeleteTablet
> -- this means that, since we still have some remote block IDs in the metadata, the DeleteTablet call deletes local blocks based on remote block IDs. These block ids are likely to belong to other live tablets locally!
> This can cause pretty serious dataloss, and has the tendency to cascade around a cluster, since later attempts to copy a tablet with missing blocks will get aborted as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)