You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2016/07/21 03:24:20 UTC

[jira] [Commented] (KUDU-1538) "Orphaned" block deletion can delete live blocks in use by other tablets

    [ https://issues.apache.org/jira/browse/KUDU-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387075#comment-15387075 ] 

Todd Lipcon commented on KUDU-1538:
-----------------------------------

A couple thoughts here:

- the above stuff is trying hard to avoid block leaks in the case of crashing just after a metadata flush, but we already have the opposite leak in the case of a crash just before a metadata flush (the in-progress blocks being written as the compaction output are "committed" in the block manager but not referenced anywhere). So, even despite our best efforts, we _still_ have to worry about a more thorough (eg mark-and-sweep-style) "garbage collector" for blocks (KUDU-829). Maybe we should just throw away this best effort and accept that our current offering is 'data leaky' and come up with a better holistic solution?
- the fact that we use randomized block IDs instead of sequential block IDs makes reuse much more plausible. With sequentially-allocated IDs, we'd have to "wrap around" our extremely large space to make this an issue, which is _way_ less likely. (I actually had a patch back in 2014 to do this, with some other benefits, but it only was for the FBM)
- maybe we need to "reserve" those block IDs in the block manager until they're actually fully removed from the metadata? worried that this could be quite complex, though.
- maybe a more 'WAL-like' way of doing the roll-forward, tied to specific revisions of the TabletMetadata, is the way to go?


> "Orphaned" block deletion can delete live blocks in use by other tablets
> ------------------------------------------------------------------------
>
>                 Key: KUDU-1538
>                 URL: https://issues.apache.org/jira/browse/KUDU-1538
>             Project: Kudu
>          Issue Type: Bug
>          Components: fs, tablet
>    Affects Versions: 0.9.1
>            Reporter: Todd Lipcon
>            Priority: Blocker
>
> Currently, we allocate block IDs using a random number generator, ensuring that the blocks we allocate are not already in use. Of course that doesn't proclude a block which was previously used and then deleted from having its ID reused.
> This interacts quite poorly with the "orphaned block" processing we have in tablet metadata. As a refresher, the "orphaned block" thing is used as follows:
> - during a compaction, we have the output blocks (newly written data) and the input blocks (data which has been compacted and no longer relevant)
> - when the compaction finishes, we write a new TabletMetadata which swaps in the new blocks and removes the old blocks
> -- followed by that, we delete the old (input) blocks. Of course we can't delete the old blocks until after we've flushed the metadata, or else if we crashed before flushing the metadata we'd have lost track of the new block IDs.
> -- so, we defer the deletion of the input blocks until after the metadata has been flushed
> - this leaves open the opposite hole: if we defer the deletion of the old blocks, and we crash just _after_ flushing metadata, we would leak those old blocks and their disk space, which is no good either.
> -- so, when we flush metadata, we include the 'old blocks' in a 'orphan_blocks' array. On loading of metadata, we try to 'roll forward' the deletion to prevent the above-mentioned leak from being permanent.
> The "roll forward" behavior mentioned above is what seems to be eating blocks. We can now have the following bad interleaving:
> - a compaction in tablet A succeeds and lists block ID "X" as orphaned
> - a different tablet B re-uses block ID "X"
> - we restart the TS, or trigger a remote bootstrap (which also "cleans up" orphan blocks)
> -- it deletes block "X" from underneath tablet "B"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)