You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "David Alves (JIRA)" <ji...@apache.org> on 2017/10/24 22:44:00 UTC
[jira] [Comment Edited] (KUDU-2195) Enforce durability happened before relationships on multiple disks

    [ https://issues.apache.org/jira/browse/KUDU-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217832#comment-16217832 ] 

David Alves edited comment on KUDU-2195 at 10/24/17 10:43 PM:
--------------------------------------------------------------

After discussing this a bit with Adar elsewhere we think for both these particular cases a well placed fsync would solve the problem.
- For the cmeta->wal case we can add an extra fsync before the temp file rename on cmeta flush. We're already fsyncing implicitly on ext4 anyway so that would give us the same semantics on other fs's like xfs. By doing it before the rename the kernel is free to skip the second implied fsync on rename, on ext4 filesystems.
- For the wal->tablet_meta case we could add an explicit fsync to WaitUntilAllFlushed() that would be executed independently of the value of the log durability flag.
 


was (Author: dralves):
After discussing this a bit with Adar elsewhere we think for both these particular cases a well placed fsync would solve the problem.
- For the cmeta->wal case we can add an extra fsync before the temp file rename on cmeta flush. We're already fsyncing implicitly on ext4 anyway so that would give us the same semantics on other fs's like xfd. By doing it before the rename the kernel is free to skip the second implied fsync on rename, on ext4 filesystems.
- For the wal->tablet_meta case we could add an explicit fsync to WaitUntilAllFlushed() that would be executed independently of the value of the log durability flag.
 

> Enforce durability happened before relationships on multiple disks
> ------------------------------------------------------------------
>
>                 Key: KUDU-2195
>                 URL: https://issues.apache.org/jira/browse/KUDU-2195
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus, tablet
>            Reporter: David Alves
>
> When using weaker durability semantics (e.g. when log_force_fsync is off) we should still enforce certain happened before relationships which are not currently being enforced when using different disks for the wal and data.
> The two cases that come to mind where this is relevant are:
> 1) cmeta (c) -> wal (w) : We flush cmeta before flushing the wal (for instance on term change) with the intention that either {}, \{c} or \{c, w} were made durable.
> 2) wal (w) -> tablet meta (t): We flush the wal before tablet metadata to make sure that that all commit messages that refer to on disk row sets (and deltas) are on disk before the row sets they point to, i.e. with the intention that either {}, \{w} or \{w, t} were made durable.
> With strong durability semantics these are always made durable in the right order. With weaker semantics that is not the case though. If using the same disk for both the wal and data then the invariants are  still preserved, as buffers will be flushed in the right order but if using different disks for the wal and data (and because cmeta is stored with the data) that is not always the case.
> 1) in ext4 is actually safe, because we perform an fsync (indirect, rename() implies fsync in ext4) when flushing cmeta. But it is not for xfs.
> 2) Is not safe in either filesystem.
> --- Possible solutions --
> For 1): Store cmeta with the wal; actually always fsync cmeta.
> For 2): Store tablet meta with the wal; always fsync the wal before flushing tablet meta.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)