You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flume.apache.org by "Kevin Conaway (JIRA)" <ji...@apache.org> on 2016/07/13 17:16:20 UTC

[jira] [Commented] (FLUME-2922) HDFSSequenceFile Should Sync Writer

    [ https://issues.apache.org/jira/browse/FLUME-2922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375372#comment-15375372 ] 

Kevin Conaway commented on FLUME-2922:
--------------------------------------

[~hshreedharan] or [~jarcec], is someone able to review this?

> HDFSSequenceFile Should Sync Writer
> -----------------------------------
>
>                 Key: FLUME-2922
>                 URL: https://issues.apache.org/jira/browse/FLUME-2922
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: v1.6.0
>            Reporter: Kevin Conaway
>            Priority: Critical
>         Attachments: FLUME-2922.patch
>
>
> There is a possibility of losing data with the current HDFS sequence file writer.
> Internally, the `SequenceFile.Writer` buffers data and periodically syncs it to the underlying output stream.  The mechanism for doing this is dependent on whether you are using compression or not but in both scenarios, the key/values are appended to an internal buffer and only flushed to disk after the buffer reaches a certain size.
> Thus it is quite possible for Flume to lose messages if the agent crashes, or is stopped, before the internal buffer is flushed to disk.
> The correct action is to force the writer to sync its internal buffers to the underlying `FSDataOutputStream` first before calling hflush/sync.
> Additionally, I believe we should be calling hsync instead of hflush.  Its my understanding writes with hsync should be more durable which I believe are the semantics we want here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)