You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "Inder SIngh (Commented) (JIRA)" <ji...@apache.org> on 2012/03/22 10:34:22 UTC

[jira] [Commented] (FLUME-1045) Proposal to support disk based spooling

    [ https://issues.apache.org/jira/browse/FLUME-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235479#comment-13235479 ] 

Inder SIngh commented on FLUME-1045:
------------------------------------

Proposed Solution
------------------

Sink triggered spooling
----------------------------
A sink going down/all sinks go down in a failover policy setup triggers spooling of data from the channel to local disk. As and when there is a successful commit from the channel to one of the sinks a de-spool is triggered from local disk to channel.

Proposed Implementation
---------------------------

1. SpooledFailoverSinkProcessor – extending from FailoverSinkProcessor. Capabilities include triggering spool(), despool() when the sink go down and comes up respectively.

Some more design choices & assumptions
----------------------------------------
1. Persist avro serialized objects in local disk which preserves data & headers.
2. Use channel based transaction semantics while spooling to avoid any data loss.
3. Spool location is configurable for each SinkGroup controlled by “spool-dir".  Event’s will be spooled in batches controlled by “spool-batch-size “ Spool files will be rolled over after they reach a size controlled by “spoolfile-size”.
4. Validation to avoid misconfiguration of overlapping spool locations across SinkGroups.
5. De-spooling happens one file at a time to avoid the complexity of persisting offsets in the first cut.

                
> Proposal to support disk based spooling
> ---------------------------------------
>
>                 Key: FLUME-1045
>                 URL: https://issues.apache.org/jira/browse/FLUME-1045
>             Project: Flume
>          Issue Type: New Feature
>    Affects Versions: v1.0.0
>            Reporter: Inder SIngh
>            Priority: Minor
>              Labels: patch
>
> 1. Problem Description 
> A sink being unavailable at any stage in the pipeline causes it to back-off and retry after a while. Channel's associated with such sinks start buffering data with the caveat that if you are using a memory channel it can result in a domino effect on the entire pipeline. There could be legitimate down times eg: HDFS sink being down for name node maintenance, hadoop upgrades. 
> 2. Why not use a durable channel (JDBC, FileChannel)?
> Want high throughput and support sink down times as a first class use-case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira