You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "Inder SIngh (Commented) (JIRA)" <ji...@apache.org> on 2012/03/22 10:34:22 UTC
[jira] [Commented] (FLUME-1045) Proposal to support disk based
spooling
[ https://issues.apache.org/jira/browse/FLUME-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235479#comment-13235479 ]
Inder SIngh commented on FLUME-1045:
------------------------------------
Proposed Solution
------------------
Sink triggered spooling
----------------------------
A sink going down/all sinks go down in a failover policy setup triggers spooling of data from the channel to local disk. As and when there is a successful commit from the channel to one of the sinks a de-spool is triggered from local disk to channel.
Proposed Implementation
---------------------------
1. SpooledFailoverSinkProcessor – extending from FailoverSinkProcessor. Capabilities include triggering spool(), despool() when the sink go down and comes up respectively.
Some more design choices & assumptions
----------------------------------------
1. Persist avro serialized objects in local disk which preserves data & headers.
2. Use channel based transaction semantics while spooling to avoid any data loss.
3. Spool location is configurable for each SinkGroup controlled by “spool-dir". Event’s will be spooled in batches controlled by “spool-batch-size “ Spool files will be rolled over after they reach a size controlled by “spoolfile-size”.
4. Validation to avoid misconfiguration of overlapping spool locations across SinkGroups.
5. De-spooling happens one file at a time to avoid the complexity of persisting offsets in the first cut.
> Proposal to support disk based spooling
> ---------------------------------------
>
> Key: FLUME-1045
> URL: https://issues.apache.org/jira/browse/FLUME-1045
> Project: Flume
> Issue Type: New Feature
> Affects Versions: v1.0.0
> Reporter: Inder SIngh
> Priority: Minor
> Labels: patch
>
> 1. Problem Description
> A sink being unavailable at any stage in the pipeline causes it to back-off and retry after a while. Channel's associated with such sinks start buffering data with the caveat that if you are using a memory channel it can result in a domino effect on the entire pipeline. There could be legitimate down times eg: HDFS sink being down for name node maintenance, hadoop upgrades.
> 2. Why not use a durable channel (JDBC, FileChannel)?
> Want high throughput and support sink down times as a first class use-case.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira