You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "Inder SIngh (JIRA)" <ji...@apache.org> on 2012/07/12 14:29:34 UTC
[jira] [Commented] (FLUME-1045) Proposal to support disk based spooling

    [ https://issues.apache.org/jira/browse/FLUME-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13412737#comment-13412737 ] 

Inder SIngh commented on FLUME-1045:
------------------------------------

Hello Guys,

i want to reopen this discussion on a totally different note.
So i wanted to configure FLUME to achieve this scenario something like this ->

1. AVROSOURCE---->MEMORY CHANNEL------>FAILOVERSINKPROCESSOR------>HDFSSINK (primary)
                                                               |
                                                               |------>AVROSINK -> FILECHANNEL ->HDFSSINK

A detailed diagram explaining this can be found at -https://docs.google.com/drawings/d/1qiCASG7YE35G9TtDjVVE_ontHeDlaYmt0MZkGS4yQGw/edit?pli=1

The configuration which i used to start FLUME is something like

### -- CHANNELS ---
# Define a memory channel called mainchannel on agent1
agent1.channels.mainchannel.type = memory
agent1.channels.spoolchannel.type = file

## --- SOURCES ----
agent1.sources.seq-source.type = seq
agent1.sources.seq-source.channels = mainchannel

#backup source to run filechannel for spooling
agent1.sources.avro-source2.type = avro
agent1.sources.avro-source2.bind = 0.0.0.0
agent1.sources.avro-source2.port = 41419
agent1.sources.avro-source2.channels = spoolchannel

## ---- SINKS -----
#sink group primary to HDFS and failover to avrosink
#to spool to file channel

agent1.sinkgroups.group1.sinks = hdfs-sink avro-spool-sink
agent1.sinkgroups.group1.processor.type = failover
agent1.sinkgroups.group1.processor.priority.hdfs-sink = 5
agent1.sinkgroups.group1.processor.priority.avro-spool-sink = 10

agent1.sinkgroups.group2.sinks = hdfs-spool-sink
agent1.sinkgroups.group2.processor.type = default

agent1.sinks.hdfs-sink.type = hdfs
agent1.sinks.hdfs-sink.channel = mainchannel
agent1.sinks.hdfs-sink.hdfs.path = hdfs://localhost
agent1.sinks.hdfs-sink.hdfs.fileType = datastream
agent1.sinks.hdfs-sink.hdfs.filePrefix = flume-inder3-data/

#agent1 backup sink is avro sink
#which reads from mainchannel if hdfs-sink is down
#and puts it to avro-source2 which will be connected to avro-spool-sink
agent1.sinks.avro-spool-sink.type = avro
agent1.sinks.avro-spool-sink.hostname = 0.0.0.0
agent1.sinks.avro-spool-sink.port = 41419
agent1.sinks.avro-spool-sink.batch-size = 100
agent1.sinks.avro-spool-sink.channel = mainchannel

#sink to despool from file channel
agent1.sinks.hdfs-spool-sink.type = hdfs
agent1.sinks.hdfs-spool-sink.channel = spoolchannel
agent1.sinks.hdfs-spool-sink.hdfs.path = hdfs://localhost
agent1.sinks.hdfs-spool-sink.hdfs.fileType = datastream
agent1.sinks.hdfs-spool-sink.hdfs.filePrefix = flume-inder3-data/

# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
agent1.sources = seq-source avro-source2
agent1.sinkgroups = group1 group2
agent1.sinks = hdfs-spool-sink avro-spool-sink hdfs-sink
agent1.channels = mainchannel spoolchannel


After starting my sequence generator source doesn't start and nothing comes in the LOG.
Sample LOG -> http://pastebin.com/9CGhcTqt

I also tried taking a thread dump -> http://pastebin.com/tub26YrX... all threads are pretty much not doing anything...

Can folks review the config once to see if anything is wrong there...
FLUME doesn't seem to be starting!
                
> Proposal to support disk based spooling
> ---------------------------------------
>
>                 Key: FLUME-1045
>                 URL: https://issues.apache.org/jira/browse/FLUME-1045
>             Project: Flume
>          Issue Type: New Feature
>    Affects Versions: v1.0.0
>            Reporter: Inder SIngh
>            Priority: Minor
>              Labels: patch
>         Attachments: FLUME-1045-1.patch, FLUME-1045-2.patch
>
>
> 1. Problem Description 
> A sink being unavailable at any stage in the pipeline causes it to back-off and retry after a while. Channel's associated with such sinks start buffering data with the caveat that if you are using a memory channel it can result in a domino effect on the entire pipeline. There could be legitimate down times eg: HDFS sink being down for name node maintenance, hadoop upgrades. 
> 2. Why not use a durable channel (JDBC, FileChannel)?
> Want high throughput and support sink down times as a first class use-case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira