You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Paul Chavez <pc...@verticalsearchworks.com> on 2013/03/29 21:20:21 UTC

LoadBalancing Sink Processor question

I am curious about the observed behavior of a set of agents configured with a Load Balancing sink processor.

I have 4 'tier1' agents receiving events directly from app servers that feed into 2 'tier2' agents that write to HDFS. They are connected up via Avro Sink/Sources and a Load Balancing Sink Processor.

Both 'tier2' agents write the same directory and I have observed they occasionally step on each other and one of the tier2 agents at that point 'loses' and gets hung up on a file lease exception. I'm not concerned with that at the moment as I know it's not best practices and this is more of a pilot architecture.

My concern is that once a tier2 agent gets stuck it obviously fills it's channel in time, and then stops accepting put requests from the Avro source. At this point my *expectation* is that the upstream tier1 agents will continue to round-robin to the tier2 nodes with every other 'put' request failing. Assuming the remaining tier2 node can handle the throughput (which it can) I would not expect the tier1 agents to ever fill their channels.

In actuality what happens is the tier1 agents slowly fill the channel and eventually start refusing put attempts from the application servers. It seems that once a given batch has been allocated to the bad sink, it won't ever get released to be processed by the other, working sink.

Is this the way it should work? Is this a defect or as designed? I will probably switch to a failover processor because I really only need one HDFS writer to keep up with my data, but I do think this isn't working as intended.

thanks,
Paul

RE: LoadBalancing Sink Processor question

Posted by Paul Chavez <pc...@verticalsearchworks.com>.

Hi Jeff,

I've attached 3 logs, the config and 3 pictures. This is flume 1.3 as bundled with CDH 4.2.

The three images are snapshots of Cacti graphs that poll the /metrics page for each agent every 5 minutes. They illustrate the behavior of the agents during the failure, in terms of number of events in the channel at time of polling. In the images, hadoop3 and 4 are 'tier2' and hadoop5 is 'tier1'.

The log files contain snippets from the time of failure for both hadoop3 and 4. As hadoop3 was the failing node, it's the log excerpt 'flume_lease_exception_loser.log' and hadoop4 is 'flume_lease_exception_winner.log'. It's clear in the logs that they attempt to create the same filename at the same time. Curiously, both log an exception but hadoop4 does not fail. I also attached a (much larger) snippet of logs from hadoop5, a 'tier1' agent that filled its channel during the same timeframe. This log spans the point when its channel filled. I attached it to see if you saw anything about why the channel filled, as far as I can tell there's no exceptions indicating it was no longer accepting events from app servers.

Thanks,
Paul Chavez


________________________________
From: Jeff Lord [mailto:jlord@cloudera.com]
Sent: Monday, April 01, 2013 11:58 AM
To: user@flume.apache.org
Subject: Re: LoadBalancing Sink Processor question

Hi Paul,

Would you kindly attach the logs from both tier 2 collectors where you observe the sinks occasionally stepping on each other. Can you please attach your flume config and note the version of flume-ng?

Best,

Jeff


On Sun, Mar 31, 2013 at 7:12 PM, JR <ma...@gmail.com>> wrote:
Hi Paul,

   I apologize that I am not giving you a solution, but in turn have a question about your avro sink to tier2 avro src.

   Could you please share the conf file?  I have tried to put the sink and source as follows, but I still get RPC connection failed.

If you have had success, could you please tell me how you got yours to work?

What is the command like / shell scripts you wrote to connect the tier1--> tier2 --> HDFS?

Thanks!


Avro source ---> mem Channel ----> Avro sink --> (next node) avro source --> mem channel ---> hdfs sink

#agent1 on  node1
 agent1.sources = avroSource
 agent1.channels = ch1
 agent1.sinks = avroSink

#agent2 on node2
 agent2.sources = avroSource2
 agent2.channels = ch2
 agent2.sinks = hdfsSink

# first source - avro
 agent1.sources.avroSource.
type = avro
 agent1.sources.avroSource.bind = 0.0.0.0
 agent1.sources.avroSource.port = 41414
 agent1.sources.avroSource.channels = ch1

# first sink - avro
 agent1.sinks.avroSink.type = avro
 agent1.sinks.avroSink.hostname = 0.0.0.0
 agent1.sinks.avroSink.port = 41415
 agent1.sinks.avroSink.channel = ch1

# second source - avro
 agent2.sources.avroSource2.type = avro
 agent2.sources.avroSource2.bind = node2 ip
 agent2.sources.avroSource2.port = 41415
 agent2.sources.avroSource2.channel = ch2

# second sink - hdfs
 agent2.sinks.hdfsSink.type = hdfs
 agent2.sinks.hdfsSink.channel = ch2
agent2.sinks.hdfsSink.hdfs.writeFormat = Text
 agent2.sinks.hdfsSink.hdfs.filePrefix =  testing
 agent2.sinks.hdfsSink.hdfs.path = hdfs://node2:9000/flume/

# channels
 agent1.channels.ch1.type = memory
 agent1.channels.ch1.capacity = 1000
 agent2.channels.ch2.type = memory
 agent2.channels.ch2.capacity = 1000


Am getting errors with the ports. Could someone please check if I have connected the sink in node1 to source in node 2 properly?

13/03/24 04:32:16 INFO source.AvroSource: Starting Avro source avroSource: { bindAddress: 0.0.0.0, port: 41414 }...
13/03/24 04:32:16 INFO instrumentation.
MonitoredCounterGroup: Monitoried counter group for type: SINK, name: avroSink, registered successfully.
13/03/24 04:32:16 INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: avroSink started
13/03/24 04:32:16 INFO sink.AvroSink: Avro sink avroSink: Building RpcClient with hostname: 0.0.0.0, port: 41415
13/03/24 04:32:16 WARN sink.AvroSink: Unable to create avro client using hostname: 0.0.0.0, port: 41415
org.apache.flume.FlumeException: NettyAvroRpcClient { host: 0.0.0.0, port: 41415 }: RPC connection error
        at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:117)
        at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:93)
        at org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:507)
        at org.apache.flume.api.RpcClientFactory.getInstance(RpcClientFactory.java:88)
        at org.apache.flume.sink.AvroSink.createConnection(AvroSink.java:182)
        at org.apache.flume.sink.AvroSink.start(AvroSink.java:242)
        at org.apache.flume.sink.DefaultSinkProcessor.start(DefaultSinkProcessor.java:46)
        at org.apache.flume.SinkRunner.start(SinkRunner.java:79)
        at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:236)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:452)
        at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:328)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:161)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:109)



On Fri, Mar 29, 2013 at 4:20 PM, Paul Chavez <pc...@verticalsearchworks.com>> wrote:
I am curious about the observed behavior of a set of agents configured with a Load Balancing sink processor.

I have 4 'tier1' agents receiving events directly from app servers that feed into 2 'tier2' agents that write to HDFS. They are connected up via Avro Sink/Sources and a Load Balancing Sink Processor.

Both 'tier2' agents write the same directory and I have observed they occasionally step on each other and one of the tier2 agents at that point 'loses' and gets hung up on a file lease exception. I'm not concerned with that at the moment as I know it's not best practices and this is more of a pilot architecture.

My concern is that once a tier2 agent gets stuck it obviously fills it's channel in time, and then stops accepting put requests from the Avro source. At this point my *expectation* is that the upstream tier1 agents will continue to round-robin to the tier2 nodes with every other 'put' request failing. Assuming the remaining tier2 node can handle the throughput (which it can) I would not expect the tier1 agents to ever fill their channels.

In actuality what happens is the tier1 agents slowly fill the channel and eventually start refusing put attempts from the application servers. It seems that once a given batch has been allocated to the bad sink, it won't ever get released to be processed by the other, working sink.

Is this the way it should work? Is this a defect or as designed? I will probably switch to a failover processor because I really only need one HDFS writer to keep up with my data, but I do think this isn't working as intended.

thanks,
Paul

Re: LoadBalancing Sink Processor question

Posted by Jeff Lord <jl...@cloudera.com>.

Hi Paul,

Would you kindly attach the logs from both tier 2 collectors where you
observe the sinks occasionally stepping on each other. Can you please
attach your flume config and note the version of flume-ng?

Best,

Jeff


On Sun, Mar 31, 2013 at 7:12 PM, JR <ma...@gmail.com> wrote:

> Hi Paul,
>
>    I apologize that I am not giving you a solution, but in turn have a
> question about your avro sink to tier2 avro src.
>
>    Could you please share the conf file?  I have tried to put the sink and
> source as follows, but I still get RPC connection failed.
>
> If you have had success, could you please tell me how you got yours to
> work?
>
> What is the command like / shell scripts you wrote to connect the tier1-->
> tier2 --> HDFS?
>
> Thanks!
>
>
> Avro source ---> mem Channel ----> Avro sink --> (next node) avro source
> --> mem channel ---> hdfs sink
>
> #agent1 on  node1
>  agent1.sources = avroSource
>  agent1.channels = ch1
>  agent1.sinks = avroSink
>
> #agent2 on node2
>  agent2.sources = avroSource2
>  agent2.channels = ch2
>  agent2.sinks = hdfsSink
>
> # first source - avro
>  agent1.sources.avroSource.
> type = avro
>  agent1.sources.avroSource.bind = 0.0.0.0
>  agent1.sources.avroSource.port = 41414
>  agent1.sources.avroSource.channels = ch1
>
> # first sink - avro
>  agent1.sinks.avroSink.type = avro
>  agent1.sinks.avroSink.hostname = 0.0.0.0
>  agent1.sinks.avroSink.port = 41415
>  agent1.sinks.avroSink.channel = ch1
>
> # second source - avro
>  agent2.sources.avroSource2.type = avro
>  agent2.sources.avroSource2.bind = node2 ip
>  agent2.sources.avroSource2.port = 41415
>  agent2.sources.avroSource2.channel = ch2
>
> # second sink - hdfs
>  agent2.sinks.hdfsSink.type = hdfs
>  agent2.sinks.hdfsSink.channel = ch2
> agent2.sinks.hdfsSink.hdfs.writeFormat = Text
>  agent2.sinks.hdfsSink.hdfs.filePrefix =  testing
>  agent2.sinks.hdfsSink.hdfs.path = hdfs://node2:9000/flume/
>
> # channels
>  agent1.channels.ch1.type = memory
>  agent1.channels.ch1.capacity = 1000
>  agent2.channels.ch2.type = memory
>  agent2.channels.ch2.capacity = 1000
>
>
> Am getting errors with the ports. Could someone please check if I have
> connected the sink in node1 to source in node 2 properly?
>
> 13/03/24 04:32:16 INFO source.AvroSource: Starting Avro source avroSource:
> { bindAddress: 0.0.0.0, port: 41414 }...
> 13/03/24 04:32:16 INFO instrumentation.
> MonitoredCounterGroup: Monitoried counter group for type: SINK, name:
> avroSink, registered successfully.
> 13/03/24 04:32:16 INFO instrumentation.MonitoredCounterGroup: Component
> type: SINK, name: avroSink started
> 13/03/24 04:32:16 INFO sink.AvroSink: Avro sink avroSink: Building
> RpcClient with hostname: 0.0.0.0, port: 41415
> 13/03/24 04:32:16 WARN sink.AvroSink: Unable to create avro client using
> hostname: 0.0.0.0, port: 41415
> org.apache.flume.FlumeException: NettyAvroRpcClient { host: 0.0.0.0, port:
> 41415 }: RPC connection error
>         at
> org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:117)
>         at
> org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:93)
>         at
> org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:507)
>         at
> org.apache.flume.api.RpcClientFactory.getInstance(RpcClientFactory.java:88)
>         at
> org.apache.flume.sink.AvroSink.createConnection(AvroSink.java:182)
>         at org.apache.flume.sink.AvroSink.start(AvroSink.java:242)
>         at
> org.apache.flume.sink.DefaultSinkProcessor.start(DefaultSinkProcessor.java:46)
>         at org.apache.flume.SinkRunner.start(SinkRunner.java:79)
>         at
> org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:236)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:452)
>         at
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:328)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:161)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:109)
>
>
>
> On Fri, Mar 29, 2013 at 4:20 PM, Paul Chavez <
> pchavez@verticalsearchworks.com> wrote:
>
>> **
>> I am curious about the observed behavior of a set of agents configured
>> with a Load Balancing sink processor.
>>
>> I have 4 'tier1' agents receiving events directly from app servers that
>> feed into 2 'tier2' agents that write to HDFS. They are connected up via
>> Avro Sink/Sources and a Load Balancing Sink Processor.
>>
>> Both 'tier2' agents write the same directory and I have observed they
>> occasionally step on each other and one of the tier2 agents at that point
>> 'loses' and gets hung up on a file lease exception. I'm not concerned with
>> that at the moment as I know it's not best practices and this is more of a
>> pilot architecture.
>>
>> My concern is that once a tier2 agent gets stuck it obviously fills it's
>> channel in time, and then stops accepting put requests from the Avro
>> source. At this point my *expectation* is that the upstream tier1 agents
>> will continue to round-robin to the tier2 nodes with every other 'put'
>> request failing. Assuming the remaining tier2 node can handle the
>> throughput (which it can) I would not expect the tier1 agents to ever fill
>> their channels.
>>
>> In actuality what happens is the tier1 agents slowly fill the channel and
>> eventually start refusing put attempts from the application servers. It
>> seems that once a given batch has been allocated to the bad sink, it won't
>> ever get released to be processed by the other, working sink.
>>
>> Is this the way it should work? Is this a defect or as designed? I will
>> probably switch to a failover processor because I really only need one HDFS
>> writer to keep up with my data, but I do think this isn't working as
>> intended.
>>
>> thanks,
>> Paul
>>
>>
>
>

RE: LoadBalancing Sink Processor question

Posted by Paul Chavez <pc...@verticalsearchworks.com>.

JR,

It looks like you have your source and sink IP configurations flipped.

agent1.sinks.avroSink.hostname = 0.0.0.0
agent2.sources.avroSource2.bind = node2 ip
 The sink should always have a 'real' IP as it's not binding to a port, it's the address of a remote node. The source is probably what you wanted to bind to 0.0.0.0, like you did on the agent1 source.

Hope that helps,
Paul


________________________________
From: JR [mailto:mailjr05@gmail.com]
Sent: Sunday, March 31, 2013 7:13 PM
To: user@flume.apache.org
Subject: Re: LoadBalancing Sink Processor question

Hi Paul,

   I apologize that I am not giving you a solution, but in turn have a question about your avro sink to tier2 avro src.

   Could you please share the conf file?  I have tried to put the sink and source as follows, but I still get RPC connection failed.

If you have had success, could you please tell me how you got yours to work?

What is the command like / shell scripts you wrote to connect the tier1--> tier2 --> HDFS?

Thanks!


Avro source ---> mem Channel ----> Avro sink --> (next node) avro source --> mem channel ---> hdfs sink

#agent1 on  node1
 agent1.sources = avroSource
 agent1.channels = ch1
 agent1.sinks = avroSink

#agent2 on node2
 agent2.sources = avroSource2
 agent2.channels = ch2
 agent2.sinks = hdfsSink

# first source - avro
 agent1.sources.avroSource.
type = avro
 agent1.sources.avroSource.bind = 0.0.0.0
 agent1.sources.avroSource.port = 41414
 agent1.sources.avroSource.channels = ch1

# first sink - avro
 agent1.sinks.avroSink.type = avro
 agent1.sinks.avroSink.hostname = 0.0.0.0
 agent1.sinks.avroSink.port = 41415
 agent1.sinks.avroSink.channel = ch1

# second source - avro
 agent2.sources.avroSource2.type = avro
 agent2.sources.avroSource2.bind = node2 ip
 agent2.sources.avroSource2.port = 41415
 agent2.sources.avroSource2.channel = ch2

# second sink - hdfs
 agent2.sinks.hdfsSink.type = hdfs
 agent2.sinks.hdfsSink.channel = ch2
agent2.sinks.hdfsSink.hdfs.writeFormat = Text
 agent2.sinks.hdfsSink.hdfs.filePrefix =  testing
 agent2.sinks.hdfsSink.hdfs.path = hdfs://node2:9000/flume/

# channels
 agent1.channels.ch1.type = memory
 agent1.channels.ch1.capacity = 1000
 agent2.channels.ch2.type = memory
 agent2.channels.ch2.capacity = 1000


Am getting errors with the ports. Could someone please check if I have connected the sink in node1 to source in node 2 properly?

13/03/24 04:32:16 INFO source.AvroSource: Starting Avro source avroSource: { bindAddress: 0.0.0.0, port: 41414 }...
13/03/24 04:32:16 INFO instrumentation.
MonitoredCounterGroup: Monitoried counter group for type: SINK, name: avroSink, registered successfully.
13/03/24 04:32:16 INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: avroSink started
13/03/24 04:32:16 INFO sink.AvroSink: Avro sink avroSink: Building RpcClient with hostname: 0.0.0.0, port: 41415
13/03/24 04:32:16 WARN sink.AvroSink: Unable to create avro client using hostname: 0.0.0.0, port: 41415
org.apache.flume.FlumeException: NettyAvroRpcClient { host: 0.0.0.0, port: 41415 }: RPC connection error
        at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:117)
        at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:93)
        at org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:507)
        at org.apache.flume.api.RpcClientFactory.getInstance(RpcClientFactory.java:88)
        at org.apache.flume.sink.AvroSink.createConnection(AvroSink.java:182)
        at org.apache.flume.sink.AvroSink.start(AvroSink.java:242)
        at org.apache.flume.sink.DefaultSinkProcessor.start(DefaultSinkProcessor.java:46)
        at org.apache.flume.SinkRunner.start(SinkRunner.java:79)
        at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:236)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:452)
        at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:328)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:161)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:109)



On Fri, Mar 29, 2013 at 4:20 PM, Paul Chavez <pc...@verticalsearchworks.com>> wrote:
I am curious about the observed behavior of a set of agents configured with a Load Balancing sink processor.

I have 4 'tier1' agents receiving events directly from app servers that feed into 2 'tier2' agents that write to HDFS. They are connected up via Avro Sink/Sources and a Load Balancing Sink Processor.

Both 'tier2' agents write the same directory and I have observed they occasionally step on each other and one of the tier2 agents at that point 'loses' and gets hung up on a file lease exception. I'm not concerned with that at the moment as I know it's not best practices and this is more of a pilot architecture.

My concern is that once a tier2 agent gets stuck it obviously fills it's channel in time, and then stops accepting put requests from the Avro source. At this point my *expectation* is that the upstream tier1 agents will continue to round-robin to the tier2 nodes with every other 'put' request failing. Assuming the remaining tier2 node can handle the throughput (which it can) I would not expect the tier1 agents to ever fill their channels.

In actuality what happens is the tier1 agents slowly fill the channel and eventually start refusing put attempts from the application servers. It seems that once a given batch has been allocated to the bad sink, it won't ever get released to be processed by the other, working sink.

Is this the way it should work? Is this a defect or as designed? I will probably switch to a failover processor because I really only need one HDFS writer to keep up with my data, but I do think this isn't working as intended.

thanks,
Paul

Re: LoadBalancing Sink Processor question

Posted by JR <ma...@gmail.com>.

Hi Paul,

   I apologize that I am not giving you a solution, but in turn have a
question about your avro sink to tier2 avro src.

   Could you please share the conf file?  I have tried to put the sink and
source as follows, but I still get RPC connection failed.

If you have had success, could you please tell me how you got yours to work?

What is the command like / shell scripts you wrote to connect the tier1-->
tier2 --> HDFS?

Thanks!


Avro source ---> mem Channel ----> Avro sink --> (next node) avro source
--> mem channel ---> hdfs sink

#agent1 on  node1
 agent1.sources = avroSource
 agent1.channels = ch1
 agent1.sinks = avroSink

#agent2 on node2
 agent2.sources = avroSource2
 agent2.channels = ch2
 agent2.sinks = hdfsSink

# first source - avro
 agent1.sources.avroSource.
type = avro
 agent1.sources.avroSource.bind = 0.0.0.0
 agent1.sources.avroSource.port = 41414
 agent1.sources.avroSource.channels = ch1

# first sink - avro
 agent1.sinks.avroSink.type = avro
 agent1.sinks.avroSink.hostname = 0.0.0.0
 agent1.sinks.avroSink.port = 41415
 agent1.sinks.avroSink.channel = ch1

# second source - avro
 agent2.sources.avroSource2.type = avro
 agent2.sources.avroSource2.bind = node2 ip
 agent2.sources.avroSource2.port = 41415
 agent2.sources.avroSource2.channel = ch2

# second sink - hdfs
 agent2.sinks.hdfsSink.type = hdfs
 agent2.sinks.hdfsSink.channel = ch2
agent2.sinks.hdfsSink.hdfs.writeFormat = Text
 agent2.sinks.hdfsSink.hdfs.filePrefix =  testing
 agent2.sinks.hdfsSink.hdfs.path = hdfs://node2:9000/flume/

# channels
 agent1.channels.ch1.type = memory
 agent1.channels.ch1.capacity = 1000
 agent2.channels.ch2.type = memory
 agent2.channels.ch2.capacity = 1000


Am getting errors with the ports. Could someone please check if I have
connected the sink in node1 to source in node 2 properly?

13/03/24 04:32:16 INFO source.AvroSource: Starting Avro source avroSource:
{ bindAddress: 0.0.0.0, port: 41414 }...
13/03/24 04:32:16 INFO instrumentation.
MonitoredCounterGroup: Monitoried counter group for type: SINK, name:
avroSink, registered successfully.
13/03/24 04:32:16 INFO instrumentation.MonitoredCounterGroup: Component
type: SINK, name: avroSink started
13/03/24 04:32:16 INFO sink.AvroSink: Avro sink avroSink: Building
RpcClient with hostname: 0.0.0.0, port: 41415
13/03/24 04:32:16 WARN sink.AvroSink: Unable to create avro client using
hostname: 0.0.0.0, port: 41415
org.apache.flume.FlumeException: NettyAvroRpcClient { host: 0.0.0.0, port:
41415 }: RPC connection error
        at
org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:117)
        at
org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:93)
        at
org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:507)
        at
org.apache.flume.api.RpcClientFactory.getInstance(RpcClientFactory.java:88)
        at
org.apache.flume.sink.AvroSink.createConnection(AvroSink.java:182)
        at org.apache.flume.sink.AvroSink.start(AvroSink.java:242)
        at
org.apache.flume.sink.DefaultSinkProcessor.start(DefaultSinkProcessor.java:46)
        at org.apache.flume.SinkRunner.start(SinkRunner.java:79)
        at
org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:236)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:452)
        at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:328)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:161)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:109)



On Fri, Mar 29, 2013 at 4:20 PM, Paul Chavez <
pchavez@verticalsearchworks.com> wrote:

> **
> I am curious about the observed behavior of a set of agents configured
> with a Load Balancing sink processor.
>
> I have 4 'tier1' agents receiving events directly from app servers that
> feed into 2 'tier2' agents that write to HDFS. They are connected up via
> Avro Sink/Sources and a Load Balancing Sink Processor.
>
> Both 'tier2' agents write the same directory and I have observed they
> occasionally step on each other and one of the tier2 agents at that point
> 'loses' and gets hung up on a file lease exception. I'm not concerned with
> that at the moment as I know it's not best practices and this is more of a
> pilot architecture.
>
> My concern is that once a tier2 agent gets stuck it obviously fills it's
> channel in time, and then stops accepting put requests from the Avro
> source. At this point my *expectation* is that the upstream tier1 agents
> will continue to round-robin to the tier2 nodes with every other 'put'
> request failing. Assuming the remaining tier2 node can handle the
> throughput (which it can) I would not expect the tier1 agents to ever fill
> their channels.
>
> In actuality what happens is the tier1 agents slowly fill the channel and
> eventually start refusing put attempts from the application servers. It
> seems that once a given batch has been allocated to the bad sink, it won't
> ever get released to be processed by the other, working sink.
>
> Is this the way it should work? Is this a defect or as designed? I will
> probably switch to a failover processor because I really only need one HDFS
> writer to keep up with my data, but I do think this isn't working as
> intended.
>
> thanks,
> Paul
>
>