You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Chris Neal <cw...@gmail.com> on 2013/04/16 20:38:00 UTC

Re: flume.EventDeliveryException: Failed to send events

I'm seeing the same thing :)

Mine is all on a local LAN though, so the fact that an RPC call doesn't
reply in 10000ms or 20000ms is quite odd.  My configuration is for the most
part the same as Denis' configuration.  Two tiered system, ExecSources
running tail -F on log files to an AvroSink, to an AvroSource, loading into
HDFS on the back tier.

I, too, see this on the AvroSink

Either (A):
[2013-04-15 23:57:14.827]
[org.apache.flume.sink.LoadBalancingSinkProcessor] [ WARN]
[SinkRunner-PollingRunner-LoadBalancingSinkProcessor] []
 (LoadBalancingSinkProcessor.java:process:154) Sink failed to consume
event. Attempting next sink if available.
org.apache.flume.EventDeliveryException: Failed to send events
        at org.apache.flume.sink.AvroSink.process(AvroSink.java:324)
        at
org.apache.flume.sink.LoadBalancingSinkProcessor.process(LoadBalancingSinkProcessor.java:151)
        at
org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
        at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
host: hadoopjt01.pegs.com, port: 10000 }: Failed to send batch
        at
org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
        at org.apache.flume.sink.AvroSink.process(AvroSink.java:308)
        ... 3 more
Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
host: hadoopjt01.pegs.com, port: 10000 }: Handshake timed out after 20000ms
        at
org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:280)
        at
org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
        ... 4 more
Caused by: java.util.concurrent.TimeoutException
        at
java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)
        at java.util.concurrent.FutureTask.get(FutureTask.java:91)
        at
org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:278)
        ... 5 more

or (B):
[2013-04-15 19:49:01.135]
[org.apache.flume.sink.LoadBalancingSinkProcessor] [ WARN]
[SinkRunner-PollingRunner-LoadBalancingSinkProcessor] []
 (LoadBalancingSinkProcessor.java:process:154) Sink failed to consume
event. Attempting next sink if available.
org.apache.flume.EventDeliveryException: Failed to send events
        at org.apache.flume.sink.AvroSink.process(AvroSink.java:324)
        at
org.apache.flume.sink.LoadBalancingSinkProcessor.process(LoadBalancingSinkProcessor.java:151)
        at
org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
        at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
host: hadoopjt01.pegs.com, port: 10000 }: Failed to send batch
        at
org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
        at org.apache.flume.sink.AvroSink.process(AvroSink.java:308)
        ... 3 more
Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
host: hadoopjt01.pegs.com, port: 10000 }: RPC request timed out
        at
org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:321)
        at
org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:295)
        at
org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
        ... 4 more
Caused by: java.util.concurrent.TimeoutException
        at org.apache.avro.ipc.CallFuture.get(CallFuture.java:132)
        at
org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:310)
        ... 6 more

The only thing I see on the AvroSource tier is the disconnect/reconnect
happening:

[2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
[pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] DISCONNECTED
[2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
[pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] UNBOUND
[2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
[pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] CLOSED
[2013-04-15 19:49:00.993] [org.apache.avro.ipc.NettyServer] [ INFO]
[pool-11-thread-10] []  (NettyServer.java:channelClosed:209) Connection to /
138.113.127.4:34481 disconnected.
[2013-04-15 19:49:03.331] [org.apache.avro.ipc.NettyServer] [ INFO]
[pool-10-thread-1] []  (NettyServer.java:handleUpstream:171) [id:
0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] OPEN
[2013-04-15 19:49:03.332] [org.apache.avro.ipc.NettyServer] [ INFO]
[pool-11-thread-13] []  (NettyServer.java:handleUpstream:171) [id:
0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] BOUND: /
138.113.127.72:10000
[2013-04-15 19:49:03.333] [org.apache.avro.ipc.NettyServer] [ INFO]
[pool-11-thread-13] []  (NettyServer.java:handleUpstream:171) [id:
0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] CONNECTED: /
138.113.127.4:62442

Is there some way to determine exactly where this bottleneck is?  The
config options for AvroSource/Sink are quite short, so there is not much
tuning to do.  Here is what I have:

# avro-hadoopjt01-sink properties
udprodae01.sinks.avro-hadoopjt01-sink.type = avro
udprodae01.sinks.avro-hadoopjt01-sink.hostname = hadoopjt01.pegs.com
udprodae01.sinks.avro-hadoopjt01-sink.port = 10000
udprodae01.sinks.avro-hadoopjt01-sink.batch-size = 100

# avro-hadoopjt01-source properties
hadoopjt01-1.sources.avro-hadoopjt01-source.type = avro
hadoopjt01-1.sources.avro-hadoopjt01-source.bind = hadoopjt01.pegs.com
hadoopjt01-1.sources.avro-hadoopjt01-source.port = 10000
hadoopjt01-1.sources.avro-hadoopjt01-source.threads = 64

I can try increasing the AvroSink timeout values, but they seem quite
adequate at the defaults.  Maybe more threads on the AvroSource?

Any advice would be much appreciated!
Chris


On Wed, Feb 6, 2013 at 12:10 PM, Denis Lowe <de...@gmail.com> wrote:

> I noticed that the logs in the destination server were reporting dropped
> connections from the upstream server:
>
> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id: 0x08e56d76,
> /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] DISCONNECTED
> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id: 0x08e56d76,
> /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] UNBOUND
> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id: 0x08e56d76,
> /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] CLOSED
> NettyServer$NettyServerAvroHandler.channelClosed:209)  - Connection to
> /SOURCE_HOST:43599 disconnected.
>
> The other thing I observed is that these errors only ever occur from our
> EU servers (connecting to our centralized downstream collectors in US West)
> - We are running Flume in Amazon EC2
>
> I can see in the log that the connection is restored quite quickly.
>
> I gather that the network latency between Europe and US West is causing
> the connection between the 2 servers to 'appear' lost, thus resulting in
> the above errors?
>
> Are there any recommended config settings to compensate for this?
>
> On 6 February 2013 00:21, Juhani Connolly <
> juhani_connolly@cyberagent.co.jp> wrote:
>
>>  Is there anything unusual in the logs for the destination(avroSource)
>> server
>>
>> Since the error is happening in the AvroSink, no data is getting lost.
>> The failed data will get rolled back, removal from the local channel is
>> cancelled, and it will attempt to resend it.
>>
>>
>> On 02/06/2013 03:23 PM, Denis Lowe wrote:
>>
>> We are running Flume-NG 1.3.1 and have noticed periodically the following
>> ERROR occurring (a few times daily):
>>
>>  We are using the File Channel connecting to 2 downstream collector
>> agents in 'round_robin' mode, using avro source/sinks.
>>
>>  We are using the config described below to deliver 5 different log
>> types (to 5 different ports downstream) and have observed the below
>> error occurring randomly across all the ports.
>>
>>  We tried doubling the connect-timeout to 40000 (from the default of
>> 20000) with no success.
>> The agent appears to recover and keep on processing data.
>>
>>  My question is:
>> Has this data been lost? or will flume eventually retry until a
>> successfull delivery has been made?
>> Are there any other config changes I can make to prevent/reduce
>> this occurring in the future?
>>
>>  05 Feb 2013 23:12:21,650 ERROR
>> [SinkRunner-PollingRunner-DefaultSinkProcessor]
>> (org.apache.flume.SinkRunner$PollingRunner.run:160)  - Unable to deliver
>> event. Exception follows.
>> org.apache.flume.EventDeliveryException: Failed to send events
>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:325)
>>         at
>> org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
>>         at
>> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>>         at java.lang.Thread.run(Thread.java:662)
>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
>> host: collector1, port: 4003 }: Failed to send batch
>>         at
>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:309)
>>         ... 3 more
>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
>> host: collector2, port: 4003 }: RPC request timed out
>>         at
>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:321)
>>         at
>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:295)
>>         at
>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>>         ... 4 more
>> Caused by: java.util.concurrent.TimeoutException
>>         at org.apache.avro.ipc.CallFuture.get(CallFuture.java:132)
>>         at
>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:310)
>>         ... 6 more
>>
>>  Below is a snapshot the current config:
>>
>>  agent.sources.eventdata.command = tail -qn +0 -F
>> /var/log/event-logs/live/eventdata.log
>>  agent.sources.eventdata.channels = eventdata-avro-channel
>> agent.sources.eventdata.batchSize = 10
>> agent.sources.eventdata.restart = true
>>
>>  ## event source interceptor
>> agent.sources.eventdata.interceptors.host-interceptor.type =
>> org.apache.flume.interceptor.HostInterceptor$Builder
>> agent.sources.eventdata.interceptors.host-interceptor.hostHeader =
>> source-host
>> agent.sources.eventdata.interceptors.host-interceptor.useIP = false
>>
>>  ## eventdata channel
>> agent.channels.eventdata-avro-channel.type = file
>> agent.channels.eventdata-avro-channel.checkpointDir =
>> /mnt/flume-ng/checkpoint/eventdata-avro-channel
>> agent.channels.eventdata-avro-channel.dataDirs =
>> /mnt/flume-ng/data1/eventdata-avro-channel,/mnt/flume-ng/data2/eventdata-avro-channel
>> agent.channels.eventdata-avro-channel.maxFileSize = 210000000
>> agent.channels.eventdata-avro-channel.capacity = 2000000
>> agent.channels.eventdata-avro-channel.checkpointInterval = 300000
>> agent.channels.eventdata-avro-channel.transactionCapacity = 10000
>> agent.channels.eventdata-avro-channel.keep-alive = 20
>> agent.channels.eventdata-avro-channel.write-timeout = 20
>>
>>  ## 2 x downstream click sinks for load balancing and failover
>>  agent.sinks.eventdata-avro-sink-1.type = avro
>> agent.sinks.eventdata-avro-sink-1.channel = eventdata-avro-channel
>> agent.sinks.eventdata-avro-sink-1.hostname = collector1
>> agent.sinks.eventdata-avro-sink-1.port = 4003
>> agent.sinks.eventdata-avro-sink-1.batch-size = 100
>> agent.sinks.eventdata-avro-sink-1.connect-timeout = 40000
>>
>>  agent.sinks.eventdata-avro-sink-2.type = avro
>> agent.sinks.eventdata-avro-sink-2.channel = eventdata-avro-channel
>> agent.sinks.eventdata-avro-sink-2.hostname = collector2
>> agent.sinks.eventdata-avro-sink-2.port = 4003
>> agent.sinks.eventdata-avro-sink-2.batch-size = 100
>> agent.sinks.eventdata-avro-sink-2.connect-timeout = 40000
>>
>>  ## load balance config
>> agent.sinkgroups = eventdata-avro-sink-group
>> agent.sinkgroups.eventdata-avro-sink-group.sinks = eventdata-avro-sink-1
>> eventdata-avro-sink-2
>> agent.sinkgroups.eventdata-avro-sink-group.processor.type = load_balance
>> agent.sinkgroups.eventdata-avro-sink-group.processor.selector =
>> round_robin
>>
>>  Denis.
>>
>>
>>
>

Re: flume.EventDeliveryException: Failed to send events

Posted by Chris Neal <cw...@gmail.com>.
Yeah, I can script something up.  Maybe increasing the timeouts was all it
needed and you'll never see another reply to this thread... ;)  But, if
not, I'll post the thread dump next.

Thanks again for all the suggestions.
Chris


On Tue, Apr 16, 2013 at 2:57 PM, Brock Noland <br...@cloudera.com> wrote:

> Any chance of getting a thread dump (or a few) on both the source and
> destination agent during an incident? :) It'd be a little work, but a
> script could look for "Unable to deliver event. Exception follows." and
> do the thread dumps.
>
>
> On Tue, Apr 16, 2013 at 2:49 PM, Chris Neal <cw...@gmail.com> wrote:
>
>> Thanks for all the input guys. :)
>>
>> @Hari:
>> The FileChannel at the AvroSource is on a SAN disk, so I don't think it
>> is the bottleneck in my case.  It is the same disk for both checkpoint and
>> data.  My queue depth remains relatively stable around 2000, which doesn't
>> bother me because of my batch size.
>>
>> On the AvroSink side, I have 126 ExecSources spread across 12 JVMs at 1GB
>> heap each.  Each VM has 4 AvroSinks across 2 separate servers, load
>> balanced and round robined (2 connections to each).  Is that a small enough
>> number of connections to remove the thread parameter on the AvroSource?
>>
>> @Brock
>> Each of my VMs on the downstream agents are 4GB heaps.  I watch the
>> MBeans pretty closely via jconsole, and they sit around 3GB used.
>>
>> Thanks again for all the help!
>>
>>
>> On Tue, Apr 16, 2013 at 2:26 PM, Brock Noland <br...@cloudera.com> wrote:
>>
>>> Another possibility is that the downstream agent is near capacity from a
>>> memory perspective. What is your heap size for these agents?
>>>
>>>
>>> On Tue, Apr 16, 2013 at 2:19 PM, Hari Shreedharan <
>>> hshreedharan@cloudera.com> wrote:
>>>
>>>>  One possibility is that you are hitting the file channel disk too hard
>>>> - do you have just one disk for checkpoint and data? It might be getting
>>>> slow because of this? Also you should probably just remove the thread limit
>>>> on AvroSource. It usually does not cause too much havoc unless you have a
>>>> massive number of connections causing too many threads.
>>>>
>>>> --
>>>> Hari Shreedharan
>>>>
>>>> On Tuesday, April 16, 2013 at 12:07 PM, Chris Neal wrote:
>>>>
>>>> Thanks Hari.
>>>>
>>>> I increased both the connect and request timeouts to 40000ms, and I'm
>>>> testing that now.  I am talking on a LAN though, which is part of the
>>>> reason I'm concerned.  Seems like it might not actually be a network issue,
>>>> but perhaps an overloaded AvroSource on the back end?
>>>>
>>>>
>>>> On Tue, Apr 16, 2013 at 1:52 PM, Hari Shreedharan <
>>>> hshreedharan@cloudera.com> wrote:
>>>>
>>>>  Looks like you are hitting Avro IPC timeouts - you should probably
>>>> increase it, especially if you are talking over WAN.
>>>>
>>>> --
>>>> Hari Shreedharan
>>>>
>>>> On Tuesday, April 16, 2013 at 11:38 AM, Chris Neal wrote:
>>>>
>>>> I'm seeing the same thing :)
>>>>
>>>> Mine is all on a local LAN though, so the fact that an RPC call doesn't
>>>> reply in 10000ms or 20000ms is quite odd.  My configuration is for the most
>>>> part the same as Denis' configuration.  Two tiered system, ExecSources
>>>> running tail -F on log files to an AvroSink, to an AvroSource, loading into
>>>> HDFS on the back tier.
>>>>
>>>> I, too, see this on the AvroSink
>>>>
>>>> Either (A):
>>>> [2013-04-15 23:57:14.827]
>>>> [org.apache.flume.sink.LoadBalancingSinkProcessor] [ WARN]
>>>> [SinkRunner-PollingRunner-LoadBalancingSinkProcessor] []
>>>>  (LoadBalancingSinkProcessor.java:process:154) Sink failed to consume
>>>> event. Attempting next sink if available.
>>>> org.apache.flume.EventDeliveryException: Failed to send events
>>>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:324)
>>>>         at
>>>> org.apache.flume.sink.LoadBalancingSinkProcessor.process(LoadBalancingSinkProcessor.java:151)
>>>>         at
>>>> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>>>>         at java.lang.Thread.run(Thread.java:619)
>>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient
>>>> { host: hadoopjt01.pegs.com, port: 10000 }: Failed to send batch
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>>>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:308)
>>>>         ... 3 more
>>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient
>>>> { host: hadoopjt01.pegs.com, port: 10000 }: Handshake timed out after
>>>> 20000ms
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:280)
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>>>>         ... 4 more
>>>> Caused by: java.util.concurrent.TimeoutException
>>>>         at
>>>> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)
>>>>         at java.util.concurrent.FutureTask.get(FutureTask.java:91)
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:278)
>>>>         ... 5 more
>>>>
>>>> or (B):
>>>> [2013-04-15 19:49:01.135]
>>>> [org.apache.flume.sink.LoadBalancingSinkProcessor] [ WARN]
>>>> [SinkRunner-PollingRunner-LoadBalancingSinkProcessor] []
>>>>  (LoadBalancingSinkProcessor.java:process:154) Sink failed to consume
>>>> event. Attempting next sink if available.
>>>> org.apache.flume.EventDeliveryException: Failed to send events
>>>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:324)
>>>>         at
>>>> org.apache.flume.sink.LoadBalancingSinkProcessor.process(LoadBalancingSinkProcessor.java:151)
>>>>         at
>>>> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>>>>         at java.lang.Thread.run(Thread.java:619)
>>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient
>>>> { host: hadoopjt01.pegs.com, port: 10000 }: Failed to send batch
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>>>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:308)
>>>>         ... 3 more
>>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient
>>>> { host: hadoopjt01.pegs.com, port: 10000 }: RPC request timed out
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:321)
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:295)
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>>>>         ... 4 more
>>>> Caused by: java.util.concurrent.TimeoutException
>>>>         at org.apache.avro.ipc.CallFuture.get(CallFuture.java:132)
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:310)
>>>>         ... 6 more
>>>>
>>>> The only thing I see on the AvroSource tier is the disconnect/reconnect
>>>> happening:
>>>>
>>>> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
>>>> [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
>>>> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] DISCONNECTED
>>>> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
>>>> [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
>>>> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] UNBOUND
>>>> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
>>>> [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
>>>> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] CLOSED
>>>> [2013-04-15 19:49:00.993] [org.apache.avro.ipc.NettyServer] [ INFO]
>>>> [pool-11-thread-10] []  (NettyServer.java:channelClosed:209) Connection to /
>>>> 138.113.127.4:34481 disconnected.
>>>> [2013-04-15 19:49:03.331] [org.apache.avro.ipc.NettyServer] [ INFO]
>>>> [pool-10-thread-1] []  (NettyServer.java:handleUpstream:171) [id:
>>>> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] OPEN
>>>> [2013-04-15 19:49:03.332] [org.apache.avro.ipc.NettyServer] [ INFO]
>>>> [pool-11-thread-13] []  (NettyServer.java:handleUpstream:171) [id:
>>>> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] BOUND: /
>>>> 138.113.127.72:10000
>>>> [2013-04-15 19:49:03.333] [org.apache.avro.ipc.NettyServer] [ INFO]
>>>> [pool-11-thread-13] []  (NettyServer.java:handleUpstream:171) [id:
>>>> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] CONNECTED: /
>>>> 138.113.127.4:62442
>>>>
>>>> Is there some way to determine exactly where this bottleneck is?  The
>>>> config options for AvroSource/Sink are quite short, so there is not much
>>>> tuning to do.  Here is what I have:
>>>>
>>>> # avro-hadoopjt01-sink properties
>>>> udprodae01.sinks.avro-hadoopjt01-sink.type = avro
>>>> udprodae01.sinks.avro-hadoopjt01-sink.hostname = hadoopjt01.pegs.com
>>>> udprodae01.sinks.avro-hadoopjt01-sink.port = 10000
>>>> udprodae01.sinks.avro-hadoopjt01-sink.batch-size = 100
>>>>
>>>> # avro-hadoopjt01-source properties
>>>> hadoopjt01-1.sources.avro-hadoopjt01-source.type = avro
>>>> hadoopjt01-1.sources.avro-hadoopjt01-source.bind = hadoopjt01.pegs.com
>>>> hadoopjt01-1.sources.avro-hadoopjt01-source.port = 10000
>>>> hadoopjt01-1.sources.avro-hadoopjt01-source.threads = 64
>>>>
>>>> I can try increasing the AvroSink timeout values, but they seem quite
>>>> adequate at the defaults.  Maybe more threads on the AvroSource?
>>>>
>>>> Any advice would be much appreciated!
>>>> Chris
>>>>
>>>>
>>>> On Wed, Feb 6, 2013 at 12:10 PM, Denis Lowe <de...@gmail.com>wrote:
>>>>
>>>> I noticed that the logs in the destination server were reporting
>>>> dropped connections from the upstream server:
>>>>
>>>> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id:
>>>> 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] DISCONNECTED
>>>> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id:
>>>> 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] UNBOUND
>>>> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id:
>>>> 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] CLOSED
>>>> NettyServer$NettyServerAvroHandler.channelClosed:209)  - Connection to
>>>> /SOURCE_HOST:43599 disconnected.
>>>>
>>>> The other thing I observed is that these errors only ever occur from
>>>> our EU servers (connecting to our centralized downstream collectors in US
>>>> West) - We are running Flume in Amazon EC2
>>>>
>>>> I can see in the log that the connection is restored quite quickly.
>>>>
>>>> I gather that the network latency between Europe and US West is causing
>>>> the connection between the 2 servers to 'appear' lost, thus resulting in
>>>> the above errors?
>>>>
>>>> Are there any recommended config settings to compensate for this?
>>>>
>>>> On 6 February 2013 00:21, Juhani Connolly <
>>>> juhani_connolly@cyberagent.co.jp> wrote:
>>>>
>>>>  Is there anything unusual in the logs for the destination(avroSource)
>>>> server
>>>>
>>>> Since the error is happening in the AvroSink, no data is getting lost.
>>>> The failed data will get rolled back, removal from the local channel is
>>>> cancelled, and it will attempt to resend it.
>>>>
>>>>
>>>> On 02/06/2013 03:23 PM, Denis Lowe wrote:
>>>>
>>>> We are running Flume-NG 1.3.1 and have noticed periodically the
>>>> following ERROR occurring (a few times daily):
>>>>
>>>>  We are using the File Channel connecting to 2 downstream collector
>>>> agents in 'round_robin' mode, using avro source/sinks.
>>>>
>>>>  We are using the config described below to deliver 5 different log
>>>> types (to 5 different ports downstream) and have observed the below
>>>> error occurring randomly across all the ports.
>>>>
>>>>  We tried doubling the connect-timeout to 40000 (from the default of
>>>> 20000) with no success.
>>>> The agent appears to recover and keep on processing data.
>>>>
>>>>  My question is:
>>>> Has this data been lost? or will flume eventually retry until a
>>>> successfull delivery has been made?
>>>> Are there any other config changes I can make to prevent/reduce
>>>> this occurring in the future?
>>>>
>>>>  05 Feb 2013 23:12:21,650 ERROR
>>>> [SinkRunner-PollingRunner-DefaultSinkProcessor]
>>>> (org.apache.flume.SinkRunner$PollingRunner.run:160)  - Unable to deliver
>>>> event. Exception follows.
>>>> org.apache.flume.EventDeliveryException: Failed to send events
>>>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:325)
>>>>         at
>>>> org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
>>>>         at
>>>> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>>>>         at java.lang.Thread.run(Thread.java:662)
>>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient
>>>> { host: collector1, port: 4003 }: Failed to send batch
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>>>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:309)
>>>>         ... 3 more
>>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient
>>>> { host: collector2, port: 4003 }: RPC request timed out
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:321)
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:295)
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>>>>         ... 4 more
>>>> Caused by: java.util.concurrent.TimeoutException
>>>>         at org.apache.avro.ipc.CallFuture.get(CallFuture.java:132)
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:310)
>>>>         ... 6 more
>>>>
>>>>  Below is a snapshot the current config:
>>>>
>>>>  agent.sources.eventdata.command = tail -qn +0 -F
>>>> /var/log/event-logs/live/eventdata.log
>>>>  agent.sources.eventdata.channels = eventdata-avro-channel
>>>> agent.sources.eventdata.batchSize = 10
>>>> agent.sources.eventdata.restart = true
>>>>
>>>>  ## event source interceptor
>>>> agent.sources.eventdata.interceptors.host-interceptor.type =
>>>> org.apache.flume.interceptor.HostInterceptor$Builder
>>>> agent.sources.eventdata.interceptors.host-interceptor.hostHeader =
>>>> source-host
>>>> agent.sources.eventdata.interceptors.host-interceptor.useIP = false
>>>>
>>>>  ## eventdata channel
>>>> agent.channels.eventdata-avro-channel.type = file
>>>> agent.channels.eventdata-avro-channel.checkpointDir =
>>>> /mnt/flume-ng/checkpoint/eventdata-avro-channel
>>>> agent.channels.eventdata-avro-channel.dataDirs =
>>>> /mnt/flume-ng/data1/eventdata-avro-channel,/mnt/flume-ng/data2/eventdata-avro-channel
>>>> agent.channels.eventdata-avro-channel.maxFileSize = 210000000
>>>> agent.channels.eventdata-avro-channel.capacity = 2000000
>>>> agent.channels.eventdata-avro-channel.checkpointInterval = 300000
>>>> agent.channels.eventdata-avro-channel.transactionCapacity = 10000
>>>> agent.channels.eventdata-avro-channel.keep-alive = 20
>>>> agent.channels.eventdata-avro-channel.write-timeout = 20
>>>>
>>>>  ## 2 x downstream click sinks for load balancing and failover
>>>>  agent.sinks.eventdata-avro-sink-1.type = avro
>>>> agent.sinks.eventdata-avro-sink-1.channel = eventdata-avro-channel
>>>> agent.sinks.eventdata-avro-sink-1.hostname = collector1
>>>> agent.sinks.eventdata-avro-sink-1.port = 4003
>>>> agent.sinks.eventdata-avro-sink-1.batch-size = 100
>>>> agent.sinks.eventdata-avro-sink-1.connect-timeout = 40000
>>>>
>>>>  agent.sinks.eventdata-avro-sink-2.type = avro
>>>> agent.sinks.eventdata-avro-sink-2.channel = eventdata-avro-channel
>>>> agent.sinks.eventdata-avro-sink-2.hostname = collector2
>>>> agent.sinks.eventdata-avro-sink-2.port = 4003
>>>> agent.sinks.eventdata-avro-sink-2.batch-size = 100
>>>> agent.sinks.eventdata-avro-sink-2.connect-timeout = 40000
>>>>
>>>>  ## load balance config
>>>> agent.sinkgroups = eventdata-avro-sink-group
>>>> agent.sinkgroups.eventdata-avro-sink-group.sinks =
>>>> eventdata-avro-sink-1 eventdata-avro-sink-2
>>>> agent.sinkgroups.eventdata-avro-sink-group.processor.type = load_balance
>>>> agent.sinkgroups.eventdata-avro-sink-group.processor.selector =
>>>> round_robin
>>>>
>>>>  Denis.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
>>>
>>
>>
>
>
> --
> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
>

Re: flume.EventDeliveryException: Failed to send events

Posted by Brock Noland <br...@cloudera.com>.
Any chance of getting a thread dump (or a few) on both the source and
destination agent during an incident? :) It'd be a little work, but a
script could look for "Unable to deliver event. Exception follows." and do
the thread dumps.


On Tue, Apr 16, 2013 at 2:49 PM, Chris Neal <cw...@gmail.com> wrote:

> Thanks for all the input guys. :)
>
> @Hari:
> The FileChannel at the AvroSource is on a SAN disk, so I don't think it is
> the bottleneck in my case.  It is the same disk for both checkpoint and
> data.  My queue depth remains relatively stable around 2000, which doesn't
> bother me because of my batch size.
>
> On the AvroSink side, I have 126 ExecSources spread across 12 JVMs at 1GB
> heap each.  Each VM has 4 AvroSinks across 2 separate servers, load
> balanced and round robined (2 connections to each).  Is that a small enough
> number of connections to remove the thread parameter on the AvroSource?
>
> @Brock
> Each of my VMs on the downstream agents are 4GB heaps.  I watch the MBeans
> pretty closely via jconsole, and they sit around 3GB used.
>
> Thanks again for all the help!
>
>
> On Tue, Apr 16, 2013 at 2:26 PM, Brock Noland <br...@cloudera.com> wrote:
>
>> Another possibility is that the downstream agent is near capacity from a
>> memory perspective. What is your heap size for these agents?
>>
>>
>> On Tue, Apr 16, 2013 at 2:19 PM, Hari Shreedharan <
>> hshreedharan@cloudera.com> wrote:
>>
>>>  One possibility is that you are hitting the file channel disk too hard
>>> - do you have just one disk for checkpoint and data? It might be getting
>>> slow because of this? Also you should probably just remove the thread limit
>>> on AvroSource. It usually does not cause too much havoc unless you have a
>>> massive number of connections causing too many threads.
>>>
>>> --
>>> Hari Shreedharan
>>>
>>> On Tuesday, April 16, 2013 at 12:07 PM, Chris Neal wrote:
>>>
>>> Thanks Hari.
>>>
>>> I increased both the connect and request timeouts to 40000ms, and I'm
>>> testing that now.  I am talking on a LAN though, which is part of the
>>> reason I'm concerned.  Seems like it might not actually be a network issue,
>>> but perhaps an overloaded AvroSource on the back end?
>>>
>>>
>>> On Tue, Apr 16, 2013 at 1:52 PM, Hari Shreedharan <
>>> hshreedharan@cloudera.com> wrote:
>>>
>>>  Looks like you are hitting Avro IPC timeouts - you should probably
>>> increase it, especially if you are talking over WAN.
>>>
>>> --
>>> Hari Shreedharan
>>>
>>> On Tuesday, April 16, 2013 at 11:38 AM, Chris Neal wrote:
>>>
>>> I'm seeing the same thing :)
>>>
>>> Mine is all on a local LAN though, so the fact that an RPC call doesn't
>>> reply in 10000ms or 20000ms is quite odd.  My configuration is for the most
>>> part the same as Denis' configuration.  Two tiered system, ExecSources
>>> running tail -F on log files to an AvroSink, to an AvroSource, loading into
>>> HDFS on the back tier.
>>>
>>> I, too, see this on the AvroSink
>>>
>>> Either (A):
>>> [2013-04-15 23:57:14.827]
>>> [org.apache.flume.sink.LoadBalancingSinkProcessor] [ WARN]
>>> [SinkRunner-PollingRunner-LoadBalancingSinkProcessor] []
>>>  (LoadBalancingSinkProcessor.java:process:154) Sink failed to consume
>>> event. Attempting next sink if available.
>>> org.apache.flume.EventDeliveryException: Failed to send events
>>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:324)
>>>         at
>>> org.apache.flume.sink.LoadBalancingSinkProcessor.process(LoadBalancingSinkProcessor.java:151)
>>>         at
>>> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>>>         at java.lang.Thread.run(Thread.java:619)
>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
>>> host: hadoopjt01.pegs.com, port: 10000 }: Failed to send batch
>>>         at
>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:308)
>>>         ... 3 more
>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
>>> host: hadoopjt01.pegs.com, port: 10000 }: Handshake timed out after
>>> 20000ms
>>>         at
>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:280)
>>>         at
>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>>>         ... 4 more
>>> Caused by: java.util.concurrent.TimeoutException
>>>         at
>>> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)
>>>         at java.util.concurrent.FutureTask.get(FutureTask.java:91)
>>>         at
>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:278)
>>>         ... 5 more
>>>
>>> or (B):
>>> [2013-04-15 19:49:01.135]
>>> [org.apache.flume.sink.LoadBalancingSinkProcessor] [ WARN]
>>> [SinkRunner-PollingRunner-LoadBalancingSinkProcessor] []
>>>  (LoadBalancingSinkProcessor.java:process:154) Sink failed to consume
>>> event. Attempting next sink if available.
>>> org.apache.flume.EventDeliveryException: Failed to send events
>>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:324)
>>>         at
>>> org.apache.flume.sink.LoadBalancingSinkProcessor.process(LoadBalancingSinkProcessor.java:151)
>>>         at
>>> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>>>         at java.lang.Thread.run(Thread.java:619)
>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
>>> host: hadoopjt01.pegs.com, port: 10000 }: Failed to send batch
>>>         at
>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:308)
>>>         ... 3 more
>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
>>> host: hadoopjt01.pegs.com, port: 10000 }: RPC request timed out
>>>         at
>>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:321)
>>>         at
>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:295)
>>>         at
>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>>>         ... 4 more
>>> Caused by: java.util.concurrent.TimeoutException
>>>         at org.apache.avro.ipc.CallFuture.get(CallFuture.java:132)
>>>         at
>>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:310)
>>>         ... 6 more
>>>
>>> The only thing I see on the AvroSource tier is the disconnect/reconnect
>>> happening:
>>>
>>> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
>>> [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
>>> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] DISCONNECTED
>>> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
>>> [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
>>> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] UNBOUND
>>> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
>>> [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
>>> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] CLOSED
>>> [2013-04-15 19:49:00.993] [org.apache.avro.ipc.NettyServer] [ INFO]
>>> [pool-11-thread-10] []  (NettyServer.java:channelClosed:209) Connection to /
>>> 138.113.127.4:34481 disconnected.
>>> [2013-04-15 19:49:03.331] [org.apache.avro.ipc.NettyServer] [ INFO]
>>> [pool-10-thread-1] []  (NettyServer.java:handleUpstream:171) [id:
>>> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] OPEN
>>> [2013-04-15 19:49:03.332] [org.apache.avro.ipc.NettyServer] [ INFO]
>>> [pool-11-thread-13] []  (NettyServer.java:handleUpstream:171) [id:
>>> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] BOUND: /
>>> 138.113.127.72:10000
>>> [2013-04-15 19:49:03.333] [org.apache.avro.ipc.NettyServer] [ INFO]
>>> [pool-11-thread-13] []  (NettyServer.java:handleUpstream:171) [id:
>>> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] CONNECTED: /
>>> 138.113.127.4:62442
>>>
>>> Is there some way to determine exactly where this bottleneck is?  The
>>> config options for AvroSource/Sink are quite short, so there is not much
>>> tuning to do.  Here is what I have:
>>>
>>> # avro-hadoopjt01-sink properties
>>> udprodae01.sinks.avro-hadoopjt01-sink.type = avro
>>> udprodae01.sinks.avro-hadoopjt01-sink.hostname = hadoopjt01.pegs.com
>>> udprodae01.sinks.avro-hadoopjt01-sink.port = 10000
>>> udprodae01.sinks.avro-hadoopjt01-sink.batch-size = 100
>>>
>>> # avro-hadoopjt01-source properties
>>> hadoopjt01-1.sources.avro-hadoopjt01-source.type = avro
>>> hadoopjt01-1.sources.avro-hadoopjt01-source.bind = hadoopjt01.pegs.com
>>> hadoopjt01-1.sources.avro-hadoopjt01-source.port = 10000
>>> hadoopjt01-1.sources.avro-hadoopjt01-source.threads = 64
>>>
>>> I can try increasing the AvroSink timeout values, but they seem quite
>>> adequate at the defaults.  Maybe more threads on the AvroSource?
>>>
>>> Any advice would be much appreciated!
>>> Chris
>>>
>>>
>>> On Wed, Feb 6, 2013 at 12:10 PM, Denis Lowe <de...@gmail.com>wrote:
>>>
>>> I noticed that the logs in the destination server were reporting dropped
>>> connections from the upstream server:
>>>
>>> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id:
>>> 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] DISCONNECTED
>>> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id:
>>> 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] UNBOUND
>>> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id:
>>> 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] CLOSED
>>> NettyServer$NettyServerAvroHandler.channelClosed:209)  - Connection to
>>> /SOURCE_HOST:43599 disconnected.
>>>
>>> The other thing I observed is that these errors only ever occur from our
>>> EU servers (connecting to our centralized downstream collectors in US West)
>>> - We are running Flume in Amazon EC2
>>>
>>> I can see in the log that the connection is restored quite quickly.
>>>
>>> I gather that the network latency between Europe and US West is causing
>>> the connection between the 2 servers to 'appear' lost, thus resulting in
>>> the above errors?
>>>
>>> Are there any recommended config settings to compensate for this?
>>>
>>> On 6 February 2013 00:21, Juhani Connolly <
>>> juhani_connolly@cyberagent.co.jp> wrote:
>>>
>>>  Is there anything unusual in the logs for the destination(avroSource)
>>> server
>>>
>>> Since the error is happening in the AvroSink, no data is getting lost.
>>> The failed data will get rolled back, removal from the local channel is
>>> cancelled, and it will attempt to resend it.
>>>
>>>
>>> On 02/06/2013 03:23 PM, Denis Lowe wrote:
>>>
>>> We are running Flume-NG 1.3.1 and have noticed periodically the
>>> following ERROR occurring (a few times daily):
>>>
>>>  We are using the File Channel connecting to 2 downstream collector
>>> agents in 'round_robin' mode, using avro source/sinks.
>>>
>>>  We are using the config described below to deliver 5 different log
>>> types (to 5 different ports downstream) and have observed the below
>>> error occurring randomly across all the ports.
>>>
>>>  We tried doubling the connect-timeout to 40000 (from the default of
>>> 20000) with no success.
>>> The agent appears to recover and keep on processing data.
>>>
>>>  My question is:
>>> Has this data been lost? or will flume eventually retry until a
>>> successfull delivery has been made?
>>> Are there any other config changes I can make to prevent/reduce
>>> this occurring in the future?
>>>
>>>  05 Feb 2013 23:12:21,650 ERROR
>>> [SinkRunner-PollingRunner-DefaultSinkProcessor]
>>> (org.apache.flume.SinkRunner$PollingRunner.run:160)  - Unable to deliver
>>> event. Exception follows.
>>> org.apache.flume.EventDeliveryException: Failed to send events
>>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:325)
>>>         at
>>> org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
>>>         at
>>> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>>>         at java.lang.Thread.run(Thread.java:662)
>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
>>> host: collector1, port: 4003 }: Failed to send batch
>>>         at
>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:309)
>>>         ... 3 more
>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
>>> host: collector2, port: 4003 }: RPC request timed out
>>>         at
>>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:321)
>>>         at
>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:295)
>>>         at
>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>>>         ... 4 more
>>> Caused by: java.util.concurrent.TimeoutException
>>>         at org.apache.avro.ipc.CallFuture.get(CallFuture.java:132)
>>>         at
>>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:310)
>>>         ... 6 more
>>>
>>>  Below is a snapshot the current config:
>>>
>>>  agent.sources.eventdata.command = tail -qn +0 -F
>>> /var/log/event-logs/live/eventdata.log
>>>  agent.sources.eventdata.channels = eventdata-avro-channel
>>> agent.sources.eventdata.batchSize = 10
>>> agent.sources.eventdata.restart = true
>>>
>>>  ## event source interceptor
>>> agent.sources.eventdata.interceptors.host-interceptor.type =
>>> org.apache.flume.interceptor.HostInterceptor$Builder
>>> agent.sources.eventdata.interceptors.host-interceptor.hostHeader =
>>> source-host
>>> agent.sources.eventdata.interceptors.host-interceptor.useIP = false
>>>
>>>  ## eventdata channel
>>> agent.channels.eventdata-avro-channel.type = file
>>> agent.channels.eventdata-avro-channel.checkpointDir =
>>> /mnt/flume-ng/checkpoint/eventdata-avro-channel
>>> agent.channels.eventdata-avro-channel.dataDirs =
>>> /mnt/flume-ng/data1/eventdata-avro-channel,/mnt/flume-ng/data2/eventdata-avro-channel
>>> agent.channels.eventdata-avro-channel.maxFileSize = 210000000
>>> agent.channels.eventdata-avro-channel.capacity = 2000000
>>> agent.channels.eventdata-avro-channel.checkpointInterval = 300000
>>> agent.channels.eventdata-avro-channel.transactionCapacity = 10000
>>> agent.channels.eventdata-avro-channel.keep-alive = 20
>>> agent.channels.eventdata-avro-channel.write-timeout = 20
>>>
>>>  ## 2 x downstream click sinks for load balancing and failover
>>>  agent.sinks.eventdata-avro-sink-1.type = avro
>>> agent.sinks.eventdata-avro-sink-1.channel = eventdata-avro-channel
>>> agent.sinks.eventdata-avro-sink-1.hostname = collector1
>>> agent.sinks.eventdata-avro-sink-1.port = 4003
>>> agent.sinks.eventdata-avro-sink-1.batch-size = 100
>>> agent.sinks.eventdata-avro-sink-1.connect-timeout = 40000
>>>
>>>  agent.sinks.eventdata-avro-sink-2.type = avro
>>> agent.sinks.eventdata-avro-sink-2.channel = eventdata-avro-channel
>>> agent.sinks.eventdata-avro-sink-2.hostname = collector2
>>> agent.sinks.eventdata-avro-sink-2.port = 4003
>>> agent.sinks.eventdata-avro-sink-2.batch-size = 100
>>> agent.sinks.eventdata-avro-sink-2.connect-timeout = 40000
>>>
>>>  ## load balance config
>>> agent.sinkgroups = eventdata-avro-sink-group
>>> agent.sinkgroups.eventdata-avro-sink-group.sinks = eventdata-avro-sink-1
>>> eventdata-avro-sink-2
>>> agent.sinkgroups.eventdata-avro-sink-group.processor.type = load_balance
>>> agent.sinkgroups.eventdata-avro-sink-group.processor.selector =
>>> round_robin
>>>
>>>  Denis.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
>>
>
>


-- 
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org

Re: flume.EventDeliveryException: Failed to send events

Posted by Chris Neal <cw...@gmail.com>.
Thanks for all the input guys. :)

@Hari:
The FileChannel at the AvroSource is on a SAN disk, so I don't think it is
the bottleneck in my case.  It is the same disk for both checkpoint and
data.  My queue depth remains relatively stable around 2000, which doesn't
bother me because of my batch size.

On the AvroSink side, I have 126 ExecSources spread across 12 JVMs at 1GB
heap each.  Each VM has 4 AvroSinks across 2 separate servers, load
balanced and round robined (2 connections to each).  Is that a small enough
number of connections to remove the thread parameter on the AvroSource?

@Brock
Each of my VMs on the downstream agents are 4GB heaps.  I watch the MBeans
pretty closely via jconsole, and they sit around 3GB used.

Thanks again for all the help!


On Tue, Apr 16, 2013 at 2:26 PM, Brock Noland <br...@cloudera.com> wrote:

> Another possibility is that the downstream agent is near capacity from a
> memory perspective. What is your heap size for these agents?
>
>
> On Tue, Apr 16, 2013 at 2:19 PM, Hari Shreedharan <
> hshreedharan@cloudera.com> wrote:
>
>>  One possibility is that you are hitting the file channel disk too hard -
>> do you have just one disk for checkpoint and data? It might be getting slow
>> because of this? Also you should probably just remove the thread limit on
>> AvroSource. It usually does not cause too much havoc unless you have a
>> massive number of connections causing too many threads.
>>
>> --
>> Hari Shreedharan
>>
>> On Tuesday, April 16, 2013 at 12:07 PM, Chris Neal wrote:
>>
>> Thanks Hari.
>>
>> I increased both the connect and request timeouts to 40000ms, and I'm
>> testing that now.  I am talking on a LAN though, which is part of the
>> reason I'm concerned.  Seems like it might not actually be a network issue,
>> but perhaps an overloaded AvroSource on the back end?
>>
>>
>> On Tue, Apr 16, 2013 at 1:52 PM, Hari Shreedharan <
>> hshreedharan@cloudera.com> wrote:
>>
>>  Looks like you are hitting Avro IPC timeouts - you should probably
>> increase it, especially if you are talking over WAN.
>>
>> --
>> Hari Shreedharan
>>
>> On Tuesday, April 16, 2013 at 11:38 AM, Chris Neal wrote:
>>
>> I'm seeing the same thing :)
>>
>> Mine is all on a local LAN though, so the fact that an RPC call doesn't
>> reply in 10000ms or 20000ms is quite odd.  My configuration is for the most
>> part the same as Denis' configuration.  Two tiered system, ExecSources
>> running tail -F on log files to an AvroSink, to an AvroSource, loading into
>> HDFS on the back tier.
>>
>> I, too, see this on the AvroSink
>>
>> Either (A):
>> [2013-04-15 23:57:14.827]
>> [org.apache.flume.sink.LoadBalancingSinkProcessor] [ WARN]
>> [SinkRunner-PollingRunner-LoadBalancingSinkProcessor] []
>>  (LoadBalancingSinkProcessor.java:process:154) Sink failed to consume
>> event. Attempting next sink if available.
>> org.apache.flume.EventDeliveryException: Failed to send events
>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:324)
>>         at
>> org.apache.flume.sink.LoadBalancingSinkProcessor.process(LoadBalancingSinkProcessor.java:151)
>>         at
>> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>>         at java.lang.Thread.run(Thread.java:619)
>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
>> host: hadoopjt01.pegs.com, port: 10000 }: Failed to send batch
>>         at
>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:308)
>>         ... 3 more
>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
>> host: hadoopjt01.pegs.com, port: 10000 }: Handshake timed out after
>> 20000ms
>>         at
>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:280)
>>         at
>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>>         ... 4 more
>> Caused by: java.util.concurrent.TimeoutException
>>         at
>> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)
>>         at java.util.concurrent.FutureTask.get(FutureTask.java:91)
>>         at
>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:278)
>>         ... 5 more
>>
>> or (B):
>> [2013-04-15 19:49:01.135]
>> [org.apache.flume.sink.LoadBalancingSinkProcessor] [ WARN]
>> [SinkRunner-PollingRunner-LoadBalancingSinkProcessor] []
>>  (LoadBalancingSinkProcessor.java:process:154) Sink failed to consume
>> event. Attempting next sink if available.
>> org.apache.flume.EventDeliveryException: Failed to send events
>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:324)
>>         at
>> org.apache.flume.sink.LoadBalancingSinkProcessor.process(LoadBalancingSinkProcessor.java:151)
>>         at
>> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>>         at java.lang.Thread.run(Thread.java:619)
>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
>> host: hadoopjt01.pegs.com, port: 10000 }: Failed to send batch
>>         at
>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:308)
>>         ... 3 more
>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
>> host: hadoopjt01.pegs.com, port: 10000 }: RPC request timed out
>>         at
>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:321)
>>         at
>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:295)
>>         at
>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>>         ... 4 more
>> Caused by: java.util.concurrent.TimeoutException
>>         at org.apache.avro.ipc.CallFuture.get(CallFuture.java:132)
>>         at
>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:310)
>>         ... 6 more
>>
>> The only thing I see on the AvroSource tier is the disconnect/reconnect
>> happening:
>>
>> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
>> [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
>> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] DISCONNECTED
>> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
>> [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
>> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] UNBOUND
>> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
>> [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
>> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] CLOSED
>> [2013-04-15 19:49:00.993] [org.apache.avro.ipc.NettyServer] [ INFO]
>> [pool-11-thread-10] []  (NettyServer.java:channelClosed:209) Connection to /
>> 138.113.127.4:34481 disconnected.
>> [2013-04-15 19:49:03.331] [org.apache.avro.ipc.NettyServer] [ INFO]
>> [pool-10-thread-1] []  (NettyServer.java:handleUpstream:171) [id:
>> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] OPEN
>> [2013-04-15 19:49:03.332] [org.apache.avro.ipc.NettyServer] [ INFO]
>> [pool-11-thread-13] []  (NettyServer.java:handleUpstream:171) [id:
>> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] BOUND: /
>> 138.113.127.72:10000
>> [2013-04-15 19:49:03.333] [org.apache.avro.ipc.NettyServer] [ INFO]
>> [pool-11-thread-13] []  (NettyServer.java:handleUpstream:171) [id:
>> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] CONNECTED: /
>> 138.113.127.4:62442
>>
>> Is there some way to determine exactly where this bottleneck is?  The
>> config options for AvroSource/Sink are quite short, so there is not much
>> tuning to do.  Here is what I have:
>>
>> # avro-hadoopjt01-sink properties
>> udprodae01.sinks.avro-hadoopjt01-sink.type = avro
>> udprodae01.sinks.avro-hadoopjt01-sink.hostname = hadoopjt01.pegs.com
>> udprodae01.sinks.avro-hadoopjt01-sink.port = 10000
>> udprodae01.sinks.avro-hadoopjt01-sink.batch-size = 100
>>
>> # avro-hadoopjt01-source properties
>> hadoopjt01-1.sources.avro-hadoopjt01-source.type = avro
>> hadoopjt01-1.sources.avro-hadoopjt01-source.bind = hadoopjt01.pegs.com
>> hadoopjt01-1.sources.avro-hadoopjt01-source.port = 10000
>> hadoopjt01-1.sources.avro-hadoopjt01-source.threads = 64
>>
>> I can try increasing the AvroSink timeout values, but they seem quite
>> adequate at the defaults.  Maybe more threads on the AvroSource?
>>
>> Any advice would be much appreciated!
>> Chris
>>
>>
>> On Wed, Feb 6, 2013 at 12:10 PM, Denis Lowe <de...@gmail.com> wrote:
>>
>> I noticed that the logs in the destination server were reporting dropped
>> connections from the upstream server:
>>
>> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id:
>> 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] DISCONNECTED
>> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id:
>> 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] UNBOUND
>> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id:
>> 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] CLOSED
>> NettyServer$NettyServerAvroHandler.channelClosed:209)  - Connection to
>> /SOURCE_HOST:43599 disconnected.
>>
>> The other thing I observed is that these errors only ever occur from our
>> EU servers (connecting to our centralized downstream collectors in US West)
>> - We are running Flume in Amazon EC2
>>
>> I can see in the log that the connection is restored quite quickly.
>>
>> I gather that the network latency between Europe and US West is causing
>> the connection between the 2 servers to 'appear' lost, thus resulting in
>> the above errors?
>>
>> Are there any recommended config settings to compensate for this?
>>
>> On 6 February 2013 00:21, Juhani Connolly <
>> juhani_connolly@cyberagent.co.jp> wrote:
>>
>>  Is there anything unusual in the logs for the destination(avroSource)
>> server
>>
>> Since the error is happening in the AvroSink, no data is getting lost.
>> The failed data will get rolled back, removal from the local channel is
>> cancelled, and it will attempt to resend it.
>>
>>
>> On 02/06/2013 03:23 PM, Denis Lowe wrote:
>>
>> We are running Flume-NG 1.3.1 and have noticed periodically the following
>> ERROR occurring (a few times daily):
>>
>>  We are using the File Channel connecting to 2 downstream collector
>> agents in 'round_robin' mode, using avro source/sinks.
>>
>>  We are using the config described below to deliver 5 different log
>> types (to 5 different ports downstream) and have observed the below
>> error occurring randomly across all the ports.
>>
>>  We tried doubling the connect-timeout to 40000 (from the default of
>> 20000) with no success.
>> The agent appears to recover and keep on processing data.
>>
>>  My question is:
>> Has this data been lost? or will flume eventually retry until a
>> successfull delivery has been made?
>> Are there any other config changes I can make to prevent/reduce
>> this occurring in the future?
>>
>>  05 Feb 2013 23:12:21,650 ERROR
>> [SinkRunner-PollingRunner-DefaultSinkProcessor]
>> (org.apache.flume.SinkRunner$PollingRunner.run:160)  - Unable to deliver
>> event. Exception follows.
>> org.apache.flume.EventDeliveryException: Failed to send events
>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:325)
>>         at
>> org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
>>         at
>> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>>         at java.lang.Thread.run(Thread.java:662)
>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
>> host: collector1, port: 4003 }: Failed to send batch
>>         at
>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:309)
>>         ... 3 more
>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
>> host: collector2, port: 4003 }: RPC request timed out
>>         at
>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:321)
>>         at
>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:295)
>>         at
>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>>         ... 4 more
>> Caused by: java.util.concurrent.TimeoutException
>>         at org.apache.avro.ipc.CallFuture.get(CallFuture.java:132)
>>         at
>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:310)
>>         ... 6 more
>>
>>  Below is a snapshot the current config:
>>
>>  agent.sources.eventdata.command = tail -qn +0 -F
>> /var/log/event-logs/live/eventdata.log
>>  agent.sources.eventdata.channels = eventdata-avro-channel
>> agent.sources.eventdata.batchSize = 10
>> agent.sources.eventdata.restart = true
>>
>>  ## event source interceptor
>> agent.sources.eventdata.interceptors.host-interceptor.type =
>> org.apache.flume.interceptor.HostInterceptor$Builder
>> agent.sources.eventdata.interceptors.host-interceptor.hostHeader =
>> source-host
>> agent.sources.eventdata.interceptors.host-interceptor.useIP = false
>>
>>  ## eventdata channel
>> agent.channels.eventdata-avro-channel.type = file
>> agent.channels.eventdata-avro-channel.checkpointDir =
>> /mnt/flume-ng/checkpoint/eventdata-avro-channel
>> agent.channels.eventdata-avro-channel.dataDirs =
>> /mnt/flume-ng/data1/eventdata-avro-channel,/mnt/flume-ng/data2/eventdata-avro-channel
>> agent.channels.eventdata-avro-channel.maxFileSize = 210000000
>> agent.channels.eventdata-avro-channel.capacity = 2000000
>> agent.channels.eventdata-avro-channel.checkpointInterval = 300000
>> agent.channels.eventdata-avro-channel.transactionCapacity = 10000
>> agent.channels.eventdata-avro-channel.keep-alive = 20
>> agent.channels.eventdata-avro-channel.write-timeout = 20
>>
>>  ## 2 x downstream click sinks for load balancing and failover
>>  agent.sinks.eventdata-avro-sink-1.type = avro
>> agent.sinks.eventdata-avro-sink-1.channel = eventdata-avro-channel
>> agent.sinks.eventdata-avro-sink-1.hostname = collector1
>> agent.sinks.eventdata-avro-sink-1.port = 4003
>> agent.sinks.eventdata-avro-sink-1.batch-size = 100
>> agent.sinks.eventdata-avro-sink-1.connect-timeout = 40000
>>
>>  agent.sinks.eventdata-avro-sink-2.type = avro
>> agent.sinks.eventdata-avro-sink-2.channel = eventdata-avro-channel
>> agent.sinks.eventdata-avro-sink-2.hostname = collector2
>> agent.sinks.eventdata-avro-sink-2.port = 4003
>> agent.sinks.eventdata-avro-sink-2.batch-size = 100
>> agent.sinks.eventdata-avro-sink-2.connect-timeout = 40000
>>
>>  ## load balance config
>> agent.sinkgroups = eventdata-avro-sink-group
>> agent.sinkgroups.eventdata-avro-sink-group.sinks = eventdata-avro-sink-1
>> eventdata-avro-sink-2
>> agent.sinkgroups.eventdata-avro-sink-group.processor.type = load_balance
>> agent.sinkgroups.eventdata-avro-sink-group.processor.selector =
>> round_robin
>>
>>  Denis.
>>
>>
>>
>>
>>
>>
>>
>>
>
>
> --
> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
>

Re: flume.EventDeliveryException: Failed to send events

Posted by Brock Noland <br...@cloudera.com>.
Another possibility is that the downstream agent is near capacity from a
memory perspective. What is your heap size for these agents?


On Tue, Apr 16, 2013 at 2:19 PM, Hari Shreedharan <hshreedharan@cloudera.com
> wrote:

>  One possibility is that you are hitting the file channel disk too hard -
> do you have just one disk for checkpoint and data? It might be getting slow
> because of this? Also you should probably just remove the thread limit on
> AvroSource. It usually does not cause too much havoc unless you have a
> massive number of connections causing too many threads.
>
> --
> Hari Shreedharan
>
> On Tuesday, April 16, 2013 at 12:07 PM, Chris Neal wrote:
>
> Thanks Hari.
>
> I increased both the connect and request timeouts to 40000ms, and I'm
> testing that now.  I am talking on a LAN though, which is part of the
> reason I'm concerned.  Seems like it might not actually be a network issue,
> but perhaps an overloaded AvroSource on the back end?
>
>
> On Tue, Apr 16, 2013 at 1:52 PM, Hari Shreedharan <
> hshreedharan@cloudera.com> wrote:
>
>  Looks like you are hitting Avro IPC timeouts - you should probably
> increase it, especially if you are talking over WAN.
>
> --
> Hari Shreedharan
>
> On Tuesday, April 16, 2013 at 11:38 AM, Chris Neal wrote:
>
> I'm seeing the same thing :)
>
> Mine is all on a local LAN though, so the fact that an RPC call doesn't
> reply in 10000ms or 20000ms is quite odd.  My configuration is for the most
> part the same as Denis' configuration.  Two tiered system, ExecSources
> running tail -F on log files to an AvroSink, to an AvroSource, loading into
> HDFS on the back tier.
>
> I, too, see this on the AvroSink
>
> Either (A):
> [2013-04-15 23:57:14.827]
> [org.apache.flume.sink.LoadBalancingSinkProcessor] [ WARN]
> [SinkRunner-PollingRunner-LoadBalancingSinkProcessor] []
>  (LoadBalancingSinkProcessor.java:process:154) Sink failed to consume
> event. Attempting next sink if available.
> org.apache.flume.EventDeliveryException: Failed to send events
>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:324)
>         at
> org.apache.flume.sink.LoadBalancingSinkProcessor.process(LoadBalancingSinkProcessor.java:151)
>         at
> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
> host: hadoopjt01.pegs.com, port: 10000 }: Failed to send batch
>         at
> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:308)
>         ... 3 more
> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
> host: hadoopjt01.pegs.com, port: 10000 }: Handshake timed out after
> 20000ms
>         at
> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:280)
>         at
> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>         ... 4 more
> Caused by: java.util.concurrent.TimeoutException
>         at
> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:91)
>         at
> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:278)
>         ... 5 more
>
> or (B):
> [2013-04-15 19:49:01.135]
> [org.apache.flume.sink.LoadBalancingSinkProcessor] [ WARN]
> [SinkRunner-PollingRunner-LoadBalancingSinkProcessor] []
>  (LoadBalancingSinkProcessor.java:process:154) Sink failed to consume
> event. Attempting next sink if available.
> org.apache.flume.EventDeliveryException: Failed to send events
>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:324)
>         at
> org.apache.flume.sink.LoadBalancingSinkProcessor.process(LoadBalancingSinkProcessor.java:151)
>         at
> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
> host: hadoopjt01.pegs.com, port: 10000 }: Failed to send batch
>         at
> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:308)
>         ... 3 more
> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
> host: hadoopjt01.pegs.com, port: 10000 }: RPC request timed out
>         at
> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:321)
>         at
> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:295)
>         at
> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>         ... 4 more
> Caused by: java.util.concurrent.TimeoutException
>         at org.apache.avro.ipc.CallFuture.get(CallFuture.java:132)
>         at
> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:310)
>         ... 6 more
>
> The only thing I see on the AvroSource tier is the disconnect/reconnect
> happening:
>
> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
> [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] DISCONNECTED
> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
> [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] UNBOUND
> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
> [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] CLOSED
> [2013-04-15 19:49:00.993] [org.apache.avro.ipc.NettyServer] [ INFO]
> [pool-11-thread-10] []  (NettyServer.java:channelClosed:209) Connection to /
> 138.113.127.4:34481 disconnected.
> [2013-04-15 19:49:03.331] [org.apache.avro.ipc.NettyServer] [ INFO]
> [pool-10-thread-1] []  (NettyServer.java:handleUpstream:171) [id:
> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] OPEN
> [2013-04-15 19:49:03.332] [org.apache.avro.ipc.NettyServer] [ INFO]
> [pool-11-thread-13] []  (NettyServer.java:handleUpstream:171) [id:
> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] BOUND: /
> 138.113.127.72:10000
> [2013-04-15 19:49:03.333] [org.apache.avro.ipc.NettyServer] [ INFO]
> [pool-11-thread-13] []  (NettyServer.java:handleUpstream:171) [id:
> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] CONNECTED: /
> 138.113.127.4:62442
>
> Is there some way to determine exactly where this bottleneck is?  The
> config options for AvroSource/Sink are quite short, so there is not much
> tuning to do.  Here is what I have:
>
> # avro-hadoopjt01-sink properties
> udprodae01.sinks.avro-hadoopjt01-sink.type = avro
> udprodae01.sinks.avro-hadoopjt01-sink.hostname = hadoopjt01.pegs.com
> udprodae01.sinks.avro-hadoopjt01-sink.port = 10000
> udprodae01.sinks.avro-hadoopjt01-sink.batch-size = 100
>
> # avro-hadoopjt01-source properties
> hadoopjt01-1.sources.avro-hadoopjt01-source.type = avro
> hadoopjt01-1.sources.avro-hadoopjt01-source.bind = hadoopjt01.pegs.com
> hadoopjt01-1.sources.avro-hadoopjt01-source.port = 10000
> hadoopjt01-1.sources.avro-hadoopjt01-source.threads = 64
>
> I can try increasing the AvroSink timeout values, but they seem quite
> adequate at the defaults.  Maybe more threads on the AvroSource?
>
> Any advice would be much appreciated!
> Chris
>
>
> On Wed, Feb 6, 2013 at 12:10 PM, Denis Lowe <de...@gmail.com> wrote:
>
> I noticed that the logs in the destination server were reporting dropped
> connections from the upstream server:
>
> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id: 0x08e56d76,
> /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] DISCONNECTED
> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id: 0x08e56d76,
> /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] UNBOUND
> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id: 0x08e56d76,
> /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] CLOSED
> NettyServer$NettyServerAvroHandler.channelClosed:209)  - Connection to
> /SOURCE_HOST:43599 disconnected.
>
> The other thing I observed is that these errors only ever occur from our
> EU servers (connecting to our centralized downstream collectors in US West)
> - We are running Flume in Amazon EC2
>
> I can see in the log that the connection is restored quite quickly.
>
> I gather that the network latency between Europe and US West is causing
> the connection between the 2 servers to 'appear' lost, thus resulting in
> the above errors?
>
> Are there any recommended config settings to compensate for this?
>
> On 6 February 2013 00:21, Juhani Connolly <
> juhani_connolly@cyberagent.co.jp> wrote:
>
>  Is there anything unusual in the logs for the destination(avroSource)
> server
>
> Since the error is happening in the AvroSink, no data is getting lost. The
> failed data will get rolled back, removal from the local channel is
> cancelled, and it will attempt to resend it.
>
>
> On 02/06/2013 03:23 PM, Denis Lowe wrote:
>
> We are running Flume-NG 1.3.1 and have noticed periodically the following
> ERROR occurring (a few times daily):
>
>  We are using the File Channel connecting to 2 downstream collector
> agents in 'round_robin' mode, using avro source/sinks.
>
>  We are using the config described below to deliver 5 different log types
> (to 5 different ports downstream) and have observed the below
> error occurring randomly across all the ports.
>
>  We tried doubling the connect-timeout to 40000 (from the default of
> 20000) with no success.
> The agent appears to recover and keep on processing data.
>
>  My question is:
> Has this data been lost? or will flume eventually retry until a
> successfull delivery has been made?
> Are there any other config changes I can make to prevent/reduce
> this occurring in the future?
>
>  05 Feb 2013 23:12:21,650 ERROR
> [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.SinkRunner$PollingRunner.run:160)  - Unable to deliver
> event. Exception follows.
> org.apache.flume.EventDeliveryException: Failed to send events
>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:325)
>         at
> org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
>         at
> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
> host: collector1, port: 4003 }: Failed to send batch
>         at
> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:309)
>         ... 3 more
> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
> host: collector2, port: 4003 }: RPC request timed out
>         at
> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:321)
>         at
> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:295)
>         at
> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>         ... 4 more
> Caused by: java.util.concurrent.TimeoutException
>         at org.apache.avro.ipc.CallFuture.get(CallFuture.java:132)
>         at
> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:310)
>         ... 6 more
>
>  Below is a snapshot the current config:
>
>  agent.sources.eventdata.command = tail -qn +0 -F
> /var/log/event-logs/live/eventdata.log
>  agent.sources.eventdata.channels = eventdata-avro-channel
> agent.sources.eventdata.batchSize = 10
> agent.sources.eventdata.restart = true
>
>  ## event source interceptor
> agent.sources.eventdata.interceptors.host-interceptor.type =
> org.apache.flume.interceptor.HostInterceptor$Builder
> agent.sources.eventdata.interceptors.host-interceptor.hostHeader =
> source-host
> agent.sources.eventdata.interceptors.host-interceptor.useIP = false
>
>  ## eventdata channel
> agent.channels.eventdata-avro-channel.type = file
> agent.channels.eventdata-avro-channel.checkpointDir =
> /mnt/flume-ng/checkpoint/eventdata-avro-channel
> agent.channels.eventdata-avro-channel.dataDirs =
> /mnt/flume-ng/data1/eventdata-avro-channel,/mnt/flume-ng/data2/eventdata-avro-channel
> agent.channels.eventdata-avro-channel.maxFileSize = 210000000
> agent.channels.eventdata-avro-channel.capacity = 2000000
> agent.channels.eventdata-avro-channel.checkpointInterval = 300000
> agent.channels.eventdata-avro-channel.transactionCapacity = 10000
> agent.channels.eventdata-avro-channel.keep-alive = 20
> agent.channels.eventdata-avro-channel.write-timeout = 20
>
>  ## 2 x downstream click sinks for load balancing and failover
>  agent.sinks.eventdata-avro-sink-1.type = avro
> agent.sinks.eventdata-avro-sink-1.channel = eventdata-avro-channel
> agent.sinks.eventdata-avro-sink-1.hostname = collector1
> agent.sinks.eventdata-avro-sink-1.port = 4003
> agent.sinks.eventdata-avro-sink-1.batch-size = 100
> agent.sinks.eventdata-avro-sink-1.connect-timeout = 40000
>
>  agent.sinks.eventdata-avro-sink-2.type = avro
> agent.sinks.eventdata-avro-sink-2.channel = eventdata-avro-channel
> agent.sinks.eventdata-avro-sink-2.hostname = collector2
> agent.sinks.eventdata-avro-sink-2.port = 4003
> agent.sinks.eventdata-avro-sink-2.batch-size = 100
> agent.sinks.eventdata-avro-sink-2.connect-timeout = 40000
>
>  ## load balance config
> agent.sinkgroups = eventdata-avro-sink-group
> agent.sinkgroups.eventdata-avro-sink-group.sinks = eventdata-avro-sink-1
> eventdata-avro-sink-2
> agent.sinkgroups.eventdata-avro-sink-group.processor.type = load_balance
> agent.sinkgroups.eventdata-avro-sink-group.processor.selector = round_robin
>
>  Denis.
>
>
>
>
>
>
>
>


-- 
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org

Re: flume.EventDeliveryException: Failed to send events

Posted by Hari Shreedharan <hs...@cloudera.com>.
One possibility is that you are hitting the file channel disk too hard - do you have just one disk for checkpoint and data? It might be getting slow because of this? Also you should probably just remove the thread limit on AvroSource. It usually does not cause too much havoc unless you have a massive number of connections causing too many threads. 

-- 
Hari Shreedharan


On Tuesday, April 16, 2013 at 12:07 PM, Chris Neal wrote:

> Thanks Hari.
> 
> I increased both the connect and request timeouts to 40000ms, and I'm testing that now.  I am talking on a LAN though, which is part of the reason I'm concerned.  Seems like it might not actually be a network issue, but perhaps an overloaded AvroSource on the back end? 
> 
> 
> On Tue, Apr 16, 2013 at 1:52 PM, Hari Shreedharan <hshreedharan@cloudera.com (mailto:hshreedharan@cloudera.com)> wrote:
> > Looks like you are hitting Avro IPC timeouts - you should probably increase it, especially if you are talking over WAN.
> > 
> > -- 
> > Hari Shreedharan
> > 
> > 
> > On Tuesday, April 16, 2013 at 11:38 AM, Chris Neal wrote:
> > 
> > > I'm seeing the same thing :)
> > > 
> > > Mine is all on a local LAN though, so the fact that an RPC call doesn't reply in 10000ms or 20000ms is quite odd.  My configuration is for the most part the same as Denis' configuration.  Two tiered system, ExecSources running tail -F on log files to an AvroSink, to an AvroSource, loading into HDFS on the back tier. 
> > > 
> > > I, too, see this on the AvroSink
> > > 
> > > Either (A):
> > > [2013-04-15 23:57:14.827] [org.apache.flume.sink.LoadBalancingSinkProcessor] [ WARN] [SinkRunner-PollingRunner-LoadBalancingSinkProcessor] []  (LoadBalancingSinkProcessor.java:process:154) Sink failed to consume event. Attempting next sink if available.
> > > org.apache.flume.EventDeliveryException: Failed to send events
> > >         at org.apache.flume.sink.AvroSink.process(AvroSink.java:324)
> > >         at org.apache.flume.sink.LoadBalancingSinkProcessor.process(LoadBalancingSinkProcessor.java:151)
> > >         at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
> > >         at java.lang.Thread.run(Thread.java:619)
> > > Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient { host: hadoopjt01.pegs.com (http://hadoopjt01.pegs.com), port: 10000 }: Failed to send batch
> > >         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
> > >         at org.apache.flume.sink.AvroSink.process(AvroSink.java:308)
> > >         ... 3 more
> > > Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient { host: hadoopjt01.pegs.com (http://hadoopjt01.pegs.com), port: 10000 }: Handshake timed out after 20000ms
> > >         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:280)
> > >         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
> > >         ... 4 more
> > > Caused by: java.util.concurrent.TimeoutException
> > >         at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)
> > >         at java.util.concurrent.FutureTask.get(FutureTask.java:91)
> > >         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:278)
> > >         ... 5 more
> > > 
> > > or (B):
> > > [2013-04-15 19:49:01.135] [org.apache.flume.sink.LoadBalancingSinkProcessor] [ WARN] [SinkRunner-PollingRunner-LoadBalancingSinkProcessor] []  (LoadBalancingSinkProcessor.java:process:154) Sink failed to consume event. Attempting next sink if available.
> > > org.apache.flume.EventDeliveryException: Failed to send events
> > >         at org.apache.flume.sink.AvroSink.process(AvroSink.java:324)
> > >         at org.apache.flume.sink.LoadBalancingSinkProcessor.process(LoadBalancingSinkProcessor.java:151)
> > >         at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
> > >         at java.lang.Thread.run(Thread.java:619)
> > > Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient { host: hadoopjt01.pegs.com (http://hadoopjt01.pegs.com), port: 10000 }: Failed to send batch
> > >         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
> > >         at org.apache.flume.sink.AvroSink.process(AvroSink.java:308)
> > >         ... 3 more
> > > Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient { host: hadoopjt01.pegs.com (http://hadoopjt01.pegs.com), port: 10000 }: RPC request timed out
> > >         at org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:321)
> > >         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:295)
> > >         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
> > >         ... 4 more
> > > Caused by: java.util.concurrent.TimeoutException
> > >         at org.apache.avro.ipc.CallFuture.get(CallFuture.java:132)
> > >         at org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:310)
> > >         ... 6 more
> > > 
> > > The only thing I see on the AvroSource tier is the disconnect/reconnect happening:
> > > 
> > > [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO] [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id: 0x2a24ed78, /138.113.127.4:34481 (http://138.113.127.4:34481) :> /138.113.127.72:10000 (http://138.113.127.72:10000)] DISCONNECTED 
> > > [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO] [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id: 0x2a24ed78, /138.113.127.4:34481 (http://138.113.127.4:34481) :> /138.113.127.72:10000 (http://138.113.127.72:10000)] UNBOUND
> > > [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO] [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id: 0x2a24ed78, /138.113.127.4:34481 (http://138.113.127.4:34481) :> /138.113.127.72:10000 (http://138.113.127.72:10000)] CLOSED
> > > [2013-04-15 19:49:00.993] [org.apache.avro.ipc.NettyServer] [ INFO] [pool-11-thread-10] []  (NettyServer.java:channelClosed:209) Connection to /138.113.127.4:34481 (http://138.113.127.4:34481) disconnected.
> > > [2013-04-15 19:49:03.331] [org.apache.avro.ipc.NettyServer] [ INFO] [pool-10-thread-1] []  (NettyServer.java:handleUpstream:171) [id: 0x3883b82e, /138.113.127.4:62442 (http://138.113.127.4:62442) => /138.113.127.72:10000 (http://138.113.127.72:10000)] OPEN
> > > [2013-04-15 19:49:03.332] [org.apache.avro.ipc.NettyServer] [ INFO] [pool-11-thread-13] []  (NettyServer.java:handleUpstream:171) [id: 0x3883b82e, /138.113.127.4:62442 (http://138.113.127.4:62442) => /138.113.127.72:10000 (http://138.113.127.72:10000)] BOUND: /138.113.127.72:10000 (http://138.113.127.72:10000)
> > > [2013-04-15 19:49:03.333] [org.apache.avro.ipc.NettyServer] [ INFO] [pool-11-thread-13] []  (NettyServer.java:handleUpstream:171) [id: 0x3883b82e, /138.113.127.4:62442 (http://138.113.127.4:62442) => /138.113.127.72:10000 (http://138.113.127.72:10000)] CONNECTED: /138.113.127.4:62442 (http://138.113.127.4:62442)
> > > 
> > > Is there some way to determine exactly where this bottleneck is?  The config options for AvroSource/Sink are quite short, so there is not much tuning to do.  Here is what I have:
> > > 
> > > # avro-hadoopjt01-sink properties
> > > udprodae01.sinks.avro-hadoopjt01-sink.type = avro
> > > udprodae01.sinks.avro-hadoopjt01-sink.hostname = hadoopjt01.pegs.com (http://hadoopjt01.pegs.com)
> > > udprodae01.sinks.avro-hadoopjt01-sink.port = 10000
> > > udprodae01.sinks.avro-hadoopjt01-sink.batch-size = 100
> > > 
> > > # avro-hadoopjt01-source properties
> > > hadoopjt01-1.sources.avro-hadoopjt01-source.type = avro
> > > hadoopjt01-1.sources.avro-hadoopjt01-source.bind = hadoopjt01.pegs.com (http://hadoopjt01.pegs.com)
> > > hadoopjt01-1.sources.avro-hadoopjt01-source.port = 10000
> > > hadoopjt01-1.sources.avro-hadoopjt01-source.threads = 64
> > > 
> > > 
> > > I can try increasing the AvroSink timeout values, but they seem quite adequate at the defaults.  Maybe more threads on the AvroSource?
> > > 
> > > Any advice would be much appreciated! 
> > > Chris
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > On Wed, Feb 6, 2013 at 12:10 PM, Denis Lowe <denis.lowe@gmail.com (mailto:denis.lowe@gmail.com)> wrote:
> > > > I noticed that the logs in the destination server were reporting dropped connections from the upstream server:
> > > > 
> > > > NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id: 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] DISCONNECTED 
> > > > NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id: 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] UNBOUND
> > > > NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id: 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] CLOSED
> > > > NettyServer$NettyServerAvroHandler.channelClosed:209)  - Connection to /SOURCE_HOST:43599 disconnected.
> > > > 
> > > > The other thing I observed is that these errors only ever occur from our EU servers (connecting to our centralized downstream collectors in US West) - We are running Flume in Amazon EC2 
> > > > 
> > > > I can see in the log that the connection is restored quite quickly.
> > > > 
> > > > I gather that the network latency between Europe and US West is causing the connection between the 2 servers to 'appear' lost, thus resulting in the above errors? 
> > > > 
> > > > Are there any recommended config settings to compensate for this?
> > > > 
> > > > On 6 February 2013 00:21, Juhani Connolly <juhani_connolly@cyberagent.co.jp (mailto:juhani_connolly@cyberagent.co.jp)> wrote:
> > > > > Is there anything unusual in the logs for the destination(avroSource) server
> > > > > 
> > > > > Since the error is happening in the AvroSink, no data is getting lost. The failed data will get rolled back, removal from the local channel is cancelled, and it will attempt to resend it.
> > > > > 
> > > > > 
> > > > > On 02/06/2013 03:23 PM, Denis Lowe wrote:
> > > > > > We are running Flume-NG 1.3.1 and have noticed periodically the following ERROR occurring (a few times daily): 
> > > > > > 
> > > > > > We are using the File Channel connecting to 2 downstream collector agents in 'round_robin' mode, using avro source/sinks. 
> > > > > > 
> > > > > > We are using the config described below to deliver 5 different log types (to 5 different ports downstream) and have observed the below error occurring randomly across all the ports. 
> > > > > > 
> > > > > > We tried doubling the connect-timeout to 40000 (from the default of 20000) with no success. 
> > > > > > The agent appears to recover and keep on processing data.
> > > > > > 
> > > > > > 
> > > > > > My question is: 
> > > > > > Has this data been lost? or will flume eventually retry until a successfull delivery has been made?
> > > > > > Are there any other config changes I can make to prevent/reduce this occurring in the future?
> > > > > > 
> > > > > > 
> > > > > > 05 Feb 2013 23:12:21,650 ERROR [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.SinkRunner$PollingRunner.run:160)  - Unable to deliver event. Exception follows. 
> > > > > > org.apache.flume.EventDeliveryException: Failed to send events
> > > > > >         at org.apache.flume.sink.AvroSink.process(AvroSink.java:325)
> > > > > >         at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
> > > > > >         at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
> > > > > >         at java.lang.Thread.run(Thread.java:662)
> > > > > > Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient { host: collector1, port: 4003 }: Failed to send batch
> > > > > >         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
> > > > > >         at org.apache.flume.sink.AvroSink.process(AvroSink.java:309)
> > > > > >         ... 3 more
> > > > > > Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient { host: collector2, port: 4003 }: RPC request timed out
> > > > > >         at org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:321)
> > > > > >         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:295)
> > > > > >         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
> > > > > >         ... 4 more
> > > > > > Caused by: java.util.concurrent.TimeoutException
> > > > > >         at org.apache.avro.ipc.CallFuture.get(CallFuture.java:132)
> > > > > >         at org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:310)
> > > > > >         ... 6 more
> > > > > > 
> > > > > > 
> > > > > > Below is a snapshot the current config: 
> > > > > > 
> > > > > > agent.sources.eventdata.command = tail -qn +0 -F /var/log/event-logs/live/eventdata.log 
> > > > > > agent.sources.eventdata.channels = eventdata-avro-channel
> > > > > > agent.sources.eventdata.batchSize = 10
> > > > > > agent.sources.eventdata.restart = true
> > > > > > 
> > > > > > ## event source interceptor 
> > > > > > agent.sources.eventdata.interceptors.host-interceptor.type = org.apache.flume.interceptor.HostInterceptor$Builder
> > > > > > agent.sources.eventdata.interceptors.host-interceptor.hostHeader = source-host
> > > > > > agent.sources.eventdata.interceptors.host-interceptor.useIP = false
> > > > > > 
> > > > > > ## eventdata channel 
> > > > > > agent.channels.eventdata-avro-channel.type = file
> > > > > > agent.channels.eventdata-avro-channel.checkpointDir = /mnt/flume-ng/checkpoint/eventdata-avro-channel
> > > > > > agent.channels.eventdata-avro-channel.dataDirs = /mnt/flume-ng/data1/eventdata-avro-channel,/mnt/flume-ng/data2/eventdata-avro-channel
> > > > > > agent.channels.eventdata-avro-channel.maxFileSize = 210000000
> > > > > > agent.channels.eventdata-avro-channel.capacity = 2000000
> > > > > > agent.channels.eventdata-avro-channel.checkpointInterval = 300000
> > > > > > agent.channels.eventdata-avro-channel.transactionCapacity = 10000
> > > > > > agent.channels.eventdata-avro-channel.keep-alive = 20
> > > > > > agent.channels.eventdata-avro-channel.write-timeout = 20
> > > > > > 
> > > > > > ## 2 x downstream click sinks for load balancing and failover 
> > > > > > agent.sinks.eventdata-avro-sink-1.type = avro
> > > > > > agent.sinks.eventdata-avro-sink-1.channel = eventdata-avro-channel
> > > > > > agent.sinks.eventdata-avro-sink-1.hostname = collector1
> > > > > > agent.sinks.eventdata-avro-sink-1.port = 4003
> > > > > > agent.sinks.eventdata-avro-sink-1.batch-size = 100
> > > > > > agent.sinks.eventdata-avro-sink-1.connect-timeout = 40000
> > > > > > 
> > > > > > agent.sinks.eventdata-avro-sink-2.type = avro 
> > > > > > agent.sinks.eventdata-avro-sink-2.channel = eventdata-avro-channel
> > > > > > agent.sinks.eventdata-avro-sink-2.hostname = collector2
> > > > > > agent.sinks.eventdata-avro-sink-2.port = 4003
> > > > > > agent.sinks.eventdata-avro-sink-2.batch-size = 100
> > > > > > agent.sinks.eventdata-avro-sink-2.connect-timeout = 40000
> > > > > > 
> > > > > > ## load balance config 
> > > > > > agent.sinkgroups = eventdata-avro-sink-group
> > > > > > agent.sinkgroups.eventdata-avro-sink-group.sinks = eventdata-avro-sink-1 eventdata-avro-sink-2
> > > > > > agent.sinkgroups.eventdata-avro-sink-group.processor.type = load_balance
> > > > > > agent.sinkgroups.eventdata-avro-sink-group.processor.selector = round_robin
> > > > > > 
> > > > > > 
> > > > > > Denis. 
> > > > 
> > > 
> > 
> 


Re: flume.EventDeliveryException: Failed to send events

Posted by Chris Neal <cw...@gmail.com>.
Thanks Hari.

I increased both the connect and request timeouts to 40000ms, and I'm
testing that now.  I am talking on a LAN though, which is part of the
reason I'm concerned.  Seems like it might not actually be a network issue,
but perhaps an overloaded AvroSource on the back end?


On Tue, Apr 16, 2013 at 1:52 PM, Hari Shreedharan <hshreedharan@cloudera.com
> wrote:

> Looks like you are hitting Avro IPC timeouts - you should probably
> increase it, especially if you are talking over WAN.
>
> --
> Hari Shreedharan
>
> On Tuesday, April 16, 2013 at 11:38 AM, Chris Neal wrote:
>
> I'm seeing the same thing :)
>
> Mine is all on a local LAN though, so the fact that an RPC call doesn't
> reply in 10000ms or 20000ms is quite odd.  My configuration is for the most
> part the same as Denis' configuration.  Two tiered system, ExecSources
> running tail -F on log files to an AvroSink, to an AvroSource, loading into
> HDFS on the back tier.
>
> I, too, see this on the AvroSink
>
> Either (A):
> [2013-04-15 23:57:14.827]
> [org.apache.flume.sink.LoadBalancingSinkProcessor] [ WARN]
> [SinkRunner-PollingRunner-LoadBalancingSinkProcessor] []
>  (LoadBalancingSinkProcessor.java:process:154) Sink failed to consume
> event. Attempting next sink if available.
> org.apache.flume.EventDeliveryException: Failed to send events
>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:324)
>         at
> org.apache.flume.sink.LoadBalancingSinkProcessor.process(LoadBalancingSinkProcessor.java:151)
>         at
> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
> host: hadoopjt01.pegs.com, port: 10000 }: Failed to send batch
>         at
> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:308)
>         ... 3 more
> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
> host: hadoopjt01.pegs.com, port: 10000 }: Handshake timed out after
> 20000ms
>         at
> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:280)
>         at
> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>         ... 4 more
> Caused by: java.util.concurrent.TimeoutException
>         at
> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:91)
>         at
> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:278)
>         ... 5 more
>
> or (B):
> [2013-04-15 19:49:01.135]
> [org.apache.flume.sink.LoadBalancingSinkProcessor] [ WARN]
> [SinkRunner-PollingRunner-LoadBalancingSinkProcessor] []
>  (LoadBalancingSinkProcessor.java:process:154) Sink failed to consume
> event. Attempting next sink if available.
> org.apache.flume.EventDeliveryException: Failed to send events
>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:324)
>         at
> org.apache.flume.sink.LoadBalancingSinkProcessor.process(LoadBalancingSinkProcessor.java:151)
>         at
> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
> host: hadoopjt01.pegs.com, port: 10000 }: Failed to send batch
>         at
> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:308)
>         ... 3 more
> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
> host: hadoopjt01.pegs.com, port: 10000 }: RPC request timed out
>         at
> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:321)
>         at
> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:295)
>         at
> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>         ... 4 more
> Caused by: java.util.concurrent.TimeoutException
>         at org.apache.avro.ipc.CallFuture.get(CallFuture.java:132)
>         at
> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:310)
>         ... 6 more
>
> The only thing I see on the AvroSource tier is the disconnect/reconnect
> happening:
>
> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
> [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] DISCONNECTED
> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
> [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] UNBOUND
> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
> [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] CLOSED
> [2013-04-15 19:49:00.993] [org.apache.avro.ipc.NettyServer] [ INFO]
> [pool-11-thread-10] []  (NettyServer.java:channelClosed:209) Connection to /
> 138.113.127.4:34481 disconnected.
> [2013-04-15 19:49:03.331] [org.apache.avro.ipc.NettyServer] [ INFO]
> [pool-10-thread-1] []  (NettyServer.java:handleUpstream:171) [id:
> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] OPEN
> [2013-04-15 19:49:03.332] [org.apache.avro.ipc.NettyServer] [ INFO]
> [pool-11-thread-13] []  (NettyServer.java:handleUpstream:171) [id:
> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] BOUND: /
> 138.113.127.72:10000
> [2013-04-15 19:49:03.333] [org.apache.avro.ipc.NettyServer] [ INFO]
> [pool-11-thread-13] []  (NettyServer.java:handleUpstream:171) [id:
> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] CONNECTED: /
> 138.113.127.4:62442
>
> Is there some way to determine exactly where this bottleneck is?  The
> config options for AvroSource/Sink are quite short, so there is not much
> tuning to do.  Here is what I have:
>
> # avro-hadoopjt01-sink properties
> udprodae01.sinks.avro-hadoopjt01-sink.type = avro
> udprodae01.sinks.avro-hadoopjt01-sink.hostname = hadoopjt01.pegs.com
> udprodae01.sinks.avro-hadoopjt01-sink.port = 10000
> udprodae01.sinks.avro-hadoopjt01-sink.batch-size = 100
>
> # avro-hadoopjt01-source properties
> hadoopjt01-1.sources.avro-hadoopjt01-source.type = avro
> hadoopjt01-1.sources.avro-hadoopjt01-source.bind = hadoopjt01.pegs.com
> hadoopjt01-1.sources.avro-hadoopjt01-source.port = 10000
> hadoopjt01-1.sources.avro-hadoopjt01-source.threads = 64
>
> I can try increasing the AvroSink timeout values, but they seem quite
> adequate at the defaults.  Maybe more threads on the AvroSource?
>
> Any advice would be much appreciated!
> Chris
>
>
> On Wed, Feb 6, 2013 at 12:10 PM, Denis Lowe <de...@gmail.com> wrote:
>
> I noticed that the logs in the destination server were reporting dropped
> connections from the upstream server:
>
> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id: 0x08e56d76,
> /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] DISCONNECTED
> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id: 0x08e56d76,
> /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] UNBOUND
> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id: 0x08e56d76,
> /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] CLOSED
> NettyServer$NettyServerAvroHandler.channelClosed:209)  - Connection to
> /SOURCE_HOST:43599 disconnected.
>
> The other thing I observed is that these errors only ever occur from our
> EU servers (connecting to our centralized downstream collectors in US West)
> - We are running Flume in Amazon EC2
>
> I can see in the log that the connection is restored quite quickly.
>
> I gather that the network latency between Europe and US West is causing
> the connection between the 2 servers to 'appear' lost, thus resulting in
> the above errors?
>
> Are there any recommended config settings to compensate for this?
>
> On 6 February 2013 00:21, Juhani Connolly <
> juhani_connolly@cyberagent.co.jp> wrote:
>
>  Is there anything unusual in the logs for the destination(avroSource)
> server
>
> Since the error is happening in the AvroSink, no data is getting lost. The
> failed data will get rolled back, removal from the local channel is
> cancelled, and it will attempt to resend it.
>
>
> On 02/06/2013 03:23 PM, Denis Lowe wrote:
>
> We are running Flume-NG 1.3.1 and have noticed periodically the following
> ERROR occurring (a few times daily):
>
>  We are using the File Channel connecting to 2 downstream collector
> agents in 'round_robin' mode, using avro source/sinks.
>
>  We are using the config described below to deliver 5 different log types
> (to 5 different ports downstream) and have observed the below
> error occurring randomly across all the ports.
>
>  We tried doubling the connect-timeout to 40000 (from the default of
> 20000) with no success.
> The agent appears to recover and keep on processing data.
>
>  My question is:
> Has this data been lost? or will flume eventually retry until a
> successfull delivery has been made?
> Are there any other config changes I can make to prevent/reduce
> this occurring in the future?
>
>  05 Feb 2013 23:12:21,650 ERROR
> [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.SinkRunner$PollingRunner.run:160)  - Unable to deliver
> event. Exception follows.
> org.apache.flume.EventDeliveryException: Failed to send events
>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:325)
>         at
> org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
>         at
> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
> host: collector1, port: 4003 }: Failed to send batch
>         at
> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:309)
>         ... 3 more
> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient {
> host: collector2, port: 4003 }: RPC request timed out
>         at
> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:321)
>         at
> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:295)
>         at
> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>         ... 4 more
> Caused by: java.util.concurrent.TimeoutException
>         at org.apache.avro.ipc.CallFuture.get(CallFuture.java:132)
>         at
> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:310)
>         ... 6 more
>
>  Below is a snapshot the current config:
>
>  agent.sources.eventdata.command = tail -qn +0 -F
> /var/log/event-logs/live/eventdata.log
>  agent.sources.eventdata.channels = eventdata-avro-channel
> agent.sources.eventdata.batchSize = 10
> agent.sources.eventdata.restart = true
>
>  ## event source interceptor
> agent.sources.eventdata.interceptors.host-interceptor.type =
> org.apache.flume.interceptor.HostInterceptor$Builder
> agent.sources.eventdata.interceptors.host-interceptor.hostHeader =
> source-host
> agent.sources.eventdata.interceptors.host-interceptor.useIP = false
>
>  ## eventdata channel
> agent.channels.eventdata-avro-channel.type = file
> agent.channels.eventdata-avro-channel.checkpointDir =
> /mnt/flume-ng/checkpoint/eventdata-avro-channel
> agent.channels.eventdata-avro-channel.dataDirs =
> /mnt/flume-ng/data1/eventdata-avro-channel,/mnt/flume-ng/data2/eventdata-avro-channel
> agent.channels.eventdata-avro-channel.maxFileSize = 210000000
> agent.channels.eventdata-avro-channel.capacity = 2000000
> agent.channels.eventdata-avro-channel.checkpointInterval = 300000
> agent.channels.eventdata-avro-channel.transactionCapacity = 10000
> agent.channels.eventdata-avro-channel.keep-alive = 20
> agent.channels.eventdata-avro-channel.write-timeout = 20
>
>  ## 2 x downstream click sinks for load balancing and failover
>  agent.sinks.eventdata-avro-sink-1.type = avro
> agent.sinks.eventdata-avro-sink-1.channel = eventdata-avro-channel
> agent.sinks.eventdata-avro-sink-1.hostname = collector1
> agent.sinks.eventdata-avro-sink-1.port = 4003
> agent.sinks.eventdata-avro-sink-1.batch-size = 100
> agent.sinks.eventdata-avro-sink-1.connect-timeout = 40000
>
>  agent.sinks.eventdata-avro-sink-2.type = avro
> agent.sinks.eventdata-avro-sink-2.channel = eventdata-avro-channel
> agent.sinks.eventdata-avro-sink-2.hostname = collector2
> agent.sinks.eventdata-avro-sink-2.port = 4003
> agent.sinks.eventdata-avro-sink-2.batch-size = 100
> agent.sinks.eventdata-avro-sink-2.connect-timeout = 40000
>
>  ## load balance config
> agent.sinkgroups = eventdata-avro-sink-group
> agent.sinkgroups.eventdata-avro-sink-group.sinks = eventdata-avro-sink-1
> eventdata-avro-sink-2
> agent.sinkgroups.eventdata-avro-sink-group.processor.type = load_balance
> agent.sinkgroups.eventdata-avro-sink-group.processor.selector = round_robin
>
>  Denis.
>
>
>
>
>
>

Re: flume.EventDeliveryException: Failed to send events

Posted by Hari Shreedharan <hs...@cloudera.com>.
Looks like you are hitting Avro IPC timeouts - you should probably increase it, especially if you are talking over WAN.

-- 
Hari Shreedharan


On Tuesday, April 16, 2013 at 11:38 AM, Chris Neal wrote:

> I'm seeing the same thing :)
> 
> Mine is all on a local LAN though, so the fact that an RPC call doesn't reply in 10000ms or 20000ms is quite odd.  My configuration is for the most part the same as Denis' configuration.  Two tiered system, ExecSources running tail -F on log files to an AvroSink, to an AvroSource, loading into HDFS on the back tier. 
> 
> I, too, see this on the AvroSink
> 
> Either (A):
> [2013-04-15 23:57:14.827] [org.apache.flume.sink.LoadBalancingSinkProcessor] [ WARN] [SinkRunner-PollingRunner-LoadBalancingSinkProcessor] []  (LoadBalancingSinkProcessor.java:process:154) Sink failed to consume event. Attempting next sink if available.
> org.apache.flume.EventDeliveryException: Failed to send events
>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:324)
>         at org.apache.flume.sink.LoadBalancingSinkProcessor.process(LoadBalancingSinkProcessor.java:151)
>         at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient { host: hadoopjt01.pegs.com (http://hadoopjt01.pegs.com), port: 10000 }: Failed to send batch
>         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:308)
>         ... 3 more
> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient { host: hadoopjt01.pegs.com (http://hadoopjt01.pegs.com), port: 10000 }: Handshake timed out after 20000ms
>         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:280)
>         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>         ... 4 more
> Caused by: java.util.concurrent.TimeoutException
>         at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:91)
>         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:278)
>         ... 5 more
> 
> or (B):
> [2013-04-15 19:49:01.135] [org.apache.flume.sink.LoadBalancingSinkProcessor] [ WARN] [SinkRunner-PollingRunner-LoadBalancingSinkProcessor] []  (LoadBalancingSinkProcessor.java:process:154) Sink failed to consume event. Attempting next sink if available.
> org.apache.flume.EventDeliveryException: Failed to send events
>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:324)
>         at org.apache.flume.sink.LoadBalancingSinkProcessor.process(LoadBalancingSinkProcessor.java:151)
>         at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient { host: hadoopjt01.pegs.com (http://hadoopjt01.pegs.com), port: 10000 }: Failed to send batch
>         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:308)
>         ... 3 more
> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient { host: hadoopjt01.pegs.com (http://hadoopjt01.pegs.com), port: 10000 }: RPC request timed out
>         at org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:321)
>         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:295)
>         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>         ... 4 more
> Caused by: java.util.concurrent.TimeoutException
>         at org.apache.avro.ipc.CallFuture.get(CallFuture.java:132)
>         at org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:310)
>         ... 6 more
> 
> The only thing I see on the AvroSource tier is the disconnect/reconnect happening:
> 
> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO] [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id: 0x2a24ed78, /138.113.127.4:34481 (http://138.113.127.4:34481) :> /138.113.127.72:10000 (http://138.113.127.72:10000)] DISCONNECTED 
> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO] [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id: 0x2a24ed78, /138.113.127.4:34481 (http://138.113.127.4:34481) :> /138.113.127.72:10000 (http://138.113.127.72:10000)] UNBOUND
> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO] [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id: 0x2a24ed78, /138.113.127.4:34481 (http://138.113.127.4:34481) :> /138.113.127.72:10000 (http://138.113.127.72:10000)] CLOSED
> [2013-04-15 19:49:00.993] [org.apache.avro.ipc.NettyServer] [ INFO] [pool-11-thread-10] []  (NettyServer.java:channelClosed:209) Connection to /138.113.127.4:34481 (http://138.113.127.4:34481) disconnected.
> [2013-04-15 19:49:03.331] [org.apache.avro.ipc.NettyServer] [ INFO] [pool-10-thread-1] []  (NettyServer.java:handleUpstream:171) [id: 0x3883b82e, /138.113.127.4:62442 (http://138.113.127.4:62442) => /138.113.127.72:10000 (http://138.113.127.72:10000)] OPEN
> [2013-04-15 19:49:03.332] [org.apache.avro.ipc.NettyServer] [ INFO] [pool-11-thread-13] []  (NettyServer.java:handleUpstream:171) [id: 0x3883b82e, /138.113.127.4:62442 (http://138.113.127.4:62442) => /138.113.127.72:10000 (http://138.113.127.72:10000)] BOUND: /138.113.127.72:10000 (http://138.113.127.72:10000)
> [2013-04-15 19:49:03.333] [org.apache.avro.ipc.NettyServer] [ INFO] [pool-11-thread-13] []  (NettyServer.java:handleUpstream:171) [id: 0x3883b82e, /138.113.127.4:62442 (http://138.113.127.4:62442) => /138.113.127.72:10000 (http://138.113.127.72:10000)] CONNECTED: /138.113.127.4:62442 (http://138.113.127.4:62442)
> 
> Is there some way to determine exactly where this bottleneck is?  The config options for AvroSource/Sink are quite short, so there is not much tuning to do.  Here is what I have:
> 
> # avro-hadoopjt01-sink properties
> udprodae01.sinks.avro-hadoopjt01-sink.type = avro
> udprodae01.sinks.avro-hadoopjt01-sink.hostname = hadoopjt01.pegs.com (http://hadoopjt01.pegs.com)
> udprodae01.sinks.avro-hadoopjt01-sink.port = 10000
> udprodae01.sinks.avro-hadoopjt01-sink.batch-size = 100
> 
> # avro-hadoopjt01-source properties
> hadoopjt01-1.sources.avro-hadoopjt01-source.type = avro
> hadoopjt01-1.sources.avro-hadoopjt01-source.bind = hadoopjt01.pegs.com (http://hadoopjt01.pegs.com)
> hadoopjt01-1.sources.avro-hadoopjt01-source.port = 10000
> hadoopjt01-1.sources.avro-hadoopjt01-source.threads = 64
> 
> 
> I can try increasing the AvroSink timeout values, but they seem quite adequate at the defaults.  Maybe more threads on the AvroSource?
> 
> Any advice would be much appreciated! 
> Chris
> 
> 
> 
> 
> 
> 
> 
> 
> On Wed, Feb 6, 2013 at 12:10 PM, Denis Lowe <denis.lowe@gmail.com (mailto:denis.lowe@gmail.com)> wrote:
> > I noticed that the logs in the destination server were reporting dropped connections from the upstream server:
> > 
> > NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id: 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] DISCONNECTED 
> > NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id: 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] UNBOUND
> > NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id: 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] CLOSED
> > NettyServer$NettyServerAvroHandler.channelClosed:209)  - Connection to /SOURCE_HOST:43599 disconnected.
> > 
> > The other thing I observed is that these errors only ever occur from our EU servers (connecting to our centralized downstream collectors in US West) - We are running Flume in Amazon EC2 
> > 
> > I can see in the log that the connection is restored quite quickly.
> > 
> > I gather that the network latency between Europe and US West is causing the connection between the 2 servers to 'appear' lost, thus resulting in the above errors? 
> > 
> > Are there any recommended config settings to compensate for this?
> > 
> > On 6 February 2013 00:21, Juhani Connolly <juhani_connolly@cyberagent.co.jp (mailto:juhani_connolly@cyberagent.co.jp)> wrote:
> > > Is there anything unusual in the logs for the destination(avroSource) server
> > > 
> > > Since the error is happening in the AvroSink, no data is getting lost. The failed data will get rolled back, removal from the local channel is cancelled, and it will attempt to resend it.
> > > 
> > > 
> > > On 02/06/2013 03:23 PM, Denis Lowe wrote:
> > > > We are running Flume-NG 1.3.1 and have noticed periodically the following ERROR occurring (a few times daily): 
> > > > 
> > > > We are using the File Channel connecting to 2 downstream collector agents in 'round_robin' mode, using avro source/sinks. 
> > > > 
> > > > We are using the config described below to deliver 5 different log types (to 5 different ports downstream) and have observed the below error occurring randomly across all the ports. 
> > > > 
> > > > We tried doubling the connect-timeout to 40000 (from the default of 20000) with no success. 
> > > > The agent appears to recover and keep on processing data.
> > > > 
> > > > 
> > > > My question is: 
> > > > Has this data been lost? or will flume eventually retry until a successfull delivery has been made?
> > > > Are there any other config changes I can make to prevent/reduce this occurring in the future?
> > > > 
> > > > 
> > > > 05 Feb 2013 23:12:21,650 ERROR [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.SinkRunner$PollingRunner.run:160)  - Unable to deliver event. Exception follows. 
> > > > org.apache.flume.EventDeliveryException: Failed to send events
> > > >         at org.apache.flume.sink.AvroSink.process(AvroSink.java:325)
> > > >         at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
> > > >         at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
> > > >         at java.lang.Thread.run(Thread.java:662)
> > > > Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient { host: collector1, port: 4003 }: Failed to send batch
> > > >         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
> > > >         at org.apache.flume.sink.AvroSink.process(AvroSink.java:309)
> > > >         ... 3 more
> > > > Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient { host: collector2, port: 4003 }: RPC request timed out
> > > >         at org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:321)
> > > >         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:295)
> > > >         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
> > > >         ... 4 more
> > > > Caused by: java.util.concurrent.TimeoutException
> > > >         at org.apache.avro.ipc.CallFuture.get(CallFuture.java:132)
> > > >         at org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:310)
> > > >         ... 6 more
> > > > 
> > > > 
> > > > Below is a snapshot the current config: 
> > > > 
> > > > agent.sources.eventdata.command = tail -qn +0 -F /var/log/event-logs/live/eventdata.log 
> > > > agent.sources.eventdata.channels = eventdata-avro-channel
> > > > agent.sources.eventdata.batchSize = 10
> > > > agent.sources.eventdata.restart = true
> > > > 
> > > > ## event source interceptor 
> > > > agent.sources.eventdata.interceptors.host-interceptor.type = org.apache.flume.interceptor.HostInterceptor$Builder
> > > > agent.sources.eventdata.interceptors.host-interceptor.hostHeader = source-host
> > > > agent.sources.eventdata.interceptors.host-interceptor.useIP = false
> > > > 
> > > > ## eventdata channel 
> > > > agent.channels.eventdata-avro-channel.type = file
> > > > agent.channels.eventdata-avro-channel.checkpointDir = /mnt/flume-ng/checkpoint/eventdata-avro-channel
> > > > agent.channels.eventdata-avro-channel.dataDirs = /mnt/flume-ng/data1/eventdata-avro-channel,/mnt/flume-ng/data2/eventdata-avro-channel
> > > > agent.channels.eventdata-avro-channel.maxFileSize = 210000000
> > > > agent.channels.eventdata-avro-channel.capacity = 2000000
> > > > agent.channels.eventdata-avro-channel.checkpointInterval = 300000
> > > > agent.channels.eventdata-avro-channel.transactionCapacity = 10000
> > > > agent.channels.eventdata-avro-channel.keep-alive = 20
> > > > agent.channels.eventdata-avro-channel.write-timeout = 20
> > > > 
> > > > ## 2 x downstream click sinks for load balancing and failover 
> > > > agent.sinks.eventdata-avro-sink-1.type = avro
> > > > agent.sinks.eventdata-avro-sink-1.channel = eventdata-avro-channel
> > > > agent.sinks.eventdata-avro-sink-1.hostname = collector1
> > > > agent.sinks.eventdata-avro-sink-1.port = 4003
> > > > agent.sinks.eventdata-avro-sink-1.batch-size = 100
> > > > agent.sinks.eventdata-avro-sink-1.connect-timeout = 40000
> > > > 
> > > > agent.sinks.eventdata-avro-sink-2.type = avro 
> > > > agent.sinks.eventdata-avro-sink-2.channel = eventdata-avro-channel
> > > > agent.sinks.eventdata-avro-sink-2.hostname = collector2
> > > > agent.sinks.eventdata-avro-sink-2.port = 4003
> > > > agent.sinks.eventdata-avro-sink-2.batch-size = 100
> > > > agent.sinks.eventdata-avro-sink-2.connect-timeout = 40000
> > > > 
> > > > ## load balance config 
> > > > agent.sinkgroups = eventdata-avro-sink-group
> > > > agent.sinkgroups.eventdata-avro-sink-group.sinks = eventdata-avro-sink-1 eventdata-avro-sink-2
> > > > agent.sinkgroups.eventdata-avro-sink-group.processor.type = load_balance
> > > > agent.sinkgroups.eventdata-avro-sink-group.processor.selector = round_robin
> > > > 
> > > > 
> > > > Denis. 
> > 
>