You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Eran Kutner <er...@gigya.com> on 2011/10/23 18:07:29 UTC

Collector stops working

Hi,
I'm having a problem where flume collectors occasionally stop working under
heavy load.
I'm writing something like 1500-2000 events per second to my collectors, and
occasionally they will just stop working. Nothing is written to the log the
only indication that this is happening is that I see 0 messages being
delivered when looking in the flume stats web page  and events start pilling
up in the agents. Restarting the service solves the problem for a while
(anything from a few minutes to a few days).
An interesting thing to note is that this seems to be load related. It used
to happen a lot more but then I split the collector into three virtual nodes
and balanced the traffic on them and now it happens a lot less. Also, while
one virtual collector stops working the others, on the same machine,
continue to work fine.

My collector configuration looks like this: collectorSource(54001) |
collector(600000) {
escapedFormatDfs("hdfs://hadoop1-m1:8020/raw-events/%Y-%m-%d/",
"events-%{rolltag}-f01-c1.snappy", seqfile("SnappyCodec")) };

I'm using 0.9.5 I've built a few weeks ago.

Any ideas what can be causing it?

-eran

Re: Collector stops working

Posted by Mingjie Lai <mj...@gmail.com>.

Cameron.

flume-808 should be for another race condition, I'm not sure it helps to 
fix the rollsink issue. However I'm glad you tried it.

Can you post your fix to RollSink to FLUME-798 (this one, right?)?

Thanks,
Mingjie


On 10/27/2011 12:15 PM, Cameron Gandevia wrote:
> Hey
>
> We were having problems with our collectors dying (We always had errors
> in the logs). We recently applied the patch
> https://issues.apache.org/jira/browse/Flume-808 and modified the
> RollSink TriggerThread to not Interrupt the append job when acquiring
> its lock. Our collectors have now been up for a few days without problems.
>
> On Thu, Oct 27, 2011 at 8:10 AM, Eran Kutner <eran@gigya.com
> <ma...@gigya.com>> wrote:
>
>     Just grepped a few days of logs and I don't see this error. It seems
>     to be correlated with higher load on the HDFS servers (like when
>     map/reduce jobs are running).
>     When it is happening the agents fail to connect to the collectors,
>     but I don't see any errors in the collectors logs. They just hang,
>     while other virtual collectors on the same server continue to work.
>
>     -eran
>
>
>
>     On Thu, Oct 27, 2011 at 06:39, Eric Sammer <esammer@cloudera.com
>     <ma...@cloudera.com>> wrote:
>
>         It's almost certainly the issue Mingjie mentioned. There's a race
>         condition in the rolling that's plagued a few people. I'm heads down
>         on NG but I think someone (probably Mingjie :)) was working on this.
>
>
>
>         On Oct 26, 2011, at 1:59 PM, Mingjie Lai <mjlai09@gmail.com
>         <ma...@gmail.com>> wrote:
>
>          >
>          > Quite some ppl mentioned on the list recently that the
>         combination of RollSink + escapedCustomDfs causes issues. You
>         may saw logs like these:
>          >
>          > 2011-10-17 17:30:07,190 [logicalNode collector0_log_dir-19]
>         INFO com.cloudera.flume.core.connector.DirectDriver - Connector
>         logicalNode collector0_log_dir-19 exited with error: Blocked
>         append interrupted by rotation event
>          > java.lang.InterruptedException: Blocked append interrupted by
>         rotation event
>          >        at
>         com.cloudera.flume.handlers.rolling.RollSink.append(RollSink.java:209)
>          >
>          >
>          > > 1500-2000 events per second
>          >
>          > It's not really a huge amount of data. Flume is expected to
>         be able to handle it.
>          >
>          > Not sure anyone is looking at it. Sorry.
>          >
>          > Mingjie
>          >
>          > On 10/23/2011 09:07 AM, Eran Kutner wrote:
>          >> Hi,
>          >> I'm having a problem where flume collectors occasionally
>         stop working
>          >> under heavy load.
>          >> I'm writing something like 1500-2000 events per second to my
>         collectors,
>          >> and occasionally they will just stop working. Nothing is
>         written to the
>          >> log the only indication that this is happening is that I see
>         0 messages
>          >> being delivered when looking in the flume stats web page
>           and events
>          >> start pilling up in the agents. Restarting the service
>         solves the
>          >> problem for a while (anything from a few minutes to a few days).
>          >> An interesting thing to note is that this seems to be load
>         related. It
>          >> used to happen a lot more but then I split the collector
>         into three
>          >> virtual nodes and balanced the traffic on them and now it
>         happens a lot
>          >> less. Also, while one virtual collector stops working the
>         others, on the
>          >> same machine, continue to work fine.
>          >>
>          >> My collector configuration looks like this:
>         collectorSource(54001) |
>          >> collector(600000) {
>          >> escapedFormatDfs("hdfs://hadoop1-m1:8020/raw-events/%Y-%m-%d/",
>          >> "events-%{rolltag}-f01-c1.snappy", seqfile("SnappyCodec")) };
>          >>
>          >> I'm using 0.9.5 I've built a few weeks ago.
>          >>
>          >> Any ideas what can be causing it?
>          >>
>          >> -eran
>          >>
>
>
>
>
>
> --
> Thanks
>
> Cameron Gandevia

Re: Collector stops working

Posted by Cameron Gandevia <cg...@gmail.com>.

Hey

We were having problems with our collectors dying (We always had errors in
the logs). We recently applied the patch
https://issues.apache.org/jira/browse/Flume-808 and modified the RollSink
TriggerThread to not Interrupt the append job when acquiring its lock. Our
collectors have now been up for a few days without problems.

On Thu, Oct 27, 2011 at 8:10 AM, Eran Kutner <er...@gigya.com> wrote:

> Just grepped a few days of logs and I don't see this error. It seems to be
> correlated with higher load on the HDFS servers (like when map/reduce jobs
> are running).
> When it is happening the agents fail to connect to the collectors, but I
> don't see any errors in the collectors logs. They just hang, while other
> virtual collectors on the same server continue to work.
>
> -eran
>
>
>
> On Thu, Oct 27, 2011 at 06:39, Eric Sammer <es...@cloudera.com> wrote:
>
>> It's almost certainly the issue Mingjie mentioned. There's a race
>> condition in the rolling that's plagued a few people. I'm heads down
>> on NG but I think someone (probably Mingjie :)) was working on this.
>>
>>
>>
>> On Oct 26, 2011, at 1:59 PM, Mingjie Lai <mj...@gmail.com> wrote:
>>
>> >
>> > Quite some ppl mentioned on the list recently that the combination of
>> RollSink + escapedCustomDfs causes issues. You may saw logs like these:
>> >
>> > 2011-10-17 17:30:07,190 [logicalNode collector0_log_dir-19] INFO
>> com.cloudera.flume.core.connector.DirectDriver - Connector logicalNode
>> collector0_log_dir-19 exited with error: Blocked append interrupted by
>> rotation event
>> > java.lang.InterruptedException: Blocked append interrupted by rotation
>> event
>> >        at
>> com.cloudera.flume.handlers.rolling.RollSink.append(RollSink.java:209)
>> >
>> >
>> > > 1500-2000 events per second
>> >
>> > It's not really a huge amount of data. Flume is expected to be able to
>> handle it.
>> >
>> > Not sure anyone is looking at it. Sorry.
>> >
>> > Mingjie
>> >
>> > On 10/23/2011 09:07 AM, Eran Kutner wrote:
>> >> Hi,
>> >> I'm having a problem where flume collectors occasionally stop working
>> >> under heavy load.
>> >> I'm writing something like 1500-2000 events per second to my
>> collectors,
>> >> and occasionally they will just stop working. Nothing is written to the
>> >> log the only indication that this is happening is that I see 0 messages
>> >> being delivered when looking in the flume stats web page  and events
>> >> start pilling up in the agents. Restarting the service solves the
>> >> problem for a while (anything from a few minutes to a few days).
>> >> An interesting thing to note is that this seems to be load related. It
>> >> used to happen a lot more but then I split the collector into three
>> >> virtual nodes and balanced the traffic on them and now it happens a lot
>> >> less. Also, while one virtual collector stops working the others, on
>> the
>> >> same machine, continue to work fine.
>> >>
>> >> My collector configuration looks like this: collectorSource(54001) |
>> >> collector(600000) {
>> >> escapedFormatDfs("hdfs://hadoop1-m1:8020/raw-events/%Y-%m-%d/",
>> >> "events-%{rolltag}-f01-c1.snappy", seqfile("SnappyCodec")) };
>> >>
>> >> I'm using 0.9.5 I've built a few weeks ago.
>> >>
>> >> Any ideas what can be causing it?
>> >>
>> >> -eran
>> >>
>>
>
>


-- 
Thanks

Cameron Gandevia

Re: Collector stops working

Posted by Eran Kutner <er...@gigya.com>.

Just grepped a few days of logs and I don't see this error. It seems to be
correlated with higher load on the HDFS servers (like when map/reduce jobs
are running).
When it is happening the agents fail to connect to the collectors, but I
don't see any errors in the collectors logs. They just hang, while other
virtual collectors on the same server continue to work.

-eran



On Thu, Oct 27, 2011 at 06:39, Eric Sammer <es...@cloudera.com> wrote:

> It's almost certainly the issue Mingjie mentioned. There's a race
> condition in the rolling that's plagued a few people. I'm heads down
> on NG but I think someone (probably Mingjie :)) was working on this.
>
>
>
> On Oct 26, 2011, at 1:59 PM, Mingjie Lai <mj...@gmail.com> wrote:
>
> >
> > Quite some ppl mentioned on the list recently that the combination of
> RollSink + escapedCustomDfs causes issues. You may saw logs like these:
> >
> > 2011-10-17 17:30:07,190 [logicalNode collector0_log_dir-19] INFO
> com.cloudera.flume.core.connector.DirectDriver - Connector logicalNode
> collector0_log_dir-19 exited with error: Blocked append interrupted by
> rotation event
> > java.lang.InterruptedException: Blocked append interrupted by rotation
> event
> >        at
> com.cloudera.flume.handlers.rolling.RollSink.append(RollSink.java:209)
> >
> >
> > > 1500-2000 events per second
> >
> > It's not really a huge amount of data. Flume is expected to be able to
> handle it.
> >
> > Not sure anyone is looking at it. Sorry.
> >
> > Mingjie
> >
> > On 10/23/2011 09:07 AM, Eran Kutner wrote:
> >> Hi,
> >> I'm having a problem where flume collectors occasionally stop working
> >> under heavy load.
> >> I'm writing something like 1500-2000 events per second to my collectors,
> >> and occasionally they will just stop working. Nothing is written to the
> >> log the only indication that this is happening is that I see 0 messages
> >> being delivered when looking in the flume stats web page  and events
> >> start pilling up in the agents. Restarting the service solves the
> >> problem for a while (anything from a few minutes to a few days).
> >> An interesting thing to note is that this seems to be load related. It
> >> used to happen a lot more but then I split the collector into three
> >> virtual nodes and balanced the traffic on them and now it happens a lot
> >> less. Also, while one virtual collector stops working the others, on the
> >> same machine, continue to work fine.
> >>
> >> My collector configuration looks like this: collectorSource(54001) |
> >> collector(600000) {
> >> escapedFormatDfs("hdfs://hadoop1-m1:8020/raw-events/%Y-%m-%d/",
> >> "events-%{rolltag}-f01-c1.snappy", seqfile("SnappyCodec")) };
> >>
> >> I'm using 0.9.5 I've built a few weeks ago.
> >>
> >> Any ideas what can be causing it?
> >>
> >> -eran
> >>
>

Re: Collector stops working

Posted by Eric Sammer <es...@cloudera.com>.

It's almost certainly the issue Mingjie mentioned. There's a race
condition in the rolling that's plagued a few people. I'm heads down
on NG but I think someone (probably Mingjie :)) was working on this.



On Oct 26, 2011, at 1:59 PM, Mingjie Lai <mj...@gmail.com> wrote:

>
> Quite some ppl mentioned on the list recently that the combination of RollSink + escapedCustomDfs causes issues. You may saw logs like these:
>
> 2011-10-17 17:30:07,190 [logicalNode collector0_log_dir-19] INFO com.cloudera.flume.core.connector.DirectDriver - Connector logicalNode collector0_log_dir-19 exited with error: Blocked append interrupted by rotation event
> java.lang.InterruptedException: Blocked append interrupted by rotation event
>        at com.cloudera.flume.handlers.rolling.RollSink.append(RollSink.java:209)
>
>
> > 1500-2000 events per second
>
> It's not really a huge amount of data. Flume is expected to be able to handle it.
>
> Not sure anyone is looking at it. Sorry.
>
> Mingjie
>
> On 10/23/2011 09:07 AM, Eran Kutner wrote:
>> Hi,
>> I'm having a problem where flume collectors occasionally stop working
>> under heavy load.
>> I'm writing something like 1500-2000 events per second to my collectors,
>> and occasionally they will just stop working. Nothing is written to the
>> log the only indication that this is happening is that I see 0 messages
>> being delivered when looking in the flume stats web page  and events
>> start pilling up in the agents. Restarting the service solves the
>> problem for a while (anything from a few minutes to a few days).
>> An interesting thing to note is that this seems to be load related. It
>> used to happen a lot more but then I split the collector into three
>> virtual nodes and balanced the traffic on them and now it happens a lot
>> less. Also, while one virtual collector stops working the others, on the
>> same machine, continue to work fine.
>>
>> My collector configuration looks like this: collectorSource(54001) |
>> collector(600000) {
>> escapedFormatDfs("hdfs://hadoop1-m1:8020/raw-events/%Y-%m-%d/",
>> "events-%{rolltag}-f01-c1.snappy", seqfile("SnappyCodec")) };
>>
>> I'm using 0.9.5 I've built a few weeks ago.
>>
>> Any ideas what can be causing it?
>>
>> -eran
>>

Re: Collector stops working

Posted by Mingjie Lai <mj...@gmail.com>.

Quite some ppl mentioned on the list recently that the combination of 
RollSink + escapedCustomDfs causes issues. You may saw logs like these:

2011-10-17 17:30:07,190 [logicalNode collector0_log_dir-19] INFO 
com.cloudera.flume.core.connector.DirectDriver - Connector logicalNode 
collector0_log_dir-19 exited with error: Blocked append interrupted by 
rotation event
java.lang.InterruptedException: Blocked append interrupted by rotation event
         at 
com.cloudera.flume.handlers.rolling.RollSink.append(RollSink.java:209)


 > 1500-2000 events per second

It's not really a huge amount of data. Flume is expected to be able to 
handle it.

Not sure anyone is looking at it. Sorry.

Mingjie

On 10/23/2011 09:07 AM, Eran Kutner wrote:
> Hi,
> I'm having a problem where flume collectors occasionally stop working
> under heavy load.
> I'm writing something like 1500-2000 events per second to my collectors,
> and occasionally they will just stop working. Nothing is written to the
> log the only indication that this is happening is that I see 0 messages
> being delivered when looking in the flume stats web page  and events
> start pilling up in the agents. Restarting the service solves the
> problem for a while (anything from a few minutes to a few days).
> An interesting thing to note is that this seems to be load related. It
> used to happen a lot more but then I split the collector into three
> virtual nodes and balanced the traffic on them and now it happens a lot
> less. Also, while one virtual collector stops working the others, on the
> same machine, continue to work fine.
>
> My collector configuration looks like this: collectorSource(54001) |
> collector(600000) {
> escapedFormatDfs("hdfs://hadoop1-m1:8020/raw-events/%Y-%m-%d/",
> "events-%{rolltag}-f01-c1.snappy", seqfile("SnappyCodec")) };
>
> I'm using 0.9.5 I've built a few weeks ago.
>
> Any ideas what can be causing it?
>
> -eran
>

Re: Collector stops working

Posted by "Alexander C.H. Lorenz" <wg...@googlemail.com>.

Hi Eran,

what says the flume-master? Could be a broken connection to the master, the
master stores in a zookeeper store the configs and running state. Also I
noticed in past if the namennode has some trouble the flume-collectors
stopped without a message. In a virtual environment check the hypervisor,
could be also a blocking state from a underlaying network stack (mostly
happen in ESX).

regards,
 Alex

On Sun, Oct 23, 2011 at 6:07 PM, Eran Kutner <er...@gigya.com> wrote:

> Hi,
> I'm having a problem where flume collectors occasionally stop working under
> heavy load.
> I'm writing something like 1500-2000 events per second to my collectors,
> and occasionally they will just stop working. Nothing is written to the log
> the only indication that this is happening is that I see 0 messages being
> delivered when looking in the flume stats web page  and events start pilling
> up in the agents. Restarting the service solves the problem for a while
> (anything from a few minutes to a few days).
> An interesting thing to note is that this seems to be load related. It used
> to happen a lot more but then I split the collector into three virtual nodes
> and balanced the traffic on them and now it happens a lot less. Also, while
> one virtual collector stops working the others, on the same machine,
> continue to work fine.
>
> My collector configuration looks like this: collectorSource(54001) |
> collector(600000) {
> escapedFormatDfs("hdfs://hadoop1-m1:8020/raw-events/%Y-%m-%d/",
> "events-%{rolltag}-f01-c1.snappy", seqfile("SnappyCodec")) };
>
> I'm using 0.9.5 I've built a few weeks ago.
>
> Any ideas what can be causing it?
>
> -eran
>
>


-- 
Alexander Lorenz
http://mapredit.blogspot.com