You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Joe Crobak <jo...@gmail.com> on 2011/08/10 16:46:33 UTC

buffered sink decorator practices

We've written a simple sink decorator to do in-memory aggregations.
 Currently, we're using a roll sink to cause the aggregator decorator to be
closed/reopened ever 60 seconds.  Based upon the info in [1], by default the
close() operation has 30 seconds to complete.  We're seeing this fail in
some cases due to other bottlenecks. I'm hesitant to just up the timeout,
though, since long GCs or other events could cause the problem regardless of
the timeout.

With all this in mind, I have two questions.
1) Rollsink and BatchingDecorator seem to share a lot of similar logic to
run a background thread to flush events periodically. There seems to be a
lot of subtly in these implementations to avoid deadlocks.  Are either of
these suitable for subclassing? (I guess BatchingDecorator is closer to what
I'm looking for)... has anyone ever done this before?
2) It's possible for our sink decorator to generate more events than it
receives, so I am afraid it could become behind -- are there dangers in
using a threadpool to call append() from a decorator to forward events to
the collector?

Thanks,
Joe

[1]
http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html#_buffered_sink_and_decorator_semantics

Re: buffered sink decorator practices

Posted by Mingjie Lai <mj...@gmail.com>.
Joe.

 > Can you elaborate?

Sorry. Let me correct myself that calling super.append() in a separate 
thread should be fine. I was a little bit confused by your 2nd question 
``It's possible for our sink decorator to generate more events than it 
receives'', so made a wrong statement.

 > are there dangers in using a threadpool to call append() from a 
decorator to forward events to the collector

I'll be working on a new hbase sink decorator which use multi-threaded 
to improve the hbase sink write throughput. Your use case is quite 
similar. Looking forward to hearing more from you.

Thanks,
Mingjie


On 08/16/2011 01:05 PM, Joe Crobak wrote:
> On Tue, Aug 16, 2011 at 2:53 PM, Mingjie Lai <mjlai09@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Joe.
>
>     I have a similar case that I need to accumulate some data at a
>     decorator and write to a sink in a batch.
>
>     1) is a reasonable choice. But I don't think deriving from
>     BatchingDecorator be a good idea. For my case, there is little in
>     common from the class. Why not borrowing the idea and implementing
>     yours version?
>
> The biggest piece of code that I was looking to use was the
> TimeoutThread.  I ended up just using that as is.
>
>     For 2), I don't like the idea to call append() in a separate thread.
>     (if I understand your solution correctly.)
>
> Can you elaborate? Is there a technical reason why not? AFAICT, this is
> what RollSink does (although it does block other threads while it's
> closing/reopening).  In our case, we're batching up a lot of data over a
> minute, so I'd prefer not to make a single call of append() block to
> forward all the data along.  As I mentioned, I'd be interested in doing
> this with multiple threads, even.
>
> Thanks,
> Joe

Re: buffered sink decorator practices

Posted by Joe Crobak <jo...@gmail.com>.
On Tue, Aug 16, 2011 at 2:53 PM, Mingjie Lai <mj...@gmail.com> wrote:

> Joe.
>
> I have a similar case that I need to accumulate some data at a decorator
> and write to a sink in a batch.
>
> 1) is a reasonable choice. But I don't think deriving from
> BatchingDecorator be a good idea. For my case, there is little in common
> from the class. Why not borrowing the idea and implementing yours version?
>
> The biggest piece of code that I was looking to use was the TimeoutThread.
 I ended up just using that as is.


> For 2), I don't like the idea to call append() in a separate thread. (if I
> understand your solution correctly.)
>
> Can you elaborate? Is there a technical reason why not? AFAICT, this is
what RollSink does (although it does block other threads while it's
closing/reopening).  In our case, we're batching up a lot of data over a
minute, so I'd prefer not to make a single call of append() block to forward
all the data along.  As I mentioned, I'd be interested in doing this with
multiple threads, even.

Thanks,
Joe

Re: buffered sink decorator practices

Posted by Mingjie Lai <mj...@gmail.com>.
Joe.

I have a similar case that I need to accumulate some data at a decorator 
and write to a sink in a batch.

1) is a reasonable choice. But I don't think deriving from 
BatchingDecorator be a good idea. For my case, there is little in common 
from the class. Why not borrowing the idea and implementing yours version?

For 2), I don't like the idea to call append() in a separate thread. (if 
I understand your solution correctly.)

Thanks,
Mingjie

On 08/16/2011 07:46 AM, Joe Crobak wrote:
> Since I've spent some time working on this, I thought I'd share my
> findings and reiterate a question below.
>
> On Wed, Aug 10, 2011 at 10:46 AM, Joe Crobak <joecrow@gmail.com
> <ma...@gmail.com>> wrote:
>
>     We've written a simple sink decorator to do in-memory aggregations.
>       Currently, we're using a roll sink to cause the aggregator
>     decorator to be closed/reopened ever 60 seconds.  Based upon the
>     info in [1], by default the close() operation has 30 seconds to
>     complete.  We're seeing this fail in some cases due to other
>     bottlenecks. I'm hesitant to just up the timeout, though, since long
>     GCs or other events could cause the problem regardless of the timeout.
>
>     With all this in mind, I have two questions.
>     1) Rollsink and BatchingDecorator seem to share a lot of similar
>     logic to run a background thread to flush events periodically. There
>     seems to be a lot of subtly in these implementations to avoid
>     deadlocks.  Are either of these suitable for subclassing? (I guess
>     BatchingDecorator is closer to what I'm looking for)... has anyone
>     ever done this before?
>
>
> I tried to subclass BatchingDecorator, but it didn't quite work.  I need
> access to BatchingDecorator's super-classes' append() method.  I suspect
> it might be useful to expose an abstract class with the core-logic of
> time-based and count-based "batching" -- or am I the only one with this
> problem? If others are interested, I could start a patch.
>
>     2) It's possible for our sink decorator to generate more events than
>     it receives, so I am afraid it could become behind -- are there
>     dangers in using a threadpool to call append() from a decorator to
>     forward events to the collector?
>
>
> I'm still wondering if a decorator might call through to its sink with a
> background threadpool.  Any thoughts about whether this is a
> good/bad/terrible idea?
>
> Thanks,
> Joe
>

Re: buffered sink decorator practices

Posted by Joe Crobak <jo...@gmail.com>.
Since I've spent some time working on this, I thought I'd share my findings
and reiterate a question below.

On Wed, Aug 10, 2011 at 10:46 AM, Joe Crobak <jo...@gmail.com> wrote:

> We've written a simple sink decorator to do in-memory aggregations.
>  Currently, we're using a roll sink to cause the aggregator decorator to be
> closed/reopened ever 60 seconds.  Based upon the info in [1], by default the
> close() operation has 30 seconds to complete.  We're seeing this fail in
> some cases due to other bottlenecks. I'm hesitant to just up the timeout,
> though, since long GCs or other events could cause the problem regardless of
> the timeout.
>
> With all this in mind, I have two questions.
> 1) Rollsink and BatchingDecorator seem to share a lot of similar logic to
> run a background thread to flush events periodically. There seems to be a
> lot of subtly in these implementations to avoid deadlocks.  Are either of
> these suitable for subclassing? (I guess BatchingDecorator is closer to what
> I'm looking for)... has anyone ever done this before?
>

I tried to subclass BatchingDecorator, but it didn't quite work.  I need
access to BatchingDecorator's super-classes' append() method.  I suspect it
might be useful to expose an abstract class with the core-logic of
time-based and count-based "batching" -- or am I the only one with this
problem? If others are interested, I could start a patch.


> 2) It's possible for our sink decorator to generate more events than it
> receives, so I am afraid it could become behind -- are there dangers in
> using a threadpool to call append() from a decorator to forward events to
> the collector?
>
>
I'm still wondering if a decorator might call through to its sink with a
background threadpool.  Any thoughts about whether this is a
good/bad/terrible idea?

Thanks,
Joe