You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Connolly Juhani <ju...@cyberagent.co.jp> on 2012/01/13 02:17:07 UTC

Flume NG reliability and failover mechanisms

Hi,

Coming into the new year we've been trying out flume NG, and run into
some questions. Tried to pick up what was possible from the javadoc and
source but pardon me if some of these are obvious.

1) Reading http://www.cloudera.com/blog/2011/12/apache-flume-architecture-of-flume-ng-2/
describes the reliability, but what happens if we lose a node?
1.1)Presumably the data stored in its channel is gone? 
1.2) If we restart the node and the channel is a persisting one(file or
jdbc based),  will it then happily start feeding data into the sink?

2) Is there some way to deliver data along multiple paths but make sure
it only gets persisted to a sink once? To avoid  loss of data to a dying
node.
2.1) Will there be stuff equivalent to the E2E mode of OG?
2.2) Anything else planned but further down along the horizon? Didn't
see much at https://cwiki.apache.org/confluence/display/FLUME/Features+and+Use+Cases
but that doesn't look very up to date.

3) Using the hdfs sink, we're getting tons of really small files. I
suspect this is related to append, and having a poke around the source,
it turns out that append is only used(by  if hdfs.append.support is set
to true. The hdfs-default.xml name for this variable is
dfs.support.append . Is this intentional? Should we be adding
hdfs.append.support manually to our config, or is there something else
going on here(regarding all the tiny files)?

Any help with these issues would be greatly appreciated.

Re: Flume NG reliability and failover mechanisms

Posted by Connolly Juhani <ju...@cyberagent.co.jp>.
Thanks for the reply.

So I've had more time to go over the code and get a better understanding
of things, as well as running a few gigs a day through the memory
channel to a connection layer. We're looking to use flume throughout our
infrastructure(a few 100gb a day compressed I believe), and are trying
to see if we could work with NG from the get-go to avoid at some point
having to switch everything over/rewrite custom stuff.

Rest inline, along with a couple larger issues that we've come run into.

> Hi Connolly,
> 
> Thanks for taking time to evaluate Flume NG. Please see my comments inline
> below:
> 
> On Thu, Jan 12, 2012 at 5:17 PM, Connolly Juhani <
> juhani_connolly@cyberagent.co.jp> wrote:
> 
> > Hi,
> >
> > Coming into the new year we've been trying out flume NG, and run into
> > some questions. Tried to pick up what was possible from the javadoc and
> > source but pardon me if some of these are obvious.
> >
> > 1) Reading
> > http://www.cloudera.com/blog/2011/12/apache-flume-architecture-of-flume-ng-2/
> > describes the reliability, but what happens if we lose a node?
> > 1.1)Presumably the data stored in its channel is gone?
> >
> 
> It depends upon the kind of channel you have. If you use a memory channel,
> the data will be gone. If you use file channel the data will be available.
> If you use JDBC channel, it is guaranteed to be available.
> 
> 
> > 1.2) If we restart the node and the channel is a persisting one(file or
> > jdbc based),  will it then happily start feeding data into the sink?
> >
> 
> Correct.
> 

Ok, that all seems clear enough now

> 
> >
> > 2) Is there some way to deliver data along multiple paths but make sure
> > it only gets persisted to a sink once? To avoid  loss of data to a dying
> > node.
> >
> 
> We have talked about fail-over sink implementations. Although we don't have
> it implemented yet, we do intend to provide these faciliteis.
> 

We're going to need something, and I was thinking of trying to put
together a failover channel(working like the fanout, except that it
doesn't copy messages to all channel but rather tries one at a time).
Would this be better implemented as a sink? If we go ahead with it,
I'd be happy to share the code back, is this something that would be
welcome in NG?

> 
> > 2.1) Will there be stuff equivalent to the E2E mode of OG?
> >
> 
> If you mean end-to-end reliable delivery guarantee, Flume NG already
> provides that. You can get this by configuring your flow with reliable
> channels (JDBC).
> 
> 
> > 2.2) Anything else planned but further down along the horizon? Didn't
> > see much at
> > https://cwiki.apache.org/confluence/display/FLUME/Features+and+Use+Cases
> > but that doesn't look very up to date.
> >
> 
> Most of the discussion is now moved to JIRA and dev-list. Features such as
> channel multiplexing from same source, compatible source implementation for
> hybrid installation of previous version of Flume and NG together, event
> prioritization have been discussed among many others. As and when resources
> permit, we will be addressing these going forward.
> 
> 

I've been looking around the JIRA a fair bit, but I guess I'd better
start reading over the dev logs.

> >
> > 3) Using the hdfs sink, we're getting tons of really small files. I
> > suspect this is related to append, and having a poke around the source,
> > it turns out that append is only used(by  if hdfs.append.support is set
> > to true. The hdfs-default.xml name for this variable is
> > dfs.support.append . Is this intentional? Should we be adding
> > hdfs.append.support manually to our config, or is there something else
> > going on here(regarding all the tiny files)?
> >
> 
> (Leaving this for Prasad who did the implementation of HDFS sink)
> 

I think our problem may be to do with our bucketing settings(or more
specifically lack of them), though the config names still don't match,
so not entirely sure what's going on.

> 
> >
> > Any help with these issues would be greatly appreciated.
> >
> 
> Thanks,
> Arvind

We have a couple of major sticking points though:
- The lack of a working file channel is pretty major. I've been
following the issue, and curious about how high up in the priority order 
this rather important component is(understanding that it is quite 
awkward due to the need to write a solid WAL implementation)?

- We're trying to work with JDBC instead of the file-based channel, but
it seems like the throughput is minuscule(mind you, this is with the
integrated derby), not even being able to keep up with top -b through an
exec source. What kind of setup would be recommended using the jdbc
channel, and what kind of throughput could we expect?

Thanks,
Juhani

Re: Flume NG reliability and failover mechanisms

Posted by Arvind Prabhakar <ar...@apache.org>.
Hi Connolly,

Thanks for taking time to evaluate Flume NG. Please see my comments inline
below:

On Thu, Jan 12, 2012 at 5:17 PM, Connolly Juhani <
juhani_connolly@cyberagent.co.jp> wrote:

> Hi,
>
> Coming into the new year we've been trying out flume NG, and run into
> some questions. Tried to pick up what was possible from the javadoc and
> source but pardon me if some of these are obvious.
>
> 1) Reading
> http://www.cloudera.com/blog/2011/12/apache-flume-architecture-of-flume-ng-2/
> describes the reliability, but what happens if we lose a node?
> 1.1)Presumably the data stored in its channel is gone?
>

It depends upon the kind of channel you have. If you use a memory channel,
the data will be gone. If you use file channel the data will be available.
If you use JDBC channel, it is guaranteed to be available.


> 1.2) If we restart the node and the channel is a persisting one(file or
> jdbc based),  will it then happily start feeding data into the sink?
>

Correct.


>
> 2) Is there some way to deliver data along multiple paths but make sure
> it only gets persisted to a sink once? To avoid  loss of data to a dying
> node.
>

We have talked about fail-over sink implementations. Although we don't have
it implemented yet, we do intend to provide these faciliteis.


> 2.1) Will there be stuff equivalent to the E2E mode of OG?
>

If you mean end-to-end reliable delivery guarantee, Flume NG already
provides that. You can get this by configuring your flow with reliable
channels (JDBC).


> 2.2) Anything else planned but further down along the horizon? Didn't
> see much at
> https://cwiki.apache.org/confluence/display/FLUME/Features+and+Use+Cases
> but that doesn't look very up to date.
>

Most of the discussion is now moved to JIRA and dev-list. Features such as
channel multiplexing from same source, compatible source implementation for
hybrid installation of previous version of Flume and NG together, event
prioritization have been discussed among many others. As and when resources
permit, we will be addressing these going forward.


>
> 3) Using the hdfs sink, we're getting tons of really small files. I
> suspect this is related to append, and having a poke around the source,
> it turns out that append is only used(by  if hdfs.append.support is set
> to true. The hdfs-default.xml name for this variable is
> dfs.support.append . Is this intentional? Should we be adding
> hdfs.append.support manually to our config, or is there something else
> going on here(regarding all the tiny files)?
>

(Leaving this for Prasad who did the implementation of HDFS sink)


>
> Any help with these issues would be greatly appreciated.
>

Thanks,
Arvind