You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by Eric Sammer <es...@cloudera.com> on 2011/07/26 07:41:29 UTC

Common feature requests

Flumers:

In what I call "my real job" of working with Cloudera customers I hear
common feature requests. For most (all, I think) of these there are
JIRAs. Normally we Cloudera folk talk about these internally but with
Flume now being an ASF project, I think it makes sense to shout them
out into the aether and bounce them around. I happen to be on site
with a customer who expressed interest in the items below (sorry, I
can't disclose who) and they're incredibly common.

* Robust multimaster. Many of the larger enterprises don't want to
touch something with even a hint of a SPOF.
* Transport and at rest encryption. We've talked a bunch about in
flight encryption but the contents of the WAL came up (a good point).
Supporting both Avro and Thrift RPC makes this literally twice as
hard.
* Autochains. Folks want redundant, N-way active collectors without
having to hand configure failover chains. They want to say "hey all
you agents, get data over there." The more they can talk about classes
of Flume node, the happy they seem to be (i.e. agents vs. collectors
rather than 10.x.x.1 vs. 10.x.x.2).
* Tight data source integration. In this case, the discussion
mentioned C++ Avro or Thrift clients and a logback appender (similar
to the log4j appender). There's less of a focus on tail-style sources.
* Even more insight into performance, failures, potential failures,
backlog, etc. REST goes a long way here but SNMP and / or JMX probably
also makes sense. Maybe a good Flume first project or GSoC project?
Jon already did some of the JMX stuff, I think.

I don't think these are anything new or unexpected. We've already been
bouncing around ideas for the new master / heartbeat stuff around ZK
that would address at least two of them. The encryption has been a
request almost since Flume day 1.

'tis all. Thanks.
-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com

Re: Common feature requests

Posted by Ralph Goers <ra...@dslextreme.com>.
I have a pretty good idea I know who your customer is and if correct, I'm not sure why you can't mention them.  

I agree with most of these. The logback appender is a bit problematic.  1) Logback hides exceptions in appenders by default, so an appender will really want to use a different base class so that applications can be made aware of an error in the unlikely case one occurs, 2) the Log4j appender is a bit simplistic. When logging a format such the SLF4J EventLogger (with EventData) the event data needs to be added to the Flume Event. A configurable set of the MDC data also needs to be added. Also, the raw audit record should be saved as the "body" so that the original event can always be retrieved without having to be retrieved without any possible data loss.  

With regard to encryption, in the case of Cassandra as the back end any application reading the encrypted data will have to know how to decrypt since Cassandra doesn't support encryption/decryption yet. If Flume is handing the encryption then the key will have to be shared between Flume and some other application.  

Ralph

On Jul 25, 2011, at 10:41 PM, Eric Sammer wrote:

> Flumers:
> 
> In what I call "my real job" of working with Cloudera customers I hear
> common feature requests. For most (all, I think) of these there are
> JIRAs. Normally we Cloudera folk talk about these internally but with
> Flume now being an ASF project, I think it makes sense to shout them
> out into the aether and bounce them around. I happen to be on site
> with a customer who expressed interest in the items below (sorry, I
> can't disclose who) and they're incredibly common.
> 
> * Robust multimaster. Many of the larger enterprises don't want to
> touch something with even a hint of a SPOF.
> * Transport and at rest encryption. We've talked a bunch about in
> flight encryption but the contents of the WAL came up (a good point).
> Supporting both Avro and Thrift RPC makes this literally twice as
> hard.
> * Autochains. Folks want redundant, N-way active collectors without
> having to hand configure failover chains. They want to say "hey all
> you agents, get data over there." The more they can talk about classes
> of Flume node, the happy they seem to be (i.e. agents vs. collectors
> rather than 10.x.x.1 vs. 10.x.x.2).
> * Tight data source integration. In this case, the discussion
> mentioned C++ Avro or Thrift clients and a logback appender (similar
> to the log4j appender). There's less of a focus on tail-style sources.
> * Even more insight into performance, failures, potential failures,
> backlog, etc. REST goes a long way here but SNMP and / or JMX probably
> also makes sense. Maybe a good Flume first project or GSoC project?
> Jon already did some of the JMX stuff, I think.
> 
> I don't think these are anything new or unexpected. We've already been
> bouncing around ideas for the new master / heartbeat stuff around ZK
> that would address at least two of them. The encryption has been a
> request almost since Flume day 1.
> 
> 'tis all. Thanks.
> -- 
> Eric Sammer
> twitter: esammer
> data: www.cloudera.com


Re: Common feature requests

Posted by Jonathan Hsieh <jo...@cloudera.com>.
Nick,

Yes, I meant SNMP.  Thanks for the correction.

Thanks,
Jon.

On Thu, Jul 28, 2011 at 8:18 AM, NerdyNick <ne...@gmail.com> wrote:

> By SMTP did you mean SNMP. Also I think the Ops prefer this because
> Nagios & Cacti and other monitors have native support for those
> protocols. Not sure about JMX thought.
>
> On Thu, Jul 28, 2011 at 8:34 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:
> > I want to reiterate that each of these could be a separate thread of
> > discussion and a great place for folks to contribute.
> >
> > To the list, I'd like to add one more:
> >
> > * Simple node-side (no separate master interaction) configuration
> mechanism.
> >  A simple approach would be to have a ./conf/nodes.d directory with a
> name
> > of the file being part of the node name and the contents being a source
> and
> > a sink spec in an extensible format so other attributes can be added.
>  This
> > would get automatically sent from the node to the master via the shell
> > interface.  The timestamp would be the mod time of the file this
> > configuration is contained in. (so master could override if node
> restarts).
> >
> >
> > Of the ones Eric suggested, I think robust multimaster, autochains, are a
> > future version/branch (0.10/1.0?) but the other and the node-side config
> > seem like they could go on a 0.9.x branch.
> >
> > I haven't done JMX but some folks have been telling me that SMTP and JMX
> are
> > preferred by ops types.
> >
> > Jon.
> >
> > On Mon, Jul 25, 2011 at 10:41 PM, Eric Sammer <es...@cloudera.com>
> wrote:
> >
> >> Flumers:
> >>
> >> In what I call "my real job" of working with Cloudera customers I hear
> >> common feature requests. For most (all, I think) of these there are
> >> JIRAs. Normally we Cloudera folk talk about these internally but with
> >> Flume now being an ASF project, I think it makes sense to shout them
> >> out into the aether and bounce them around. I happen to be on site
> >> with a customer who expressed interest in the items below (sorry, I
> >> can't disclose who) and they're incredibly common.
> >>
> >> * Robust multimaster. Many of the larger enterprises don't want to
> >> touch something with even a hint of a SPOF.
> >> * Transport and at rest encryption. We've talked a bunch about in
> >> flight encryption but the contents of the WAL came up (a good point).
> >> Supporting both Avro and Thrift RPC makes this literally twice as
> >> hard.
> >> * Autochains. Folks want redundant, N-way active collectors without
> >> having to hand configure failover chains. They want to say "hey all
> >> you agents, get data over there." The more they can talk about classes
> >> of Flume node, the happy they seem to be (i.e. agents vs. collectors
> >> rather than 10.x.x.1 vs. 10.x.x.2).
> >> * Tight data source integration. In this case, the discussion
> >> mentioned C++ Avro or Thrift clients and a logback appender (similar
> >> to the log4j appender). There's less of a focus on tail-style sources.
> >> * Even more insight into performance, failures, potential failures,
> >> backlog, etc. REST goes a long way here but SNMP and / or JMX probably
> >> also makes sense. Maybe a good Flume first project or GSoC project?
> >> Jon already did some of the JMX stuff, I think.
> >>
> >> I don't think these are anything new or unexpected. We've already been
> >> bouncing around ideas for the new master / heartbeat stuff around ZK
> >> that would address at least two of them. The encryption has been a
> >> request almost since Flume day 1.
> >>
> >> 'tis all. Thanks.
> >> --
> >> Eric Sammer
> >> twitter: esammer
> >> data: www.cloudera.com
> >>
> >
> >
> >
> > --
> > // Jonathan Hsieh (shay)
> > // Software Engineer, Cloudera
> > // jon@cloudera.com
> >
>
>
>
> --
> Nick Verbeck - NerdyNick
> ----------------------------------------------------
> NerdyNick.com
> Coloco.ubuntu-rocks.org
>



-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: Common feature requests

Posted by NerdyNick <ne...@gmail.com>.
By SMTP did you mean SNMP. Also I think the Ops prefer this because
Nagios & Cacti and other monitors have native support for those
protocols. Not sure about JMX thought.

On Thu, Jul 28, 2011 at 8:34 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:
> I want to reiterate that each of these could be a separate thread of
> discussion and a great place for folks to contribute.
>
> To the list, I'd like to add one more:
>
> * Simple node-side (no separate master interaction) configuration mechanism.
>  A simple approach would be to have a ./conf/nodes.d directory with a name
> of the file being part of the node name and the contents being a source and
> a sink spec in an extensible format so other attributes can be added.  This
> would get automatically sent from the node to the master via the shell
> interface.  The timestamp would be the mod time of the file this
> configuration is contained in. (so master could override if node restarts).
>
>
> Of the ones Eric suggested, I think robust multimaster, autochains, are a
> future version/branch (0.10/1.0?) but the other and the node-side config
> seem like they could go on a 0.9.x branch.
>
> I haven't done JMX but some folks have been telling me that SMTP and JMX are
> preferred by ops types.
>
> Jon.
>
> On Mon, Jul 25, 2011 at 10:41 PM, Eric Sammer <es...@cloudera.com> wrote:
>
>> Flumers:
>>
>> In what I call "my real job" of working with Cloudera customers I hear
>> common feature requests. For most (all, I think) of these there are
>> JIRAs. Normally we Cloudera folk talk about these internally but with
>> Flume now being an ASF project, I think it makes sense to shout them
>> out into the aether and bounce them around. I happen to be on site
>> with a customer who expressed interest in the items below (sorry, I
>> can't disclose who) and they're incredibly common.
>>
>> * Robust multimaster. Many of the larger enterprises don't want to
>> touch something with even a hint of a SPOF.
>> * Transport and at rest encryption. We've talked a bunch about in
>> flight encryption but the contents of the WAL came up (a good point).
>> Supporting both Avro and Thrift RPC makes this literally twice as
>> hard.
>> * Autochains. Folks want redundant, N-way active collectors without
>> having to hand configure failover chains. They want to say "hey all
>> you agents, get data over there." The more they can talk about classes
>> of Flume node, the happy they seem to be (i.e. agents vs. collectors
>> rather than 10.x.x.1 vs. 10.x.x.2).
>> * Tight data source integration. In this case, the discussion
>> mentioned C++ Avro or Thrift clients and a logback appender (similar
>> to the log4j appender). There's less of a focus on tail-style sources.
>> * Even more insight into performance, failures, potential failures,
>> backlog, etc. REST goes a long way here but SNMP and / or JMX probably
>> also makes sense. Maybe a good Flume first project or GSoC project?
>> Jon already did some of the JMX stuff, I think.
>>
>> I don't think these are anything new or unexpected. We've already been
>> bouncing around ideas for the new master / heartbeat stuff around ZK
>> that would address at least two of them. The encryption has been a
>> request almost since Flume day 1.
>>
>> 'tis all. Thanks.
>> --
>> Eric Sammer
>> twitter: esammer
>> data: www.cloudera.com
>>
>
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>



-- 
Nick Verbeck - NerdyNick
----------------------------------------------------
NerdyNick.com
Coloco.ubuntu-rocks.org

Re: Common feature requests

Posted by Eric Sammer <es...@cloudera.com>.
To clarify, the reason I called these features out is because they're
not "new." We're supposed to have them already (encryption is
debatable). I don't think we should add any new features < 1.0 or 0.10
(whatever we want to call it). I think we should say there's no desert
(brand new features) until we've eaten our veggies (stability,
production level versions of features we've promised).

On Thu, Jul 28, 2011 at 7:34 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:
> I want to reiterate that each of these could be a separate thread of
> discussion and a great place for folks to contribute.
>
> To the list, I'd like to add one more:
>
> * Simple node-side (no separate master interaction) configuration mechanism.
>  A simple approach would be to have a ./conf/nodes.d directory with a name
> of the file being part of the node name and the contents being a source and
> a sink spec in an extensible format so other attributes can be added.  This
> would get automatically sent from the node to the master via the shell
> interface.  The timestamp would be the mod time of the file this
> configuration is contained in. (so master could override if node restarts).
>
>
> Of the ones Eric suggested, I think robust multimaster, autochains, are a
> future version/branch (0.10/1.0?) but the other and the node-side config
> seem like they could go on a 0.9.x branch.
>
> I haven't done JMX but some folks have been telling me that SMTP and JMX are
> preferred by ops types.
>
> Jon.
>
> On Mon, Jul 25, 2011 at 10:41 PM, Eric Sammer <es...@cloudera.com> wrote:
>
>> Flumers:
>>
>> In what I call "my real job" of working with Cloudera customers I hear
>> common feature requests. For most (all, I think) of these there are
>> JIRAs. Normally we Cloudera folk talk about these internally but with
>> Flume now being an ASF project, I think it makes sense to shout them
>> out into the aether and bounce them around. I happen to be on site
>> with a customer who expressed interest in the items below (sorry, I
>> can't disclose who) and they're incredibly common.
>>
>> * Robust multimaster. Many of the larger enterprises don't want to
>> touch something with even a hint of a SPOF.
>> * Transport and at rest encryption. We've talked a bunch about in
>> flight encryption but the contents of the WAL came up (a good point).
>> Supporting both Avro and Thrift RPC makes this literally twice as
>> hard.
>> * Autochains. Folks want redundant, N-way active collectors without
>> having to hand configure failover chains. They want to say "hey all
>> you agents, get data over there." The more they can talk about classes
>> of Flume node, the happy they seem to be (i.e. agents vs. collectors
>> rather than 10.x.x.1 vs. 10.x.x.2).
>> * Tight data source integration. In this case, the discussion
>> mentioned C++ Avro or Thrift clients and a logback appender (similar
>> to the log4j appender). There's less of a focus on tail-style sources.
>> * Even more insight into performance, failures, potential failures,
>> backlog, etc. REST goes a long way here but SNMP and / or JMX probably
>> also makes sense. Maybe a good Flume first project or GSoC project?
>> Jon already did some of the JMX stuff, I think.
>>
>> I don't think these are anything new or unexpected. We've already been
>> bouncing around ideas for the new master / heartbeat stuff around ZK
>> that would address at least two of them. The encryption has been a
>> request almost since Flume day 1.
>>
>> 'tis all. Thanks.
>> --
>> Eric Sammer
>> twitter: esammer
>> data: www.cloudera.com
>>
>
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>



-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com

Re: Common feature requests

Posted by Jonathan Hsieh <jo...@cloudera.com>.
I want to reiterate that each of these could be a separate thread of
discussion and a great place for folks to contribute.

To the list, I'd like to add one more:

* Simple node-side (no separate master interaction) configuration mechanism.
 A simple approach would be to have a ./conf/nodes.d directory with a name
of the file being part of the node name and the contents being a source and
a sink spec in an extensible format so other attributes can be added.  This
would get automatically sent from the node to the master via the shell
interface.  The timestamp would be the mod time of the file this
configuration is contained in. (so master could override if node restarts).


Of the ones Eric suggested, I think robust multimaster, autochains, are a
future version/branch (0.10/1.0?) but the other and the node-side config
seem like they could go on a 0.9.x branch.

I haven't done JMX but some folks have been telling me that SMTP and JMX are
preferred by ops types.

Jon.

On Mon, Jul 25, 2011 at 10:41 PM, Eric Sammer <es...@cloudera.com> wrote:

> Flumers:
>
> In what I call "my real job" of working with Cloudera customers I hear
> common feature requests. For most (all, I think) of these there are
> JIRAs. Normally we Cloudera folk talk about these internally but with
> Flume now being an ASF project, I think it makes sense to shout them
> out into the aether and bounce them around. I happen to be on site
> with a customer who expressed interest in the items below (sorry, I
> can't disclose who) and they're incredibly common.
>
> * Robust multimaster. Many of the larger enterprises don't want to
> touch something with even a hint of a SPOF.
> * Transport and at rest encryption. We've talked a bunch about in
> flight encryption but the contents of the WAL came up (a good point).
> Supporting both Avro and Thrift RPC makes this literally twice as
> hard.
> * Autochains. Folks want redundant, N-way active collectors without
> having to hand configure failover chains. They want to say "hey all
> you agents, get data over there." The more they can talk about classes
> of Flume node, the happy they seem to be (i.e. agents vs. collectors
> rather than 10.x.x.1 vs. 10.x.x.2).
> * Tight data source integration. In this case, the discussion
> mentioned C++ Avro or Thrift clients and a logback appender (similar
> to the log4j appender). There's less of a focus on tail-style sources.
> * Even more insight into performance, failures, potential failures,
> backlog, etc. REST goes a long way here but SNMP and / or JMX probably
> also makes sense. Maybe a good Flume first project or GSoC project?
> Jon already did some of the JMX stuff, I think.
>
> I don't think these are anything new or unexpected. We've already been
> bouncing around ideas for the new master / heartbeat stuff around ZK
> that would address at least two of them. The encryption has been a
> request almost since Flume day 1.
>
> 'tis all. Thanks.
> --
> Eric Sammer
> twitter: esammer
> data: www.cloudera.com
>



-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com