You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flume.apache.org by Israel Ekpo <is...@aicer.org> on 2013/09/08 05:47:25 UTC

Re: New Features Proposed for Apache Flume

Thank you everyone for your very constructive feedbacks. They were very
helpful.

To provide some background, most of these suggestions have been inspired by
features I have found in Logstash [3].

I am going to spend more time to understand how the cdk morphline commands
[4] work because I think it will really help with the transformation utils
needed in FileSource.

Regarding the GrokInterceptor, I was not aware of the existence of
MorphlineInterceptor. It already does what I was proposing with
GrokInterceptor. So we are cool from that end.

In simple standalone tests, the commons-io class that I am planning to use
for the FileSource handles file rotations well but I have not tested
renames or removals yet.

Regarding the GeoIPInterceptor we can provide links for downloading the
Maxmind database seperately without bundling the IP database with Flume
releases.

This is how the Logstash project does it.

Because of the large number of events expected, I was planning to use
Lucene because of the speed of executing range queries from trie indexing
[5] and the results can also be cached in-memory if they have been
previously executed.

I can perform some benchmarks with and without Lucene and see if the
performance differences justify using it for the lookups.

My gut feeling is that using Lucene will lead to shorter processing times
as the volume of events increase.

The RedisSource and RedisSink features will just be simple sources and
sinks. The sink will push [1] events to the Redis server and the source
will do a blocking pop [2] as it waits for new events to occur on the Redis
Server.

I am still trying out a few things, this part is not yet finalized.

Regarding contributing features as plugins, how are plugins typically
contributed and managed?

Do I have to create github repo and manage it independently or are they
contributed as patches to the Flume project?

[1] http://redis.io/commands/rpush
[2] http://redis.io/commands/blpop
[3] http://logstash.net/docs/1.2.1/
[4] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html
[5]
http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/NumericRangeQuery.html

*Author and Instructor for the Upcoming Book and Lecture Series*
*Massive Log Data Aggregation, Processing, Searching and Visualization with
Open Source Software*
*http://massivelogdata.com*


On Wed, Aug 28, 2013 at 1:21 PM, Wolfgang Hoschek <wh...@cloudera.com>wrote:

> Re: GrokInterceptor
>
> This functionality is already available in the form of the Apache Flume
> MorphlineInterceptor [1] with the grok command [2]. While grok is very
> useful, consider that grok alone often isn't enough - you typically need
> some other log event processing commands as well, for example as contained
> in morphlines [3].
>
> Re: FileSource
>
> True file tailing would be great.
>
> Merging multiple lines into one event can already be done with the
> MorphlineInterceptor with the readMultiLine command [4]. Or maybe embed a
> morphline directly into that new FileSource?
>
> Re: GeoIPInterceptor
>
> Seems to me that it would be more flexible, powerful and reusable to add
> this kind of functionality as a morphline command - contributions welcome!
>
> Finally, a word of caution, Maxmind is a good geo db, and I've used it
> before, but it has some LGPL issues that may or may not be workable in this
> context. Maxmind db fits into RAM - Lucene seems like overkill here - you
> can do fast maxmind lookups directly without Lucene.
>
> [1] http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor
> [2]
> http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#grok
> [3] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html
> [4]
> http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#readMultiLine
>
> Wolfgang.
>
> >
> > *FileSource*
> >
> > Using the Tailer feature from Apache Commons I/O utility [1], we can tail
> > specific files for events.
> >
> > This allows us to, regardless of the operating system, have the ability
> to
> > watch files for future events as they occur.
> >
> > It also allows us to step in and determine if two or more events should
> be
> > merged into one events if newline characters are present in an event.
> >
> > We can configure certain regular expressions that determines if a
> specific
> > line is a new event or part of the prevent event.
> >
> > Essentially, this source will have the ability to merge multiple lines
> into
> > one event before it is passed on to interceptors.
> >
> > It has been complicated group multiple lines into a single event with the
> > Spooling Directory Source or Exec Source. I tried creating custom
> > deserializers but it was hard to get around the logic used to parse the
> > files.
> >
> > Using the Spooling Directory also means we cannot watch the original
> files
> > so we need a background process to copy over the log files into the
> > spooling directory which requires additional setup.
> >
> > The tail command is not also available on all operating systems out of
> the
> > box.
> >
> >
> > *GrokInterceptor*
> >
> > With this interceptor we can parse semi-structure and unstructured text
> and
> > log data in the headers and body of the event into something structured
> > that can be easily queried.
> > I plan to use the information [2] and [3] for this.
> > With this interceptor, we can extract HTTP response codes, response
> times,
> > user agents, IP addresses and a whole bunch of useful data point from
> free
> > form text.
> >
> >
> >
> > *GeoIPInterceptor*
> >
> > This is for IP intelligence.
> >
> > This interceptor will allow us to use the value of an IP address in the
> > event header or body of the request to estimate the geographical location
> > of the IP address.
> >
> > Using the database available here [4], we can inject the two-letter code
> or
> > country name of the IP address into the event.
> >
> > We can also deduce other values such as city name, postalCode, latitude,
> > longitude, Internet Service Provider and Organization name.
> >
> > This can be very helpful in analyzing traffic patterns and target
> audience
> > from webserver or application logs.
> >
> > The database is loaded into a Lucene index when the agent is started up.
> > The index is only created once if it does not already exists.
> >
> > As the interceptor comes across events, it maps the IP address to a
> variety
> > of values that can be injected into the events.
> >
> >
> >
> > *RedisSink*
> >
> > This can provide another option for setting up a fan-in and/or fan-out
> > architecture.
> >
> > The RedisSink can serve as a queue that is used as a source by another
> > agent down the line.
> >
> > *References*
> > [1]
> >
> http://commons.apache.org/proper/commons-io/javadocs/api-release/org/apache/commons/io/input/Tailer.html
> > [2] https://github.com/NFLabs/java-grok
> > [3] http://www.anthonycorbacho.net/portfolio/grok-pattern/
> > [4] http://dev.maxmind.com/geoip/legacy/geolite/#Downloads
> > [5] http://dev.maxmind.com/geoip/legacy/csv/
> > [6] http://redis.io/documentation
> > [7] https://github.com/xetorthio/jedis
> >
> > *Author and Instructor for the Upcoming Book and Lecture Series*
> > *Massive Log Data Aggregation, Processing, Searching and Visualization
> with
> > Open Source Software*
> > *http://massivelogdata.com*
>
>

Re: New Features Proposed for Apache Flume

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi,

Don't want to beat a dead horse, but I just stumbled upon this email.  Note
how Israel wasn't aware of MorphlineInterceptor when he suggested
GrokInterceptor.  I think that's because MorphlineInterceptor lives under
the org.apache.flume.sink.*solr*.... package.

See:
* http://search-hadoop.com/m/0jVep1J1hJL&subj=MorphlineInterceptor+questions
*
http://search-hadoop.com/m/23imV1tSCQK1&subj=Questions+about+Morphline+Solr+Sink+structure

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Sat, Nov 16, 2013 at 5:26 AM, Wolfgang Hoschek <wh...@cloudera.com>wrote:

> FYI, I've just added a new morphline command that returns Geolocation
> information for a given IP address, using an efficient in-memory Maxmind
> database lookup - https://issues.cloudera.org/browse/CDK-227
>
> This can then be used in the MorphlineInterceptor or Morphline Sink.
>
> Wolfgang.
>
> On Sep 7, 2013, at 8:47 PM, Israel Ekpo wrote:
>
> > Thank you everyone for your very constructive feedbacks. They were very
> > helpful.
> >
> > To provide some background, most of these suggestions have been inspired
> by
> > features I have found in Logstash [3].
> >
> > I am going to spend more time to understand how the cdk morphline
> commands
> > [4] work because I think it will really help with the transformation
> utils
> > needed in FileSource.
> >
> > Regarding the GrokInterceptor, I was not aware of the existence of
> > MorphlineInterceptor. It already does what I was proposing with
> > GrokInterceptor. So we are cool from that end.
> >
> > In simple standalone tests, the commons-io class that I am planning to
> use
> > for the FileSource handles file rotations well but I have not tested
> > renames or removals yet.
> >
> > Regarding the GeoIPInterceptor we can provide links for downloading the
> > Maxmind database seperately without bundling the IP database with Flume
> > releases.
> >
> > This is how the Logstash project does it.
> >
> > Because of the large number of events expected, I was planning to use
> > Lucene because of the speed of executing range queries from trie indexing
> > [5] and the results can also be cached in-memory if they have been
> > previously executed.
> >
> > I can perform some benchmarks with and without Lucene and see if the
> > performance differences justify using it for the lookups.
> >
> > My gut feeling is that using Lucene will lead to shorter processing times
> > as the volume of events increase.
> >
> > The RedisSource and RedisSink features will just be simple sources and
> > sinks. The sink will push [1] events to the Redis server and the source
> > will do a blocking pop [2] as it waits for new events to occur on the
> Redis
> > Server.
> >
> > I am still trying out a few things, this part is not yet finalized.
> >
> > Regarding contributing features as plugins, how are plugins typically
> > contributed and managed?
> >
> > Do I have to create github repo and manage it independently or are they
> > contributed as patches to the Flume project?
> >
> > [1] http://redis.io/commands/rpush
> > [2] http://redis.io/commands/blpop
> > [3] http://logstash.net/docs/1.2.1/
> > [4] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html
> > [5]
> >
> http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/NumericRangeQuery.html
> >
> > *Author and Instructor for the Upcoming Book and Lecture Series*
> > *Massive Log Data Aggregation, Processing, Searching and Visualization
> with
> > Open Source Software*
> > *http://massivelogdata.com*
> >
> >
> > On Wed, Aug 28, 2013 at 1:21 PM, Wolfgang Hoschek <whoschek@cloudera.com
> >wrote:
> >
> >> Re: GrokInterceptor
> >>
> >> This functionality is already available in the form of the Apache Flume
> >> MorphlineInterceptor [1] with the grok command [2]. While grok is very
> >> useful, consider that grok alone often isn't enough - you typically need
> >> some other log event processing commands as well, for example as
> contained
> >> in morphlines [3].
> >>
> >> Re: FileSource
> >>
> >> True file tailing would be great.
> >>
> >> Merging multiple lines into one event can already be done with the
> >> MorphlineInterceptor with the readMultiLine command [4]. Or maybe embed
> a
> >> morphline directly into that new FileSource?
> >>
> >> Re: GeoIPInterceptor
> >>
> >> Seems to me that it would be more flexible, powerful and reusable to add
> >> this kind of functionality as a morphline command - contributions
> welcome!
> >>
> >> Finally, a word of caution, Maxmind is a good geo db, and I've used it
> >> before, but it has some LGPL issues that may or may not be workable in
> this
> >> context. Maxmind db fits into RAM - Lucene seems like overkill here -
> you
> >> can do fast maxmind lookups directly without Lucene.
> >>
> >> [1] http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor
> >> [2]
> >>
> http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#grok
> >> [3] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html
> >> [4]
> >>
> http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#readMultiLine
> >>
> >> Wolfgang.
> >>
> >>>
> >>> *FileSource*
> >>>
> >>> Using the Tailer feature from Apache Commons I/O utility [1], we can
> tail
> >>> specific files for events.
> >>>
> >>> This allows us to, regardless of the operating system, have the ability
> >> to
> >>> watch files for future events as they occur.
> >>>
> >>> It also allows us to step in and determine if two or more events should
> >> be
> >>> merged into one events if newline characters are present in an event.
> >>>
> >>> We can configure certain regular expressions that determines if a
> >> specific
> >>> line is a new event or part of the prevent event.
> >>>
> >>> Essentially, this source will have the ability to merge multiple lines
> >> into
> >>> one event before it is passed on to interceptors.
> >>>
> >>> It has been complicated group multiple lines into a single event with
> the
> >>> Spooling Directory Source or Exec Source. I tried creating custom
> >>> deserializers but it was hard to get around the logic used to parse the
> >>> files.
> >>>
> >>> Using the Spooling Directory also means we cannot watch the original
> >> files
> >>> so we need a background process to copy over the log files into the
> >>> spooling directory which requires additional setup.
> >>>
> >>> The tail command is not also available on all operating systems out of
> >> the
> >>> box.
> >>>
> >>>
> >>> *GrokInterceptor*
> >>>
> >>> With this interceptor we can parse semi-structure and unstructured text
> >> and
> >>> log data in the headers and body of the event into something structured
> >>> that can be easily queried.
> >>> I plan to use the information [2] and [3] for this.
> >>> With this interceptor, we can extract HTTP response codes, response
> >> times,
> >>> user agents, IP addresses and a whole bunch of useful data point from
> >> free
> >>> form text.
> >>>
> >>>
> >>>
> >>> *GeoIPInterceptor*
> >>>
> >>> This is for IP intelligence.
> >>>
> >>> This interceptor will allow us to use the value of an IP address in the
> >>> event header or body of the request to estimate the geographical
> location
> >>> of the IP address.
> >>>
> >>> Using the database available here [4], we can inject the two-letter
> code
> >> or
> >>> country name of the IP address into the event.
> >>>
> >>> We can also deduce other values such as city name, postalCode,
> latitude,
> >>> longitude, Internet Service Provider and Organization name.
> >>>
> >>> This can be very helpful in analyzing traffic patterns and target
> >> audience
> >>> from webserver or application logs.
> >>>
> >>> The database is loaded into a Lucene index when the agent is started
> up.
> >>> The index is only created once if it does not already exists.
> >>>
> >>> As the interceptor comes across events, it maps the IP address to a
> >> variety
> >>> of values that can be injected into the events.
> >>>
> >>>
> >>>
> >>> *RedisSink*
> >>>
> >>> This can provide another option for setting up a fan-in and/or fan-out
> >>> architecture.
> >>>
> >>> The RedisSink can serve as a queue that is used as a source by another
> >>> agent down the line.
> >>>
> >>> *References*
> >>> [1]
> >>>
> >>
> http://commons.apache.org/proper/commons-io/javadocs/api-release/org/apache/commons/io/input/Tailer.html
> >>> [2] https://github.com/NFLabs/java-grok
> >>> [3] http://www.anthonycorbacho.net/portfolio/grok-pattern/
> >>> [4] http://dev.maxmind.com/geoip/legacy/geolite/#Downloads
> >>> [5] http://dev.maxmind.com/geoip/legacy/csv/
> >>> [6] http://redis.io/documentation
> >>> [7] https://github.com/xetorthio/jedis
> >>>
> >>> *Author and Instructor for the Upcoming Book and Lecture Series*
> >>> *Massive Log Data Aggregation, Processing, Searching and Visualization
> >> with
> >>> Open Source Software*
> >>> *http://massivelogdata.com*
> >>
> >>
>
>

Re: New Features Proposed for Apache Flume

Posted by Wolfgang Hoschek <wh...@cloudera.com>.

FYI, I've just added a new morphline command that returns Geolocation information for a given IP address, using an efficient in-memory Maxmind database lookup - https://issues.cloudera.org/browse/CDK-227

This can then be used in the MorphlineInterceptor or Morphline Sink.

Wolfgang.

On Sep 7, 2013, at 8:47 PM, Israel Ekpo wrote:

> Thank you everyone for your very constructive feedbacks. They were very
> helpful.
> 
> To provide some background, most of these suggestions have been inspired by
> features I have found in Logstash [3].
> 
> I am going to spend more time to understand how the cdk morphline commands
> [4] work because I think it will really help with the transformation utils
> needed in FileSource.
> 
> Regarding the GrokInterceptor, I was not aware of the existence of
> MorphlineInterceptor. It already does what I was proposing with
> GrokInterceptor. So we are cool from that end.
> 
> In simple standalone tests, the commons-io class that I am planning to use
> for the FileSource handles file rotations well but I have not tested
> renames or removals yet.
> 
> Regarding the GeoIPInterceptor we can provide links for downloading the
> Maxmind database seperately without bundling the IP database with Flume
> releases.
> 
> This is how the Logstash project does it.
> 
> Because of the large number of events expected, I was planning to use
> Lucene because of the speed of executing range queries from trie indexing
> [5] and the results can also be cached in-memory if they have been
> previously executed.
> 
> I can perform some benchmarks with and without Lucene and see if the
> performance differences justify using it for the lookups.
> 
> My gut feeling is that using Lucene will lead to shorter processing times
> as the volume of events increase.
> 
> The RedisSource and RedisSink features will just be simple sources and
> sinks. The sink will push [1] events to the Redis server and the source
> will do a blocking pop [2] as it waits for new events to occur on the Redis
> Server.
> 
> I am still trying out a few things, this part is not yet finalized.
> 
> Regarding contributing features as plugins, how are plugins typically
> contributed and managed?
> 
> Do I have to create github repo and manage it independently or are they
> contributed as patches to the Flume project?
> 
> [1] http://redis.io/commands/rpush
> [2] http://redis.io/commands/blpop
> [3] http://logstash.net/docs/1.2.1/
> [4] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html
> [5]
> http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/NumericRangeQuery.html
> 
> *Author and Instructor for the Upcoming Book and Lecture Series*
> *Massive Log Data Aggregation, Processing, Searching and Visualization with
> Open Source Software*
> *http://massivelogdata.com*
> 
> 
> On Wed, Aug 28, 2013 at 1:21 PM, Wolfgang Hoschek <wh...@cloudera.com>wrote:
> 
>> Re: GrokInterceptor
>> 
>> This functionality is already available in the form of the Apache Flume
>> MorphlineInterceptor [1] with the grok command [2]. While grok is very
>> useful, consider that grok alone often isn't enough - you typically need
>> some other log event processing commands as well, for example as contained
>> in morphlines [3].
>> 
>> Re: FileSource
>> 
>> True file tailing would be great.
>> 
>> Merging multiple lines into one event can already be done with the
>> MorphlineInterceptor with the readMultiLine command [4]. Or maybe embed a
>> morphline directly into that new FileSource?
>> 
>> Re: GeoIPInterceptor
>> 
>> Seems to me that it would be more flexible, powerful and reusable to add
>> this kind of functionality as a morphline command - contributions welcome!
>> 
>> Finally, a word of caution, Maxmind is a good geo db, and I've used it
>> before, but it has some LGPL issues that may or may not be workable in this
>> context. Maxmind db fits into RAM - Lucene seems like overkill here - you
>> can do fast maxmind lookups directly without Lucene.
>> 
>> [1] http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor
>> [2]
>> http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#grok
>> [3] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html
>> [4]
>> http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#readMultiLine
>> 
>> Wolfgang.
>> 
>>> 
>>> *FileSource*
>>> 
>>> Using the Tailer feature from Apache Commons I/O utility [1], we can tail
>>> specific files for events.
>>> 
>>> This allows us to, regardless of the operating system, have the ability
>> to
>>> watch files for future events as they occur.
>>> 
>>> It also allows us to step in and determine if two or more events should
>> be
>>> merged into one events if newline characters are present in an event.
>>> 
>>> We can configure certain regular expressions that determines if a
>> specific
>>> line is a new event or part of the prevent event.
>>> 
>>> Essentially, this source will have the ability to merge multiple lines
>> into
>>> one event before it is passed on to interceptors.
>>> 
>>> It has been complicated group multiple lines into a single event with the
>>> Spooling Directory Source or Exec Source. I tried creating custom
>>> deserializers but it was hard to get around the logic used to parse the
>>> files.
>>> 
>>> Using the Spooling Directory also means we cannot watch the original
>> files
>>> so we need a background process to copy over the log files into the
>>> spooling directory which requires additional setup.
>>> 
>>> The tail command is not also available on all operating systems out of
>> the
>>> box.
>>> 
>>> 
>>> *GrokInterceptor*
>>> 
>>> With this interceptor we can parse semi-structure and unstructured text
>> and
>>> log data in the headers and body of the event into something structured
>>> that can be easily queried.
>>> I plan to use the information [2] and [3] for this.
>>> With this interceptor, we can extract HTTP response codes, response
>> times,
>>> user agents, IP addresses and a whole bunch of useful data point from
>> free
>>> form text.
>>> 
>>> 
>>> 
>>> *GeoIPInterceptor*
>>> 
>>> This is for IP intelligence.
>>> 
>>> This interceptor will allow us to use the value of an IP address in the
>>> event header or body of the request to estimate the geographical location
>>> of the IP address.
>>> 
>>> Using the database available here [4], we can inject the two-letter code
>> or
>>> country name of the IP address into the event.
>>> 
>>> We can also deduce other values such as city name, postalCode, latitude,
>>> longitude, Internet Service Provider and Organization name.
>>> 
>>> This can be very helpful in analyzing traffic patterns and target
>> audience
>>> from webserver or application logs.
>>> 
>>> The database is loaded into a Lucene index when the agent is started up.
>>> The index is only created once if it does not already exists.
>>> 
>>> As the interceptor comes across events, it maps the IP address to a
>> variety
>>> of values that can be injected into the events.
>>> 
>>> 
>>> 
>>> *RedisSink*
>>> 
>>> This can provide another option for setting up a fan-in and/or fan-out
>>> architecture.
>>> 
>>> The RedisSink can serve as a queue that is used as a source by another
>>> agent down the line.
>>> 
>>> *References*
>>> [1]
>>> 
>> http://commons.apache.org/proper/commons-io/javadocs/api-release/org/apache/commons/io/input/Tailer.html
>>> [2] https://github.com/NFLabs/java-grok
>>> [3] http://www.anthonycorbacho.net/portfolio/grok-pattern/
>>> [4] http://dev.maxmind.com/geoip/legacy/geolite/#Downloads
>>> [5] http://dev.maxmind.com/geoip/legacy/csv/
>>> [6] http://redis.io/documentation
>>> [7] https://github.com/xetorthio/jedis
>>> 
>>> *Author and Instructor for the Upcoming Book and Lecture Series*
>>> *Massive Log Data Aggregation, Processing, Searching and Visualization
>> with
>>> Open Source Software*
>>> *http://massivelogdata.com*
>> 
>>