You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Jeremy Hanna <je...@gmail.com> on 2011/09/29 19:12:58 UTC

Aggregating tomcat, log4j, other logs in realtime

We have a number of web servers in ec2 and periodically we just blow them away and create new ones.  That makes keeping logs problematic.  We're looking for a way to stream the logs from those various sources directly to a central log server - either just a single server or hdfs or something like that.

My question is whether kafka is a good fit for that or should I be looking more along the lines of flume or scribe?

Many thanks.

Jeremy

Re: Aggregating tomcat, log4j, other logs in realtime

Posted by Jay Kreps <ja...@gmail.com>.

Hmm, I think I am still confused. Is the question how to set up kafka to
consume from log files being produced on a particular server even though
those log files may get rotated during consumption? Or are saying you want
to get all the log files onto a central server somewhere and so the question
is whether there is an off-the-shelf method for consuming messages and
outputing them to log files in a rotating fashion? I think it is the later.
I don't think we have something for that, though it is not much more than a
for loop so easy to add. It would be good to upgrade the console consumer to
output to a file with optional file rolling instead of to the console.

-Jay

On Thu, Sep 29, 2011 at 12:38 PM, Eric Hauser <ew...@gmail.com> wrote:

> Jun,
>
> I was referring to the logic that would be necessary for the consumer
> of the topic to rotate the log files on the centralized log server.
> With Flume you would handle this via configuration:
>
> collectorSink("file://var/logs/flume/webdata/%Y-%m-%d/%H00/", "web-")
>
> You would probably just use log4j or what not in your Kafka consumer
> to handle this.
>
> On Thu, Sep 29, 2011 at 3:20 PM, Jun Rao <ju...@gmail.com> wrote:
> > Eric,
> >
> > Thanks for the analysis. A couple of comments:
> >
> > Kafka recently added the end-to-end compression feature and we will be
> > releasing it soon. Please see
> > https://issues.apache.org/jira/browse/KAFKA-79for details.
> >
> > About the file rolling support, are you referring to Kafka log? Kafka
> logs
> > are rolled based on a preconfigured size.
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Sep 29, 2011 at 11:25 AM, Eric Hauser <ew...@gmail.com>
> wrote:
> >
> >> Jeremy,
> >>
> >> I've used both Flume and Kafka, and I can provide some info for
> comparison:
> >>
> >> Flume
> >> - The current Flume release 0.9.4 has some pretty nasty bugs in it
> >> (most have been fixed in trunk).
> >> - Flume is a more complex to maintain operations-wise (IMO) than Kafka
> >> since you have to setup masters and collectors (you don't necessarily
> >> need collectors if you aren't writing to HDFS)
> >> - Flume has a well defined pattern for doing what you want:
> >>
> >>
> http://www.cloudera.com/blog/2010/09/using-flume-to-collect-apache-2-web-server-logs/
> >>
> >> Kafka
> >> - If you need multiple Kafka partitions for the logs, you will want to
> >> partition by host so the messages arrive in order for the same host
> >> - You can use the same piped technique as Flume to publish to Kafka,
> >> but you'll have to write a little code to publish and subscribe to the
> >> stream
> >> - Kafka does not provide any of the file rolling, compression, etc.
> >> that Flume provides
> >> - If you ever want to do anything more interesting with those log
> >> files than just send them to one location, publishing them to Kafka
> >> would allow you to add additional consumers later.  Flume has a
> >> concept of fanout sinks, but I don't care for the way it works.
> >>
> >>
> >>
> >> On Thu, Sep 29, 2011 at 1:48 PM, Jun Rao <ju...@gmail.com> wrote:
> >> > Jeremy,
> >> >
> >> > Yes, Kafka will be a good fit for that.
> >> >
> >> > Thanks,
> >> >
> >> > Jun
> >> >
> >> > On Thu, Sep 29, 2011 at 10:12 AM, Jeremy Hanna
> >> > <je...@gmail.com>wrote:
> >> >
> >> >> We have a number of web servers in ec2 and periodically we just blow
> >> them
> >> >> away and create new ones.  That makes keeping logs problematic.
>  We're
> >> >> looking for a way to stream the logs from those various sources
> directly
> >> to
> >> >> a central log server - either just a single server or hdfs or
> something
> >> like
> >> >> that.
> >> >>
> >> >> My question is whether kafka is a good fit for that or should I be
> >> looking
> >> >> more along the lines of flume or scribe?
> >> >>
> >> >> Many thanks.
> >> >>
> >> >> Jeremy
> >> >
> >>
> >
>

Re: Aggregating tomcat, log4j, other logs in realtime

Posted by Eric Hauser <ew...@gmail.com>.

Jun,

I was referring to the logic that would be necessary for the consumer
of the topic to rotate the log files on the centralized log server.
With Flume you would handle this via configuration:

collectorSink("file://var/logs/flume/webdata/%Y-%m-%d/%H00/", "web-")

You would probably just use log4j or what not in your Kafka consumer
to handle this.

On Thu, Sep 29, 2011 at 3:20 PM, Jun Rao <ju...@gmail.com> wrote:
> Eric,
>
> Thanks for the analysis. A couple of comments:
>
> Kafka recently added the end-to-end compression feature and we will be
> releasing it soon. Please see
> https://issues.apache.org/jira/browse/KAFKA-79for details.
>
> About the file rolling support, are you referring to Kafka log? Kafka logs
> are rolled based on a preconfigured size.
>
> Thanks,
>
> Jun
>
> On Thu, Sep 29, 2011 at 11:25 AM, Eric Hauser <ew...@gmail.com> wrote:
>
>> Jeremy,
>>
>> I've used both Flume and Kafka, and I can provide some info for comparison:
>>
>> Flume
>> - The current Flume release 0.9.4 has some pretty nasty bugs in it
>> (most have been fixed in trunk).
>> - Flume is a more complex to maintain operations-wise (IMO) than Kafka
>> since you have to setup masters and collectors (you don't necessarily
>> need collectors if you aren't writing to HDFS)
>> - Flume has a well defined pattern for doing what you want:
>>
>> http://www.cloudera.com/blog/2010/09/using-flume-to-collect-apache-2-web-server-logs/
>>
>> Kafka
>> - If you need multiple Kafka partitions for the logs, you will want to
>> partition by host so the messages arrive in order for the same host
>> - You can use the same piped technique as Flume to publish to Kafka,
>> but you'll have to write a little code to publish and subscribe to the
>> stream
>> - Kafka does not provide any of the file rolling, compression, etc.
>> that Flume provides
>> - If you ever want to do anything more interesting with those log
>> files than just send them to one location, publishing them to Kafka
>> would allow you to add additional consumers later.  Flume has a
>> concept of fanout sinks, but I don't care for the way it works.
>>
>>
>>
>> On Thu, Sep 29, 2011 at 1:48 PM, Jun Rao <ju...@gmail.com> wrote:
>> > Jeremy,
>> >
>> > Yes, Kafka will be a good fit for that.
>> >
>> > Thanks,
>> >
>> > Jun
>> >
>> > On Thu, Sep 29, 2011 at 10:12 AM, Jeremy Hanna
>> > <je...@gmail.com>wrote:
>> >
>> >> We have a number of web servers in ec2 and periodically we just blow
>> them
>> >> away and create new ones.  That makes keeping logs problematic.  We're
>> >> looking for a way to stream the logs from those various sources directly
>> to
>> >> a central log server - either just a single server or hdfs or something
>> like
>> >> that.
>> >>
>> >> My question is whether kafka is a good fit for that or should I be
>> looking
>> >> more along the lines of flume or scribe?
>> >>
>> >> Many thanks.
>> >>
>> >> Jeremy
>> >
>>
>

Re: Aggregating tomcat, log4j, other logs in realtime

Posted by Anurag <an...@gmail.com>.

Jun,
compression is an awesome feature..
Re: file rolling - I was referring to the apache log rotation, not kafka..


On Thu, Sep 29, 2011 at 12:20 PM, Jun Rao <ju...@gmail.com> wrote:
> Eric,
>
> Thanks for the analysis. A couple of comments:
>
> Kafka recently added the end-to-end compression feature and we will be
> releasing it soon. Please see
> https://issues.apache.org/jira/browse/KAFKA-79for details.
>
> About the file rolling support, are you referring to Kafka log? Kafka logs
> are rolled based on a preconfigured size.
>
> Thanks,
>
> Jun
>
> On Thu, Sep 29, 2011 at 11:25 AM, Eric Hauser <ew...@gmail.com> wrote:
>
>> Jeremy,
>>
>> I've used both Flume and Kafka, and I can provide some info for comparison:
>>
>> Flume
>> - The current Flume release 0.9.4 has some pretty nasty bugs in it
>> (most have been fixed in trunk).
>> - Flume is a more complex to maintain operations-wise (IMO) than Kafka
>> since you have to setup masters and collectors (you don't necessarily
>> need collectors if you aren't writing to HDFS)
>> - Flume has a well defined pattern for doing what you want:
>>
>> http://www.cloudera.com/blog/2010/09/using-flume-to-collect-apache-2-web-server-logs/
>>
>> Kafka
>> - If you need multiple Kafka partitions for the logs, you will want to
>> partition by host so the messages arrive in order for the same host
>> - You can use the same piped technique as Flume to publish to Kafka,
>> but you'll have to write a little code to publish and subscribe to the
>> stream
>> - Kafka does not provide any of the file rolling, compression, etc.
>> that Flume provides
>> - If you ever want to do anything more interesting with those log
>> files than just send them to one location, publishing them to Kafka
>> would allow you to add additional consumers later.  Flume has a
>> concept of fanout sinks, but I don't care for the way it works.
>>
>>
>>
>> On Thu, Sep 29, 2011 at 1:48 PM, Jun Rao <ju...@gmail.com> wrote:
>> > Jeremy,
>> >
>> > Yes, Kafka will be a good fit for that.
>> >
>> > Thanks,
>> >
>> > Jun
>> >
>> > On Thu, Sep 29, 2011 at 10:12 AM, Jeremy Hanna
>> > <je...@gmail.com>wrote:
>> >
>> >> We have a number of web servers in ec2 and periodically we just blow
>> them
>> >> away and create new ones.  That makes keeping logs problematic.  We're
>> >> looking for a way to stream the logs from those various sources directly
>> to
>> >> a central log server - either just a single server or hdfs or something
>> like
>> >> that.
>> >>
>> >> My question is whether kafka is a good fit for that or should I be
>> looking
>> >> more along the lines of flume or scribe?
>> >>
>> >> Many thanks.
>> >>
>> >> Jeremy
>> >
>>
>

Re: Aggregating tomcat, log4j, other logs in realtime

Posted by Jun Rao <ju...@gmail.com>.

Eric,

Thanks for the analysis. A couple of comments:

Kafka recently added the end-to-end compression feature and we will be
releasing it soon. Please see
https://issues.apache.org/jira/browse/KAFKA-79for details.

About the file rolling support, are you referring to Kafka log? Kafka logs
are rolled based on a preconfigured size.

Thanks,

Jun

On Thu, Sep 29, 2011 at 11:25 AM, Eric Hauser <ew...@gmail.com> wrote:

> Jeremy,
>
> I've used both Flume and Kafka, and I can provide some info for comparison:
>
> Flume
> - The current Flume release 0.9.4 has some pretty nasty bugs in it
> (most have been fixed in trunk).
> - Flume is a more complex to maintain operations-wise (IMO) than Kafka
> since you have to setup masters and collectors (you don't necessarily
> need collectors if you aren't writing to HDFS)
> - Flume has a well defined pattern for doing what you want:
>
> http://www.cloudera.com/blog/2010/09/using-flume-to-collect-apache-2-web-server-logs/
>
> Kafka
> - If you need multiple Kafka partitions for the logs, you will want to
> partition by host so the messages arrive in order for the same host
> - You can use the same piped technique as Flume to publish to Kafka,
> but you'll have to write a little code to publish and subscribe to the
> stream
> - Kafka does not provide any of the file rolling, compression, etc.
> that Flume provides
> - If you ever want to do anything more interesting with those log
> files than just send them to one location, publishing them to Kafka
> would allow you to add additional consumers later.  Flume has a
> concept of fanout sinks, but I don't care for the way it works.
>
>
>
> On Thu, Sep 29, 2011 at 1:48 PM, Jun Rao <ju...@gmail.com> wrote:
> > Jeremy,
> >
> > Yes, Kafka will be a good fit for that.
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Sep 29, 2011 at 10:12 AM, Jeremy Hanna
> > <je...@gmail.com>wrote:
> >
> >> We have a number of web servers in ec2 and periodically we just blow
> them
> >> away and create new ones.  That makes keeping logs problematic.  We're
> >> looking for a way to stream the logs from those various sources directly
> to
> >> a central log server - either just a single server or hdfs or something
> like
> >> that.
> >>
> >> My question is whether kafka is a good fit for that or should I be
> looking
> >> more along the lines of flume or scribe?
> >>
> >> Many thanks.
> >>
> >> Jeremy
> >
>

Re: Aggregating tomcat, log4j, other logs in realtime

Posted by Evan Chan <ev...@ooyala.com>.

One more point to this thread.  It's really hard to do partitioning in
Flume.
If you need partitioning but don't want to deal with a set of central
brokers, and don't need persistence, you can check out the new Storm project
(github.com/nathanmarz)

-Evan


On Thu, Sep 29, 2011 at 11:38 AM, Anurag <an...@gmail.com> wrote:

> Eric/Jun,
> Can you throw some light on how to handle apache log rotation? afaik,
> even if we write custom code to tail a file, the file handle is lost
> on rotation and might result in some loss of data.
>
>
> On Thu, Sep 29, 2011 at 11:35 AM, Jeremy Hanna
> <je...@gmail.com> wrote:
> > Thanks a lot for the comparison Eric.  Really good to hear a perspective
> from a user of both.
> >
> > On Sep 29, 2011, at 1:25 PM, Eric Hauser wrote:
> >
> >> Jeremy,
> >>
> >> I've used both Flume and Kafka, and I can provide some info for
> comparison:
> >>
> >> Flume
> >> - The current Flume release 0.9.4 has some pretty nasty bugs in it
> >> (most have been fixed in trunk).
> >> - Flume is a more complex to maintain operations-wise (IMO) than Kafka
> >> since you have to setup masters and collectors (you don't necessarily
> >> need collectors if you aren't writing to HDFS)
> >> - Flume has a well defined pattern for doing what you want:
> >>
> http://www.cloudera.com/blog/2010/09/using-flume-to-collect-apache-2-web-server-logs/
> >>
> >> Kafka
> >> - If you need multiple Kafka partitions for the logs, you will want to
> >> partition by host so the messages arrive in order for the same host
> >> - You can use the same piped technique as Flume to publish to Kafka,
> >> but you'll have to write a little code to publish and subscribe to the
> >> stream
> >> - Kafka does not provide any of the file rolling, compression, etc.
> >> that Flume provides
> >> - If you ever want to do anything more interesting with those log
> >> files than just send them to one location, publishing them to Kafka
> >> would allow you to add additional consumers later.  Flume has a
> >> concept of fanout sinks, but I don't care for the way it works.
> >>
> >>
> >>
> >> On Thu, Sep 29, 2011 at 1:48 PM, Jun Rao <ju...@gmail.com> wrote:
> >>> Jeremy,
> >>>
> >>> Yes, Kafka will be a good fit for that.
> >>>
> >>> Thanks,
> >>>
> >>> Jun
> >>>
> >>> On Thu, Sep 29, 2011 at 10:12 AM, Jeremy Hanna
> >>> <je...@gmail.com>wrote:
> >>>
> >>>> We have a number of web servers in ec2 and periodically we just blow
> them
> >>>> away and create new ones.  That makes keeping logs problematic.  We're
> >>>> looking for a way to stream the logs from those various sources
> directly to
> >>>> a central log server - either just a single server or hdfs or
> something like
> >>>> that.
> >>>>
> >>>> My question is whether kafka is a good fit for that or should I be
> looking
> >>>> more along the lines of flume or scribe?
> >>>>
> >>>> Many thanks.
> >>>>
> >>>> Jeremy
> >>>
> >
> >
>



-- 
--
*Evan Chan*
Senior Software Engineer |
ev@ooyala.com | (650) 996-4600
www.ooyala.com | blog <http://www.ooyala.com/blog> |
@ooyala<http://www.twitter.com/ooyala>

Re: Aggregating tomcat, log4j, other logs in realtime

Posted by Anurag <an...@gmail.com>.

Eric,
Thx.... we use log rotate on hourly basis, just wanted to know if
there's anything different that we might be missing.

-anurag


On Thu, Sep 29, 2011 at 11:47 AM, Eric Hauser <ew...@gmail.com> wrote:
> Anurag,
>
> I wouldn't tail the log files, but instead make use of Apache's
> facilities to pipe the logs to another program:
>
> http://httpd.apache.org/docs/2.2/mod/core.html#errorlog
> http://httpd.apache.org/docs/2.0/programs/rotatelogs.html
>
>
> On Thu, Sep 29, 2011 at 2:38 PM, Anurag <an...@gmail.com> wrote:
>> Eric/Jun,
>> Can you throw some light on how to handle apache log rotation? afaik,
>> even if we write custom code to tail a file, the file handle is lost
>> on rotation and might result in some loss of data.
>>
>>
>> On Thu, Sep 29, 2011 at 11:35 AM, Jeremy Hanna
>> <je...@gmail.com> wrote:
>>> Thanks a lot for the comparison Eric.  Really good to hear a perspective from a user of both.
>>>
>>> On Sep 29, 2011, at 1:25 PM, Eric Hauser wrote:
>>>
>>>> Jeremy,
>>>>
>>>> I've used both Flume and Kafka, and I can provide some info for comparison:
>>>>
>>>> Flume
>>>> - The current Flume release 0.9.4 has some pretty nasty bugs in it
>>>> (most have been fixed in trunk).
>>>> - Flume is a more complex to maintain operations-wise (IMO) than Kafka
>>>> since you have to setup masters and collectors (you don't necessarily
>>>> need collectors if you aren't writing to HDFS)
>>>> - Flume has a well defined pattern for doing what you want:
>>>> http://www.cloudera.com/blog/2010/09/using-flume-to-collect-apache-2-web-server-logs/
>>>>
>>>> Kafka
>>>> - If you need multiple Kafka partitions for the logs, you will want to
>>>> partition by host so the messages arrive in order for the same host
>>>> - You can use the same piped technique as Flume to publish to Kafka,
>>>> but you'll have to write a little code to publish and subscribe to the
>>>> stream
>>>> - Kafka does not provide any of the file rolling, compression, etc.
>>>> that Flume provides
>>>> - If you ever want to do anything more interesting with those log
>>>> files than just send them to one location, publishing them to Kafka
>>>> would allow you to add additional consumers later.  Flume has a
>>>> concept of fanout sinks, but I don't care for the way it works.
>>>>
>>>>
>>>>
>>>> On Thu, Sep 29, 2011 at 1:48 PM, Jun Rao <ju...@gmail.com> wrote:
>>>>> Jeremy,
>>>>>
>>>>> Yes, Kafka will be a good fit for that.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jun
>>>>>
>>>>> On Thu, Sep 29, 2011 at 10:12 AM, Jeremy Hanna
>>>>> <je...@gmail.com>wrote:
>>>>>
>>>>>> We have a number of web servers in ec2 and periodically we just blow them
>>>>>> away and create new ones.  That makes keeping logs problematic.  We're
>>>>>> looking for a way to stream the logs from those various sources directly to
>>>>>> a central log server - either just a single server or hdfs or something like
>>>>>> that.
>>>>>>
>>>>>> My question is whether kafka is a good fit for that or should I be looking
>>>>>> more along the lines of flume or scribe?
>>>>>>
>>>>>> Many thanks.
>>>>>>
>>>>>> Jeremy
>>>>>
>>>
>>>
>>
>

Re: Aggregating tomcat, log4j, other logs in realtime

Posted by Eric Hauser <ew...@gmail.com>.

Anurag,

I wouldn't tail the log files, but instead make use of Apache's
facilities to pipe the logs to another program:

http://httpd.apache.org/docs/2.2/mod/core.html#errorlog
http://httpd.apache.org/docs/2.0/programs/rotatelogs.html


On Thu, Sep 29, 2011 at 2:38 PM, Anurag <an...@gmail.com> wrote:
> Eric/Jun,
> Can you throw some light on how to handle apache log rotation? afaik,
> even if we write custom code to tail a file, the file handle is lost
> on rotation and might result in some loss of data.
>
>
> On Thu, Sep 29, 2011 at 11:35 AM, Jeremy Hanna
> <je...@gmail.com> wrote:
>> Thanks a lot for the comparison Eric.  Really good to hear a perspective from a user of both.
>>
>> On Sep 29, 2011, at 1:25 PM, Eric Hauser wrote:
>>
>>> Jeremy,
>>>
>>> I've used both Flume and Kafka, and I can provide some info for comparison:
>>>
>>> Flume
>>> - The current Flume release 0.9.4 has some pretty nasty bugs in it
>>> (most have been fixed in trunk).
>>> - Flume is a more complex to maintain operations-wise (IMO) than Kafka
>>> since you have to setup masters and collectors (you don't necessarily
>>> need collectors if you aren't writing to HDFS)
>>> - Flume has a well defined pattern for doing what you want:
>>> http://www.cloudera.com/blog/2010/09/using-flume-to-collect-apache-2-web-server-logs/
>>>
>>> Kafka
>>> - If you need multiple Kafka partitions for the logs, you will want to
>>> partition by host so the messages arrive in order for the same host
>>> - You can use the same piped technique as Flume to publish to Kafka,
>>> but you'll have to write a little code to publish and subscribe to the
>>> stream
>>> - Kafka does not provide any of the file rolling, compression, etc.
>>> that Flume provides
>>> - If you ever want to do anything more interesting with those log
>>> files than just send them to one location, publishing them to Kafka
>>> would allow you to add additional consumers later.  Flume has a
>>> concept of fanout sinks, but I don't care for the way it works.
>>>
>>>
>>>
>>> On Thu, Sep 29, 2011 at 1:48 PM, Jun Rao <ju...@gmail.com> wrote:
>>>> Jeremy,
>>>>
>>>> Yes, Kafka will be a good fit for that.
>>>>
>>>> Thanks,
>>>>
>>>> Jun
>>>>
>>>> On Thu, Sep 29, 2011 at 10:12 AM, Jeremy Hanna
>>>> <je...@gmail.com>wrote:
>>>>
>>>>> We have a number of web servers in ec2 and periodically we just blow them
>>>>> away and create new ones.  That makes keeping logs problematic.  We're
>>>>> looking for a way to stream the logs from those various sources directly to
>>>>> a central log server - either just a single server or hdfs or something like
>>>>> that.
>>>>>
>>>>> My question is whether kafka is a good fit for that or should I be looking
>>>>> more along the lines of flume or scribe?
>>>>>
>>>>> Many thanks.
>>>>>
>>>>> Jeremy
>>>>
>>
>>
>

Re: Aggregating tomcat, log4j, other logs in realtime

Posted by Anurag <an...@gmail.com>.

Eric/Jun,
Can you throw some light on how to handle apache log rotation? afaik,
even if we write custom code to tail a file, the file handle is lost
on rotation and might result in some loss of data.


On Thu, Sep 29, 2011 at 11:35 AM, Jeremy Hanna
<je...@gmail.com> wrote:
> Thanks a lot for the comparison Eric.  Really good to hear a perspective from a user of both.
>
> On Sep 29, 2011, at 1:25 PM, Eric Hauser wrote:
>
>> Jeremy,
>>
>> I've used both Flume and Kafka, and I can provide some info for comparison:
>>
>> Flume
>> - The current Flume release 0.9.4 has some pretty nasty bugs in it
>> (most have been fixed in trunk).
>> - Flume is a more complex to maintain operations-wise (IMO) than Kafka
>> since you have to setup masters and collectors (you don't necessarily
>> need collectors if you aren't writing to HDFS)
>> - Flume has a well defined pattern for doing what you want:
>> http://www.cloudera.com/blog/2010/09/using-flume-to-collect-apache-2-web-server-logs/
>>
>> Kafka
>> - If you need multiple Kafka partitions for the logs, you will want to
>> partition by host so the messages arrive in order for the same host
>> - You can use the same piped technique as Flume to publish to Kafka,
>> but you'll have to write a little code to publish and subscribe to the
>> stream
>> - Kafka does not provide any of the file rolling, compression, etc.
>> that Flume provides
>> - If you ever want to do anything more interesting with those log
>> files than just send them to one location, publishing them to Kafka
>> would allow you to add additional consumers later.  Flume has a
>> concept of fanout sinks, but I don't care for the way it works.
>>
>>
>>
>> On Thu, Sep 29, 2011 at 1:48 PM, Jun Rao <ju...@gmail.com> wrote:
>>> Jeremy,
>>>
>>> Yes, Kafka will be a good fit for that.
>>>
>>> Thanks,
>>>
>>> Jun
>>>
>>> On Thu, Sep 29, 2011 at 10:12 AM, Jeremy Hanna
>>> <je...@gmail.com>wrote:
>>>
>>>> We have a number of web servers in ec2 and periodically we just blow them
>>>> away and create new ones.  That makes keeping logs problematic.  We're
>>>> looking for a way to stream the logs from those various sources directly to
>>>> a central log server - either just a single server or hdfs or something like
>>>> that.
>>>>
>>>> My question is whether kafka is a good fit for that or should I be looking
>>>> more along the lines of flume or scribe?
>>>>
>>>> Many thanks.
>>>>
>>>> Jeremy
>>>
>
>

Re: Aggregating tomcat, log4j, other logs in realtime

Posted by Jeremy Hanna <je...@gmail.com>.

Thanks a lot for the comparison Eric.  Really good to hear a perspective from a user of both.

On Sep 29, 2011, at 1:25 PM, Eric Hauser wrote:

> Jeremy,
> 
> I've used both Flume and Kafka, and I can provide some info for comparison:
> 
> Flume
> - The current Flume release 0.9.4 has some pretty nasty bugs in it
> (most have been fixed in trunk).
> - Flume is a more complex to maintain operations-wise (IMO) than Kafka
> since you have to setup masters and collectors (you don't necessarily
> need collectors if you aren't writing to HDFS)
> - Flume has a well defined pattern for doing what you want:
> http://www.cloudera.com/blog/2010/09/using-flume-to-collect-apache-2-web-server-logs/
> 
> Kafka
> - If you need multiple Kafka partitions for the logs, you will want to
> partition by host so the messages arrive in order for the same host
> - You can use the same piped technique as Flume to publish to Kafka,
> but you'll have to write a little code to publish and subscribe to the
> stream
> - Kafka does not provide any of the file rolling, compression, etc.
> that Flume provides
> - If you ever want to do anything more interesting with those log
> files than just send them to one location, publishing them to Kafka
> would allow you to add additional consumers later.  Flume has a
> concept of fanout sinks, but I don't care for the way it works.
> 
> 
> 
> On Thu, Sep 29, 2011 at 1:48 PM, Jun Rao <ju...@gmail.com> wrote:
>> Jeremy,
>> 
>> Yes, Kafka will be a good fit for that.
>> 
>> Thanks,
>> 
>> Jun
>> 
>> On Thu, Sep 29, 2011 at 10:12 AM, Jeremy Hanna
>> <je...@gmail.com>wrote:
>> 
>>> We have a number of web servers in ec2 and periodically we just blow them
>>> away and create new ones.  That makes keeping logs problematic.  We're
>>> looking for a way to stream the logs from those various sources directly to
>>> a central log server - either just a single server or hdfs or something like
>>> that.
>>> 
>>> My question is whether kafka is a good fit for that or should I be looking
>>> more along the lines of flume or scribe?
>>> 
>>> Many thanks.
>>> 
>>> Jeremy
>>

Re: Aggregating tomcat, log4j, other logs in realtime

Posted by Eric Hauser <ew...@gmail.com>.

Jeremy,

I've used both Flume and Kafka, and I can provide some info for comparison:

Flume
- The current Flume release 0.9.4 has some pretty nasty bugs in it
(most have been fixed in trunk).
- Flume is a more complex to maintain operations-wise (IMO) than Kafka
since you have to setup masters and collectors (you don't necessarily
need collectors if you aren't writing to HDFS)
- Flume has a well defined pattern for doing what you want:
http://www.cloudera.com/blog/2010/09/using-flume-to-collect-apache-2-web-server-logs/

Kafka
- If you need multiple Kafka partitions for the logs, you will want to
partition by host so the messages arrive in order for the same host
- You can use the same piped technique as Flume to publish to Kafka,
but you'll have to write a little code to publish and subscribe to the
stream
- Kafka does not provide any of the file rolling, compression, etc.
that Flume provides
- If you ever want to do anything more interesting with those log
files than just send them to one location, publishing them to Kafka
would allow you to add additional consumers later.  Flume has a
concept of fanout sinks, but I don't care for the way it works.



On Thu, Sep 29, 2011 at 1:48 PM, Jun Rao <ju...@gmail.com> wrote:
> Jeremy,
>
> Yes, Kafka will be a good fit for that.
>
> Thanks,
>
> Jun
>
> On Thu, Sep 29, 2011 at 10:12 AM, Jeremy Hanna
> <je...@gmail.com>wrote:
>
>> We have a number of web servers in ec2 and periodically we just blow them
>> away and create new ones.  That makes keeping logs problematic.  We're
>> looking for a way to stream the logs from those various sources directly to
>> a central log server - either just a single server or hdfs or something like
>> that.
>>
>> My question is whether kafka is a good fit for that or should I be looking
>> more along the lines of flume or scribe?
>>
>> Many thanks.
>>
>> Jeremy
>

Re: Aggregating tomcat, log4j, other logs in realtime

Posted by Jun Rao <ju...@gmail.com>.

Jeremy,

Yes, Kafka will be a good fit for that.

Thanks,

Jun

On Thu, Sep 29, 2011 at 10:12 AM, Jeremy Hanna
<je...@gmail.com>wrote:

> We have a number of web servers in ec2 and periodically we just blow them
> away and create new ones.  That makes keeping logs problematic.  We're
> looking for a way to stream the logs from those various sources directly to
> a central log server - either just a single server or hdfs or something like
> that.
>
> My question is whether kafka is a good fit for that or should I be looking
> more along the lines of flume or scribe?
>
> Many thanks.
>
> Jeremy