You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Ricky Ho <rh...@adobe.com> on 2009/04/10 06:06:14 UTC

HDFS as a logfile ??

I want to analyze the traffic pattern and statistics of a distributed application.  I am thinking of having the application write the events as log entries into HDFS and then later I can use a Map/Reduce task to do the analysis in parallel.  Is this a good approach ?

In this case, does HDFS support concurrent write (append) to a file ?  Another question is whether the write API thread-safe ?

Rgds,
Ricky

Re: HDFS as a logfile ??

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Also, Chukwa (a project already in Hadoop contrib) is designed to do  
something similar with Hadoop directly:

http://wiki.apache.org/hadoop/Chukwa

I think some of the examples even mention Apache logs.  Haven't used  
it personally, but it looks nice.

Brian

On Apr 9, 2009, at 11:14 PM, Alex Loddengaard wrote:

> This is a great idea and a common application, Ricky.  Scribe is  
> probably
> useful for you as well:
>
> <http://sourceforge.net/projects/scribeserver/>
> <
> http://images.google.com/imgres?imgurl=http://farm3.static.flickr.com/2211/2197670659_b42810b8ba.jpg&imgrefurl=http://www.flickr.com/photos/niallkennedy/2197670659/&usg=__WLc-p9Gi_p3AdA-YuKLRZ-bdgvo=&h=375&w=500&sz=131&hl=en&start=2&sig2=P22LVO1KObby6_DDy8ujYg&um=1&tbnid=QudxiEyFOk1EpM:&tbnh=98&tbnw=130&prev=/images%3Fq%3Dfacebook%2Bscribe%2Bhadoop%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DN%26um%3D1&ei=48beSa74L4H-swORnPmjDg
>>
>
> Scribe is what Facebook uses to get its Apache logs to Hadoop.
> Unfortunately, HDFS doesn't (yet) have append, so you'll have to  
> batch log
> files and load them into HDFS in bulk.
>
> Alex
>
> On Thu, Apr 9, 2009 at 9:06 PM, Ricky Ho <rh...@adobe.com> wrote:
>
>> I want to analyze the traffic pattern and statistics of a distributed
>> application.  I am thinking of having the application write the  
>> events as
>> log entries into HDFS and then later I can use a Map/Reduce task to  
>> do the
>> analysis in parallel.  Is this a good approach ?
>>
>> In this case, does HDFS support concurrent write (append) to a file ?
>> Another question is whether the write API thread-safe ?
>>
>> Rgds,
>> Ricky
>>

Re: HDFS as a logfile ??

Posted by Alex Loddengaard <al...@cloudera.com>.

This is a great idea and a common application, Ricky.  Scribe is probably
useful for you as well:

<http://sourceforge.net/projects/scribeserver/>
<
http://images.google.com/imgres?imgurl=http://farm3.static.flickr.com/2211/2197670659_b42810b8ba.jpg&imgrefurl=http://www.flickr.com/photos/niallkennedy/2197670659/&usg=__WLc-p9Gi_p3AdA-YuKLRZ-bdgvo=&h=375&w=500&sz=131&hl=en&start=2&sig2=P22LVO1KObby6_DDy8ujYg&um=1&tbnid=QudxiEyFOk1EpM:&tbnh=98&tbnw=130&prev=/images%3Fq%3Dfacebook%2Bscribe%2Bhadoop%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DN%26um%3D1&ei=48beSa74L4H-swORnPmjDg
>

Scribe is what Facebook uses to get its Apache logs to Hadoop.
Unfortunately, HDFS doesn't (yet) have append, so you'll have to batch log
files and load them into HDFS in bulk.

Alex

On Thu, Apr 9, 2009 at 9:06 PM, Ricky Ho <rh...@adobe.com> wrote:

> I want to analyze the traffic pattern and statistics of a distributed
> application.  I am thinking of having the application write the events as
> log entries into HDFS and then later I can use a Map/Reduce task to do the
> analysis in parallel.  Is this a good approach ?
>
> In this case, does HDFS support concurrent write (append) to a file ?
>  Another question is whether the write API thread-safe ?
>
> Rgds,
> Ricky
>

Re: HDFS as a logfile ??

Posted by Ariel Rabkin <as...@gmail.com>.

Everything gets dumped into the same files.

We don't assume anything at all about the format of the input data; it
gets dumped into Hadoop sequence files, tagged with some metadata to
say what machine and app it came from, and where it was in the
original stream.

There is a slight penalty from the log-to-local disk. In practice, you
often want a local copy anyway, both for redundancy and because it's
much more convenient for human inspection.  Having a separate
collector process is indeed inelegant. However, HDFS copes badly with
many small files, so that pushes you to merge entries across either
hosts or time partitions. And since HDFS doesn't have a flush(),
having one log per source means that log entries don't become visible
quickly enough.   Hence, collectors.

It isn't gorgeous, but it seems to work fine in practice.

On Mon, Apr 13, 2009 at 8:01 AM, Ricky Ho <rh...@adobe.com> wrote:
> Ari, thanks for your note.
>
> Like to understand more how Chukwa group log entries ... If I have appA running in machine X, Y and appB running in machine Y, Z.  Each of them calling the Chukwa log API.
>
> Do I have all entries going in the same HDFS file ?  or 4 separated HDFS files based on the App/Machine combination ?
>
> If the answer of first Q is "yes", then what if appA and appB has different format of log entries ?
> If the answer of second Q is "yes", then are all these HDFS files cut at the same time boundary ?
>
> Looks like in Chukwa, application first log to a daemon, which buffer-write the log entries into a local file.  And there is a separate process to ship these data to a remote collector daemon which issue the actual HDFS write.  I observe the following overhead ...
>
> 1) The overhead of extra write to local disk and ship the data over to the collector.  If HDFS supports append, the application can then go directly to the HDFS
>
> 2) The centralized collector establish a bottleneck to the otherwise perfectly parallel HDFS architecture.
>
> Am I missing something here ?
>

-- 
Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department

RE: HDFS as a logfile ??

Posted by Ricky Ho <rh...@adobe.com>.

Ari, thanks for your note.

Like to understand more how Chukwa group log entries ... If I have appA running in machine X, Y and appB running in machine Y, Z.  Each of them calling the Chukwa log API.

Do I have all entries going in the same HDFS file ?  or 4 separated HDFS files based on the App/Machine combination ?

If the answer of first Q is "yes", then what if appA and appB has different format of log entries ?
If the answer of second Q is "yes", then are all these HDFS files cut at the same time boundary ?

Looks like in Chukwa, application first log to a daemon, which buffer-write the log entries into a local file.  And there is a separate process to ship these data to a remote collector daemon which issue the actual HDFS write.  I observe the following overhead ...

1) The overhead of extra write to local disk and ship the data over to the collector.  If HDFS supports append, the application can then go directly to the HDFS

2) The centralized collector establish a bottleneck to the otherwise perfectly parallel HDFS architecture.

Am I missing something here ?

Rgds,
Ricky

-----Original Message-----
From: Ariel Rabkin [mailto:asrabkin@gmail.com] 
Sent: Monday, April 13, 2009 7:38 AM
To: core-user@hadoop.apache.org
Subject: Re: HDFS as a logfile ??

Chukwa is a Hadoop subproject aiming to do something similar, though
particularly for the case of Hadoop logs.  You may find it useful.

Hadoop unfortunately does not support concurrent appends.  As a
result, the Chukwa project found itself creating a whole new demon,
the chukwa collector, precisely to merge the event streams and write
it out, just once. We're set to do a release within the next week or
two, but in the meantime you can check it out from SVN at
https://svn.apache.org/repos/asf/hadoop/chukwa/trunk

--Ari

On Fri, Apr 10, 2009 at 12:06 AM, Ricky Ho <rh...@adobe.com> wrote:
> I want to analyze the traffic pattern and statistics of a distributed application.  I am thinking of having the application write the events as log entries into HDFS and then later I can use a Map/Reduce task to do the analysis in parallel.  Is this a good approach ?
>
> In this case, does HDFS support concurrent write (append) to a file ?  Another question is whether the write API thread-safe ?
>
> Rgds,
> Ricky
>

-- 
Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department

Re: HDFS as a logfile ??

Posted by Ariel Rabkin <as...@gmail.com>.

Chukwa is a Hadoop subproject aiming to do something similar, though
particularly for the case of Hadoop logs.  You may find it useful.

Hadoop unfortunately does not support concurrent appends.  As a
result, the Chukwa project found itself creating a whole new demon,
the chukwa collector, precisely to merge the event streams and write
it out, just once. We're set to do a release within the next week or
two, but in the meantime you can check it out from SVN at
https://svn.apache.org/repos/asf/hadoop/chukwa/trunk

--Ari

On Fri, Apr 10, 2009 at 12:06 AM, Ricky Ho <rh...@adobe.com> wrote:
> I want to analyze the traffic pattern and statistics of a distributed application.  I am thinking of having the application write the events as log entries into HDFS and then later I can use a Map/Reduce task to do the analysis in parallel.  Is this a good approach ?
>
> In this case, does HDFS support concurrent write (append) to a file ?  Another question is whether the write API thread-safe ?
>
> Rgds,
> Ricky
>

-- 
Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department