You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@chukwa.apache.org by Jonathan Mervine <jm...@rcanalytics.com> on 2014/06/27 22:38:24 UTC

Trying to determine if Chukwa is what I need

Hey I came across chukwa from a blog post. And it looks like it there is a real effort in collecting data from multiple sources and pumping it into the HDFS.

I was looking at this pdf from the wiki https://wiki.apache.org/hadoop/Chukwa?action=AttachFile&do=view&target=ChukwaPoster.pdf

And the chart in the middle seems to imply that 2 of the agents you can have is one that takes in streaming data and one that is associated with Log4J and works with log files in particular.

I'm pretty new to Hadoop so I'm trying to learn a lot about it in a short time, but what I'm looking for is some kind of system that will monitor a directory somewhere for files being placed there. I don't know what kind of files they could be, csv's, psv's, doc's, txt's, and many others. A later stage would be formatting, parsing and analyzing but for now I just want to be able to detect when a File is placed there. After a file has been detected than it should be sent on it's way to be placed into the HDFS. This should be a completely autonomous and automatic process (or as much as possible).

Is this something Chukwa can help me with? If not do you know of any system that might do what I want? I've kind of read a little about Oozie, Falcon, Flume, Scribe, and a couple other projects but I don't think I've found what I'm looking for. Also any information you could provide to help me on my way or clear up any misunderstanding I may have would be great!

Thanks
[https://mail.google.com/mail/u/0/images/cleardot.gif]
jmervine@rcanalytics.com

Re: Trying to determine if Chukwa is what I need

Posted by Ariel Rabkin <as...@gmail.com>.

 Chukwa should be a great fit for this use case. If you use the
DirTailingAdaptor, it should send every file. If none of the files
will get very big, I would use the FileAdaptor -- this sends each file
as one Chunk, so it'll be stored as consecutive bytes in HDFS, which
is the most convenient form for analysis. This relies on being able to
buffer the full contents in memory, so they all get written as a
block.

If your files might be big -- many megabytes -- you might hit an
internal buffer-size limitation. Those are configurable and can be
adjusted to your needs.

If your files are big enough that you worry about RAM consumption
during the copy, you want to use an adaptor that breaks the files into
smaller chunks before sending. You probably should subclass
LWFTAdaptor to specify the semantics you want.

There are a couple ugly cases to think about, regardless of which
adaptor you use.

- What if a file is updated after being placed?
- What if a file is deleted and then a new file with the same name is created?
- What if a file is placed and then immediately moved or deleted,
before Chukwa finishes grabbing it?

I'm not sure which of those cases will come up for you or what you
want to have happen. You might want to do some testing to check that
you get the behavior you want.

I believe Eric's comments here are being overly cautious. Depending on
your file sizes and on what kind of downstream processing you need,
you might not optimal performance from MapReduce jobs run against the
output in HDFS, but if you're just interested in storage, it should be
fine.


On Sat, Jun 28, 2014 at 11:52 AM, Eric Yang <er...@gmail.com> wrote:
>
> Hi Jon,
>
> Chukwa can take files from a directory and ship to HDFS with some limitation.  First, the data needs to be the same type within a directory.  Second, Chukwa does not ship identical files to HDFS.  It extracts files into records before data is shipped to HDFS or HBase.  The files written to HDFS is optimized for map reduce jobs because the files are closed at fix interval.  This assumption is that collector creates similar files in size to ensure map reduce tasks can execute in even amount of time for parallelization.  Chukwa is designed to ship entry of records in log files.  It may not perform well to ship word document or images.  Flume is designed to ship original files.  Therefore, if you have requirement to ship original files and not records, flume maybe the better choice for that problem.
>
> For testing purpose, tailing files in a directory can be achieved using this command in Chukwa agent port 9093:
>
> add DirTailingAdaptor logs /var/log/ *.log filetailer.CharFileTailingAdaptorUTF8 0
>
> This will spawn off multiple CharFileTailingAdaptorUTF8 to ship all log files within the directory.  If the log files is removed, the adaptor is automatically shutdown.
>
> Hope this helps.
>
> regards,
> Eric
>
>
> On Fri, Jun 27, 2014 at 1:38 PM, Jonathan Mervine <jm...@rcanalytics.com> wrote:
>>
>> Hey I came across chukwa from a blog post. And it looks like it  there is a real effort in collecting data from multiple sources and pumping it into the HDFS.
>>
>> I was looking at this pdf from the wiki https://wiki.apache.org/hadoop/Chukwa?action=AttachFile&do=view&target=ChukwaPoster.pdf
>>
>> And the chart in the middle seems to imply that 2 of the agents you can have is one that takes in streaming data and one that is associated with Log4J and works with log files in particular.
>>
>> I’m pretty new to Hadoop so I’m trying to learn a lot about it in a short time, but what I’m looking for is some kind of system that will monitor a directory somewhere for files being placed there. I don’t know what kind of files they could be, csv’s, psv’s, doc’s, txt’s, and many others. A later stage would be formatting, parsing and analyzing but for now I just want to be able to detect when a File is placed there. After a file has been detected than it should be sent on it’s way to be placed into the HDFS. This should be a completely autonomous and automatic process (or as much as possible).
>>
>> Is this something Chukwa can help me with? If not do you know of any system that might do what I want? I’ve kind of read a little about Oozie, Falcon, Flume, Scribe, and a couple other projects but I don’t think I’ve found what I’m looking for.  Also any information you could provide to help me on my way or clear up any misunderstanding I may have would be great!
>>
>> Thanks
>> jmervine@rcanalytics.com
>
>



-- 
Ari Rabkin asrabkin@gmail.com
Princeton Computer Science Department

Re: Trying to determine if Chukwa is what I need

Posted by Eric Yang <er...@gmail.com>.

Hi Jon,

Chukwa can take files from a directory and ship to HDFS with some
limitation.  First, the data needs to be the same type within a directory.
 Second, Chukwa does not ship identical files to HDFS.  It extracts files
into records before data is shipped to HDFS or HBase.  The files written to
HDFS is optimized for map reduce jobs because the files are closed at fix
interval.  This assumption is that collector creates similar files in size
to ensure map reduce tasks can execute in even amount of time for
parallelization.  Chukwa is designed to ship entry of records in log files.
 It may not perform well to ship word document or images.  Flume is
designed to ship original files.  Therefore, if you have requirement to
ship original files and not records, flume maybe the better choice for that
problem.

For testing purpose, tailing files in a directory can be achieved using
this command in Chukwa agent port 9093:

add DirTailingAdaptor logs /var/log/ *.log
filetailer.CharFileTailingAdaptorUTF8 0

This will spawn off multiple CharFileTailingAdaptorUTF8 to ship all log
files within the directory.  If the log files is removed, the adaptor is
automatically shutdown.

Hope this helps.

regards,
Eric

On Fri, Jun 27, 2014 at 1:38 PM, Jonathan Mervine <jm...@rcanalytics.com>
wrote:

>  Hey I came across chukwa from a blog post. And it looks like it  there
> is a real effort in collecting data from multiple sources and pumping it
> into the HDFS.
>
>  I was looking at this pdf from the wiki
> https://wiki.apache.org/hadoop/Chukwa?action=AttachFile&do=view&target=ChukwaPoster.pdf
>
>
>  And the chart in the middle seems to imply that 2 of the agents you can
> have is one that takes in streaming data and one that is associated with
> Log4J and works with log files in particular.
>
>  I’m pretty new to Hadoop so I’m trying to learn a lot about it in a
> short time, but what I’m looking for is some kind of system that will
> monitor a directory somewhere for files being placed there. I don’t know
> what kind of files they could be, csv’s, psv’s, doc’s, txt’s, and many
> others. A later stage would be formatting, parsing and analyzing but for
> now I just want to be able to detect when a File is placed there. After a
> file has been detected than it should be sent on it’s way to be placed into
> the HDFS. This should be a completely autonomous and automatic process (or
> as much as possible).
>
>  Is this something Chukwa can help me with? If not do you know of any
> system that might do what I want? I’ve kind of read a little about Oozie,
> Falcon, Flume, Scribe, and a couple other projects but I don’t think I’ve
> found what I’m looking for.  Also any information you could provide to help
> me on my way or clear up any misunderstanding I may have would be great!
>
>  Thanks
>   jmervine@rcanalytics.com
>