You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@chukwa.apache.org by Bill Graham <bi...@gmail.com> on 2009/11/13 00:35:50 UTC

Chukwa HDFS layout and dataflow

I'm trying to get my head around the various chukwa directories in HDFS and
how they relate to the dataflow and related processes. If anyone could help
me check my assumptions below and fill in the blanks it would be greatly
appreciated. Maybe the output of the thread could be used to seed
documentation around this on the site.

After running the collector (via the chukwa-collector script), and the
chukwa-data-processors script and adding some test data, I see the following
directories in HDFS. I also have two processes that mysteriously appeared in
my crontab (bin/watchdog.sh and tools/expire.sh) if anyone knows how that
happens.

/chukwa/
   archivesProcessing/
   dataSinkArchives/
   demuxProcessing/
   finalArchives/
   logs/
   postProcess/
   repos/
   rolling/
   temp/

Here's what I think is happening:

1. Collectors write chunks to logs/*.chukwa files until 64MB chunk size is
reached or a given time interval is reached

2. Collectors close chunks and moves them to
dataSinkArchives/[YYYMMDD]/*/*.done

3. Some process? aggregates?, orders? and dedups? *.done files and writes
data to repos/[clusterName]/[dataType]/[YYYMMDD]/[[HH]/[mm]/*.evt

4. Demux runs and somehow uses the demuxProcessing/ directory. Mine is
empty.

5. At midnight some process aggregates? daily data to
finalArchives/[YYYMMDD]/*/chukwaArchive-part-*

These directories are empty in my cluster, so I'm not sure what/who uses
them:

archivesProcessing/mrInput
postProcess/
rolling/daily/[YYYMMDD]/[clusterName]/[dataType]
rolling/hourly/[YYYMMDD]/[clusterName]/[dataType]
temp/hourlyrolling/[clusterName]/[dataType]/[YYYMMDD]

What I'm ultimately trying to get an answer to is which files or directories
should I be using as input if I were to write my own MR summary jobs for 5
minute, 1 hour, 24 hour summaries? My guess is the repos/ dir. And should I
put the output anywhere special within the chukwa part HDFS?

thanks,
Bill