You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@chukwa.apache.org by as...@apache.org on 2010/03/23 19:14:56 UTC
svn commit: r926693 - in
/hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs: dataflow.xml
programming.xml site.xml
Author: asrabkin
Date: Tue Mar 23 18:14:56 2010
New Revision: 926693
URL: http://svn.apache.org/viewvc?rev=926693&view=rev
Log:
CHUKWA-458. Documentation for 0.4
Added:
hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/dataflow.xml
Modified:
hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/programming.xml
hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/site.xml
Added: hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/dataflow.xml
URL: http://svn.apache.org/viewvc/hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/dataflow.xml?rev=926693&view=auto
==============================================================================
--- hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/dataflow.xml (added)
+++ hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/dataflow.xml Tue Mar 23 18:14:56 2010
@@ -0,0 +1,129 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN"
+"http://forrest.apache.org/dtd/document-v20.dtd">
+
+<document>
+ <header>
+ <title>Guide to Chukwa Storage Layout</title>
+ </header>
+ <body>
+
+<section><title>Overview</title>
+<p>This document describes how Chukwa data is stored in HDFS and the processes that act on it.</p>
+</section>
+
+<section><title>HDFS File System Structure</title>
+
+<p>The general layout of the Chukwa filesystem is as follows.</p>
+
+<source>
+/chukwa/
+ archivesProcessing/
+ dataSinkArchives/
+ demuxProcessing/
+ finalArchives/
+ logs/
+ postProcess/
+ repos/
+ rolling/
+ temp/
+</source>
+</section>
+
+<section><title>Raw Log Collection and Aggregation Workflow</title>
+
+<p>What data is stored where is best described by stepping through the Chukwa workflow.</p>
+
+<ol>
+<li>Collectors write chunks to <code>logs/*.chukwa</code> files until a 64MB chunk size is reached or a given time interval has passed.
+ <ul><li><code>logs/*.chukwa</code></li></ul>
+</li>
+<li>Collectors close chunks and rename them to <code>*.done</code>
+<ul>
+<li>from <code>logs/*.chukwa</code></li>
+<li>to <code>logs/*.done</code></li>
+</ul>
+</li>
+<li>DemuxManager checks for <code>*.done</code> files every 20 seconds.
+ <ol>
+ <li>If <code>*.done</code> files exist, moves files in place for demux processing:
+ <ul>
+ <li>from: <code>logs/*.done</code></li>
+ <li>to: <code>demuxProcessing/mrInput</code></li>
+ </ul>
+ </li>
+ <li>The Demux MapReduce job is run on the data in <code>demuxProcessing/mrInput</code>.</li>
+ <li>If demux is successful within 3 attempts, archives the completed files:
+ <ul>
+ <li>from: <code>demuxProcessing/mrOutput</code></li>
+ <li>to: <code>dataSinkArchives/[yyyyMMdd]/*/*.done</code> </li>
+ </ul>
+ </li>
+ <li>Otherwise moves the completed files to an error folder:
+ <ul>
+ <li>from: <code>demuxProcessing/mrOutput</code></li>
+ <li>to: <code>dataSinkArchives/InError/[yyyyMMdd]/*/*.done</code> </li>
+ </ul>
+ </li>
+ </ol>
+</li>
+<li>PostProcessManager wakes up every few minutes and aggregates, orders and de-dups record files.
+ <ul><li>from: <code>postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[yyyyMMdd]_[HH].R.evt</code></li>
+ <li>to: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[HH]_[N].[N].evt</code></li>
+ </ul>
+</li>
+<li>HourlyChukwaRecordRolling runs M/R jobs at 16 past the hour to group 5 minute logs to hourly.
+ <ul>
+ <li>from: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[mm].[N].evt</code></li>
+ <li>to: <code>temp/hourlyRolling/[clusterName]/[dataType]/[yyyyMMdd]</code></li>
+ <li>to: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_HourlyDone_[yyyyMMdd]_[HH].[N].evt</code></li>
+ <li>leaves: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/rotateDone/</code> </li>
+ </ul>
+</li>
+<li>DailyChukwaRecordRolling runs M/R jobs at 1:30AM to group hourly logs to daily.
+ <ul>
+ <li>from: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_[yyyyMMdd]_[HH].[N].evt</code></li>
+ <li>to: <code>temp/dailyRolling/[clusterName]/[dataType]/[yyyyMMdd]</code></li>
+ <li>to: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/[dataType]_DailyDone_[yyyyMMdd].[N].evt</code></li>
+ <li>leaves: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/rotateDone/</code> </li>
+ </ul>
+ </li>
+<li>ChukwaArchiveManager every half hour or so aggregates and removes dataSinkArchives data using M/R.
+ <ul>
+ <li>from: <code>dataSinkArchives/[yyyyMMdd]/*/*.done</code></li>
+ <li>to: <code>archivesProcessing/mrInput</code></li>
+ <li>to: <code>archivesProcessing/mrOutput</code></li>
+ <li>to: <code>finalArchives/[yyyyMMdd]/*/chukwaArchive-part-*</code> </li>
+ </ul>
+ </li>
+ </ol>
+ </section>
+
+<section>
+<title>Log Directories Requiring Cleanup</title>
+
+<p>The following directories will grow over time and will need to be periodically pruned:</p>
+
+<ul>
+<li><code>finalArchives/[yyyyMMdd]/*</code></li>
+<li><code>repos/[clusterName]/[dataType]/[yyyyMMdd]/*.evt</code> </li>
+</ul>
+</section>
+</body>
+</document>
\ No newline at end of file
Modified: hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/programming.xml
URL: http://svn.apache.org/viewvc/hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/programming.xml?rev=926693&r1=926692&r2=926693&view=diff
==============================================================================
--- hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/programming.xml (original)
+++ hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/programming.xml Tue Mar 23 18:14:56 2010
@@ -32,8 +32,8 @@ pipeline, see the <a href="design.html">
</p>
<p>
-In particular, this document discusses the Chukwa archive file formats, and
-the layout of the Chukwa storage directories.</p>
+In particular, this document discusses the Chukwa archive file formats, the
+demux and archiving mapreduce jobs, and the layout of the Chukwa storage directories.</p>
@@ -178,5 +178,86 @@ created with a disambiguating suffix.</p
</section>
+<section><title>Demux</title>
+
+<p>A key use for Chukwa is processing arriving data, in parallel, using MapReduce.
+The most common way to do this is using the Chukwa demux framework.
+As <a href="dataflow.html">data flows through Chukwa</a>, the demux job is often the
+first job that runs.
+</p>
+
+<p>By default, Chukwa will use the default TsProcessor. This parser will try to
+ extract the real log statement from the log entry using the ISO8601 date
+ format. If it fails, it will use the time at which the chunk was written to
+ disk (collector timestamp).</p>
+
+<section>
+<title>Writing a custom demux Mapper</title>
+
+<p>If you want to extract some specific information and perform more processing you
+ need to write your own parser. Like any M/R program, your have to write at least
+ the Map side for your parser. The reduce side is Identity by default.</p>
+
+<p>On the Map side,you can write your own parser from scratch or extend the AbstractProcessor class
+ that hides all the low level action on the chunk. See
+ <code>org.apache.hadoop.chukwa.extraction.demux.processor.mapper.Df</code> for an example
+ of a Map class for use with Demux.
+ </p>
+
+<p>For Chukwa to invoke your Mapper code, you have
+ to specify which data types it should run on.
+ Edit <code>${CHUKWA_HOME}/conf/chukwa-demux-conf.xml</code> and add the following lines:
+ </p>
+<source>
+ <property>
+ <name>MyDataType</name>
+ <value>org.apache.hadoop.chukwa.extraction.demux.processor.mapper.MyParser</value>
+ <description>Parser class for MyDataType.</description>
+ </property>
+</source>
+<p>You can use the same parser for several different recordTypes.</p>
+</section>
+
+<section><title>Writing a custom reduce</title>
+
+<p>You only need to implement a reduce side if you need to group records together.
+The interface that your need to implement is <code>ReduceProcessor</code>:
+</p>
+<source>
+public interface ReduceProcessor
+{
+ public String getDataType();
+ public void process(ChukwaRecordKey key,Iterator<ChukwaRecord> values,
+ OutputCollector<ChukwaRecordKey,
+ ChukwaRecord> output, Reporter reporter);
+}
+</source>
+
+<p>The link between the Map side and the reduce is done by setting your reduce class
+ into the reduce type: <code>key.setReduceType("MyReduceClass");</code>.
+ Note that in the current version of Chukwa, your class needs to be in the package
+ <code>org.apache.hadoop.chukwa.extraction.demux.processor</code>
+See <code>org.apache.hadoop.chukwa.extraction.demux.processor.reducer.SystemMetrics</code>
+for an example of a Demux reducer.</p>
+</section>
+
+<section>
+<title>Output</title>
+<p> Your data is going to be sorted by RecordType then by the key field. The default
+ implementation use the following grouping for all records:</p>
+<ol>
+<li>Time partition (Time up to the hour)</li>
+<li>Machine name (physical input source)</li>
+<li>Record timestamp </li>
+</ol>
+
+<p>The demux process will use the recordType to save similar records together
+(same recordType) to the same directory:
+<code>>cluster name>/<record type>/</code>
+</p></section>
+
+</section>
+
+
</body>
</document>
\ No newline at end of file
Modified: hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/site.xml
URL: http://svn.apache.org/viewvc/hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/site.xml?rev=926693&r1=926692&r2=926693&view=diff
==============================================================================
--- hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/site.xml (original)
+++ hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/site.xml Tue Mar 23 18:14:56 2010
@@ -44,6 +44,7 @@ See http://forrest.apache.org/docs/linki
<index label="Architecture" href="design.html" />
<admin label="Admin Guide" href="admin.html" />
<agent label="Agent Configuration Guide" href="agent.html" />
+ <programming label="Guide to Chukwa Storage layout" href="dataflow.html" />
<programming label="Programming Guide" href="programming.html" />
<api label="API Docs" href="ext:api/index"/>
<wiki label="Wiki" href="ext:wiki" />