You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@chukwa.apache.org by as...@apache.org on 2010/03/23 19:14:56 UTC
svn commit: r926693 - in /hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs: dataflow.xml programming.xml site.xml

Author: asrabkin
Date: Tue Mar 23 18:14:56 2010
New Revision: 926693

URL: http://svn.apache.org/viewvc?rev=926693&view=rev
Log:
CHUKWA-458. Documentation for 0.4

Added:
    hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/dataflow.xml
Modified:
    hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/programming.xml
    hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/site.xml

Added: hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/dataflow.xml
URL: http://svn.apache.org/viewvc/hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/dataflow.xml?rev=926693&view=auto
==============================================================================
--- hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/dataflow.xml (added)
+++ hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/dataflow.xml Tue Mar 23 18:14:56 2010
@@ -0,0 +1,129 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" 
+"http://forrest.apache.org/dtd/document-v20.dtd">
+
+<document>
+  <header>
+    <title>Guide to Chukwa Storage Layout</title>
+  </header>
+  <body>
+
+<section><title>Overview</title>
+<p>This document describes how Chukwa data is stored in HDFS and the processes that act on it.</p>
+</section>
+
+<section><title>HDFS File System Structure</title>
+
+<p>The general layout of the Chukwa filesystem is as follows.</p>
+
+<source>
+/chukwa/
+   archivesProcessing/
+   dataSinkArchives/
+   demuxProcessing/
+   finalArchives/
+   logs/
+   postProcess/
+   repos/
+   rolling/
+   temp/
+</source>
+</section>
+
+<section><title>Raw Log Collection and Aggregation Workflow</title>
+
+<p>What data is stored where is best described by stepping through the Chukwa workflow.</p>
+
+<ol>
+<li>Collectors write chunks to <code>logs/*.chukwa</code> files until a 64MB chunk size is reached or a given time interval has passed.
+  <ul><li><code>logs/*.chukwa</code></li></ul> 
+</li>
+<li>Collectors close chunks and rename them to <code>*.done</code>
+<ul>
+<li>from <code>logs/*.chukwa</code></li>
+<li>to <code>logs/*.done</code></li>
+</ul>
+</li>
+<li>DemuxManager checks for <code>*.done</code> files every 20 seconds.
+ <ol>
+  <li>If <code>*.done</code> files exist, moves files in place for demux processing:
+   <ul>
+     <li>from: <code>logs/*.done</code></li>
+     <li>to: <code>demuxProcessing/mrInput</code></li>
+   </ul>
+  </li>
+  <li>The Demux MapReduce job is run on the data in <code>demuxProcessing/mrInput</code>.</li>
+  <li>If demux is successful within 3 attempts, archives the completed files:
+    <ul>
+     <li>from: <code>demuxProcessing/mrOutput</code></li>
+     <li>to: <code>dataSinkArchives/[yyyyMMdd]/*/*.done</code> </li>
+    </ul>
+  </li>
+  <li>Otherwise moves the completed files to an error folder:
+    <ul>
+     <li>from: <code>demuxProcessing/mrOutput</code></li>
+     <li>to: <code>dataSinkArchives/InError/[yyyyMMdd]/*/*.done</code> </li>
+    </ul>
+   </li>
+  </ol>
+</li>
+<li>PostProcessManager wakes up every few minutes and aggregates, orders and de-dups record files.
+  <ul><li>from: <code>postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[yyyyMMdd]_[HH].R.evt</code></li>
+  <li>to: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[HH]_[N].[N].evt</code></li>
+  </ul>
+</li>
+<li>HourlyChukwaRecordRolling runs M/R jobs at 16 past the hour to group 5 minute logs to hourly.
+  <ul>
+  <li>from: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[mm].[N].evt</code></li>
+  <li>to: <code>temp/hourlyRolling/[clusterName]/[dataType]/[yyyyMMdd]</code></li>
+  <li>to: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_HourlyDone_[yyyyMMdd]_[HH].[N].evt</code></li>
+  <li>leaves: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/rotateDone/</code> </li>
+  </ul>
+</li>
+<li>DailyChukwaRecordRolling runs M/R jobs at 1:30AM to group hourly logs to daily.
+  <ul>
+  <li>from: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_[yyyyMMdd]_[HH].[N].evt</code></li>
+  <li>to: <code>temp/dailyRolling/[clusterName]/[dataType]/[yyyyMMdd]</code></li>
+  <li>to: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/[dataType]_DailyDone_[yyyyMMdd].[N].evt</code></li>
+  <li>leaves: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/rotateDone/</code> </li>
+  </ul>
+  </li> 
+<li>ChukwaArchiveManager every half hour or so aggregates and removes dataSinkArchives data using M/R.
+  <ul>
+  <li>from: <code>dataSinkArchives/[yyyyMMdd]/*/*.done</code></li>
+  <li>to: <code>archivesProcessing/mrInput</code></li>
+  <li>to: <code>archivesProcessing/mrOutput</code></li>
+  <li>to: <code>finalArchives/[yyyyMMdd]/*/chukwaArchive-part-*</code> </li>
+  </ul>
+  </li> 
+ </ol>
+ </section> 
+
+<section>
+<title>Log Directories Requiring Cleanup</title>
+
+<p>The following directories will grow over time and will need to be periodically pruned:</p>
+
+<ul>
+<li><code>finalArchives/[yyyyMMdd]/*</code></li>
+<li><code>repos/[clusterName]/[dataType]/[yyyyMMdd]/*.evt</code> </li>
+</ul>
+</section>
+</body>
+</document>
\ No newline at end of file

Modified: hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/programming.xml
URL: http://svn.apache.org/viewvc/hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/programming.xml?rev=926693&r1=926692&r2=926693&view=diff
==============================================================================
--- hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/programming.xml (original)
+++ hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/programming.xml Tue Mar 23 18:14:56 2010
@@ -32,8 +32,8 @@ pipeline, see the <a href="design.html">
 </p>
 
 <p>
-In particular, this document discusses the Chukwa archive file formats, and 
-the layout of the Chukwa storage directories.</p>
+In particular, this document discusses the Chukwa archive file formats, the
+demux and archiving mapreduce jobs, and  the layout of the Chukwa storage directories.</p>
 
 
 
@@ -178,5 +178,86 @@ created with a disambiguating suffix.</p
 </section>
 
 
+<section><title>Demux</title>
+
+<p>A key use for Chukwa is processing arriving data, in parallel, using MapReduce.
+The most common way to do this is using the Chukwa demux framework.
+As <a href="dataflow.html">data flows through Chukwa</a>, the demux job is often the
+first job that runs.
+</p>
+
+<p>By default, Chukwa will use the default TsProcessor. This parser will try to
+ extract the real log statement from the log entry using the ISO8601 date 
+ format. If it fails, it will use the time at which the chunk was written to
+ disk (collector timestamp).</p>
+
+<section>
+<title>Writing a custom demux Mapper</title>
+
+<p>If you want to extract some specific information and perform more processing you
+ need to write your own parser. Like any M/R program, your have to write at least
+ the Map side for your parser. The reduce side is Identity by default.</p>
+
+<p>On the Map side,you can write your own parser from scratch or extend the AbstractProcessor class
+ that hides all the low level action on the chunk. See
+ <code>org.apache.hadoop.chukwa.extraction.demux.processor.mapper.Df</code> for an example
+ of a Map class for use with Demux.
+ </p>
+ 
+<p>For Chukwa to invoke your Mapper code, you have
+ to specify which data types it should run on.
+ Edit <code>${CHUKWA_HOME}/conf/chukwa-demux-conf.xml</code> and add the following lines:
+ </p>
+<source>
+      &#60;property&#62;
+            &#60;name&#62;MyDataType&#60;/name&#62; 
+            &#60;value&#62;org.apache.hadoop.chukwa.extraction.demux.processor.mapper.MyParser&#60;/value&#62;
+            &#60;description&#62;Parser class for MyDataType.&#60;/description&#62;
+      &#60;/property&#62;
+</source>
+<p>You can use the same parser for several different recordTypes.</p>
+</section>
+
+<section><title>Writing a custom reduce</title>
+
+<p>You only need to implement a reduce side if you need to group records together. 
+The interface that your need to implement is <code>ReduceProcessor</code>:
+</p>
+<source>
+public interface ReduceProcessor
+{
+           public String getDataType();
+           public void process(ChukwaRecordKey key,Iterator&#60;ChukwaRecord&#62; values,
+                      OutputCollector&#60;ChukwaRecordKey, 
+                      ChukwaRecord&#62; output, Reporter reporter);
+}
+</source>
+
+<p>The link between the Map side and the reduce is done by setting your reduce class
+ into the reduce type: <code>key.setReduceType("MyReduceClass");</code>.
+ Note that in the current version of Chukwa, your class needs to be in the package
+ <code>org.apache.hadoop.chukwa.extraction.demux.processor</code>
+See <code>org.apache.hadoop.chukwa.extraction.demux.processor.reducer.SystemMetrics</code>
+for an example of a Demux reducer.</p>
+</section>
+
+<section>
+<title>Output</title>
+<p> Your data is going to be sorted by RecordType then by the key field. The default
+ implementation use the following grouping for all records:</p>
+<ol>
+<li>Time partition (Time up to the hour)</li>
+<li>Machine name (physical input source)</li>
+<li>Record timestamp </li>
+</ol>
+
+<p>The demux process will use the recordType to save similar records together 
+(same recordType) to the same directory: 
+<code>&#62;cluster name&#62;/&#60;record type&#62;/</code>
+</p></section>
+
+</section>
+
+
 </body>
 </document>
\ No newline at end of file

Modified: hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/site.xml
URL: http://svn.apache.org/viewvc/hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/site.xml?rev=926693&r1=926692&r2=926693&view=diff
==============================================================================
--- hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/site.xml (original)
+++ hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/site.xml Tue Mar 23 18:14:56 2010
@@ -44,6 +44,7 @@ See http://forrest.apache.org/docs/linki
     <index      label="Architecture"       href="design.html" />
     <admin      label="Admin Guide"    href="admin.html" />
     <agent      label="Agent Configuration Guide" href="agent.html" />
+    <programming      label="Guide to Chukwa Storage layout" href="dataflow.html" />
     <programming      label="Programming Guide" href="programming.html" />
     <api        label="API Docs"       href="ext:api/index"/>
     <wiki       label="Wiki"           href="ext:wiki" />