You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@chukwa.apache.org by ey...@apache.org on 2014/07/22 18:08:09 UTC

svn commit: r1612600 - in /chukwa/trunk: ./ src/site/ src/site/apt/ src/site/resources/images/

Author: eyang
Date: Tue Jul 22 16:08:09 2014
New Revision: 1612600

URL: http://svn.apache.org/r1612600
Log:
CHUKWA-721. Updated Chukwa document to reflect changes in Chukwa 0.6.  (Eric Yang)

Added:
    chukwa/trunk/src/site/apt/pipeline.apt
      - copied, changed from r1607772, chukwa/trunk/src/site/apt/collector.apt
    chukwa/trunk/src/site/apt/user.apt
      - copied, changed from r1607772, chukwa/trunk/src/site/apt/admin.apt
Removed:
    chukwa/trunk/src/site/apt/admin.apt
    chukwa/trunk/src/site/apt/collector.apt
Modified:
    chukwa/trunk/CHANGES.txt
    chukwa/trunk/pom.xml
    chukwa/trunk/src/site/apt/Quick_Start_Guide.apt
    chukwa/trunk/src/site/apt/agent.apt
    chukwa/trunk/src/site/apt/dataflow.apt
    chukwa/trunk/src/site/apt/design.apt
    chukwa/trunk/src/site/apt/hicc.apt
    chukwa/trunk/src/site/apt/index.apt
    chukwa/trunk/src/site/apt/programming.apt
    chukwa/trunk/src/site/resources/images/chukwa_architecture.png
    chukwa/trunk/src/site/site.xml

Modified: chukwa/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/chukwa/trunk/CHANGES.txt?rev=1612600&r1=1612599&r2=1612600&view=diff
==============================================================================
--- chukwa/trunk/CHANGES.txt (original)
+++ chukwa/trunk/CHANGES.txt Tue Jul 22 16:08:09 2014
@@ -34,6 +34,8 @@ Release 0.6 - Unreleased
 
   IMPROVEMENTS
 
+    CHUKWA-721. Updated Chukwa document to reflect changes in Chukwa 0.6.  (Eric Yang)
+
     CHUKWA-718. Updated Chukwa Agent REST API document and generation method.  (Eric Yang)
 
     CHUKWA-710. Set TCP socket reuse option for server sockets. (Shreyas Subramanya)

Modified: chukwa/trunk/pom.xml
URL: http://svn.apache.org/viewvc/chukwa/trunk/pom.xml?rev=1612600&r1=1612599&r2=1612600&view=diff
==============================================================================
--- chukwa/trunk/pom.xml (original)
+++ chukwa/trunk/pom.xml Tue Jul 22 16:08:09 2014
@@ -1067,7 +1067,7 @@
               <plugin>
                 <groupId>org.apache.maven.plugins</groupId>
                 <artifactId>maven-site-plugin</artifactId>
-                <version>3.0</version>
+                <version>3.3</version>
                 <dependencies>
                   <dependency><!-- add support for ssh/scp -->
                     <groupId>org.apache.maven.wagon</groupId>
@@ -1239,7 +1239,7 @@
             <artifactId>maven-jxr-plugin</artifactId>
             <version>2.3</version>
         </plugin>
-        <plugin>
+<!--        <plugin>
             <artifactId>maven-pmd-plugin</artifactId>
             <version>2.6</version>
             <reportSets>
@@ -1262,7 +1262,7 @@
                 <threshold>Normal</threshold>
                 <effort>Max</effort>
             </configuration>
-        </plugin>
+        </plugin>-->
         <plugin>
           <groupId>org.apache.maven.plugins</groupId>
           <artifactId>maven-project-info-reports-plugin</artifactId>

Modified: chukwa/trunk/src/site/apt/Quick_Start_Guide.apt
URL: http://svn.apache.org/viewvc/chukwa/trunk/src/site/apt/Quick_Start_Guide.apt?rev=1612600&r1=1612599&r2=1612600&view=diff
==============================================================================
--- chukwa/trunk/src/site/apt/Quick_Start_Guide.apt (original)
+++ chukwa/trunk/src/site/apt/Quick_Start_Guide.apt Tue Jul 22 16:08:09 2014
@@ -24,7 +24,7 @@ Pre-requisites
 
   Chukwa should work on any POSIX platform, but GNU/Linux is the only production platform that has been tested extensively. Chukwa has also been used successfully on Mac OS X, which several members of the Chukwa team use for development.
 
-  The only absolute software requirements are Java 1.6 or better and Hadoop 0.20.205.0+. HICC, the Chukwa visualization interface, requires HBase 0.90.4.
+  The only absolute software requirements are Java 1.6 or better, ZooKeeper 3.4.x, HBase 0.96.x and Hadoop 1.x.
 
   The Chukwa cluster management scripts rely on ssh; these scripts, however, are not required if you have some alternate mechanism for starting and stopping daemons.
 
@@ -35,9 +35,7 @@ Installing Chukwa
 
   * A Hadoop and HBase cluster on which Chukwa will process data (referred to as the Chukwa cluster). 
   
-  * A collector process, that writes collected data to HBase. 
-  
-  * One or more agent processes, that send monitoring data to the collector. The nodes with active agent processes are referred to as the monitored source nodes.
+  * One or more agent processes, that send monitoring data to HBase. The nodes with active agent processes are referred to as the monitored source nodes.
   
   * Data analytics script, summarize Hadoop Cluster Health.
 
@@ -53,9 +51,9 @@ First Steps
 
   * Un-tar the release, via tar xzf.
 
-  * Make sure a copy of Chukwa is available on each node being monitored, and on each node that will run a collector.
+  * Make sure a copy of Chukwa is available on each node being monitored.
 
-  * We refer to the directory containing Chukwa as CHUKWA_HOME. It may be helpful to set CHUKWA_HOME explicitly in your environment, but Chukwa does not require that you do so.
+  * We refer to the directory containing Chukwa as CHUKWA_HOME. It may be useful to set CHUKWA_HOME explicitly in your environment for ease of use.
 
 Setting Up Chukwa Cluster
 
@@ -82,19 +80,9 @@ bin/hbase shell < CHUKWA_HOME/etc/chukwa
 
   This procedure initializes the default Chukwa HBase schema.
 
-* Configuring And Starting Chukwa Collector
-
- [[1]] Edit CHUKWA_HOME/etc/chukwa/chukwa-env.sh. Make sure that JAVA_HOME, HADOOP_HOME, HADOOP_CONF_DIR, HBASE_HOME, and HBASE_CONF_DIR are set correctly.
-
- [[2]] In CHUKWA_HOME, run:
-
----
-bin/chukwa collector
----
-
 * Configuring And Starting Chukwa Agent
 
- [[1]] Add collector hostnames to CHUKWA_HOME/etc/chukwa/collectors. One host per line.
+ [[1]] Edit CHUKWA_HOME/etc/chukwa/chukwa-env.sh. Make sure that JAVA_HOME, HADOOP_HOME, HADOOP_CONF_DIR, HBASE_HOME, and HBASE_CONF_DIR are set correctly.
 
  [[2]] In CHUKWA_HOME, run:
 
@@ -146,4 +134,4 @@ http://<server>:4080/hicc/
   
   [[2]] The default user name and password is "admin" without quotes.
   
-  [[3]] Metrics data collected by Chukwa collector will be browsable through Graph Explorer widget.
+  [[3]] Metrics data collected by Chukwa Agent will be browsable through Graph Explorer widget.

Modified: chukwa/trunk/src/site/apt/agent.apt
URL: http://svn.apache.org/viewvc/chukwa/trunk/src/site/apt/agent.apt?rev=1612600&r1=1612599&r2=1612600&view=diff
==============================================================================
--- chukwa/trunk/src/site/apt/agent.apt (original)
+++ chukwa/trunk/src/site/apt/agent.apt Tue Jul 22 16:08:09 2014
@@ -14,7 +14,7 @@
 ~~ limitations under the License.
 ~~
 
-Overview
+Agent Configuration Guide
 
   In a normal Chukwa installation, an <Agent> process runs on every 
 machine being monitored. This process is responsible for all the data collection
@@ -38,27 +38,27 @@ Agent Control
 use to inspect and control it.  By default, Agents listen for incoming commands
 on port 9093. Commands are case-insensitive
 
-*--------------------*--------------------------------------*--------------:
-| Command            | Purpose                              | Options      |
-*--------------------*--------------------------------------*--------------:
-| <add>              | Start an adaptor.                    | See below    |
-*--------------------*--------------------------------------*--------------:
-| <close>            | Close socket connection to agent.    | None         |
-*--------------------*--------------------------------------*--------------:
-| <help>             | Display a list of available commands | None         |
-*--------------------*--------------------------------------*--------------:
-| <list>             | List currently running adaptors      | None         |
-*--------------------*--------------------------------------*--------------:
-| <reloadcollectors> | Re-read list of collectors           | None         |
-*--------------------*--------------------------------------*--------------:
-| <stop>             | Stop adaptor, abruptly               | Adaptor name |
-*--------------------*--------------------------------------*--------------:
-| <stopall>          | Stop all adaptors, abruptly          | Adaptor name |
-*--------------------*--------------------------------------*--------------:
-| <shutdown>         | Stop adaptor, gracefully             | Adaptor name |
-*--------------------*--------------------------------------*--------------:
-| <stopagent>        | Stop agent process                   | None         |
-*--------------------*--------------------------------------*--------------:
+*--------------------*-----------------------------------------*--------------:
+| Command            | Purpose                                 | Options      |
+*--------------------*-----------------------------------------*--------------:
+| <add>              | Start an adaptor.                       | See below    |
+*--------------------*-----------------------------------------*--------------:
+| <close>            | Close socket connection to agent.       | None         |
+*--------------------*-----------------------------------------*--------------:
+| <help>             | Display a list of available commands    | None         |
+*--------------------*-----------------------------------------*--------------:
+| <list>             | List currently running adaptors         | None         |
+*--------------------*-----------------------------------------*--------------:
+| <reloadcollectors> | Re-read list of collectors (deprecated) | None         |
+*--------------------*-----------------------------------------*--------------:
+| <stop>             | Stop adaptor, abruptly                  | Adaptor name |
+*--------------------*-----------------------------------------*--------------:
+| <stopall>          | Stop all adaptors, abruptly             | Adaptor name |
+*--------------------*-----------------------------------------*--------------:
+| <shutdown>         | Stop adaptor, gracefully                | Adaptor name |
+*--------------------*-----------------------------------------*--------------:
+| <stopagent>        | Stop agent process                      | None         |
+*--------------------*-----------------------------------------*--------------:
 
 
   The add command is by far the most complex; it takes several mandatory and 
@@ -87,10 +87,8 @@ Command-line options
   Normally, agents are configured via the file <conf/chukwa-agent-conf.xml.>
 However, there are a few command-line options that are sometimes useful in
 troubleshooting. If you specify "local" as an option, then the agent will print
-chunks to standard out, rather than to a collector. If you specify a URI, then
-that will be used as collector, overriding the collectors specified in
-<conf/collectors>.  These options are intended for testing and debugging,
-not for production use.
+chunks to standard out, rather than to pipeline writers.
+This option is intended for testing and debugging, not for production use.
 
 ---
 bin/chukwa agent local

Modified: chukwa/trunk/src/site/apt/dataflow.apt
URL: http://svn.apache.org/viewvc/chukwa/trunk/src/site/apt/dataflow.apt?rev=1612600&r1=1612599&r2=1612600&view=diff
==============================================================================
--- chukwa/trunk/src/site/apt/dataflow.apt (original)
+++ chukwa/trunk/src/site/apt/dataflow.apt Tue Jul 22 16:08:09 2014
@@ -14,7 +14,7 @@
 ~~ limitations under the License.
 ~~
 
-Chukwa Storage Layout
+HDFS Storage Layout
 
 Overview
 
@@ -41,11 +41,11 @@ Raw Log Collection and Aggregation Workf
 
   What data is stored where is best described by stepping through the Chukwa workflow.
 
-  [[1]] Collectors write chunks to <logs/*.chukwa> files until a 64MB chunk size is reached or a given time interval has passed.
+  [[1]] Agents write chunks to <logs/*.chukwa> files until a 64MB chunk size is reached or a given time interval has passed.
 
         * <logs/*.chukwa> 
 
-  [[2]] Collectors close chunks and rename them to <*.done>
+  [[2]] Agents close chunks and rename them to <*.done>
 
         * from <logs/*.chukwa>
 

Modified: chukwa/trunk/src/site/apt/design.apt
URL: http://svn.apache.org/viewvc/chukwa/trunk/src/site/apt/design.apt?rev=1612600&r1=1612599&r2=1612600&view=diff
==============================================================================
--- chukwa/trunk/src/site/apt/design.apt (original)
+++ chukwa/trunk/src/site/apt/design.apt Tue Jul 22 16:08:09 2014
@@ -26,10 +26,9 @@ stages. This will facilitate future inno
 
 Chukwa has five primary components:
 
-  * <<Agents>> that run on each machine and emit data.
+  * <<Adaptors>> that collect data from various data source.
 
-  * <<Collectors>> that receive data from the agent and write
-    it to stable storage.
+  * <<Agents>> that run on each machine and emit data.
 
   * <<ETL Processes>> for parsing and archiving the data.
 
@@ -42,7 +41,7 @@ Chukwa has five primary components:
 dwell times at each stage. A more detailed figure is available at the end
 of this document.
 
-[./images/datapipeline.png] A picture of the chukwa data pipeline
+[./images/chukwa_architecture.png] Architecture
 	  
 Agents and Adaptors
 
@@ -102,31 +101,9 @@ it's usually save to specify 0 as an ID,
 something else. For instance, it lets you do things like only tail the second 
 half of a file. 
 		
-Collectors
-
-  Rather than have each adaptor write directly to HDFS, data is sent across 
-the network to a <collector> process, that does the HDFS writes.  
-Each collector receives data from up to several hundred hosts, and writes all
-this data to a single <sink file>, which is a Hadoop sequence file of
-serialized Chunks. Periodically, collectors close their sink files, rename 
-them to mark them available for processing, and resume writing a new file.  
-Data is sent to collectors over HTTP.  
-
-  Collectors thus drastically reduce the number of HDFS files generated by Chukwa,
-from one per machine or adaptor per unit time, to a handful per cluster.  
-The decision to put collectors between data sources and the data store has 
-other benefits. Collectors hide the details of the HDFS file system in use, 
-such as its Hadoop version, from the adaptors.  This is a significant aid to 
-configuration.  It is especially helpful when using Chukwa to monitor a 
-development cluster running a different version of Hadoop or when using 
-Chukwa to monitor a non-Hadoop cluster.  
-
-  For more information on configuring collectors, see the 
-{{{./collector.html}Collector documentation}}.
-		
 ETL Processes
 
-  Collectors can write data directly to HBase or sequence files. 
+  Chukwa Agents can write data directly to HBase or sequence files. 
 This is convenient for rapidly getting data committed to stable storage. 
 
   HBase provides index by primary key, and manage data compaction.  It is
@@ -146,10 +123,10 @@ precisely how they group the data.)
 
   Demux, in contrast, take Chunks as input and parse them to produce
 ChukwaRecords, which are sets of key-value pairs.  Demux can run as a
-MapReduce job or as part of Chukwa Collector.
+MapReduce job or as part of HBaseWriter.
 
   For details on controlling this part of the pipeline, see the 
-{{{./admin.html}Administration guide}}. For details about the file
+{{{./pipeline.html}Pipeline guide}}. For details about the file
 formats, and how to use the collected data, see the {{{./programming.html}
 Programming guide}}.
 
@@ -169,6 +146,20 @@ which in turn is populated by collector 
 that runs on the collected data, after Demux. The  
 {{{./admin.html}Administration guide}} has details on setting up HICC.
 
-  And now, the architecture picture of Chukwa: 
+Collectors (Deprecated)
 
-[./images/chukwa_architecture.png] Architecture
+  The origin design of collector is to reduce the number of TCP connections
+for collecting data from various source, provide high availability and wire
+compatibility across versions.  Data transfer reliability has been improved 
+in HDFS client and HBase client.  The original problem that Chukwa Collector 
+tried to solve is no longer high priority in Chukwa data collection framework 
+because both HDFS and HBase are in better position in solving data transport 
+and replication problem.  Hence, datanode and HBase Region servers are 
+replacement for Chukwa collectors.
+
+  Chukwa has adopted to use HBase to ensure data arrival in milli-seconds and
+also make data available to down steam application at the same time.  This
+will enable monitoring application to have near realtime view as soon as
+data are arriving in the system.  The file rolling, archiving are replaced
+by HBase Region Server minor and major compactions.
+		

Modified: chukwa/trunk/src/site/apt/hicc.apt
URL: http://svn.apache.org/viewvc/chukwa/trunk/src/site/apt/hicc.apt?rev=1612600&r1=1612599&r2=1612600&view=diff
==============================================================================
--- chukwa/trunk/src/site/apt/hicc.apt (original)
+++ chukwa/trunk/src/site/apt/hicc.apt Tue Jul 22 16:08:09 2014
@@ -14,7 +14,7 @@
 ~~ limitations under the License.
 ~~
 
-Overview
+HICC Operation Manual
 
   HICC stands for Hadoop Infrastructure Care Center.  It is the central dashboard
 for visualize and monitoring of metrics collected by Chukwa.
@@ -87,6 +87,13 @@ make additional choices or provide any n
   * If an option is dimmed, it is not available.  For example, you can not edit
 name of the dashboard, if you are not the owner of the dashboard.
 
+* Dashboard basics
+
+  Each dashboard can have multiple tabs and each tab is divided into row and columns.
+You can add more widgets by selecting Options > choose a widget > click Add.
+To remove a widgets by clicking on Close button on the top right hand corner of
+the widget.
+
 * Tab basics
 
   Tab provides a way to organize related information together.  As you create widgets,
@@ -115,13 +122,6 @@ information organized.
 
   [[2]] Enter a new name for the tab and press Return.
 
-* Dashboard basics
-
-  Each dashboard can have multiple tabs and each tab is divided into row and columns.
-You can add more widgets by selecting Options > choose a widget > click Add.
-To remove a widgets by clicking on Close button on the top right hand corner of
-the widget.
-
 User accounts
 
   You should set up an account for each person who uses HICC on a regular basis.

Modified: chukwa/trunk/src/site/apt/index.apt
URL: http://svn.apache.org/viewvc/chukwa/trunk/src/site/apt/index.apt?rev=1612600&r1=1612599&r2=1612600&view=diff
==============================================================================
--- chukwa/trunk/src/site/apt/index.apt (original)
+++ chukwa/trunk/src/site/apt/index.apt Tue Jul 22 16:08:09 2014
@@ -16,31 +16,40 @@
 Overview
 
   Log processing was one of the original purposes of MapReduce. Unfortunately,
-using Hadoop for MapReduce processing of logs is somewhat troublesome.
-Logs are generated incrementally across many machines, but Hadoop MapReduce
+using Hadoop MapReduce to monitor Hadoop can be inefficient.  Batch
+processing nature of Hadoop MapReduce prevents the system to provide real time
+status of the cluster.
+
+  We started this journey at beginning of 2008, and a lot of Hadoop components
+have been built to improve overall reliability of the system and 
+improve realtimeness of monitoring. We have adopted HBase to facilitate lower 
+latency of random reads and using in memory updates and write ahead logs to 
+improve the reliability for root cause analysis.
+
+  Logs are generated incrementally across many machines, but Hadoop MapReduce
 works best on a small number of large files. Merging the reduced output
 of multiple runs may require additional mapreduce jobs.  This creates some 
 overhead for data management on Hadoop.
 
   Chukwa is a Hadoop subproject devoted to bridging that gap between logs
 processing and Hadoop ecosystem.  Chukwa is a scalable distributed monitoring 
-and analysis system, particularly logs from Hadoop and other large systems.
+and analysis system, particularly logs from Hadoop and other distributed systems.
 
   The Chukwa Documentation provides the information you need to get
-started using Chukwa. You should start with the {{{./design.html}
-Architecture and Design document}}.
+started using Chukwa. {{{./design.html} Architecture and Design document}}
+provides high level view of Chukwa design.
 
-  If you're trying to set up a Chukwa cluster from scratch, you should
-read the {{{./admin.html}Chukwa Administration Guide}} which
-shows you how to setup and deploy Chukwa.
+  If you're trying to set up a Chukwa cluster from scratch, 
+{{{./user.html} User Guide}} describes the setup and deploy procedure.
 
   If you want to configure the Chukwa agent process, to control what's
-collected, you should read the {{{./agent.html}Agent Guide}}. There's
-also a  {{{./collector.html}Collector Guide}} describing that part of
-the pipeline.
+collected, you should read the {{{./agent.html} Agent Guide}}. There is
+also a  {{{./pipeline.html} Pipeline Guide}} describing configuration
+parameters for ETL processes for the data pipeline.
      
-  And if you want to use collected data, read the
-{{{./programming.html}User and Programming Guide}}
+  And if you want to develop Chukwa to monitor other data source,
+{{{./programming.html} Programming Guide}} maybe handy to learn
+about Chukwa programming API.
 
   If you have more questions, you can ask on the
-{{{mailto:chukwa-user@incubator.apache.org}Chukwa mailing lists}}
+{{{mailto:user@chukwa.apache.org}Chukwa mailing lists}}

Copied: chukwa/trunk/src/site/apt/pipeline.apt (from r1607772, chukwa/trunk/src/site/apt/collector.apt)
URL: http://svn.apache.org/viewvc/chukwa/trunk/src/site/apt/pipeline.apt?p2=chukwa/trunk/src/site/apt/pipeline.apt&p1=chukwa/trunk/src/site/apt/collector.apt&r1=1607772&r2=1612600&rev=1612600&view=diff
==============================================================================
--- chukwa/trunk/src/site/apt/collector.apt (original)
+++ chukwa/trunk/src/site/apt/pipeline.apt Tue Jul 22 16:08:09 2014
@@ -14,42 +14,44 @@
 ~~ limitations under the License.
 ~~
 
-Basic Operation
+Pipeline Configuration Guide
 
-  Chukwa Collectors are responsible for accepting incoming data from Agents,
-and storing the data.  Most commonly, collectors simply write all received 
-to HBase or HDFS.  
+Basic Options
+
+  Chukwa pipeline are responsible for accepting incoming data from Agents,
+and extract, transform and load data to destination storage.  Most commonly, 
+pipeline simply write all received to HBase or HDFS.  
 
 * HBase
 
-  For enabling streaming data to HBase, chukwa collector writer class can
-be configured in <chukwa-collector-conf.xml>.
+  For enabling streaming data to HBase, chukwa pipeline can
+be configured in <chukwa-agent-conf.xml>.
 
 ---
 <property>
-  <name>chukwaCollector.writerClass</name>
+  <name>chukwa.pipeline</name>
   <value>org.apache.hadoop.chukwa.datacollection.writer.hbase.HBaseWriter</value>
 </property>
 ---
 
   In this mode, HBase configuration is configured in <chukwa-env.sh>.
 HBASE_CONF_DIR should reference to HBae configuration directory to enable
-Chukwa Collector to load <hbase-site.xml> from class path.
+Chukwa agent to load <hbase-site.xml> from class path.
 
 * HDFS
 
-  For enabling streaming data to HDFS, chukwa collector writer class can
-be configured in <chukwa-collector-conf.xml>.
+  For enabling streaming data to HDFS, chukwa pipeline can be configured in 
+<chukwa-agent-conf.xml>.
 
 ---
 <property>
-  <name>chukwaCollector.writerClass</name>
+  <name>chukwa.pipeline</name>
   <value>org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter</value>
 </property>
 ---
 
   In this mode, the filesystem to write to is determined by the option
-<writer.hdfs.filesystem> in <chukwa-collector-conf.xml>.
+<writer.hdfs.filesystem> in <chukwa-agent-conf.xml>.
 
 ---
 <property>
@@ -60,51 +62,29 @@ be configured in <chukwa-collector-conf.
 ---
 
   This is the only option that you really need to specify to get a working 
-collector.
-
-  By default, collectors listen on port 8080. This can be configured
-in <chukwa-collector.conf.xml>
-  	
-Configuration Knobs
-
-  There's a bunch more "standard" knobs worth knowing about. These
-are mostly documented in <chukwa-collector-conf.xml>
-  	
-  It's also possible to do limited configuration on the command line. This is
-primarily intended for debugging.  You can say 'writer=pretend' to get the 
-collector to print incoming chunks on standard out, or portno=xyz to override
-the default port number.
-
----
-bin/chukwa collector writer=pretend portno=8081
----
+pipeline.
 
 Advanced Options
 
   There are some advanced options, not necessarily documented in the
-collector conf file, that are helpful in using Chukwa in nonstandard ways.
+agent conf file, that are helpful in using Chukwa in nonstandard ways.
 While normally Chukwa writes sequence files to HDFS, it's possible to
-specify an alternate Writer class. The option 
-<chukwaCollector.writerClass> specifies a Java class to instantiate
-and use as a writer. See the <ChukwaWriter> javadoc for details.
+specify an alternate pipe class. The option <chukwa.pipeline> specifies 
+a Java class to instantiate and use as a writer. See the <ChukwaWriter> 
+javadoc for details.
 
-  One particularly useful Writer class is <PipelineStageWriter>, which
+  One particularly useful pipeline class is <PipelineStageWriter>, which
 lets you string together a series of <PipelineableWriters>
 for pre-processing or post-processing incoming data.
 As an example, the SocketTeeWriter class allows other programs to get 
-incoming chunks fed to them over a socket by the collector.
+incoming chunks fed to them over a socket by Chukwa agent.
 	  	
   Stages in the pipeline should be listed, comma-separated, in option 
-<chukwaCollector.pipeline>
+<chukwa.pipeline>
 	  	
 ---
 <property>
-  <name>chukwaCollector.writerClass</name>
-  <value>org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter</value>
-</property>
-
-<property>
-  <name>chukwaCollector.pipeline</name>
+  <name>chukwa.pipeline</name>
   <value>org.apache.hadoop.chukwa.datacollection.writer.SocketTeeWriter,org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter</value>
 </property>
 ---
@@ -137,7 +117,7 @@ key value pairs to HBase table.  HBaseWr
 ---
 
   * <<hbase.writer.halt.on.schema.mismatch>> If this option is set to true, 
-    and HBase table schema is mismatched with demux parser, collector will 
+    and HBase table schema is mismatched with demux parser, agent will 
     shut down itself.
 
 ---
@@ -167,7 +147,7 @@ SeqFileWriter
   The <SeqFileWriter> streams chunks of data to HDFS, and write data in
 temp filename with <.chukwa> suffix.  When the file is completed writing,
 the filename is renamed with <.done> suffix.  SeqFileWriter has the following
-configuration in <chukwa-collector-conf.xml>.
+configuration in <chukwa-agent-conf.xml>.
 
   * <<writer.hdfs.filesystem>> Location to name node address
 
@@ -200,14 +180,14 @@ configuration in <chukwa-collector-conf.
 ---
 
   * <<chukwaCollector.isFixedTimeRotatorScheme>> A flag to indicate that the 
-    collector should close at a fixed offset after every rotateInterval. 
+    agent should close at a fixed offset after every rotateInterval. 
     The default value is false which uses the default scheme where 
-    collectors close after regular rotateIntervals.
+    agents close after regular rotateIntervals.
     If set to true then specify chukwaCollector.fixedTimeIntervalOffset value.
     e.g., if isFixedTimeRotatorScheme is true and fixedTimeIntervalOffset is
-    set to 10000 and rotateInterval is set to 300000, then the collector will
+    set to 10000 and rotateInterval is set to 300000, then the agent will
     close its files at 10 seconds past the 5 minute mark, if
-    isFixedTimeRotatorScheme is false, collectors will rotate approximately
+    isFixedTimeRotatorScheme is false, agents will rotate approximately
     once every 5 minutes
 
 ---
@@ -231,7 +211,7 @@ configuration in <chukwa-collector-conf.
 SocketTeeWriter
 
   The <SocketTeeWriter> allows external processes to watch
-the stream of chunks passing through the collector. This allows certain kinds
+the stream of chunks passing through the agent. This allows certain kinds
 of real-time monitoring to be done on-top of Chukwa.
 	  	
   SocketTeeWriter listens on a port (specified by conf option
@@ -269,77 +249,3 @@ while(true) {
 }
 ---
 	  	
-Acknowledgement mode
-
-  Chukwa supports two different reliability strategies.
-The first, default strategy, is as follows: collectors write data to HDFS, and
-as soon as the HDFS write call returns success, report success to the agent,
-which advances its checkpoint state.
-
-  This is potentially a problem if HDFS (or some other storage tier) has
-non-durable or asynchronous writes. As a result, Chukwa offers a mechanism,
-asynchronous acknowledgement, for coping with this case.
-
-  This mechanism can be enabled by setting option <httpConnector.asyncAcks>.
-This option applies to both agents and collectors. On the collector side, it
-tells the collector to return asynchronous acknowledgements. On the agent side,
-it tells agents to look for and process them correctly. Agents with the option
-set to false should work OK with collectors where it's set to true. The
-reverse is not generally true: agents will expect a collector to be able to
-answer questions about the state of the filesystem.
-
-* Theory
-
-  In this approach, rather than try to build a fault tolerant collector,
-Chukwa agents look <<through>> the collectors to the underlying state of the
-filesystem. This filesystem state is what is used to detect and recover from
-failure. Recovery is handled entirely by the agent, without requiring anything
-at all from the failed collector.
-
-  When an agent sends data to a collector, the collector responds with the name
-of the HDFS file in which the data will be stored and the future location of
-the data within the file. This is very easy to compute -- since each file is
-only written by a single collector, the only requirement is to enqueue the
-data and add up lengths.
-
-  Every few minutes, each agent process polls a collector to find the length of
-each file to which data is being written. The length of the file is then
-compared with the offset at which each chunk was to be written. If the file
-length exceeds this value, then the data has been committed and the agent
-process advances its checkpoint accordingly. (Note that the length returned by
-the filesystem is the amount of data that has been successfully replicated.)
-There is nothing essential about the role of collectors in monitoring the
-written files. Collectors store no per-agent state. The reason to poll
-collectors, rather than the filesystem directly, is to reduce the load on
-the filesystem master and to shield agents from the details of the storage
-system.
-
-  The collector component that handles these requests is
-<datacollection.collector.servlet.CommitCheckServlet>.
-This will be started if <httpConnector.asyncAcks> is true in the
-collector configuration.
-
-  On error, agents resume from their last checkpoint and pick a new collector.
-In the event of a failure, the total volume of data retransmitted is bounded by
-the period between collector file rotations.
-
-  The solution is end-to-end. Authoritative copies of data can only exist in
-two places: the nodes where data was originally produced, and the HDFS file
-system where it will ultimately be stored. Collectors only hold soft state;
-the only ``hard'' state stored by Chukwa is the agent checkpoints. Below is a
-diagram of the flow of messages in this protocol.
-
-* Configuration
-
-  In addition to <httpConnector.asyncAcks> (which enables asynchronous
-acknowledgement) a number of options affect this mode of operation.
-
-  * <chukwaCollector.asyncAcks.scanperiod> affects how often collectors
-will check the filesystem for commits. It defaults to twice the rotation
-interval.
-
-  * <chukwaCollector.asyncAcks.scanpaths> determines where in HDFS
-collectors will look. It defaults to the data sink dir plus the archive dir.
-
-  In the future, Zookeeper could be used instead to track rotations.
-

Modified: chukwa/trunk/src/site/apt/programming.apt
URL: http://svn.apache.org/viewvc/chukwa/trunk/src/site/apt/programming.apt?rev=1612600&r1=1612599&r2=1612600&view=diff
==============================================================================
--- chukwa/trunk/src/site/apt/programming.apt (original)
+++ chukwa/trunk/src/site/apt/programming.apt Tue Jul 22 16:08:09 2014
@@ -97,7 +97,7 @@ Sink File Format
   As data is collected, Chukwa dumps it into <sink files> in HDFS. By
 default, these are located in <hdfs:///chukwa/logs>.  If the file name 
 ends in .chukwa, that means the file is still being written to. Every few minutes, 
-the collector will close the file, and rename the file to '*.done'.  This 
+the agent will close the file, and rename the file to '*.done'.  This 
 marks the file as available for processing.
 
   Each sink file is a Hadoop sequence file, containing a succession of 
@@ -149,7 +149,7 @@ first job that runs.
   By default, Chukwa will use the default TsProcessor. This parser will try to
 extract the real log statement from the log entry using the ISO8601 date 
 format. If it fails, it will use the time at which the chunk was written to
-disk (collector timestamp).
+disk (agent timestamp).
 
 * Writing a custom demux Mapper
 
@@ -218,10 +218,10 @@ implementation use the following groupin
 
 * Demux Data To HBase
 
-  Demux parsers can be configured to run in Chukwa Collector.  See 
-{{{./collector.html}Collector configuration guide}}.  HBaseWriter is not a
-real map reduce job.  It is designed to reuse Demux parsers for extraction,
-transformation and load purpose.  There are some limitations to consider before implementing
+  Demux parsers can be configured to run in <${CHUKWA_HOME}/etc/chukwa/chukwa-demux-conf.xml>.  See 
+{{{./pipeline.html} Pipeline configuration guide}}.  HBaseWriter is not a
+real map reduce job.  It is designed to reuse Demux parsers for extraction and transformation purpose.
+There are some limitations to consider before implementing
 Demux parser for loading data to HBase.  In MapReduce job, mutliple values can be merged and 
 group into a key/value pair in shuffle/combine and merge phases.  This kind of aggregation is 
 unsupported by Demux in HBaseWriter because the data are not merged in memory, but send to HBase.

Copied: chukwa/trunk/src/site/apt/user.apt (from r1607772, chukwa/trunk/src/site/apt/admin.apt)
URL: http://svn.apache.org/viewvc/chukwa/trunk/src/site/apt/user.apt?p2=chukwa/trunk/src/site/apt/user.apt&p1=chukwa/trunk/src/site/apt/admin.apt&r1=1607772&r2=1612600&rev=1612600&view=diff
==============================================================================
--- chukwa/trunk/src/site/apt/admin.apt (original)
+++ chukwa/trunk/src/site/apt/user.apt Tue Jul 22 16:08:09 2014
@@ -14,7 +14,7 @@
 ~~ limitations under the License.
 ~~
 
-Chukwa Administration Guide
+Chukwa User Guide
 
   This chapter is the detailed configuration guide to Chukwa configuration.
 
@@ -40,24 +40,20 @@ production platform that has been tested
 successfully on Mac OS X, which several members of the Chukwa team use for 
 development.
 
-  The only absolute software requirements are {{{http://java.sun.com}Java 1.6}}
-or better and {{{http://hadoop.apache.org/}Hadoop 0.20.205.1+}}.
+  The only absolute software requirements are Java 1.6 or better,
+ZooKeeper 3.4.x, HBase 0.96.x and Hadoop 1.x.
   
-  HICC, the Chukwa visualization interface, requires {{{http://hbase.apache.org}HBase 0.90.4+}}.
-
   The Chukwa cluster management scripts rely on <ssh>; these scripts, however,
 are not required if you have some alternate mechanism for starting and stopping
 daemons.
 
 Installing Chukwa
 
-  A minimal Chukwa deployment has three components:
+  A minimal Chukwa deployment has five components:
 
   * A Hadoop and HBase cluster on which Chukwa will process data (referred to as the Chukwa cluster).
 
-  * A collector process, that writes collected data to HBase.
-
-  * One or more agent processes, that send monitoring data to the collector. 
+  * One or more agent processes, that send monitoring data to HBase.
     The nodes with active agent processes are referred to as the monitored 
     source nodes.
 
@@ -76,8 +72,7 @@ Installing Chukwa
 
   * Un-tar the release, via <tar xzf>.
 
-  * Make sure a copy of Chukwa is available on each node being monitored, and on
-each node that will run a collector.
+  * Make sure a copy of Chukwa is available on each node being monitored.
 
   * We refer to the directory containing Chukwa as <CHUKWA_HOME>. It may
 be helpful to set <CHUKWA_HOME> explicitly in your environment,
@@ -85,9 +80,6 @@ but Chukwa does not require that you do 
 
 * General Configuration
 
-  Agents and collectors are configured differently, but part of the process
-is common to both.
-
   * Make sure that <JAVA_HOME> is set correctly and points to a Java 1.6 JRE. 
 It's generally best to set this in <etc/chukwa/chukwa-env.sh>.
 
@@ -103,27 +95,19 @@ Agents
 
   Agents are the Chukwa processes that actually produce data. This section
 describes how to configure and run them. More details are available in the
-{{{./agent.html}Agent configuration guide}}.
+{{{./agent.html} Agent configuration guide}}.
 
 * Configuration
 
-  This section describes how to set up the agent process on the source nodes.
-
-  The one mandatory configuration step is to set up 
-<$CHUKWA_HOME/etc/chukwa/collectors>. This file should contain a list
-of hosts that will run Chukwa collectors. Agents will pick a random collector
-from this list to try sending to, and will fail-over to another listed collector
-on error.  The file should look something like:
-
----
-http://<collector1HostName>:<collector1Port>/
-http://<collector2HostName>:<collector2Port>/
-http://<collector3HostName>:<collector3Port>/
----
+  First, edit <$CHUKWA_HOME/etc/chukwa/chukwa-env.sh> In addition to 
+the general directions given above, you should set <HADOOP_CONF_DIR> and
+<HBASE_CONF_DIR>.  This should be the Hadoop deployment Chukwa will use to 
+store collected data.  You will get a version mismatch error if this is 
+configured incorrectly.
 
   Edit the <CHUKWA_HOME/etc/chukwa/initial_adaptors> configuration file. 
 This is where you tell Chukwa what log files to monitor. See
-{{{./agent.html#Adaptors}the adaptor configuration guide}} for
+{{{./agent.html#Adaptors} the adaptor configuration guide}} for
 a list of available adaptors.
 
   There are a number of optional settings in 
@@ -150,6 +134,30 @@ not NFS-mount, directory.
   * Setting the option <chukwaAgent.control.remote> will disallow remote 
 connections to the agent control socket.
 
+** Use HBase For Data Storage
+
+  * Configuring the pipeline: set HBaseWriter as your writer, or add it 
+    to the pipeline if you are using 
+
+---
+  <property>
+    <name>chukwa.agent.connector</name>
+    <value>org.apache.hadoop.chukwa.datacollection.connector.PipelineConnector</value>
+  </property>
+
+  <property>
+    <name>chukwa.pipeline</name>
+    <value>org.apache.hadoop.chukwa.datacollection.writer.hbase.HBaseWriter</value>
+  </property>
+---
+
+** Use HDFS For Data Storage
+
+  The one mandatory configuration parameter is <writer.hdfs.filesystem>.
+This should be set to the HDFS root URL on which Chukwa will store data.
+Various optional configuration options are described in 
+{{{./pipeline.html} the pipeline configuration guide}}.
+
 * Starting, Stopping, And Monitoring
 
   To run an agent process on a single node, use <bin/chukwa agent>.
@@ -191,73 +199,15 @@ Setup HBase Table
 allow real-time charting. This section describes how to configure HBase and 
 HICC to work together.
 
-  * Presently, we support HBase 0.90.4+. If you have HBase 0.89 jars anywhere, 
+  * Presently, we support HBase 0.96+. If you have older HBase jars anywhere, 
 they will cause linkage errors.  Check for and remove them.
 
   * Setting up tables:
 
 ---
-/path/to/hbase-0.90.4/bin/hbase shell < etc/chukwa/hbase.schema
+hbase/bin/hbase shell < etc/chukwa/hbase.schema
 ---
 
-Collectors
-
-  This section describes how to set up the Chukwa collectors.
-For more details, see {{{./collector.html}the collector configuration guide}}.
-
-* Configuration
-
-  First, edit <$CHUKWA_HOME/etc/chukwa/chukwa-env.sh> In addition to 
-the general directions given above, you should set <HADOOP_CONF_DIR> and
-<HBASE_CONF_DIR>.  This should be the Hadoop deployment Chukwa will use to 
-store collected data.  You will get a version mismatch error if this is 
-configured incorrectly.
-
-  Next, edit <$CHUKWA_HOME/etc/chukwa/chukwa-collector-conf.xml>.
-
-** Use HBase For Data Storage
-
-  * Configuring the collector: set HBaseWriter as your writer, or add it 
-    to the pipeline if you are using 
-
----
-  <property>
-    <name>chukwaCollector.writerClass</name>
-    <value>org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter</value>
-  </property>
-
-  <property>
-    <name>chukwaCollector.pipeline</name>
-    <value>org.apache.hadoop.chukwa.datacollection.writer.hbase.HBaseWriter</value>
-  </property>
----
-
-** Use HDFS For Data Storage
-
-  The one mandatory configuration parameter is <writer.hdfs.filesystem>.
-This should be set to the HDFS root URL on which Chukwa will store data.
-Various optional configuration options are described in 
-{{{./collector.html}the collector configuration guide}}
-and in the collector configuration file itself.
-
-* Starting, Stopping, And Monitoring
-
-  To run a collector process on a single node, use <bin/chukwa collector>.
-
-  Typically, collectors run as daemons. The script <bin/start-collectors.sh> 
-will ssh to each collector listed in <etc/chukwa/collectors> and start a
-collector, running in the background. The script <bin/stop-collectors.sh> 
-does the reverse.
-
-  You can, of course, use any other daemon-management system you like. 
-For instance, <tools/init.d> includes init scripts for running
-Chukwa collectors.
-
-  To check if a collector is working properly, you can simply access
-<http://collectorhost:collectorport/chukwa?ping=true> with a web browser.
-If the collector is running, you should see a status page with a handful of 
-statistics.
-
 ETL Processes (Optional)
 
   For storing data to HDFS, the archive and demux mapreduce jobs can be 
@@ -335,14 +285,6 @@ org.apache.hadoop.chukwa.datacollection.
 ps ax | grep org.apache.hadoop.chukwa.datacollection.agent.ChukwaAgent
 ---
 
-* UNIX Processes For Chukwa Collectors
-
-  Chukwa Collector name is identified by:
-
----
-org.apache.hadoop.chukwa.datacollection.collector.CollectorStub
----
-
 * UNIX Processes For Chukwa Data Processes
 
   Chukwa Data Processors are identified by:
@@ -358,9 +300,8 @@ visible from the process list.
 
 * Checks For Disk Full 
 
-  If anything is wrong, use /etc/init.d/chukwa-agent and 
-CHUKWA_HOME/tools/init.d/chukwa-system-metrics stop to shutdown Chukwa.  
-Look at agent.log and collector.log file to determine the problems. 
+  If anything is wrong, use /etc/init.d/chukwa-agent stop to shutdown Chukwa.  
+Look at agent.log file to determine the problems. 
 
   The most common problem is the log files are growing unbounded. Set up a 
 cron job to remove old log files:

Modified: chukwa/trunk/src/site/resources/images/chukwa_architecture.png
URL: http://svn.apache.org/viewvc/chukwa/trunk/src/site/resources/images/chukwa_architecture.png?rev=1612600&r1=1612599&r2=1612600&view=diff
==============================================================================
Binary files - no diff available.

Modified: chukwa/trunk/src/site/site.xml
URL: http://svn.apache.org/viewvc/chukwa/trunk/src/site/site.xml?rev=1612600&r1=1612599&r2=1612600&view=diff
==============================================================================
--- chukwa/trunk/src/site/site.xml (original)
+++ chukwa/trunk/src/site/site.xml Tue Jul 22 16:08:09 2014
@@ -39,20 +39,21 @@
       <item name="Zookeeper" href="http://zookeeper.apache.org/"/>
     </links>
 
-    <menu name="Chukwa 0.5">
+    <menu name="Table of Contents">
       <item name="Overview" href="index.html"/>
       <item name="Quick Start Guide" href="Quick_Start_Guide.html"/>
-      <item name="HICC User Guide" href="hicc.html"/>
-      <item name="Administration Guide" href="admin.html">
+      <item name="User Guide" href="user.html">
         <item name="Agent" href="agent.html"/>
-        <item name="Collector" href="collector.html"/>
+        <item name="Pipeline" href="pipeline.html"/>
+        <item name="HICC" href="hicc.html"/>
       </item>
-      <item name="Architecture" href="design.html"/>
-      <item name="Chukwa Storage Layout" href="dataflow.html"/>
       <item name="Programming Guide" href="programming.html">
         <item name="Agent REST API" href="apidocs/agent-rest.html"/>
         <item name="Javadocs" href="apidocs/index.html"/>
       </item>
+      <item name="Architecture" href="design.html">
+        <item name="HDFS Layout" href="dataflow.html"/>
+      </item>
       <item name="Wiki" href="http://wiki.apache.org/hadoop/Chukwa/"/>
       <item name="FAQ" href="http://wiki.apache.org/hadoop/Chukwa/FAQ"/>
     </menu>