You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2008/10/23 23:20:13 UTC

[Hadoop Wiki] Update of "Sending information to Chukwa" by Jerome Boulon

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by Jerome Boulon:
http://wiki.apache.org/hadoop/Sending_information_to_Chukwa

New page:
#format wiki
#language en


= How to push new information to Chukwa =

== Add a new dataSource (Source Input) ==
=== Using Log4J ===
Chukwa comes with a Log4J Appender. Here the steps that you need to fallow in order to use it:

  1.  Create a log4j.properties file that contains the fallowing information:

    log4j.rootLogger=INFO, chukwa
    log4j.appender.chukwa=org.apache.hadoop.chukwa.inputtools.log4j.ChukwaDailyRollingFileAppender
    log4j.appender.chukwa.File=${CHUKWA_HOME}/logs/${RECORD_TYPE}.log
    log4j.appender.chukwa.DatePattern='.'yyyy-MM-dd
    log4j.appender.chukwa.recordType=${RECORD_TYPE}
    log4j.appender.chukwa.layout=org.apache.log4j.PatternLayout
    log4j.appender.chukwa.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n

  1.  Add those parameters to your java command line:
      * -DCHUKWA_HOME=${CHUKWA_HOME} -DRECORD_TYPE=&lt;YourRecordType_Here&gt; -Dlog4j.configuration=log4j.properties
      * -DRECORD_TYPE=&lt;YourRecordType_Here&gt; is the most important parameter. 
      * You can only store one record type per file, so if you need to split your logs into different record types,just create one appender per data type     (%T% see hadoop logs4j configuration file)

  1.  Start your program, now all you log statements should be written in ${CHUKWA_HOME}/logs/<YourRecordType_Here>.log

=== Static file like /var/log/messages ===

   1. Edit ${CHUKWA_HOME}/conf/initial_adaptors 
   1. Add a line similar to this one:
      * add org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.CharFileTailingAdaptorUTF8NewLine SysLog 0 /var/log/messages 0

This line will automatically register the "CharFileTailingAdaptorUTF8NewLine" adaptor for /var/log/messages

=== Register a file from another application/language ===

   1. Open a socket from your application to the ChukwaLocalAgent
   1. Write this line to the socket
      * add org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.CharFileTailingAdaptorUTF8NewLine <RecordType> <StartOffset> <fileName> <StartOffset>
      * Where <RecordType> is the data type that will identify your data
      * Where <StartOffset> is the start offset
      * Where <fileName> is the local path on your machine
   1. close the socket


== Extract information from this new dataSource ==

=== Using the default TimeStamp Parser ===

By default, Chukwa will use the default TsProcessor. 

This parser will try to extract the real log statement from the log entry using the %d{ISO8601} date format.
If it fails, it will use the time at which the chunk as been written to disk (collector timestamp).

Your log will be automatically available from the Web Log viewer under the <YourRecordTypeHere> directory 
 
=== Using a specific Parser ===
If you want to extract some specific information and more processing you need to write your own parser.
Like any M/R program, your have to write at least the Map side for your parser. The reduce side is Identity by default.

==== MAP side of the parser ====
Your can either write your own from strach or extends the AbstractProcessor class that hide all the low level action on the chunk.
then you have to register your parser to the demux (link between the RecordType and the parser)

==== Parser registration ====
   * Edit ${CHUKWA_HOME}/conf/chukwa-demux-conf.xml and add the fallowing lines

   <property>
    <name><YourRecordType_Here></name>
    <value><org.apache.hadoop.chukwa.extraction.demux.processor.mapper.MyParser></value>
    <description>Parser class for <YourRecordType_Here></description>
   </property>

(Tips: You can use the same parser for different recordType)

==== Parser implementation ====

{{{#!java

public class MyParser extends AbstractProcessor
{
       protected void parse(String recordEntry,
			               OutputCollector<ChukwaRecordKey, ChukwaRecord> output,
			               Reporter reporter)
	{

           // Extract Log4j information, i.e timestamp, logLevel, logger, ...
           SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm");
           // Extract log timestamp & Log4j information
           String dStr = recordEntry.substring(0, 23);
           int start = 24;
           int idx = recordEntry.indexOf(' ', start);
           String logLevel = recordEntry.substring(start, idx);
           start = idx + 1;
           idx = recordEntry.indexOf(' ', start);
           String className = recordEntry.substring(start, idx-1);
           String body = recordEntry.substring(idx + 1);

           Date d = sdf.parse(dStr);
           key = new ChukwaRecordKey();
           record = new ChukwaRecord();

           key = new ChukwaRecordKey();
           key.setKey("<YOUR_KEY_HERE>"));
           key.setReduceType("<YOUR_RECORD_TYPE_HERE>");
           
           record = new ChukwaRecord();
           record.setTime(d.getTime());

           // Parse your line here and extract useful information
           // Add your {key,value} pairs
           record.add(key1, value1);
           record.add(key2, value2);
           record.add(key3, value3);

           // Output your record
           output.collect(key, record);
        }
}


}}}

(Tips: see org.apache.hadoop.chukwa.extraction.demux.processor.mapper.Df class, for an example of Parser class)

==== REDUCE side of the parser ====
You only need to implement a reduce side if you need to group records together.
Here the interface that your need to implement:

The link between the Map side and the reduce is done by setting your reduce class into the reduce type: key.setReduceType("<YourReduceClassHere>"); 

{{{#!java
public interface ReduceProcessor
{
           public String getDataType();
           public void process(ChukwaRecordKey key,Iterator<ChukwaRecord> values,
                      OutputCollector<ChukwaRecordKey, 
                      ChukwaRecord> output, Reporter reporter);
}
}}}

(Tips: see org.apache.hadoop.chukwa.extraction.demux.processor.reducer.SystemMetrics class, for an example of Reduce class)

==== Parser key field ====

Your data is going to be sorted by RecordType then by the key field.
The default implementation use the fallowing grouping for all records:
   1. Time partition (Time up to the hour)
   1. Machine name (physical input source)
   1. Record timestamp

==== Output directory ====
The demux process will use the recordType to save similar records together (same recordType) to the same directory:
<Your_Cluster_Information>/<Your_Record_Type>/