You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kafka.apache.org by ne...@apache.org on 2012/01/04 03:29:47 UTC
svn commit: r1227032 - in /incubator/kafka/site: configuration.html design.html downloads.html includes/header.html projects.html quickstart.html

Author: nehanarkhede
Date: Wed Jan  4 02:29:46 2012
New Revision: 1227032

URL: http://svn.apache.org/viewvc?rev=1227032&view=rev
Log:
KAFKA-230 0.7.0 release documentation; patched by jaykreps and nehanarkhede; reviewed by junrao

Modified:
    incubator/kafka/site/configuration.html
    incubator/kafka/site/design.html
    incubator/kafka/site/downloads.html
    incubator/kafka/site/includes/header.html
    incubator/kafka/site/projects.html
    incubator/kafka/site/quickstart.html

Modified: incubator/kafka/site/configuration.html
URL: http://svn.apache.org/viewvc/incubator/kafka/site/configuration.html?rev=1227032&r1=1227031&r2=1227032&view=diff
==============================================================================
--- incubator/kafka/site/configuration.html (original)
+++ incubator/kafka/site/configuration.html Wed Jan  4 02:29:46 2012
@@ -18,6 +18,11 @@
     <td>Each broker is uniquely identified by an id. This id serves as the brokers "name", and allows the broker to be moved to a different host/port without confusing consumers.</td>
 </tr>
 <tr>
+    <td><code>enable.zookeeper</code></td>
+    <td>true</td>
+    <td>enable zookeeper registration in the server</td>
+</tr>
+<tr>
      <td><code>log.flush.interval</code></td>
      <td>500</td>
      <td>Controls the number of messages accumulated in each topic (partition) before the data is flushed to disk and made available to consumers.</td>  
@@ -25,7 +30,7 @@
 <tr>
     <td><code>log.default.flush.scheduler.interval.ms</code></td>
     <td>3000</td>
-    <td>Controls the interval at which logs are checked to see if they need to be flushed to disk.</td>
+    <td>Controls the interval at which logs are checked to see if they need to be flushed to disk. A background thread will run at a frequency specified by this parameter and will check each log to see if it has exceeded its flush.interval time, and if so it will flush it.</td>
 </tr>
 <tr>
     <td><code>log.default.flush.interval.ms</code> </td>
@@ -49,6 +54,11 @@
     <td>Topic-specific retention time that overrides <code>log.retention.hours</code>, e.g., topic1:10,topic2:20</td>
 </tr>
 <tr>
+    <td><code>log.retention.size</code></td>
+    <td>-1</td>
+    <td>the maximum size of the log before deleting it. This controls how large a log is allowed to grow</td>
+</tr>
+<tr>
     <td><code>log.cleanup.interval.mins</code></td>
     <td>10</td>
     <td>Controls how often the log cleaner checks logs eligible for deletion. A log file is eligible for deletion if it hasn't been modified for <code>log.retention.hours</code> hours.</td>
@@ -64,22 +74,42 @@
     <td>Controls the maximum size of a single log file.</td>
 </tr>
 <tr>
+    <td><code>max.socket.request.bytes<code></td>
+    <td>104857600</td>
+    <td>the maximum number of bytes in a socket request</td>
+</tr>
+<tr>
+    <td><code>monitoring.period.secs<code></td>
+    <td>600</td>
+    <td>the interval in which to measure performance statistics</td>
+</tr>
+<tr>
     <td><code>num.threads</code></td>
     <td>Runtime.getRuntime().availableProcessors</td>
     <td>Controls the number of worker threads in the broker to serve requests.</td>
 </tr>
 <tr>
-    <td><code>num.partitions</code> </td>
+    <td><code>num.partitions</code></td>
     <td>1</td>
     <td>Specifies the default number of partitions per topic.</td>
 </tr>
 <tr>
+    <td><code>socket.send.buffer</code></td>
+    <td>102400</td>
+    <td>the SO_SNDBUFF buffer of the socket sever sockets</td>
+</tr>
+<tr>
+    <td><code>socket.receive.buffer</code></td>
+    <td>102400</td>
+    <td>the SO_RCVBUFF buffer of the socket sever sockets</td>
+</tr>
+<tr>
     <td><code>topic.partition.count.map</code></td>
     <td>none</td>
     <td>Override parameter to control the number of partitions for selected topics. E.g., topic1:10,topic2:20</td>
 </tr>
 <tr>
-    <td><code>zk.connect</code> </td>
+    <td><code>zk.connect</code></td>
     <td>localhost:2182/kafka</td>
     <td>Specifies the zookeeper connection string in the form hostname:port/chroot. Here the chroot is a base directory which is prepended to all path operations (this effectively namespaces all kafka znodes to allow sharing with other applications on the same zookeeper cluster)</td>
 </tr>
@@ -168,6 +198,26 @@ the size of those queues</td>
     <td>-1</td>
     <td>By default, this value is -1 and a consumer blocks indefinitely if no new message is available for consumption. By setting the value to a positive integer, a timeout exception is thrown to the consumer if no message is available for consumption after the specified timeout value.</td>
 </tr>
+<tr>
+    <td><code>rebalance.retries.max</code> </td>
+    <td>4</td>
+    <td>max number of retries during rebalance</td>
+</tr>
+<tr>
+    <td><code>mirror.topics.whitelist</code></td>
+    <td>""</td>
+    <td>Whitelist of topics for this mirror's embedded consumer to consume. At most one of whitelist/blacklist may be specified.</td>
+</tr>
+<tr>
+    <td><code>mirror.topics.blacklist</code></td>
+    <td>""</td>
+    <td>Topics to skip mirroring. At most one of whitelist/blacklist may be specified</td>
+</tr>
+<tr>
+    <td><code>mirror.consumer.numthreads</code></td>
+    <td>4</td>
+    <td>The number of threads to be used per topic for the mirroring consumer, by default</td>
+</tr>
 </table>
 
 
@@ -207,7 +257,51 @@ the size of those queues</td>
     <td>null. Either this parameter or broker.partition.info needs to be specified by the user</td>
     <td>For using the zookeeper based automatic broker discovery, use this config to pass in the zookeeper connection url to the zookeeper cluster where the Kafka brokers are registered.</td>
 </tr>
-<tr><td></td><td><b>The following config parameters are related to</b> <code><b>kafka.producer.async.AsyncSyncProducer</b></code></td><tr>
+<tr>
+    <td><code>buffer.size</code></td>
+    <td>102400</td>
+    <td>the socket buffer size, in bytes</td>
+</tr>
+<tr>
+    <td><code>connect.timeout.ms</code></td>
+    <td>5000</td>
+    <td>the maximum time spent by <code>kafka.producer.SyncProducer</code> trying to connect to the kafka broker. Once it elapses, the producer throws an ERROR and stops.</td>
+</tr>
+<tr>
+    <td><code>socket.timeout.ms</code></td>
+    <td>30000</td>
+    <td>The socket timeout in milliseconds</td>
+</tr>
+<tr>
+    <td><code>reconnect.interval</code> </td>
+    <td>30000</td>
+    <td>the number of produce requests after which <code>kafka.producer.SyncProducer</code> tears down the socket connection to the broker and establishes it again</td>
+</tr>
+<tr>
+    <td><code>max.message.size</code> </td>
+    <td>1000000</td>
+    <td>the maximum number of bytes that the kafka.producer.SyncProducer can send as a single message payload</td>
+</tr>
+<tr>
+    <td><code>compression.codec</code></td>
+    <td>0 (No compression)</td>
+    <td>This parameter allows you to specify the compression codec for all data generated by this producer.</td>
+</tr>
+<tr>
+    <td><code>compressed.topics</code></td>
+    <td>null</td>
+    <td>This parameter allows you to set whether compression should be turned on for particular topics. If the compression codec is anything other than NoCompressionCodec, enable compression only for specified topics if any. If the list of compressed topics is empty, then enable the specified compression codec for all topics. If the compression codec is NoCompressionCodec, compression is disabled for all topics. </td>
+</tr>
+<tr>
+    <td><code>zk.read.num.retries</code></td>
+    <td>3</td>
+    <td>The producer using the zookeeper software load balancer maintains a ZK cache that gets updated by the zookeeper watcher listeners. During some events like a broker bounce, the producer ZK cache can get into an inconsistent state, for a small time period. In this time period, it could end up picking a broker partition that is unavailable. When this happens, the ZK cache needs to be updated. This parameter specifies the number of times the producer attempts to refresh this ZK cache.</td>
+</tr>
+<tr>
+	<td colspan="3" style="text-align: center">
+	Options for Asynchronous Producers (<code>producer.type=async</code>)
+	</td>
+</tr>
 <tr>
     <td><code>queue.time</code></td>
     <td>5000</td>
@@ -245,32 +339,7 @@ the size of those queues</td>
     <td>null</td>
     <td>the <code>java.util.Properties()</code> object used to initialize the custom <code>callback.handler</code> through its <code>init()</code> API</td>
 </tr>
-<tr><td></td><td><b>The following config parameters are related to</b> <code><b>kafka.producer.SyncProducer</b></code></td><tr>
-<tr>
-    <td><code>buffer.size</code></td>
-    <td>102400</td>
-    <td>the socket buffer size, in bytes</td>
-</tr>
-<tr>
-    <td><code>connect.timeout.ms</code></td>
-    <td>5000</td>
-    <td>the maximum time spent by <code>kafka.producer.SyncProducer</code> trying to connect to the kafka broker. Once it elapses, the producer throws an ERROR and stops.</td>
-</tr>
-<tr>
-    <td><code>socket.timeout.ms</code></td>
-    <td>30000</td>
-    <td>The socket timeout in milliseconds</td>
-</tr>
-<tr>
-    <td><code>reconnect.interval</code> </td>
-    <td>30000</td>
-    <td>the number of produce requests after which <code>kafka.producer.SyncProducer</code> tears down the socket connection to the broker and establishes it again</td>
-</tr>
-<tr>
-    <td><code>max.message.size</code> </td>
-    <td>1000000</td>
-    <td>the maximum number of bytes that the kafka.producer.SyncProducer can send as a single message payload</td>
-</tr></table>
+</table>
 
 
 <!--#include virtual="includes/footer.html" -->

Modified: incubator/kafka/site/design.html
URL: http://svn.apache.org/viewvc/incubator/kafka/site/design.html?rev=1227032&r1=1227031&r2=1227032&view=diff
==============================================================================
--- incubator/kafka/site/design.html (original)
+++ incubator/kafka/site/design.html Wed Jan  4 02:29:46 2012
@@ -3,17 +3,17 @@
 <h2>Why we built this</h2>
 
 <p>
-Kafka is a messaging system that was originally developed at LinkedIn to serve as the foundation for LinkedIn's activity stream processing pipeline.
+Kafka is a messaging system that was originally developed at LinkedIn to serve as the foundation for LinkedIn's activity stream and operational data processing pipeline. It is now used at a <a href="https://cwiki.apache.org/confluence/display/KAFKA/Powered+By">variety of different companies</a> for various data pipeline and messaging uses.
 </p>
 
 <p>
-Activity stream data is a normal part of any website for reporting on usage of the site. Activity data is things like page views, information about what content was shown, searches, etc. This kind of thing is usually handled by logging the activity out to some kind of file and then periodically aggregating these files for analysis.
+Activity stream data is a normal part of any website for reporting on usage of the site. Activity data is things like page views, information about what content was shown, searches, etc. This kind of thing is usually handled by logging the activity out to some kind of file and then periodically aggregating these files for analysis. Operational data is data about the performance of servers (CPU, IO usage, request times, service logs, etc) and a variety of different approaches to aggregating operational data are used.
 </p>
 
 <p>
-In recent years, however, activity data has become a critical part of the production features of websites, and a slightly more sophisticated set of infrastructure is needed.
+In recent years, activity and operational data has become a critical part of the production features of websites, and a slightly more sophisticated set of infrastructure is needed.
 
-<h2>Use cases for activity stream data</h2>
+<h2>Use cases for activity stream and operational data</h2>
 <ul>
 	<li>"News feed" features that broadcast the activity of your friends.</li>
 	<li>Relevance and ranking uses count ratings, votes, or click-through to determine which of a given set of items is most relevant.</li>
@@ -44,6 +44,16 @@ The following diagram gives a simplified
 Note that a single kafka cluster handles all activity data from all different sources. This provides a single pipeline of data for both online and offline consumers. This tier acts as a buffer between live activity and asynchronous processing. We also use kafka to replicate all data to a different datacenter for offline consumption.
 </p>
 
+<p>
+It is not intended that a single Kafka cluster span data centers, but Kafka is intended to support multi-datacenter data flow topologies. This is done by allowing mirroring or "syncing" between clusters. This feature is very simple, the mirror cluster simply acts as a consumer of the source cluster. This means it is possible for a single cluster to join data from many datacenters into a single location. Here is an example of a possible multi-datacenter topology aimed at supporting batch loads:
+</p>
+
+<img src="images/kafka_multidc.png">
+
+<p>
+Note that there is no correspondence between nodes in the two clusters&mdash;they may be of different sizes, contain different number of nodes, and a single cluster can mirror any number of source clusters. More details on using the mirroring feature can be found <a href="https://cwiki.apache.org/confluence/display/KAFKA/Kafka+mirroring">here</a>.
+</p>
+
 <h1>Major Design Elements</h1>
 
 <p>
@@ -149,6 +159,16 @@ We expect a common use case to be multip
 For more background on the sendfile and zero-copy support in Java, see this <a href="http://www.ibm.com/developerworks/linux/library/j-zerocopy">article</a> on IBM developerworks.	
 </p>
 
+<h2>End-to-end Batch Compression</h2>
+<p>
+In many cases the bottleneck is actually not CPU but network. This is particularly true for a data pipeline that needs to send messages across data centers. Of course the user can always send compressed messages without any support needed from Kafka, but this can lead to very poor compression ratios as much of the redundancy is due to repetition between messages (e.g. field names in JSON or user agents in web logs or common string values). Efficient compression requires compressing multiple messages together rather than compressing each message individually. Ideally this would be possible in an end-to-end fashion&mdash;that is, data would be compressed prior to sending by the producer and remain compressed on the server, only being decompressed by the eventual consumers.
+</p>
+<p>
+Kafka supports this be allowing recursive message sets. A batch of messages can be clumped together compressed and sent to the server in this form. This batch of messages will be delivered all to the same consumer and will remain in compressed form until it arrives there.
+</p>
+<p>
+Kafka supports GZIP and Snappy compression protocols. More details on compression can be found <a href="https://cwiki.apache.org/confluence/display/KAFKA/Compression">here</a>.
+
 <h2>Consumer state</h2>
 
 <p>
@@ -201,7 +221,7 @@ Kafka is built to be run across a cluste
 Currently, there is no built-in load balancing between the producers and the brokers in Kafka; in our own usage we publish from a large number of heterogeneous machines and so it is desirable that the publisher not need any explicit knowledge of the cluster topology. We rely on a hardware load balancer to distribute the producer load across multiple brokers. We will consider adding this in a future release to allow semantic partitioning of messages (i.e. publishing all messages to a particular broker based on some id to ensure an ordered stream of updates within that id).
 </p>
 <p>
-Kafka does have built-in load balancing between the consumers and the brokers. To achieve this co-ordination, each broker and each consumer register its state and maintains its metadata in Zookeeper. When there is a broker or a consumer change, each consumer is notified about the change through the zookeeper watcher. The consumer then reads the current information about all relevant brokers and consumers, and determines which brokers it should consume data from.
+Kafka has built-in load balancing between the consumers and the brokers. To achieve this co-ordination, each broker and each consumer register its state and maintains its metadata in Zookeeper. When there is a broker or a consumer change, each consumer is notified about the change through the zookeeper watcher. The consumer then reads the current information about all relevant brokers and consumers, and determines which brokers it should consume data from.
 </p>
 <p>
 This kind of cluster-aware balancing of consumption has several advantages:
@@ -214,9 +234,13 @@ This kind of cluster-aware balancing of 
 
 <h2>Producer</h2>
 
-<h3>Automatic load balancing</h3>
+<h3>Automatic producer load balancing</h3>
+<p>
+Kafka supports client-side load balancing for message producers or use of a dedicated load balancer to balance TCP connections. A dedicated layer-4 load balancer works by balancing TCP connections over Kafka brokers. In this configuration all messages from a given producer go to a single broker. The advantage of using a level-4 load balancer is that each producer only needs a single TCP connection, and no connection to zookeeper is needed. The disadvantage is that the balancing is done at the TCP connection level, and hence it may not be well balanced (if some producers produce many more messages then others, evenly dividing up the connections per broker may not result in evenly dividing up the messages per broker).
+<p>
+Client-side zookeeper-based load balancing solves some of these problems. It allows the producer to dynamically discover new brokers, and balance load on a per-request basis. Likewise it allows the producer to partition data according to some key instead of randomly, which enables stickiness on the consumer (e.g. partitioning data consumption by user id). This feature is called "semantic partitioning", and is described in more detail below.
 <p>
-In v0.6, we introduced built-in automatic load balancing between the producers and the brokers in Kafka; Currently, in our own usage we publish from a large number of heterogeneous machines and so it is desirable that the publisher not need any explicit knowledge of the cluster topology. We rely on a hardware load balancer to distribute the producer load across multiple brokers. An advantage of using the hardware load balancer is the âhealthcheckâ service that detects if a broker is down and forwards the producer request to another healthy broker. In v0.6, this âhealthcheckâ feature is provided in the cluster-aware producer. Producers discover the available brokers in a cluster and the number of partitions on each, by registering watchers in zookeeper. Since the number of broker partitions is configurable per topic, zookeeper watchers are registered on the following events -
+The working of the zookeeper-based load balancing is described below. Zookeeper watchers are registered on the following events&mdash;
 </p>
 <ul>
 <li>a new broker comes up</li>
@@ -225,7 +249,7 @@ In v0.6, we introduced built-in automati
 <li>a broker gets registered for an existing topic</li>
 </ul>
 <p>
-Internally, the producer maintains an elastic pool of connections to the brokers, one per broker. This pool is kept updated to establish/maintain connections to all the live brokers, through the zookeeper watcher callbacks. When a producer request for a particular topic comes in, a broker partition is picked by the partitioner (see section on Semantic partitioning). The available producer connection is used from the pool to send the data to the selected  broker partition.
+Internally, the producer maintains an elastic pool of connections to the brokers, one per broker. This pool is kept updated to establish/maintain connections to all the live brokers, through the zookeeper watcher callbacks. When a producer request for a particular topic comes in, a broker partition is picked by the partitioner (see section on semantic partitioning). The available producer connection is used from the pool to send the data to the selected broker partition.
 </p>
 
 <h3>Asynchronous send</h3>
@@ -235,7 +259,7 @@ Asynchronous non-blocking operations are
 
 <h3>Semantic partitioning</h3>
 <p>
-Consider an application that would like to maintain an aggregation of the number of profile visitors for each member. It would like to send all profile visit events for a member to a particular partition and, hence, have all updates for a member to appear in the same stream for the same consumer thread. In v0.6, we added the capability to the cluster aware producer to be able to semantically map messages to the available kafka nodes and partitions. This allows partitioning the stream of messages with some semantic partition function based on some key in the message to spread them over broker machines. The partitioning function can be customized by providing an implementation of the kafka.producer.Partitioner interface, default being the random partitioner. For the example above, the key would be member_id and the partitioning function would be hash(member_id)%num_partitions.
+Consider an application that would like to maintain an aggregation of the number of profile visitors for each member. It would like to send all profile visit events for a member to a particular partition and, hence, have all updates for a member to appear in the same stream for the same consumer thread. The producer has the capability to be able to semantically map messages to the available kafka nodes and partitions. This allows partitioning the stream of messages with some semantic partition function based on some key in the message to spread them over broker machines. The partitioning function can be customized by providing an implementation of the kafka.producer.Partitioner interface, default being the random partitioner. For the example above, the key would be member_id and the partitioning function would be hash(member_id)%num_partitions.
 </p>
 
 <h2>Support for Hadoop and other batch data load</h2>

Modified: incubator/kafka/site/downloads.html
URL: http://svn.apache.org/viewvc/incubator/kafka/site/downloads.html?rev=1227032&r1=1227031&r2=1227032&view=diff
==============================================================================
--- incubator/kafka/site/downloads.html (original)
+++ incubator/kafka/site/downloads.html Wed Jan  4 02:29:46 2012
@@ -2,6 +2,25 @@
 
 <h2>Downloads</h2>
 
-We have not yet done an Apache release and Apache does not allow us to host non-Apache releases on this site. You can download the previous releases <a href="http://sna-projects.com/kafka/downloads.php">here</a>.
+The current stable version is 0.7.0-incubating. You can verify your download by following these <a href="http://www.apache.org/info/verification.html">procedures</a> and using these <a href="http://svn.apache.org/repos/asf/incubator/kafka/KEYS">KEYS</a>.
+
+<h3>0.7.0 Release</h3>
+<ul>
+	<li>
+		<a href="http://people.apache.org/~nehanarkhede/kafka-0.7.0-incubating/RELEASE-NOTES.html">Release Notes</a>
+	</li>
+ 	<li>
+		Download: <a href="http://people.apache.org/~nehanarkhede/kafka-0.7.0-incubating/kafka-0.7.0-incubating-src.tar.gz">kafka-0.7.0-incubating-src.tar.gz</a> (<a href="http://people.apache.org/~nehanarkhede/kafka-0.7.0-incubating/kafka-0.7.0-incubating-src.tar.gz.asc">asc</a>, <a href="http://people.apache.org/~nehanarkhede/kafka-0.7.0-incubating/kafka-0.7.0-incubating-src.tar.gz.md5">md5</a>)
+	</li>
+</ul>
+<h3>Older Releases</h3>
+<p>
+You can download the previous releases <a href="http://sna-projects.com/kafka/downloads.php">here</a>.
+</p>
+
+<h3>Disclaimer</h3>
+<p>
+Apache Kafka is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.
+</p>
 
 <!--#include virtual="includes/footer.html" -->

Modified: incubator/kafka/site/includes/header.html
URL: http://svn.apache.org/viewvc/incubator/kafka/site/includes/header.html?rev=1227032&r1=1227031&r2=1227032&view=diff
==============================================================================
--- incubator/kafka/site/includes/header.html (original)
+++ incubator/kafka/site/includes/header.html Wed Jan  4 02:29:46 2012
@@ -26,7 +26,7 @@
 			<div class="lsidebar">
 				<ul>
 					<li><a href="downloads.html">download</a></li>
-					<li><a href="api-docs/0.6">api&nbsp;docs</a></li>
+					<li><a href="http://people.apache.org/~nehanarkhede/kafka-0.7.0-incubating/docs">api&nbsp;docs</a></li>
 					<li><a href="quickstart.html">quickstart</a></li>
 					<li><a href="design.html">design</a></li>
 					<li><a href="configuration.html">configuration</a></li>

Modified: incubator/kafka/site/projects.html
URL: http://svn.apache.org/viewvc/incubator/kafka/site/projects.html?rev=1227032&r1=1227031&r2=1227032&view=diff
==============================================================================
--- incubator/kafka/site/projects.html (original)
+++ incubator/kafka/site/projects.html Wed Jan  4 02:29:46 2012
@@ -1,7 +1,10 @@
 <!--#include virtual="includes/header.html" -->
 
 <h1>Current Work</h1>
-	
+<p>
+  Here is a <a href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+12311720+AND+labels+%3D+newbie">list of JIRAs</a> you can work on to contribute some quick and easy patches in Kafka.
+</p>	
+
 <p>
   Below is a list of major projects we know people are currently pursuing. If you have thoughts on these or want to help, please <a href="mailto: kafka-dev@incubator.apache.org">let us know</a>.
 </p>
@@ -109,4 +112,4 @@ Currently, the <a href="http://sna-proje
 <h3>Size based retention policy</h3>
 The kafka server garbage collects logs according to a time-based retention policy (log.retention.hours). Ideally, the server should also support a size based retention policy (log.retention.size) to prevent any one topic from occupying too much disk space. Please refer to the <a href="http://linkedin.jira.com/browse/KAFKA-3">JIRA</a> to contribute.
 
-<!--#include virtual="includes/footer.html" -->
\ No newline at end of file
+<!--#include virtual="includes/footer.html" -->

Modified: incubator/kafka/site/quickstart.html
URL: http://svn.apache.org/viewvc/incubator/kafka/site/quickstart.html?rev=1227032&r1=1227031&r2=1227032&view=diff
==============================================================================
--- incubator/kafka/site/quickstart.html (original)
+++ incubator/kafka/site/quickstart.html Wed Jan  4 02:29:46 2012
@@ -4,7 +4,7 @@
 	
 <h3> Step 1: Download the code </h3>
 
-<a href="downloads" title="Kafka downloads">Download</a> a recent stable release.
+<a href="downloads.html" title="Kafka downloads">Download</a> a recent stable release.
 
 <pre>
 <b>&gt; tar xzf kafka-&lt;VERSION&gt;.tgz</b>
@@ -34,73 +34,39 @@ jkreps-mn-2:kafka-trunk jkreps$ bin/kafk
 
 <h3>Step 3: Send some messages</h3>
 
-A toy producer script is available to send plain text messages. To use it, run the following command:
+Kafka comes with a command line client that will take input from standard in and send it out as messages to the Kafka cluster. By default each line will be sent as a separate message. The topic <i>test</i> is created automatically when messages are sent to it. Omitting logging you should see something like this:
 
 <pre>
-<b>&gt; bin/kafka-producer-shell.sh --server kafka://localhost:9092 --topic test</b>
-> hello
-sent: hello (14 bytes)
-> world
-sent: world (14 bytes)
+&gt; <b>bin/kafka-console-producer.sh --zookeeper localhost:2181 --topic test</b> 
+This is a message
+This is another message
 </pre>
 
-<h3>Step 5: Start a consumer</h3>
+<h3>Step 4: Start a consumer</h3>
 
-Start a toy consumer to dump out the messages you sent to the console:
+Kafka also has a command line consumer that will dump out messages to standard out.
 
 <pre>
-<b>&gt; bin/kafka-consumer-shell.sh --topic test --props config/consumer.properties</b>
-Starting consumer...
-...
-consumed: hello
-consumed: world
+<b>&gt; bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning</b>
+This is a message
+This is another message
 </pre>
-
+<p>
 If you have each of the above commands running in a different terminal then you should now be able to type messages into the producer terminal and see them appear in the consumer terminal.
+</p>
+<p>
+Both of these command line tools have additional options. Running the command with no arguments will display usage information documenting them in more detail.	
+</p>
 
-<h3>Step 6: Write some code</h3>
+<h3>Step 5: Write some code</h3>
 
 Below is some very simple examples of using Kafka for sending messages, more complete examples can be found in the Kafka source code in the examples/ directory.
 
 <h4>Producer Code</h4>
 
-<h5>1. Log4j appender </h5>
-
-Data can also be produced to a Kafka server in the form of a log4j appender. In this way, minimal code needs to be written in order to send some data across to the Kafka server. 
-Here is an example of how to use the Kafka Log4j appender -
-
-Start by defining the Kafka appender in your log4j.properties file.
-<pre>
-<small>// define the kafka log4j appender config parameters</small>
-log4j.appender.KAFKA=kafka.producer.KafkaLog4jAppender
-<small>// <b>REQUIRED</b>: set the hostname of the kafka server</small>
-log4j.appender.KAFKA.Host=localhost
-<small>// <b>REQUIRED</b>: set the port on which the Kafka server is listening for connections</small>
-log4j.appender.KAFKA.Port=9092
-<small>// <b>REQUIRED</b>: the topic under which the logger messages are to be posted</small>
-log4j.appender.KAFKA.Topic=test
-<small>// the serializer to be used to turn an object into a Kafka message. Defaults to kafka.producer.DefaultStringEncoder</small>
-log4j.appender.KAFKA.Serializer=kafka.test.AppenderStringSerializer
-<small>// do not set the above KAFKA appender as the root appender</small>
-log4j.rootLogger=INFO
-<small>// set the logger for your package to be the KAFKA appender</small>
-log4j.logger.your.test.package=INFO, KAFKA
-</pre>
-
-Data can be sent using a log4j appender as follows -
-
-<pre>
-Logger logger = Logger.getLogger([your.test.class])    
-logger.info("message from log4j appender");
-</pre>
-
-If your log4j appender fails to send messages, please verify that the correct 
-log4j properties file is being used. You can add 
-<code>-Dlog4j.debug=true</code> to your VM parameters to verify this.
-
-<h5>2. Producer API </h5>
+<h5>Producer API </h5>
 
-With release 0.6, we introduced a new producer API - <code>kafka.producer.Producer&lt;T&gt;</code>. Here are examples of using the producer -
+Here are examples of using the producer API - <code>kafka.producer.Producer&lt;T&gt;</code> -
 
 <ol>
 <li>First, start a local instance of the zookeeper server
@@ -209,12 +175,13 @@ ProducerData&lt;String, String&gt; data 
 producer.send(data);	
 </pre>
 </li>
-<li>Use the asynchronous producer. This buffers writes in memory until either <code>batch.size</code> or <code>queue.time</code> is reached. After that, data is sent to the Kafka brokers
+<li>Use the asynchronous producer along with GZIP compression. This buffers writes in memory until either <code>batch.size</code> or <code>queue.time</code> is reached. After that, data is sent to the Kafka brokers
 <pre>
 Properties props = new Properties();
 props.put("zk.connect"â "127.0.0.1:2181");
 props.put("serializer.class", "kafka.serializer.StringEncoder");
 props.put("producer.type", "async");
+props.put("compression.codec", "1");
 ProducerConfig config = new ProducerConfig(props);
 Producer&lt;String, String&gt; producer = new Producer&lt;String, String&gt;(config);
 ProducerData&lt;String, String&gt; data = new ProducerData&lt;String, String&gt;("test-topic", "test-message");
@@ -226,6 +193,40 @@ producer.send(data);
 </li>
 </ol>
 
+<h5>Log4j appender </h5>
+
+Data can also be produced to a Kafka server in the form of a log4j appender. In this way, minimal code needs to be written in order to send some data across to the Kafka server. 
+Here is an example of how to use the Kafka Log4j appender -
+
+Start by defining the Kafka appender in your log4j.properties file.
+<pre>
+<small>// define the kafka log4j appender config parameters</small>
+log4j.appender.KAFKA=kafka.producer.KafkaLog4jAppender
+<small>// <b>REQUIRED</b>: set the hostname of the kafka server</small>
+log4j.appender.KAFKA.Host=localhost
+<small>// <b>REQUIRED</b>: set the port on which the Kafka server is listening for connections</small>
+log4j.appender.KAFKA.Port=9092
+<small>// <b>REQUIRED</b>: the topic under which the logger messages are to be posted</small>
+log4j.appender.KAFKA.Topic=test
+<small>// the serializer to be used to turn an object into a Kafka message. Defaults to kafka.producer.DefaultStringEncoder</small>
+log4j.appender.KAFKA.Serializer=kafka.test.AppenderStringSerializer
+<small>// do not set the above KAFKA appender as the root appender</small>
+log4j.rootLogger=INFO
+<small>// set the logger for your package to be the KAFKA appender</small>
+log4j.logger.your.test.package=INFO, KAFKA
+</pre>
+
+Data can be sent using a log4j appender as follows -
+
+<pre>
+Logger logger = Logger.getLogger([your.test.class])    
+logger.info("message from log4j appender");
+</pre>
+
+If your log4j appender fails to send messages, please verify that the correct 
+log4j properties file is being used. You can add 
+<code>-Dlog4j.debug=true</code> to your VM parameters to verify this.
+
 <h4>Consumer Code</h4>
 
 The consumer code is slightly more complex as it enables multithreaded consumption:
@@ -242,15 +243,15 @@ ConsumerConfig consumerConfig = new Cons
 ConsumerConnector consumerConnector = Consumer.createJavaConsumerConnector(consumerConfig);
 
 // create 4 partitions of the stream for topic âtestâ, to allow 4 threads to consume
-Map&lt;String, List&lt;KafkaMessageStream&gt;&gt; topicMessageStreams = 
+Map&lt;String, List&lt;KafkaMessageStream&lt;Message&gt;&gt;&gt; topicMessageStreams = 
     consumerConnector.createMessageStreams(ImmutableMap.of("test", 4));
-List&lt;KafkaMessageStream&gt; streams = topicMessageStreams.get("test");
+List&lt;KafkaMessageStream&lt;Message&gt;&gt; streams = topicMessageStreams.get("test");
 
 // create list of 4 threads to consume from each of the partitions 
 ExecutorService executor = Executors.newFixedThreadPool(4);
 
 // consume the messages in the threads
-for(final KafkaMessageStream stream: streams) {
+for(final KafkaMessageStream&lt;Message&gt; stream: streams) {
   executor.submit(new Runnable() {
     public void run() {
       for(Message message: stream) {
@@ -295,10 +296,10 @@ while (true) {
 
   <small>// get the message set from the consumer and print them out</small>
   ByteBufferMessageSet messages = consumer.fetch(fetchRequest);
-  for(Message message : messages) {
-    System.out.println("consumed: " + Utils.toString(message.payload(), "UTF-8"));
+  for(MessageAndOffset msg : messages) {
+    System.out.println("consumed: " + Utils.toString(msg.message.payload(), "UTF-8"));
     <small>// advance the offset after consuming each message</small>
-    offset += MessageSet.entrySize(message);
+    offset = msg.offset;
   }
 }
 </pre>