You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kafka.apache.org by jk...@apache.org on 2014/03/06 18:14:17 UTC

svn commit: r1574966 - in /kafka/site/081: configuration.html design.html introduction.html ops.html

Author: jkreps
Date: Thu Mar  6 17:14:17 2014
New Revision: 1574966

URL: http://svn.apache.org/r1574966
Log:
KAFKA-1295: Misc. typo fixes from Evan Zacks.


Modified:
    kafka/site/081/configuration.html
    kafka/site/081/design.html
    kafka/site/081/introduction.html
    kafka/site/081/ops.html

Modified: kafka/site/081/configuration.html
URL: http://svn.apache.org/viewvc/kafka/site/081/configuration.html?rev=1574966&r1=1574965&r2=1574966&view=diff
==============================================================================
--- kafka/site/081/configuration.html (original)
+++ kafka/site/081/configuration.html Thu Mar  6 17:14:17 2014
@@ -20,7 +20,7 @@ Topic-level configurations and defaults 
     <tr>
       <td>broker.id</td>
       <td></td>
-      <td>Each broker is uniquely identified by a non-negative integer id. This id serves as the brokers "name", and allows the broker to be moved to a different host/port without confusing consumers. You can choose any number you like so long as it is unique.
+      <td>Each broker is uniquely identified by a non-negative integer id. This id serves as the broker's "name" and allows the broker to be moved to a different host/port without confusing consumers. You can choose any number you like so long as it is unique.
 	</td>
     </tr>
     <tr>
@@ -239,7 +239,7 @@ Zookeeper also allows you to add a "chro
     <tr>
       <td>replica.lag.time.max.ms</td>
       <td>10000</td>
-      <td>If a follower hasn't sent any fetch requests for this window of time, the leader will remove the follower from ISR and treat it as dead.</td>
+      <td>If a follower hasn't sent any fetch requests for this window of time, the leader will remove the follower from ISR (in-sync replicas) and treat it as dead.</td>
     </tr>
     <tr>
       <td>replica.lag.max.messages</td>
@@ -301,12 +301,12 @@ Zookeeper also allows you to add a "chro
     <tr>
       <td>zookeeper.connection.timeout.ms</td>
       <td>6000</td>
-      <td>The max time that the client waits to establish a connection to zookeeper.</td>
+      <td>The maximum amount of time that the client waits to establish a connection to zookeeper.</td>
     </tr>
     <tr>
       <td>zookeeper.sync.time.ms</td>
       <td>2000</td>
-      <td>How far a ZK follower can be behind a ZK leader</td>
+      <td>How far a ZK follower can be behind a ZK leader.</td>
     </tr>
     <tr>
       <td>controlled.shutdown.enable</td>

Modified: kafka/site/081/design.html
URL: http://svn.apache.org/viewvc/kafka/site/081/design.html?rev=1574966&r1=1574965&r2=1574966&view=diff
==============================================================================
--- kafka/site/081/design.html (original)
+++ kafka/site/081/design.html Thu Mar  6 17:14:17 2014
@@ -153,14 +153,14 @@ These are not the strongest possible sem
 <p>
 Not all use cases require such strong guarantees. For uses which are latency sensitive we allow the producer to specify the durability level it desires. If the producer specifies that it wants to wait on the message being committed this can take on the order of 10 ms. However the producer can also specify that it wants to perform the send completely asynchronously or that it wants to wait only until the leader (but not necessarily the followers) have the message.
 <p>
-Now let's describe the semantics from the point-of-view of the consumer. All replicas have the exact same log with the same offsets. The consumer controls it's position in this log. If the consumer never crashed it could just store this position in memory, but if the producer fails and we want this topic partition to be taken over by another process the new process will need to choose an appropriate position from which to start processing. Let's say the consumer reads some messages it has several options for processing the messages and updating its position.
+Now let's describe the semantics from the point-of-view of the consumer. All replicas have the exact same log with the same offsets. The consumer controls its position in this log. If the consumer never crashed it could just store this position in memory, but if the producer fails and we want this topic partition to be taken over by another process the new process will need to choose an appropriate position from which to start processing. Let's say the consumer reads some messages -- it has several options for processing the messages and updating its position.
 <ol>
   <li>It can read the messages, then save its position in the log, and finally process the messages. In this case there is a possibility that the consumer process crashes after saving its position but before saving the output of its message processing. In this case the process that took over processing would start at the saved position even though a few messages prior to that position had not been processed. This corresponds to "at-most-once" semantics as in the case of a consumer failure messages may not be processed.
   <li>It can read the messages, process the messages, and finally save its position. In this case there is a possibility that the consumer process crashes after processing messages but before saving its position. In this case when the new process takes over the first few messages it receives will already have been processed. This corresponds to the "at-least-once" semantics in the case of consumer failure. In many cases messages have a primary key and so the updates are idempotent (receiving the same message twice just overwrites a record with another copy of itself).
-  <li>So what about exactly once semantics (i.e. the thing you actually want)? The limitation here is not actually a feature of the messaging system but rather the need to co-ordinate the consumers position with what is actually stored as output. The classic way of achieving this would be to introduce a two-phase commit between the storage for the consumer position and the storage of the consumers output. But this can be handled more simply and generally by simply letting the consumer store its offset in the same place as its output. This is better because many of the output systems a consumer might want to write to will not support a two-phase commit. As example of this our Hadoop ETL that populates data in HDFS stores its offsets in HDFS with the data it reads so that it is guaranteed that either data and offsets are both updated or neither is. We follow similar patterns for many other data systems which require these stronger semantics and for which the messages do not have a pri
 mary key to allow for deduplication.
+  <li>So what about exactly once semantics (i.e. the thing you actually want)? The limitation here is not actually a feature of the messaging system but rather the need to co-ordinate the consumer's position with what is actually stored as output. The classic way of achieving this would be to introduce a two-phase commit between the storage for the consumer position and the storage of the consumers output. But this can be handled more simply and generally by simply letting the consumer store its offset in the same place as its output. This is better because many of the output systems a consumer might want to write to will not support a two-phase commit. As an example of this, our Hadoop ETL that populates data in HDFS stores its offsets in HDFS with the data it reads so that it is guaranteed that either data and offsets are both updated or neither is. We follow similar patterns for many other data systems which require these stronger semantics and for which the messages do not have 
 a primary key to allow for deduplication.
 </ol>
 <p>
-So effectively Kafka guarantees at-least-once delivery by default and allows the user to implement at most once delivery by disabling retries on the producer and committing its offset prior to processing a batch of messages. Exactly-once delivery requires co-operation with the destination storage system but Kafka gives the offset which makes implementing this straight-forward.
+So effectively Kafka guarantees at-least-once delivery by default and allows the user to implement at most once delivery by disabling retries on the producer and committing its offset prior to processing a batch of messages. Exactly-once delivery requires co-operation with the destination storage system but Kafka provides the offset which makes implementing this straight-forward.
 
 <h3><a id="replication">4.7 Replication</a></h3>
 <p>
@@ -189,7 +189,7 @@ Kafka will remain available in the prese
 
 <h4>Replicated Logs: Quorums, ISRs, and State Machines (Oh my!)</h4>
 
-At it's heart a Kafka partition is a replicated log. The replicated log is one of the most basic primitives in distributed data systems, and there are many approaches for implementing one. A replicated log can be used by other systems as a primitive for implementing other distributed systems in the <a href="http://en.wikipedia.org/wiki/State_machine_replication">state-machine style</a>.
+At its heart a Kafka partition is a replicated log. The replicated log is one of the most basic primitives in distributed data systems, and there are many approaches for implementing one. A replicated log can be used by other systems as a primitive for implementing other distributed systems in the <a href="http://en.wikipedia.org/wiki/State_machine_replication">state-machine style</a>.
 <p>
 A replicated log models the process of coming into consensus on the order of a series of values (generally numbering the log entries 0, 1, 2, ...). There are many ways to implement this, but the simplest and fastest is with a leader who chooses the ordering of values provided to it. As long as the leader remains alive, all followers need to only copy the values and ordering, the leader chooses.
 <p>

Modified: kafka/site/081/introduction.html
URL: http://svn.apache.org/viewvc/kafka/site/081/introduction.html?rev=1574966&r1=1574965&r2=1574966&view=diff
==============================================================================
--- kafka/site/081/introduction.html (original)
+++ kafka/site/081/introduction.html Thu Mar  6 17:14:17 2014
@@ -43,7 +43,7 @@ Each partition has one server which acts
 
 <h4>Producers</h4>
 
-Producers publish data to the topics of their choice. The producer is able to chose which message to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the message). More on the use of partitioning in a second.
+Producers publish data to the topics of their choice. The producer is able to choose which message to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the message). More on the use of partitioning in a second.
 
 <h4>Consumers</h4>
 
@@ -53,7 +53,7 @@ Consumers label themselves with a consum
 <p>
 If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.
 <p>
-If all the consumers instances have different consumer groups then this works like publish-subscribe and all messages are broadcast to all consumers. 
+If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers. 
 <p>
 More commonly, however, we have found that topics have a small number of consumer groups, one for each "logical subscriber". Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is cluster of consumers instead of a single process.
 <p>
@@ -63,9 +63,9 @@ More commonly, however, we have found th
   A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.
 </div>
 <p>
-Kafka has stronger ordering guarantees than a traditional messaging system too.
+Kafka has stronger ordering guarantees than a traditional messaging system, too.
 <p>
-A traditional queue retains messages in-order on the server, and if multiple consumers consume from the queue then the server hands out messages in the order they are stored. However although the server hands out messages in order, the messages are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the messages is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only on process to consume from a queue, but of course this means that there is no parallelism in processing.
+A traditional queue retains messages in-order on the server, and if multiple consumers consume from the queue then the server hands out messages in the order they are stored. However, although the server hands out messages in order, the messages are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the messages is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.
 <p>
 Kafka does it better. By having a notion of parallelism&mdash;the partition&mdash;within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances than partitions.
 <p>
@@ -73,10 +73,10 @@ Not that partitioning means Kafka only p
 
 <h4>Guarantees</h4>
 
-At a high-level Kafka gives the following guarantees
+At a high-level Kafka gives the following guarantees:
 <ul>
-  <li>Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is if a message M1 is sent by the same producer as a message M2, and M1 is sent first, then M1 will have a lower offset then M2 and appear earlier in the log.
+  <li>Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a message M1 is sent by the same producer as a message M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
   <li>A consumer instance sees messages in the order they are stored in the log.
   <li>For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any messages committed to the log.
 </ul>
-More details on these guarantees are given in the design section of the documentation.
\ No newline at end of file
+More details on these guarantees are given in the design section of the documentation.

Modified: kafka/site/081/ops.html
URL: http://svn.apache.org/viewvc/kafka/site/081/ops.html?rev=1574966&r1=1574965&r2=1574966&view=diff
==============================================================================
--- kafka/site/081/ops.html (original)
+++ kafka/site/081/ops.html Thu Mar  6 17:14:17 2014
@@ -26,7 +26,7 @@ The most important producer configuratio
 </ul>
 The most important consumer configuration is the fetch size.
 <p>
-All configurations are documented in the <a href="configuration.html">configuration</a> page.
+All configurations are documented in the <a href="#configuration">configuration</a> section.
 <p>
 <h4><a id="prodconfig">A Production Server Config</a></h4>
 Here is our server production server configuration: