You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kafka.apache.org by jg...@apache.org on 2017/06/20 23:29:19 UTC

kafka git commit: MINOR: Add Processing Guarantees to Streams docs

Repository: kafka
Updated Branches:
  refs/heads/trunk 491774bd5 -> f28fc1100


MINOR: Add Processing Guarantees to Streams docs

Author: Guozhang Wang <wa...@gmail.com>

Reviewers: Apurva Mehta <ap...@confluent.io>, Matthias J. Sax <ma...@confluent.io>, Jason Gustafson <ja...@confluent.io>

Closes #3345 from guozhangwang/KMinor-streams-eos-docs


Project: http://git-wip-us.apache.org/repos/asf/kafka/repo
Commit: http://git-wip-us.apache.org/repos/asf/kafka/commit/f28fc110
Tree: http://git-wip-us.apache.org/repos/asf/kafka/tree/f28fc110
Diff: http://git-wip-us.apache.org/repos/asf/kafka/diff/f28fc110

Branch: refs/heads/trunk
Commit: f28fc1100b5f6e79baeb6b2f2363e86e5cd12306
Parents: 491774b
Author: Guozhang Wang <wa...@gmail.com>
Authored: Tue Jun 20 16:29:03 2017 -0700
Committer: Jason Gustafson <ja...@confluent.io>
Committed: Tue Jun 20 16:29:08 2017 -0700

----------------------------------------------------------------------
 docs/documentation.html |  6 +++---
 docs/js/templateData.js |  4 ++--
 docs/streams.html       | 41 ++++++++++++++++++++++++++++++++++++-----
 3 files changed, 41 insertions(+), 10 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kafka/blob/f28fc110/docs/documentation.html
----------------------------------------------------------------------
diff --git a/docs/documentation.html b/docs/documentation.html
index 5c70e18..7f297cc 100644
--- a/docs/documentation.html
+++ b/docs/documentation.html
@@ -69,15 +69,15 @@
     <h2><a id="connect" href="#connect">8. Kafka Connect</a></h2>
     <!--#include virtual="connect.html" -->
 
-    <h2><a id="streams" href="/0102/documentation/streams">9. Kafka Streams</a></h2>
+    <h2><a id="streams" href="/0110/documentation/streams">9. Kafka Streams</a></h2>
     <p>
-        Kafka Streams is a client library for processing and analyzing data stored in Kafka and either write the resulting data back to Kafka or send the final output to an external system. It builds upon important stream processing concepts such as properly distinguishing between event time and processing time, windowing support, and simple yet efficient management of application state.
+        Kafka Streams is a client library for processing and analyzing data stored in Kafka. It builds upon important stream processing concepts such as properly distinguishing between event time and processing time, windowing support, exactly-once processing semantics and simple yet efficient management of application state.
     </p>
     <p>
         Kafka Streams has a <b>low barrier to entry</b>: You can quickly write and run a small-scale proof-of-concept on a single machine; and you only need to run additional instances of your application on multiple machines to scale up to high-volume production workloads. Kafka Streams transparently handles the load balancing of multiple instances of the same application by leveraging Kafka's parallelism model.
     </p>
 
-    <p>Learn More about Kafka Streams read <a href="/0102/documentation/streams">this</a> Section.</p>
+    <p>Learn More about Kafka Streams read <a href="/0110/documentation/streams">this</a> Section.</p>
 
 <!--#include virtual="../includes/_footer.htm" -->
 <!--#include virtual="../includes/_docs_footer.htm" -->

http://git-wip-us.apache.org/repos/asf/kafka/blob/f28fc110/docs/js/templateData.js
----------------------------------------------------------------------
diff --git a/docs/js/templateData.js b/docs/js/templateData.js
index b4aedf5..2f32444 100644
--- a/docs/js/templateData.js
+++ b/docs/js/templateData.js
@@ -17,6 +17,6 @@ limitations under the License.
 
 // Define variables for doc templates
 var context={
-    "version": "0102",
-    "dotVersion": "0.10.2"
+    "version": "0110",
+    "dotVersion": "0.11.0"
 };
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/kafka/blob/f28fc110/docs/streams.html
----------------------------------------------------------------------
diff --git a/docs/streams.html b/docs/streams.html
index e59c386..1f1adb1 100644
--- a/docs/streams.html
+++ b/docs/streams.html
@@ -46,19 +46,22 @@
         <h2><a id="streams_overview" href="#streams_overview">Overview</a></h2>
 
         <p>
-        Kafka Streams is a client library for processing and analyzing data stored in Kafka and either write the resulting data back to Kafka or send the final output to an external system. It builds upon important stream processing concepts such as properly distinguishing between event time and processing time, windowing support, and simple yet efficient management of application state.
+            Kafka Streams is a client library for processing and analyzing data stored in Kafka.
+            It builds upon important stream processing concepts such as properly distinguishing between event time and processing time, windowing support, and simple yet efficient management of application state.
         </p>
         <p>
-        Kafka Streams has a <b>low barrier to entry</b>: You can quickly write and run a small-scale proof-of-concept on a single machine; and you only need to run additional instances of your application on multiple machines to scale up to high-volume production workloads. Kafka Streams transparently handles the load balancing of multiple instances of the same application by leveraging Kafka's parallelism model.
+            Kafka Streams has a <b>low barrier to entry</b>: You can quickly write and run a small-scale proof-of-concept on a single machine; and you only need to run additional instances of your application on multiple machines to scale up to high-volume production workloads.
+            Kafka Streams transparently handles the load balancing of multiple instances of the same application by leveraging Kafka's parallelism model.
         </p>
         <p>
-        Some highlights of Kafka Streams:
+            Some highlights of Kafka Streams:
         </p>
 
         <ul>
             <li>Designed as a <b>simple and lightweight client library</b>, which can be easily embedded in any Java application and integrated with any existing packaging, deployment and operational tools that users have for their streaming applications.</li>
             <li>Has <b>no external dependencies on systems other than Apache Kafka itself</b> as the internal messaging layer; notably, it uses Kafka's partitioning model to horizontally scale processing while maintaining strong ordering guarantees.</li>
             <li>Supports <b>fault-tolerant local state</b>, which enables very fast and efficient stateful operations like windowed joins and aggregations.</li>
+            <li>Supports <b>exactly-once</b> processing semantics to guarantee that each record will be processed once and only once even when there is a failure on either Streams clients or Kafka brokers in the middle of processing.</li>
             <li>Employs <b>one-record-at-a-time processing</b> to achieve millisecond processing latency, and supports <b>event-time based windowing operations</b> with late arrival of records.</li>
             <li>Offers necessary stream processing primitives, along with a <b>high-level Streams DSL</b> and a <b>low-level Processor API</b>.</li>
 
@@ -86,6 +89,8 @@
             <li><b>Sink Processor</b>: A sink processor is a special type of stream processor that does not have down-stream processors. It sends any received records from its up-stream processors to a specified Kafka topic.</li>
         </ul>
 
+        Note that in normal processor nodes other remote systems can also be accessed while processing the current record. Therefore the processed results can either be streamed back into Kafka or written to an external system.
+
         <img class="centered" src="/{{version}}/images/streams-architecture-topology.jpg" style="width:400px">
 
         <p>
@@ -158,6 +163,27 @@
         </p>
         <br>
 
+        <h2><a id="streams_processing_guarantee" href="#streams_processing_guarantee">Processing Guarantees</a></h2>
+
+        <p>
+            In stream processing, one of the most frequently asked question is "does my stream processing system guarantee that each record is processed once and only once, even if some failures are encountered in the middle of processing?"
+            Failing to guarantee exactly-once stream processing is a deal-breaker for many applications that cannot tolerate any data-loss or data duplicates, and in that case a batch-oriented framework is usually used in addition
+            to the stream processing pipeline, known as the <a href="http://lambda-architecture.net/">Lambda Architecture</a>.
+            Prior to 0.11.0.0, Kafka only provides at-least-once delivery guarantees and hence any stream processing systems that leverage it as the backend storage could not guarantee end-to-end exactly-once semantics.
+            In fact, even for those stream processing systems that claim to support exactly-once processing, as long as they are reading from / writing to Kafka as the source / sink, their applications cannot actually guarantee that
+            no duplicates will be generated throughout the pipeline.
+
+            Since the 0.11.0.0 release, Kafka has added support to allow its producers to send messages to different topic partitions in a <a href="https://kafka.apache.org/documentation/#semantics">transactional and idempotent manner</a>,
+            and Kafka Streams has hence added the end-to-end exactly-once processing semantics by leveraging these features.
+            More specifically, it guarantees that for any record read from the source Kafka topics, its processing results will be reflected exactly once in the output Kafka topic as well as in the state stores for stateful operations.
+            Note the key difference between Kafka Streams end-to-end exactly-once guarantee with other stream processing frameworks' claimed guarantees is that Kafka Streams tightly integrates with the underlying Kafka storage system and ensure that
+            commits on the input topic offsets, updates on the state stores, and writes to the output topics will be completed atomically instead of treating Kafka as an external system that may have side-effects.
+            To read more details on how this is done inside Kafka Streams, readers are recommended to read <a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-129%3A+Streams+Exactly-Once+Semantics">KIP-129</a>.
+
+            In order to achieve exactly-once semantics when running Kafka Streams applications, users can simply set the <code>processing.guarantee</code> config value to <b>exactly_once</b> (default value is <b>at_least_once</b>).
+            More details can be found in the <a href="#streamsconfigs">Kafka Streams Configs</a> section.
+        </p>
+
         <h2><a id="streams_architecture" href="#streams_architecture">Architecture</a></h2>
 
         Kafka Streams simplifies application development by building on the Kafka producer and consumer libraries and leveraging the native capabilities of
@@ -685,7 +711,11 @@ Properties settings = new Properties();
 // Set a few key parameters
 settings.put(StreamsConfig.APPLICATION_ID_CONFIG, "my-first-streams-application");
 settings.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka-broker1:9092");
-settings.put(StreamsConfig.ZOOKEEPER_CONNECT_CONFIG, "zookeeper1:2181");
+
+// Set a few user customized parameters
+settings.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE);
+settings.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, MyTimestampExtractor.class);
+
 // Any further settings
 settings.put(... , ...);
 
@@ -718,7 +748,7 @@ settings.put("consumer." + ConsumerConfig.RECEIVE_BUFFER_CONFIG, 1024 * 1024);
 settings.put("producer." + ProducerConfig.RECEIVE_BUFFER_CONFIG, 64 * 1024);
 // Alternatively, you can use
 settings.put(StreamsConfig.consumerPrefix(ConsumerConfig.RECEIVE_BUFFER_CONFIG), 1024 * 1024);
-settings.put(StremasConfig.producerConfig(ProducerConfig.RECEIVE_BUFFER_CONFIG), 64 * 1024);
+settings.put(StreamsConfig.producerConfig(ProducerConfig.RECEIVE_BUFFER_CONFIG), 64 * 1024);
 </pre>
 
         <h4><a id="streams_broker_config" href="#streams_broker_config">Broker Configuration</a></h4>
@@ -848,6 +878,7 @@ $ java -cp path-to-app-fatjar.jar com.example.MyStreamsApp
 
         <p> Updates in <code>StreamsConfig</code>: </p>
         <ul>
+          <li> new configuration parameter <code>processing.guarantee</code> is added </li>
           <li> configuration parameter <code>key.serde</code> was deprecated and replaced by <code>default.key.serde</code> </li>
           <li> configuration parameter <code>value.serde</code> was deprecated and replaced by <code>default.value.serde</code> </li>
           <li> configuration parameter <code>timestamp.extractor</code> was deprecated and replaced by <code>default.timestamp.extractor</code> </li>