You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by ma...@apache.org on 2014/06/14 00:21:08 UTC
svn commit: r1602533 [2/5] - in /incubator/samza/site: ./ community/ contribute/ learn/documentation/0.7.0/ learn/documentation/0.7.0/api/ learn/documentation/0.7.0/comparisons/ learn/documentation/0.7.0/container/ learn/documentation/0.7.0/introductio...

Modified: incubator/samza/site/learn/documentation/0.7.0/comparisons/mupd8.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/comparisons/mupd8.html?rev=1602533&r1=1602532&r2=1602533&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/comparisons/mupd8.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/comparisons/mupd8.html Fri Jun 13 22:21:06 2014
@@ -1,4 +1,20 @@
 <!DOCTYPE html>
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
 <html lang="en">
   <head>
     <meta charset="utf-8">
@@ -70,29 +86,63 @@
           </div>
 
           <div class="content">
-            <h2>MUPD8</h2>
+            <!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+<h2>MUPD8</h2>
+
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
 
-<p><em>People generally want to know how similar systems compare. We&#39;ve done our best to fairly contrast the feature sets of Samza with other systems. But we aren&#39;t experts in these frameworks, and we are, of course, totally biased. If we have goofed anything, please let us know and we will correct it.</em></p>
+<p><em>People generally want to know how similar systems compare. We&rsquo;ve done our best to fairly contrast the feature sets of Samza with other systems. But we aren&rsquo;t experts in these frameworks, and we are, of course, totally biased. If we have goofed anything, please let us know and we will correct it.</em></p>
 
-<h3>Durability</h3>
+<h3 id="toc_0">Durability</h3>
 
 <p>MUPD8 makes no durability or delivery guarantees. Within MUPD8, stream processor tasks receive messages at most once. Samza uses Kafka for messaging, which guarantees message delivery.</p>
 
-<h3>Ordering</h3>
+<h3 id="toc_1">Ordering</h3>
 
 <p>As with durability, developers would ideally like their stream processors to receive messages in exactly the order that they were written.</p>
 
-<p>We don&#39;t entirely follow MUPD8&#39;s description of their ordering guarantees, but it seems to guarantee that all messages will be processed in the order in which they are written to MUPD8 queues, which is comparable to Kafka and Samza&#39;s guarantee.</p>
+<p>We don&rsquo;t entirely follow MUPD8&rsquo;s description of their ordering guarantees, but it seems to guarantee that all messages will be processed in the order in which they are written to MUPD8 queues, which is comparable to Kafka and Samza&rsquo;s guarantee.</p>
 
-<h3>Buffering</h3>
+<h3 id="toc_2">Buffering</h3>
 
 <p>A critical issue for handling large data flows is handling back pressure when one downstream processing stage gets slow.</p>
 
 <p>MUPD8 buffers messages in an in-memory queue when passing messages between two MUPD8 tasks. When a queue fills up, developers have the option to either drop the messages on the floor, log the messages to local disk, or block until the queue frees up. All of these options are sub-optimal. Dropping messages leads to incorrect results. Blocking your stream processor can have a cascading effect, where the slowest processor blocks all upstream processors, which in turn block their upstream processors, until the whole system grinds to a halt. Logging to local disk is the most reasonable, but when a fault occurs, those messages are lost on failover.</p>
 
-<p>By adopting Kafka&#39;s broker as a remote buffer, Samza solves all of these problems. It doesn&#39;t need to block because consumers and producers are decoupled using the Kafka brokers&#39; disks as buffers. Messages are not dropped because Kafka brokers are highly available as of version 0.8. In the event of a failure, when a Samza job is restarted on another machine, its input and output are not lost, because they are stored remotely on replicated Kafka brokers.</p>
+<p>By adopting Kafka&rsquo;s broker as a remote buffer, Samza solves all of these problems. It doesn&rsquo;t need to block because consumers and producers are decoupled using the Kafka brokers&#39; disks as buffers. Messages are not dropped because Kafka brokers are highly available as of version 0.8. In the event of a failure, when a Samza job is restarted on another machine, its input and output are not lost, because they are stored remotely on replicated Kafka brokers.</p>
 
-<h3>State Management</h3>
+<h3 id="toc_3">State Management</h3>
 
 <p>As described in the <a href="introduction.html#state">introduction</a>, stream processors often need to maintain some state as they process messages. Different frameworks have different approaches to handling such state, and what to do in case of a failure.</p>
 
@@ -100,21 +150,21 @@
 
 <p>Samza maintains state locally with the task. This allows state larger than will fit in memory. State is persisted to an output stream to enable recovery should the task fail. We believe this design enables stronger fault tolerance semantics, because the change log captures the evolution of state, allowing the state of a task to restored to a consistent point in time.</p>
 
-<h3>Deployment and execution</h3>
+<h3 id="toc_4">Deployment and execution</h3>
 
-<p>MUPD8 includes a custom execution framework. The functionality that this framework supports in terms of users and resource limits isn&#39;t clear to us.</p>
+<p>MUPD8 includes a custom execution framework. The functionality that this framework supports in terms of users and resource limits isn&rsquo;t clear to us.</p>
 
 <p>Samza leverages YARN to deploy user code, and execute it in a distributed environment.</p>
 
-<h3>Fault Tolerance</h3>
+<h3 id="toc_5">Fault Tolerance</h3>
 
 <p>What should a stream processing system do when a machine or processor fails?</p>
 
-<p>MUPD8 uses its custom equivalent to YARN to manage fault tolerance. When a stream processor is unable to send a message to a downstream processor, it notifies MUPD8&#39;s coordinator, and all other machines are notified. The machines then send all messages to a new machine based on the key hash that&#39;s used. Messages and state can be lost when this happens.</p>
+<p>MUPD8 uses its custom equivalent to YARN to manage fault tolerance. When a stream processor is unable to send a message to a downstream processor, it notifies MUPD8&rsquo;s coordinator, and all other machines are notified. The machines then send all messages to a new machine based on the key hash that&rsquo;s used. Messages and state can be lost when this happens.</p>
 
-<p>Samza uses YARN to manage fault tolerance. YARN detects when nodes or Samza tasks fail, and notifies Samza&#39;s <a href="../yarn/application-master.html">ApplicationMaster</a>. At that point, it&#39;s up to Samza to decide what to do. Generally, this means re-starting the task on another machine. Since messages are persisted to Kafka brokers remotely, and there are no in-memory queues, no messages should be lost (unless the processors are using async Kafka producers, which offer higher performance but don&#39;t wait for messages to be committed).</p>
+<p>Samza uses YARN to manage fault tolerance. YARN detects when nodes or Samza tasks fail, and notifies Samza&rsquo;s <a href="../yarn/application-master.html">ApplicationMaster</a>. At that point, it&rsquo;s up to Samza to decide what to do. Generally, this means re-starting the task on another machine. Since messages are persisted to Kafka brokers remotely, and there are no in-memory queues, no messages should be lost (unless the processors are using async Kafka producers, which offer higher performance but don&rsquo;t wait for messages to be committed).</p>
 
-<h3>Workflow</h3>
+<h3 id="toc_6">Workflow</h3>
 
 <p>Sometimes more than one job or processing stage is needed to accomplish something. This is the case where you wish to re-partition a stream, for example. MUPD8 has a custom workflow system setup to define how to execute multiple jobs at once, and how to feed stream data from one into the other.</p>
 
@@ -122,23 +172,23 @@
 
 <p>This was motivated by our experience with Hadoop, where the data flow between jobs is implicitly defined by their input and output directories. This decentralized model has proven itself to scale well to a large organization.</p>
 
-<h3>Memory</h3>
+<h3 id="toc_7">Memory</h3>
 
 <p>MUPD8 executes all of its map/update processors inside a single JVM, using threads. This is memory-efficient, as the JVM memory overhead is shared across the threads.</p>
 
 <p>Samza uses a separate JVM for each <a href="../container/samza-container.html">stream processor container</a>. This has the disadvantage of using more memory compared to running multiple stream processing threads within a single JVM. However, the advantage is improved isolation between tasks, which can make them more reliable.</p>
 
-<h3>Isolation</h3>
+<h3 id="toc_8">Isolation</h3>
 
 <p>MUPD8 provides no resource isolation between stream processors. A single badly behaved stream processor can bring down all processors on the node.</p>
 
-<p>Samza uses process level isolation between stream processor tasks, similarly to Hadoop&#39;s approach. We can enforce strict per-process memory limits. In addition, Samza supports CPU limits when used with YARN cgroups. As the YARN support for cgroups develops further, it should also become possible to support disk and network cgroup limits.</p>
+<p>Samza uses process level isolation between stream processor tasks, similarly to Hadoop&rsquo;s approach. We can enforce strict per-process memory limits. In addition, Samza supports CPU limits when used with YARN cgroups. As the YARN support for cgroups develops further, it should also become possible to support disk and network cgroup limits.</p>
 
-<h3>Further Reading</h3>
+<h3 id="toc_9">Further Reading</h3>
 
 <p>The MUPD8 team has published a very good <a href="http://vldb.org/pvldb/vol5/p1814_wanglam_vldb2012.pdf">paper</a> on the design of their system.</p>
 
-<h2><a href="storm.html">Storm &raquo;</a></h2>
+<h2 id="toc_10"><a href="storm.html">Storm &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/comparisons/storm.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/comparisons/storm.html?rev=1602533&r1=1602532&r2=1602533&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/comparisons/storm.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/comparisons/storm.html Fri Jun 13 22:21:06 2014
@@ -1,4 +1,20 @@
 <!DOCTYPE html>
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
 <html lang="en">
   <head>
     <meta charset="utf-8">
@@ -70,113 +86,147 @@
           </div>
 
           <div class="content">
-            <h2>Storm</h2>
+            <!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+<h2>Storm</h2>
+
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
 
-<p><em>People generally want to know how similar systems compare. We&#39;ve done our best to fairly contrast the feature sets of Samza with other systems. But we aren&#39;t experts in these frameworks, and we are, of course, totally biased. If we have goofed anything, please let us know and we will correct it.</em></p>
+<p><em>People generally want to know how similar systems compare. We&rsquo;ve done our best to fairly contrast the feature sets of Samza with other systems. But we aren&rsquo;t experts in these frameworks, and we are, of course, totally biased. If we have goofed anything, please let us know and we will correct it.</em></p>
 
 <p><a href="http://storm-project.net/">Storm</a> and Samza are fairly similar. Both systems provide many of the same high-level features: a partitioned stream model, a distributed execution environment, an API for stream processing, fault tolerance, Kafka integration, etc.</p>
 
-<p>Storm and Samza use different words for similar concepts: <em>spouts</em> in Storm are similar to stream consumers in Samza, <em>bolts</em> are similar to tasks, and <em>tuples</em> are similar to messages in Samza. Storm also has some additional building blocks which don&#39;t have direct equivalents in Samza.</p>
+<p>Storm and Samza use different words for similar concepts: <em>spouts</em> in Storm are similar to stream consumers in Samza, <em>bolts</em> are similar to tasks, and <em>tuples</em> are similar to messages in Samza. Storm also has some additional building blocks which don&rsquo;t have direct equivalents in Samza.</p>
 
-<h3>Ordering and Guarantees</h3>
+<h3 id="toc_0">Ordering and Guarantees</h3>
 
 <p>Storm allows you to choose the level of guarantee with which you want your messages to be processed:</p>
 
 <ul>
 <li>The simplest mode is <em>at-most-once delivery</em>, which drops messages if they are not processed correctly, or if the machine doing the processing fails. This mode requires no special logic, and processes messages in the order they were produced by the spout.</li>
-<li>There is also <em>at-least-once delivery</em>, which tracks whether each input tuple (and any downstream tuples it generated) was successfully processed within a configured timeout, by keeping an in-memory record of all emitted tuples. Any tuples that are not fully processed within the timeout are re-emitted by the spout. This implies that a bolt may see the same tuple more than once, and that messages can be processed out-of-order. This mechanism also requires some co-operation from the user code, which must maintain the ancestry of records in order to properly acknowledge its input. This is explained in depth on <a href="https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing">Storm&#39;s wiki</a>.</li>
-<li>Finally, Storm offers <em>exactly-once semantics</em> using its <a href="https://github.com/nathanmarz/storm/wiki/Trident-tutorial">Trident</a> abstraction. This mode uses the same failure detection mechanism as the at-least-once mode. Tuples are actually processed at least once, but Storm&#39;s state implementation allows duplicates to be detected and ignored. (The duplicate detection only applies to state managed by Storm. If your code has other side-effects, e.g. sending messages to a service outside of the topology, it will not have exactly-once semantics.) In this mode, the spout breaks the input stream into batches, and processes batches in strictly sequential order.</li>
+<li>There is also <em>at-least-once delivery</em>, which tracks whether each input tuple (and any downstream tuples it generated) was successfully processed within a configured timeout, by keeping an in-memory record of all emitted tuples. Any tuples that are not fully processed within the timeout are re-emitted by the spout. This implies that a bolt may see the same tuple more than once, and that messages can be processed out-of-order. This mechanism also requires some co-operation from the user code, which must maintain the ancestry of records in order to properly acknowledge its input. This is explained in depth on <a href="https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing">Storm&rsquo;s wiki</a>.</li>
+<li>Finally, Storm offers <em>exactly-once semantics</em> using its <a href="https://github.com/nathanmarz/storm/wiki/Trident-tutorial">Trident</a> abstraction. This mode uses the same failure detection mechanism as the at-least-once mode. Tuples are actually processed at least once, but Storm&rsquo;s state implementation allows duplicates to be detected and ignored. (The duplicate detection only applies to state managed by Storm. If your code has other side-effects, e.g. sending messages to a service outside of the topology, it will not have exactly-once semantics.) In this mode, the spout breaks the input stream into batches, and processes batches in strictly sequential order.</li>
 </ul>
 
-<p>Samza also offers guaranteed delivery &mdash; currently only at-least-once delivery, but support for exactly-once semantics is planned. Within each stream partition, Samza always processes messages in the order they appear in the partition, but there is no guarantee of ordering across different input streams or partitions. This model allows Samza to offer at-least-once delivery without the overhead of ancestry tracking. In Samza, there would be no performance advantage to using at-most-once delivery (i.e. dropping messages on failure), which is why we don&#39;t offer that mode &mdash; message delivery is always guaranteed.</p>
+<p>Samza also offers guaranteed delivery &mdash; currently only at-least-once delivery, but support for exactly-once semantics is planned. Within each stream partition, Samza always processes messages in the order they appear in the partition, but there is no guarantee of ordering across different input streams or partitions. This model allows Samza to offer at-least-once delivery without the overhead of ancestry tracking. In Samza, there would be no performance advantage to using at-most-once delivery (i.e. dropping messages on failure), which is why we don&rsquo;t offer that mode &mdash; message delivery is always guaranteed.</p>
 
 <p>Moreover, because Samza never processes messages in a partition out-of-order, it is better suited for handling keyed data. For example, if you have a stream of database updates &mdash; where later updates may replace earlier updates &mdash; then reordering the messages may change the final result. Provided that all updates for the same key appear in the same stream partition, Samza is able to guarantee a consistent state.</p>
 
-<h3>State Management</h3>
+<h3 id="toc_1">State Management</h3>
 
-<p>Storm&#39;s lower-level API of bolts does not offer any help for managing state in a stream process. A bolt can maintain in-memory state (which is lost if that bolt dies), or it can make calls to a remote database to read and write state. However, a topology can usually process messages at a much higher rate than calls to a remote database can be made, so making a remote call for each message quickly becomes a bottleneck.</p>
+<p>Storm&rsquo;s lower-level API of bolts does not offer any help for managing state in a stream process. A bolt can maintain in-memory state (which is lost if that bolt dies), or it can make calls to a remote database to read and write state. However, a topology can usually process messages at a much higher rate than calls to a remote database can be made, so making a remote call for each message quickly becomes a bottleneck.</p>
 
 <p>As part of its higher-level Trident API, Storm offers automatic <a href="https://github.com/nathanmarz/storm/wiki/Trident-state">state management</a>. It keeps state in memory, and periodically checkpoints it to a remote database (e.g. Cassandra) for durability, so the cost of the remote database call is amortized over several processed tuples. By maintaining metadata alongside the state, Trident is able to achieve exactly-once processing semantics &mdash; for example, if you are counting events, this mechanism allows the counters to be correct, even when machines fail and tuples are replayed.</p>
 
-<p>Storm&#39;s approach of caching and batching state changes works well if the amount of state in each bolt is fairly small &mdash; perhaps less than 100kB. That makes it suitable for keeping track of counters, minimum, maximum and average values of a metric, and the like. However, if you need to maintain a large amount of state, this approach essentially degrades to making a database call per processed tuple, with the associated performance cost.</p>
+<p>Storm&rsquo;s approach of caching and batching state changes works well if the amount of state in each bolt is fairly small &mdash; perhaps less than 100kB. That makes it suitable for keeping track of counters, minimum, maximum and average values of a metric, and the like. However, if you need to maintain a large amount of state, this approach essentially degrades to making a database call per processed tuple, with the associated performance cost.</p>
 
 <p>Samza takes a <a href="../container/state-management.html">completely different approach</a> to state management. Rather than using a remote database for durable storage, each Samza task includes an embedded key-value store, located on the same machine. Reads and writes to this store are very fast, even when the contents of the store are larger than the available memory. Changes to this key-value store are replicated to other machines in the cluster, so that if one machine dies, the state of the tasks it was running can be restored on another machine.</p>
 
 <p>By co-locating storage and processing on the same machine, Samza is able to achieve very high throughput, even when there is a large amount of state. This is necessary if you want to perform stateful operations that are not just counters. For example, if you want to perform a window join of multiple streams, or join a stream with a database table (replicated to Samza through a changelog), or group several related messages into a bigger message, then you need to maintain so much state that it is much more efficient to keep the state local to the task.</p>
 
-<p>A limitation of Samza&#39;s state handling is that it currently does not support exactly-once semantics &mdash; only at-least-once is supported right now. But we&#39;re working on fixing that, so stay tuned for updates.</p>
+<p>A limitation of Samza&rsquo;s state handling is that it currently does not support exactly-once semantics &mdash; only at-least-once is supported right now. But we&rsquo;re working on fixing that, so stay tuned for updates.</p>
 
-<h3>Partitioning and Parallelism</h3>
+<h3 id="toc_2">Partitioning and Parallelism</h3>
 
-<p>Storm&#39;s <a href="https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology">parallelism model</a> is fairly similar to Samza&#39;s. Both frameworks split processing into independent <em>tasks</em> that can run in parallel. Resource allocation is independent of the number of tasks: a small job can keep all tasks in a single process on a single machine; a large job can spread the tasks over many processes on many machines.</p>
+<p>Storm&rsquo;s <a href="https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology">parallelism model</a> is fairly similar to Samza&rsquo;s. Both frameworks split processing into independent <em>tasks</em> that can run in parallel. Resource allocation is independent of the number of tasks: a small job can keep all tasks in a single process on a single machine; a large job can spread the tasks over many processes on many machines.</p>
 
-<p>The biggest difference is that Storm uses one thread per task by default, whereas Samza uses single-threaded processes (containers). A Samza container may contain multiple tasks, but there is only one thread that invokes each of the tasks in turn. This means each container is mapped to exactly one CPU core, which makes the resource model much simpler and reduces interference from other tasks running on the same machine. Storm&#39;s multithreaded model has the advantage of taking better advantage of excess capacity on an idle machine, at the cost of a less predictable resource model.</p>
+<p>The biggest difference is that Storm uses one thread per task by default, whereas Samza uses single-threaded processes (containers). A Samza container may contain multiple tasks, but there is only one thread that invokes each of the tasks in turn. This means each container is mapped to exactly one CPU core, which makes the resource model much simpler and reduces interference from other tasks running on the same machine. Storm&rsquo;s multithreaded model has the advantage of taking better advantage of excess capacity on an idle machine, at the cost of a less predictable resource model.</p>
 
-<p>Storm supports <em>dynamic rebalancing</em>, which means adding more threads or processes to a topology without restarting the topology or cluster. This is a convenient feature, especially during development. We haven&#39;t added this to Samza: philosophically we feel that this kind of change should go through a normal configuration management process (i.e. version control, notification, etc.) as it impacts production performance. In other words, the code and configuration of the jobs should fully recreate the state of the cluster.</p>
+<p>Storm supports <em>dynamic rebalancing</em>, which means adding more threads or processes to a topology without restarting the topology or cluster. This is a convenient feature, especially during development. We haven&rsquo;t added this to Samza: philosophically we feel that this kind of change should go through a normal configuration management process (i.e. version control, notification, etc.) as it impacts production performance. In other words, the code and configuration of the jobs should fully recreate the state of the cluster.</p>
 
-<p>When using a transactional spout with Trident (a requirement for achieving exactly-once semantics), parallelism is potentially reduced. Trident relies on a global ordering in its input streams &mdash; that is, ordering across all partitions of a stream, not just within one partion. This means that the topology&#39;s input stream has to go through a single spout instance, effectively ignoring the partitioning of the input stream. This spout may become a bottleneck on high-volume streams. In Samza, all stream processing is parallel &mdash; there are no such choke points.</p>
+<p>When using a transactional spout with Trident (a requirement for achieving exactly-once semantics), parallelism is potentially reduced. Trident relies on a global ordering in its input streams &mdash; that is, ordering across all partitions of a stream, not just within one partion. This means that the topology&rsquo;s input stream has to go through a single spout instance, effectively ignoring the partitioning of the input stream. This spout may become a bottleneck on high-volume streams. In Samza, all stream processing is parallel &mdash; there are no such choke points.</p>
 
-<h3>Deployment &amp; Execution</h3>
+<h3 id="toc_3">Deployment &amp; Execution</h3>
 
-<p>A Storm cluster is composed of a set of nodes running a <em>Supervisor</em> daemon. The supervisor daemons talk to a single master node running a daemon called <em>Nimbus</em>. The Nimbus daemon is responsible for assigning work and managing resources in the cluster. See Storm&#39;s <a href="https://github.com/nathanmarz/storm/wiki/Tutorial">Tutorial</a> page for details. This is quite similar to YARN; though YARN is a bit more fully featured and intended to be multi-framework, Nimbus is better integrated with Storm.</p>
+<p>A Storm cluster is composed of a set of nodes running a <em>Supervisor</em> daemon. The supervisor daemons talk to a single master node running a daemon called <em>Nimbus</em>. The Nimbus daemon is responsible for assigning work and managing resources in the cluster. See Storm&rsquo;s <a href="https://github.com/nathanmarz/storm/wiki/Tutorial">Tutorial</a> page for details. This is quite similar to YARN; though YARN is a bit more fully featured and intended to be multi-framework, Nimbus is better integrated with Storm.</p>
 
 <p>Yahoo! has also released <a href="https://github.com/yahoo/storm-yarn">Storm-YARN</a>. As described in <a href="http://developer.yahoo.com/blogs/ydn/storm-yarn-released-open-source-143745133.html">this Yahoo! blog post</a>, Storm-YARN is a wrapper that starts a single Storm cluster (complete with Nimbus, and Supervisors) inside a YARN grid.</p>
 
-<p>There are a lot of similarities between Storm&#39;s Nimbus and YARN&#39;s ResourceManager, as well as between Storm&#39;s Supervisor and YARN&#39;s Node Managers. Rather than writing our own resource management framework, or running a second one inside of YARN, we decided that Samza should use YARN directly, as a first-class citizen in the YARN ecosystem. YARN is stable, well adopted, fully-featured, and inter-operable with Hadoop. It also provides a bunch of nice features like security (user authentication), cgroup process isolation, etc.</p>
+<p>There are a lot of similarities between Storm&rsquo;s Nimbus and YARN&rsquo;s ResourceManager, as well as between Storm&rsquo;s Supervisor and YARN&rsquo;s Node Managers. Rather than writing our own resource management framework, or running a second one inside of YARN, we decided that Samza should use YARN directly, as a first-class citizen in the YARN ecosystem. YARN is stable, well adopted, fully-featured, and inter-operable with Hadoop. It also provides a bunch of nice features like security (user authentication), cgroup process isolation, etc.</p>
 
 <p>The YARN support in Samza is pluggable, so you can swap it for a different execution framework if you wish.</p>
 
-<h3>Language Support</h3>
+<h3 id="toc_4">Language Support</h3>
 
 <p>Storm is written in Java and Clojure but has good support for non-JVM languages. It follows a model similar to MapReduce Streaming: the non-JVM task is launched in a separate process, data is sent to its stdin, and output is read from its stdout.</p>
 
 <p>Samza is written in Java and Scala. It is built with multi-language support in mind, but currently only supports JVM languages.</p>
 
-<h3>Workflow</h3>
+<h3 id="toc_5">Workflow</h3>
 
 <p>Storm provides modeling of <em>topologies</em> (a processing graph of multiple stages) <a href="https://github.com/nathanmarz/storm/wiki/Tutorial">in code</a>. Trident provides a further <a href="https://github.com/nathanmarz/storm/wiki/Trident-tutorial">higher-level API</a> on top of this, including familiar relational-like operators such as filters, grouping, aggregation and joins. This means the entire topology is wired up in one place, which has the advantage that it is documented in code, but has the disadvantage that the entire topology needs to be developed and deployed as a whole.</p>
 
 <p>In Samza, each job is an independent entity. You can define multiple jobs in a single codebase, or you can have separate teams working on different jobs using different codebases. Each job is deployed, started and stopped independently. Jobs communicate only through named streams, and you can add jobs to the system without affecting any other jobs. This makes Samza well suited for handling the data flow in a large company.</p>
 
-<p>Samza&#39;s approach can be emulated in Storm by connecting two separate topologies via a broker, such as Kafka. However, Storm&#39;s implementation of exactly-once semantics only works within a single topology.</p>
+<p>Samza&rsquo;s approach can be emulated in Storm by connecting two separate topologies via a broker, such as Kafka. However, Storm&rsquo;s implementation of exactly-once semantics only works within a single topology.</p>
 
-<h3>Maturity</h3>
+<h3 id="toc_6">Maturity</h3>
 
-<p>We can&#39;t speak to Storm&#39;s maturity, but it has an <a href="https://github.com/nathanmarz/storm/wiki/Powered-By">impressive number of adopters</a>, a strong feature set, and seems to be under active development. It integrates well with many common messaging systems (RabbitMQ, Kestrel, Kafka, etc).</p>
+<p>We can&rsquo;t speak to Storm&rsquo;s maturity, but it has an <a href="https://github.com/nathanmarz/storm/wiki/Powered-By">impressive number of adopters</a>, a strong feature set, and seems to be under active development. It integrates well with many common messaging systems (RabbitMQ, Kestrel, Kafka, etc).</p>
 
-<p>Samza is pretty immature, though it builds on solid components. YARN is fairly new, but is already being run on 3000+ node clusters at Yahoo!, and the project is under active development by both <a href="http://hortonworks.com/">Hortonworks</a> and <a href="http://www.cloudera.com/content/cloudera/en/home.html">Cloudera</a>. Kafka has a strong <a href="https://cwiki.apache.org/KAFKA/powered-by.html">powered by</a> page, and has seen increased adoption recently. It&#39;s also frequently used with Storm. Samza is a brand new project that is in use at LinkedIn. Our hope is that others will find it useful, and adopt it as well.</p>
+<p>Samza is pretty immature, though it builds on solid components. YARN is fairly new, but is already being run on 3000+ node clusters at Yahoo!, and the project is under active development by both <a href="http://hortonworks.com/">Hortonworks</a> and <a href="http://www.cloudera.com/content/cloudera/en/home.html">Cloudera</a>. Kafka has a strong <a href="https://cwiki.apache.org/KAFKA/powered-by.html">powered by</a> page, and has seen increased adoption recently. It&rsquo;s also frequently used with Storm. Samza is a brand new project that is in use at LinkedIn. Our hope is that others will find it useful, and adopt it as well.</p>
 
-<h3>Buffering &amp; Latency</h3>
+<h3 id="toc_7">Buffering &amp; Latency</h3>
 
 <p>Storm uses <a href="http://zeromq.org/">ZeroMQ</a> for non-durable communication between bolts, which enables extremely low latency transmission of tuples. Samza does not have an equivalent mechanism, and always writes task output to a stream.</p>
 
-<p>On the flip side, when a bolt is trying to send messages using ZeroMQ, and the consumer can&#39;t read them fast enough, the ZeroMQ buffer in the producer&#39;s process begins to fill up with messages. If this buffer grows too much, the topology&#39;s processing timeout may be reached, which causes messages to be re-emitted at the spout and makes the problem worse by adding even more messages to the buffer. In order to prevent such overflow, you can configure a maximum number of messages that can be in flight in the topology at any one time; when that threshold is reached, the spout blocks until some of the messages in flight are fully processed. This mechanism allows back pressure, but requires <a href="http://nathanmarz.github.io/storm/doc/backtype/storm/Config.html#TOPOLOGY_MAX_SPOUT_PENDING">topology.max.spout.pending</a> to be carefully configured. If a single bolt in a topology starts running slow, the processing in the entire topology grinds to a halt.</p>
+<p>On the flip side, when a bolt is trying to send messages using ZeroMQ, and the consumer can&rsquo;t read them fast enough, the ZeroMQ buffer in the producer&rsquo;s process begins to fill up with messages. If this buffer grows too much, the topology&rsquo;s processing timeout may be reached, which causes messages to be re-emitted at the spout and makes the problem worse by adding even more messages to the buffer. In order to prevent such overflow, you can configure a maximum number of messages that can be in flight in the topology at any one time; when that threshold is reached, the spout blocks until some of the messages in flight are fully processed. This mechanism allows back pressure, but requires <a href="http://nathanmarz.github.io/storm/doc/backtype/storm/Config.html#TOPOLOGY_MAX_SPOUT_PENDING">topology.max.spout.pending</a> to be carefully configured. If a single bolt in a topology starts running slow, the processing in the entire topology grinds to a halt.</p>
 
-<p>A lack of a broker between bolts also adds complexity when trying to deal with fault tolerance and messaging semantics.  Storm has a <a href="https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing">clever mechanism</a> for detecting tuples that failed to be processed, but Samza doesn&#39;t need such a mechanism because every input and output stream is fault-tolerant and replicated.</p>
+<p>A lack of a broker between bolts also adds complexity when trying to deal with fault tolerance and messaging semantics.  Storm has a <a href="https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing">clever mechanism</a> for detecting tuples that failed to be processed, but Samza doesn&rsquo;t need such a mechanism because every input and output stream is fault-tolerant and replicated.</p>
 
 <p>Samza takes a different approach to buffering. We buffer to disk at every hop between a StreamTask. This decision, and its trade-offs, are described in detail on the <a href="introduction.html">Comparison Introduction</a> page. This design decision makes durability guarantees easy, and has the advantage of allowing the buffer to absorb a large backlog of messages if a job has fallen behind in its processing. However, it comes at the price of slightly higher latency.</p>
 
-<p>As described in the <em>workflow</em> section above, Samza&#39;s approach can be emulated in Storm, but comes with a loss in functionality.</p>
+<p>As described in the <em>workflow</em> section above, Samza&rsquo;s approach can be emulated in Storm, but comes with a loss in functionality.</p>
 
-<h3>Isolation</h3>
+<h3 id="toc_8">Isolation</h3>
 
-<p>Storm provides standard UNIX process-level isolation. Your topology can impact another topology&#39;s performance (or vice-versa) if too much CPU, disk, network, or memory is used.</p>
+<p>Storm provides standard UNIX process-level isolation. Your topology can impact another topology&rsquo;s performance (or vice-versa) if too much CPU, disk, network, or memory is used.</p>
 
 <p>Samza relies on YARN to provide resource-level isolation. Currently, YARN provides explicit controls for memory and CPU limits (through <a href="../yarn/isolation.html">cgroups</a>), and both have been used successfully with Samza. No isolation for disk or network is provided by YARN at this time.</p>
 
-<h3>Distributed RPC</h3>
+<h3 id="toc_9">Distributed RPC</h3>
 
 <p>In Storm, you can write topologies which not only accept a stream of fixed events, but also allow clients to run distributed computations on demand. The query is sent into the topology as a tuple on a special spout, and when the topology has computed the answer, it is returned to the client (who was synchronously waiting for the answer). This facility is called <a href="https://github.com/nathanmarz/storm/wiki/Distributed-RPC">Distributed RPC</a> (DRPC).</p>
 
-<p>Samza does not currently have an equivalent API to DRPC, but you can build it yourself using Samza&#39;s stream processing primitives.</p>
+<p>Samza does not currently have an equivalent API to DRPC, but you can build it yourself using Samza&rsquo;s stream processing primitives.</p>
 
-<h3>Data Model</h3>
+<h3 id="toc_10">Data Model</h3>
 
 <p>Storm models all messages as <em>tuples</em> with a defined data model but pluggable serialization.</p>
 
-<p>Samza&#39;s serialization and data model are both pluggable. We are not terribly opinionated about which approach is best.</p>
+<p>Samza&rsquo;s serialization and data model are both pluggable. We are not terribly opinionated about which approach is best.</p>
 
-<h2><a href="../api/overview.html">API Overview &raquo;</a></h2>
+<h2 id="toc_11"><a href="../api/overview.html">API Overview &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/container/checkpointing.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/checkpointing.html?rev=1602533&r1=1602532&r2=1602533&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/container/checkpointing.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/container/checkpointing.html Fri Jun 13 22:21:06 2014
@@ -1,4 +1,20 @@
 <!DOCTYPE html>
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
 <html lang="en">
   <head>
     <meta charset="utf-8">
@@ -70,9 +86,43 @@
           </div>
 
           <div class="content">
-            <h2>Checkpointing</h2>
+            <!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+<h2>Checkpointing</h2>
+
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
 
-<p>Samza provides fault-tolerant processing of streams: Samza guarantees that messages won&#39;t be lost, even if your job crashes, if a machine dies, if there is a network fault, or something else goes wrong. In order to provide this guarantee, Samza expects the <a href="streams.html">input system</a> to meet the following requirements:</p>
+<p>Samza provides fault-tolerant processing of streams: Samza guarantees that messages won&rsquo;t be lost, even if your job crashes, if a machine dies, if there is a network fault, or something else goes wrong. In order to provide this guarantee, Samza expects the <a href="streams.html">input system</a> to meet the following requirements:</p>
 
 <ul>
 <li>The stream may be sharded into one or more <em>partitions</em>. Each partition is independent from the others, and is replicated across multiple machines (the stream continues to be available, even if a machine fails).</li>
@@ -88,11 +138,11 @@
 
 <p><img src="/img/0.7.0/learn/documentation/container/checkpointing.svg" alt="Illustration of checkpointing" class="diagram-large"></p>
 
-<p>When a Samza container starts up, it looks for the most recent checkpoint and starts consuming messages from the checkpointed offsets. If the previous container failed unexpectedly, the most recent checkpoint may be slightly behind the current offsets (i.e. the job may have consumed some more messages since the last checkpoint was written), but we can&#39;t know for sure. In that case, the job may process a few messages again.</p>
+<p>When a Samza container starts up, it looks for the most recent checkpoint and starts consuming messages from the checkpointed offsets. If the previous container failed unexpectedly, the most recent checkpoint may be slightly behind the current offsets (i.e. the job may have consumed some more messages since the last checkpoint was written), but we can&rsquo;t know for sure. In that case, the job may process a few messages again.</p>
 
-<p>This guarantee is called <em>at-least-once processing</em>: Samza ensures that your job doesn&#39;t miss any messages, even if containers need to be restarted. However, it is possible for your job to see the same message more than once when a container is restarted. We are planning to address this in a future version of Samza, but for now it is just something to be aware of: for example, if you are counting page views, a forcefully killed container could cause events to be slightly over-counted. You can reduce duplication by checkpointing more frequently, at a slight performance cost.</p>
+<p>This guarantee is called <em>at-least-once processing</em>: Samza ensures that your job doesn&rsquo;t miss any messages, even if containers need to be restarted. However, it is possible for your job to see the same message more than once when a container is restarted. We are planning to address this in a future version of Samza, but for now it is just something to be aware of: for example, if you are counting page views, a forcefully killed container could cause events to be slightly over-counted. You can reduce duplication by checkpointing more frequently, at a slight performance cost.</p>
 
-<p>For checkpoints to be effective, they need to be written somewhere where they will survive faults. Samza allows you to write checkpoints to the file system (using FileSystemCheckpointManager), but that doesn&#39;t help if the machine fails and the container needs to be restarted on another machine. The most common configuration is to use Kafka for checkpointing. You can enable this with the following job configuration:</p>
+<p>For checkpoints to be effective, they need to be written somewhere where they will survive faults. Samza allows you to write checkpoints to the file system (using FileSystemCheckpointManager), but that doesn&rsquo;t help if the machine fails and the container needs to be restarted on another machine. The most common configuration is to use Kafka for checkpointing. You can enable this with the following job configuration:</p>
 <div class="highlight"><pre><code class="text language-text" data-lang="text"># The name of your job determines the name under which checkpoints will be stored
 job.name=example-job
 
@@ -145,9 +195,22 @@ systems.kafka.streams.my-special-topic.s
 
 <p>Note that the example configuration above causes your tasks to start consuming from the oldest offset <em>every time a container starts up</em>. This is useful in case you have some in-memory state in your tasks that you need to rebuild from source data in an input stream. If you are using streams in this way, you may also find <a href="streams.html">bootstrap streams</a> useful.</p>
 
-<p>If you want to make a one-off change to a job&#39;s consumer offsets, for example to force old messages to be processed again with a new version of your code, you can use CheckpointTool to manipulate the job&#39;s checkpoint. The tool is included in Samza&#39;s <a href="/contribute/code.html">source repository</a> and documented in the README.</p>
+<h3 id="toc_0">Manipulating Checkpoints Manually</h3>
 
-<h2><a href="state-management.html">State Management &raquo;</a></h2>
+<p>If you want to make a one-off change to a job&rsquo;s consumer offsets, for example to force old messages to be <a href="../jobs/reprocessing.html">processed again</a> with a new version of your code, you can use CheckpointTool to inspect and manipulate the job&rsquo;s checkpoint. The tool is included in Samza&rsquo;s <a href="/contribute/code.html">source repository</a>.</p>
+
+<p>To inspect a job&rsquo;s latest checkpoint, you need to specify your job&rsquo;s config file, so that the tool knows which job it is dealing with:</p>
+<div class="highlight"><pre><code class="text language-text" data-lang="text">samza-example/target/bin/checkpoint-tool.sh \
+  --config-path=file:///path/to/job/config.properties
+</code></pre></div>
+<p>This command prints out the latest checkpoint in a properties file format. You can save the output to a file, and edit it as you wish. For example, to jump back to the oldest possible point in time, you can set all the offsets to 0. Then you can feed that properties file back into checkpoint-tool.sh and save the modified checkpoint:</p>
+<div class="highlight"><pre><code class="text language-text" data-lang="text">samza-example/target/bin/checkpoint-tool.sh \
+  --config-path=file:///path/to/job/config.properties \
+  --new-offsets=file:///path/to/new/offsets.properties
+</code></pre></div>
+<p>Note that Samza only reads checkpoints on container startup. In order for your checkpoint change to take effect, you need to first stop the job, then save the modified offsets, and then start the job again. If you write a checkpoint while the job is running, it will most likely have no effect.</p>
+
+<h2 id="toc_1"><a href="state-management.html">State Management &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/container/event-loop.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/event-loop.html?rev=1602533&r1=1602532&r2=1602533&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/container/event-loop.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/container/event-loop.html Fri Jun 13 22:21:06 2014
@@ -1,4 +1,20 @@
 <!DOCTYPE html>
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
 <html lang="en">
   <head>
     <meta charset="utf-8">
@@ -70,15 +86,49 @@
           </div>
 
           <div class="content">
-            <h2>Event Loop</h2>
+            <!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+<h2>Event Loop</h2>
+
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
 
-<p>The event loop is the <a href="samza-container.html">container</a>&#39;s single thread that is in charge of <a href="streams.html">reading and writing messages</a>, <a href="metrics.html">flushing metrics</a>, <a href="checkpointing.html">checkpointing</a>, and <a href="windowing.html">windowing</a>.</p>
+<p>The event loop is the <a href="samza-container.html">container</a>&rsquo;s single thread that is in charge of <a href="streams.html">reading and writing messages</a>, <a href="metrics.html">flushing metrics</a>, <a href="checkpointing.html">checkpointing</a>, and <a href="windowing.html">windowing</a>.</p>
 
 <p>Samza uses a single thread because every container is designed to use a single CPU core; to get more parallelism, simply run more containers. This uses a bit more memory than multithreaded parallelism, because each JVM has some overhead, but it simplifies resource management and improves isolation between jobs. This helps Samza jobs run reliably on a multitenant cluster, where many different jobs written by different people are running at the same time.</p>
 
-<p>You are strongly discouraged from using threads in your job&#39;s code. Samza uses multiple threads internally for communicating with input and output streams, but all message processing and user code runs on a single-threaded event loop. In general, Samza is not thread-safe.</p>
+<p>You are strongly discouraged from using threads in your job&rsquo;s code. Samza uses multiple threads internally for communicating with input and output streams, but all message processing and user code runs on a single-threaded event loop. In general, Samza is not thread-safe.</p>
 
-<h3>Event Loop Internals</h3>
+<h3 id="toc_0">Event Loop Internals</h3>
 
 <p>A container may have multiple <a href="../api/javadocs/org/apache/samza/system/SystemConsumer.html">SystemConsumers</a> for consuming messages from different input systems. Each SystemConsumer reads messages on its own thread, but writes messages into a shared in-process message queue. The container uses this queue to funnel all of the messages into the event loop.</p>
 
@@ -94,9 +144,9 @@
 
 <p>The container does this, in a loop, until it is shut down. Note that although there can be multiple task instances within a container (depending on the number of input stream partitions), their process() and window() methods are all called on the same thread, never concurrently on different threads.</p>
 
-<h3>Lifecycle Listeners</h3>
+<h3 id="toc_1">Lifecycle Listeners</h3>
 
-<p>Sometimes, you need to run your own code at specific points in a task&#39;s lifecycle. For example, you might want to set up some context in the container whenever a new message arrives, or perform some operations on startup or shutdown.</p>
+<p>Sometimes, you need to run your own code at specific points in a task&rsquo;s lifecycle. For example, you might want to set up some context in the container whenever a new message arrives, or perform some operations on startup or shutdown.</p>
 
 <p>To receive notifications when such events happen, you can implement the <a href="../api/javadocs/org/apache/samza/task/TaskLifecycleListenerFactory.html">TaskLifecycleListenerFactory</a> interface. It returns a <a href="../api/javadocs/org/apache/samza/task/TaskLifecycleListener.html">TaskLifecycleListener</a>, whose methods are called by Samza at the appropriate times.</p>
 
@@ -109,7 +159,7 @@ task.lifecycle.listeners=my-listener
 </code></pre></div>
 <p>The Samza container creates one instance of your <a href="../api/javadocs/org/apache/samza/task/TaskLifecycleListener.html">TaskLifecycleListener</a>. If the container has multiple task instances (processing different input stream partitions), the beforeInit, afterInit, beforeClose and afterClose methods are called for each task instance. The <a href="../api/javadocs/org/apache/samza/task/TaskContext.html">TaskContext</a> argument of those methods gives you more information about the partitions.</p>
 
-<h2><a href="jmx.html">JMX &raquo;</a></h2>
+<h2 id="toc_2"><a href="jmx.html">JMX &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/container/jmx.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/jmx.html?rev=1602533&r1=1602532&r2=1602533&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/container/jmx.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/container/jmx.html Fri Jun 13 22:21:06 2014
@@ -1,4 +1,20 @@
 <!DOCTYPE html>
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
 <html lang="en">
   <head>
     <meta charset="utf-8">
@@ -70,9 +86,43 @@
           </div>
 
           <div class="content">
-            <h2>JMX</h2>
+            <!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+<h2>JMX</h2>
+
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
 
-<p>Samza&#39;s containers and YARN ApplicationMaster enable <a href="http://docs.oracle.com/javase/tutorial/jmx/">JMX</a> by default. JMX can be used for managing the JVM; for example, you can connect to it using <a href="http://docs.oracle.com/javase/7/docs/technotes/guides/management/jconsole.html">jconsole</a>, which is included in the JDK.</p>
+<p>Samza&rsquo;s containers and YARN ApplicationMaster enable <a href="http://docs.oracle.com/javase/tutorial/jmx/">JMX</a> by default. JMX can be used for managing the JVM; for example, you can connect to it using <a href="http://docs.oracle.com/javase/7/docs/technotes/guides/management/jconsole.html">jconsole</a>, which is included in the JDK.</p>
 
 <p>You can tell Samza to publish its internal <a href="metrics.html">metrics</a>, and any custom metrics you define, as JMX MBeans. To enable this, set the following properties in your job configuration:</p>
 <div class="highlight"><pre><code class="text language-text" data-lang="text"># Define a Samza metrics reporter called &quot;jmx&quot;, which publishes to JMX
@@ -81,12 +131,12 @@ metrics.reporter.jmx.class=org.apache.sa
 # Use it (if you have multiple reporters defined, separate them with commas)
 metrics.reporters=jmx
 </code></pre></div>
-<p>JMX needs to be configured to use a specific port, but in a distributed environment, there is no way of knowing in advance which ports are available on the machines running your containers. Therefore Samza chooses the JMX port randomly. If you need to connect to it, you can find the port by looking in the container&#39;s logs, which report the JMX server details as follows:</p>
+<p>JMX needs to be configured to use a specific port, but in a distributed environment, there is no way of knowing in advance which ports are available on the machines running your containers. Therefore Samza chooses the JMX port randomly. If you need to connect to it, you can find the port by looking in the container&rsquo;s logs, which report the JMX server details as follows:</p>
 <div class="highlight"><pre><code class="text language-text" data-lang="text">2014-06-02 21:50:17 JmxServer [INFO] According to InetAddress.getLocalHost.getHostName we are samza-grid-1234.example.com
 2014-06-02 21:50:17 JmxServer [INFO] Started JmxServer registry port=50214 server port=50215 url=service:jmx:rmi://localhost:50215/jndi/rmi://localhost:50214/jmxrmi
 2014-06-02 21:50:17 JmxServer [INFO] If you are tunneling, you might want to try JmxServer registry port=50214 server port=50215 url=service:jmx:rmi://samza-grid-1234.example.com:50215/jndi/rmi://samza-grid-1234.example.com:50214/jmxrmi
 </code></pre></div>
-<h2><a href="../jobs/job-runner.html">JobRunner &raquo;</a></h2>
+<h2 id="toc_0"><a href="../jobs/job-runner.html">JobRunner &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/container/metrics.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/metrics.html?rev=1602533&r1=1602532&r2=1602533&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/container/metrics.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/container/metrics.html Fri Jun 13 22:21:06 2014
@@ -1,4 +1,20 @@
 <!DOCTYPE html>
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
 <html lang="en">
   <head>
     <meta charset="utf-8">
@@ -70,11 +86,45 @@
           </div>
 
           <div class="content">
-            <h2>Metrics</h2>
+            <!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+<h2>Metrics</h2>
+
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
 
-<p>When you&#39;re running a stream process in production, it&#39;s important that you have good metrics to track the health of your job. In order to make this easy, Samza includes a metrics library. It is used by Samza itself to generate some standard metrics such as message throughput, but you can also use it in your task code to emit custom metrics.</p>
+<p>When you&rsquo;re running a stream process in production, it&rsquo;s important that you have good metrics to track the health of your job. In order to make this easy, Samza includes a metrics library. It is used by Samza itself to generate some standard metrics such as message throughput, but you can also use it in your task code to emit custom metrics.</p>
 
-<p>Metrics can be reported in various ways. You can expose them via <a href="jmx.html">JMX</a>, which is useful in development. In production, a common setup is for each Samza container to periodically publish its metrics to a &quot;metrics&quot; Kafka topic, in which the metrics from all Samza jobs are aggregated. You can then consume this stream in another Samza job, and send the metrics to your favorite graphing system such as <a href="http://graphite.wikidot.com/">Graphite</a>.</p>
+<p>Metrics can be reported in various ways. You can expose them via <a href="jmx.html">JMX</a>, which is useful in development. In production, a common setup is for each Samza container to periodically publish its metrics to a &ldquo;metrics&rdquo; Kafka topic, in which the metrics from all Samza jobs are aggregated. You can then consume this stream in another Samza job, and send the metrics to your favorite graphing system such as <a href="http://graphite.wikidot.com/">Graphite</a>.</p>
 
 <p>To set up your job to publish metrics to Kafka, you can use the following configuration:</p>
 <div class="highlight"><pre><code class="text language-text" data-lang="text"># Define a metrics reporter called &quot;snapshot&quot;, which publishes metrics
@@ -90,7 +140,7 @@ metrics.reporter.snapshot.stream=kafka.m
 serializers.registry.metrics.class=org.apache.samza.serializers.MetricsSnapshotSerdeFactory
 systems.kafka.streams.metrics.samza.msg.serde=metrics
 </code></pre></div>
-<p>With this configuration, the job automatically sends several JSON-encoded messages to the &quot;metrics&quot; topic in Kafka every 60 seconds. The messages look something like this:</p>
+<p>With this configuration, the job automatically sends several JSON-encoded messages to the &ldquo;metrics&rdquo; topic in Kafka every 60 seconds. The messages look something like this:</p>
 <div class="highlight"><pre><code class="text language-text" data-lang="text">{
   &quot;header&quot;: {
     &quot;container-name&quot;: &quot;samza-container-0&quot;,
@@ -120,7 +170,7 @@ systems.kafka.streams.metrics.samza.msg.
 </code></pre></div>
 <p>There is a separate message for each task instance, and the header tells you the job name, job ID and partition of the task. The metrics allow you to see how many messages have been processed and sent, the current offset in the input stream partition, and other details. There are additional messages which give you metrics about the JVM (heap size, garbage collection information, threads etc.), internal metrics of the Kafka producers and consumers, and more.</p>
 
-<p>It&#39;s easy to generate custom metrics in your job, if there&#39;s some value you want to keep an eye on. You can use Samza&#39;s built-in metrics framework, which is similar in design to Coda Hale&#39;s <a href="http://metrics.codahale.com/">metrics</a> library. </p>
+<p>It&rsquo;s easy to generate custom metrics in your job, if there&rsquo;s some value you want to keep an eye on. You can use Samza&rsquo;s built-in metrics framework, which is similar in design to Coda Hale&rsquo;s <a href="http://metrics.codahale.com/">metrics</a> library. </p>
 
 <p>You can register your custom metrics through a <a href="../api/javadocs/org/apache/samza/metrics/MetricsRegistry.html">MetricsRegistry</a>. Your stream task needs to implement <a href="../api/javadocs/org/apache/samza/task/InitableTask.html">InitableTask</a>, so that you can get the metrics registry from the <a href="../api/javadocs/org/apache/samza/task/TaskContext.html">TaskContext</a>. This simple example shows how to count the number of messages processed by your task:</p>
 <div class="highlight"><pre><code class="text language-text" data-lang="text">public class MyJavaStreamTask implements StreamTask, InitableTask {
@@ -143,7 +193,7 @@ systems.kafka.streams.metrics.samza.msg.
 
 <p>If you want to report metrics in some other way, e.g. directly to a graphing system (without going via Kafka), you can implement a <a href="../api/javadocs/org/apache/samza/metrics/MetricsReporterFactory.html">MetricsReporterFactory</a> and reference it in your job configuration.</p>
 
-<h2><a href="windowing.html">Windowing &raquo;</a></h2>
+<h2 id="toc_0"><a href="windowing.html">Windowing &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/container/samza-container.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/samza-container.html?rev=1602533&r1=1602532&r2=1602533&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/container/samza-container.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/container/samza-container.html Fri Jun 13 22:21:06 2014
@@ -1,4 +1,20 @@
 <!DOCTYPE html>
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
 <html lang="en">
   <head>
     <meta charset="utf-8">
@@ -70,7 +86,41 @@
           </div>
 
           <div class="content">
-            <h2>SamzaContainer</h2>
+            <!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+<h2>SamzaContainer</h2>
+
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
 
 <p>The SamzaContainer is responsible for managing the startup, execution, and shutdown of one or more <a href="../api/overview.html">StreamTask</a> instances. Each SamzaContainer typically runs as an indepentent Java virtual machine. A Samza job can consist of several SamzaContainers, potentially running on different machines.</p>
 
@@ -78,26 +128,26 @@
 
 <ol>
 <li>Get last checkpointed offset for each input stream partition that it consumes</li>
-<li>Create a &quot;reader&quot; thread for every input stream partition that it consumes</li>
+<li>Create a &ldquo;reader&rdquo; thread for every input stream partition that it consumes</li>
 <li>Start metrics reporters to report metrics</li>
-<li>Start a checkpoint timer to save your task&#39;s input stream offsets every so often</li>
-<li>Start a window timer to trigger your task&#39;s <a href="../api/javadocs/org/apache/samza/task/WindowableTask.html">window method</a>, if it is defined</li>
+<li>Start a checkpoint timer to save your task&rsquo;s input stream offsets every so often</li>
+<li>Start a window timer to trigger your task&rsquo;s <a href="../api/javadocs/org/apache/samza/task/WindowableTask.html">window method</a>, if it is defined</li>
 <li>Instantiate and initialize your StreamTask once for each input stream partition</li>
 <li>Start an event loop that takes messages from the input stream reader threads, and gives them to your StreamTasks</li>
 <li>Notify lifecycle listeners during each one of these steps</li>
 </ol>
 
-<p>Let&#39;s start in the middle, with the instantiation of a StreamTask. The following sections of the documentation cover the other steps.</p>
+<p>Let&rsquo;s start in the middle, with the instantiation of a StreamTask. The following sections of the documentation cover the other steps.</p>
 
-<h3>Tasks and Partitions</h3>
+<h3 id="toc_0">Tasks and Partitions</h3>
 
-<p>When the container starts, it creates instances of the <a href="../api/overview.html">task class</a> that you&#39;ve written. If the task class implements the <a href="../api/javadocs/org/apache/samza/task/InitableTask.html">InitableTask</a> interface, the SamzaContainer will also call the init() method.</p>
+<p>When the container starts, it creates instances of the <a href="../api/overview.html">task class</a> that you&rsquo;ve written. If the task class implements the <a href="../api/javadocs/org/apache/samza/task/InitableTask.html">InitableTask</a> interface, the SamzaContainer will also call the init() method.</p>
 <div class="highlight"><pre><code class="text language-text" data-lang="text">/** Implement this if you want a callback when your task starts up. */
 public interface InitableTask {
   void init(Config config, TaskContext context);
 }
 </code></pre></div>
-<p>How many instances of your task class are created depends on the number of partitions in the job&#39;s input streams. If your Samza job has ten partitions, there will be ten instantiations of your task class: one for each partition. The first task instance will receive all messages for partition one, the second instance will receive all messages for partition two, and so on.</p>
+<p>How many instances of your task class are created depends on the number of partitions in the job&rsquo;s input streams. If your Samza job has ten partitions, there will be ten instantiations of your task class: one for each partition. The first task instance will receive all messages for partition one, the second instance will receive all messages for partition two, and so on.</p>
 
 <p><img src="/img/0.7.0/learn/documentation/container/tasks-and-partitions.svg" alt="Illustration of tasks consuming partitions" class="diagram-large"></p>
 
@@ -107,17 +157,17 @@ public interface InitableTask {
 
 <p>There is <a href="https://issues.apache.org/jira/browse/SAMZA-71">work underway</a> to make the assignment of partitions to tasks more flexible in future versions of Samza.</p>
 
-<h3>Containers and resource allocation</h3>
+<h3 id="toc_1">Containers and resource allocation</h3>
 
 <p>Although the number of task instances is fixed &mdash; determined by the number of input partitions &mdash; you can configure how many containers you want to use for your job. If you are <a href="../jobs/yarn-jobs.html">using YARN</a>, the number of containers determines what CPU and memory resources are allocated to your job.</p>
 
 <p>If the data volume on your input streams is small, it might be sufficient to use just one SamzaContainer. In that case, Samza still creates one task instance per input partition, but all those tasks run within the same container. At the other extreme, you can create as many containers as you have partitions, and Samza will assign one task instance to each container.</p>
 
-<p>Each SamzaContainer is designed to use one CPU core, so it uses a <a href="event-loop.html">single-threaded event loop</a> for execution. It&#39;s not advisable to create your own threads within a SamzaContainer. If you need more parallelism, please configure your job to use more containers.</p>
+<p>Each SamzaContainer is designed to use one CPU core, so it uses a <a href="event-loop.html">single-threaded event loop</a> for execution. It&rsquo;s not advisable to create your own threads within a SamzaContainer. If you need more parallelism, please configure your job to use more containers.</p>
 
-<p>Any <a href="state-management.html">state</a> in your job belongs to a task instance, not to a container. This is a key design decision for Samza&#39;s scalability: as your job&#39;s resource requirements grow and shrink, you can simply increase or decrease the number of containers, but the number of task instances remains unchanged. As you scale up or down, the same state remains attached to each task instance. Task instances may be moved from one container to another, and any persistent state managed by Samza will be moved with it. This allows the job&#39;s processing semantics to remain unchanged, even as you change the job&#39;s parallelism.</p>
+<p>Any <a href="state-management.html">state</a> in your job belongs to a task instance, not to a container. This is a key design decision for Samza&rsquo;s scalability: as your job&rsquo;s resource requirements grow and shrink, you can simply increase or decrease the number of containers, but the number of task instances remains unchanged. As you scale up or down, the same state remains attached to each task instance. Task instances may be moved from one container to another, and any persistent state managed by Samza will be moved with it. This allows the job&rsquo;s processing semantics to remain unchanged, even as you change the job&rsquo;s parallelism.</p>
 
-<h3>Joining multiple input streams</h3>
+<h3 id="toc_2">Joining multiple input streams</h3>
 
 <p>If your job has multiple input streams, Samza provides a simple but powerful mechanism for joining data from different streams: each task instance receives messages from one partition of <em>each</em> of the input streams. For example, say you have two input streams, A and B, each with four partitions. Samza creates four task instances to process them, and assigns the partitions as follows:</p>
 
@@ -131,9 +181,9 @@ public interface InitableTask {
 
 <p>Thus, if you want two events in different streams to be processed by the same task instance, you need to ensure they are sent to the same partition number. You can achieve this by using the same partitioning key when <a href="../api/overview.html">sending the messages</a>. Joining streams is discussed in detail in the <a href="state-management.html">state management</a> section.</p>
 
-<p>There is one caveat in all of this: Samza currently assumes that a stream&#39;s partition count will never change. Partition splitting or repartitioning is not supported. If an input stream has N partitions, it is expected that it has always had, and will always have N partitions. If you want to re-partition a stream, you can write a job that reads messages from the stream, and writes them out to a new stream with the required number of partitions. For example, you could read messages from PageViewEvent, and write them to PageViewEventRepartition.</p>
+<p>There is one caveat in all of this: Samza currently assumes that a stream&rsquo;s partition count will never change. Partition splitting or repartitioning is not supported. If an input stream has N partitions, it is expected that it has always had, and will always have N partitions. If you want to re-partition a stream, you can write a job that reads messages from the stream, and writes them out to a new stream with the required number of partitions. For example, you could read messages from PageViewEvent, and write them to PageViewEventRepartition.</p>
 
-<h2><a href="streams.html">Streams &raquo;</a></h2>
+<h2 id="toc_3"><a href="streams.html">Streams &raquo;</a></h2>
 
 
           </div>