You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by ya...@apache.org on 2014/06/18 07:50:57 UTC
svn commit: r1603358 [29/30] - in /incubator/samza/site: ./ community/ contribute/ css/ learn/documentation/0.7.0/api/ learn/documentation/0.7.0/api/javadocs/ learn/documentation/0.7.0/api/javadocs/org/apache/samza/ learn/documentation/0.7.0/api/javado...

Modified: incubator/samza/site/learn/documentation/0.7.0/comparisons/introduction.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/comparisons/introduction.html?rev=1603358&r1=1603357&r2=1603358&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/comparisons/introduction.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/comparisons/introduction.html Wed Jun 18 05:50:54 2014
@@ -124,7 +124,7 @@
 
 <p>Here are a few of the high-level design decisions that we think make Samza a bit different from other stream processing projects.</p>
 
-<h3 id="toc_0">The Stream Model</h3>
+<h3 id="the-stream-model">The Stream Model</h3>
 
 <p>Streams are the input and output to Samza jobs. Samza has a very strong model of a stream&mdash;it is more than just a simple message exchange mechanism. A stream in Samza is a partitioned, ordered-per-partition, replayable, multi-subscriber, lossless sequence of messages. Streams are not just inputs and outputs to the system, but also buffers that isolate processing stages from each other.</p>
 
@@ -142,7 +142,7 @@
 
 <p>MapReduce is sometimes criticized for writing to disk more than necessary. However, this criticism applies less to stream processing: batch processing like MapReduce often is used for processing large historical collections of data in a short period of time (e.g. query a month of data in ten minutes), whereas stream processing mostly needs to keep up with the steady-state flow of data (process 10 minutes worth of data in 10 minutes). This means that the raw throughput requirements for stream processing are, generally, orders of magnitude lower than for batch processing.</p>
 
-<h3 id="toc_1"><a name="state"></a> State</h3>
+<h3 id="-state"><a name="state"></a> State</h3>
 
 <p>Only the very simplest stream processing problems are stateless (i.e. can process one message at a time, independently of all other messages). Many stream processing applications require a job to maintain some state. For example:</p>
 
@@ -172,7 +172,7 @@ example above, where you have a stream o
 
 <p>Partitioned local state is not always appropriate, and not required &mdash; nothing in Samza prevents calls to external databases. If you cannot produce a feed of changes from your database, or you need to rely on logic that exists only in a remote service, then it may be more convenient to call a remote service from your Samza job. But if you want to use local state, it works out of the box.</p>
 
-<h3 id="toc_2">Execution Framework</h3>
+<h3 id="execution-framework">Execution Framework</h3>
 
 <p>One final decision we made was to not build a custom distributed execution system in Samza. Instead, execution is pluggable, and currently completely handled by YARN. This has two benefits.</p>
 
@@ -182,7 +182,7 @@ example above, where you have a stream o
 
 <p>We think there will be a lot of innovation both in open source virtualization frameworks like Mesos and YARN and in commercial cloud providers like Amazon, so it makes sense to integrate with them.</p>
 
-<h2 id="toc_3"><a href="mupd8.html">MUPD8 &raquo;</a></h2>
+<h2 id="mupd8-&raquo;"><a href="mupd8.html">MUPD8 &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/comparisons/mupd8.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/comparisons/mupd8.html?rev=1603358&r1=1603357&r2=1603358&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/comparisons/mupd8.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/comparisons/mupd8.html Wed Jun 18 05:50:54 2014
@@ -124,17 +124,17 @@
 
 <p><em>People generally want to know how similar systems compare. We&rsquo;ve done our best to fairly contrast the feature sets of Samza with other systems. But we aren&rsquo;t experts in these frameworks, and we are, of course, totally biased. If we have goofed anything, please let us know and we will correct it.</em></p>
 
-<h3 id="toc_0">Durability</h3>
+<h3 id="durability">Durability</h3>
 
 <p>MUPD8 makes no durability or delivery guarantees. Within MUPD8, stream processor tasks receive messages at most once. Samza uses Kafka for messaging, which guarantees message delivery.</p>
 
-<h3 id="toc_1">Ordering</h3>
+<h3 id="ordering">Ordering</h3>
 
 <p>As with durability, developers would ideally like their stream processors to receive messages in exactly the order that they were written.</p>
 
 <p>We don&rsquo;t entirely follow MUPD8&rsquo;s description of their ordering guarantees, but it seems to guarantee that all messages will be processed in the order in which they are written to MUPD8 queues, which is comparable to Kafka and Samza&rsquo;s guarantee.</p>
 
-<h3 id="toc_2">Buffering</h3>
+<h3 id="buffering">Buffering</h3>
 
 <p>A critical issue for handling large data flows is handling back pressure when one downstream processing stage gets slow.</p>
 
@@ -142,7 +142,7 @@
 
 <p>By adopting Kafka&rsquo;s broker as a remote buffer, Samza solves all of these problems. It doesn&rsquo;t need to block because consumers and producers are decoupled using the Kafka brokers&#39; disks as buffers. Messages are not dropped because Kafka brokers are highly available as of version 0.8. In the event of a failure, when a Samza job is restarted on another machine, its input and output are not lost, because they are stored remotely on replicated Kafka brokers.</p>
 
-<h3 id="toc_3">State Management</h3>
+<h3 id="state-management">State Management</h3>
 
 <p>As described in the <a href="introduction.html#state">introduction</a>, stream processors often need to maintain some state as they process messages. Different frameworks have different approaches to handling such state, and what to do in case of a failure.</p>
 
@@ -150,13 +150,13 @@
 
 <p>Samza maintains state locally with the task. This allows state larger than will fit in memory. State is persisted to an output stream to enable recovery should the task fail. We believe this design enables stronger fault tolerance semantics, because the change log captures the evolution of state, allowing the state of a task to restored to a consistent point in time.</p>
 
-<h3 id="toc_4">Deployment and execution</h3>
+<h3 id="deployment-and-execution">Deployment and execution</h3>
 
 <p>MUPD8 includes a custom execution framework. The functionality that this framework supports in terms of users and resource limits isn&rsquo;t clear to us.</p>
 
 <p>Samza leverages YARN to deploy user code, and execute it in a distributed environment.</p>
 
-<h3 id="toc_5">Fault Tolerance</h3>
+<h3 id="fault-tolerance">Fault Tolerance</h3>
 
 <p>What should a stream processing system do when a machine or processor fails?</p>
 
@@ -164,7 +164,7 @@
 
 <p>Samza uses YARN to manage fault tolerance. YARN detects when nodes or Samza tasks fail, and notifies Samza&rsquo;s <a href="../yarn/application-master.html">ApplicationMaster</a>. At that point, it&rsquo;s up to Samza to decide what to do. Generally, this means re-starting the task on another machine. Since messages are persisted to Kafka brokers remotely, and there are no in-memory queues, no messages should be lost (unless the processors are using async Kafka producers, which offer higher performance but don&rsquo;t wait for messages to be committed).</p>
 
-<h3 id="toc_6">Workflow</h3>
+<h3 id="workflow">Workflow</h3>
 
 <p>Sometimes more than one job or processing stage is needed to accomplish something. This is the case where you wish to re-partition a stream, for example. MUPD8 has a custom workflow system setup to define how to execute multiple jobs at once, and how to feed stream data from one into the other.</p>
 
@@ -172,23 +172,23 @@
 
 <p>This was motivated by our experience with Hadoop, where the data flow between jobs is implicitly defined by their input and output directories. This decentralized model has proven itself to scale well to a large organization.</p>
 
-<h3 id="toc_7">Memory</h3>
+<h3 id="memory">Memory</h3>
 
 <p>MUPD8 executes all of its map/update processors inside a single JVM, using threads. This is memory-efficient, as the JVM memory overhead is shared across the threads.</p>
 
 <p>Samza uses a separate JVM for each <a href="../container/samza-container.html">stream processor container</a>. This has the disadvantage of using more memory compared to running multiple stream processing threads within a single JVM. However, the advantage is improved isolation between tasks, which can make them more reliable.</p>
 
-<h3 id="toc_8">Isolation</h3>
+<h3 id="isolation">Isolation</h3>
 
 <p>MUPD8 provides no resource isolation between stream processors. A single badly behaved stream processor can bring down all processors on the node.</p>
 
 <p>Samza uses process level isolation between stream processor tasks, similarly to Hadoop&rsquo;s approach. We can enforce strict per-process memory limits. In addition, Samza supports CPU limits when used with YARN cgroups. As the YARN support for cgroups develops further, it should also become possible to support disk and network cgroup limits.</p>
 
-<h3 id="toc_9">Further Reading</h3>
+<h3 id="further-reading">Further Reading</h3>
 
 <p>The MUPD8 team has published a very good <a href="http://vldb.org/pvldb/vol5/p1814_wanglam_vldb2012.pdf">paper</a> on the design of their system.</p>
 
-<h2 id="toc_10"><a href="storm.html">Storm &raquo;</a></h2>
+<h2 id="storm-&raquo;"><a href="storm.html">Storm &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/comparisons/storm.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/comparisons/storm.html?rev=1603358&r1=1603357&r2=1603358&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/comparisons/storm.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/comparisons/storm.html Wed Jun 18 05:50:54 2014
@@ -128,7 +128,7 @@
 
 <p>Storm and Samza use different words for similar concepts: <em>spouts</em> in Storm are similar to stream consumers in Samza, <em>bolts</em> are similar to tasks, and <em>tuples</em> are similar to messages in Samza. Storm also has some additional building blocks which don&rsquo;t have direct equivalents in Samza.</p>
 
-<h3 id="toc_0">Ordering and Guarantees</h3>
+<h3 id="ordering-and-guarantees">Ordering and Guarantees</h3>
 
 <p>Storm allows you to choose the level of guarantee with which you want your messages to be processed:</p>
 
@@ -142,7 +142,7 @@
 
 <p>Moreover, because Samza never processes messages in a partition out-of-order, it is better suited for handling keyed data. For example, if you have a stream of database updates &mdash; where later updates may replace earlier updates &mdash; then reordering the messages may change the final result. Provided that all updates for the same key appear in the same stream partition, Samza is able to guarantee a consistent state.</p>
 
-<h3 id="toc_1">State Management</h3>
+<h3 id="state-management">State Management</h3>
 
 <p>Storm&rsquo;s lower-level API of bolts does not offer any help for managing state in a stream process. A bolt can maintain in-memory state (which is lost if that bolt dies), or it can make calls to a remote database to read and write state. However, a topology can usually process messages at a much higher rate than calls to a remote database can be made, so making a remote call for each message quickly becomes a bottleneck.</p>
 
@@ -156,7 +156,7 @@
 
 <p>A limitation of Samza&rsquo;s state handling is that it currently does not support exactly-once semantics &mdash; only at-least-once is supported right now. But we&rsquo;re working on fixing that, so stay tuned for updates.</p>
 
-<h3 id="toc_2">Partitioning and Parallelism</h3>
+<h3 id="partitioning-and-parallelism">Partitioning and Parallelism</h3>
 
 <p>Storm&rsquo;s <a href="https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology">parallelism model</a> is fairly similar to Samza&rsquo;s. Both frameworks split processing into independent <em>tasks</em> that can run in parallel. Resource allocation is independent of the number of tasks: a small job can keep all tasks in a single process on a single machine; a large job can spread the tasks over many processes on many machines.</p>
 
@@ -166,7 +166,7 @@
 
 <p>When using a transactional spout with Trident (a requirement for achieving exactly-once semantics), parallelism is potentially reduced. Trident relies on a global ordering in its input streams &mdash; that is, ordering across all partitions of a stream, not just within one partion. This means that the topology&rsquo;s input stream has to go through a single spout instance, effectively ignoring the partitioning of the input stream. This spout may become a bottleneck on high-volume streams. In Samza, all stream processing is parallel &mdash; there are no such choke points.</p>
 
-<h3 id="toc_3">Deployment &amp; Execution</h3>
+<h3 id="deployment-&amp;-execution">Deployment &amp; Execution</h3>
 
 <p>A Storm cluster is composed of a set of nodes running a <em>Supervisor</em> daemon. The supervisor daemons talk to a single master node running a daemon called <em>Nimbus</em>. The Nimbus daemon is responsible for assigning work and managing resources in the cluster. See Storm&rsquo;s <a href="https://github.com/nathanmarz/storm/wiki/Tutorial">Tutorial</a> page for details. This is quite similar to YARN; though YARN is a bit more fully featured and intended to be multi-framework, Nimbus is better integrated with Storm.</p>
 
@@ -176,13 +176,13 @@
 
 <p>The YARN support in Samza is pluggable, so you can swap it for a different execution framework if you wish.</p>
 
-<h3 id="toc_4">Language Support</h3>
+<h3 id="language-support">Language Support</h3>
 
 <p>Storm is written in Java and Clojure but has good support for non-JVM languages. It follows a model similar to MapReduce Streaming: the non-JVM task is launched in a separate process, data is sent to its stdin, and output is read from its stdout.</p>
 
 <p>Samza is written in Java and Scala. It is built with multi-language support in mind, but currently only supports JVM languages.</p>
 
-<h3 id="toc_5">Workflow</h3>
+<h3 id="workflow">Workflow</h3>
 
 <p>Storm provides modeling of <em>topologies</em> (a processing graph of multiple stages) <a href="https://github.com/nathanmarz/storm/wiki/Tutorial">in code</a>. Trident provides a further <a href="https://github.com/nathanmarz/storm/wiki/Trident-tutorial">higher-level API</a> on top of this, including familiar relational-like operators such as filters, grouping, aggregation and joins. This means the entire topology is wired up in one place, which has the advantage that it is documented in code, but has the disadvantage that the entire topology needs to be developed and deployed as a whole.</p>
 
@@ -190,13 +190,13 @@
 
 <p>Samza&rsquo;s approach can be emulated in Storm by connecting two separate topologies via a broker, such as Kafka. However, Storm&rsquo;s implementation of exactly-once semantics only works within a single topology.</p>
 
-<h3 id="toc_6">Maturity</h3>
+<h3 id="maturity">Maturity</h3>
 
 <p>We can&rsquo;t speak to Storm&rsquo;s maturity, but it has an <a href="https://github.com/nathanmarz/storm/wiki/Powered-By">impressive number of adopters</a>, a strong feature set, and seems to be under active development. It integrates well with many common messaging systems (RabbitMQ, Kestrel, Kafka, etc).</p>
 
 <p>Samza is pretty immature, though it builds on solid components. YARN is fairly new, but is already being run on 3000+ node clusters at Yahoo!, and the project is under active development by both <a href="http://hortonworks.com/">Hortonworks</a> and <a href="http://www.cloudera.com/content/cloudera/en/home.html">Cloudera</a>. Kafka has a strong <a href="https://cwiki.apache.org/KAFKA/powered-by.html">powered by</a> page, and has seen increased adoption recently. It&rsquo;s also frequently used with Storm. Samza is a brand new project that is in use at LinkedIn. Our hope is that others will find it useful, and adopt it as well.</p>
 
-<h3 id="toc_7">Buffering &amp; Latency</h3>
+<h3 id="buffering-&amp;-latency">Buffering &amp; Latency</h3>
 
 <p>Storm uses <a href="http://zeromq.org/">ZeroMQ</a> for non-durable communication between bolts, which enables extremely low latency transmission of tuples. Samza does not have an equivalent mechanism, and always writes task output to a stream.</p>
 
@@ -208,25 +208,25 @@
 
 <p>As described in the <em>workflow</em> section above, Samza&rsquo;s approach can be emulated in Storm, but comes with a loss in functionality.</p>
 
-<h3 id="toc_8">Isolation</h3>
+<h3 id="isolation">Isolation</h3>
 
 <p>Storm provides standard UNIX process-level isolation. Your topology can impact another topology&rsquo;s performance (or vice-versa) if too much CPU, disk, network, or memory is used.</p>
 
 <p>Samza relies on YARN to provide resource-level isolation. Currently, YARN provides explicit controls for memory and CPU limits (through <a href="../yarn/isolation.html">cgroups</a>), and both have been used successfully with Samza. No isolation for disk or network is provided by YARN at this time.</p>
 
-<h3 id="toc_9">Distributed RPC</h3>
+<h3 id="distributed-rpc">Distributed RPC</h3>
 
 <p>In Storm, you can write topologies which not only accept a stream of fixed events, but also allow clients to run distributed computations on demand. The query is sent into the topology as a tuple on a special spout, and when the topology has computed the answer, it is returned to the client (who was synchronously waiting for the answer). This facility is called <a href="https://github.com/nathanmarz/storm/wiki/Distributed-RPC">Distributed RPC</a> (DRPC).</p>
 
 <p>Samza does not currently have an equivalent API to DRPC, but you can build it yourself using Samza&rsquo;s stream processing primitives.</p>
 
-<h3 id="toc_10">Data Model</h3>
+<h3 id="data-model">Data Model</h3>
 
 <p>Storm models all messages as <em>tuples</em> with a defined data model but pluggable serialization.</p>
 
 <p>Samza&rsquo;s serialization and data model are both pluggable. We are not terribly opinionated about which approach is best.</p>
 
-<h2 id="toc_11"><a href="../api/overview.html">API Overview &raquo;</a></h2>
+<h2 id="api-overview-&raquo;"><a href="../api/overview.html">API Overview &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/container/checkpointing.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/checkpointing.html?rev=1603358&r1=1603357&r2=1603358&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/container/checkpointing.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/container/checkpointing.html Wed Jun 18 05:50:54 2014
@@ -143,7 +143,7 @@
 <p>This guarantee is called <em>at-least-once processing</em>: Samza ensures that your job doesn&rsquo;t miss any messages, even if containers need to be restarted. However, it is possible for your job to see the same message more than once when a container is restarted. We are planning to address this in a future version of Samza, but for now it is just something to be aware of: for example, if you are counting page views, a forcefully killed container could cause events to be slightly over-counted. You can reduce duplication by checkpointing more frequently, at a slight performance cost.</p>
 
 <p>For checkpoints to be effective, they need to be written somewhere where they will survive faults. Samza allows you to write checkpoints to the file system (using FileSystemCheckpointManager), but that doesn&rsquo;t help if the machine fails and the container needs to be restarted on another machine. The most common configuration is to use Kafka for checkpointing. You can enable this with the following job configuration:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text"># The name of your job determines the name under which checkpoints will be stored
+<div class="highlight"><pre><code class="language-text" data-lang="text"># The name of your job determines the name under which checkpoints will be stored
 job.name=example-job
 
 # Define a system called &quot;kafka&quot; for consuming and producing to a Kafka cluster
@@ -159,7 +159,7 @@ task.commit.ms=60000
 <p>In this configuration, Samza writes checkpoints to a separate Kafka topic called __samza_checkpoint_&lt;job-name&gt;_&lt;job-id&gt; (in the example configuration above, the topic would be called __samza_checkpoint_example-job_1). Once per minute, Samza automatically sends a message to this topic, in which the current offsets of the input streams are encoded. When a Samza container starts up, it looks for the most recent offset message in this topic, and loads that checkpoint.</p>
 
 <p>Sometimes it can be useful to use checkpoints only for some input streams, but not for others. In this case, you can tell Samza to ignore any checkpointed offsets for a particular stream name:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text"># Ignore any checkpoints for the topic &quot;my-special-topic&quot;
+<div class="highlight"><pre><code class="language-text" data-lang="text"># Ignore any checkpoints for the topic &quot;my-special-topic&quot;
 systems.kafka.streams.my-special-topic.samza.reset.offset=true
 
 # Always start consuming &quot;my-special-topic&quot; at the oldest available offset
@@ -195,22 +195,22 @@ systems.kafka.streams.my-special-topic.s
 
 <p>Note that the example configuration above causes your tasks to start consuming from the oldest offset <em>every time a container starts up</em>. This is useful in case you have some in-memory state in your tasks that you need to rebuild from source data in an input stream. If you are using streams in this way, you may also find <a href="streams.html">bootstrap streams</a> useful.</p>
 
-<h3 id="toc_0">Manipulating Checkpoints Manually</h3>
+<h3 id="manipulating-checkpoints-manually">Manipulating Checkpoints Manually</h3>
 
 <p>If you want to make a one-off change to a job&rsquo;s consumer offsets, for example to force old messages to be <a href="../jobs/reprocessing.html">processed again</a> with a new version of your code, you can use CheckpointTool to inspect and manipulate the job&rsquo;s checkpoint. The tool is included in Samza&rsquo;s <a href="/contribute/code.html">source repository</a>.</p>
 
 <p>To inspect a job&rsquo;s latest checkpoint, you need to specify your job&rsquo;s config file, so that the tool knows which job it is dealing with:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text">samza-example/target/bin/checkpoint-tool.sh \
+<div class="highlight"><pre><code class="language-text" data-lang="text">samza-example/target/bin/checkpoint-tool.sh \
   --config-path=file:///path/to/job/config.properties
 </code></pre></div>
 <p>This command prints out the latest checkpoint in a properties file format. You can save the output to a file, and edit it as you wish. For example, to jump back to the oldest possible point in time, you can set all the offsets to 0. Then you can feed that properties file back into checkpoint-tool.sh and save the modified checkpoint:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text">samza-example/target/bin/checkpoint-tool.sh \
+<div class="highlight"><pre><code class="language-text" data-lang="text">samza-example/target/bin/checkpoint-tool.sh \
   --config-path=file:///path/to/job/config.properties \
   --new-offsets=file:///path/to/new/offsets.properties
 </code></pre></div>
 <p>Note that Samza only reads checkpoints on container startup. In order for your checkpoint change to take effect, you need to first stop the job, then save the modified offsets, and then start the job again. If you write a checkpoint while the job is running, it will most likely have no effect.</p>
 
-<h2 id="toc_1"><a href="state-management.html">State Management &raquo;</a></h2>
+<h2 id="state-management-&raquo;"><a href="state-management.html">State Management &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/container/event-loop.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/event-loop.html?rev=1603358&r1=1603357&r2=1603358&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/container/event-loop.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/container/event-loop.html Wed Jun 18 05:50:54 2014
@@ -128,7 +128,7 @@
 
 <p>You are strongly discouraged from using threads in your job&rsquo;s code. Samza uses multiple threads internally for communicating with input and output streams, but all message processing and user code runs on a single-threaded event loop. In general, Samza is not thread-safe.</p>
 
-<h3 id="toc_0">Event Loop Internals</h3>
+<h3 id="event-loop-internals">Event Loop Internals</h3>
 
 <p>A container may have multiple <a href="../api/javadocs/org/apache/samza/system/SystemConsumer.html">SystemConsumers</a> for consuming messages from different input systems. Each SystemConsumer reads messages on its own thread, but writes messages into a shared in-process message queue. The container uses this queue to funnel all of the messages into the event loop.</p>
 
@@ -144,14 +144,14 @@
 
 <p>The container does this, in a loop, until it is shut down. Note that although there can be multiple task instances within a container (depending on the number of input stream partitions), their process() and window() methods are all called on the same thread, never concurrently on different threads.</p>
 
-<h3 id="toc_1">Lifecycle Listeners</h3>
+<h3 id="lifecycle-listeners">Lifecycle Listeners</h3>
 
 <p>Sometimes, you need to run your own code at specific points in a task&rsquo;s lifecycle. For example, you might want to set up some context in the container whenever a new message arrives, or perform some operations on startup or shutdown.</p>
 
 <p>To receive notifications when such events happen, you can implement the <a href="../api/javadocs/org/apache/samza/task/TaskLifecycleListenerFactory.html">TaskLifecycleListenerFactory</a> interface. It returns a <a href="../api/javadocs/org/apache/samza/task/TaskLifecycleListener.html">TaskLifecycleListener</a>, whose methods are called by Samza at the appropriate times.</p>
 
 <p>You can then tell Samza to use your lifecycle listener with the following properties in your job configuration:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text"># Define a listener called &quot;my-listener&quot; by giving the factory class name
+<div class="highlight"><pre><code class="language-text" data-lang="text"># Define a listener called &quot;my-listener&quot; by giving the factory class name
 task.lifecycle.listener.my-listener.class=com.example.foo.MyListenerFactory
 
 # Enable it in this job (multiple listeners can be separated by commas)
@@ -159,7 +159,7 @@ task.lifecycle.listeners=my-listener
 </code></pre></div>
 <p>The Samza container creates one instance of your <a href="../api/javadocs/org/apache/samza/task/TaskLifecycleListener.html">TaskLifecycleListener</a>. If the container has multiple task instances (processing different input stream partitions), the beforeInit, afterInit, beforeClose and afterClose methods are called for each task instance. The <a href="../api/javadocs/org/apache/samza/task/TaskContext.html">TaskContext</a> argument of those methods gives you more information about the partitions.</p>
 
-<h2 id="toc_2"><a href="jmx.html">JMX &raquo;</a></h2>
+<h2 id="jmx-&raquo;"><a href="jmx.html">JMX &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/container/jmx.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/jmx.html?rev=1603358&r1=1603357&r2=1603358&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/container/jmx.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/container/jmx.html Wed Jun 18 05:50:54 2014
@@ -125,18 +125,18 @@
 <p>Samza&rsquo;s containers and YARN ApplicationMaster enable <a href="http://docs.oracle.com/javase/tutorial/jmx/">JMX</a> by default. JMX can be used for managing the JVM; for example, you can connect to it using <a href="http://docs.oracle.com/javase/7/docs/technotes/guides/management/jconsole.html">jconsole</a>, which is included in the JDK.</p>
 
 <p>You can tell Samza to publish its internal <a href="metrics.html">metrics</a>, and any custom metrics you define, as JMX MBeans. To enable this, set the following properties in your job configuration:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text"># Define a Samza metrics reporter called &quot;jmx&quot;, which publishes to JMX
+<div class="highlight"><pre><code class="language-text" data-lang="text"># Define a Samza metrics reporter called &quot;jmx&quot;, which publishes to JMX
 metrics.reporter.jmx.class=org.apache.samza.metrics.reporter.JmxReporterFactory
 
 # Use it (if you have multiple reporters defined, separate them with commas)
 metrics.reporters=jmx
 </code></pre></div>
 <p>JMX needs to be configured to use a specific port, but in a distributed environment, there is no way of knowing in advance which ports are available on the machines running your containers. Therefore Samza chooses the JMX port randomly. If you need to connect to it, you can find the port by looking in the container&rsquo;s logs, which report the JMX server details as follows:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text">2014-06-02 21:50:17 JmxServer [INFO] According to InetAddress.getLocalHost.getHostName we are samza-grid-1234.example.com
+<div class="highlight"><pre><code class="language-text" data-lang="text">2014-06-02 21:50:17 JmxServer [INFO] According to InetAddress.getLocalHost.getHostName we are samza-grid-1234.example.com
 2014-06-02 21:50:17 JmxServer [INFO] Started JmxServer registry port=50214 server port=50215 url=service:jmx:rmi://localhost:50215/jndi/rmi://localhost:50214/jmxrmi
 2014-06-02 21:50:17 JmxServer [INFO] If you are tunneling, you might want to try JmxServer registry port=50214 server port=50215 url=service:jmx:rmi://samza-grid-1234.example.com:50215/jndi/rmi://samza-grid-1234.example.com:50214/jmxrmi
 </code></pre></div>
-<h2 id="toc_0"><a href="../jobs/job-runner.html">JobRunner &raquo;</a></h2>
+<h2 id="jobrunner-&raquo;"><a href="../jobs/job-runner.html">JobRunner &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/container/metrics.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/metrics.html?rev=1603358&r1=1603357&r2=1603358&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/container/metrics.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/container/metrics.html Wed Jun 18 05:50:54 2014
@@ -127,7 +127,7 @@
 <p>Metrics can be reported in various ways. You can expose them via <a href="jmx.html">JMX</a>, which is useful in development. In production, a common setup is for each Samza container to periodically publish its metrics to a &ldquo;metrics&rdquo; Kafka topic, in which the metrics from all Samza jobs are aggregated. You can then consume this stream in another Samza job, and send the metrics to your favorite graphing system such as <a href="http://graphite.wikidot.com/">Graphite</a>.</p>
 
 <p>To set up your job to publish metrics to Kafka, you can use the following configuration:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text"># Define a metrics reporter called &quot;snapshot&quot;, which publishes metrics
+<div class="highlight"><pre><code class="language-text" data-lang="text"># Define a metrics reporter called &quot;snapshot&quot;, which publishes metrics
 # every 60 seconds.
 metrics.reporters=snapshot
 metrics.reporter.snapshot.class=org.apache.samza.metrics.reporter.MetricsSnapshotReporterFactory
@@ -141,7 +141,7 @@ serializers.registry.metrics.class=org.a
 systems.kafka.streams.metrics.samza.msg.serde=metrics
 </code></pre></div>
 <p>With this configuration, the job automatically sends several JSON-encoded messages to the &ldquo;metrics&rdquo; topic in Kafka every 60 seconds. The messages look something like this:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text">{
+<div class="highlight"><pre><code class="language-text" data-lang="text">{
   &quot;header&quot;: {
     &quot;container-name&quot;: &quot;samza-container-0&quot;,
     &quot;host&quot;: &quot;samza-grid-1234.example.com&quot;,
@@ -173,7 +173,7 @@ systems.kafka.streams.metrics.samza.msg.
 <p>It&rsquo;s easy to generate custom metrics in your job, if there&rsquo;s some value you want to keep an eye on. You can use Samza&rsquo;s built-in metrics framework, which is similar in design to Coda Hale&rsquo;s <a href="http://metrics.codahale.com/">metrics</a> library. </p>
 
 <p>You can register your custom metrics through a <a href="../api/javadocs/org/apache/samza/metrics/MetricsRegistry.html">MetricsRegistry</a>. Your stream task needs to implement <a href="../api/javadocs/org/apache/samza/task/InitableTask.html">InitableTask</a>, so that you can get the metrics registry from the <a href="../api/javadocs/org/apache/samza/task/TaskContext.html">TaskContext</a>. This simple example shows how to count the number of messages processed by your task:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text">public class MyJavaStreamTask implements StreamTask, InitableTask {
+<div class="highlight"><pre><code class="language-text" data-lang="text">public class MyJavaStreamTask implements StreamTask, InitableTask {
   private Counter messageCount;
 
   public void init(Config config, TaskContext context) {
@@ -193,7 +193,7 @@ systems.kafka.streams.metrics.samza.msg.
 
 <p>If you want to report metrics in some other way, e.g. directly to a graphing system (without going via Kafka), you can implement a <a href="../api/javadocs/org/apache/samza/metrics/MetricsReporterFactory.html">MetricsReporterFactory</a> and reference it in your job configuration.</p>
 
-<h2 id="toc_0"><a href="windowing.html">Windowing &raquo;</a></h2>
+<h2 id="windowing-&raquo;"><a href="windowing.html">Windowing &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/container/samza-container.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/samza-container.html?rev=1603358&r1=1603357&r2=1603358&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/container/samza-container.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/container/samza-container.html Wed Jun 18 05:50:54 2014
@@ -139,10 +139,10 @@
 
 <p>Let&rsquo;s start in the middle, with the instantiation of a StreamTask. The following sections of the documentation cover the other steps.</p>
 
-<h3 id="toc_0">Tasks and Partitions</h3>
+<h3 id="tasks-and-partitions">Tasks and Partitions</h3>
 
 <p>When the container starts, it creates instances of the <a href="../api/overview.html">task class</a> that you&rsquo;ve written. If the task class implements the <a href="../api/javadocs/org/apache/samza/task/InitableTask.html">InitableTask</a> interface, the SamzaContainer will also call the init() method.</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text">/** Implement this if you want a callback when your task starts up. */
+<div class="highlight"><pre><code class="language-text" data-lang="text">/** Implement this if you want a callback when your task starts up. */
 public interface InitableTask {
   void init(Config config, TaskContext context);
 }
@@ -151,13 +151,13 @@ public interface InitableTask {
 
 <p><img src="/img/0.7.0/learn/documentation/container/tasks-and-partitions.svg" alt="Illustration of tasks consuming partitions" class="diagram-large"></p>
 
-<p>The number of partitions in the input streams is determined by the systems from which you are consuming. For example, if your input system is Kafka, you can specify the number of partitions when you create a topic.</p>
+<p>The number of partitions in the input streams is determined by the systems from which you are consuming. For example, if your input system is Kafka, you can specify the number of partitions when you create a topic from the command line or using the num.partitions in Kafka&rsquo;s server properties file.</p>
 
 <p>If a Samza job has more than one input stream, the number of task instances for the Samza job is the maximum number of partitions across all input streams. For example, if a Samza job is reading from PageViewEvent (12 partitions), and ServiceMetricEvent (14 partitions), then the Samza job would have 14 task instances (numbered 0 through 13). Task instances 12 and 13 only receive events from ServiceMetricEvent, because there is no corresponding PageViewEvent partition.</p>
 
 <p>There is <a href="https://issues.apache.org/jira/browse/SAMZA-71">work underway</a> to make the assignment of partitions to tasks more flexible in future versions of Samza.</p>
 
-<h3 id="toc_1">Containers and resource allocation</h3>
+<h3 id="containers-and-resource-allocation">Containers and resource allocation</h3>
 
 <p>Although the number of task instances is fixed &mdash; determined by the number of input partitions &mdash; you can configure how many containers you want to use for your job. If you are <a href="../jobs/yarn-jobs.html">using YARN</a>, the number of containers determines what CPU and memory resources are allocated to your job.</p>
 
@@ -167,7 +167,7 @@ public interface InitableTask {
 
 <p>Any <a href="state-management.html">state</a> in your job belongs to a task instance, not to a container. This is a key design decision for Samza&rsquo;s scalability: as your job&rsquo;s resource requirements grow and shrink, you can simply increase or decrease the number of containers, but the number of task instances remains unchanged. As you scale up or down, the same state remains attached to each task instance. Task instances may be moved from one container to another, and any persistent state managed by Samza will be moved with it. This allows the job&rsquo;s processing semantics to remain unchanged, even as you change the job&rsquo;s parallelism.</p>
 
-<h3 id="toc_2">Joining multiple input streams</h3>
+<h3 id="joining-multiple-input-streams">Joining multiple input streams</h3>
 
 <p>If your job has multiple input streams, Samza provides a simple but powerful mechanism for joining data from different streams: each task instance receives messages from one partition of <em>each</em> of the input streams. For example, say you have two input streams, A and B, each with four partitions. Samza creates four task instances to process them, and assigns the partitions as follows:</p>
 
@@ -183,7 +183,7 @@ public interface InitableTask {
 
 <p>There is one caveat in all of this: Samza currently assumes that a stream&rsquo;s partition count will never change. Partition splitting or repartitioning is not supported. If an input stream has N partitions, it is expected that it has always had, and will always have N partitions. If you want to re-partition a stream, you can write a job that reads messages from the stream, and writes them out to a new stream with the required number of partitions. For example, you could read messages from PageViewEvent, and write them to PageViewEventRepartition.</p>
 
-<h2 id="toc_3"><a href="streams.html">Streams &raquo;</a></h2>
+<h2 id="streams-&raquo;"><a href="streams.html">Streams &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/container/serialization.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/serialization.html?rev=1603358&r1=1603357&r2=1603358&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/container/serialization.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/container/serialization.html Wed Jun 18 05:50:54 2014
@@ -131,7 +131,7 @@
 </ol>
 
 <p>You can use whatever makes sense for your job; Samza doesn&rsquo;t impose any particular data model or serialization scheme on you. However, the cleanest solution is usually to use Samza&rsquo;s serde layer. The following configuration example shows how to use it.</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text"># Define a system called &quot;kafka&quot;
+<div class="highlight"><pre><code class="language-text" data-lang="text"># Define a system called &quot;kafka&quot;
 systems.kafka.samza.factory=org.apache.samza.system.kafka.KafkaSystemFactory
 
 # The job is going to consume a topic called &quot;PageViewEvent&quot; from the &quot;kafka&quot; system
@@ -163,7 +163,7 @@ stores.LastPageViewPerUser.msg.serde=jso
 
 <p>All the Samza APIs for sending and receiving messages are typed as <em>Object</em>. This means that you have to cast messages to the correct type before you can use them. It&rsquo;s a little bit more code, but it has the advantage that Samza is not restricted to any particular data model.</p>
 
-<h2 id="toc_0"><a href="checkpointing.html">Checkpointing &raquo;</a></h2>
+<h2 id="checkpointing-&raquo;"><a href="checkpointing.html">Checkpointing &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/container/state-management.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/state-management.html?rev=1603358&r1=1603357&r2=1603358&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/container/state-management.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/container/state-management.html Wed Jun 18 05:50:54 2014
@@ -128,11 +128,11 @@
 
 <p>However, being able to maintain state opens up many possibilities for sophisticated stream processing jobs: joining input streams, grouping messages and aggregating groups of messages. By analogy to SQL, the <em>select</em> and <em>where</em> clauses of a query are usually stateless, but <em>join</em>, <em>group by</em> and aggregation functions like <em>sum</em> and <em>count</em> require state. Samza doesn&rsquo;t yet provide a higher-level SQL-like language, but it does provide lower-level primitives that you can use to implement streaming aggregation and joins.</p>
 
-<h3 id="toc_0">Common use cases for stateful processing</h3>
+<h3 id="common-use-cases-for-stateful-processing">Common use cases for stateful processing</h3>
 
 <p>First, let&rsquo;s look at some simple examples of stateful stream processing that might be seen in the backend of a consumer website. Later in this page we&rsquo;ll discuss how to implement these applications using Samza&rsquo;s built-in key-value storage capabilities.</p>
 
-<h4 id="toc_1">Windowed aggregation</h4>
+<h4 id="windowed-aggregation">Windowed aggregation</h4>
 
 <p><em>Example: Counting the number of page views for each user per hour</em></p>
 
@@ -140,7 +140,7 @@
 
 <p>The simplest implementation keeps this state in memory (e.g. a hash map in the task instances), and writes it to a database or output stream at the end of every time window. However, you need to consider what happens when a container fails and your in-memory state is lost. You might be able to restore it by processing all the messages in the current window again, but that might take a long time if the window covers a long period of time. Samza can speed up this recovery by making the state fault-tolerant rather than trying to recompute it.</p>
 
-<h4 id="toc_2">Table-table join</h4>
+<h4 id="table-table-join">Table-table join</h4>
 
 <p><em>Example: Join a table of user profiles to a table of user settings by user_id and emit the joined stream</em></p>
 
@@ -158,7 +158,7 @@
 
 <p>Each of these use cases is a massively complex data normalization problem that can be thought of as constructing a materialized view over many input tables. Samza can help implement such data processing pipelines robustly.</p>
 
-<h4 id="toc_3">Stream-table join</h4>
+<h4 id="stream-table-join">Stream-table join</h4>
 
 <p><em>Example: Augment a stream of page view events with the user&rsquo;s ZIP code (perhaps to allow aggregation by zip code in a later stage)</em></p>
 
@@ -166,7 +166,7 @@
 
 <p>In data warehouse terminology, you can think of the raw event stream as rows in the central fact table, which needs to be joined with dimension tables so that you can use attributes of the dimensions in your analysis.</p>
 
-<h4 id="toc_4">Stream-stream join</h4>
+<h4 id="stream-stream-join">Stream-stream join</h4>
 
 <p><em>Example: Join a stream of ad clicks to a stream of ad impressions (to link the information on when the ad was shown to the information on when it was clicked)</em></p>
 
@@ -174,21 +174,21 @@
 
 <p>In order to perform a join between streams, your job needs to buffer events for the time window over which you want to join. For short time windows, you can do this in memory (at the risk of losing events if the machine fails). You can also use Samza&rsquo;s state store to buffer events, which supports buffering more messages than you can fit in memory.</p>
 
-<h4 id="toc_5">More</h4>
+<h4 id="more">More</h4>
 
 <p>There are many variations of joins and aggregations, but most are essentially variations and combinations of the above patterns.</p>
 
-<h3 id="toc_6">Approaches to managing task state</h3>
+<h3 id="approaches-to-managing-task-state">Approaches to managing task state</h3>
 
 <p>So how do systems support this kind of stateful processing? We&rsquo;ll lead in by describing what we have seen in other stream processing systems, and then describe what Samza does.</p>
 
-<h4 id="toc_7">In-memory state with checkpointing</h4>
+<h4 id="in-memory-state-with-checkpointing">In-memory state with checkpointing</h4>
 
 <p>A simple approach, common in academic stream processing systems, is to periodically save the task&rsquo;s entire in-memory data to durable storage. This approach works well if the in-memory state consists of only a few values. However, you have to store the complete task state on each checkpoint, which becomes increasingly expensive as task state grows. Unfortunately, many non-trivial use cases for joins and aggregation have large amounts of state &mdash; often many gigabytes. This makes full dumps of the state impractical.</p>
 
 <p>Some academic systems produce <em>diffs</em> in addition to full checkpoints, which are smaller if only some of the state has changed since the last checkpoint. <a href="../comparisons/storm.html">Storm&rsquo;s Trident abstraction</a> similarly keeps an in-memory cache of state, and periodically writes any changes to a remote store such as Cassandra. However, this optimization only helps if most of the state remains unchanged. In some use cases, such as stream joins, it is normal to have a lot of churn in the state, so this technique essentially degrades to making a remote database request for every message (see below).</p>
 
-<h4 id="toc_8">Using an external store</h4>
+<h4 id="using-an-external-store">Using an external store</h4>
 
 <p>Another common pattern for stateful processing is to store the state in an external database or key-value store. Conventional database replication can be used to make that database fault-tolerant. The architecture looks something like this:</p>
 
@@ -204,7 +204,7 @@
 <li><strong>Reprocessing</strong>: Sometimes it can be useful to re-run a stream process on a large amount of historical data, e.g. after updating your processing task&rsquo;s code. However, the issues above make this impractical for jobs that make external queries.</li>
 </ol>
 
-<h3 id="toc_9">Local state in Samza</h3>
+<h3 id="local-state-in-samza">Local state in Samza</h3>
 
 <p>Samza allows tasks to maintain state in a way that is different from the approaches described above:</p>
 
@@ -235,7 +235,7 @@
 
 <p>Nothing prevents you from using an external database if you want to, but for many use cases, Samza&rsquo;s local state is a powerful tool for enabling stateful stream processing.</p>
 
-<h3 id="toc_10">Key-value storage</h3>
+<h3 id="key-value-storage">Key-value storage</h3>
 
 <p>Any storage engine can be plugged into Samza, as described below. Out of the box, Samza ships with a key-value store implementation that is built on <a href="https://code.google.com/p/leveldb">LevelDB</a> using a <a href="https://github.com/fusesource/leveldbjni">JNI API</a>.</p>
 
@@ -244,7 +244,7 @@
 <p>Samza includes an additional in-memory caching layer in front of LevelDB, which avoids the cost of deserialization for frequently-accessed objects and batches writes. If the same key is updated multiple times in quick succession, the batching coalesces those updates into a single write. The writes are flushed to the changelog when a task <a href="checkpointing.html">commits</a>.</p>
 
 <p>To use a key-value store in your job, add the following to your job config:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text"># Use the key-value store implementation for a store called &quot;my-store&quot;
+<div class="highlight"><pre><code class="language-text" data-lang="text"># Use the key-value store implementation for a store called &quot;my-store&quot;
 stores.my-store.factory=org.apache.samza.storage.kv.KeyValueStorageEngineFactory
 
 # Use the Kafka topic &quot;my-store-changelog&quot; as the changelog stream for this store.
@@ -260,7 +260,7 @@ stores.my-store.msg.serde=string
 <p>See the <a href="serialization.html">serialization section</a> for more information on the <em>serde</em> options.</p>
 
 <p>Here is a simple example that writes every incoming message to the store:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text">public class MyStatefulTask implements StreamTask, InitableTask {
+<div class="highlight"><pre><code class="language-text" data-lang="text">public class MyStatefulTask implements StreamTask, InitableTask {
   private KeyValueStore&lt;String, String&gt; store;
 
   public void init(Config config, TaskContext context) {
@@ -275,7 +275,7 @@ stores.my-store.msg.serde=string
 }
 </code></pre></div>
 <p>Here is the complete key-value store API:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text">public interface KeyValueStore&lt;K, V&gt; {
+<div class="highlight"><pre><code class="language-text" data-lang="text">public interface KeyValueStore&lt;K, V&gt; {
   V get(K key);
   void put(K key, V value);
   void putAll(List&lt;Entry&lt;K,V&gt;&gt; entries);
@@ -285,7 +285,7 @@ stores.my-store.msg.serde=string
 }
 </code></pre></div>
 <p>Here is a list of additional configurations accepted by the key-value store, along with their default values:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text"># The number of writes to batch together
+<div class="highlight"><pre><code class="language-text" data-lang="text"># The number of writes to batch together
 stores.my-store.write.batch.size=500
 
 # The number of objects to keep in Samza&#39;s cache (in front of LevelDB).
@@ -312,11 +312,11 @@ stores.my-store.leveldb.compression=snap
 # to change this unless you are a compulsive fiddler.
 stores.my-store.leveldb.block.size.bytes=4096
 </code></pre></div>
-<h3 id="toc_11">Implementing common use cases with the key-value store</h3>
+<h3 id="implementing-common-use-cases-with-the-key-value-store">Implementing common use cases with the key-value store</h3>
 
 <p>Earlier in this section we discussed some example use cases for stateful stream processing. Let&rsquo;s look at how each of these could be implemented using a key-value storage engine such as Samza&rsquo;s LevelDB.</p>
 
-<h4 id="toc_12">Windowed aggregation</h4>
+<h4 id="windowed-aggregation">Windowed aggregation</h4>
 
 <p><em>Example: Counting the number of page views for each user per hour</em></p>
 
@@ -329,13 +329,13 @@ stores.my-store.leveldb.block.size.bytes
 
 <p>Note that this job effectively pauses at the hour mark to output its results. This is totally fine for Samza, as scanning over the contents of the key-value store is quite fast. The input stream is buffered while the job is doing this hourly work.</p>
 
-<h4 id="toc_13">Table-table join</h4>
+<h4 id="table-table-join">Table-table join</h4>
 
 <p><em>Example: Join a table of user profiles to a table of user settings by user_id and emit the joined stream</em></p>
 
 <p>Implementation: The job subscribes to the change streams for the user profiles database and the user settings database, both partitioned by user_id. The job keeps a key-value store keyed by user_id, which contains the latest profile record and the latest settings record for each user_id. When a new event comes in from either stream, the job looks up the current value in its store, updates the appropriate fields (depending on whether it was a profile update or a settings update), and writes back the new joined record to the store. The changelog of the store doubles as the output stream of the task.</p>
 
-<h4 id="toc_14">Table-stream join</h4>
+<h4 id="table-stream-join">Table-stream join</h4>
 
 <p><em>Example: Augment a stream of page view events with the user&rsquo;s ZIP code (perhaps to allow aggregation by zip code in a later stage)</em></p>
 
@@ -343,7 +343,7 @@ stores.my-store.leveldb.block.size.bytes
 
 <p>If the next stage needs to aggregate by ZIP code, the ZIP code can be used as the partitioning key of the job&rsquo;s output stream. That ensures that all the events for the same ZIP code are sent to the same stream partition.</p>
 
-<h4 id="toc_15">Stream-stream join</h4>
+<h4 id="stream-stream-join">Stream-stream join</h4>
 
 <p><em>Example: Join a stream of ad clicks to a stream of ad impressions (to link the information on when the ad was shown to the information on when it was clicked)</em></p>
 
@@ -351,13 +351,13 @@ stores.my-store.leveldb.block.size.bytes
 
 <p>Implementation: Partition the ad click and ad impression streams by the impression ID or user ID (assuming that two events with the same impression ID always have the same user ID). The task keeps two stores, one containing click events and one containing impression events, using the impression ID as key for both stores. When the job receives a click event, it looks for the corresponding impression in the impression store, and vice versa. If a match is found, the joined pair is emitted and the entry is deleted. If no match is found, the event is written to the appropriate store. Periodically the job scans over both stores and deletes any old events that were not matched within the time window of the join.</p>
 
-<h3 id="toc_16">Other storage engines</h3>
+<h3 id="other-storage-engines">Other storage engines</h3>
 
 <p>Samza&rsquo;s fault-tolerance mechanism (sending a local store&rsquo;s writes to a replicated changelog) is completely decoupled from the storage engine&rsquo;s data structures and query APIs. While a key-value storage engine is good for general-purpose processing, you can easily add your own storage engines for other types of queries by implementing the <a href="../api/javadocs/org/apache/samza/storage/StorageEngine.html">StorageEngine</a> interface. Samza&rsquo;s model is especially amenable to embedded storage engines, which run as a library in the same process as the stream task. </p>
 
 <p>Some ideas for other storage engines that could be useful: a persistent heap (for running top-N queries), <a href="http://infolab.stanford.edu/%7Eullman/mmds/ch4.pdf">approximate algorithms</a> such as <a href="http://en.wikipedia.org/wiki/Bloom_filter">bloom filters</a> and <a href="http://research.google.com/pubs/pub40671.html">hyperloglog</a>, or full-text indexes such as <a href="http://lucene.apache.org">Lucene</a>. (Patches accepted!)</p>
 
-<h3 id="toc_17">Fault tolerance semantics with state</h3>
+<h3 id="fault-tolerance-semantics-with-state">Fault tolerance semantics with state</h3>
 
 <p>As discussed in the section on <a href="checkpointing.html">checkpointing</a>, Samza currently only supports at-least-once delivery guarantees in the presence of failure (this is sometimes referred to as &ldquo;guaranteed delivery&rdquo;). This means that if a task fails, no messages are lost, but some messages may be redelivered.</p>
 
@@ -365,7 +365,7 @@ stores.my-store.leveldb.block.size.bytes
 
 <p>However, for non-idempotent operations such as counting, at-least-once delivery guarantees can give incorrect results. If a Samza task fails and is restarted, it may double-count some messages that were processed shortly before the failure. We are planning to address this limitation in a future release of Samza.</p>
 
-<h2 id="toc_18"><a href="metrics.html">Metrics &raquo;</a></h2>
+<h2 id="metrics-&raquo;"><a href="metrics.html">Metrics &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/container/streams.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/streams.html?rev=1603358&r1=1603357&r2=1603358&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/container/streams.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/container/streams.html Wed Jun 18 05:50:54 2014
@@ -123,7 +123,7 @@
 -->
 
 <p>The <a href="samza-container.html">samza container</a> reads and writes messages using the <a href="../api/javadocs/org/apache/samza/system/SystemConsumer.html">SystemConsumer</a> and <a href="../api/javadocs/org/apache/samza/system/SystemProducer.html">SystemProducer</a> interfaces. You can integrate any message broker with Samza by implementing these two interfaces.</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text">public interface SystemConsumer {
+<div class="highlight"><pre><code class="language-text" data-lang="text">public interface SystemConsumer {
   void start();
 
   void stop();
@@ -171,25 +171,25 @@ public class OutgoingMessageEnvelope {
 
 <p>The job configuration file can include properties that are specific to a particular consumer and producer implementation. For example, the configuration would typically indicate the hostname and port of the message broker to use, and perhaps connection options.</p>
 
-<h3 id="toc_0">How streams are processed</h3>
+<h3 id="how-streams-are-processed">How streams are processed</h3>
 
 <p>If a job is consuming messages from more than one input stream, and all input streams have messages available, messages are processed in a round robin fashion by default. For example, if a job is consuming AdImpressionEvent and AdClickEvent, the task instance&rsquo;s process() method is called with a message from AdImpressionEvent, then a message from AdClickEvent, then another message from AdImpressionEvent, &hellip; and continues to alternate between the two.</p>
 
 <p>If one of the input streams has no new messages available (the most recent message has already been consumed), that stream is skipped, and the job continues to consume from the other inputs. It continues to check for new messages becoming available.</p>
 
-<h4 id="toc_1">MessageChooser</h4>
+<h4 id="messagechooser">MessageChooser</h4>
 
 <p>When a Samza container has several incoming messages on different stream partitions, how does it decide which to process first? The behavior is determined by a <a href="../api/javadocs/org/apache/samza/system/chooser/MessageChooser.html">MessageChooser</a>. The default chooser is RoundRobinChooser, but you can override it by implementing a custom chooser.</p>
 
 <p>To plug in your own message chooser, you need to implement the <a href="../api/javadocs/org/apache/samza/system/chooser/MessageChooserFactory.html">MessageChooserFactory</a> interface, and set the &ldquo;task.chooser.class&rdquo; configuration to the fully-qualified class name of your implementation:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text">task.chooser.class=com.example.samza.YourMessageChooserFactory
+<div class="highlight"><pre><code class="language-text" data-lang="text">task.chooser.class=com.example.samza.YourMessageChooserFactory
 </code></pre></div>
-<h4 id="toc_2">Prioritizing input streams</h4>
+<h4 id="prioritizing-input-streams">Prioritizing input streams</h4>
 
 <p>There are certain times when messages from one stream should be processed with higher priority than messages from another stream. For example, some Samza jobs consume two streams: one stream is fed by a real-time system and the other stream is fed by a batch system. In this case, it&rsquo;s useful to prioritize the real-time stream over the batch stream, so that the real-time processing doesn&rsquo;t slow down if there is a sudden burst of data on the batch stream.</p>
 
 <p>Samza provides a mechanism to prioritize one stream over another by setting this configuration parameter: systems.&lt;system&gt;.streams.&lt;stream&gt;.samza.priority=&lt;number&gt;. For example:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text">systems.kafka.streams.my-real-time-stream.samza.priority=2
+<div class="highlight"><pre><code class="language-text" data-lang="text">systems.kafka.streams.my-real-time-stream.samza.priority=2
 systems.kafka.streams.my-batch-stream.samza.priority=1
 </code></pre></div>
 <p>This declares that my-real-time-stream&rsquo;s messages should be processed with higher priority than my-batch-stream&rsquo;s messages. If my-real-time-stream has any messages available, they are processed first. Only if there are no messages currently waiting on my-real-time-stream, the Samza job continues processing my-batch-stream.</p>
@@ -198,7 +198,7 @@ systems.kafka.streams.my-batch-stream.sa
 
 <p>It&rsquo;s also valid to only define priorities for some streams. All non-prioritized streams are treated as the lowest priority, and share a MessageChooser.</p>
 
-<h4 id="toc_3">Bootstrapping</h4>
+<h4 id="bootstrapping">Bootstrapping</h4>
 
 <p>Sometimes, a Samza job needs to fully consume a stream (from offset 0 up to the most recent message) before it processes messages from any other stream. This is useful in situations where the stream contains some prerequisite data that the job needs, and it doesn&rsquo;t make sense to process messages from other streams until the job has loaded that prerequisite data. Samza supports this use case with <em>bootstrap streams</em>.</p>
 
@@ -207,7 +207,7 @@ systems.kafka.streams.my-batch-stream.sa
 <p>Another difference between a bootstrap stream and a high-priority stream is that the bootstrap stream&rsquo;s special treatment is temporary: when it has been fully consumed (we say it has &ldquo;caught up&rdquo;), its priority drops to be the same as all the other input streams.</p>
 
 <p>To configure a stream called &ldquo;my-bootstrap-stream&rdquo; to be a fully-consumed bootstrap stream, use the following settings:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text">systems.kafka.streams.my-bootstrap-stream.samza.bootstrap=true
+<div class="highlight"><pre><code class="language-text" data-lang="text">systems.kafka.streams.my-bootstrap-stream.samza.bootstrap=true
 systems.kafka.streams.my-bootstrap-stream.samza.reset.offset=true
 systems.kafka.streams.my-bootstrap-stream.samza.offset.default=oldest
 </code></pre></div>
@@ -215,16 +215,16 @@ systems.kafka.streams.my-bootstrap-strea
 
 <p>It is valid to define multiple bootstrap streams. In this case, the order in which they are bootstrapped is determined by the priority.</p>
 
-<h4 id="toc_4">Batching</h4>
+<h4 id="batching">Batching</h4>
 
 <p>In some cases, you can improve performance by consuming several messages from the same stream partition in sequence. Samza supports this mode of operation, called <em>batching</em>.</p>
 
 <p>For example, if you want to read 100 messages in a row from each stream partition (regardless of the MessageChooser), you can use this configuration parameter:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text">task.consumer.batch.size=100
+<div class="highlight"><pre><code class="language-text" data-lang="text">task.consumer.batch.size=100
 </code></pre></div>
 <p>With this setting, Samza tries to read a message from the most recently used <a href="../api/javadocs/org/apache/samza/system/SystemStreamPartition.html">SystemStreamPartition</a>. This behavior continues either until no more messages are available for that SystemStreamPartition, or until the batch size has been reached. When that happens, Samza defers to the MessageChooser to determine the next message to process. It then again tries to continue consume from the chosen message&rsquo;s SystemStreamPartition until the batch size is reached.</p>
 
-<h2 id="toc_5"><a href="serialization.html">Serialization &raquo;</a></h2>
+<h2 id="serialization-&raquo;"><a href="serialization.html">Serialization &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/container/windowing.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/windowing.html?rev=1603358&r1=1603357&r2=1603358&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/container/windowing.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/container/windowing.html Wed Jun 18 05:50:54 2014
@@ -125,13 +125,13 @@
 <p>Sometimes a stream processing job needs to do something in regular time intervals, regardless of how many incoming messages the job is processing. For example, say you want to report the number of page views per minute. To do this, you increment a counter every time you see a page view event. Once per minute, you send the current counter value to an output stream and reset the counter to zero.</p>
 
 <p>Samza&rsquo;s <em>windowing</em> feature provides a way for tasks to do something in regular time intervals, for example once per minute. To enable windowing, you just need to set one property in your job configuration:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text"># Call the window() method every 60 seconds
+<div class="highlight"><pre><code class="language-text" data-lang="text"># Call the window() method every 60 seconds
 task.window.ms=60000
 </code></pre></div>
 <p>Next, your stream task needs to implement the <a href="../api/javadocs/org/apache/samza/task/WindowableTask.html">WindowableTask</a> interface. This interface defines a window() method which is called by Samza in the regular interval that you configured.</p>
 
 <p>For example, this is how you would implement a basic per-minute event counter:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text">public class EventCounterTask implements StreamTask, WindowableTask {
+<div class="highlight"><pre><code class="language-text" data-lang="text">public class EventCounterTask implements StreamTask, WindowableTask {
 
   public static final SystemStream OUTPUT_STREAM =
     new SystemStream(&quot;kafka&quot;, &quot;events-per-minute&quot;);
@@ -155,7 +155,7 @@ task.window.ms=60000
 
 <p>Note that Samza uses <a href="event-loop.html">single-threaded execution</a>, so the window() call can never happen concurrently with a process() call. This has the advantage that you don&rsquo;t need to worry about thread safety in your code (no need to synchronize anything), but the downside that the window() call may be delayed if your process() method takes a long time to return.</p>
 
-<h2 id="toc_0"><a href="event-loop.html">Event Loop &raquo;</a></h2>
+<h2 id="event-loop-&raquo;"><a href="event-loop.html">Event Loop &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/introduction/architecture.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/introduction/architecture.html?rev=1603358&r1=1603357&r2=1603358&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/introduction/architecture.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/introduction/architecture.html Wed Jun 18 05:50:54 2014
@@ -148,7 +148,7 @@
 
 <p>Before going in-depth on each of these three layers, it should be noted that Samza&rsquo;s support is not limited to Kafka and YARN. Both Samza&rsquo;s execution and streaming layer are pluggable, and allow developers to implement alternatives if they prefer.</p>
 
-<h3 id="toc_0">Kafka</h3>
+<h3 id="kafka">Kafka</h3>
 
 <p><a href="http://kafka.apache.org/">Kafka</a> is a distributed pub/sub and message queueing system that provides at-least once messaging guarantees (i.e. the system guarantees that no messages are lost, but in certain fault scenarios, a consumer might receive the same message more than once), and highly available partitions (i.e. a stream&rsquo;s partitions continue to be available even if a machine goes down).</p>
 
@@ -163,7 +163,7 @@
 
 <p>For more details on Kafka, see Kafka&rsquo;s <a href="http://kafka.apache.org/documentation.html">documentation</a> pages.</p>
 
-<h3 id="toc_1">YARN</h3>
+<h3 id="yarn">YARN</h3>
 
 <p><a href="http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html">YARN</a> (Yet Another Resource Negotiator) is Hadoop&rsquo;s next-generation cluster scheduler. It allows you to allocate a number of <em>containers</em> (processes) in a cluster of machines, and execute arbitrary commands on them.</p>
 
@@ -178,11 +178,11 @@
 
 <p>Samza uses YARN to manage deployment, fault tolerance, logging, resource isolation, security, and locality. A brief overview of YARN is below; see <a href="http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/">this page from Hortonworks</a> for a much better overview.</p>
 
-<h4 id="toc_2">YARN Architecture</h4>
+<h4 id="yarn-architecture">YARN Architecture</h4>
 
 <p>YARN has three important pieces: a <em>ResourceManager</em>, a <em>NodeManager</em>, and an <em>ApplicationMaster</em>. In a YARN grid, every machine runs a NodeManager, which is responsible for launching processes on that machine. A ResourceManager talks to all of the NodeManagers to tell them what to run. Applications, in turn, talk to the ResourceManager when they wish to run something on the cluster. The third piece, the ApplicationMaster, is actually application-specific code that runs in the YARN cluster. It&rsquo;s responsible for managing the application&rsquo;s workload, asking for containers (usually UNIX processes), and handling notifications when one of its containers fails.</p>
 
-<h4 id="toc_3">Samza and YARN</h4>
+<h4 id="samza-and-yarn">Samza and YARN</h4>
 
 <p>Samza provides a YARN ApplicationMaster and a YARN job runner out of the box. The integration between Samza and YARN is outlined in the following diagram (different colors indicate different host machines):</p>
 
@@ -190,7 +190,7 @@
 
 <p>The Samza client talks to the YARN RM when it wants to start a new Samza job. The YARN RM talks to a YARN NM to allocate space on the cluster for Samza&rsquo;s ApplicationMaster. Once the NM allocates space, it starts the Samza AM. After the Samza AM starts, it asks the YARN RM for one or more YARN containers to run <a href="../container/samza-container.html">SamzaContainers</a>. Again, the RM works with NMs to allocate space for the containers. Once the space has been allocated, the NMs start the Samza containers.</p>
 
-<h3 id="toc_4">Samza</h3>
+<h3 id="samza">Samza</h3>
 
 <p>Samza uses YARN and Kafka to provide a framework for stage-wise stream processing and partitioning. Everything, put together, looks like this (different colors indicate different host machines):</p>
 
@@ -198,10 +198,10 @@
 
 <p>The Samza client uses YARN to run a Samza job: YARN starts and supervises one or more <a href="../container/samza-container.html">SamzaContainers</a>, and your processing code (using the <a href="../api/overview.html">StreamTask</a> API) runs inside those containers. The input and output for the Samza StreamTasks come from Kafka brokers that are (usually) co-located on the same machines as the YARN NMs.</p>
 
-<h3 id="toc_5">Example</h3>
+<h3 id="example">Example</h3>
 
 <p>Let&rsquo;s take a look at a real example: suppose we want to count the number of page views. In SQL, you would write something like:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text">SELECT user_id, COUNT(*) FROM PageViewEvent GROUP BY user_id
+<div class="highlight"><pre><code class="language-text" data-lang="text">SELECT user_id, COUNT(*) FROM PageViewEvent GROUP BY user_id
 </code></pre></div>
 <p>Although Samza doesn&rsquo;t support SQL right now, the idea is the same. Two jobs are required to calculate this query: one to group messages by user ID, and the other to do the counting.</p>
 
@@ -215,7 +215,7 @@
 
 <p>By partitioning topics, and by breaking a stream process down into jobs and parallel tasks that run on multiple machines, Samza scales to streams with very high message throughput. By using YARN and Kafka, Samza achieves fault-tolerance: if a process or machine fails, it is automatically restarted on another machine and continues processing messages from the point where it left off.</p>
 
-<h2 id="toc_6"><a href="../comparisons/introduction.html">Comparison Introduction &raquo;</a></h2>
+<h2 id="comparison-introduction-&raquo;"><a href="../comparisons/introduction.html">Comparison Introduction &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/introduction/background.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/introduction/background.html?rev=1603358&r1=1603357&r2=1603358&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/introduction/background.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/introduction/background.html Wed Jun 18 05:50:54 2014
@@ -124,7 +124,7 @@
 
 <p>This page provides some background about stream processing, describes what Samza is, and why it was built.</p>
 
-<h3 id="toc_0">What is messaging?</h3>
+<h3 id="what-is-messaging?">What is messaging?</h3>
 
 <p>Messaging systems are a popular way of implementing near-realtime asynchronous computation. Messages can be added to a message queue (ActiveMQ, RabbitMQ), pub-sub system (Kestrel, Kafka), or log aggregation system (Flume, Scribe) when something happens. Downstream <em>consumers</em> read messages from these systems, and process them or take actions based on the message contents.</p>
 
@@ -140,7 +140,7 @@
 
 <p>A messaging system lets you decouple all of this work from the actual web page serving.</p>
 
-<h3 id="toc_1">What is stream processing?</h3>
+<h3 id="what-is-stream-processing?">What is stream processing?</h3>
 
 <p>A messaging system is a fairly low-level piece of infrastructure&mdash;it stores messages and waits for consumers to consume them. When you start writing code that produces or consumes messages, you quickly find that there are a lot of tricky problems that have to be solved in the processing layer. Samza aims to help with these problems.</p>
 
@@ -148,7 +148,7 @@
 
 <p>Stream processing is a higher level of abstraction on top of messaging systems, and it&rsquo;s meant to address precisely this category of problems.</p>
 
-<h3 id="toc_2">Samza</h3>
+<h3 id="samza">Samza</h3>
 
 <p>Samza is a stream processing framework with the following features:</p>
 
@@ -162,7 +162,7 @@
 <li><strong>Processor isolation:</strong> Samza works with Apache YARN, which supports Hadoop&rsquo;s security model, and resource isolation through Linux CGroups.</li>
 </ul>
 
-<h3 id="toc_3">Alternatives</h3>
+<h3 id="alternatives">Alternatives</h3>
 
 <p>The available open source stream processing systems are actually quite young, and no single system offers a complete solution. New problems in this area include: how a stream processor&rsquo;s state should be managed, whether or not a stream should be buffered remotely on disk, what to do when duplicate messages are received or messages are lost, and how to model underlying messaging systems.</p>
 
@@ -177,7 +177,7 @@
 
 <p>For a more in-depth discussion on Samza, and how it relates to other stream processing systems, have a look at Samza&rsquo;s <a href="../comparisons/introduction.html">Comparisons</a> documentation.</p>
 
-<h2 id="toc_4"><a href="concepts.html">Concepts &raquo;</a></h2>
+<h2 id="concepts-&raquo;"><a href="concepts.html">Concepts &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/introduction/concepts.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/introduction/concepts.html?rev=1603358&r1=1603357&r2=1603358&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/introduction/concepts.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/introduction/concepts.html Wed Jun 18 05:50:54 2014
@@ -124,7 +124,7 @@
 
 <p>This page gives an introduction to the high-level concepts in Samza.</p>
 
-<h3 id="toc_0">Streams</h3>
+<h3 id="streams">Streams</h3>
 
 <p>Samza processes <em>streams</em>. A stream is composed of immutable <em>messages</em> of a similar type or category. For example, a stream could be all the clicks on a website, or all the updates to a particular database table, or all the logs produced by a service, or any other type of event data. Messages can be appended to a stream or read from a stream. A stream can have any number of <em>consumers</em>, and reading from a stream doesn&rsquo;t delete the message (so each message is effectively broadcast to all consumers). Messages can optionally have an associated key which is used for partitioning, which we&rsquo;ll talk about in a second.</p>
 
@@ -132,13 +132,13 @@
 
 <p><img src="/img/0.7.0/learn/documentation/introduction/job.png" alt="job"></p>
 
-<h3 id="toc_1">Jobs</h3>
+<h3 id="jobs">Jobs</h3>
 
 <p>A Samza <em>job</em> is code that performs a logical transformation on a set of input streams to append output messages to set of output streams.</p>
 
 <p>If scalability were not a concern, streams and jobs would be all we need. However, in order to scale the throughput of the stream processor, we chop streams and jobs up into smaller units of parallelism: <em>partitions</em> and <em>tasks</em>.</p>
 
-<h3 id="toc_2">Partitions</h3>
+<h3 id="partitions">Partitions</h3>
 
 <p>Each stream is broken into one or more partitions. Each partition in the stream is a totally ordered sequence of messages.</p>
 
@@ -148,7 +148,7 @@
 
 <p><img src="/img/0.7.0/learn/documentation/introduction/stream.png" alt="stream"></p>
 
-<h3 id="toc_3">Tasks</h3>
+<h3 id="tasks">Tasks</h3>
 
 <p>A job is scaled by breaking it into multiple <em>tasks</em>. The <em>task</em> is the unit of parallelism of the job, just as the partition is to the stream. Each task consumes data from one partition for each of the job&rsquo;s input streams.</p>
 
@@ -160,7 +160,7 @@
 
 <p><img src="/img/0.7.0/learn/documentation/introduction/job_detail.png" alt="job-detail"></p>
 
-<h3 id="toc_4">Dataflow Graphs</h3>
+<h3 id="dataflow-graphs">Dataflow Graphs</h3>
 
 <p>We can compose multiple jobs to create a dataflow graph, where the nodes are streams containing data, and the edges are jobs performing transformations. This composition is done purely through the streams the jobs take as input and output. The jobs are otherwise totally decoupled: they need not be implemented in the same code base, and adding, removing, or restarting a downstream job will not impact an upstream job.</p>
 
@@ -168,11 +168,11 @@
 
 <p><img src="/img/0.7.0/learn/documentation/introduction/dag.png" width="430" alt="Directed acyclic job graph"></p>
 
-<h3 id="toc_5">Containers</h3>
+<h3 id="containers">Containers</h3>
 
 <p>Partitions and tasks are both <em>logical</em> units of parallelism&mdash;they don&rsquo;t correspond to any particular assignment of computational resources (CPU, memory, disk space, etc). Containers are the unit of physical parallelism, and a container is essentially a Unix process (or Linux <a href="http://en.wikipedia.org/wiki/Cgroups">cgroup</a>). Each container runs one or more tasks. The number of tasks is determined automatically from the number of partitions in the input and is fixed, but the number of containers (and the CPU and memory resources associated with them) is specified by the user at run time and can be changed at any time.</p>
 
-<h2 id="toc_6"><a href="architecture.html">Architecture &raquo;</a></h2>
+<h2 id="architecture-&raquo;"><a href="architecture.html">Architecture &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/jobs/configuration.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/jobs/configuration.html?rev=1603358&r1=1603357&r2=1603358&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/jobs/configuration.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/jobs/configuration.html Wed Jun 18 05:50:54 2014
@@ -123,7 +123,7 @@
 -->
 
 <p>All Samza jobs have a configuration file that defines the job. A very basic configuration file looks like this:</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text"># Job
+<div class="highlight"><pre><code class="language-text" data-lang="text"># Job
 job.factory.class=samza.job.local.LocalJobFactory
 job.name=hello-world
 
@@ -149,7 +149,7 @@ systems.example-system.samza.msg.serde=j
 <li>The system section defines systems that your StreamTask can read from along with the types of serdes used for sending keys and messages from that system. Usually, you&rsquo;ll define a Kafka system, if you&rsquo;re reading from Kafka, although you can also specify your own self-implemented Samza-compatible systems. See the <a href="/startup/hello-samza/0.7.0">hello-samza example project</a>&rsquo;s Wikipedia system for a good example of a self-implemented system.</li>
 </ol>
 
-<h3 id="toc_0">Required Configuration</h3>
+<h3 id="required-configuration">Required Configuration</h3>
 
 <p>Configuration keys that absolutely must be defined for a Samza job are:</p>
 
@@ -160,11 +160,11 @@ systems.example-system.samza.msg.serde=j
 <li>task.inputs</li>
 </ul>
 
-<h3 id="toc_1">Configuration Keys</h3>
+<h3 id="configuration-keys">Configuration Keys</h3>
 
 <p>A complete list of configuration keys can be found on the <a href="configuration-table.html">Configuration Table</a> page.</p>
 
-<h2 id="toc_2"><a href="packaging.html">Packaging &raquo;</a></h2>
+<h2 id="packaging-&raquo;"><a href="packaging.html">Packaging &raquo;</a></h2>
 
 
           </div>

Modified: incubator/samza/site/learn/documentation/0.7.0/jobs/job-runner.html
URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/jobs/job-runner.html?rev=1603358&r1=1603357&r2=1603358&view=diff
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/jobs/job-runner.html (original)
+++ incubator/samza/site/learn/documentation/0.7.0/jobs/job-runner.html Wed Jun 18 05:50:54 2014
@@ -123,19 +123,19 @@
 -->
 
 <p>Samza jobs are started using a script called run-job.sh.</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text">samza-example/target/bin/run-job.sh \
+<div class="highlight"><pre><code class="language-text" data-lang="text">samza-example/target/bin/run-job.sh \
   --config-factory=samza.config.factories.PropertiesConfigFactory \
   --config-path=file://$PWD/config/hello-world.properties
 </code></pre></div>
 <p>You provide two parameters to the run-job.sh script. One is the config location, and the other is a factory class that is used to read your configuration file. The run-job.sh script is actually executing a Samza class called JobRunner. The JobRunner uses your ConfigFactory to get a Config object from the config path.</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text">public interface ConfigFactory {
+<div class="highlight"><pre><code class="language-text" data-lang="text">public interface ConfigFactory {
   Config getConfig(URI configUri);
 }
 </code></pre></div>
 <p>The Config object is just a wrapper around Map<String, String>, with some nice helper methods. Out of the box, Samza ships with the PropertiesConfigFactory, but developers can implement any kind of ConfigFactory they wish.</p>
 
 <p>Once the JobRunner gets your configuration, it gives your configuration to the StreamJobFactory class defined by the &ldquo;job.factory&rdquo; property. Samza ships with two job factory implementations: LocalJobFactory and YarnJobFactory. The StreamJobFactory&rsquo;s responsibility is to give the JobRunner a job that it can run.</p>
-<div class="highlight"><pre><code class="text language-text" data-lang="text">public interface StreamJob {
+<div class="highlight"><pre><code class="language-text" data-lang="text">public interface StreamJob {
   StreamJob submit();
 
   StreamJob kill();
@@ -151,7 +151,7 @@
 
 <p>This flow differs slightly when you use YARN, but we&rsquo;ll get to that later.</p>
 
-<h2 id="toc_0"><a href="configuration.html">Configuration &raquo;</a></h2>
+<h2 id="configuration-&raquo;"><a href="configuration.html">Configuration &raquo;</a></h2>
 
 
           </div>