You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by aj...@apache.org on 2023/01/18 19:33:31 UTC

svn commit: r1906774 [33/49] - in /samza/site: ./ archive/ blog/ case-studies/ community/ contribute/ img/latest/learn/documentation/api/ learn/documentation/latest/ learn/documentation/latest/api/ learn/documentation/latest/api/javadocs/ learn/documen...

Modified: samza/site/learn/documentation/latest/core-concepts/core-concepts.html
URL: http://svn.apache.org/viewvc/samza/site/learn/documentation/latest/core-concepts/core-concepts.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/documentation/latest/core-concepts/core-concepts.html (original)
+++ samza/site/learn/documentation/latest/core-concepts/core-concepts.html Wed Jan 18 19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" href="/releases/1.6.0">1.6.0</a>
       
         
@@ -538,6 +544,14 @@
               
               
 
+              <li class="hide"><a href="/learn/documentation/1.8.0/core-concepts/core-concepts">1.8.0</a></li>
+
+              
+
+              <li class="hide"><a href="/learn/documentation/1.7.0/core-concepts/core-concepts">1.7.0</a></li>
+
+              
+
               <li class="hide"><a href="/learn/documentation/1.6.0/core-concepts/core-concepts">1.6.0</a></li>
 
               
@@ -638,14 +652,13 @@
    See the License for the specific language governing permissions and
    limitations under the License.
 -->
-
 <ul>
-<li><a href="#introduction">Introduction</a></li>
-<li><a href="#streams-partitions">Streams, Partitions</a></li>
-<li><a href="#stream-application">Stream Application</a></li>
-<li><a href="#state">State</a></li>
-<li><a href="#time">Time</a></li>
-<li><a href="#processing-guarantee">Processing guarantee</a></li>
+  <li><a href="#introduction">Introduction</a></li>
+  <li><a href="#streams-partitions">Streams, Partitions</a></li>
+  <li><a href="#stream-application">Stream Application</a></li>
+  <li><a href="#state">State</a></li>
+  <li><a href="#time">Time</a></li>
+  <li><a href="#processing-guarantee">Processing guarantee</a></li>
 </ul>
 
 <h3 id="introduction">Introduction</h3>
@@ -656,61 +669,58 @@
 
 <p><em>Pluggability at every level:</em> Process and transform data from any source. Samza offers built-in integrations with <a href="/learn/documentation/latest/connectors/kafka.html">Apache Kafka</a>, <a href="/learn/documentation/latest/connectors/kinesis.html">AWS Kinesis</a>, <a href="/learn/documentation/latest/connectors/kinesis.html">Azure EventHubs</a>, ElasticSearch and <a href="/learn/documentation/latest/connectors/hdfs.html">Apache Hadoop</a>. Also, it’s quite easy to integrate with your own sources.</p>
 
-<p><em>Samza as an embedded library:</em> Integrate effortlessly with your existing applications eliminating the need to spin up and operate a separate cluster for stream processing. Samza can be used as a light-weight client-library <a href="/learn/documentation/latest/deployment/standalone.html">embedded</a> in your Java/Scala applications. </p>
+<p><em>Samza as an embedded library:</em> Integrate effortlessly with your existing applications eliminating the need to spin up and operate a separate cluster for stream processing. Samza can be used as a light-weight client-library <a href="/learn/documentation/latest/deployment/standalone.html">embedded</a> in your Java/Scala applications.</p>
 
 <p><em>Write once, Run anywhere:</em> <a href="/learn/documentation/latest/deployment/deployment-model.html">Flexible deployment options</a>  to run applications anywhere - from public clouds to containerized environments to bare-metal hardware.</p>
 
-<p><em>Samza as a managed service:</em> Run stream-processing as a managed service by integrating with popular cluster-managers including <a href="https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html">Apache YARN</a>. </p>
+<p><em>Samza as a managed service:</em> Run stream-processing as a managed service by integrating with popular cluster-managers including <a href="https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html">Apache YARN</a>.</p>
 
 <p><em>Fault-tolerance:</em>  Transparently migrate tasks along with their associated state in the event of failures. Samza supports <a href="/learn/documentation/latest/architecture/architecture-overview.html#host-affinity">host-affinity</a> and <a href="/learn/documentation/latest/architecture/architecture-overview.html#incremental-checkpoints">incremental checkpointing</a> to enable fast recovery from failures.</p>
 
-<p><em>Massive scale:</em> Battle-tested on applications that use several terabytes of state and run on thousands of cores. It <a href="/powered-by/">powers</a> multiple large companies including LinkedIn, Uber, TripAdvisor, Slack etc. </p>
+<p><em>Massive scale:</em> Battle-tested on applications that use several terabytes of state and run on thousands of cores. It <a href="/powered-by/">powers</a> multiple large companies including LinkedIn, Uber, TripAdvisor, Slack etc.</p>
 
-<p>Next, we will introduce Samza’s terminology. You will realize that it is extremely easy to <a href="/startup/quick-start/latest">get started</a> with building your first application. </p>
+<p>Next, we will introduce Samza’s terminology. You will realize that it is extremely easy to <a href="/startup/quick-start/latest">get started</a> with building your first application.</p>
 
 <h3 id="streams-partitions">Streams, Partitions</h3>
+<p>Samza processes your data in the form of streams. A <em>stream</em> is a collection of immutable messages, usually of the same type or category. Each message in a stream is modelled as a key-value pair.</p>
 
-<p>Samza processes your data in the form of streams. A <em>stream</em> is a collection of immutable messages, usually of the same type or category. Each message in a stream is modelled as a key-value pair. </p>
-
-<p><img src="/img/latest/learn/documentation/core-concepts/streams-partitions.png" alt="diagram-medium">
-<br/>
-A stream can have multiple producers that write data to it and multiple consumers that read data from it. Data in a stream can be unbounded (eg: a Kafka topic) or bounded (eg: a set of files on HDFS). </p>
+<p><img src="/img/latest/learn/documentation/core-concepts/streams-partitions.png" alt="diagram-medium" />
+<br />
+A stream can have multiple producers that write data to it and multiple consumers that read data from it. Data in a stream can be unbounded (eg: a Kafka topic) or bounded (eg: a set of files on HDFS).</p>
 
-<p>A stream is sharded into multiple partitions for scaling how its data is processed. Each <em>partition</em> is an ordered, replayable sequence of records. When a message is written to a stream, it ends up in one of its partitions. Each message in a partition is uniquely identified by an <em>offset</em>. </p>
+<p>A stream is sharded into multiple partitions for scaling how its data is processed. Each <em>partition</em> is an ordered, replayable sequence of records. When a message is written to a stream, it ends up in one of its partitions. Each message in a partition is uniquely identified by an <em>offset</em>.</p>
 
 <p>Samza supports pluggable systems that can implement the stream abstraction. As an example, Kafka implements a stream as a topic while a database might implement a stream as a sequence of updates to its tables.</p>
 
 <h3 id="stream-application">Stream Application</h3>
-
 <p>A <em>stream application</em> processes messages from input streams, transforms them and emits results to an output stream or a database. It is built by chaining multiple operators, each of which takes in one or more streams and transforms them.</p>
 
-<p><img src="/img/latest/learn/documentation/core-concepts/stream-application.png" alt="diagram-medium"></p>
+<p><img src="/img/latest/learn/documentation/core-concepts/stream-application.png" alt="diagram-medium" /></p>
 
-<p>Samza offers foure top-level APIs to help you build your stream applications: <br/>
-1. The <a href="/learn/documentation/latest/api/high-level-api.html">High Level Streams API</a>,  which offers several built-in operators like map, filter, etc. This is the recommended API for most use-cases. <br/>
-2. The <a href="/learn/documentation/latest/api/low-level-api.html">Low Level Task API</a>, which allows greater flexibility to define your processing-logic and offers greater control <br/>
-3. <a href="/learn/documentation/latest/api/samza-sql.html">Samza SQL</a>, which offers a declarative SQL interface to create your applications <br/>
-4. <a href="/learn/documentation/latest/api/beam-api.html">Apache Beam API</a>, which offers the full Java API from <a href="https://beam.apache.org/">Apache beam</a> while Python and Go are work-in-progress.</p>
+<p>Samza offers foure top-level APIs to help you build your stream applications: <br /></p>
+<ol>
+  <li>The <a href="/learn/documentation/latest/api/high-level-api.html">High Level Streams API</a>,  which offers several built-in operators like map, filter, etc. This is the recommended API for most use-cases. <br /></li>
+  <li>The <a href="/learn/documentation/latest/api/low-level-api.html">Low Level Task API</a>, which allows greater flexibility to define your processing-logic and offers greater control <br /></li>
+  <li><a href="/learn/documentation/latest/api/samza-sql.html">Samza SQL</a>, which offers a declarative SQL interface to create your applications <br /></li>
+  <li><a href="/learn/documentation/latest/api/beam-api.html">Apache Beam API</a>, which offers the full Java API from <a href="https://beam.apache.org/">Apache beam</a> while Python and Go are work-in-progress.</li>
+</ol>
 
 <h3 id="state">State</h3>
-
-<p>Samza supports both stateless and stateful stream processing. <em>Stateless processing</em>, as the name implies, does not retain any state associated with the current message after it has been processed. A good example of this is filtering an incoming stream of user-records by a field (eg:userId) and writing the filtered messages to their own stream. </p>
+<p>Samza supports both stateless and stateful stream processing. <em>Stateless processing</em>, as the name implies, does not retain any state associated with the current message after it has been processed. A good example of this is filtering an incoming stream of user-records by a field (eg:userId) and writing the filtered messages to their own stream.</p>
 
 <p>In contrast, <em>stateful processing</em> requires you to record some state about a message even after processing it. Consider the example of counting the number of unique users to a website every five minutes. This requires you to store information about each user seen thus far for de-duplication. Samza offers a fault-tolerant, scalable state-store for this purpose.</p>
 
 <h3 id="time">Time</h3>
-
-<p>Time is a fundamental concept in stream processing, especially in how it is modeled and interpreted by the system. Samza supports two notions of time. By default, all built-in Samza operators use processing time. In processing time, the timestamp of a message is determined by when it is processed by the system. For example, an event generated by a sensor could be processed by Samza several milliseconds later. </p>
+<p>Time is a fundamental concept in stream processing, especially in how it is modeled and interpreted by the system. Samza supports two notions of time. By default, all built-in Samza operators use processing time. In processing time, the timestamp of a message is determined by when it is processed by the system. For example, an event generated by a sensor could be processed by Samza several milliseconds later.</p>
 
 <p>On the other hand, in event time, the timestamp of an event is determined by when it actually occurred at the source. For example, a sensor which generates an event could embed the time of occurrence as a part of the event itself. Samza provides event-time based processing by its integration with <a href="https://beam.apache.org/documentation/runners/samza/">Apache BEAM</a>.</p>
 
 <h3 id="processing-guarantee">Processing guarantee</h3>
-
 <p>Samza supports at-least once processing. As the name implies, this ensures that each message in the input stream is processed by the system at-least once. This guarantees no data-loss even when there are failures, thereby making Samza a practical choice for building fault-tolerant applications.</p>
 
 <p>Next Steps: We are now ready to have a closer look at Samza’s architecture.</p>
+<h3 id="architecture-"><a href="/learn/documentation/latest/architecture/architecture-overview.html">Architecture »</a></h3>
 
-<h3 id="architecture"><a href="/learn/documentation/latest/architecture/architecture-overview.html">Architecture &raquo;</a></h3>
 
            
         </div>

Modified: samza/site/learn/documentation/latest/deployment/deployment-model.html
URL: http://svn.apache.org/viewvc/samza/site/learn/documentation/latest/deployment/deployment-model.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/documentation/latest/deployment/deployment-model.html (original)
+++ samza/site/learn/documentation/latest/deployment/deployment-model.html Wed Jan 18 19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" href="/releases/1.6.0">1.6.0</a>
       
         
@@ -538,6 +544,14 @@
               
               
 
+              <li class="hide"><a href="/learn/documentation/1.8.0/deployment/deployment-model">1.8.0</a></li>
+
+              
+
+              <li class="hide"><a href="/learn/documentation/1.7.0/deployment/deployment-model">1.7.0</a></li>
+
+              
+
               <li class="hide"><a href="/learn/documentation/1.6.0/deployment/deployment-model">1.6.0</a></li>
 
               
@@ -640,49 +654,47 @@
 -->
 
 <h3 id="overview">Overview</h3>
-
 <p>A unique thing about Samza is that it provides multiple ways to run your applications. Each deployment model comes with its own benefits, so you have flexibility in choosing whichever fits your needs. Since Samza supports “Write Once, Run Anywhere”, your application logic does not change depending on where you deploy it.</p>
 
 <h3 id="running-samza-on-yarn">Running Samza on YARN</h3>
-
 <p>Samza integrates with <a href="/learn/documentation/latest/deployment/yarn.html">Apache YARN</a> for running stream-processing as a managed service. We leverage YARN for isolation, multi-tenancy, resource-management and deployment for your applications. In this mode, you write your Samza application and submit it to be scheduled on a YARN cluster. You also specify its resource requirement - the number of containers needed, number of cores and memory required per-container. Samza then works with YARN to provision resources for your application and run it across a cluster of machines. It also handles failures of individual instances and automatically restarts them.</p>
 
-<p>When multiple applications share the same YARN cluster, they need to be isolated from each other. For this purpose, Samza works with YARN to enforce cpu and memory limits. Any application that uses more than its requested share of memory or cpu is automatically terminated - thereby, allowing multi-tenancy. Just like you would for any YARN-based application, you can use YARN&rsquo;s <a href="/learn/documentation/latest/deployment/yarn.html#application-master-ui">web UI</a> to manage your Samza jobs, view their logs etc.</p>
+<p>When multiple applications share the same YARN cluster, they need to be isolated from each other. For this purpose, Samza works with YARN to enforce cpu and memory limits. Any application that uses more than its requested share of memory or cpu is automatically terminated - thereby, allowing multi-tenancy. Just like you would for any YARN-based application, you can use YARN’s <a href="/learn/documentation/latest/deployment/yarn.html#application-master-ui">web UI</a> to manage your Samza jobs, view their logs etc.</p>
 
 <h3 id="running-samza-in-standalone-mode">Running Samza in standalone mode</h3>
 
-<p>Often you want to embed and integrate Samza as a component within a larger application. To enable this, Samza supports a <a href="/learn/documentation/latest/deployment/standalone.html">standalone mode</a> of deployment allowing greater control over your application&rsquo;s life-cycle. In this model, Samza can be used just like any library you import within your Java application. This is identical to using a <a href="https://kafka.apache.org/">high-level Kafka consumer</a> to process your streams.</p>
+<p>Often you want to embed and integrate Samza as a component within a larger application. To enable this, Samza supports a <a href="/learn/documentation/latest/deployment/standalone.html">standalone mode</a> of deployment allowing greater control over your application’s life-cycle. In this model, Samza can be used just like any library you import within your Java application. This is identical to using a <a href="https://kafka.apache.org/">high-level Kafka consumer</a> to process your streams.</p>
 
-<p>You can increase your application&rsquo;s capacity by spinning up multiple instances. These instances will then dynamically coordinate with each other and distribute work among themselves. If an instance fails for some reason, the tasks running on it will be re-assigned to the remaining ones. By default, Samza uses <a href="https://zookeeper.apache.org/">Zookeeper</a> for coordination across individual instances. The coordination logic by itself is pluggable and hence, can integrate with other frameworks.</p>
+<p>You can increase your application’s capacity by spinning up multiple instances. These instances will then dynamically coordinate with each other and distribute work among themselves. If an instance fails for some reason, the tasks running on it will be re-assigned to the remaining ones. By default, Samza uses <a href="https://zookeeper.apache.org/">Zookeeper</a> for coordination across individual instances. The coordination logic by itself is pluggable and hence, can integrate with other frameworks.</p>
 
 <p>This mode allows you to bring any cluster-manager or hosting-environment of your choice(eg: <a href="https://kubernetes.io/">Kubernetes</a>, <a href="https://mesosphere.github.io/marathon/">Marathon</a>) to run your application. You are also free to control memory-limits, multi-tenancy on your own - since Samza is used as a light-weight library.</p>
 
 <h3 id="choosing-a-deployment-model">Choosing a deployment model</h3>
 
-<p>A common question that we get asked is - &ldquo;Should I use YARN or standalone?&rdquo;. Here are some guidelines when choosing your deployment model. Since your application logic does not change, it is quite easy to port from one to the other.</p>
-
-<ul>
-<li>Would you like Samza to be embedded as a component of a larger application?
-
-<ul>
-<li>If so, then you should use standalone.</li>
-</ul></li>
-<li>Would you like to have out-of-the-box resource management (e.g. CPU/memory limits, restarts on failures)?
-
-<ul>
-<li>If so, then you should use YARN.</li>
-</ul></li>
-<li>Would you like to run your application on any other cluster manager - eg: Kubernetes?
-
-<ul>
-<li>If so, then you should use standalone.</li>
-</ul></li>
-<li>Would you like to run centrally-managed tools and dashboards?
+<p>A common question that we get asked is - “Should I use YARN or standalone?”. Here are some guidelines when choosing your deployment model. Since your application logic does not change, it is quite easy to port from one to the other.</p>
 
 <ul>
-<li>If so, then you should use YARN.</li>
-<li>Note: You can still have tools and dashboards when using standalone, but you will need to run them yourself wherever your application is deployed.</li>
-</ul></li>
+  <li>Would you like Samza to be embedded as a component of a larger application?
+    <ul>
+      <li>If so, then you should use standalone.</li>
+    </ul>
+  </li>
+  <li>Would you like to have out-of-the-box resource management (e.g. CPU/memory limits, restarts on failures)?
+    <ul>
+      <li>If so, then you should use YARN.</li>
+    </ul>
+  </li>
+  <li>Would you like to run your application on any other cluster manager - eg: Kubernetes?
+    <ul>
+      <li>If so, then you should use standalone.</li>
+    </ul>
+  </li>
+  <li>Would you like to run centrally-managed tools and dashboards?
+    <ul>
+      <li>If so, then you should use YARN.</li>
+      <li>Note: You can still have tools and dashboards when using standalone, but you will need to run them yourself wherever your application is deployed.</li>
+    </ul>
+  </li>
 </ul>
 
            

Modified: samza/site/learn/documentation/latest/deployment/standalone.html
URL: http://svn.apache.org/viewvc/samza/site/learn/documentation/latest/deployment/standalone.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/documentation/latest/deployment/standalone.html (original)
+++ samza/site/learn/documentation/latest/deployment/standalone.html Wed Jan 18 19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" href="/releases/1.6.0">1.6.0</a>
       
         
@@ -538,6 +544,14 @@
               
               
 
+              <li class="hide"><a href="/learn/documentation/1.8.0/deployment/standalone">1.8.0</a></li>
+
+              
+
+              <li class="hide"><a href="/learn/documentation/1.7.0/deployment/standalone">1.7.0</a></li>
+
+              
+
               <li class="hide"><a href="/learn/documentation/1.6.0/deployment/standalone">1.6.0</a></li>
 
               
@@ -640,16 +654,16 @@
 -->
 
 <ul>
-<li><a href="#introduction">Introduction</a></li>
-<li><a href="#quick-start-guide">Quick start</a>
-
-<ul>
-<li><a href="#setup-zookeeper">Installing Zookeeper and Kafka</a></li>
-<li><a href="#build-binaries">Building binaries</a></li>
-<li><a href="#deploy-binaries">Running the application</a></li>
-<li><a href="#inspect-results">Inspecting results</a></li>
-</ul></li>
-<li><a href="#coordinator-internals">Coordinator internals</a></li>
+  <li><a href="#introduction">Introduction</a></li>
+  <li><a href="#quick-start-guide">Quick start</a>
+    <ul>
+      <li><a href="#setup-zookeeper">Installing Zookeeper and Kafka</a></li>
+      <li><a href="#build-binaries">Building binaries</a></li>
+      <li><a href="#deploy-binaries">Running the application</a></li>
+      <li><a href="#inspect-results">Inspecting results</a></li>
+    </ul>
+  </li>
+  <li><a href="#coordinator-internals">Coordinator internals</a></li>
 </ul>
 
 <h3 id="introduction">Introduction</h3>
@@ -663,38 +677,50 @@
 <h3 id="quick-start">Quick start</h3>
 
 <p>The <a href="https://github.com/apache/samza-hello-samza/">Hello-samza</a> project includes multiple examples of Samza standalone applications. Let us first check out the repository.</p>
-<div class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span>git clone https://gitbox.apache.org/repos/asf/samza-hello-samza.git hello-samza
-<span class="nb">cd</span> hello-samza 
-</code></pre></div>
+
+<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://gitbox.apache.org/repos/asf/samza-hello-samza.git hello-samza
+<span class="nb">cd </span>hello-samza 
+</code></pre></div></div>
+
 <h4 id="installing-zookeeper-and-kafka">Installing Zookeeper and Kafka</h4>
 
-<p>We will use the <code>./bin/grid</code> script from the <code>hello-samza</code> project to setup up Zookeeper and Kafka locally.</p>
-<div class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span>./bin/grid start zookeeper
+<p>We will use the <code class="language-plaintext highlighter-rouge">./bin/grid</code> script from the <code class="language-plaintext highlighter-rouge">hello-samza</code> project to setup up Zookeeper and Kafka locally.</p>
+
+<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./bin/grid start zookeeper
 ./bin/grid start kafka
-</code></pre></div>
+</code></pre></div></div>
+
 <h4 id="building-the-binaries">Building the binaries</h4>
 
-<p>Let us now build the <code>hello-samza</code> project from its sources.</p>
-<div class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span>mvn clean package
-mkdir -p deploy/samza
-tar -xvf ./target/hello-samza-1.1.0-dist.tar.gz -C deploy/samza
-</code></pre></div>
+<p>Let us now build the <code class="language-plaintext highlighter-rouge">hello-samza</code> project from its sources.</p>
+
+<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mvn clean package
+<span class="nb">mkdir</span> <span class="nt">-p</span> deploy/samza
+<span class="nb">tar</span> <span class="nt">-xvf</span> ./target/hello-samza-1.1.0-dist.tar.gz <span class="nt">-C</span> deploy/samza
+</code></pre></div></div>
+
 <h4 id="running-the-application">Running the application</h4>
 
-<p>We are ready to run the example application <a href="https://github.com/apache/samza-hello-samza/blob/master/src/main/java/samza/examples/wikipedia/application/WikipediaZkLocalApplication.java">WikipediaZkLocalApplication</a>. This application reads messages from the wikipedia-edits topic, and calculates counts, every ten seconds, for all edits that were made during that window. It emits these results to another topic named <code>wikipedia-stats</code>.</p>
-<div class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span>./deploy/samza/bin/run-class.sh samza.examples.wikipedia.application.WikipediaZkLocalApplication  --config job.config.loader.factory<span class="o">=</span>org.apache.samza.config.loaders.PropertiesConfigLoaderFactory --config job.config.loader.properties.path<span class="o">=</span><span class="nv">$PWD</span>/deploy/samza/config/wikipedia-application-local-runner.properties
-</code></pre></div>
+<p>We are ready to run the example application <a href="https://github.com/apache/samza-hello-samza/blob/master/src/main/java/samza/examples/wikipedia/application/WikipediaZkLocalApplication.java">WikipediaZkLocalApplication</a>. This application reads messages from the wikipedia-edits topic, and calculates counts, every ten seconds, for all edits that were made during that window. It emits these results to another topic named <code class="language-plaintext highlighter-rouge">wikipedia-stats</code>.</p>
+
+<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./deploy/samza/bin/run-class.sh samza.examples.wikipedia.application.WikipediaZkLocalApplication  <span class="nt">--config</span> job.config.loader.factory<span class="o">=</span>org.apache.samza.config.loaders.PropertiesConfigLoaderFactory <span class="nt">--config</span> job.config.loader.properties.path<span class="o">=</span><span class="nv">$PWD</span>/deploy/samza/config/wikipedia-application-local-runner.properties
+</code></pre></div></div>
+
 <p>You can run the above command again to spin up a new instance of your application.</p>
 
 <h4 id="inspecting-results">Inspecting results</h4>
 
 <p>To inspect events in output topic, run the following command.</p>
-<div class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span>./deploy/kafka/bin/kafka-console-consumer.sh  --zookeeper localhost:2181 --topic wikipedia-stats
-</code></pre></div>
+
+<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./deploy/kafka/bin/kafka-console-consumer.sh  <span class="nt">--zookeeper</span> localhost:2181 <span class="nt">--topic</span> wikipedia-stats
+</code></pre></div></div>
+
 <p>You should see the output messages emitted to the Kafka topic.</p>
-<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>{&quot;is-talk&quot;:2,&quot;bytes-added&quot;:5276,&quot;edits&quot;:13,&quot;unique-titles&quot;:13}
-{&quot;is-bot-edit&quot;:1,&quot;is-talk&quot;:3,&quot;bytes-added&quot;:4211,&quot;edits&quot;:30,&quot;unique-titles&quot;:30,&quot;is-unpatrolled&quot;:1,&quot;is-new&quot;:2,&quot;is-minor&quot;:7}
-</code></pre></div>
+
+<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{"is-talk":2,"bytes-added":5276,"edits":13,"unique-titles":13}
+{"is-bot-edit":1,"is-talk":3,"bytes-added":4211,"edits":30,"unique-titles":30,"is-unpatrolled":1,"is-new":2,"is-minor":7}
+</code></pre></div></div>
+
 <h3 id="standalone-coordinator-internals">Standalone Coordinator Internals</h3>
 
 <p>Samza runs your application by logically breaking its execution down into multiple tasks. A task is the unit of parallelism for your application, with each task consumeing data from one or more partitions of your input streams.
@@ -705,10 +731,18 @@ pluggable - with a Zookeeper-based imple
 <p>Here is a typical sequence of actions when using Zookeeper for coordination.</p>
 
 <ol>
-<li><p>Everytime you spawn a new instance of your application, it registers itself with Zookeeper as a participant.</p></li>
-<li><p>There is always a single leader - which acts as the coordinator. The coordinator which manages the assignment of tasks across the individual containers. The coordinator also monitors the liveness of individual containers and redistributes the tasks among the remaining ones during a failure.  </p></li>
-<li><p>Whenever a new instance joins or leaves the group, it triggers a notification to the leader. The leader can then recompute assignments of tasks to the live instances.</p></li>
-<li><p>Once the leader publishes new partition assignments, all running instances pick up the new assignment and resume processing.</p></li>
+  <li>
+    <p>Everytime you spawn a new instance of your application, it registers itself with Zookeeper as a participant.</p>
+  </li>
+  <li>
+    <p>There is always a single leader - which acts as the coordinator. The coordinator which manages the assignment of tasks across the individual containers. The coordinator also monitors the liveness of individual containers and redistributes the tasks among the remaining ones during a failure.</p>
+  </li>
+  <li>
+    <p>Whenever a new instance joins or leaves the group, it triggers a notification to the leader. The leader can then recompute assignments of tasks to the live instances.</p>
+  </li>
+  <li>
+    <p>Once the leader publishes new partition assignments, all running instances pick up the new assignment and resume processing.</p>
+  </li>
 </ol>
 
            

Modified: samza/site/learn/documentation/latest/deployment/yarn.html
URL: http://svn.apache.org/viewvc/samza/site/learn/documentation/latest/deployment/yarn.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/documentation/latest/deployment/yarn.html (original)
+++ samza/site/learn/documentation/latest/deployment/yarn.html Wed Jan 18 19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" href="/releases/1.6.0">1.6.0</a>
       
         
@@ -538,6 +544,14 @@
               
               
 
+              <li class="hide"><a href="/learn/documentation/1.8.0/deployment/yarn">1.8.0</a></li>
+
+              
+
+              <li class="hide"><a href="/learn/documentation/1.7.0/deployment/yarn">1.7.0</a></li>
+
+              
+
               <li class="hide"><a href="/learn/documentation/1.6.0/deployment/yarn">1.6.0</a></li>
 
               
@@ -640,77 +654,84 @@
 -->
 
 <ul>
-<li><a href="#introduction">Introduction</a></li>
-<li><a href="#starting-your-application-on-yarn">Running on YARN: Quickstart</a>
-
-<ul>
-<li><a href="#setting-up-a-single-node-yarn-cluster-optional">Setting up a single node YARN cluster</a></li>
-<li><a href="#submitting-the-application-to-yarn">Submitting the application to YARN</a></li>
-</ul></li>
-<li><a href="#application-master-ui">Application Master UI</a></li>
-<li><a href="#configuration">Configuration</a>
-
-<ul>
-<li><a href="#configuring-parallelism">Configuring parallelism</a></li>
-<li><a href="#configuring-resources">Configuring resources</a>
-
-<ul>
-<li><a href="#memory">Memory</a></li>
-<li><a href="#cpu">CPU</a></li>
-</ul></li>
-<li><a href="#configuring-retries">Configuring retries</a></li>
-<li><a href="#configuring-rm-high-availability-and-nm-work-preserving-recovery">Configuring RM high-availability and NM work-preserving recovery</a>
-
-<ul>
-<li><a href="#resource-manager-high-availability">Resource Manager high-availability</a></li>
-<li><a href="#nodemanager-work-preserving-recovery">NodeManager work-preserving recovery</a></li>
-</ul></li>
-<li><a href="#configuring-host-affinity">Configuring host-affinity</a></li>
-<li><a href="#configuring-security">Configuring security</a>
-
-<ul>
-<li><a href="#delegation-token-management-strategy">Delegation token management strategy</a></li>
-<li><a href="#security-components">Security Components</a>
-
-<ul>
-<li><a href="#securitymanager">SecurityManager</a></li>
-</ul></li>
-<li><a href="#security-configuration">Security Configuration</a>
-
-<ul>
-<li><a href="#job">Job</a></li>
-<li><a href="#yarn">YARN</a></li>
-</ul></li>
-</ul></li>
-</ul></li>
-<li><a href="#coordinator-internals">Coordinator Internals</a></li>
+  <li><a href="#introduction">Introduction</a></li>
+  <li><a href="#starting-your-application-on-yarn">Running on YARN: Quickstart</a>
+    <ul>
+      <li><a href="#setting-up-a-single-node-yarn-cluster-optional">Setting up a single node YARN cluster</a></li>
+      <li><a href="#submitting-the-application-to-yarn">Submitting the application to YARN</a></li>
+    </ul>
+  </li>
+  <li><a href="#application-master-ui">Application Master UI</a></li>
+  <li><a href="#configuration">Configuration</a>
+    <ul>
+      <li><a href="#configuring-parallelism">Configuring parallelism</a></li>
+      <li><a href="#configuring-resources">Configuring resources</a>
+        <ul>
+          <li><a href="#memory">Memory</a></li>
+          <li><a href="#cpu">CPU</a></li>
+        </ul>
+      </li>
+      <li><a href="#configuring-retries">Configuring retries</a></li>
+      <li><a href="#configuring-rm-high-availability-and-nm-work-preserving-recovery">Configuring RM high-availability and NM work-preserving recovery</a>
+        <ul>
+          <li><a href="#resource-manager-high-availability">Resource Manager high-availability</a></li>
+          <li><a href="#nodemanager-work-preserving-recovery">NodeManager work-preserving recovery</a></li>
+        </ul>
+      </li>
+      <li><a href="#configuring-host-affinity">Configuring host-affinity</a></li>
+      <li><a href="#configuring-security">Configuring security</a>
+        <ul>
+          <li><a href="#delegation-token-management-strategy">Delegation token management strategy</a></li>
+          <li><a href="#security-components">Security Components</a>
+            <ul>
+              <li><a href="#securitymanager">SecurityManager</a></li>
+            </ul>
+          </li>
+          <li><a href="#security-configuration">Security Configuration</a>
+            <ul>
+              <li><a href="#job">Job</a></li>
+              <li><a href="#yarn">YARN</a></li>
+            </ul>
+          </li>
+        </ul>
+      </li>
+    </ul>
+  </li>
+  <li><a href="#coordinator-internals">Coordinator Internals</a></li>
 </ul>
 
 <h3 id="introduction">Introduction</h3>
 
-<p>Apache YARN is part of the Hadoop project and provides the ability to run distributed applications on a cluster. A YARN cluster minimally consists of a Resource Manager (RM) and multiple Node Managers (NM). The RM is responsible for managing the resources in the cluster and allocating them to applications. Every node in the cluster has an NM (Node Manager), which is responsible for managing containers on that node - starting them, monitoring their resource usage and reporting the same to the RM. </p>
+<p>Apache YARN is part of the Hadoop project and provides the ability to run distributed applications on a cluster. A YARN cluster minimally consists of a Resource Manager (RM) and multiple Node Managers (NM). The RM is responsible for managing the resources in the cluster and allocating them to applications. Every node in the cluster has an NM (Node Manager), which is responsible for managing containers on that node - starting them, monitoring their resource usage and reporting the same to the RM.</p>
 
 <p>Applications are run on the cluster by implementing a coordinator called an ApplicationMaster (AM). The AM is responsible for requesting resources including CPU, memory from the Resource Manager (RM) on behalf of the application. Samza provides its own implementation of the AM for each job.</p>
 
 <h3 id="running-on-yarn-quickstart">Running on YARN: Quickstart</h3>
 
-<p>We will demonstrate running a Samza application on YARN by using the <code>hello-samza</code> example. Lets first checkout our repository.</p>
-<div class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span>git clone https://github.com/apache/samza-hello-samza.git
-<span class="nb">cd</span> samza-hello-samza
+<p>We will demonstrate running a Samza application on YARN by using the <code class="language-plaintext highlighter-rouge">hello-samza</code> example. Lets first checkout our repository.</p>
+
+<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/apache/samza-hello-samza.git
+<span class="nb">cd </span>samza-hello-samza
 git checkout latest
-</code></pre></div>
+</code></pre></div></div>
+
 <h4 id="set-up-a-single-node-yarn-cluster">Set up a single node YARN cluster</h4>
 
-<p>You can use the <code>grid</code> script included as part of the <a href="https://github.com/apache/samza-hello-samza/">hello-samza</a> repository to setup a single-node cluster. The script also starts Zookeeper and Kafka locally.</p>
-<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>./bin/grid bootstrap
-</code></pre></div>
+<p>You can use the <code class="language-plaintext highlighter-rouge">grid</code> script included as part of the <a href="https://github.com/apache/samza-hello-samza/">hello-samza</a> repository to setup a single-node cluster. The script also starts Zookeeper and Kafka locally.</p>
+
+<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./bin/grid bootstrap
+</code></pre></div></div>
+
 <h3 id="submitting-the-application-to-yarn">Submitting the application to YARN</h3>
 
-<p>Now that we have a YARN cluster ready, lets build our application. The below command does a maven-build and generates an archive in the <code>./target</code> folder. </p>
-<div class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span>./bin/build-package.sh
-</code></pre></div>
+<p>Now that we have a YARN cluster ready, lets build our application. The below command does a maven-build and generates an archive in the <code class="language-plaintext highlighter-rouge">./target</code> folder.</p>
+
+<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./bin/build-package.sh
+</code></pre></div></div>
+
 <p>You can inspect the structure of the generated archive. To run on YARN, Samza jobs should be packaged with the following structure.</p>
-<div class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span>samza-job-name-folder
+
+<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>samza-job-name-folder
 ├── bin
 │   ├── run-app.sh
 │   ├── run-class.sh
@@ -723,38 +744,49 @@ git checkout latest
     ├── samza-kafka_2.11-0.14.0.jar
     ├── samza-yarn_2.11-0.14.0.jar
     └── ...
-</code></pre></div>
-<p>Once the archive is built, the <code>run-app.sh</code> script can be used to submit the application to YARN&rsquo;s resource manager. The script takes 2 CLI parameters - the config factory and the config file for the application. As an example, lets run our <a href="https://github.com/apache/samza-hello-samza/blob/latest/src/main/java/samza/examples/cookbook/FilterExample.java">FilterExample</a> on YARN as follows:</p>
-<div class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span>$ ./deploy/samza/bin/run-app.sh --config-path<span class="o">=</span>./deploy/samza/config/filter-example.properties
-</code></pre></div>
-<p>Congratulations, you&rsquo;ve successfully submitted your first job to YARN! You can view the YARN Web UI to view its status. </p>
+</code></pre></div></div>
+
+<p>Once the archive is built, the <code class="language-plaintext highlighter-rouge">run-app.sh</code> script can be used to submit the application to YARN’s resource manager. The script takes 2 CLI parameters - the config factory and the config file for the application. As an example, lets run our <a href="https://github.com/apache/samza-hello-samza/blob/latest/src/main/java/samza/examples/cookbook/FilterExample.java">FilterExample</a> on YARN as follows:</p>
+
+<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>./deploy/samza/bin/run-app.sh <span class="nt">--config-path</span><span class="o">=</span>./deploy/samza/config/filter-example.properties
+</code></pre></div></div>
+
+<p>Congratulations, you’ve successfully submitted your first job to YARN! You can view the YARN Web UI to view its status.</p>
 
 <h3 id="application-master-ui">Application Master UI</h3>
 
-<p>The YARN RM provides a Web UI to view the status of applications in the cluster, their containers and logs. By default, it can be accessed from <code>localhost:8088</code> on the RM host. 
-<img src="/img/latest/learn/documentation/yarn/yarn-am-ui.png" alt="diagram-medium"></p>
+<p>The YARN RM provides a Web UI to view the status of applications in the cluster, their containers and logs. By default, it can be accessed from <code class="language-plaintext highlighter-rouge">localhost:8088</code> on the RM host. 
+<img src="/img/latest/learn/documentation/yarn/yarn-am-ui.png" alt="diagram-medium" /></p>
 
-<p>In addition to YARN&rsquo;s UI, Samza also offers a REST end-point and a web interface for its ApplicationMaster. To access it, simply click on the Tracking UI link corresponding to your application. 
-Samza&rsquo;s Application Master UI provides you the ability to view:</p>
+<p>In addition to YARN’s UI, Samza also offers a REST end-point and a web interface for its ApplicationMaster. To access it, simply click on the Tracking UI link corresponding to your application. 
+Samza’s Application Master UI provides you the ability to view:</p>
 
 <ul>
-<li><p>Job-level runtime metadata - eg: JMX endpoints, running JVM version
-<img src="/img/latest/learn/documentation/yarn/am-runtime-metadata.png" alt="diagram-small"></p></li>
-<li><p>Information about individual containers eg: their uptime, status and logs
-<img src="/img/latest/learn/documentation/yarn/am-container-info.png" alt="diagram-small"></p></li>
-<li><p>Task Groups eg: Information on individual tasks, where they run and which partitions are consumed from what host
-<img src="/img/latest/learn/documentation/yarn/am-job-model.png" alt="diagram-small"></p></li>
-<li><p>Runtime configs for your application
-<img src="/img/latest/learn/documentation/yarn/am-runtime-configs.png" alt="diagram-small"></p></li>
+  <li>
+    <p>Job-level runtime metadata - eg: JMX endpoints, running JVM version
+<img src="/img/latest/learn/documentation/yarn/am-runtime-metadata.png" alt="diagram-small" /></p>
+  </li>
+  <li>
+    <p>Information about individual containers eg: their uptime, status and logs
+<img src="/img/latest/learn/documentation/yarn/am-container-info.png" alt="diagram-small" /></p>
+  </li>
+  <li>
+    <p>Task Groups eg: Information on individual tasks, where they run and which partitions are consumed from what host
+<img src="/img/latest/learn/documentation/yarn/am-job-model.png" alt="diagram-small" /></p>
+  </li>
+  <li>
+    <p>Runtime configs for your application
+<img src="/img/latest/learn/documentation/yarn/am-runtime-configs.png" alt="diagram-small" /></p>
+  </li>
 </ul>
 
 <h3 id="configuration">Configuration</h3>
 
-<p>In this section, we&rsquo;ll look at configuring your jobs when running on YARN.</p>
+<p>In this section, we’ll look at configuring your jobs when running on YARN.</p>
 
 <h4 id="configuring-parallelism">Configuring parallelism</h4>
 
-<p><a href="/learn/documentation/latest/architecture/architecture-overview.html#container">Recall</a> that Samza scales your applications by breaking them into multiple tasks. On YARN, these tasks are executed on one or more containers, each of which is a Java process. You can control the number of containers allocated to your application by configuring <code>cluster-manager.container.count</code>. For example, if we are consuming from an input topic with 5 partitions, Samza will create 5 tasks, each of which process one partition. Tasks are equally distributed among available containers. The number of containers can be utmost the number of tasks - since, we cannot have idle containers without any tasks assigned to them. </p>
+<p><a href="/learn/documentation/latest/architecture/architecture-overview.html#container">Recall</a> that Samza scales your applications by breaking them into multiple tasks. On YARN, these tasks are executed on one or more containers, each of which is a Java process. You can control the number of containers allocated to your application by configuring <code class="language-plaintext highlighter-rouge">cluster-manager.container.count</code>. For example, if we are consuming from an input topic with 5 partitions, Samza will create 5 tasks, each of which process one partition. Tasks are equally distributed among available containers. The number of containers can be utmost the number of tasks - since, we cannot have idle containers without any tasks assigned to them.</p>
 
 <h4 id="configuring-resources">Configuring resources</h4>
 
@@ -762,21 +794,21 @@ Samza&rsquo;s Application Master UI prov
 
 <h5 id="memory">Memory</h5>
 
-<p>You can configure the memory-limit per-container using <code>cluster-manager.container.memory.mb</code> and memory-limit for the AM using <code>yarn.am.container.memory.mb</code>. If your container process exceeds its configured memory-limits, it is automatically killed by YARN. </p>
+<p>You can configure the memory-limit per-container using <code class="language-plaintext highlighter-rouge">cluster-manager.container.memory.mb</code> and memory-limit for the AM using <code class="language-plaintext highlighter-rouge">yarn.am.container.memory.mb</code>. If your container process exceeds its configured memory-limits, it is automatically killed by YARN.</p>
 
 <h5 id="cpu">CPU</h5>
 
-<p>Similar to configuring memory-limits, you can configure the maximum number of vCores (virtual cores) each container can use by setting <code>cluster-manager.container.cpu.cores</code>. A <em>vCore</em> is YARN&rsquo;s abstraction over a physical core on a NodeManager which allows for over-provisioning. YARN supports <a href="(http://riccomini.name/posts/hadoop/2013-06-14-yarn-with-cgroups/)">isolation</a> of cpu cores using Linux CGroups.</p>
+<p>Similar to configuring memory-limits, you can configure the maximum number of vCores (virtual cores) each container can use by setting <code class="language-plaintext highlighter-rouge">cluster-manager.container.cpu.cores</code>. A <em>vCore</em> is YARN’s abstraction over a physical core on a NodeManager which allows for over-provisioning. YARN supports <a href="(http://riccomini.name/posts/hadoop/2013-06-14-yarn-with-cgroups/)">isolation</a> of cpu cores using Linux CGroups.</p>
 
 <h4 id="configuring-retries">Configuring retries</h4>
 
 <p>Failures are common when running any distributed system and should be handled gracefully. The Samza AM automatically restarts containers during a failure. The following properties govern this behavior.</p>
 
-<p><code>cluster-manager.container.retry.count</code>: This property determines the maximum number of times Samza will attempt to restart a failed container within a time window. If this property is set to 0, any failed container immediately causes the whole job to fail. If it is set to a negative number, there is no limit on the number of retries.</p>
+<p><code class="language-plaintext highlighter-rouge">cluster-manager.container.retry.count</code>: This property determines the maximum number of times Samza will attempt to restart a failed container within a time window. If this property is set to 0, any failed container immediately causes the whole job to fail. If it is set to a negative number, there is no limit on the number of retries.</p>
 
-<p><code>cluster-manager.container.retry.window.ms</code>:  This property determines how frequently a container is allowed to fail before we give up and fail the job. If the same container has failed more than cluster-manager.container.retry.count times and the time between failures is less than this property, then Samza terminates the job. There is no limit to the number of times we restart a container, if the time between failures is greater than cluster-manager.container.retry.window.ms.</p>
+<p><code class="language-plaintext highlighter-rouge">cluster-manager.container.retry.window.ms</code>:  This property determines how frequently a container is allowed to fail before we give up and fail the job. If the same container has failed more than cluster-manager.container.retry.count times and the time between failures is less than this property, then Samza terminates the job. There is no limit to the number of times we restart a container, if the time between failures is greater than cluster-manager.container.retry.window.ms.</p>
 
-<h2 id="yarn-operations-best-practices">YARN - Operations Best practices</h2>
+<h2 id="yarn---operations-best-practices">YARN - Operations Best practices</h2>
 
 <p>Although this section is not Samza specific, it describes some best practices for running a YARN cluster in production.</p>
 
@@ -784,8 +816,9 @@ Samza&rsquo;s Application Master UI prov
 
 <p>The Resource Manager (RM) provides services like scheduling, heartbeats, liveness monitoring to all applications running in the YARN cluster. Losing the host running the RM would kill every application running on the cluster - making it a single point of failure. The High Availability feature introduced in Hadoop 2.4 adds redundancy by allowing multiple stand-by RMs.</p>
 
-<p>To configure YARN&rsquo;s ResourceManager to be highly available Resource Manager, set your yarn-site.xml file with the following configs:</p>
-<div class="highlight"><pre><code class="language-xml" data-lang="xml"><span></span><span class="nt">&lt;property&gt;</span>
+<p>To configure YARN’s ResourceManager to be highly available Resource Manager, set your yarn-site.xml file with the following configs:</p>
+
+<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;property&gt;</span>
   <span class="nt">&lt;name&gt;</span>yarn.resourcemanager.ha.enabled<span class="nt">&lt;/name&gt;</span>
   <span class="nt">&lt;value&gt;</span>true<span class="nt">&lt;/value&gt;</span>
 <span class="nt">&lt;/property&gt;</span>
@@ -817,20 +850,21 @@ Samza&rsquo;s Application Master UI prov
   <span class="nt">&lt;name&gt;</span>yarn.resourcemanager.zk-address<span class="nt">&lt;/name&gt;</span>
   <span class="nt">&lt;value&gt;</span>zk1:2181,zk2:2181,zk3:2181<span class="nt">&lt;/value&gt;</span>
 <span class="nt">&lt;/property&gt;</span>
-</code></pre></div>
+</code></pre></div></div>
+
 <h3 id="reserving-memory-for-other-services">Reserving memory for other services</h3>
 
-<p>Often, other services including monitoring daemons like Samza-REST run on the same nodes in the YARN cluster. You can configure <code>yarn.nodemanager.resource.system-reserved-memory-mb</code> to control the amount of physical memory reserved for non-YARN processes.</p>
+<p>Often, other services including monitoring daemons like Samza-REST run on the same nodes in the YARN cluster. You can configure <code class="language-plaintext highlighter-rouge">yarn.nodemanager.resource.system-reserved-memory-mb</code> to control the amount of physical memory reserved for non-YARN processes.</p>
 
-<p>Another behaviour to keep in mind is that the Resource Manager allocates memory and cpu on the cluster in increments of <code>yarn.scheduler.minimum-allocation-mb</code> and <code>yarn.scheduler.minimum-allocation-vcores</code>. Hence, requesting allocations that are not multiples of the above configs will cause internal fragmentation.</p>
+<p>Another behaviour to keep in mind is that the Resource Manager allocates memory and cpu on the cluster in increments of <code class="language-plaintext highlighter-rouge">yarn.scheduler.minimum-allocation-mb</code> and <code class="language-plaintext highlighter-rouge">yarn.scheduler.minimum-allocation-vcores</code>. Hence, requesting allocations that are not multiples of the above configs will cause internal fragmentation.</p>
 
 <h3 id="nodemanager-work-preserving-recovery">NodeManager work-preserving recovery</h3>
 
-<p>Often, NMs have to be bounced in the cluster for upgrades or maintenance reasons. By default, bouncing a Node Manager kills all containers running on its host. Work-preserving NM Restart enables NodeManagers to be restarted without losing active containers running on the node. You can turn on this feature by setting <code>yarn.nodemanager.recovery.enabled</code> to <code>true</code> in <code>yarn-site.xml</code>. You should also set <code>yarn.nodemanager.recovery.dir</code> to a directory where the NM should store its state needed for recovery.</p>
+<p>Often, NMs have to be bounced in the cluster for upgrades or maintenance reasons. By default, bouncing a Node Manager kills all containers running on its host. Work-preserving NM Restart enables NodeManagers to be restarted without losing active containers running on the node. You can turn on this feature by setting <code class="language-plaintext highlighter-rouge">yarn.nodemanager.recovery.enabled</code> to <code class="language-plaintext highlighter-rouge">true</code> in <code class="language-plaintext highlighter-rouge">yarn-site.xml</code>. You should also set <code class="language-plaintext highlighter-rouge">yarn.nodemanager.recovery.dir</code> to a directory where the NM should store its state needed for recovery.</p>
 
 <h3 id="configuring-state-store-directories">Configuring state-store directories</h3>
 
-<p>When a stateful Samza job is deployed in YARN, the state stores for the tasks are located in the current working directory of YARN’s attempt. YARN&rsquo;s DeletionService cleans up the working directories after an application exits. To ensure durability of Samza&rsquo;s state, its stores need to be persisted outside the scope of YARN&rsquo;s DeletionService. You can set this location by configuring an environment variable named <code>LOGGED_STORE_BASE_DIR</code> across the cluster.</p>
+<p>When a stateful Samza job is deployed in YARN, the state stores for the tasks are located in the current working directory of YARN’s attempt. YARN’s DeletionService cleans up the working directories after an application exits. To ensure durability of Samza’s state, its stores need to be persisted outside the scope of YARN’s DeletionService. You can set this location by configuring an environment variable named <code class="language-plaintext highlighter-rouge">LOGGED_STORE_BASE_DIR</code> across the cluster.</p>
 
 <p>To manage disk space and clean-up state stores that are no longer necessary, Samza-REST supports periodic, long-running tasks named <a href="/learn/documentation/latest/rest/monitors.html">monitors</a>.</p>
 
@@ -841,17 +875,20 @@ Samza&rsquo;s Application Master UI prov
 <h4 id="management-of-kerberos-tokens">Management of Kerberos tokens</h4>
 
 <p>One challenge for long-running applications on YARN is how they periodically renew their Kerberos tokens. Samza handles this by having the AM periodically create tokens and refresh them in a staging directory on HDFS. This directory is accessible only by the containers of your job. You can set your Kerberos principal and kerberos keytab file as follows:</p>
-<div class="highlight"><pre><code class="language-properties" data-lang="properties"><span></span><span class="c"># Use the SamzaYarnSecurityManagerFactory, which fetches and renews the Kerberos delegation tokens when the job is running in a secure environment.</span>
-<span class="na">job.security.manager.factory</span><span class="o">=</span><span class="s">org.apache.samza.job.yarn.SamzaYarnSecurityManagerFactory</span>
 
-<span class="c"># Kerberos principal</span>
-<span class="na">yarn.kerberos.principal</span><span class="o">=</span><span class="s">your-principal-name</span>
+<div class="language-properties highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Use the SamzaYarnSecurityManagerFactory, which fetches and renews the Kerberos delegation tokens when the job is running in a secure environment.
+</span><span class="py">job.security.manager.factory</span><span class="p">=</span><span class="s">org.apache.samza.job.yarn.SamzaYarnSecurityManagerFactory</span>
+
+<span class="c"># Kerberos principal
+</span><span class="py">yarn.kerberos.principal</span><span class="p">=</span><span class="s">your-principal-name</span>
+
+<span class="c"># Path of the keytab file (local path)
+</span><span class="py">yarn.kerberos.keytab</span><span class="p">=</span><span class="s">/tmp/keytab</span>
+</code></pre></div></div>
+
+<p>By default, Kerberos tokens on YARN have a maximum life-time of 7 days, beyond which they auto-expire. Often streaming applications are long-running and don’t terminate within this life-time. To get around this, you can configure YARN’s Resource Manager to automatically re-create tokens on your behalf by setting these configs in your <code class="language-plaintext highlighter-rouge">yarn-site.xml</code> file.</p>
 
-<span class="c"># Path of the keytab file (local path)</span>
-<span class="na">yarn.kerberos.keytab</span><span class="o">=</span><span class="s">/tmp/keytab</span>
-</code></pre></div>
-<p>By default, Kerberos tokens on YARN have a maximum life-time of 7 days, beyond which they auto-expire. Often streaming applications are long-running and don&rsquo;t terminate within this life-time. To get around this, you can configure YARN&rsquo;s Resource Manager to automatically re-create tokens on your behalf by setting these configs in your <code>yarn-site.xml</code> file. </p>
-<div class="highlight"><pre><code class="language-xml" data-lang="xml"><span></span><span class="nt">&lt;property&gt;</span>
+<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;property&gt;</span>
 <span class="nt">&lt;name&gt;</span>hadoop.proxyuser.yarn.hosts<span class="nt">&lt;/name&gt;</span>
 <span class="nt">&lt;value&gt;</span>*<span class="nt">&lt;/value&gt;</span>
 <span class="nt">&lt;/property&gt;</span>
@@ -860,32 +897,50 @@ Samza&rsquo;s Application Master UI prov
 <span class="nt">&lt;name&gt;</span>hadoop.proxyuser.yarn.groups<span class="nt">&lt;/name&gt;</span>
 <span class="nt">&lt;value&gt;</span>*<span class="nt">&lt;/value&gt;</span>
 <span class="nt">&lt;/property&gt;</span>
-</code></pre></div>
+</code></pre></div></div>
 <h1 id="samza-coordinator-internals">Samza Coordinator Internals</h1>
 
-<p>In this section, we will discuss some of implementation internals of the Samza ApplicationMaster (AM). </p>
+<p>In this section, we will discuss some of implementation internals of the Samza ApplicationMaster (AM).</p>
 
 <p>The Samza AM is the control-hub for a Samza application running on a YARN cluster. It is responsible for coordinating work assignment across individual containers. It includes the following componeents:</p>
 
 <ul>
-<li>YARNClusterResourceManager, which handles interactions with YARN and provides APIs for requesting resources and starting containers.</li>
-<li>ContainerProcessManager, which uses the above APIs to manage Samza containers - including restarting them on failure, ensuring they stay in a healthy state.</li>
+  <li>YARNClusterResourceManager, which handles interactions with YARN and provides APIs for requesting resources and starting containers.</li>
+  <li>ContainerProcessManager, which uses the above APIs to manage Samza containers - including restarting them on failure, ensuring they stay in a healthy state.</li>
 </ul>
 
-<p><img src="/img/latest/learn/documentation/yarn/coordinator-internals.png" alt="diagram-small"></p>
+<p><img src="/img/latest/learn/documentation/yarn/coordinator-internals.png" alt="diagram-small" /></p>
 
-<p>Here&rsquo;s a life-cycle of a Samza job submitted to YARN:</p>
+<p>Here’s a life-cycle of a Samza job submitted to YARN:</p>
 
 <ul>
-<li><p>The <code>run-app.sh</code> script is started providing the location of your application&rsquo;s binaries and its config file. The script instantiates an ApplicationRunner, which is the main entry-point responsible for running your application.</p></li>
-<li><p>The ApplicationRunner parses your configs and writes them to a special Kafka topic named - the coordinator stream for distributing them. It proceeds to submit a request to YARN to launch your application. </p></li>
-<li><p>The first step in launching any YARN application is starting its Application Master (AM).</p></li>
-<li><p>The ResourceManager allocates an available host and starts the Samza AM. </p></li>
-<li><p>The Samza AM is then responsible for managing the overall application. It reads configs from the Coordinator Stream and computes work-assignments for individual containers. </p></li>
-<li><p>It also determines the hosts each container should run on taking data-locality into account. It proceeds to request resources on those nodes using the <code>YARNClusterResourceManager</code> APIs.</p></li>
-<li><p>Once resources have been allocated, it proceeds to start the containers on the allocated hosts.</p></li>
-<li><p>When it is started, each container first queries the Samza AM to determine its work-assignments and configs. It then proceeds to execute its assigned tasks. </p></li>
-<li><p>The Samza AM periodically monitors each container using heartbeats and ensure they stay alive. </p></li>
+  <li>
+    <p>The <code class="language-plaintext highlighter-rouge">run-app.sh</code> script is started providing the location of your application’s binaries and its config file. The script instantiates an ApplicationRunner, which is the main entry-point responsible for running your application.</p>
+  </li>
+  <li>
+    <p>The ApplicationRunner parses your configs and writes them to a special Kafka topic named - the coordinator stream for distributing them. It proceeds to submit a request to YARN to launch your application.</p>
+  </li>
+  <li>
+    <p>The first step in launching any YARN application is starting its Application Master (AM).</p>
+  </li>
+  <li>
+    <p>The ResourceManager allocates an available host and starts the Samza AM.</p>
+  </li>
+  <li>
+    <p>The Samza AM is then responsible for managing the overall application. It reads configs from the Coordinator Stream and computes work-assignments for individual containers.</p>
+  </li>
+  <li>
+    <p>It also determines the hosts each container should run on taking data-locality into account. It proceeds to request resources on those nodes using the <code class="language-plaintext highlighter-rouge">YARNClusterResourceManager</code> APIs.</p>
+  </li>
+  <li>
+    <p>Once resources have been allocated, it proceeds to start the containers on the allocated hosts.</p>
+  </li>
+  <li>
+    <p>When it is started, each container first queries the Samza AM to determine its work-assignments and configs. It then proceeds to execute its assigned tasks.</p>
+  </li>
+  <li>
+    <p>The Samza AM periodically monitors each container using heartbeats and ensure they stay alive.</p>
+  </li>
 </ul>
 
            

Modified: samza/site/learn/documentation/latest/hadoop/consumer.html
URL: http://svn.apache.org/viewvc/samza/site/learn/documentation/latest/hadoop/consumer.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/documentation/latest/hadoop/consumer.html (original)
+++ samza/site/learn/documentation/latest/hadoop/consumer.html Wed Jan 18 19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" href="/releases/1.6.0">1.6.0</a>
       
         
@@ -538,6 +544,14 @@
               
               
 
+              <li class="hide"><a href="/learn/documentation/1.8.0/hadoop/consumer">1.8.0</a></li>
+
+              
+
+              <li class="hide"><a href="/learn/documentation/1.7.0/hadoop/consumer">1.7.0</a></li>
+
+              
+
               <li class="hide"><a href="/learn/documentation/1.6.0/hadoop/consumer">1.6.0</a></li>
 
               
@@ -639,7 +653,7 @@
    limitations under the License.
 -->
 
-<p>You can configure your Samza job to read from HDFS files. The <a href="https://github.com/apache/samza/blob/master/samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java">HdfsSystemConsumer</a> can read from HDFS files. Avro encoded records are supported out of the box and it is easy to extend to support other formats (plain text, csv, json etc). See <code>Event format</code> section below.</p>
+<p>You can configure your Samza job to read from HDFS files. The <a href="https://github.com/apache/samza/blob/master/samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java">HdfsSystemConsumer</a> can read from HDFS files. Avro encoded records are supported out of the box and it is easy to extend to support other formats (plain text, csv, json etc). See <code class="language-plaintext highlighter-rouge">Event format</code> section below.</p>
 
 <h3 id="environment">Environment</h3>
 
@@ -654,9 +668,9 @@
 <p><a href="https://github.com/apache/samza/blob/master/samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java">HdfsSystemConsumer</a> currently supports reading from avro files. The received <a href="../api/javadocs/org/apache/samza/system/IncomingMessageEnvelope.html">IncomingMessageEnvelope</a> contains three significant fields:</p>
 
 <ol>
-<li>The key which is empty</li>
-<li>The message which is set to the avro <a href="https://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/generic/GenericRecord.html">GenericRecord</a></li>
-<li>The stream partition which is set to the name of the HDFS file</li>
+  <li>The key which is empty</li>
+  <li>The message which is set to the avro <a href="https://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/generic/GenericRecord.html">GenericRecord</a></li>
+  <li>The stream partition which is set to the name of the HDFS file</li>
 </ol>
 
 <p>To extend the support beyond avro files (e.g. json, csv, etc.), you can implement the interface <a href="https://github.com/apache/samza/blob/master/samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java">SingleFileHdfsReader</a> (take a look at the implementation of <a href="https://github.com/apache/samza/blob/master/samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java">AvroFileHdfsReader</a> as a sample).</p>
@@ -670,7 +684,8 @@
 <h3 id="basic-configuration">Basic Configuration</h3>
 
 <p>Here is a few of the basic configs to set up HdfsSystemConsumer:</p>
-<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span># The HDFS system consumer is implemented under the org.apache.samza.system.hdfs package,
+
+<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># The HDFS system consumer is implemented under the org.apache.samza.system.hdfs package,
 # so use HdfsSystemFactory as the system factory for your system
 systems.hdfs-clickstream.samza.factory=org.apache.samza.system.hdfs.HdfsSystemFactory
 
@@ -680,15 +695,18 @@ task.inputs=hdfs-clickstream.hdfs:/data/
 # You can specify a white list of files you want your job to process (in Java Pattern style)
 systems.hdfs-clickstream.partitioner.defaultPartitioner.whitelist=.*avro
 
-# You can specify a black list of files you don&#39;t want your job to process (in Java Pattern style),
-# by default it&#39;s empty.
+# You can specify a black list of files you don't want your job to process (in Java Pattern style),
+# by default it's empty.
 # Note that you can have both white list and black list, in which case both will be applied.
 systems.hdfs-clickstream.partitioner.defaultPartitioner.blacklist=somefile.avro
-</code></pre></div>
+
+</code></pre></div></div>
+
 <h3 id="security-configuration">Security Configuration</h3>
 
 <p>The following additional configs are required when accessing HDFS clusters that have kerberos enabled:</p>
-<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span># Use the SamzaYarnSecurityManagerFactory, which fetches and renews the Kerberos delegation tokens when the job is running in a secure environment.
+
+<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Use the SamzaYarnSecurityManagerFactory, which fetches and renews the Kerberos delegation tokens when the job is running in a secure environment.
 job.security.manager.factory=org.apache.samza.job.yarn.SamzaYarnSecurityManagerFactory
 
 # Kerberos principal
@@ -696,27 +714,32 @@ yarn.kerberos.principal=your-principal-n
 
 # Path of the keytab file (local path)
 yarn.kerberos.keytab=/tmp/keytab
-</code></pre></div>
+</code></pre></div></div>
+
 <h3 id="advanced-configuration">Advanced Configuration</h3>
 
 <p>Some of the advanced configuration you might need to set up:</p>
-<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span># Specify the group pattern for advanced partitioning.
+
+<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Specify the group pattern for advanced partitioning.
 systems.hdfs-clickstream.partitioner.defaultPartitioner.groupPattern=part-[id]-.*
-</code></pre></div>
-<p>The advanced partitioning goes beyond the basic assumption that each file is a partition. With advanced partitioning you can group files into partitions arbitrarily. For example, if you have a set of files as [part-01-a.avro, part-01-b.avro, part-02-a.avro, part-02-b.avro, part-03-a.avro] that you want to organize into three partitions as (part-01-a.avro, part-01-b.avro), (part-02-a.avro, part-02-b.avro), (part-03-a.avro), where the numbers in the middle act as a &ldquo;group identifier&rdquo;, you can then set this property to be &ldquo;part-[id]-.<em>&rdquo; (note that *</em>[id]** is a reserved term here, i.e. you have to literally put it as <strong>[id]</strong>). The partitioner will apply this pattern to all file names and extract the &ldquo;group identifier&rdquo; (&ldquo;[id]&rdquo; in the pattern), then use the &ldquo;group identifier&rdquo; to group files into partitions.</p>
-<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span># Specify the type of files your job want to process (support avro only for now)
+</code></pre></div></div>
+
+<p>The advanced partitioning goes beyond the basic assumption that each file is a partition. With advanced partitioning you can group files into partitions arbitrarily. For example, if you have a set of files as [part-01-a.avro, part-01-b.avro, part-02-a.avro, part-02-b.avro, part-03-a.avro] that you want to organize into three partitions as (part-01-a.avro, part-01-b.avro), (part-02-a.avro, part-02-b.avro), (part-03-a.avro), where the numbers in the middle act as a “group identifier”, you can then set this property to be “part-[id]-.*” (note that <strong>[id]</strong> is a reserved term here, i.e. you have to literally put it as <strong>[id]</strong>). The partitioner will apply this pattern to all file names and extract the “group identifier” (“[id]” in the pattern), then use the “group identifier” to group files into partitions.</p>
+
+<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Specify the type of files your job want to process (support avro only for now)
 systems.hdfs-clickstream.consumer.reader=avro
 
 # Max number of retries (per-partition) before the container fails.
 system.hdfs-clickstream.consumer.numMaxRetries=10
-</code></pre></div>
+
+</code></pre></div></div>
+
 <p>For the list of all configs, check out the configuration table page <a href="../jobs/configuration-table.html">here</a></p>
 
 <h3 id="more-information">More Information</h3>
-
 <p><a href="https://issues.apache.org/jira/secure/attachment/12827670/HDFSSystemConsumer.pdf">HdfsSystemConsumer design doc</a></p>
 
-<h2 id="writing-to-hdfs"><a href="./producer.html">Writing to HDFS &raquo;</a></h2>
+<h2 id="writing-to-hdfs-"><a href="./producer.html">Writing to HDFS »</a></h2>
 
            
         </div>

Modified: samza/site/learn/documentation/latest/hadoop/overview.html
URL: http://svn.apache.org/viewvc/samza/site/learn/documentation/latest/hadoop/overview.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/documentation/latest/hadoop/overview.html (original)
+++ samza/site/learn/documentation/latest/hadoop/overview.html Wed Jan 18 19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" href="/releases/1.6.0">1.6.0</a>
       
         
@@ -538,6 +544,14 @@
               
               
 
+              <li class="hide"><a href="/learn/documentation/1.8.0/hadoop/overview">1.8.0</a></li>
+
+              
+
+              <li class="hide"><a href="/learn/documentation/1.7.0/hadoop/overview">1.7.0</a></li>
+
+              
+
               <li class="hide"><a href="/learn/documentation/1.6.0/hadoop/overview">1.6.0</a></li>
 
               
@@ -641,7 +655,7 @@
 
 <p>Samza provides a unified data processing model for both stream and batch processing. The primary difference between batch and streaming is whether the input size is bounded or unbounded. Batch data sources are typically bounded (e.g. static files on HDFS), whereas streams are unbounded (e.g. a topic in Kafka). Under the hood, the same highly-efficient stream-processing engine handles both types.</p>
 
-<p><img src="/img/latest/learn/documentation/hadoop/unified_batch_streaming.png" alt="Unified Batch and Streaming" style="max-width: 100%; height: auto;" onclick="window.open(this.src)"></p>
+<p><img src="/img/latest/learn/documentation/hadoop/unified_batch_streaming.png" alt="Unified Batch and Streaming" style="max-width: 100%; height: auto;" onclick="window.open(this.src)" /></p>
 
 <h3 id="unified-api-for-batch-and-streaming">Unified API for Batch and Streaming</h3>
 
@@ -649,13 +663,13 @@
 
 <h3 id="multi-stage-batch-pipeline">Multi-stage Batch Pipeline</h3>
 
-<p>Complex data pipelines usually consist multiple stages, with data shuffled (repartitioned) between stages to enable key-based operations such as windowing, aggregation, and join. Samza <a href="/startup/preview/index.html">High Level Streams API</a> provides an operator named <code>partitionBy</code> to create such multi-stage pipelines. Internally, Samza creates a physical stream, called an “intermediate stream”, based on the system configured as in <code>job.default.system</code>. Samza repartitions the output of the previous stage by sending it to the intermediate stream with the appropriate partition count and partition key. It then feeds it to the next stage of the pipeline. The lifecycle of intermediate streams is completely managed by Samza so from the user perspective the data shuffling is automatic.</p>
+<p>Complex data pipelines usually consist multiple stages, with data shuffled (repartitioned) between stages to enable key-based operations such as windowing, aggregation, and join. Samza <a href="/startup/preview/index.html">High Level Streams API</a> provides an operator named <code class="language-plaintext highlighter-rouge">partitionBy</code> to create such multi-stage pipelines. Internally, Samza creates a physical stream, called an “intermediate stream”, based on the system configured as in <code class="language-plaintext highlighter-rouge">job.default.system</code>. Samza repartitions the output of the previous stage by sending it to the intermediate stream with the appropriate partition count and partition key. It then feeds it to the next stage of the pipeline. The lifecycle of intermediate streams is completely managed by Samza so from the user perspective the data shuffling is automatic.</p>
 
 <p>For a single-stage pipeline, dealing with bounded data sets is straightforward: the system consumer “knows” the end of a particular partition, and it will emit end-of-stream token once a partition is complete. Samza will shut down the container when all its input partitions are complete.</p>
 
-<p>For a multi-stage pipeline, however, things become tricky since intermediate streams are often physically unbounded data streams, e.g. Kafka, and the downstream stages don&rsquo;t know when to shut down since unbounded streams don&rsquo;t have an end. To solve this problem, Samza uses in-band end-of-stream control messages in the intermediate stream along with user data messages. The upstream stage broadcasts end-of-stream control messages to every partition of the intermediate stream, and the downstream stage will aggregate the end-of-stream messages for each partition. When one end-of-stream message has been received for every upstream task in a partition, the downstream stage will conclude that the partition has no more messages, and the task will shut down. For pipelines with more than 2 stages, the end-of-stream control messages will be propagated from the source to the last stage, and each stage will perform the end-of-stream aggregation and then shuts down. The following d
 iagram shows the flow:</p>
+<p>For a multi-stage pipeline, however, things become tricky since intermediate streams are often physically unbounded data streams, e.g. Kafka, and the downstream stages don’t know when to shut down since unbounded streams don’t have an end. To solve this problem, Samza uses in-band end-of-stream control messages in the intermediate stream along with user data messages. The upstream stage broadcasts end-of-stream control messages to every partition of the intermediate stream, and the downstream stage will aggregate the end-of-stream messages for each partition. When one end-of-stream message has been received for every upstream task in a partition, the downstream stage will conclude that the partition has no more messages, and the task will shut down. For pipelines with more than 2 stages, the end-of-stream control messages will be propagated from the source to the last stage, and each stage will perform the end-of-stream aggregation and then shuts down. The following dia
 gram shows the flow:</p>
 
-<p><img src="/img/latest/learn/documentation/hadoop/multi_stage_batch.png" alt="Multi-stage Batch Processing" style="max-width: 100%; height: auto;" onclick="window.open(this.src)"></p>
+<p><img src="/img/latest/learn/documentation/hadoop/multi_stage_batch.png" alt="Multi-stage Batch Processing" style="max-width: 100%; height: auto;" onclick="window.open(this.src)" /></p>
 
 <h3 id="state-and-fault-tolerance">State and Fault-tolerance</h3>
 
@@ -663,7 +677,7 @@
 
 <p>During a job restart, batch processing behaves completely different from streaming. In batch, it is expected to be a re-run and all the internal streams, including intermediate, checkpoint and changelog streams, need to be fresh. Since some systems only support retention-based stream cleanup, e.g. Kafka without deletion enabled, Samza creates a new set of internal streams for each job run. To achieve this, Samza internally generates a unique <strong>run.id</strong> to each job run. The <strong>run.id</strong> is appended to the physical names of the internal streams, which will be used in the job in each run. Samza also performs due diligence to delete/purge the streams from previous run. The cleanup happens when the job is restarted.</p>
 
-<h2 id="reading-from-hdfs"><a href="./consumer.html">Reading from HDFS &raquo;</a></h2>
+<h2 id="reading-from-hdfs-"><a href="./consumer.html">Reading from HDFS »</a></h2>
 
            
         </div>

Modified: samza/site/learn/documentation/latest/hadoop/producer.html
URL: http://svn.apache.org/viewvc/samza/site/learn/documentation/latest/hadoop/producer.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/documentation/latest/hadoop/producer.html (original)
+++ samza/site/learn/documentation/latest/hadoop/producer.html Wed Jan 18 19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" href="/releases/1.6.0">1.6.0</a>
       
         
@@ -538,6 +544,14 @@
               
               
 
+              <li class="hide"><a href="/learn/documentation/1.8.0/hadoop/producer">1.8.0</a></li>
+
+              
+
+              <li class="hide"><a href="/learn/documentation/1.7.0/hadoop/producer">1.7.0</a></li>
+
+              
+
               <li class="hide"><a href="/learn/documentation/1.6.0/hadoop/producer">1.6.0</a></li>
 
               
@@ -639,14 +653,15 @@
    limitations under the License.
 -->
 
-<p>The <code>samza-hdfs</code> module implements a Samza Producer to write to HDFS. The current implementation includes a ready-to-use <code>HdfsSystemProducer</code>, and three <code>HdfsWriter</code>s: One that writes messages of raw bytes to a <code>SequenceFile</code> of <code>BytesWritable</code> keys and values. Another writes UTF-8 <code>String</code>s to a <code>SequenceFile</code> with <code>LongWritable</code> keys and <code>Text</code> values.
+<p>The <code class="language-plaintext highlighter-rouge">samza-hdfs</code> module implements a Samza Producer to write to HDFS. The current implementation includes a ready-to-use <code class="language-plaintext highlighter-rouge">HdfsSystemProducer</code>, and three <code class="language-plaintext highlighter-rouge">HdfsWriter</code>s: One that writes messages of raw bytes to a <code class="language-plaintext highlighter-rouge">SequenceFile</code> of <code class="language-plaintext highlighter-rouge">BytesWritable</code> keys and values. Another writes UTF-8 <code class="language-plaintext highlighter-rouge">String</code>s to a <code class="language-plaintext highlighter-rouge">SequenceFile</code> with <code class="language-plaintext highlighter-rouge">LongWritable</code> keys and <code class="language-plaintext highlighter-rouge">Text</code> values.
 The last one writes out Avro data files including the schema automatically reflected from the POJO objects fed to it.</p>
 
 <h3 id="configuring-an-hdfssystemproducer">Configuring an HdfsSystemProducer</h3>
 
-<p>You can configure an HdfsSystemProducer like any other Samza system: using configuration keys and values set in a <code>job.properties</code> file.
-You might configure the system producer for use by your <code>StreamTasks</code> like this:</p>
-<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span># set the SystemFactory implementation to instantiate HdfsSystemProducer aliased to &#39;hdfs-clickstream&#39;
+<p>You can configure an HdfsSystemProducer like any other Samza system: using configuration keys and values set in a <code class="language-plaintext highlighter-rouge">job.properties</code> file.
+You might configure the system producer for use by your <code class="language-plaintext highlighter-rouge">StreamTasks</code> like this:</p>
+
+<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># set the SystemFactory implementation to instantiate HdfsSystemProducer aliased to 'hdfs-clickstream'
 systems.hdfs-clickstream.samza.factory=org.apache.samza.system.hdfs.HdfsSystemFactory
 
 # define a serializer/deserializer for the hdfs-clickstream system
@@ -658,7 +673,7 @@ systems.hdfs-clickstream.samza.msg.serde
 # Assign a Metrics implementation via a label we defined earlier in the props file
 systems.hdfs-clickstream.streams.metrics.samza.msg.serde=some-metrics-impl
 
-# Assign the implementation class for this system&#39;s HdfsWriter
+# Assign the implementation class for this system's HdfsWriter
 systems.hdfs-clickstream.producer.hdfs.writer.class=org.apache.samza.system.hdfs.writer.TextSequenceFileHdfsWriter
 #systems.hdfs-clickstream.producer.hdfs.writer.class=org.apache.samza.system.hdfs.writer.AvroDataFileHdfsWriter
 
@@ -670,7 +685,7 @@ systems.hdfs-clickstream.producer.hdfs.c
 # is currently /BASE/JOB_NAME/DATE_PATH/FILES, where BASE is set below
 systems.hdfs-clickstream.producer.hdfs.base.output.dir=/user/me/analytics/clickstream_data
 
-# Assign the implementation class for the HdfsWriter&#39;s Bucketer
+# Assign the implementation class for the HdfsWriter's Bucketer
 systems.hdfs-clickstream.producer.hdfs.bucketer.class=org.apache.samza.system.hdfs.writer.JobNameDateTimeBucketer
 
 # Configure the DATE_PATH the Bucketer will set to bucket output files by day for this job run.
@@ -681,8 +696,9 @@ systems.hdfs-clickstream.producer.hdfs.b
 # (records for AvroDataFileHdfsWriter) are written.
 systems.hdfs-clickstream.producer.hdfs.write.batch.size.bytes=134217728
 #systems.hdfs-clickstream.producer.hdfs.write.batch.size.records=10000
-</code></pre></div>
-<p>The above configuration assumes a Metrics and Serde implemnetation has been properly configured against the <code>some-serde-impl</code> and <code>some-metrics-impl</code> labels somewhere else in the same <code>job.properties</code> file. Each of these properties has a reasonable default, so you can leave out the ones you don&rsquo;t need to customize for your job run.</p>
+</code></pre></div></div>
+
+<p>The above configuration assumes a Metrics and Serde implemnetation has been properly configured against the <code class="language-plaintext highlighter-rouge">some-serde-impl</code> and <code class="language-plaintext highlighter-rouge">some-metrics-impl</code> labels somewhere else in the same <code class="language-plaintext highlighter-rouge">job.properties</code> file. Each of these properties has a reasonable default, so you can leave out the ones you don’t need to customize for your job run.</p>
 
            
         </div>