You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by aj...@apache.org on 2023/01/18 19:33:31 UTC

svn commit: r1906774 [42/49] - in /samza/site: ./ archive/ blog/ case-studies/ community/ contribute/ img/latest/learn/documentation/api/ learn/documentation/latest/ learn/documentation/latest/api/ learn/documentation/latest/api/javadocs/ learn/documen...

Modified: samza/site/learn/documentation/latest/yarn/yarn-host-affinity.html
URL: http://svn.apache.org/viewvc/samza/site/learn/documentation/latest/yarn/yarn-host-affinity.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/documentation/latest/yarn/yarn-host-affinity.html (original)
+++ samza/site/learn/documentation/latest/yarn/yarn-host-affinity.html Wed Jan 18 19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" href="/releases/1.6.0">1.6.0</a>
       
         
@@ -538,6 +544,14 @@
               
               
 
+              <li class="hide"><a href="/learn/documentation/1.8.0/yarn/yarn-host-affinity">1.8.0</a></li>
+
+              
+
+              <li class="hide"><a href="/learn/documentation/1.7.0/yarn/yarn-host-affinity">1.7.0</a></li>
+
+              
+
               <li class="hide"><a href="/learn/documentation/1.6.0/yarn/yarn-host-affinity">1.6.0</a></li>
 
               
@@ -644,12 +658,12 @@
 <p>We define a <em>Stateful Samza Job</em> as the Samza job that uses a key-value store in its implementation, along with an associated changelog stream. In stateful samza jobs, a task may be configured to use multiple stores. For each store there is a 1:1 mapping between the task instance and the data store. Since the allocation of containers to machines in the Yarn cluster is completely left to Yarn, Samza does not guarantee that a container (and hence, its associated task(s)) gets deployed on the same machine. Containers can get shuffled in any of the following cases:</p>
 
 <ol>
-<li>When a job is upgraded by pointing <code>yarn.package.path</code> to the new package path and re-submitted.</li>
-<li>When a job is simply restarted by Yarn or the user</li>
-<li>When a container failure or premption triggers the SamzaAppMaster to re-allocate on another available resource</li>
+  <li>When a job is upgraded by pointing <code>yarn.package.path</code> to the new package path and re-submitted.</li>
+  <li>When a job is simply restarted by Yarn or the user</li>
+  <li>When a container failure or premption triggers the SamzaAppMaster to re-allocate on another available resource</li>
 </ol>
 
-<p>In any of the above cases, the task&rsquo;s co-located data needs to be restored every time a container starts-up. Restoring data each time can be expensive, especially for applications that have a large data set. This behavior slows the start-up time for the job so much that the job is no longer &ldquo;near realtime&rdquo;. Furthermore, if multiple stateful samza jobs restart around the same time in the cluster and they all share the same changelog system, then it is possible to quickly saturate the changelog system&rsquo;s network and cause a DDoS.</p>
+<p>In any of the above cases, the task’s co-located data needs to be restored every time a container starts-up. Restoring data each time can be expensive, especially for applications that have a large data set. This behavior slows the start-up time for the job so much that the job is no longer “near realtime”. Furthermore, if multiple stateful samza jobs restart around the same time in the cluster and they all share the same changelog system, then it is possible to quickly saturate the changelog system’s network and cause a DDoS.</p>
 
 <p>For instance, consider a Samza job performing a Stream-Table join. Typically, such a job requires the dataset to be available on all processors before they begin processing the input stream. The dataset is usually large (order &gt; 1TB) read-only data that will be used to join or add attributes to incoming messages. The job may initialize this cache by populating it with data directly from a remote store or changelog stream. This cache initialization happens each time the container is restarted. This causes significant latency during job start-up.</p>
 
@@ -657,91 +671,97 @@
 
 <h2 id="how-does-it-work">How does it work?</h2>
 
-<p>When a stateful Samza job is deployed in Yarn, the state stores for the tasks are co-located in the current working directory of Yarn&rsquo;s application attempt.</p>
+<p>When a stateful Samza job is deployed in Yarn, the state stores for the tasks are co-located in the current working directory of Yarn’s application attempt.</p>
 
-<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span><span class="nv">container_working_dir</span><span class="o">=</span><span class="si">${</span><span class="nv">yarn</span><span class="p">.nodemanager.local-dirs</span><span class="si">}</span>/usercache/<span class="si">${</span><span class="nv">user</span><span class="si">}</span>/appcache/application_<span class="si">${</span><span class="nv">appid</span><span class="si">}</span>/container_<span class="si">${</span><span class="nv">contid</span><span class="si">}</span>/
+<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">container_working_dir</span><span class="o">=</span><span class="k">${</span><span class="nv">yarn</span><span class="p">.nodemanager.local-dirs</span><span class="k">}</span>/usercache/<span class="k">${</span><span class="nv">user</span><span class="k">}</span>/appcache/application_<span class="k">${</span><span class="nv">appid</span><span class="k">}</span>/container_<span class="k">${</span><span class="nv">contid</span><span class="k">}</span>/
 
-<span class="c1"># Data Stores</span>
-ls <span class="si">${</span><span class="nv">container_working_dir</span><span class="si">}</span>/state/<span class="si">${</span><span class="nv">store</span><span class="p">-name</span><span class="si">}</span>/<span class="si">${</span><span class="nv">task_name</span><span class="si">}</span>/</code></pre></figure>
+<span class="c"># Data Stores</span>
+<span class="nb">ls</span> <span class="k">${</span><span class="nv">container_working_dir</span><span class="k">}</span>/state/<span class="k">${</span><span class="nv">store</span><span class="p">-name</span><span class="k">}</span>/<span class="k">${</span><span class="nv">task_name</span><span class="k">}</span>/</code></pre></figure>
 
-<p>This allows the Node Manager&rsquo;s (NM) DeletionService to clean-up the working directory once the application completes or fails. In order to re-use local state store, the state store needs to be persisted outside the scope of NM&rsquo;s deletion service. The cluster administrator should set this location as an environment variable in Yarn - <code>LOGGED_STORE_BASE_DIR</code>.</p>
+<p>This allows the Node Manager’s (NM) DeletionService to clean-up the working directory once the application completes or fails. In order to re-use local state store, the state store needs to be persisted outside the scope of NM’s deletion service. The cluster administrator should set this location as an environment variable in Yarn - <code>LOGGED\_STORE\_BASE\_DIR</code>.</p>
 
-<p><img src="/img/latest/learn/documentation/yarn/samza-host-affinity.png" alt="Yarn host affinity component diagram" style="max-width: 100%; height: auto;" onclick="window.open(this.src)"/></p>
+<p><img src="/img/latest/learn/documentation/yarn/samza-host-affinity.png" alt="Yarn host affinity component diagram" style="max-width: 100%; height: auto;" onclick="window.open(this.src)" /></p>
 
-<p>Each time a task commits, Samza writes the last materialized offset from the changelog stream to the checksumed file on disk. This is also done on container shutdown. Thus, there is an <em>OFFSET</em> file associated with each state stores&rsquo; changelog partitions, that is consumed by the tasks in the container.</p>
+<p>Each time a task commits, Samza writes the last materialized offset from the changelog stream to the checksumed file on disk. This is also done on container shutdown. Thus, there is an <em>OFFSET</em> file associated with each state stores’ changelog partitions, that is consumed by the tasks in the container.</p>
 
-<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span><span class="si">${</span><span class="nv">LOGGED_STORE_BASE_DIR</span><span class="si">}</span>/<span class="si">${</span><span class="nv">job</span><span class="p">.name</span><span class="si">}</span>-<span class="si">${</span><span class="nv">job</span><span class="p">.id</span><span class="si">}</span>/<span class="si">${</span><span class="nv">store</span><span class="p">.name</span><span class="si">}</span>/<span class="si">${</span><span class="nv">task</span><span class="p">.name</span><span class="si">}</span>/OFFSET</code></pre></figure>
+<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="k">${</span><span class="nv">LOGGED_STORE_BASE_DIR</span><span class="k">}</span>/<span class="k">${</span><span class="nv">job</span><span class="p">.name</span><span class="k">}</span>-<span class="k">${</span><span class="nv">job</span><span class="p">.id</span><span class="k">}</span>/<span class="k">${</span><span class="nv">store</span><span class="p">.name</span><span class="k">}</span>/<span class="k">${</span><span class="nv">task</span><span class="p">.name</span><span class="k">}</span>/OFFSET</code></pre></figure>
 
 <p>Now, when a container restarts on the same machine after the OFFSET file exists, the Samza container:</p>
 
 <ol>
-<li>Opens the persisted store on disk</li>
-<li>Reads the OFFSET file</li>
-<li>Restores the state store from the OFFSET value</li>
+  <li>Opens the persisted store on disk</li>
+  <li>Reads the OFFSET file</li>
+  <li>Restores the state store from the OFFSET value</li>
 </ol>
 
-<p>This significantly reduces the state restoration time on container start-up as we no longer consume from the beginning of the changelog stream. If the OFFSET file doesn&rsquo;t exist, it creates the state store and consumes from the oldest offset in the changelog to re-create the state. Since the OFFSET file is written on each commit after flushing the store, the recorded offset is guaranteed to correspond to the current contents of the store or some older point, but never newer. This gives at least once semantics for state restore. Therefore, the changelog entries must be idempotent.</p>
+<p>This significantly reduces the state restoration time on container start-up as we no longer consume from the beginning of the changelog stream. If the OFFSET file doesn’t exist, it creates the state store and consumes from the oldest offset in the changelog to re-create the state. Since the OFFSET file is written on each commit after flushing the store, the recorded offset is guaranteed to correspond to the current contents of the store or some older point, but never newer. This gives at least once semantics for state restore. Therefore, the changelog entries must be idempotent.</p>
 
 <p>It is necessary to periodically clean-up unused or orphaned state stores on the machines to manage disk-space. This feature is being worked on in <a href="https://issues.apache.org/jira/browse/SAMZA-656">SAMZA-656</a>.</p>
 
-<p>In order to re-use local state, Samza has to sucessfully claim the specific hosts from the Resource Manager (RM). To support this, the Samza containers write their locality information to the <a href="../container/coordinator-stream.html">Coordinator Stream</a> every time they start-up successfully. Now, the Samza Application Master (AM) can identify the last known host of a container via the <a href="../container/coordinator-stream.html">Job Coordinator</a>(JC) and the application is no longer agnostic of the container locality. On a container failure (due to any of the above cited reasons), the AM includes the hostname of the expected resource in the <a href="https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ResourceRequest.java#L239%5D">ResourceRequest</a>.</p>
+<p>In order to re-use local state, Samza has to sucessfully claim the specific hosts from the Resource Manager (RM). To support this, the Samza containers write their locality information to the <a href="../container/coordinator-stream.html">Coordinator Stream</a> every time they start-up successfully. Now, the Samza Application Master (AM) can identify the last known host of a container via the <a href="../container/coordinator-stream.html">Job Coordinator</a>(JC) and the application is no longer agnostic of the container locality. On a container failure (due to any of the above cited reasons), the AM includes the hostname of the expected resource in the <a href="https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ResourceRequest.java#L239]">ResourceRequest</a>.</p>
 
-<p>Note that the Yarn cluster has to be configured to use <a href="https://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-site/FairScheduler.html">Fair Scheduler</a> with continuous-scheduling enabled. With continuous scheduling, the scheduler continuously iterates through all nodes in the cluster, instead of relying on the nodes&rsquo; heartbeat, and schedules work based on previously known status for each node, before relaxing locality. Hence, the scheduler takes care of relaxing locality after the configured delay. This approach can be considered as a &ldquo;<em>best-effort stickiness</em>&rdquo; policy because it is possible that the requested node is not running or does not have sufficient resources at the time of request (even though the state in the data stores may be persisted). For more details on the choice of Fair Scheduler, please refer the <a href="https://issues.apache.org/jira/secure/attachment/12726945/DESIGN-SAMZA-617-2.pdf">design doc</a>.</p>
+<p>Note that the Yarn cluster has to be configured to use <a href="https://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-site/FairScheduler.html">Fair Scheduler</a> with continuous-scheduling enabled. With continuous scheduling, the scheduler continuously iterates through all nodes in the cluster, instead of relying on the nodes’ heartbeat, and schedules work based on previously known status for each node, before relaxing locality. Hence, the scheduler takes care of relaxing locality after the configured delay. This approach can be considered as a “<em>best-effort stickiness</em>” policy because it is possible that the requested node is not running or does not have sufficient resources at the time of request (even though the state in the data stores may be persisted). For more details on the choice of Fair Scheduler, please refer the <a href="https://issues.apache.org/jira/secure/attachment/12726945/DESIGN-SAMZA-617-2.pdf">design doc</a>.</p>
 
 <h2 id="configuring-yarn-cluster-to-support-host-affinity">Configuring YARN cluster to support Host Affinity</h2>
 
 <ol>
-<li>Enable local state re-use by setting the <code>LOGGED_STORE_BASE_DIR</code> environment variable in yarn-env.sh 
-<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span><span class="nb">export</span> <span class="nv">LOGGED<em>STORE</em>BASE_DIR</span><span class="o">=</span>&lt;path-for-state-stores&gt;</code></pre></figure>
-Without this configuration, the state stores are not persisted upon a container shutdown. This will effectively mean you will not re-use local state and hence, host-affinity becomes a moot operation.</li>
-<li><p>Configure Yarn to use Fair Scheduler and enable continuous-scheduling in yarn-site.xml 
-<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><span></span><span class="nt">&lt;property&gt;</span>
-<span class="nt">&lt;name&gt;</span>yarn.resourcemanager.scheduler.class<span class="nt">&lt;/name&gt;</span>
-<span class="nt">&lt;description&gt;</span>The class to use as the resource scheduler.<span class="nt">&lt;/description&gt;</span>
-<span class="nt">&lt;value&gt;</span>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler<span class="nt">&lt;/value&gt;</span>
+  <li>Enable local state re-use by setting the <code>LOGGED\_STORE\_BASE\_DIR</code> environment variable in yarn-env.sh</li>
+</ol>
+<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nb">export </span><span class="nv">LOGGED_STORE_BASE_DIR</span><span class="o">=</span>&lt;path-for-state-stores&gt;</code></pre></figure>
+<p>Without this configuration, the state stores are not persisted upon a container shutdown. This will effectively mean you will not re-use local state and hence, host-affinity becomes a moot operation.</p>
+<ol>
+  <li>Configure Yarn to use Fair Scheduler and enable continuous-scheduling in yarn-site.xml</li>
+</ol>
+<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><span class="nt">&lt;property&gt;</span>
+    <span class="nt">&lt;name&gt;</span>yarn.resourcemanager.scheduler.class<span class="nt">&lt;/name&gt;</span>
+    <span class="nt">&lt;description&gt;</span>The class to use as the resource scheduler.<span class="nt">&lt;/description&gt;</span>
+    <span class="nt">&lt;value&gt;</span>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler<span class="nt">&lt;/value&gt;</span>
 <span class="nt">&lt;/property&gt;</span>
 <span class="nt">&lt;property&gt;</span>
-<span class="nt">&lt;name&gt;</span>yarn.scheduler.fair.continuous-scheduling-enabled<span class="nt">&lt;/name&gt;</span>
-<span class="nt">&lt;description&gt;</span>Enable Continuous Scheduling of Resource Requests<span class="nt">&lt;/description&gt;</span>
-<span class="nt">&lt;value&gt;</span>true<span class="nt">&lt;/value&gt;</span>
+    <span class="nt">&lt;name&gt;</span>yarn.scheduler.fair.continuous-scheduling-enabled<span class="nt">&lt;/name&gt;</span>
+    <span class="nt">&lt;description&gt;</span>Enable Continuous Scheduling of Resource Requests<span class="nt">&lt;/description&gt;</span>
+    <span class="nt">&lt;value&gt;</span>true<span class="nt">&lt;/value&gt;</span>
 <span class="nt">&lt;/property&gt;</span>
 <span class="nt">&lt;property&gt;</span>
-<span class="nt">&lt;name&gt;</span>yarn.scheduler.fair.locality-delay-node-ms<span class="nt">&lt;/name&gt;</span>
-<span class="nt">&lt;description&gt;</span>Delay time in milliseconds before relaxing locality at node-level<span class="nt">&lt;/description&gt;</span>
-<span class="nt">&lt;value&gt;</span>1000<span class="nt">&lt;/value&gt;</span>  <span class="c">&lt;!-- Should be tuned per requirement --&gt;</span>
+    <span class="nt">&lt;name&gt;</span>yarn.scheduler.fair.locality-delay-node-ms<span class="nt">&lt;/name&gt;</span>
+    <span class="nt">&lt;description&gt;</span>Delay time in milliseconds before relaxing locality at node-level<span class="nt">&lt;/description&gt;</span>
+    <span class="nt">&lt;value&gt;</span>1000<span class="nt">&lt;/value&gt;</span>  <span class="c">&lt;!-- Should be tuned per requirement --&gt;</span>
 <span class="nt">&lt;/property&gt;</span>
 <span class="nt">&lt;property&gt;</span>
-<span class="nt">&lt;name&gt;</span>yarn.scheduler.fair.locality-delay-rack-ms<span class="nt">&lt;/name&gt;</span>
-<span class="nt">&lt;description&gt;</span>Delay time in milliseconds before relaxing locality at rack-level<span class="nt">&lt;/description&gt;</span>
-<span class="nt">&lt;value&gt;</span>1000<span class="nt">&lt;/value&gt;</span> <span class="c">&lt;!-- Should be tuned per requirement --&gt;</span>
-<span class="nt">&lt;/property&gt;</span></code></pre></figure></p></li>
-<li><p>Configure Yarn Node Manager SIGTERM to SIGKILL timeout to be reasonable time s.t. Node Manager will give Samza Container enough time to perform a clean shutdown in yarn-site.xml 
-<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><span></span><span class="nt">&lt;property&gt;</span>
-<span class="nt">&lt;name&gt;</span>yarn.nodemanager.sleep-delay-before-sigkill.ms<span class="nt">&lt;/name&gt;</span>
-<span class="nt">&lt;description&gt;</span>No. of ms to wait between sending a SIGTERM and SIGKILL to a container<span class="nt">&lt;/description&gt;</span>
-<span class="nt">&lt;value&gt;</span>600000<span class="nt">&lt;/value&gt;</span> <span class="c">&lt;!-- Set it to 10min to allow enough time for clean shutdown of containers --&gt;</span>
-<span class="nt">&lt;/property&gt;</span></code></pre></figure></p></li>
-<li><p>The Yarn <a href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/RackAwareness.html">Rack Awareness</a> feature is not required and does not change the behavior of Samza Host Affinity. However, if Rack Awareness is configured in the cluster, make sure the DNSToSwitchMapping implementation is robust. Any failures could cause container requests to fall back to the defaultRack. This will cause ContainerRequests to not match the preferred host, which will degrade Host Affinity. For details, see <a href="https://issues.apache.org/jira/browse/SAMZA-886">SAMZA-866</a></p></li>
+    <span class="nt">&lt;name&gt;</span>yarn.scheduler.fair.locality-delay-rack-ms<span class="nt">&lt;/name&gt;</span>
+    <span class="nt">&lt;description&gt;</span>Delay time in milliseconds before relaxing locality at rack-level<span class="nt">&lt;/description&gt;</span>
+    <span class="nt">&lt;value&gt;</span>1000<span class="nt">&lt;/value&gt;</span> <span class="c">&lt;!-- Should be tuned per requirement --&gt;</span>
+<span class="nt">&lt;/property&gt;</span></code></pre></figure>
+
+<ol>
+  <li>Configure Yarn Node Manager SIGTERM to SIGKILL timeout to be reasonable time s.t. Node Manager will give Samza Container enough time to perform a clean shutdown in yarn-site.xml</li>
 </ol>
+<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><span class="nt">&lt;property&gt;</span>
+    <span class="nt">&lt;name&gt;</span>yarn.nodemanager.sleep-delay-before-sigkill.ms<span class="nt">&lt;/name&gt;</span>
+    <span class="nt">&lt;description&gt;</span>No. of ms to wait between sending a SIGTERM and SIGKILL to a container<span class="nt">&lt;/description&gt;</span>
+    <span class="nt">&lt;value&gt;</span>600000<span class="nt">&lt;/value&gt;</span> <span class="c">&lt;!-- Set it to 10min to allow enough time for clean shutdown of containers --&gt;</span>
+<span class="nt">&lt;/property&gt;</span></code></pre></figure>
 
-<h2 id="configuring-a-samza-job-to-use-host-affinity">Configuring a Samza job to use Host Affinity</h2>
+<ol>
+  <li>The Yarn <a href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/RackAwareness.html">Rack Awareness</a> feature is not required and does not change the behavior of Samza Host Affinity. However, if Rack Awareness is configured in the cluster, make sure the DNSToSwitchMapping implementation is robust. Any failures could cause container requests to fall back to the defaultRack. This will cause ContainerRequests to not match the preferred host, which will degrade Host Affinity. For details, see <a href="https://issues.apache.org/jira/browse/SAMZA-886">SAMZA-866</a></li>
+</ol>
 
+<h2 id="configuring-a-samza-job-to-use-host-affinity">Configuring a Samza job to use Host Affinity</h2>
 <p>Any stateful Samza job can leverage this feature to reduce the Mean Time To Restore (MTTR) of its state stores by setting <code>yarn.samza.host-affinity.enabled</code> to true.</p>
 
-<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span>yarn.samza.host-affinity.enabled<span class="o">=</span><span class="nb">true</span>  <span class="c1"># Default: false</span></code></pre></figure>
+<figure class="highlight"><pre><code class="language-bash" data-lang="bash">yarn.samza.host-affinity.enabled<span class="o">=</span><span class="nb">true</span>  <span class="c"># Default: false</span></code></pre></figure>
 
 <p>Enabling this feature for a stateless Samza job should not have any adverse effect on the job.</p>
 
 <h2 id="host-affinity-guarantees">Host-affinity Guarantees</h2>
-
 <p>As you have observed, host-affinity cannot be guaranteed all the time due to varibale load distribution in the Yarn cluster. Hence, this is a best-effort policy that Samza provides. However, certain scenarios are worth calling out where these guarantees may be hard to achieve or are not applicable.</p>
 
 <ol>
-<li><em>When the number of containers and/or container-task assignment changes across successive application runs</em> - We may be able to re-use local state for a subset of partitions. Currently, there is no logic in the Job Coordinator to handle partitioning of tasks among containers intelligently. Handling this is more involved as relates to <a href="https://issues.apache.org/jira/browse/SAMZA-336">auto-scaling</a> of the containers. However, with <a href="https://issues.apache.org/jira/browse/SAMZA-906">task-container mapping</a>, this will work better for typical container count adjustments.</li>
-<li><em>When SystemStreamPartitionGrouper changes across successive application runs</em> - When the grouper logic used to distribute the partitions across containers changes, the data in the Coordinator Stream (for changelog-task partition assignment etc) and the data stores becomes invalid. Thus, to be safe, we should flush out all state-related data from the Coordinator Stream. An alternative is to overwrite the Task-ChangelogPartition assignment message and the Container Locality message in the Coordinator Stream, before starting up the job again.</li>
+  <li><em>When the number of containers and/or container-task assignment changes across successive application runs</em> - We may be able to re-use local state for a subset of partitions. Currently, there is no logic in the Job Coordinator to handle partitioning of tasks among containers intelligently. Handling this is more involved as relates to <a href="https://issues.apache.org/jira/browse/SAMZA-336">auto-scaling</a> of the containers. However, with <a href="https://issues.apache.org/jira/browse/SAMZA-906">task-container mapping</a>, this will work better for typical container count adjustments.</li>
+  <li><em>When SystemStreamPartitionGrouper changes across successive application runs</em> - When the grouper logic used to distribute the partitions across containers changes, the data in the Coordinator Stream (for changelog-task partition assignment etc) and the data stores becomes invalid. Thus, to be safe, we should flush out all state-related data from the Coordinator Stream. An alternative is to overwrite the Task-ChangelogPartition assignment message and the Container Locality message in the Coordinator Stream, before starting up the job again.</li>
 </ol>
 
-<h2 id="resource-localization"><a href="../yarn/yarn-resource-localization.html">Resource Localization &raquo;</a></h2>
+<h2 id="resource-localization-"><a href="../yarn/yarn-resource-localization.html">Resource Localization »</a></h2>
 
            
         </div>

Modified: samza/site/learn/documentation/latest/yarn/yarn-resource-localization.html
URL: http://svn.apache.org/viewvc/samza/site/learn/documentation/latest/yarn/yarn-resource-localization.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/documentation/latest/yarn/yarn-resource-localization.html (original)
+++ samza/site/learn/documentation/latest/yarn/yarn-resource-localization.html Wed Jan 18 19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" href="/releases/1.6.0">1.6.0</a>
       
         
@@ -538,6 +544,14 @@
               
               
 
+              <li class="hide"><a href="/learn/documentation/1.8.0/yarn/yarn-resource-localization">1.8.0</a></li>
+
+              
+
+              <li class="hide"><a href="/learn/documentation/1.7.0/yarn/yarn-resource-localization">1.7.0</a></li>
+
+              
+
               <li class="hide"><a href="/learn/documentation/1.6.0/yarn/yarn-resource-localization">1.6.0</a></li>
 
               
@@ -638,80 +652,75 @@
    See the License for the specific language governing permissions and
    limitations under the License.
 -->
-
 <p>When running Samza jobs on YARN clusters, you may need to download some resources before startup (For example, downloading the job binaries, fetching certificate files etc.) This step is called as Resource Localization.</p>
 
 <h3 id="resource-localization-process">Resource Localization Process</h3>
 
-<p>For Samza jobs running on YARN, resource localization leverages the YARN node manager&rsquo;s localization service. Here is a <a href="https://hortonworks.com/blog/resource-localization-in-yarn-deep-dive/">deep dive</a> on how localization works in YARN. </p>
+<p>For Samza jobs running on YARN, resource localization leverages the YARN node manager’s localization service. Here is a <a href="https://hortonworks.com/blog/resource-localization-in-yarn-deep-dive/">deep dive</a> on how localization works in YARN.</p>
 
-<p>Depending on where and how the resource comes from, fetching the resource is associated with a scheme in the path (such as <code>http</code>, <code>https</code>, <code>hdfs</code>, <code>ftp</code>, <code>file</code>, etc). The scheme maps to a corresponding <code>FileSystem</code> implementation for handling the localization. </p>
+<p>Depending on where and how the resource comes from, fetching the resource is associated with a scheme in the path (such as <code class="language-plaintext highlighter-rouge">http</code>, <code class="language-plaintext highlighter-rouge">https</code>, <code class="language-plaintext highlighter-rouge">hdfs</code>, <code class="language-plaintext highlighter-rouge">ftp</code>, <code class="language-plaintext highlighter-rouge">file</code>, etc). The scheme maps to a corresponding <code class="language-plaintext highlighter-rouge">FileSystem</code> implementation for handling the localization.</p>
 
-<p>There are some predefined <code>FileSystem</code> implementations in Hadoop and Samza, which are provided if you run Samza jobs on YARN:</p>
+<p>There are some predefined <code class="language-plaintext highlighter-rouge">FileSystem</code> implementations in Hadoop and Samza, which are provided if you run Samza jobs on YARN:</p>
 
 <ul>
-<li><code>org.apache.samza.util.hadoop.HttpFileSystem</code>: used for fetching resources based on http or https without client side authentication.</li>
-<li><code>org.apache.hadoop.hdfs.DistributedFileSystem</code>: used for fetching resource from DFS system on Hadoop.</li>
-<li><code>org.apache.hadoop.fs.LocalFileSystem</code>: used for copying resources from local file system to the job directory.</li>
-<li><code>org.apache.hadoop.fs.ftp.FTPFileSystem</code>: used for fetching resources based on ftp.</li>
+  <li><code class="language-plaintext highlighter-rouge">org.apache.samza.util.hadoop.HttpFileSystem</code>: used for fetching resources based on http or https without client side authentication.</li>
+  <li><code class="language-plaintext highlighter-rouge">org.apache.hadoop.hdfs.DistributedFileSystem</code>: used for fetching resource from DFS system on Hadoop.</li>
+  <li><code class="language-plaintext highlighter-rouge">org.apache.hadoop.fs.LocalFileSystem</code>: used for copying resources from local file system to the job directory.</li>
+  <li><code class="language-plaintext highlighter-rouge">org.apache.hadoop.fs.ftp.FTPFileSystem</code>: used for fetching resources based on ftp.</li>
 </ul>
 
-<p>You can create your own file system implementation by creating a class which extends from <code>org.apache.hadoop.fs.FileSystem</code>. </p>
+<p>You can create your own file system implementation by creating a class which extends from <code class="language-plaintext highlighter-rouge">org.apache.hadoop.fs.FileSystem</code>.</p>
 
 <h3 id="resource-configuration">Resource Configuration</h3>
-
 <p>You can specify a resource to be localized by the following configuration.</p>
 
 <h4 id="required-configuration">Required Configuration</h4>
-
 <ol>
-<li><code>yarn.resources.&lt;resourceName&gt;.path</code>
-
-<ul>
-<li>The path for fetching the resource for localization, e.g. http://hostname.com/packages/myResource</li>
-</ul></li>
+  <li><code class="language-plaintext highlighter-rouge">yarn.resources.&lt;resourceName&gt;.path</code>
+    <ul>
+      <li>The path for fetching the resource for localization, e.g. http://hostname.com/packages/myResource</li>
+    </ul>
+  </li>
 </ol>
 
 <h4 id="optional-configuration">Optional Configuration</h4>
-
 <ol>
-<li><code>yarn.resources.&lt;resourceName&gt;.local.name</code>
-
-<ul>
-<li>The local name used for the localized resource.</li>
-<li>If it is not set, the default will be the <code>&lt;resourceName&gt;</code> specified in <code>yarn.resources.&lt;resourceName&gt;.path</code></li>
-</ul></li>
-<li><code>yarn.resources.&lt;resourceName&gt;.local.type</code>
-
-<ul>
-<li>The type of the resource with valid values from: <code>ARCHIVE</code>, <code>FILE</code>, <code>PATTERN</code>.
-
-<ul>
-<li>ARCHIVE: the localized resource will be an archived directory;</li>
-<li>FILE: the localized resource will be a file;</li>
-<li>PATTERN: the localized resource will be the entries extracted from the archive with the pattern.</li>
-</ul></li>
-<li>If it is not set, the default value is <code>FILE</code>.</li>
-</ul></li>
-<li><code>yarn.resources.&lt;resourceName&gt;.local.visibility</code>
-
-<ul>
-<li>Visibility for the resource with valid values from <code>PUBLIC</code>, <code>PRIVATE</code>, <code>APPLICATION</code>
-
-<ul>
-<li>PUBLIC: visible to everyone </li>
-<li>PRIVATE: visible to just the account which runs the job</li>
-<li>APPLICATION: visible only to the specific application job which has the resource configuration</li>
-</ul></li>
-<li>If it is not set, the default value is <code>APPLICATION</code></li>
-</ul></li>
+  <li><code class="language-plaintext highlighter-rouge">yarn.resources.&lt;resourceName&gt;.local.name</code>
+    <ul>
+      <li>The local name used for the localized resource.</li>
+      <li>If it is not set, the default will be the <code class="language-plaintext highlighter-rouge">&lt;resourceName&gt;</code> specified in <code class="language-plaintext highlighter-rouge">yarn.resources.&lt;resourceName&gt;.path</code></li>
+    </ul>
+  </li>
+  <li><code class="language-plaintext highlighter-rouge">yarn.resources.&lt;resourceName&gt;.local.type</code>
+    <ul>
+      <li>The type of the resource with valid values from: <code class="language-plaintext highlighter-rouge">ARCHIVE</code>, <code class="language-plaintext highlighter-rouge">FILE</code>, <code class="language-plaintext highlighter-rouge">PATTERN</code>.
+        <ul>
+          <li>ARCHIVE: the localized resource will be an archived directory;</li>
+          <li>FILE: the localized resource will be a file;</li>
+          <li>PATTERN: the localized resource will be the entries extracted from the archive with the pattern.</li>
+        </ul>
+      </li>
+      <li>If it is not set, the default value is <code class="language-plaintext highlighter-rouge">FILE</code>.</li>
+    </ul>
+  </li>
+  <li><code class="language-plaintext highlighter-rouge">yarn.resources.&lt;resourceName&gt;.local.visibility</code>
+    <ul>
+      <li>Visibility for the resource with valid values from <code class="language-plaintext highlighter-rouge">PUBLIC</code>, <code class="language-plaintext highlighter-rouge">PRIVATE</code>, <code class="language-plaintext highlighter-rouge">APPLICATION</code>
+        <ul>
+          <li>PUBLIC: visible to everyone</li>
+          <li>PRIVATE: visible to just the account which runs the job</li>
+          <li>APPLICATION: visible only to the specific application job which has the resource configuration</li>
+        </ul>
+      </li>
+      <li>If it is not set, the default value is <code class="language-plaintext highlighter-rouge">APPLICATION</code></li>
+    </ul>
+  </li>
 </ol>
 
 <h3 id="yarn-configuration">YARN Configuration</h3>
+<p>Make sure the scheme used in the <code class="language-plaintext highlighter-rouge">yarn.resources.&lt;resourceName&gt;.path</code> is configured with a corresponding FileSystem implementation in YARN core-site.xml.</p>
 
-<p>Make sure the scheme used in the <code>yarn.resources.&lt;resourceName&gt;.path</code> is configured with a corresponding FileSystem implementation in YARN core-site.xml.</p>
-
-<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><span></span><span class="cp">&lt;?xml-stylesheet type=&quot;text/xsl&quot; href=&quot;configuration.xsl&quot;?&gt;</span>
+<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><span class="cp">&lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&gt;</span>
 <span class="nt">&lt;configuration&gt;</span>
     <span class="nt">&lt;property&gt;</span>
       <span class="nt">&lt;name&gt;</span>fs.http.impl<span class="nt">&lt;/name&gt;</span>
@@ -719,9 +728,9 @@
     <span class="nt">&lt;/property&gt;</span>
 <span class="nt">&lt;/configuration&gt;</span></code></pre></figure>
 
-<p>If you are using your own scheme (for example, <code>yarn.resources.myResource.path=myScheme://host.com/test</code>), you can link your <a href="https://hadoop.apache.org/docs/stable/api/index.html?org/apache/hadoop/fs/FileSystem.html">FileSystem</a> implementation with it as follows.</p>
+<p>If you are using your own scheme (for example, <code class="language-plaintext highlighter-rouge">yarn.resources.myResource.path=myScheme://host.com/test</code>), you can link your <a href="https://hadoop.apache.org/docs/stable/api/index.html?org/apache/hadoop/fs/FileSystem.html">FileSystem</a> implementation with it as follows.</p>
 
-<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><span></span><span class="cp">&lt;?xml-stylesheet type=&quot;text/xsl&quot; href=&quot;configuration.xsl&quot;?&gt;</span>
+<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><span class="cp">&lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&gt;</span>
 <span class="nt">&lt;configuration&gt;</span>
     <span class="nt">&lt;property&gt;</span>
       <span class="nt">&lt;name&gt;</span>fs.myScheme.impl<span class="nt">&lt;/name&gt;</span>
@@ -729,7 +738,7 @@
     <span class="nt">&lt;/property&gt;</span>
 <span class="nt">&lt;/configuration&gt;</span></code></pre></figure>
 
-<h2 id="yarn-security"><a href="../yarn/yarn-security.html">Yarn Security &raquo;</a></h2>
+<h2 id="yarn-security-"><a href="../yarn/yarn-security.html">Yarn Security »</a></h2>
 
            
         </div>

Modified: samza/site/learn/documentation/latest/yarn/yarn-security.html
URL: http://svn.apache.org/viewvc/samza/site/learn/documentation/latest/yarn/yarn-security.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/documentation/latest/yarn/yarn-security.html (original)
+++ samza/site/learn/documentation/latest/yarn/yarn-security.html Wed Jan 18 19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" href="/releases/1.6.0">1.6.0</a>
       
         
@@ -538,6 +544,14 @@
               
               
 
+              <li class="hide"><a href="/learn/documentation/1.8.0/yarn/yarn-security">1.8.0</a></li>
+
+              
+
+              <li class="hide"><a href="/learn/documentation/1.7.0/yarn/yarn-security">1.7.0</a></li>
+
+              
+
               <li class="hide"><a href="/learn/documentation/1.6.0/yarn/yarn-security">1.6.0</a></li>
 
               
@@ -646,59 +660,86 @@
 <p>One of the challenges for long-lived application running on a secure YARN cluster is its token renewal strategy. Samza takes the following approach to manage token creation and renewal.</p>
 
 <ol>
-<li><p>Client running Samza app needs to kinit into KDC with his credentials and add the HDFS delegation tokens to the launcher context before submitting the application.</p></li>
-<li><p>Next, client prepares the local resources for the application as follows.
+  <li>
+    <p>Client running Samza app needs to kinit into KDC with his credentials and add the HDFS delegation tokens to the launcher context before submitting the application.</p>
+  </li>
+  <li>
+    <p>Next, client prepares the local resources for the application as follows.
 2.1. First, it creates a staging directory on HDFS. This directory is only accessible by the running user and used to store resources required for Application Master (AM) and Containers.
 2.2. Client then adds the keytab as a local resource in the container launcher context for AM.
-2.3. Finally, it sends the corresponding principal and the path to the keytab file in the staging directory to the coordinator stream. Samza currently uses the staging directory to store both the keytab and refreshed tokens because the access to the directory is secured via Kerberos.</p></li>
-<li><p>Once the resource is allocated for the Application Master, the Node Manager will localizes app resources from HDFS using the HDFS delegation tokens in the launcher context. Same rule applies to Container localization too. </p></li>
-<li><p>When Application Master starts, it localizes the keytab file into its working directory and reads the principal from the coordinator stream.</p></li>
-<li><p>The Application Master periodically re-authenticate itself with the given principal and keytab. In each iteration, it creates new delegation tokens and stores them in the given job specific staging directory on HDFS.</p></li>
-<li><p>Each running container will get new delegation tokens from the credentials file on HDFS before the current ones expire.</p></li>
-<li><p>Application Master and Containers don&rsquo;t communicate with each other for that matter. Each side proceeds independently by reading or writing the tokens on HDFS.</p></li>
+2.3. Finally, it sends the corresponding principal and the path to the keytab file in the staging directory to the coordinator stream. Samza currently uses the staging directory to store both the keytab and refreshed tokens because the access to the directory is secured via Kerberos.</p>
+  </li>
+  <li>
+    <p>Once the resource is allocated for the Application Master, the Node Manager will localizes app resources from HDFS using the HDFS delegation tokens in the launcher context. Same rule applies to Container localization too.</p>
+  </li>
+  <li>
+    <p>When Application Master starts, it localizes the keytab file into its working directory and reads the principal from the coordinator stream.</p>
+  </li>
+  <li>
+    <p>The Application Master periodically re-authenticate itself with the given principal and keytab. In each iteration, it creates new delegation tokens and stores them in the given job specific staging directory on HDFS.</p>
+  </li>
+  <li>
+    <p>Each running container will get new delegation tokens from the credentials file on HDFS before the current ones expire.</p>
+  </li>
+  <li>
+    <p>Application Master and Containers don’t communicate with each other for that matter. Each side proceeds independently by reading or writing the tokens on HDFS.</p>
+  </li>
 </ol>
 
-<p>By default, any HDFS delegation token has a maximum life of 7 days (configured by <code>dfs.namenode.delegation.token.max-lifetime</code> in hdfs-site.xml) and the token is normally renewed every 24 hours (configured by <code>dfs.namenode.delegation.token.renew-interval</code> in hdfs-site.xml). What if the Application Master dies and needs restarts after 7 days? The original HDFS delegation token stored in the launcher context will be invalid no matter what. Luckily, Samza can rely on Resource Manager to handle this scenario. See the Configuration section below for details.  </p>
+<p>By default, any HDFS delegation token has a maximum life of 7 days (configured by <code class="language-plaintext highlighter-rouge">dfs.namenode.delegation.token.max-lifetime</code> in hdfs-site.xml) and the token is normally renewed every 24 hours (configured by <code class="language-plaintext highlighter-rouge">dfs.namenode.delegation.token.renew-interval</code> in hdfs-site.xml). What if the Application Master dies and needs restarts after 7 days? The original HDFS delegation token stored in the launcher context will be invalid no matter what. Luckily, Samza can rely on Resource Manager to handle this scenario. See the Configuration section below for details.</p>
 
 <h3 id="components">Components</h3>
 
 <h4 id="securitymanager">SecurityManager</h4>
 
-<p>When ApplicationMaster starts, it spawns <code>SamzaAppMasterSecurityManager</code>, which runs on its separate thread. The <code>SamzaAppMasterSecurityManager</code> is responsible for periodically logging in through the given Kerberos keytab and regenerates the HDFS delegation tokens regularly. After each run, it writes new tokens on a pre-defined job specific directory on HDFS. The frequency of this process is determined by <code>yarn.token.renewal.interval.seconds</code>.</p>
+<p>When ApplicationMaster starts, it spawns <code class="language-plaintext highlighter-rouge">SamzaAppMasterSecurityManager</code>, which runs on its separate thread. The <code class="language-plaintext highlighter-rouge">SamzaAppMasterSecurityManager</code> is responsible for periodically logging in through the given Kerberos keytab and regenerates the HDFS delegation tokens regularly. After each run, it writes new tokens on a pre-defined job specific directory on HDFS. The frequency of this process is determined by <code class="language-plaintext highlighter-rouge">yarn.token.renewal.interval.seconds</code>.</p>
 
-<p>Each container, upon start, runs a <code>SamzaContainerSecurityManager</code>. It reads from the credentials file on HDFS and refreshes its delegation tokens at the same interval.</p>
+<p>Each container, upon start, runs a <code class="language-plaintext highlighter-rouge">SamzaContainerSecurityManager</code>. It reads from the credentials file on HDFS and refreshes its delegation tokens at the same interval.</p>
 
 <h3 id="configuration">Configuration</h3>
 
 <ol>
-<li>For the Samza job, the following job configurations are required on a YARN cluster with security enabled.
-# Job
-job.security.manager.factory=org.apache.samza.job.yarn.SamzaYarnSecurityManagerFactory</li>
+  <li>For the Samza job, the following job configurations are required on a YARN cluster with security enabled.
+    <h1 id="job">Job</h1>
+    <p>job.security.manager.factory=org.apache.samza.job.yarn.SamzaYarnSecurityManagerFactory</p>
+  </li>
 </ol>
 
 <h1 id="yarn">YARN</h1>
 
-<figure class="highlight"><pre><code class="language-properties" data-lang="properties"><span></span><span class="na">yarn.kerberos.principal</span><span class="o">=</span><span class="s">user/localhost</span>
-<span class="na">yarn.kerberos.keytab</span><span class="o">=</span><span class="s">/etc/krb5.keytab.user</span>
-<span class="na">yarn.token.renewal.interval.seconds</span><span class="o">=</span><span class="s">86400</span></code></pre></figure>
+<figure class="highlight"><pre><code class="language-properties" data-lang="properties"><span class="py">yarn.kerberos.principal</span><span class="p">=</span><span class="s">user/localhost</span>
+<span class="py">yarn.kerberos.keytab</span><span class="p">=</span><span class="s">/etc/krb5.keytab.user</span>
+<span class="py">yarn.token.renewal.interval.seconds</span><span class="p">=</span><span class="s">86400</span></code></pre></figure>
 
 <ol>
-<li>Configure the Hadoop cluster to enable Resource Manager to recreate and renew the delegation token on behalf of the application user. This will address the following 2 scenarios.</li>
-</ol>
-<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>* When Application Master dies unexpectedly and needs a restart after 7 days (the default maximum lifespan a delegation token can be renewed).
+  <li>
+    <p>Configure the Hadoop cluster to enable Resource Manager to recreate and renew the delegation token on behalf of the application user. This will address the following 2 scenarios.</p>
 
-* When the Samza job terminates and log aggregation is turned on for the job. Node managers need to be able to upload all the local application logs to HDFS.
+    <ul>
+      <li>
+        <p>When Application Master dies unexpectedly and needs a restart after 7 days (the default maximum lifespan a delegation token can be renewed).</p>
+      </li>
+      <li>
+        <p>When the Samza job terminates and log aggregation is turned on for the job. Node managers need to be able to upload all the local application logs to HDFS.</p>
+      </li>
+    </ul>
+
+    <ol>
+      <li>Enable the resource manager as a privileged user in yarn-site.xml.</li>
+    </ol>
+  </li>
+</ol>
 
-1. Enable the resource manager as a privileged user in yarn-site.xml.
-</code></pre></div>
-<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><span></span>        <span class="nt">&lt;property&gt;</span>
+<figure class="highlight"><pre><code class="language-xml" data-lang="xml">        <span class="nt">&lt;property&gt;</span>
             <span class="nt">&lt;name&gt;</span>yarn.resourcemanager.proxy-user-privileges.enabled<span class="nt">&lt;/name&gt;</span>
             <span class="nt">&lt;value&gt;</span>true<span class="nt">&lt;/value&gt;</span>
         <span class="nt">&lt;/property&gt;</span>
     </code></pre></figure>
-<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>2. Make `yarn` as a proxy user, in core-site.xml
-</code></pre></div>
-<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><span></span>        <span class="nt">&lt;property&gt;</span>
+
+<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2. Make `yarn` as a proxy user, in core-site.xml
+</code></pre></div></div>
+
+<figure class="highlight"><pre><code class="language-xml" data-lang="xml">        <span class="nt">&lt;property&gt;</span>
             <span class="nt">&lt;name&gt;</span>hadoop.proxyuser.yarn.hosts<span class="nt">&lt;/name&gt;</span>
             <span class="nt">&lt;value&gt;</span>*<span class="nt">&lt;/value&gt;</span>
         <span class="nt">&lt;/property&gt;</span>
@@ -708,6 +749,7 @@ job.security.manager.factory=org.apache.
         <span class="nt">&lt;/property&gt;</span>
     </code></pre></figure>
 
+
            
         </div>
       </div>

Modified: samza/site/learn/tutorials/latest/deploy-samza-job-from-hdfs.html
URL: http://svn.apache.org/viewvc/samza/site/learn/tutorials/latest/deploy-samza-job-from-hdfs.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/tutorials/latest/deploy-samza-job-from-hdfs.html (original)
+++ samza/site/learn/tutorials/latest/deploy-samza-job-from-hdfs.html Wed Jan 18 19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" href="/releases/1.6.0">1.6.0</a>
       
         
@@ -545,11 +551,11 @@
    limitations under the License.
 -->
 
-<p>This tutorial uses <a href="../../../startup/hello-samza/latest/">hello-samza</a> to illustrate how to run a Samza job if you want to publish the Samza job&rsquo;s .tar.gz package to HDFS.</p>
+<p>This tutorial uses <a href="../../../startup/hello-samza/latest/">hello-samza</a> to illustrate how to run a Samza job if you want to publish the Samza job’s .tar.gz package to HDFS.</p>
 
 <h3 id="upload-the-package">Upload the package</h3>
 
-<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span>hadoop fs -put ./target/hello-samza-1.1.0-dist.tar.gz /path/for/tgz</code></pre></figure>
+<figure class="highlight"><pre><code class="language-bash" data-lang="bash">hadoop fs <span class="nt">-put</span> ./target/hello-samza-1.1.0-dist.tar.gz /path/for/tgz</code></pre></figure>
 
 <h3 id="add-hdfs-configuration">Add HDFS configuration</h3>
 
@@ -559,7 +565,7 @@
 
 <p>Change the yarn.package.path in the properties file to your HDFS location.</p>
 
-<figure class="highlight"><pre><code class="language-jproperties" data-lang="jproperties"><span></span><span class="na">yarn.package.path</span><span class="o">=</span><span class="s">hdfs://&lt;hdfs name node ip&gt;:&lt;hdfs name node port&gt;/path/to/tgz</span></code></pre></figure>
+<figure class="highlight"><pre><code class="language-jproperties" data-lang="jproperties">yarn.package.path=hdfs://&lt;hdfs name node ip&gt;:&lt;hdfs name node port&gt;/path/to/tgz</code></pre></figure>
 
 <p>Then you should be able to run the Samza job as described in <a href="../../../startup/hello-samza/latest/">hello-samza</a>.</p>
 

Modified: samza/site/learn/tutorials/latest/deploy-samza-to-CDH.html
URL: http://svn.apache.org/viewvc/samza/site/learn/tutorials/latest/deploy-samza-to-CDH.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/tutorials/latest/deploy-samza-to-CDH.html (original)
+++ samza/site/learn/tutorials/latest/deploy-samza-to-CDH.html Wed Jan 18 19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" href="/releases/1.6.0">1.6.0</a>
       
         
@@ -547,39 +553,40 @@
 
 <p>The tutorial assumes you have successfully run <a href="../../../startup/hello-samza/latest/">hello-samza</a> and now you want to deploy the job to your Cloudera Data Hub (<a href="http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html">CDH</a>). This tutorial is based on CDH 5.4.0 and uses hello-samza as the example job.</p>
 
-<h3 id="compile-package-for-cdh-5-4-0">Compile Package for CDH 5.4.0</h3>
+<h3 id="compile-package-for-cdh-540">Compile Package for CDH 5.4.0</h3>
 
 <p>We need to use a specific compile option to build hello-samza package for CDH 5.4.0</p>
 
-<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span>mvn clean package -Dhadoop.version<span class="o">=</span>cdh5.4.0</code></pre></figure>
+<figure class="highlight"><pre><code class="language-bash" data-lang="bash">mvn clean package <span class="nt">-Dhadoop</span>.version<span class="o">=</span>cdh5.4.0</code></pre></figure>
 
 <h3 id="upload-package-to-cluster">Upload Package to Cluster</h3>
 
-<p>There are a few ways of uploading the package to the cluster&rsquo;s HDFS. If you do not have the job package in your cluster, <strong>scp</strong> from you local machine to the cluster. Then run</p>
+<p>There are a few ways of uploading the package to the cluster’s HDFS. If you do not have the job package in your cluster, <strong>scp</strong> from you local machine to the cluster. Then run</p>
 
-<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span>hadoop fs -put path/to/hello-samza-1.1.0-dist.tar.gz /path/for/tgz</code></pre></figure>
+<figure class="highlight"><pre><code class="language-bash" data-lang="bash">hadoop fs <span class="nt">-put</span> path/to/hello-samza-1.1.0-dist.tar.gz /path/for/tgz</code></pre></figure>
 
 <h3 id="get-deploying-scripts">Get Deploying Scripts</h3>
 
 <p>Untar the job package (assume you will run from the current directory)</p>
 
-<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span>tar -xvf path/to/samza-job-package-1.1.0-dist.tar.gz -C ./</code></pre></figure>
+<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nb">tar</span> <span class="nt">-xvf</span> path/to/samza-job-package-1.1.0-dist.tar.gz <span class="nt">-C</span> ./</code></pre></figure>
 
 <h3 id="add-package-path-to-properties-file">Add Package Path to Properties File</h3>
 
-<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span>vim config/wikipedia-parser.properties</code></pre></figure>
+<figure class="highlight"><pre><code class="language-bash" data-lang="bash">vim config/wikipedia-parser.properties</code></pre></figure>
 
 <p>Change the yarn package path:</p>
 
-<figure class="highlight"><pre><code class="language-jproperties" data-lang="jproperties"><span></span><span class="na">yarn.package.path</span><span class="o">=</span><span class="s">hdfs://&lt;hdfs name node ip&gt;:&lt;hdfs name node port&gt;/path/to/tgz</span></code></pre></figure>
+<figure class="highlight"><pre><code class="language-jproperties" data-lang="jproperties">yarn.package.path=hdfs://&lt;hdfs name node ip&gt;:&lt;hdfs name node port&gt;/path/to/tgz</code></pre></figure>
 
 <h3 id="set-yarn-environment-variable">Set Yarn Environment Variable</h3>
 
-<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span><span class="nb">export</span> <span class="nv">HADOOP_CONF_DIR</span><span class="o">=</span>/etc/hadoop/conf</code></pre></figure>
+<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nb">export </span><span class="nv">HADOOP_CONF_DIR</span><span class="o">=</span>/etc/hadoop/conf</code></pre></figure>
 
 <h3 id="run-samza-job">Run Samza Job</h3>
 
-<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span>bin/run-app.sh --config-path<span class="o">=</span><span class="nv">$PWD</span>/config/wikipedia-parser.properties</code></pre></figure>
+<figure class="highlight"><pre><code class="language-bash" data-lang="bash">bin/run-app.sh <span class="nt">--config-path</span><span class="o">=</span><span class="nv">$PWD</span>/config/wikipedia-parser.properties</code></pre></figure>
+
 
            
         </div>