You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@apex.apache.org by th...@apache.org on 2016/09/07 02:14:39 UTC

[1/6] apex-site git commit: Update apex-3.4 documentation from master to include security changes and development best practices.

Repository: apex-site
Updated Branches:
  refs/heads/asf-site 974baceda -> d396fa83b


http://git-wip-us.apache.org/repos/asf/apex-site/blob/21e76a00/docs/apex-3.4/operator_development/index.html
----------------------------------------------------------------------
diff --git a/docs/apex-3.4/operator_development/index.html b/docs/apex-3.4/operator_development/index.html
index a08e0d3..f03bff4 100644
--- a/docs/apex-3.4/operator_development/index.html
+++ b/docs/apex-3.4/operator_development/index.html
@@ -161,6 +161,13 @@
     </li>
 
         
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../development_best_practices/">Best Practices</a>
+        
+    </li>
+
+        
     </ul>
 <li>
           
@@ -610,7 +617,7 @@ ports.</p>
 replaced.</li>
 </ol>
 <h1 id="malhar-operator-library">Malhar Operator Library</h1>
-<p>To see the full list of Apex Malhar operators along with related documentation, visit <a href="https://github.com/apache/incubator-apex-malhar">Apex Malhar on Github</a></p>
+<p>To see the full list of Apex Malhar operators along with related documentation, visit <a href="https://github.com/apache/apex-malhar">Apex Malhar on Github</a></p>
               
             </div>
           </div>

http://git-wip-us.apache.org/repos/asf/apex-site/blob/21e76a00/docs/apex-3.4/search.html
----------------------------------------------------------------------
diff --git a/docs/apex-3.4/search.html b/docs/apex-3.4/search.html
index 0ce9901..72484a3 100644
--- a/docs/apex-3.4/search.html
+++ b/docs/apex-3.4/search.html
@@ -98,6 +98,13 @@
     </li>
 
         
+            
+    <li class="toctree-l1 ">
+        <a class="" href="development_best_practices/">Best Practices</a>
+        
+    </li>
+
+        
     </ul>
 <li>
           

http://git-wip-us.apache.org/repos/asf/apex-site/blob/21e76a00/docs/apex-3.4/security/index.html
----------------------------------------------------------------------
diff --git a/docs/apex-3.4/security/index.html b/docs/apex-3.4/security/index.html
index 527af0f..3a2080f 100644
--- a/docs/apex-3.4/security/index.html
+++ b/docs/apex-3.4/security/index.html
@@ -102,6 +102,13 @@
     </li>
 
         
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../development_best_practices/">Best Practices</a>
+        
+    </li>
+
+        
     </ul>
 <li>
           
@@ -188,32 +195,11 @@
                 <h1 id="security">Security</h1>
 <p>Applications built on Apex run as native YARN applications on Hadoop. The security framework and apparatus in Hadoop apply to the applications. The default security mechanism in Hadoop is Kerberos.</p>
 <h2 id="kerberos-authentication">Kerberos Authentication</h2>
-<p>Kerberos is a ticket based authentication system that provides authentication in a distributed environment where authentication is needed between multiple users, hosts and services. It is the de-facto authentication mechanism supported in Hadoop. To use Kerberos authentication, the Hadoop installation must first be configured for secure mode with Kerberos. Please refer to the administration guide of your Hadoop distribution on how to do that. Once Hadoop is configured, there is some configuration needed on Apex side as well.</p>
+<p>Kerberos is a ticket based authentication system that provides authentication in a distributed environment where authentication is needed between multiple users, hosts and services. It is the de-facto authentication mechanism supported in Hadoop. To use Kerberos authentication, the Hadoop installation must first be configured for secure mode with Kerberos. Please refer to the administration guide of your Hadoop distribution on how to do that. Once Hadoop is configured, some configuration is needed on the Apex side as well.</p>
 <h2 id="configuring-security">Configuring security</h2>
-<p>There is Hadoop configuration and CLI configuration. Hadoop configuration may be optional.</p>
-<h3 id="hadoop-configuration">Hadoop Configuration</h3>
-<p>An Apex application uses delegation tokens to authenticate with the ResourceManager (YARN) and NameNode (HDFS) and these tokens are issued by those servers respectively. Since the application is long-running,
-the tokens should be valid for the lifetime of the application. Hadoop has a configuration setting for the maximum lifetime of the tokens and they should be set to cover the lifetime of the application. There are separate settings for ResourceManager and NameNode delegation
-tokens.</p>
-<p>The ResourceManager delegation token max lifetime is specified in <code>yarn-site.xml</code> and can be specified as follows for example for a lifetime of 1 year</p>
-<pre><code class="xml">&lt;property&gt;
-  &lt;name&gt;yarn.resourcemanager.delegation.token.max-lifetime&lt;/name&gt;
-  &lt;value&gt;31536000000&lt;/value&gt;
-&lt;/property&gt;
-</code></pre>
-
-<p>The NameNode delegation token max lifetime is specified in
-hdfs-site.xml and can be specified as follows for example for a lifetime of 1 year</p>
-<pre><code class="xml">&lt;property&gt;
-   &lt;name&gt;dfs.namenode.delegation.token.max-lifetime&lt;/name&gt;
-   &lt;value&gt;31536000000&lt;/value&gt;
- &lt;/property&gt;
-</code></pre>
-
+<p>The Apex command line interface (CLI) program, <code>apex</code>, is used to launch applications on the Hadoop cluster along with performing various other operations and administrative tasks on the applications. In a secure cluster additional configuration is needed for the CLI program <code>apex</code>.</p>
 <h3 id="cli-configuration">CLI Configuration</h3>
-<p>The Apex command line interface is used to launch
-applications along with performing various other operations and administrative tasks on the applications. �When Kerberos security is enabled in Hadoop, a Kerberos ticket granting ticket (TGT) or the Kerberos credentials of the user are needed by the CLI program <code>apex</code> to authenticate with Hadoop for any operation. Kerberos credentials are composed of a principal and either a <em>keytab</em> or a password. For security and operational reasons only keytabs are supported in Hadoop and by extension in Apex platform. When user credentials are specified, all operations including launching
-application are performed as that user.</p>
+<p>When Kerberos security is enabled in Hadoop, a Kerberos ticket granting ticket (TGT) or the Kerberos credentials of the user are needed by the CLI program <code>apex</code> to authenticate with Hadoop for any operation. Kerberos credentials are composed of a principal and either a <em>keytab</em> or a password. For security and operational reasons only keytabs are supported in Hadoop and by extension in Apex platform. When user credentials are specified, all operations including launching application are performed as that user.</p>
 <h4 id="using-kinit">Using kinit</h4>
 <p>A Kerberos ticket granting ticket (TGT) can be obtained by using the Kerberos command <code>kinit</code>. Detailed documentation for the command can be found online or in man pages. An sample usage of this command is</p>
 <pre><code>kinit -k -t path-tokeytab-file kerberos-principal
@@ -235,7 +221,96 @@ home directory. The location of this file will be <code>$HOME/.dt/dt-site.xml</c
 </code></pre>
 
 <p>The property <code>dt.authentication.principal</code> specifies the Kerberos user principal and <code>dt.authentication.keytab</code> specifies the absolute path to the keytab file for the user.</p>
+<h3 id="web-services-security">Web Services security</h3>
+<p>Alongside every Apex application is an application master process running called Streaming Container Manager (STRAM). STRAM manages the application by handling the various control aspects of the application such as orchestrating the execution of the application on the cluster, playing a key role in scalability and fault tolerance, providing application insight by collecting statistics among other functionality.</p>
+<p>STRAM provides a web service interface to introspect the state of the application and its various components and to make dynamic changes to the applications. Some examples of supported functionality are getting resource usage and partition information of various operators, getting operator statistics and changing properties of running operators.</p>
+<p>Access to the web services can be secured to prevent unauthorized access. By default it is automatically enabled in Hadoop secure mode environments and not enabled in non-secure environments. How the security actually works is described in <code>Security architecture</code> section below.</p>
+<p>There are additional options available for finer grained control on enabling it. This can be configured on a per-application basis using an application attribute. It can also be enabled or disabled based on Hadoop security configuration. The following security options are available</p>
+<ul>
+<li>Enable - Enable Authentication</li>
+<li>Follow Hadoop Authentication - Enable authentication if secure mode is enabled in Hadoop, the default</li>
+<li>Follow Hadoop HTTP Authentication - Enable authentication only if HTTP authentication is enabled in Hadoop and not just secure mode.</li>
+<li>Disable - Disable Authentication</li>
+</ul>
+<p>To specify the security option for an application the following configuration can be specified in the <code>dt-site.xml</code> file</p>
+<pre><code class="xml">&lt;property&gt;
+        &lt;name&gt;dt.application.name.attr.STRAM_HTTP_AUTHENTICATION&lt;/name&gt;
+        &lt;value&gt;security-option&lt;/value&gt;
+&lt;/property&gt;
+</code></pre>
+
+<p>The security option value can be <code>ENABLED</code>, <code>FOLLOW_HADOOP_AUTH</code>, <code>FOLLOW_HADOOP_HTTP_AUTH</code> or <code>DISABLE</code> for the four options above respectively.</p>
 <p>The subsequent sections talk about how security works in Apex. This information is not needed by users but is intended for the inquisitive techical audience who want to know how security works.</p>
+<h3 id="token-refresh">Token Refresh</h3>
+<p>Apex applications, at runtime, use delegation tokens to authenticate with Hadoop services when communicating with them as described in the security architecture section below. The delegation tokens are originally issued by these Hadoop services and have an expiry time period which is typically 7 days. The tokens become invalid beyond this time and the applications will no longer be able to communicate with the Hadoop services. For long running applications this presents a problem.</p>
+<p>To solve this problem one of the two approaches can be used. The first approach is to change the Hadoop configuration itself to extend the token expiry time period. This may not be possible in all environments as it requires a change in the security policy as the tokens will now be valid for a longer period of time and the change also requires administrator privileges to Hadoop. The second approach is to use a feature available in apex to auto-refresh the tokens before they expire. Both the approaches are detailed below and the users can choose the one that works best for them.</p>
+<h4 id="hadoop-configuration-approach">Hadoop configuration approach</h4>
+<p>An Apex application uses delegation tokens to authenticate with Hadoop services, Resource Manager (YARN) and Name Node (HDFS), and these tokens are issued by those services respectively. Since the application is long-running, the tokens can expire while the application is still running. Hadoop uses configuration settings for the maximum lifetime of these tokens. </p>
+<p>There are separate settings for ResourceManager and NameNode delegation tokens. In this approach the user increases the values of these settings to cover the lifetime of the application. Once these settings are changed, the YARN and HDFS services would have to be restarted. The values in these settings are of type <code>long</code> and has an upper limit so applications cannot run forever. This limitation is not present with the next approach described below.</p>
+<p>The Resource Manager delegation token max lifetime is specified in <code>yarn-site.xml</code> and can be specified as follows for a lifetime of 1 year as an example</p>
+<pre><code class="xml">&lt;property&gt;
+  &lt;name&gt;yarn.resourcemanager.delegation.token.max-lifetime&lt;/name&gt;
+  &lt;value&gt;31536000000&lt;/value&gt;
+&lt;/property&gt;
+</code></pre>
+
+<p>The Name Node delegation token max lifetime is specified in
+hdfs-site.xml and can be specified as follows for a lifetime of 1 year as an example</p>
+<pre><code class="xml">&lt;property&gt;
+   &lt;name&gt;dfs.namenode.delegation.token.max-lifetime&lt;/name&gt;
+   &lt;value&gt;31536000000&lt;/value&gt;
+ &lt;/property&gt;
+</code></pre>
+
+<h4 id="auto-refresh-approach">Auto-refresh approach</h4>
+<p>In this approach the application, in anticipation of a token expiring, obtains a new token to replace the current one. It keeps repeating the process whenever a token is close to expiry so that the application can continue to run indefinitely.</p>
+<p>This requires the application having access to a keytab file at runtime because obtaining a new token requires a keytab. The keytab file should be present in HDFS so that the application can access it at runtime. The user can provide a HDFS location for the keytab file using a setting otherwise the keytab file specified for the <code>apex</code> CLI program above will be copied from the local filesystem into HDFS before the application is started and made available to the application. There are other optional settings available to configure the behavior of this feature. All the settings are described below.</p>
+<p>The location of the keytab can be specified by using the following setting in <code>dt-site.xml</code>. If it is not specified then the file specified in <code>dt.authentication.keytab</code> is copied into HDFS and used.</p>
+<pre><code class="xml">&lt;property&gt;
+        &lt;name&gt;dt.authentication.store.keytab&lt;/name&gt;
+        &lt;value&gt;hdfs-path-to-keytab-file&lt;/value&gt;
+&lt;/property&gt;
+</code></pre>
+
+<p>The expiry period of the Resource Manager and Name Node tokens needs to be known so that the application can renew them before they expire. These are automatically obtained using the <code>yarn.resourcemanager.delegation.token.max-lifetime</code> and <code>dfs.namenode.delegation.token.max-lifetime</code> properties from the hadoop configuration files. Sometimes however these properties are not available or kept up-to-date on the nodes running the applications. If that is the case then the following properties can be used to specify the expiry period, the values are in milliseconds. The example below shows how to specify these with values of 7 days.</p>
+<pre><code class="xml">&lt;property&gt;
+        &lt;name&gt;dt.resourcemanager.delegation.token.max-lifetime&lt;/name&gt;
+        &lt;value&gt;604800000&lt;/value&gt;
+&lt;/property&gt;
+
+&lt;property&gt;
+        &lt;name&gt;dt.namenode.delegation.token.max-lifetime&lt;/name&gt;
+        &lt;value&gt;604800000&lt;/value&gt;
+&lt;/property&gt;
+</code></pre>
+
+<p>As explained earlier new tokens are obtained before the old ones expire. How early the new tokens are obtained before expiry is controlled by a setting. This setting is specified as a factor of the token expiration with a value between 0.0 and 1.0. The default value is <code>0.7</code>. This factor is multipled with the expiration time to determine when to refresh the tokens. This setting can be changed by the user and the following example shows how this can be done</p>
+<pre><code class="xml">&lt;property&gt;
+        &lt;name&gt;dt.authentication.token.refresh.factor&lt;/name&gt;
+        &lt;value&gt;0.7&lt;/value&gt;
+&lt;/property&gt;
+</code></pre>
+
+<h3 id="impersonation">Impersonation</h3>
+<p>The CLI program <code>apex</code> supports Hadoop proxy user impersonation, in allowing applications to be launched and other operations to be performed as a different user than the one specified by the Kerberos credentials. The Kerberos credentials are still used for authentication. This is useful in scenarios where a system using <code>apex</code> has to support multiple users but only has a single set of Kerberos credentials, those of a system user.</p>
+<h4 id="usage">Usage</h4>
+<p>To use this feature, the following environment variable should be set to the user name of the user being impersonated, before running <code>apex</code> and the operations will be performed as that user. For example, if launching an application, the application will run as the specified user and not as the user specified by the Kerberos credentials.</p>
+<pre><code>HADOOP_USER_NAME=&lt;username&gt;
+</code></pre>
+
+<h4 id="hadoop-configuration">Hadoop Configuration</h4>
+<p>For this feature to work, additional configuration settings are needed in Hadoop. These settings would allow a specified user, such as a system user, to impersonate other users. The example snippet below shows these settings. In this example, the specified user can impersonate users belonging to any group and can do so running from any host. Note that the user specified here is different from the user specified above in usage, there it is the user that is being impersonated and here it is the impersonating user such as a system user.</p>
+<pre><code class="xml">&lt;property&gt;
+  &lt;name&gt;hadoop.proxyuser.&lt;username&gt;.groups&lt;/name&gt;
+  &lt;value&gt;*&lt;/value&gt;
+&lt;/property&gt;
+
+&lt;property&gt;
+  &lt;name&gt;hadoop.proxyuser.&lt;username&gt;.hosts&lt;/name&gt;
+  &lt;value&gt;*&lt;/value&gt;
+&lt;/property&gt;
+</code></pre>
+
 <h2 id="security-architecture">Security architecture</h2>
 <p>In this section we will see how security works for applications built on Apex. We will look at the different methodologies involved in running the applications and in each case we will look into the different components that are involved. We will go into the architecture of these components and look at the different security mechanisms that are in play.</p>
 <h3 id="application-launch">Application Launch</h3>
@@ -272,8 +347,12 @@ home directory. The location of this file will be <code>$HOME/.dt/dt-site.xml</c
 <p>When operators are running there will be effective processing rate differences between them due to intrinsic reasons such as operator logic or external reasons such as different resource availability of CPU, memory, network bandwidth etc. as the operators are running in different containers. To maximize performance and utilization the data flow is handled asynchronous to the regular operator function and a buffer is used to intermediately store the data that is being produced by the operator. This buffered data is served by a buffer server over the network connection to the downstream streaming container containing the operator that is supposed to receive the data from this operator. This connection is secured by a token called the buffer server token. These tokens are also generated and seeded by STRAM when the streaming containers are deployed and started and it uses different tokens for different buffer servers to have better security.</p>
 <h5 id="namenode-delegation-token">NameNode Delegation Token</h5>
 <p>Like STRAM, streaming containers also need to communicate with NameNode to use HDFS persistence for reasons such as saving the state of the operators. In secure mode they also use NameNode delegation tokens for authentication. These tokens are also seeded by STRAM for the streaming containers.</p>
+<h4 id="stram-web-services">Stram Web Services</h4>
+<p>Clients connect to STRAM and make web service requests to obtain operational information about running applications. When security is enabled we want this connection to also be authenticated. In this mode the client passes a web service token in the request and STRAM checks this token. If the token is valid, then the request is processed else it is denied.</p>
+<p>How does the client get the web service token in the first place? The client will have to first connect to STRAM via the Resource Manager Web Services Proxy which is a service run by Hadoop to proxy requests to application web services. This connection is authenticated by the proxy service using a protocol called SPNEGO when secure mode is enabled. SPNEGO is Kerberos over HTTP and the client also needs to support it. If the authentication is successful the proxy forwards the request to STRAM. STRAM in processing the request generates and sends back a web service token similar to a delegation token. This token is then used by the client in subsequent requests it makes directly to STRAM and STRAM is able to validate it since it generated the token in the first place.</p>
+<p><img alt="" src="../images/security/image03.png" /></p>
 <h2 id="conclusion">Conclusion</h2>
-<p>We looked at the different security requirements for distributed applications when they run in a secure Hadoop environment and looked at how Apex solves this.</p>
+<p>We looked at the different security configuration options that are available in Apex, saw the different security requirements for distributed applications in a secure Hadoop environment in detail and looked at how the various security mechanisms in Apex solves this.</p>
               
             </div>
           </div>

http://git-wip-us.apache.org/repos/asf/apex-site/blob/21e76a00/docs/apex-3.4/sitemap.xml
----------------------------------------------------------------------
diff --git a/docs/apex-3.4/sitemap.xml b/docs/apex-3.4/sitemap.xml
index 7af727b..ef8957a 100644
--- a/docs/apex-3.4/sitemap.xml
+++ b/docs/apex-3.4/sitemap.xml
@@ -4,7 +4,7 @@
     
     <url>
      <loc>/</loc>
-     <lastmod>2016-05-13</lastmod>
+     <lastmod>2016-09-06</lastmod>
      <changefreq>daily</changefreq>
     </url>
     
@@ -13,31 +13,37 @@
         
     <url>
      <loc>/apex_development_setup/</loc>
-     <lastmod>2016-05-13</lastmod>
+     <lastmod>2016-09-06</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/application_development/</loc>
-     <lastmod>2016-05-13</lastmod>
+     <lastmod>2016-09-06</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/application_packages/</loc>
-     <lastmod>2016-05-13</lastmod>
+     <lastmod>2016-09-06</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/operator_development/</loc>
-     <lastmod>2016-05-13</lastmod>
+     <lastmod>2016-09-06</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/autometrics/</loc>
-     <lastmod>2016-05-13</lastmod>
+     <lastmod>2016-09-06</lastmod>
+     <changefreq>daily</changefreq>
+    </url>
+        
+    <url>
+     <loc>/development_best_practices/</loc>
+     <lastmod>2016-09-06</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
@@ -47,13 +53,13 @@
         
     <url>
      <loc>/apex_cli/</loc>
-     <lastmod>2016-05-13</lastmod>
+     <lastmod>2016-09-06</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/security/</loc>
-     <lastmod>2016-05-13</lastmod>
+     <lastmod>2016-09-06</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
@@ -62,7 +68,7 @@
     
     <url>
      <loc>/compatibility/</loc>
-     <lastmod>2016-05-13</lastmod>
+     <lastmod>2016-09-06</lastmod>
      <changefreq>daily</changefreq>
     </url>

[4/6] apex-site git commit: from c3a284ba04d860705af016afe3348f0e523f48c1

Posted by th...@apache.org.

http://git-wip-us.apache.org/repos/asf/apex-site/blob/d396fa83/content/docs/apex-3.4/operator_development/index.html
----------------------------------------------------------------------
diff --git a/content/docs/apex-3.4/operator_development/index.html b/content/docs/apex-3.4/operator_development/index.html
index a08e0d3..f03bff4 100644
--- a/content/docs/apex-3.4/operator_development/index.html
+++ b/content/docs/apex-3.4/operator_development/index.html
@@ -161,6 +161,13 @@
     </li>
 
         
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../development_best_practices/">Best Practices</a>
+        
+    </li>
+
+        
     </ul>
 <li>
           
@@ -610,7 +617,7 @@ ports.</p>
 replaced.</li>
 </ol>
 <h1 id="malhar-operator-library">Malhar Operator Library</h1>
-<p>To see the full list of Apex Malhar operators along with related documentation, visit <a href="https://github.com/apache/incubator-apex-malhar">Apex Malhar on Github</a></p>
+<p>To see the full list of Apex Malhar operators along with related documentation, visit <a href="https://github.com/apache/apex-malhar">Apex Malhar on Github</a></p>
               
             </div>
           </div>

http://git-wip-us.apache.org/repos/asf/apex-site/blob/d396fa83/content/docs/apex-3.4/search.html
----------------------------------------------------------------------
diff --git a/content/docs/apex-3.4/search.html b/content/docs/apex-3.4/search.html
index 0ce9901..72484a3 100644
--- a/content/docs/apex-3.4/search.html
+++ b/content/docs/apex-3.4/search.html
@@ -98,6 +98,13 @@
     </li>
 
         
+            
+    <li class="toctree-l1 ">
+        <a class="" href="development_best_practices/">Best Practices</a>
+        
+    </li>
+
+        
     </ul>
 <li>
           

http://git-wip-us.apache.org/repos/asf/apex-site/blob/d396fa83/content/docs/apex-3.4/security/index.html
----------------------------------------------------------------------
diff --git a/content/docs/apex-3.4/security/index.html b/content/docs/apex-3.4/security/index.html
index 527af0f..3a2080f 100644
--- a/content/docs/apex-3.4/security/index.html
+++ b/content/docs/apex-3.4/security/index.html
@@ -102,6 +102,13 @@
     </li>
 
         
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../development_best_practices/">Best Practices</a>
+        
+    </li>
+
+        
     </ul>
 <li>
           
@@ -188,32 +195,11 @@
                 <h1 id="security">Security</h1>
 <p>Applications built on Apex run as native YARN applications on Hadoop. The security framework and apparatus in Hadoop apply to the applications. The default security mechanism in Hadoop is Kerberos.</p>
 <h2 id="kerberos-authentication">Kerberos Authentication</h2>
-<p>Kerberos is a ticket based authentication system that provides authentication in a distributed environment where authentication is needed between multiple users, hosts and services. It is the de-facto authentication mechanism supported in Hadoop. To use Kerberos authentication, the Hadoop installation must first be configured for secure mode with Kerberos. Please refer to the administration guide of your Hadoop distribution on how to do that. Once Hadoop is configured, there is some configuration needed on Apex side as well.</p>
+<p>Kerberos is a ticket based authentication system that provides authentication in a distributed environment where authentication is needed between multiple users, hosts and services. It is the de-facto authentication mechanism supported in Hadoop. To use Kerberos authentication, the Hadoop installation must first be configured for secure mode with Kerberos. Please refer to the administration guide of your Hadoop distribution on how to do that. Once Hadoop is configured, some configuration is needed on the Apex side as well.</p>
 <h2 id="configuring-security">Configuring security</h2>
-<p>There is Hadoop configuration and CLI configuration. Hadoop configuration may be optional.</p>
-<h3 id="hadoop-configuration">Hadoop Configuration</h3>
-<p>An Apex application uses delegation tokens to authenticate with the ResourceManager (YARN) and NameNode (HDFS) and these tokens are issued by those servers respectively. Since the application is long-running,
-the tokens should be valid for the lifetime of the application. Hadoop has a configuration setting for the maximum lifetime of the tokens and they should be set to cover the lifetime of the application. There are separate settings for ResourceManager and NameNode delegation
-tokens.</p>
-<p>The ResourceManager delegation token max lifetime is specified in <code>yarn-site.xml</code> and can be specified as follows for example for a lifetime of 1 year</p>
-<pre><code class="xml">&lt;property&gt;
-  &lt;name&gt;yarn.resourcemanager.delegation.token.max-lifetime&lt;/name&gt;
-  &lt;value&gt;31536000000&lt;/value&gt;
-&lt;/property&gt;
-</code></pre>
-
-<p>The NameNode delegation token max lifetime is specified in
-hdfs-site.xml and can be specified as follows for example for a lifetime of 1 year</p>
-<pre><code class="xml">&lt;property&gt;
-   &lt;name&gt;dfs.namenode.delegation.token.max-lifetime&lt;/name&gt;
-   &lt;value&gt;31536000000&lt;/value&gt;
- &lt;/property&gt;
-</code></pre>
-
+<p>The Apex command line interface (CLI) program, <code>apex</code>, is used to launch applications on the Hadoop cluster along with performing various other operations and administrative tasks on the applications. In a secure cluster additional configuration is needed for the CLI program <code>apex</code>.</p>
 <h3 id="cli-configuration">CLI Configuration</h3>
-<p>The Apex command line interface is used to launch
-applications along with performing various other operations and administrative tasks on the applications. �When Kerberos security is enabled in Hadoop, a Kerberos ticket granting ticket (TGT) or the Kerberos credentials of the user are needed by the CLI program <code>apex</code> to authenticate with Hadoop for any operation. Kerberos credentials are composed of a principal and either a <em>keytab</em> or a password. For security and operational reasons only keytabs are supported in Hadoop and by extension in Apex platform. When user credentials are specified, all operations including launching
-application are performed as that user.</p>
+<p>When Kerberos security is enabled in Hadoop, a Kerberos ticket granting ticket (TGT) or the Kerberos credentials of the user are needed by the CLI program <code>apex</code> to authenticate with Hadoop for any operation. Kerberos credentials are composed of a principal and either a <em>keytab</em> or a password. For security and operational reasons only keytabs are supported in Hadoop and by extension in Apex platform. When user credentials are specified, all operations including launching application are performed as that user.</p>
 <h4 id="using-kinit">Using kinit</h4>
 <p>A Kerberos ticket granting ticket (TGT) can be obtained by using the Kerberos command <code>kinit</code>. Detailed documentation for the command can be found online or in man pages. An sample usage of this command is</p>
 <pre><code>kinit -k -t path-tokeytab-file kerberos-principal
@@ -235,7 +221,96 @@ home directory. The location of this file will be <code>$HOME/.dt/dt-site.xml</c
 </code></pre>
 
 <p>The property <code>dt.authentication.principal</code> specifies the Kerberos user principal and <code>dt.authentication.keytab</code> specifies the absolute path to the keytab file for the user.</p>
+<h3 id="web-services-security">Web Services security</h3>
+<p>Alongside every Apex application is an application master process running called Streaming Container Manager (STRAM). STRAM manages the application by handling the various control aspects of the application such as orchestrating the execution of the application on the cluster, playing a key role in scalability and fault tolerance, providing application insight by collecting statistics among other functionality.</p>
+<p>STRAM provides a web service interface to introspect the state of the application and its various components and to make dynamic changes to the applications. Some examples of supported functionality are getting resource usage and partition information of various operators, getting operator statistics and changing properties of running operators.</p>
+<p>Access to the web services can be secured to prevent unauthorized access. By default it is automatically enabled in Hadoop secure mode environments and not enabled in non-secure environments. How the security actually works is described in <code>Security architecture</code> section below.</p>
+<p>There are additional options available for finer grained control on enabling it. This can be configured on a per-application basis using an application attribute. It can also be enabled or disabled based on Hadoop security configuration. The following security options are available</p>
+<ul>
+<li>Enable - Enable Authentication</li>
+<li>Follow Hadoop Authentication - Enable authentication if secure mode is enabled in Hadoop, the default</li>
+<li>Follow Hadoop HTTP Authentication - Enable authentication only if HTTP authentication is enabled in Hadoop and not just secure mode.</li>
+<li>Disable - Disable Authentication</li>
+</ul>
+<p>To specify the security option for an application the following configuration can be specified in the <code>dt-site.xml</code> file</p>
+<pre><code class="xml">&lt;property&gt;
+        &lt;name&gt;dt.application.name.attr.STRAM_HTTP_AUTHENTICATION&lt;/name&gt;
+        &lt;value&gt;security-option&lt;/value&gt;
+&lt;/property&gt;
+</code></pre>
+
+<p>The security option value can be <code>ENABLED</code>, <code>FOLLOW_HADOOP_AUTH</code>, <code>FOLLOW_HADOOP_HTTP_AUTH</code> or <code>DISABLE</code> for the four options above respectively.</p>
 <p>The subsequent sections talk about how security works in Apex. This information is not needed by users but is intended for the inquisitive techical audience who want to know how security works.</p>
+<h3 id="token-refresh">Token Refresh</h3>
+<p>Apex applications, at runtime, use delegation tokens to authenticate with Hadoop services when communicating with them as described in the security architecture section below. The delegation tokens are originally issued by these Hadoop services and have an expiry time period which is typically 7 days. The tokens become invalid beyond this time and the applications will no longer be able to communicate with the Hadoop services. For long running applications this presents a problem.</p>
+<p>To solve this problem one of the two approaches can be used. The first approach is to change the Hadoop configuration itself to extend the token expiry time period. This may not be possible in all environments as it requires a change in the security policy as the tokens will now be valid for a longer period of time and the change also requires administrator privileges to Hadoop. The second approach is to use a feature available in apex to auto-refresh the tokens before they expire. Both the approaches are detailed below and the users can choose the one that works best for them.</p>
+<h4 id="hadoop-configuration-approach">Hadoop configuration approach</h4>
+<p>An Apex application uses delegation tokens to authenticate with Hadoop services, Resource Manager (YARN) and Name Node (HDFS), and these tokens are issued by those services respectively. Since the application is long-running, the tokens can expire while the application is still running. Hadoop uses configuration settings for the maximum lifetime of these tokens. </p>
+<p>There are separate settings for ResourceManager and NameNode delegation tokens. In this approach the user increases the values of these settings to cover the lifetime of the application. Once these settings are changed, the YARN and HDFS services would have to be restarted. The values in these settings are of type <code>long</code> and has an upper limit so applications cannot run forever. This limitation is not present with the next approach described below.</p>
+<p>The Resource Manager delegation token max lifetime is specified in <code>yarn-site.xml</code> and can be specified as follows for a lifetime of 1 year as an example</p>
+<pre><code class="xml">&lt;property&gt;
+  &lt;name&gt;yarn.resourcemanager.delegation.token.max-lifetime&lt;/name&gt;
+  &lt;value&gt;31536000000&lt;/value&gt;
+&lt;/property&gt;
+</code></pre>
+
+<p>The Name Node delegation token max lifetime is specified in
+hdfs-site.xml and can be specified as follows for a lifetime of 1 year as an example</p>
+<pre><code class="xml">&lt;property&gt;
+   &lt;name&gt;dfs.namenode.delegation.token.max-lifetime&lt;/name&gt;
+   &lt;value&gt;31536000000&lt;/value&gt;
+ &lt;/property&gt;
+</code></pre>
+
+<h4 id="auto-refresh-approach">Auto-refresh approach</h4>
+<p>In this approach the application, in anticipation of a token expiring, obtains a new token to replace the current one. It keeps repeating the process whenever a token is close to expiry so that the application can continue to run indefinitely.</p>
+<p>This requires the application having access to a keytab file at runtime because obtaining a new token requires a keytab. The keytab file should be present in HDFS so that the application can access it at runtime. The user can provide a HDFS location for the keytab file using a setting otherwise the keytab file specified for the <code>apex</code> CLI program above will be copied from the local filesystem into HDFS before the application is started and made available to the application. There are other optional settings available to configure the behavior of this feature. All the settings are described below.</p>
+<p>The location of the keytab can be specified by using the following setting in <code>dt-site.xml</code>. If it is not specified then the file specified in <code>dt.authentication.keytab</code> is copied into HDFS and used.</p>
+<pre><code class="xml">&lt;property&gt;
+        &lt;name&gt;dt.authentication.store.keytab&lt;/name&gt;
+        &lt;value&gt;hdfs-path-to-keytab-file&lt;/value&gt;
+&lt;/property&gt;
+</code></pre>
+
+<p>The expiry period of the Resource Manager and Name Node tokens needs to be known so that the application can renew them before they expire. These are automatically obtained using the <code>yarn.resourcemanager.delegation.token.max-lifetime</code> and <code>dfs.namenode.delegation.token.max-lifetime</code> properties from the hadoop configuration files. Sometimes however these properties are not available or kept up-to-date on the nodes running the applications. If that is the case then the following properties can be used to specify the expiry period, the values are in milliseconds. The example below shows how to specify these with values of 7 days.</p>
+<pre><code class="xml">&lt;property&gt;
+        &lt;name&gt;dt.resourcemanager.delegation.token.max-lifetime&lt;/name&gt;
+        &lt;value&gt;604800000&lt;/value&gt;
+&lt;/property&gt;
+
+&lt;property&gt;
+        &lt;name&gt;dt.namenode.delegation.token.max-lifetime&lt;/name&gt;
+        &lt;value&gt;604800000&lt;/value&gt;
+&lt;/property&gt;
+</code></pre>
+
+<p>As explained earlier new tokens are obtained before the old ones expire. How early the new tokens are obtained before expiry is controlled by a setting. This setting is specified as a factor of the token expiration with a value between 0.0 and 1.0. The default value is <code>0.7</code>. This factor is multipled with the expiration time to determine when to refresh the tokens. This setting can be changed by the user and the following example shows how this can be done</p>
+<pre><code class="xml">&lt;property&gt;
+        &lt;name&gt;dt.authentication.token.refresh.factor&lt;/name&gt;
+        &lt;value&gt;0.7&lt;/value&gt;
+&lt;/property&gt;
+</code></pre>
+
+<h3 id="impersonation">Impersonation</h3>
+<p>The CLI program <code>apex</code> supports Hadoop proxy user impersonation, in allowing applications to be launched and other operations to be performed as a different user than the one specified by the Kerberos credentials. The Kerberos credentials are still used for authentication. This is useful in scenarios where a system using <code>apex</code> has to support multiple users but only has a single set of Kerberos credentials, those of a system user.</p>
+<h4 id="usage">Usage</h4>
+<p>To use this feature, the following environment variable should be set to the user name of the user being impersonated, before running <code>apex</code> and the operations will be performed as that user. For example, if launching an application, the application will run as the specified user and not as the user specified by the Kerberos credentials.</p>
+<pre><code>HADOOP_USER_NAME=&lt;username&gt;
+</code></pre>
+
+<h4 id="hadoop-configuration">Hadoop Configuration</h4>
+<p>For this feature to work, additional configuration settings are needed in Hadoop. These settings would allow a specified user, such as a system user, to impersonate other users. The example snippet below shows these settings. In this example, the specified user can impersonate users belonging to any group and can do so running from any host. Note that the user specified here is different from the user specified above in usage, there it is the user that is being impersonated and here it is the impersonating user such as a system user.</p>
+<pre><code class="xml">&lt;property&gt;
+  &lt;name&gt;hadoop.proxyuser.&lt;username&gt;.groups&lt;/name&gt;
+  &lt;value&gt;*&lt;/value&gt;
+&lt;/property&gt;
+
+&lt;property&gt;
+  &lt;name&gt;hadoop.proxyuser.&lt;username&gt;.hosts&lt;/name&gt;
+  &lt;value&gt;*&lt;/value&gt;
+&lt;/property&gt;
+</code></pre>
+
 <h2 id="security-architecture">Security architecture</h2>
 <p>In this section we will see how security works for applications built on Apex. We will look at the different methodologies involved in running the applications and in each case we will look into the different components that are involved. We will go into the architecture of these components and look at the different security mechanisms that are in play.</p>
 <h3 id="application-launch">Application Launch</h3>
@@ -272,8 +347,12 @@ home directory. The location of this file will be <code>$HOME/.dt/dt-site.xml</c
 <p>When operators are running there will be effective processing rate differences between them due to intrinsic reasons such as operator logic or external reasons such as different resource availability of CPU, memory, network bandwidth etc. as the operators are running in different containers. To maximize performance and utilization the data flow is handled asynchronous to the regular operator function and a buffer is used to intermediately store the data that is being produced by the operator. This buffered data is served by a buffer server over the network connection to the downstream streaming container containing the operator that is supposed to receive the data from this operator. This connection is secured by a token called the buffer server token. These tokens are also generated and seeded by STRAM when the streaming containers are deployed and started and it uses different tokens for different buffer servers to have better security.</p>
 <h5 id="namenode-delegation-token">NameNode Delegation Token</h5>
 <p>Like STRAM, streaming containers also need to communicate with NameNode to use HDFS persistence for reasons such as saving the state of the operators. In secure mode they also use NameNode delegation tokens for authentication. These tokens are also seeded by STRAM for the streaming containers.</p>
+<h4 id="stram-web-services">Stram Web Services</h4>
+<p>Clients connect to STRAM and make web service requests to obtain operational information about running applications. When security is enabled we want this connection to also be authenticated. In this mode the client passes a web service token in the request and STRAM checks this token. If the token is valid, then the request is processed else it is denied.</p>
+<p>How does the client get the web service token in the first place? The client will have to first connect to STRAM via the Resource Manager Web Services Proxy which is a service run by Hadoop to proxy requests to application web services. This connection is authenticated by the proxy service using a protocol called SPNEGO when secure mode is enabled. SPNEGO is Kerberos over HTTP and the client also needs to support it. If the authentication is successful the proxy forwards the request to STRAM. STRAM in processing the request generates and sends back a web service token similar to a delegation token. This token is then used by the client in subsequent requests it makes directly to STRAM and STRAM is able to validate it since it generated the token in the first place.</p>
+<p><img alt="" src="../images/security/image03.png" /></p>
 <h2 id="conclusion">Conclusion</h2>
-<p>We looked at the different security requirements for distributed applications when they run in a secure Hadoop environment and looked at how Apex solves this.</p>
+<p>We looked at the different security configuration options that are available in Apex, saw the different security requirements for distributed applications in a secure Hadoop environment in detail and looked at how the various security mechanisms in Apex solves this.</p>
               
             </div>
           </div>

http://git-wip-us.apache.org/repos/asf/apex-site/blob/d396fa83/content/docs/apex-3.4/sitemap.xml
----------------------------------------------------------------------
diff --git a/content/docs/apex-3.4/sitemap.xml b/content/docs/apex-3.4/sitemap.xml
index 7af727b..ef8957a 100644
--- a/content/docs/apex-3.4/sitemap.xml
+++ b/content/docs/apex-3.4/sitemap.xml
@@ -4,7 +4,7 @@
     
     <url>
      <loc>/</loc>
-     <lastmod>2016-05-13</lastmod>
+     <lastmod>2016-09-06</lastmod>
      <changefreq>daily</changefreq>
     </url>
     
@@ -13,31 +13,37 @@
         
     <url>
      <loc>/apex_development_setup/</loc>
-     <lastmod>2016-05-13</lastmod>
+     <lastmod>2016-09-06</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/application_development/</loc>
-     <lastmod>2016-05-13</lastmod>
+     <lastmod>2016-09-06</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/application_packages/</loc>
-     <lastmod>2016-05-13</lastmod>
+     <lastmod>2016-09-06</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/operator_development/</loc>
-     <lastmod>2016-05-13</lastmod>
+     <lastmod>2016-09-06</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/autometrics/</loc>
-     <lastmod>2016-05-13</lastmod>
+     <lastmod>2016-09-06</lastmod>
+     <changefreq>daily</changefreq>
+    </url>
+        
+    <url>
+     <loc>/development_best_practices/</loc>
+     <lastmod>2016-09-06</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
@@ -47,13 +53,13 @@
         
     <url>
      <loc>/apex_cli/</loc>
-     <lastmod>2016-05-13</lastmod>
+     <lastmod>2016-09-06</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/security/</loc>
-     <lastmod>2016-05-13</lastmod>
+     <lastmod>2016-09-06</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
@@ -62,7 +68,7 @@
     
     <url>
      <loc>/compatibility/</loc>
-     <lastmod>2016-05-13</lastmod>
+     <lastmod>2016-09-06</lastmod>
      <changefreq>daily</changefreq>
     </url>
     

http://git-wip-us.apache.org/repos/asf/apex-site/blob/d396fa83/content/malhar-contributing.html
----------------------------------------------------------------------
diff --git a/content/malhar-contributing.html b/content/malhar-contributing.html
index 5813f8c..ca4765c 100644
--- a/content/malhar-contributing.html
+++ b/content/malhar-contributing.html
@@ -101,7 +101,7 @@
 </ul>
 <h2 id="implementing-an-operator">Implementing an operator</h2>
 <ul>
-<li>Look at the <a href="/docs/apex/operator_development">Operator Development Guide</a> and the <a href="/docs/malhar/development_best_practices">Best Practices Guide</a> on how to implement an operator and what the dos and don&#39;ts are.</li>
+<li>Look at the <a href="/docs/apex/operator_development">Operator Development Guide</a> and the <a href="/docs/apex/development_best_practices">Best Practices Guide</a> on how to implement an operator and what the dos and don&#39;ts are.</li>
 <li>Refer to existing operator implementations when in doubt or unsure about how to implement some functionality. You can also email the <a href="/community.html#mailing-lists">dev mailing list</a> with any questions.</li>
 <li>Write unit tests for operators<ul>
 <li>Refer to unit tests for existing operators.</li>

[5/6] apex-site git commit: from c3a284ba04d860705af016afe3348f0e523f48c1

Posted by th...@apache.org.

http://git-wip-us.apache.org/repos/asf/apex-site/blob/d396fa83/content/docs/apex-3.4/mkdocs/search_index.json
----------------------------------------------------------------------
diff --git a/content/docs/apex-3.4/mkdocs/search_index.json b/content/docs/apex-3.4/mkdocs/search_index.json
index 3512a2f..611f195 100644
--- a/content/docs/apex-3.4/mkdocs/search_index.json
+++ b/content/docs/apex-3.4/mkdocs/search_index.json
@@ -12,7 +12,7 @@
         }, 
         {
             "location": "/apex_development_setup/", 
-            "text": "Apache Apex Development Environment Setup\n\n\nThis document discusses the steps needed for setting up a development environment for creating applications that run on the Apache Apex platform.\n\n\nDevelopment Tools\n\n\nThere are a few tools that will be helpful when developing Apache Apex applications, including:\n\n\n\n\n\n\ngit\n - A revision control system (version 1.7.1 or later). There are multiple git clients available for Windows (\nhttp://git-scm.com/download/win\n for example), so download and install a client of your choice.\n\n\n\n\n\n\njava JDK\n (not JRE) - Includes the Java Runtime Environment as well as the Java compiler and a variety of tools (version 1.7.0_79 or later). Can be downloaded from the Oracle website.\n\n\n\n\n\n\nmaven\n - Apache Maven is a build system for Java projects (version 3.0.5 or later). It can be downloaded from \nhttps://maven.apache.org/download.cgi\n.\n\n\n\n\n\n\nIDE\n (Optional) - If you prefer to use an IDE (Integra
 ted Development Environment) such as \nNetBeans\n, \nEclipse\n or \nIntelliJ\n, install that as well.\n\n\n\n\n\n\nAfter installing these tools, make sure that the directories containing the executable files are in your PATH environment variable.\n\n\n\n\nWindows\n - Open a console window and enter the command \necho %PATH%\n to see the value of the \nPATH\n variable and verify that the above directories for Java, git, and maven executables are present.  JDK executables like \njava\n and \njavac\n, the directory might be something like \nC:\\Program Files\\Java\\jdk1.7.0\\_80\\bin\n; for \ngit\n it might be \nC:\\Program Files\\Git\\bin\n; and for maven it might be \nC:\\Users\\user\\Software\\apache-maven-3.3.3\\bin\n.  If not, you can change its value clicking on the button at \nControl Panel\n \n \nAdvanced System Settings\n \n \nAdvanced tab\n \n \nEnvironment Variables\n.\n\n\nLinux and Mac\n - Open a console/terminal window and enter the command \necho $PATH\n to see the value
  of the \nPATH\n variable and verify that the above directories for Java, git, and maven executables are present.  If not, make sure software is downloaded and installed, and optionally PATH reference is added and exported  in a \n~/.profile\n or \n~/.bash_profile\n.  For example to add maven located in \n/sfw/maven/apache-maven-3.3.3\n to PATH add the line: \nexport PATH=$PATH:/sfw/maven/apache-maven-3.3.3/bin\n\n\n\n\nConfirm by running the following commands and comparing with output that show in the table below:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCommand\n\n\nOutput\n\n\n\n\n\n\njavac -version\n\n\njavac 1.7.0_80\n\n\n\n\n\n\njava -version\n\n\njava version \n1.7.0_80\n\n\nJava(TM) SE Runtime Environment (build 1.7.0_80-b15)\n\n\nJava HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)\n\n\n\n\n\n\ngit --version\n\n\ngit version 2.6.1.windows.1\n\n\n\n\n\n\nmvn --version\n\n\nApache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06; 2015-04-22T06:57:37-05:00)\n\n\n...\n
 \n\n\n\n\n\n\n\n\n\n\nCreating New Apex Project\n\n\nAfter development tools are configured, you can now use the maven archetype to create a basic Apache Apex project.  \nNote:\n When executing the commands below, replace \n3.4.0\n by \nlatest available version\n of Apache Apex.\n\n\n\n\n\n\nWindows\n - Create a new Windows command file called \nnewapp.cmd\n by copying the lines below, and execute it.  When you run this file, the properties will be displayed and you will be prompted with \nY: :\n; just press \nEnter\n to complete the project generation.  The caret (^) at the end of some lines indicates that a continuation line follows. \n\n\n@echo off\n@rem Script for creating a new application\nsetlocal\nmvn archetype:generate ^\n -DarchetypeGroupId=org.apache.apex ^\n -DarchetypeArtifactId=apex-app-archetype -DarchetypeVersion=3.4.0 ^\n -DgroupId=com.example -Dpackage=com.example.myapexapp -DartifactId=myapexapp ^\n -Dversion=1.0-SNAPSHOT\nendlocal\n\n\n\n\n\n\n\nLinux\n - Execute
  the lines below in a terminal window.  New project will be created in the curent working directory.  The backslash (\\) at the end of the lines indicates continuation.\n\n\nmvn archetype:generate \\\n -DarchetypeGroupId=org.apache.apex \\\n -DarchetypeArtifactId=apex-app-archetype -DarchetypeVersion=3.4.0 \\\n -DgroupId=com.example -Dpackage=com.example.myapexapp -DartifactId=myapexapp \\\n -Dversion=1.0-SNAPSHOT\n\n\n\n\n\n\n\nWhen the run completes successfully, you should see a new directory named \nmyapexapp\n containing a maven project for building a basic Apache Apex application. It includes 3 source files:\nApplication.java\n,  \nRandomNumberGenerator.java\n and \nApplicationTest.java\n. You can now build the application by stepping into the new directory and running the maven package command:\n\n\ncd myapexapp\nmvn clean package -DskipTests\n\n\n\nThe build should create the application package file \nmyapexapp/target/myapexapp-1.0-SNAPSHOT.apa\n. This application package c
 an then be used to launch example application via \napex\n CLI, or other visual management tools.  When running, this application will generate a stream of random numbers and print them out, each prefixed by the string \nhello world:\n.\n\n\nRunning Unit Tests\n\n\nTo run unit tests on Linux or OSX, simply run the usual maven command, for example: \nmvn test\n.\n\n\nOn Windows, an additional file, \nwinutils.exe\n, is required; download it from\n\nhttps://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip\n\nand unpack the archive to, say, \nC:\\hadoop\n; this file should be present under\n\nhadoop-common-2.2.0-bin-master\\bin\n within it.\n\n\nSet the \nHADOOP_HOME\n environment variable system-wide to\n\nc:\\hadoop\\hadoop-common-2.2.0-bin-master\n as described at:\n\nhttps://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/sysdm_advancd_environmnt_addchange_variable.mspx?mfr=true\n. You should now be able to run unit tests normally.\n\n\nIf you 
 prefer not to set the variable globally, you can set it on the command line or within\nyour IDE. For example, on the command line, specify the maven\nproperty \nhadoop.home.dir\n:\n\n\nmvn -Dhadoop.home.dir=c:\\hadoop\\hadoop-common-2.2.0-bin-master test\n\n\n\nor set the environment variable separately:\n\n\nset HADOOP_HOME=c:\\hadoop\\hadoop-common-2.2.0-bin-master\nmvn test\n\n\n\nWithin your IDE, set the environment variable and then run the desired\nunit test in the usual way. For example, with NetBeans you can add:\n\n\nEnv.HADOOP_HOME=c:/hadoop/hadoop-common-2.2.0-bin-master\n\n\n\nat \nProperties \n Actions \n Run project \n Set Properties\n.\n\n\nSimilarly, in Eclipse (Mars) add it to the\nproject properties at \nProperties \n Run/Debug Settings \n ApplicationTest\n\n Environment\n tab.\n\n\nBuilding Apex Demos\n\n\nIf you want to see more substantial Apex demo applications and the associated source code, you can follow these simple steps to check out and build them.\n\n\n\
 n\n\n\nCheck out the source code repositories:\n\n\ngit clone https://github.com/apache/incubator-apex-core\ngit clone https://github.com/apache/incubator-apex-malhar\n\n\n\n\n\n\n\nSwitch to the appropriate release branch and build each repository:\n\n\ncd incubator-apex-core\nmvn clean install -DskipTests\n\ncd incubator-apex-malhar\nmvn clean install -DskipTests\n\n\n\n\n\n\n\nThe \ninstall\n argument to the \nmvn\n command installs resources from each project to your local maven repository (typically \n.m2/repository\n under your home directory), and \nnot\n to the system directories, so Administrator privileges are not required. The  \n-DskipTests\n argument skips running unit tests since they take a long time. If this is a first-time installation, it might take several minutes to complete because maven will download a number of associated plugins.\n\n\nAfter the build completes, you should see the demo application package files in the target directory under each demo subdirect
 ory in \nincubator-apex-malhar/demos\n.\n\n\nSandbox\n\n\nTo jump start development with an Apache Hadoop single node cluster, \nDataTorrent Sandbox\n powered by VirtualBox is available on Windows, Linux, or Mac platforms.  The sandbox is configured by default to run with 6GB RAM; if your development machine has 16GB or more, you can increase the sandbox RAM to 8GB or more using the VirtualBox console.  This will yield better performance and support larger applications.  The advantage of developing in the sandbox is that most of the tools (e.g. \njdk\n, \ngit\n, \nmaven\n), Hadoop YARN and HDFS, and a distribution of Apache Apex and DataTorrent RTS are pre-installed.  The disadvantage is that the sandbox is a memory-limited environment, and requires settings changes and restarts to adjust memory available for development and testing.", 
+            "text": "Apache Apex Development Environment Setup\n\n\nThis document discusses the steps needed for setting up a development environment for creating applications that run on the Apache Apex platform.\n\n\nDevelopment Tools\n\n\nThere are a few tools that will be helpful when developing Apache Apex applications, including:\n\n\n\n\n\n\ngit\n - A revision control system (version 1.7.1 or later). There are multiple git clients available for Windows (\nhttp://git-scm.com/download/win\n for example), so download and install a client of your choice.\n\n\n\n\n\n\njava JDK\n (not JRE) - Includes the Java Runtime Environment as well as the Java compiler and a variety of tools (version 1.7.0_79 or later). Can be downloaded from the Oracle website.\n\n\n\n\n\n\nmaven\n - Apache Maven is a build system for Java projects (version 3.0.5 or later). It can be downloaded from \nhttps://maven.apache.org/download.cgi\n.\n\n\n\n\n\n\nIDE\n (Optional) - If you prefer to use an IDE (Integra
 ted Development Environment) such as \nNetBeans\n, \nEclipse\n or \nIntelliJ\n, install that as well.\n\n\n\n\n\n\nAfter installing these tools, make sure that the directories containing the executable files are in your PATH environment variable.\n\n\n\n\nWindows\n - Open a console window and enter the command \necho %PATH%\n to see the value of the \nPATH\n variable and verify that the above directories for Java, git, and maven executables are present.  JDK executables like \njava\n and \njavac\n, the directory might be something like \nC:\\Program Files\\Java\\jdk1.7.0\\_80\\bin\n; for \ngit\n it might be \nC:\\Program Files\\Git\\bin\n; and for maven it might be \nC:\\Users\\user\\Software\\apache-maven-3.3.3\\bin\n.  If not, you can change its value clicking on the button at \nControl Panel\n \n \nAdvanced System Settings\n \n \nAdvanced tab\n \n \nEnvironment Variables\n.\n\n\nLinux and Mac\n - Open a console/terminal window and enter the command \necho $PATH\n to see the value
  of the \nPATH\n variable and verify that the above directories for Java, git, and maven executables are present.  If not, make sure software is downloaded and installed, and optionally PATH reference is added and exported  in a \n~/.profile\n or \n~/.bash_profile\n.  For example to add maven located in \n/sfw/maven/apache-maven-3.3.3\n to PATH add the line: \nexport PATH=$PATH:/sfw/maven/apache-maven-3.3.3/bin\n\n\n\n\nConfirm by running the following commands and comparing with output that show in the table below:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCommand\n\n\nOutput\n\n\n\n\n\n\njavac -version\n\n\njavac 1.7.0_80\n\n\n\n\n\n\njava -version\n\n\njava version \n1.7.0_80\n\n\nJava(TM) SE Runtime Environment (build 1.7.0_80-b15)\n\n\nJava HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)\n\n\n\n\n\n\ngit --version\n\n\ngit version 2.6.1.windows.1\n\n\n\n\n\n\nmvn --version\n\n\nApache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06; 2015-04-22T06:57:37-05:00)\n\n\n...\n
 \n\n\n\n\n\n\n\n\n\n\nCreating New Apex Project\n\n\nAfter development tools are configured, you can now use the maven archetype to create a basic Apache Apex project.  \nNote:\n When executing the commands below, replace \n3.4.0\n by \nlatest available version\n of Apache Apex.\n\n\n\n\n\n\nWindows\n - Create a new Windows command file called \nnewapp.cmd\n by copying the lines below, and execute it.  When you run this file, the properties will be displayed and you will be prompted with \nY: :\n; just press \nEnter\n to complete the project generation.  The caret (^) at the end of some lines indicates that a continuation line follows. \n\n\n@echo off\n@rem Script for creating a new application\nsetlocal\nmvn archetype:generate ^\n -DarchetypeGroupId=org.apache.apex ^\n -DarchetypeArtifactId=apex-app-archetype -DarchetypeVersion=3.4.0 ^\n -DgroupId=com.example -Dpackage=com.example.myapexapp -DartifactId=myapexapp ^\n -Dversion=1.0-SNAPSHOT\nendlocal\n\n\n\n\n\n\n\nLinux\n - Execute
  the lines below in a terminal window.  New project will be created in the curent working directory.  The backslash (\\) at the end of the lines indicates continuation.\n\n\nmvn archetype:generate \\\n -DarchetypeGroupId=org.apache.apex \\\n -DarchetypeArtifactId=apex-app-archetype -DarchetypeVersion=3.4.0 \\\n -DgroupId=com.example -Dpackage=com.example.myapexapp -DartifactId=myapexapp \\\n -Dversion=1.0-SNAPSHOT\n\n\n\n\n\n\n\nWhen the run completes successfully, you should see a new directory named \nmyapexapp\n containing a maven project for building a basic Apache Apex application. It includes 3 source files:\nApplication.java\n,  \nRandomNumberGenerator.java\n and \nApplicationTest.java\n. You can now build the application by stepping into the new directory and running the maven package command:\n\n\ncd myapexapp\nmvn clean package -DskipTests\n\n\n\nThe build should create the application package file \nmyapexapp/target/myapexapp-1.0-SNAPSHOT.apa\n. This application package c
 an then be used to launch example application via \napex\n CLI, or other visual management tools.  When running, this application will generate a stream of random numbers and print them out, each prefixed by the string \nhello world:\n.\n\n\nRunning Unit Tests\n\n\nTo run unit tests on Linux or OSX, simply run the usual maven command, for example: \nmvn test\n.\n\n\nOn Windows, an additional file, \nwinutils.exe\n, is required; download it from\n\nhttps://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip\n\nand unpack the archive to, say, \nC:\\hadoop\n; this file should be present under\n\nhadoop-common-2.2.0-bin-master\\bin\n within it.\n\n\nSet the \nHADOOP_HOME\n environment variable system-wide to\n\nc:\\hadoop\\hadoop-common-2.2.0-bin-master\n as described at:\n\nhttps://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/sysdm_advancd_environmnt_addchange_variable.mspx?mfr=true\n. You should now be able to run unit tests normally.\n\n\nIf you 
 prefer not to set the variable globally, you can set it on the command line or within\nyour IDE. For example, on the command line, specify the maven\nproperty \nhadoop.home.dir\n:\n\n\nmvn -Dhadoop.home.dir=c:\\hadoop\\hadoop-common-2.2.0-bin-master test\n\n\n\nor set the environment variable separately:\n\n\nset HADOOP_HOME=c:\\hadoop\\hadoop-common-2.2.0-bin-master\nmvn test\n\n\n\nWithin your IDE, set the environment variable and then run the desired\nunit test in the usual way. For example, with NetBeans you can add:\n\n\nEnv.HADOOP_HOME=c:/hadoop/hadoop-common-2.2.0-bin-master\n\n\n\nat \nProperties \n Actions \n Run project \n Set Properties\n.\n\n\nSimilarly, in Eclipse (Mars) add it to the\nproject properties at \nProperties \n Run/Debug Settings \n ApplicationTest\n\n Environment\n tab.\n\n\nBuilding Apex Demos\n\n\nIf you want to see more substantial Apex demo applications and the associated source code, you can follow these simple steps to check out and build them.\n\n\n\
 n\n\n\nCheck out the source code repositories:\n\n\ngit clone https://github.com/apache/apex-core\ngit clone https://github.com/apache/apex-malhar\n\n\n\n\n\n\n\nSwitch to the appropriate release branch and build each repository:\n\n\ncd apex-core\nmvn clean install -DskipTests\n\ncd apex-malhar\nmvn clean install -DskipTests\n\n\n\n\n\n\n\nThe \ninstall\n argument to the \nmvn\n command installs resources from each project to your local maven repository (typically \n.m2/repository\n under your home directory), and \nnot\n to the system directories, so Administrator privileges are not required. The  \n-DskipTests\n argument skips running unit tests since they take a long time. If this is a first-time installation, it might take several minutes to complete because maven will download a number of associated plugins.\n\n\nAfter the build completes, you should see the demo application package files in the target directory under each demo subdirectory in \napex-malhar/demos\n.\n\n\nSandb
 ox\n\n\nTo jump start development with an Apache Hadoop single node cluster, \nDataTorrent Sandbox\n powered by VirtualBox is available on Windows, Linux, or Mac platforms.  The sandbox is configured by default to run with 6GB RAM; if your development machine has 16GB or more, you can increase the sandbox RAM to 8GB or more using the VirtualBox console.  This will yield better performance and support larger applications.  The advantage of developing in the sandbox is that most of the tools (e.g. \njdk\n, \ngit\n, \nmaven\n), Hadoop YARN and HDFS, and a distribution of Apache Apex and DataTorrent RTS are pre-installed.  The disadvantage is that the sandbox is a memory-limited environment, and requires settings changes and restarts to adjust memory available for development and testing.", 
             "title": "Development Setup"
         }, 
         {
@@ -37,7 +37,7 @@
         }, 
         {
             "location": "/apex_development_setup/#building-apex-demos", 
-            "text": "If you want to see more substantial Apex demo applications and the associated source code, you can follow these simple steps to check out and build them.    Check out the source code repositories:  git clone https://github.com/apache/incubator-apex-core\ngit clone https://github.com/apache/incubator-apex-malhar    Switch to the appropriate release branch and build each repository:  cd incubator-apex-core\nmvn clean install -DskipTests\n\ncd incubator-apex-malhar\nmvn clean install -DskipTests    The  install  argument to the  mvn  command installs resources from each project to your local maven repository (typically  .m2/repository  under your home directory), and  not  to the system directories, so Administrator privileges are not required. The   -DskipTests  argument skips running unit tests since they take a long time. If this is a first-time installation, it might take several minutes to complete because maven will download a number of associated plugins.  A
 fter the build completes, you should see the demo application package files in the target directory under each demo subdirectory in  incubator-apex-malhar/demos .", 
+            "text": "If you want to see more substantial Apex demo applications and the associated source code, you can follow these simple steps to check out and build them.    Check out the source code repositories:  git clone https://github.com/apache/apex-core\ngit clone https://github.com/apache/apex-malhar    Switch to the appropriate release branch and build each repository:  cd apex-core\nmvn clean install -DskipTests\n\ncd apex-malhar\nmvn clean install -DskipTests    The  install  argument to the  mvn  command installs resources from each project to your local maven repository (typically  .m2/repository  under your home directory), and  not  to the system directories, so Administrator privileges are not required. The   -DskipTests  argument skips running unit tests since they take a long time. If this is a first-time installation, it might take several minutes to complete because maven will download a number of associated plugins.  After the build completes, you should see
  the demo application package files in the target directory under each demo subdirectory in  apex-malhar/demos .", 
             "title": "Building Apex Demos"
         }, 
         {
@@ -821,6 +821,71 @@
             "title": "System Metrics"
         }, 
         {
+            "location": "/development_best_practices/", 
+            "text": "Development Best Practices\n\n\nThis document describes the best practices to follow when developing operators and other application components such as partitoners, stream codecs etc on the Apache Apex platform.\n\n\nOperators\n\n\nThese are general guidelines for all operators that are covered in the current section. The subsequent sections talk about special considerations for input and output operators.\n\n\n\n\nWhen writing a new operator to be used in an application, consider breaking it down into\n\n\nAn abstract operator that encompasses the core functionality but leaves application specific schemas and logic to the implementation.\n\n\nAn optional concrete operator also in the library that extends the abstract operator and provides commonly used schema types such as strings, byte[] or POJOs.\n\n\n\n\n\n\nFollow these conventions for the life cycle methods:\n\n\nDo one time initialization of entities that apply for the entire lifetime of the operator in t
 he \nsetup\n method, e.g., factory initializations. Initializations in \nsetup\n are done in the container where the operator is deployed. Allocating memory for fields in the constructor is not efficient as it would lead to extra garbage in memory for the following reason. The operator is instantiated on the client from where the application is launched, serialized and started one of the Hadoop nodes in a container. So the constructor is first called on the client and if it were to initialize any of the fields, that state would be saved during serialization. In the Hadoop container the operator is deserialized and started. This would invoke the constructor again, which will initialize the fields but their state will get overwritten by the serialized state and the initial values would become garbage in memory.\n\n\nDo one time initialization for live entities in \nactivate\n method, e.g., opening connections to a database server or starting a thread for asynchronous operations. The \
 nactivate\n method is called right before processing starts so it is a better place for these initializations than at \nsetup\n which can lead to a delay before processing data from the live entity.  \n\n\nPerform periodic tasks based on processing time in application window boundaries.\n\n\nPerform initializations needed for each application window in \nbeginWindow\n.\n\n\nPerform aggregations needed for each application window  in \nendWindow\n.\n\n\nTeardown of live entities (inverse of tasks performed during activate) should be in the \ndeactivate\n method.\n\n\nTeardown of lifetime entities (those initialized in setup method) should happen in the \nteardown\n method.\n\n\nIf the operator implementation is not finalized mark it with the \n@Evolving\n annotation.\n\n\n\n\n\n\nIf the operator needs to perform operations based on event time of the individual tuples and not the processing time, extend and use the \nWindowedOperator\n. Refer to documentation of that operator for deta
 ils on how to use it.\n\n\nIf an operator needs to do some work when it is not receiving any input, it should implement \nIdleTimeHandler\n interface. This interface contains \nhandleIdleTime\n method which will be called whenever the platform isn\u2019t doing anything else and the operator can do the work in this method. If for any reason the operator does not have any work to do when this method is called, it should sleep for a small amount of time such as that specified by the \nSPIN_MILLIS\n attribute so that it does not cause a busy wait when called repeatedly by the platform. Also, the method should not block and return in a reasonable amount of time that is less than the streaming window size (which is 500ms by default).\n\n\nOften operators have customizable parameters such as information about locations of external systems or parameters that modify the behavior of the operator. Users should be able to specify these easily without having to change source code. This can be do
 ne by making them properties of the operator because they can then be initialized from external properties files.\n\n\nWhere possible default values should be provided for the properties in the source code.\n\n\nValidation rules should be specified for the properties using javax constraint validations that check whether the values specified for the properties are in the correct format, range or other operator requirements. Required properties should have at least a \n@NotNull\n validation specifying that they have to be specified by the user.\n\n\n\n\n\n\n\n\nCheckpointing\n\n\nCheckpointing is a process of snapshotting the state of an operator and saving it so that in case of failure the state can be used to restore the operator to a prior state and continue processing. It is automatically performed by the platform at a configurable interval. All operators in the application are checkpointed in a distributed fashion, thus allowing the entire state of the application to be saved and
  available for recovery if needed. Here are some things to remember when it comes to checkpointing:\n\n\n\n\nThe process of checkpointing involves snapshotting the state by serializing the operator and saving it to a store. This is done using a \nStorageAgent\n. By default a \nStorageAgent\n is already provided by the platform and it is called \nAsyncFSStorageAgent\n. It serializes the operator using Kryo and saves the serialized state asynchronously to a filesystem such as HDFS. There are other implementations of \nStorageAgent\n available such as \nGeodeKeyValueStorageAgent\n that stores the serialized state in Geode which is an in-memory replicated data grid.\n\n\nAll variables in the operator marked neither transient nor final are saved so any variables in the operator that are not part of the state should be marked transient. Specifically any variables like connection objects, i/o streams, ports are transient, because they need to be setup again on failure recovery.\n\n\nIf the
  operator does not keep any state between windows, mark it with the \n@Stateless\n annotation. This results in efficiencies during checkpointing and recovery. The operator will not be checkpointed and is always restored to the initial state\n\n\nThe checkpoint interval can be set using the \nCHECKPOINT_WINDOW_COUNT\n attribute which specifies the interval in terms of number of streaming windows.\n\n\nIf the correct functioning of the operator requires the \nendWindow\n method be called before checkpointing can happen, then the checkpoint interval should align with application window interval i.e., it should be a multiple of application window interval. In this case the operator should be marked with \nOperatorAnnotation\n and \ncheckpointableWithinAppWindow\n set to false. If the window intervals are configured by the user and they don\u2019t align, it will result in a DAG validation error and application won\u2019t launch.\n\n\nIn some cases the operator state related to a piece of
  data needs to be purged once that data is no longer required by the application, otherwise the state will continue to build up indefinitely. The platform provides a way to let the operator know about this using a callback listener called \nCheckpointNotificationListener\n. This listener has a callback method called \ncommitted\n, which is called by the platform from time to time with a window id that has been processed successfully by all the operators in the DAG and hence is no longer needed. The operator can delete all the state corresponding to window ids less than or equal to the provided window id.\n\n\nSometimes operators need to perform some tasks just before checkpointing. For example, filesystem operators may want to flush the files just before checkpoint so they can be sure that all pending data is written to disk and no data is lost if there is an operator failure just after the checkpoint and the operator restarts from the checkpoint. To do this the operator would imple
 ment the same \nCheckpointNotificationListener\n interface and implement the \nbeforeCheckpoint\n method where it can do these tasks.\n\n\nIf the operator is going to have a large state, checkpointing the entire state each time becomes unviable. Furthermore, the amount of memory needed to hold the state could be larger than the amount of physical memory available. In these cases the operator should checkpoint the state incrementally and also manage the memory for the state more efficiently. The platform provides a utiltiy called \nManagedState\n that uses a combination of in memory and disk cache to efficiently store and retrieve data in a performant, fault tolerant way and also checkpoint it in an incremental fashion. There are operators in the platform that use \nManagedState\n and can be used as a reference on how to use this utility such as Dedup or Join operators.\n\n\n\n\nInput Operators\n\n\nInput operators have additional requirements:\n\n\n\n\nThe \nemitTuples\n method impl
 emented by the operator, is called by the platform, to give the operator an opportunity to emit some data. This method is always called within a window boundary but can be called multiple times within the same window. There are some important guidelines on how to implement this method:\n\n\nThis should not be a blocking method and should return in a reasonable time that is less than the streaming window size (which is 500ms by default). This also applies to other callback methods called by the platform such as \nbeginWindow\n, \nendWindow\n etc., but is more important here since this method will be called continuously by the platform.\n\n\nIf the operator needs to interact with external systems to obtain data and this can potentially take a long time, then this should be performed asynchronously in a different thread. Refer to the threading section below for the guidelines when using threading.\n\n\nIn each invocation, the method can emit any number of data tuples.\n\n\n\n\n\n\n\n\n
 Idempotence\n\n\nMany applications write data to external systems using output operators. To ensure that data is present exactly once in the external system even in a failure recovery scenario, the output operators expect the replayed windows during recovery contain the same data as before the failure. This is called idempotency. Since operators within the DAG are merely responding to input data provided to them by the upstream operators and the input operator has no upstream operator, the responsibility of idempotent replay falls on the input operators.\n\n\n\n\nFor idempotent replay of data, the operator needs to store some meta-information for every window that would allow it to identify what data was sent in that window. This is called the idempotent state.\n\n\nIf the external source of the input operator allows replayability, this could be information such as offset of last piece of data in the window, an identifier of the last piece of data itself or number of data tuples sen
 t.\n\n\nHowever if the external source does not allow replayability from an operator specified point, then the entire data sent within the window may need to be persisted by the operator.\n\n\n\n\n\n\nThe platform provides a utility called \nWindowDataManager\n to allow operators to save and retrieve idempotent state every window. Operators should use this to implement idempotency.\n\n\n\n\nOutput Operators\n\n\nOutput operators typically connect to external storage systems such as filesystems, databases or key value stores to store data.\n\n\n\n\nIn some situations, the external systems may not be functioning in a reliable fashion. They may be having prolonged outages or performance problems. If the operator is being designed to work in such environments, it needs to be able to to handle these problems gracefully and not block the DAG or fail. In these scenarios the operator should cache the data into a local store such as HDFS and interact with external systems in a separate threa
 d so as to not have problems in the operator lifecycle thread. This pattern is called the \nReconciler\n pattern and there are operators that implement this pattern available in the library for reference.\n\n\n\n\nEnd-to-End Exactly Once\n\n\nWhen output operators store data in external systems, it is important that they do not lose data or write duplicate data when there is a failure event and the DAG recovers from that failure. In failure recovery, the windows from the previous checkpoint are replayed and the operator receives this data again. The operator should ensure that it does not write this data again. Operator developers should figure out how to do this specifically for the operators they are developing depending on the logic of the operators. Below are examples of how a couple of existing output operators do this for reference.\n\n\n\n\nFile output operator that writes data to files keeps track of the file lengths in the state. These lengths are checkpointed and restored 
 on failure recovery. On restart, the operator truncates the file to the length equal to the length in the recovered state. This makes the data in the file same as it was at the time of checkpoint before the failure. The operator now writes the replayed data from the checkpoint in regular fashion as any other data. This ensures no data is lost or duplicated in the file.\n\n\nThe JDBC output operator that writes data to a database table writes the data in a window in a single transaction. It also writes the current window id into a meta table along with the data as part of the same transaction. It commits the transaction at the end of the window. When there is an operator failure before the final commit, the state of the database is that it contains the data from the previous fully processed window and its window id since the current window transaction isn\u2019t yet committed. On recovery, the operator reads this window id back from the meta table. It ignores all the replayed windows
  whose window id is less than or equal to the recovered window id and thus ensures that it does not duplicate data already present in the database. It starts writing data normally again when window id of data becomes greater than recovered window thus ensuring no data is lost.\n\n\n\n\nPartitioning\n\n\nPartitioning allows an operation to be scaled to handle more pieces of data than before but with a similar SLA. This is done by creating multiple instances of an operator and distributing the data among them. Input operators can also be partitioned to stream more pieces of data into the application. The platform provides a lot of flexibility and options for partitioning. Partitioning can happen once at startup or can be dynamically changed anytime while the application is running, and it can be done in a stateless or stateful way by distributing state from the old partitions to new partitions.\n\n\nIn the platform, the responsibility for partitioning is shared among different entitie
 s. These are:\n\n\n\n\nA \npartitioner\n that specifies \nhow\n to partition the operator, specifically it takes an old set of partitions and creates a new set of partitions. At the start of the application the old set has one partition and the partitioner can return more than one partitions to start the application with multiple partitions. The partitioner can have any custom JAVA logic to determine the number of new partitions, set their initial state as a brand new state or derive it from the state of the old partitions. It also specifies how the data gets distributed among the new partitions. The new set doesn't have to contain only new partitions, it can carry over some old partitions if desired.\n\n\nAn optional \nstatistics (stats) listener\n that specifies \nwhen\n to partition. The reason it is optional is that it is needed only when dynamic partitioning is needed. With the stats listener, the stats can be used to determine when to partition.\n\n\nIn some cases the \noperat
 or\n itself should be aware of partitioning and would need to provide supporting code.\n\n\nIn case of input operators each partition should have a property or a set of properties that allow it to distinguish itself from the other partitions and fetch unique data.\n\n\n\n\n\n\nWhen an operator that was originally a single instance is split into multiple partitions with each partition working on a subset of data, the results of the partitions may need to be combined together to compute the final result. The combining logic would depend on the logic of the operator. This would be specified by the developer using a \nUnifier\n, which is deployed as another operator by the platform. If no \nUnifier\n is specified, the platform inserts a \ndefault unifier\n that merges the results of the multiple partition streams into a single stream. Each output port can have a different \nUnifier\n and this is specified by returning the corresponding \nUnifier\n in the \ngetUnifier\n method of the out
 put port. The operator developer should provide a custom \nUnifier\n wherever applicable.\n\n\nThe Apex \nengine\n that brings everything together and effects the partitioning.\n\n\n\n\nSince partitioning is critical for scalability of applications, operators must support it. There should be a strong reason for an operator to not support partitioning, such as, the logic performed by the operator not lending itself to parallelism. In order to support partitioning, an operator developer, apart from developing the functionality of the operator, may also need to provide a partitioner, stats listener and supporting code in the operator as described in the steps above. The next sections delve into this. \n\n\nOut of the box partitioning\n\n\nThe platform comes with some built-in partitioning utilities that can be used in certain scenarios.\n\n\n\n\n\n\nStatelessPartitioner\n provides a default partitioner, that can be used for an operator in certain conditions. If the operator satisfies t
 hese conditions, the partitioner can be specified for the operator with a simple setting and no other partitioning code is needed. The conditions are:\n\n\n\n\nNo dynamic partitioning is needed, see next point about dynamic partitioning. \n\n\nThere is no distinct initial state for the partitions, i.e., all partitions start with the same initial state submitted during application launch.\n\n\n\n\nTypically input or output operators do not fall into this category, although there are some exceptions. This partitioner is mainly used with operators that are in the middle of the DAG, after the input and before the output operators. When used with non-input operators, only the data for the first declared input port is distributed among the different partitions. All other input ports are treated as broadcast and all partitions receive all the data for that port.\n\n\n\n\n\n\nStatelessThroughputBasedPartitioner\n in Malhar provides a dynamic partitioner based on throughput thresholds. Simil
 arly \nStatelessLatencyBasedPartitioner\n provides a latency based dynamic partitioner in RTS. If these partitioners can be used, then separate partitioning related code is not needed. The conditions under which these can be used are:\n\n\n\n\nThere is no distinct initial state for the partitions.\n\n\nThere is no state being carried over by the operator from one window to the next i.e., operator is stateless.\n\n\n\n\n\n\n\n\nCustom partitioning\n\n\nIn many cases, operators don\u2019t satisfy the above conditions and a built-in partitioner cannot be used. Custom partitioning code needs to be written by the operator developer. Below are guidelines for it.\n\n\n\n\nSince the operator developer is providing a \npartitioner\n for the operator, the partitioning code should be added to the operator itself by making the operator implement the Partitioner interface and implementing the required methods, rather than creating a separate partitioner. The advantage is the user of the operator
  does not have to explicitly figure out the partitioner and set it for the operator but still has the option to override this built-in partitioner with a different one.\n\n\nThe \npartitioner\n is responsible for setting the initial state of the new partitions, whether it is at the start of the application or when partitioning is happening while the application is running as in the dynamic partitioning case. In the dynamic partitioning scenario, the partitioner needs to take the state from the old partitions and distribute it among the new partitions. It is important to note that apart from the checkpointed state the partitioner also needs to distribute idempotent state.\n\n\nThe \npartitioner\n interface has two methods, \ndefinePartitions\n and \npartitioned\n. The method \ndefinePartitons\n is first called to determine the new partitions, and if enough resources are available on the cluster, the \npartitioned\n method is called passing in the new partitions. This happens both dur
 ing initial partitioning and dynamic partitioning. If resources are not available, partitioning is abandoned and existing partitions continue to run untouched. This means that any processing intensive operations should be deferred to the \npartitioned\n call instead of doing them in \ndefinePartitions\n, as they may not be needed if there are not enough resources available in the cluster.\n\n\nThe \npartitioner\n, along with creating the new partitions, should also specify how the data gets distributed across the new partitions. It should do this by specifying a mapping called \nPartitionKeys\n for each partition that maps the data to that partition. This mapping needs to be specified for every input port in the operator. If the \npartitioner\n wants to use the standard mapping it can use a utility method called \nDefaultPartition.assignPartitionKeys\n.\n\n\nWhen the partitioner is scaling the operator up to more partitions, try to reuse the existing partitions and create new partit
 ions to augment the current set. The reuse can be achieved by the partitioner returning the current partitions unchanged. This will result in the current partitions continuing to run untouched.\n\n\nIn case of dynamic partitioning, as mentioned earlier, a stats listener is also needed to determine when to re-partition. Like the \nPartitioner\n interface, the operator can also implement the \nStatsListener\n interface to provide a stats listener implementation that will be automatically used.\n\n\nThe \nStatsListener\n has access to all operator statistics to make its decision on partitioning. Apart from the statistics that the platform computes for the operators such as throughput, latency etc, operator developers can include their own business metrics by using the AutoMetric feature.\n\n\nIf the operator is not partitionable, mark it so with \nOperatorAnnotation\n and \npartitionable\n element set to false.\n\n\n\n\nStreamCodecs\n\n\nA \nStreamCodec\n is used in partitioning to dis
 tribute the data tuples among the partitions. The \nStreamCodec\n computes an integer hashcode for a data tuple and this is used along with \nPartitionKeys\n mapping to determine which partition or partitions receive the data tuple. If a \nStreamCodec\n is not specified, then a default one is used by the platform which returns the JAVA hashcode of the tuple. \n\n\nStreamCodec\n is also useful in another aspect of the application. It is used to serialize and deserialize the tuple to transfer it between operators. The default \nStreamCodec\n uses Kryo library for serialization. \n\n\nThe following guidelines are useful when considering a custom \nStreamCodec\n\n\n\n\nA custom \nStreamCodec\n is needed if the tuples need to be distributed based on a criteria different from the hashcode of the tuple. If the correct working of an operator depends on the data from the upstream operator being distributed using a custom criteria such as being sticky on a \u201ckey\u201d field within the tup
 le, then a custom \nStreamCodec\n should be provided by the operator developer. This codec can implement the custom criteria. The operator should also return this custom codec in the \ngetStreamCodec\n method of the input port.\n\n\nWhen implementing a custom \nStreamCodec\n for the purpose of using a different criteria to distribute the tuples, the codec can extend an existing \nStreamCodec\n and implement the hashcode method, so that the codec does not have to worry about the serialization and deserialization functionality. The Apex platform provides two pre-built \nStreamCodec\n implementations for this purpose, one is \nKryoSerializableStreamCodec\n that uses Kryo for serialization and another one \nJavaSerializationStreamCodec\n that uses JAVA serialization.\n\n\nDifferent \nStreamCodec\n implementations can be used for different inputs in a stream with multiple inputs when different criteria of distributing the tuples is desired between the multiple inputs. \n\n\n\n\nThreads\n
 \n\nThe operator lifecycle methods such as \nsetup\n, \nbeginWindow\n, \nendWindow\n, \nprocess\n in \nInputPorts\n are all called from a single operator lifecycle thread, by the platform, unbeknownst to the user. So the user does not have to worry about dealing with the issues arising from multi-threaded code. Use of separate threads in an operator is discouraged because in most cases the motivation for this is parallelism, but parallelism can already be achieved by using multiple partitions and furthermore mistakes can be made easily when writing multi-threaded code. When dealing with high volume and velocity data, the corner cases with incorrectly written multi-threaded code are encountered more easily and exposed. However, there are times when separate threads are needed, for example, when interacting with external systems the delay in retrieving or sending data can be large at times, blocking the operator and other DAG processing such as committed windows. In these cases the fo
 llowing guidelines must be followed strictly.\n\n\n\n\nThreads should be started in \nactivate\n and stopped in \ndeactivate\n. In \ndeactivate\n the operator should wait till any threads it launched, have finished execution. It can do so by calling \njoin\n on the threads or if using \nExecutorService\n, calling \nawaitTermination\n on the service.\n\n\nThreads should not call any methods on the ports directly as this can cause concurrency exceptions and also result in invalid states.\n\n\nThreads can share state with the lifecycle methods using data structures that are either explicitly protected by synchronization or are inherently thread safe such as thread safe queues.\n\n\nIf this shared state needs to be protected against failure then it needs to be persisted during checkpoint. To have a consistent checkpoint, the state should not be modified by the thread when it is being serialized and saved by the operator lifecycle thread during checkpoint. Since the checkpoint process ha
 ppens outside the window boundary the thread should be quiesced between \nendWindow\n and \nbeginWindow\n or more efficiently between pre-checkpoint and checkpointed callbacks.", 
+            "title": "Best Practices"
+        }, 
+        {
+            "location": "/development_best_practices/#development-best-practices", 
+            "text": "This document describes the best practices to follow when developing operators and other application components such as partitoners, stream codecs etc on the Apache Apex platform.", 
+            "title": "Development Best Practices"
+        }, 
+        {
+            "location": "/development_best_practices/#operators", 
+            "text": "These are general guidelines for all operators that are covered in the current section. The subsequent sections talk about special considerations for input and output operators.   When writing a new operator to be used in an application, consider breaking it down into  An abstract operator that encompasses the core functionality but leaves application specific schemas and logic to the implementation.  An optional concrete operator also in the library that extends the abstract operator and provides commonly used schema types such as strings, byte[] or POJOs.    Follow these conventions for the life cycle methods:  Do one time initialization of entities that apply for the entire lifetime of the operator in the  setup  method, e.g., factory initializations. Initializations in  setup  are done in the container where the operator is deployed. Allocating memory for fields in the constructor is not efficient as it would lead to extra garbage in memory for the following
  reason. The operator is instantiated on the client from where the application is launched, serialized and started one of the Hadoop nodes in a container. So the constructor is first called on the client and if it were to initialize any of the fields, that state would be saved during serialization. In the Hadoop container the operator is deserialized and started. This would invoke the constructor again, which will initialize the fields but their state will get overwritten by the serialized state and the initial values would become garbage in memory.  Do one time initialization for live entities in  activate  method, e.g., opening connections to a database server or starting a thread for asynchronous operations. The  activate  method is called right before processing starts so it is a better place for these initializations than at  setup  which can lead to a delay before processing data from the live entity.    Perform periodic tasks based on processing time in application window bou
 ndaries.  Perform initializations needed for each application window in  beginWindow .  Perform aggregations needed for each application window  in  endWindow .  Teardown of live entities (inverse of tasks performed during activate) should be in the  deactivate  method.  Teardown of lifetime entities (those initialized in setup method) should happen in the  teardown  method.  If the operator implementation is not finalized mark it with the  @Evolving  annotation.    If the operator needs to perform operations based on event time of the individual tuples and not the processing time, extend and use the  WindowedOperator . Refer to documentation of that operator for details on how to use it.  If an operator needs to do some work when it is not receiving any input, it should implement  IdleTimeHandler  interface. This interface contains  handleIdleTime  method which will be called whenever the platform isn\u2019t doing anything else and the operator can do the work in this method. If fo
 r any reason the operator does not have any work to do when this method is called, it should sleep for a small amount of time such as that specified by the  SPIN_MILLIS  attribute so that it does not cause a busy wait when called repeatedly by the platform. Also, the method should not block and return in a reasonable amount of time that is less than the streaming window size (which is 500ms by default).  Often operators have customizable parameters such as information about locations of external systems or parameters that modify the behavior of the operator. Users should be able to specify these easily without having to change source code. This can be done by making them properties of the operator because they can then be initialized from external properties files.  Where possible default values should be provided for the properties in the source code.  Validation rules should be specified for the properties using javax constraint validations that check whether the values specified 
 for the properties are in the correct format, range or other operator requirements. Required properties should have at least a  @NotNull  validation specifying that they have to be specified by the user.", 
+            "title": "Operators"
+        }, 
+        {
+            "location": "/development_best_practices/#checkpointing", 
+            "text": "Checkpointing is a process of snapshotting the state of an operator and saving it so that in case of failure the state can be used to restore the operator to a prior state and continue processing. It is automatically performed by the platform at a configurable interval. All operators in the application are checkpointed in a distributed fashion, thus allowing the entire state of the application to be saved and available for recovery if needed. Here are some things to remember when it comes to checkpointing:   The process of checkpointing involves snapshotting the state by serializing the operator and saving it to a store. This is done using a  StorageAgent . By default a  StorageAgent  is already provided by the platform and it is called  AsyncFSStorageAgent . It serializes the operator using Kryo and saves the serialized state asynchronously to a filesystem such as HDFS. There are other implementations of  StorageAgent  available such as  GeodeKeyValueStorageAge
 nt  that stores the serialized state in Geode which is an in-memory replicated data grid.  All variables in the operator marked neither transient nor final are saved so any variables in the operator that are not part of the state should be marked transient. Specifically any variables like connection objects, i/o streams, ports are transient, because they need to be setup again on failure recovery.  If the operator does not keep any state between windows, mark it with the  @Stateless  annotation. This results in efficiencies during checkpointing and recovery. The operator will not be checkpointed and is always restored to the initial state  The checkpoint interval can be set using the  CHECKPOINT_WINDOW_COUNT  attribute which specifies the interval in terms of number of streaming windows.  If the correct functioning of the operator requires the  endWindow  method be called before checkpointing can happen, then the checkpoint interval should align with application window interval i.e.
 , it should be a multiple of application window interval. In this case the operator should be marked with  OperatorAnnotation  and  checkpointableWithinAppWindow  set to false. If the window intervals are configured by the user and they don\u2019t align, it will result in a DAG validation error and application won\u2019t launch.  In some cases the operator state related to a piece of data needs to be purged once that data is no longer required by the application, otherwise the state will continue to build up indefinitely. The platform provides a way to let the operator know about this using a callback listener called  CheckpointNotificationListener . This listener has a callback method called  committed , which is called by the platform from time to time with a window id that has been processed successfully by all the operators in the DAG and hence is no longer needed. The operator can delete all the state corresponding to window ids less than or equal to the provided window id.  So
 metimes operators need to perform some tasks just before checkpointing. For example, filesystem operators may want to flush the files just before checkpoint so they can be sure that all pending data is written to disk and no data is lost if there is an operator failure just after the checkpoint and the operator restarts from the checkpoint. To do this the operator would implement the same  CheckpointNotificationListener  interface and implement the  beforeCheckpoint  method where it can do these tasks.  If the operator is going to have a large state, checkpointing the entire state each time becomes unviable. Furthermore, the amount of memory needed to hold the state could be larger than the amount of physical memory available. In these cases the operator should checkpoint the state incrementally and also manage the memory for the state more efficiently. The platform provides a utiltiy called  ManagedState  that uses a combination of in memory and disk cache to efficiently store and 
 retrieve data in a performant, fault tolerant way and also checkpoint it in an incremental fashion. There are operators in the platform that use  ManagedState  and can be used as a reference on how to use this utility such as Dedup or Join operators.", 
+            "title": "Checkpointing"
+        }, 
+        {
+            "location": "/development_best_practices/#input-operators", 
+            "text": "Input operators have additional requirements:   The  emitTuples  method implemented by the operator, is called by the platform, to give the operator an opportunity to emit some data. This method is always called within a window boundary but can be called multiple times within the same window. There are some important guidelines on how to implement this method:  This should not be a blocking method and should return in a reasonable time that is less than the streaming window size (which is 500ms by default). This also applies to other callback methods called by the platform such as  beginWindow ,  endWindow  etc., but is more important here since this method will be called continuously by the platform.  If the operator needs to interact with external systems to obtain data and this can potentially take a long time, then this should be performed asynchronously in a different thread. Refer to the threading section below for the guidelines when using threading.  In 
 each invocation, the method can emit any number of data tuples.", 
+            "title": "Input Operators"
+        }, 
+        {
+            "location": "/development_best_practices/#idempotence", 
+            "text": "Many applications write data to external systems using output operators. To ensure that data is present exactly once in the external system even in a failure recovery scenario, the output operators expect the replayed windows during recovery contain the same data as before the failure. This is called idempotency. Since operators within the DAG are merely responding to input data provided to them by the upstream operators and the input operator has no upstream operator, the responsibility of idempotent replay falls on the input operators.   For idempotent replay of data, the operator needs to store some meta-information for every window that would allow it to identify what data was sent in that window. This is called the idempotent state.  If the external source of the input operator allows replayability, this could be information such as offset of last piece of data in the window, an identifier of the last piece of data itself or number of data tuples sent.  How
 ever if the external source does not allow replayability from an operator specified point, then the entire data sent within the window may need to be persisted by the operator.    The platform provides a utility called  WindowDataManager  to allow operators to save and retrieve idempotent state every window. Operators should use this to implement idempotency.", 
+            "title": "Idempotence"
+        }, 
+        {
+            "location": "/development_best_practices/#output-operators", 
+            "text": "Output operators typically connect to external storage systems such as filesystems, databases or key value stores to store data.   In some situations, the external systems may not be functioning in a reliable fashion. They may be having prolonged outages or performance problems. If the operator is being designed to work in such environments, it needs to be able to to handle these problems gracefully and not block the DAG or fail. In these scenarios the operator should cache the data into a local store such as HDFS and interact with external systems in a separate thread so as to not have problems in the operator lifecycle thread. This pattern is called the  Reconciler  pattern and there are operators that implement this pattern available in the library for reference.", 
+            "title": "Output Operators"
+        }, 
+        {
+            "location": "/development_best_practices/#end-to-end-exactly-once", 
+            "text": "When output operators store data in external systems, it is important that they do not lose data or write duplicate data when there is a failure event and the DAG recovers from that failure. In failure recovery, the windows from the previous checkpoint are replayed and the operator receives this data again. The operator should ensure that it does not write this data again. Operator developers should figure out how to do this specifically for the operators they are developing depending on the logic of the operators. Below are examples of how a couple of existing output operators do this for reference.   File output operator that writes data to files keeps track of the file lengths in the state. These lengths are checkpointed and restored on failure recovery. On restart, the operator truncates the file to the length equal to the length in the recovered state. This makes the data in the file same as it was at the time of checkpoint before the failure. The operator 
 now writes the replayed data from the checkpoint in regular fashion as any other data. This ensures no data is lost or duplicated in the file.  The JDBC output operator that writes data to a database table writes the data in a window in a single transaction. It also writes the current window id into a meta table along with the data as part of the same transaction. It commits the transaction at the end of the window. When there is an operator failure before the final commit, the state of the database is that it contains the data from the previous fully processed window and its window id since the current window transaction isn\u2019t yet committed. On recovery, the operator reads this window id back from the meta table. It ignores all the replayed windows whose window id is less than or equal to the recovered window id and thus ensures that it does not duplicate data already present in the database. It starts writing data normally again when window id of data becomes greater than rec
 overed window thus ensuring no data is lost.", 
+            "title": "End-to-End Exactly Once"
+        }, 
+        {
+            "location": "/development_best_practices/#partitioning", 
+            "text": "Partitioning allows an operation to be scaled to handle more pieces of data than before but with a similar SLA. This is done by creating multiple instances of an operator and distributing the data among them. Input operators can also be partitioned to stream more pieces of data into the application. The platform provides a lot of flexibility and options for partitioning. Partitioning can happen once at startup or can be dynamically changed anytime while the application is running, and it can be done in a stateless or stateful way by distributing state from the old partitions to new partitions.  In the platform, the responsibility for partitioning is shared among different entities. These are:   A  partitioner  that specifies  how  to partition the operator, specifically it takes an old set of partitions and creates a new set of partitions. At the start of the application the old set has one partition and the partitioner can return more than one partitions to sta
 rt the application with multiple partitions. The partitioner can have any custom JAVA logic to determine the number of new partitions, set their initial state as a brand new state or derive it from the state of the old partitions. It also specifies how the data gets distributed among the new partitions. The new set doesn't have to contain only new partitions, it can carry over some old partitions if desired.  An optional  statistics (stats) listener  that specifies  when  to partition. The reason it is optional is that it is needed only when dynamic partitioning is needed. With the stats listener, the stats can be used to determine when to partition.  In some cases the  operator  itself should be aware of partitioning and would need to provide supporting code.  In case of input operators each partition should have a property or a set of properties that allow it to distinguish itself from the other partitions and fetch unique data.    When an operator that was originally a single ins
 tance is split into multiple partitions with each partition working on a subset of data, the results of the partitions may need to be combined together to compute the final result. The combining logic would depend on the logic of the operator. This would be specified by the developer using a  Unifier , which is deployed as another operator by the platform. If no  Unifier  is specified, the platform inserts a  default unifier  that merges the results of the multiple partition streams into a single stream. Each output port can have a different  Unifier  and this is specified by returning the corresponding  Unifier  in the  getUnifier  method of the output port. The operator developer should provide a custom  Unifier  wherever applicable.  The Apex  engine  that brings everything together and effects the partitioning.   Since partitioning is critical for scalability of applications, operators must support it. There should be a strong reason for an operator to not support partitioning, 
 such as, the logic performed by the operator not lending itself to parallelism. In order to support partitioning, an operator developer, apart from developing the functionality of the operator, may also need to provide a partitioner, stats listener and supporting code in the operator as described in the steps above. The next sections delve into this.", 
+            "title": "Partitioning"
+        }, 
+        {
+            "location": "/development_best_practices/#out-of-the-box-partitioning", 
+            "text": "The platform comes with some built-in partitioning utilities that can be used in certain scenarios.    StatelessPartitioner  provides a default partitioner, that can be used for an operator in certain conditions. If the operator satisfies these conditions, the partitioner can be specified for the operator with a simple setting and no other partitioning code is needed. The conditions are:   No dynamic partitioning is needed, see next point about dynamic partitioning.   There is no distinct initial state for the partitions, i.e., all partitions start with the same initial state submitted during application launch.   Typically input or output operators do not fall into this category, although there are some exceptions. This partitioner is mainly used with operators that are in the middle of the DAG, after the input and before the output operators. When used with non-input operators, only the data for the first declared input port is distributed among the different 
 partitions. All other input ports are treated as broadcast and all partitions receive all the data for that port.    StatelessThroughputBasedPartitioner  in Malhar provides a dynamic partitioner based on throughput thresholds. Similarly  StatelessLatencyBasedPartitioner  provides a latency based dynamic partitioner in RTS. If these partitioners can be used, then separate partitioning related code is not needed. The conditions under which these can be used are:   There is no distinct initial state for the partitions.  There is no state being carried over by the operator from one window to the next i.e., operator is stateless.", 
+            "title": "Out of the box partitioning"
+        }, 
+        {
+            "location": "/development_best_practices/#custom-partitioning", 
+            "text": "In many cases, operators don\u2019t satisfy the above conditions and a built-in partitioner cannot be used. Custom partitioning code needs to be written by the operator developer. Below are guidelines for it.   Since the operator developer is providing a  partitioner  for the operator, the partitioning code should be added to the operator itself by making the operator implement the Partitioner interface and implementing the required methods, rather than creating a separate partitioner. The advantage is the user of the operator does not have to explicitly figure out the partitioner and set it for the operator but still has the option to override this built-in partitioner with a different one.  The  partitioner  is responsible for setting the initial state of the new partitions, whether it is at the start of the application or when partitioning is happening while the application is running as in the dynamic partitioning case. In the dynamic partitioning scenario, 
 the partitioner needs to take the state from the old partitions and distribute it among the new partitions. It is important to note that apart from the checkpointed state the partitioner also needs to distribute idempotent state.  The  partitioner  interface has two methods,  definePartitions  and  partitioned . The method  definePartitons  is first called to determine the new partitions, and if enough resources are available on the cluster, the  partitioned  method is called passing in the new partitions. This happens both during initial partitioning and dynamic partitioning. If resources are not available, partitioning is abandoned and existing partitions continue to run untouched. This means that any processing intensive operations should be deferred to the  partitioned  call instead of doing them in  definePartitions , as they may not be needed if there are not enough resources available in the cluster.  The  partitioner , along with creating the new partitions, should also spec
 ify how the data gets distributed across the new partitions. It should do this by specifying a mapping called  PartitionKeys  for each partition that maps the data to that partition. This mapping needs to be specified for every input port in the operator. If the  partitioner  wants to use the standard mapping it can use a utility method called  DefaultPartition.assignPartitionKeys .  When the partitioner is scaling the operator up to more partitions, try to reuse the existing partitions and create new partitions to augment the current set. The reuse can be achieved by the partitioner returning the current partitions unchanged. This will result in the current partitions continuing to run untouched.  In case of dynamic partitioning, as mentioned earlier, a stats listener is also needed to determine when to re-partition. Like the  Partitioner  interface, the operator can also implement the  StatsListener  interface to provide a stats listener implementation that will be automatically u
 sed.  The  StatsListener  has access to all operator statistics to make its decision on partitioning. Apart from the statistics that the platform computes for the operators such as throughput, latency etc, operator developers can include their own business metrics by using the AutoMetric feature.  If the operator is not partitionable, mark it so with  OperatorAnnotation  and  partitionable  element set to false.", 
+            "title": "Custom partitioning"
+        }, 
+        {
+            "location": "/development_best_practices/#streamcodecs", 
+            "text": "A  StreamCodec  is used in partitioning to distribute the data tuples among the partitions. The  StreamCodec  computes an integer hashcode for a data tuple and this is used along with  PartitionKeys  mapping to determine which partition or partitions receive the data tuple. If a  StreamCodec  is not specified, then a default one is used by the platform which returns the JAVA hashcode of the tuple.   StreamCodec  is also useful in another aspect of the application. It is used to serialize and deserialize the tuple to transfer it between operators. The default  StreamCodec  uses Kryo library for serialization.   The following guidelines are useful when considering a custom  StreamCodec   A custom  StreamCodec  is needed if the tuples need to be distributed based on a criteria different from the hashcode of the tuple. If the correct working of an operator depends on the data from the upstream operator being distributed using a custom criteria such as being sticky o
 n a \u201ckey\u201d field within the tuple, then a custom  StreamCodec  should be provided by the operator developer. This codec can implement the custom criteria. The operator should also return this custom codec in the  getStreamCodec  method of the input port.  When implementing a custom  StreamCodec  for the purpose of using a different criteria to distribute the tuples, the codec can extend an existing  StreamCodec  and implement the hashcode method, so that the codec does not have to worry about the serialization and deserialization functionality. The Apex platform provides two pre-built  StreamCodec  implementations for this purpose, one is  KryoSerializableStreamCodec  that uses Kryo for serialization and another one  JavaSerializationStreamCodec  that uses JAVA serialization.  Different  StreamCodec  implementations can be used for different inputs in a stream with multiple inputs when different criteria of distributing the tuples is desired between the multiple inputs.", 
+            "title": "StreamCodecs"
+        }, 
+        {
+            "location": "/development_best_practices/#threads", 
+            "text": "The operator lifecycle methods such as  setup ,  beginWindow ,  endWindow ,  process  in  InputPorts  are all called from a single operator lifecycle thread, by the platform, unbeknownst to the user. So the user does not have to worry about dealing with the issues arising from multi-threaded code. Use of separate threads in an operator is discouraged because in most cases the motivation for this is parallelism, but parallelism can already be achieved by using multiple partitions and furthermore mistakes can be made easily when writing multi-threaded code. When dealing with high volume and velocity data, the corner cases with incorrectly written multi-threaded code are encountered more easily and exposed. However, there are times when separate threads are needed, for example, when interacting with external systems the delay in retrieving or sending data can be large at times, blocking the operator and other DAG processing such as committed windows. In these cases
  the following guidelines must be followed strictly.   Threads should be started in  activate  and stopped in  deactivate . In  deactivate  the operator should wait till any threads it launched, have finished execution. It can do so by calling  join  on the threads or if using  ExecutorService , calling  awaitTermination  on the service.  Threads should not call any methods on the ports directly as this can cause concurrency exceptions and also result in invalid states.  Threads can share state with the lifecycle methods using data structures that are either explicitly protected by synchronization or are inherently thread safe such as thread safe queues.  If this shared state needs to be protected against failure then it needs to be persisted during checkpoint. To have a consistent checkpoint, the state should not be modified by the thread when it is being serialized and saved by the operator lifecycle thread during checkpoint. Since the checkpoint process happens outside the window
  boundary the thread should be quiesced between  endWindow  and  beginWindow  or more efficiently between pre-checkpoint and checkpointed callbacks.", 
+            "title": "Threads"
+        }, 
+        {
             "location": "/apex_cli/", 
             "text": "Apache Apex Command Line Interface\n\n\nApex CLI, the Apache Apex command line interface, can be used to launch, monitor, and manage Apache Apex applications.  It provides a developer friendly way of interacting with Apache Apex platform.  Another advantage of Apex CLI is to provide scope, by connecting and executing commands in a context of specific application.  Apex CLI enables easy integration with existing enterprise toolset for automated application monitoring and management.  Currently the following high level tasks are supported.\n\n\n\n\nLaunch or kill applications\n\n\nView system metrics including load, throughput, latency, etc.\n\n\nStart or stop tuple recording\n\n\nRead operator, stream, port properties and attributes\n\n\nWrite to operator properties\n\n\nDynamically change the application logical plan\n\n\nCreate custom macros\n\n\n\n\nApex CLI Commands\n\n\nApex CLI can be launched by running following command\n\n\napex\n\n\n\nHelp on all comman
 ds is available via \u201chelp\u201d command in the CLI\n\n\nGlobal Commands\n\n\nGLOBAL COMMANDS EXCEPT WHEN CHANGING LOGICAL PLAN:\n\nalias alias-name command\n    Create a command alias\n\nbegin-macro name\n    Begin Macro Definition ($1...$9 to access parameters and type 'end' to end the definition)\n\nconnect app-id\n    Connect to an app\n\ndump-properties-file out-file jar-file class-name\n    Dump the properties file of an app class\n\necho [arg ...]\n    Echo the arguments\n\nexit\n    Exit the CLI\n\nget-app-info app-id\n    Get the information of an app\n\nget-app-package-info app-package-file\n    Get info on the app package file\n\nget-app-package-operator-properties app-package-file operator-class\n    Get operator properties within the given app package\n\nget-app-package-operators [options] app-package-file [search-term]\n    Get operators within the given app package\n    Options:\n            -parent    Specify the parent class for the operators\n\nget-config-param
 eter [parameter-name]\n    Get the configuration parameter\n\nget-jar-operator-classes [options] jar-files-comma-separated [search-term]\n    List operators in a jar list\n    Options:\n            -parent    Specify the parent class for the operators\n\nget-jar-operator-properties jar-files-comma-separated operator-class-name\n    List properties in specified operator\n\nhelp [command]\n    Show help\n\nkill-app app-id [app-id ...]\n    Kill an app\n\n  launch [options] jar-file/json-file/properties-file/app-package-file [matching-app-name]\n    Launch an app\n    Options:\n            -apconf \napp package configuration file\n        Specify an application\n                                                            configuration file\n                                                            within the app\n                                                            package if launching\n                                                            an app package.\n            -a
 rchives \ncomma separated list of archives\n    Specify comma\n                                                            separated archives\n                                                            to be unarchived on\n                                                            the compute machines.\n            -conf \nconfiguration file\n                      Specify an\n                                                            application\n                                                            configuration file.\n            -D \nproperty=value\n                             Use value for given\n                                                            property.\n            -exactMatch                                     Only consider\n                                                            applications with\n                                                            exact app name\n            -files \ncomma separated list of files\n          Specify comma\n 
                                                            separated files to\n                                                            be copied on the\n                                                            compute machines.\n            -ignorepom                                      Do not run maven to\n                                                            find the dependency\n            -libjars \ncomma separated list of libjars\n      Specify comma\n                                                            separated jar files\n                                                            or other resource\n                                                            files to include in\n                                                            the classpath.\n            -local                                          Run application in\n                                                            local mode.\n            -originalAppId \napplication id\n       
           Specify original\n                                                            application\n                                                            identifier for restart.\n            -queue \nqueue name\n                             Specify the queue to\n                                                            launch the application\n\nlist-application-attributes\n    Lists the application attributes\nlist-apps [pattern]\n    List applications\nlist-operator-attributes\n    Lists the operator attributes\nlist-port-attributes\n    Lists the port attributes\nset-pager on/off\n    Set the pager program for output\nshow-logical-plan [options] jar-file/app-package-file [class-name]\n    List apps in a jar or show logical plan of an app class\n    Options:\n            -exactMatch                                Only consider exact match\n                                                       for app name\n            -ignorepom                                 Do not run 
 maven to find\n                                                       the dependency\n            -libjars \ncomma separated list of jars\n    Specify comma separated\n                                                       jar/resource files to\n                                                       include in the classpath.\nshutdown-app app-id [app-id ...]\n    Shutdown an app\nsource file\n    Execute the commands in a file\n\n\n\n\nCommands after connecting to an application\n\n\nCOMMANDS WHEN CONNECTED TO AN APP (via connect \nappid\n) EXCEPT WHEN CHANGING LOGICAL PLAN:\n\nbegin-logical-plan-change\n    Begin Logical Plan Change\ndump-properties-file out-file [jar-file] [class-name]\n    Dump the properties file of an app class\nget-app-attributes [attribute-name]\n    Get attributes of the connected app\nget-app-info [app-id]\n    Get the information of an app\nget-operator-attributes operator-name [attribute-name]\n    Get attributes of an operator\nget-operator-properties op
 erator-name [property-name]\n    Get properties of a logical operator\nget-physical-operator-properties [options] operator-id\n    Get properties of a physical operator\n    Options:\n            -propertyName \nproperty name\n    The name of the property whose\n                                             value needs to be retrieved\n            -waitTime \nwait time\n            How long to wait to get the result\nget-port-attributes operator-name port-name [attribute-name]\n    Get attributes of a port\nget-recording-info [operator-id] [start-time]\n    Get tuple recording info\nkill-app [app-id ...]\n    Kill an app\nkill-container container-id [container-id ...]\n    Kill a container\nlist-containers\n    List containers\nlist-operators [pattern]\n    List operators\nset-operator-property operator-name property-name property-value\n    Set a property of an operator\nset-physical-operator-property operator-id property-name property-value\n    Set a property of an operator\nshow-
 logical-plan [options] [jar-file/app-package-file] [class-name]\n    Show logical plan of an app class\n    Options:\n            -exactMatch                                Only consider exact match\n                                                       for app name\n            -ignorepom                                 Do not run maven to find\n                                                       the dependency\n            -libjars \ncomma separated list of jars\n    Specify comma separated\n                                                       jar/resource files to\n                                                       include in the classpath.\nshow-physical-plan\n    Show physical plan\nshutdown-app [app-id ...]\n    Shutdown an app\nstart-recording operator-id [port-name] [num-windows]\n    Start recording\nstop-recording operator-id [port-name]\n    Stop recording\nwait timeout\n    Wait for completion of current application\n\n\n\n\nCommands when changing the logical p
 lan\n\n\nCOMMANDS WHEN CHANGING LOGICAL PLAN (via begin-logical-plan-change):\n\nabort\n    Abort the plan change\nadd-stream-sink stream-name to-operator-name to-port-name\n    Add a sink to an existing stream\ncreate-operator operator-name class-name\n    Create an operator\ncreate-stream stream-name from-operator-name from-port-name to-operator-name to-port-name\n    Create a stream\nhelp [command]\n    Show help\nremove-operator operator-name\n    Remove an operator\nremove-stream stream-name\n    Remove a stream\nset-operator-attribute operator-name attr-name attr-value\n    Set an attribute of an operator\nset-operator-property operator-name property-name property-value\n    Set a property of an operator\nset-port-attribute operator-name port-name attr-name attr-value\n    Set an attribute of a port\nset-stream-attribute stream-name attr-name attr-value\n    Set an attribute of a stream\nshow-queue\n    Show the queue of the plan change\nsubmit\n    Submit the plan change\n\n\
 n\n\nExamples\n\n\nAn example of defining a custom macro.  The macro updates a running application by inserting a new operator.  It takes three parameters and executes a logical plan changes.\n\n\napex\n begin-macro add-console-output\nmacro\n begin-logical-plan-change\nmacro\n create-operator $1 com.datatorrent.lib.io.ConsoleOutputOperator\nmacro\n create-stream stream_$1 $2 $3 $1 in\nmacro\n submit\n\n\n\n\nThen execute the \nadd-console-output\n macro like this\n\n\napex\n add-console-output xyz opername portname\n\n\n\n\nThis macro then expands to run the following command\n\n\nbegin-logical-plan-change\ncreate-operator xyz com.datatorrent.lib.io.ConsoleOutputOperator\ncreate-stream stream_xyz opername portname xyz in\nsubmit\n\n\n\n\nNote\n:  To perform runtime logical plan changes, like ability to add new operators,\nthey must be part of the jar files that were deployed at application launch time.", 
             "title": "Apex CLI"
@@ -857,7 +922,7 @@
         }, 
         {
             "location": "/security/", 
-            "text": "Security\n\n\nApplications built on Apex run as native YARN applications on Hadoop. The security framework and apparatus in Hadoop apply to the applications. The default security mechanism in Hadoop is Kerberos.\n\n\nKerberos Authentication\n\n\nKerberos is a ticket based authentication system that provides authentication in a distributed environment where authentication is needed between multiple users, hosts and services. It is the de-facto authentication mechanism supported in Hadoop. To use Kerberos authentication, the Hadoop installation must first be configured for secure mode with Kerberos. Please refer to the administration guide of your Hadoop distribution on how to do that. Once Hadoop is configured, there is some configuration needed on Apex side as well.\n\n\nConfiguring security\n\n\nThere is Hadoop configuration and CLI configuration. Hadoop configuration may be optional.\n\n\nHadoop Configuration\n\n\nAn Apex application uses delegation tokens to 
 authenticate with the ResourceManager (YARN) and NameNode (HDFS) and these tokens are issued by those servers respectively. Since the application is long-running,\nthe tokens should be valid for the lifetime of the application. Hadoop has a configuration setting for the maximum lifetime of the tokens and they should be set to cover the lifetime of the application. There are separate settings for ResourceManager and NameNode delegation\ntokens.\n\n\nThe ResourceManager delegation token max lifetime is specified in \nyarn-site.xml\n and can be specified as follows for example for a lifetime of 1 year\n\n\nproperty\n\n  \nname\nyarn.resourcemanager.delegation.token.max-lifetime\n/name\n\n  \nvalue\n31536000000\n/value\n\n\n/property\n\n\n\n\n\nThe NameNode delegation token max lifetime is specified in\nhdfs-site.xml and can be specified as follows for example for a lifetime of 1 year\n\n\nproperty\n\n   \nname\ndfs.namenode.delegation.token.max-lifetime\n/name\n\n   \nvalue\n3153600000
 0\n/value\n\n \n/property\n\n\n\n\n\nCLI Configuration\n\n\nThe Apex command line interface is used to launch\napplications along with performing various other operations and administrative tasks on the applications. \u00a0When Kerberos security is enabled in Hadoop, a Kerberos ticket granting ticket (TGT) or the Kerberos credentials of the user are needed b

<TRUNCATED>

[3/6] apex-site git commit: Update apex-3.4 documentation from master to include security changes and development best practices.

Posted by th...@apache.org.

Update apex-3.4 documentation from master to include security changes and development best practices.


Project: http://git-wip-us.apache.org/repos/asf/apex-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/apex-site/commit/21e76a00
Tree: http://git-wip-us.apache.org/repos/asf/apex-site/tree/21e76a00
Diff: http://git-wip-us.apache.org/repos/asf/apex-site/diff/21e76a00

Branch: refs/heads/asf-site
Commit: 21e76a006707cf1871eacbc5ab99eb17cf8e3d2b
Parents: 974bace
Author: Thomas Weise <th...@datatorrent.com>
Authored: Tue Sep 6 19:06:26 2016 -0700
Committer: Thomas Weise <th...@datatorrent.com>
Committed: Tue Sep 6 19:06:26 2016 -0700

----------------------------------------------------------------------
 docs/apex-3.4/__init__.pyc                      | Bin 166 -> 163 bytes
 docs/apex-3.4/apex_cli/index.html               |  11 +-
 docs/apex-3.4/apex_development_setup/index.html |  17 +-
 .../apex-3.4/application_development/index.html |  15 +-
 docs/apex-3.4/application_packages/index.html   |   7 +
 docs/apex-3.4/autometrics/index.html            |  13 +-
 docs/apex-3.4/compatibility/index.html          |   7 +
 .../development_best_practices/index.html       | 376 +++++++++++++++++++
 docs/apex-3.4/images/security/image03.png       | Bin 0 -> 18677 bytes
 docs/apex-3.4/index.html                        |  11 +-
 docs/apex-3.4/license/highlight.js/LICENSE      |  24 --
 docs/apex-3.4/main.html                         |  10 +
 docs/apex-3.4/mkdocs/js/lunr-0.5.7.min.js       |   7 +
 docs/apex-3.4/mkdocs/search_index.json          | 126 ++++++-
 docs/apex-3.4/operator_development/index.html   |   9 +-
 docs/apex-3.4/search.html                       |   7 +
 docs/apex-3.4/security/index.html               | 129 +++++--
 docs/apex-3.4/sitemap.xml                       |  24 +-
 18 files changed, 705 insertions(+), 88 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/apex-site/blob/21e76a00/docs/apex-3.4/__init__.pyc
----------------------------------------------------------------------
diff --git a/docs/apex-3.4/__init__.pyc b/docs/apex-3.4/__init__.pyc
index f478a23..5d767d8 100644
Binary files a/docs/apex-3.4/__init__.pyc and b/docs/apex-3.4/__init__.pyc differ

http://git-wip-us.apache.org/repos/asf/apex-site/blob/21e76a00/docs/apex-3.4/apex_cli/index.html
----------------------------------------------------------------------
diff --git a/docs/apex-3.4/apex_cli/index.html b/docs/apex-3.4/apex_cli/index.html
index f6c491e..c45aec1 100644
--- a/docs/apex-3.4/apex_cli/index.html
+++ b/docs/apex-3.4/apex_cli/index.html
@@ -102,6 +102,13 @@
     </li>
 
         
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../development_best_practices/">Best Practices</a>
+        
+    </li>
+
+        
     </ul>
 <li>
           
@@ -436,7 +443,7 @@ they must be part of the jar files that were deployed at application launch time
         <a href="../security/" class="btn btn-neutral float-right" title="Security">Next <span class="icon icon-circle-arrow-right"></span></a>
       
       
-        <a href="../autometrics/" class="btn btn-neutral" title="AutoMetric API"><span class="icon icon-circle-arrow-left"></span> Previous</a>
+        <a href="../development_best_practices/" class="btn btn-neutral" title="Best Practices"><span class="icon icon-circle-arrow-left"></span> Previous</a>
       
     </div>
   
@@ -462,7 +469,7 @@ they must be part of the jar files that were deployed at application launch time
     <span class="rst-current-version" data-toggle="rst-current-version">
       
       
-        <span><a href="../autometrics/" style="color: #fcfcfc;">&laquo; Previous</a></span>
+        <span><a href="../development_best_practices/" style="color: #fcfcfc;">&laquo; Previous</a></span>
       
       
         <span style="margin-left: 15px"><a href="../security/" style="color: #fcfcfc">Next &raquo;</a></span>

http://git-wip-us.apache.org/repos/asf/apex-site/blob/21e76a00/docs/apex-3.4/apex_development_setup/index.html
----------------------------------------------------------------------
diff --git a/docs/apex-3.4/apex_development_setup/index.html b/docs/apex-3.4/apex_development_setup/index.html
index 75a7891..1af03d1 100644
--- a/docs/apex-3.4/apex_development_setup/index.html
+++ b/docs/apex-3.4/apex_development_setup/index.html
@@ -119,6 +119,13 @@
     </li>
 
         
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../development_best_practices/">Best Practices</a>
+        
+    </li>
+
+        
     </ul>
 <li>
           
@@ -306,22 +313,22 @@ project properties at <em>Properties &#8658; Run/Debug Settings &#8658; Applicat
 <ol>
 <li>
 <p>Check out the source code repositories:</p>
-<pre><code>git clone https://github.com/apache/incubator-apex-core
-git clone https://github.com/apache/incubator-apex-malhar
+<pre><code>git clone https://github.com/apache/apex-core
+git clone https://github.com/apache/apex-malhar
 </code></pre>
 </li>
 <li>
 <p>Switch to the appropriate release branch and build each repository:</p>
-<pre><code>cd incubator-apex-core
+<pre><code>cd apex-core
 mvn clean install -DskipTests
 
-cd incubator-apex-malhar
+cd apex-malhar
 mvn clean install -DskipTests
 </code></pre>
 </li>
 </ol>
 <p>The <code>install</code> argument to the <code>mvn</code> command installs resources from each project to your local maven repository (typically <code>.m2/repository</code> under your home directory), and <strong>not</strong> to the system directories, so Administrator privileges are not required. The  <code>-DskipTests</code> argument skips running unit tests since they take a long time. If this is a first-time installation, it might take several minutes to complete because maven will download a number of associated plugins.</p>
-<p>After the build completes, you should see the demo application package files in the target directory under each demo subdirectory in <code>incubator-apex-malhar/demos</code>.</p>
+<p>After the build completes, you should see the demo application package files in the target directory under each demo subdirectory in <code>apex-malhar/demos</code>.</p>
 <h2 id="sandbox">Sandbox</h2>
 <p>To jump start development with an Apache Hadoop single node cluster, <a href="https://www.datatorrent.com/download">DataTorrent Sandbox</a> powered by VirtualBox is available on Windows, Linux, or Mac platforms.  The sandbox is configured by default to run with 6GB RAM; if your development machine has 16GB or more, you can increase the sandbox RAM to 8GB or more using the VirtualBox console.  This will yield better performance and support larger applications.  The advantage of developing in the sandbox is that most of the tools (e.g. <em>jdk</em>, <em>git</em>, <em>maven</em>), Hadoop YARN and HDFS, and a distribution of Apache Apex and DataTorrent RTS are pre-installed.  The disadvantage is that the sandbox is a memory-limited environment, and requires settings changes and restarts to adjust memory available for development and testing.</p>
               

http://git-wip-us.apache.org/repos/asf/apex-site/blob/21e76a00/docs/apex-3.4/application_development/index.html
----------------------------------------------------------------------
diff --git a/docs/apex-3.4/application_development/index.html b/docs/apex-3.4/application_development/index.html
index 8c8f184..20d8e2e 100644
--- a/docs/apex-3.4/application_development/index.html
+++ b/docs/apex-3.4/application_development/index.html
@@ -187,6 +187,13 @@
     </li>
 
         
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../development_best_practices/">Best Practices</a>
+        
+    </li>
+
+        
     </ul>
 <li>
           
@@ -278,7 +285,7 @@ operators to the <a href="../operator_development/">Operator Development Guide</
 <h1 id="running-a-test-application">Running A Test Application</h1>
 <p>If you are starting with the Apex platform for the first time,
 it can be informative to launch an existing application and see it run.
-One of the simplest examples provided in <a href="https://github.com/apache/incubator-apex-malhar">Apex-Malhar repository</a> is a Pi demo application,
+One of the simplest examples provided in <a href="https://github.com/apache/apex-malhar">Apex-Malhar repository</a> is a Pi demo application,
 which computes the value of PI using random numbers.  After <a href="../apex_development_setup/">setting up development environment</a>
 Pi demo can be launched as follows:</p>
 <ol>
@@ -907,7 +914,7 @@ project name \u201cMalhar\u201d as part of our efforts to foster community
 innovation. These operators can be used in a DAG as is, while others
 have properties�that can be set to specify the
 desired computation. Those interested in details, should refer to
-<a href="https://github.com/apache/incubator-apex-malhar">Apex-Malhar operator library</a>.</p>
+<a href="https://github.com/apache/apex-malhar">Apex-Malhar operator library</a>.</p>
 <p>The platform is a Hadoop YARN native
 application. It runs in a Hadoop cluster just like any
 other YARN application (MapReduce etc.) and is designed to seamlessly
@@ -1281,7 +1288,7 @@ DAG in local mode within the IDE.</p>
 <li>The <code>operators</code> field is the list of operators the application has. You can specifiy the name, the Java class, and the properties of each operator here.</li>
 <li>The <code>streams</code> field is the list of streams that connects the operators together to form the DAG. Each stream consists of the stream name, the operator and port that it connects from, and the list of operators and ports that it connects to. Note that you can connect from <em>one</em> output port of an operator to <em>multiple</em> different input ports of different operators.</li>
 </ul>
-<p>In Apex Malhar, there is an <a href="https://github.com/apache/incubator-apex-malhar/blob/master/demos/pi/src/main/resources/app/PiJsonDemo.json">example</a> in the Pi Demo doing just that.</p>
+<p>In Apex Malhar, there is an <a href="https://github.com/apache/apex-malhar/blob/master/demos/pi/src/main/resources/app/PiJsonDemo.json">example</a> in the Pi Demo doing just that.</p>
 <h3 id="properties-file-dag-specification">Properties File DAG Specification</h3>
 <p>The platform also supports specification of a DAG via a properties
 file. The aim here to make it easy for tools to create and run an
@@ -2625,7 +2632,7 @@ details refer to  <a href="http://docs.datatorrent.com/configuration/">Configura
 <hr />
 <h1 id="demos">Demos</h1>
 <p>The source code for the demos is available in the open-source
-<a href="https://github.com/apache/incubator-apex-malhar">Apache Apex-Malhar repository</a>.
+<a href="https://github.com/apache/apex-malhar">Apache Apex-Malhar repository</a>.
 All of these do computations in real-time. Developers are encouraged to
 review them as they use various features of the platform and provide an
 opportunity for quick learning.</p>

http://git-wip-us.apache.org/repos/asf/apex-site/blob/21e76a00/docs/apex-3.4/application_packages/index.html
----------------------------------------------------------------------
diff --git a/docs/apex-3.4/application_packages/index.html b/docs/apex-3.4/application_packages/index.html
index 654c764..d4aff60 100644
--- a/docs/apex-3.4/application_packages/index.html
+++ b/docs/apex-3.4/application_packages/index.html
@@ -129,6 +129,13 @@
     </li>
 
         
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../development_best_practices/">Best Practices</a>
+        
+    </li>
+
+        
     </ul>
 <li>
           

http://git-wip-us.apache.org/repos/asf/apex-site/blob/21e76a00/docs/apex-3.4/autometrics/index.html
----------------------------------------------------------------------
diff --git a/docs/apex-3.4/autometrics/index.html b/docs/apex-3.4/autometrics/index.html
index 5d01dec..4712619 100644
--- a/docs/apex-3.4/autometrics/index.html
+++ b/docs/apex-3.4/autometrics/index.html
@@ -128,6 +128,13 @@
     </li>
 
         
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../development_best_practices/">Best Practices</a>
+        
+    </li>
+
+        
     </ul>
 <li>
           
@@ -234,7 +241,7 @@
 <p>When an operator is partitioned, it is useful to aggregate the values of auto-metrics across all its partitions every window to get a logical view of these metrics. The application master performs these aggregations using metrics aggregators.</p>
 <p>The AutoMetric API helps to achieve this by providing an interface for writing aggregators- <code>AutoMetric.Aggregator</code>. Any implementation of <code>AutoMetric.Aggregator</code> can be set as an operator attribute - <code>METRICS_AGGREGATOR</code> for a particular operator which in turn is used for aggregating physical metrics.</p>
 <h2 id="default-aggregators">Default aggregators</h2>
-<p><a href="https://github.com/apache/incubator-apex-core/blob/master/common/src/main/java/com/datatorrent/common/metric/MetricsAggregator.java"><code>MetricsAggregator</code></a> is a simple implementation of <code>AutoMetric.Aggregator</code> that platform uses as a default for summing up primitive types - int, long, float and double.</p>
+<p><a href="https://github.com/apache/apex-core/blob/master/common/src/main/java/com/datatorrent/common/metric/MetricsAggregator.java"><code>MetricsAggregator</code></a> is a simple implementation of <code>AutoMetric.Aggregator</code> that platform uses as a default for summing up primitive types - int, long, float and double.</p>
 <p><code>MetricsAggregator</code> is just a collection of <code>SingleMetricAggregator</code>s. There are multiple implementations of <code>SingleMetricAggregator</code> that perform sum, min, max, avg which are present in Apex core and Apex malhar.</p>
 <p>For the <code>LineReceiver</code> operator, the application developer need not specify any aggregator. The platform will automatically inject an instance of <code>MetricsAggregator</code> that contains two <code>LongSumAggregator</code>s - one for <code>length</code> and one for <code>count</code>. This aggregator will report sum of length and sum of count across all the partitions of <code>LineReceiver</code>.</p>
 <h2 id="building-custom-aggregators">Building custom aggregators</h2>
@@ -358,7 +365,7 @@
   
     <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
       
-        <a href="../apex_cli/" class="btn btn-neutral float-right" title="Apex CLI">Next <span class="icon icon-circle-arrow-right"></span></a>
+        <a href="../development_best_practices/" class="btn btn-neutral float-right" title="Best Practices">Next <span class="icon icon-circle-arrow-right"></span></a>
       
       
         <a href="../operator_development/" class="btn btn-neutral" title="Operators"><span class="icon icon-circle-arrow-left"></span> Previous</a>
@@ -390,7 +397,7 @@
         <span><a href="../operator_development/" style="color: #fcfcfc;">&laquo; Previous</a></span>
       
       
-        <span style="margin-left: 15px"><a href="../apex_cli/" style="color: #fcfcfc">Next &raquo;</a></span>
+        <span style="margin-left: 15px"><a href="../development_best_practices/" style="color: #fcfcfc">Next &raquo;</a></span>
       
     </span>
 </div>

http://git-wip-us.apache.org/repos/asf/apex-site/blob/21e76a00/docs/apex-3.4/compatibility/index.html
----------------------------------------------------------------------
diff --git a/docs/apex-3.4/compatibility/index.html b/docs/apex-3.4/compatibility/index.html
index ee9fece..9c682ee 100644
--- a/docs/apex-3.4/compatibility/index.html
+++ b/docs/apex-3.4/compatibility/index.html
@@ -102,6 +102,13 @@
     </li>
 
         
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../development_best_practices/">Best Practices</a>
+        
+    </li>
+
+        
     </ul>
 <li>
           

http://git-wip-us.apache.org/repos/asf/apex-site/blob/21e76a00/docs/apex-3.4/development_best_practices/index.html
----------------------------------------------------------------------
diff --git a/docs/apex-3.4/development_best_practices/index.html b/docs/apex-3.4/development_best_practices/index.html
new file mode 100644
index 0000000..c2a143f
--- /dev/null
+++ b/docs/apex-3.4/development_best_practices/index.html
@@ -0,0 +1,376 @@
+<!DOCTYPE html>
+<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
+<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
+<head>
+  <meta charset="utf-8">
+  <meta http-equiv="X-UA-Compatible" content="IE=edge">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  
+  
+  
+  <title>Best Practices - Apache Apex Documentation</title>
+  
+
+  <link rel="shortcut icon" href="../favicon.ico">
+  
+
+  
+  <link href='https://fonts.googleapis.com/css?family=Lato:400,700|Roboto+Slab:400,700|Inconsolata:400,700' rel='stylesheet' type='text/css'>
+
+  <link rel="stylesheet" href="../css/theme.css" type="text/css" />
+  <link rel="stylesheet" href="../css/theme_extra.css" type="text/css" />
+  <link rel="stylesheet" href="../css/highlight.css">
+
+  
+  <script>
+    // Current page data
+    var mkdocs_page_name = "Best Practices";
+    var mkdocs_page_input_path = "development_best_practices.md";
+    var mkdocs_page_url = "/development_best_practices/";
+  </script>
+  
+  <script src="../js/jquery-2.1.1.min.js"></script>
+  <script src="../js/modernizr-2.8.3.min.js"></script>
+  <script type="text/javascript" src="../js/highlight.pack.js"></script>
+  <script src="../js/theme.js"></script> 
+
+  
+</head>
+
+<body class="wy-body-for-nav" role="document">
+
+  <div class="wy-grid-for-nav">
+
+    
+    <nav data-toggle="wy-nav-shift" class="wy-nav-side stickynav">
+      <div class="wy-side-nav-search">
+        <a href=".." class="icon icon-home"> Apache Apex Documentation</a>
+        <div role="search">
+  <form id ="rtd-search-form" class="wy-form" action="../search.html" method="get">
+    <input type="text" name="q" placeholder="Search docs" />
+  </form>
+</div>
+      </div>
+
+      <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
+        <ul class="current">
+          
+            <li>
+    <li class="toctree-l1 ">
+        <a class="" href="..">Apache Apex</a>
+        
+    </li>
+<li>
+          
+            <li>
+    <ul class="subnav">
+    <li><span>Development</span></li>
+
+        
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../apex_development_setup/">Development Setup</a>
+        
+    </li>
+
+        
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../application_development/">Applications</a>
+        
+    </li>
+
+        
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../application_packages/">Packages</a>
+        
+    </li>
+
+        
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../operator_development/">Operators</a>
+        
+    </li>
+
+        
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../autometrics/">AutoMetric API</a>
+        
+    </li>
+
+        
+            
+    <li class="toctree-l1 current">
+        <a class="current" href="./">Best Practices</a>
+        
+            <ul>
+            
+                <li class="toctree-l3"><a href="#development-best-practices">Development Best Practices</a></li>
+                
+                    <li><a class="toctree-l4" href="#operators">Operators</a></li>
+                
+                    <li><a class="toctree-l4" href="#input-operators">Input Operators</a></li>
+                
+                    <li><a class="toctree-l4" href="#output-operators">Output Operators</a></li>
+                
+                    <li><a class="toctree-l4" href="#partitioning">Partitioning</a></li>
+                
+                    <li><a class="toctree-l4" href="#threads">Threads</a></li>
+                
+            
+            </ul>
+        
+    </li>
+
+        
+    </ul>
+<li>
+          
+            <li>
+    <ul class="subnav">
+    <li><span>Operations</span></li>
+
+        
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../apex_cli/">Apex CLI</a>
+        
+    </li>
+
+        
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../security/">Security</a>
+        
+    </li>
+
+        
+    </ul>
+<li>
+          
+            <li>
+    <li class="toctree-l1 ">
+        <a class="" href="../compatibility/">Compatibility</a>
+        
+    </li>
+<li>
+          
+        </ul>
+      </div>
+      &nbsp;
+    </nav>
+
+    <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
+
+      
+      <nav class="wy-nav-top" role="navigation" aria-label="top navigation">
+        <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
+        <a href="..">Apache Apex Documentation</a>
+      </nav>
+
+      
+      <div class="wy-nav-content">
+        <div class="rst-content">
+          <div role="navigation" aria-label="breadcrumbs navigation">
+  <ul class="wy-breadcrumbs">
+    <li><a href="..">Docs</a> &raquo;</li>
+    
+      
+        
+          <li>Development &raquo;</li>
+        
+      
+    
+    <li>Best Practices</li>
+    <li class="wy-breadcrumbs-aside">
+      
+    </li>
+  </ul>
+  <hr/>
+</div>
+          <div role="main">
+            <div class="section">
+              
+                <h1 id="development-best-practices">Development Best Practices</h1>
+<p>This document describes the best practices to follow when developing operators and other application components such as partitoners, stream codecs etc on the Apache Apex platform.</p>
+<h2 id="operators">Operators</h2>
+<p>These are general guidelines for all operators that are covered in the current section. The subsequent sections talk about special considerations for input and output operators.</p>
+<ul>
+<li>When writing a new operator to be used in an application, consider breaking it down into<ul>
+<li>An abstract operator that encompasses the core functionality but leaves application specific schemas and logic to the implementation.</li>
+<li>An optional concrete operator also in the library that extends the abstract operator and provides commonly used schema types such as strings, byte[] or POJOs.</li>
+</ul>
+</li>
+<li>Follow these conventions for the life cycle methods:<ul>
+<li>Do one time initialization of entities that apply for the entire lifetime of the operator in the <strong>setup</strong> method, e.g., factory initializations. Initializations in <strong>setup</strong> are done in the container where the operator is deployed. Allocating memory for fields in the constructor is not efficient as it would lead to extra garbage in memory for the following reason. The operator is instantiated on the client from where the application is launched, serialized and started one of the Hadoop nodes in a container. So the constructor is first called on the client and if it were to initialize any of the fields, that state would be saved during serialization. In the Hadoop container the operator is deserialized and started. This would invoke the constructor again, which will initialize the fields but their state will get overwritten by the serialized state and the initial values would become garbage in memory.</li>
+<li>Do one time initialization for live entities in <strong>activate</strong> method, e.g., opening connections to a database server or starting a thread for asynchronous operations. The <strong>activate</strong> method is called right before processing starts so it is a better place for these initializations than at <strong>setup</strong> which can lead to a delay before processing data from the live entity.  </li>
+<li>Perform periodic tasks based on processing time in application window boundaries.</li>
+<li>Perform initializations needed for each application window in <strong>beginWindow</strong>.</li>
+<li>Perform aggregations needed for each application window  in <strong>endWindow</strong>.</li>
+<li>Teardown of live entities (inverse of tasks performed during activate) should be in the <strong>deactivate</strong> method.</li>
+<li>Teardown of lifetime entities (those initialized in setup method) should happen in the <strong>teardown</strong> method.</li>
+<li>If the operator implementation is not finalized mark it with the <strong>@Evolving</strong> annotation.</li>
+</ul>
+</li>
+<li>If the operator needs to perform operations based on event time of the individual tuples and not the processing time, extend and use the <strong>WindowedOperator</strong>. Refer to documentation of that operator for details on how to use it.</li>
+<li>If an operator needs to do some work when it is not receiving any input, it should implement <strong>IdleTimeHandler</strong> interface. This interface contains <strong>handleIdleTime</strong> method which will be called whenever the platform isn\u2019t doing anything else and the operator can do the work in this method. If for any reason the operator does not have any work to do when this method is called, it should sleep for a small amount of time such as that specified by the <strong>SPIN_MILLIS</strong> attribute so that it does not cause a busy wait when called repeatedly by the platform. Also, the method should not block and return in a reasonable amount of time that is less than the streaming window size (which is 500ms by default).</li>
+<li>Often operators have customizable parameters such as information about locations of external systems or parameters that modify the behavior of the operator. Users should be able to specify these easily without having to change source code. This can be done by making them properties of the operator because they can then be initialized from external properties files.<ul>
+<li>Where possible default values should be provided for the properties in the source code.</li>
+<li>Validation rules should be specified for the properties using javax constraint validations that check whether the values specified for the properties are in the correct format, range or other operator requirements. Required properties should have at least a <strong>@NotNull</strong> validation specifying that they have to be specified by the user.</li>
+</ul>
+</li>
+</ul>
+<h3 id="checkpointing">Checkpointing</h3>
+<p>Checkpointing is a process of snapshotting the state of an operator and saving it so that in case of failure the state can be used to restore the operator to a prior state and continue processing. It is automatically performed by the platform at a configurable interval. All operators in the application are checkpointed in a distributed fashion, thus allowing the entire state of the application to be saved and available for recovery if needed. Here are some things to remember when it comes to checkpointing:</p>
+<ul>
+<li>The process of checkpointing involves snapshotting the state by serializing the operator and saving it to a store. This is done using a <strong>StorageAgent</strong>. By default a <em>StorageAgent</em> is already provided by the platform and it is called <strong>AsyncFSStorageAgent</strong>. It serializes the operator using Kryo and saves the serialized state asynchronously to a filesystem such as HDFS. There are other implementations of <em>StorageAgent</em> available such as <strong>GeodeKeyValueStorageAgent</strong> that stores the serialized state in Geode which is an in-memory replicated data grid.</li>
+<li>All variables in the operator marked neither transient nor final are saved so any variables in the operator that are not part of the state should be marked transient. Specifically any variables like connection objects, i/o streams, ports are transient, because they need to be setup again on failure recovery.</li>
+<li>If the operator does not keep any state between windows, mark it with the <strong>@Stateless</strong> annotation. This results in efficiencies during checkpointing and recovery. The operator will not be checkpointed and is always restored to the initial state</li>
+<li>The checkpoint interval can be set using the <strong>CHECKPOINT_WINDOW_COUNT</strong> attribute which specifies the interval in terms of number of streaming windows.</li>
+<li>If the correct functioning of the operator requires the <strong>endWindow</strong> method be called before checkpointing can happen, then the checkpoint interval should align with application window interval i.e., it should be a multiple of application window interval. In this case the operator should be marked with <strong>OperatorAnnotation</strong> and <strong>checkpointableWithinAppWindow</strong> set to false. If the window intervals are configured by the user and they don\u2019t align, it will result in a DAG validation error and application won\u2019t launch.</li>
+<li>In some cases the operator state related to a piece of data needs to be purged once that data is no longer required by the application, otherwise the state will continue to build up indefinitely. The platform provides a way to let the operator know about this using a callback listener called <strong>CheckpointNotificationListener</strong>. This listener has a callback method called <strong>committed</strong>, which is called by the platform from time to time with a window id that has been processed successfully by all the operators in the DAG and hence is no longer needed. The operator can delete all the state corresponding to window ids less than or equal to the provided window id.</li>
+<li>Sometimes operators need to perform some tasks just before checkpointing. For example, filesystem operators may want to flush the files just before checkpoint so they can be sure that all pending data is written to disk and no data is lost if there is an operator failure just after the checkpoint and the operator restarts from the checkpoint. To do this the operator would implement the same <em>CheckpointNotificationListener</em> interface and implement the <strong>beforeCheckpoint</strong> method where it can do these tasks.</li>
+<li>If the operator is going to have a large state, checkpointing the entire state each time becomes unviable. Furthermore, the amount of memory needed to hold the state could be larger than the amount of physical memory available. In these cases the operator should checkpoint the state incrementally and also manage the memory for the state more efficiently. The platform provides a utiltiy called <strong>ManagedState</strong> that uses a combination of in memory and disk cache to efficiently store and retrieve data in a performant, fault tolerant way and also checkpoint it in an incremental fashion. There are operators in the platform that use <em>ManagedState</em> and can be used as a reference on how to use this utility such as Dedup or Join operators.</li>
+</ul>
+<h2 id="input-operators">Input Operators</h2>
+<p>Input operators have additional requirements:</p>
+<ul>
+<li>The <strong>emitTuples</strong> method implemented by the operator, is called by the platform, to give the operator an opportunity to emit some data. This method is always called within a window boundary but can be called multiple times within the same window. There are some important guidelines on how to implement this method:<ul>
+<li>This should not be a blocking method and should return in a reasonable time that is less than the streaming window size (which is 500ms by default). This also applies to other callback methods called by the platform such as <em>beginWindow</em>, <em>endWindow</em> etc., but is more important here since this method will be called continuously by the platform.</li>
+<li>If the operator needs to interact with external systems to obtain data and this can potentially take a long time, then this should be performed asynchronously in a different thread. Refer to the threading section below for the guidelines when using threading.</li>
+<li>In each invocation, the method can emit any number of data tuples.</li>
+</ul>
+</li>
+</ul>
+<h3 id="idempotence">Idempotence</h3>
+<p>Many applications write data to external systems using output operators. To ensure that data is present exactly once in the external system even in a failure recovery scenario, the output operators expect the replayed windows during recovery contain the same data as before the failure. This is called idempotency. Since operators within the DAG are merely responding to input data provided to them by the upstream operators and the input operator has no upstream operator, the responsibility of idempotent replay falls on the input operators.</p>
+<ul>
+<li>For idempotent replay of data, the operator needs to store some meta-information for every window that would allow it to identify what data was sent in that window. This is called the idempotent state.<ul>
+<li>If the external source of the input operator allows replayability, this could be information such as offset of last piece of data in the window, an identifier of the last piece of data itself or number of data tuples sent.</li>
+<li>However if the external source does not allow replayability from an operator specified point, then the entire data sent within the window may need to be persisted by the operator.</li>
+</ul>
+</li>
+<li>The platform provides a utility called <em>WindowDataManager</em> to allow operators to save and retrieve idempotent state every window. Operators should use this to implement idempotency.</li>
+</ul>
+<h2 id="output-operators">Output Operators</h2>
+<p>Output operators typically connect to external storage systems such as filesystems, databases or key value stores to store data.</p>
+<ul>
+<li>In some situations, the external systems may not be functioning in a reliable fashion. They may be having prolonged outages or performance problems. If the operator is being designed to work in such environments, it needs to be able to to handle these problems gracefully and not block the DAG or fail. In these scenarios the operator should cache the data into a local store such as HDFS and interact with external systems in a separate thread so as to not have problems in the operator lifecycle thread. This pattern is called the <strong>Reconciler</strong> pattern and there are operators that implement this pattern available in the library for reference.</li>
+</ul>
+<h3 id="end-to-end-exactly-once">End-to-End Exactly Once</h3>
+<p>When output operators store data in external systems, it is important that they do not lose data or write duplicate data when there is a failure event and the DAG recovers from that failure. In failure recovery, the windows from the previous checkpoint are replayed and the operator receives this data again. The operator should ensure that it does not write this data again. Operator developers should figure out how to do this specifically for the operators they are developing depending on the logic of the operators. Below are examples of how a couple of existing output operators do this for reference.</p>
+<ul>
+<li>File output operator that writes data to files keeps track of the file lengths in the state. These lengths are checkpointed and restored on failure recovery. On restart, the operator truncates the file to the length equal to the length in the recovered state. This makes the data in the file same as it was at the time of checkpoint before the failure. The operator now writes the replayed data from the checkpoint in regular fashion as any other data. This ensures no data is lost or duplicated in the file.</li>
+<li>The JDBC output operator that writes data to a database table writes the data in a window in a single transaction. It also writes the current window id into a meta table along with the data as part of the same transaction. It commits the transaction at the end of the window. When there is an operator failure before the final commit, the state of the database is that it contains the data from the previous fully processed window and its window id since the current window transaction isn\u2019t yet committed. On recovery, the operator reads this window id back from the meta table. It ignores all the replayed windows whose window id is less than or equal to the recovered window id and thus ensures that it does not duplicate data already present in the database. It starts writing data normally again when window id of data becomes greater than recovered window thus ensuring no data is lost.</li>
+</ul>
+<h2 id="partitioning">Partitioning</h2>
+<p>Partitioning allows an operation to be scaled to handle more pieces of data than before but with a similar SLA. This is done by creating multiple instances of an operator and distributing the data among them. Input operators can also be partitioned to stream more pieces of data into the application. The platform provides a lot of flexibility and options for partitioning. Partitioning can happen once at startup or can be dynamically changed anytime while the application is running, and it can be done in a stateless or stateful way by distributing state from the old partitions to new partitions.</p>
+<p>In the platform, the responsibility for partitioning is shared among different entities. These are:</p>
+<ol>
+<li>A <strong>partitioner</strong> that specifies <em>how</em> to partition the operator, specifically it takes an old set of partitions and creates a new set of partitions. At the start of the application the old set has one partition and the partitioner can return more than one partitions to start the application with multiple partitions. The partitioner can have any custom JAVA logic to determine the number of new partitions, set their initial state as a brand new state or derive it from the state of the old partitions. It also specifies how the data gets distributed among the new partitions. The new set doesn't have to contain only new partitions, it can carry over some old partitions if desired.</li>
+<li>An optional <strong>statistics (stats) listener</strong> that specifies <em>when</em> to partition. The reason it is optional is that it is needed only when dynamic partitioning is needed. With the stats listener, the stats can be used to determine when to partition.</li>
+<li>In some cases the <em>operator</em> itself should be aware of partitioning and would need to provide supporting code.<ul>
+<li>In case of input operators each partition should have a property or a set of properties that allow it to distinguish itself from the other partitions and fetch unique data.</li>
+</ul>
+</li>
+<li>When an operator that was originally a single instance is split into multiple partitions with each partition working on a subset of data, the results of the partitions may need to be combined together to compute the final result. The combining logic would depend on the logic of the operator. This would be specified by the developer using a <strong>Unifier</strong>, which is deployed as another operator by the platform. If no <em>Unifier</em> is specified, the platform inserts a <strong>default unifier</strong> that merges the results of the multiple partition streams into a single stream. Each output port can have a different <em>Unifier</em> and this is specified by returning the corresponding <em>Unifier</em> in the <strong>getUnifier</strong> method of the output port. The operator developer should provide a custom <em>Unifier</em> wherever applicable.</li>
+<li>The Apex <em>engine</em> that brings everything together and effects the partitioning.</li>
+</ol>
+<p>Since partitioning is critical for scalability of applications, operators must support it. There should be a strong reason for an operator to not support partitioning, such as, the logic performed by the operator not lending itself to parallelism. In order to support partitioning, an operator developer, apart from developing the functionality of the operator, may also need to provide a partitioner, stats listener and supporting code in the operator as described in the steps above. The next sections delve into this. </p>
+<h3 id="out-of-the-box-partitioning">Out of the box partitioning</h3>
+<p>The platform comes with some built-in partitioning utilities that can be used in certain scenarios.</p>
+<ul>
+<li>
+<p><strong>StatelessPartitioner</strong> provides a default partitioner, that can be used for an operator in certain conditions. If the operator satisfies these conditions, the partitioner can be specified for the operator with a simple setting and no other partitioning code is needed. The conditions are:</p>
+<ul>
+<li>No dynamic partitioning is needed, see next point about dynamic partitioning. </li>
+<li>There is no distinct initial state for the partitions, i.e., all partitions start with the same initial state submitted during application launch.</li>
+</ul>
+<p>Typically input or output operators do not fall into this category, although there are some exceptions. This partitioner is mainly used with operators that are in the middle of the DAG, after the input and before the output operators. When used with non-input operators, only the data for the first declared input port is distributed among the different partitions. All other input ports are treated as broadcast and all partitions receive all the data for that port.</p>
+</li>
+<li>
+<p><strong>StatelessThroughputBasedPartitioner</strong> in Malhar provides a dynamic partitioner based on throughput thresholds. Similarly <strong>StatelessLatencyBasedPartitioner</strong> provides a latency based dynamic partitioner in RTS. If these partitioners can be used, then separate partitioning related code is not needed. The conditions under which these can be used are:</p>
+<ul>
+<li>There is no distinct initial state for the partitions.</li>
+<li>There is no state being carried over by the operator from one window to the next i.e., operator is stateless.</li>
+</ul>
+</li>
+</ul>
+<h3 id="custom-partitioning">Custom partitioning</h3>
+<p>In many cases, operators don\u2019t satisfy the above conditions and a built-in partitioner cannot be used. Custom partitioning code needs to be written by the operator developer. Below are guidelines for it.</p>
+<ul>
+<li>Since the operator developer is providing a <em>partitioner</em> for the operator, the partitioning code should be added to the operator itself by making the operator implement the Partitioner interface and implementing the required methods, rather than creating a separate partitioner. The advantage is the user of the operator does not have to explicitly figure out the partitioner and set it for the operator but still has the option to override this built-in partitioner with a different one.</li>
+<li>The <em>partitioner</em> is responsible for setting the initial state of the new partitions, whether it is at the start of the application or when partitioning is happening while the application is running as in the dynamic partitioning case. In the dynamic partitioning scenario, the partitioner needs to take the state from the old partitions and distribute it among the new partitions. It is important to note that apart from the checkpointed state the partitioner also needs to distribute idempotent state.</li>
+<li>The <em>partitioner</em> interface has two methods, <strong>definePartitions</strong> and <strong>partitioned</strong>. The method <em>definePartitons</em> is first called to determine the new partitions, and if enough resources are available on the cluster, the <em>partitioned</em> method is called passing in the new partitions. This happens both during initial partitioning and dynamic partitioning. If resources are not available, partitioning is abandoned and existing partitions continue to run untouched. This means that any processing intensive operations should be deferred to the <em>partitioned</em> call instead of doing them in <em>definePartitions</em>, as they may not be needed if there are not enough resources available in the cluster.</li>
+<li>The <em>partitioner</em>, along with creating the new partitions, should also specify how the data gets distributed across the new partitions. It should do this by specifying a mapping called <strong>PartitionKeys</strong> for each partition that maps the data to that partition. This mapping needs to be specified for every input port in the operator. If the <em>partitioner</em> wants to use the standard mapping it can use a utility method called <strong>DefaultPartition.assignPartitionKeys</strong>.</li>
+<li>When the partitioner is scaling the operator up to more partitions, try to reuse the existing partitions and create new partitions to augment the current set. The reuse can be achieved by the partitioner returning the current partitions unchanged. This will result in the current partitions continuing to run untouched.</li>
+<li>In case of dynamic partitioning, as mentioned earlier, a stats listener is also needed to determine when to re-partition. Like the <em>Partitioner</em> interface, the operator can also implement the <em>StatsListener</em> interface to provide a stats listener implementation that will be automatically used.</li>
+<li>The <em>StatsListener</em> has access to all operator statistics to make its decision on partitioning. Apart from the statistics that the platform computes for the operators such as throughput, latency etc, operator developers can include their own business metrics by using the AutoMetric feature.</li>
+<li>If the operator is not partitionable, mark it so with <em>OperatorAnnotation</em> and <em>partitionable</em> element set to false.</li>
+</ul>
+<h3 id="streamcodecs">StreamCodecs</h3>
+<p>A <strong>StreamCodec</strong> is used in partitioning to distribute the data tuples among the partitions. The <em>StreamCodec</em> computes an integer hashcode for a data tuple and this is used along with <em>PartitionKeys</em> mapping to determine which partition or partitions receive the data tuple. If a <em>StreamCodec</em> is not specified, then a default one is used by the platform which returns the JAVA hashcode of the tuple. </p>
+<p><em>StreamCodec</em> is also useful in another aspect of the application. It is used to serialize and deserialize the tuple to transfer it between operators. The default <em>StreamCodec</em> uses Kryo library for serialization. </p>
+<p>The following guidelines are useful when considering a custom <em>StreamCodec</em></p>
+<ul>
+<li>A custom <em>StreamCodec</em> is needed if the tuples need to be distributed based on a criteria different from the hashcode of the tuple. If the correct working of an operator depends on the data from the upstream operator being distributed using a custom criteria such as being sticky on a \u201ckey\u201d field within the tuple, then a custom <em>StreamCodec</em> should be provided by the operator developer. This codec can implement the custom criteria. The operator should also return this custom codec in the <strong>getStreamCodec</strong> method of the input port.</li>
+<li>When implementing a custom <em>StreamCodec</em> for the purpose of using a different criteria to distribute the tuples, the codec can extend an existing <em>StreamCodec</em> and implement the hashcode method, so that the codec does not have to worry about the serialization and deserialization functionality. The Apex platform provides two pre-built <em>StreamCodec</em> implementations for this purpose, one is <strong>KryoSerializableStreamCodec</strong> that uses Kryo for serialization and another one <strong>JavaSerializationStreamCodec</strong> that uses JAVA serialization.</li>
+<li>Different <em>StreamCodec</em> implementations can be used for different inputs in a stream with multiple inputs when different criteria of distributing the tuples is desired between the multiple inputs. </li>
+</ul>
+<h2 id="threads">Threads</h2>
+<p>The operator lifecycle methods such as <strong>setup</strong>, <strong>beginWindow</strong>, <strong>endWindow</strong>, <strong>process</strong> in <em>InputPorts</em> are all called from a single operator lifecycle thread, by the platform, unbeknownst to the user. So the user does not have to worry about dealing with the issues arising from multi-threaded code. Use of separate threads in an operator is discouraged because in most cases the motivation for this is parallelism, but parallelism can already be achieved by using multiple partitions and furthermore mistakes can be made easily when writing multi-threaded code. When dealing with high volume and velocity data, the corner cases with incorrectly written multi-threaded code are encountered more easily and exposed. However, there are times when separate threads are needed, for example, when interacting with external systems the delay in retrieving or sending data can be large at times, blocking the operator and other DAG pro
 cessing such as committed windows. In these cases the following guidelines must be followed strictly.</p>
+<ul>
+<li>Threads should be started in <strong>activate</strong> and stopped in <strong>deactivate</strong>. In <em>deactivate</em> the operator should wait till any threads it launched, have finished execution. It can do so by calling <strong>join</strong> on the threads or if using <strong>ExecutorService</strong>, calling <strong>awaitTermination</strong> on the service.</li>
+<li>Threads should not call any methods on the ports directly as this can cause concurrency exceptions and also result in invalid states.</li>
+<li>Threads can share state with the lifecycle methods using data structures that are either explicitly protected by synchronization or are inherently thread safe such as thread safe queues.</li>
+<li>If this shared state needs to be protected against failure then it needs to be persisted during checkpoint. To have a consistent checkpoint, the state should not be modified by the thread when it is being serialized and saved by the operator lifecycle thread during checkpoint. Since the checkpoint process happens outside the window boundary the thread should be quiesced between <strong>endWindow</strong> and <strong>beginWindow</strong> or more efficiently between pre-checkpoint and checkpointed callbacks.</li>
+</ul>
+              
+            </div>
+          </div>
+          <footer>
+  
+    <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
+      
+        <a href="../apex_cli/" class="btn btn-neutral float-right" title="Apex CLI">Next <span class="icon icon-circle-arrow-right"></span></a>
+      
+      
+        <a href="../autometrics/" class="btn btn-neutral" title="AutoMetric API"><span class="icon icon-circle-arrow-left"></span> Previous</a>
+      
+    </div>
+  
+
+  <hr/>
+
+  <div role="contentinfo">
+    <!-- Copyright etc -->
+    
+  </div>
+
+  Built with <a href="http://www.mkdocs.org">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
+</footer>
+	  
+        </div>
+      </div>
+
+    </section>
+
+  </div>
+
+<div class="rst-versions" role="note" style="cursor: pointer">
+    <span class="rst-current-version" data-toggle="rst-current-version">
+      
+      
+        <span><a href="../autometrics/" style="color: #fcfcfc;">&laquo; Previous</a></span>
+      
+      
+        <span style="margin-left: 15px"><a href="../apex_cli/" style="color: #fcfcfc">Next &raquo;</a></span>
+      
+    </span>
+</div>
+
+</body>
+</html>

http://git-wip-us.apache.org/repos/asf/apex-site/blob/21e76a00/docs/apex-3.4/images/security/image03.png
----------------------------------------------------------------------
diff --git a/docs/apex-3.4/images/security/image03.png b/docs/apex-3.4/images/security/image03.png
new file mode 100755
index 0000000..175feb8
Binary files /dev/null and b/docs/apex-3.4/images/security/image03.png differ

http://git-wip-us.apache.org/repos/asf/apex-site/blob/21e76a00/docs/apex-3.4/index.html
----------------------------------------------------------------------
diff --git a/docs/apex-3.4/index.html b/docs/apex-3.4/index.html
index c944ecc..56fcb03 100644
--- a/docs/apex-3.4/index.html
+++ b/docs/apex-3.4/index.html
@@ -109,6 +109,13 @@
     </li>
 
         
+            
+    <li class="toctree-l1 ">
+        <a class="" href="development_best_practices/">Best Practices</a>
+        
+    </li>
+
+        
     </ul>
 <li>
           
@@ -184,7 +191,7 @@
 <li>Simple API supports generic Java code</li>
 </ul>
 <p>Platform has been demonstated to scale linearly across Hadoop clusters under extreme loads of billions of events per second.  Hardware and process failures are quickly recovered with HDFS-backed checkpointing and automatic operator recovery, preserving application state and resuming execution in seconds.  Functional and operational specifications are separated.  Apex provides a simple API, which enables users to write generic, reusable code.  The code is dropped in as-is and platform automatically handles the various operational concerns, such as state management, fault tolerance, scalability, security, metrics, etc.  This frees users to focus on functional development, and lets platform provide operability support.</p>
-<p>The core Apex platform is supplemented by Malhar, a library of connector and logic functions, enabling rapid application development.  These operators and modules provide access to HDFS, S3, NFS, FTP, and other file systems; Kafka, ActiveMQ, RabbitMQ, JMS, and other message systems; MySql, Cassandra, MongoDB, Redis, HBase, CouchDB, generic JDBC, and other database connectors.  In addition to the operators, the library contains a number of demos applications, demonstrating operator features and capabilities.  To see the full list of available operators and related documentation, visit <a href="https://github.com/apache/incubator-apex-malhar">Apex Malhar on Github</a></p>
+<p>The core Apex platform is supplemented by Malhar, a library of connector and logic functions, enabling rapid application development.  These operators and modules provide access to HDFS, S3, NFS, FTP, and other file systems; Kafka, ActiveMQ, RabbitMQ, JMS, and other message systems; MySql, Cassandra, MongoDB, Redis, HBase, CouchDB, generic JDBC, and other database connectors.  In addition to the operators, the library contains a number of demos applications, demonstrating operator features and capabilities.  To see the full list of available operators and related documentation, visit <a href="https://github.com/apache/apex-malhar">Apex Malhar on Github</a></p>
 <p>For additional information visit <a href="http://apex.apache.org/">Apache Apex</a>.</p>
 <p><a href="http://apex.apache.org/"><img alt="" src="./favicon.ico" /></a></p>
               
@@ -232,5 +239,5 @@
 
 <!--
 MkDocs version : 0.15.3
-Build Date UTC : 2016-05-13 22:25:11.258707
+Build Date UTC : 2016-09-07 01:53:39.631895
 -->

http://git-wip-us.apache.org/repos/asf/apex-site/blob/21e76a00/docs/apex-3.4/license/highlight.js/LICENSE
----------------------------------------------------------------------
diff --git a/docs/apex-3.4/license/highlight.js/LICENSE b/docs/apex-3.4/license/highlight.js/LICENSE
deleted file mode 100644
index 422deb7..0000000
--- a/docs/apex-3.4/license/highlight.js/LICENSE
+++ /dev/null
@@ -1,24 +0,0 @@
-Copyright (c) 2006, Ivan Sagalaev
-All rights reserved.
-Redistribution and use in source and binary forms, with or without
-modification, are permitted provided that the following conditions are met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-    * Neither the name of highlight.js nor the names of its contributors 
-      may be used to endorse or promote products derived from this software 
-      without specific prior written permission.
-
-THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND ANY
-EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-DISCLAIMED. IN NO EVENT SHALL THE REGENTS AND CONTRIBUTORS BE LIABLE FOR ANY
-DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
-ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

http://git-wip-us.apache.org/repos/asf/apex-site/blob/21e76a00/docs/apex-3.4/main.html
----------------------------------------------------------------------
diff --git a/docs/apex-3.4/main.html b/docs/apex-3.4/main.html
new file mode 100644
index 0000000..79c9f4e
--- /dev/null
+++ b/docs/apex-3.4/main.html
@@ -0,0 +1,10 @@
+{% extends "base.html" %}
+
+{#
+The entry point for the ReadTheDocs Theme.
+ 
+Any theme customisations should override this file to redefine blocks defined in
+the various templates. The custom theme should only need to define a main.html
+which `{% extends "base.html" %}` and defines various blocks which will replace
+the blocks defined in base.html and its included child templates.
+#}

http://git-wip-us.apache.org/repos/asf/apex-site/blob/21e76a00/docs/apex-3.4/mkdocs/js/lunr-0.5.7.min.js
----------------------------------------------------------------------
diff --git a/docs/apex-3.4/mkdocs/js/lunr-0.5.7.min.js b/docs/apex-3.4/mkdocs/js/lunr-0.5.7.min.js
new file mode 100644
index 0000000..b72449a
--- /dev/null
+++ b/docs/apex-3.4/mkdocs/js/lunr-0.5.7.min.js
@@ -0,0 +1,7 @@
+/**
+ * lunr - http://lunrjs.com - A bit like Solr, but much smaller and not as bright - 0.5.7
+ * Copyright (C) 2014 Oliver Nightingale
+ * MIT Licensed
+ * @license
+ */
+!function(){var t=function(e){var n=new t.Index;return n.pipeline.add(t.trimmer,t.stopWordFilter,t.stemmer),e&&e.call(n,n),n};t.version="0.5.7",t.utils={},t.utils.warn=function(t){return function(e){t.console&&console.warn&&console.warn(e)}}(this),t.EventEmitter=function(){this.events={}},t.EventEmitter.prototype.addListener=function(){var t=Array.prototype.slice.call(arguments),e=t.pop(),n=t;if("function"!=typeof e)throw new TypeError("last argument must be a function");n.forEach(function(t){this.hasHandler(t)||(this.events[t]=[]),this.events[t].push(e)},this)},t.EventEmitter.prototype.removeListener=function(t,e){if(this.hasHandler(t)){var n=this.events[t].indexOf(e);this.events[t].splice(n,1),this.events[t].length||delete this.events[t]}},t.EventEmitter.prototype.emit=function(t){if(this.hasHandler(t)){var e=Array.prototype.slice.call(arguments,1);this.events[t].forEach(function(t){t.apply(void 0,e)})}},t.EventEmitter.prototype.hasHandler=function(t){return t in this.events},t.to
 kenizer=function(t){if(!arguments.length||null==t||void 0==t)return[];if(Array.isArray(t))return t.map(function(t){return t.toLowerCase()});for(var e=t.toString().replace(/^\s+/,""),n=e.length-1;n>=0;n--)if(/\S/.test(e.charAt(n))){e=e.substring(0,n+1);break}return e.split(/(?:\s+|\-)/).filter(function(t){return!!t}).map(function(t){return t.toLowerCase()})},t.Pipeline=function(){this._stack=[]},t.Pipeline.registeredFunctions={},t.Pipeline.registerFunction=function(e,n){n in this.registeredFunctions&&t.utils.warn("Overwriting existing registered function: "+n),e.label=n,t.Pipeline.registeredFunctions[e.label]=e},t.Pipeline.warnIfFunctionNotRegistered=function(e){var n=e.label&&e.label in this.registeredFunctions;n||t.utils.warn("Function is not registered with pipeline. This may cause problems when serialising the index.\n",e)},t.Pipeline.load=function(e){var n=new t.Pipeline;return e.forEach(function(e){var i=t.Pipeline.registeredFunctions[e];if(!i)throw new Error("Cannot load un-re
 gistered function: "+e);n.add(i)}),n},t.Pipeline.prototype.add=function(){var e=Array.prototype.slice.call(arguments);e.forEach(function(e){t.Pipeline.warnIfFunctionNotRegistered(e),this._stack.push(e)},this)},t.Pipeline.prototype.after=function(e,n){t.Pipeline.warnIfFunctionNotRegistered(n);var i=this._stack.indexOf(e)+1;this._stack.splice(i,0,n)},t.Pipeline.prototype.before=function(e,n){t.Pipeline.warnIfFunctionNotRegistered(n);var i=this._stack.indexOf(e);this._stack.splice(i,0,n)},t.Pipeline.prototype.remove=function(t){var e=this._stack.indexOf(t);this._stack.splice(e,1)},t.Pipeline.prototype.run=function(t){for(var e=[],n=t.length,i=this._stack.length,o=0;n>o;o++){for(var r=t[o],s=0;i>s&&(r=this._stack[s](r,o,t),void 0!==r);s++);void 0!==r&&e.push(r)}return e},t.Pipeline.prototype.reset=function(){this._stack=[]},t.Pipeline.prototype.toJSON=function(){return this._stack.map(function(e){return t.Pipeline.warnIfFunctionNotRegistered(e),e.label})},t.Vector=function(){this._magni
 tude=null,this.list=void 0,this.length=0},t.Vector.Node=function(t,e,n){this.idx=t,this.val=e,this.next=n},t.Vector.prototype.insert=function(e,n){var i=this.list;if(!i)return this.list=new t.Vector.Node(e,n,i),this.length++;for(var o=i,r=i.next;void 0!=r;){if(e<r.idx)return o.next=new t.Vector.Node(e,n,r),this.length++;o=r,r=r.next}return o.next=new t.Vector.Node(e,n,r),this.length++},t.Vector.prototype.magnitude=function(){if(this._magniture)return this._magnitude;for(var t,e=this.list,n=0;e;)t=e.val,n+=t*t,e=e.next;return this._magnitude=Math.sqrt(n)},t.Vector.prototype.dot=function(t){for(var e=this.list,n=t.list,i=0;e&&n;)e.idx<n.idx?e=e.next:e.idx>n.idx?n=n.next:(i+=e.val*n.val,e=e.next,n=n.next);return i},t.Vector.prototype.similarity=function(t){return this.dot(t)/(this.magnitude()*t.magnitude())},t.SortedSet=function(){this.length=0,this.elements=[]},t.SortedSet.load=function(t){var e=new this;return e.elements=t,e.length=t.length,e},t.SortedSet.prototype.add=function(){Arr
 ay.prototype.slice.call(arguments).forEach(function(t){~this.indexOf(t)||this.elements.splice(this.locationFor(t),0,t)},this),this.length=this.elements.length},t.SortedSet.prototype.toArray=function(){return this.elements.slice()},t.SortedSet.prototype.map=function(t,e){return this.elements.map(t,e)},t.SortedSet.prototype.forEach=function(t,e){return this.elements.forEach(t,e)},t.SortedSet.prototype.indexOf=function(t,e,n){var e=e||0,n=n||this.elements.length,i=n-e,o=e+Math.floor(i/2),r=this.elements[o];return 1>=i?r===t?o:-1:t>r?this.indexOf(t,o,n):r>t?this.indexOf(t,e,o):r===t?o:void 0},t.SortedSet.prototype.locationFor=function(t,e,n){var e=e||0,n=n||this.elements.length,i=n-e,o=e+Math.floor(i/2),r=this.elements[o];if(1>=i){if(r>t)return o;if(t>r)return o+1}return t>r?this.locationFor(t,o,n):r>t?this.locationFor(t,e,o):void 0},t.SortedSet.prototype.intersect=function(e){for(var n=new t.SortedSet,i=0,o=0,r=this.length,s=e.length,a=this.elements,h=e.elements;;){if(i>r-1||o>s-1)brea
 k;a[i]!==h[o]?a[i]<h[o]?i++:a[i]>h[o]&&o++:(n.add(a[i]),i++,o++)}return n},t.SortedSet.prototype.clone=function(){var e=new t.SortedSet;return e.elements=this.toArray(),e.length=e.elements.length,e},t.SortedSet.prototype.union=function(t){var e,n,i;return this.length>=t.length?(e=this,n=t):(e=t,n=this),i=e.clone(),i.add.apply(i,n.toArray()),i},t.SortedSet.prototype.toJSON=function(){return this.toArray()},t.Index=function(){this._fields=[],this._ref="id",this.pipeline=new t.Pipeline,this.documentStore=new t.Store,this.tokenStore=new t.TokenStore,this.corpusTokens=new t.SortedSet,this.eventEmitter=new t.EventEmitter,this._idfCache={},this.on("add","remove","update",function(){this._idfCache={}}.bind(this))},t.Index.prototype.on=function(){var t=Array.prototype.slice.call(arguments);return this.eventEmitter.addListener.apply(this.eventEmitter,t)},t.Index.prototype.off=function(t,e){return this.eventEmitter.removeListener(t,e)},t.Index.load=function(e){e.version!==t.version&&t.utils.wa
 rn("version mismatch: current "+t.version+" importing "+e.version);var n=new this;return n._fields=e.fields,n._ref=e.ref,n.documentStore=t.Store.load(e.documentStore),n.tokenStore=t.TokenStore.load(e.tokenStore),n.corpusTokens=t.SortedSet.load(e.corpusTokens),n.pipeline=t.Pipeline.load(e.pipeline),n},t.Index.prototype.field=function(t,e){var e=e||{},n={name:t,boost:e.boost||1};return this._fields.push(n),this},t.Index.prototype.ref=function(t){return this._ref=t,this},t.Index.prototype.add=function(e,n){var i={},o=new t.SortedSet,r=e[this._ref],n=void 0===n?!0:n;this._fields.forEach(function(n){var r=this.pipeline.run(t.tokenizer(e[n.name]));i[n.name]=r,t.SortedSet.prototype.add.apply(o,r)},this),this.documentStore.set(r,o),t.SortedSet.prototype.add.apply(this.corpusTokens,o.toArray());for(var s=0;s<o.length;s++){var a=o.elements[s],h=this._fields.reduce(function(t,e){var n=i[e.name].length;if(!n)return t;var o=i[e.name].filter(function(t){return t===a}).length;return t+o/n*e.boost}
 ,0);this.tokenStore.add(a,{ref:r,tf:h})}n&&this.eventEmitter.emit("add",e,this)},t.Index.prototype.remove=function(t,e){var n=t[this._ref],e=void 0===e?!0:e;if(this.documentStore.has(n)){var i=this.documentStore.get(n);this.documentStore.remove(n),i.forEach(function(t){this.tokenStore.remove(t,n)},this),e&&this.eventEmitter.emit("remove",t,this)}},t.Index.prototype.update=function(t,e){var e=void 0===e?!0:e;this.remove(t,!1),this.add(t,!1),e&&this.eventEmitter.emit("update",t,this)},t.Index.prototype.idf=function(t){var e="@"+t;if(Object.prototype.hasOwnProperty.call(this._idfCache,e))return this._idfCache[e];var n=this.tokenStore.count(t),i=1;return n>0&&(i=1+Math.log(this.tokenStore.length/n)),this._idfCache[e]=i},t.Index.prototype.search=function(e){var n=this.pipeline.run(t.tokenizer(e)),i=new t.Vector,o=[],r=this._fields.reduce(function(t,e){return t+e.boost},0),s=n.some(function(t){return this.tokenStore.has(t)},this);if(!s)return[];n.forEach(function(e,n,s){var a=1/s.length*t
 his._fields.length*r,h=this,u=this.tokenStore.expand(e).reduce(function(n,o){var r=h.corpusTokens.indexOf(o),s=h.idf(o),u=1,l=new t.SortedSet;if(o!==e){var c=Math.max(3,o.length-e.length);u=1/Math.log(c)}return r>-1&&i.insert(r,a*s*u),Object.keys(h.tokenStore.get(o)).forEach(function(t){l.add(t)}),n.union(l)},new t.SortedSet);o.push(u)},this);var a=o.reduce(function(t,e){return t.intersect(e)});return a.map(function(t){return{ref:t,score:i.similarity(this.documentVector(t))}},this).sort(function(t,e){return e.score-t.score})},t.Index.prototype.documentVector=function(e){for(var n=this.documentStore.get(e),i=n.length,o=new t.Vector,r=0;i>r;r++){var s=n.elements[r],a=this.tokenStore.get(s)[e].tf,h=this.idf(s);o.insert(this.corpusTokens.indexOf(s),a*h)}return o},t.Index.prototype.toJSON=function(){return{version:t.version,fields:this._fields,ref:this._ref,documentStore:this.documentStore.toJSON(),tokenStore:this.tokenStore.toJSON(),corpusTokens:this.corpusTokens.toJSON(),pipeline:this.
 pipeline.toJSON()}},t.Index.prototype.use=function(t){var e=Array.prototype.slice.call(arguments,1);e.unshift(this),t.apply(this,e)},t.Store=function(){this.store={},this.length=0},t.Store.load=function(e){var n=new this;return n.length=e.length,n.store=Object.keys(e.store).reduce(function(n,i){return n[i]=t.SortedSet.load(e.store[i]),n},{}),n},t.Store.prototype.set=function(t,e){this.has(t)||this.length++,this.store[t]=e},t.Store.prototype.get=function(t){return this.store[t]},t.Store.prototype.has=function(t){return t in this.store},t.Store.prototype.remove=function(t){this.has(t)&&(delete this.store[t],this.length--)},t.Store.prototype.toJSON=function(){return{store:this.store,length:this.length}},t.stemmer=function(){var t={ational:"ate",tional:"tion",enci:"ence",anci:"ance",izer:"ize",bli:"ble",alli:"al",entli:"ent",eli:"e",ousli:"ous",ization:"ize",ation:"ate",ator:"ate",alism:"al",iveness:"ive",fulness:"ful",ousness:"ous",aliti:"al",iviti:"ive",biliti:"ble",logi:"log"},e={ica
 te:"ic",ative:"",alize:"al",iciti:"ic",ical:"ic",ful:"",ness:""},n="[^aeiou]",i="[aeiouy]",o=n+"[^aeiouy]*",r=i+"[aeiou]*",s="^("+o+")?"+r+o,a="^("+o+")?"+r+o+"("+r+")?$",h="^("+o+")?"+r+o+r+o,u="^("+o+")?"+i,l=new RegExp(s),c=new RegExp(h),p=new RegExp(a),f=new RegExp(u),d=/^(.+?)(ss|i)es$/,v=/^(.+?)([^s])s$/,m=/^(.+?)eed$/,g=/^(.+?)(ed|ing)$/,y=/.$/,S=/(at|bl|iz)$/,w=new RegExp("([^aeiouylsz])\\1$"),x=new RegExp("^"+o+i+"[^aeiouwxy]$"),k=/^(.+?[^aeiou])y$/,b=/^(.+?)(ational|tional|enci|anci|izer|bli|alli|entli|eli|ousli|ization|ation|ator|alism|iveness|fulness|ousness|aliti|iviti|biliti|logi)$/,E=/^(.+?)(icate|ative|alize|iciti|ical|ful|ness)$/,_=/^(.+?)(al|ance|ence|er|ic|able|ible|ant|ement|ment|ent|ou|ism|ate|iti|ous|ive|ize)$/,O=/^(.+?)(s|t)(ion)$/,F=/^(.+?)e$/,P=/ll$/,T=new RegExp("^"+o+i+"[^aeiouwxy]$"),$=function(n){var i,o,r,s,a,h,u;if(n.length<3)return n;if(r=n.substr(0,1),"y"==r&&(n=r.toUpperCase()+n.substr(1)),s=d,a=v,s.test(n)?n=n.replace(s,"$1$2"):a.test(n)&&(n=n.repl
 ace(a,"$1$2")),s=m,a=g,s.test(n)){var $=s.exec(n);s=l,s.test($[1])&&(s=y,n=n.replace(s,""))}else if(a.test(n)){var $=a.exec(n);i=$[1],a=f,a.test(i)&&(n=i,a=S,h=w,u=x,a.test(n)?n+="e":h.test(n)?(s=y,n=n.replace(s,"")):u.test(n)&&(n+="e"))}if(s=k,s.test(n)){var $=s.exec(n);i=$[1],n=i+"i"}if(s=b,s.test(n)){var $=s.exec(n);i=$[1],o=$[2],s=l,s.test(i)&&(n=i+t[o])}if(s=E,s.test(n)){var $=s.exec(n);i=$[1],o=$[2],s=l,s.test(i)&&(n=i+e[o])}if(s=_,a=O,s.test(n)){var $=s.exec(n);i=$[1],s=c,s.test(i)&&(n=i)}else if(a.test(n)){var $=a.exec(n);i=$[1]+$[2],a=c,a.test(i)&&(n=i)}if(s=F,s.test(n)){var $=s.exec(n);i=$[1],s=c,a=p,h=T,(s.test(i)||a.test(i)&&!h.test(i))&&(n=i)}return s=P,a=c,s.test(n)&&a.test(n)&&(s=y,n=n.replace(s,"")),"y"==r&&(n=r.toLowerCase()+n.substr(1)),n};return $}(),t.Pipeline.registerFunction(t.stemmer,"stemmer"),t.stopWordFilter=function(e){return-1===t.stopWordFilter.stopWords.indexOf(e)?e:void 0},t.stopWordFilter.stopWords=new t.SortedSet,t.stopWordFilter.stopWords.length=119
 ,t.stopWordFilter.stopWords.elements=["","a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"],t.Pipeline.registerFunction(t.stopWordFilter,"stopWordFilter"),t.trimmer=function(t){return t.replace(/^\W+/,"").replace(/\W+$/,"")},t.Pipeline.registerFunction(t.tr
 immer,"trimmer"),t.TokenStore=function(){this.root={docs:{}},this.length=0},t.TokenStore.load=function(t){var e=new this;return e.root=t.root,e.length=t.length,e},t.TokenStore.prototype.add=function(t,e,n){var n=n||this.root,i=t[0],o=t.slice(1);return i in n||(n[i]={docs:{}}),0===o.length?(n[i].docs[e.ref]=e,void(this.length+=1)):this.add(o,e,n[i])},t.TokenStore.prototype.has=function(t){if(!t)return!1;for(var e=this.root,n=0;n<t.length;n++){if(!e[t[n]])return!1;e=e[t[n]]}return!0},t.TokenStore.prototype.getNode=function(t){if(!t)return{};for(var e=this.root,n=0;n<t.length;n++){if(!e[t[n]])return{};e=e[t[n]]}return e},t.TokenStore.prototype.get=function(t,e){return this.getNode(t,e).docs||{}},t.TokenStore.prototype.count=function(t,e){return Object.keys(this.get(t,e)).length},t.TokenStore.prototype.remove=function(t,e){if(t){for(var n=this.root,i=0;i<t.length;i++){if(!(t[i]in n))return;n=n[t[i]]}delete n.docs[e]}},t.TokenStore.prototype.expand=function(t,e){var n=this.getNode(t),i=n
 .docs||{},e=e||[];return Object.keys(i).length&&e.push(t),Object.keys(n).forEach(function(n){"docs"!==n&&e.concat(this.expand(t+n,e))},this),e},t.TokenStore.prototype.toJSON=function(){return{root:this.root,length:this.length}},function(t,e){"function"==typeof define&&define.amd?define(e):"object"==typeof exports?module.exports=e():t.lunr=e()}(this,function(){return t})}();

[6/6] apex-site git commit: from c3a284ba04d860705af016afe3348f0e523f48c1

Posted by th...@apache.org.

from c3a284ba04d860705af016afe3348f0e523f48c1


Project: http://git-wip-us.apache.org/repos/asf/apex-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/apex-site/commit/d396fa83
Tree: http://git-wip-us.apache.org/repos/asf/apex-site/tree/d396fa83
Diff: http://git-wip-us.apache.org/repos/asf/apex-site/diff/d396fa83

Branch: refs/heads/asf-site
Commit: d396fa83ba89d2d98679a262f7c2cd1fb9b1f883
Parents: 21e76a0
Author: Thomas Weise <th...@datatorrent.com>
Authored: Tue Sep 6 19:07:05 2016 -0700
Committer: Thomas Weise <th...@datatorrent.com>
Committed: Tue Sep 6 19:07:05 2016 -0700

----------------------------------------------------------------------
 content/community.html                          |   2 +-
 content/docs/apex-3.4/__init__.pyc              | Bin 166 -> 163 bytes
 content/docs/apex-3.4/apex_cli/index.html       |  11 +-
 .../apex-3.4/apex_development_setup/index.html  |  17 +-
 .../apex-3.4/application_development/index.html |  15 +-
 .../apex-3.4/application_packages/index.html    |   7 +
 content/docs/apex-3.4/autometrics/index.html    |  13 +-
 content/docs/apex-3.4/compatibility/index.html  |   7 +
 .../development_best_practices/index.html       | 376 +++++++++++++++++++
 .../docs/apex-3.4/images/security/image03.png   | Bin 0 -> 18677 bytes
 content/docs/apex-3.4/index.html                |  11 +-
 content/docs/apex-3.4/main.html                 |  10 +
 .../docs/apex-3.4/mkdocs/js/lunr-0.5.7.min.js   |   7 +
 content/docs/apex-3.4/mkdocs/search_index.json  | 126 ++++++-
 .../apex-3.4/operator_development/index.html    |   9 +-
 content/docs/apex-3.4/search.html               |   7 +
 content/docs/apex-3.4/security/index.html       | 129 +++++--
 content/docs/apex-3.4/sitemap.xml               |  24 +-
 content/malhar-contributing.html                |   2 +-
 19 files changed, 707 insertions(+), 66 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/apex-site/blob/d396fa83/content/community.html
----------------------------------------------------------------------
diff --git a/content/community.html b/content/community.html
index 47fd341..4f4bbe3 100644
--- a/content/community.html
+++ b/content/community.html
@@ -90,7 +90,7 @@
 <h2 id="powered-by-apex">Powered By Apex</h2>
 <p>List of organizations using Apache Apex is available at <a href="/powered-by-apex.html">Powered by Apex</a>.</p>
 <h1 id="contributing">Contributing</h1>
-<p>Looking for ideas to get involved? Please see <a href="https://issues.apache.org/jira/issues/?jql=project%20in%20%28APEXCORE%2C%20APEXMALHAR%29%20and%20labels%20%3D%20newbie">JIRA tickets for newcomers</a> and pick a ticket. Please also sign up to the dev mailing list and JIRA. </p>
+<p>Looking for ideas to get involved? Please see <a href="https://issues.apache.org/jira/issues/?jql=project%20in%20%28APEXCORE%2C%20APEXMALHAR%29%20and%20resolution%20%3D%20Unresolved%20and%20labels%20%3D%20newbie">JIRA tickets for newcomers</a> and pick a ticket. Please also sign up to the dev mailing list and JIRA. </p>
 <p><strong>To learn more about contributing to the project, <a href="/contributing.html">check out the contributing guidelines</a>.</strong></p>
 <p>The Apex Project is made up of two repositories:</p>
 <ul>

http://git-wip-us.apache.org/repos/asf/apex-site/blob/d396fa83/content/docs/apex-3.4/__init__.pyc
----------------------------------------------------------------------
diff --git a/content/docs/apex-3.4/__init__.pyc b/content/docs/apex-3.4/__init__.pyc
index f478a23..5d767d8 100644
Binary files a/content/docs/apex-3.4/__init__.pyc and b/content/docs/apex-3.4/__init__.pyc differ

http://git-wip-us.apache.org/repos/asf/apex-site/blob/d396fa83/content/docs/apex-3.4/apex_cli/index.html
----------------------------------------------------------------------
diff --git a/content/docs/apex-3.4/apex_cli/index.html b/content/docs/apex-3.4/apex_cli/index.html
index f6c491e..c45aec1 100644
--- a/content/docs/apex-3.4/apex_cli/index.html
+++ b/content/docs/apex-3.4/apex_cli/index.html
@@ -102,6 +102,13 @@
     </li>
 
         
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../development_best_practices/">Best Practices</a>
+        
+    </li>
+
+        
     </ul>
 <li>
           
@@ -436,7 +443,7 @@ they must be part of the jar files that were deployed at application launch time
         <a href="../security/" class="btn btn-neutral float-right" title="Security">Next <span class="icon icon-circle-arrow-right"></span></a>
       
       
-        <a href="../autometrics/" class="btn btn-neutral" title="AutoMetric API"><span class="icon icon-circle-arrow-left"></span> Previous</a>
+        <a href="../development_best_practices/" class="btn btn-neutral" title="Best Practices"><span class="icon icon-circle-arrow-left"></span> Previous</a>
       
     </div>
   
@@ -462,7 +469,7 @@ they must be part of the jar files that were deployed at application launch time
     <span class="rst-current-version" data-toggle="rst-current-version">
       
       
-        <span><a href="../autometrics/" style="color: #fcfcfc;">&laquo; Previous</a></span>
+        <span><a href="../development_best_practices/" style="color: #fcfcfc;">&laquo; Previous</a></span>
       
       
         <span style="margin-left: 15px"><a href="../security/" style="color: #fcfcfc">Next &raquo;</a></span>

http://git-wip-us.apache.org/repos/asf/apex-site/blob/d396fa83/content/docs/apex-3.4/apex_development_setup/index.html
----------------------------------------------------------------------
diff --git a/content/docs/apex-3.4/apex_development_setup/index.html b/content/docs/apex-3.4/apex_development_setup/index.html
index 75a7891..1af03d1 100644
--- a/content/docs/apex-3.4/apex_development_setup/index.html
+++ b/content/docs/apex-3.4/apex_development_setup/index.html
@@ -119,6 +119,13 @@
     </li>
 
         
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../development_best_practices/">Best Practices</a>
+        
+    </li>
+
+        
     </ul>
 <li>
           
@@ -306,22 +313,22 @@ project properties at <em>Properties &#8658; Run/Debug Settings &#8658; Applicat
 <ol>
 <li>
 <p>Check out the source code repositories:</p>
-<pre><code>git clone https://github.com/apache/incubator-apex-core
-git clone https://github.com/apache/incubator-apex-malhar
+<pre><code>git clone https://github.com/apache/apex-core
+git clone https://github.com/apache/apex-malhar
 </code></pre>
 </li>
 <li>
 <p>Switch to the appropriate release branch and build each repository:</p>
-<pre><code>cd incubator-apex-core
+<pre><code>cd apex-core
 mvn clean install -DskipTests
 
-cd incubator-apex-malhar
+cd apex-malhar
 mvn clean install -DskipTests
 </code></pre>
 </li>
 </ol>
 <p>The <code>install</code> argument to the <code>mvn</code> command installs resources from each project to your local maven repository (typically <code>.m2/repository</code> under your home directory), and <strong>not</strong> to the system directories, so Administrator privileges are not required. The  <code>-DskipTests</code> argument skips running unit tests since they take a long time. If this is a first-time installation, it might take several minutes to complete because maven will download a number of associated plugins.</p>
-<p>After the build completes, you should see the demo application package files in the target directory under each demo subdirectory in <code>incubator-apex-malhar/demos</code>.</p>
+<p>After the build completes, you should see the demo application package files in the target directory under each demo subdirectory in <code>apex-malhar/demos</code>.</p>
 <h2 id="sandbox">Sandbox</h2>
 <p>To jump start development with an Apache Hadoop single node cluster, <a href="https://www.datatorrent.com/download">DataTorrent Sandbox</a> powered by VirtualBox is available on Windows, Linux, or Mac platforms.  The sandbox is configured by default to run with 6GB RAM; if your development machine has 16GB or more, you can increase the sandbox RAM to 8GB or more using the VirtualBox console.  This will yield better performance and support larger applications.  The advantage of developing in the sandbox is that most of the tools (e.g. <em>jdk</em>, <em>git</em>, <em>maven</em>), Hadoop YARN and HDFS, and a distribution of Apache Apex and DataTorrent RTS are pre-installed.  The disadvantage is that the sandbox is a memory-limited environment, and requires settings changes and restarts to adjust memory available for development and testing.</p>
               

http://git-wip-us.apache.org/repos/asf/apex-site/blob/d396fa83/content/docs/apex-3.4/application_development/index.html
----------------------------------------------------------------------
diff --git a/content/docs/apex-3.4/application_development/index.html b/content/docs/apex-3.4/application_development/index.html
index 8c8f184..20d8e2e 100644
--- a/content/docs/apex-3.4/application_development/index.html
+++ b/content/docs/apex-3.4/application_development/index.html
@@ -187,6 +187,13 @@
     </li>
 
         
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../development_best_practices/">Best Practices</a>
+        
+    </li>
+
+        
     </ul>
 <li>
           
@@ -278,7 +285,7 @@ operators to the <a href="../operator_development/">Operator Development Guide</
 <h1 id="running-a-test-application">Running A Test Application</h1>
 <p>If you are starting with the Apex platform for the first time,
 it can be informative to launch an existing application and see it run.
-One of the simplest examples provided in <a href="https://github.com/apache/incubator-apex-malhar">Apex-Malhar repository</a> is a Pi demo application,
+One of the simplest examples provided in <a href="https://github.com/apache/apex-malhar">Apex-Malhar repository</a> is a Pi demo application,
 which computes the value of PI using random numbers.  After <a href="../apex_development_setup/">setting up development environment</a>
 Pi demo can be launched as follows:</p>
 <ol>
@@ -907,7 +914,7 @@ project name \u201cMalhar\u201d as part of our efforts to foster community
 innovation. These operators can be used in a DAG as is, while others
 have properties�that can be set to specify the
 desired computation. Those interested in details, should refer to
-<a href="https://github.com/apache/incubator-apex-malhar">Apex-Malhar operator library</a>.</p>
+<a href="https://github.com/apache/apex-malhar">Apex-Malhar operator library</a>.</p>
 <p>The platform is a Hadoop YARN native
 application. It runs in a Hadoop cluster just like any
 other YARN application (MapReduce etc.) and is designed to seamlessly
@@ -1281,7 +1288,7 @@ DAG in local mode within the IDE.</p>
 <li>The <code>operators</code> field is the list of operators the application has. You can specifiy the name, the Java class, and the properties of each operator here.</li>
 <li>The <code>streams</code> field is the list of streams that connects the operators together to form the DAG. Each stream consists of the stream name, the operator and port that it connects from, and the list of operators and ports that it connects to. Note that you can connect from <em>one</em> output port of an operator to <em>multiple</em> different input ports of different operators.</li>
 </ul>
-<p>In Apex Malhar, there is an <a href="https://github.com/apache/incubator-apex-malhar/blob/master/demos/pi/src/main/resources/app/PiJsonDemo.json">example</a> in the Pi Demo doing just that.</p>
+<p>In Apex Malhar, there is an <a href="https://github.com/apache/apex-malhar/blob/master/demos/pi/src/main/resources/app/PiJsonDemo.json">example</a> in the Pi Demo doing just that.</p>
 <h3 id="properties-file-dag-specification">Properties File DAG Specification</h3>
 <p>The platform also supports specification of a DAG via a properties
 file. The aim here to make it easy for tools to create and run an
@@ -2625,7 +2632,7 @@ details refer to  <a href="http://docs.datatorrent.com/configuration/">Configura
 <hr />
 <h1 id="demos">Demos</h1>
 <p>The source code for the demos is available in the open-source
-<a href="https://github.com/apache/incubator-apex-malhar">Apache Apex-Malhar repository</a>.
+<a href="https://github.com/apache/apex-malhar">Apache Apex-Malhar repository</a>.
 All of these do computations in real-time. Developers are encouraged to
 review them as they use various features of the platform and provide an
 opportunity for quick learning.</p>

http://git-wip-us.apache.org/repos/asf/apex-site/blob/d396fa83/content/docs/apex-3.4/application_packages/index.html
----------------------------------------------------------------------
diff --git a/content/docs/apex-3.4/application_packages/index.html b/content/docs/apex-3.4/application_packages/index.html
index 654c764..d4aff60 100644
--- a/content/docs/apex-3.4/application_packages/index.html
+++ b/content/docs/apex-3.4/application_packages/index.html
@@ -129,6 +129,13 @@
     </li>
 
         
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../development_best_practices/">Best Practices</a>
+        
+    </li>
+
+        
     </ul>
 <li>
           

http://git-wip-us.apache.org/repos/asf/apex-site/blob/d396fa83/content/docs/apex-3.4/autometrics/index.html
----------------------------------------------------------------------
diff --git a/content/docs/apex-3.4/autometrics/index.html b/content/docs/apex-3.4/autometrics/index.html
index 5d01dec..4712619 100644
--- a/content/docs/apex-3.4/autometrics/index.html
+++ b/content/docs/apex-3.4/autometrics/index.html
@@ -128,6 +128,13 @@
     </li>
 
         
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../development_best_practices/">Best Practices</a>
+        
+    </li>
+
+        
     </ul>
 <li>
           
@@ -234,7 +241,7 @@
 <p>When an operator is partitioned, it is useful to aggregate the values of auto-metrics across all its partitions every window to get a logical view of these metrics. The application master performs these aggregations using metrics aggregators.</p>
 <p>The AutoMetric API helps to achieve this by providing an interface for writing aggregators- <code>AutoMetric.Aggregator</code>. Any implementation of <code>AutoMetric.Aggregator</code> can be set as an operator attribute - <code>METRICS_AGGREGATOR</code> for a particular operator which in turn is used for aggregating physical metrics.</p>
 <h2 id="default-aggregators">Default aggregators</h2>
-<p><a href="https://github.com/apache/incubator-apex-core/blob/master/common/src/main/java/com/datatorrent/common/metric/MetricsAggregator.java"><code>MetricsAggregator</code></a> is a simple implementation of <code>AutoMetric.Aggregator</code> that platform uses as a default for summing up primitive types - int, long, float and double.</p>
+<p><a href="https://github.com/apache/apex-core/blob/master/common/src/main/java/com/datatorrent/common/metric/MetricsAggregator.java"><code>MetricsAggregator</code></a> is a simple implementation of <code>AutoMetric.Aggregator</code> that platform uses as a default for summing up primitive types - int, long, float and double.</p>
 <p><code>MetricsAggregator</code> is just a collection of <code>SingleMetricAggregator</code>s. There are multiple implementations of <code>SingleMetricAggregator</code> that perform sum, min, max, avg which are present in Apex core and Apex malhar.</p>
 <p>For the <code>LineReceiver</code> operator, the application developer need not specify any aggregator. The platform will automatically inject an instance of <code>MetricsAggregator</code> that contains two <code>LongSumAggregator</code>s - one for <code>length</code> and one for <code>count</code>. This aggregator will report sum of length and sum of count across all the partitions of <code>LineReceiver</code>.</p>
 <h2 id="building-custom-aggregators">Building custom aggregators</h2>
@@ -358,7 +365,7 @@
   
     <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
       
-        <a href="../apex_cli/" class="btn btn-neutral float-right" title="Apex CLI">Next <span class="icon icon-circle-arrow-right"></span></a>
+        <a href="../development_best_practices/" class="btn btn-neutral float-right" title="Best Practices">Next <span class="icon icon-circle-arrow-right"></span></a>
       
       
         <a href="../operator_development/" class="btn btn-neutral" title="Operators"><span class="icon icon-circle-arrow-left"></span> Previous</a>
@@ -390,7 +397,7 @@
         <span><a href="../operator_development/" style="color: #fcfcfc;">&laquo; Previous</a></span>
       
       
-        <span style="margin-left: 15px"><a href="../apex_cli/" style="color: #fcfcfc">Next &raquo;</a></span>
+        <span style="margin-left: 15px"><a href="../development_best_practices/" style="color: #fcfcfc">Next &raquo;</a></span>
       
     </span>
 </div>

http://git-wip-us.apache.org/repos/asf/apex-site/blob/d396fa83/content/docs/apex-3.4/compatibility/index.html
----------------------------------------------------------------------
diff --git a/content/docs/apex-3.4/compatibility/index.html b/content/docs/apex-3.4/compatibility/index.html
index ee9fece..9c682ee 100644
--- a/content/docs/apex-3.4/compatibility/index.html
+++ b/content/docs/apex-3.4/compatibility/index.html
@@ -102,6 +102,13 @@
     </li>
 
         
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../development_best_practices/">Best Practices</a>
+        
+    </li>
+
+        
     </ul>
 <li>
           

http://git-wip-us.apache.org/repos/asf/apex-site/blob/d396fa83/content/docs/apex-3.4/development_best_practices/index.html
----------------------------------------------------------------------
diff --git a/content/docs/apex-3.4/development_best_practices/index.html b/content/docs/apex-3.4/development_best_practices/index.html
new file mode 100644
index 0000000..c2a143f
--- /dev/null
+++ b/content/docs/apex-3.4/development_best_practices/index.html
@@ -0,0 +1,376 @@
+<!DOCTYPE html>
+<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
+<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
+<head>
+  <meta charset="utf-8">
+  <meta http-equiv="X-UA-Compatible" content="IE=edge">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  
+  
+  
+  <title>Best Practices - Apache Apex Documentation</title>
+  
+
+  <link rel="shortcut icon" href="../favicon.ico">
+  
+
+  
+  <link href='https://fonts.googleapis.com/css?family=Lato:400,700|Roboto+Slab:400,700|Inconsolata:400,700' rel='stylesheet' type='text/css'>
+
+  <link rel="stylesheet" href="../css/theme.css" type="text/css" />
+  <link rel="stylesheet" href="../css/theme_extra.css" type="text/css" />
+  <link rel="stylesheet" href="../css/highlight.css">
+
+  
+  <script>
+    // Current page data
+    var mkdocs_page_name = "Best Practices";
+    var mkdocs_page_input_path = "development_best_practices.md";
+    var mkdocs_page_url = "/development_best_practices/";
+  </script>
+  
+  <script src="../js/jquery-2.1.1.min.js"></script>
+  <script src="../js/modernizr-2.8.3.min.js"></script>
+  <script type="text/javascript" src="../js/highlight.pack.js"></script>
+  <script src="../js/theme.js"></script> 
+
+  
+</head>
+
+<body class="wy-body-for-nav" role="document">
+
+  <div class="wy-grid-for-nav">
+
+    
+    <nav data-toggle="wy-nav-shift" class="wy-nav-side stickynav">
+      <div class="wy-side-nav-search">
+        <a href=".." class="icon icon-home"> Apache Apex Documentation</a>
+        <div role="search">
+  <form id ="rtd-search-form" class="wy-form" action="../search.html" method="get">
+    <input type="text" name="q" placeholder="Search docs" />
+  </form>
+</div>
+      </div>
+
+      <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
+        <ul class="current">
+          
+            <li>
+    <li class="toctree-l1 ">
+        <a class="" href="..">Apache Apex</a>
+        
+    </li>
+<li>
+          
+            <li>
+    <ul class="subnav">
+    <li><span>Development</span></li>
+
+        
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../apex_development_setup/">Development Setup</a>
+        
+    </li>
+
+        
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../application_development/">Applications</a>
+        
+    </li>
+
+        
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../application_packages/">Packages</a>
+        
+    </li>
+
+        
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../operator_development/">Operators</a>
+        
+    </li>
+
+        
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../autometrics/">AutoMetric API</a>
+        
+    </li>
+
+        
+            
+    <li class="toctree-l1 current">
+        <a class="current" href="./">Best Practices</a>
+        
+            <ul>
+            
+                <li class="toctree-l3"><a href="#development-best-practices">Development Best Practices</a></li>
+                
+                    <li><a class="toctree-l4" href="#operators">Operators</a></li>
+                
+                    <li><a class="toctree-l4" href="#input-operators">Input Operators</a></li>
+                
+                    <li><a class="toctree-l4" href="#output-operators">Output Operators</a></li>
+                
+                    <li><a class="toctree-l4" href="#partitioning">Partitioning</a></li>
+                
+                    <li><a class="toctree-l4" href="#threads">Threads</a></li>
+                
+            
+            </ul>
+        
+    </li>
+
+        
+    </ul>
+<li>
+          
+            <li>
+    <ul class="subnav">
+    <li><span>Operations</span></li>
+
+        
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../apex_cli/">Apex CLI</a>
+        
+    </li>
+
+        
+            
+    <li class="toctree-l1 ">
+        <a class="" href="../security/">Security</a>
+        
+    </li>
+
+        
+    </ul>
+<li>
+          
+            <li>
+    <li class="toctree-l1 ">
+        <a class="" href="../compatibility/">Compatibility</a>
+        
+    </li>
+<li>
+          
+        </ul>
+      </div>
+      &nbsp;
+    </nav>
+
+    <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
+
+      
+      <nav class="wy-nav-top" role="navigation" aria-label="top navigation">
+        <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
+        <a href="..">Apache Apex Documentation</a>
+      </nav>
+
+      
+      <div class="wy-nav-content">
+        <div class="rst-content">
+          <div role="navigation" aria-label="breadcrumbs navigation">
+  <ul class="wy-breadcrumbs">
+    <li><a href="..">Docs</a> &raquo;</li>
+    
+      
+        
+          <li>Development &raquo;</li>
+        
+      
+    
+    <li>Best Practices</li>
+    <li class="wy-breadcrumbs-aside">
+      
+    </li>
+  </ul>
+  <hr/>
+</div>
+          <div role="main">
+            <div class="section">
+              
+                <h1 id="development-best-practices">Development Best Practices</h1>
+<p>This document describes the best practices to follow when developing operators and other application components such as partitoners, stream codecs etc on the Apache Apex platform.</p>
+<h2 id="operators">Operators</h2>
+<p>These are general guidelines for all operators that are covered in the current section. The subsequent sections talk about special considerations for input and output operators.</p>
+<ul>
+<li>When writing a new operator to be used in an application, consider breaking it down into<ul>
+<li>An abstract operator that encompasses the core functionality but leaves application specific schemas and logic to the implementation.</li>
+<li>An optional concrete operator also in the library that extends the abstract operator and provides commonly used schema types such as strings, byte[] or POJOs.</li>
+</ul>
+</li>
+<li>Follow these conventions for the life cycle methods:<ul>
+<li>Do one time initialization of entities that apply for the entire lifetime of the operator in the <strong>setup</strong> method, e.g., factory initializations. Initializations in <strong>setup</strong> are done in the container where the operator is deployed. Allocating memory for fields in the constructor is not efficient as it would lead to extra garbage in memory for the following reason. The operator is instantiated on the client from where the application is launched, serialized and started one of the Hadoop nodes in a container. So the constructor is first called on the client and if it were to initialize any of the fields, that state would be saved during serialization. In the Hadoop container the operator is deserialized and started. This would invoke the constructor again, which will initialize the fields but their state will get overwritten by the serialized state and the initial values would become garbage in memory.</li>
+<li>Do one time initialization for live entities in <strong>activate</strong> method, e.g., opening connections to a database server or starting a thread for asynchronous operations. The <strong>activate</strong> method is called right before processing starts so it is a better place for these initializations than at <strong>setup</strong> which can lead to a delay before processing data from the live entity.  </li>
+<li>Perform periodic tasks based on processing time in application window boundaries.</li>
+<li>Perform initializations needed for each application window in <strong>beginWindow</strong>.</li>
+<li>Perform aggregations needed for each application window  in <strong>endWindow</strong>.</li>
+<li>Teardown of live entities (inverse of tasks performed during activate) should be in the <strong>deactivate</strong> method.</li>
+<li>Teardown of lifetime entities (those initialized in setup method) should happen in the <strong>teardown</strong> method.</li>
+<li>If the operator implementation is not finalized mark it with the <strong>@Evolving</strong> annotation.</li>
+</ul>
+</li>
+<li>If the operator needs to perform operations based on event time of the individual tuples and not the processing time, extend and use the <strong>WindowedOperator</strong>. Refer to documentation of that operator for details on how to use it.</li>
+<li>If an operator needs to do some work when it is not receiving any input, it should implement <strong>IdleTimeHandler</strong> interface. This interface contains <strong>handleIdleTime</strong> method which will be called whenever the platform isn\u2019t doing anything else and the operator can do the work in this method. If for any reason the operator does not have any work to do when this method is called, it should sleep for a small amount of time such as that specified by the <strong>SPIN_MILLIS</strong> attribute so that it does not cause a busy wait when called repeatedly by the platform. Also, the method should not block and return in a reasonable amount of time that is less than the streaming window size (which is 500ms by default).</li>
+<li>Often operators have customizable parameters such as information about locations of external systems or parameters that modify the behavior of the operator. Users should be able to specify these easily without having to change source code. This can be done by making them properties of the operator because they can then be initialized from external properties files.<ul>
+<li>Where possible default values should be provided for the properties in the source code.</li>
+<li>Validation rules should be specified for the properties using javax constraint validations that check whether the values specified for the properties are in the correct format, range or other operator requirements. Required properties should have at least a <strong>@NotNull</strong> validation specifying that they have to be specified by the user.</li>
+</ul>
+</li>
+</ul>
+<h3 id="checkpointing">Checkpointing</h3>
+<p>Checkpointing is a process of snapshotting the state of an operator and saving it so that in case of failure the state can be used to restore the operator to a prior state and continue processing. It is automatically performed by the platform at a configurable interval. All operators in the application are checkpointed in a distributed fashion, thus allowing the entire state of the application to be saved and available for recovery if needed. Here are some things to remember when it comes to checkpointing:</p>
+<ul>
+<li>The process of checkpointing involves snapshotting the state by serializing the operator and saving it to a store. This is done using a <strong>StorageAgent</strong>. By default a <em>StorageAgent</em> is already provided by the platform and it is called <strong>AsyncFSStorageAgent</strong>. It serializes the operator using Kryo and saves the serialized state asynchronously to a filesystem such as HDFS. There are other implementations of <em>StorageAgent</em> available such as <strong>GeodeKeyValueStorageAgent</strong> that stores the serialized state in Geode which is an in-memory replicated data grid.</li>
+<li>All variables in the operator marked neither transient nor final are saved so any variables in the operator that are not part of the state should be marked transient. Specifically any variables like connection objects, i/o streams, ports are transient, because they need to be setup again on failure recovery.</li>
+<li>If the operator does not keep any state between windows, mark it with the <strong>@Stateless</strong> annotation. This results in efficiencies during checkpointing and recovery. The operator will not be checkpointed and is always restored to the initial state</li>
+<li>The checkpoint interval can be set using the <strong>CHECKPOINT_WINDOW_COUNT</strong> attribute which specifies the interval in terms of number of streaming windows.</li>
+<li>If the correct functioning of the operator requires the <strong>endWindow</strong> method be called before checkpointing can happen, then the checkpoint interval should align with application window interval i.e., it should be a multiple of application window interval. In this case the operator should be marked with <strong>OperatorAnnotation</strong> and <strong>checkpointableWithinAppWindow</strong> set to false. If the window intervals are configured by the user and they don\u2019t align, it will result in a DAG validation error and application won\u2019t launch.</li>
+<li>In some cases the operator state related to a piece of data needs to be purged once that data is no longer required by the application, otherwise the state will continue to build up indefinitely. The platform provides a way to let the operator know about this using a callback listener called <strong>CheckpointNotificationListener</strong>. This listener has a callback method called <strong>committed</strong>, which is called by the platform from time to time with a window id that has been processed successfully by all the operators in the DAG and hence is no longer needed. The operator can delete all the state corresponding to window ids less than or equal to the provided window id.</li>
+<li>Sometimes operators need to perform some tasks just before checkpointing. For example, filesystem operators may want to flush the files just before checkpoint so they can be sure that all pending data is written to disk and no data is lost if there is an operator failure just after the checkpoint and the operator restarts from the checkpoint. To do this the operator would implement the same <em>CheckpointNotificationListener</em> interface and implement the <strong>beforeCheckpoint</strong> method where it can do these tasks.</li>
+<li>If the operator is going to have a large state, checkpointing the entire state each time becomes unviable. Furthermore, the amount of memory needed to hold the state could be larger than the amount of physical memory available. In these cases the operator should checkpoint the state incrementally and also manage the memory for the state more efficiently. The platform provides a utiltiy called <strong>ManagedState</strong> that uses a combination of in memory and disk cache to efficiently store and retrieve data in a performant, fault tolerant way and also checkpoint it in an incremental fashion. There are operators in the platform that use <em>ManagedState</em> and can be used as a reference on how to use this utility such as Dedup or Join operators.</li>
+</ul>
+<h2 id="input-operators">Input Operators</h2>
+<p>Input operators have additional requirements:</p>
+<ul>
+<li>The <strong>emitTuples</strong> method implemented by the operator, is called by the platform, to give the operator an opportunity to emit some data. This method is always called within a window boundary but can be called multiple times within the same window. There are some important guidelines on how to implement this method:<ul>
+<li>This should not be a blocking method and should return in a reasonable time that is less than the streaming window size (which is 500ms by default). This also applies to other callback methods called by the platform such as <em>beginWindow</em>, <em>endWindow</em> etc., but is more important here since this method will be called continuously by the platform.</li>
+<li>If the operator needs to interact with external systems to obtain data and this can potentially take a long time, then this should be performed asynchronously in a different thread. Refer to the threading section below for the guidelines when using threading.</li>
+<li>In each invocation, the method can emit any number of data tuples.</li>
+</ul>
+</li>
+</ul>
+<h3 id="idempotence">Idempotence</h3>
+<p>Many applications write data to external systems using output operators. To ensure that data is present exactly once in the external system even in a failure recovery scenario, the output operators expect the replayed windows during recovery contain the same data as before the failure. This is called idempotency. Since operators within the DAG are merely responding to input data provided to them by the upstream operators and the input operator has no upstream operator, the responsibility of idempotent replay falls on the input operators.</p>
+<ul>
+<li>For idempotent replay of data, the operator needs to store some meta-information for every window that would allow it to identify what data was sent in that window. This is called the idempotent state.<ul>
+<li>If the external source of the input operator allows replayability, this could be information such as offset of last piece of data in the window, an identifier of the last piece of data itself or number of data tuples sent.</li>
+<li>However if the external source does not allow replayability from an operator specified point, then the entire data sent within the window may need to be persisted by the operator.</li>
+</ul>
+</li>
+<li>The platform provides a utility called <em>WindowDataManager</em> to allow operators to save and retrieve idempotent state every window. Operators should use this to implement idempotency.</li>
+</ul>
+<h2 id="output-operators">Output Operators</h2>
+<p>Output operators typically connect to external storage systems such as filesystems, databases or key value stores to store data.</p>
+<ul>
+<li>In some situations, the external systems may not be functioning in a reliable fashion. They may be having prolonged outages or performance problems. If the operator is being designed to work in such environments, it needs to be able to to handle these problems gracefully and not block the DAG or fail. In these scenarios the operator should cache the data into a local store such as HDFS and interact with external systems in a separate thread so as to not have problems in the operator lifecycle thread. This pattern is called the <strong>Reconciler</strong> pattern and there are operators that implement this pattern available in the library for reference.</li>
+</ul>
+<h3 id="end-to-end-exactly-once">End-to-End Exactly Once</h3>
+<p>When output operators store data in external systems, it is important that they do not lose data or write duplicate data when there is a failure event and the DAG recovers from that failure. In failure recovery, the windows from the previous checkpoint are replayed and the operator receives this data again. The operator should ensure that it does not write this data again. Operator developers should figure out how to do this specifically for the operators they are developing depending on the logic of the operators. Below are examples of how a couple of existing output operators do this for reference.</p>
+<ul>
+<li>File output operator that writes data to files keeps track of the file lengths in the state. These lengths are checkpointed and restored on failure recovery. On restart, the operator truncates the file to the length equal to the length in the recovered state. This makes the data in the file same as it was at the time of checkpoint before the failure. The operator now writes the replayed data from the checkpoint in regular fashion as any other data. This ensures no data is lost or duplicated in the file.</li>
+<li>The JDBC output operator that writes data to a database table writes the data in a window in a single transaction. It also writes the current window id into a meta table along with the data as part of the same transaction. It commits the transaction at the end of the window. When there is an operator failure before the final commit, the state of the database is that it contains the data from the previous fully processed window and its window id since the current window transaction isn\u2019t yet committed. On recovery, the operator reads this window id back from the meta table. It ignores all the replayed windows whose window id is less than or equal to the recovered window id and thus ensures that it does not duplicate data already present in the database. It starts writing data normally again when window id of data becomes greater than recovered window thus ensuring no data is lost.</li>
+</ul>
+<h2 id="partitioning">Partitioning</h2>
+<p>Partitioning allows an operation to be scaled to handle more pieces of data than before but with a similar SLA. This is done by creating multiple instances of an operator and distributing the data among them. Input operators can also be partitioned to stream more pieces of data into the application. The platform provides a lot of flexibility and options for partitioning. Partitioning can happen once at startup or can be dynamically changed anytime while the application is running, and it can be done in a stateless or stateful way by distributing state from the old partitions to new partitions.</p>
+<p>In the platform, the responsibility for partitioning is shared among different entities. These are:</p>
+<ol>
+<li>A <strong>partitioner</strong> that specifies <em>how</em> to partition the operator, specifically it takes an old set of partitions and creates a new set of partitions. At the start of the application the old set has one partition and the partitioner can return more than one partitions to start the application with multiple partitions. The partitioner can have any custom JAVA logic to determine the number of new partitions, set their initial state as a brand new state or derive it from the state of the old partitions. It also specifies how the data gets distributed among the new partitions. The new set doesn't have to contain only new partitions, it can carry over some old partitions if desired.</li>
+<li>An optional <strong>statistics (stats) listener</strong> that specifies <em>when</em> to partition. The reason it is optional is that it is needed only when dynamic partitioning is needed. With the stats listener, the stats can be used to determine when to partition.</li>
+<li>In some cases the <em>operator</em> itself should be aware of partitioning and would need to provide supporting code.<ul>
+<li>In case of input operators each partition should have a property or a set of properties that allow it to distinguish itself from the other partitions and fetch unique data.</li>
+</ul>
+</li>
+<li>When an operator that was originally a single instance is split into multiple partitions with each partition working on a subset of data, the results of the partitions may need to be combined together to compute the final result. The combining logic would depend on the logic of the operator. This would be specified by the developer using a <strong>Unifier</strong>, which is deployed as another operator by the platform. If no <em>Unifier</em> is specified, the platform inserts a <strong>default unifier</strong> that merges the results of the multiple partition streams into a single stream. Each output port can have a different <em>Unifier</em> and this is specified by returning the corresponding <em>Unifier</em> in the <strong>getUnifier</strong> method of the output port. The operator developer should provide a custom <em>Unifier</em> wherever applicable.</li>
+<li>The Apex <em>engine</em> that brings everything together and effects the partitioning.</li>
+</ol>
+<p>Since partitioning is critical for scalability of applications, operators must support it. There should be a strong reason for an operator to not support partitioning, such as, the logic performed by the operator not lending itself to parallelism. In order to support partitioning, an operator developer, apart from developing the functionality of the operator, may also need to provide a partitioner, stats listener and supporting code in the operator as described in the steps above. The next sections delve into this. </p>
+<h3 id="out-of-the-box-partitioning">Out of the box partitioning</h3>
+<p>The platform comes with some built-in partitioning utilities that can be used in certain scenarios.</p>
+<ul>
+<li>
+<p><strong>StatelessPartitioner</strong> provides a default partitioner, that can be used for an operator in certain conditions. If the operator satisfies these conditions, the partitioner can be specified for the operator with a simple setting and no other partitioning code is needed. The conditions are:</p>
+<ul>
+<li>No dynamic partitioning is needed, see next point about dynamic partitioning. </li>
+<li>There is no distinct initial state for the partitions, i.e., all partitions start with the same initial state submitted during application launch.</li>
+</ul>
+<p>Typically input or output operators do not fall into this category, although there are some exceptions. This partitioner is mainly used with operators that are in the middle of the DAG, after the input and before the output operators. When used with non-input operators, only the data for the first declared input port is distributed among the different partitions. All other input ports are treated as broadcast and all partitions receive all the data for that port.</p>
+</li>
+<li>
+<p><strong>StatelessThroughputBasedPartitioner</strong> in Malhar provides a dynamic partitioner based on throughput thresholds. Similarly <strong>StatelessLatencyBasedPartitioner</strong> provides a latency based dynamic partitioner in RTS. If these partitioners can be used, then separate partitioning related code is not needed. The conditions under which these can be used are:</p>
+<ul>
+<li>There is no distinct initial state for the partitions.</li>
+<li>There is no state being carried over by the operator from one window to the next i.e., operator is stateless.</li>
+</ul>
+</li>
+</ul>
+<h3 id="custom-partitioning">Custom partitioning</h3>
+<p>In many cases, operators don\u2019t satisfy the above conditions and a built-in partitioner cannot be used. Custom partitioning code needs to be written by the operator developer. Below are guidelines for it.</p>
+<ul>
+<li>Since the operator developer is providing a <em>partitioner</em> for the operator, the partitioning code should be added to the operator itself by making the operator implement the Partitioner interface and implementing the required methods, rather than creating a separate partitioner. The advantage is the user of the operator does not have to explicitly figure out the partitioner and set it for the operator but still has the option to override this built-in partitioner with a different one.</li>
+<li>The <em>partitioner</em> is responsible for setting the initial state of the new partitions, whether it is at the start of the application or when partitioning is happening while the application is running as in the dynamic partitioning case. In the dynamic partitioning scenario, the partitioner needs to take the state from the old partitions and distribute it among the new partitions. It is important to note that apart from the checkpointed state the partitioner also needs to distribute idempotent state.</li>
+<li>The <em>partitioner</em> interface has two methods, <strong>definePartitions</strong> and <strong>partitioned</strong>. The method <em>definePartitons</em> is first called to determine the new partitions, and if enough resources are available on the cluster, the <em>partitioned</em> method is called passing in the new partitions. This happens both during initial partitioning and dynamic partitioning. If resources are not available, partitioning is abandoned and existing partitions continue to run untouched. This means that any processing intensive operations should be deferred to the <em>partitioned</em> call instead of doing them in <em>definePartitions</em>, as they may not be needed if there are not enough resources available in the cluster.</li>
+<li>The <em>partitioner</em>, along with creating the new partitions, should also specify how the data gets distributed across the new partitions. It should do this by specifying a mapping called <strong>PartitionKeys</strong> for each partition that maps the data to that partition. This mapping needs to be specified for every input port in the operator. If the <em>partitioner</em> wants to use the standard mapping it can use a utility method called <strong>DefaultPartition.assignPartitionKeys</strong>.</li>
+<li>When the partitioner is scaling the operator up to more partitions, try to reuse the existing partitions and create new partitions to augment the current set. The reuse can be achieved by the partitioner returning the current partitions unchanged. This will result in the current partitions continuing to run untouched.</li>
+<li>In case of dynamic partitioning, as mentioned earlier, a stats listener is also needed to determine when to re-partition. Like the <em>Partitioner</em> interface, the operator can also implement the <em>StatsListener</em> interface to provide a stats listener implementation that will be automatically used.</li>
+<li>The <em>StatsListener</em> has access to all operator statistics to make its decision on partitioning. Apart from the statistics that the platform computes for the operators such as throughput, latency etc, operator developers can include their own business metrics by using the AutoMetric feature.</li>
+<li>If the operator is not partitionable, mark it so with <em>OperatorAnnotation</em> and <em>partitionable</em> element set to false.</li>
+</ul>
+<h3 id="streamcodecs">StreamCodecs</h3>
+<p>A <strong>StreamCodec</strong> is used in partitioning to distribute the data tuples among the partitions. The <em>StreamCodec</em> computes an integer hashcode for a data tuple and this is used along with <em>PartitionKeys</em> mapping to determine which partition or partitions receive the data tuple. If a <em>StreamCodec</em> is not specified, then a default one is used by the platform which returns the JAVA hashcode of the tuple. </p>
+<p><em>StreamCodec</em> is also useful in another aspect of the application. It is used to serialize and deserialize the tuple to transfer it between operators. The default <em>StreamCodec</em> uses Kryo library for serialization. </p>
+<p>The following guidelines are useful when considering a custom <em>StreamCodec</em></p>
+<ul>
+<li>A custom <em>StreamCodec</em> is needed if the tuples need to be distributed based on a criteria different from the hashcode of the tuple. If the correct working of an operator depends on the data from the upstream operator being distributed using a custom criteria such as being sticky on a \u201ckey\u201d field within the tuple, then a custom <em>StreamCodec</em> should be provided by the operator developer. This codec can implement the custom criteria. The operator should also return this custom codec in the <strong>getStreamCodec</strong> method of the input port.</li>
+<li>When implementing a custom <em>StreamCodec</em> for the purpose of using a different criteria to distribute the tuples, the codec can extend an existing <em>StreamCodec</em> and implement the hashcode method, so that the codec does not have to worry about the serialization and deserialization functionality. The Apex platform provides two pre-built <em>StreamCodec</em> implementations for this purpose, one is <strong>KryoSerializableStreamCodec</strong> that uses Kryo for serialization and another one <strong>JavaSerializationStreamCodec</strong> that uses JAVA serialization.</li>
+<li>Different <em>StreamCodec</em> implementations can be used for different inputs in a stream with multiple inputs when different criteria of distributing the tuples is desired between the multiple inputs. </li>
+</ul>
+<h2 id="threads">Threads</h2>
+<p>The operator lifecycle methods such as <strong>setup</strong>, <strong>beginWindow</strong>, <strong>endWindow</strong>, <strong>process</strong> in <em>InputPorts</em> are all called from a single operator lifecycle thread, by the platform, unbeknownst to the user. So the user does not have to worry about dealing with the issues arising from multi-threaded code. Use of separate threads in an operator is discouraged because in most cases the motivation for this is parallelism, but parallelism can already be achieved by using multiple partitions and furthermore mistakes can be made easily when writing multi-threaded code. When dealing with high volume and velocity data, the corner cases with incorrectly written multi-threaded code are encountered more easily and exposed. However, there are times when separate threads are needed, for example, when interacting with external systems the delay in retrieving or sending data can be large at times, blocking the operator and other DAG pro
 cessing such as committed windows. In these cases the following guidelines must be followed strictly.</p>
+<ul>
+<li>Threads should be started in <strong>activate</strong> and stopped in <strong>deactivate</strong>. In <em>deactivate</em> the operator should wait till any threads it launched, have finished execution. It can do so by calling <strong>join</strong> on the threads or if using <strong>ExecutorService</strong>, calling <strong>awaitTermination</strong> on the service.</li>
+<li>Threads should not call any methods on the ports directly as this can cause concurrency exceptions and also result in invalid states.</li>
+<li>Threads can share state with the lifecycle methods using data structures that are either explicitly protected by synchronization or are inherently thread safe such as thread safe queues.</li>
+<li>If this shared state needs to be protected against failure then it needs to be persisted during checkpoint. To have a consistent checkpoint, the state should not be modified by the thread when it is being serialized and saved by the operator lifecycle thread during checkpoint. Since the checkpoint process happens outside the window boundary the thread should be quiesced between <strong>endWindow</strong> and <strong>beginWindow</strong> or more efficiently between pre-checkpoint and checkpointed callbacks.</li>
+</ul>
+              
+            </div>
+          </div>
+          <footer>
+  
+    <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
+      
+        <a href="../apex_cli/" class="btn btn-neutral float-right" title="Apex CLI">Next <span class="icon icon-circle-arrow-right"></span></a>
+      
+      
+        <a href="../autometrics/" class="btn btn-neutral" title="AutoMetric API"><span class="icon icon-circle-arrow-left"></span> Previous</a>
+      
+    </div>
+  
+
+  <hr/>
+
+  <div role="contentinfo">
+    <!-- Copyright etc -->
+    
+  </div>
+
+  Built with <a href="http://www.mkdocs.org">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
+</footer>
+	  
+        </div>
+      </div>
+
+    </section>
+
+  </div>
+
+<div class="rst-versions" role="note" style="cursor: pointer">
+    <span class="rst-current-version" data-toggle="rst-current-version">
+      
+      
+        <span><a href="../autometrics/" style="color: #fcfcfc;">&laquo; Previous</a></span>
+      
+      
+        <span style="margin-left: 15px"><a href="../apex_cli/" style="color: #fcfcfc">Next &raquo;</a></span>
+      
+    </span>
+</div>
+
+</body>
+</html>

http://git-wip-us.apache.org/repos/asf/apex-site/blob/d396fa83/content/docs/apex-3.4/images/security/image03.png
----------------------------------------------------------------------
diff --git a/content/docs/apex-3.4/images/security/image03.png b/content/docs/apex-3.4/images/security/image03.png
new file mode 100755
index 0000000..175feb8
Binary files /dev/null and b/content/docs/apex-3.4/images/security/image03.png differ

http://git-wip-us.apache.org/repos/asf/apex-site/blob/d396fa83/content/docs/apex-3.4/index.html
----------------------------------------------------------------------
diff --git a/content/docs/apex-3.4/index.html b/content/docs/apex-3.4/index.html
index c944ecc..56fcb03 100644
--- a/content/docs/apex-3.4/index.html
+++ b/content/docs/apex-3.4/index.html
@@ -109,6 +109,13 @@
     </li>
 
         
+            
+    <li class="toctree-l1 ">
+        <a class="" href="development_best_practices/">Best Practices</a>
+        
+    </li>
+
+        
     </ul>
 <li>
           
@@ -184,7 +191,7 @@
 <li>Simple API supports generic Java code</li>
 </ul>
 <p>Platform has been demonstated to scale linearly across Hadoop clusters under extreme loads of billions of events per second.  Hardware and process failures are quickly recovered with HDFS-backed checkpointing and automatic operator recovery, preserving application state and resuming execution in seconds.  Functional and operational specifications are separated.  Apex provides a simple API, which enables users to write generic, reusable code.  The code is dropped in as-is and platform automatically handles the various operational concerns, such as state management, fault tolerance, scalability, security, metrics, etc.  This frees users to focus on functional development, and lets platform provide operability support.</p>
-<p>The core Apex platform is supplemented by Malhar, a library of connector and logic functions, enabling rapid application development.  These operators and modules provide access to HDFS, S3, NFS, FTP, and other file systems; Kafka, ActiveMQ, RabbitMQ, JMS, and other message systems; MySql, Cassandra, MongoDB, Redis, HBase, CouchDB, generic JDBC, and other database connectors.  In addition to the operators, the library contains a number of demos applications, demonstrating operator features and capabilities.  To see the full list of available operators and related documentation, visit <a href="https://github.com/apache/incubator-apex-malhar">Apex Malhar on Github</a></p>
+<p>The core Apex platform is supplemented by Malhar, a library of connector and logic functions, enabling rapid application development.  These operators and modules provide access to HDFS, S3, NFS, FTP, and other file systems; Kafka, ActiveMQ, RabbitMQ, JMS, and other message systems; MySql, Cassandra, MongoDB, Redis, HBase, CouchDB, generic JDBC, and other database connectors.  In addition to the operators, the library contains a number of demos applications, demonstrating operator features and capabilities.  To see the full list of available operators and related documentation, visit <a href="https://github.com/apache/apex-malhar">Apex Malhar on Github</a></p>
 <p>For additional information visit <a href="http://apex.apache.org/">Apache Apex</a>.</p>
 <p><a href="http://apex.apache.org/"><img alt="" src="./favicon.ico" /></a></p>
               
@@ -232,5 +239,5 @@
 
 <!--
 MkDocs version : 0.15.3
-Build Date UTC : 2016-05-13 22:25:11.258707
+Build Date UTC : 2016-09-07 01:53:39.631895
 -->

http://git-wip-us.apache.org/repos/asf/apex-site/blob/d396fa83/content/docs/apex-3.4/main.html
----------------------------------------------------------------------
diff --git a/content/docs/apex-3.4/main.html b/content/docs/apex-3.4/main.html
new file mode 100644
index 0000000..79c9f4e
--- /dev/null
+++ b/content/docs/apex-3.4/main.html
@@ -0,0 +1,10 @@
+{% extends "base.html" %}
+
+{#
+The entry point for the ReadTheDocs Theme.
+ 
+Any theme customisations should override this file to redefine blocks defined in
+the various templates. The custom theme should only need to define a main.html
+which `{% extends "base.html" %}` and defines various blocks which will replace
+the blocks defined in base.html and its included child templates.
+#}

http://git-wip-us.apache.org/repos/asf/apex-site/blob/d396fa83/content/docs/apex-3.4/mkdocs/js/lunr-0.5.7.min.js
----------------------------------------------------------------------
diff --git a/content/docs/apex-3.4/mkdocs/js/lunr-0.5.7.min.js b/content/docs/apex-3.4/mkdocs/js/lunr-0.5.7.min.js
new file mode 100644
index 0000000..b72449a
--- /dev/null
+++ b/content/docs/apex-3.4/mkdocs/js/lunr-0.5.7.min.js
@@ -0,0 +1,7 @@
+/**
+ * lunr - http://lunrjs.com - A bit like Solr, but much smaller and not as bright - 0.5.7
+ * Copyright (C) 2014 Oliver Nightingale
+ * MIT Licensed
+ * @license
+ */
+!function(){var t=function(e){var n=new t.Index;return n.pipeline.add(t.trimmer,t.stopWordFilter,t.stemmer),e&&e.call(n,n),n};t.version="0.5.7",t.utils={},t.utils.warn=function(t){return function(e){t.console&&console.warn&&console.warn(e)}}(this),t.EventEmitter=function(){this.events={}},t.EventEmitter.prototype.addListener=function(){var t=Array.prototype.slice.call(arguments),e=t.pop(),n=t;if("function"!=typeof e)throw new TypeError("last argument must be a function");n.forEach(function(t){this.hasHandler(t)||(this.events[t]=[]),this.events[t].push(e)},this)},t.EventEmitter.prototype.removeListener=function(t,e){if(this.hasHandler(t)){var n=this.events[t].indexOf(e);this.events[t].splice(n,1),this.events[t].length||delete this.events[t]}},t.EventEmitter.prototype.emit=function(t){if(this.hasHandler(t)){var e=Array.prototype.slice.call(arguments,1);this.events[t].forEach(function(t){t.apply(void 0,e)})}},t.EventEmitter.prototype.hasHandler=function(t){return t in this.events},t.to
 kenizer=function(t){if(!arguments.length||null==t||void 0==t)return[];if(Array.isArray(t))return t.map(function(t){return t.toLowerCase()});for(var e=t.toString().replace(/^\s+/,""),n=e.length-1;n>=0;n--)if(/\S/.test(e.charAt(n))){e=e.substring(0,n+1);break}return e.split(/(?:\s+|\-)/).filter(function(t){return!!t}).map(function(t){return t.toLowerCase()})},t.Pipeline=function(){this._stack=[]},t.Pipeline.registeredFunctions={},t.Pipeline.registerFunction=function(e,n){n in this.registeredFunctions&&t.utils.warn("Overwriting existing registered function: "+n),e.label=n,t.Pipeline.registeredFunctions[e.label]=e},t.Pipeline.warnIfFunctionNotRegistered=function(e){var n=e.label&&e.label in this.registeredFunctions;n||t.utils.warn("Function is not registered with pipeline. This may cause problems when serialising the index.\n",e)},t.Pipeline.load=function(e){var n=new t.Pipeline;return e.forEach(function(e){var i=t.Pipeline.registeredFunctions[e];if(!i)throw new Error("Cannot load un-re
 gistered function: "+e);n.add(i)}),n},t.Pipeline.prototype.add=function(){var e=Array.prototype.slice.call(arguments);e.forEach(function(e){t.Pipeline.warnIfFunctionNotRegistered(e),this._stack.push(e)},this)},t.Pipeline.prototype.after=function(e,n){t.Pipeline.warnIfFunctionNotRegistered(n);var i=this._stack.indexOf(e)+1;this._stack.splice(i,0,n)},t.Pipeline.prototype.before=function(e,n){t.Pipeline.warnIfFunctionNotRegistered(n);var i=this._stack.indexOf(e);this._stack.splice(i,0,n)},t.Pipeline.prototype.remove=function(t){var e=this._stack.indexOf(t);this._stack.splice(e,1)},t.Pipeline.prototype.run=function(t){for(var e=[],n=t.length,i=this._stack.length,o=0;n>o;o++){for(var r=t[o],s=0;i>s&&(r=this._stack[s](r,o,t),void 0!==r);s++);void 0!==r&&e.push(r)}return e},t.Pipeline.prototype.reset=function(){this._stack=[]},t.Pipeline.prototype.toJSON=function(){return this._stack.map(function(e){return t.Pipeline.warnIfFunctionNotRegistered(e),e.label})},t.Vector=function(){this._magni
 tude=null,this.list=void 0,this.length=0},t.Vector.Node=function(t,e,n){this.idx=t,this.val=e,this.next=n},t.Vector.prototype.insert=function(e,n){var i=this.list;if(!i)return this.list=new t.Vector.Node(e,n,i),this.length++;for(var o=i,r=i.next;void 0!=r;){if(e<r.idx)return o.next=new t.Vector.Node(e,n,r),this.length++;o=r,r=r.next}return o.next=new t.Vector.Node(e,n,r),this.length++},t.Vector.prototype.magnitude=function(){if(this._magniture)return this._magnitude;for(var t,e=this.list,n=0;e;)t=e.val,n+=t*t,e=e.next;return this._magnitude=Math.sqrt(n)},t.Vector.prototype.dot=function(t){for(var e=this.list,n=t.list,i=0;e&&n;)e.idx<n.idx?e=e.next:e.idx>n.idx?n=n.next:(i+=e.val*n.val,e=e.next,n=n.next);return i},t.Vector.prototype.similarity=function(t){return this.dot(t)/(this.magnitude()*t.magnitude())},t.SortedSet=function(){this.length=0,this.elements=[]},t.SortedSet.load=function(t){var e=new this;return e.elements=t,e.length=t.length,e},t.SortedSet.prototype.add=function(){Arr
 ay.prototype.slice.call(arguments).forEach(function(t){~this.indexOf(t)||this.elements.splice(this.locationFor(t),0,t)},this),this.length=this.elements.length},t.SortedSet.prototype.toArray=function(){return this.elements.slice()},t.SortedSet.prototype.map=function(t,e){return this.elements.map(t,e)},t.SortedSet.prototype.forEach=function(t,e){return this.elements.forEach(t,e)},t.SortedSet.prototype.indexOf=function(t,e,n){var e=e||0,n=n||this.elements.length,i=n-e,o=e+Math.floor(i/2),r=this.elements[o];return 1>=i?r===t?o:-1:t>r?this.indexOf(t,o,n):r>t?this.indexOf(t,e,o):r===t?o:void 0},t.SortedSet.prototype.locationFor=function(t,e,n){var e=e||0,n=n||this.elements.length,i=n-e,o=e+Math.floor(i/2),r=this.elements[o];if(1>=i){if(r>t)return o;if(t>r)return o+1}return t>r?this.locationFor(t,o,n):r>t?this.locationFor(t,e,o):void 0},t.SortedSet.prototype.intersect=function(e){for(var n=new t.SortedSet,i=0,o=0,r=this.length,s=e.length,a=this.elements,h=e.elements;;){if(i>r-1||o>s-1)brea
 k;a[i]!==h[o]?a[i]<h[o]?i++:a[i]>h[o]&&o++:(n.add(a[i]),i++,o++)}return n},t.SortedSet.prototype.clone=function(){var e=new t.SortedSet;return e.elements=this.toArray(),e.length=e.elements.length,e},t.SortedSet.prototype.union=function(t){var e,n,i;return this.length>=t.length?(e=this,n=t):(e=t,n=this),i=e.clone(),i.add.apply(i,n.toArray()),i},t.SortedSet.prototype.toJSON=function(){return this.toArray()},t.Index=function(){this._fields=[],this._ref="id",this.pipeline=new t.Pipeline,this.documentStore=new t.Store,this.tokenStore=new t.TokenStore,this.corpusTokens=new t.SortedSet,this.eventEmitter=new t.EventEmitter,this._idfCache={},this.on("add","remove","update",function(){this._idfCache={}}.bind(this))},t.Index.prototype.on=function(){var t=Array.prototype.slice.call(arguments);return this.eventEmitter.addListener.apply(this.eventEmitter,t)},t.Index.prototype.off=function(t,e){return this.eventEmitter.removeListener(t,e)},t.Index.load=function(e){e.version!==t.version&&t.utils.wa
 rn("version mismatch: current "+t.version+" importing "+e.version);var n=new this;return n._fields=e.fields,n._ref=e.ref,n.documentStore=t.Store.load(e.documentStore),n.tokenStore=t.TokenStore.load(e.tokenStore),n.corpusTokens=t.SortedSet.load(e.corpusTokens),n.pipeline=t.Pipeline.load(e.pipeline),n},t.Index.prototype.field=function(t,e){var e=e||{},n={name:t,boost:e.boost||1};return this._fields.push(n),this},t.Index.prototype.ref=function(t){return this._ref=t,this},t.Index.prototype.add=function(e,n){var i={},o=new t.SortedSet,r=e[this._ref],n=void 0===n?!0:n;this._fields.forEach(function(n){var r=this.pipeline.run(t.tokenizer(e[n.name]));i[n.name]=r,t.SortedSet.prototype.add.apply(o,r)},this),this.documentStore.set(r,o),t.SortedSet.prototype.add.apply(this.corpusTokens,o.toArray());for(var s=0;s<o.length;s++){var a=o.elements[s],h=this._fields.reduce(function(t,e){var n=i[e.name].length;if(!n)return t;var o=i[e.name].filter(function(t){return t===a}).length;return t+o/n*e.boost}
 ,0);this.tokenStore.add(a,{ref:r,tf:h})}n&&this.eventEmitter.emit("add",e,this)},t.Index.prototype.remove=function(t,e){var n=t[this._ref],e=void 0===e?!0:e;if(this.documentStore.has(n)){var i=this.documentStore.get(n);this.documentStore.remove(n),i.forEach(function(t){this.tokenStore.remove(t,n)},this),e&&this.eventEmitter.emit("remove",t,this)}},t.Index.prototype.update=function(t,e){var e=void 0===e?!0:e;this.remove(t,!1),this.add(t,!1),e&&this.eventEmitter.emit("update",t,this)},t.Index.prototype.idf=function(t){var e="@"+t;if(Object.prototype.hasOwnProperty.call(this._idfCache,e))return this._idfCache[e];var n=this.tokenStore.count(t),i=1;return n>0&&(i=1+Math.log(this.tokenStore.length/n)),this._idfCache[e]=i},t.Index.prototype.search=function(e){var n=this.pipeline.run(t.tokenizer(e)),i=new t.Vector,o=[],r=this._fields.reduce(function(t,e){return t+e.boost},0),s=n.some(function(t){return this.tokenStore.has(t)},this);if(!s)return[];n.forEach(function(e,n,s){var a=1/s.length*t
 his._fields.length*r,h=this,u=this.tokenStore.expand(e).reduce(function(n,o){var r=h.corpusTokens.indexOf(o),s=h.idf(o),u=1,l=new t.SortedSet;if(o!==e){var c=Math.max(3,o.length-e.length);u=1/Math.log(c)}return r>-1&&i.insert(r,a*s*u),Object.keys(h.tokenStore.get(o)).forEach(function(t){l.add(t)}),n.union(l)},new t.SortedSet);o.push(u)},this);var a=o.reduce(function(t,e){return t.intersect(e)});return a.map(function(t){return{ref:t,score:i.similarity(this.documentVector(t))}},this).sort(function(t,e){return e.score-t.score})},t.Index.prototype.documentVector=function(e){for(var n=this.documentStore.get(e),i=n.length,o=new t.Vector,r=0;i>r;r++){var s=n.elements[r],a=this.tokenStore.get(s)[e].tf,h=this.idf(s);o.insert(this.corpusTokens.indexOf(s),a*h)}return o},t.Index.prototype.toJSON=function(){return{version:t.version,fields:this._fields,ref:this._ref,documentStore:this.documentStore.toJSON(),tokenStore:this.tokenStore.toJSON(),corpusTokens:this.corpusTokens.toJSON(),pipeline:this.
 pipeline.toJSON()}},t.Index.prototype.use=function(t){var e=Array.prototype.slice.call(arguments,1);e.unshift(this),t.apply(this,e)},t.Store=function(){this.store={},this.length=0},t.Store.load=function(e){var n=new this;return n.length=e.length,n.store=Object.keys(e.store).reduce(function(n,i){return n[i]=t.SortedSet.load(e.store[i]),n},{}),n},t.Store.prototype.set=function(t,e){this.has(t)||this.length++,this.store[t]=e},t.Store.prototype.get=function(t){return this.store[t]},t.Store.prototype.has=function(t){return t in this.store},t.Store.prototype.remove=function(t){this.has(t)&&(delete this.store[t],this.length--)},t.Store.prototype.toJSON=function(){return{store:this.store,length:this.length}},t.stemmer=function(){var t={ational:"ate",tional:"tion",enci:"ence",anci:"ance",izer:"ize",bli:"ble",alli:"al",entli:"ent",eli:"e",ousli:"ous",ization:"ize",ation:"ate",ator:"ate",alism:"al",iveness:"ive",fulness:"ful",ousness:"ous",aliti:"al",iviti:"ive",biliti:"ble",logi:"log"},e={ica
 te:"ic",ative:"",alize:"al",iciti:"ic",ical:"ic",ful:"",ness:""},n="[^aeiou]",i="[aeiouy]",o=n+"[^aeiouy]*",r=i+"[aeiou]*",s="^("+o+")?"+r+o,a="^("+o+")?"+r+o+"("+r+")?$",h="^("+o+")?"+r+o+r+o,u="^("+o+")?"+i,l=new RegExp(s),c=new RegExp(h),p=new RegExp(a),f=new RegExp(u),d=/^(.+?)(ss|i)es$/,v=/^(.+?)([^s])s$/,m=/^(.+?)eed$/,g=/^(.+?)(ed|ing)$/,y=/.$/,S=/(at|bl|iz)$/,w=new RegExp("([^aeiouylsz])\\1$"),x=new RegExp("^"+o+i+"[^aeiouwxy]$"),k=/^(.+?[^aeiou])y$/,b=/^(.+?)(ational|tional|enci|anci|izer|bli|alli|entli|eli|ousli|ization|ation|ator|alism|iveness|fulness|ousness|aliti|iviti|biliti|logi)$/,E=/^(.+?)(icate|ative|alize|iciti|ical|ful|ness)$/,_=/^(.+?)(al|ance|ence|er|ic|able|ible|ant|ement|ment|ent|ou|ism|ate|iti|ous|ive|ize)$/,O=/^(.+?)(s|t)(ion)$/,F=/^(.+?)e$/,P=/ll$/,T=new RegExp("^"+o+i+"[^aeiouwxy]$"),$=function(n){var i,o,r,s,a,h,u;if(n.length<3)return n;if(r=n.substr(0,1),"y"==r&&(n=r.toUpperCase()+n.substr(1)),s=d,a=v,s.test(n)?n=n.replace(s,"$1$2"):a.test(n)&&(n=n.repl
 ace(a,"$1$2")),s=m,a=g,s.test(n)){var $=s.exec(n);s=l,s.test($[1])&&(s=y,n=n.replace(s,""))}else if(a.test(n)){var $=a.exec(n);i=$[1],a=f,a.test(i)&&(n=i,a=S,h=w,u=x,a.test(n)?n+="e":h.test(n)?(s=y,n=n.replace(s,"")):u.test(n)&&(n+="e"))}if(s=k,s.test(n)){var $=s.exec(n);i=$[1],n=i+"i"}if(s=b,s.test(n)){var $=s.exec(n);i=$[1],o=$[2],s=l,s.test(i)&&(n=i+t[o])}if(s=E,s.test(n)){var $=s.exec(n);i=$[1],o=$[2],s=l,s.test(i)&&(n=i+e[o])}if(s=_,a=O,s.test(n)){var $=s.exec(n);i=$[1],s=c,s.test(i)&&(n=i)}else if(a.test(n)){var $=a.exec(n);i=$[1]+$[2],a=c,a.test(i)&&(n=i)}if(s=F,s.test(n)){var $=s.exec(n);i=$[1],s=c,a=p,h=T,(s.test(i)||a.test(i)&&!h.test(i))&&(n=i)}return s=P,a=c,s.test(n)&&a.test(n)&&(s=y,n=n.replace(s,"")),"y"==r&&(n=r.toLowerCase()+n.substr(1)),n};return $}(),t.Pipeline.registerFunction(t.stemmer,"stemmer"),t.stopWordFilter=function(e){return-1===t.stopWordFilter.stopWords.indexOf(e)?e:void 0},t.stopWordFilter.stopWords=new t.SortedSet,t.stopWordFilter.stopWords.length=119
 ,t.stopWordFilter.stopWords.elements=["","a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"],t.Pipeline.registerFunction(t.stopWordFilter,"stopWordFilter"),t.trimmer=function(t){return t.replace(/^\W+/,"").replace(/\W+$/,"")},t.Pipeline.registerFunction(t.tr
 immer,"trimmer"),t.TokenStore=function(){this.root={docs:{}},this.length=0},t.TokenStore.load=function(t){var e=new this;return e.root=t.root,e.length=t.length,e},t.TokenStore.prototype.add=function(t,e,n){var n=n||this.root,i=t[0],o=t.slice(1);return i in n||(n[i]={docs:{}}),0===o.length?(n[i].docs[e.ref]=e,void(this.length+=1)):this.add(o,e,n[i])},t.TokenStore.prototype.has=function(t){if(!t)return!1;for(var e=this.root,n=0;n<t.length;n++){if(!e[t[n]])return!1;e=e[t[n]]}return!0},t.TokenStore.prototype.getNode=function(t){if(!t)return{};for(var e=this.root,n=0;n<t.length;n++){if(!e[t[n]])return{};e=e[t[n]]}return e},t.TokenStore.prototype.get=function(t,e){return this.getNode(t,e).docs||{}},t.TokenStore.prototype.count=function(t,e){return Object.keys(this.get(t,e)).length},t.TokenStore.prototype.remove=function(t,e){if(t){for(var n=this.root,i=0;i<t.length;i++){if(!(t[i]in n))return;n=n[t[i]]}delete n.docs[e]}},t.TokenStore.prototype.expand=function(t,e){var n=this.getNode(t),i=n
 .docs||{},e=e||[];return Object.keys(i).length&&e.push(t),Object.keys(n).forEach(function(n){"docs"!==n&&e.concat(this.expand(t+n,e))},this),e},t.TokenStore.prototype.toJSON=function(){return{root:this.root,length:this.length}},function(t,e){"function"==typeof define&&define.amd?define(e):"object"==typeof exports?module.exports=e():t.lunr=e()}(this,function(){return t})}();

[2/6] apex-site git commit: Update apex-3.4 documentation from master to include security changes and development best practices.

Posted by th...@apache.org.

http://git-wip-us.apache.org/repos/asf/apex-site/blob/21e76a00/docs/apex-3.4/mkdocs/search_index.json
----------------------------------------------------------------------
diff --git a/docs/apex-3.4/mkdocs/search_index.json b/docs/apex-3.4/mkdocs/search_index.json
index 3512a2f..611f195 100644
--- a/docs/apex-3.4/mkdocs/search_index.json
+++ b/docs/apex-3.4/mkdocs/search_index.json
@@ -12,7 +12,7 @@
         }, 
         {
             "location": "/apex_development_setup/", 
-            "text": "Apache Apex Development Environment Setup\n\n\nThis document discusses the steps needed for setting up a development environment for creating applications that run on the Apache Apex platform.\n\n\nDevelopment Tools\n\n\nThere are a few tools that will be helpful when developing Apache Apex applications, including:\n\n\n\n\n\n\ngit\n - A revision control system (version 1.7.1 or later). There are multiple git clients available for Windows (\nhttp://git-scm.com/download/win\n for example), so download and install a client of your choice.\n\n\n\n\n\n\njava JDK\n (not JRE) - Includes the Java Runtime Environment as well as the Java compiler and a variety of tools (version 1.7.0_79 or later). Can be downloaded from the Oracle website.\n\n\n\n\n\n\nmaven\n - Apache Maven is a build system for Java projects (version 3.0.5 or later). It can be downloaded from \nhttps://maven.apache.org/download.cgi\n.\n\n\n\n\n\n\nIDE\n (Optional) - If you prefer to use an IDE (Integra
 ted Development Environment) such as \nNetBeans\n, \nEclipse\n or \nIntelliJ\n, install that as well.\n\n\n\n\n\n\nAfter installing these tools, make sure that the directories containing the executable files are in your PATH environment variable.\n\n\n\n\nWindows\n - Open a console window and enter the command \necho %PATH%\n to see the value of the \nPATH\n variable and verify that the above directories for Java, git, and maven executables are present.  JDK executables like \njava\n and \njavac\n, the directory might be something like \nC:\\Program Files\\Java\\jdk1.7.0\\_80\\bin\n; for \ngit\n it might be \nC:\\Program Files\\Git\\bin\n; and for maven it might be \nC:\\Users\\user\\Software\\apache-maven-3.3.3\\bin\n.  If not, you can change its value clicking on the button at \nControl Panel\n \n \nAdvanced System Settings\n \n \nAdvanced tab\n \n \nEnvironment Variables\n.\n\n\nLinux and Mac\n - Open a console/terminal window and enter the command \necho $PATH\n to see the value
  of the \nPATH\n variable and verify that the above directories for Java, git, and maven executables are present.  If not, make sure software is downloaded and installed, and optionally PATH reference is added and exported  in a \n~/.profile\n or \n~/.bash_profile\n.  For example to add maven located in \n/sfw/maven/apache-maven-3.3.3\n to PATH add the line: \nexport PATH=$PATH:/sfw/maven/apache-maven-3.3.3/bin\n\n\n\n\nConfirm by running the following commands and comparing with output that show in the table below:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCommand\n\n\nOutput\n\n\n\n\n\n\njavac -version\n\n\njavac 1.7.0_80\n\n\n\n\n\n\njava -version\n\n\njava version \n1.7.0_80\n\n\nJava(TM) SE Runtime Environment (build 1.7.0_80-b15)\n\n\nJava HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)\n\n\n\n\n\n\ngit --version\n\n\ngit version 2.6.1.windows.1\n\n\n\n\n\n\nmvn --version\n\n\nApache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06; 2015-04-22T06:57:37-05:00)\n\n\n...\n
 \n\n\n\n\n\n\n\n\n\n\nCreating New Apex Project\n\n\nAfter development tools are configured, you can now use the maven archetype to create a basic Apache Apex project.  \nNote:\n When executing the commands below, replace \n3.4.0\n by \nlatest available version\n of Apache Apex.\n\n\n\n\n\n\nWindows\n - Create a new Windows command file called \nnewapp.cmd\n by copying the lines below, and execute it.  When you run this file, the properties will be displayed and you will be prompted with \nY: :\n; just press \nEnter\n to complete the project generation.  The caret (^) at the end of some lines indicates that a continuation line follows. \n\n\n@echo off\n@rem Script for creating a new application\nsetlocal\nmvn archetype:generate ^\n -DarchetypeGroupId=org.apache.apex ^\n -DarchetypeArtifactId=apex-app-archetype -DarchetypeVersion=3.4.0 ^\n -DgroupId=com.example -Dpackage=com.example.myapexapp -DartifactId=myapexapp ^\n -Dversion=1.0-SNAPSHOT\nendlocal\n\n\n\n\n\n\n\nLinux\n - Execute
  the lines below in a terminal window.  New project will be created in the curent working directory.  The backslash (\\) at the end of the lines indicates continuation.\n\n\nmvn archetype:generate \\\n -DarchetypeGroupId=org.apache.apex \\\n -DarchetypeArtifactId=apex-app-archetype -DarchetypeVersion=3.4.0 \\\n -DgroupId=com.example -Dpackage=com.example.myapexapp -DartifactId=myapexapp \\\n -Dversion=1.0-SNAPSHOT\n\n\n\n\n\n\n\nWhen the run completes successfully, you should see a new directory named \nmyapexapp\n containing a maven project for building a basic Apache Apex application. It includes 3 source files:\nApplication.java\n,  \nRandomNumberGenerator.java\n and \nApplicationTest.java\n. You can now build the application by stepping into the new directory and running the maven package command:\n\n\ncd myapexapp\nmvn clean package -DskipTests\n\n\n\nThe build should create the application package file \nmyapexapp/target/myapexapp-1.0-SNAPSHOT.apa\n. This application package c
 an then be used to launch example application via \napex\n CLI, or other visual management tools.  When running, this application will generate a stream of random numbers and print them out, each prefixed by the string \nhello world:\n.\n\n\nRunning Unit Tests\n\n\nTo run unit tests on Linux or OSX, simply run the usual maven command, for example: \nmvn test\n.\n\n\nOn Windows, an additional file, \nwinutils.exe\n, is required; download it from\n\nhttps://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip\n\nand unpack the archive to, say, \nC:\\hadoop\n; this file should be present under\n\nhadoop-common-2.2.0-bin-master\\bin\n within it.\n\n\nSet the \nHADOOP_HOME\n environment variable system-wide to\n\nc:\\hadoop\\hadoop-common-2.2.0-bin-master\n as described at:\n\nhttps://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/sysdm_advancd_environmnt_addchange_variable.mspx?mfr=true\n. You should now be able to run unit tests normally.\n\n\nIf you 
 prefer not to set the variable globally, you can set it on the command line or within\nyour IDE. For example, on the command line, specify the maven\nproperty \nhadoop.home.dir\n:\n\n\nmvn -Dhadoop.home.dir=c:\\hadoop\\hadoop-common-2.2.0-bin-master test\n\n\n\nor set the environment variable separately:\n\n\nset HADOOP_HOME=c:\\hadoop\\hadoop-common-2.2.0-bin-master\nmvn test\n\n\n\nWithin your IDE, set the environment variable and then run the desired\nunit test in the usual way. For example, with NetBeans you can add:\n\n\nEnv.HADOOP_HOME=c:/hadoop/hadoop-common-2.2.0-bin-master\n\n\n\nat \nProperties \n Actions \n Run project \n Set Properties\n.\n\n\nSimilarly, in Eclipse (Mars) add it to the\nproject properties at \nProperties \n Run/Debug Settings \n ApplicationTest\n\n Environment\n tab.\n\n\nBuilding Apex Demos\n\n\nIf you want to see more substantial Apex demo applications and the associated source code, you can follow these simple steps to check out and build them.\n\n\n\
 n\n\n\nCheck out the source code repositories:\n\n\ngit clone https://github.com/apache/incubator-apex-core\ngit clone https://github.com/apache/incubator-apex-malhar\n\n\n\n\n\n\n\nSwitch to the appropriate release branch and build each repository:\n\n\ncd incubator-apex-core\nmvn clean install -DskipTests\n\ncd incubator-apex-malhar\nmvn clean install -DskipTests\n\n\n\n\n\n\n\nThe \ninstall\n argument to the \nmvn\n command installs resources from each project to your local maven repository (typically \n.m2/repository\n under your home directory), and \nnot\n to the system directories, so Administrator privileges are not required. The  \n-DskipTests\n argument skips running unit tests since they take a long time. If this is a first-time installation, it might take several minutes to complete because maven will download a number of associated plugins.\n\n\nAfter the build completes, you should see the demo application package files in the target directory under each demo subdirect
 ory in \nincubator-apex-malhar/demos\n.\n\n\nSandbox\n\n\nTo jump start development with an Apache Hadoop single node cluster, \nDataTorrent Sandbox\n powered by VirtualBox is available on Windows, Linux, or Mac platforms.  The sandbox is configured by default to run with 6GB RAM; if your development machine has 16GB or more, you can increase the sandbox RAM to 8GB or more using the VirtualBox console.  This will yield better performance and support larger applications.  The advantage of developing in the sandbox is that most of the tools (e.g. \njdk\n, \ngit\n, \nmaven\n), Hadoop YARN and HDFS, and a distribution of Apache Apex and DataTorrent RTS are pre-installed.  The disadvantage is that the sandbox is a memory-limited environment, and requires settings changes and restarts to adjust memory available for development and testing.", 
+            "text": "Apache Apex Development Environment Setup\n\n\nThis document discusses the steps needed for setting up a development environment for creating applications that run on the Apache Apex platform.\n\n\nDevelopment Tools\n\n\nThere are a few tools that will be helpful when developing Apache Apex applications, including:\n\n\n\n\n\n\ngit\n - A revision control system (version 1.7.1 or later). There are multiple git clients available for Windows (\nhttp://git-scm.com/download/win\n for example), so download and install a client of your choice.\n\n\n\n\n\n\njava JDK\n (not JRE) - Includes the Java Runtime Environment as well as the Java compiler and a variety of tools (version 1.7.0_79 or later). Can be downloaded from the Oracle website.\n\n\n\n\n\n\nmaven\n - Apache Maven is a build system for Java projects (version 3.0.5 or later). It can be downloaded from \nhttps://maven.apache.org/download.cgi\n.\n\n\n\n\n\n\nIDE\n (Optional) - If you prefer to use an IDE (Integra
 ted Development Environment) such as \nNetBeans\n, \nEclipse\n or \nIntelliJ\n, install that as well.\n\n\n\n\n\n\nAfter installing these tools, make sure that the directories containing the executable files are in your PATH environment variable.\n\n\n\n\nWindows\n - Open a console window and enter the command \necho %PATH%\n to see the value of the \nPATH\n variable and verify that the above directories for Java, git, and maven executables are present.  JDK executables like \njava\n and \njavac\n, the directory might be something like \nC:\\Program Files\\Java\\jdk1.7.0\\_80\\bin\n; for \ngit\n it might be \nC:\\Program Files\\Git\\bin\n; and for maven it might be \nC:\\Users\\user\\Software\\apache-maven-3.3.3\\bin\n.  If not, you can change its value clicking on the button at \nControl Panel\n \n \nAdvanced System Settings\n \n \nAdvanced tab\n \n \nEnvironment Variables\n.\n\n\nLinux and Mac\n - Open a console/terminal window and enter the command \necho $PATH\n to see the value
  of the \nPATH\n variable and verify that the above directories for Java, git, and maven executables are present.  If not, make sure software is downloaded and installed, and optionally PATH reference is added and exported  in a \n~/.profile\n or \n~/.bash_profile\n.  For example to add maven located in \n/sfw/maven/apache-maven-3.3.3\n to PATH add the line: \nexport PATH=$PATH:/sfw/maven/apache-maven-3.3.3/bin\n\n\n\n\nConfirm by running the following commands and comparing with output that show in the table below:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCommand\n\n\nOutput\n\n\n\n\n\n\njavac -version\n\n\njavac 1.7.0_80\n\n\n\n\n\n\njava -version\n\n\njava version \n1.7.0_80\n\n\nJava(TM) SE Runtime Environment (build 1.7.0_80-b15)\n\n\nJava HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)\n\n\n\n\n\n\ngit --version\n\n\ngit version 2.6.1.windows.1\n\n\n\n\n\n\nmvn --version\n\n\nApache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06; 2015-04-22T06:57:37-05:00)\n\n\n...\n
 \n\n\n\n\n\n\n\n\n\n\nCreating New Apex Project\n\n\nAfter development tools are configured, you can now use the maven archetype to create a basic Apache Apex project.  \nNote:\n When executing the commands below, replace \n3.4.0\n by \nlatest available version\n of Apache Apex.\n\n\n\n\n\n\nWindows\n - Create a new Windows command file called \nnewapp.cmd\n by copying the lines below, and execute it.  When you run this file, the properties will be displayed and you will be prompted with \nY: :\n; just press \nEnter\n to complete the project generation.  The caret (^) at the end of some lines indicates that a continuation line follows. \n\n\n@echo off\n@rem Script for creating a new application\nsetlocal\nmvn archetype:generate ^\n -DarchetypeGroupId=org.apache.apex ^\n -DarchetypeArtifactId=apex-app-archetype -DarchetypeVersion=3.4.0 ^\n -DgroupId=com.example -Dpackage=com.example.myapexapp -DartifactId=myapexapp ^\n -Dversion=1.0-SNAPSHOT\nendlocal\n\n\n\n\n\n\n\nLinux\n - Execute
  the lines below in a terminal window.  New project will be created in the curent working directory.  The backslash (\\) at the end of the lines indicates continuation.\n\n\nmvn archetype:generate \\\n -DarchetypeGroupId=org.apache.apex \\\n -DarchetypeArtifactId=apex-app-archetype -DarchetypeVersion=3.4.0 \\\n -DgroupId=com.example -Dpackage=com.example.myapexapp -DartifactId=myapexapp \\\n -Dversion=1.0-SNAPSHOT\n\n\n\n\n\n\n\nWhen the run completes successfully, you should see a new directory named \nmyapexapp\n containing a maven project for building a basic Apache Apex application. It includes 3 source files:\nApplication.java\n,  \nRandomNumberGenerator.java\n and \nApplicationTest.java\n. You can now build the application by stepping into the new directory and running the maven package command:\n\n\ncd myapexapp\nmvn clean package -DskipTests\n\n\n\nThe build should create the application package file \nmyapexapp/target/myapexapp-1.0-SNAPSHOT.apa\n. This application package c
 an then be used to launch example application via \napex\n CLI, or other visual management tools.  When running, this application will generate a stream of random numbers and print them out, each prefixed by the string \nhello world:\n.\n\n\nRunning Unit Tests\n\n\nTo run unit tests on Linux or OSX, simply run the usual maven command, for example: \nmvn test\n.\n\n\nOn Windows, an additional file, \nwinutils.exe\n, is required; download it from\n\nhttps://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip\n\nand unpack the archive to, say, \nC:\\hadoop\n; this file should be present under\n\nhadoop-common-2.2.0-bin-master\\bin\n within it.\n\n\nSet the \nHADOOP_HOME\n environment variable system-wide to\n\nc:\\hadoop\\hadoop-common-2.2.0-bin-master\n as described at:\n\nhttps://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/sysdm_advancd_environmnt_addchange_variable.mspx?mfr=true\n. You should now be able to run unit tests normally.\n\n\nIf you 
 prefer not to set the variable globally, you can set it on the command line or within\nyour IDE. For example, on the command line, specify the maven\nproperty \nhadoop.home.dir\n:\n\n\nmvn -Dhadoop.home.dir=c:\\hadoop\\hadoop-common-2.2.0-bin-master test\n\n\n\nor set the environment variable separately:\n\n\nset HADOOP_HOME=c:\\hadoop\\hadoop-common-2.2.0-bin-master\nmvn test\n\n\n\nWithin your IDE, set the environment variable and then run the desired\nunit test in the usual way. For example, with NetBeans you can add:\n\n\nEnv.HADOOP_HOME=c:/hadoop/hadoop-common-2.2.0-bin-master\n\n\n\nat \nProperties \n Actions \n Run project \n Set Properties\n.\n\n\nSimilarly, in Eclipse (Mars) add it to the\nproject properties at \nProperties \n Run/Debug Settings \n ApplicationTest\n\n Environment\n tab.\n\n\nBuilding Apex Demos\n\n\nIf you want to see more substantial Apex demo applications and the associated source code, you can follow these simple steps to check out and build them.\n\n\n\
 n\n\n\nCheck out the source code repositories:\n\n\ngit clone https://github.com/apache/apex-core\ngit clone https://github.com/apache/apex-malhar\n\n\n\n\n\n\n\nSwitch to the appropriate release branch and build each repository:\n\n\ncd apex-core\nmvn clean install -DskipTests\n\ncd apex-malhar\nmvn clean install -DskipTests\n\n\n\n\n\n\n\nThe \ninstall\n argument to the \nmvn\n command installs resources from each project to your local maven repository (typically \n.m2/repository\n under your home directory), and \nnot\n to the system directories, so Administrator privileges are not required. The  \n-DskipTests\n argument skips running unit tests since they take a long time. If this is a first-time installation, it might take several minutes to complete because maven will download a number of associated plugins.\n\n\nAfter the build completes, you should see the demo application package files in the target directory under each demo subdirectory in \napex-malhar/demos\n.\n\n\nSandb
 ox\n\n\nTo jump start development with an Apache Hadoop single node cluster, \nDataTorrent Sandbox\n powered by VirtualBox is available on Windows, Linux, or Mac platforms.  The sandbox is configured by default to run with 6GB RAM; if your development machine has 16GB or more, you can increase the sandbox RAM to 8GB or more using the VirtualBox console.  This will yield better performance and support larger applications.  The advantage of developing in the sandbox is that most of the tools (e.g. \njdk\n, \ngit\n, \nmaven\n), Hadoop YARN and HDFS, and a distribution of Apache Apex and DataTorrent RTS are pre-installed.  The disadvantage is that the sandbox is a memory-limited environment, and requires settings changes and restarts to adjust memory available for development and testing.", 
             "title": "Development Setup"
         }, 
         {
@@ -37,7 +37,7 @@
         }, 
         {
             "location": "/apex_development_setup/#building-apex-demos", 
-            "text": "If you want to see more substantial Apex demo applications and the associated source code, you can follow these simple steps to check out and build them.    Check out the source code repositories:  git clone https://github.com/apache/incubator-apex-core\ngit clone https://github.com/apache/incubator-apex-malhar    Switch to the appropriate release branch and build each repository:  cd incubator-apex-core\nmvn clean install -DskipTests\n\ncd incubator-apex-malhar\nmvn clean install -DskipTests    The  install  argument to the  mvn  command installs resources from each project to your local maven repository (typically  .m2/repository  under your home directory), and  not  to the system directories, so Administrator privileges are not required. The   -DskipTests  argument skips running unit tests since they take a long time. If this is a first-time installation, it might take several minutes to complete because maven will download a number of associated plugins.  A
 fter the build completes, you should see the demo application package files in the target directory under each demo subdirectory in  incubator-apex-malhar/demos .", 
+            "text": "If you want to see more substantial Apex demo applications and the associated source code, you can follow these simple steps to check out and build them.    Check out the source code repositories:  git clone https://github.com/apache/apex-core\ngit clone https://github.com/apache/apex-malhar    Switch to the appropriate release branch and build each repository:  cd apex-core\nmvn clean install -DskipTests\n\ncd apex-malhar\nmvn clean install -DskipTests    The  install  argument to the  mvn  command installs resources from each project to your local maven repository (typically  .m2/repository  under your home directory), and  not  to the system directories, so Administrator privileges are not required. The   -DskipTests  argument skips running unit tests since they take a long time. If this is a first-time installation, it might take several minutes to complete because maven will download a number of associated plugins.  After the build completes, you should see
  the demo application package files in the target directory under each demo subdirectory in  apex-malhar/demos .", 
             "title": "Building Apex Demos"
         }, 
         {
@@ -821,6 +821,71 @@
             "title": "System Metrics"
         }, 
         {
+            "location": "/development_best_practices/", 
+            "text": "Development Best Practices\n\n\nThis document describes the best practices to follow when developing operators and other application components such as partitoners, stream codecs etc on the Apache Apex platform.\n\n\nOperators\n\n\nThese are general guidelines for all operators that are covered in the current section. The subsequent sections talk about special considerations for input and output operators.\n\n\n\n\nWhen writing a new operator to be used in an application, consider breaking it down into\n\n\nAn abstract operator that encompasses the core functionality but leaves application specific schemas and logic to the implementation.\n\n\nAn optional concrete operator also in the library that extends the abstract operator and provides commonly used schema types such as strings, byte[] or POJOs.\n\n\n\n\n\n\nFollow these conventions for the life cycle methods:\n\n\nDo one time initialization of entities that apply for the entire lifetime of the operator in t
 he \nsetup\n method, e.g., factory initializations. Initializations in \nsetup\n are done in the container where the operator is deployed. Allocating memory for fields in the constructor is not efficient as it would lead to extra garbage in memory for the following reason. The operator is instantiated on the client from where the application is launched, serialized and started one of the Hadoop nodes in a container. So the constructor is first called on the client and if it were to initialize any of the fields, that state would be saved during serialization. In the Hadoop container the operator is deserialized and started. This would invoke the constructor again, which will initialize the fields but their state will get overwritten by the serialized state and the initial values would become garbage in memory.\n\n\nDo one time initialization for live entities in \nactivate\n method, e.g., opening connections to a database server or starting a thread for asynchronous operations. The \
 nactivate\n method is called right before processing starts so it is a better place for these initializations than at \nsetup\n which can lead to a delay before processing data from the live entity.  \n\n\nPerform periodic tasks based on processing time in application window boundaries.\n\n\nPerform initializations needed for each application window in \nbeginWindow\n.\n\n\nPerform aggregations needed for each application window  in \nendWindow\n.\n\n\nTeardown of live entities (inverse of tasks performed during activate) should be in the \ndeactivate\n method.\n\n\nTeardown of lifetime entities (those initialized in setup method) should happen in the \nteardown\n method.\n\n\nIf the operator implementation is not finalized mark it with the \n@Evolving\n annotation.\n\n\n\n\n\n\nIf the operator needs to perform operations based on event time of the individual tuples and not the processing time, extend and use the \nWindowedOperator\n. Refer to documentation of that operator for deta
 ils on how to use it.\n\n\nIf an operator needs to do some work when it is not receiving any input, it should implement \nIdleTimeHandler\n interface. This interface contains \nhandleIdleTime\n method which will be called whenever the platform isn\u2019t doing anything else and the operator can do the work in this method. If for any reason the operator does not have any work to do when this method is called, it should sleep for a small amount of time such as that specified by the \nSPIN_MILLIS\n attribute so that it does not cause a busy wait when called repeatedly by the platform. Also, the method should not block and return in a reasonable amount of time that is less than the streaming window size (which is 500ms by default).\n\n\nOften operators have customizable parameters such as information about locations of external systems or parameters that modify the behavior of the operator. Users should be able to specify these easily without having to change source code. This can be do
 ne by making them properties of the operator because they can then be initialized from external properties files.\n\n\nWhere possible default values should be provided for the properties in the source code.\n\n\nValidation rules should be specified for the properties using javax constraint validations that check whether the values specified for the properties are in the correct format, range or other operator requirements. Required properties should have at least a \n@NotNull\n validation specifying that they have to be specified by the user.\n\n\n\n\n\n\n\n\nCheckpointing\n\n\nCheckpointing is a process of snapshotting the state of an operator and saving it so that in case of failure the state can be used to restore the operator to a prior state and continue processing. It is automatically performed by the platform at a configurable interval. All operators in the application are checkpointed in a distributed fashion, thus allowing the entire state of the application to be saved and
  available for recovery if needed. Here are some things to remember when it comes to checkpointing:\n\n\n\n\nThe process of checkpointing involves snapshotting the state by serializing the operator and saving it to a store. This is done using a \nStorageAgent\n. By default a \nStorageAgent\n is already provided by the platform and it is called \nAsyncFSStorageAgent\n. It serializes the operator using Kryo and saves the serialized state asynchronously to a filesystem such as HDFS. There are other implementations of \nStorageAgent\n available such as \nGeodeKeyValueStorageAgent\n that stores the serialized state in Geode which is an in-memory replicated data grid.\n\n\nAll variables in the operator marked neither transient nor final are saved so any variables in the operator that are not part of the state should be marked transient. Specifically any variables like connection objects, i/o streams, ports are transient, because they need to be setup again on failure recovery.\n\n\nIf the
  operator does not keep any state between windows, mark it with the \n@Stateless\n annotation. This results in efficiencies during checkpointing and recovery. The operator will not be checkpointed and is always restored to the initial state\n\n\nThe checkpoint interval can be set using the \nCHECKPOINT_WINDOW_COUNT\n attribute which specifies the interval in terms of number of streaming windows.\n\n\nIf the correct functioning of the operator requires the \nendWindow\n method be called before checkpointing can happen, then the checkpoint interval should align with application window interval i.e., it should be a multiple of application window interval. In this case the operator should be marked with \nOperatorAnnotation\n and \ncheckpointableWithinAppWindow\n set to false. If the window intervals are configured by the user and they don\u2019t align, it will result in a DAG validation error and application won\u2019t launch.\n\n\nIn some cases the operator state related to a piece of
  data needs to be purged once that data is no longer required by the application, otherwise the state will continue to build up indefinitely. The platform provides a way to let the operator know about this using a callback listener called \nCheckpointNotificationListener\n. This listener has a callback method called \ncommitted\n, which is called by the platform from time to time with a window id that has been processed successfully by all the operators in the DAG and hence is no longer needed. The operator can delete all the state corresponding to window ids less than or equal to the provided window id.\n\n\nSometimes operators need to perform some tasks just before checkpointing. For example, filesystem operators may want to flush the files just before checkpoint so they can be sure that all pending data is written to disk and no data is lost if there is an operator failure just after the checkpoint and the operator restarts from the checkpoint. To do this the operator would imple
 ment the same \nCheckpointNotificationListener\n interface and implement the \nbeforeCheckpoint\n method where it can do these tasks.\n\n\nIf the operator is going to have a large state, checkpointing the entire state each time becomes unviable. Furthermore, the amount of memory needed to hold the state could be larger than the amount of physical memory available. In these cases the operator should checkpoint the state incrementally and also manage the memory for the state more efficiently. The platform provides a utiltiy called \nManagedState\n that uses a combination of in memory and disk cache to efficiently store and retrieve data in a performant, fault tolerant way and also checkpoint it in an incremental fashion. There are operators in the platform that use \nManagedState\n and can be used as a reference on how to use this utility such as Dedup or Join operators.\n\n\n\n\nInput Operators\n\n\nInput operators have additional requirements:\n\n\n\n\nThe \nemitTuples\n method impl
 emented by the operator, is called by the platform, to give the operator an opportunity to emit some data. This method is always called within a window boundary but can be called multiple times within the same window. There are some important guidelines on how to implement this method:\n\n\nThis should not be a blocking method and should return in a reasonable time that is less than the streaming window size (which is 500ms by default). This also applies to other callback methods called by the platform such as \nbeginWindow\n, \nendWindow\n etc., but is more important here since this method will be called continuously by the platform.\n\n\nIf the operator needs to interact with external systems to obtain data and this can potentially take a long time, then this should be performed asynchronously in a different thread. Refer to the threading section below for the guidelines when using threading.\n\n\nIn each invocation, the method can emit any number of data tuples.\n\n\n\n\n\n\n\n\n
 Idempotence\n\n\nMany applications write data to external systems using output operators. To ensure that data is present exactly once in the external system even in a failure recovery scenario, the output operators expect the replayed windows during recovery contain the same data as before the failure. This is called idempotency. Since operators within the DAG are merely responding to input data provided to them by the upstream operators and the input operator has no upstream operator, the responsibility of idempotent replay falls on the input operators.\n\n\n\n\nFor idempotent replay of data, the operator needs to store some meta-information for every window that would allow it to identify what data was sent in that window. This is called the idempotent state.\n\n\nIf the external source of the input operator allows replayability, this could be information such as offset of last piece of data in the window, an identifier of the last piece of data itself or number of data tuples sen
 t.\n\n\nHowever if the external source does not allow replayability from an operator specified point, then the entire data sent within the window may need to be persisted by the operator.\n\n\n\n\n\n\nThe platform provides a utility called \nWindowDataManager\n to allow operators to save and retrieve idempotent state every window. Operators should use this to implement idempotency.\n\n\n\n\nOutput Operators\n\n\nOutput operators typically connect to external storage systems such as filesystems, databases or key value stores to store data.\n\n\n\n\nIn some situations, the external systems may not be functioning in a reliable fashion. They may be having prolonged outages or performance problems. If the operator is being designed to work in such environments, it needs to be able to to handle these problems gracefully and not block the DAG or fail. In these scenarios the operator should cache the data into a local store such as HDFS and interact with external systems in a separate threa
 d so as to not have problems in the operator lifecycle thread. This pattern is called the \nReconciler\n pattern and there are operators that implement this pattern available in the library for reference.\n\n\n\n\nEnd-to-End Exactly Once\n\n\nWhen output operators store data in external systems, it is important that they do not lose data or write duplicate data when there is a failure event and the DAG recovers from that failure. In failure recovery, the windows from the previous checkpoint are replayed and the operator receives this data again. The operator should ensure that it does not write this data again. Operator developers should figure out how to do this specifically for the operators they are developing depending on the logic of the operators. Below are examples of how a couple of existing output operators do this for reference.\n\n\n\n\nFile output operator that writes data to files keeps track of the file lengths in the state. These lengths are checkpointed and restored 
 on failure recovery. On restart, the operator truncates the file to the length equal to the length in the recovered state. This makes the data in the file same as it was at the time of checkpoint before the failure. The operator now writes the replayed data from the checkpoint in regular fashion as any other data. This ensures no data is lost or duplicated in the file.\n\n\nThe JDBC output operator that writes data to a database table writes the data in a window in a single transaction. It also writes the current window id into a meta table along with the data as part of the same transaction. It commits the transaction at the end of the window. When there is an operator failure before the final commit, the state of the database is that it contains the data from the previous fully processed window and its window id since the current window transaction isn\u2019t yet committed. On recovery, the operator reads this window id back from the meta table. It ignores all the replayed windows
  whose window id is less than or equal to the recovered window id and thus ensures that it does not duplicate data already present in the database. It starts writing data normally again when window id of data becomes greater than recovered window thus ensuring no data is lost.\n\n\n\n\nPartitioning\n\n\nPartitioning allows an operation to be scaled to handle more pieces of data than before but with a similar SLA. This is done by creating multiple instances of an operator and distributing the data among them. Input operators can also be partitioned to stream more pieces of data into the application. The platform provides a lot of flexibility and options for partitioning. Partitioning can happen once at startup or can be dynamically changed anytime while the application is running, and it can be done in a stateless or stateful way by distributing state from the old partitions to new partitions.\n\n\nIn the platform, the responsibility for partitioning is shared among different entitie
 s. These are:\n\n\n\n\nA \npartitioner\n that specifies \nhow\n to partition the operator, specifically it takes an old set of partitions and creates a new set of partitions. At the start of the application the old set has one partition and the partitioner can return more than one partitions to start the application with multiple partitions. The partitioner can have any custom JAVA logic to determine the number of new partitions, set their initial state as a brand new state or derive it from the state of the old partitions. It also specifies how the data gets distributed among the new partitions. The new set doesn't have to contain only new partitions, it can carry over some old partitions if desired.\n\n\nAn optional \nstatistics (stats) listener\n that specifies \nwhen\n to partition. The reason it is optional is that it is needed only when dynamic partitioning is needed. With the stats listener, the stats can be used to determine when to partition.\n\n\nIn some cases the \noperat
 or\n itself should be aware of partitioning and would need to provide supporting code.\n\n\nIn case of input operators each partition should have a property or a set of properties that allow it to distinguish itself from the other partitions and fetch unique data.\n\n\n\n\n\n\nWhen an operator that was originally a single instance is split into multiple partitions with each partition working on a subset of data, the results of the partitions may need to be combined together to compute the final result. The combining logic would depend on the logic of the operator. This would be specified by the developer using a \nUnifier\n, which is deployed as another operator by the platform. If no \nUnifier\n is specified, the platform inserts a \ndefault unifier\n that merges the results of the multiple partition streams into a single stream. Each output port can have a different \nUnifier\n and this is specified by returning the corresponding \nUnifier\n in the \ngetUnifier\n method of the out
 put port. The operator developer should provide a custom \nUnifier\n wherever applicable.\n\n\nThe Apex \nengine\n that brings everything together and effects the partitioning.\n\n\n\n\nSince partitioning is critical for scalability of applications, operators must support it. There should be a strong reason for an operator to not support partitioning, such as, the logic performed by the operator not lending itself to parallelism. In order to support partitioning, an operator developer, apart from developing the functionality of the operator, may also need to provide a partitioner, stats listener and supporting code in the operator as described in the steps above. The next sections delve into this. \n\n\nOut of the box partitioning\n\n\nThe platform comes with some built-in partitioning utilities that can be used in certain scenarios.\n\n\n\n\n\n\nStatelessPartitioner\n provides a default partitioner, that can be used for an operator in certain conditions. If the operator satisfies t
 hese conditions, the partitioner can be specified for the operator with a simple setting and no other partitioning code is needed. The conditions are:\n\n\n\n\nNo dynamic partitioning is needed, see next point about dynamic partitioning. \n\n\nThere is no distinct initial state for the partitions, i.e., all partitions start with the same initial state submitted during application launch.\n\n\n\n\nTypically input or output operators do not fall into this category, although there are some exceptions. This partitioner is mainly used with operators that are in the middle of the DAG, after the input and before the output operators. When used with non-input operators, only the data for the first declared input port is distributed among the different partitions. All other input ports are treated as broadcast and all partitions receive all the data for that port.\n\n\n\n\n\n\nStatelessThroughputBasedPartitioner\n in Malhar provides a dynamic partitioner based on throughput thresholds. Simil
 arly \nStatelessLatencyBasedPartitioner\n provides a latency based dynamic partitioner in RTS. If these partitioners can be used, then separate partitioning related code is not needed. The conditions under which these can be used are:\n\n\n\n\nThere is no distinct initial state for the partitions.\n\n\nThere is no state being carried over by the operator from one window to the next i.e., operator is stateless.\n\n\n\n\n\n\n\n\nCustom partitioning\n\n\nIn many cases, operators don\u2019t satisfy the above conditions and a built-in partitioner cannot be used. Custom partitioning code needs to be written by the operator developer. Below are guidelines for it.\n\n\n\n\nSince the operator developer is providing a \npartitioner\n for the operator, the partitioning code should be added to the operator itself by making the operator implement the Partitioner interface and implementing the required methods, rather than creating a separate partitioner. The advantage is the user of the operator
  does not have to explicitly figure out the partitioner and set it for the operator but still has the option to override this built-in partitioner with a different one.\n\n\nThe \npartitioner\n is responsible for setting the initial state of the new partitions, whether it is at the start of the application or when partitioning is happening while the application is running as in the dynamic partitioning case. In the dynamic partitioning scenario, the partitioner needs to take the state from the old partitions and distribute it among the new partitions. It is important to note that apart from the checkpointed state the partitioner also needs to distribute idempotent state.\n\n\nThe \npartitioner\n interface has two methods, \ndefinePartitions\n and \npartitioned\n. The method \ndefinePartitons\n is first called to determine the new partitions, and if enough resources are available on the cluster, the \npartitioned\n method is called passing in the new partitions. This happens both dur
 ing initial partitioning and dynamic partitioning. If resources are not available, partitioning is abandoned and existing partitions continue to run untouched. This means that any processing intensive operations should be deferred to the \npartitioned\n call instead of doing them in \ndefinePartitions\n, as they may not be needed if there are not enough resources available in the cluster.\n\n\nThe \npartitioner\n, along with creating the new partitions, should also specify how the data gets distributed across the new partitions. It should do this by specifying a mapping called \nPartitionKeys\n for each partition that maps the data to that partition. This mapping needs to be specified for every input port in the operator. If the \npartitioner\n wants to use the standard mapping it can use a utility method called \nDefaultPartition.assignPartitionKeys\n.\n\n\nWhen the partitioner is scaling the operator up to more partitions, try to reuse the existing partitions and create new partit
 ions to augment the current set. The reuse can be achieved by the partitioner returning the current partitions unchanged. This will result in the current partitions continuing to run untouched.\n\n\nIn case of dynamic partitioning, as mentioned earlier, a stats listener is also needed to determine when to re-partition. Like the \nPartitioner\n interface, the operator can also implement the \nStatsListener\n interface to provide a stats listener implementation that will be automatically used.\n\n\nThe \nStatsListener\n has access to all operator statistics to make its decision on partitioning. Apart from the statistics that the platform computes for the operators such as throughput, latency etc, operator developers can include their own business metrics by using the AutoMetric feature.\n\n\nIf the operator is not partitionable, mark it so with \nOperatorAnnotation\n and \npartitionable\n element set to false.\n\n\n\n\nStreamCodecs\n\n\nA \nStreamCodec\n is used in partitioning to dis
 tribute the data tuples among the partitions. The \nStreamCodec\n computes an integer hashcode for a data tuple and this is used along with \nPartitionKeys\n mapping to determine which partition or partitions receive the data tuple. If a \nStreamCodec\n is not specified, then a default one is used by the platform which returns the JAVA hashcode of the tuple. \n\n\nStreamCodec\n is also useful in another aspect of the application. It is used to serialize and deserialize the tuple to transfer it between operators. The default \nStreamCodec\n uses Kryo library for serialization. \n\n\nThe following guidelines are useful when considering a custom \nStreamCodec\n\n\n\n\nA custom \nStreamCodec\n is needed if the tuples need to be distributed based on a criteria different from the hashcode of the tuple. If the correct working of an operator depends on the data from the upstream operator being distributed using a custom criteria such as being sticky on a \u201ckey\u201d field within the tup
 le, then a custom \nStreamCodec\n should be provided by the operator developer. This codec can implement the custom criteria. The operator should also return this custom codec in the \ngetStreamCodec\n method of the input port.\n\n\nWhen implementing a custom \nStreamCodec\n for the purpose of using a different criteria to distribute the tuples, the codec can extend an existing \nStreamCodec\n and implement the hashcode method, so that the codec does not have to worry about the serialization and deserialization functionality. The Apex platform provides two pre-built \nStreamCodec\n implementations for this purpose, one is \nKryoSerializableStreamCodec\n that uses Kryo for serialization and another one \nJavaSerializationStreamCodec\n that uses JAVA serialization.\n\n\nDifferent \nStreamCodec\n implementations can be used for different inputs in a stream with multiple inputs when different criteria of distributing the tuples is desired between the multiple inputs. \n\n\n\n\nThreads\n
 \n\nThe operator lifecycle methods such as \nsetup\n, \nbeginWindow\n, \nendWindow\n, \nprocess\n in \nInputPorts\n are all called from a single operator lifecycle thread, by the platform, unbeknownst to the user. So the user does not have to worry about dealing with the issues arising from multi-threaded code. Use of separate threads in an operator is discouraged because in most cases the motivation for this is parallelism, but parallelism can already be achieved by using multiple partitions and furthermore mistakes can be made easily when writing multi-threaded code. When dealing with high volume and velocity data, the corner cases with incorrectly written multi-threaded code are encountered more easily and exposed. However, there are times when separate threads are needed, for example, when interacting with external systems the delay in retrieving or sending data can be large at times, blocking the operator and other DAG processing such as committed windows. In these cases the fo
 llowing guidelines must be followed strictly.\n\n\n\n\nThreads should be started in \nactivate\n and stopped in \ndeactivate\n. In \ndeactivate\n the operator should wait till any threads it launched, have finished execution. It can do so by calling \njoin\n on the threads or if using \nExecutorService\n, calling \nawaitTermination\n on the service.\n\n\nThreads should not call any methods on the ports directly as this can cause concurrency exceptions and also result in invalid states.\n\n\nThreads can share state with the lifecycle methods using data structures that are either explicitly protected by synchronization or are inherently thread safe such as thread safe queues.\n\n\nIf this shared state needs to be protected against failure then it needs to be persisted during checkpoint. To have a consistent checkpoint, the state should not be modified by the thread when it is being serialized and saved by the operator lifecycle thread during checkpoint. Since the checkpoint process ha
 ppens outside the window boundary the thread should be quiesced between \nendWindow\n and \nbeginWindow\n or more efficiently between pre-checkpoint and checkpointed callbacks.", 
+            "title": "Best Practices"
+        }, 
+        {
+            "location": "/development_best_practices/#development-best-practices", 
+            "text": "This document describes the best practices to follow when developing operators and other application components such as partitoners, stream codecs etc on the Apache Apex platform.", 
+            "title": "Development Best Practices"
+        }, 
+        {
+            "location": "/development_best_practices/#operators", 
+            "text": "These are general guidelines for all operators that are covered in the current section. The subsequent sections talk about special considerations for input and output operators.   When writing a new operator to be used in an application, consider breaking it down into  An abstract operator that encompasses the core functionality but leaves application specific schemas and logic to the implementation.  An optional concrete operator also in the library that extends the abstract operator and provides commonly used schema types such as strings, byte[] or POJOs.    Follow these conventions for the life cycle methods:  Do one time initialization of entities that apply for the entire lifetime of the operator in the  setup  method, e.g., factory initializations. Initializations in  setup  are done in the container where the operator is deployed. Allocating memory for fields in the constructor is not efficient as it would lead to extra garbage in memory for the following
  reason. The operator is instantiated on the client from where the application is launched, serialized and started one of the Hadoop nodes in a container. So the constructor is first called on the client and if it were to initialize any of the fields, that state would be saved during serialization. In the Hadoop container the operator is deserialized and started. This would invoke the constructor again, which will initialize the fields but their state will get overwritten by the serialized state and the initial values would become garbage in memory.  Do one time initialization for live entities in  activate  method, e.g., opening connections to a database server or starting a thread for asynchronous operations. The  activate  method is called right before processing starts so it is a better place for these initializations than at  setup  which can lead to a delay before processing data from the live entity.    Perform periodic tasks based on processing time in application window bou
 ndaries.  Perform initializations needed for each application window in  beginWindow .  Perform aggregations needed for each application window  in  endWindow .  Teardown of live entities (inverse of tasks performed during activate) should be in the  deactivate  method.  Teardown of lifetime entities (those initialized in setup method) should happen in the  teardown  method.  If the operator implementation is not finalized mark it with the  @Evolving  annotation.    If the operator needs to perform operations based on event time of the individual tuples and not the processing time, extend and use the  WindowedOperator . Refer to documentation of that operator for details on how to use it.  If an operator needs to do some work when it is not receiving any input, it should implement  IdleTimeHandler  interface. This interface contains  handleIdleTime  method which will be called whenever the platform isn\u2019t doing anything else and the operator can do the work in this method. If fo
 r any reason the operator does not have any work to do when this method is called, it should sleep for a small amount of time such as that specified by the  SPIN_MILLIS  attribute so that it does not cause a busy wait when called repeatedly by the platform. Also, the method should not block and return in a reasonable amount of time that is less than the streaming window size (which is 500ms by default).  Often operators have customizable parameters such as information about locations of external systems or parameters that modify the behavior of the operator. Users should be able to specify these easily without having to change source code. This can be done by making them properties of the operator because they can then be initialized from external properties files.  Where possible default values should be provided for the properties in the source code.  Validation rules should be specified for the properties using javax constraint validations that check whether the values specified 
 for the properties are in the correct format, range or other operator requirements. Required properties should have at least a  @NotNull  validation specifying that they have to be specified by the user.", 
+            "title": "Operators"
+        }, 
+        {
+            "location": "/development_best_practices/#checkpointing", 
+            "text": "Checkpointing is a process of snapshotting the state of an operator and saving it so that in case of failure the state can be used to restore the operator to a prior state and continue processing. It is automatically performed by the platform at a configurable interval. All operators in the application are checkpointed in a distributed fashion, thus allowing the entire state of the application to be saved and available for recovery if needed. Here are some things to remember when it comes to checkpointing:   The process of checkpointing involves snapshotting the state by serializing the operator and saving it to a store. This is done using a  StorageAgent . By default a  StorageAgent  is already provided by the platform and it is called  AsyncFSStorageAgent . It serializes the operator using Kryo and saves the serialized state asynchronously to a filesystem such as HDFS. There are other implementations of  StorageAgent  available such as  GeodeKeyValueStorageAge
 nt  that stores the serialized state in Geode which is an in-memory replicated data grid.  All variables in the operator marked neither transient nor final are saved so any variables in the operator that are not part of the state should be marked transient. Specifically any variables like connection objects, i/o streams, ports are transient, because they need to be setup again on failure recovery.  If the operator does not keep any state between windows, mark it with the  @Stateless  annotation. This results in efficiencies during checkpointing and recovery. The operator will not be checkpointed and is always restored to the initial state  The checkpoint interval can be set using the  CHECKPOINT_WINDOW_COUNT  attribute which specifies the interval in terms of number of streaming windows.  If the correct functioning of the operator requires the  endWindow  method be called before checkpointing can happen, then the checkpoint interval should align with application window interval i.e.
 , it should be a multiple of application window interval. In this case the operator should be marked with  OperatorAnnotation  and  checkpointableWithinAppWindow  set to false. If the window intervals are configured by the user and they don\u2019t align, it will result in a DAG validation error and application won\u2019t launch.  In some cases the operator state related to a piece of data needs to be purged once that data is no longer required by the application, otherwise the state will continue to build up indefinitely. The platform provides a way to let the operator know about this using a callback listener called  CheckpointNotificationListener . This listener has a callback method called  committed , which is called by the platform from time to time with a window id that has been processed successfully by all the operators in the DAG and hence is no longer needed. The operator can delete all the state corresponding to window ids less than or equal to the provided window id.  So
 metimes operators need to perform some tasks just before checkpointing. For example, filesystem operators may want to flush the files just before checkpoint so they can be sure that all pending data is written to disk and no data is lost if there is an operator failure just after the checkpoint and the operator restarts from the checkpoint. To do this the operator would implement the same  CheckpointNotificationListener  interface and implement the  beforeCheckpoint  method where it can do these tasks.  If the operator is going to have a large state, checkpointing the entire state each time becomes unviable. Furthermore, the amount of memory needed to hold the state could be larger than the amount of physical memory available. In these cases the operator should checkpoint the state incrementally and also manage the memory for the state more efficiently. The platform provides a utiltiy called  ManagedState  that uses a combination of in memory and disk cache to efficiently store and 
 retrieve data in a performant, fault tolerant way and also checkpoint it in an incremental fashion. There are operators in the platform that use  ManagedState  and can be used as a reference on how to use this utility such as Dedup or Join operators.", 
+            "title": "Checkpointing"
+        }, 
+        {
+            "location": "/development_best_practices/#input-operators", 
+            "text": "Input operators have additional requirements:   The  emitTuples  method implemented by the operator, is called by the platform, to give the operator an opportunity to emit some data. This method is always called within a window boundary but can be called multiple times within the same window. There are some important guidelines on how to implement this method:  This should not be a blocking method and should return in a reasonable time that is less than the streaming window size (which is 500ms by default). This also applies to other callback methods called by the platform such as  beginWindow ,  endWindow  etc., but is more important here since this method will be called continuously by the platform.  If the operator needs to interact with external systems to obtain data and this can potentially take a long time, then this should be performed asynchronously in a different thread. Refer to the threading section below for the guidelines when using threading.  In 
 each invocation, the method can emit any number of data tuples.", 
+            "title": "Input Operators"
+        }, 
+        {
+            "location": "/development_best_practices/#idempotence", 
+            "text": "Many applications write data to external systems using output operators. To ensure that data is present exactly once in the external system even in a failure recovery scenario, the output operators expect the replayed windows during recovery contain the same data as before the failure. This is called idempotency. Since operators within the DAG are merely responding to input data provided to them by the upstream operators and the input operator has no upstream operator, the responsibility of idempotent replay falls on the input operators.   For idempotent replay of data, the operator needs to store some meta-information for every window that would allow it to identify what data was sent in that window. This is called the idempotent state.  If the external source of the input operator allows replayability, this could be information such as offset of last piece of data in the window, an identifier of the last piece of data itself or number of data tuples sent.  How
 ever if the external source does not allow replayability from an operator specified point, then the entire data sent within the window may need to be persisted by the operator.    The platform provides a utility called  WindowDataManager  to allow operators to save and retrieve idempotent state every window. Operators should use this to implement idempotency.", 
+            "title": "Idempotence"
+        }, 
+        {
+            "location": "/development_best_practices/#output-operators", 
+            "text": "Output operators typically connect to external storage systems such as filesystems, databases or key value stores to store data.   In some situations, the external systems may not be functioning in a reliable fashion. They may be having prolonged outages or performance problems. If the operator is being designed to work in such environments, it needs to be able to to handle these problems gracefully and not block the DAG or fail. In these scenarios the operator should cache the data into a local store such as HDFS and interact with external systems in a separate thread so as to not have problems in the operator lifecycle thread. This pattern is called the  Reconciler  pattern and there are operators that implement this pattern available in the library for reference.", 
+            "title": "Output Operators"
+        }, 
+        {
+            "location": "/development_best_practices/#end-to-end-exactly-once", 
+            "text": "When output operators store data in external systems, it is important that they do not lose data or write duplicate data when there is a failure event and the DAG recovers from that failure. In failure recovery, the windows from the previous checkpoint are replayed and the operator receives this data again. The operator should ensure that it does not write this data again. Operator developers should figure out how to do this specifically for the operators they are developing depending on the logic of the operators. Below are examples of how a couple of existing output operators do this for reference.   File output operator that writes data to files keeps track of the file lengths in the state. These lengths are checkpointed and restored on failure recovery. On restart, the operator truncates the file to the length equal to the length in the recovered state. This makes the data in the file same as it was at the time of checkpoint before the failure. The operator 
 now writes the replayed data from the checkpoint in regular fashion as any other data. This ensures no data is lost or duplicated in the file.  The JDBC output operator that writes data to a database table writes the data in a window in a single transaction. It also writes the current window id into a meta table along with the data as part of the same transaction. It commits the transaction at the end of the window. When there is an operator failure before the final commit, the state of the database is that it contains the data from the previous fully processed window and its window id since the current window transaction isn\u2019t yet committed. On recovery, the operator reads this window id back from the meta table. It ignores all the replayed windows whose window id is less than or equal to the recovered window id and thus ensures that it does not duplicate data already present in the database. It starts writing data normally again when window id of data becomes greater than rec
 overed window thus ensuring no data is lost.", 
+            "title": "End-to-End Exactly Once"
+        }, 
+        {
+            "location": "/development_best_practices/#partitioning", 
+            "text": "Partitioning allows an operation to be scaled to handle more pieces of data than before but with a similar SLA. This is done by creating multiple instances of an operator and distributing the data among them. Input operators can also be partitioned to stream more pieces of data into the application. The platform provides a lot of flexibility and options for partitioning. Partitioning can happen once at startup or can be dynamically changed anytime while the application is running, and it can be done in a stateless or stateful way by distributing state from the old partitions to new partitions.  In the platform, the responsibility for partitioning is shared among different entities. These are:   A  partitioner  that specifies  how  to partition the operator, specifically it takes an old set of partitions and creates a new set of partitions. At the start of the application the old set has one partition and the partitioner can return more than one partitions to sta
 rt the application with multiple partitions. The partitioner can have any custom JAVA logic to determine the number of new partitions, set their initial state as a brand new state or derive it from the state of the old partitions. It also specifies how the data gets distributed among the new partitions. The new set doesn't have to contain only new partitions, it can carry over some old partitions if desired.  An optional  statistics (stats) listener  that specifies  when  to partition. The reason it is optional is that it is needed only when dynamic partitioning is needed. With the stats listener, the stats can be used to determine when to partition.  In some cases the  operator  itself should be aware of partitioning and would need to provide supporting code.  In case of input operators each partition should have a property or a set of properties that allow it to distinguish itself from the other partitions and fetch unique data.    When an operator that was originally a single ins
 tance is split into multiple partitions with each partition working on a subset of data, the results of the partitions may need to be combined together to compute the final result. The combining logic would depend on the logic of the operator. This would be specified by the developer using a  Unifier , which is deployed as another operator by the platform. If no  Unifier  is specified, the platform inserts a  default unifier  that merges the results of the multiple partition streams into a single stream. Each output port can have a different  Unifier  and this is specified by returning the corresponding  Unifier  in the  getUnifier  method of the output port. The operator developer should provide a custom  Unifier  wherever applicable.  The Apex  engine  that brings everything together and effects the partitioning.   Since partitioning is critical for scalability of applications, operators must support it. There should be a strong reason for an operator to not support partitioning, 
 such as, the logic performed by the operator not lending itself to parallelism. In order to support partitioning, an operator developer, apart from developing the functionality of the operator, may also need to provide a partitioner, stats listener and supporting code in the operator as described in the steps above. The next sections delve into this.", 
+            "title": "Partitioning"
+        }, 
+        {
+            "location": "/development_best_practices/#out-of-the-box-partitioning", 
+            "text": "The platform comes with some built-in partitioning utilities that can be used in certain scenarios.    StatelessPartitioner  provides a default partitioner, that can be used for an operator in certain conditions. If the operator satisfies these conditions, the partitioner can be specified for the operator with a simple setting and no other partitioning code is needed. The conditions are:   No dynamic partitioning is needed, see next point about dynamic partitioning.   There is no distinct initial state for the partitions, i.e., all partitions start with the same initial state submitted during application launch.   Typically input or output operators do not fall into this category, although there are some exceptions. This partitioner is mainly used with operators that are in the middle of the DAG, after the input and before the output operators. When used with non-input operators, only the data for the first declared input port is distributed among the different 
 partitions. All other input ports are treated as broadcast and all partitions receive all the data for that port.    StatelessThroughputBasedPartitioner  in Malhar provides a dynamic partitioner based on throughput thresholds. Similarly  StatelessLatencyBasedPartitioner  provides a latency based dynamic partitioner in RTS. If these partitioners can be used, then separate partitioning related code is not needed. The conditions under which these can be used are:   There is no distinct initial state for the partitions.  There is no state being carried over by the operator from one window to the next i.e., operator is stateless.", 
+            "title": "Out of the box partitioning"
+        }, 
+        {
+            "location": "/development_best_practices/#custom-partitioning", 
+            "text": "In many cases, operators don\u2019t satisfy the above conditions and a built-in partitioner cannot be used. Custom partitioning code needs to be written by the operator developer. Below are guidelines for it.   Since the operator developer is providing a  partitioner  for the operator, the partitioning code should be added to the operator itself by making the operator implement the Partitioner interface and implementing the required methods, rather than creating a separate partitioner. The advantage is the user of the operator does not have to explicitly figure out the partitioner and set it for the operator but still has the option to override this built-in partitioner with a different one.  The  partitioner  is responsible for setting the initial state of the new partitions, whether it is at the start of the application or when partitioning is happening while the application is running as in the dynamic partitioning case. In the dynamic partitioning scenario, 
 the partitioner needs to take the state from the old partitions and distribute it among the new partitions. It is important to note that apart from the checkpointed state the partitioner also needs to distribute idempotent state.  The  partitioner  interface has two methods,  definePartitions  and  partitioned . The method  definePartitons  is first called to determine the new partitions, and if enough resources are available on the cluster, the  partitioned  method is called passing in the new partitions. This happens both during initial partitioning and dynamic partitioning. If resources are not available, partitioning is abandoned and existing partitions continue to run untouched. This means that any processing intensive operations should be deferred to the  partitioned  call instead of doing them in  definePartitions , as they may not be needed if there are not enough resources available in the cluster.  The  partitioner , along with creating the new partitions, should also spec
 ify how the data gets distributed across the new partitions. It should do this by specifying a mapping called  PartitionKeys  for each partition that maps the data to that partition. This mapping needs to be specified for every input port in the operator. If the  partitioner  wants to use the standard mapping it can use a utility method called  DefaultPartition.assignPartitionKeys .  When the partitioner is scaling the operator up to more partitions, try to reuse the existing partitions and create new partitions to augment the current set. The reuse can be achieved by the partitioner returning the current partitions unchanged. This will result in the current partitions continuing to run untouched.  In case of dynamic partitioning, as mentioned earlier, a stats listener is also needed to determine when to re-partition. Like the  Partitioner  interface, the operator can also implement the  StatsListener  interface to provide a stats listener implementation that will be automatically u
 sed.  The  StatsListener  has access to all operator statistics to make its decision on partitioning. Apart from the statistics that the platform computes for the operators such as throughput, latency etc, operator developers can include their own business metrics by using the AutoMetric feature.  If the operator is not partitionable, mark it so with  OperatorAnnotation  and  partitionable  element set to false.", 
+            "title": "Custom partitioning"
+        }, 
+        {
+            "location": "/development_best_practices/#streamcodecs", 
+            "text": "A  StreamCodec  is used in partitioning to distribute the data tuples among the partitions. The  StreamCodec  computes an integer hashcode for a data tuple and this is used along with  PartitionKeys  mapping to determine which partition or partitions receive the data tuple. If a  StreamCodec  is not specified, then a default one is used by the platform which returns the JAVA hashcode of the tuple.   StreamCodec  is also useful in another aspect of the application. It is used to serialize and deserialize the tuple to transfer it between operators. The default  StreamCodec  uses Kryo library for serialization.   The following guidelines are useful when considering a custom  StreamCodec   A custom  StreamCodec  is needed if the tuples need to be distributed based on a criteria different from the hashcode of the tuple. If the correct working of an operator depends on the data from the upstream operator being distributed using a custom criteria such as being sticky o
 n a \u201ckey\u201d field within the tuple, then a custom  StreamCodec  should be provided by the operator developer. This codec can implement the custom criteria. The operator should also return this custom codec in the  getStreamCodec  method of the input port.  When implementing a custom  StreamCodec  for the purpose of using a different criteria to distribute the tuples, the codec can extend an existing  StreamCodec  and implement the hashcode method, so that the codec does not have to worry about the serialization and deserialization functionality. The Apex platform provides two pre-built  StreamCodec  implementations for this purpose, one is  KryoSerializableStreamCodec  that uses Kryo for serialization and another one  JavaSerializationStreamCodec  that uses JAVA serialization.  Different  StreamCodec  implementations can be used for different inputs in a stream with multiple inputs when different criteria of distributing the tuples is desired between the multiple inputs.", 
+            "title": "StreamCodecs"
+        }, 
+        {
+            "location": "/development_best_practices/#threads", 
+            "text": "The operator lifecycle methods such as  setup ,  beginWindow ,  endWindow ,  process  in  InputPorts  are all called from a single operator lifecycle thread, by the platform, unbeknownst to the user. So the user does not have to worry about dealing with the issues arising from multi-threaded code. Use of separate threads in an operator is discouraged because in most cases the motivation for this is parallelism, but parallelism can already be achieved by using multiple partitions and furthermore mistakes can be made easily when writing multi-threaded code. When dealing with high volume and velocity data, the corner cases with incorrectly written multi-threaded code are encountered more easily and exposed. However, there are times when separate threads are needed, for example, when interacting with external systems the delay in retrieving or sending data can be large at times, blocking the operator and other DAG processing such as committed windows. In these cases
  the following guidelines must be followed strictly.   Threads should be started in  activate  and stopped in  deactivate . In  deactivate  the operator should wait till any threads it launched, have finished execution. It can do so by calling  join  on the threads or if using  ExecutorService , calling  awaitTermination  on the service.  Threads should not call any methods on the ports directly as this can cause concurrency exceptions and also result in invalid states.  Threads can share state with the lifecycle methods using data structures that are either explicitly protected by synchronization or are inherently thread safe such as thread safe queues.  If this shared state needs to be protected against failure then it needs to be persisted during checkpoint. To have a consistent checkpoint, the state should not be modified by the thread when it is being serialized and saved by the operator lifecycle thread during checkpoint. Since the checkpoint process happens outside the window
  boundary the thread should be quiesced between  endWindow  and  beginWindow  or more efficiently between pre-checkpoint and checkpointed callbacks.", 
+            "title": "Threads"
+        }, 
+        {
             "location": "/apex_cli/", 
             "text": "Apache Apex Command Line Interface\n\n\nApex CLI, the Apache Apex command line interface, can be used to launch, monitor, and manage Apache Apex applications.  It provides a developer friendly way of interacting with Apache Apex platform.  Another advantage of Apex CLI is to provide scope, by connecting and executing commands in a context of specific application.  Apex CLI enables easy integration with existing enterprise toolset for automated application monitoring and management.  Currently the following high level tasks are supported.\n\n\n\n\nLaunch or kill applications\n\n\nView system metrics including load, throughput, latency, etc.\n\n\nStart or stop tuple recording\n\n\nRead operator, stream, port properties and attributes\n\n\nWrite to operator properties\n\n\nDynamically change the application logical plan\n\n\nCreate custom macros\n\n\n\n\nApex CLI Commands\n\n\nApex CLI can be launched by running following command\n\n\napex\n\n\n\nHelp on all comman
 ds is available via \u201chelp\u201d command in the CLI\n\n\nGlobal Commands\n\n\nGLOBAL COMMANDS EXCEPT WHEN CHANGING LOGICAL PLAN:\n\nalias alias-name command\n    Create a command alias\n\nbegin-macro name\n    Begin Macro Definition ($1...$9 to access parameters and type 'end' to end the definition)\n\nconnect app-id\n    Connect to an app\n\ndump-properties-file out-file jar-file class-name\n    Dump the properties file of an app class\n\necho [arg ...]\n    Echo the arguments\n\nexit\n    Exit the CLI\n\nget-app-info app-id\n    Get the information of an app\n\nget-app-package-info app-package-file\n    Get info on the app package file\n\nget-app-package-operator-properties app-package-file operator-class\n    Get operator properties within the given app package\n\nget-app-package-operators [options] app-package-file [search-term]\n    Get operators within the given app package\n    Options:\n            -parent    Specify the parent class for the operators\n\nget-config-param
 eter [parameter-name]\n    Get the configuration parameter\n\nget-jar-operator-classes [options] jar-files-comma-separated [search-term]\n    List operators in a jar list\n    Options:\n            -parent    Specify the parent class for the operators\n\nget-jar-operator-properties jar-files-comma-separated operator-class-name\n    List properties in specified operator\n\nhelp [command]\n    Show help\n\nkill-app app-id [app-id ...]\n    Kill an app\n\n  launch [options] jar-file/json-file/properties-file/app-package-file [matching-app-name]\n    Launch an app\n    Options:\n            -apconf \napp package configuration file\n        Specify an application\n                                                            configuration file\n                                                            within the app\n                                                            package if launching\n                                                            an app package.\n            -a
 rchives \ncomma separated list of archives\n    Specify comma\n                                                            separated archives\n                                                            to be unarchived on\n                                                            the compute machines.\n            -conf \nconfiguration file\n                      Specify an\n                                                            application\n                                                            configuration file.\n            -D \nproperty=value\n                             Use value for given\n                                                            property.\n            -exactMatch                                     Only consider\n                                                            applications with\n                                                            exact app name\n            -files \ncomma separated list of files\n          Specify comma\n 
                                                            separated files to\n                                                            be copied on the\n                                                            compute machines.\n            -ignorepom                                      Do not run maven to\n                                                            find the dependency\n            -libjars \ncomma separated list of libjars\n      Specify comma\n                                                            separated jar files\n                                                            or other resource\n                                                            files to include in\n                                                            the classpath.\n            -local                                          Run application in\n                                                            local mode.\n            -originalAppId \napplication id\n       
           Specify original\n                                                            application\n                                                            identifier for restart.\n            -queue \nqueue name\n                             Specify the queue to\n                                                            launch the application\n\nlist-application-attributes\n    Lists the application attributes\nlist-apps [pattern]\n    List applications\nlist-operator-attributes\n    Lists the operator attributes\nlist-port-attributes\n    Lists the port attributes\nset-pager on/off\n    Set the pager program for output\nshow-logical-plan [options] jar-file/app-package-file [class-name]\n    List apps in a jar or show logical plan of an app class\n    Options:\n            -exactMatch                                Only consider exact match\n                                                       for app name\n            -ignorepom                                 Do not run 
 maven to find\n                                                       the dependency\n            -libjars \ncomma separated list of jars\n    Specify comma separated\n                                                       jar/resource files to\n                                                       include in the classpath.\nshutdown-app app-id [app-id ...]\n    Shutdown an app\nsource file\n    Execute the commands in a file\n\n\n\n\nCommands after connecting to an application\n\n\nCOMMANDS WHEN CONNECTED TO AN APP (via connect \nappid\n) EXCEPT WHEN CHANGING LOGICAL PLAN:\n\nbegin-logical-plan-change\n    Begin Logical Plan Change\ndump-properties-file out-file [jar-file] [class-name]\n    Dump the properties file of an app class\nget-app-attributes [attribute-name]\n    Get attributes of the connected app\nget-app-info [app-id]\n    Get the information of an app\nget-operator-attributes operator-name [attribute-name]\n    Get attributes of an operator\nget-operator-properties op
 erator-name [property-name]\n    Get properties of a logical operator\nget-physical-operator-properties [options] operator-id\n    Get properties of a physical operator\n    Options:\n            -propertyName \nproperty name\n    The name of the property whose\n                                             value needs to be retrieved\n            -waitTime \nwait time\n            How long to wait to get the result\nget-port-attributes operator-name port-name [attribute-name]\n    Get attributes of a port\nget-recording-info [operator-id] [start-time]\n    Get tuple recording info\nkill-app [app-id ...]\n    Kill an app\nkill-container container-id [container-id ...]\n    Kill a container\nlist-containers\n    List containers\nlist-operators [pattern]\n    List operators\nset-operator-property operator-name property-name property-value\n    Set a property of an operator\nset-physical-operator-property operator-id property-name property-value\n    Set a property of an operator\nshow-
 logical-plan [options] [jar-file/app-package-file] [class-name]\n    Show logical plan of an app class\n    Options:\n            -exactMatch                                Only consider exact match\n                                                       for app name\n            -ignorepom                                 Do not run maven to find\n                                                       the dependency\n            -libjars \ncomma separated list of jars\n    Specify comma separated\n                                                       jar/resource files to\n                                                       include in the classpath.\nshow-physical-plan\n    Show physical plan\nshutdown-app [app-id ...]\n    Shutdown an app\nstart-recording operator-id [port-name] [num-windows]\n    Start recording\nstop-recording operator-id [port-name]\n    Stop recording\nwait timeout\n    Wait for completion of current application\n\n\n\n\nCommands when changing the logical p
 lan\n\n\nCOMMANDS WHEN CHANGING LOGICAL PLAN (via begin-logical-plan-change):\n\nabort\n    Abort the plan change\nadd-stream-sink stream-name to-operator-name to-port-name\n    Add a sink to an existing stream\ncreate-operator operator-name class-name\n    Create an operator\ncreate-stream stream-name from-operator-name from-port-name to-operator-name to-port-name\n    Create a stream\nhelp [command]\n    Show help\nremove-operator operator-name\n    Remove an operator\nremove-stream stream-name\n    Remove a stream\nset-operator-attribute operator-name attr-name attr-value\n    Set an attribute of an operator\nset-operator-property operator-name property-name property-value\n    Set a property of an operator\nset-port-attribute operator-name port-name attr-name attr-value\n    Set an attribute of a port\nset-stream-attribute stream-name attr-name attr-value\n    Set an attribute of a stream\nshow-queue\n    Show the queue of the plan change\nsubmit\n    Submit the plan change\n\n\
 n\n\nExamples\n\n\nAn example of defining a custom macro.  The macro updates a running application by inserting a new operator.  It takes three parameters and executes a logical plan changes.\n\n\napex\n begin-macro add-console-output\nmacro\n begin-logical-plan-change\nmacro\n create-operator $1 com.datatorrent.lib.io.ConsoleOutputOperator\nmacro\n create-stream stream_$1 $2 $3 $1 in\nmacro\n submit\n\n\n\n\nThen execute the \nadd-console-output\n macro like this\n\n\napex\n add-console-output xyz opername portname\n\n\n\n\nThis macro then expands to run the following command\n\n\nbegin-logical-plan-change\ncreate-operator xyz com.datatorrent.lib.io.ConsoleOutputOperator\ncreate-stream stream_xyz opername portname xyz in\nsubmit\n\n\n\n\nNote\n:  To perform runtime logical plan changes, like ability to add new operators,\nthey must be part of the jar files that were deployed at application launch time.", 
             "title": "Apex CLI"
@@ -857,7 +922,7 @@
         }, 
         {
             "location": "/security/", 
-            "text": "Security\n\n\nApplications built on Apex run as native YARN applications on Hadoop. The security framework and apparatus in Hadoop apply to the applications. The default security mechanism in Hadoop is Kerberos.\n\n\nKerberos Authentication\n\n\nKerberos is a ticket based authentication system that provides authentication in a distributed environment where authentication is needed between multiple users, hosts and services. It is the de-facto authentication mechanism supported in Hadoop. To use Kerberos authentication, the Hadoop installation must first be configured for secure mode with Kerberos. Please refer to the administration guide of your Hadoop distribution on how to do that. Once Hadoop is configured, there is some configuration needed on Apex side as well.\n\n\nConfiguring security\n\n\nThere is Hadoop configuration and CLI configuration. Hadoop configuration may be optional.\n\n\nHadoop Configuration\n\n\nAn Apex application uses delegation tokens to 
 authenticate with the ResourceManager (YARN) and NameNode (HDFS) and these tokens are issued by those servers respectively. Since the application is long-running,\nthe tokens should be valid for the lifetime of the application. Hadoop has a configuration setting for the maximum lifetime of the tokens and they should be set to cover the lifetime of the application. There are separate settings for ResourceManager and NameNode delegation\ntokens.\n\n\nThe ResourceManager delegation token max lifetime is specified in \nyarn-site.xml\n and can be specified as follows for example for a lifetime of 1 year\n\n\nproperty\n\n  \nname\nyarn.resourcemanager.delegation.token.max-lifetime\n/name\n\n  \nvalue\n31536000000\n/value\n\n\n/property\n\n\n\n\n\nThe NameNode delegation token max lifetime is specified in\nhdfs-site.xml and can be specified as follows for example for a lifetime of 1 year\n\n\nproperty\n\n   \nname\ndfs.namenode.delegation.token.max-lifetime\n/name\n\n   \nvalue\n3153600000
 0\n/value\n\n \n/property\n\n\n\n\n\nCLI Configuration\n\n\nThe Apex command line interface is used to launch\napplications along with performing various other operations and administrative tasks on the applications. \u00a0When Kerberos security is enabled in Hadoop, a Kerberos ticket granting ticket (TGT) or the Kerberos credentials of the user are needed by the CLI program \napex\n to authentica

<TRUNCATED>