You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@apex.apache.org by th...@apache.org on 2015/11/12 04:23:15 UTC

[20/22] incubator-apex-malhar git commit: Using DataTorrent rss feed for test, #comment MLHR-1899

http://git-wip-us.apache.org/repos/asf/incubator-apex-malhar/blob/90d5774f/contrib/src/test/resources/com/datatorrent/contrib/romesyndication/datatorrent_feed_updated.rss
----------------------------------------------------------------------
diff --git a/contrib/src/test/resources/com/datatorrent/contrib/romesyndication/datatorrent_feed_updated.rss b/contrib/src/test/resources/com/datatorrent/contrib/romesyndication/datatorrent_feed_updated.rss
new file mode 100644
index 0000000..2403483
--- /dev/null
+++ b/contrib/src/test/resources/com/datatorrent/contrib/romesyndication/datatorrent_feed_updated.rss
@@ -0,0 +1,1134 @@
+<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
+	xmlns:content="http://purl.org/rss/1.0/modules/content/"
+	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
+	xmlns:dc="http://purl.org/dc/elements/1.1/"
+	xmlns:atom="http://www.w3.org/2005/Atom"
+	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
+	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
+	>
+
+<channel>
+	<title>DataTorrent</title>
+	<atom:link href="https://www.datatorrent.com/feed/" rel="self" type="application/rss+xml" />
+	<link>https://www.datatorrent.com</link>
+	<description></description>
+	<lastBuildDate>Tue, 10 Nov 2015 08:00:45 +0000</lastBuildDate>
+	<language>en-US</language>
+	<sy:updatePeriod>hourly</sy:updatePeriod>
+	<sy:updateFrequency>1</sy:updateFrequency>
+	<generator>http://wordpress.org/?v=4.2.5</generator>
+	<item>
+		<title>An introduction to checkpointing in Apache Apex</title>
+		<link>https://www.datatorrent.com/blog-introduction-to-checkpoint/</link>
+		<comments>https://www.datatorrent.com/blog-introduction-to-checkpoint/#comments</comments>
+		<pubDate>Tue, 10 Nov 2015 08:00:45 +0000</pubDate>
+		<dc:creator><![CDATA[Gaurav Gupta]]></dc:creator>
+				<category><![CDATA[How-to]]></category>
+		<category><![CDATA[Apache Apex]]></category>
+
+		<guid isPermaLink="false">https://www.datatorrent.com/?p=2254</guid>
+		<description><![CDATA[<p>Know how Apex makes checkpointing easy Big data is evolving in a big way. As it booms, the issue of fault tolerance becomes more and more exigent. What happens if a node fails? Will your application recover from the effects of data or process corruption? In a conventional world, the simplest solution for such a problem would have [&#8230;]</p>
+<p>The post <a rel="nofollow" href="https://www.datatorrent.com/blog-introduction-to-checkpoint/">An introduction to checkpointing in Apache Apex</a> appeared first on <a rel="nofollow" href="https://www.datatorrent.com">DataTorrent</a>.</p>
+]]></description>
+				<content:encoded><![CDATA[<h5 id="an-introduction-to-checkpointing-in-apex" class="c5">Know how Apex makes checkpointing easy</h5>
+<p class="c2"><span class="c0">Big data is evolving<span class="c0"> </span><span class="c3">i</span><span class="c0">n a big way. As it booms, the issue of </span><span class="c0">fault</span><span class="c3"> </span><span class="c0">tolerance</span><span class="c0"> becomes more and more exigent. What happens if a node fails? Will your application recover from the effects of data or process corruption?</span></span></p>
+<p class="c2"><span class="c0">In a conventional world, the simplest solution for such a problem would have been a restart of the offending processes from the beginning. However, that was the conventional world, with data sizes still within the reach of imagination. In the world of big data, the size of data cannot be imagined. </span><span class="c3">Let alone imagine, the growth is almost incomprehensible.</span><span class="c0"> A complete restart would mean </span><span class="c3">wasting </span><span class="c0">precious resources, be it time, or CPU capacity. Such a restart in a real-time scenario would also be unpredictable. After all, how do you recover data that changed by the second (or even less) accurately? </span></p>
+<p class="c2"><span class="c0">Fault tolerance</span><span class="c0"> is not just a need, it is an absolute necessity. </span><span class="c3">The lack of a fault tolerance mechanism affects SLAs</span><span class="c3">. </span><span class="c3">A system can be made fault-tolerant by using checkpointing</span><span class="c3">. </span><span class="c0">You can think of a checkpointing mechanism as a recovery process; a system process saves snapshots</span><span class="c3"> of </span><span class="c0">application states periodically, and uses these snapshots for recovery in case of failures. </span><span class="c3">A</span><span class="c0"> platform developer alone should ensure that a big data platform is checkpoint</span><span class="c3">&#8211;</span><span class="c0">compliant. Application developers should only be concerned with their business logic, thus ensuring clear distinction between operability and functional behavior. </span></p>
+<p class="c2"><span class="c3 c7">Apex treats checkpointing as a native function, allowing application developers to stop worrying about application consistency</span><span class="c3">. </span></p>
+<p class="c2"><span class="c0">Apache</span><span class="c0 c12">®</span><span class="c0"> Apex is the industry’s only open-source platform that checkpoints intelligently, while relieving application developers of the need to worry about their platforms being failsafe.</span><span class="c3"> </span><span class="c0">By transparently checkpointing the state of operators to HDFS periodically, Apex ensures that the operators are recoverable on any node within a cluster</span><span class="c3">. </span><span class="c3">The Apex infrastructure is designed to scale, while ensuring easy recovery at any point during failure</span><span class="c3">. The </span><span class="c0">Apex checkpointing mechanism uses the DFS interfa</span><span class="c3">ce. thereby being agnostic to the DFS implementation.</span></p>
+<p class="c2"><span class="c3 c7">Apex introduces checkpointing by maintaining operator states within HDFS.</span></p>
+<p class="c2"><span class="c3">Apex serializes the state of operators to local disks, and then asynchronously copies serialized state to HDFS. The state is asynchronously copied to HDFS in order to ensure that the performance of applications is not affected by the HDFS latency. An operator is considered “checkpointed” only after the serialized state is copied to HDFS. In case of </span><span class="c3 c8"><a class="c1" href="https://www.datatorrent.com/docs/guides/ApplicationDeveloperGuide.html#h.1gf8i83">Exactly-Once </a></span><span class="c3">recovery mechanism, platform checkpoints at every window boundary and it behaves in synchronous mode i.e the operator is blocked till the state is copied to HDFS. </span></p>
+<p class="c2"><span class="c3">Although Apex is designed to checkpoint at window boundaries, developers can control how optimal a checkpoint operation is.</span><span class="c3"> </span><span class="c3">Developers can control how often checkpointing is triggered. They can do this by configuring the window boundary at which checkpointing will occur, by using the </span><span class="c3 c10"><a class="c1" href="https://www.datatorrent.com/docs/apidocs/com/datatorrent/api/Context.OperatorContext.html#CHECKPOINT_WINDOW_COUNT">CHECKPOINT_WINDOW_COUNT</a></span><span class="c3"> attribute. </span><span class="c0">Frequent checkpoints hamper the overall application performance. This is in stark contrast to sparsely placed checkpoints, which are dangerous because they might make application recovery a </span><span class="c3">time-consuming</span><span class="c0"> task. </span></p>
+<p class="c2"><span class="c3 c7">See this example to see how the apex checkpointing mechanism works. </span></p>
+<p class="c2"><span class="c3">In our example, let us set CHECKPOINT_WINDOW_COUNT to 1. This diagram shows the flow of data within window boundaries.  You will see that at the end of each window, Apex checkpoints data in order to ensure that it is consistent. If the operator crashes during window n+1, Apex restores it to nearest stable state, in this example, the state obtained by introducing checkpointing at the end of window n. The operator now starts processing window n+1 from the beginning. If CHECKPOINT_WINDOW_COUNT was set to 2, then there would have been one checkpoint before window n, and another checkpoint after window n+1.</span></p>
+<p class="c2"><img title="" src="https://www.datatorrent.com/wp-content/uploads/2015/10/image001.png" alt="checkpointing.png" /></p>
+<p class="c2"><span class="c3 c7">Judicious checkpointing ensures optimum application performance</span></p>
+<p class="c2"><span class="c3">Checkpointing is a costly and a resource-intensive operation, indicating that an overindulgence will impact an application’s performance. </span><span class="c3">To</span><span class="c3 c9"> </span><span class="c3">act as a deterrent to the performance slowdown because of checkpointing, Apex checkpoints non-transient data only.</span><span class="c3 c9"> </span><span class="c3">For example, in JDBC operator, connection can be reinitialized at the time of setup, it should be marked as transient, and thus omitted from getting checkpointed. </span></p>
+<p class="c2"><span class="c0">Application developers must know </span><span class="c3">whether</span><span class="c0"> to checkpoint or not; the thumb rule dictates that </span><span class="c0">operators</span><span class="c0"> for which computation depends on the previous state must be checkpointed. An example is the Counter operator, which tracks the number of tuples processed by the system. Because the operator relies on </span><span class="c3">its previous state to proceed, it</span><span class="c0"> needs to be checkpointed. Some operators are stateless; their computation does not depend on the previous state. Application developers can omit such operators from checkpointing operations by using the </span><span class="c3 c8"><a class="c1" href="https://www.datatorrent.com/docs/apidocs/com/datatorrent/api/Context.OperatorContext.html#STATELESS">STATELESS</a></span><span class="c0"> attribute or </span><span class="c3">by annotating the operator with </span><span class="c3 c
 8"><a class="c1" href="https://www.datatorrent.com/docs/apidocs/com/datatorrent/api/annotation/Stateless.html">Stateless</a></span><span class="c0">.</span></p>
+<p class="c2"><span class="c0">This post is an introduction to checkpointing</span><span class="c3"> </span><span class="c0">in Apex. For </span><span class="c3">more details</span><span class="c0">, </span><span class="c3">see our </span><span class="c3 c8"><a class="c1" href="http://www.datatorrent.com/docs/guides/ApplicationDeveloperGuide.html">Application Developer Guide</a></span><span class="c3">.</span></p>
+<h3 id="conclusion">Resources</h3>
+<p>Download DataTorrent Sandbox <a href="http://web.datatorrent.com/DataTorrent-RTS-Sandbox-Edition-Download.html" target="_blank">here</a></p>
+<p>Download DataTorrent Enterprise Edition <a href="http://web.datatorrent.com/DataTorrent-RTS-Enteprise-Edition-Download.html" target="_blank">here</a></p>
+<p>Join Apache Apex Meetups <a href="https://www.datatorrent.com/meetups/">here</a></p>
+<p>The post <a rel="nofollow" href="https://www.datatorrent.com/blog-introduction-to-checkpoint/">An introduction to checkpointing in Apache Apex</a> appeared first on <a rel="nofollow" href="https://www.datatorrent.com">DataTorrent</a>.</p>
+]]></content:encoded>
+			<wfw:commentRss>https://www.datatorrent.com/blog-introduction-to-checkpoint/feed/</wfw:commentRss>
+		<slash:comments>0</slash:comments>
+		</item>
+		<item>
+		<title>Dimensions Computation (Aggregate Navigator) Part 2: Implementation</title>
+		<link>https://www.datatorrent.com/dimensions-computation-aggregate-navigator-part-2-implementation/</link>
+		<comments>https://www.datatorrent.com/dimensions-computation-aggregate-navigator-part-2-implementation/#comments</comments>
+		<pubDate>Thu, 05 Nov 2015 09:00:10 +0000</pubDate>
+		<dc:creator><![CDATA[tim farkas]]></dc:creator>
+				<category><![CDATA[Uncategorized]]></category>
+		<category><![CDATA[Apache Apex]]></category>
+
+		<guid isPermaLink="false">https://www.datatorrent.com/?p=2401</guid>
+		<description><![CDATA[<p>Overview While the theory of computing the aggregations is correct, some more work is required to provide a scalable implementation of Dimensions Computation. As can be seen from the formulas provided in the previous post, the number of aggregations to maintain grows rapidly as the number of unique key values, aggregators, dimension combinations, and time [&#8230;]</p>
+<p>The post <a rel="nofollow" href="https://www.datatorrent.com/dimensions-computation-aggregate-navigator-part-2-implementation/">Dimensions Computation (Aggregate Navigator) Part 2: Implementation</a> appeared first on <a rel="nofollow" href="https://www.datatorrent.com">DataTorrent</a>.</p>
+]]></description>
+				<content:encoded><![CDATA[<h3 id="overview">Overview</h3>
+<p>While the theory of computing the aggregations is correct, some more work is required to provide a scalable implementation of Dimensions Computation. As can be seen from the formulas provided in the previous post, the number of aggregations to maintain grows rapidly as the number of unique key values, aggregators, dimension combinations, and time buckets grows. Additionally, a scalable implementation of Dimensions Computation must be capable of handling hundreds of thousands of events per second. In order to achieve this level of performance a balance must be struck between the speed afforded by in memory processing and the need to persist large quantities of data. This balance is achieved by performing dimensions computation in three phases:</p>
+<ol>
+<li>The <strong>Pre-Aggregation</strong> phase.</li>
+<li>The <strong>Unification</strong> phase.</li>
+<li>The <strong>Aggregation Storage</strong> phase.</li>
+</ol>
+<p>The sections below will describe the details of each phase of Dimensions Computation, and will also provide the code snippets required to implement each phase in Data Torrent.</p>
+<h3 id="the-pre-aggregation-phase">The Pre-aggregation Phase</h3>
+<h4 id="the-theory">The Theory</h4>
+<p>This phase allows Dimensions Computation to scale by reducing the number of events entering the system. How this is achieved can be described by the following example:</p>
+<ul>
+<li>Let’s say we have 500,000 <strong>AdEvents</strong>/second entering our system, and we want to perform Dimension Computation on those events.</li>
+</ul>
+<p>Although each <strong>AdEvent</strong> will contribute to many aggregations (as described by the formulas in the previous post) the number of unique values of keys in the <strong>AdEvents</strong> will likely be much smaller than 500,000. So the total number of aggregations produced by 500,000 events will also be much smaller than 500,000. Let’s say for the sake of this example that the number of aggregations produced will be on the order of 10,000. This means that if we perform Dimension Computation on batches of 500,000 tuples we can reduce 500,000 events to 10,000 aggregations.</p>
+<p>The process can be sped up even further by utilizing partitioning. If a partition can handle 500,000 events/second, then 8 partitions would be able to handle 4,000,000 events/second. And these 4,000,000 events/seconds would then be compressed into 80,000 aggregations/second. These aggregations are then passed on to the Unification stage of processing.</p>
+<p><strong>Note</strong> that these 80,000 aggregations will not be complete aggregations for two reasons:</p>
+<ol>
+<li>The aggregations do not incorporate the values of events received in previous batches. This draw back is corrected by the <strong>Aggregation Storage</strong> phase.</li>
+<li>The aggregations computed by different partitions may share the same key values and time buckets. This draw back is corrected by the <strong>Unification</strong> phase.</li>
+</ol>
+<h4 id="the-code">The Code</h4>
+<p>Setting up the Pre-Aggregation phase of Dimensions Computation involves configuring a Dimension Computation operator. There are several flavors of the Dimension Computation operator, the easiest to use out of the box for Java and dtAssemble is <strong>DimensionsComputationFlexibleSingleSchemaPOJO</strong>. This operator can receive any POJO as input (like our AdEvent) and requires the following configuration:</p>
+<ul>
+<li><strong>A JSON Schema:</strong> The JSON schema specifies the keys, aggregates, aggregators, dimension combinations, and time buckets to be used for Dimension Computation. An example of a schema that could be used for <strong>AdEvents</strong> is the following:</li>
+</ul>
+<pre class="prettyprint"><code class=" hljs json">{"<span class="hljs-attribute">keys</span>":<span class="hljs-value">[{"<span class="hljs-attribute">name</span>":<span class="hljs-value"><span class="hljs-string">"advertiser"</span></span>,"<span class="hljs-attribute">type</span>":<span class="hljs-value"><span class="hljs-string">"string"</span></span>},
+         {"<span class="hljs-attribute">name</span>":<span class="hljs-value"><span class="hljs-string">"location"</span></span>,"<span class="hljs-attribute">type</span>":<span class="hljs-value"><span class="hljs-string">"string"</span></span>}]</span>,
+ "<span class="hljs-attribute">timeBuckets</span>":<span class="hljs-value">[<span class="hljs-string">"1m"</span>,<span class="hljs-string">"1h"</span>,<span class="hljs-string">"1d"</span>]</span>,
+ "<span class="hljs-attribute">values</span>":
+  <span class="hljs-value">[{"<span class="hljs-attribute">name</span>":<span class="hljs-value"><span class="hljs-string">"impressions"</span></span>,"<span class="hljs-attribute">type</span>":<span class="hljs-value"><span class="hljs-string">"long"</span></span>,"<span class="hljs-attribute">aggregators</span>":<span class="hljs-value">[<span class="hljs-string">"SUM"</span>,<span class="hljs-string">"MAX"</span>,<span class="hljs-string">"MIN"</span>]</span>},
+   {"<span class="hljs-attribute">name</span>":<span class="hljs-value"><span class="hljs-string">"clicks"</span></span>,"<span class="hljs-attribute">type</span>":<span class="hljs-value"><span class="hljs-string">"long"</span></span>,"<span class="hljs-attribute">aggregators</span>":<span class="hljs-value">[<span class="hljs-string">"SUM"</span>,<span class="hljs-string">"MAX"</span>,<span class="hljs-string">"MIN"</span>]</span>},
+   {"<span class="hljs-attribute">name</span>":<span class="hljs-value"><span class="hljs-string">"cost"</span></span>,"<span class="hljs-attribute">type</span>":<span class="hljs-value"><span class="hljs-string">"double"</span></span>,"<span class="hljs-attribute">aggregators</span>":<span class="hljs-value">[<span class="hljs-string">"SUM"</span>,<span class="hljs-string">"MAX"</span>,<span class="hljs-string">"MIN"</span>]</span>},
+   {"<span class="hljs-attribute">name</span>":<span class="hljs-value"><span class="hljs-string">"revenue"</span></span>,"<span class="hljs-attribute">type</span>":<span class="hljs-value"><span class="hljs-string">"double"</span></span>,"<span class="hljs-attribute">aggregators</span>":<span class="hljs-value">[<span class="hljs-string">"SUM"</span>,<span class="hljs-string">"MAX"</span>,<span class="hljs-string">"MIN"</span>]</span>}]</span>,
+ "<span class="hljs-attribute">dimensions</span>":
+  <span class="hljs-value">[{"<span class="hljs-attribute">combination</span>":<span class="hljs-value">[]</span>},
+   {"<span class="hljs-attribute">combination</span>":<span class="hljs-value">[<span class="hljs-string">"location"</span>]</span>},
+   {"<span class="hljs-attribute">combination</span>":<span class="hljs-value">[<span class="hljs-string">"advertiser"</span>]</span>},
+   {"<span class="hljs-attribute">combination</span>":<span class="hljs-value">[<span class="hljs-string">"advertiser"</span>,<span class="hljs-string">"location"</span>]</span>}]
+</span>}</code></pre>
+<ul>
+<li>A map from key names to the Java expression used to extract the key from an incoming POJO.</li>
+<li>A map from aggregate names to the Java expression used to extract the aggregate from an incoming POJO.</li>
+</ul>
+<p>An example of how to configure a Dimensions Computation operator to process <strong>AdEvents</strong> is as follows:</p>
+<pre class="prettyprint"><code class=" hljs avrasm">DimensionsComputationFlexibleSingleSchemaPOJO dimensions = dag<span class="hljs-preprocessor">.addOperator</span>(<span class="hljs-string">"DimensionsComputation"</span>, DimensionsComputationFlexibleSingleSchemaPOJO<span class="hljs-preprocessor">.class</span>)<span class="hljs-comment">;</span>
+
+Map&lt;String, String&gt; keyToExpression = Maps<span class="hljs-preprocessor">.newHashMap</span>()<span class="hljs-comment">;</span>
+keyToExpression<span class="hljs-preprocessor">.put</span>(<span class="hljs-string">"advertiser"</span>, <span class="hljs-string">"getAdvertiser()"</span>)<span class="hljs-comment">;</span>
+keyToExpression<span class="hljs-preprocessor">.put</span>(<span class="hljs-string">"location"</span>, <span class="hljs-string">"getLocation()"</span>)<span class="hljs-comment">;</span>
+keyToExpression<span class="hljs-preprocessor">.put</span>(<span class="hljs-string">"time"</span>, <span class="hljs-string">"getTime()"</span>)<span class="hljs-comment">;</span>
+
+Map&lt;String, String&gt; aggregateToExpression = Maps<span class="hljs-preprocessor">.newHashMap</span>()<span class="hljs-comment">;</span>
+aggregateToExpression<span class="hljs-preprocessor">.put</span>(<span class="hljs-string">"cost"</span>, <span class="hljs-string">"getCost()"</span>)<span class="hljs-comment">;</span>
+aggregateToExpression<span class="hljs-preprocessor">.put</span>(<span class="hljs-string">"revenue"</span>, <span class="hljs-string">"getRevenue()"</span>)<span class="hljs-comment">;</span>
+aggregateToExpression<span class="hljs-preprocessor">.put</span>(<span class="hljs-string">"impressions"</span>, <span class="hljs-string">"getImpressions()"</span>)<span class="hljs-comment">;</span>
+aggregateToExpression<span class="hljs-preprocessor">.put</span>(<span class="hljs-string">"clicks"</span>, <span class="hljs-string">"getClicks()"</span>)<span class="hljs-comment">;</span>
+
+dimensions<span class="hljs-preprocessor">.setKeyToExpression</span>(keyToExpression)<span class="hljs-comment">;</span>
+dimensions<span class="hljs-preprocessor">.setAggregateToExpression</span>(aggregateToExpression)<span class="hljs-comment">;</span>
+//Here eventSchema is a string containing the JSON listed above.
+dimensions<span class="hljs-preprocessor">.setConfigurationSchemaJSON</span>(eventSchema)<span class="hljs-comment">;</span></code></pre>
+<h3 id="the-unification-phase">The Unification Phase</h3>
+<h4 id="the-theory-1">The Theory</h4>
+<p>The Unification phase is relatively simple. It combines the outputs of all the partitions in the Pre-Aggregation phase into a single single stream which can be passed on to the storage phase. It has the added benefit of reducing the number of aggregations even further. This is because the aggregations produced by different partitions which share the same key and time bucket can be combined to produce a single aggregation. For example, if the Unification phase receives 80,000 aggregations/second, you can expect 20,000 aggregations/second after unification.</p>
+<h4 id="the-code-1">The Code</h4>
+<p>The Unification phase is implemented as a unifier that can be set on your dimensions computation operator.</p>
+<pre class="prettyprint"><code class=" hljs vbnet">dimensions.setUnifier(<span class="hljs-keyword">new</span> DimensionsComputationUnifierImpl&lt;InputEvent, <span class="hljs-keyword">Aggregate</span>&gt;());</code></pre>
+<h3 id="the-aggregation-storage-phase">The Aggregation Storage Phase</h3>
+<h4 id="the-theory-2">The Theory</h4>
+<p>The total number of aggregations produced by Dimension Computation is large, and it only increases with time (due to time bucketing). Aggregations are persisted to HDFS using HDHT. This persistence is performed by the Dimensions Store and serves two purposes:</p>
+<ul>
+<li>Functions as a storage so that aggregations can be retrieved for visualization.</li>
+<li>Functions as a storage allowing aggregations to be combined with incomplete aggregates produced by Unification.</li>
+</ul>
+<h5 id="visualization">Visualization</h5>
+<p>The DimensionsStore allows you to visualize your aggregations over time. This is done by allowing queries and responses to be received from and sent to the UI via websocket.</p>
+<h5 id="aggregation">Aggregation</h5>
+<p>The store produces complete aggregations by combining the incomplete aggregations received from the Unification stage with aggregations persisted to HDFS.</p>
+<h5 id="scalability">Scalability</h5>
+<p>Since the work done by the DimensionsStore is IO intensive, it cannot handle hundreds of thousands of events. The purpose of the the Pre-Aggregation and Unification phases is to reduce the cardinality of events so that the Store will almost always have a small number of events to handle. However, in cases where there are many unique values for keys, the Pre-Aggregation and Unification phases will not be sufficient to reduce the cardinality of events handled by the Dimension Store. In such cases it is possible to partition the Dimensions Store so that each partition handles the aggregates for a subset of the dimension combinations and time buckets.</p>
+<h4 id="the-code-2">The Code</h4>
+<p>Configuration of the DimensionsStore involves the following:</p>
+<ul>
+<li>Setting the JSON Schema.</li>
+<li>Connecting Query and Result operators that are used to send queries to and receive results from the DimensionsStore.</li>
+<li>Setting an HDHT File Implementation.</li>
+<li>Setting an HDFS path for storing aggregation data.</li>
+</ul>
+<p>An example of configuring the store is as follows:</p>
+<pre class="prettyprint"><code class=" hljs avrasm">AppDataSingleSchemaDimensionStoreHDHT store = dag<span class="hljs-preprocessor">.addOperator</span>(<span class="hljs-string">"Store"</span>, AppDataSingleSchemaDimensionStoreHDHT<span class="hljs-preprocessor">.class</span>)<span class="hljs-comment">;</span>
+
+TFileImpl hdsFile = new TFileImpl<span class="hljs-preprocessor">.DTFileImpl</span>()<span class="hljs-comment">;</span>
+hdsFile<span class="hljs-preprocessor">.setBasePath</span>(basePath)<span class="hljs-comment">;</span>
+store<span class="hljs-preprocessor">.setFileStore</span>(hdsFile)<span class="hljs-comment">;</span>
+store<span class="hljs-preprocessor">.setConfigurationSchemaJSON</span>(eventSchema)<span class="hljs-comment">;</span>
+
+String gatewayAddress = dag<span class="hljs-preprocessor">.getValue</span>(DAG<span class="hljs-preprocessor">.GATEWAY</span>_CONNECT_ADDRESS)<span class="hljs-comment">;</span>
+URI uri = URI<span class="hljs-preprocessor">.create</span>(<span class="hljs-string">"ws://"</span> + gatewayAddress + <span class="hljs-string">"/pubsub"</span>)<span class="hljs-comment">;</span>
+
+PubSubWebSocketAppDataQuery wsIn = dag<span class="hljs-preprocessor">.addOperator</span>(<span class="hljs-string">"Query"</span>, PubSubWebSocketAppDataQuery<span class="hljs-preprocessor">.class</span>)<span class="hljs-comment">;</span>
+wsIn<span class="hljs-preprocessor">.setUri</span>(uri)<span class="hljs-comment">;</span>
+wsIn<span class="hljs-preprocessor">.setTopic</span>(<span class="hljs-string">"Query Topic"</span>)<span class="hljs-comment">;</span>
+
+PubSubWebSocketAppDataResult wsOut = dag<span class="hljs-preprocessor">.addOperator</span>(<span class="hljs-string">"QueryResult"</span>, PubSubWebSocketAppDataResult<span class="hljs-preprocessor">.class</span>)<span class="hljs-comment">;</span>
+wsOut<span class="hljs-preprocessor">.setUri</span>(uri)<span class="hljs-comment">;</span>
+wsOut<span class="hljs-preprocessor">.setTopic</span>(<span class="hljs-string">"Result Topic"</span>)<span class="hljs-comment">;</span>
+
+dag<span class="hljs-preprocessor">.addStream</span>(<span class="hljs-string">"Query"</span>, wsIn<span class="hljs-preprocessor">.outputPort</span>, store<span class="hljs-preprocessor">.query</span>)<span class="hljs-comment">;</span>
+dag<span class="hljs-preprocessor">.addStream</span>(<span class="hljs-string">"QueryResult"</span>, store<span class="hljs-preprocessor">.queryResult</span>, wsOut<span class="hljs-preprocessor">.input</span>)<span class="hljs-comment">;</span></code></pre>
+<h3 id="putting-it-all-together">Putting it all Together</h3>
+<p>When you combine all the pieces described above, an application that visualizes <strong>AdEvents</strong> looks like this:</p>
+<pre class="prettyprint"><code class=" hljs avrasm">@ApplicationAnnotation(name=<span class="hljs-string">"AdEventDemo"</span>)
+public class AdEventDemo implements StreamingApplication
+{
+  public static final String EVENT_SCHEMA = <span class="hljs-string">"adsGenericEventSchema.json"</span><span class="hljs-comment">;</span>
+
+  @Override
+  public void populateDAG(DAG dag, Configuration conf)
+  {
+    //This loads the eventSchema<span class="hljs-preprocessor">.json</span> file which is a jar resource file.
+    String eventSchema = SchemaUtils<span class="hljs-preprocessor">.jarResourceFileToString</span>(<span class="hljs-string">"eventSchema.json"</span>)<span class="hljs-comment">;</span>
+
+    //Operator that receives Ad Events
+    AdEventReceiver receiver = dag<span class="hljs-preprocessor">.addOperator</span>(<span class="hljs-string">"Event Receiver"</span>, AdEventReceiver<span class="hljs-preprocessor">.class</span>)<span class="hljs-comment">;</span>
+
+    //Dimension Computation
+    DimensionsComputationFlexibleSingleSchemaPOJO dimensions = dag<span class="hljs-preprocessor">.addOperator</span>(<span class="hljs-string">"DimensionsComputation"</span>, DimensionsComputationFlexibleSingleSchemaPOJO<span class="hljs-preprocessor">.class</span>)<span class="hljs-comment">;</span>
+
+    Map&lt;String, String&gt; keyToExpression = Maps<span class="hljs-preprocessor">.newHashMap</span>()<span class="hljs-comment">;</span>
+    keyToExpression<span class="hljs-preprocessor">.put</span>(<span class="hljs-string">"advertiser"</span>, <span class="hljs-string">"getAdvertiser()"</span>)<span class="hljs-comment">;</span>
+    keyToExpression<span class="hljs-preprocessor">.put</span>(<span class="hljs-string">"location"</span>, <span class="hljs-string">"getLocation()"</span>)<span class="hljs-comment">;</span>
+    keyToExpression<span class="hljs-preprocessor">.put</span>(<span class="hljs-string">"time"</span>, <span class="hljs-string">"getTime()"</span>)<span class="hljs-comment">;</span>
+
+    Map&lt;String, String&gt; aggregateToExpression = Maps<span class="hljs-preprocessor">.newHashMap</span>()<span class="hljs-comment">;</span>
+    aggregateToExpression<span class="hljs-preprocessor">.put</span>(<span class="hljs-string">"cost"</span>, <span class="hljs-string">"getCost()"</span>)<span class="hljs-comment">;</span>
+    aggregateToExpression<span class="hljs-preprocessor">.put</span>(<span class="hljs-string">"revenue"</span>, <span class="hljs-string">"getRevenue()"</span>)<span class="hljs-comment">;</span>
+    aggregateToExpression<span class="hljs-preprocessor">.put</span>(<span class="hljs-string">"impressions"</span>, <span class="hljs-string">"getImpressions()"</span>)<span class="hljs-comment">;</span>
+    aggregateToExpression<span class="hljs-preprocessor">.put</span>(<span class="hljs-string">"clicks"</span>, <span class="hljs-string">"getClicks()"</span>)<span class="hljs-comment">;</span>
+
+    dimensions<span class="hljs-preprocessor">.setKeyToExpression</span>(keyToExpression)<span class="hljs-comment">;</span>
+    dimensions<span class="hljs-preprocessor">.setAggregateToExpression</span>(aggregateToExpression)<span class="hljs-comment">;</span>
+    dimensions<span class="hljs-preprocessor">.setConfigurationSchemaJSON</span>(eventSchema)<span class="hljs-comment">;</span>
+
+    dimensions<span class="hljs-preprocessor">.setUnifier</span>(new DimensionsComputationUnifierImpl&lt;InputEvent, Aggregate&gt;())<span class="hljs-comment">;</span>
+
+    //Dimension Store
+    AppDataSingleSchemaDimensionStoreHDHT store = dag<span class="hljs-preprocessor">.addOperator</span>(<span class="hljs-string">"Store"</span>, AppDataSingleSchemaDimensionStoreHDHT<span class="hljs-preprocessor">.class</span>)<span class="hljs-comment">;</span>
+
+    TFileImpl hdsFile = new TFileImpl<span class="hljs-preprocessor">.DTFileImpl</span>()<span class="hljs-comment">;</span>
+    hdsFile<span class="hljs-preprocessor">.setBasePath</span>(<span class="hljs-string">"dataStorePath"</span>)<span class="hljs-comment">;</span>
+    store<span class="hljs-preprocessor">.setFileStore</span>(hdsFile)<span class="hljs-comment">;</span>
+    store<span class="hljs-preprocessor">.setConfigurationSchemaJSON</span>(eventSchema)<span class="hljs-comment">;</span>
+
+    String gatewayAddress = dag<span class="hljs-preprocessor">.getValue</span>(DAG<span class="hljs-preprocessor">.GATEWAY</span>_CONNECT_ADDRESS)<span class="hljs-comment">;</span>
+    URI uri = URI<span class="hljs-preprocessor">.create</span>(<span class="hljs-string">"ws://"</span> + gatewayAddress + <span class="hljs-string">"/pubsub"</span>)<span class="hljs-comment">;</span>
+
+    PubSubWebSocketAppDataQuery wsIn = dag<span class="hljs-preprocessor">.addOperator</span>(<span class="hljs-string">"Query"</span>, PubSubWebSocketAppDataQuery<span class="hljs-preprocessor">.class</span>)<span class="hljs-comment">;</span>
+    wsIn<span class="hljs-preprocessor">.setUri</span>(uri)<span class="hljs-comment">;</span>
+    wsIn<span class="hljs-preprocessor">.setTopic</span>(<span class="hljs-string">"Query Topic"</span>)<span class="hljs-comment">;</span>
+
+    PubSubWebSocketAppDataResult wsOut = dag<span class="hljs-preprocessor">.addOperator</span>(<span class="hljs-string">"QueryResult"</span>, PubSubWebSocketAppDataResult<span class="hljs-preprocessor">.class</span>)<span class="hljs-comment">;</span>
+    wsOut<span class="hljs-preprocessor">.setUri</span>(uri)<span class="hljs-comment">;</span>
+    wsOut<span class="hljs-preprocessor">.setTopic</span>(<span class="hljs-string">"Result Topic"</span>)<span class="hljs-comment">;</span>
+
+    //Configure Streams
+
+    dag<span class="hljs-preprocessor">.addStream</span>(<span class="hljs-string">"Query"</span>, wsIn<span class="hljs-preprocessor">.outputPort</span>, store<span class="hljs-preprocessor">.query</span>)<span class="hljs-comment">;</span>
+    dag<span class="hljs-preprocessor">.addStream</span>(<span class="hljs-string">"QueryResult"</span>, store<span class="hljs-preprocessor">.queryResult</span>, wsOut<span class="hljs-preprocessor">.input</span>)<span class="hljs-comment">;</span>
+
+    dag<span class="hljs-preprocessor">.addStream</span>(<span class="hljs-string">"InputStream"</span>, receiver<span class="hljs-preprocessor">.output</span>, dimensions<span class="hljs-preprocessor">.input</span>)<span class="hljs-comment">;</span>
+    dag<span class="hljs-preprocessor">.addStream</span>(<span class="hljs-string">"DimensionalData"</span>, dimensions<span class="hljs-preprocessor">.output</span>, store<span class="hljs-preprocessor">.input</span>)<span class="hljs-comment">;</span>
+  }
+}</code></pre>
+<h3> </h3>
+<h3 id="visualizing-the-aggregations">Visualizing The Aggregations</h3>
+<p>When you launch your application you can visualize the aggregations of AdEvents over time by adding a widget to a visualization dashboard.</p>
+<p><img title="" src="https://docs.google.com/drawings/d/1wcHlgORqQYRdlnvkp3K7R9-2BllsW2jrgSBERqOq2jg/pub?w=960&amp;h=720" alt="enter image description here" /></p>
+<h3 id="conclusion">Conclusion</h3>
+<p>Aggregating huge amounts of data in real time is a challenge that many enterprises face today. Dimensions Computation is valuable for aggregating data, and Data Torrent provides an implementation of Dimensions Computation that allows users to integrate data aggregation with their applications with minimal effort.</p>
+<h3 id="conclusion">Resources</h3>
+<p>Download DataTorrent Sandbox <a href="http://web.datatorrent.com/DataTorrent-RTS-Sandbox-Edition-Download.html" target="_blank">here</a></p>
+<p>Download DataTorrent Enterprise Edition <a href="http://web.datatorrent.com/DataTorrent-RTS-Enteprise-Edition-Download.html" target="_blank">here</a></p>
+<p>Join Apache Apex Meetups <a href="https://www.datatorrent.com/meetups/">here</a></p>
+<p>The post <a rel="nofollow" href="https://www.datatorrent.com/dimensions-computation-aggregate-navigator-part-2-implementation/">Dimensions Computation (Aggregate Navigator) Part 2: Implementation</a> appeared first on <a rel="nofollow" href="https://www.datatorrent.com">DataTorrent</a>.</p>
+]]></content:encoded>
+			<wfw:commentRss>https://www.datatorrent.com/dimensions-computation-aggregate-navigator-part-2-implementation/feed/</wfw:commentRss>
+		<slash:comments>0</slash:comments>
+		</item>
+		<item>
+		<title>Dimensions Computation (Aggregate Navigator) Part 1: Intro</title>
+		<link>https://www.datatorrent.com/blog-dimensions-computation-aggregate-navigator-part-1-intro/</link>
+		<comments>https://www.datatorrent.com/blog-dimensions-computation-aggregate-navigator-part-1-intro/#comments</comments>
+		<pubDate>Tue, 03 Nov 2015 08:00:29 +0000</pubDate>
+		<dc:creator><![CDATA[tim farkas]]></dc:creator>
+				<category><![CDATA[Uncategorized]]></category>
+
+		<guid isPermaLink="false">https://www.datatorrent.com/?p=2399</guid>
+		<description><![CDATA[<p>Introduction In the world of big data, enterprises have a common problem. They have large volumes of data flowing into their systems for which they need to observe historical trends in real-time. Consider the case of a digital advertising publisher that is receiving hundreds of thousands of click events every second. Looking at the history [&#8230;]</p>
+<p>The post <a rel="nofollow" href="https://www.datatorrent.com/blog-dimensions-computation-aggregate-navigator-part-1-intro/">Dimensions Computation (Aggregate Navigator) Part 1: Intro</a> appeared first on <a rel="nofollow" href="https://www.datatorrent.com">DataTorrent</a>.</p>
+]]></description>
+				<content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
+<p>In the world of big data, enterprises have a common problem. They have large volumes of data flowing into their systems for which they need to observe historical trends in real-time. Consider the case of a digital advertising publisher that is receiving hundreds of thousands of click events every second. Looking at the history of individual clicks and impressions doesn’t tell the publisher much about what is going on. A technique the publisher might employ is to track the total number of clicks and impressions for every second, minute, hour, and day. Such a technique might help find global trends in their systems, but may not provide enough granularity to take action on localized trends. The technique will need to be powerful enough to spot local trends. For example, the total clicks and impressions for an advertiser, a geographical area, or a combination of the two can provide some actionable insight. This process of receiving individual events, aggregating them over time, and
  drilling down into the data using some parameters like “advertiser” and “location” is called Dimensions Computation.</p>
+<p>Dimensions Computation is a powerful mechanism that allows you to spot trends in your streaming data in real-time. In this post we’ll cover the key concepts behind Dimensions Computation and outline the process of performing Dimensions Computation. We will also show you how to use Data Torrent’s out-of-the-box enterprise operators to easily add Dimensions Computation to your application.</p>
+<h2 id="the-process">The Process</h2>
+<p>Let us continue with our example of an advertising publisher. Let us now see the steps that the publisher might take to ensure that large volumes of raw advertisement data is converted into meaningful historical views of their advertisement events.</p>
+<h3 id="the-data">The Data</h3>
+<p>Typically advertising publishers receive packets of information for each advertising event. The events that a publisher receives might look like this:</p>
+<pre class="prettyprint"><code class=" hljs cs">    <span class="hljs-keyword">public</span> <span class="hljs-keyword">class</span> AdEvent
+    {
+        <span class="hljs-comment">//The name of the company that is advertising</span>
+      <span class="hljs-keyword">public</span> String advertiser;
+      <span class="hljs-comment">//The geographical location of the person initiating the event</span>
+      <span class="hljs-keyword">public</span> String location;
+      <span class="hljs-comment">//How much the advertiser was charged for the event</span>
+      <span class="hljs-keyword">public</span> <span class="hljs-keyword">double</span> cost;
+      <span class="hljs-comment">//How much revenue was generated for the advertiser</span>
+      <span class="hljs-keyword">public</span> <span class="hljs-keyword">double</span> revenue;
+      <span class="hljs-comment">//The number of impressions the advertiser received from this event</span>
+      <span class="hljs-keyword">public</span> <span class="hljs-keyword">long</span> impressions;
+      <span class="hljs-comment">//The number of clicks the advertiser received from this event</span>
+      <span class="hljs-keyword">public</span> <span class="hljs-keyword">long</span> clicks;
+      <span class="hljs-comment">//The timestamp of the event in milliseconds</span>
+      <span class="hljs-keyword">public</span> <span class="hljs-keyword">long</span> time;
+    }</code></pre>
+<p>The class <strong>AdEvent</strong> contains two types of data:</p>
+<ul>
+<li><strong>Aggregates</strong>: The data that is combined using aggregators.</li>
+<li><strong>Keys</strong>: The data that is used to select aggregations at a finer granularity.</li>
+</ul>
+<h4 id="aggregates">Aggregates</h4>
+<p>The aggregates in our <strong>AdEvent</strong> object are the pieces of data, which we must combine using aggregators in order to obtain a meaningful historical view. In this case, we can think of combining cost, revenue, impressions, and clicks. So these are our aggregates. We won’t obtain anything useful by aggregating the location and advertiser strings in our <strong>AdEvent</strong>, so those are not considered aggregates. It’s important to note that aggregates are considered separate entities. This means that the cost field of and <strong>AdEvent</strong> cannot be combined with its revenue field; cost values can only be aggregated with other cost values and revenue values can only be aggregated with other revenue values.</p>
+<p>In summary the aggregates in our <strong>AdEvent</strong> are:</p>
+<ul>
+<li><strong>cost</strong></li>
+<li><strong>revenue</strong></li>
+<li><strong>impressions</strong></li>
+<li><strong>clicks</strong></li>
+</ul>
+<h4 id="keys">Keys</h4>
+<p>The keys in our <strong>AdEvent</strong> object are used for selecting aggregations at a finer granularity. For example, it would make sense to look at the number of clicks for a particular advertiser, the number of clicks for a certain location, and the number of clicks for a certain location and advertiser combination. So location and advertiser are keys. Time is also another key since it is useful to look at the number of clicks received in a particular time range (For example, 12:00 pm through 1:00 pm or 12:00 pm through 12:01 pm.</p>
+<p>In summary the keys in our <strong>AdEvent</strong> are:</p>
+<ul>
+<li><strong>advertiser</strong></li>
+<li><strong>location</strong></li>
+<li><strong>time</strong></li>
+</ul>
+<h3 id="computing-the-aggregations">Computing The Aggregations</h3>
+<p>When the publisher receives a new <strong>AdEvent</strong> the event is added to running aggregations in real time. The keys and aggregates in the <strong>AdEvent</strong> are used to compute aggregations. How the aggregations are computed and the number of aggregations computed are determined by three tunable parameters:</p>
+<ul>
+<li><strong>Aggregators</strong></li>
+<li><strong>Dimensions Combinations</strong></li>
+<li><strong>Time Buckets</strong></li>
+</ul>
+<h4 id="aggregators">Aggregators</h4>
+<p>Dimensions Computation supports more than just one type of aggregation, and multiple aggregators can be used to combine incoming data at once. Some of the aggregators available out-of-the-box are:</p>
+<ul>
+<li><strong>Sum</strong></li>
+<li><strong>Count</strong></li>
+<li><strong>Min</strong></li>
+<li><strong>Max</strong></li>
+</ul>
+<p>As an example, suppose the publisher is not using the keys in their <strong>AdEvents</strong> and this publisher wants to perform a sum and a max aggregation.</p>
+<p><strong>1.</strong> An AdEvent arrives. The AdEvent is aggregated to the Sum and Count aggregations.<br />
+<img title="" src="https://docs.google.com/drawings/d/1upf5hv-UDT4BKhm7yTrcuFZYqnI263vMTXioKhr_qTo/pub?w=960&amp;h=720" alt="Adding Aggregate" /><br />
+<strong>2.</strong> Another AdEvent arrives. The AdEvent is aggregated to the Sum and Count aggregations.<br />
+<img title="" src="https://docs.google.com/drawings/d/10gTXjMyxanYo9UFc76IShPxOi5G7U5tvQKtfwqGyIws/pub?w=960&amp;h=720" alt="Adding Aggregate" /></p>
+<p>As can be seen from the example above, each <strong>AdEvent</strong> contributes to two aggregations.</p>
+<h4 id="dimension-combinations">Dimension Combinations</h4>
+<p>Each <strong>AdEvent</strong> does not necessarily contribute to only one aggregation for each aggregator. In our advertisement example there are 4 <strong>dimension combinations</strong> over which aggregations can be computed.</p>
+<ul>
+<li><strong>advertiser:</strong> This dimension combination is comprised of just the advertiser value. This means that all the aggregates for <strong>AdEvents</strong> with a particular value for advertiser (for example, Gamestop) are aggregated.</li>
+<li><strong>location:</strong> This dimension combination is comprised of just the location value. This means that all the aggregates for <strong>AdEvents</strong> with a particular value for location (for example, California) are aggregated.</li>
+<li><strong>advertiser, location:</strong> This dimension combination is comprised the advertiser and location values. This means that all the aggregates for <strong>AdEvents</strong> with the same advertiser and location combination (for example, Gamestop, California) are aggregated.</li>
+<li><strong>the empty combination:</strong> This combination is a <em>global aggregation</em> because it doesn’t use any of the keys in the <strong>AdEvent</strong>. This means that all the <strong>AddEvents</strong> are aggregated.</li>
+</ul>
+<p>Therefore if a publisher is using the four dimension combinations enumerated above along with the sum and max aggregators, the number of aggregations being maintained would be:</p>
+<p>NUM_AGGS = 2 x <em>(number of unique advertisers)</em> + 2 * <em>(number of unique locations)</em> + 2 * <em>(number of unique advertiser and location combinations)</em> + 2</p>
+<p>And each individual <strong>AdEvent</strong> will contribute to <em>(number of aggregators)</em> x <em>(number of dimension combinations)</em> aggregations.</p>
+<p>Here is an example of how NUM_AGGS aggregations are computed:</p>
+<p><strong>1.</strong> An <strong>AdEvent</strong> arrives. The <strong>AdEvent</strong> is applied to aggregations for each aggregator and each dimension combination.<br />
+<img title="" src="https://docs.google.com/drawings/d/1qx8gLu615KneLDspsGkAS0_OlkX-DyvCUA7DAJtYJys/pub?w=960&amp;h=720" alt="Adding Aggregate" /><br />
+<strong>2.</strong> Another <strong>AdEvent</strong> arrives. The <strong>AdEvent</strong> is applied to aggregations for each aggregator and each dimension combination.<br />
+<img title="" src="https://docs.google.com/drawings/d/1FA2IyxewwzXtJ9A8JfJPrKtx-pfWHtHpVXp8lb8lKmE/pub?w=960&amp;h=720" alt="Adding Aggregate" /><br />
+<strong>3.</strong> Another <strong>AdEvent</strong> arrives. The <strong>AdEvent</strong> is applied to aggregations for each aggregator and each dimension combination.<br />
+<img title="" src="https://docs.google.com/drawings/d/15sxwfZeYOKBiapoD2o721M4rZs-bZBxhF3MeXelnu6M/pub?w=960&amp;h=720" alt="Adding Aggregate" /></p>
+<p>As can be seen from the example above each <strong>AdEvent</strong> contributes to 2 x 4 = 8 aggregations and there are 2 x 2 + 2 x 2 + 2 x 3 + 2 = 16 aggregations in total.</p>
+<h4 id="time-buckets">Time Buckets</h4>
+<p>In addition to computing multiple aggregations for each dimension combination, aggregations can also be performed over time buckets. Time buckets are windows of time (for example, 1:00 pm through 1:01 pm) that are specified by a simple notation: 1m is one minute, 1h is one hour, 1d is one day. When aggregations are performed over time buckets, separate aggregations are maintained for each time bucket. Aggregations for a time bucket are comprised only of events with a time stamp that falls into that time bucket.</p>
+<p>An example of how these time bucketed aggregations are computed is as follows:</p>
+<p>Let’s say our advertisement publisher is interested in computing the Sum and Max of <strong>AdEvents</strong> for the dimension combinations comprised of <strong>advertiser</strong> and <strong>location</strong> over 1 minute and 1 hour time buckets.</p>
+<p><strong>1.</strong> An <strong>AdEvent</strong> arrives. The <strong>AdEvent</strong> is applied to the aggregations for the appropriate aggregator, dimension combination and time bucket.</p>
+<p><img title="" src="https://docs.google.com/drawings/d/11voOdqkagpGKcWn5HOiWWAn78fXlpGl7aXUa3tG5sQc/pub?w=960&amp;h=720" alt="Adding Aggregate" /></p>
+<p><strong>3.</strong> Another <strong>AdEvent</strong> arrives. The <strong>AdEvent</strong> is applied to the aggregations for the appropriate aggregator, dimension combination and time bucket.</p>
+<p><img title="" src="https://docs.google.com/drawings/d/1ffovsxWZfHnSc_Z30RzGIXgzQeHjCnyZBoanO_xT_e4/pub?w=960&amp;h=720" alt="Adding Aggregate" /></p>
+<h4 id="conclusion">Conclusion</h4>
+<p>In summary, the three tunable parameters discussed above (<strong>Aggregators</strong>, <strong>Dimension Combinations</strong>, and <strong>Time Buckets</strong>) determine how aggregations are computed. In the examples provided in the <strong>Aggregators</strong>, <strong>Dimension Combinations</strong>, and <strong>Time Buckets</strong> sections respectively, we have incrementally increased the complexity in which the aggregations are performed. The examples provided in the <strong>Aggregators</strong>, and <strong>Dimension Combinations</strong> sections were for illustration purposes only; the example provided in the <strong>Time Buckets</strong> section provides an accurate view of how aggregations are computed within Data Torrent&#8217;s enterprise operators.</p>
+<p>Download DataTorrent Sandbox <a href="http://web.datatorrent.com/DataTorrent-RTS-Sandbox-Edition-Download.html" target="_blank">here</a></p>
+<p>Download DataTorrent Enterprise Edition <a href="http://web.datatorrent.com/DataTorrent-RTS-Enteprise-Edition-Download.html" target="_blank">here</a></p>
+<p>The post <a rel="nofollow" href="https://www.datatorrent.com/blog-dimensions-computation-aggregate-navigator-part-1-intro/">Dimensions Computation (Aggregate Navigator) Part 1: Intro</a> appeared first on <a rel="nofollow" href="https://www.datatorrent.com">DataTorrent</a>.</p>
+]]></content:encoded>
+			<wfw:commentRss>https://www.datatorrent.com/blog-dimensions-computation-aggregate-navigator-part-1-intro/feed/</wfw:commentRss>
+		<slash:comments>0</slash:comments>
+		</item>
+		<item>
+		<title>Cisco ACI, Big Data, and DataTorrent</title>
+		<link>https://www.datatorrent.com/blog_cisco_aci/</link>
+		<comments>https://www.datatorrent.com/blog_cisco_aci/#comments</comments>
+		<pubDate>Tue, 27 Oct 2015 22:30:07 +0000</pubDate>
+		<dc:creator><![CDATA[Charu Madan]]></dc:creator>
+				<category><![CDATA[Uncategorized]]></category>
+
+		<guid isPermaLink="false">https://www.datatorrent.com/?p=2348</guid>
+		<description><![CDATA[<p>By: Harry Petty, Data Center and Cloud Networking, Cisco  (This blog has been developed in association with Farid Jiandani, Product Manager with Cisco’s Insieme Networks Business Unit and Charu Madan, Director Business Development at DataTorrent. It was originally published on Cisco Blogs) If you work for an enterprise that’s looking to hit its digital sweet [&#8230;]</p>
+<p>The post <a rel="nofollow" href="https://www.datatorrent.com/blog_cisco_aci/">Cisco ACI, Big Data, and DataTorrent</a> appeared first on <a rel="nofollow" href="https://www.datatorrent.com">DataTorrent</a>.</p>
+]]></description>
+				<content:encoded><![CDATA[<p>By: Harry Petty, Data Center and Cloud Networking, Cisco</p>
+<p class="c0 c11"><a name="h.gjdgxs"></a><span class="c1"> (</span><span class="c4 c13">This blog has been developed in association with Farid Jiandani, Product Manager with Cisco’s Insieme Networks Business Unit and Charu Madan, Director Business Development at DataTorrent. It was originally published on <a href="http://blogs.cisco.com/datacenter/aci-big-data-and-datatorrent" target="_blank">Cisco Blogs</a>)</span></p>
+<p class="c0"><span class="c1">If you work for an enterprise that’s looking to hit its digital sweet spot, then you’re scrutinizing your sales, marketing and operations to see where you should make digital investments to innovate and improve productivity. Super-fast data processing at scale is being used to obtain real-time insights for digital business and Internet of Things (IoT) initiatives.</span></p>
+<p class="c0"><span class="c1">According to Gartner Group, one of the cool vendors in this area of providing super- fast big data analysis using in-memory streaming analytics is called DataTorrent, a startup founded by long-time ex-Yahoo! veterans with vast experience managing big data for leading edge applications and infrastructure at massive scale. Their goal is to empower today’s enterprises to experience the full potential and business impact of big data with a platform that processes and analyzes data in real-time.</span></p>
+<p class="c0"><span class="c1 c2">DataTorrent RTS</span></p>
+<p class="c0"><span class="c4 c6">DataTorrent RTS is an open core</span><span class="c2 c4 c6">, enterprise-grade product powered by Apache Apex. </span><span class="c4 c6">DataTorrent RTS provides a single, unified batch and stream processing platform that enables organizations to reduce time to market, development costs and operational expenditures for big data analytics applications. </span></p>
+<p class="c0"><span class="c1 c2">DataTorrent RTS Integration with ACI</span></p>
+<p class="c0"><span class="c4 c6">A member of the Cisco ACI ecosystem, DataTorrent announced on September 29th DataTorrent RTS integration with Cisco </span><span class="c4 c6"><a class="c7" href="https://www.google.com/url?q=http://www.cisco.com/c/en/us/solutions/data-center-virtualization/application-centric-infrastructure/index.html&amp;sa=D&amp;usg=AFQjCNFMZhMYdUmPuuqrUI5IZmrvEhlK5g">Application Centric Infrastructure (ACI)</a></span><span class="c4 c6"> through the Application Policy Infrastructure Controller (APIC) to help create more efficient IT operations, bringing together network operations management and big data application management and development: </span></p>
+<p class="c0"><span class="c4"><a class="c7" href="https://www.google.com/url?q=https://www.datatorrent.com/press-releases/datatorrent-integrates-with-cisco-aci-to-help-secure-big-data-processing-through-a-unified-data-and-network-fabric/&amp;sa=D&amp;usg=AFQjCNG4S_2-OY5ox5nCf_0_Qj7s-x9pyw">DataTorrent Integrates with Cisco ACI to Help Secure Big Data Processing Through a Unified Data and Network Fabric</a></span><span class="c4">. </span><span class="c2 c4">The joint solution enables</span></p>
+<ul class="c8 lst-kix_list_2-0 start">
+<li class="c12 c0"><span class="c4">A unified fabric approach for managing </span><span class="c2 c4">Applications, Data </span><span class="c4">and </span><span class="c2 c4">Network</span></li>
+<li class="c0 c12"><span class="c4">A highly secure and automated Big Data application platform which uses the power of Cisco ACI for automation and security policy management </span></li>
+<li class="c12 c0"><span class="c4">The creation, repository, and enforcement point for Cisco ACI application policies for big data applications</span></li>
+</ul>
+<p class="c0"><span class="c4">With the ACI integration, secure connectivity to diverse data sets becomes a part of a user defined policy which is automated and does not compromise on security and access management. As an example, if one of the DataTorrent Big Data application needs access to say a Kafka source, then all nodes need to be opened up. This leaves the environment vulnerable and prone to attacks. With ACI, the access management policies and contracts help define the connectivity contracts and only the right node and right application gets access. See Figure 1 and 2 for the illustration of this concept. </span></p>
+<p class="c0"><span class="c1 c2">Figure 1:</span></p>
+<p class="c0"><a href="https://www.datatorrent.com/wp-content/uploads/2015/10/image00.jpg"><img class="alignnone size-full wp-image-2349" src="https://www.datatorrent.com/wp-content/uploads/2015/10/image00.jpg" alt="image00" width="432" height="219" /></a></p>
+<p class="c0"><span class="c1 c2">Figure 2</span></p>
+<p class="c0"><a href="https://www.datatorrent.com/wp-content/uploads/2015/10/image011.png"><img class="alignnone size-full wp-image-2350" src="https://www.datatorrent.com/wp-content/uploads/2015/10/image011.png" alt="image01" width="904" height="493" /></a></p>
+<p class="c0"><span class="c1 c2">ACI Support for Big Data Solutions</span></p>
+<p class="c0"><span class="c1">The openness and the flexibility of ACI allow big data customers to run a wide variety of different applications within their fabric alongside Hadoop. Due to the elasticity of ACI, customers are able to run batch processing alongside stream processing and other data base applications in a seamless fashion. In traditional Hadoop environments, the network is segmented based off of individual server nodes (see Figure 1). This makes it difficult to elastically allow access to and from different applications. Ultimately, within the ACI framework, logical demarcation points can be created based on application workloads rather than physical server groups (a set of Hadoop nodes should not be considered as a bunch of individual server nodes, rather a single group.)</span></p>
+<p class="c0"><span class="c1 c2">A Robust and Active Ecosystem</span></p>
+<p class="c0"><span class="c1">Many vendors claim they have a broad ecosystem of vendors, but sometimes that’s pure marketing, without any real integration efforts going on behind the slideware. But Cisco’s Application Centric Infrastructure (ACI) has a very active ecosystem of industry leaders who are putting significant resources into integration efforts, taking advantage of ACI’s open Northbound and Southbound API’s. DataTorrent is just one example of an innovative company that is using ACI integration to add value to their solutions and deliver real benefits to their channel partners and customers.</span></p>
+<p class="c0"><span class="c1">Stay tuned for more success stories to come: we’ll continue to showcase industry leaders that are taking advantage of the open ACI API’s.</span></p>
+<p class="c0"><span class="c1">Additional References</span></p>
+<p class="c0"><span class="c3"><a class="c7" href="https://www.google.com/url?q=https://www.cisco.com/go/aci&amp;sa=D&amp;usg=AFQjCNHPa1zEn6-1fEWQeCgZ-QmP9te5ig">https://www.cisco.com/go/aci</a></span></p>
+<p class="c0"><span class="c3"><a class="c7" href="https://www.google.com/url?q=https://www.cisco.com/go/aciecosystem&amp;sa=D&amp;usg=AFQjCNGmS3P3mOU0DQen5F43--fDi25uWw">https://www.cisco.com/go/aciecosystem</a></span></p>
+<p class="c11 c0"><span class="c3"><a class="c7" href="https://www.google.com/url?q=http://www.datatorrent/com&amp;sa=D&amp;usg=AFQjCNHbzoCVBy0azkWTbjpqdyxPqkCo9g">http://www.datatorrent/</a></span></p>
+<p>&nbsp;</p>
+<p>&nbsp;</p>
+<p>The post <a rel="nofollow" href="https://www.datatorrent.com/blog_cisco_aci/">Cisco ACI, Big Data, and DataTorrent</a> appeared first on <a rel="nofollow" href="https://www.datatorrent.com">DataTorrent</a>.</p>
+]]></content:encoded>
+			<wfw:commentRss>https://www.datatorrent.com/blog_cisco_aci/feed/</wfw:commentRss>
+		<slash:comments>0</slash:comments>
+		</item>
+		<item>
+		<title>Write Your First Apache Apex Application in Scala</title>
+		<link>https://www.datatorrent.com/blog-writing-apache-apex-application-in-scala/</link>
+		<comments>https://www.datatorrent.com/blog-writing-apache-apex-application-in-scala/#comments</comments>
+		<pubDate>Tue, 27 Oct 2015 01:58:25 +0000</pubDate>
+		<dc:creator><![CDATA[Tushar Gosavi]]></dc:creator>
+				<category><![CDATA[Uncategorized]]></category>
+
+		<guid isPermaLink="false">https://www.datatorrent.com/?p=2280</guid>
+		<description><![CDATA[<p>* Extend your Scala expertise to building Apache Apex applications * Scala is modern, multi-paradigm programing language that integrates features of functional as well as object-oriented languages elegantly. Big Data frameworks are already exploring Scala as a language of choice for implementations. Apache Apex is developed in Java, the Apex APIs are such that writing [&#8230;]</p>
+<p>The post <a rel="nofollow" href="https://www.datatorrent.com/blog-writing-apache-apex-application-in-scala/">Write Your First Apache Apex Application in Scala</a> appeared first on <a rel="nofollow" href="https://www.datatorrent.com">DataTorrent</a>.</p>
+]]></description>
+				<content:encoded><![CDATA[<p><em>* Extend your Scala expertise to building Apache Apex applications *</em></p>
+<p>Scala is modern, multi-paradigm programing language that integrates features of functional as well as object-oriented languages elegantly. Big Data frameworks are already exploring Scala as a language of choice for implementations. Apache Apex is developed in Java, the Apex APIs are such that writing applications is a smooth sail. Developers can use any programing language that can run on JVM and access JAVA classes, because Scala has good interoperability with Java, running Apex applications designed in Scala is a fuss-free experience. We will explain how to write an Apache Apex application in Scala.</p>
+<p>Writing an <a href="http://www.datatorrent.com/apex" target="_blank">Apache Apex</a> application in Scala is simple.</p>
+<h2 id="operators-within-the-application">Operators within the application</h2>
+<p>We will develop a word count applications in Scala. This application will look for new files in a directory. With the availability of new files, the word count application will read the files, and compute a count for each word and print result on stdout. The application requires following operators:</p>
+<ul>
+<li><strong>LineReader</strong> &#8211; This operator monitors directories for new files periodically. After a new file is detected, LineReader reads the file line-by-line, and makes lines available on the output port for the next operator.</li>
+<li><strong>Parser</strong> &#8211; This operator receives lines read by LineReader on its input port. Parser breaks the line into words, and makes individual words available on the output port.</li>
+<li><strong>UniqueCounter</strong> &#8211; This operator computes the count of each word received on its input port.</li>
+<li><strong>ConsoleOutputOperator</strong> &#8211; This operator prints unique counts of words on standard output.</li>
+</ul>
+<h2 id="build-the-scala-word-count-application">Build the Scala word count application</h2>
+<p>Now, we will generate a sample application using maven archtype:generate.</p>
+<h3 id="generate-a-sample-application">Generate a sample application.</h3>
+<pre class="prettyprint"><code class="language-bash hljs ">mvn archetype:generate -DarchetypeRepository=https://www.datatorrent.com/maven/content/repositories/releases -DarchetypeGroupId=com.datatorrent -DarchetypeArtifactId=apex-app-archetype -DarchetypeVersion=<span class="hljs-number">3.0</span>.<span class="hljs-number">0</span> -DgroupId=com.datatorrent -Dpackage=com.datatorrent.wordcount -DartifactId=wordcount -Dversion=<span class="hljs-number">1.0</span>-SNAPSHOT</code></pre>
+<p>This creates a directory called <strong>wordcount</strong>, with a sample application and build script. Let us see how to modify this application into the Scala-based word count application that we are looking to develop.</p>
+<h3 id="add-the-scala-build-plugin">Add the Scala build plugin.</h3>
+<p>Apache Apex uses maven for building the framework and operator library. Maven supports a plugin for compiling Scala files. To enable this plugin, add the following snippet to the <code>build -&gt; plugins</code> sections of the <code>pom.xml</code> file that is located in the application directory.</p>
+<pre class="prettyprint"><code class="language-xml hljs ">  <span class="hljs-tag">&lt;<span class="hljs-title">plugin</span>&gt;</span>
+    <span class="hljs-tag">&lt;<span class="hljs-title">groupId</span>&gt;</span>net.alchim31.maven<span class="hljs-tag">&lt;/<span class="hljs-title">groupId</span>&gt;</span>
+    <span class="hljs-tag">&lt;<span class="hljs-title">artifactId</span>&gt;</span>scala-maven-plugin<span class="hljs-tag">&lt;/<span class="hljs-title">artifactId</span>&gt;</span>
+    <span class="hljs-tag">&lt;<span class="hljs-title">version</span>&gt;</span>3.2.1<span class="hljs-tag">&lt;/<span class="hljs-title">version</span>&gt;</span>
+    <span class="hljs-tag">&lt;<span class="hljs-title">executions</span>&gt;</span>
+      <span class="hljs-tag">&lt;<span class="hljs-title">execution</span>&gt;</span>
+        <span class="hljs-tag">&lt;<span class="hljs-title">goals</span>&gt;</span>
+          <span class="hljs-tag">&lt;<span class="hljs-title">goal</span>&gt;</span>compile<span class="hljs-tag">&lt;/<span class="hljs-title">goal</span>&gt;</span>
+          <span class="hljs-tag">&lt;<span class="hljs-title">goal</span>&gt;</span>testCompile<span class="hljs-tag">&lt;/<span class="hljs-title">goal</span>&gt;</span>
+        <span class="hljs-tag">&lt;/<span class="hljs-title">goals</span>&gt;</span>
+      <span class="hljs-tag">&lt;/<span class="hljs-title">execution</span>&gt;</span>
+    <span class="hljs-tag">&lt;/<span class="hljs-title">executions</span>&gt;</span>
+  <span class="hljs-tag">&lt;/<span class="hljs-title">plugin</span>&gt;</span></code></pre>
+<p>Also, specify the Scala library as a dependency in the pom.xml file.<br />
+Add the Scala library.</p>
+<pre class="prettyprint"><code class="language-xml hljs "><span class="hljs-tag">&lt;<span class="hljs-title">dependency</span>&gt;</span>
+ <span class="hljs-tag">&lt;<span class="hljs-title">groupId</span>&gt;</span>org.scala-lang<span class="hljs-tag">&lt;/<span class="hljs-title">groupId</span>&gt;</span>
+ <span class="hljs-tag">&lt;<span class="hljs-title">artifactId</span>&gt;</span>scala-library<span class="hljs-tag">&lt;/<span class="hljs-title">artifactId</span>&gt;</span>
+ <span class="hljs-tag">&lt;<span class="hljs-title">version</span>&gt;</span>2.11.2<span class="hljs-tag">&lt;/<span class="hljs-title">version</span>&gt;</span>
+<span class="hljs-tag">&lt;/<span class="hljs-title">dependency</span>&gt;</span></code></pre>
+<p>We are now set to write a Scala application.</p>
+<h2 id="write-your-scala-word-count-application">Write your Scala word count application</h2>
+<h3 id="linereader">LineReader</h3>
+<p><a href="https://github.com/apache/incubator-apex-malhar" target="_blank">Apache Malhar library</a> contains an <a href="https://github.com/apache/incubator-apex-malhar/blob/1f5676b455f7749d11c7cd200216d0d4ad7fce32/library/src/main/java/com/datatorrent/lib/io/AbstractFTPInputOperator.java" target="_blank">AbstractFileInputOperator</a> operator that monitors and reads files from a directory. This operator has capabilities such as support for scaling, fault tolerance, and exactly once processing. To complete the functionality, override a few methods:<br />
+<em>readEntity</em> : Reads a line from a file.<br />
+<em>emit</em> : Emits data read on the output port.<br />
+We have overridden openFile to obtain an instance of BufferedReader that is required while reading lines from the file. We also override closeFile for closing an instance of BufferedReader.</p>
+<pre class="prettyprint"><code class="language-scala hljs "><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">LineReader</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">AbstractFileInputOperator</span>[<span class="hljs-title">String</span>] {</span>
+
+  <span class="hljs-annotation">@transient</span>
+  <span class="hljs-keyword">val</span> out : DefaultOutputPort[String] = <span class="hljs-keyword">new</span> DefaultOutputPort[String]();
+
+  <span class="hljs-keyword">override</span> <span class="hljs-keyword">def</span> readEntity(): String = br.readLine()
+
+  <span class="hljs-keyword">override</span> <span class="hljs-keyword">def</span> emit(line: String): Unit = out.emit(line)
+
+  <span class="hljs-keyword">override</span> <span class="hljs-keyword">def</span> openFile(path: Path): InputStream = {
+    <span class="hljs-keyword">val</span> in = <span class="hljs-keyword">super</span>.openFile(path)
+    br = <span class="hljs-keyword">new</span> BufferedReader(<span class="hljs-keyword">new</span> InputStreamReader(in))
+    <span class="hljs-keyword">return</span> in
+  }
+
+  <span class="hljs-keyword">override</span> <span class="hljs-keyword">def</span> closeFile(is: InputStream): Unit = {
+    br.close()
+    <span class="hljs-keyword">super</span>.closeFile(is)
+  }
+
+  <span class="hljs-annotation">@transient</span>
+  <span class="hljs-keyword">private</span> <span class="hljs-keyword">var</span> br : BufferedReader = <span class="hljs-keyword">null</span>
+}</code></pre>
+<p>Some Apex API classes are not serializable, and must be defined as transient. Scala supports transient annotation for such scenarios. If you see objects that are not a part of the state of the operator, you must annotate them with a @transient. For example, in this code, we have annotated buffer reader and output port as transient.</p>
+<h3 id="parser">Parser</h3>
+<p>Parser splits lines using in-built JAVA split function, and emits individual words to the output port.</p>
+<pre class="prettyprint"><code class="language-scala hljs "><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Parser</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">BaseOperator</span> {</span>
+  <span class="hljs-annotation">@BeanProperty</span>
+  <span class="hljs-keyword">var</span> regex : String = <span class="hljs-string">" "</span>
+
+  <span class="hljs-annotation">@transient</span>
+  <span class="hljs-keyword">val</span> out = <span class="hljs-keyword">new</span> DefaultOutputPort[String]()
+
+  <span class="hljs-annotation">@transient</span>
+  <span class="hljs-keyword">val</span> in = <span class="hljs-keyword">new</span> DefaultInputPort[String]() {
+    <span class="hljs-keyword">override</span> <span class="hljs-keyword">def</span> process(t: String): Unit = {
+      <span class="hljs-keyword">for</span>(w &lt;- t.split(regex)) out.emit(w)
+    }
+  }
+}</code></pre>
+<p>Scala simplifies automatic generation of setters and getters based on the @BinProperty annotation. Properties annotated with @BinProperty can be modified at the time of launching an application by using configuration files. You can also modify such properties while an application is running. You can specify the regular expression used for splitting within the configuration file.</p>
+<h3 id="uniquecount-and-consoeloutputoperator">UniqueCount and ConsoelOutputOperator</h3>
+<p>For this application, let us use UniqueCount and ConsoleOutputOperator as is.</p>
+<h3 id="put-together-the-word-count-application">Put together the word count application</h3>
+<p>Writing the main application class in Scala is similar to doing it in JAVA. You must first get an instance of DAG object by overriding the populateDAG() method. Later, you must add operators to this instance using the addOperator() method. Finally, you must connect the operators with the addStream() method.</p>
+<pre class="prettyprint"><code class="language-scala hljs "><span class="hljs-annotation">@ApplicationAnnotation</span>(name=<span class="hljs-string">"WordCount"</span>)
+<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Application</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">StreamingApplication</span> {</span>
+  <span class="hljs-keyword">override</span> <span class="hljs-keyword">def</span> populateDAG(dag: DAG, configuration: Configuration): Unit = {
+    <span class="hljs-keyword">val</span> input = dag.addOperator(<span class="hljs-string">"input"</span>, <span class="hljs-keyword">new</span> LineReader)
+    <span class="hljs-keyword">val</span> parser = dag.addOperator(<span class="hljs-string">"parser"</span>, <span class="hljs-keyword">new</span> Parser)
+    <span class="hljs-keyword">val</span> counter = dag.addOperator(<span class="hljs-string">"counter"</span>, <span class="hljs-keyword">new</span> UniqueCounter[String])
+    <span class="hljs-keyword">val</span> out = dag.addOperator(<span class="hljs-string">"console"</span>, <span class="hljs-keyword">new</span> ConsoleOutputOperator)
+
+    dag.addStream[String](<span class="hljs-string">"lines"</span>, input.out, parser.in)
+    dag.addStream[String](<span class="hljs-string">"words"</span>, parser.out, counter.data)
+    dag.addStream[java.util.HashMap[String,Integer]](<span class="hljs-string">"counts"</span>, counter.count, out.input)
+  }
+}</code></pre>
+<h2 id="running-application">Running application</h2>
+<p>Before running the word count application, specify the input directory for the input operator. You can use the default configuration file for this. Open the <em>src/main/resources/META-INF/properties.xml</em> file, and add the following lines between the tag. Do not forget to replace “username” with your Hadoop username.</p>
+<pre class="prettyprint"><code class="language-xml hljs "><span class="hljs-tag">&lt;<span class="hljs-title">property</span>&gt;</span>
+ <span class="hljs-tag">&lt;<span class="hljs-title">name</span>&gt;</span>dt.application.WordCount.operator.input.prop.directory<span class="hljs-tag">&lt;/<span class="hljs-title">name</span>&gt;</span>
+  <span class="hljs-tag">&lt;<span class="hljs-title">value</span>&gt;</span>/user/username/data<span class="hljs-tag">&lt;/<span class="hljs-title">value</span>&gt;</span>
+<span class="hljs-tag">&lt;/<span class="hljs-title">property</span>&gt;</span></code></pre>
+<p>Build the application from the application directory using this command:</p>
+<pre class="prettyprint"><code class="language-bash hljs ">mvn clean install -DskipTests</code></pre>
+<p>You should now have an application package in the target directory.</p>
+<p>Now, launch this application package using dtcli.</p>
+<pre class="prettyprint"><code class="language-bash hljs ">$ dtcli
+DT CLI <span class="hljs-number">3.2</span>.<span class="hljs-number">0</span>-SNAPSHOT <span class="hljs-number">28.09</span>.<span class="hljs-number">2015</span> @ <span class="hljs-number">12</span>:<span class="hljs-number">45</span>:<span class="hljs-number">15</span> IST rev: <span class="hljs-number">8</span>e49cfb branch: devel-<span class="hljs-number">3</span>
+dt&gt; launch target/wordcount-<span class="hljs-number">1.0</span>-SNAPSHOT.apa
+{<span class="hljs-string">"appId"</span>: <span class="hljs-string">"application_1443354392775_0010"</span>}
+dt (application_1443354392775_0010) &gt;</code></pre>
+<p>Add some text files to the <em>/user/username/data</em> directory on your HDFS to see how the application works. You can see the words along with their counts in the container log of the console operator.</p>
+<h2 id="summary">Summary</h2>
+<p>Scala classes are JVM classes that can be inherited from JAVA classes, while allowing transparency in JAVA object creation and calling. That is why you can easily extend your Scala capabilities to build Apex applications.<br />
+To get started with creating your first application, see <a href="https://www.datatorrent.com/buildingapps/">https://www.datatorrent.com/buildingapps/</a>.</p>
+<h2 id="see-also">See Also</h2>
+<ul>
+<li>Building Applications with Apache Apex and Malhar <a href="https://www.datatorrent.com/buildingapps/">https://www.datatorrent.com/buildingapps/</a></li>
+<li>Scala tutorial for java programmers <a href="http://docs.scala-lang.org/tutorials/scala-for-java-programmers.html">http://docs.scala-lang.org/tutorials/scala-for-java-programmers.html</a></li>
+<li>Application Developer Guide <a href="https://www.datatorrent.com/docs/guides/ApplicationDeveloperGuide.html">https://www.datatorrent.com/docs/guides/ApplicationDeveloperGuide.html</a></li>
+<li>Operator Developer Guide <a href="https://www.datatorrent.com/docs/guides/OperatorDeveloperGuide.html">https://www.datatorrent.com/docs/guides/OperatorDeveloperGuide.html</a></li>
+<li>Malhar Operator Library <a href="https://www.datatorrent.com/docs/guides/MalharStandardOperatorLibraryTemplatesGuide.html">https://www.datatorrent.com/docs/guides/MalharStandardOperatorLibraryTemplatesGuide.html</a></li>
+</ul>
+<p>The post <a rel="nofollow" href="https://www.datatorrent.com/blog-writing-apache-apex-application-in-scala/">Write Your First Apache Apex Application in Scala</a> appeared first on <a rel="nofollow" href="https://www.datatorrent.com">DataTorrent</a>.</p>
+]]></content:encoded>
+			<wfw:commentRss>https://www.datatorrent.com/blog-writing-apache-apex-application-in-scala/feed/</wfw:commentRss>
+		<slash:comments>0</slash:comments>
+		</item>
+		<item>
+		<title>Apache Apex Performance Benchmarks</title>
+		<link>https://www.datatorrent.com/blog-apex-performance-benchmark/</link>
+		<comments>https://www.datatorrent.com/blog-apex-performance-benchmark/#comments</comments>
+		<pubDate>Tue, 20 Oct 2015 13:23:27 +0000</pubDate>
+		<dc:creator><![CDATA[Vlad Rozov]]></dc:creator>
+				<category><![CDATA[Uncategorized]]></category>
+
+		<guid isPermaLink="false">https://www.datatorrent.com/?p=2261</guid>
+		<description><![CDATA[<p>Why another benchmark blog? As engineers, we are skeptical of performance benchmarks developed and published by software vendors. Most of the time such benchmarks are biased towards the company’s own product in comparison to other vendors. Any reported advantage may be the result of selecting a specific use case better handled by the product or [&#8230;]</p>
+<p>The post <a rel="nofollow" href="https://www.datatorrent.com/blog-apex-performance-benchmark/">Apache Apex Performance Benchmarks</a> appeared first on <a rel="nofollow" href="https://www.datatorrent.com">DataTorrent</a>.</p>
+]]></description>
+				<content:encoded><![CDATA[<p><b id="apex-performance-benchmarks" class="c2 c16"><span class="c0">Why another benchmark blog?</span></b></p>
+<p class="c2">As engineers, we are skeptical of performance benchmarks developed and published by software vendors. Most of the time such benchmarks are biased towards the company’s own product in comparison to other vendors. Any reported advantage may be the result of selecting a specific use case better handled by the product or using optimized configuration for one’s own product compared to out of the box configuration for other vendors’ products.</p>
+<p class="c2">So, why another blog on the topic? The reason I decided to write this blog is to explain the rationale behind <a href="http://www.datatorrent.com">DataTorrent’s</a> effort to develop and maintain a benchmark performance suite, how the suite is used to certify various releases, and seek community opinion on how the performance benchmark may be improved.</p>
+<p class="c2 c4">Note: the performance numbers given here are only for reference and by no means a comprehensive performance evaluation of <a href="http://apex.apache.org/">Apache APEX</a>; performance numbers can vary depending on different configurations or different use cases.</p>
+<p class="c12 c2 subtitle"><strong>Benchmark application.</strong><img class=" aligncenter" title="" src="https://www.datatorrent.com/wp-content/uploads/2015/10/image02.png" alt="" /></p>
+<p class="c2">To evaluate the performance of the <a href="http://apex.apache.org/">Apache APEX</a>  platform,  we use Benchmark application that has a simple <a href="https://www.datatorrent.com/blog-tracing-dags-from-specification-to-execution/">DAG</a> with only two operators. The first operator (<span class="c3">wordGenerato</span>r) emits tuples and the second operator (<span class="c3">counter</span>) counts tuples received. For such trivial operations, operators add minimum overhead to CPU and memory consumption allowing measurement of <a href="http://apex.apache.org/">Apache APEX</a>  platform throughput. As operators don’t change from release to release, this application is suitable for comparing the platform performance across releases.</p>
+<p class="c2">Tuples are byte arrays with configurable length, minimizing complexity of tuples serialization and at the same time allowing examination of  performance of the platform against several different tuple sizes. The emitter (<span class="c3">wordGenerator</span>) operator may be configured to use the same byte array avoiding the operator induced garbage collection. Or it may be configured to allocate new byte array for every new tuple emitted, more closely simulating real application behavior.</p>
+<p class="c2">The consumer (<span class="c3">counter</span>) operator collects the total number of tuples received and the wall-clock time in milliseconds passed between begin and end window. It writes the collected data to the log at the end of every 10th window.</p>
+<p class="c2">The data stream (<span class="c3">Generator2Counter</span>) connects the first operator output port to the second operator input port. The benchmark suite exercises all possible configurations for the stream locality:</p>
+<ul class="c8 lst-kix_2ql03f9wui4c-0 start">
+<li class="c2 c7">thread local (<span class="c3">THREAD_LOCAL</span>) when both operators are deployed into the same thread within a container effectively serializing operators computation;</li>
+<li class="c2 c7">container local (<span class="c3">CONTAINER_LOCAL)</span><span class="c3"> </span>when both operators are deployed into the same container and execute in two different threads;</li>
+<li class="c2 c7">node local (<span class="c3">NODE_LOCAL</span>)<sup><a href="#ftnt_ref1">[1]</a></sup> when operators are deployed into two different containers running on the same yarn node;</li>
+<li class="c2 c7">rack local (RACK_LOCAL)<sup><a href="#ftnt_ref2">[2]</a></sup> when operators are deployed into two different containers running on yarn nodes residing on the same rack</li>
+<li class="c2 c7">no locality when operators are deployed into two different containers running on any hadoop cluster node.</li>
+</ul>
+<p class="c2 c12 subtitle"><span class="c0"><b><a href="http://apex.apache.org/">Apache APEX</a> release performance certification</b></span></p>
+<p class="c2">The benchmark application is a part of <a href="http://apex.apache.org/">Apache APEX</a> release certification. It is executed on <a href="http://www.datatorrent.com">DataTorrent’s</a> development Hadoop cluster by an automated script that launches the application with all supported <span class="c3">Generator2Counter</span> stream localities and 64, 128, 256, 512, 1024, 2048 and a tuple byte array length of 4096. The script collects the number of tuples emitted, the number of tuples processed and the <span class="c3">counter</span> operator latency for the running application and shuts down the application after it runs for 5 minutes, whereupon it moves on to the next configuration. For all configurations, the script runs between 6 and 8 hours depending on the development cluster load.</p>
+<p class="c12 c2 subtitle"><span class="c0"><b>Benchmark results</b></span></p>
+<p class="c2">As each supported stream locality has distinct performance characteristics (with exception of rack local and no locality due to the development cluster being setup on a single rack), I use a separate chart for each stream locality.</p>
+<p class="c2">Overall the results are self explanatory and I hope that anyone who uses, plans to use or plans to contribute to the <a href="http://apex.apache.org/">Apache APEX</a> project finds it useful. A few notes that seems to be worth mentioning:</p>
+<ul class="c8 lst-kix_5u2revq5rd1r-0 start">
+<li class="c2 c7">There is no performance regression in APEX release 3.0 compared to release 2.0</li>
+<li class="c2 c7">Benchmark was executed with default settings for buffer server spooling (turned on by default in release 3.0 and off in release 2.0). As the result, the benchmark application required just 2 GB of memory for the <span class="c3">wordGenerator</span> operator container in release 3.0, while it was necessary to allocate 8 GB to the same in release 2.0</li>
+<li class="c2 c7">When tuple size increases, JVM garbage collection starts to play a major role in performance benchmark compared to locality</li>
+<li class="c2 c7">Thread local outperforms all other stream localities only for trivial operators that we specifically designed for the benchmark.</li>
+<li class="c2 c7">The benchmark was performed on the development cluster while other developers were using it<img title="" src="https://www.datatorrent.com/wp-content/uploads/2015/10/image03.png" alt="" /></li>
+</ul>
+<p class="c2"><img title="" src="https://www.datatorrent.com/wp-content/uploads/2015/10/image01.png" alt="" /></p>
+<p class="c2 c17"><img title="" src="https://www.datatorrent.com/wp-content/uploads/2015/10/image002.png" alt="" /></p>
+<hr class="c10" />
+<div>
+<p class="c2 c13"><a name="ftnt_ref1"></a>[1]<span class="c6"> NODE_LOCAL is currently excluded from the benchmark test due to known limitation. Please see </span><span class="c6 c9"><a class="c5" href="https://malhar.atlassian.net/browse/APEX-123">APEX-123</a></span></p>
+</div>
+<div>
+<p class="c2 c13"><a name="ftnt_ref2"></a>[2]<span class="c6"> RACK_LOCAL is not yet fully implemented by APEX and is currently equivalent to no locality specified</span></p>
+</div>
+<p>The post <a rel="nofollow" href="https://www.datatorrent.com/blog-apex-performance-benchmark/">Apache Apex Performance Benchmarks</a> appeared first on <a rel="nofollow" href="https://www.datatorrent.com">DataTorrent</a>.</p>
+]]></content:encoded>
+			<wfw:commentRss>https://www.datatorrent.com/blog-apex-performance-benchmark/feed/</wfw:commentRss>
+		<slash:comments>0</slash:comments>
+		</item>
+		<item>
+		<title>Introduction to dtGateway</title>
+		<link>https://www.datatorrent.com/blog-introduction-to-dtgateway/</link>
+		<comments>https://www.datatorrent.com/blog-introduction-to-dtgateway/#comments</comments>
+		<pubDate>Tue, 06 Oct 2015 13:00:48 +0000</pubDate>
+		<dc:creator><![CDATA[David Yan]]></dc:creator>
+				<category><![CDATA[Uncategorized]]></category>
+
+		<guid isPermaLink="false">https://www.datatorrent.com/?p=2247</guid>
+		<description><![CDATA[<p>A platform, no matter how much it can do, and how technically superb it is, does not delight users without a proper UI or an API. That’s why there are products such as Cloudera Manager and Apache Ambari to improve the usability of the Hadoop platform. At DataTorrent, in addition to excellence in technology, we [&#8230;]</p>
+<p>The post <a rel="nofollow" href="https://www.datatorrent.com/blog-introduction-to-dtgateway/">Introduction to dtGateway</a> appeared first on <a rel="nofollow" href="https://www.datatorrent.com">DataTorrent</a>.</p>
+]]></description>
+				<content:encoded><![CDATA[<p>A platform, no matter how much it can do, and how technically superb it is, does not delight users without a proper UI or an API. That’s why there are products such as Cloudera Manager and Apache Ambari to improve the 

<TRUNCATED>