You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@storm.apache.org by pt...@apache.org on 2014/05/27 20:39:09 UTC

svn commit: r1597847 [6/6] - in /incubator/storm/site: _posts/ publish/ publish/2012/08/02/ publish/2012/09/06/ publish/2013/01/11/ publish/2013/12/08/ publish/2014/04/10/ publish/2014/04/17/ publish/2014/04/19/ publish/2014/04/21/ publish/2014/04/22/ ...

Modified: incubator/storm/site/publish/documentation/Trident-tutorial.html
URL: http://svn.apache.org/viewvc/incubator/storm/site/publish/documentation/Trident-tutorial.html?rev=1597847&r1=1597846&r2=1597847&view=diff
==============================================================================
--- incubator/storm/site/publish/documentation/Trident-tutorial.html (original)
+++ incubator/storm/site/publish/documentation/Trident-tutorial.html Tue May 27 18:39:07 2014
@@ -67,11 +67,11 @@
 <div id="aboutcontent">
 <h1 id="trident-tutorial">Trident tutorial</h1>
 
-<p>Trident is a high-level abstraction for doing realtime computing on top of Storm. It allows you to seamlessly intermix high throughput (millions of messages per second), stateful stream processing with low latency distributed querying. If you&#8217;re familiar with high level batch processing tools like Pig or Cascading, the concepts of Trident will be very familiar – Trident has joins, aggregations, grouping, functions, and filters. In addition to these, Trident adds primitives for doing stateful, incremental processing on top of any database or persistence store. Trident has consistent, exactly-once semantics, so it is easy to reason about Trident topologies.</p>
+<p>Trident is a high-level abstraction for doing realtime computing on top of Storm. It allows you to seamlessly intermix high throughput (millions of messages per second), stateful stream processing with low latency distributed querying. If you’re familiar with high level batch processing tools like Pig or Cascading, the concepts of Trident will be very familiar – Trident has joins, aggregations, grouping, functions, and filters. In addition to these, Trident adds primitives for doing stateful, incremental processing on top of any database or persistence store. Trident has consistent, exactly-once semantics, so it is easy to reason about Trident topologies.</p>
 
 <h2 id="illustrative-example">Illustrative example</h2>
 
-<p>Let&#8217;s look at an illustrative example of Trident. This example will do two things:</p>
+<p>Let’s look at an illustrative example of Trident. This example will do two things:</p>
 
 <ol>
   <li>Compute streaming word count from an input stream of sentences</li>
@@ -89,7 +89,7 @@ FixedBatchSpout spout = new FixedBatchSp
 spout.setCycle(true);
 </code></p>
 
-<p>This spout cycles through that set of sentences over and over to produce the sentence stream. Here&#8217;s the code to do the streaming word count part of the computation:</p>
+<p>This spout cycles through that set of sentences over and over to produce the sentence stream. Here’s the code to do the streaming word count part of the computation:</p>
 
 <p><code>java
 TridentTopology topology = new TridentTopology();        
@@ -101,7 +101,7 @@ TridentState wordCounts =
        .parallelismHint(6);
 </code></p>
 
-<p>Let&#8217;s go through the code line by line. First a TridentTopology object is created, which exposes the interface for constructing Trident computations. TridentTopology has a method called newStream that creates a new stream of data in the topology reading from an input source. In this case, the input source is just the FixedBatchSpout defined from before. Input sources can also be queue brokers like Kestrel or Kafka. Trident keeps track of a small amount of state for each input source (metadata about what it has consumed) in Zookeeper, and the &#8220;spout1&#8221; string here specifies the node in Zookeeper where Trident should keep that metadata.</p>
+<p>Let’s go through the code line by line. First a TridentTopology object is created, which exposes the interface for constructing Trident computations. TridentTopology has a method called newStream that creates a new stream of data in the topology reading from an input source. In this case, the input source is just the FixedBatchSpout defined from before. Input sources can also be queue brokers like Kestrel or Kafka. Trident keeps track of a small amount of state for each input source (metadata about what it has consumed) in Zookeeper, and the “spout1” string here specifies the node in Zookeeper where Trident should keep that metadata.</p>
 
 <p>Trident processes the stream as small batches of tuples. For example, the incoming stream of sentences might be divided into batches like so:</p>
 
@@ -109,9 +109,9 @@ TridentState wordCounts =
 
 <p>Generally the size of those small batches will be on the order of thousands or millions of tuples, depending on your incoming throughput.</p>
 
-<p>Trident provides a fully fledged batch processing API to process those small batches. The API is very similar to what you see in high level abstractions for Hadoop like Pig or Cascading: you can do group by&#8217;s, joins, aggregations, run functions, run filters, and so on. Of course, processing each small batch in isolation isn&#8217;t that interesting, so Trident provides functions for doing aggregations across batches and persistently storing those aggregations – whether in memory, in Memcached, in Cassandra, or some other store. Finally, Trident has first-class functions for querying sources of realtime state. That state could be updated by Trident (like in this example), or it could be an independent source of state.</p>
+<p>Trident provides a fully fledged batch processing API to process those small batches. The API is very similar to what you see in high level abstractions for Hadoop like Pig or Cascading: you can do group by’s, joins, aggregations, run functions, run filters, and so on. Of course, processing each small batch in isolation isn’t that interesting, so Trident provides functions for doing aggregations across batches and persistently storing those aggregations – whether in memory, in Memcached, in Cassandra, or some other store. Finally, Trident has first-class functions for querying sources of realtime state. That state could be updated by Trident (like in this example), or it could be an independent source of state.</p>
 
-<p>Back to the example, the spout emits a stream containing one field called &#8220;sentence&#8221;. The next line of the topology definition applies the Split function to each tuple in the stream, taking the &#8220;sentence&#8221; field and splitting it into words. Each sentence tuple creates potentially many word tuples – for instance, the sentence &#8220;the cow jumped over the moon&#8221; creates six &#8220;word&#8221; tuples. Here&#8217;s the definition of Split:</p>
+<p>Back to the example, the spout emits a stream containing one field called “sentence”. The next line of the topology definition applies the Split function to each tuple in the stream, taking the “sentence” field and splitting it into words. Each sentence tuple creates potentially many word tuples – for instance, the sentence “the cow jumped over the moon” creates six “word” tuples. Here’s the definition of Split:</p>
 
 <p><code>java
 public class Split extends BaseFunction {
@@ -124,9 +124,9 @@ public class Split extends BaseFunction 
 }
 </code></p>
 
-<p>As you can see, it&#8217;s really simple. It simply grabs the sentence, splits it on whitespace, and emits a tuple for each word.</p>
+<p>As you can see, it’s really simple. It simply grabs the sentence, splits it on whitespace, and emits a tuple for each word.</p>
 
-<p>The rest of the topology computes word count and keeps the results persistently stored. First the stream is grouped by the &#8220;word&#8221; field. Then, each group is persistently aggregated using the Count aggregator. The persistentAggregate function knows how to store and update the results of the aggregation in a source of state. In this example, the word counts are kept in memory, but this can be trivially swapped to use Memcached, Cassandra, or any other persistent store. Swapping this topology to store counts in Memcached is as simple as replacing the persistentAggregate line with this (using <a href="https://github.com/nathanmarz/trident-memcached">trident-memcached</a>), where the &#8220;serverLocations&#8221; is a list of host/ports for the Memcached cluster:</p>
+<p>The rest of the topology computes word count and keeps the results persistently stored. First the stream is grouped by the “word” field. Then, each group is persistently aggregated using the Count aggregator. The persistentAggregate function knows how to store and update the results of the aggregation in a source of state. In this example, the word counts are kept in memory, but this can be trivially swapped to use Memcached, Cassandra, or any other persistent store. Swapping this topology to store counts in Memcached is as simple as replacing the persistentAggregate line with this (using <a href="https://github.com/nathanmarz/trident-memcached">trident-memcached</a>), where the “serverLocations” is a list of host/ports for the Memcached cluster:</p>
 
 <p><code>java
 .persistentAggregate(MemcachedState.transactional(serverLocations), new Count(), new Fields("count"))        
@@ -135,11 +135,11 @@ MemcachedState.transactional()
 
 <p>The values stored by persistentAggregate represents the aggregation of all batches ever emitted by the stream.</p>
 
-<p>One of the cool things about Trident is that it has fully fault-tolerant, exactly-once processing semantics. This makes it easy to reason about your realtime processing. Trident persists state in a way so that if failures occur and retries are necessary, it won&#8217;t perform multiple updates to the database for the same source data.</p>
+<p>One of the cool things about Trident is that it has fully fault-tolerant, exactly-once processing semantics. This makes it easy to reason about your realtime processing. Trident persists state in a way so that if failures occur and retries are necessary, it won’t perform multiple updates to the database for the same source data.</p>
 
 <p>The persistentAggregate method transforms a Stream into a TridentState object. In this case the TridentState object represents all the word counts. We will use this TridentState object to implement the distributed query portion of the computation.</p>
 
-<p>The next part of the topology implements a low latency distributed query on the word counts. The query takes as input a whitespace separated list of words and return the sum of the counts for those words. These queries are executed just like normal RPC calls, except they are parallelized in the background. Here&#8217;s an example of how you might invoke one of these queries:</p>
+<p>The next part of the topology implements a low latency distributed query on the word counts. The query takes as input a whitespace separated list of words and return the sum of the counts for those words. These queries are executed just like normal RPC calls, except they are parallelized in the background. Here’s an example of how you might invoke one of these queries:</p>
 
 <p><code>java
 DRPCClient client = new DRPCClient("drpc.server.location", 3772);
@@ -147,7 +147,7 @@ System.out.println(client.execute("words
 // prints the JSON-encoded result, e.g.: "[[5078]]"
 </code></p>
 
-<p>As you can see, it looks just like a regular remote procedure call (RPC), except it&#8217;s executing in parallel across a Storm cluster. The latency for small queries like this are typically around 10ms. More intense DRPC queries can take longer of course, although the latency largely depends on how many resources you have allocated for the computation.</p>
+<p>As you can see, it looks just like a regular remote procedure call (RPC), except it’s executing in parallel across a Storm cluster. The latency for small queries like this are typically around 10ms. More intense DRPC queries can take longer of course, although the latency largely depends on how many resources you have allocated for the computation.</p>
 
 <p>The implementation of the distributed query portion of the topology looks like this:</p>
 
@@ -160,22 +160,22 @@ topology.newDRPCStream("words")
        .aggregate(new Fields("count"), new Sum(), new Fields("sum"));
 </code></p>
 
-<p>The same TridentTopology object is used to create the DRPC stream, and the function is named &#8220;words&#8221;. The function name corresponds to the function name given in the first argument of execute when using a DRPCClient.</p>
+<p>The same TridentTopology object is used to create the DRPC stream, and the function is named “words”. The function name corresponds to the function name given in the first argument of execute when using a DRPCClient.</p>
 
-<p>Each DRPC request is treated as its own little batch processing job that takes as input a single tuple representing the request. The tuple contains one field called &#8220;args&#8221; that contains the argument provided by the client. In this case, the argument is a whitespace separated list of words.</p>
+<p>Each DRPC request is treated as its own little batch processing job that takes as input a single tuple representing the request. The tuple contains one field called “args” that contains the argument provided by the client. In this case, the argument is a whitespace separated list of words.</p>
 
-<p>First, the Split function is used to split the arguments for the request into its constituent words. The stream is grouped by &#8220;word&#8221;, and the stateQuery operator is used to query the TridentState object that the first part of the topology generated. stateQuery takes in a source of state – in this case, the word counts computed by the other portion of the topology – and a function for querying that state. In this case, the MapGet function is invoked, which gets the count for each word. Since the DRPC stream is grouped the exact same way as the TridentState was (by the &#8220;word&#8221; field), each word query is routed to the exact partition of the TridentState object that manages updates for that word.</p>
+<p>First, the Split function is used to split the arguments for the request into its constituent words. The stream is grouped by “word”, and the stateQuery operator is used to query the TridentState object that the first part of the topology generated. stateQuery takes in a source of state – in this case, the word counts computed by the other portion of the topology – and a function for querying that state. In this case, the MapGet function is invoked, which gets the count for each word. Since the DRPC stream is grouped the exact same way as the TridentState was (by the “word” field), each word query is routed to the exact partition of the TridentState object that manages updates for that word.</p>
 
-<p>Next, words that didn&#8217;t have a count are filtered out via the FilterNull filter and the counts are summed using the Sum aggregator to get the result. Then, Trident automatically sends the result back to the waiting client.</p>
+<p>Next, words that didn’t have a count are filtered out via the FilterNull filter and the counts are summed using the Sum aggregator to get the result. Then, Trident automatically sends the result back to the waiting client.</p>
 
-<p>Trident is intelligent about how it executes a topology to maximize performance. There&#8217;s two interesting things happening automatically in this topology:</p>
+<p>Trident is intelligent about how it executes a topology to maximize performance. There’s two interesting things happening automatically in this topology:</p>
 
 <ol>
-  <li>Operations that read from or write to state (like persistentAggregate and stateQuery) automatically batch operations to that state. So if there&#8217;s 20 updates that need to be made to the database for the current batch of processing, rather than do 20 read requests and 20 writes requests to the database, Trident will automatically batch up the reads and writes, doing only 1 read request and 1 write request (and in many cases, you can use caching in your State implementation to eliminate the read request). So you get the best of both words of convenience – being able to express your computation in terms of what should be done with each tuple – and performance.</li>
+  <li>Operations that read from or write to state (like persistentAggregate and stateQuery) automatically batch operations to that state. So if there’s 20 updates that need to be made to the database for the current batch of processing, rather than do 20 read requests and 20 writes requests to the database, Trident will automatically batch up the reads and writes, doing only 1 read request and 1 write request (and in many cases, you can use caching in your State implementation to eliminate the read request). So you get the best of both words of convenience – being able to express your computation in terms of what should be done with each tuple – and performance.</li>
   <li>Trident aggregators are heavily optimized. Rather than transfer all tuples for a group to the same machine and then run the aggregator, Trident will do partial aggregations when possible before sending tuples over the network. For example, the Count aggregator computes the count on each partition, sends the partial count over the network, and then sums together all the partial counts to get the total count. This technique is similar to the use of combiners in MapReduce.</li>
 </ol>
 
-<p>Let&#8217;s look at another example of Trident.</p>
+<p>Let’s look at another example of Trident.</p>
 
 <h2 id="reach">Reach</h2>
 
@@ -189,26 +189,26 @@ TridentState urlToTweeters =
 TridentState tweetersToFollowers =
        topology.newStaticState(getTweeterToFollowersState());</p>
 
-<p>topology.newDRPCStream(&#8220;reach&#8221;)
-       .stateQuery(urlToTweeters, new Fields(&#8220;args&#8221;), new MapGet(), new Fields(&#8220;tweeters&#8221;))
-       .each(new Fields(&#8220;tweeters&#8221;), new ExpandList(), new Fields(&#8220;tweeter&#8221;))
+<p>topology.newDRPCStream(“reach”)
+       .stateQuery(urlToTweeters, new Fields(“args”), new MapGet(), new Fields(“tweeters”))
+       .each(new Fields(“tweeters”), new ExpandList(), new Fields(“tweeter”))
        .shuffle()
-       .stateQuery(tweetersToFollowers, new Fields(&#8220;tweeter&#8221;), new MapGet(), new Fields(&#8220;followers&#8221;))
+       .stateQuery(tweetersToFollowers, new Fields(“tweeter”), new MapGet(), new Fields(“followers”))
        .parallelismHint(200)
-       .each(new Fields(&#8220;followers&#8221;), new ExpandList(), new Fields(&#8220;follower&#8221;))
-       .groupBy(new Fields(&#8220;follower&#8221;))
-       .aggregate(new One(), new Fields(&#8220;one&#8221;))
+       .each(new Fields(“followers”), new ExpandList(), new Fields(“follower”))
+       .groupBy(new Fields(“follower”))
+       .aggregate(new One(), new Fields(“one”))
        .parallelismHint(20)
-       .aggregate(new Count(), new Fields(&#8220;reach&#8221;));
+       .aggregate(new Count(), new Fields(“reach”));
 ```</p>
 
 <p>The topology creates TridentState objects representing each external database using the newStaticState method. These can then be queried in the topology. Like all sources of state, queries to these databases will be automatically batched for maximum efficiency.</p>
 
-<p>The topology definition is straightforward – it&#8217;s just a simple batch processing job. First, the urlToTweeters database is queried to get the list of people who tweeted the URL for this request. That returns a list, so the ExpandList function is invoked to create a tuple for each tweeter.</p>
+<p>The topology definition is straightforward – it’s just a simple batch processing job. First, the urlToTweeters database is queried to get the list of people who tweeted the URL for this request. That returns a list, so the ExpandList function is invoked to create a tuple for each tweeter.</p>
 
-<p>Next, the followers for each tweeter must be fetched. It&#8217;s important that this step be parallelized, so shuffle is invoked to evenly distribute the tweeters among all workers for the topology. Then, the followers database is queried to get the list of followers for each tweeter. You can see that this portion of the topology is given a large parallelism since this is the most intense portion of the computation.</p>
+<p>Next, the followers for each tweeter must be fetched. It’s important that this step be parallelized, so shuffle is invoked to evenly distribute the tweeters among all workers for the topology. Then, the followers database is queried to get the list of followers for each tweeter. You can see that this portion of the topology is given a large parallelism since this is the most intense portion of the computation.</p>
 
-<p>Next, the set of followers is uniqued and counted. This is done in two steps. First a &#8220;group by&#8221; is done on the batch by &#8220;follower&#8221;, running the &#8220;One&#8221; aggregator on each group. The &#8220;One&#8221; aggregator simply emits a single tuple containing the number one for each group. Then, the ones are summed together to get the unique count of the followers set. Here&#8217;s the definition of the &#8220;One&#8221; aggregator:</p>
+<p>Next, the set of followers is uniqued and counted. This is done in two steps. First a “group by” is done on the batch by “follower”, running the “One” aggregator on each group. The “One” aggregator simply emits a single tuple containing the number one for each group. Then, the ones are summed together to get the unique count of the followers set. Here’s the definition of the “One” aggregator:</p>
 
 <p>```java
 public class One implements CombinerAggregator<integer> {
@@ -226,15 +226,15 @@ public class One implements CombinerAggr
 }
 ```</p>
 
-<p>This is a &#8220;combiner aggregator&#8221;, which knows how to do partial aggregations before transferring tuples over the network to maximize efficiency. Sum is also defined as a combiner aggregator, so the global sum done at the end of the topology will be very efficient.</p>
+<p>This is a “combiner aggregator”, which knows how to do partial aggregations before transferring tuples over the network to maximize efficiency. Sum is also defined as a combiner aggregator, so the global sum done at the end of the topology will be very efficient.</p>
 
-<p>Let&#8217;s now look at Trident in more detail.</p>
+<p>Let’s now look at Trident in more detail.</p>
 
 <h2 id="fields-and-tuples">Fields and tuples</h2>
 
-<p>The Trident data model is the TridentTuple which is a named list of values. During a topology, tuples are incrementally built up through a sequence of operations. Operations generally take in a set of input fields and emit a set of &#8220;function fields&#8221;. The input fields are used to select a subset of the tuple as input to the operation, while the &#8220;function fields&#8221; name the fields the operation emits.</p>
+<p>The Trident data model is the TridentTuple which is a named list of values. During a topology, tuples are incrementally built up through a sequence of operations. Operations generally take in a set of input fields and emit a set of “function fields”. The input fields are used to select a subset of the tuple as input to the operation, while the “function fields” name the fields the operation emits.</p>
 
-<p>Consider this example. Suppose you have a stream called &#8220;stream&#8221; that contains the fields &#8220;x&#8221;, &#8220;y&#8221;, and &#8220;z&#8221;. To run a filter MyFilter that takes in &#8220;y&#8221; as input, you would say:</p>
+<p>Consider this example. Suppose you have a stream called “stream” that contains the fields “x”, “y”, and “z”. To run a filter MyFilter that takes in “y” as input, you would say:</p>
 
 <p><code>java
 stream.each(new Fields("y"), new MyFilter())
@@ -250,9 +250,9 @@ public class MyFilter extends BaseFilter
 }
 </code></p>
 
-<p>This will keep all tuples whose &#8220;y&#8221; field is less than 10. The TridentTuple given as input to MyFilter will only contain the &#8220;y&#8221; field. Note that Trident is able to project a subset of a tuple extremely efficiently when selecting the input fields: the projection is essentially free.</p>
+<p>This will keep all tuples whose “y” field is less than 10. The TridentTuple given as input to MyFilter will only contain the “y” field. Note that Trident is able to project a subset of a tuple extremely efficiently when selecting the input fields: the projection is essentially free.</p>
 
-<p>Let&#8217;s now look at how &#8220;function fields&#8221; work. Suppose you had this function:</p>
+<p>Let’s now look at how “function fields” work. Suppose you had this function:</p>
 
 <p><code>java
 public class AddAndMultiply extends BaseFunction {
@@ -264,21 +264,21 @@ public class AddAndMultiply extends Base
 }
 </code></p>
 
-<p>This function takes two numbers as input and emits two new values: the addition of the numbers and the multiplication of the numbers. Suppose you had a stream with the fields &#8220;x&#8221;, &#8220;y&#8221;, and &#8220;z&#8221;. You would use this function like this:</p>
+<p>This function takes two numbers as input and emits two new values: the addition of the numbers and the multiplication of the numbers. Suppose you had a stream with the fields “x”, “y”, and “z”. You would use this function like this:</p>
 
 <p><code>java
 stream.each(new Fields("x", "y"), new AddAndMultiply(), new Fields("added", "multiplied"));
 </code></p>
 
-<p>The output of functions is additive: the fields are added to the input tuple. So the output of this each call would contain tuples with the five fields &#8220;x&#8221;, &#8220;y&#8221;, &#8220;z&#8221;, &#8220;added&#8221;, and &#8220;multiplied&#8221;. &#8220;added&#8221; corresponds to the first value emitted by AddAndMultiply, while &#8220;multiplied&#8221; corresponds to the second value.</p>
+<p>The output of functions is additive: the fields are added to the input tuple. So the output of this each call would contain tuples with the five fields “x”, “y”, “z”, “added”, and “multiplied”. “added” corresponds to the first value emitted by AddAndMultiply, while “multiplied” corresponds to the second value.</p>
 
-<p>With aggregators, on the other hand, the function fields replace the input tuples. So if you had a stream containing the fields &#8220;val1&#8221; and &#8220;val2&#8221;, and you did this:</p>
+<p>With aggregators, on the other hand, the function fields replace the input tuples. So if you had a stream containing the fields “val1” and “val2”, and you did this:</p>
 
 <p><code>java
 stream.aggregate(new Fields("val2"), new Sum(), new Fields("sum"))
 </code></p>
 
-<p>The output stream would only contain a single tuple with a single field called &#8220;sum&#8221;, representing the sum of all &#8220;val2&#8221; fields in that batch.</p>
+<p>The output stream would only contain a single tuple with a single field called “sum”, representing the sum of all “val2” fields in that batch.</p>
 
 <p>With grouped streams, the output will contain the grouping fields followed by the fields emitted by the aggregator. For example:</p>
 
@@ -287,26 +287,26 @@ stream.groupBy(new Fields("val1"))
      .aggregate(new Fields("val2"), new Sum(), new Fields("sum"))
 </code></p>
 
-<p>In this example, the output will contain the fields &#8220;val1&#8221; and &#8220;sum&#8221;.</p>
+<p>In this example, the output will contain the fields “val1” and “sum”.</p>
 
 <h2 id="state">State</h2>
 
-<p>A key problem to solve with realtime computation is how to manage state so that updates are idempotent in the face of failures and retries. It&#8217;s impossible to eliminate failures, so when a node dies or something else goes wrong, batches need to be retried. The question is – how do you do state updates (whether external databases or state internal to the topology) so that it&#8217;s like each message was only processed only once?</p>
+<p>A key problem to solve with realtime computation is how to manage state so that updates are idempotent in the face of failures and retries. It’s impossible to eliminate failures, so when a node dies or something else goes wrong, batches need to be retried. The question is – how do you do state updates (whether external databases or state internal to the topology) so that it’s like each message was only processed only once?</p>
 
-<p>This is a tricky problem, and can be illustrated with the following example. Suppose that you&#8217;re doing a count aggregation of your stream and want to store the running count in a database. If you store only the count in the database and it&#8217;s time to apply a state update for a batch, there&#8217;s no way to know if you applied that state update before. The batch could have been attempted before, succeeded in updating the database, and then failed at a later step. Or the batch could have been attempted before and failed to update the database. You just don&#8217;t know.</p>
+<p>This is a tricky problem, and can be illustrated with the following example. Suppose that you’re doing a count aggregation of your stream and want to store the running count in a database. If you store only the count in the database and it’s time to apply a state update for a batch, there’s no way to know if you applied that state update before. The batch could have been attempted before, succeeded in updating the database, and then failed at a later step. Or the batch could have been attempted before and failed to update the database. You just don’t know.</p>
 
 <p>Trident solves this problem by doing two things:</p>
 
 <ol>
-  <li>Each batch is given a unique id called the &#8220;transaction id&#8221;. If a batch is retried it will have the exact same transaction id.</li>
-  <li>State updates are ordered among batches. That is, the state updates for batch 3 won&#8217;t be applied until the state updates for batch 2 have succeeded.</li>
+  <li>Each batch is given a unique id called the “transaction id”. If a batch is retried it will have the exact same transaction id.</li>
+  <li>State updates are ordered among batches. That is, the state updates for batch 3 won’t be applied until the state updates for batch 2 have succeeded.</li>
 </ol>
 
-<p>With these two primitives, you can achieve exactly-once semantics with your state updates. Rather than store just the count in the database, what you can do instead is store the transaction id with the count in the database as an atomic value. Then, when updating the count, you can just compare the transaction id in the database with the transaction id for the current batch. If they&#8217;re the same, you skip the update – because of the strong ordering, you know for sure that the value in the database incorporates the current batch. If they&#8217;re different, you increment the count.</p>
+<p>With these two primitives, you can achieve exactly-once semantics with your state updates. Rather than store just the count in the database, what you can do instead is store the transaction id with the count in the database as an atomic value. Then, when updating the count, you can just compare the transaction id in the database with the transaction id for the current batch. If they’re the same, you skip the update – because of the strong ordering, you know for sure that the value in the database incorporates the current batch. If they’re different, you increment the count.</p>
 
-<p>Of course, you don&#8217;t have to do this logic manually in your topologies. This logic is wrapped by the State abstraction and done automatically. Nor is your State object required to implement the transaction id trick: if you don&#8217;t want to pay the cost of storing the transaction id in the database, you don&#8217;t have to. In that case the State will have at-least-once-processing semantics in the case of failures (which may be fine for your application). You can read more about how to implement a State and the various fault-tolerance tradeoffs possible <a href="/documentation/Trident-state">in this doc</a>.</p>
+<p>Of course, you don’t have to do this logic manually in your topologies. This logic is wrapped by the State abstraction and done automatically. Nor is your State object required to implement the transaction id trick: if you don’t want to pay the cost of storing the transaction id in the database, you don’t have to. In that case the State will have at-least-once-processing semantics in the case of failures (which may be fine for your application). You can read more about how to implement a State and the various fault-tolerance tradeoffs possible <a href="/documentation/Trident-state">in this doc</a>.</p>
 
-<p>A State is allowed to use whatever strategy it wants to store state. So it could store state in an external database or it could keep the state in-memory but backed by HDFS (like how HBase works). State&#8217;s are not required to hold onto state forever. For example, you could have an in-memory State implementation that only keeps the last X hours of data available and drops anything older. Take a look at the implementation of the <a href="https://github.com/nathanmarz/trident-memcached/blob/master/src/jvm/trident/memcached/MemcachedState.java">Memcached integration</a> for an example State implementation.</p>
+<p>A State is allowed to use whatever strategy it wants to store state. So it could store state in an external database or it could keep the state in-memory but backed by HDFS (like how HBase works). State’s are not required to hold onto state forever. For example, you could have an in-memory State implementation that only keeps the last X hours of data available and drops anything older. Take a look at the implementation of the <a href="https://github.com/nathanmarz/trident-memcached/blob/master/src/jvm/trident/memcached/MemcachedState.java">Memcached integration</a> for an example State implementation.</p>
 
 <h2 id="execution-of-trident-topologies">Execution of Trident topologies</h2>
 
@@ -320,7 +320,7 @@ stream.groupBy(new Fields("val1"))
 
 <h2 id="conclusion">Conclusion</h2>
 
-<p>Trident makes realtime computation elegant. You&#8217;ve seen how high throughput stream processing, state manipulation, and low-latency querying can be seamlessly intermixed via Trident&#8217;s API. Trident lets you express your realtime computations in a natural way while still getting maximal performance.</p>
+<p>Trident makes realtime computation elegant. You’ve seen how high throughput stream processing, state manipulation, and low-latency querying can be seamlessly intermixed via Trident’s API. Trident lets you express your realtime computations in a natural way while still getting maximal performance.</p>
 
 </div>
 </div>

Modified: incubator/storm/site/publish/documentation/Troubleshooting.html
URL: http://svn.apache.org/viewvc/incubator/storm/site/publish/documentation/Troubleshooting.html?rev=1597847&r1=1597846&r2=1597847&view=diff
==============================================================================
--- incubator/storm/site/publish/documentation/Troubleshooting.html (original)
+++ incubator/storm/site/publish/documentation/Troubleshooting.html Tue May 27 18:39:07 2014
@@ -80,7 +80,7 @@
 <p>Solutions:</p>
 
 <ul>
-  <li>You may have a misconfigured subnet, where nodes can&#8217;t locate other nodes based on their hostname. ZeroMQ sometimes crashes the process when it can&#8217;t resolve a host. There are two solutions:</li>
+  <li>You may have a misconfigured subnet, where nodes can’t locate other nodes based on their hostname. ZeroMQ sometimes crashes the process when it can’t resolve a host. There are two solutions:</li>
   <li>Make a mapping from hostname to IP address in /etc/hosts</li>
   <li>Set up an internal DNS so that nodes can locate each other based on hostname.</li>
 </ul>
@@ -97,7 +97,7 @@
 <p>Solutions:</p>
 
 <ul>
-  <li>Storm doesn&#8217;t work with ipv6. You can force ipv4 by adding <code>-Djava.net.preferIPv4Stack=true</code> to the supervisor child options and restarting the supervisor. </li>
+  <li>Storm doesn’t work with ipv6. You can force ipv4 by adding <code>-Djava.net.preferIPv4Stack=true</code> to the supervisor child options and restarting the supervisor. </li>
   <li>You may have a misconfigured subnet. See the solutions for <code>Worker processes are crashing on startup with no stack trace</code></li>
 </ul>
 
@@ -131,32 +131,32 @@
   <li>Try deleting the local dirs for the supervisors and restarting the daemons. Supervisors create a unique id for themselves and store it locally. When that id is copied to other nodes, Storm gets confused. </li>
 </ul>
 
-<h3 id="multiple-defaultsyaml-found-error">&#8220;Multiple defaults.yaml found&#8221; error</h3>
+<h3 id="multiple-defaultsyaml-found-error">“Multiple defaults.yaml found” error</h3>
 
 <p>Symptoms:</p>
 
 <ul>
-  <li>When deploying a topology with &#8220;storm jar&#8221;, you get this error</li>
+  <li>When deploying a topology with “storm jar”, you get this error</li>
 </ul>
 
 <p>Solution:</p>
 
 <ul>
-  <li>You&#8217;re most likely including the Storm jars inside your topology jar. When packaging your topology jar, don&#8217;t include the Storm jars as Storm will put those on the classpath for you.</li>
+  <li>You’re most likely including the Storm jars inside your topology jar. When packaging your topology jar, don’t include the Storm jars as Storm will put those on the classpath for you.</li>
 </ul>
 
-<h3 id="nosuchmethoderror-when-running-storm-jar">&#8220;NoSuchMethodError&#8221; when running storm jar</h3>
+<h3 id="nosuchmethoderror-when-running-storm-jar">“NoSuchMethodError” when running storm jar</h3>
 
 <p>Symptoms:</p>
 
 <ul>
-  <li>When running storm jar, you get a cryptic &#8220;NoSuchMethodError&#8221;</li>
+  <li>When running storm jar, you get a cryptic “NoSuchMethodError”</li>
 </ul>
 
 <p>Solution:</p>
 
 <ul>
-  <li>You&#8217;re deploying your topology with a different version of Storm than you built your topology against. Make sure the storm client you use comes from the same version as the version you compiled your topology against.</li>
+  <li>You’re deploying your topology with a different version of Storm than you built your topology against. Make sure the storm client you use comes from the same version as the version you compiled your topology against.</li>
 </ul>
 
 <h3 id="kryo-concurrentmodificationexception">Kryo ConcurrentModificationException</h3>
@@ -200,7 +200,7 @@ Caused by: java.util.ConcurrentModificat
 <p>Solution: </p>
 
 <ul>
-  <li>This means that you&#8217;re emitting a mutable object as an output tuple. Everything you emit into the output collector must be immutable. What&#8217;s happening is that your bolt is modifying the object while it is being serialized to be sent over the network.</li>
+  <li>This means that you’re emitting a mutable object as an output tuple. Everything you emit into the output collector must be immutable. What’s happening is that your bolt is modifying the object while it is being serialized to be sent over the network.</li>
 </ul>
 
 <h3 id="nullpointerexception-from-deep-inside-storm">NullPointerException from deep inside Storm</h3>
@@ -234,7 +234,7 @@ Caused by: java.lang.NullPointerExceptio
 <p>Solution:</p>
 
 <ul>
-  <li>This is caused by having multiple threads issue methods on the <code>OutputCollector</code>. All emits, acks, and fails must happen on the same thread. One subtle way this can happen is if you make a <code>IBasicBolt</code> that emits on a separate thread. <code>IBasicBolt</code>&#8217;s automatically ack after execute is called, so this would cause multiple threads to use the <code>OutputCollector</code> leading to this exception. When using a basic bolt, all emits must happen in the same thread that runs <code>execute</code>.</li>
+  <li>This is caused by having multiple threads issue methods on the <code>OutputCollector</code>. All emits, acks, and fails must happen on the same thread. One subtle way this can happen is if you make a <code>IBasicBolt</code> that emits on a separate thread. <code>IBasicBolt</code>’s automatically ack after execute is called, so this would cause multiple threads to use the <code>OutputCollector</code> leading to this exception. When using a basic bolt, all emits must happen in the same thread that runs <code>execute</code>.</li>
 </ul>
 
 </div>

Modified: incubator/storm/site/publish/documentation/Tutorial.html
URL: http://svn.apache.org/viewvc/incubator/storm/site/publish/documentation/Tutorial.html?rev=1597847&r1=1597846&r2=1597847&view=diff
==============================================================================
--- incubator/storm/site/publish/documentation/Tutorial.html (original)
+++ incubator/storm/site/publish/documentation/Tutorial.html Tue May 27 18:39:07 2014
@@ -65,27 +65,27 @@
   </ul>
 </div>
 <div id="aboutcontent">
-<p>In this tutorial, you&#8217;ll learn how to create Storm topologies and deploy them to a Storm cluster. Java will be the main language used, but a few examples will use Python to illustrate Storm&#8217;s multi-language capabilities.</p>
+<p>In this tutorial, you’ll learn how to create Storm topologies and deploy them to a Storm cluster. Java will be the main language used, but a few examples will use Python to illustrate Storm’s multi-language capabilities.</p>
 
 <h2 id="preliminaries">Preliminaries</h2>
 
-<p>This tutorial uses examples from the <a href="http://github.com/nathanmarz/storm-starter">storm-starter</a> project. It&#8217;s recommended that you clone the project and follow along with the examples. Read <a href="Setting-up-development-environment.html">Setting up a development environment</a> and <a href="Creating-a-new-Storm-project.html">Creating a new Storm project</a> to get your machine set up. </p>
+<p>This tutorial uses examples from the <a href="http://github.com/nathanmarz/storm-starter">storm-starter</a> project. It’s recommended that you clone the project and follow along with the examples. Read <a href="Setting-up-development-environment.html">Setting up a development environment</a> and <a href="Creating-a-new-Storm-project.html">Creating a new Storm project</a> to get your machine set up. </p>
 
 <h2 id="components-of-a-storm-cluster">Components of a Storm cluster</h2>
 
-<p>A Storm cluster is superficially similar to a Hadoop cluster. Whereas on Hadoop you run &#8220;MapReduce jobs&#8221;, on Storm you run &#8220;topologies&#8221;. &#8220;Jobs&#8221; and &#8220;topologies&#8221; themselves are very different &#8211; one key difference is that a MapReduce job eventually finishes, whereas a topology processes messages forever (or until you kill it).</p>
+<p>A Storm cluster is superficially similar to a Hadoop cluster. Whereas on Hadoop you run “MapReduce jobs”, on Storm you run “topologies”. “Jobs” and “topologies” themselves are very different – one key difference is that a MapReduce job eventually finishes, whereas a topology processes messages forever (or until you kill it).</p>
 
-<p>There are two kinds of nodes on a Storm cluster: the master node and the worker nodes. The master node runs a daemon called &#8220;Nimbus&#8221; that is similar to Hadoop&#8217;s &#8220;JobTracker&#8221;. Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures.</p>
+<p>There are two kinds of nodes on a Storm cluster: the master node and the worker nodes. The master node runs a daemon called “Nimbus” that is similar to Hadoop’s “JobTracker”. Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures.</p>
 
-<p>Each worker node runs a daemon called the &#8220;Supervisor&#8221;. The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Each worker process executes a subset of a topology; a running topology consists of many worker processes spread across many machines.</p>
+<p>Each worker node runs a daemon called the “Supervisor”. The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Each worker process executes a subset of a topology; a running topology consists of many worker processes spread across many machines.</p>
 
 <p><img src="images/storm-cluster.png" alt="Storm cluster" /></p>
 
-<p>All coordination between Nimbus and the Supervisors is done through a <a href="http://zookeeper.apache.org/">Zookeeper</a> cluster. Additionally, the Nimbus daemon and Supervisor daemons are fail-fast and stateless; all state is kept in Zookeeper or on local disk. This means you can kill -9 Nimbus or the Supervisors and they&#8217;ll start back up like nothing happened. This design leads to Storm clusters being incredibly stable.</p>
+<p>All coordination between Nimbus and the Supervisors is done through a <a href="http://zookeeper.apache.org/">Zookeeper</a> cluster. Additionally, the Nimbus daemon and Supervisor daemons are fail-fast and stateless; all state is kept in Zookeeper or on local disk. This means you can kill -9 Nimbus or the Supervisors and they’ll start back up like nothing happened. This design leads to Storm clusters being incredibly stable.</p>
 
 <h2 id="topologies">Topologies</h2>
 
-<p>To do realtime computation on Storm, you create what are called &#8220;topologies&#8221;. A topology is a graph of computation. Each node in a topology contains processing logic, and links between nodes indicate how data should be passed around between nodes.</p>
+<p>To do realtime computation on Storm, you create what are called “topologies”. A topology is a graph of computation. Each node in a topology contains processing logic, and links between nodes indicate how data should be passed around between nodes.</p>
 
 <p>Running a topology is straightforward. First, you package all your code and dependencies into a single jar. Then, you run a command like the following:</p>
 
@@ -99,19 +99,19 @@ storm jar all-my-code.jar backtype.storm
 
 <h2 id="streams">Streams</h2>
 
-<p>The core abstraction in Storm is the &#8220;stream&#8221;. A stream is an unbounded sequence of tuples. Storm provides the primitives for transforming a stream into a new stream in a distributed and reliable way. For example, you may transform a stream of tweets into a stream of trending topics.</p>
+<p>The core abstraction in Storm is the “stream”. A stream is an unbounded sequence of tuples. Storm provides the primitives for transforming a stream into a new stream in a distributed and reliable way. For example, you may transform a stream of tweets into a stream of trending topics.</p>
 
-<p>The basic primitives Storm provides for doing stream transformations are &#8220;spouts&#8221; and &#8220;bolts&#8221;. Spouts and bolts have interfaces that you implement to run your application-specific logic.</p>
+<p>The basic primitives Storm provides for doing stream transformations are “spouts” and “bolts”. Spouts and bolts have interfaces that you implement to run your application-specific logic.</p>
 
 <p>A spout is a source of streams. For example, a spout may read tuples off of a <a href="http://github.com/nathanmarz/storm-kestrel">Kestrel</a> queue and emit them as a stream. Or a spout may connect to the Twitter API and emit a stream of tweets.</p>
 
 <p>A bolt consumes any number of input streams, does some processing, and possibly emits new streams. Complex stream transformations, like computing a stream of trending topics from a stream of tweets, require multiple steps and thus multiple bolts. Bolts can do anything from run functions, filter tuples, do streaming aggregations, do streaming joins, talk to databases, and more.</p>
 
-<p>Networks of spouts and bolts are packaged into a &#8220;topology&#8221; which is the top-level abstraction that you submit to Storm clusters for execution. A topology is a graph of stream transformations where each node is a spout or bolt. Edges in the graph indicate which bolts are subscribing to which streams. When a spout or bolt emits a tuple to a stream, it sends the tuple to every bolt that subscribed to that stream.</p>
+<p>Networks of spouts and bolts are packaged into a “topology” which is the top-level abstraction that you submit to Storm clusters for execution. A topology is a graph of stream transformations where each node is a spout or bolt. Edges in the graph indicate which bolts are subscribing to which streams. When a spout or bolt emits a tuple to a stream, it sends the tuple to every bolt that subscribed to that stream.</p>
 
 <p><img src="images/topology.png" alt="A Storm topology" /></p>
 
-<p>Links between nodes in your topology indicate how tuples should be passed around. For example, if there is a link between Spout A and Bolt B, a link from Spout A to Bolt C, and a link from Bolt B to Bolt C, then everytime Spout A emits a tuple, it will send the tuple to both Bolt B and Bolt C. All of Bolt B&#8217;s output tuples will go to Bolt C as well.</p>
+<p>Links between nodes in your topology indicate how tuples should be passed around. For example, if there is a link between Spout A and Bolt B, a link from Spout A to Bolt C, and a link from Bolt B to Bolt C, then everytime Spout A emits a tuple, it will send the tuple to both Bolt B and Bolt C. All of Bolt B’s output tuples will go to Bolt C as well.</p>
 
 <p>Each node in a Storm topology executes in parallel. In your topology, you can specify how much parallelism you want for each node, and then Storm will spawn that number of threads across the cluster to do the execution.</p>
 
@@ -121,7 +121,7 @@ storm jar all-my-code.jar backtype.storm
 
 <p>Storm uses tuples as its data model. A tuple is a named list of values, and a field in a tuple can be an object of any type. Out of the box, Storm supports all the primitive types, strings, and byte arrays as tuple field values. To use an object of another type, you just need to implement <a href="Serialization.html">a serializer</a> for the type.</p>
 
-<p>Every node in a topology must declare the output fields for the tuples it emits. For example, this bolt declares that it emits 2-tuples with the fields &#8220;double&#8221; and &#8220;triple&#8221;:</p>
+<p>Every node in a topology must declare the output fields for the tuples it emits. For example, this bolt declares that it emits 2-tuples with the fields “double” and “triple”:</p>
 
 <p>```java
 public class DoubleAndTripleBolt extends BaseRichBolt {
@@ -149,7 +149,7 @@ public void declareOutputFields(OutputFi
 
 <h2 id="a-simple-topology">A simple topology</h2>
 
-<p>Let&#8217;s take a look at a simple topology to explore the concepts more and see how the code shapes up. Let&#8217;s look at the <code>ExclamationTopology</code> definition from storm-starter:</p>
+<p>Let’s take a look at a simple topology to explore the concepts more and see how the code shapes up. Let’s look at the <code>ExclamationTopology</code> definition from storm-starter:</p>
 
 <p><code>java
 TopologyBuilder builder = new TopologyBuilder();        
@@ -160,17 +160,17 @@ builder.setBolt("exclaim2", new Exclamat
         .shuffleGrouping("exclaim1");
 </code></p>
 
-<p>This topology contains a spout and two bolts. The spout emits words, and each bolt appends the string &#8220;!!!&#8221; to its input. The nodes are arranged in a line: the spout emits to the first bolt which then emits to the second bolt. If the spout emits the tuples [&#8220;bob&#8221;] and [&#8220;john&#8221;], then the second bolt will emit the words [&#8220;bob!!!!!!&#8221;] and [&#8220;john!!!!!!&#8221;].</p>
+<p>This topology contains a spout and two bolts. The spout emits words, and each bolt appends the string “!!!” to its input. The nodes are arranged in a line: the spout emits to the first bolt which then emits to the second bolt. If the spout emits the tuples [“bob”] and [“john”], then the second bolt will emit the words [“bob!!!!!!”] and [“john!!!!!!”].</p>
 
-<p>This code defines the nodes using the <code>setSpout</code> and <code>setBolt</code> methods. These methods take as input a user-specified id, an object containing the processing logic, and the amount of parallelism you want for the node. In this example, the spout is given id &#8220;words&#8221; and the bolts are given ids &#8220;exclaim1&#8221; and &#8220;exclaim2&#8221;. </p>
+<p>This code defines the nodes using the <code>setSpout</code> and <code>setBolt</code> methods. These methods take as input a user-specified id, an object containing the processing logic, and the amount of parallelism you want for the node. In this example, the spout is given id “words” and the bolts are given ids “exclaim1” and “exclaim2”. </p>
 
 <p>The object containing the processing logic implements the <a href="/apidocs/backtype/storm/topology/IRichSpout.html">IRichSpout</a> interface for spouts and the <a href="/apidocs/backtype/storm/topology/IRichBolt.html">IRichBolt</a> interface for bolts. </p>
 
 <p>The last parameter, how much parallelism you want for the node, is optional. It indicates how many threads should execute that component across the cluster. If you omit it, Storm will only allocate one thread for that node.</p>
 
-<p><code>setBolt</code> returns an <a href="/apidocs/backtype/storm/topology/InputDeclarer.html">InputDeclarer</a> object that is used to define the inputs to the Bolt. Here, component &#8220;exclaim1&#8221; declares that it wants to read all the tuples emitted by component &#8220;words&#8221; using a shuffle grouping, and component &#8220;exclaim2&#8221; declares that it wants to read all the tuples emitted by component &#8220;exclaim1&#8221; using a shuffle grouping. &#8220;shuffle grouping&#8221; means that tuples should be randomly distributed from the input tasks to the bolt&#8217;s tasks. There are many ways to group data between components. These will be explained in a few sections.</p>
+<p><code>setBolt</code> returns an <a href="/apidocs/backtype/storm/topology/InputDeclarer.html">InputDeclarer</a> object that is used to define the inputs to the Bolt. Here, component “exclaim1” declares that it wants to read all the tuples emitted by component “words” using a shuffle grouping, and component “exclaim2” declares that it wants to read all the tuples emitted by component “exclaim1” using a shuffle grouping. “shuffle grouping” means that tuples should be randomly distributed from the input tasks to the bolt’s tasks. There are many ways to group data between components. These will be explained in a few sections.</p>
 
-<p>If you wanted component &#8220;exclaim2&#8221; to read all the tuples emitted by both component &#8220;words&#8221; and component &#8220;exclaim1&#8221;, you would write component &#8220;exclaim2&#8221;&#8217;s definition like this:</p>
+<p>If you wanted component “exclaim2” to read all the tuples emitted by both component “words” and component “exclaim1”, you would write component “exclaim2”’s definition like this:</p>
 
 <p><code>java
 builder.setBolt("exclaim2", new ExclamationBolt(), 5)
@@ -180,7 +180,7 @@ builder.setBolt("exclaim2", new Exclamat
 
 <p>As you can see, input declarations can be chained to specify multiple sources for the Bolt.</p>
 
-<p>Let&#8217;s dig into the implementations of the spouts and bolts in this topology. Spouts are responsible for emitting new messages into the topology. <code>TestWordSpout</code> in this topology emits a random word from the list [&#8220;nathan&#8221;, &#8220;mike&#8221;, &#8220;jackson&#8221;, &#8220;golda&#8221;, &#8220;bertels&#8221;] as a 1-tuple every 100ms. The implementation of <code>nextTuple()</code> in TestWordSpout looks like this:</p>
+<p>Let’s dig into the implementations of the spouts and bolts in this topology. Spouts are responsible for emitting new messages into the topology. <code>TestWordSpout</code> in this topology emits a random word from the list [“nathan”, “mike”, “jackson”, “golda”, “bertels”] as a 1-tuple every 100ms. The implementation of <code>nextTuple()</code> in TestWordSpout looks like this:</p>
 
 <p><code>java
 public void nextTuple() {
@@ -194,7 +194,7 @@ public void nextTuple() {
 
 <p>As you can see, the implementation is very straightforward.</p>
 
-<p><code>ExclamationBolt</code> appends the string &#8220;!!!&#8221; to its input. Let&#8217;s take a look at the full implementation for <code>ExclamationBolt</code>:</p>
+<p><code>ExclamationBolt</code> appends the string “!!!” to its input. Let’s take a look at the full implementation for <code>ExclamationBolt</code>:</p>
 
 <p>```java
 public static class ExclamationBolt implements IRichBolt {
@@ -221,15 +221,15 @@ public Map getComponentConfiguration() {
 } } ```
 </code></pre>
 
-<p>The <code>prepare</code> method provides the bolt with an <code>OutputCollector</code> that is used for emitting tuples from this bolt. Tuples can be emitted at anytime from the bolt &#8211; in the <code>prepare</code>, <code>execute</code>, or <code>cleanup</code> methods, or even asynchronously in another thread. This <code>prepare</code> implementation simply saves the <code>OutputCollector</code> as an instance variable to be used later on in the <code>execute</code> method.</p>
+<p>The <code>prepare</code> method provides the bolt with an <code>OutputCollector</code> that is used for emitting tuples from this bolt. Tuples can be emitted at anytime from the bolt – in the <code>prepare</code>, <code>execute</code>, or <code>cleanup</code> methods, or even asynchronously in another thread. This <code>prepare</code> implementation simply saves the <code>OutputCollector</code> as an instance variable to be used later on in the <code>execute</code> method.</p>
 
-<p>The <code>execute</code> method receives a tuple from one of the bolt&#8217;s inputs. The <code>ExclamationBolt</code> grabs the first field from the tuple and emits a new tuple with the string &#8220;!!!&#8221; appended to it. If you implement a bolt that subscribes to multiple input sources, you can find out which component the <a href="/apidocs/backtype/storm/tuple/Tuple.html">Tuple</a> came from by using the <code>Tuple#getSourceComponent</code> method.</p>
+<p>The <code>execute</code> method receives a tuple from one of the bolt’s inputs. The <code>ExclamationBolt</code> grabs the first field from the tuple and emits a new tuple with the string “!!!” appended to it. If you implement a bolt that subscribes to multiple input sources, you can find out which component the <a href="/apidocs/backtype/storm/tuple/Tuple.html">Tuple</a> came from by using the <code>Tuple#getSourceComponent</code> method.</p>
 
-<p>There&#8217;s a few other things going in in the <code>execute</code> method, namely that the input tuple is passed as the first argument to <code>emit</code> and the input tuple is acked on the final line. These are part of Storm&#8217;s reliability API for guaranteeing no data loss and will be explained later in this tutorial. </p>
+<p>There’s a few other things going in in the <code>execute</code> method, namely that the input tuple is passed as the first argument to <code>emit</code> and the input tuple is acked on the final line. These are part of Storm’s reliability API for guaranteeing no data loss and will be explained later in this tutorial. </p>
 
-<p>The <code>cleanup</code> method is called when a Bolt is being shutdown and should cleanup any resources that were opened. There&#8217;s no guarantee that this method will be called on the cluster: for example, if the machine the task is running on blows up, there&#8217;s no way to invoke the method. The <code>cleanup</code> method is intended for when you run topologies in <a href="Local-mode.html">local mode</a> (where a Storm cluster is simulated in process), and you want to be able to run and kill many topologies without suffering any resource leaks.</p>
+<p>The <code>cleanup</code> method is called when a Bolt is being shutdown and should cleanup any resources that were opened. There’s no guarantee that this method will be called on the cluster: for example, if the machine the task is running on blows up, there’s no way to invoke the method. The <code>cleanup</code> method is intended for when you run topologies in <a href="Local-mode.html">local mode</a> (where a Storm cluster is simulated in process), and you want to be able to run and kill many topologies without suffering any resource leaks.</p>
 
-<p>The <code>declareOutputFields</code> method declares that the <code>ExclamationBolt</code> emits 1-tuples with one field called &#8220;word&#8221;.</p>
+<p>The <code>declareOutputFields</code> method declares that the <code>ExclamationBolt</code> emits 1-tuples with one field called “word”.</p>
 
 <p>The <code>getComponentConfiguration</code> method allows you to configure various aspects of how this component runs. This is a more advanced topic that is explained further on <a href="Configuration.html">Configuration</a>.</p>
 
@@ -255,13 +255,13 @@ public void declareOutputFields(OutputFi
 
 <h2 id="running-exclamationtopology-in-local-mode">Running ExclamationTopology in local mode</h2>
 
-<p>Let&#8217;s see how to run the <code>ExclamationTopology</code> in local mode and see that it&#8217;s working.</p>
+<p>Let’s see how to run the <code>ExclamationTopology</code> in local mode and see that it’s working.</p>
 
-<p>Storm has two modes of operation: local mode and distributed mode. In local mode, Storm executes completely in process by simulating worker nodes with threads. Local mode is useful for testing and development of topologies. When you run the topologies in storm-starter, they&#8217;ll run in local mode and you&#8217;ll be able to see what messages each component is emitting. You can read more about running topologies in local mode on <a href="Local-mode.html">Local mode</a>.</p>
+<p>Storm has two modes of operation: local mode and distributed mode. In local mode, Storm executes completely in process by simulating worker nodes with threads. Local mode is useful for testing and development of topologies. When you run the topologies in storm-starter, they’ll run in local mode and you’ll be able to see what messages each component is emitting. You can read more about running topologies in local mode on <a href="Local-mode.html">Local mode</a>.</p>
 
 <p>In distributed mode, Storm operates as a cluster of machines. When you submit a topology to the master, you also submit all the code necessary to run the topology. The master will take care of distributing your code and allocating workers to run your topology. If workers go down, the master will reassign them somewhere else. You can read more about running topologies on a cluster on <a href="Running-topologies-on-a-production-cluster.html">Running topologies on a production cluster</a>]. </p>
 
-<p>Here&#8217;s the code that runs <code>ExclamationTopology</code> in local mode:</p>
+<p>Here’s the code that runs <code>ExclamationTopology</code> in local mode:</p>
 
 <p>```java
 Config conf = new Config();
@@ -269,9 +269,9 @@ conf.setDebug(true);
 conf.setNumWorkers(2);</p>
 
 <p>LocalCluster cluster = new LocalCluster();
-cluster.submitTopology(&#8220;test&#8221;, conf, builder.createTopology());
+cluster.submitTopology(“test”, conf, builder.createTopology());
 Utils.sleep(10000);
-cluster.killTopology(&#8220;test&#8221;);
+cluster.killTopology(“test”);
 cluster.shutdown();
 ```</p>
 
@@ -286,7 +286,7 @@ cluster.shutdown();
   <li><strong>TOPOLOGY_DEBUG</strong> (set with <code>setDebug</code>), when set to true, tells Storm to log every message every emitted by a component. This is useful in local mode when testing topologies, but you probably want to keep this turned off when running topologies on the cluster.</li>
 </ol>
 
-<p>There&#8217;s many other configurations you can set for the topology. The various configurations are detailed on <a href="/apidocs/backtype/storm/Config.html">the Javadoc for Config</a>.</p>
+<p>There’s many other configurations you can set for the topology. The various configurations are detailed on <a href="/apidocs/backtype/storm/Config.html">the Javadoc for Config</a>.</p>
 
 <p>To learn about how to set up your development environment so that you can run topologies in local mode (such as in Eclipse), see <a href="Creating-a-new-Storm-project.html">Creating a new Storm project</a>.</p>
 
@@ -298,40 +298,40 @@ cluster.shutdown();
 
 <p>When a task for Bolt A emits a tuple to Bolt B, which task should it send the tuple to?</p>
 
-<p>A &#8220;stream grouping&#8221; answers this question by telling Storm how to send tuples between sets of tasks. Before we dig into the different kinds of stream groupings, let&#8217;s take a look at another topology from <a href="http://github.com/nathanmarz/storm-starter">storm-starter</a>. This <a href="https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/WordCountTopology.java">WordCountTopology</a> reads sentences off of a spout and streams out of <code>WordCountBolt</code> the total number of times it has seen that word before:</p>
+<p>A “stream grouping” answers this question by telling Storm how to send tuples between sets of tasks. Before we dig into the different kinds of stream groupings, let’s take a look at another topology from <a href="http://github.com/nathanmarz/storm-starter">storm-starter</a>. This <a href="https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/WordCountTopology.java">WordCountTopology</a> reads sentences off of a spout and streams out of <code>WordCountBolt</code> the total number of times it has seen that word before:</p>
 
 <p>```java
 TopologyBuilder builder = new TopologyBuilder();</p>
 
-<p>builder.setSpout(&#8220;sentences&#8221;, new RandomSentenceSpout(), 5);      <br />
-builder.setBolt(&#8220;split&#8221;, new SplitSentence(), 8)
-        .shuffleGrouping(&#8220;sentences&#8221;);
-builder.setBolt(&#8220;count&#8221;, new WordCount(), 12)
-        .fieldsGrouping(&#8220;split&#8221;, new Fields(&#8220;word&#8221;));
+<p>builder.setSpout(“sentences”, new RandomSentenceSpout(), 5);      <br />
+builder.setBolt(“split”, new SplitSentence(), 8)
+        .shuffleGrouping(“sentences”);
+builder.setBolt(“count”, new WordCount(), 12)
+        .fieldsGrouping(“split”, new Fields(“word”));
 ```</p>
 
 <p><code>SplitSentence</code> emits a tuple for each word in each sentence it receives, and <code>WordCount</code> keeps a map in memory from word to count. Each time <code>WordCount</code> receives a word, it updates its state and emits the new word count.</p>
 
-<p>There&#8217;s a few different kinds of stream groupings.</p>
+<p>There’s a few different kinds of stream groupings.</p>
 
-<p>The simplest kind of grouping is called a &#8220;shuffle grouping&#8221; which sends the tuple to a random task. A shuffle grouping is used in the <code>WordCountTopology</code> to send tuples from <code>RandomSentenceSpout</code> to the <code>SplitSentence</code> bolt. It has the effect of evenly distributing the work of processing the tuples across all of <code>SplitSentence</code> bolt&#8217;s tasks.</p>
+<p>The simplest kind of grouping is called a “shuffle grouping” which sends the tuple to a random task. A shuffle grouping is used in the <code>WordCountTopology</code> to send tuples from <code>RandomSentenceSpout</code> to the <code>SplitSentence</code> bolt. It has the effect of evenly distributing the work of processing the tuples across all of <code>SplitSentence</code> bolt’s tasks.</p>
 
-<p>A more interesting kind of grouping is the &#8220;fields grouping&#8221;. A fields grouping is used between the <code>SplitSentence</code> bolt and the <code>WordCount</code> bolt. It is critical for the functioning of the <code>WordCount</code> bolt that the same word always go to the same task. Otherwise, more than one task will see the same word, and they&#8217;ll each emit incorrect values for the count since each has incomplete information. A fields grouping lets you group a stream by a subset of its fields. This causes equal values for that subset of fields to go to the same task. Since <code>WordCount</code> subscribes to <code>SplitSentence</code>&#8217;s output stream using a fields grouping on the &#8220;word&#8221; field, the same word always goes to the same task and the bolt produces the correct output.</p>
+<p>A more interesting kind of grouping is the “fields grouping”. A fields grouping is used between the <code>SplitSentence</code> bolt and the <code>WordCount</code> bolt. It is critical for the functioning of the <code>WordCount</code> bolt that the same word always go to the same task. Otherwise, more than one task will see the same word, and they’ll each emit incorrect values for the count since each has incomplete information. A fields grouping lets you group a stream by a subset of its fields. This causes equal values for that subset of fields to go to the same task. Since <code>WordCount</code> subscribes to <code>SplitSentence</code>’s output stream using a fields grouping on the “word” field, the same word always goes to the same task and the bolt produces the correct output.</p>
 
 <p>Fields groupings are the basis of implementing streaming joins and streaming aggregations as well as a plethora of other use cases. Underneath the hood, fields groupings are implemented using mod hashing.</p>
 
-<p>There&#8217;s a few other kinds of stream groupings. You can read more about them on <a href="Concepts.html">Concepts</a>. </p>
+<p>There’s a few other kinds of stream groupings. You can read more about them on <a href="Concepts.html">Concepts</a>. </p>
 
 <h2 id="defining-bolts-in-other-languages">Defining Bolts in other languages</h2>
 
 <p>Bolts can be defined in any language. Bolts written in another language are executed as subprocesses, and Storm communicates with those subprocesses with JSON messages over stdin/stdout. The communication protocol just requires an ~100 line adapter library, and Storm ships with adapter libraries for Ruby, Python, and Fancy. </p>
 
-<p>Here&#8217;s the definition of the <code>SplitSentence</code> bolt from <code>WordCountTopology</code>:</p>
+<p>Here’s the definition of the <code>SplitSentence</code> bolt from <code>WordCountTopology</code>:</p>
 
 <p>```java
 public static class SplitSentence extends ShellBolt implements IRichBolt {
     public SplitSentence() {
-        super(&#8220;python&#8221;, &#8220;splitsentence.py&#8221;);
+        super(“python”, “splitsentence.py”);
     }</p>
 
 <pre><code>public void declareOutputFields(OutputFieldsDeclarer declarer) {
@@ -339,14 +339,14 @@ public static class SplitSentence extend
 } } ```
 </code></pre>
 
-<p><code>SplitSentence</code> overrides <code>ShellBolt</code> and declares it as running using <code>python</code> with the arguments <code>splitsentence.py</code>. Here&#8217;s the implementation of <code>splitsentence.py</code>:</p>
+<p><code>SplitSentence</code> overrides <code>ShellBolt</code> and declares it as running using <code>python</code> with the arguments <code>splitsentence.py</code>. Here’s the implementation of <code>splitsentence.py</code>:</p>
 
 <p>```python
 import storm</p>
 
 <p>class SplitSentenceBolt(storm.BasicBolt):
     def process(self, tup):
-        words = tup.values[0].split(&#8220; &#8220;)
+        words = tup.values[0].split(“ “)
         for word in words:
           storm.emit([word])</p>
 
@@ -357,15 +357,15 @@ import storm</p>
 
 <h2 id="guaranteeing-message-processing">Guaranteeing message processing</h2>
 
-<p>Earlier on in this tutorial, we skipped over a few aspects of how tuples are emitted. Those aspects were part of Storm&#8217;s reliability API: how Storm guarantees that every message coming off a spout will be fully processed. See <a href="Guaranteeing-message-processing.html">Guaranteeing message processing</a> for information on how this works and what you have to do as a user to take advantage of Storm&#8217;s reliability capabilities.</p>
+<p>Earlier on in this tutorial, we skipped over a few aspects of how tuples are emitted. Those aspects were part of Storm’s reliability API: how Storm guarantees that every message coming off a spout will be fully processed. See <a href="Guaranteeing-message-processing.html">Guaranteeing message processing</a> for information on how this works and what you have to do as a user to take advantage of Storm’s reliability capabilities.</p>
 
 <h2 id="transactional-topologies">Transactional topologies</h2>
 
-<p>Storm guarantees that every message will be played through the topology at least once. A common question asked is &#8220;how do you do things like counting on top of Storm? Won&#8217;t you overcount?&#8221; Storm has a feature called transactional topologies that let you achieve exactly-once messaging semantics for most computations. Read more about transactional topologies <a href="Transactional-topologies.html">here</a>. </p>
+<p>Storm guarantees that every message will be played through the topology at least once. A common question asked is “how do you do things like counting on top of Storm? Won’t you overcount?” Storm has a feature called transactional topologies that let you achieve exactly-once messaging semantics for most computations. Read more about transactional topologies <a href="Transactional-topologies.html">here</a>. </p>
 
 <h2 id="distributed-rpc">Distributed RPC</h2>
 
-<p>This tutorial showed how to do basic stream processing on top of Storm. There&#8217;s lots more things you can do with Storm&#8217;s primitives. One of the most interesting applications of Storm is Distributed RPC, where you parallelize the computation of intense functions on the fly. Read more about Distributed RPC <a href="Distributed-RPC.html">here</a>. </p>
+<p>This tutorial showed how to do basic stream processing on top of Storm. There’s lots more things you can do with Storm’s primitives. One of the most interesting applications of Storm is Distributed RPC, where you parallelize the computation of intense functions on the fly. Read more about Distributed RPC <a href="Distributed-RPC.html">here</a>. </p>
 
 <h2 id="conclusion">Conclusion</h2>
 

Modified: incubator/storm/site/publish/documentation/Understanding-the-parallelism-of-a-Storm-topology.html
URL: http://svn.apache.org/viewvc/incubator/storm/site/publish/documentation/Understanding-the-parallelism-of-a-Storm-topology.html?rev=1597847&r1=1597846&r2=1597847&view=diff
==============================================================================
--- incubator/storm/site/publish/documentation/Understanding-the-parallelism-of-a-Storm-topology.html (original)
+++ incubator/storm/site/publish/documentation/Understanding-the-parallelism-of-a-Storm-topology.html Tue May 27 18:39:07 2014
@@ -87,7 +87,7 @@
 
 <h1 id="configuring-the-parallelism-of-a-topology">Configuring the parallelism of a topology</h1>
 
-<p>Note that in Storm’s terminology &#8220;parallelism&#8221; is specifically used to describe the so-called <em>parallelism hint</em>, which means the initial number of executor (threads) of a component. In this document though we use the term &#8220;parallelism&#8221; in a more general sense to describe how you can configure not only the number of executors but also the number of worker processes and the number of tasks of a Storm topology. We will specifically call out when &#8220;parallelism&#8221; is used in the normal, narrow definition of Storm.</p>
+<p>Note that in Storm’s terminology “parallelism” is specifically used to describe the so-called <em>parallelism hint</em>, which means the initial number of executor (threads) of a component. In this document though we use the term “parallelism” in a more general sense to describe how you can configure not only the number of executors but also the number of worker processes and the number of tasks of a Storm topology. We will specifically call out when “parallelism” is used in the normal, narrow definition of Storm.</p>
 
 <p>The following sections give an overview of the various configuration options and how to set them in your code. There is more than one way of setting these options though, and the table lists only some of them. Storm currently has the following <a href="Configuration.html">order of precedence for configuration settings</a>: <code>defaults.yaml</code> &lt; <code>storm.yaml</code> &lt; topology-specific configuration &lt; internal component-specific configuration &lt; external component-specific configuration.</p>
 
@@ -151,17 +151,17 @@ topologyBuilder.setBolt("green-bolt", ne
 Config conf = new Config();
 conf.setNumWorkers(2); // use two worker processes</p>
 
-<p>topologyBuilder.setSpout(&#8220;blue-spout&#8221;, new BlueSpout(), 2); // set parallelism hint to 2</p>
+<p>topologyBuilder.setSpout(“blue-spout”, new BlueSpout(), 2); // set parallelism hint to 2</p>
 
-<p>topologyBuilder.setBolt(&#8220;green-bolt&#8221;, new GreenBolt(), 2)
+<p>topologyBuilder.setBolt(“green-bolt”, new GreenBolt(), 2)
                .setNumTasks(4)
-               .shuffleGrouping(&#8220;blue-spout&#8221;);</p>
+               .shuffleGrouping(“blue-spout”);</p>
 
-<p>topologyBuilder.setBolt(&#8220;yellow-bolt&#8221;, new YellowBolt(), 6)
-               .shuffleGrouping(&#8220;green-bolt&#8221;);</p>
+<p>topologyBuilder.setBolt(“yellow-bolt”, new YellowBolt(), 6)
+               .shuffleGrouping(“green-bolt”);</p>
 
 <p>StormSubmitter.submitTopology(
-        &#8220;mytopology&#8221;,
+        “mytopology”,
         conf,
         topologyBuilder.createTopology()
     );
@@ -187,9 +187,9 @@ conf.setNumWorkers(2); // use two worker
 <p>Here is an example of using the CLI tool:</p>
 
 <p>```
-# Reconfigure the topology &#8220;mytopology&#8221; to use 5 worker processes,
-# the spout &#8220;blue-spout&#8221; to use 3 executors and
-# the bolt &#8220;yellow-bolt&#8221; to use 10 executors.</p>
+# Reconfigure the topology “mytopology” to use 5 worker processes,
+# the spout “blue-spout” to use 3 executors and
+# the bolt “yellow-bolt” to use 10 executors.</p>
 
 <p>$ storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10
 ```</p>

Modified: incubator/storm/site/publish/documentation/Using-non-JVM-languages-with-Storm.html
URL: http://svn.apache.org/viewvc/incubator/storm/site/publish/documentation/Using-non-JVM-languages-with-Storm.html?rev=1597847&r1=1597846&r2=1597847&view=diff
==============================================================================
--- incubator/storm/site/publish/documentation/Using-non-JVM-languages-with-Storm.html (original)
+++ incubator/storm/site/publish/documentation/Using-non-JVM-languages-with-Storm.html Tue May 27 18:39:07 2014
@@ -68,9 +68,9 @@
 <ul>
   <li>two pieces: creating topologies and implementing spouts and bolts in other languages</li>
   <li>creating topologies in another language is easy since topologies are just thrift structures (link to storm.thrift)</li>
-  <li>implementing spouts and bolts in another language is called a &#8220;multilang components&#8221; or &#8220;shelling&#8221;
+  <li>implementing spouts and bolts in another language is called a “multilang components” or “shelling”
     <ul>
-      <li>Here&#8217;s a specification of the protocol: <a href="Multilang-protocol.html">Multilang protocol</a></li>
+      <li>Here’s a specification of the protocol: <a href="Multilang-protocol.html">Multilang protocol</a></li>
       <li>the thrift structure lets you define multilang components explicitly as a program and a script (e.g., python and the file implementing your bolt)</li>
       <li>In Java, you override ShellBolt or ShellSpout to create multilang components
         <ul>
@@ -89,7 +89,7 @@
       </li>
     </ul>
   </li>
-  <li>&#8220;storm shell&#8221; command makes constructing jar and uploading to nimbus easy
+  <li>“storm shell” command makes constructing jar and uploading to nimbus easy
     <ul>
       <li>makes jar and uploads it</li>
       <li>calls your program with host/port of nimbus and the jarfile id</li>
@@ -111,9 +111,9 @@ union ComponentObject {
 }
 </code></p>
 
-<p>For a non-JVM DSL, you would want to make use of &#8220;2&#8221; and &#8220;3&#8221;. ShellComponent lets you specify a script to run that component (e.g., your python code). And JavaObject lets you specify native java spouts and bolts for the component (and Storm will use reflection to create that spout or bolt).</p>
+<p>For a non-JVM DSL, you would want to make use of “2” and “3”. ShellComponent lets you specify a script to run that component (e.g., your python code). And JavaObject lets you specify native java spouts and bolts for the component (and Storm will use reflection to create that spout or bolt).</p>
 
-<p>There&#8217;s a &#8220;storm shell&#8221; command that will help with submitting a topology. Its usage is like this:</p>
+<p>There’s a “storm shell” command that will help with submitting a topology. Its usage is like this:</p>
 
 <p><code>
 storm shell resources/ python topology.py arg1 arg2
@@ -125,7 +125,7 @@ storm shell resources/ python topology.p
 python topology.py arg1 arg2 {nimbus-host} {nimbus-port} {uploaded-jar-location}
 </code></p>
 
-<p>Then you can connect to Nimbus using the Thrift API and submit the topology, passing {uploaded-jar-location} into the submitTopology method. For reference, here&#8217;s the submitTopology definition:</p>
+<p>Then you can connect to Nimbus using the Thrift API and submit the topology, passing {uploaded-jar-location} into the submitTopology method. For reference, here’s the submitTopology definition:</p>
 
 <p><code>
 void submitTopology(1: string name, 2: string uploadedJarLocation, 3: string jsonConf, 4: StormTopology topology)