You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by ya...@apache.org on 2014/08/15 07:22:29 UTC

[02/39] SAMZA-259: Restructure documentation folders

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/jobs/logging.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/jobs/logging.md b/docs/learn/documentation/versioned/jobs/logging.md
new file mode 100644
index 0000000..6af3d4d
--- /dev/null
+++ b/docs/learn/documentation/versioned/jobs/logging.md
@@ -0,0 +1,93 @@
+---
+layout: page
+title: Logging
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+Samza uses [SLF4J](http://www.slf4j.org/) for all of its logging. By default, Samza only depends on slf4j-api, so you must add an SLF4J runtime dependency to your Samza packages for whichever underlying logging platform you wish to use.
+
+### Log4j
+
+The [hello-samza](/startup/hello-samza/{{site.version}}) project shows how to use [log4j](http://logging.apache.org/log4j/1.2/) with Samza. To turn on log4j logging, you just need to make sure slf4j-log4j12 is in your SamzaContainer's classpath. In Maven, this can be done by adding the following dependency to your Samza package project.
+
+{% highlight xml %}
+<dependency>
+  <groupId>org.slf4j</groupId>
+  <artifactId>slf4j-log4j12</artifactId>
+  <scope>runtime</scope>
+  <version>1.6.2</version>
+</dependency>
+{% endhighlight %}
+
+If you're not using Maven, just make sure that slf4j-log4j12 ends up in your Samza package's lib directory.
+
+#### Log4j configuration
+
+Samza's [run-class.sh](packaging.html) script will automatically set the following setting if log4j.xml exists in your [Samza package's](packaging.html) lib directory.
+
+{% highlight bash %}
+-Dlog4j.configuration=file:$base_dir/lib/log4j.xml
+{% endhighlight %}
+
+The [run-class.sh](packaging.html) script will also set the following Java system properties:
+
+{% highlight bash %}
+-Dsamza.log.dir=$SAMZA_LOG_DIR -Dsamza.container.name=$SAMZA_CONTAINER_NAME=
+{% endhighlight %}
+
+These settings are very useful if you're using a file-based appender. For example, you can use a daily rolling appender by configuring log4j.xml like this:
+
+{% highlight xml %}
+<appender name="RollingAppender" class="org.apache.log4j.DailyRollingFileAppender">
+   <param name="File" value="${samza.log.dir}/${samza.container.name}.log" />
+   <param name="DatePattern" value="'.'yyyy-MM-dd" />
+   <layout class="org.apache.log4j.PatternLayout">
+    <param name="ConversionPattern" value="%d{yyyy-MM-dd HH:mm:ss} %c{1} [%p] %m%n" />
+   </layout>
+</appender>
+{% endhighlight %}
+
+Setting up a file-based appender is recommended as a better alternative to using standard out. Standard out log files (see below) don't roll, and can get quite large if used for logging.
+
+**NOTE:** If you use the `task.opts` configuration property, the log configuration is disrupted. This is a known bug; please see [SAMZA-109](https://issues.apache.org/jira/browse/SAMZA-109) for a workaround.
+
+### Log Directory
+
+Samza will look for the `SAMZA_LOG_DIR` environment variable when it executes. If this variable is defined, all logs will be written to this directory. If the environment variable is empty, or not defined, then Samza will use /tmp. This environment variable can also be referenced inside log4j.xml files (see above).
+
+### Garbage Collection Logging
+
+Samza's will automatically set the following garbage collection logging setting, and will output it to `$SAMZA_LOG_DIR/gc.log`.
+
+{% highlight bash %}
+-XX:+PrintGCDateStamps -Xloggc:$SAMZA_LOG_DIR/gc.log
+{% endhighlight %}
+
+#### Rotation
+
+In older versions of Java, it is impossible to have GC logs roll over based on time or size without the use of a secondary tool. This means that your GC logs will never be deleted until a Samza job ceases to run. As of [Java 6 Update 34](http://www.oracle.com/technetwork/java/javase/2col/6u34-bugfixes-1733379.html), and [Java 7 Update 2](http://www.oracle.com/technetwork/java/javase/7u2-relnotes-1394228.html), [new GC command line switches](http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6941923) have been added to support this functionality. If you are using a version of Java that supports GC log rotation, it's highly recommended that you turn it on.
+
+### YARN
+
+When a Samza job executes on a YARN grid, the `$SAMZA_LOG_DIR` environment variable will point to a directory that is secured such that only the user executing the Samza job can read and write to it, if YARN is [securely configured](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html).
+
+#### STDOUT
+
+Samza's [ApplicationMaster](../yarn/application-master.html) pipes all STDOUT and STDERR output to logs/stdout and logs/stderr, respectively. These files are never rotated.
+
+## [Reprocessing &raquo;](reprocessing.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/jobs/packaging.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/jobs/packaging.md b/docs/learn/documentation/versioned/jobs/packaging.md
new file mode 100644
index 0000000..9e55f9a
--- /dev/null
+++ b/docs/learn/documentation/versioned/jobs/packaging.md
@@ -0,0 +1,47 @@
+---
+layout: page
+title: Packaging
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+The [JobRunner](job-runner.html) page talks about run-job.sh, and how it's used to start a job either locally (ProcessJobFactory/ThreadJobFactory) or with YARN (YarnJobFactory). In the diagram that shows the execution flow, it also shows a run-container.sh script. This script, along with a run-am.sh script, are what Samza actually calls to execute its code.
+
+```
+bin/run-am.sh
+bin/run-container.sh
+```
+
+The run-container.sh script is responsible for starting the [SamzaContainer](../container/samza-container.html). The run-am.sh script is responsible for starting Samza's application master for YARN. Thus, the run-am.sh script is only used by the YarnJob, but both YarnJob and ProcessJob use run-container.sh.
+
+Typically, these two scripts are bundled into a tar.gz file that has a structure like this:
+
+```
+bin/run-am.sh
+bin/run-class.sh
+bin/run-job.sh
+bin/run-container.sh
+lib/*.jar
+```
+
+To run a Samza job, you un-zip its tar.gz file, and execute the run-job.sh script, as defined in the JobRunner section. There are a number of interesting implications from this packaging scheme. First, you'll notice that there is no configuration in the package. Second, you'll notice that the lib directory contains all JARs that you'll need to run your Samza job.
+
+The reason that configuration is decoupled from your Samza job packaging is that it allows configuration to be updated without having to re-build the entire Samza package. This makes life easier for everyone when you just need to tweak one parameter, and don't want to have to worry about which branch your package was built from, or whether trunk is in a stable state. It also has the added benefit of forcing configuration to be fully resolved at runtime. This means that that the configuration for a job is resolved at the time run-job.sh is called (using --config-path and --config-provider parameters), and from that point on, the configuration is immutable, and passed where it needs to be by Samza (and YARN, if you're using it).
+
+The second statement, that your Samza package contains all JARs that it needs to run, means that a Samza package is entirely self contained. This allows Samza jobs to run on independent Samza versions without conflicting with each other. This is in contrast to Hadoop, where JARs are pulled in from the local machine that the job is running on (using environment variables). With Samza, you might run your job on version 0.7.0, and someone else might run their job on version 0.8.0. There is no problem with this.
+
+## [YARN Jobs &raquo;](yarn-jobs.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/jobs/reprocessing.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/jobs/reprocessing.md b/docs/learn/documentation/versioned/jobs/reprocessing.md
new file mode 100644
index 0000000..28d9925
--- /dev/null
+++ b/docs/learn/documentation/versioned/jobs/reprocessing.md
@@ -0,0 +1,83 @@
+---
+layout: page
+title: Reprocessing previously processed data
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+From time to time you may want to deploy a new version of your Samza job that computes results differently. Perhaps you fixed a bug or introduced a new feature. For example, say you have a Samza job that classifies messages as spam or not-spam, using a machine learning model that you train offline. Periodically you want to deploy an updated version of your Samza job which includes the latest classification model.
+
+When you start up a new version of your job, a question arises: what do you want to do with messages that were previously processed with the old version of your job? The answer depends on the behavior you want:
+
+1. **No reprocessing:** By default, Samza assumes that messages processed by the old version don't need to be processed again. When the new version starts up, it will resume processing at the point where the old version left off (assuming you have [checkpointing](../container/checkpointing.html) enabled). If this is the behavior you want, there's nothing special you need to do.
+
+2. **Simple rewind:** Perhaps you want to go back and re-process old messages using the new version of your job. For example, maybe the old version of your classifier marked things as spam too aggressively, so you now want to revisit its previous spam/not-spam decisions using an improved classifier. You can do this by restarting the job at an older point in time in the stream, and running through all the messages since that time. Thus your job starts off reprocessing messages that it has already seen, but it then seamlessly continues with new messages when the reprocessing is done.
+
+   This approach requires an input system such as Kafka, which allows you to jump back in time to a previous point in the stream. We discuss below how this works in practice.
+
+3. **Parallel rewind:** This approach avoids a downside of the *simple rewind* approach. With simple rewind, any new messages that appear while the job is reprocessing old data are queued up, and are processed when the reprocessing is done. The queueing delay needn't be long, because Samza can stream through historical data very quickly, but some latency-sensitive applications need to process messages faster.
+
+   In the *parallel rewind* approach, you run two jobs in parallel: one job continues to handle live updates with low latency (the *real-time job*), while the other is started at an older point in the stream and reprocesses historical data (the *reprocessing job*). The two jobs consume the same input stream at different points in time, and eventually the reprocessing job catches up with the real-time job.
+
+   There are a few details that you need to think through before deploying parallel rewind, which we discuss below.
+
+### Jumping Back in Time
+
+A common aspect of the *simple rewind* and *parallel rewind* approaches is: you have a job which jumps back to an old point in time in the input streams, and consumes all messages since that time. You achieve this by working with Samza's checkpoints.
+
+Normally, when a Samza job starts up, it reads the latest checkpoint to determine at which offset in the input streams it needs to resume processing. If you need to rewind to an earlier time, you do that in one of two ways:
+
+1. You can stop the job, manipulate its last checkpoint to point to an older offset, and start the job up again. Samza includes a command-line tool called [CheckpointTool](../container/checkpointing.html#toc_0) which you can use to manipulate checkpoints.
+2. You can start a new job with a different *job.name* or *job.id* (e.g. increment *job.id* every time you need to jump back in time). This gives the job a new checkpoint stream, with none of the old checkpoint information. You also need to set [samza.offset.default=oldest](../container/checkpointing.html), so that when the job starts up without checkpoint, it starts consuming at the oldest offset available.
+
+With either of these approaches you can get Samza to reprocess the entire history of messages in the input system. Input systems such as Kafka can retain a large amount of history &mdash; see discussion below. In order to speed up the reprocessing of historical data, you can increase the container count (*yarn.container.count* if you're running Samza on YARN) to boost your job's computational resources.
+
+If your job maintains any [persistent state](../container/state-management.html), you need to be careful when jumping back in time: resetting a checkpoint does not automatically change persistent state, so you could end up reprocessing old messages while using state from a later point in time. In most cases, a job that jumps back in time should start with an empty state. You can reset the state by deleting the changelog topic, or by changing the name of the changelog topic in your job configuration.
+
+When you're jumping back in time, you're using Samza somewhat like a batch processing framework (e.g. MapReduce) &mdash; with the difference that your job doesn't stop when it has processed all the historical data, but instead continues running, incrementally processing the stream of new messages as they come in. This has the advantage that you don't need to write and maintain separate batch and streaming versions of your job: you can just use the same Samza API for processing both real-time and historical data.
+
+### Retention of history
+
+Samza doesn't maintain history itself &mdash; that is the responsibility of the input system, such as Kafka. How far back in time you can jump depends on the amount of history that is retained in that system.
+
+Kafka is designed to keep a fairly large amount of history: it is common for Kafka brokers to keep one or two weeks of message history accessible, even for high volume topics. The retention period is mostly determined by how much disk space you have available. Kafka's performance [remains high](http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines) even if you have terabytes of history.
+
+There are two different kinds of history which require different configuration:
+
+* **Activity events** are things like user tracking events, web server log events and the like. This kind of stream is typically configured with a time-based retention, e.g. a few weeks. Events older than the retention period are deleted (or archived in an offline system such as HDFS).
+* **Database changes** are events that show inserts, updates and deletes in a database. In this kind of stream, each event typically has a primary key, and a newer event for a key overwrites any older events for the same key. If the same key is updated many times, you're only really interested in the most recent value. (The [changelog streams](../container/state-management.html) used by Samza's persistent state fall in this category.)
+
+In a database change stream, when you're reprocessing data, you typically want to reprocess the entire database. You don't want to miss a value just because it was last updated more than a few weeks ago. In other words, you don't want change events to be deleted just because they are older than some threshold. In this case, when you're jumping back in time, you need to rewind to the *beginning of time*, to the first change ever made to the database (known in Kafka as "offset 0").
+
+Fortunately this can be done efficiently, using a Kafka feature called [log compaction](http://kafka.apache.org/documentation.html#compaction). 
+
+For example, imagine your database contains counters: every time something happens, you increment the appropriate counters and update the database with the new counter values. Every update is sent to the changelog, and because there are many updates, the changelog stream will take up a lot of space. With log compaction turned on, Kafka deduplicates the stream in the background, keeping only the most recent counter value for each key, and deleting any old values for the same counter. This reduces the size of the stream so much that you can keep the most recent update for every key, even if it was last updated long ago.
+
+With log compaction enabled, the stream of database changes becomes a full copy of the entire database. By jumping back to offset 0, your Samza job can scan over the entire database and reprocess it. This is a very powerful way of building scalable applications.
+
+### Details of Parallel Rewind
+
+If you are taking the *parallel rewind* approach described above, running two jobs in parallel, you need to configure them carefully to avoid problems. In particular, some things to look out for:
+
+* Make sure that the two jobs don't interfere with each other. They need different *job.name* or *job.id* configuration properties, so that each job gets its own checkpoint stream. If the jobs maintain [persistent state](../container/state-management.html), each job needs its own changelog (two different jobs writing to the same changelog produces undefined results).
+* What happens to job output? If the job sends its results to an output stream, or writes to a database, then the easiest solution is for each job to have a separate output stream or database table. If they write to the same output, you need to take care to ensure that newer data isn't overwritten with older data (due to race conditions between the two jobs).
+* Do you need to support A/B testing between the old and the new version of your job, e.g. to test whether the new version improves your metrics? Parallel rewind is ideal for this: each job writes to a separate output, and clients or consumers of the output can read from either the old or the new version's output, depending on whether a user is in test group A or B.
+* Reclaiming resources: you might want to keep the old version of your job running for a while, even when the new version has finished reprocessing historical data (especially if the old version's output is being used in an A/B test). However, eventually you'll want to shut it down, and delete the checkpoint and changelog streams belonging to the old version.
+
+Samza gives you a lot of flexibility for reprocessing historical data, and you don't need to program against a separate batch processing API to take advantage of it. If you're mindful of these issues, you can build a data system that is very robust, but still gives you lots of freedom to change your processing logic in future.
+
+## [Application Master &raquo;](../yarn/application-master.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/jobs/yarn-jobs.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/jobs/yarn-jobs.md b/docs/learn/documentation/versioned/jobs/yarn-jobs.md
new file mode 100644
index 0000000..58ca50d
--- /dev/null
+++ b/docs/learn/documentation/versioned/jobs/yarn-jobs.md
@@ -0,0 +1,34 @@
+---
+layout: page
+title: YARN Jobs
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+When you define `job.factory.class=org.apache.samza.job.yarn.YarnJobFactory` in your job's configuration, Samza will use YARN to execute your job. The YarnJobFactory will use the YARN_HOME environment variable on the machine that run-job.sh is executed on to get the appropriate YARN configuration, which will define where the YARN resource manager is. The YarnJob will work with the resource manager to get your job started on the YARN cluster.
+
+If you want to use YARN to run your Samza job, you'll also need to define the location of your Samza job's package. For example, you might say:
+
+{% highlight jproperties %}
+yarn.package.path=http://my.http.server/jobs/ingraphs-package-0.0.55.tgz
+{% endhighlight %}
+
+This .tgz file follows the conventions outlined on the [Packaging](packaging.html) page (it has bin/run-am.sh and bin/run-container.sh). YARN NodeManagers will take responsibility for downloading this .tgz file on the appropriate machines, and untar'ing them. From there, YARN will execute run-am.sh or run-container.sh for the Samza Application Master, and SamzaContainer, respectively.
+
+<!-- TODO document yarn.container.count and other key configs -->
+
+## [Logging &raquo;](logging.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/operations/kafka.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/operations/kafka.md b/docs/learn/documentation/versioned/operations/kafka.md
new file mode 100644
index 0000000..29833e4
--- /dev/null
+++ b/docs/learn/documentation/versioned/operations/kafka.md
@@ -0,0 +1,34 @@
+---
+layout: page
+title: Kafka
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+<!-- TODO kafka page should be fleshed out a bit -->
+
+<!-- TODO when 0.8.1 is released, update with state management config information -->
+
+Kafka has a great [operations wiki](http://kafka.apache.org/08/ops.html), which provides some detail on how to operate Kafka at scale.
+
+### Auto-Create Topics
+
+Kafka brokers should be configured to automatically create topics. Without this, it's going to be very cumbersome to run Samze jobs, since jobs will write to arbitrary (and sometimes new) topics.
+
+{% highlight jproperties %}
+auto.create.topics.enable=true
+{% endhighlight %}

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/operations/security.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/operations/security.md b/docs/learn/documentation/versioned/operations/security.md
new file mode 100644
index 0000000..b7ef24e
--- /dev/null
+++ b/docs/learn/documentation/versioned/operations/security.md
@@ -0,0 +1,72 @@
+---
+layout: page
+title: Security
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+Samza provides no security. All security is implemented in the stream system, or in the environment that Samza containers run.
+
+### Securing Streaming Systems
+
+Samza does not provide any security at the stream system level. It is up to individual streaming systems to enforce their own security. If a stream system requires usernames and passwords in order to consume from specific streams, these values must be supplied via configuration, and used at the StreamConsumer/StreamConsumerFactory implementation. The same holds true if the streaming system uses SSL certificates or Kerberos. The environment in which Samza runs must provide the appropriate certificate or Kerberos ticket, and the StreamConsumer must be implemented to use these certificates or tickets.
+
+#### Securing Kafka
+
+Kafka provides no security for its topics, and therefore Samza doesn't provide any security when using Kafka topics.
+
+### Securing Samza's Environment
+
+The most important thing to keep in mind when securing an environment that Samza containers run in is that **Samza containers execute arbitrary user code**. They must considered an adversarial application, and the environment must be locked down accordingly.
+
+#### Configuration
+
+Samza reads all configuration at the time a Samza job is started using the run-job.sh script. If configuration contains sensitive information, then care must be taken to provide the JobRunner with the configuration. This means implementing a ConfigFactory that understands the configuration security model, and resolves configuration to Samza's Config object in a secure way.
+
+During the duration of a Samza job's execution, the configuration is kept in memory. The only time configuration is visible is:
+
+1. When configuration is resolved using a ConfigFactory.
+2. The configuration is printed to STDOUT when run-job.sh is run.
+3. The configuration is written to the logs when a Samza container starts.
+
+If configuration contains sensitive data, then these three points must be secured.
+
+#### Ports
+
+The only port that a Samza container opens by default is an un-secured JMX port that is randomly selected at start time. If this is not desired, JMX can be disabled through configuration. See the [Configuration](configuration.html) page for details.
+
+Users might open ports from inside a Samza container. If this is not desired, then the user that executes the Samza container must have the appropriate permissions revoked, usually using iptables.
+
+#### Logs
+
+Samza container logs contain configuration, and might contain arbitrary sensitive data logged by the user. A secure log directory must be provided to the Samza container.
+
+#### Starting a Samza Job
+
+If operators do not wish to allow Samza containers to be executed by arbitrary users, then the mechanism that Samza containers are deployed must secured. Usually, this means controlling execution of the run-job.sh script. The recommended pattern is to lock down the machines that Samza containers run on, and execute run-job.sh from either a blessed web service or special machine, and only allow access to the service or machine by specific users.
+
+#### Shell Scripts
+
+Please see the [Packaging](packaging.html) section for details on the the shell scripts that Samza uses. Samza containers allow users to execute arbitrary shell commands, so user permissions must be locked down to prevent users from damaging the environment or reading sensitive data.
+
+#### YARN
+
+<!-- TODO make the security page link to the actual YARN security document, when we write it. -->
+
+Samza provides out-of-the-box YARN integration. Take a look at Samza's YARN Security page for details.
+
+## [Kafka &raquo;](kafka.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/yarn/application-master.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/yarn/application-master.md b/docs/learn/documentation/versioned/yarn/application-master.md
new file mode 100644
index 0000000..d20aece
--- /dev/null
+++ b/docs/learn/documentation/versioned/yarn/application-master.md
@@ -0,0 +1,69 @@
+---
+layout: page
+title: Application Master
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+YARN is Hadoop's next-generation cluster manager. It allows developers to deploy and execute arbitrary commands on a grid. If you're unfamiliar with YARN, or the concept of an ApplicationMaster (AM), please read Hadoop's [YARN](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html) page.
+
+### Integration
+
+Samza's main integration with YARN comes in the form of a Samza ApplicationMaster. This is the chunk of code responsible for managing a Samza job in a YARN grid. It decides what to do when a stream processor fails, which machines a Samza job's [containers](../container/samza-container.html) should run on, and so on.
+
+When the Samza ApplicationMaster starts up, it does the following:
+
+1. Receives configuration from YARN via the STREAMING_CONFIG environment variable.
+2. Starts a JMX server on a random port.
+3. Instantiates a metrics registry and reporters to keep track of relevant metrics.
+4. Registers the AM with YARN's RM.
+5. Get the total number of partitions for the Samza job using each input stream's PartitionManager (see the [Streams](../container/streams.html) page for details).
+6. Read the total number of containers requested from the Samza job's configuration.
+7. Assign each partition to a container (called a Task Group in Samza's AM dashboard).
+8. Make a [ResourceRequest](http://hadoop.apache.org/docs/current/api/org/apache/hadoop/yarn/api/records/ResourceRequest.html) to YARN for each container.
+9. Poll the YARN RM every second to check for allocated and released containers.
+
+From this point on, the ApplicationMaster just reacts to events from the RM.
+
+### Fault Tolerance
+
+Whenever a container is allocated, the AM will work with the YARN NM to start a SamzaContainer (with appropriate partitions assigned to it) in the container. If a container fails with a non-zero return code, the AM will request a new container, and restart the SamzaContainer. If a SamzaContainer fails too many times, too quickly, the ApplicationMaster will fail the whole Samza job with a non-zero return code. See the yarn.container.retry.count and yarn.container.retry.window.ms [configuration](../jobs/configuration.html) parameters for details.
+
+When the AM receives a reboot signal from YARN, it will throw a SamzaException. This will trigger a clean and successful shutdown of the AM (YARN won't think the AM failed).
+
+If the AM, itself, fails, YARN will handle restarting the AM. When the AM is restarted, all containers that were running will be killed, and the AM will start from scratch. The same list of operations, shown above, will be executed. The AM will request new containers for its SamzaContainers, and proceed as though it has just started for the first time. YARN has a yarn.resourcemanager.am.max-retries configuration parameter that's defined in [yarn-site.xml](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml). This configuration defaults to 1, which means that, by default, a single AM failure will cause your Samza job to stop running.
+
+### Dashboard
+
+Samza's ApplicationMaster comes with a dashboard to show useful information such as:
+
+1. Where containers are located.
+2. Links to logs.
+3. The Samza job's configuration.
+4. Container failure count.
+
+You can find this dashboard by going to your YARN grid's ResourceManager page (usually something like [http://localhost:8088/cluster](http://localhost:8088/cluster)), and clicking on the "ApplicationMaster" link of a running Samza job.
+
+<img src="/img/{{site.version}}/learn/documentation/yarn/samza-am-dashboard.png" alt="Screenshot of ApplicationMaster dashboard" class="diagram-large">
+
+### Security
+
+The Samza dashboard's HTTP access is currently un-secured, even when using YARN in secure-mode. This means that users with access to a YARN grid could port-scan a Samza ApplicationMaster's HTTP server, and open the dashboard in a browser to view its contents. Sensitive configuration can be viewed by anyone, in this way, and care should be taken. There are plans to secure Samza's ApplicationMaster using [Hadoop's security](http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.0/bk_installing_manually_book/content/rpm-chap14-2-3-1.html) features ([SPENAGO](http://en.wikipedia.org/wiki/SPNEGO)).
+
+See Samza's [security](../operations/security.html) page for more details.
+
+## [Isolation &raquo;](isolation.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/yarn/isolation.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/yarn/isolation.md b/docs/learn/documentation/versioned/yarn/isolation.md
new file mode 100644
index 0000000..1eb3bf5
--- /dev/null
+++ b/docs/learn/documentation/versioned/yarn/isolation.md
@@ -0,0 +1,46 @@
+---
+layout: page
+title: Isolation
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+When running Samza jobs in a shared, distributed environment, the stream processors can have an impact on one another's performance. A stream processor that uses 100% of a machine's CPU will slow down all other stream processors on the machine.
+
+One of [YARN](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html)'s responsibilities is to manage resources so that this doesn't happen. Each of YARN's Node Managers (NM) has a chunk of "resources" dedicated to it. The YARN Resource Manager (RM) will only allow a container to be allocated on a NM if it has enough resources to satisfy the container's needs.
+
+YARN currently supports resource management for memory and CPU.
+
+### Memory
+
+YARN will automatically enforce memory limits for all containers that it executes. All containers must have a max-memory size defined when they're created. If the sum of all memory usage for processes associated with a single YARN container exceeds this maximum, YARN will kill the container.
+
+Samza supports memory limits using the yarn.container.memory.mb and yarn.am.container.memory.mb configuration parameters. Keep in mind that this is simply the amount of memory YARN will allow a [SamzaContainer](../container/samza-container.html) or [ApplicationMaster](application-master.html) to have. You'll still need to configure your heap settings appropriately using task.opts, when using Java (the default is -Xmx160M). See the [Configuration](../jobs/configuration.html) and [Packaging](../jobs/packaging.html) pages for details.
+
+### CPU
+
+YARN has the concept of a virtual core. Each NM is assigned a total number of virtual cores (32, by default). When a container request is made, it must specify how many virtual cores it needs. The YARN RM will only assign the container to a NM that has enough virtual cores to satisfy the request.
+
+#### CGroups
+
+Unlike memory, which YARN can enforce itself (by looking at the /proc folder), YARN can't enforce CPU isolation, since this must be done at the Linux kernel level. One of YARN's interesting new features is its support for Linux [CGroups](https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt). CGroups are a way to control process utilization at the kernel level in Linux.
+
+If YARN is setup to use CGroups, then YARN will guarantee that a container will get at least the amount of CPU that it requires. Currently, YARN will give you more CPU, if it's available. For details on enforcing "at most" CPU usage, see [YARN-810](https://issues.apache.org/jira/browse/YARN-810). 
+
+See [this blog post](http://riccomini.name/posts/hadoop/2013-06-14-yarn-with-cgroups/) for details on setting up YARN with CGroups.
+
+## [Security &raquo;](../operations/security.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/tutorials/0.7.0/deploy-samza-job-from-hdfs.md
----------------------------------------------------------------------
diff --git a/docs/learn/tutorials/0.7.0/deploy-samza-job-from-hdfs.md b/docs/learn/tutorials/0.7.0/deploy-samza-job-from-hdfs.md
deleted file mode 100644
index ab33dd3..0000000
--- a/docs/learn/tutorials/0.7.0/deploy-samza-job-from-hdfs.md
+++ /dev/null
@@ -1,42 +0,0 @@
----
-layout: page
-title: Deploying a Samza job from HDFS
----
-<!--
-   Licensed to the Apache Software Foundation (ASF) under one or more
-   contributor license agreements.  See the NOTICE file distributed with
-   this work for additional information regarding copyright ownership.
-   The ASF licenses this file to You under the Apache License, Version 2.0
-   (the "License"); you may not use this file except in compliance with
-   the License.  You may obtain a copy of the License at
-
-       http://www.apache.org/licenses/LICENSE-2.0
-
-   Unless required by applicable law or agreed to in writing, software
-   distributed under the License is distributed on an "AS IS" BASIS,
-   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-   See the License for the specific language governing permissions and
-   limitations under the License.
--->
-
-This tutorial uses [hello-samza](../../../startup/hello-samza/0.7.0/) to illustrate how to run a Samza job if you want to publish the Samza job's .tar.gz package to HDFS.
-
-### Upload the package
-
-{% highlight bash %}
-hadoop fs -put ./samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz /path/for/tgz
-{% endhighlight %}
-
-### Add HDFS configuration
-
-Put the hdfs-site.xml file of your cluster into ~/.samza/conf directory (The same place as the yarn-site.xml). If you set HADOOP\_CONF\_DIR, put the hdfs-site.xml in your configuration directory if the hdfs-site.xml is not there.
-
-### Change properties file
-
-Change the yarn.package.path in the properties file to your HDFS location.
-
-{% highlight jproperties %}
-yarn.package.path=hdfs://<hdfs name node ip>:<hdfs name node port>/path/to/tgz
-{% endhighlight %}
-
-Then you should be able to run the Samza job as described in [hello-samza](../../../startup/hello-samza/0.7.0/).
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/tutorials/0.7.0/index.md
----------------------------------------------------------------------
diff --git a/docs/learn/tutorials/0.7.0/index.md b/docs/learn/tutorials/0.7.0/index.md
deleted file mode 100644
index 91bddc5..0000000
--- a/docs/learn/tutorials/0.7.0/index.md
+++ /dev/null
@@ -1,39 +0,0 @@
----
-layout: page
-title: Tutorials
----
-<!--
-   Licensed to the Apache Software Foundation (ASF) under one or more
-   contributor license agreements.  See the NOTICE file distributed with
-   this work for additional information regarding copyright ownership.
-   The ASF licenses this file to You under the Apache License, Version 2.0
-   (the "License"); you may not use this file except in compliance with
-   the License.  You may obtain a copy of the License at
-
-       http://www.apache.org/licenses/LICENSE-2.0
-
-   Unless required by applicable law or agreed to in writing, software
-   distributed under the License is distributed on an "AS IS" BASIS,
-   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-   See the License for the specific language governing permissions and
-   limitations under the License.
--->
-
-[Remote Debugging with Samza](remote-debugging-samza.html)
-
-[Deploying a Samza Job from HDFS](deploy-samza-job-from-hdfs.html)
-
-[Run Hello-samza in Multi-node YARN](run-in-multi-node-yarn.html)
-
-[Run Hello-samza without Internet](run-hello-samza-without-internet.html)
-
-<!-- TODO a bunch of tutorials
-[Log Walkthrough](log-walkthrough.html)
-<a href="configuring-kafka-system.html">Configuring a Kafka System</a><br/>
-<a href="joining-streams.html">Joining Streams</a><br/>
-<a href="sort-stream.html">Sorting a Stream</a><br/>
-<a href="group-by-count.html">Group-by and Counting</a><br/>
-<a href="initialize-close.html">Initializing and Closing</a><br/>
-<a href="windowing.html">Windowing</a><br/>
-<a href="committing.html">Committing</a><br/>
--->

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/tutorials/0.7.0/remote-debugging-samza.md
----------------------------------------------------------------------
diff --git a/docs/learn/tutorials/0.7.0/remote-debugging-samza.md b/docs/learn/tutorials/0.7.0/remote-debugging-samza.md
deleted file mode 100644
index 89d0856..0000000
--- a/docs/learn/tutorials/0.7.0/remote-debugging-samza.md
+++ /dev/null
@@ -1,100 +0,0 @@
----
-layout: page
-title: Remote Debugging with Samza
----
-<!--
-   Licensed to the Apache Software Foundation (ASF) under one or more
-   contributor license agreements.  See the NOTICE file distributed with
-   this work for additional information regarding copyright ownership.
-   The ASF licenses this file to You under the Apache License, Version 2.0
-   (the "License"); you may not use this file except in compliance with
-   the License.  You may obtain a copy of the License at
-
-       http://www.apache.org/licenses/LICENSE-2.0
-
-   Unless required by applicable law or agreed to in writing, software
-   distributed under the License is distributed on an "AS IS" BASIS,
-   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-   See the License for the specific language governing permissions and
-   limitations under the License.
--->
-
-Let's use Eclipse to attach a remote debugger to a Samza container. If you're an IntelliJ user, you'll have to fill in the blanks, but the process should be pretty similar. This tutorial assumes you've already run through the [Hello Samza](../../../startup/hello-samza/0.7.0/) tutorial.
-
-### Get the Code
-
-Start by checking out Samza, so we have access to the source.
-
-{% highlight bash %}
-git clone http://git-wip-us.apache.org/repos/asf/incubator-samza.git
-{% endhighlight %}
-
-Next, grab hello-samza.
-
-{% highlight bash %}
-git clone git://git.apache.org/incubator-samza-hello-samza.git
-{% endhighlight %}
-
-### Setup the Environment
-
-Now, let's setup the Eclipse project files.
-
-{% highlight bash %}
-cd incubator-samza
-./gradlew eclipse
-{% endhighlight %}
-
-Let's also release Samza to Maven's local repository, so hello-samza has access to the JARs that it needs.
-
-{% highlight bash %}
-./gradlew -PscalaVersion=2.9.2 clean publishToMavenLocal
-{% endhighlight %}
-
-Next, open Eclipse, and import the Samza source code into your workspace: "File" &gt; "Import" &gt; "Existing Projects into Workspace" &gt; "Browse". Select 'incubator-samza' folder, and hit 'finish'.
-
-### Enable Remote Debugging
-
-Now, go back to the hello-samza project, and edit ./samza-job-package/src/main/config/wikipedia-feed.properties to add the following line:
-
-{% highlight jproperties %}
-task.opts=-agentlib:jdwp=transport=dt_socket,address=localhost:9009,server=y,suspend=y
-{% endhighlight %}
-
-The [task.opts](../../documentation/0.7.0/jobs/configuration-table.html) configuration parameter is a way to override Java parameters at runtime for your Samza containers. In this example, we're setting the agentlib parameter to enable remote debugging on localhost, port 9009. In a more realistic environment, you might also set Java heap settings (-Xmx, -Xms, etc), as well as garbage collection and logging settings.
-
-*NOTE: If you're running multiple Samza containers on the same machine, there is a potential for port collisions. You must configure your task.opts to assign different ports for different Samza jobs. If a Samza job has more than one container (e.g. if you're using YARN with yarn.container.count=2), those containers must be run on different machines.*
-
-### Start the Grid
-
-Now that the Samza job has been setup to enable remote debugging when a Samza container starts, let's start the ZooKeeper, Kafka, and YARN.
-
-{% highlight bash %}
-bin/grid
-{% endhighlight %}
-
-If you get a complaint that JAVA_HOME is not set, then you'll need to set it. This can be done on OSX by running:
-
-{% highlight bash %}
-export JAVA_HOME=$(/usr/libexec/java_home)
-{% endhighlight %}
-
-Once the grid starts, you can start the wikipedia-feed Samza job.
-
-{% highlight bash %}
-mvn clean package
-mkdir -p deploy/samza
-tar -xvf ./samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz -C deploy/samza
-deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties
-{% endhighlight %}
-
-When the wikipedia-feed job starts up, a single Samza container will be created to process all incoming messages. This is the container that we'll want to connect to from the remote debugger.
-
-### Connect the Remote Debugger
-
-Switch back to Eclipse, and set a break point in TaskInstance.process by clicking on a line inside TaskInstance.process, and clicking "Run" &gt; "Toggle Breakpoint". A blue circle should appear to the left of the line. This will let you see incoming messages as they arrive.
-
-Setup a remote debugging session: "Run" &gt; "Debug Configurations..." &gt; right click on "Remote Java Application" &gt; "New". Set the name to 'wikipedia-feed-debug'. Set the port to 9009 (matching the port in the task.opts configuration). Click "Source" &gt; "Add..." &gt; "Java Project". Select all of the Samza projects that you imported (i.e. samza-api, samza-core, etc). If you would like to set breakpoints in your own Stream task, also add the project that contains your StreamTask implementation. Click 'Debug'.
-
-After a few moments, Eclipse should connect to the wikipedia-feed job, and ask you to switch to Debug mode. Once in debug, you'll see that it's broken at the TaskInstance.process method. From here, you can step through code, inspect variable values, etc.
-
-Congratulations, you've got a remote debug connection to your StreamTask!

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/tutorials/0.7.0/run-hello-samza-without-internet.md
----------------------------------------------------------------------
diff --git a/docs/learn/tutorials/0.7.0/run-hello-samza-without-internet.md b/docs/learn/tutorials/0.7.0/run-hello-samza-without-internet.md
deleted file mode 100644
index a5503ef..0000000
--- a/docs/learn/tutorials/0.7.0/run-hello-samza-without-internet.md
+++ /dev/null
@@ -1,78 +0,0 @@
----
-layout: page
-title: Run Hello Samza without Internet
----
-<!--
-   Licensed to the Apache Software Foundation (ASF) under one or more
-   contributor license agreements.  See the NOTICE file distributed with
-   this work for additional information regarding copyright ownership.
-   The ASF licenses this file to You under the Apache License, Version 2.0
-   (the "License"); you may not use this file except in compliance with
-   the License.  You may obtain a copy of the License at
-
-       http://www.apache.org/licenses/LICENSE-2.0
-
-   Unless required by applicable law or agreed to in writing, software
-   distributed under the License is distributed on an "AS IS" BASIS,
-   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-   See the License for the specific language governing permissions and
-   limitations under the License.
--->
-
-This tutorial is to help you run [Hello Samza](../../../startup/hello-samza/0.7.0/) if you can not connect to the internet. 
-
-### Test Your Connection
-
-Ping irc.wikimedia.org. Sometimes the firewall in your company blocks this service.
-
-{% highlight bash %}
-telnet irc.wikimedia.org 6667
-{% endhighlight %}
-
-You should see something like this:
-
-```
-Trying 208.80.152.178...
-Connected to ekrem.wikimedia.org.
-Escape character is '^]'.
-NOTICE AUTH :*** Processing connection to irc.pmtpa.wikimedia.org
-NOTICE AUTH :*** Looking up your hostname...
-NOTICE AUTH :*** Checking Ident
-NOTICE AUTH :*** Found your hostname
-```
-
-Otherwise, you may have the connection problem.
-
-### Use Local Data to Run Hello Samza
-
-We provide an alternative to get wikipedia feed data. Instead of running
-
-{% highlight bash %}
-deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties
-{% endhighlight %}
-
-You will run
-
-{% highlight bash %}
-bin/produce-wikipedia-raw-data.sh
-{% endhighlight %}
-
-This script will read wikipedia feed data from local file and produce them to the Kafka broker. By default, it produces to localhost:9092 as the Kafka broker and uses localhost:2181 as zookeeper. You can overwrite them:
-
-{% highlight bash %}
-bin/produce-wikipedia-raw-data.sh -b yourKafkaBrokerAddress -z yourZookeeperAddress
-{% endhighlight %}
-
-Now you can go back to Generate Wikipedia Statistics section in [Hello Samza](../../../startup/hello-samza/0.7.0/) and follow the remaining steps.
-
-### A Little Explanation
-
-The goal of
-
-{% highlight bash %}
-deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties
-{% endhighlight %}
-
-is to deploy a Samza job which listens to wikipedia API, receives the feed in realtime and produces the feed to the Kafka topic wikipedia-raw. The alternative in this tutorial is reading local wikipedia feed in an infinite loop and producing the data to Kafka wikipedia-raw. The follow-up job, wikipedia-parser is getting data from Kafka topic wikipedia-raw, so as long as we have correct data in Kafka topic wikipedia-raw, we are fine. All Samza jobs are connected by the Kafka and do not depend on each other.
-
-

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/tutorials/0.7.0/run-in-multi-node-yarn.md
----------------------------------------------------------------------
diff --git a/docs/learn/tutorials/0.7.0/run-in-multi-node-yarn.md b/docs/learn/tutorials/0.7.0/run-in-multi-node-yarn.md
deleted file mode 100644
index c079233..0000000
--- a/docs/learn/tutorials/0.7.0/run-in-multi-node-yarn.md
+++ /dev/null
@@ -1,174 +0,0 @@
----
-layout: page
-title: Run Hello-samza in Multi-node YARN
----
-<!--
-   Licensed to the Apache Software Foundation (ASF) under one or more
-   contributor license agreements.  See the NOTICE file distributed with
-   this work for additional information regarding copyright ownership.
-   The ASF licenses this file to You under the Apache License, Version 2.0
-   (the "License"); you may not use this file except in compliance with
-   the License.  You may obtain a copy of the License at
-
-       http://www.apache.org/licenses/LICENSE-2.0
-
-   Unless required by applicable law or agreed to in writing, software
-   distributed under the License is distributed on an "AS IS" BASIS,
-   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-   See the License for the specific language governing permissions and
-   limitations under the License.
--->
-
-You must successfully run the [hello-samza](../../../startup/hello-samza/0.7.0/) project in a single-node YARN by following the [hello-samza](../../../startup/hello-samza/0.7.0/) tutorial. Now it's time to run the Samza job in a "real" YARN grid (with more than one node).
-
-## Set Up Multi-node YARN
-
-If you already have a multi-node YARN cluster (such as CDH5 cluster), you can skip this set-up section.
-
-### Basic YARN Setting
-
-1\. Dowload [YARN 2.3](http://mirror.symnds.com/software/Apache/hadoop/common/hadoop-2.3.0/hadoop-2.3.0.tar.gz) to /tmp and untar it.
-
-{% highlight bash %}
-cd /tmp
-tar -xvf hadoop-2.3.0.tar.gz
-cd hadoop-2.3.0
-{% endhighlight %}
-
-2\. Set up environment variables.
-
-{% highlight bash %}
-export HADOOP_YARN_HOME=$(pwd)
-mkdir conf
-export HADOOP_CONF_DIR=$HADOOP_YARN_HOME/conf
-{% endhighlight %}
-
-3\. Configure YARN setting file.
-
-{% highlight bash %}
-cp ./etc/hadoop/yarn-site.xml conf
-vi conf/yarn-site.xml
-{% endhighlight %}
-
-Add the following property to yarn-site.xml:
-
-{% highlight xml %}
-<property>
-    <name>yarn.resourcemanager.hostname</name>
-    <!-- hostname that is accessible from all NMs -->
-    <value>yourHostname</value>
-</property>
-{% endhighlight %}
-
-Download and add capacity-schedule.xml.
-
-```
-curl http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/resources/capacity-scheduler.xml?view=co > conf/capacity-scheduler.xml
-```
-
-### Set Up Http Filesystem for YARN
-
-The goal of these steps is to configure YARN to read http filesystem because we will use Http server to deploy Samza job package. If you want to use HDFS to deploy Samza job package, you can skip step 4~6 and follow [Deploying a Samza Job from HDFS](deploy-samza-job-from-hdfs.html)
-
-4\. Download Scala package and untar it.
-
-{% highlight bash %}
-cd /tmp
-curl http://www.scala-lang.org/files/archive/scala-2.10.3.tgz > scala-2.10.3.tgz
-tar -xvf scala-2.10.3.tgz
-{% endhighlight %}
-
-5\. Add Scala and its log jars.
-
-{% highlight bash %}
-cp /tmp/scala-2.10.3/lib/scala-compiler.jar $HADOOP_YARN_HOME/share/hadoop/hdfs/lib
-cp /tmp/scala-2.10.3/lib/scala-library.jar $HADOOP_YARN_HOME/share/hadoop/hdfs/lib
-curl http://search.maven.org/remotecontent?filepath=org/clapper/grizzled-slf4j_2.10/1.0.1/grizzled-slf4j_2.10-1.0.1.jar > $HADOOP_YARN_HOME/share/hadoop/hdfs/lib/grizzled-slf4j_2.10-1.0.1.jar
-{% endhighlight %}
-
-6\. Add http configuration in core-site.xml (create the core-site.xml file and add content).
-
-{% highlight xml %}
-vi $HADOOP_YARN_HOME/conf/core-site.xml
-{% endhighlight %}
-
-Add the following code:
-
-{% highlight xml %}
-<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
-<configuration>
-    <property>
-      <name>fs.http.impl</name>
-      <value>org.apache.samza.util.hadoop.HttpFileSystem</value>
-    </property>
-</configuration>
-{% endhighlight %}
-
-### Distribute Hadoop File to Slaves
-
-7\. Basically, you copy the hadoop file in your host machine to slave machines. (172.21.100.35, in my case):
-
-{% highlight bash %}
-scp -r . 172.21.100.35:/tmp/hadoop-2.3.0
-echo 172.21.100.35 > conf/slaves
-sbin/start-yarn.sh
-{% endhighlight %}
-
-* If you get "172.21.100.35: Error: JAVA_HOME is not set and could not be found.", you'll need to add a conf/hadoop-env.sh file to the machine with the failure (172.21.100.35, in this case), which has "export JAVA_HOME=/export/apps/jdk/JDK-1_6_0_27" (or wherever your JAVA_HOME actually is).
-
-8\. Validate that your nodes are up by visiting http://yourHostname:8088/cluster/nodes.
-
-## Deploy Samza Job
-
-Some of the following steps are exactlly identical to what you have seen in [hello-samza](../../../startup/hello-samza/0.7.0/). You may skip them if you have already done so.
-
-1\. Download Samza and publish it to Maven local repository.
-
-{% highlight bash %}
-cd /tmp
-git clone http://git-wip-us.apache.org/repos/asf/incubator-samza.git
-cd incubator-samza
-./gradlew clean publishToMavenLocal
-cd ..
-{% endhighlight %}
-
-2\. Download hello-samza project and change the job properties file.
-
-{% highlight bash %}
-git clone git://github.com/linkedin/hello-samza.git
-cd hello-samza
-vi samza-job-package/src/main/config/wikipedia-feed.properties
-{% endhighlight %}
-
-Change the yarn.package.path property to be:
-
-{% highlight jproperties %}
-yarn.package.path=http://yourHostname:8000/samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz
-{% endhighlight %}
-
-3\. Complie hello-samza.
-
-{% highlight bash %}
-mvn clean package
-mkdir -p deploy/samza
-tar -xvf ./samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz -C deploy/samza
-{% endhighlight %}
-
-4\. Deploy Samza job package to Http server..
-
-Open a new terminal, and run:
-
-{% highlight bash %}
-cd /tmp/hello-samza && python -m SimpleHTTPServer
-{% endhighlight %}
-
-Go back to the original terminal (not the one running the HTTP server):
-
-{% highlight bash %}
-deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties
-{% endhighlight %}
-
-Go to http://yourHostname:8088 and find the wikipedia-feed job. Click on the ApplicationMaster link to see that it's running.
-
-Congratulations! You now run the Samza job in a "real" YARN grid!
-

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/tutorials/versioned/deploy-samza-job-from-hdfs.md
----------------------------------------------------------------------
diff --git a/docs/learn/tutorials/versioned/deploy-samza-job-from-hdfs.md b/docs/learn/tutorials/versioned/deploy-samza-job-from-hdfs.md
new file mode 100644
index 0000000..ec455d7
--- /dev/null
+++ b/docs/learn/tutorials/versioned/deploy-samza-job-from-hdfs.md
@@ -0,0 +1,42 @@
+---
+layout: page
+title: Deploying a Samza job from HDFS
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+This tutorial uses [hello-samza](../../../startup/hello-samza/{{site.version}}/) to illustrate how to run a Samza job if you want to publish the Samza job's .tar.gz package to HDFS.
+
+### Upload the package
+
+{% highlight bash %}
+hadoop fs -put ./samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz /path/for/tgz
+{% endhighlight %}
+
+### Add HDFS configuration
+
+Put the hdfs-site.xml file of your cluster into ~/.samza/conf directory (The same place as the yarn-site.xml). If you set HADOOP\_CONF\_DIR, put the hdfs-site.xml in your configuration directory if the hdfs-site.xml is not there.
+
+### Change properties file
+
+Change the yarn.package.path in the properties file to your HDFS location.
+
+{% highlight jproperties %}
+yarn.package.path=hdfs://<hdfs name node ip>:<hdfs name node port>/path/to/tgz
+{% endhighlight %}
+
+Then you should be able to run the Samza job as described in [hello-samza](../../../startup/hello-samza/{{site.version}}/).
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/tutorials/versioned/index.md
----------------------------------------------------------------------
diff --git a/docs/learn/tutorials/versioned/index.md b/docs/learn/tutorials/versioned/index.md
new file mode 100644
index 0000000..91bddc5
--- /dev/null
+++ b/docs/learn/tutorials/versioned/index.md
@@ -0,0 +1,39 @@
+---
+layout: page
+title: Tutorials
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+[Remote Debugging with Samza](remote-debugging-samza.html)
+
+[Deploying a Samza Job from HDFS](deploy-samza-job-from-hdfs.html)
+
+[Run Hello-samza in Multi-node YARN](run-in-multi-node-yarn.html)
+
+[Run Hello-samza without Internet](run-hello-samza-without-internet.html)
+
+<!-- TODO a bunch of tutorials
+[Log Walkthrough](log-walkthrough.html)
+<a href="configuring-kafka-system.html">Configuring a Kafka System</a><br/>
+<a href="joining-streams.html">Joining Streams</a><br/>
+<a href="sort-stream.html">Sorting a Stream</a><br/>
+<a href="group-by-count.html">Group-by and Counting</a><br/>
+<a href="initialize-close.html">Initializing and Closing</a><br/>
+<a href="windowing.html">Windowing</a><br/>
+<a href="committing.html">Committing</a><br/>
+-->

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/tutorials/versioned/remote-debugging-samza.md
----------------------------------------------------------------------
diff --git a/docs/learn/tutorials/versioned/remote-debugging-samza.md b/docs/learn/tutorials/versioned/remote-debugging-samza.md
new file mode 100644
index 0000000..b84584e
--- /dev/null
+++ b/docs/learn/tutorials/versioned/remote-debugging-samza.md
@@ -0,0 +1,100 @@
+---
+layout: page
+title: Remote Debugging with Samza
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+Let's use Eclipse to attach a remote debugger to a Samza container. If you're an IntelliJ user, you'll have to fill in the blanks, but the process should be pretty similar. This tutorial assumes you've already run through the [Hello Samza](../../../startup/hello-samza/{{site.version}}/) tutorial.
+
+### Get the Code
+
+Start by checking out Samza, so we have access to the source.
+
+{% highlight bash %}
+git clone http://git-wip-us.apache.org/repos/asf/incubator-samza.git
+{% endhighlight %}
+
+Next, grab hello-samza.
+
+{% highlight bash %}
+git clone git://git.apache.org/incubator-samza-hello-samza.git
+{% endhighlight %}
+
+### Setup the Environment
+
+Now, let's setup the Eclipse project files.
+
+{% highlight bash %}
+cd incubator-samza
+./gradlew eclipse
+{% endhighlight %}
+
+Let's also release Samza to Maven's local repository, so hello-samza has access to the JARs that it needs.
+
+{% highlight bash %}
+./gradlew -PscalaVersion=2.9.2 clean publishToMavenLocal
+{% endhighlight %}
+
+Next, open Eclipse, and import the Samza source code into your workspace: "File" &gt; "Import" &gt; "Existing Projects into Workspace" &gt; "Browse". Select 'incubator-samza' folder, and hit 'finish'.
+
+### Enable Remote Debugging
+
+Now, go back to the hello-samza project, and edit ./samza-job-package/src/main/config/wikipedia-feed.properties to add the following line:
+
+{% highlight jproperties %}
+task.opts=-agentlib:jdwp=transport=dt_socket,address=localhost:9009,server=y,suspend=y
+{% endhighlight %}
+
+The [task.opts](../../documentation/{{site.version}}/jobs/configuration-table.html) configuration parameter is a way to override Java parameters at runtime for your Samza containers. In this example, we're setting the agentlib parameter to enable remote debugging on localhost, port 9009. In a more realistic environment, you might also set Java heap settings (-Xmx, -Xms, etc), as well as garbage collection and logging settings.
+
+*NOTE: If you're running multiple Samza containers on the same machine, there is a potential for port collisions. You must configure your task.opts to assign different ports for different Samza jobs. If a Samza job has more than one container (e.g. if you're using YARN with yarn.container.count=2), those containers must be run on different machines.*
+
+### Start the Grid
+
+Now that the Samza job has been setup to enable remote debugging when a Samza container starts, let's start the ZooKeeper, Kafka, and YARN.
+
+{% highlight bash %}
+bin/grid
+{% endhighlight %}
+
+If you get a complaint that JAVA_HOME is not set, then you'll need to set it. This can be done on OSX by running:
+
+{% highlight bash %}
+export JAVA_HOME=$(/usr/libexec/java_home)
+{% endhighlight %}
+
+Once the grid starts, you can start the wikipedia-feed Samza job.
+
+{% highlight bash %}
+mvn clean package
+mkdir -p deploy/samza
+tar -xvf ./samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz -C deploy/samza
+deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties
+{% endhighlight %}
+
+When the wikipedia-feed job starts up, a single Samza container will be created to process all incoming messages. This is the container that we'll want to connect to from the remote debugger.
+
+### Connect the Remote Debugger
+
+Switch back to Eclipse, and set a break point in TaskInstance.process by clicking on a line inside TaskInstance.process, and clicking "Run" &gt; "Toggle Breakpoint". A blue circle should appear to the left of the line. This will let you see incoming messages as they arrive.
+
+Setup a remote debugging session: "Run" &gt; "Debug Configurations..." &gt; right click on "Remote Java Application" &gt; "New". Set the name to 'wikipedia-feed-debug'. Set the port to 9009 (matching the port in the task.opts configuration). Click "Source" &gt; "Add..." &gt; "Java Project". Select all of the Samza projects that you imported (i.e. samza-api, samza-core, etc). If you would like to set breakpoints in your own Stream task, also add the project that contains your StreamTask implementation. Click 'Debug'.
+
+After a few moments, Eclipse should connect to the wikipedia-feed job, and ask you to switch to Debug mode. Once in debug, you'll see that it's broken at the TaskInstance.process method. From here, you can step through code, inspect variable values, etc.
+
+Congratulations, you've got a remote debug connection to your StreamTask!

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/tutorials/versioned/run-hello-samza-without-internet.md
----------------------------------------------------------------------
diff --git a/docs/learn/tutorials/versioned/run-hello-samza-without-internet.md b/docs/learn/tutorials/versioned/run-hello-samza-without-internet.md
new file mode 100644
index 0000000..e276cdb
--- /dev/null
+++ b/docs/learn/tutorials/versioned/run-hello-samza-without-internet.md
@@ -0,0 +1,78 @@
+---
+layout: page
+title: Run Hello Samza without Internet
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+This tutorial is to help you run [Hello Samza](../../../startup/hello-samza/{{site.version}}/) if you can not connect to the internet. 
+
+### Test Your Connection
+
+Ping irc.wikimedia.org. Sometimes the firewall in your company blocks this service.
+
+{% highlight bash %}
+telnet irc.wikimedia.org 6667
+{% endhighlight %}
+
+You should see something like this:
+
+```
+Trying 208.80.152.178...
+Connected to ekrem.wikimedia.org.
+Escape character is '^]'.
+NOTICE AUTH :*** Processing connection to irc.pmtpa.wikimedia.org
+NOTICE AUTH :*** Looking up your hostname...
+NOTICE AUTH :*** Checking Ident
+NOTICE AUTH :*** Found your hostname
+```
+
+Otherwise, you may have the connection problem.
+
+### Use Local Data to Run Hello Samza
+
+We provide an alternative to get wikipedia feed data. Instead of running
+
+{% highlight bash %}
+deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties
+{% endhighlight %}
+
+You will run
+
+{% highlight bash %}
+bin/produce-wikipedia-raw-data.sh
+{% endhighlight %}
+
+This script will read wikipedia feed data from local file and produce them to the Kafka broker. By default, it produces to localhost:9092 as the Kafka broker and uses localhost:2181 as zookeeper. You can overwrite them:
+
+{% highlight bash %}
+bin/produce-wikipedia-raw-data.sh -b yourKafkaBrokerAddress -z yourZookeeperAddress
+{% endhighlight %}
+
+Now you can go back to Generate Wikipedia Statistics section in [Hello Samza](../../../startup/hello-samza/{{site.version}}/) and follow the remaining steps.
+
+### A Little Explanation
+
+The goal of
+
+{% highlight bash %}
+deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties
+{% endhighlight %}
+
+is to deploy a Samza job which listens to wikipedia API, receives the feed in realtime and produces the feed to the Kafka topic wikipedia-raw. The alternative in this tutorial is reading local wikipedia feed in an infinite loop and producing the data to Kafka wikipedia-raw. The follow-up job, wikipedia-parser is getting data from Kafka topic wikipedia-raw, so as long as we have correct data in Kafka topic wikipedia-raw, we are fine. All Samza jobs are connected by the Kafka and do not depend on each other.
+
+

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/tutorials/versioned/run-in-multi-node-yarn.md
----------------------------------------------------------------------
diff --git a/docs/learn/tutorials/versioned/run-in-multi-node-yarn.md b/docs/learn/tutorials/versioned/run-in-multi-node-yarn.md
new file mode 100644
index 0000000..b5e6dcb
--- /dev/null
+++ b/docs/learn/tutorials/versioned/run-in-multi-node-yarn.md
@@ -0,0 +1,174 @@
+---
+layout: page
+title: Run Hello-samza in Multi-node YARN
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+You must successfully run the [hello-samza](../../../startup/hello-samza/{{site.version}}/) project in a single-node YARN by following the [hello-samza](../../../startup/hello-samza/{{site.version}}/) tutorial. Now it's time to run the Samza job in a "real" YARN grid (with more than one node).
+
+## Set Up Multi-node YARN
+
+If you already have a multi-node YARN cluster (such as CDH5 cluster), you can skip this set-up section.
+
+### Basic YARN Setting
+
+1\. Dowload [YARN 2.3](http://mirror.symnds.com/software/Apache/hadoop/common/hadoop-2.3.0/hadoop-2.3.0.tar.gz) to /tmp and untar it.
+
+{% highlight bash %}
+cd /tmp
+tar -xvf hadoop-2.3.0.tar.gz
+cd hadoop-2.3.0
+{% endhighlight %}
+
+2\. Set up environment variables.
+
+{% highlight bash %}
+export HADOOP_YARN_HOME=$(pwd)
+mkdir conf
+export HADOOP_CONF_DIR=$HADOOP_YARN_HOME/conf
+{% endhighlight %}
+
+3\. Configure YARN setting file.
+
+{% highlight bash %}
+cp ./etc/hadoop/yarn-site.xml conf
+vi conf/yarn-site.xml
+{% endhighlight %}
+
+Add the following property to yarn-site.xml:
+
+{% highlight xml %}
+<property>
+    <name>yarn.resourcemanager.hostname</name>
+    <!-- hostname that is accessible from all NMs -->
+    <value>yourHostname</value>
+</property>
+{% endhighlight %}
+
+Download and add capacity-schedule.xml.
+
+```
+curl http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/resources/capacity-scheduler.xml?view=co > conf/capacity-scheduler.xml
+```
+
+### Set Up Http Filesystem for YARN
+
+The goal of these steps is to configure YARN to read http filesystem because we will use Http server to deploy Samza job package. If you want to use HDFS to deploy Samza job package, you can skip step 4~6 and follow [Deploying a Samza Job from HDFS](deploy-samza-job-from-hdfs.html)
+
+4\. Download Scala package and untar it.
+
+{% highlight bash %}
+cd /tmp
+curl http://www.scala-lang.org/files/archive/scala-2.10.3.tgz > scala-2.10.3.tgz
+tar -xvf scala-2.10.3.tgz
+{% endhighlight %}
+
+5\. Add Scala and its log jars.
+
+{% highlight bash %}
+cp /tmp/scala-2.10.3/lib/scala-compiler.jar $HADOOP_YARN_HOME/share/hadoop/hdfs/lib
+cp /tmp/scala-2.10.3/lib/scala-library.jar $HADOOP_YARN_HOME/share/hadoop/hdfs/lib
+curl http://search.maven.org/remotecontent?filepath=org/clapper/grizzled-slf4j_2.10/1.0.1/grizzled-slf4j_2.10-1.0.1.jar > $HADOOP_YARN_HOME/share/hadoop/hdfs/lib/grizzled-slf4j_2.10-1.0.1.jar
+{% endhighlight %}
+
+6\. Add http configuration in core-site.xml (create the core-site.xml file and add content).
+
+{% highlight xml %}
+vi $HADOOP_YARN_HOME/conf/core-site.xml
+{% endhighlight %}
+
+Add the following code:
+
+{% highlight xml %}
+<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
+<configuration>
+    <property>
+      <name>fs.http.impl</name>
+      <value>org.apache.samza.util.hadoop.HttpFileSystem</value>
+    </property>
+</configuration>
+{% endhighlight %}
+
+### Distribute Hadoop File to Slaves
+
+7\. Basically, you copy the hadoop file in your host machine to slave machines. (172.21.100.35, in my case):
+
+{% highlight bash %}
+scp -r . 172.21.100.35:/tmp/hadoop-2.3.0
+echo 172.21.100.35 > conf/slaves
+sbin/start-yarn.sh
+{% endhighlight %}
+
+* If you get "172.21.100.35: Error: JAVA_HOME is not set and could not be found.", you'll need to add a conf/hadoop-env.sh file to the machine with the failure (172.21.100.35, in this case), which has "export JAVA_HOME=/export/apps/jdk/JDK-1_6_0_27" (or wherever your JAVA_HOME actually is).
+
+8\. Validate that your nodes are up by visiting http://yourHostname:8088/cluster/nodes.
+
+## Deploy Samza Job
+
+Some of the following steps are exactlly identical to what you have seen in [hello-samza](../../../startup/hello-samza/{{site.version}}/). You may skip them if you have already done so.
+
+1\. Download Samza and publish it to Maven local repository.
+
+{% highlight bash %}
+cd /tmp
+git clone http://git-wip-us.apache.org/repos/asf/incubator-samza.git
+cd incubator-samza
+./gradlew clean publishToMavenLocal
+cd ..
+{% endhighlight %}
+
+2\. Download hello-samza project and change the job properties file.
+
+{% highlight bash %}
+git clone git://github.com/linkedin/hello-samza.git
+cd hello-samza
+vi samza-job-package/src/main/config/wikipedia-feed.properties
+{% endhighlight %}
+
+Change the yarn.package.path property to be:
+
+{% highlight jproperties %}
+yarn.package.path=http://yourHostname:8000/samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz
+{% endhighlight %}
+
+3\. Complie hello-samza.
+
+{% highlight bash %}
+mvn clean package
+mkdir -p deploy/samza
+tar -xvf ./samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz -C deploy/samza
+{% endhighlight %}
+
+4\. Deploy Samza job package to Http server..
+
+Open a new terminal, and run:
+
+{% highlight bash %}
+cd /tmp/hello-samza && python -m SimpleHTTPServer
+{% endhighlight %}
+
+Go back to the original terminal (not the one running the HTTP server):
+
+{% highlight bash %}
+deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties
+{% endhighlight %}
+
+Go to http://yourHostname:8088 and find the wikipedia-feed job. Click on the ApplicationMaster link to see that it's running.
+
+Congratulations! You now run the Samza job in a "real" YARN grid!
+

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/startup/download/index.md
----------------------------------------------------------------------
diff --git a/docs/startup/download/index.md b/docs/startup/download/index.md
index 6ee6b1f..e08202d 100644
--- a/docs/startup/download/index.md
+++ b/docs/startup/download/index.md
@@ -21,7 +21,7 @@ title: Download
 
 Samza is released as a source artifact, and also through Maven.
 
-If you just want to play around with Samza for the first time, go to [Hello Samza](/startup/hello-samza/0.7.0).
+If you just want to play around with Samza for the first time, go to [Hello Samza](/startup/hello-samza/{{site.version}}).
 
 ### Source Releases
 
@@ -80,7 +80,7 @@ A Maven-based Samza project can pull in all required dependencies Samza dependen
 </dependency>
 {% endhighlight %}
 
-[Hello Samza](/startup/hello-samza/0.7.0) is a working Maven project that illustrates how to build projects that have Samza jobs in them.
+[Hello Samza](/startup/hello-samza/{{site.version}}) is a working Maven project that illustrates how to build projects that have Samza jobs in them.
 
 #### Repositories
 

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/startup/hello-samza/0.7.0/index.md
----------------------------------------------------------------------
diff --git a/docs/startup/hello-samza/0.7.0/index.md b/docs/startup/hello-samza/0.7.0/index.md
deleted file mode 100644
index 92d5ba2..0000000
--- a/docs/startup/hello-samza/0.7.0/index.md
+++ /dev/null
@@ -1,116 +0,0 @@
----
-layout: page
-title: Hello Samza
----
-<!--
-   Licensed to the Apache Software Foundation (ASF) under one or more
-   contributor license agreements.  See the NOTICE file distributed with
-   this work for additional information regarding copyright ownership.
-   The ASF licenses this file to You under the Apache License, Version 2.0
-   (the "License"); you may not use this file except in compliance with
-   the License.  You may obtain a copy of the License at
-
-       http://www.apache.org/licenses/LICENSE-2.0
-
-   Unless required by applicable law or agreed to in writing, software
-   distributed under the License is distributed on an "AS IS" BASIS,
-   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-   See the License for the specific language governing permissions and
-   limitations under the License.
--->
-The [hello-samza](https://github.com/apache/incubator-samza-hello-samza) project is a stand-alone project designed to help you run your first Samza job.
-
-### Get the Code
-
-Check out the hello-samza project:
-
-{% highlight bash %}
-git clone git://git.apache.org/incubator-samza-hello-samza.git hello-samza
-cd hello-samza
-{% endhighlight %}
-
-This project contains everything you'll need to run your first Samza jobs.
-
-### Start a Grid
-
-A Samza grid usually comprises three different systems: [YARN](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html), [Kafka](http://kafka.apache.org/), and [ZooKeeper](http://zookeeper.apache.org/). The hello-samza project comes with a script called "grid" to help you setup these systems. Start by running:
-
-{% highlight bash %}
-bin/grid bootstrap
-{% endhighlight %}
-
-This command will download, install, and start ZooKeeper, Kafka, and YARN. It will also check out the latest version of Samza and build it. All package files will be put in a sub-directory called "deploy" inside hello-samza's root folder.
-
-If you get a complaint that JAVA_HOME is not set, then you'll need to set it to the path where Java is installed on your system.
-
-Once the grid command completes, you can verify that YARN is up and running by going to [http://localhost:8088](http://localhost:8088). This is the YARN UI.
-
-### Build a Samza Job Package
-
-Before you can run a Samza job, you need to build a package for it. This package is what YARN uses to deploy your jobs on the grid.
-
-{% highlight bash %}
-mvn clean package
-mkdir -p deploy/samza
-tar -xvf ./samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz -C deploy/samza
-{% endhighlight %}
-
-### Run a Samza Job
-
-After you've built your Samza package, you can start a job on the grid using the run-job.sh script.
-
-{% highlight bash %}
-deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties
-{% endhighlight %}
-
-The job will consume a feed of real-time edits from Wikipedia, and produce them to a Kafka topic called "wikipedia-raw". Give the job a minute to startup, and then tail the Kafka topic:
-
-{% highlight bash %}
-deploy/kafka/bin/kafka-console-consumer.sh  --zookeeper localhost:2181 --topic wikipedia-raw
-{% endhighlight %}
-
-Pretty neat, right? Now, check out the YARN UI again ([http://localhost:8088](http://localhost:8088)). This time around, you'll see your Samza job is running!
-
-If you can not see any output from Kafka consumer, you may have connection problem. Check [here](../../../learn/tutorials/0.7.0/run-hello-samza-without-internet.html).
-
-### Generate Wikipedia Statistics
-
-Let's calculate some statistics based on the messages in the wikipedia-raw topic. Start two more jobs:
-
-{% highlight bash %}
-deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-parser.properties
-deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-stats.properties
-{% endhighlight %}
-
-The first job (wikipedia-parser) parses the messages in wikipedia-raw, and extracts information about the size of the edit, who made the change, etc. You can take a look at its output with:
-
-{% highlight bash %}
-deploy/kafka/bin/kafka-console-consumer.sh  --zookeeper localhost:2181 --topic wikipedia-edits
-{% endhighlight %}
-
-The last job (wikipedia-stats) reads messages from the wikipedia-edits topic, and calculates counts, every ten seconds, for all edits that were made during that window. It outputs these counts to the wikipedia-stats topic.
-
-{% highlight bash %}
-deploy/kafka/bin/kafka-console-consumer.sh  --zookeeper localhost:2181 --topic wikipedia-stats
-{% endhighlight %}
-
-The messages in the stats topic look like this:
-
-{% highlight json %}
-{"is-talk":2,"bytes-added":5276,"edits":13,"unique-titles":13}
-{"is-bot-edit":1,"is-talk":3,"bytes-added":4211,"edits":30,"unique-titles":30,"is-unpatrolled":1,"is-new":2,"is-minor":7}
-{"bytes-added":3180,"edits":19,"unique-titles":19,"is-unpatrolled":1,"is-new":1,"is-minor":3}
-{"bytes-added":2218,"edits":18,"unique-titles":18,"is-unpatrolled":2,"is-new":2,"is-minor":3}
-{% endhighlight %}
-
-If you check the YARN UI, again, you'll see that all three jobs are now listed.
-
-### Shutdown
-
-After you're done, you can clean everything up using the same grid script.
-
-{% highlight bash %}
-bin/grid stop all
-{% endhighlight %}
-
-Congratulations! You've now setup a local grid that includes YARN, Kafka, and ZooKeeper, and run a Samza job on it. Next up, check out the [Background](/learn/documentation/0.7.0/introduction/background.html) and [API Overview](/learn/documentation/0.7.0/api/overview.html) pages.