You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by cr...@apache.org on 2013/08/12 18:21:41 UTC

[11/15] initial import.

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/5ff71e51/docs/learn/documentation/0.7.0/comparisons/mupd8.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/comparisons/mupd8.md b/docs/learn/documentation/0.7.0/comparisons/mupd8.md
new file mode 100644
index 0000000..bb0d5a1
--- /dev/null
+++ b/docs/learn/documentation/0.7.0/comparisons/mupd8.md
@@ -0,0 +1,72 @@
+---
+layout: page
+title: MUPD8
+---
+
+*People generally want to know how similar systems compare. We've done our best to fairly contrast the feature sets of Samza with other systems. But we aren't experts in these frameworks, and we are, of course, totally biased. If we have goofed anything let us know and we will correct it.*
+
+### Durability
+
+MUPD8 makes no durability or delivery guarantees. Within MUPD8, stream processor tasks receive messages at most once. Samza uses Kafka for messaging, which guarantees message delivery.
+
+### Ordering
+
+As with durability, developers would ideally like their stream processors to receive messages in exactly the order that they were written.
+
+We don't entirely follow MUPD8's description of their ordering guarantees, but it seems to guarantee that all messages will be processed in the order in which they are written to MUPD8 queues, which is comparable to Kafka and Samza's guarantee.
+
+### Buffering
+
+A critical issue for handling large data flows is handling back pressure when one downstream processing stage gets slow.
+
+MUPD8 buffers messages in an in-memory queue when passing messages between two MUPD8 tasks. When a queue fills up, developers are left to either drop the messages on the floor, log the messages to local disk, or block until the queue frees up. All of these options are sub-optimal. Dropping messages destroys durability guarantees. Blocking your stream processor can result in back pressure, where the slowest processor blocks all upstream processors, which in turn block their upstream processors, until the whole system comes to a grinding hault. Logging to local disk is the most reasonable, but when a fault occurs, those messages will be lost on failover.
+
+By adopting Kafka's broker as a remote buffer, Samza solves all of these problems. It doesn't need to block because consumers and producers are decoupled using Kafka's brokers' disks as async buffers. Messages shouldn't be dropped because Kafkas's 0.8 brokers should be highly available. In the event of a failure, when a Samza job resumes on another system, its input and output are not lost because it's stored remotely on replicated Kafka brokers.
+
+### State Management
+
+Stream processors frequently will accrue state as they process messages. For example, they might be incrementing a counter when a certain type of message is seen. They might also be storing messages in memory while trying to join them with messages from another stream (e.g. ad impressions vs. ad clicks). A design decision that needs to be made is how (if at all) to handle this in-memory state in situations where a failure occurs.
+
+MUPD8 uses a write back caching strategy to manage in-memory state that is periodically written back to Cassandra.
+
+Samza maintains state locally with the task. This allows state larger than will fit in memory. State is persisted to an output stream for recovery purposes should the task fail. In the long run we believe this design will be better suited to strong fault tolerance semantics as the change log captures the evolution of state allowing consistent restore of a task to a consistent point of time.
+
+### Deployment and execution
+
+MUPD8 includes a custom execution framework. The functionality that this framework supports in terms of users and resource limits isn't clear to us.
+
+Samza simply leverages YARN to deploy user code, and execute it in a distributed environment.
+
+### Fault Tolerance
+
+What should a stream processing system do when a machine or processor fails?
+
+MUPD8 uses its custom rolled equivalent to YARN to manage fault tolerance. When a stream processor is unable to send a message to a downstream processor, it notifies MUPD8's coordinator, and all other machines are notified. The machines then send all messages to a new machine based on the key hash that's used. Messages and state can both be lost when this happens.
+
+Samza uses YARN to manage fault tolerance. YARN will detect when nodes or Samza tasks fail, and will notify Samza's [ApplicationMaster](../yarn/application-master.html). At that point, it's up to Samza to decide what to do. Generally, this means re-starting the task on another machine. Since messages are persisted to Kafka brokers remotely, and there are no in-memory queues, no messages should be lost unless the processors are using async Kafka producers.
+
+### Workflow
+
+Sometimes more than one job or processing stage is needed to accomplish something. This is the case where you wish to re-partition a stream, for example. MUPD8 has a custom workflow system setup to define how to execute multiple jobs at once, and how to feed stream data from one into the other.
+
+Samza makes the individual jobs the level of granularity of execution. Jobs communicate via named input and output streams. This implicitly defines a data flow graph between all running jobs. We chose this model to enable data flow graphs with processing stages owned by different engineers on different teams working in different code bases without the need to wire everything together into a single topology.
+
+This was motivated by our experience with Hadoop where the data flow between jobs is implicitly wired together by their input and output directories. We have had good experience making this decentralized model work well.
+
+### Memory
+
+MUPD8 executes all of its map/update processors inside a single JVM, using threads. This should shrink the memory footprint of a stream processor by amortizing JVM overhead across the number of processors currently being executed.
+
+Samza tends to use more memory since it has distinct JVMs for each stream processor container ([TaskRunner](../container/task-runner.html)), rather than, running multiple stream processors in the same JVM, which is what MUPD8 does. The benefit of having separate processes, however, is isolation.
+
+### Isolation
+
+MUPD8 provides no stream processor isolation. A single badly behaved stream processor can bring down all processors on the node.
+
+Samza uses process level isolation between stream processor tasks. This is also the approach that Hadoop takes. We can enforce strict per-process memory footprints. In addition, Samza supports CPU limits when used with YARN CGroups. As YARN CGroup maturity progresses, the possibility to support disk and network CGroup limits should become available as well.
+
+### Further Reading
+
+The MUPD8 team has published a very good [paper](http://vldb.org/pvldb/vol5/p1814_wanglam_vldb2012.pdf) on the design of their system.
+
+## [Storm »](storm.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/5ff71e51/docs/learn/documentation/0.7.0/comparisons/storm.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/comparisons/storm.md b/docs/learn/documentation/0.7.0/comparisons/storm.md
new file mode 100644
index 0000000..0372d7c
--- /dev/null
+++ b/docs/learn/documentation/0.7.0/comparisons/storm.md
@@ -0,0 +1,94 @@
+---
+layout: page
+title: Storm
+---
+
+*People generally want to know how similar systems compare. We've done our best to fairly contrast the feature sets of Samza with other systems. But we aren't experts in these frameworks, and we are, of course, totally biased. If we have goofed anything let us know and we will correct it.*
+
+[Storm](http://storm-project.net/) and Samza are fairly similar. Both systems provide many of the same features: a partitioned stream model, a distributed execution environment, an API for stream processing, fault tolerance, Kafka integration, etc.
+
+### Ordering and Guarantees
+
+Storm has more conceptual building blocks than Samza. "Spouts" in Storm are similar to Streams in Samza, and Samza does not have an equivalent of their transient zeromq communication.
+
+There are also several approaches to handling delivery guarantees.
+
+The primary approach is implemented by keeping a record of all emitted records in memory until they are acknowledged by all elements of a particular processing graph. In this mode messages that timeout are re-emitted. This seems to imply that messages can be processed out of order. This mechanism requires some co-operation from the user code which must maintain the ancestry of records in order to properly acknowledge its input. This is detailed in-depth on [Storm's wiki](https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing).
+
+Out of order processing is a problem for handling keyed data. For example if you have a stream of database updates where later updates may replace earlier updates then reordering them may change the output.
+
+This mechanism also implies that individual stages may produce back pressure up the processing graph, so the graphs are probably mostly limited to a single logical function. However multiple graphs could likely be stitched together using Spouts in between to buffer.
+
+Storm offers a secondary approach to delivery guarantees called [transactional topologies](https://github.com/nathanmarz/storm/wiki/Transactional-topologies). These require an underlying system similar to Kafka that maintains strongly sequenced messages. Transactional topologies seem to be limited to a single input stream.
+
+Samza always offers guaranteed delivery and ordering of input within a stream partition. We make no guarantee of ordering between different input streams or input stream partitions. Since all stages are repayable there is no need for the user code to track its ancestry.
+
+Like Storm's transactional topologies Samza provides a unique "offset" which is a sequential integer uniquely denoting the message in that stream partition. That is the first message in a stream partition has offset 0, the second offset 1, etc. Samza always records the position of a job in its input streams as a vector of offsets for the input stream partitions it consumers.
+
+Storm has integrated these transaction ids into some of its storage abstractions to help with deduplicating updates. We have a different take on ensuring the semantics of output in the presence of failures however we have not yet implemented this.
+
+### State Management
+
+We are not aware of any state management facilities in Storm though transactional topologies have plugins for external storage to use the transaction id for deduping. In this case, Storm will manage only the metadata necessary to make a topology transactional. It's still up to the Bolt implementer to handle transaction IDs, and store state in a remote database, somewhere.
+
+Samza provides [built-in primitives](../container/state-management.html) for managing large amounts of state.
+
+### Partitioning and Parallelism
+
+Storm's [parallelism model](https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology) maps fairly similar to Samza's. The biggest difference is that Samza holds only a single job per process and the process is single threaded regardless of the number of tasks it contains. Storm's more optimistic parallelism model has the advantage of taking better advantage of excess capacity on an idle machine. However this significantly complicates the resource model. In Samza since each container map exactly to a CPU core a job run in 100 containers will use 100 CPU cores. This allows us to better model the CPU usage on a machine and ensure that we don't see uneven performance based on the other tasks that happen to be collocated on that machine. 
+
+Storm supports "dynamic rebalancing", which means adding more threads or processes to a topology without restarting the topology or cluster. This is a convenient feature, especially for during development. We haven't added this yet as philosophically we feel that these kind of changes should go through a normal configuration management process (i.e. version control, notification, etc) as they impact production performance. In other words the jobs + configs should fully recreate the state of the cluster.
+
+### Deployment & Execution
+
+A Storm cluster is composed of a series of nodes running a "Supervisor" daemon. The supervisor daemons talk to a single master node running a daemon called "Nimbus". The Nimbus daemon is responsible for assigning work and managing resources in the cluster. See Storm's [Tutorial](https://github.com/nathanmarz/storm/wiki/Tutorial) page for details. This is quite similar to YARN; though YARN is a bit more fully featured and intended to be multi-framework, Nimbus is better integrated with Storm.
+
+Yahoo! has also released [Storm-YARN](https://github.com/yahoo/storm-yarn). As described in [this Yahoo! blog post](http://developer.yahoo.com/blogs/ydn/storm-yarn-released-open-source-143745133.html), Storm-YARN is a wrapper that starts a single Storm cluster (complete with Nimbus, and Supervisors) inside a YARN grid.
+
+Anyone familiar with YARN will recognize the similarity between Storm's "Nimbus" daemon, and YARN's ResourceManager, and Storm's "Supervisor" daemon, and YARN's Node Managers. Rather than writing its own resource management framework, or running a second one inside of YARN, Samza simply uses YARN directly, as a first-class citizen in the Samza ecosystem. YARN is stable, well adopted, fully-featured, and inter-operable with Hadoop. It also provides a bunch of nice features like Security, CGroup process isolation, etc.
+
+### Language Support
+
+Storm is written in Java and Clojure but has good support for non-JVM languages. It follows a model similar to MapReduce Streaming by piping input and output streams fed to externally managed processes.
+
+On top of this, Storm provides [Trident](https://github.com/nathanmarz/storm/wiki/Trident-tutorial), a DSL that's meant to make writing Storm topologies easier.
+
+Samza is built with language support in mind, but currently only supports JVM languages.
+
+### Workflow
+
+Storm provides modeling of "Topologies" (a processing graph of multiple stages) [in code](https://github.com/nathanmarz/storm/wiki/Tutorial). This manual wiring together of the flow can serve as nice documentation of the processing flow.
+
+Each job in a Samza graph is an independent entity that communicates with other jobs through a named stream rather than manually wiring them together. All the jobs on a cluster comprise a single (potentially disconnected) data flow graph. Each job can be stopped or started independently and there is no code coupling between jobs.
+
+### Maturity
+
+We can't speak to Storm's maturity, but it has an [impressive amount of adopters](https://github.com/nathanmarz/storm/wiki/Powered-By), a strong feature set, and seems to be under active development. It integrates well with many common messaging systems (RabbitMQ, Kesrel, Kafka, etc).
+
+Samza is pretty immature, though it builds on solid components. YARN is fairly new, but is already being run on 3000+ node clusters at Yahoo!, and the project is under active development by both [Hortonworks](http://hortonworks.com/) and [Cloudera](http://www.cloudera.com/content/cloudera/en/home.html). Kafka has a strong [powered by](https://cwiki.apache.org/KAFKA/powered-by.html) page, and has seen its share of adoption, recently. It's also frequently used with Storm. Samza is a brand new project that is in use at LinkedIn. Our hope is that others will find it useful, and adopt it as well.
+
+### Buffering & Latency
+
+Within a single topology, Storm has producers and consumers, but no broker (to use Kafka's terminology). This design decision leads to a number of interesting properties.
+
+Since Storm uses ZeroMQ without intermediate brokers, the transmission of messages from one Bolt to another is extremely low latency. It's just a network hop.
+
+On the flip side, when a Bolt is trying to send messages using ZeroMQ, and the consumer can't read them fast enough, the ZeroMQ buffer in the producer's process begins to fill up with messages. When it becomes full, you have the option to drop them, log to local disk, or block until space becomes available again. These options are outlined in the [MUPD8 comparison](mupd8) page, as well, and none of them are ideal. This style of stream processing runs the risk of completely grinding to a halt (or dropping messages) if a single Bolt has a throughput issue. This problem is commonly known as back pressure. When back pressure occurs, Storm essentially offloads the problem to the Spout implementation. In cases where the Spout can't handle large volumes of back-logged messages, the same problem occurs. In systems like Kafka, where large volumes of backlogged messages are supported, the entire topology just reads messages from the spout at a lower rate.
+
+A lack of a broker between bolts also adds complexity when trying to deal with fault tolerance and messaging semantics. Storm has a very well written page on [Transactional Topologies](https://github.com/nathanmarz/storm/wiki/Transactional-topologies) that describes this problem, and Storm's solution, in depth.
+
+Samza takes a different approach to buffering. We buffer to disk at every hop between a StreamTask. This decision, and its trade-offs, are described in detail on the [Comparison Introduction](introduction.html) page's "stream model" section. This design decision lets us cheat a little bit, when it comes to things like durability guarantees, and exactly once messaging semantics, but it comes at the price of increased latency, since everything must be written to disk in Kafka.
+
+### Isolation
+
+Storm provides standard UNIX process-level isolation. Your topology can impact another topology's performance (or vice-versa) if too much CPU, disk, network, or memory is used.
+
+Samza relies on YARN to provide resource-level isolation. Currently, YARN provides explicit controls for memory and CPU limits (through [CGroups](../yarn/isolation.html)), and both have been used successfully with Samza. No isolation for disk or network is provided by YARN at this time.
+
+### Data Model
+
+Storm models all messages as "Tuples" with a defined data model but pluggable serialization.
+
+Samza's serialization and data model are both pluggable. We are not terribly opinionated about which approach is best.
+
+## [API Overview »](../api/overview.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/5ff71e51/docs/learn/documentation/0.7.0/container/checkpointing.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/container/checkpointing.md b/docs/learn/documentation/0.7.0/container/checkpointing.md
new file mode 100644
index 0000000..b0cd04f
--- /dev/null
+++ b/docs/learn/documentation/0.7.0/container/checkpointing.md
@@ -0,0 +1,45 @@
+---
+layout: page
+title: Checkpointing
+---
+
+On the [Streams](streams.html) page, on important detail was glossed over. When a TaskRunner instantiates a StreamConsumer for an input stream/partition pair, how does the TaskRunner know where in the stream to start reading messages. If you recall, Kafka has the concept of an offset, which defines a specific location in a topic/partition pair. The idea is that an offset can be used to reference a specific point in a stream/partition pair. When you read messages from Kafka, you can supply an offset to specify at which point you'd like to read from. After you read, you increment your offset, and get the next message.
+
+![diagram](/img/0.7.0/learn/documentation/container/checkpointing.png)
+
+This diagram looks the same as on the [Streams](streams.html) page, except that there are black lines at different points in each input stream/partition pair. These lines represent the current offset for each stream consumer. As the stream consumer reads, the offset increases, and moves closer to the "head" of the stream. The diagram also illustrates that the offsets might be staggered, such that some offsets are farther along in their stream/partition than others.
+
+If a StreamConsumer is reading messages for a TaskRunner, and the TaskRunner stops for some reason (due to hardware failure, re-deployment, or whatever), the StreamConsumer should start where it left off when the TaskRunner starts back up again. We're able to do this because the Kafka broker is buffering messages on a remote server (the broker). Since the messages are available when we come back, we can just start from our last offset, and continue moving forward, without losing data.
+
+The TaskRunner supports this ability using something called a CheckpointManager.
+
+```
+public interface CheckpointManager {
+  public void writeCheckpoint(Partition partition, Checkpoint checkpoint);
+
+  public Checkpoint readLastCheckpoint(Partition partition);
+
+  public void close();
+}
+
+public class Checkpoint {
+  private final Map<String, String> offsets;
+  ...
+}
+```
+
+As you can see, the checkpoint manager provides a way to write out checkpoints for a given partition. Right now, the checkpoint contains a map. The map's keys are input stream names, and the map's values are each input stream's offset. Each checkpoint is managed per-partition. For example, if you have page-view-event and service-metric-event defined as streams in your Samza job's configuration file, the TaskRunner would supply a checkpoint with two keys in each checkpoint offset map (one for page-view-event and the other for service-metric-event).
+
+Samza provides two checkpoint managers: FileSystemCheckpointManager and KafkaCheckpointManager. The KafkaCheckpointManager is what you generally want to use. The way that KafkaCheckpointManager works is as follows: it writes checkpoint messages for your Samza job to a special Kafka topic. This topic's name is \_\_samza\_checkpoint\_your-job-name. For example, if you had a Samza job called "my-first-job", the Kafka topic would be called \_\_samza\_checkpoint\_my-first-job. This Kafka topic is partitioned identically to your Samza job's partition count. If your Samza job has 10 partitions, the checkpoint topic for your Samza job will also have 10 partitions. Every time that the TaskRunner calls writeCheckpoint, a checkpoint message will be sent to the partition that corresponds with the partition for the checkpoint that the TaskRunner wishes to write.
+
+![diagram](/img/0.7.0/learn/documentation/container/checkpointing-2.png)
+
+When the TaskRunner starts for the first time, the offset behavior of the StreamConsumers is undefined. If the system for the StreamConsumer is Kafka,, we fall back to the autooffest.reset setting. If the autooffset.reset is set to "largest", we start reading messages from the head of the stream; if it's set to "smallest", we read from the tail. If it's undefined, the TaskRunner will fail.
+
+The TaskRunner calls writeCheckpoint at a windowed interval (e.g. every 10 seconds). If the TaskRunner fails, and restarts, it simply calls readLastCheckpoint for each partition. In the case of the KafkaCheckpointManager, this readLastCheckpoint method will read the last message that was written to the checkpoint topic for each partition in the job. One edge case to consider is that StreamConsumers might have read messages from an offset that hasn't yet been checkpointed. In such a case, when the TaskRunner reads the last checkpoint for each partition, the offsets might be farther back in the stream. When this happens, your StreamTask could get duplicate messages (i.e. it saw message X, failed, restarted at an offset prior to message X, and then reads message X again). Thus, Samza currently provides at least once messaging. You might get duplicates. Caveat emptor.
+
+<!-- TODO Add a link to the fault tolerance SEP when one exists -->
+
+*Note that there are design proposals in the works to give exactly once messaging.*
+
+## [State Management &raquo;](state-management.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/5ff71e51/docs/learn/documentation/0.7.0/container/event-loop.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/container/event-loop.md b/docs/learn/documentation/0.7.0/container/event-loop.md
new file mode 100644
index 0000000..4069ef0
--- /dev/null
+++ b/docs/learn/documentation/0.7.0/container/event-loop.md
@@ -0,0 +1,61 @@
+---
+layout: page
+title: Event Loop
+---
+
+The event loop is the [TaskRunner](task-runner.html)'s single thread that is in charge of [reading](streams.html), [writing](streams.html), [metrics flushing](metrics.html), [checkpointing](checkpointing.html), and [windowing](windowing.html). It's the code that puts all of this stuff together. Each StreamConsumer reads messages on its own thread, but writes messages into a centralized message queue. The TaskRunner uses this queue to funnel all of the messages into the event loop. Here's how the event loop works:
+
+1. Take a message from the incoming message queue (the queue that the StreamConsumers are putting their messages)
+2. Give the message to the appropriate StreamTask by calling process() on it
+3. Send any StreamTask output from the process() call to the appropriate StreamProducers
+4. Call window() on the StreamTask if it implements WindowableTask, and the window time has expired
+5. Send any StreamTask output from the window() call to the appropriate StreamProducers
+6. Write checkpoints for any partitions that are past the defined checkpoint commit interval
+
+The TaskRunner does this, in a loop, until it is shutdown.
+
+### Lifecycle Listeners
+
+Sometimes, it's useful to receive notifications when a specific event happens in the TaskRunner. For example, you might want to reset some context in the container whenever a new message arrives. To accomplish this, Samza provides a TaskLifecycleListener interface, that can be wired into the TaskRunner through configuration.
+
+```
+/**
+ * Used to get before/after notifications before initializing/closing all tasks
+ * in a given container (JVM/process).
+ */
+public interface TaskLifecycleListener {
+  /**
+   * Called before all tasks in TaskRunner are initialized.
+   */
+  void beforeInit(Config config, TaskContext context);
+
+  /**
+   * Called after all tasks in TaskRunner are initialized.
+   */
+  void afterInit(Config config, TaskContext context);
+
+  /**
+   * Called before a message is processed by a task.
+   */
+  void beforeProcess(MessageEnvelope envelope, Config config, TaskContext context);
+
+  /**
+   * Called after a message is processed by a task.
+   */
+  void afterProcess(MessageEnvelope envelope, Config config, TaskContext context);
+
+  /**
+   * Called before all tasks in TaskRunner are closed.
+   */
+  void beforeClose(Config config, TaskContext context);
+
+  /**
+   * Called after all tasks in TaskRunner are closed.
+   */
+  void afterClose(Config config, TaskContext context);
+}
+```
+
+The TaskRunner will notify any lifecycle listeners whenever one of these events occurs. Usually, you don't really need to worry about lifecycle, but it's there if you need it.
+
+## [JMX &raquo;](jmx.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/5ff71e51/docs/learn/documentation/0.7.0/container/index.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/container/index.md b/docs/learn/documentation/0.7.0/container/index.md
new file mode 100644
index 0000000..17751de
--- /dev/null
+++ b/docs/learn/documentation/0.7.0/container/index.md
@@ -0,0 +1,18 @@
+---
+layout: page
+title: Container
+---
+
+The API section shows how a Samza StreamTask is written. To execute a StreamTask, Samza has a container that wraps around your StreamTask. The Samza container manages:
+
+* Metrics
+* Configuration
+* Lifecycle
+* Checkpointing
+* State management
+* Serialization
+* Data transport
+
+This container is called a TaskRunner. Read on to learn more about Samza's TaskRunner.
+
+## [JobRunner &raquo;](job-runner.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/5ff71e51/docs/learn/documentation/0.7.0/container/jmx.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/container/jmx.md b/docs/learn/documentation/0.7.0/container/jmx.md
new file mode 100644
index 0000000..a9fcc77
--- /dev/null
+++ b/docs/learn/documentation/0.7.0/container/jmx.md
@@ -0,0 +1,13 @@
+---
+layout: page
+title: JMX
+---
+
+The Samza TaskRunner (and YARN Application Master) will turn on JMX using a randomly selected port, since Samza is meant to be run in a distributed environment, and it's unknown which ports will be available prior to runtime. The port will be output in the TaskRunner's logs with a line like this:
+
+    2013-07-05 20:42:36 JmxServer [INFO] According to InetAddress.getLocalHost.getHostName we are Chriss-MacBook-Pro.local
+    2013-07-05 20:42:36 JmxServer [INFO] Started JmxServer port=64905 url=service:jmx:rmi:///jndi/rmi://Chriss-MacBook-Pro.local:64905/jmxrmi
+
+Any metrics that are registered in the TaskRunner will be visible through JMX. To toggle JMX, see the [Configuration](../jobs/configuration.html) section.
+
+## [JobRunner &raquo;](../jobs/job-runner.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/5ff71e51/docs/learn/documentation/0.7.0/container/metrics.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/container/metrics.md b/docs/learn/documentation/0.7.0/container/metrics.md
new file mode 100644
index 0000000..4a3e403
--- /dev/null
+++ b/docs/learn/documentation/0.7.0/container/metrics.md
@@ -0,0 +1,50 @@
+---
+layout: page
+title: Metrics
+---
+
+Samza also provides a metrics library that the TaskRunner uses. It allows a StreamTask to create counters and gauges. The TaskRunner then writes those metrics to metrics infrastructure through a MetricsReporter implementation.
+
+```
+public class MyJavaStreamerTask implements StreamTask, InitableTask {
+  private static final Counter messageCount;
+
+  public void init(Config config, TaskContextPartition context) {
+    this.messageCount = context.getMetricsRegistry().newCounter(MyJavaStreamerTask.class.toString(), "MessageCount");
+  }
+
+  @Override
+  public void process(MessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {
+    System.out.println(envelope.getMessage().toString());
+    messageCount.inc();
+  }
+}
+```
+
+Samza's metrics design is very similar to Coda Hale's [metrics](https://github.com/codahale/metrics) library. It has two important interfaces:
+
+```
+public interface MetricsRegistry {
+  Counter newCounter(String group, String name);
+
+  <T> Gauge<T> newGauge(String group, String name, T value);
+}
+
+public interface MetricsReporter {
+  void report(MessageCollector collector, ReadableMetricsRegistry registry, long currentTimeMillis, Partition partition);
+}
+```
+
+### MetricsRegistry
+
+When the TaskRunner starts up, as with StreamTask instantiation, it creates a MetricsRegistry for every partition in the Samza job.
+
+![diagram](/img/0.7.0/learn/documentation/container/metrics.png)
+
+The TaskRunner, itself, also gets a MetricsRegistry that it can use to create counters and gauges. It uses this registry to measure a lot of relevant metrics for itself.
+
+### MetricsReporter
+
+The other important interface is the MetricsReporter. The TaskRunner uses MetricsReporter implementations to send its MetricsRegistry counters and gauges to whatever metrics infrastructure the reporter uses. A Samza job's configuration determines which MetricsReporters the TaskRunner will use. Out of the box, Samza comes with a MetricsSnapshotReporter that sends JSON metrics messages to a Kafka topic.
+
+## [Windowing &raquo;](windowing.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/5ff71e51/docs/learn/documentation/0.7.0/container/state-management.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/container/state-management.md b/docs/learn/documentation/0.7.0/container/state-management.md
new file mode 100644
index 0000000..0e3b9b1
--- /dev/null
+++ b/docs/learn/documentation/0.7.0/container/state-management.md
@@ -0,0 +1,115 @@
+---
+layout: page
+title: State Management
+---
+
+Samza allows tasks to maintain persistent, mutable state that is physically co-located with each task. The state is highly available: in the event of a task failure it will not be lost when the task fails over to another machine.
+
+A key-value store implementation is provided out of the box that covers many use cases. Other store implementations can be plugged in for different types of storage.
+
+State is naturally partitioned with the tasks, with one store per task. When there is a backing changelog, the stream will also be co-partitioned with the tasks. Possible extensions to handle non-partitioned state (i.e. a global lookup dictionary) are discussed at the end.
+
+Restoring state can be done either by having a dedicated stream that captures the changes to the local store, or by rebuilding the state off the input streams.
+
+### Use Cases
+
+We have a few use-cases in mind for this functionality.
+
+#### Windowed Aggregation
+
+Example: Counting member page views by hour
+
+Implementation: The stream is partitioned by the aggregation key (member\_id). Each new input record would cause the job to retrieve and update the aggregate (the page view count). When the window is complete (i.e. the hour is over), the job outputs the current stored aggregate value.
+
+####Table-Table Join
+
+Example: Join profile to user\_settings by member\_id and emit the joined stream
+
+Implementation: The job subscribes to the change stream for profile and for user\_settings both partitioned by member\_id. The job keeps a local store containing both the profile and settings data. When a record comes in from either profile or settings, the job looks up the value for that member and updates the appropriate section (either profile or settings). The changelog for the local store can be used as the output stream if the desired output stream is simply the join of the two inputs.
+
+#### Table-Stream Join
+
+Example: Join member geo region to page view data
+
+Implementation: The job subscribes to the profile stream (for geo) and page views stream, both partitioned by member\_id. It keeps a local store of member\_id => geo that it updates off the profile feed. When a page view arrives it does a lookup in this store to join on the geo data.
+
+#### Stream-Stream Join
+
+Example: Join ad clicks to ad impressions by some shared key
+
+Implementation: Partition ad click and ad impression by the join key. Keep a store of unmatched clicks and unmatched impressions. When a click comes in try to find its matching impression in the impression store, and vice versa when an impression comes in check the click store. If a match is found emit the joined pair and delete the entry. If no match is found store the event to wait for a match. Since this is presumably a left outer join (i.e. every click has a corresponding impression but not vice versa) we will periodically scan the impression table and delete old impressions for which no click arrived.
+
+#### More
+
+Of course there are infinite variations on joins and aggregations, but most amount to essentially variations on the above.
+
+### Usage
+
+To declare a new store for usage you add the following to your job config:
+
+    # Use the key-value store implementation for 
+    stores.my-store.factory=samza.storage.kv.KeyValueStorageEngineFactory
+    # Log changes to the store to a stream
+    stores.my-store.changelog=my-stream-name
+    # The serialization format to use
+    stores.my-store.serde=string
+    # The system to use for the changelog
+    stores.my-store.system=kafka
+
+Example code:
+
+    public class MyStatefulTask implements StreamTask, InitableTask {
+      private KeyValueStore<String, String> store;
+      
+      public void init(Config config, TaskContextPartition context) {
+        this.store = (KeyValueStore<String, String>) context.getStore("store");
+      }
+
+      public void process(MessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {
+        System.out.println("Adding " + envelope.getMessage() + " => " + envelope.getMessage() + " to the store.");
+        store.put((String) envelope.getMessage(), (String) envelope.getMessage());
+      }
+    }
+
+This shows the put() API, but KeyValueStore gives a fairly general key-value interface:
+
+    public interface KeyValueStore<K, V> {
+      V get(K key);
+      void put(K key, V value);
+      void putAll(List<Entry<K,V>> entries);
+      void delete(K key);
+      KeyValueIterator<K,V> range(K from, K to);
+      KeyValueIterator<K,V> all();
+    }
+
+### Implementing Storage Engines
+
+The above code shows usage of the key-value storage engine, but it is not too hard to implement an alternate storage engine. To do so, you implement methods to restore the contents of the store from a stream, flush any cached content on commit, and close the store:
+
+    public interface StorageEngine {
+      void restore(StreamConsumer consumer);
+      void flush();
+      void close();
+    }
+
+The user specifies the type of storage engine they want by passing in a factory for that store in their configuration.
+
+### Fault Tolerance Semantics with State
+
+Samza currently only supports at-least-once delivery guarantees. We will extend this to exact atomic semantics across outputs to multiple streams/partitions in the future.
+
+<!-- TODO add fault tolerance semantics SEP link when one exists
+The most feasible plan for exact semantics seems to me to be journalling non-deterministic decisions proposal outlined in the fault-tolerance semantics wiki. I propose we use that as a working plan.
+
+To ensure correct semantics in the presence of faults we need to ensure that the task restores to the exact state at the time of the last commit.
+
+If the task is fed off replayable inputs then it can simply replay these inputs to recreate its state.
+
+If the task has a changelog to log its state then there is the possibility that the log contains several entries beyond the last commit point. The store should only restore up to the last commit point to ensure that the state is in the correct position with respect to the inputs–the remaining changelog will then be repeated and de-duplicated as the task begins executing.
+-->
+
+### Shared State
+
+Originally we had discussed possibly allowing some facility for global lookup dictionaries that are un-partitioned; however, this does not work with our fault-tolerance semantics proposal, as the container-wide state changes out of band (effectively acting like a separate database or service). This would not work with proposed message de-duplication features since the task output is not deterministic.
+
+## [Metrics &raquo;](metrics.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/5ff71e51/docs/learn/documentation/0.7.0/container/streams.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/container/streams.md b/docs/learn/documentation/0.7.0/container/streams.md
new file mode 100644
index 0000000..e755789
--- /dev/null
+++ b/docs/learn/documentation/0.7.0/container/streams.md
@@ -0,0 +1,40 @@
+---
+layout: page
+title: Streams
+---
+
+The [TaskRunner](task-runner.html) reads and writes messages using the StreamConsumer and StreamProducer interfaces.
+
+```
+public interface StreamConsumer {
+  StreamConsumerMessageEnvelope getMessage();
+
+  void close();
+}
+
+public interface StreamConsumerMessageEnvelope {
+  ByteBuffer getMessage();
+
+  String getOffsetId();
+}
+
+public interface StreamProducer<K> {
+  void send(ByteBuffer bytes);
+
+  void send(K k, ByteBuffer bytes);
+
+  void commit();
+
+  void close();
+}
+```
+
+Out of the box, Samza supports reads and writes to Kafka (i.e. it has a KafkaStreamConsumer/KafkaStreamProducer), but the stream interfaces are pluggable, and most message bus systems can be plugged in, with some degree of support.
+
+A number of stream-related properties should be defined in your Samza job's configuration file. These properties define systems that Samza can read from, the streams on these systems, and how to serialize and deserialize the messages from the streams. For example, you might wish to read PageViewEvent from a specific Kafka cluster. The system properties in the configuration file would define how to connect to the Kafka cluster. The stream section would define PageViewEvent as an input stream. The serializer in the configuration would define the serde to use to decode PageViewEvent messages.
+
+When the TaskRunner starts up, it will use the stream-related properties in your configuration to instantiate consumers for each stream partition. For example, if your input stream is PageViewEvent, which has 12 partitions, then the TaskRunner would create twelve KafkaStreamConsumers. Each stream consumer will read ByteBuffers from one partition, deserialize the ByteBuffer to an object, and put them into a queue. This queue is what the [event loop](event-loop.html) will use to feed messages to your StreamTask instances.
+
+In the process method in StreamTask, there is a MessageCollector parameter given to use. When the TaskRunner calls process() on one of your StreamTask instances, it provides the collector. After the process() method completes, the TaskRunner takes any output messages that your StreamTask wrote to the collector, serializes the messages, and calls the send() method on the appropriate StreamProducer.
+
+## [Checkpointing &raquo;](checkpointing.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/5ff71e51/docs/learn/documentation/0.7.0/container/task-runner.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/container/task-runner.md b/docs/learn/documentation/0.7.0/container/task-runner.md
new file mode 100644
index 0000000..2e94926
--- /dev/null
+++ b/docs/learn/documentation/0.7.0/container/task-runner.md
@@ -0,0 +1,43 @@
+---
+layout: page
+title: TaskRunner
+---
+
+The TaskRunner is Samza's stream processing container. It is responsible for managing the startup, execution, and shutdown of one or more StreamTask instances.
+
+When the a TaskRunner starts up, it does the following:
+
+1. Get last checkpointed offset for each input stream/partition pair
+2. Create a "reader" thread for every input stream/partition pair
+3. Start metrics reporters to report metrics
+4. Start a checkpoint timer to save your task's input stream offsets every so often
+5. Start a window timer to trigger your StreamTask's window method, if it is defined
+6. Instantiate and initialize your StreamTask once for each input stream partition
+7. Start an event loop that takes messages from the input stream reader threads, and gives them to your StreamTasks
+8. Notify lifecycle listeners during each one of these steps
+
+Let's go over each of these items, starting in the middle, with the instantiation of a StreamTask.
+
+### Tasks and Partitions
+
+When the TaskRunner starts, it creates an instance of the StreamTask that you've written. If the StreamTask implements the InitableTask interface, the TaskRunner will also call the init() method.
+
+```
+public interface InitableTask {
+  void init(Config config, TaskContextPartition context);
+}
+```
+
+It doesn't just do this once, though. It creates the StreamTask once for each partition in your Samza job. If your Samza job has ten partitions, there will be ten instantiations of your StreamTask: one for each partition. The StreamTask instance for partition one will receive all messages for partition one, the instance for partition two will receive all messages for partition two, and so on.
+
+The number of partitions that a Samza job has is determined by the number of partitions in its input streams. If a Samza job is set up to read from a topic called PageViewEvent, which has 12 partitions, then the Samza job will have 12 partitions when it executes.
+
+![diagram](/img/0.7.0/learn/documentation/container/tasks-and-partitions.png)
+
+If a Samza job has more than one input stream, then the number of partitions for the Samza job will be the maximum number of partitions across all input streams. For example, if a Samza job is reading from PageView event, which has 12 partitions, and ServiceMetricEvent, which has 14 partitions, then the Samza job would have 14 partitions (0 through 13).
+
+When the TaskRunner's StreamConsumer threads are reading messages from each input stream partition, the messages that it receives are tagged with the partition number that it came from. Each message is fed to the StreamTask instance that corresponds to the message's partition. This design has two important properties. When a Samza job has more than one input stream, and those streams have an imbalanced number of partitions (e.g. one has 12 partitions and the other has 14), then some of your StreamTask instances will not receive messages from all streams. In the PageViewEvent/ServiceMetricEvent example, the last two StreamTask instances would only receive messages from the ServiceMetricEvent topic (partitions 12 and 13). The lower 12 instances would receive messages from both streams. If your Samza job is reading more than one input stream, you probably want all input streams to have the same number of partitions, especially if you're trying to join streams together. The second impor
 tant property is that Samza assumes that a stream's partition count will never change. No partition splitting is supported. If an input stream has N partitions, it is expected that it has had, and will always have N partitions. If you want to re-partition, you must read messages from the stream, and write them out to a new stream that has the number of partitions that you want. For example you could read messages from PageViewEvent, and write them to PageViewEventRepartition, which could have 14 partitions. If you did this, then you would achieve balance between PageViewEventRepartition and ServiceMetricEvent.
+
+This design is important because it guarantees that any state that your StreamTask keeps in memory will be isolated on a per-partition basis. For example, if you refer back to the page-view counting job we used as an example in the [Architecture](../introduction/architecture.html) section, we might have a Map&lt;Integer, Integer&gt; map that keeps track of page view counts per-member ID. If we were to have just one StreamTask per Samza job, for instance, then the member ID counts from different partitions would be inter-mingled into the same map. This inter-mingling would prevent us from moving partitions between processes or machines, which is something that we want to do with YARN. You can imagine a case where you started with one TaskRunner in a single YARN container. Your Samza job might be unable to keep up with only one container, so you ask for a second YARN container to put some of the StreamTask partitions. In such a case, how would we split the counts such that one contain
 er gets only member ID counts for the partitions in charge of? This is effectively impossible if we've inter-mingled the StreamTask's state together. This is why we isolate StreamTask instances on a per-partition basis: to make partition migration possible.
+
+## [Streams &raquo;](streams.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/5ff71e51/docs/learn/documentation/0.7.0/container/windowing.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/container/windowing.md b/docs/learn/documentation/0.7.0/container/windowing.md
new file mode 100644
index 0000000..0a2e647
--- /dev/null
+++ b/docs/learn/documentation/0.7.0/container/windowing.md
@@ -0,0 +1,16 @@
+---
+layout: page
+title: Windowing
+---
+
+Referring back to the, "count PageViewEvent by member ID," example in the [Architecture](../introduction/architecture.html) section, one thing that we left out was what we do with the counts. Let's say that the Samza job wants to update the member ID counts in a database once every minute. Here's how it would work. The Samza job that does the counting would keep a Map&lt;Integer, Integer&gt; in memory, which maps member IDs to page view counts. Every time a message arrives, the job would take the member ID in the PageViewEvent, and use it to increment the member ID's count in the in-memory map. Then, once a minute, the StreamTask would update the database (total_count += current_count) for every member ID in the map, and then reset the count map.
+
+Windowing is how we achieve this. If a StreamTask implements the WindowableTask interface, the TaskRunner will call the window() method on the task over a configured interval.
+
+```
+public interface WindowableTask {
+  void window(MessageCollector collector, TaskCoordinator coordinator);
+}
+```
+
+If you choose to implement the WindowableTask interface, you can use the Samza job's configuration to define how often the TaskRunner should call your window() method. In the PageViewEvent example (above), you would define it to flush every 60000 milliseconds (60 seconds).

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/5ff71e51/docs/learn/documentation/0.7.0/index.html
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/index.html b/docs/learn/documentation/0.7.0/index.html
new file mode 100644
index 0000000..7806baf
--- /dev/null
+++ b/docs/learn/documentation/0.7.0/index.html
@@ -0,0 +1,73 @@
+---
+layout: page
+title: Documentation
+---
+
+<h4>Introduction</h4>
+
+<div class="documentation-second-level">
+  <a href="introduction/background.html">Background</a><br/>
+  <a href="introduction/concepts.html">Concepts</a><br/>
+  <a href="introduction/architecture.html">Architecture</a>
+</div>
+
+<h4>Comparisons</h4>
+
+<div class="documentation-second-level">
+  <a href="comparisons/introduction.html">Introduction</a><br/>
+  <a href="comparisons/mupd8.html">MUPD8</a><br/>
+  <a href="comparisons/storm.html">Storm</a>
+<!-- TODO comparisons pages
+  <a href="comparisons/aurora.html">Aurora</a><br/>
+  <a href="comparisons/jms.html">JMS</a><br/>
+  <a href="comparisons/s4.html">S4</a><br/>
+-->
+</div>
+
+<h4>API</h4>
+
+<div class="documentation-second-level">
+  <a href="api/overview.html">Overview</a><br/>
+  <a href="api/javadocs">Javadocs</a><br/>
+</div>
+
+<h4>Container</h4>
+
+<div class="documentation-second-level">
+  <a href="container/task-runner.html">TaskRunner</a><br/>
+  <a href="container/streams.html">Streams</a><br/>
+  <a href="container/checkpointing.html">Checkpointing</a><br/>
+  <a href="container/state-management.html">State Management</a><br/>
+  <a href="container/metrics.html">Metrics</a><br/>
+  <a href="container/windowing.html">Windowing</a><br/>
+  <a href="container/event-loop.html">Event Loop</a><br/>
+  <a href="container/jmx.html">JMX</a>
+</div>
+
+<h4>Jobs</h4>
+
+<div class="documentation-second-level">
+  <a href="jobs/job-runner.html">JobRunner</a><br/>
+  <a href="jobs/configuration.html">Configuration</a><br/>
+  <a href="jobs/packaging.html">Packaging</a><br/>
+  <a href="jobs/yarn-jobs.html">YARN Jobs</a><br>
+  <a href="jobs/logging.html">Logging</a>
+</div>
+
+<h4>YARN</h4>
+
+<div class="documentation-second-level">
+  <a href="yarn/application-master.html">Application Master</a><br/>
+  <a href="yarn/isolation.html">Isolation</a>
+<!-- TODO write yarn pages
+  <a href="">Fault Tolerance</a><br/>
+  <a href="">Security</a><br/>
+-->
+</div>
+
+<h4>Operations</h4>
+
+<div class="documentation-second-level">
+  <a href="operations/security.html">Security</a><br/>
+  <a href="operations/kafka.html">Kafka</a>
+</div>

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/5ff71e51/docs/learn/documentation/0.7.0/introduction/architecture.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/introduction/architecture.md b/docs/learn/documentation/0.7.0/introduction/architecture.md
new file mode 100644
index 0000000..74470d1
--- /dev/null
+++ b/docs/learn/documentation/0.7.0/introduction/architecture.md
@@ -0,0 +1,90 @@
+---
+layout: page
+title: Architecture
+---
+
+Samza is made up of three layers:
+
+1. A streaming layer.
+2. An execution layer.
+3. A processing layer.
+
+Samza provides out of the box support for all three layers.
+
+1. **Streaming:** [Kafka](http://kafka.apache.org/)
+2. **Execution:** [YARN](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html)
+3. **Processing:** [Samza API](../api/overview.html)
+
+These three pieces fit together to form Samza.
+
+![diagram-medium](/img/0.7.0/learn/documentation/introduction/samza-ecosystem.png)
+
+This architecture should be familiar to anyone that's used Hadoop.
+
+![diagram-medium](/img/0.7.0/learn/documentation/introduction/samza-hadoop.png)
+
+Before going in-depth on each of these three layers, it should be noted that Samza supports is not limited to these systems. Both Samza's execution and streaming layer are pluggable, and allow developers to implement alternatives if they prefer.
+
+### Kafka
+
+[Kafka](http://kafka.apache.org/) is a distributed pub/sub and message queueing system that provides at-least once messaging guarantees, and highly available partitions (i.e. a stream's partitions will be available, even if a machine goes down).
+
+In Kafka, each stream is called a "topic". Each topic is partitioned up, to make things scalable. When a "producer" sends a message to a topic, the producer provides a key, which is used to determine which partition the message should be sent to. Kafka "brokers", each of which are in charge of some partitions, receive the messages that the producer sends, and stores them on their disk in a log file. Kafka "consumers" can then read from a topic by getting messages from all of a topic's partitions.
+
+This has some interesting properties. First, all messages partitioned by the same key are guaranteed to be in the same Kafka topic partition. This means, if you wish to read all messages for a specific member ID, you only have to read the messages from the partition that the member ID is on, not the whole topic (assuming the topic is partitioned by member ID). Second, since a Kafka broker's file is a log, you can reference any point in the log file using an "offset". This offset determines where a consumer is in a topic/partition pair. After every message a consumer reads from a topic/partition pair, the offset is incremented.
+
+For more details on Kafka, see Kafka's [introduction](http://kafka.apache.org/introduction.html) and [design](http://kafka.apache.org/design.html) pages.
+
+### YARN
+
+[YARN](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html) (Yet Another Resource Negotiator) is Hadoop's next-generation cluster scheduler. It allows you to allocate a number of "containers" (processes) in a cluster of machines, and execute arbitrary commands on them.
+
+When an application interacts with YARN, it looks something like this:
+
+1. **Application**: I want to run command X on two machines with 512M memory
+2. **YARN**: Cool, where's your code?
+3. **Application**: http://path.to.host/jobs/download/my.tgz
+4. **YARN**: I'm running your job on node-1.grid and node-1.grid
+
+Samza uses YARN to manage:
+
+* Deployment
+* Fault tolerance
+* Logging
+* Isolation
+* Security
+* Locality
+
+This page covers a brief overview of YARN, but [this page](http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/) from Hortonworks contains a much better overview.
+
+#### YARN Architecture
+
+YARN has three important pieces: a ResourceManager, a NodeManager, and an ApplicationMaster. In a YARN grid, every computer runs a NodeManager, which is responsible for running processes on the local machine. A ResourceManager talks to all of the NodeManagers to tell it what to run. Applications, in turn, talk to the ResourceManager when they wish to run something on the cluster. The flow, when starting a new application, goes from user application to YARN RM, to YARN NM. The third piece, the ApplicationMaster, is actually application-specific code that runs in the YARN cluster. It's responsible for managing the application's workload, asking for containers (usually, UNIX processes), and handling notifications when one of its containers fails.
+
+#### Samza and YARN
+
+Samza provides a YARN ApplicationMaster, and YARN job runner out of the box. The integration between Samza and YARN is outlined in the following diagram (different colors indicate different host machines).
+
+![diagram-small](/img/0.7.0/learn/documentation/introduction/samza-yarn-integration.png)
+
+The Samza client talks to the YARN RM when it wants to start a new Samza job. The YARN RM talks to a YARN NM to allocate space on the cluster for Samza's ApplicationMaster. Once the NM allocates space, it starts the Samza AM. After the Samza AM starts, it asks the YARN RM for one, or more, YARN containers to run Samza [TaskRunners](../container/task-runner.html). Again, the RM works with NMs to allocate space for the containers. Once the space has been allocated, the NMs start the Samza containers.
+
+### Samza
+
+Samza uses YARN and Kafka to provide a framework for stage-wise stream processing and partitioning. Everything, put together, looks like this (different colors indicate different host machines):
+
+![diagram-small](/img/0.7.0/learn/documentation/introduction/samza-yarn-kafka-integration.png)
+
+The Samza client uses YARN to run a Samza job. The Samza [TaskRunners](../container/task-runner.html) run in one, or more, YARN containers, and execute user-written Samza [StreamTasks](../api/overview.html). The input and output for the Samza StreamTasks come from Kafka brokers that are (usually) co-located on the same machines as the YARN NMs.
+
+### Example
+
+Let's take a look at a real example. Suppose that we wanted to count page views grouped by member ID. In SQL, it would look something like: SELECT COUNT(\*) FROM PageViewEvent GROUP BY member_id. Although Samza doesn't support SQL right now, the idea is the same. Two jobs are required to calculate this query: one to group messages by member ID, and the other to do the counting. The counting and grouping can't be done in the same Samza job because the input topic might not be partitioned by the member ID. Anyone familiar with Hadoop will recognize this as a Map/Reduce operation, where you first map data by a particular key, and then count in the reduce step.
+
+![diagram-large](/img/0.7.0/learn/documentation/introduction/group-by-example.png)
+
+The input topic is partitioned using Kafka. Each Samza process reads messages from one or more of the input topic's partitions, and emits them back out to a different Kafka topic keyed by the message's member ID attribute. The Kafka brokers receive these messages, and buffer them on disk until the second job (the counting job on the bottom of the diagram) reads the messages, and increments its counters.
+
+There are some neat things to consider about this example. First, we're leveraging the fact that Kafka topics are inherently partitioned. This lets us run one or more Samza processes, and assign them each some partitions to read from. Second, since we're guaranteed that, for a given key, all messages will be on the same partition, we can actually split up the aggregation (counting). For example, if the first job's output had four partitions, we could assign two partitions to the first count process, and the other two partitions to the second count process. We'd be guaranteed that for any give member ID, all of their messages will be consumed by either the first process or the second, but not both. This means we'll get accurate counts, even when partitioning. Third, the fact that we're using Kafka, which buffers messages on its brokers, also means that we don't have to worry as much about failures. If a process or machine fails, we can use YARN to start the process on another machine
 . When the process starts up again, it can get its last offset, and resume reading messages where it left off.
+
+## [Comparison Introduction &raquo;](../comparisons/introduction.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/5ff71e51/docs/learn/documentation/0.7.0/introduction/background.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/introduction/background.md b/docs/learn/documentation/0.7.0/introduction/background.md
new file mode 100644
index 0000000..1437611
--- /dev/null
+++ b/docs/learn/documentation/0.7.0/introduction/background.md
@@ -0,0 +1,55 @@
+---
+layout: page
+title: Background
+---
+
+This page provides some background about stream processing, describes what Samza is, and why it was built.
+
+### What is messaging?
+
+Messaging systems are a popular way of implementing near-realtime asynchronous computation. Messages can be added to a message queue (Active MQ, Rabbit MQ), pub-sub system (Kestrel, Kafka), or log aggregation system (Flume, Scribe) when something happens. Downstream "consumers" read messages from these systems, and process or take action based on the message contents.
+
+Suppose that you have a server that's serving web pages. You can have the web server send a "user viewed page" event to a messaging system. You might then have consumers:
+
+* Put the message into Hadoop
+* Count page views and update a dashboard
+* Trigger an alert if a page view fails
+* Send an email notification to another use
+* Join the page view event with the user's profile, and send the message back to the messaging system
+
+A messaging system lets you decouple all of this work from the actual web page serving.
+
+### What is stream processing?
+
+A messaging system is a fairly low-level piece of infrastructure---it stores messages and waits for consumers to consume them. When you start writing code that produces or consumes messages, you quickly find that there are a lot of tricky problems that have to be solved in the processing layer. Samza aims to help with these problems.
+
+Consider the counting example, above (count page views and update a dashboard). What happens when the machine that your consumer is running on fails, and your "current count" is lost. How do you recover? Where should the processor be run when it restarts? What if the underlying messaging system sends you the same message twice, or loses a message? Your counts will be off. What if you want to count page views grouped by the page URL? How can you do that in a distributed environment?
+
+Stream processing is a higher level of abstraction on top of messaging systems, and it's meant to address precisely this category of problems.
+
+### Samza
+
+Samza is a stream processing framework with the following features:
+
+* **Simpe API:** Samza provides a very simple call-back based "process message" API.
+* **Managed state:** Samza manages snapshotting and restoration of a stream processor's state. Samza will restore a stream processor's state to a snapshot consistent with the processor's last read messages when the processor is restarted. Samza is built to handle large amounts of state (even many gigabytes per partition).
+* **Fault tolerance:** Samza will work with YARN to transparently migrate your tasks whenever a machine in the cluster fails.
+* **Durability:** Samza uses Kafka to guarantee that no messages will ever be lost.
+* **Scalability:** Samza is partitioned and distributed at every level. Kafka provides ordered, partitioned, replayable, fault-tolerant streams. YARN provides a distributed environment for Samza containers to run in.
+* **Pluggable:** Though Samza works out of the box with Kafka and YARN, Samza provides a pluggable API that lets you run Samza with other messaging systems and execution environments.
+* **Processor isolation:** Samza works with Apache YARN, to give security and resource scheduling, and resource isolation through Linux CGroups.
+
+### Alternatives
+
+The open source stream processing systems that are available are actually quite young, and no single system offers a complete solution. Problems like how a stream processor's state should be managed, whether a stream should be buffered remotely on disk or not, what to do when duplicate messages are received or messages are lost, and how to model underlying messaging systems are all pretty new.
+
+Samza's main differentiators are:
+
+* Samza supports fault-tolerant local state. State can be thought of as tables that are split up and maintained with the processing tasks. State is itself modeled as a stream. When a processor is restarted, the state stream is entirely replayed to restore it.
+* Streams are ordered, partitioned, replayable, and fault tolerant.
+* YARN is used for processor isolation, security, and fault tolerance.
+* All streams are materialized to disk.
+
+For a more in-depth discussion on Samza, and how it relates to other stream processing systems, have a look at Samza's [Comparisons](../comparisons/introduction.html) documentation.
+
+## [Concepts &raquo;](concepts.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/5ff71e51/docs/learn/documentation/0.7.0/introduction/concepts.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/introduction/concepts.md b/docs/learn/documentation/0.7.0/introduction/concepts.md
new file mode 100644
index 0000000..206133d
--- /dev/null
+++ b/docs/learn/documentation/0.7.0/introduction/concepts.md
@@ -0,0 +1,58 @@
+---
+layout: page
+title: Concepts
+---
+
+This page gives an introduction to the high-level concepts in Samza.
+
+### Streams
+
+Samza processes *streams*. A stream is composed of immutable *messages* of a similar type or category. Example streams might include all the clicks on a website, or all the updates to a particular database table, or any other type of event data. Messages can be appended to a stream or read from a stream. A stream can have any number of readers and reading from a stream doesn't delete the message so a message written to a stream is effectively broadcast out to all readers. Messages can optionally have an associated key which is used for partitioning, which we'll talk about in a second.
+
+Samza supports pluggable *systems* that implement the stream abstraction: in Kafka a stream is a topic, in a database we might read a stream by consuming updates from a table, in Hadoop we might tail a directory of files in HDFS.
+
+![job](/img/0.7.0/learn/documentation/introduction/job.png)
+
+### Jobs
+
+A Samza *job* is code that performs a logical transformation on a set of input streams to append output messages to set of output streams.
+
+If scalability were not a concern streams and jobs would be all we would need. But to let us scale our jobs and streams we chop these two things up into smaller unit of parallelism below the stream and job, namely *partitions* and *tasks*.
+
+### Partitions
+
+Each stream is broken into one or more partitions. Each partition in the stream is a totally ordered sequence of messages.
+
+Each position in this sequence has a unique identifier called the *offset*. The offset can be a sequential integer, byte offset, or string depending on the underlying system implementation.
+
+Each message appended to a stream is appended to only one of the streams partitions. The assignment of the message to its partition is done with a key chosen by the writer (in the click example above, data might be partitioned by user id).
+
+![stream](/img/0.7.0/learn/documentation/introduction/stream.png)
+
+### Tasks
+
+A job is itself distributed by breaking it into multiple *tasks*. The *task* is the unit of parallelism of the job, just as the partition is to the stream. Each task consumes data from one partition for each of the job's input streams.
+
+The task processes messages from each of its input partitions *in order by offset*. There is no defined ordering between partitions.
+
+The position of the task in its input partitions can be represented by set of offsets, one for each partition.
+
+The number of tasks a job has is fixed and does not change (though the computational resources assigned to the job may go up and down). The number of tasks a job has also determines the maximum parallelism of the job as each task processes messages sequentially. There cannot be more tasks than input partitions (or there would be some task with no input).
+
+The partitions assigned to a task will never change: if a task is on a machine that fails the task will be restarted elsewhere still consuming the same stream partitions.
+
+![job-detail](/img/0.7.0/learn/documentation/introduction/job_detail.png)
+
+### Dataflow Graphs
+
+We can compose multiple jobs to create data flow graph where the nodes are streams containing data and the edges are jobs performing transformations. This composition is done purely through the streams the jobs take as input and output&mdash;the jobs are otherwise totally decoupled: They need not be implemented in the same code base, and adding, removing, or restarting a downstream job will not impact an upstream job.
+
+These graphs are often acyclic&mdash;that is, data usually doesn't flow from a job, through other jobs, back to itself. However this is not a requirement.
+
+![dag](/img/0.7.0/learn/documentation/introduction/dag.png)
+
+### Containers
+
+Partitions and tasks are both *logical* units of parallelism, they don't actually correspond to any particular assignment of computational resources (CPU, memory, disk space, etc). Containers are the unit of physical parallelism, and a container is essentially just a unix process (or linux [cgroup](http://en.wikipedia.org/wiki/Cgroups)). Each container runs one or more tasks. The number of tasks is determined automatically from the number of partitions in the input and is fixed, but the number of containers (and the cpu and memory resources associated with them) is specified by the user at run time and can be changed at any time.
+
+## [Architecture &raquo;](architecture.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/5ff71e51/docs/learn/documentation/0.7.0/jobs/configuration-table.html
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/jobs/configuration-table.html b/docs/learn/documentation/0.7.0/jobs/configuration-table.html
new file mode 100644
index 0000000..41353b2
--- /dev/null
+++ b/docs/learn/documentation/0.7.0/jobs/configuration-table.html
@@ -0,0 +1,224 @@
+<html>
+  <body>
+    <table cellspacing="2" border="1" cellpadding="2">
+      <tbody>
+        <tr><th>Name</th><th>Default</th><th>Description</th></tr>
+        <tr>
+          <td><strong>Job</strong></td>
+          <td> </td>
+          <td> </td>
+        </tr>
+        <tr>
+          <td>job.factory.class</td>
+          <td>none</td>
+          <td><strong>Required:</strong> The job factory to use when running a task. This can be either samza.job.local.LocalJobFactory, or samza.job.yarn.YarnJobFactory.</td>
+        </tr>
+        <tr>
+          <td>job.name</td>
+          <td>none</td>
+          <td><strong>Required:</strong> The name of your job. This is the name that will appear on the Samza dashboard, when your job is running.</td>
+        </tr>
+        <tr>
+          <td>job.id</td>
+          <td>1</td>
+          <td>An ID string that is used to distinguish between multiple concurrent executions of the same Samza job.</td>
+        </tr>
+        <tr>
+          <td><strong>Task</strong></td>
+          <td> </td>
+          <td> </td>
+        </tr>
+        <tr>
+          <td>task.class</td>
+          <td>none</td>
+          <td><strong>Required:</strong> The package name of the StreamTask to execute. For example, samza.task.example.MyStreamerTask.</td>
+        </tr>
+        <tr>
+          <td>task.execute</td>
+          <td>bin/run-task.sh</td>
+          <td>The command that a StreamJob should invoke to start the TaskRunner.</td>
+        </tr>
+        <tr>
+          <td>task.message.buffer.size</td>
+          <td>10000</td>
+          <td>The number of messages that the TaskRunner will buffer in the event loop queue, before it begins blocking StreamConsumers.</td>
+        </tr>
+        <tr>
+          <td>task.inputs</td>
+          <td>none</td>
+          <td><strong>Required:</strong> A CSV list of stream names that the TaskRunner should use to read messages from for your StreamTasks (e.g. page-view-event,service-metrics).</td>
+        </tr>
+        <tr>
+          <td>task.window.ms</td>
+          <td>-1</td>
+          <td>How often the TaskRunner should call window() on a WindowableTask. A negative number tells the TaskRunner to never call window().</td>
+        </tr>
+        <tr>
+          <td>task.commit.ms</td>
+          <td>60000</td>
+          <td>How often the TaskRunner should call writeCheckpoint for a partition.</td>
+        </tr>
+        <tr>
+          <td>task.command.class</td>
+          <td>samza.task.ShellCommandBuilder</td>
+          <td>The class to use to build environment variables for the task.execute command.</td>
+        </tr>
+        <tr>
+          <td>task.lifecycle.listeners</td>
+          <td>none</td>
+          <td>A CSV list of lifecycle listener names that the TaskRunner notify when lifecycle events occur (e.g. my-lifecycle-manager).</td>
+        </tr>
+        <tr>
+          <td>task.lifecycle.listener.%s.class</td>
+          <td>none</td>
+          <td>The class name for a lifecycle listener factory (e.g. task.lifecycle.listener.my-lifecycle-manager.class=foo.bar.MyLifecycleManagerFactory)</td>
+        </tr>
+        <tr>
+          <td>task.checkpoint.factory</td>
+          <td>none</td>
+          <td>The class name for the checkpoint manager to use (e.g. samza.task.state.KafkaCheckpointManagerFactory)</td>
+        </tr>
+        <tr>
+          <td>task.checkpoint.failure.retry.ms</td>
+          <td>10000</td>
+          <td>If readLastCheckpoint, or writeCheckpoint fails, the TaskRunner will wait this interval before retrying the checkpoint.</td>
+        </tr>
+        <tr>
+          <td colspan="1">task.opts</td>
+          <td colspan="1">none</td>
+          <td colspan="1">JVM options that should be attached to each JVM that is running StreamTasks. If you wish to reference the log directory from this parameter, use logs/. <span>If you wish to reference code in the Samza job's TGZ package use __package/.</span></td>
+        </tr>
+        <tr>
+          <td><strong>System</strong></td>
+          <td> </td>
+          <td> </td>
+        </tr>
+        <tr>
+          <td>systems.%s.samza.consumer.factory</td>
+          <td>none</td>
+          <td>The StreamConsumerFactory class to use when creating a new StreamConsumer for this system (e.g. samza.stream.kafka.KafkaConsumerFactory).</td>
+        </tr>
+        <tr>
+          <td>systems.%s.samza.producer.factory</td>
+          <td>none</td>
+          <td>The StreamProducerFactory class to use when creating a new StreamProducer for this system (e.g. samza.stream.kafka.KafkaProducerFactory).</td>
+        </tr>
+        <tr>
+          <td>systems.%s.samza.partition.manager</td>
+          <td>none</td>
+          <td>The PartitionManager class to use when fetching partition information about streams for the system (e.g. samza.stream.kafka.KafkaPartitionManager).</td>
+        </tr>
+        <tr>
+          <td>systems.%s.producer.reconnect.interval.ms</td>
+          <td>10000</td>
+          <td>If a producer fails, the TaskRunner will wait this interval before retrying.</td>
+        </tr>
+        <tr>
+          <td>systems.%s.*</td>
+          <td>none</td>
+          <td>For both Kafka and Databus, any configuration you supply under this namespace will be given to the underlying Kafka consumer/producer, and Databus consumer/producer. This is useful for configuring things like autooffset.reset, socket buffer size, fetch size, batch size, etc.</td>
+        </tr>
+        <tr>
+          <td><strong>Stream</strong></td>
+          <td> </td>
+          <td> </td>
+        </tr>
+        <tr>
+          <td>streams.%s.system</td>
+          <td>none</td>
+          <td>The name of the system associated with this stream (e.g. kafka-aggregate-tracking). This name must match with a system defined in the configuration file.</td>
+        </tr>
+        <tr>
+          <td>streams.%s.stream</td>
+          <td>none</td>
+          <td>The name of the stream in the system (e.g. PageViewEvent).</td>
+        </tr>
+        <tr>
+          <td>streams.%s.serde</td>
+          <td>none</td>
+          <td>The serde to use to serialize and deserialize messages for this stream. If undefined, the TaskRunner will try to fall back to the default serde, if it's defined.</td>
+        </tr>
+        <tr>
+          <td>streams.%s.consumer.reset.offset</td>
+          <td>false</td>
+          <td>If set to true, the TaskRunner will ignore the last checkpoint offset for this stream, and use null as the offset for the stream instead. In the case of Kafka's consumer, it will fall back to autooffset.reset. In the case of Databus' consumer, it will fall back to SCN 0.</td>
+        </tr>
+        <tr>
+          <td>streams.%s.consumer.failure.retry.ms</td>
+          <td>10000</td>
+          <td>If a StreamConsumer fails, the TaskRunner will wait this interval before retrying.</td>
+        </tr>
+        <tr>
+          <td>streams.%s.consumer.max.bytes.per.sec</td>
+          <td>none</td>
+          <td>The maximum number of bytes that the TaskRunner will allow from all partitions that it's reading for this stream. For example, if you have an input stream with two partitions, and 1 MB/sec max, then the maximum bytes the TaskRunner will read per second from all of the input stream's partitions is 1 MB/sec.</td>
+        </tr>
+        <tr>
+          <td>streams.%s.producer.reconnect.interval.ms</td>
+          <td>10000</td>
+          <td>If a StreamProducer fails, the TaskRunner will wait this interval before retrying.</td>
+        </tr>
+        <tr>
+          <td><strong>Serdes</strong></td>
+          <td> </td>
+          <td> </td>
+        </tr>
+        <tr>
+          <td>serializers.registry.%s.class</td>
+          <td>none</td>
+          <td>The name of a class that implements both SerializerFactory and DeserializerFactory (e.g. serializers.registry.json.class=samza.serializers.JsonSerdeFactory).</td>
+        </tr>
+        <tr>
+          <td>serializers.default</td>
+          <td>none</td>
+          <td>The default serde to use, if one is not defined for an input or output stream (e.g. serializers.default=json).</td>
+        </tr>
+        <tr>
+          <td><strong>YARN</strong></td>
+          <td> </td>
+          <td> </td>
+        </tr>
+        <tr>
+          <td>yarn.package.path</td>
+          <td>none</td>
+          <td>The tgz location of your Samza job. This tgz file is well structured. See the YARN section for details.</td>
+        </tr>
+        <tr>
+          <td>yarn.container.memory.mb</td>
+          <td>768</td>
+          <td>How much memory to ask for (per-container), when Samza is starting a YARN container.</td>
+        </tr>
+        <tr>
+          <td>yarn.container.count</td>
+          <td>1</td>
+          <td>How many containers to start when a Samza job is started in YARN. Partitions are divided evenly among the containers.</td>
+        </tr>
+        <tr>
+          <td colspan="1">yarn.am.opts</td>
+          <td colspan="1">none</td>
+          <td colspan="1"><span>JVM options that should be attached to each JVM that is running the ApplicationMaster. If you wish to reference the log directory from this parameter, use logs/. If you wish to reference code in the Samza job's TGZ package use __package/.</span></td>
+        </tr>
+        <tr>
+          <td><strong>Metrics</strong></td>
+          <td> </td>
+          <td> </td>
+        </tr>
+        <tr>
+          <td>metrics.reporter.%s.class</td>
+          <td>none</td>
+          <td>The package and class for a metrics reporter (e.g. metrics.reporter.foo-bar.class=samza.metrics.reporter.MetricsSnapshotReporter).</td>
+        </tr>
+        <tr>
+          <td>metrics.reporter.%s.window.ms</td>
+          <td>10000</td>
+          <td>How often the TaskRunner tells the metrics reporter to send update or send its metrics.</td>
+        </tr>
+        <tr>
+          <td>metrics.reporters</td>
+          <td>none</td>
+          <td>A CSV list of metric reporter names (e.g. metrics.reporters=foo-bar).</td>
+        </tr>
+      </tbody>
+    </table>
+  </body>
+</html>

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/5ff71e51/docs/learn/documentation/0.7.0/jobs/configuration.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/jobs/configuration.md b/docs/learn/documentation/0.7.0/jobs/configuration.md
new file mode 100644
index 0000000..01035ba
--- /dev/null
+++ b/docs/learn/documentation/0.7.0/jobs/configuration.md
@@ -0,0 +1,45 @@
+---
+layout: page
+title: Configuration
+---
+
+All Samza jobs have a configuration file that defines the job. A very basic configuration file looks like this:
+
+```
+# Job
+job.factory.class=samza.job.local.LocalJobFactory
+job.name=hello-world
+
+# Task
+task.class=samza.task.example.MyJavaStreamerTask
+task.inputs=example-stream
+
+# Serializers
+serializers.registry.json.class=samza.serializers.JsonSerdeFactory
+serializers.default=json
+
+# Streams
+streams.example-stream.system=example-system
+streams.example-stream.stream=some-stream
+
+# Systems
+systems.example-system.samza.consumer.factory=samza.stream.example.ExampleConsumerFactory
+systems.example-system.samza.partition.manager=samza.stream.example.ExamplePartitionManager
+```
+
+There are five major sections to a configuration file. The job section defines things like the name of the job, and whether to use the YarnJobFactory or LocalJobFactory. The task section is where you specify the class name for your StreamTask. It's also where you define what the input streams are for your task. The system section defines systems that you can read from. Usually, you'll define a Kafka system, if you're reading from Kafka. After that you'll need to define the stream(s) that you want to read from, which systems they're coming from, and how to deserialize objects from the stream.
+
+### Required Configuration
+
+Configuration keys that absolutely must be defined for a Samza job are:
+
+* job.factory.class
+* job.name
+* task.class
+* task.inputs
+
+### Configuration Keys
+
+A complete list of configuration keys can be found on the [Configuration Table](configuration-table.html) page.
+
+## [Packaging &raquo;](packaging.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/5ff71e51/docs/learn/documentation/0.7.0/jobs/job-runner.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/jobs/job-runner.md b/docs/learn/documentation/0.7.0/jobs/job-runner.md
new file mode 100644
index 0000000..4c2ab4c
--- /dev/null
+++ b/docs/learn/documentation/0.7.0/jobs/job-runner.md
@@ -0,0 +1,46 @@
+---
+layout: page
+title: JobRunner
+---
+
+Samza jobs are started using a script called run-job.sh.
+
+```
+samza-example/target/bin/run-job.sh \
+  --config-factory=samza.config.factories.PropertiesConfigFactory \
+  --config-path=file://$PWD/config/hello-world.properties
+```
+
+You provide two parameters to the run-job.sh script. One is the config location, and the other is a factory class that is used to read your configuration file. The run-job.sh script is actually executing a Samza class called JobRunner. The JobRunner uses your ConfigFactory to get a Config object from the config path.
+
+```
+public interface ConfigFactory {
+  Config getConfig(URI configUri);
+}
+```
+
+The Config object is just a wrapper around Map<String, String>, with some nice helper methods. Out of the box, Samza ships with the PropertiesConfigFactory, but developers can implement any kind of ConfigFactory they wish.
+
+Once the JobRunner gets your configuration, it gives your configuration to the StreamJobFactory class defined by the "job.factory" property. Samza ships with two job factory implementations: LocalJobFactory and YarnJobFactory. The StreamJobFactory's responsibility is to give the JobRunner a job that it can run.
+
+```
+public interface StreamJob {
+  StreamJob submit();
+
+  StreamJob kill();
+
+  ApplicationStatus waitForFinish(long timeoutMs);
+
+  ApplicationStatus waitForStatus(ApplicationStatus status, long timeoutMs);
+
+  ApplicationStatus getStatus();
+}
+```
+
+Once the JobRunner gets a job, it calls submit() on the job. This method is what tells the StreamJob implementation to start the TaskRunner. In the case of LocalJobRunner, it uses a run-task.sh script to execute the TaskRunner in a separate process, which will start one TaskRunner locally on the machine that you ran run-job.sh on.
+
+![diagram](/img/0.7.0/learn/documentation/container/job-flow.png)
+
+This flow differs slightly when you use YARN, but we'll get to that later.
+
+## [Configuration &raquo;](configuration.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/5ff71e51/docs/learn/documentation/0.7.0/jobs/logging.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/jobs/logging.md b/docs/learn/documentation/0.7.0/jobs/logging.md
new file mode 100644
index 0000000..9ef9ca1
--- /dev/null
+++ b/docs/learn/documentation/0.7.0/jobs/logging.md
@@ -0,0 +1,53 @@
+---
+layout: page
+title: Logging
+---
+
+Samza uses [SLF4J](http://www.slf4j.org/) for all of its logging. By default, only slf4j-api is used, so you must add an SLF4J runtime dependency to your Samza packages for whichever underlying logging platform you wish to use.
+
+### Log4j
+
+The [hello-samza](/startup/hello-samza/0.7.0) project shows how to use [log4j](http://logging.apache.org/log4j/1.2/) with Samza. To turn on log4j logging, you just need to make sure slf4j-log4j12 is in your Samza TaskRunner's classpath. In Maven, this can be done by adding the following dependency to your Samza package project.
+
+    <dependency>
+      <groupId>org.slf4j</groupId>
+      <artifactId>slf4j-log4j12</artifactId>
+      <scope>runtime</scope>
+      <version>1.6.2</version>
+    </dependency>
+
+If you're not using Maven, just make sure that slf4j-log4j12 ends up in your Samza package's lib directory.
+
+#### log4j.xml
+
+Samza's [run-class.sh](packaging.html) script will automatically set the following setting if log4j.xml exists in your [Samza package's](packaging.html) lib directory.
+
+    -Dlog4j.configuration=file:$base_dir/lib/log4j.xml
+
+<!-- TODO add notes showing how to use task.opts for gc logging
+#### task.opts
+-->
+
+### Log Directory
+
+Samza will look for the _SAMZA_\__LOG_\__DIR_ environment variable when it executes. If this variable is defined, all logs will be written to this directory. If the environment variable is empty, or not defined, then Samza will use /tmp. This environment variable can also be referenced inside log4j.xml files.
+
+### Garbage Collection Logging
+
+Samza's will automatically set the following garbage collection logging setting, and will output it to _$SAMZA_\__LOG_\__DIR_/gc.log.
+
+    -XX:+PrintGCDateStamps -Xloggc:$SAMZA_LOG_DIR/gc.log
+
+#### Rotation
+
+In older versions of Java, it is impossible to have GC logs roll over based on time or size without the use of a secondary tool. This means that your GC logs will never be deleted until a Samza job ceases to run. As of [Java 6 Update 34](http://www.oracle.com/technetwork/java/javase/2col/6u34-bugfixes-1733379.html), and [Java 7 Update 2](http://www.oracle.com/technetwork/java/javase/7u2-relnotes-1394228.html), [new GC command line switches](http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6941923) have been added to support this functionality. If you are using a version of Java that supports GC log rotation, it's highly recommended that you turn it on.
+
+### YARN
+
+When a Samza job executes on a YARN grid, the _$SAMZA_\__LOG_\__DIR_ environment variable will point to a directory that is secured such that only the user executing the Samza job can read and write to it, if YARN is [securely configured](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html).
+
+#### STDOUT
+
+YARN pipes all STDOUT and STDERR output to logs/stdout and logs/stderr, respectively. These files are never rotated.
+
+## [Application Master &raquo;](../yarn/application-master.html)