You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by ma...@apache.org on 2014/06/10 13:06:43 UTC

[15/21] SAMZA-7: Rewrite Container section of docs to bring it up-to-date. Reviewed by Jakob Homan.

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/26c1e27d/docs/learn/documentation/0.7.0/jobs/configuration.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/jobs/configuration.md b/docs/learn/documentation/0.7.0/jobs/configuration.md
index d4a516e..3bb80ef 100644
--- a/docs/learn/documentation/0.7.0/jobs/configuration.md
+++ b/docs/learn/documentation/0.7.0/jobs/configuration.md
@@ -15,7 +15,7 @@ task.class=samza.task.example.MyJavaStreamerTask
 task.inputs=example-system.example-stream
 
 # Serializers
-serializers.registry.json.class=samza.serializers.JsonSerdeFactory
+serializers.registry.json.class=org.apache.samza.serializers.JsonSerdeFactory
 serializers.registry.string.class=org.apache.samza.serializers.StringSerdeFactory
 
 # Systems
@@ -24,7 +24,12 @@ systems.example-system.samza.key.serde=string
 systems.example-system.samza.msg.serde=json
 ```
 
-There are four major sections to a configuration file. The job section defines things like the name of the job, and whether to use the YarnJobFactory or LocalJobFactory. The task section is where you specify the class name for your StreamTask. It's also where you define what the input streams are for your task. The serializers section defines the classes of the serdes used for serialization and deserialization of specific objects that are received and sent along different streams. The system section defines systems that your StreamTask can read from along with the types of serdes used for sending keys and messages from that system. Usually, you'll define a Kafka system, if you're reading from Kafka, although you can also specify your own self-implemented Samza-compatible systems. See the hello-samza example project's Wikipedia system for a good example of a self-implemented system.
+There are four major sections to a configuration file:
+
+1. The job section defines things like the name of the job, and whether to use the YarnJobFactory or LocalJobFactory.
+2. The task section is where you specify the class name for your [StreamTask](../api/overview.html). It's also where you define what the [input streams](../container/streams.html) are for your task.
+3. The serializers section defines the classes of the [serdes](../container/serialization.html) used for serialization and deserialization of specific objects that are received and sent along different streams.
+4. The system section defines systems that your StreamTask can read from along with the types of serdes used for sending keys and messages from that system. Usually, you'll define a Kafka system, if you're reading from Kafka, although you can also specify your own self-implemented Samza-compatible systems. See the [hello-samza example project](/startup/hello-samza/0.7.0)'s Wikipedia system for a good example of a self-implemented system.
 
 ### Required Configuration
 

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/26c1e27d/docs/learn/documentation/0.7.0/jobs/job-runner.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/jobs/job-runner.md b/docs/learn/documentation/0.7.0/jobs/job-runner.md
index c73b234..b41a410 100644
--- a/docs/learn/documentation/0.7.0/jobs/job-runner.md
+++ b/docs/learn/documentation/0.7.0/jobs/job-runner.md
@@ -37,9 +37,7 @@ public interface StreamJob {
 }
 ```
 
-Once the JobRunner gets a job, it calls submit() on the job. This method is what tells the StreamJob implementation to start the TaskRunner. In the case of LocalJobRunner, it uses a run-container.sh script to execute the TaskRunner in a separate process, which will start one TaskRunner locally on the machine that you ran run-job.sh on.
-
-![diagram](/img/0.7.0/learn/documentation/container/job-flow.png)
+Once the JobRunner gets a job, it calls submit() on the job. This method is what tells the StreamJob implementation to start the SamzaContainer. In the case of LocalJobRunner, it uses a run-container.sh script to execute the SamzaContainer in a separate process, which will start one SamzaContainer locally on the machine that you ran run-job.sh on.
 
 This flow differs slightly when you use YARN, but we'll get to that later.
 

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/26c1e27d/docs/learn/documentation/0.7.0/jobs/logging.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/jobs/logging.md b/docs/learn/documentation/0.7.0/jobs/logging.md
index 6bb6bf4..65a755c 100644
--- a/docs/learn/documentation/0.7.0/jobs/logging.md
+++ b/docs/learn/documentation/0.7.0/jobs/logging.md
@@ -7,7 +7,7 @@ Samza uses [SLF4J](http://www.slf4j.org/) for all of its logging. By default, Sa
 
 ### Log4j
 
-The [hello-samza](/startup/hello-samza/0.7.0) project shows how to use [log4j](http://logging.apache.org/log4j/1.2/) with Samza. To turn on log4j logging, you just need to make sure slf4j-log4j12 is in your Samza TaskRunner's classpath. In Maven, this can be done by adding the following dependency to your Samza package project.
+The [hello-samza](/startup/hello-samza/0.7.0) project shows how to use [log4j](http://logging.apache.org/log4j/1.2/) with Samza. To turn on log4j logging, you just need to make sure slf4j-log4j12 is in your SamzaContainer's classpath. In Maven, this can be done by adding the following dependency to your Samza package project.
 
     <dependency>
       <groupId>org.slf4j</groupId>
@@ -18,7 +18,7 @@ The [hello-samza](/startup/hello-samza/0.7.0) project shows how to use [log4j](h
 
 If you're not using Maven, just make sure that slf4j-log4j12 ends up in your Samza package's lib directory.
 
-#### log4j.xml
+#### Log4j configuration
 
 Samza's [run-class.sh](packaging.html) script will automatically set the following setting if log4j.xml exists in your [Samza package's](packaging.html) lib directory.
 
@@ -42,9 +42,7 @@ These settings are very useful if you're using a file-based appender. For exampl
 
 Setting up a file-based appender is recommended as a better alternative to using standard out. Standard out log files (see below) don't roll, and can get quite large if used for logging.
 
-<!-- TODO add notes showing how to use task.opts for gc logging
-#### task.opts
--->
+**NOTE:** If you use the task.opts configuration property, the log configuration is disrupted. This is a known bug; please see [SAMZA-109](https://issues.apache.org/jira/browse/SAMZA-109) for a workaround.
 
 ### Log Directory
 

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/26c1e27d/docs/learn/documentation/0.7.0/jobs/packaging.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/jobs/packaging.md b/docs/learn/documentation/0.7.0/jobs/packaging.md
index 62c089a..4f06625 100644
--- a/docs/learn/documentation/0.7.0/jobs/packaging.md
+++ b/docs/learn/documentation/0.7.0/jobs/packaging.md
@@ -10,7 +10,7 @@ bin/run-am.sh
 bin/run-container.sh
 ```
 
-The run-container.sh script is responsible for starting the TaskRunner. The run-am.sh script is responsible for starting Samza's application master for YARN. Thus, the run-am.sh script is only used by the YarnJob, but both YarnJob and ProcessJob use run-container.sh.
+The run-container.sh script is responsible for starting the [SamzaContainer](../container/samza-container.html). The run-am.sh script is responsible for starting Samza's application master for YARN. Thus, the run-am.sh script is only used by the YarnJob, but both YarnJob and ProcessJob use run-container.sh.
 
 Typically, these two scripts are bundled into a tar.gz file that has a structure like this:
 

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/26c1e27d/docs/learn/documentation/0.7.0/jobs/yarn-jobs.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/jobs/yarn-jobs.md b/docs/learn/documentation/0.7.0/jobs/yarn-jobs.md
index 3d971cd..5dbbe54 100644
--- a/docs/learn/documentation/0.7.0/jobs/yarn-jobs.md
+++ b/docs/learn/documentation/0.7.0/jobs/yarn-jobs.md
@@ -3,7 +3,7 @@ layout: page
 title: YARN Jobs
 ---
 
-When you define job.factory.class=samza.job.yarn.YarnJobFactory in your job's configuration, Samza will use YARN to execute your job. The YarnJobFactory will use the YARN_HOME environment variable on the machine that run-job.sh is executed on to get the appropriate YARN configuration, which will define where the YARN resource manager is. The YarnJob will work with the resource manager to get your job started on the YARN cluster.
+When you define job.factory.class=org.apache.samza.job.yarn.YarnJobFactory in your job's configuration, Samza will use YARN to execute your job. The YarnJobFactory will use the YARN_HOME environment variable on the machine that run-job.sh is executed on to get the appropriate YARN configuration, which will define where the YARN resource manager is. The YarnJob will work with the resource manager to get your job started on the YARN cluster.
 
 If you want to use YARN to run your Samza job, you'll also need to define the location of your Samza job's package. For example, you might say:
 
@@ -11,6 +11,8 @@ If you want to use YARN to run your Samza job, you'll also need to define the lo
 yarn.package.path=http://my.http.server/jobs/ingraphs-package-0.0.55.tgz
 ```
 
-This .tgz file follows the conventions outlined on the [Packaging](packaging.html) page (it has bin/run-am.sh and bin/run-container.sh). YARN NodeManagers will take responsibility for downloading this .tgz file on the appropriate machines, and untar'ing them. From there, YARN will execute run-am.sh or run-container.sh for the Samza Application Master, and TaskRunner, respectively.
+This .tgz file follows the conventions outlined on the [Packaging](packaging.html) page (it has bin/run-am.sh and bin/run-container.sh). YARN NodeManagers will take responsibility for downloading this .tgz file on the appropriate machines, and untar'ing them. From there, YARN will execute run-am.sh or run-container.sh for the Samza Application Master, and SamzaContainer, respectively.
+
+<!-- TODO document yarn.container.count and other key configs -->
 
 ## [Logging &raquo;](logging.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/26c1e27d/docs/learn/documentation/0.7.0/yarn/application-master.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/yarn/application-master.md b/docs/learn/documentation/0.7.0/yarn/application-master.md
index 0da6dc0..92e1e18 100644
--- a/docs/learn/documentation/0.7.0/yarn/application-master.md
+++ b/docs/learn/documentation/0.7.0/yarn/application-master.md
@@ -7,7 +7,7 @@ YARN is Hadoop's next-generation cluster manager. It allows developers to deploy
 
 ### Integration
 
-Samza's main integration with YARN comes in the form of a Samza ApplicationMaster. This is the chunk of code responsible for managing a Samza job in a YARN grid. It decides what to do when a stream processor fails, which machines a Samza job's [TaskRunner](../container/task-runner.html) should run on, and so on.
+Samza's main integration with YARN comes in the form of a Samza ApplicationMaster. This is the chunk of code responsible for managing a Samza job in a YARN grid. It decides what to do when a stream processor fails, which machines a Samza job's [containers](../container/samza-container.html) should run on, and so on.
 
 When the Samza ApplicationMaster starts up, it does the following:
 
@@ -25,11 +25,11 @@ From this point on, the ApplicationMaster just reacts to events from the RM.
 
 ### Fault Tolerance
 
-Whenever a container is allocated, the AM will work with the YARN NM to start a TaskRunner (with appropriate partitions assigned to it) in the container. If a container fails with a non-zero return code, the AM will request a new container, and restart the TaskRunner. If a TaskRunner fails too many times, too quickly, the ApplicationMaster will fail the whole Samza job with a non-zero return code. See the yarn.countainer.retry.count and yarn.container.retry.window.ms [configuration](../jobs/configuration.html) parameters for details.
+Whenever a container is allocated, the AM will work with the YARN NM to start a SamzaContainer (with appropriate partitions assigned to it) in the container. If a container fails with a non-zero return code, the AM will request a new container, and restart the SamzaContainer. If a SamzaContainer fails too many times, too quickly, the ApplicationMaster will fail the whole Samza job with a non-zero return code. See the yarn.countainer.retry.count and yarn.container.retry.window.ms [configuration](../jobs/configuration.html) parameters for details.
 
 When the AM receives a reboot signal from YARN, it will throw a SamzaException. This will trigger a clean and successful shutdown of the AM (YARN won't think the AM failed).
 
-If the AM, itself, fails, YARN will handle restarting the AM. When the AM is restarted, all containers that were running will be killed, and the AM will start from scratch. The same list of operations, shown above, will be executed. The AM will request new containers for its TaskRunners, and proceed as though it has just started for the first time. YARN has a yarn.resourcemanager.am.max-retries configuration parameter that's defined in [yarn-site.xml](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml). This configuration defaults to 1, which means that, by default, a single AM failure will cause your Samza job to stop running.
+If the AM, itself, fails, YARN will handle restarting the AM. When the AM is restarted, all containers that were running will be killed, and the AM will start from scratch. The same list of operations, shown above, will be executed. The AM will request new containers for its SamzaContainers, and proceed as though it has just started for the first time. YARN has a yarn.resourcemanager.am.max-retries configuration parameter that's defined in [yarn-site.xml](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml). This configuration defaults to 1, which means that, by default, a single AM failure will cause your Samza job to stop running.
 
 ### Dashboard
 
@@ -42,7 +42,7 @@ Samza's ApplicationMaster comes with a dashboard to show useful information such
 
 You can find this dashboard by going to your YARN grid's ResourceManager page (usually something like [http://localhost:8088/cluster](http://localhost:8088/cluster)), and clicking on the "ApplicationMaster" link of a running Samza job.
 
-![diagram](/img/0.7.0/learn/documentation/yarn/samza-am-dashboard.png)
+<img src="/img/0.7.0/learn/documentation/yarn/samza-am-dashboard.png" alt="Screenshot of ApplicationMaster dashboard" class="diagram-large">
 
 ### Security
 

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/26c1e27d/docs/learn/documentation/0.7.0/yarn/isolation.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/yarn/isolation.md b/docs/learn/documentation/0.7.0/yarn/isolation.md
index c685729..1a4f315 100644
--- a/docs/learn/documentation/0.7.0/yarn/isolation.md
+++ b/docs/learn/documentation/0.7.0/yarn/isolation.md
@@ -13,7 +13,7 @@ YARN currently supports resource management for memory and CPU.
 
 YARN will automatically enforce memory limits for all containers that it executes. All containers must have a max-memory size defined when they're created. If the sum of all memory usage for processes associated with a single YARN container exceeds this maximum, YARN will kill the container.
 
-Samza supports memory limits using the yarn.container.memory.mb and yarn.am.container.memory.mb configuration parameters. Keep in mind that this is simply the amount of memory YARN will allow a Samza [TaskRunner](../container/task-runner.html) or [ApplicationMaster](application-master.html) to have. You'll still need to configure your heap settings appropriately using task.opts, when using Java (the default is -Xmx160M). See the [Configuration](../jobs/configuration.html) and [Packaging](../jobs/packaging.html) pages for details.
+Samza supports memory limits using the yarn.container.memory.mb and yarn.am.container.memory.mb configuration parameters. Keep in mind that this is simply the amount of memory YARN will allow a [SamzaContainer](../container/samza-container.html) or [ApplicationMaster](application-master.html) to have. You'll still need to configure your heap settings appropriately using task.opts, when using Java (the default is -Xmx160M). See the [Configuration](../jobs/configuration.html) and [Packaging](../jobs/packaging.html) pages for details.
 
 ### CPU