You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by ja...@apache.org on 2018/11/27 09:48:54 UTC

[47/50] samza git commit: Clean up standalone docs

Clean up standalone docs


Project: http://git-wip-us.apache.org/repos/asf/samza/repo
Commit: http://git-wip-us.apache.org/repos/asf/samza/commit/d6b89882
Tree: http://git-wip-us.apache.org/repos/asf/samza/tree/d6b89882
Diff: http://git-wip-us.apache.org/repos/asf/samza/diff/d6b89882

Branch: refs/heads/master
Commit: d6b8988215f959348c4b740b9d40064c3599fdbc
Parents: 072cff6
Author: Jagadish <jv...@linkedin.com>
Authored: Tue Nov 27 00:26:06 2018 -0800
Committer: Jagadish <jv...@linkedin.com>
Committed: Tue Nov 27 00:26:06 2018 -0800

----------------------------------------------------------------------
 .../versioned/deployment/standalone.md          | 177 ++++---------------
 1 file changed, 35 insertions(+), 142 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/samza/blob/d6b89882/docs/learn/documentation/versioned/deployment/standalone.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/deployment/standalone.md b/docs/learn/documentation/versioned/deployment/standalone.md
index 146f9cb..d8bd541 100644
--- a/docs/learn/documentation/versioned/deployment/standalone.md
+++ b/docs/learn/documentation/versioned/deployment/standalone.md
@@ -1,6 +1,6 @@
 ---
 layout: page
-title: Run as embedded library.
+title: Run as embedded library
 ---
 <!--
    Licensed to the Apache Software Foundation (ASF) under one or more
@@ -20,145 +20,46 @@ title: Run as embedded library.
 -->
 
 - [Introduction](#introduction)
-- [User guide](#user-guide)
-     - [Setup dependencies](#setup-dependencies)
-     - [Configuration](#configuration)
-     - [Code sample](#code-sample)
-- [Quick start guide](#quick-start-guide)
-  - [Setup zookeeper](#setup-zookeeper)
-  - [Setup kafka](#setup-kafka)
-  - [Build binaries](#build-binaries)
-  - [Deploy binaries](#deploy-binaries)
-  - [Inspect results](#inspect-results)
+- [Quick start](#quick-start-guide)
+  - [Installing Zookeeper and Kafka](#setup-zookeeper)
+  - [Building binaries](#build-binaries)
+  - [Running the application](#deploy-binaries)
+  - [Inspecting results](#inspect-results)
 - [Coordinator internals](#coordinator-internals)
 
-#
 
-# Introduction
+### Introduction
 
-With Samza 0.13.0, the deployment model of samza jobs has been simplified and decoupled from YARN. _Standalone_ model provides the stream processing capabilities of samza packaged in the form of a library with pluggable coordination. This library model offers an easier integration path  and promotes a flexible deployment model for an application. Using the standalone mode, you can leverage Samza processors directly in your application and deploy Samza applications to self-managed clusters.
+Often you want to embed and integrate Samza as a component within a larger application. To enable this, Samza supports a standalone mode of deployment allowing greater control over your application’s life-cycle. In this model, Samza can be used just like any library you import within your Java application. This is identical to using a high-level Kafka consumer to process your streams.
 
-A standalone application typically is comprised of multiple _stream processors_. A _stream processor_ encapsulates a user defined processing function and is responsible for processing a subset of input topic partitions. A stream processor of a standalone application is uniquely identified by a _processorId_.
+You can increase your application’s capacity by spinning up multiple instances. These instances will then dynamically coordinate with each other and distribute work among themselves. If an instance fails for some reason, the tasks running on it will be re-assigned to the remaining ones. By default, Samza uses Zookeeper for coordination across individual instances. The coordination logic by itself is pluggable and hence, can integrate with other frameworks.
 
-Samza provides pluggable job _coordinator_ layer to perform leader election and assign work to the stream processors. Standalone supports Zookeeper coordination out of the box and uses it for distributed coordination between the stream processors of standalone application. A processor can become part of a standalone application by setting its app.name(Ex: app.name=group\_1) and joining the group.
+This mode allows you to bring any cluster-manager or hosting-environment of your choice(eg: Kubernetes, Marathon) to run your application. You are also free to control memory-limits, multi-tenancy on your own - since Samza is used as a light-weight library.
 
-In samza standalone, the input topic partitions are distributed between the available processors dynamically at runtime. In each standalone application, one stream processor will be chosen as a leader initially to mediate the assignment of input topic partitions to the stream processors. If the number of available processors changes(for example, if a processors is shutdown or added), then the leader processor will regenerate the partition assignments and re-distribute it to all the processors.
 
-On processor group change, the act of re-assigning input topic partitions to the remaining live processors in the group is known as rebalancing the group. On failure of the leader processor of a standalone application, an another stream processor of the standalone application will be chosen as leader.
+### Quick start
 
-## User guide
-
-Samza standalone is designed to help you to have more control over the deployment of the application. So it is your responsibility to configure and deploy the processors. In case of ZooKeeper coordination, you have to configure the URL for an instance of ZooKeeper.
-
-A stream processor is identified by a unique processorID which is generated by the pluggable ProcessorIdGenerator abstraction. ProcessorId of the stream processor is used with the coordination service. Samza supports UUID based ProcessorIdGenerator out of the box.
-
-The diagram below shows a input topic with three partitions and an standalone application with three processors consuming messages from it.
-
-<img src="/img/versioned/learn/documentation/standalone/standalone-application.jpg" alt="Standalone application" height="550px" width="700px" align="middle">
-
-When a group is first initialized, each stream processor typically starts processing messages from either the earliest or latest offset of the input topic partition. The messages in each partition are sequentially delivered to the user defined processing function. As the stream processor makes progress, it commits the offsets of the messages it has successfully processed. For example, in the figure above, the stream processor position is at offset 7 and its last committed offset is at offset 3.
-
-When a input partition is reassigned to another processor in the group, the initial position is set to the last committed offset. If the processor-1 in the example above suddenly crashed, then the live processor taking over the partition would begin consumption from offset 3. In that case, it would not have to reprocess the messages up to the crashed processor's position of 3.
-
-### Setup dependencies
-
-Add the following samza-standalone maven dependencies to your project.
-
-```xml
-<dependency>
-    <groupId>org.apache.samza</groupId>
-    <artifactId>samza-kafka_2.11</artifactId>
-    <version>1.0</version>
-</dependency>
-<dependency>
-    <groupId>org.apache.samza</groupId>
-    <artifactId>samza-core_2.11</artifactId>
-    <version>1.0</version>
-</dependency>
-<dependency>
-    <groupId>org.apache.samza</groupId>
-    <artifactId>samza-api</artifactId>
-    <version>1.0</version>
-</dependency>
-```
-
-### Configuration
-
-A samza standalone application requires you to define the following mandatory configurations:
-
-```bash
-job.coordinator.factory=org.apache.samza.zk.ZkJobCoordinatorFactory
-job.coordinator.zk.connect=your_zk_connection(for local zookeeper, use localhost:2181)
-task.name.grouper.factory=org.apache.samza.container.grouper.task.GroupByContainerIdsFactory 
-```
-
-You have to configure the stream processor with the kafka brokers as defined in the following sample(we have assumed that the broker is running on localhost):
-
-```bash 
-systems.kafka.samza.factory=org.apache.samza.system.kafka.KafkaSystemFactory
-systems.kafka.samza.msg.serde=json
-systems.kafka.consumer.zookeeper.connect=localhost:2181
-systems.kafka.producer.bootstrap.servers=localhost:9092 
-```
-
-### Code sample
-
-Here&#39;s a sample standalone application with app.name set to sample-test. Running this class would launch a stream processor.
-
-```java
-public class PageViewEventExample implements StreamApplication {
-
-  public static void main(String[] args) {
-    CommandLine cmdLine = new CommandLine();
-    OptionSet options = cmdLine.parser().parse(args);
-    Config config = cmdLine.loadConfig(options);
-
-    ApplicationRunner runner = ApplicationRunners.getApplicationRunner(ApplicationClassUtils.fromConfig(config), config);
-    runner.run();
-    runner.waitForFinish();
-  }
-
-  @Override
-  public void describe(StreamAppDescriptor appDesc) {
-     MessageStream<PageViewEvent> pageViewEvents = null;
-     pageViewEvents = appDesc.getInputStream("inputStream", new JsonSerdeV2<>(PageViewEvent.class));
-     OutputStream<KV<String, PageViewCount>> pageViewEventPerMemberStream =
-         appDesc.getOutputStream("outputStream",  new JsonSerdeV2<>(PageViewEvent.class));
-     pageViewEvents.sendTo(pageViewEventPerMemberStream);
-  }
-}
-```
-
-## Quick start guide
-
-The [Hello-samza](https://github.com/apache/samza-hello-samza/) project contains sample Samza standalone applications. Here are step by step instruction guide to install, build and run a standalone application binaries using the local zookeeper cluster for coordination. Check out the hello-samza project by running the following commands:
+The [Hello-samza](https://github.com/apache/samza-hello-samza/) project includes multiple examples of Samza standalone applications. Let us first check out the repository.
 
 ```bash
 git clone https://git.apache.org/samza-hello-samza.git hello-samza
 cd hello-samza 
 ```
 
-### Setup Zookeeper
-
-Run the following command to install and start a local zookeeper cluster.
-
-```bash
-./bin/grid install zookeeper
-./bin/grid start zookeeper
-```
 
-### Setup Kafka
+#### Installing Zookeeper and Kafka
 
-Run the following command to install and start a local kafka cluster.
+We will use the `./bin/grid` script from the `hello-samza` project to setup up Zookeeper and Kafka locally.
 
 ```bash
-./bin/grid install kafka
+./bin/grid start zookeeper
 ./bin/grid start zookeeper
 ```
 
-### Build binaries
 
-Before you can run the standalone job, you need to build a package for it using the following command.
+#### Building the binaries
+
+Let us now build the `hello-samza` project from its sources.
 
 ```bash
 mvn clean package
@@ -166,52 +67,44 @@ mkdir -p deploy/samza
 tar -xvf ./target/hello-samza-1.0.0-dist.tar.gz -C deploy/samza
 ```
 
-### Deploy binaries
+#### Running the application
 
-To run the sample standalone application [WikipediaZkLocalApplication](https://github.com/apache/samza-hello-samza/blob/master/src/main/java/samza/examples/wikipedia/application/WikipediaZkLocalApplication.java)
+We are ready to run the example application [WikipediaZkLocalApplication](https://github.com/apache/samza-hello-samza/blob/master/src/main/java/samza/examples/wikipedia/application/WikipediaZkLocalApplication.java). This application reads messages from the wikipedia-edits topic, and calculates counts, every ten seconds, for all edits that were made during that window. It emits these results to another topic named `wikipedia-stats`.
 
 ```bash
-./bin/deploy.sh
 ./deploy/samza/bin/run-class.sh samza.examples.wikipedia.application.WikipediaZkLocalApplication  --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-application-local-runner.properties
 ```
 
-### Inspect results
+You can run the above command again to spin up a new instance of your application.
 
-The standalone application reads messages from the wikipedia-edits topic, and calculates counts, every ten seconds, for all edits that were made during that window. It outputs these counts to the local wikipedia-stats kafka topic. To inspect events in output topic, run the following command.
+#### Inspecting results
+
+To inspect events in output topic, run the following command.
 
 ```bash
 ./deploy/kafka/bin/kafka-console-consumer.sh  --zookeeper localhost:2181 --topic wikipedia-stats
 ```
 
-Events produced to the output topic from the standalone application launched above will be of the following form:
+You should see the output messages emitted to the Kafka topic.
 
 ```
 {"is-talk":2,"bytes-added":5276,"edits":13,"unique-titles":13}
 {"is-bot-edit":1,"is-talk":3,"bytes-added":4211,"edits":30,"unique-titles":30,"is-unpatrolled":1,"is-new":2,"is-minor":7}
 ```
 
-# Coordinator internals
-
-A samza application is comprised of multiple stream processors. A processor can become part of a standalone application by setting its app.name(Ex: app.name=group\_1) and joining the group. In samza standalone, the input topic partitions are distributed between the available processors dynamically at runtime. If the number of available processors changes(for example, if some processors are shutdown or added), then the partition assignments will be regenerated and re-distributed to all the processors. One processor will be elected as leader and it will generate the partition assignments and distribute it to the other processors in the group.
-
-To mediate the partition assignments between processors, samza standalone relies upon a coordination service. The main responsibilities of coordination service are the following:
-
-**Leader Election** - Elects a single processor to generate the partition assignments and distribute it to other processors in the group.
-
-**Distributed barrier** - Coordination primitive used by the processors to reach consensus(agree) on an partition assignment.
-
-By default, embedded samza uses Zookeeper for coordinating between processors of an application and store the partition assignment state. Coordination sequence for a standalone application is listed below:
-
-1. Each processor(participant) will register with the coordination service(e.g: Zookeeper) with its participant ID.
+### Standalone Coordinator Internals
 
-2. One of the participants will be elected as the leader.
+Samza runs your application by logically breaking its execution down into multiple tasks. A task is the unit of parallelism for your application, with each task consumeing data from one or more partitions of your input streams.
+Obviously, there is a limit to the throughput you can achieve with a single instance. You can increase parallelism by simply running multiple instances of your application. The individual instances can be executed on the same machine, or distributed across
+machines. Likewise, to scale down your parallelism, you can shut down a few instances of your application. Samza will coordinate among available instances and dynamically assign the tasks them. The coordination logic itself is 
+pluggable - with a Zookeeper-based implementation provided out of the box.
 
-3. The leader will monitor the list of all the active participants.
+Here is a typical sequence of actions when using Zookeeper for coordination.
 
-4. Whenever the list of the participants changes in a group, the leader will generate a new partition assignments for the current participants and persist it to a common storage.
+1. Everytime you spawn a new instance of your application, it registers itself with Zookeeper as a participant.
 
-5. Participants are notified that the new partition assignment is available. Notification is done through the coordination service(e.g. ZooKeeper).
+2. There is always a single leader - which acts as the coordinator. The coordinator which manages the assignment of tasks across the individual containers. The coordinator also monitors the liveness of individual containers and redistributes the tasks among the remaining ones during a failure.  
 
-6. The participants will stop processing, pick up the new partition assignment, and then resume processing.
+3. Whenever a new instance joins or leaves the group, it triggers a notification to the leader. The leader can then recompute assignments of tasks to the live instances.
 
-In order to ensure that no two partitions are processed by different processors, processing is paused and all the processors will synchronize on a distributed barrier. Once all the processors are paused, the new partition assignments are applied, after which processing resumes.
\ No newline at end of file
+4. Once the leader publishes new partition assignments, all running instances pick up the new assignment and resume processing.