You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by ja...@apache.org on 2018/11/27 11:22:44 UTC

samza git commit: Use consistent font /heading sizes for all pages

Repository: samza
Updated Branches:
  refs/heads/master 6cdcdefd4 -> f86749b69


Use consistent font /heading sizes for all pages

Author: Jagadish <jv...@linkedin.com>

Reviewers: Jagadish<ja...@apache.org>

Closes #819 from vjagadish1989/website-reorg34


Project: http://git-wip-us.apache.org/repos/asf/samza/repo
Commit: http://git-wip-us.apache.org/repos/asf/samza/commit/f86749b6
Tree: http://git-wip-us.apache.org/repos/asf/samza/tree/f86749b6
Diff: http://git-wip-us.apache.org/repos/asf/samza/diff/f86749b6

Branch: refs/heads/master
Commit: f86749b69ff9022dce2a1afadb3793d4c92da802
Parents: 6cdcdef
Author: Jagadish <jv...@linkedin.com>
Authored: Tue Nov 27 03:21:25 2018 -0800
Committer: Jagadish <jv...@linkedin.com>
Committed: Tue Nov 27 03:21:25 2018 -0800

----------------------------------------------------------------------
 .../versioned/architecture/architecture-overview.md | 10 +++++-----
 .../documentation/versioned/connectors/eventhubs.md | 10 +++++-----
 .../documentation/versioned/connectors/hdfs.md      | 16 ++++++++--------
 .../documentation/versioned/connectors/kinesis.md   |  6 +++---
 .../versioned/core-concepts/core-concepts.md        | 14 +++++++-------
 .../documentation/versioned/deployment/yarn.md      | 13 ++++++-------
 6 files changed, 34 insertions(+), 35 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/samza/blob/f86749b6/docs/learn/documentation/versioned/architecture/architecture-overview.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/architecture/architecture-overview.md b/docs/learn/documentation/versioned/architecture/architecture-overview.md
index 282352c..8bfe574 100644
--- a/docs/learn/documentation/versioned/architecture/architecture-overview.md
+++ b/docs/learn/documentation/versioned/architecture/architecture-overview.md
@@ -49,7 +49,7 @@ Just like a task is the logical unit of parallelism for your application, a cont
 Each application also has a coordinator which manages the assignment of tasks across the individual containers. The coordinator monitors the liveness of individual containers and redistributes the tasks among the remaining ones during a failure. <br/><br/>
 The coordinator itself is pluggable, enabling Samza to support multiple deployment options. You can use Samza as a light-weight embedded library that easily integrates with a larger application. Alternately, you can deploy and run it as a managed framework using a cluster-manager like YARN. It is worth noting that Samza is the only system that offers first-class support for both these deployment options. Some systems like Kafka-streams only support the embedded library model while others like Flink, Spark streaming etc., only offer the framework model for stream-processing.
 
-## Threading model and ordering
+### Threading model and ordering
 
 Samza offers a flexible threading model to run each task. When running your applications, you can control the number of workers needed to process your data. You can also configure the number of threads each worker uses to run its assigned tasks. Each thread can run one or more tasks. Tasks don’t share any state - hence, you don’t have to worry about coordination across these threads. 
 
@@ -57,14 +57,14 @@ Another common scenario in stream processing is to interact with remote services
 s
 By default, all messages delivered to a task are processed by the same thread. This guarantees in-order processing of messages within a partition. However, some applications don’t care about in-order processing of messages. For such use-cases, Samza also supports processing messages out-of-order within a single partition. This typically offers higher throughput by allowing for multiple concurrent messages in each partition.
 
-## Incremental checkpointing 
+### Incremental checkpointing 
 ![diagram-large](/img/{{site.version}}/learn/documentation/architecture/incremental-checkpointing.png)
 
 Samza guarantees that messages won’t be lost, even if your job crashes, if a machine dies, if there is a network fault, or something else goes wrong. To achieve this property, each task periodically persists the last processed offsets for its input stream partitions. If a task needs to be restarted on a different worker due to a failure, it resumes processing from its latest checkpoint. 
 
 Samza’s checkpointing mechanism ensures each task also stores the contents of its state-store consistently with its last processed offsets. Checkpoints are flushed incrementally ie., the state-store only flushes the delta since the previous checkpoint instead of flushing its entire state.
 
-## State management
+### State management
 Samza offers scalable, high-performance storage to enable you to build stateful stream-processing applications. This is implemented by associating each Samza task with its own instance of a local database (aka. a state-store). The state-store associated with a particular task only stores data corresponding to the partitions processed by that task. This is important: when you scale out your job by giving it more computing resources, Samza transparently migrates the tasks from one machine to another. By giving each task its own state, tasks can be relocated without affecting your overall application. 
 ![diagram-large](/img/{{site.version}}/learn/documentation/architecture/state-store.png)
 
@@ -74,11 +74,11 @@ Here are some key advantages of this architecture. <br/>
 - Each job has its own store, to avoid the isolation issues in a shared remote database (if you make an expensive query, it affects only the current task, nobody else). <br/>
 - Different storage engines can be plugged in - for example, a remote data-store that enables richer query capabilities <br/>
 
-## Fault tolerance of state
+### Fault tolerance of state
 Distributed stream processing systems need recover quickly from failures to resume their processing. While having a durable local store offers great performance, we should still guarantee fault-tolerance. For this purpose, Samza replicates every change to the local store into a separate stream (aka. called a changelog for the store). This allows you to later recover the data in the store by reading the contents of the changelog from the beginning. A log-compacted Kafka topic is typically used as a changelog since Kafka automatically retains the most recent value for each key.
 ![diagram-large](/img/{{site.version}}/learn/documentation/architecture/fault-tolerance.png)
 
-## Host affinity
+### Host affinity
 If your application has several terabytes of state, then bootstrapping it every time by reading the changelog will stall progress. So, it’s critical to be able to recover state swiftly during failures. For this purpose, Samza takes data-locality into account when scheduling tasks on hosts. This is implemented by persisting metadata about the host each task is currently running on. 
 
 During a new deployment of the application, Samza tries to re-schedule the tasks on the same hosts they were previously on. This enables the task to re-use the snapshot of its local-state from its previous run on that host. We call this feature _host-affinity_ since it tries to preserve the assignment of tasks to hosts. This is a key differentiator that enables Samza applications to scale to several terabytes of local-state with effectively zero downtime.

http://git-wip-us.apache.org/repos/asf/samza/blob/f86749b6/docs/learn/documentation/versioned/connectors/eventhubs.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/connectors/eventhubs.md b/docs/learn/documentation/versioned/connectors/eventhubs.md
index 9fdc861..11de3ff 100644
--- a/docs/learn/documentation/versioned/connectors/eventhubs.md
+++ b/docs/learn/documentation/versioned/connectors/eventhubs.md
@@ -19,7 +19,7 @@ title: Event Hubs Connector
    limitations under the License.
 -->
 
-## EventHubs I/O: QuickStart
+### EventHubs I/O: QuickStart
 
 The Samza EventHubs connector provides access to [Azure EventHubs](https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-features), Microsoft’s data streaming service on Azure. An eventhub is similar to a Kafka topic and can have multiple partitions with producers and consumers. Each message produced or consumed from an event hub is an instance of [EventData](https://docs.microsoft.com/en-us/java/api/com.microsoft.azure.eventhubs._event_data). 
 
@@ -67,9 +67,9 @@ Hence, you should also provide your SAS keys and tokens to access the stream. Yo
 ####Data Model
 Each event produced and consumed from an EventHubs stream is an instance of [EventData](https://docs.microsoft.com/en-us/java/api/com.microsoft.azure.eventhubs._event_data), which wraps a byte-array payload. When producing to EventHubs, Samza serializes your object into an `EventData` payload before sending it over the wire. Likewise, when consuming messages from EventHubs, messages are de-serialized into typed objects using the provided Serde. 
 
-## Configuration
+### Configuration
 
-###Producer partitioning
+####Producer partitioning
 
 You can use `#withPartitioningMethod` to control how outgoing messages are partitioned. The following partitioning schemes are supported:
 
@@ -85,7 +85,7 @@ EventHubsSystemDescriptor systemDescriptor = new EventHubsSystemDescriptor("even
 {% endhighlight %}
 
 
-### Consumer groups
+#### Consumer groups
 
 Event Hubs supports the notion of [consumer groups](https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-features#consumer-groups) which enable multiple applications to have their own view of the event stream. Each partition is exclusively consumed by one consumer in the group. Each event hub stream has a pre-defined consumer group named $Default. You can define your own consumer group for your job using `withConsumerGroup`.
 
@@ -97,7 +97,7 @@ EventHubsInputDescriptor<KV<String, String>> inputDescriptor =
 {% endhighlight %}
 
 
-### Consumer buffer size
+#### Consumer buffer size
 
 When the consumer reads a message from EventHubs, it appends them to a shared producer-consumer queue corresponding to its partition. This config determines the per-partition queue size. Setting a higher value for this config typically achieves a higher throughput at the expense of increased on-heap memory.
 

http://git-wip-us.apache.org/repos/asf/samza/blob/f86749b6/docs/learn/documentation/versioned/connectors/hdfs.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/connectors/hdfs.md b/docs/learn/documentation/versioned/connectors/hdfs.md
index 9b79f24..ece7bbf 100644
--- a/docs/learn/documentation/versioned/connectors/hdfs.md
+++ b/docs/learn/documentation/versioned/connectors/hdfs.md
@@ -19,17 +19,17 @@ title: HDFS Connector
    limitations under the License.
 -->
 
-## Overview
+### Overview
 
 The HDFS connector allows your Samza jobs to read data stored in HDFS files. Likewise, you can write processed results to HDFS. 
 To interact with HDFS, Samza requires your job to run on the same YARN cluster.
 
-## Consuming from HDFS
-### Input Partitioning
+### Consuming from HDFS
+#### Input Partitioning
 
 Partitioning works at the level of individual directories and files. Each directory is treated as its own stream and each of its files is treated as a _partition_. For example, Samza creates 5 partitions when it's reading from a directory containing 5 files. There is no way to parallelize the consumption when reading from a single file - you can only have one container to process the file.
 
-### Input Event format
+#### Input Event format
 Samza supports avro natively, and it's easy to extend to other serialization formats. Each avro record read from HDFS is wrapped into a message-envelope. The [envelope](../api/javadocs/org/apache/samza/system/IncomingMessageEnvelope.html) contains these 3 fields:
 
 - The key, which is empty
@@ -40,12 +40,12 @@ Samza supports avro natively, and it's easy to extend to other serialization for
 
 To support non-avro input formats, you can implement the [SingleFileHdfsReader](https://github.com/apache/samza/blob/master/samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java) interface.
 
-### EndOfStream
+#### EndOfStream
 
 While streaming sources like Kafka are unbounded, files on HDFS have finite data and have a notion of EOF. When reading from HDFS, your Samza job automatically exits after consuming all the data. You can implement [EndOfStreamListenerTask](../api/javadocs/org/apache/samza/task/EndOfStreamListenerTask.html) to get a callback once EOF has been reached. 
 
 
-### Defining streams
+#### Defining streams
 
 Samza uses the notion of a _system_ to describe any I/O source it interacts with. To consume from HDFS, you should create a new system that points to - `HdfsSystemFactory`. You can then associate multiple streams with this _system_. Each stream should have a _physical name_, which should be set to the name of the directory on HDFS.
 
@@ -68,7 +68,7 @@ systems.hdfs.partitioner.defaultPartitioner.blacklist=somefile.avro
 {% endhighlight %}
 
 
-## Producing to HDFS
+### Producing to HDFS
 
 #### Output format
 
@@ -104,7 +104,7 @@ systems.hdfs.producer.hdfs.write.batch.size.bytes=134217728
 systems.hdfs.producer.hdfs.write.batch.size.records=10000
 {% endhighlight %}
 
-## Security 
+### Security 
 
 You can access Kerberos-enabled HDFS clusters by providing your principal and the path to your key-tab file. Samza takes care of automatically creating and renewing your Kerberos tokens periodically. 
 

http://git-wip-us.apache.org/repos/asf/samza/blob/f86749b6/docs/learn/documentation/versioned/connectors/kinesis.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/connectors/kinesis.md b/docs/learn/documentation/versioned/connectors/kinesis.md
index 85149f6..e319e92 100644
--- a/docs/learn/documentation/versioned/connectors/kinesis.md
+++ b/docs/learn/documentation/versioned/connectors/kinesis.md
@@ -19,7 +19,7 @@ title: Kinesis Connector
    limitations under the License.
 -->
 
-## Kinesis I/O: Quickstart
+### Kinesis I/O: Quickstart
 
 The Samza Kinesis connector allows you to interact with [Amazon Kinesis Data Streams](https://aws.amazon.com/kinesis/data-streams),
 Amazon’s data streaming service. The `hello-samza` project includes an example of processing Kinesis streams using Samza. Here is the complete [source code](https://github.com/apache/samza-hello-samza/blob/master/src/main/java/samza/examples/kinesis/KinesisHelloSamza.java) and [configs](https://github.com/apache/samza-hello-samza/blob/master/src/main/config/kinesis-hello-samza.properties).
@@ -32,9 +32,9 @@ Each message consumed from the stream is an instance of a Kinesis [Record](http:
 Samza’s [KinesisSystemConsumer](https://github.com/apache/samza/blob/master/samza-aws/src/main/java/org/apache/samza/system/kinesis/consumer/KinesisSystemConsumer.java)
 wraps the Record into a [KinesisIncomingMessageEnvelope](https://github.com/apache/samza/blob/master/samza-aws/src/main/java/org/apache/samza/system/kinesis/consumer/KinesisIncomingMessageEnvelope.java).
 
-## Consuming from Kinesis
+### Consuming from Kinesis
 
-### Basic Configuration
+#### Basic Configuration
 
 Here is the required configuration for consuming messages from Kinesis. 
 

http://git-wip-us.apache.org/repos/asf/samza/blob/f86749b6/docs/learn/documentation/versioned/core-concepts/core-concepts.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/core-concepts/core-concepts.md b/docs/learn/documentation/versioned/core-concepts/core-concepts.md
index b69de3d..8e2ce93 100644
--- a/docs/learn/documentation/versioned/core-concepts/core-concepts.md
+++ b/docs/learn/documentation/versioned/core-concepts/core-concepts.md
@@ -25,7 +25,7 @@ title: Core concepts
 - [Time](#time)
 - [Processing guarantee](#processing-guarantee)
 
-## Introduction
+### Introduction
 
 Apache Samza is a scalable data processing engine that allows you to process and analyze your data in real-time. Here is a summary of Samza’s features that simplify building your applications:
 
@@ -46,7 +46,7 @@ _**Unified API:**_ Use a simple API to describe your application-logic in a mann
 Next, we will introduce Samza’s terminology. You will realize that it is extremely easy to [get started](/quickstart/{{site.version}}) with building your first application. 
 
 
-## Streams, Partitions
+### Streams, Partitions
 Samza processes your data in the form of streams. A _stream_ is a collection of immutable messages, usually of the same type or category. Each message in a stream is modelled as a key-value pair. 
 
 ![diagram-medium](/img/{{site.version}}/learn/documentation/core-concepts/streams-partitions.png)
@@ -57,7 +57,7 @@ A stream is sharded into multiple partitions for scaling how its data is process
 
 Samza supports pluggable systems that can implement the stream abstraction. As an example, Kafka implements a stream as a topic while a database might implement a stream as a sequence of updates to its tables.
 
-## Stream Application
+### Stream Application
 A _stream application_ processes messages from input streams, transforms them and emits results to an output stream or a database. It is built by chaining multiple operators, each of which take in one or more streams and transform them.
 
 ![diagram-medium](/img/{{site.version}}/learn/documentation/core-concepts/stream-application.png)
@@ -67,20 +67,20 @@ Samza offers three top-level APIs to help you build your stream applications: <b
 2. The [Low Level Task API](/learn/documentation/{{site.version}}/api/low-level-api.html), which allows greater flexibility to define your processing-logic and offers greater control <br/>
 3. [Samza SQL](/learn/documentation/{{site.version}}/api/samza-sql.html), which offers a declarative SQL interface to create your applications <br/>
 
-## State
+### State
 Samza supports for both stateless and stateful stream processing. _Stateless processing_, as the name implies, does not retain any state associated with the current message after it has been processed. A good example of this is filtering an incoming stream of user-records by a field (eg:userId) and writing the filtered messages to their own stream. 
 
 In contrast, _stateful processing_ requires you to record some state about a message even after processing it. Consider the example of counting the number of unique users to a website every five minutes. This requires you to store information about each user seen thus far for de-duplication. Samza offers a fault-tolerant, scalable state-store for this purpose.
 
-## Time
+### Time
 Time is a fundamental concept in stream processing, especially in how it is modeled and interpreted by the system. Samza supports two notions of time. By default, all built-in Samza operators use processing time. In processing time, the timestamp of a message is determined by when it is processed by the system. For example, an event generated by a sensor could be processed by Samza several milliseconds later. 
 
 On the other hand, in event time, the timestamp of an event is determined by when it actually occurred at the source. For example, a sensor which generates an event could embed the time of occurrence as a part of the event itself. Samza provides event-time based processing by its integration with [Apache BEAM](https://beam.apache.org/documentation/runners/samza/).
 
-## Processing guarantee
+### Processing guarantee
 Samza supports at-least once processing. As the name implies, this ensures that each message in the input stream is processed by the system at-least once. This guarantees no data-loss even when there are failures, thereby making Samza a practical choice for building fault-tolerant applications.
 
 
 Next Steps: We are now ready to have a closer look at Samza’s architecture.
-## [Architecture &raquo;](/learn/documentation/{{site.version}}/architecture/architecture-overview.html)
+### [Architecture &raquo;](/learn/documentation/{{site.version}}/architecture/architecture-overview.html)
 

http://git-wip-us.apache.org/repos/asf/samza/blob/f86749b6/docs/learn/documentation/versioned/deployment/yarn.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/deployment/yarn.md b/docs/learn/documentation/versioned/deployment/yarn.md
index b32ba68..3a46cea 100644
--- a/docs/learn/documentation/versioned/deployment/yarn.md
+++ b/docs/learn/documentation/versioned/deployment/yarn.md
@@ -20,11 +20,10 @@ title: Run on YARN
 -->
 
 - [Introduction](#introduction)
-- [Starting your application on YARN](#starting-your-application-on-yarn)
+- [Running on YARN: Quickstart](#starting-your-application-on-yarn)
     - [Setting up a single node YARN cluster](#setting-up-a-single-node-yarn-cluster-optional)
     - [Submitting the application to YARN](#submitting-the-application-to-yarn)
 - [Application Master UI](#application-master-ui)
-- [Viewing logs](#viewing-logs)
 - [Configuration](#configuration)
     - [Configuring parallelism](#configuring-parallelism)
     - [Configuring resources](#configuring-resources)
@@ -45,13 +44,13 @@ title: Run on YARN
 - [Coordinator Internals](#coordinator-internals)
 
 
-## Introduction
+### Introduction
 
 Apache YARN is part of the Hadoop project and provides the ability to run distributed applications on a cluster. A YARN cluster minimally consists of a Resource Manager (RM) and multiple Node Managers (NM). The RM is responsible for managing the resources in the cluster and allocating them to applications. Every node in the cluster has an NM (Node Manager), which is responsible for managing containers on that node - starting them, monitoring their resource usage and reporting the same to the RM. 
 
 Applications are run on the cluster by implementing a coordinator called an ApplicationMaster (AM). The AM is responsible for requesting resources including CPU, memory from the Resource Manager (RM) on behalf of the application. Samza provides its own implementation of the AM for each job.
 
-## Running on YARN: Quickstart
+### Running on YARN: Quickstart
 
 We will demonstrate running a Samza application on YARN by using the `hello-samza` example. Lets first checkout our repository.
 
@@ -61,7 +60,7 @@ cd samza-hello-samza
 git checkout latest
 ```
 
-### Set up a single node YARN cluster
+#### Set up a single node YARN cluster
 
 You can use the `grid` script included as part of the [hello-samza](https://github.com/apache/samza-hello-samza/) repository to setup a single-node cluster. The script also starts Zookeeper and Kafka locally.
 
@@ -104,7 +103,7 @@ $ ./deploy/samza/bin/run-app.sh --config-factory=org.apache.samza.config.factori
 Congratulations, you've successfully submitted your first job to YARN! You can view the YARN Web UI to view its status. 
 
 
-## Application Master UI
+### Application Master UI
 
 The YARN RM provides a Web UI to view the status of applications in the cluster, their containers and logs. By default, it can be accessed from `localhost:8088` on the RM host. 
 ![diagram-medium](/img/{{site.version}}/learn/documentation/yarn/yarn-am-ui.png)
@@ -127,7 +126,7 @@ Samza's Application Master UI provides you the ability to view:
 ![diagram-small](/img/{{site.version}}/learn/documentation/yarn/am-runtime-configs.png)
 
 
-### Configurations
+### Configuration
 
 In this section, we'll look at configuring your jobs when running on YARN.