You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by ja...@apache.org on 2018/10/17 16:48:48 UTC

samza git commit: Added Samza Configurations to website

Repository: samza
Updated Branches:
  refs/heads/master 8b8526682 -> 058776d65


Added Samza Configurations to website

vjagadish
Added `CONFIGURATIONS` under `DOCUMENTATION`
Updated `configuration.md` page to work with new configs

Do we have documentation about `SystemDescriptors` anywhere on the website?
I was thinking to add it in the `configuration.md` page otherwise.

Author: Daniel Chen <dc...@linkedin.com>

Reviewers: Jagadish<ja...@apache.org>

Closes #723 from dxichen/add-configs-to-website


Project: http://git-wip-us.apache.org/repos/asf/samza/repo
Commit: http://git-wip-us.apache.org/repos/asf/samza/commit/058776d6
Tree: http://git-wip-us.apache.org/repos/asf/samza/tree/058776d6
Diff: http://git-wip-us.apache.org/repos/asf/samza/diff/058776d6

Branch: refs/heads/master
Commit: 058776d65e79a3d4973f748dc5f57ac7ad36d72e
Parents: 8b85266
Author: Daniel Chen <dc...@linkedin.com>
Authored: Wed Oct 17 09:40:39 2018 -0700
Committer: Jagadish <jv...@linkedin.com>
Committed: Wed Oct 17 09:40:39 2018 -0700

----------------------------------------------------------------------
 docs/learn/documentation/versioned/index.html   |  2 +-
 .../versioned/jobs/configuration.md             | 56 ++++++++++----------
 .../versioned/jobs/samza-configurations.md      |  4 +-
 3 files changed, 30 insertions(+), 32 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/samza/blob/058776d6/docs/learn/documentation/versioned/index.html
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/index.html b/docs/learn/documentation/versioned/index.html
index 94f7e18..50bfd2d 100644
--- a/docs/learn/documentation/versioned/index.html
+++ b/docs/learn/documentation/versioned/index.html
@@ -21,7 +21,7 @@ title: Documentation
 
 <h4><a href="core-concepts/core-concepts.html">CORE CONCEPTS</a></h4>
 <h4><a href="architecture/architecture-overview.html">ARCHITECTURE</a></h4>
-
+<h4><a href="jobs/configuration.html">CONFIGURATIONS</a></h4>
 
 <h4>API</h4>
 

http://git-wip-us.apache.org/repos/asf/samza/blob/058776d6/docs/learn/documentation/versioned/jobs/configuration.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/jobs/configuration.md b/docs/learn/documentation/versioned/jobs/configuration.md
index 4aac9bf..aafb870 100644
--- a/docs/learn/documentation/versioned/jobs/configuration.md
+++ b/docs/learn/documentation/versioned/jobs/configuration.md
@@ -19,48 +19,46 @@ title: Configuration
    limitations under the License.
 -->
 
-All Samza jobs have a configuration file that defines the job. A very basic configuration file looks like this:
+All Samza applications have a [properties format](https://en.wikipedia.org/wiki/.properties) file that defines its configurations.
+A complete list of configuration keys can be found on the [__Samza Configurations Table__](samza-configurations.html) page. 
+ 
+A very basic configuration file looks like this:
 
 {% highlight jproperties %}
-# Job
-job.factory.class=org.apache.samza.job.local.ThreadJobFactory
-job.name=hello-world
-
-# Task
-task.class=samza.task.example.MyJavaStreamerTask
-task.inputs=example-system.example-stream
-
-# Serializers
+# Application Configurations
+job.factory.class=org.apache.samza.job.local.YarnJobFactory
+app.name=hello-world
+job.default.system=example-system
 serializers.registry.json.class=org.apache.samza.serializers.JsonSerdeFactory
 serializers.registry.string.class=org.apache.samza.serializers.StringSerdeFactory
 
-# Systems
+# Systems & Streams Configurations
 systems.example-system.samza.factory=samza.stream.example.ExampleConsumerFactory
 systems.example-system.samza.key.serde=string
 systems.example-system.samza.msg.serde=json
-{% endhighlight %}
 
-There are four major sections to a configuration file:
+# Checkpointing
+task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory
 
-1. The job section defines things like the name of the job, and whether to use the YarnJobFactory or ProcessJobFactory/ThreadJobFactory (See the job.factory.class property in [Configuration Table](configuration-table.html)).
-2. The task section is where you specify the class name for your [StreamTask](../api/overview.html). It's also where you define what the [input streams](../container/streams.html) are for your task.
-3. The serializers section defines the classes of the [serdes](../container/serialization.html) used for serialization and deserialization of specific objects that are received and sent along different streams.
-4. The system section defines systems that your StreamTask can read from along with the types of serdes used for sending keys and messages from that system. Usually, you'll define a Kafka system, if you're reading from Kafka, although you can also specify your own self-implemented Samza-compatible systems. See the [hello-samza example project](/startup/hello-samza/{{site.version}})'s Wikipedia system for a good example of a self-implemented system.
+# State Storage
+stores.example-store.factory=org.apache.samza.storage.kv.RocksDbKeyValueStorageEngineFactory
+stores.example-store.key.serde=string
+stores.example-store.value.serde=json
 
-### Required Configuration
-
-Configuration keys that absolutely must be defined for a Samza job are:
+# Metrics
+metrics.reporter.example-reporter.class=org.apache.samza.metrics.reporter.JmxReporterFactory
+metrics.reporters=example-reporter
+{% endhighlight %}
 
-* `job.factory.class`
-* `job.name`
-* `task.class`
-* `task.inputs`
+There are 6 sections sections to a configuration file:
 
-### Configuration Keys
+1. The [__Application__](samza-configurations.html#application-configurations) section defines things like the name of the job, job factory (See the job.factory.class property in [Configuration Table](samza-configurations.html)), the class name for your [StreamTask](../api/overview.html) and serialization and deserialization of specific objects that are received and sent along different streams.
+2. The [__Systems & Streams__](samza-configurations.html#systems-streams) section defines systems that your StreamTask can read from along with the types of serdes used for sending keys and messages from that system. You may use any of the [predefined systems](../connectors/overview.html) that Samza ships with, although you can also specify your own self-implemented Samza-compatible systems. See the [hello-samza example project](/startup/hello-samza/{{site.version}})'s Wikipedia system for a good example of a self-implemented system.
+3. The [__Checkpointing__](samza-configurations.html#checkpointing) section defines how the messages processing state is saved, which provides fault-tolerant processing of streams (See [Checkpointing](../container/checkpointing.html) for more details).
+4. The [__State Storage__](samza-configurations.html#state-storage) section defines the [stateful stream processing](../container/state-management.html) settings for Samza.
+5. The [__Deployment__](samza-configurations.html#deployment) section defines how the Samza application will be deployed (To a cluster manager (YARN), or as a standalone library) as well as settings for each option. See [Deployment Models](/deployment/deployment-model.html) for more details.
+6. The [__Metrics__](samza-configurations.html#metrics) section defines how the Samza application metrics will be monitored and collected. (See [Monitoring](../operations/monitoring.html))
 
-A complete list of configuration keys can be found on the [Samza Configurations](samza-configurations.html) page.  Note
-that configuration keys prefixed with "sensitive." are treated specially, in that the values associated with such keys
+Note that configuration keys prefixed with `sensitive.` are treated specially, in that the values associated with such keys
 will be masked in logs and Samza's YARN ApplicationMaster UI.  This is to prevent accidental disclosure only; no
 encryption is done.
-
-## [Packaging &raquo;](packaging.html)

http://git-wip-us.apache.org/repos/asf/samza/blob/058776d6/docs/learn/documentation/versioned/jobs/samza-configurations.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/jobs/samza-configurations.md b/docs/learn/documentation/versioned/jobs/samza-configurations.md
index ea76210..0928ee2 100644
--- a/docs/learn/documentation/versioned/jobs/samza-configurations.md
+++ b/docs/learn/documentation/versioned/jobs/samza-configurations.md
@@ -57,7 +57,7 @@ These are the basic properties for setting up a Samza application.
 |job.host-affinity.enabled|false|This property indicates whether host-affinity is enabled or not. Host-affinity refers to the ability of Samza to request and allocate a container on the same host every time the job is deployed. When host-affinity is enabled, Samza makes a "best-effort" to honor the host-affinity constraint. The property `cluster-manager.container.request.timeout.ms` determines how long to wait before de-prioritizing the host-affinity constraint and assigning the container to any available resource.|
 |task.window.ms|-1|If task.class implements [WindowableTask](../api/javadocs/org/apache/samza/task/WindowableTask.html), it can receive a windowing callback in regular intervals. This property specifies the time between window() calls, in milliseconds. If the number is negative (the default), window() is never called. A `window()` call will never  occur concurrently with the processing of a message. If a message is being processed when a window() call is due, the invocation of window happens after processing the message. This property is set automatically when using join or window operators in a High Level API StreamApplication Note: task.window.ms should be set to be much larger than average process or window call duration to avoid starving regular processing.|
 |task.log4j.system| |Specify the system name for the StreamAppender. If this property is not specified in the config, an exception will be thrown. (See [Stream Log4j Appender](logging.html#stream-log4j-appender)) Example: task.log4j.system=kafka|
-|serializers.registry.<br>**_serde-name_**.class| |Use this property to register a serializer/deserializer, which defines a way of encoding data as an array of bytes (used for messages in streams, and for data in persistent storage). You can give a serde any serde-name you want, and reference that name in properties like systems.\*.samza.key.serde, systems.\*.samza.msg.serde, streams.\*.samza.key.serde, streams.\*.samza.msg.serde, stores.\*.key.serde and stores.\*.msg.serde. The value of this property is the fully-qualified name of a Java class that implements SerdeFactory. Samza ships with the following serde implementations, which can be used with their predefined serde name without adding them to the registry explicitly:<br><br>`org.apache.samza.serializers.ByteSerdeFactory`<br>A no-op serde which passes through the undecoded byte array. Its predefined serde-name is `byte`.<br><br>`org.apache.samza.serializers.ByteBufferSerdeFactory`<br>Encodes `java.nio.ByteBuffer` objects. Its 
 predefined serde-name is `bytebuffer`.<br><br>`org.apache.samza.serializers.IntegerSerdeFactory`<br>Encodes `java.lang.Integer` objects as binary (4 bytes fixed-length big-endian encoding). Its predefined serde-name is `integer`.<br><br>`org.apache.samza.serializers.StringSerdeFactory`<br>Encodes `java.lang.String` objects as UTF-8. Its predefined serde-name is `string`.<br><br>`org.apache.samza.serializers.JsonSerdeFactory`<br>Encodes nested structures of `java.util.Map`, `java.util.List` etc. as JSON. Note: This Serde enforces a dash-separated property naming convention, while JsonSerdeV2 doesn't. This serde is primarily meant for Samza's internal usage, and is publicly available for backwards compatibility. Its predefined serde-name is `json`.<br><br>`org.apache.samza.serializers.JsonSerdeV2Factory`<br>Encodes nested structures of `java.util.Map`, `java.util.List` etc. as JSON. Note: This Serde uses Jackson's default (camelCase) property naming convention. This serde should be pr
 eferred over JsonSerde, especially in High Level API, unless the dasherized naming convention is required (e.g., for backwards compatibility).<br><br>`org.apache.samza.serializers.LongSerdeFactory`<br>Encodes `java.lang.Long` as binary (8 bytes fixed-length big-endian encoding). Its predefined serde-name is `long`.<br><br>`org.apache.samza.serializers.DoubleSerdeFactory`<br>Encodes `java.lang.Double` as binary (8 bytes double-precision float point). Its predefined serde-name is `double`.<br><br>`org.apache.samza.serializers.UUIDSerdeFactory`<br>Encodes `java.util.UUID` objects.<br><br>`org.apache.samza.serializers.SerializableSerdeFactory`<br>Encodes `java.io.Serializable` objects. Its predefined serde-name is `serializable`.<br><br>`org.apache.samza.serializers.MetricsSnapshotSerdeFactory`<br>Encodes `org.apache.samza.metrics.reporter.MetricsSnapshot` objects (which are used for reporting metrics) as JSON.<br><br>`org.apache.samza.serializers.KafkaSerdeFactory`<br>Adapter which all
 ows existing `kafka.serializer.Encoder` and `kafka.serializer.Decoder` implementations to be used as Samza serdes. Set `serializers.registry.serde-name.encoder` and  `serializers.registry.serde-name.decoder` to the appropriate class names.|
+|serializers.registry.<br>**_serde-name_**.class| |Use this property to register a serializer/deserializer, which defines a way of encoding data as an array of bytes (used for messages in streams, and for data in persistent storage). You can give a serde any serde-name you want, and reference that name in properties like systems.\*.samza.key.serde, systems.\*.samza.msg.serde, streams.\*.samza.key.serde, streams.\*.samza.msg.serde, stores.\*.key.serde and stores.\*.msg.serde. The value of this property is the fully-qualified name of a Java class that implements SerdeFactory. Samza ships with the following serde implementations:<br><br>`org.apache.samza.serializers.ByteSerdeFactory`<br>A no-op serde which passes through the undecoded byte array. <br><br>`org.apache.samza.serializers.ByteBufferSerdeFactory`<br>Encodes `java.nio.ByteBuffer` objects. <br><br>`org.apache.samza.serializers.IntegerSerdeFactory`<br>Encodes `java.lang.Integer` objects as binary (4 bytes fixed-length big-endia
 n encoding).<br><br>`org.apache.samza.serializers.StringSerdeFactory`<br>Encodes `java.lang.String` objects as UTF-8. <br><br>`org.apache.samza.serializers.JsonSerdeFactory`<br>Encodes nested structures of `java.util.Map`, `java.util.List` etc. as JSON. Note: This Serde enforces a dash-separated property naming convention, while JsonSerdeV2 doesn't. This serde is primarily meant for Samza's internal usage, and is publicly available for backwards compatibility.<br><br>`org.apache.samza.serializers.JsonSerdeV2Factory`<br>Encodes nested structures of `java.util.Map`, `java.util.List` etc. as JSON. Note: This Serde uses Jackson's default (camelCase) property naming convention. This serde should be preferred over JsonSerde, especially in High Level API, unless the dasherized naming convention is required (e.g., for backwards compatibility).<br><br>`org.apache.samza.serializers.LongSerdeFactory`<br>Encodes `java.lang.Long` as binary (8 bytes fixed-length big-endian encoding).<br><br>`org.
 apache.samza.serializers.DoubleSerdeFactory`<br>Encodes `java.lang.Double` as binary (8 bytes double-precision float point). <br><br>`org.apache.samza.serializers.UUIDSerdeFactory`<br>Encodes `java.util.UUID` objects.<br><br>`org.apache.samza.serializers.SerializableSerdeFactory`<br>Encodes `java.io.Serializable` objects.<br><br>`org.apache.samza.serializers.MetricsSnapshotSerdeFactory`<br>Encodes `org.apache.samza.metrics.reporter.MetricsSnapshot` objects (which are used for reporting metrics) as JSON.<br><br>`org.apache.samza.serializers.KafkaSerdeFactory`<br>Adapter which allows existing `kafka.serializer.Encoder` and `kafka.serializer.Decoder` implementations to be used as Samza serdes. Set `serializers.registry.serde-name.encoder` and  `serializers.registry.serde-name.decoder` to the appropriate class names.|
 
 #### <a name="advanced-application-configurations"></a> [1.1 Advanced Application Configurations](#advanced-application-configurations)
 
@@ -279,7 +279,7 @@ Samza supports both standalone and clustered ([YARN](yarn-jobs.html)) [deploymen
 #### <a name="yarn-cluster-deployment"></a>[5.1 YARN Cluster Deployment](#yarn-cluster-deployment)
 |Name|Default|Description|
 |--- |--- |--- |
-|yarn.package.path| |Required for YARN jobs: The URL from which the job package can be downloaded, for example a http:// or hdfs:// URL. The job package is a .tar.gz file with a specific directory structure.|
+|yarn.package.path| |__Required for YARN jobs:__ The URL from which the job package can be downloaded, for example a http:// or hdfs:// URL. The job package is a .tar.gz file with a specific directory structure.|
 |job.container.count|1|The number of YARN containers to request for running your job. This is the main parameter for controlling the scale (allocated computing resources) of your job: to increase the parallelism of processing, you need to increase the number of containers. The minimum is one container, and the maximum number of containers is the number of task instances (usually the number of input stream partitions). Task instances are evenly distributed across the number of containers that you specify.|
 |cluster-manager.container.memory.mb|1024|How much memory, in megabytes, to request from the cluster manager per container of your job. Along with cluster-manager.container.cpu.cores, this property determines how many containers the cluster manager will run on one machine. If the container exceeds this limit, it will be killed, so it is important that the container's actual memory use remains below the limit. The amount of memory used is normally the JVM heap size (configured with task.opts), plus the size of any off-heap memory allocation (for example stores.*.container.cache.size.bytes), plus a safety margin to allow for JVM overheads.|
 |cluster-manager.container.cpu.cores|1|The number of CPU cores to request per container of your job. Each node in the cluster has a certain number of CPU cores available, so this number (along with cluster-manager.container.memory.mb) determines how many containers can be run on one machine.|