You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by GitBox <gi...@apache.org> on 2019/10/08 17:19:24 UTC

[GitHub] [samza] prateekm commented on a change in pull request #1179: SAMZA-2334: SSPGrouperProxy should check if job is stateful, not if host-affinity is enabled

prateekm commented on a change in pull request #1179: SAMZA-2334: SSPGrouperProxy should check if job is stateful, not if host-affinity is enabled
URL: https://github.com/apache/samza/pull/1179#discussion_r332633787
 
 

 ##########
 File path: docs/learn/documentation/versioned/jobs/samza-configurations.md
 ##########
 @@ -77,9 +77,10 @@ These are the basic properties for setting up a Samza application.
 |job.coordinator.replication.<br>factor|300000|The frequency at which the input streams' partition count change should be detected. When the input partition count change is detected, Samza will automatically restart a stateless job or fail a stateful job. A longer time interval is recommended for jobs w/ large number of input system stream partitions, since gathering partition count may incur measurable overhead to the job. You can completely disable partition count monitoring by setting this value to 0 or a negative integer, which will also disable auto-restart/failing behavior of a Samza job on partition count changes.|
 |job.systemstreampartition.<br>grouper.factory|`org.apache.samza.`<br>`container.grouper.stream.`<br>`GroupByPartitionFactory`|A factory class that is used to determine how input SystemStreamPartitions are grouped together for processing in individual StreamTask instances. The factory must implement the SystemStreamPartitionGrouperFactory interface. Once this configuration is set, it can't be changed, since doing so could violate state semantics, and lead to a loss of data.<br><br>`org.apache.samza.container.grouper.stream.`<br>`GroupByPartitionFactory`<br>Groups input stream partitions according to their partition number. This grouping leads to a single StreamTask processing all messages for a single partition (e.g. partition 0) across all input streams that have a partition 0. Therefore, the default is that you get one StreamTask for all input partitions with the same partition number. Using this strategy, if two input streams have a partition 0, then messages from both partitions will be routed to a single StreamTask. This partitioning strategy is useful for joining and aggregating streams.<br><br>`org.apache.samza.container.grouper.stream.`<br>`GroupBySystemStreamPartitionFactory`<br>Assigns each SystemStreamPartition to its own unique StreamTask. The GroupBySystemStreamPartitionFactory is useful in cases where you want increased parallelism (more containers), and don't care about co-locating partitions for grouping or joins, since it allows for a greater number of StreamTasks to be divided up amongst Samza containers.|
 |job.systemstreampartition.<br>matcher.class| |If you want to enable static partition assignment, then this is a required configuration. The value of this property is a fully-qualified Java class name that implements the interface org.apache.samza.system.SystemStreamPartitionMatcher. Samza ships with two matcher classes:<br><br>`org.apache.samza.system.RangeSystemStreamPartitionMatcher`<br>This classes uses a comma separated list of range(s) to determine which partition matches, and thus statically assigned to the Job. For example "2,3,1-2", statically assigns partition 1, 2, and 3 for all the specified system and streams (topics in case of Kafka) to the job. For config validation each element in the comma separated list much conform to one of the following regex:<br>`(\\d+)`" or"`(\\d+-\\d+)`"<br>`JobConfig.SSP_MATCHER_CLASS_RANGE` constant has the canonical name of this class.<br><br>`org.apache.samza.system.RegexSystemStreamPartitionMatcher`<br>This classes uses a standard Java supported regex to determine which partition matches, and thus statically assigned to the Job. For example "[1-2]", statically assigns partition 1 and 2 for all the specified system and streams (topics in case of Kafka) to the job. JobConfig.SSP_MATCHER_CLASS_REGEX constant has the canonical name of this class.|
-|job.systemstreampartition.<br>matcher.config.<br>range| |If `job.systemstreampartition.matcher.class` is specified, and the value of this property is `org.apache.samza.system.RangeSystemStreamPartitionMatcher`, then this property is a required configuration. Specify a comma separated list of range(s) to determine which partition matches, and thus statically assigned to the Job. For example "2,3,11-20", statically assigns partition 2, 3, and 11 to 20 for all the specified system and streams (topics in case of Kafka) to the job. A singel configuration value like "19" is valid as well. This statically assigns partition 19. For config validation each element in the comma separated list much conform to one of the following regex:<br>"`(\\d+)`" or "`(\\d+-\\d+)`"|
+|job.systemstreampartition.<br>matcher.config.<br>range| |If `job.systemstreampartition.matcher.class` is specified, and the value of this property is `org.apache.samza.system.RangeSystemStreamPartitionMatcher`, then this property is a required configuration. Specify a comma separated list of range(s) to determine which partition matches, and thus statically assigned to the Job. For example "2,3,11-20", statically assigns partition 2, 3, and 11 to 20 for all the specified system and streams (topics in case of Kafka) to the job. A single configuration value like "19" is valid as well. This statically assigns partition 19. For config validation each element in the comma separated list much conform to one of the following regex:<br>"`(\\d+)`" or "`(\\d+-\\d+)`"|
 |job.systemstreampartition.<br>matcher.config.<br>regex| |If `job.systemstreampartition.matcher.class` is specified, and the value of this property is `org.apache.samza.system.RegexSystemStreamPartitionMatcher`, then this property is a required configuration. The value should be a valid Java supported regex. For example "[1-2]", statically assigns partition 1 and 2 for all the specified system and streams (topics in case of Kakfa) to the job.|
 |job.systemstreampartition.<br>matcher.config.<br>job.factory.regex| |This configuration can be used to specify the Java supported regex to match the StreamJobFactory for which the static partition assignment should be enabled. This configuration enables the partition assignment feature to be used for custom StreamJobFactory(ies) as well.<br>This config defaults to the following value: "_org\\\\.apache\\\\.samza\\\\.job\\\\.local(.*ProcessJobFactory &#124; .*ThreadJobFactory)_", which enables static partition assignment when job.factory.class is set to `org.apache.samza.job.local.ProcessJobFactory` or `org.apache.samza.job.local.ThreadJobFactory`.|
+|job.systemstreampartition.<br>grouper.proxy.enabled|true|When enabled, this provides a proxy layer on top of the grouper configured by `job.systemstreampartition.grouper.factory` that allows stateful jobs to expand or contract their partition count by a multiple of the previous count so that events from an input stream partition are not stored in the local state of a different task. This will prevent erroneous results. If this configuration is disabled or if the job is stateless, the proxy will call the grouper directly and skip the expansion or contraction logic. See [SEP-5](https://cwiki.apache.org/confluence/display/SAMZA/SEP-5%3A+Enable+partition+expansion+of+input+streams) for more details.|
 
 Review comment:
   Can we rename this config to reflect the feature (partition count increase) instead of the implementation (grouper proxy)? Same for the docs.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services