You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Anoop Krishnakumar <an...@gmail.com> on 2020/02/20 08:16:20 UTC

Scaling samza jobs in standalone deployment

Hello Team,

I am testing scaling of Samza jobs in standalone deployment model. Jobs are
deployed in Kubernetes with zookeeper as job coordinator. The input Kafka
topic has two partitions and I am running two Samza instances. The
configuration supplied to both instances are identical except app.id, which
is 1,2 respectively.

The expected result was each Samza instance would process one of the two
partitions. In reality both instances subscribes to partition 0 and 1, and
I could see duplicate messages in the output topic. Is there anything that
I am missing?

Samza Version: 1.3.0
Kafka Version: 2.1
Zookeeper Version: 3.4

Configuration:
app.id=1
app.name=test
app.class=insights.stream.AlertStreamApplication
job.default.system=kafka
job.coordinator.factory=org.apache.samza.zk.ZkJobCoordinatorFactory
task.name.grouper.factory=org.apache.samza.container.grouper.task.GroupByContainerIdsFactory
# Task Checkpoint
task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory
job.coordinator.zk.connect=<ZK IP>:2181
systems.kafka.producer.bootstrap.servers=<Kafka IP>:9092
# Kafka System
systems.kafka.samza.factory=org.apache.samza.system.kafka.KafkaSystemFactory
systems.kafka.default.stream.replication.factor=1
# Input Streams
streams.alert.samza.system=kafka
streams.alert.samza.physical.name=alert
# Output Streams
streams.events.samza.system=kafka
streams.events.samza.physical.name=events

Thanks,
Anoop

Re: Scaling samza jobs in standalone deployment

Posted by Jordan Messec <jm...@apple.com.INVALID>.
app.id <http://app.id/> is used to identify an instance of your application. Each application consumes all partitions for all topics. With unique app.id <http://app.id/>, what you really have is two Samza jobs running simultaneously.

Sounds to me like what you’re expecting is a single Samza job with two containers.

Jordan

> On Feb 20, 2020, at 12:16 AM, Anoop Krishnakumar <an...@gmail.com> wrote:
> 
> Hello Team,
> 
> I am testing scaling of Samza jobs in standalone deployment model. Jobs are
> deployed in Kubernetes with zookeeper as job coordinator. The input Kafka
> topic has two partitions and I am running two Samza instances. The
> configuration supplied to both instances are identical except app.id, which
> is 1,2 respectively.
> 
> The expected result was each Samza instance would process one of the two
> partitions. In reality both instances subscribes to partition 0 and 1, and
> I could see duplicate messages in the output topic. Is there anything that
> I am missing?
> 
> Samza Version: 1.3.0
> Kafka Version: 2.1
> Zookeeper Version: 3.4
> 
> Configuration:
> app.id=1
> app.name=test
> app.class=insights.stream.AlertStreamApplication
> job.default.system=kafka
> job.coordinator.factory=org.apache.samza.zk.ZkJobCoordinatorFactory
> task.name.grouper.factory=org.apache.samza.container.grouper.task.GroupByContainerIdsFactory
> # Task Checkpoint
> task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory
> job.coordinator.zk.connect=<ZK IP>:2181
> systems.kafka.producer.bootstrap.servers=<Kafka IP>:9092
> # Kafka System
> systems.kafka.samza.factory=org.apache.samza.system.kafka.KafkaSystemFactory
> systems.kafka.default.stream.replication.factor=1
> # Input Streams
> streams.alert.samza.system=kafka
> streams.alert.samza.physical.name=alert
> # Output Streams
> streams.events.samza.system=kafka
> streams.events.samza.physical.name=events
> 
> Thanks,
> Anoop