You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by "Devaki, Srinivas" <me...@eightnoteight.space> on 2021/01/23 21:06:48 UTC

Suggestions on canarying traffic for kafka consumers.

Hi Folks,

Canarying traffic is an excellent way of reducing the impact when
releasing a new release with a bug. Such canarying is somewhat easier
with a few queueing backends like sqs & redis. In sqs for example each
application container/instance of canary can self regulate how much
throughput they process after looking at how much throughput the rest
of the containers processed using some quota logic.

But coming to the kafka consumers, since partitions are assigned to
application containers/instances, I'm finding it a bit hard to decide
on a way to split the traffic between the canary & production
application deployments.

As of now these are a few thoughts in my mind.

### Kafka Consumer as Separate Deployment that makes RPC calls to
application containers

In this approach, I was thinking to run kafka consumer as a separate
deployment that makes rpc calls to application containers via load
balancer or envoy. The load balancer/envoy will help in splitting the
traffic between canary and production containers.

### Kafka Stateful Proxy

This approach was that there is a kafka proxy wrapper on top of kafka
brokers which runs kafka consumer groups for each topic and partition
assignment strategy mirrors that of the broker's partition assignment.

This mirroring assignment is to ensure that load is equally split, and
if it's not equally split the problem is not with the proxy but the
problem is at the broker level itself where partitions themselves are
unbalanced across brokers.

applications containers poll for new items from this proxy and report
back to this proxy once that processing has finished - essentially we
are abstracting out the polling loop to maintain a listener to poll
items and report back the status on them. can be implemented as push
or pull method based pros and cons of each approach.

### Use of other queueing backends like SQS
A separate kafka consumer group deployment can be made that exposes
kafka topics under sqs queues for each application use case. Although
this is stacking multiple components and looks unintuitive, this
solution seems to be the most simplest of all and is flexible to
implement other functionalities on top

### Topic Splitting
Each kafka consumer use case sets up an infrastructure component to
create their own canary/prod topics that are created from the main
kafka topic according to the canary traffic percentage.

This solution felt like a complex one and ruled it out, would be
interesting if anyone had any positive thoughts on this approach.

### 2 Kafka Consumer Groups
Canary and Production deployments use their own kafka consumer groups
and both use a hash function to decide which deployment can process
which item, for x percentage of the traffic to be driven via canary
something like `digest(partition, offset) % 100 < x` can be used.

There is the problem of resource wastage in this approach, but it is
still a very decent approach to splitting the traffic between canary &
production.

---------------------------

I wanted to get some inputs on the above approaches and most
importantly want to see how others are solving this problem of
canarying kafka event streams and maybe help in re-enforcing some of
the above approaches.

Thanks
Srinivas
SRE @ zomato
@eightnoteight