You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Andrew Otto <ot...@wikimedia.org> on 2022/05/09 18:58:18 UTC

Kafka Stretch Clusters

Hi all,

I'm evaluating <https://phabricator.wikimedia.org/T307944> the feasibility
of setting up a cross datacenter Kafka 'stretch' cluster at The Wikimedia
Foundation.

I've found docs here and there, but they are pretty slim.  My
biggest concern is the fact that while Follower Fetching
<https://cwiki.apache.org/confluence/display/KAFKA/KIP-392%3A+Allow+consumers+to+fetch+from+closest+replica>
helps
with potential consumer latency in a stretch cluster, there is nothing that
addresses producer latency.  I'd have expected the docs I've read to
mention this if it was a concern, but I haven't seen it.

Specifically, let's say I'm a producer in DC-A, and I want to produce to
partition X with acks=all.  Partition X has 3 replicas, on brokers B1 in DC
A, B2 in DC-A and B3 in DC-B.  Currently, the replica on B3(DC-B) is the
partition leader.  IIUC, when I produce my message to partition X, that
message will cross the DC boundary for my produce request to B3(DC-B), then
back again when replica B1(DC-A) fetches, and also when replica B2(DC-A)
fetches, for a total of 3 times between DCs.

Questions:
- Am I correct in understanding that each one of these fetches contributes
to the ack latency?

- And, as the number of brokers and replica increases, the number of times
a message crosses the DC (likely) increases too?

- When replicas are promoted to be a partition leader,  producer clients
will shuffle their connections around, often resulting in them connecting
to the leader in a remote datacenter. Should I be worried about this
unpredictability in cross DC network connections and traffic?

I'm really hoping that a stretch cluster will help solve some Multi DC
streaming app architecture woes, but I'm not so sure the potential issues
with partition leaders is worth it!

Thanks for any insight y'all have,
-Andrew Otto
 Wikimedia Foundation

Re: Kafka Stretch Clusters

Posted by Guozhang Wang <wa...@gmail.com>.
>  Am I correct in assuming
that if the preferred leader is not available, the next replica in the ISR
list is chosen to be the leader?

Yes, that's correct :)

On Wed, May 11, 2022 at 1:15 PM Andrew Otto <ot...@wikimedia.org> wrote:

> Thanks so much Guozhang!
>
> > 1) For the producer -> leader hop, could you save the cross-DC network?
> >  even if your message's partition has to be determined deterministically
> by the key, in operations you can still see if most of your active
> producers
> are from one DC, then configure your topic partitions to be hosted by
> brokers within the same DC. Generally speaking, there are various ways you
> can consider saving this hop from across DCs.
>
> Hm, perhaps something like this?
> If we run the producer in active/standby mode, so that the producer
> application only ever runs in one DC at a time, could we manage the
> preferred leaders via the replica list order during a failover?  Example:
> If DC-A is the 'active' DC, then the producer would run only in DC-A.  We'd
> ensure that each partition's replica list starts with brokers only in DC-A.
>
>
> Let Broker A1 and A2 be in DC-A, and Broker B1 and B2 in DC-B.  partition 0
> and partition 1 have a replication factor of 4.
>
> p0: [A1, A2, B1,B2]
> p1: [A2, A1, B2, B1]
>
> In order to failover to DC-B, we'd reassign the partition replica list to
> put the DC-B brokers first, like:
> p0: [B1, B2, A1,A2]
> p1: [B2, B1, A2, A1]
>
> Then issue a preferred leader election, stop the producer in DC-A, and
> start it in DC-B.
> We'd incur a producer latency hit during the failover process until both
> partition leaders and the producer are in DC-B, but hopefully that will be
> short lived (minutes)?
>
> With follower fetching, this would still allow consumers in either DC to
> read from the closest replica, so it would allow for active/active reads.
> With at least 2 replicas in each DC, rolling broker restarts would
> hopefully still allow consumers to consume from replicas in their local DC.
>
> ---
> Also, a quick question about leader election.  Am I correct in assuming
> that if the preferred leader is not available, the next replica in the ISR
> list is chosen to be the leader?  Or, is it a random selection from any of
> the ISRs? If it is a random selection, then manually optimizing the replica
> list to reduce producer hops probably isn't worth trying, as we'd get the
> producer hops during normal broker maintenance.
>
> Thank you!
>
>
>
>
>
>
>
> On Mon, May 9, 2022 at 6:00 PM Guozhang Wang <wa...@gmail.com> wrote:
>
> > Hello Andrew.
> >
> > Just to answer your questions first, yes that's correct in your described
> > settings that three round-trips between DCs would incur, but since the
> > replica fetches can be done in parallel, the latency is not a sum of all
> > the round-trips. But if you stay with 2 DCs only, the number of
> round-trips
> > would only depend on the number of follower replicas that are on
> > different DCs with the leader replica.
> >
> > Jumping out of the question and your described settings, there are a
> couple
> > of things you can consider for your design:
> >
> > 1) For the producer -> leader hop, could you save the cross-DC network?
> For
> > example, if your message can potentially go to any partitions (such as it
> > is not key-ed), then you can customize your partitioner as a "rack-aware"
> > one that would always try to pick the partition whose leader co-exist
> > within the same DC as the producer client; even if your message's
> partition
> > has to be determined deterministically by the key, in operations you can
> > still see if most of your active producers are from one DC, then
> configure
> > your topic partitions to be hosted by brokers within the same DC.
> Generally
> > speaking, there are various ways you can consider saving this hop from
> > across DCs.
> >
> > 2) For the leader -> follower hop, you can start from first validating
> how
> > many failures cross DCs that you'd like to tolerate. For example, let's
> say
> > you have 2N+1 replicas per partition, with N+1 replicas including the
> > leader on one DC and N other replicas on the other DC, if we can set the
> > acks to N+2 then it means we will have the data replicated at least on
> one
> > remote replica before returning the request, and hence the data would not
> > be lost if the one whole DC fails, which could be sufficient from many
> > stretching and multi-colo cases. Then in practice, since the cross-colo
> > usually takes more latency, you'd usually get much fewer round-trips
> than N
> > across DC before satisfying the acks. And your average/p99 latencies
> would
> > not increase much compared with just one cross-DC replica.
> >
> >
> > Guozhang
> >
> >
> > On Mon, May 9, 2022 at 11:58 AM Andrew Otto <ot...@wikimedia.org> wrote:
> >
> > > Hi all,
> > >
> > > I'm evaluating <https://phabricator.wikimedia.org/T307944> the
> > feasibility
> > > of setting up a cross datacenter Kafka 'stretch' cluster at The
> Wikimedia
> > > Foundation.
> > >
> > > I've found docs here and there, but they are pretty slim.  My
> > > biggest concern is the fact that while Follower Fetching
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-392%3A+Allow+consumers+to+fetch+from+closest+replica
> > > >
> > > helps
> > > with potential consumer latency in a stretch cluster, there is nothing
> > that
> > > addresses producer latency.  I'd have expected the docs I've read to
> > > mention this if it was a concern, but I haven't seen it.
> > >
> > > Specifically, let's say I'm a producer in DC-A, and I want to produce
> to
> > > partition X with acks=all.  Partition X has 3 replicas, on brokers B1
> in
> > DC
> > > A, B2 in DC-A and B3 in DC-B.  Currently, the replica on B3(DC-B) is
> the
> > > partition leader.  IIUC, when I produce my message to partition X, that
> > > message will cross the DC boundary for my produce request to B3(DC-B),
> > then
> > > back again when replica B1(DC-A) fetches, and also when replica
> B2(DC-A)
> > > fetches, for a total of 3 times between DCs.
> > >
> > > Questions:
> > > - Am I correct in understanding that each one of these fetches
> > contributes
> > > to the ack latency?
> > >
> > > - And, as the number of brokers and replica increases, the number of
> > times
> > > a message crosses the DC (likely) increases too?
> > >
> > > - When replicas are promoted to be a partition leader,  producer
> clients
> > > will shuffle their connections around, often resulting in them
> connecting
> > > to the leader in a remote datacenter. Should I be worried about this
> > > unpredictability in cross DC network connections and traffic?
> > >
> > > I'm really hoping that a stretch cluster will help solve some Multi DC
> > > streaming app architecture woes, but I'm not so sure the potential
> issues
> > > with partition leaders is worth it!
> > >
> > > Thanks for any insight y'all have,
> > > -Andrew Otto
> > >  Wikimedia Foundation
> > >
> >
> >
> > --
> > -- Guozhang
> >
>


-- 
-- Guozhang

Re: Kafka Stretch Clusters

Posted by Andrew Otto <ot...@wikimedia.org>.
Thanks so much Guozhang!

> 1) For the producer -> leader hop, could you save the cross-DC network?
>  even if your message's partition has to be determined deterministically
by the key, in operations you can still see if most of your active producers
are from one DC, then configure your topic partitions to be hosted by
brokers within the same DC. Generally speaking, there are various ways you
can consider saving this hop from across DCs.

Hm, perhaps something like this?
If we run the producer in active/standby mode, so that the producer
application only ever runs in one DC at a time, could we manage the
preferred leaders via the replica list order during a failover?  Example:
If DC-A is the 'active' DC, then the producer would run only in DC-A.  We'd
ensure that each partition's replica list starts with brokers only in DC-A.


Let Broker A1 and A2 be in DC-A, and Broker B1 and B2 in DC-B.  partition 0
and partition 1 have a replication factor of 4.

p0: [A1, A2, B1,B2]
p1: [A2, A1, B2, B1]

In order to failover to DC-B, we'd reassign the partition replica list to
put the DC-B brokers first, like:
p0: [B1, B2, A1,A2]
p1: [B2, B1, A2, A1]

Then issue a preferred leader election, stop the producer in DC-A, and
start it in DC-B.
We'd incur a producer latency hit during the failover process until both
partition leaders and the producer are in DC-B, but hopefully that will be
short lived (minutes)?

With follower fetching, this would still allow consumers in either DC to
read from the closest replica, so it would allow for active/active reads.
With at least 2 replicas in each DC, rolling broker restarts would
hopefully still allow consumers to consume from replicas in their local DC.

---
Also, a quick question about leader election.  Am I correct in assuming
that if the preferred leader is not available, the next replica in the ISR
list is chosen to be the leader?  Or, is it a random selection from any of
the ISRs? If it is a random selection, then manually optimizing the replica
list to reduce producer hops probably isn't worth trying, as we'd get the
producer hops during normal broker maintenance.

Thank you!







On Mon, May 9, 2022 at 6:00 PM Guozhang Wang <wa...@gmail.com> wrote:

> Hello Andrew.
>
> Just to answer your questions first, yes that's correct in your described
> settings that three round-trips between DCs would incur, but since the
> replica fetches can be done in parallel, the latency is not a sum of all
> the round-trips. But if you stay with 2 DCs only, the number of round-trips
> would only depend on the number of follower replicas that are on
> different DCs with the leader replica.
>
> Jumping out of the question and your described settings, there are a couple
> of things you can consider for your design:
>
> 1) For the producer -> leader hop, could you save the cross-DC network? For
> example, if your message can potentially go to any partitions (such as it
> is not key-ed), then you can customize your partitioner as a "rack-aware"
> one that would always try to pick the partition whose leader co-exist
> within the same DC as the producer client; even if your message's partition
> has to be determined deterministically by the key, in operations you can
> still see if most of your active producers are from one DC, then configure
> your topic partitions to be hosted by brokers within the same DC. Generally
> speaking, there are various ways you can consider saving this hop from
> across DCs.
>
> 2) For the leader -> follower hop, you can start from first validating how
> many failures cross DCs that you'd like to tolerate. For example, let's say
> you have 2N+1 replicas per partition, with N+1 replicas including the
> leader on one DC and N other replicas on the other DC, if we can set the
> acks to N+2 then it means we will have the data replicated at least on one
> remote replica before returning the request, and hence the data would not
> be lost if the one whole DC fails, which could be sufficient from many
> stretching and multi-colo cases. Then in practice, since the cross-colo
> usually takes more latency, you'd usually get much fewer round-trips than N
> across DC before satisfying the acks. And your average/p99 latencies would
> not increase much compared with just one cross-DC replica.
>
>
> Guozhang
>
>
> On Mon, May 9, 2022 at 11:58 AM Andrew Otto <ot...@wikimedia.org> wrote:
>
> > Hi all,
> >
> > I'm evaluating <https://phabricator.wikimedia.org/T307944> the
> feasibility
> > of setting up a cross datacenter Kafka 'stretch' cluster at The Wikimedia
> > Foundation.
> >
> > I've found docs here and there, but they are pretty slim.  My
> > biggest concern is the fact that while Follower Fetching
> > <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-392%3A+Allow+consumers+to+fetch+from+closest+replica
> > >
> > helps
> > with potential consumer latency in a stretch cluster, there is nothing
> that
> > addresses producer latency.  I'd have expected the docs I've read to
> > mention this if it was a concern, but I haven't seen it.
> >
> > Specifically, let's say I'm a producer in DC-A, and I want to produce to
> > partition X with acks=all.  Partition X has 3 replicas, on brokers B1 in
> DC
> > A, B2 in DC-A and B3 in DC-B.  Currently, the replica on B3(DC-B) is the
> > partition leader.  IIUC, when I produce my message to partition X, that
> > message will cross the DC boundary for my produce request to B3(DC-B),
> then
> > back again when replica B1(DC-A) fetches, and also when replica B2(DC-A)
> > fetches, for a total of 3 times between DCs.
> >
> > Questions:
> > - Am I correct in understanding that each one of these fetches
> contributes
> > to the ack latency?
> >
> > - And, as the number of brokers and replica increases, the number of
> times
> > a message crosses the DC (likely) increases too?
> >
> > - When replicas are promoted to be a partition leader,  producer clients
> > will shuffle their connections around, often resulting in them connecting
> > to the leader in a remote datacenter. Should I be worried about this
> > unpredictability in cross DC network connections and traffic?
> >
> > I'm really hoping that a stretch cluster will help solve some Multi DC
> > streaming app architecture woes, but I'm not so sure the potential issues
> > with partition leaders is worth it!
> >
> > Thanks for any insight y'all have,
> > -Andrew Otto
> >  Wikimedia Foundation
> >
>
>
> --
> -- Guozhang
>

Re: Kafka Stretch Clusters

Posted by Guozhang Wang <wa...@gmail.com>.
Hello Andrew.

Just to answer your questions first, yes that's correct in your described
settings that three round-trips between DCs would incur, but since the
replica fetches can be done in parallel, the latency is not a sum of all
the round-trips. But if you stay with 2 DCs only, the number of round-trips
would only depend on the number of follower replicas that are on
different DCs with the leader replica.

Jumping out of the question and your described settings, there are a couple
of things you can consider for your design:

1) For the producer -> leader hop, could you save the cross-DC network? For
example, if your message can potentially go to any partitions (such as it
is not key-ed), then you can customize your partitioner as a "rack-aware"
one that would always try to pick the partition whose leader co-exist
within the same DC as the producer client; even if your message's partition
has to be determined deterministically by the key, in operations you can
still see if most of your active producers are from one DC, then configure
your topic partitions to be hosted by brokers within the same DC. Generally
speaking, there are various ways you can consider saving this hop from
across DCs.

2) For the leader -> follower hop, you can start from first validating how
many failures cross DCs that you'd like to tolerate. For example, let's say
you have 2N+1 replicas per partition, with N+1 replicas including the
leader on one DC and N other replicas on the other DC, if we can set the
acks to N+2 then it means we will have the data replicated at least on one
remote replica before returning the request, and hence the data would not
be lost if the one whole DC fails, which could be sufficient from many
stretching and multi-colo cases. Then in practice, since the cross-colo
usually takes more latency, you'd usually get much fewer round-trips than N
across DC before satisfying the acks. And your average/p99 latencies would
not increase much compared with just one cross-DC replica.


Guozhang


On Mon, May 9, 2022 at 11:58 AM Andrew Otto <ot...@wikimedia.org> wrote:

> Hi all,
>
> I'm evaluating <https://phabricator.wikimedia.org/T307944> the feasibility
> of setting up a cross datacenter Kafka 'stretch' cluster at The Wikimedia
> Foundation.
>
> I've found docs here and there, but they are pretty slim.  My
> biggest concern is the fact that while Follower Fetching
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-392%3A+Allow+consumers+to+fetch+from+closest+replica
> >
> helps
> with potential consumer latency in a stretch cluster, there is nothing that
> addresses producer latency.  I'd have expected the docs I've read to
> mention this if it was a concern, but I haven't seen it.
>
> Specifically, let's say I'm a producer in DC-A, and I want to produce to
> partition X with acks=all.  Partition X has 3 replicas, on brokers B1 in DC
> A, B2 in DC-A and B3 in DC-B.  Currently, the replica on B3(DC-B) is the
> partition leader.  IIUC, when I produce my message to partition X, that
> message will cross the DC boundary for my produce request to B3(DC-B), then
> back again when replica B1(DC-A) fetches, and also when replica B2(DC-A)
> fetches, for a total of 3 times between DCs.
>
> Questions:
> - Am I correct in understanding that each one of these fetches contributes
> to the ack latency?
>
> - And, as the number of brokers and replica increases, the number of times
> a message crosses the DC (likely) increases too?
>
> - When replicas are promoted to be a partition leader,  producer clients
> will shuffle their connections around, often resulting in them connecting
> to the leader in a remote datacenter. Should I be worried about this
> unpredictability in cross DC network connections and traffic?
>
> I'm really hoping that a stretch cluster will help solve some Multi DC
> streaming app architecture woes, but I'm not so sure the potential issues
> with partition leaders is worth it!
>
> Thanks for any insight y'all have,
> -Andrew Otto
>  Wikimedia Foundation
>


-- 
-- Guozhang