You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Maxime Petazzoni <ma...@turn.com> on 2013/07/12 02:18:54 UTC

Complex multi-datacenter setups

Hi all,

I was wondering if anybody here has and was willing to share experience
about designing and operating complex multi-datacenter/multi-cluster
Kafka deployments in which data must flow from and to several distinct
Kafka clusters with more complex semantics than what MirrorMaker
provides.

The general, very sensible consensus is that producers of data should
publish to a local Kafka cluster. But if that data is produced in
multiple datacenters, and must be consumed multiple datacenters as well,
then you need to implement data routing and filtering to organise your
pipeline.

Imagine the following scenario, with three datacenters A, B and C. Data
is being produced (of the same kind, to the same topic) in all three
datacenters. Both datacenters A and B have consumers that want all the
data generated in all three datacenters, but C is only interested in a
subset of what is produced in A and B (according to some specific
filters for example).

This means you have data flowing in both directions between each
datacenter. You need some kind of source-base filtering to prevent data
going back and forth ad vitam eternam, as well as something to implement
the custom filtering logic where needed, which also means you'd need to
envelope all data into a broader object that knows about where the data
was published from.

Is this kind of deployment pretty common in the industry/among the users
of Kafka? I haven't found much online that would help putting together
this type of architectures. Is it basically roll-your-own with something
similar to the MirrorMaker that has a consumer, filtering component and
producer, and place a couple of these in each direction between each
pair of clusters?

It ultimately bogs down to pretty simple "routing" of data, just in a
more complex manner than having all data flow to a single sink location.


Let me know what you folks think!

TIA,
/Max
-- 
Maxime Petazzoni
Sr. Platform Engineer
m 408.310.0595
www.turn.com

Re: Complex multi-datacenter setups

Posted by Jun Rao <ju...@gmail.com>.

Our plan is keep the agg cluster in a few, but not all DCs.

Thanks,

Jun


On Fri, Jul 12, 2013 at 9:30 AM, Maxime Petazzoni <maxime.petazzoni@turn.com
> wrote:

> * Jun Rao <ju...@gmail.com> [2013-07-11 21:10:46]:
>
> > We we have at LinkedIn is an extra aggregate cluster per data center. We
> > use mirror maker to copy data from the local cluster in each of the data
> > centers to the aggregate one.
>
> Do you mean you have an aggregate cluster in each datacenter that gets
> all the data from all the other datacenters? And you have that setup in
> all your datacenters?
>
> Don't you end up moving a lot of data around for nothing? I'd like to
> hear more about your setup, if that's something you are a liberty to
> share.
>
> TIA,
> /Max
> --
> Maxime Petazzoni
> Sr. Platform Engineer
> m 408.310.0595
> www.turn.com
>

Re: Complex multi-datacenter setups

Posted by Maxime Petazzoni <ma...@turn.com>.

* Jun Rao <ju...@gmail.com> [2013-07-11 21:10:46]:

> We we have at LinkedIn is an extra aggregate cluster per data center. We
> use mirror maker to copy data from the local cluster in each of the data
> centers to the aggregate one.

Do you mean you have an aggregate cluster in each datacenter that gets
all the data from all the other datacenters? And you have that setup in
all your datacenters?

Don't you end up moving a lot of data around for nothing? I'd like to
hear more about your setup, if that's something you are a liberty to
share.

TIA,
/Max
-- 
Maxime Petazzoni
Sr. Platform Engineer
m 408.310.0595
www.turn.com

Re: Complex multi-datacenter setups

Posted by Jun Rao <ju...@gmail.com>.

We we have at LinkedIn is an extra aggregate cluster per data center. We
use mirror maker to copy data from the local cluster in each of the data
centers to the aggregate one.

Thanks,

Jun


On Thu, Jul 11, 2013 at 5:18 PM, Maxime Petazzoni <maxime.petazzoni@turn.com
> wrote:

> Hi all,
>
> I was wondering if anybody here has and was willing to share experience
> about designing and operating complex multi-datacenter/multi-cluster
> Kafka deployments in which data must flow from and to several distinct
> Kafka clusters with more complex semantics than what MirrorMaker
> provides.
>
> The general, very sensible consensus is that producers of data should
> publish to a local Kafka cluster. But if that data is produced in
> multiple datacenters, and must be consumed multiple datacenters as well,
> then you need to implement data routing and filtering to organise your
> pipeline.
>
> Imagine the following scenario, with three datacenters A, B and C. Data
> is being produced (of the same kind, to the same topic) in all three
> datacenters. Both datacenters A and B have consumers that want all the
> data generated in all three datacenters, but C is only interested in a
> subset of what is produced in A and B (according to some specific
> filters for example).
>
> This means you have data flowing in both directions between each
> datacenter. You need some kind of source-base filtering to prevent data
> going back and forth ad vitam eternam, as well as something to implement
> the custom filtering logic where needed, which also means you'd need to
> envelope all data into a broader object that knows about where the data
> was published from.
>
> Is this kind of deployment pretty common in the industry/among the users
> of Kafka? I haven't found much online that would help putting together
> this type of architectures. Is it basically roll-your-own with something
> similar to the MirrorMaker that has a consumer, filtering component and
> producer, and place a couple of these in each direction between each
> pair of clusters?
>
> It ultimately bogs down to pretty simple "routing" of data, just in a
> more complex manner than having all data flow to a single sink location.
>
>
> Let me know what you folks think!
>
> TIA,
> /Max
> --
> Maxime Petazzoni
> Sr. Platform Engineer
> m 408.310.0595
> www.turn.com
>