You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by "Young, Ben" <Be...@sungard.com> on 2015/11/06 09:20:28 UTC

Locality question

Hi,

I've had a look over the website and searched the archives, but I can't find any obvious answers to this, so apologies if it's been asked before.

I'm investigating potentially using Kafka for the transaction log for our in-memory database technology. The idea is the Kafka partitioning and replication will "automatically" give us sharding and hot-standby capabilities in the db (obviously with a fair amount of work). 

The database can ingest hundreds of gigabytes of data extremely quickly, easily enough to saturate any reasonable network connection, so I've thought about co-locating the db on the same nodes of the kafka cluster that actually store the data, to cut out the network entirely from the loading process. We'd also probably want the db topology to be defined first, and the kafka partitioning to follow. I can see how to use the partitioner class to assign a specific partition to a key, but I can't currently see how to assume partitions to known machines upfront. Is this possible? 

Does the plan sound reasonable in general? I've also considered a log shipping approach like Flume, but Kafka seems simplest all round, and a really like the idea of just being able to set the log offset to zero to reload on startup.

Thanks,
Ben Young


Ben Young . Principal Software Engineer . Adaptiv . 



RE: Locality question

Posted by "Young, Ben" <Be...@sungard.com>.
Hi Svante,

Thank you, that's cleared things up. We'll look for something where locality is more built in

Ben

> -----Original Message-----
> From: Svante Karlsson [mailto:svante.karlsson@csi.se]
> Sent: 12 November 2015 09:17
> To: users@kafka.apache.org
> Subject: Re: Locality question
> 
> If you have a kafka partition that is replicated to 3 nodes the partition varies
> (in time) thus making the colocation pointless. You can only produce and
> consume to/from the leader.
> 
> /svante
> 
> 
> 
> 2015-11-12 9:00 GMT+01:00 Young, Ben <Be...@sungard.com>:
> 
> > Hi,
> >
> > Any thoughts on this? Perhaps Kafka is not the best way to go for
> > this, but the docs do mention transaction/replication logs as a use
> > case, and I'd have thought locality would have been important for that?
> >
> > Thanks,
> > Ben
> >
> > -----Original Message-----
> > From: Young, Ben [mailto:Ben.Young@sungard.com]
> > Sent: 06 November 2015 08:20
> > To: users@kafka.apache.org
> > Subject: Locality question
> >
> > Hi,
> >
> > I've had a look over the website and searched the archives, but I
> > can't find any obvious answers to this, so apologies if it's been asked
> before.
> >
> > I'm investigating potentially using Kafka for the transaction log for
> > our in-memory database technology. The idea is the Kafka partitioning
> > and replication will "automatically" give us sharding and hot-standby
> > capabilities in the db (obviously with a fair amount of work).
> >
> > The database can ingest hundreds of gigabytes of data extremely
> > quickly, easily enough to saturate any reasonable network connection,
> > so I've thought about co-locating the db on the same nodes of the
> > kafka cluster that actually store the data, to cut out the network
> > entirely from the loading process. We'd also probably want the db
> > topology to be defined first, and the kafka partitioning to follow. I
> > can see how to use the partitioner class to assign a specific
> > partition to a key, but I can't currently see how to assume partitions
> > to known machines upfront. Is this possible?
> >
> > Does the plan sound reasonable in general? I've also considered a log
> > shipping approach like Flume, but Kafka seems simplest all round, and
> > a really like the idea of just being able to set the log offset to
> > zero to reload on startup.
> >
> > Thanks,
> > Ben Young
> >
> >
> > Ben Young . Principal Software Engineer . Adaptiv .
> >
> >
> >

Re: Locality question

Posted by Svante Karlsson <sv...@csi.se>.
If you have a kafka partition that is replicated to 3 nodes the partition
varies (in time) thus making the colocation pointless. You can only produce
and consume to/from the leader.

/svante



2015-11-12 9:00 GMT+01:00 Young, Ben <Be...@sungard.com>:

> Hi,
>
> Any thoughts on this? Perhaps Kafka is not the best way to go for this,
> but the docs do mention transaction/replication logs as a use case, and I'd
> have thought locality would have been important for that?
>
> Thanks,
> Ben
>
> -----Original Message-----
> From: Young, Ben [mailto:Ben.Young@sungard.com]
> Sent: 06 November 2015 08:20
> To: users@kafka.apache.org
> Subject: Locality question
>
> Hi,
>
> I've had a look over the website and searched the archives, but I can't
> find any obvious answers to this, so apologies if it's been asked before.
>
> I'm investigating potentially using Kafka for the transaction log for our
> in-memory database technology. The idea is the Kafka partitioning and
> replication will "automatically" give us sharding and hot-standby
> capabilities in the db (obviously with a fair amount of work).
>
> The database can ingest hundreds of gigabytes of data extremely quickly,
> easily enough to saturate any reasonable network connection, so I've
> thought about co-locating the db on the same nodes of the kafka cluster
> that actually store the data, to cut out the network entirely from the
> loading process. We'd also probably want the db topology to be defined
> first, and the kafka partitioning to follow. I can see how to use the
> partitioner class to assign a specific partition to a key, but I can't
> currently see how to assume partitions to known machines upfront. Is this
> possible?
>
> Does the plan sound reasonable in general? I've also considered a log
> shipping approach like Flume, but Kafka seems simplest all round, and a
> really like the idea of just being able to set the log offset to zero to
> reload on startup.
>
> Thanks,
> Ben Young
>
>
> Ben Young . Principal Software Engineer . Adaptiv .
>
>
>

RE: Locality question

Posted by "Young, Ben" <Be...@sungard.com>.
Hi,

Any thoughts on this? Perhaps Kafka is not the best way to go for this, but the docs do mention transaction/replication logs as a use case, and I'd have thought locality would have been important for that?

Thanks,
Ben

-----Original Message-----
From: Young, Ben [mailto:Ben.Young@sungard.com] 
Sent: 06 November 2015 08:20
To: users@kafka.apache.org
Subject: Locality question

Hi,

I've had a look over the website and searched the archives, but I can't find any obvious answers to this, so apologies if it's been asked before.

I'm investigating potentially using Kafka for the transaction log for our in-memory database technology. The idea is the Kafka partitioning and replication will "automatically" give us sharding and hot-standby capabilities in the db (obviously with a fair amount of work). 

The database can ingest hundreds of gigabytes of data extremely quickly, easily enough to saturate any reasonable network connection, so I've thought about co-locating the db on the same nodes of the kafka cluster that actually store the data, to cut out the network entirely from the loading process. We'd also probably want the db topology to be defined first, and the kafka partitioning to follow. I can see how to use the partitioner class to assign a specific partition to a key, but I can't currently see how to assume partitions to known machines upfront. Is this possible? 

Does the plan sound reasonable in general? I've also considered a log shipping approach like Flume, but Kafka seems simplest all round, and a really like the idea of just being able to set the log offset to zero to reload on startup.

Thanks,
Ben Young


Ben Young . Principal Software Engineer . Adaptiv . 



RE: Locality question

Posted by "Young, Ben" <Be...@sungard.com>.
Hi Prabhjot,

Thanks, yes there are almost too many ways to do it. One other option is to use HDFS for the locality side of things, but it would be nice if we could use Kafka for both

Thanks,
Ben

-----Original Message-----
From: Prabhjot Bharaj [mailto:prabhbharaj@gmail.com] 
Sent: 06 November 2015 11:41
To: users@kafka.apache.org
Subject: Re: Locality question

Hi Ben,

I have a similar use case where data from processing layer gets onto storage layer, where data per db gets distributed amongst storage machines based on some logic.
The architecture that we had agreed upon was to have topics as storage machine names so the processing layer can send all data meant for 1 storage machine onto the topic only that storage machine will consume.
In case you change the logic of distributing data amongst the storage machines, the processing machines can have a time based approach to send the data to respective topics. In this case, you wont need to know which partitions are residing on which machine. If you have, lets say x partitions, you can keep writing data to this topic (meant for 1 storage
machine) without worrying where those partitions reside in kafka cluster.

The other architecture was similar to the one you are planning - to have db names as topics and have respective partitions (more than 200 per topic), this design would have killed our system because we have lots of dbs, and having too many topics and partitions on the transfer layer (like kafka) might have required higher end hardware as well.


Your other question still remains unanswered here - where to deploy your storage machines ? - on the kafka cluster or separately (requireing a network hop for your data). Both approaches their pros and cons.

I would love to hear about this from the community

Thanks,
Prabhjot

On Fri, Nov 6, 2015 at 1:50 PM, Young, Ben <Be...@sungard.com> wrote:

> Hi,
>
> I've had a look over the website and searched the archives, but I 
> can't find any obvious answers to this, so apologies if it's been asked before.
>
> I'm investigating potentially using Kafka for the transaction log for 
> our in-memory database technology. The idea is the Kafka partitioning 
> and replication will "automatically" give us sharding and hot-standby 
> capabilities in the db (obviously with a fair amount of work).
>
> The database can ingest hundreds of gigabytes of data extremely 
> quickly, easily enough to saturate any reasonable network connection, 
> so I've thought about co-locating the db on the same nodes of the 
> kafka cluster that actually store the data, to cut out the network 
> entirely from the loading process. We'd also probably want the db 
> topology to be defined first, and the kafka partitioning to follow. I 
> can see how to use the partitioner class to assign a specific 
> partition to a key, but I can't currently see how to assume partitions 
> to known machines upfront. Is this possible?
>
> Does the plan sound reasonable in general? I've also considered a log 
> shipping approach like Flume, but Kafka seems simplest all round, and 
> a really like the idea of just being able to set the log offset to 
> zero to reload on startup.
>
> Thanks,
> Ben Young
>
>
> Ben Young . Principal Software Engineer . Adaptiv .
>
>
>


--
---------------------------------------------------------
"There are only 10 types of people in the world: Those who understand binary, and those who don't"

Re: Locality question

Posted by Prabhjot Bharaj <pr...@gmail.com>.
Hi Ben,

I have a similar use case where data from processing layer gets onto
storage layer, where data per db gets distributed amongst storage machines
based on some logic.
The architecture that we had agreed upon was to have topics as storage
machine names so the processing layer can send all data meant for 1 storage
machine onto the topic only that storage machine will consume.
In case you change the logic of distributing data amongst the storage
machines, the processing machines can have a time based approach to send
the data to respective topics. In this case, you wont need to know which
partitions are residing on which machine. If you have, lets say x
partitions, you can keep writing data to this topic (meant for 1 storage
machine) without worrying where those partitions reside in kafka cluster.

The other architecture was similar to the one you are planning - to have db
names as topics and have respective partitions (more than 200 per topic),
this design would have killed our system because we have lots of dbs, and
having too many topics and partitions on the transfer layer (like kafka)
might have required higher end hardware as well.


Your other question still remains unanswered here - where to deploy your
storage machines ? - on the kafka cluster or separately (requireing a
network hop for your data). Both approaches their pros and cons.

I would love to hear about this from the community

Thanks,
Prabhjot

On Fri, Nov 6, 2015 at 1:50 PM, Young, Ben <Be...@sungard.com> wrote:

> Hi,
>
> I've had a look over the website and searched the archives, but I can't
> find any obvious answers to this, so apologies if it's been asked before.
>
> I'm investigating potentially using Kafka for the transaction log for our
> in-memory database technology. The idea is the Kafka partitioning and
> replication will "automatically" give us sharding and hot-standby
> capabilities in the db (obviously with a fair amount of work).
>
> The database can ingest hundreds of gigabytes of data extremely quickly,
> easily enough to saturate any reasonable network connection, so I've
> thought about co-locating the db on the same nodes of the kafka cluster
> that actually store the data, to cut out the network entirely from the
> loading process. We'd also probably want the db topology to be defined
> first, and the kafka partitioning to follow. I can see how to use the
> partitioner class to assign a specific partition to a key, but I can't
> currently see how to assume partitions to known machines upfront. Is this
> possible?
>
> Does the plan sound reasonable in general? I've also considered a log
> shipping approach like Flume, but Kafka seems simplest all round, and a
> really like the idea of just being able to set the log offset to zero to
> reload on startup.
>
> Thanks,
> Ben Young
>
>
> Ben Young . Principal Software Engineer . Adaptiv .
>
>
>


-- 
---------------------------------------------------------
"There are only 10 types of people in the world: Those who understand
binary, and those who don't"