You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Ted Swerve <te...@gmail.com> on 2016/02/15 17:56:56 UTC

Kafka as master data store

Hello,

Is it viable to use infinite-retention Kafka topics as a master data
store?  I'm not talking massive volumes of data here, but still potentially
extending into tens of terabytes.

Are there any drawbacks or pitfalls to such an approach?  It seems like a
compelling design, but there seem to be mixed messages about its
suitability for this kind of role.

Regards,
Ted

Re: Kafka as master data store

Posted by Ben Stopford <be...@confluent.io>.
Hi Ted

This is an interesting question. 

Kafka has similar resilience properties to other distributed stores such as Cassandra, which are used as master data stores (obviously without the query functions). You’d need to set unclean.leader.election.enable=false and configure sufficient replication to get good resiliency. 

One objection to doing this would be that the majority of Kafka usage is for transitory data. This is fair and I’ve not seen Kafka used as a master data store per se. I have seen it used for reliable messaging, which means not losing data and hence requires similar properties. Certainly there is nothing I can think of that would suggest Kafka would be any worse than other distributed data stores, but to further mitigate concerns, you could use Connect to create a backup in HDFS, SAN etc. 

All the best

B 



> On 15 Feb 2016, at 08:56, Ted Swerve <te...@gmail.com> wrote:
> 
> Hello,
> 
> Is it viable to use infinite-retention Kafka topics as a master data
> store?  I'm not talking massive volumes of data here, but still potentially
> extending into tens of terabytes.
> 
> Are there any drawbacks or pitfalls to such an approach?  It seems like a
> compelling design, but there seem to be mixed messages about its
> suitability for this kind of role.
> 
> Regards,
> Ted


Re: Kafka as master data store

Posted by Damian Guy <da...@gmail.com>.
Hi Ted - if the data is keyed you can use a key compacted topic and
essentially keep the data 'forever',i.e., you'll always have the latest
version of the data for a given key. However, you'd still want to backup
the data someplace else just-in-case.

On 16 February 2016 at 21:25, Ted Swerve <te...@gmail.com> wrote:

> I guess I was just drawn in by the elegance of having everything available
> in one well-defined Kafka topic should I start up some new code.
>
> If instead the Kafka topics were on a retention period of say 7 days, that
> would involve firing up a topic to load the warehoused data from HDFS (or a
> more traditional load), and then switching over to the live topic?
>
> On Tue, Feb 16, 2016 at 8:32 AM, Ben Stopford <be...@confluent.io> wrote:
>
> > Ted - it depends on your domain. More conservative approaches to long
> > lived data protect against data corruption, which generally means
> snapshots
> > and cold storage.
> >
> >
> > > On 15 Feb 2016, at 21:31, Ted Swerve <te...@gmail.com> wrote:
> > >
> > > HI Ben, Sharninder,
> > >
> > > Thanks for your responses, I appreciate it.
> > >
> > > Ben - thanks for the tips on settings. A backup could certainly be a
> > > possibility, although if only with similar durability guarantees, I'm
> not
> > > sure what the purpose would be?
> > >
> > > Sharninder - yes, we would only be using the logs as forward-only
> > streams -
> > > i.e. picking an offset to read from and moving forwards - and would be
> > > setting retention time to essentially infinite.
> > >
> > > Regards,
> > > Ted.
> > >
> > > On Tue, Feb 16, 2016 at 5:05 AM, Sharninder Khera <
> sharninder@gmail.com>
> > > wrote:
> > >
> > >> This topic comes up often on this list. Kafka can be used as a
> datastore
> > >> if that’s what your application wants with the caveat that Kafka isn’t
> > >> designed to keep data around forever. There is a default retention
> time
> > >> after which older data gets deleted. The high level consumer
> essentially
> > >> reads data as a stream and while you can do sort of random access with
> > the
> > >> low level consumer, its not ideal.
> > >>
> > >>
> > >>
> > >>> On 15-Feb-2016, at 10:26 PM, Ted Swerve <te...@gmail.com>
> wrote:
> > >>>
> > >>> Hello,
> > >>>
> > >>> Is it viable to use infinite-retention Kafka topics as a master data
> > >>> store?  I'm not talking massive volumes of data here, but still
> > >> potentially
> > >>> extending into tens of terabytes.
> > >>>
> > >>> Are there any drawbacks or pitfalls to such an approach?  It seems
> > like a
> > >>> compelling design, but there seem to be mixed messages about its
> > >>> suitability for this kind of role.
> > >>>
> > >>> Regards,
> > >>> Ted
> > >>
> > >>
> >
> >
>

Re: Kafka as master data store

Posted by Daniel Schierbeck <da...@zendesk.com.INVALID>.
I'm also very interested in using Kafka as a persistent, distributed commit
log – essentially the write side of a distributed database, with the read
side being an array of various query stores (Elasticsearch, Redis,
whatever) and stream processing systems.

The benefit of retaining data in Kafka indefinitely is the easy with which
it's possible to bootstrap new read-side technologies. I really feel that
there should be a standardized Kafka configuration optimized for this case,
with long-term durability in mind.

On Tue, Feb 16, 2016 at 10:26 PM Ted Swerve <te...@gmail.com> wrote:

> I guess I was just drawn in by the elegance of having everything available
> in one well-defined Kafka topic should I start up some new code.
>
> If instead the Kafka topics were on a retention period of say 7 days, that
> would involve firing up a topic to load the warehoused data from HDFS (or a
> more traditional load), and then switching over to the live topic?
>
> On Tue, Feb 16, 2016 at 8:32 AM, Ben Stopford <be...@confluent.io> wrote:
>
> > Ted - it depends on your domain. More conservative approaches to long
> > lived data protect against data corruption, which generally means
> snapshots
> > and cold storage.
> >
> >
> > > On 15 Feb 2016, at 21:31, Ted Swerve <te...@gmail.com> wrote:
> > >
> > > HI Ben, Sharninder,
> > >
> > > Thanks for your responses, I appreciate it.
> > >
> > > Ben - thanks for the tips on settings. A backup could certainly be a
> > > possibility, although if only with similar durability guarantees, I'm
> not
> > > sure what the purpose would be?
> > >
> > > Sharninder - yes, we would only be using the logs as forward-only
> > streams -
> > > i.e. picking an offset to read from and moving forwards - and would be
> > > setting retention time to essentially infinite.
> > >
> > > Regards,
> > > Ted.
> > >
> > > On Tue, Feb 16, 2016 at 5:05 AM, Sharninder Khera <
> sharninder@gmail.com>
> > > wrote:
> > >
> > >> This topic comes up often on this list. Kafka can be used as a
> datastore
> > >> if that’s what your application wants with the caveat that Kafka isn’t
> > >> designed to keep data around forever. There is a default retention
> time
> > >> after which older data gets deleted. The high level consumer
> essentially
> > >> reads data as a stream and while you can do sort of random access with
> > the
> > >> low level consumer, its not ideal.
> > >>
> > >>
> > >>
> > >>> On 15-Feb-2016, at 10:26 PM, Ted Swerve <te...@gmail.com>
> wrote:
> > >>>
> > >>> Hello,
> > >>>
> > >>> Is it viable to use infinite-retention Kafka topics as a master data
> > >>> store?  I'm not talking massive volumes of data here, but still
> > >> potentially
> > >>> extending into tens of terabytes.
> > >>>
> > >>> Are there any drawbacks or pitfalls to such an approach?  It seems
> > like a
> > >>> compelling design, but there seem to be mixed messages about its
> > >>> suitability for this kind of role.
> > >>>
> > >>> Regards,
> > >>> Ted
> > >>
> > >>
> >
> >
>

Re: Kafka as master data store

Posted by Ted Swerve <te...@gmail.com>.
I guess I was just drawn in by the elegance of having everything available
in one well-defined Kafka topic should I start up some new code.

If instead the Kafka topics were on a retention period of say 7 days, that
would involve firing up a topic to load the warehoused data from HDFS (or a
more traditional load), and then switching over to the live topic?

On Tue, Feb 16, 2016 at 8:32 AM, Ben Stopford <be...@confluent.io> wrote:

> Ted - it depends on your domain. More conservative approaches to long
> lived data protect against data corruption, which generally means snapshots
> and cold storage.
>
>
> > On 15 Feb 2016, at 21:31, Ted Swerve <te...@gmail.com> wrote:
> >
> > HI Ben, Sharninder,
> >
> > Thanks for your responses, I appreciate it.
> >
> > Ben - thanks for the tips on settings. A backup could certainly be a
> > possibility, although if only with similar durability guarantees, I'm not
> > sure what the purpose would be?
> >
> > Sharninder - yes, we would only be using the logs as forward-only
> streams -
> > i.e. picking an offset to read from and moving forwards - and would be
> > setting retention time to essentially infinite.
> >
> > Regards,
> > Ted.
> >
> > On Tue, Feb 16, 2016 at 5:05 AM, Sharninder Khera <sh...@gmail.com>
> > wrote:
> >
> >> This topic comes up often on this list. Kafka can be used as a datastore
> >> if that’s what your application wants with the caveat that Kafka isn’t
> >> designed to keep data around forever. There is a default retention time
> >> after which older data gets deleted. The high level consumer essentially
> >> reads data as a stream and while you can do sort of random access with
> the
> >> low level consumer, its not ideal.
> >>
> >>
> >>
> >>> On 15-Feb-2016, at 10:26 PM, Ted Swerve <te...@gmail.com> wrote:
> >>>
> >>> Hello,
> >>>
> >>> Is it viable to use infinite-retention Kafka topics as a master data
> >>> store?  I'm not talking massive volumes of data here, but still
> >> potentially
> >>> extending into tens of terabytes.
> >>>
> >>> Are there any drawbacks or pitfalls to such an approach?  It seems
> like a
> >>> compelling design, but there seem to be mixed messages about its
> >>> suitability for this kind of role.
> >>>
> >>> Regards,
> >>> Ted
> >>
> >>
>
>

Re: Kafka as master data store

Posted by Ben Stopford <be...@confluent.io>.
Ted - it depends on your domain. More conservative approaches to long lived data protect against data corruption, which generally means snapshots and cold storage.  


> On 15 Feb 2016, at 21:31, Ted Swerve <te...@gmail.com> wrote:
> 
> HI Ben, Sharninder,
> 
> Thanks for your responses, I appreciate it.
> 
> Ben - thanks for the tips on settings. A backup could certainly be a
> possibility, although if only with similar durability guarantees, I'm not
> sure what the purpose would be?
> 
> Sharninder - yes, we would only be using the logs as forward-only streams -
> i.e. picking an offset to read from and moving forwards - and would be
> setting retention time to essentially infinite.
> 
> Regards,
> Ted.
> 
> On Tue, Feb 16, 2016 at 5:05 AM, Sharninder Khera <sh...@gmail.com>
> wrote:
> 
>> This topic comes up often on this list. Kafka can be used as a datastore
>> if that’s what your application wants with the caveat that Kafka isn’t
>> designed to keep data around forever. There is a default retention time
>> after which older data gets deleted. The high level consumer essentially
>> reads data as a stream and while you can do sort of random access with the
>> low level consumer, its not ideal.
>> 
>> 
>> 
>>> On 15-Feb-2016, at 10:26 PM, Ted Swerve <te...@gmail.com> wrote:
>>> 
>>> Hello,
>>> 
>>> Is it viable to use infinite-retention Kafka topics as a master data
>>> store?  I'm not talking massive volumes of data here, but still
>> potentially
>>> extending into tens of terabytes.
>>> 
>>> Are there any drawbacks or pitfalls to such an approach?  It seems like a
>>> compelling design, but there seem to be mixed messages about its
>>> suitability for this kind of role.
>>> 
>>> Regards,
>>> Ted
>> 
>> 


Re: Kafka as master data store

Posted by Ted Swerve <te...@gmail.com>.
HI Ben, Sharninder,

Thanks for your responses, I appreciate it.

Ben - thanks for the tips on settings. A backup could certainly be a
possibility, although if only with similar durability guarantees, I'm not
sure what the purpose would be?

Sharninder - yes, we would only be using the logs as forward-only streams -
i.e. picking an offset to read from and moving forwards - and would be
setting retention time to essentially infinite.

Regards,
Ted.

On Tue, Feb 16, 2016 at 5:05 AM, Sharninder Khera <sh...@gmail.com>
wrote:

> This topic comes up often on this list. Kafka can be used as a datastore
> if that’s what your application wants with the caveat that Kafka isn’t
> designed to keep data around forever. There is a default retention time
> after which older data gets deleted. The high level consumer essentially
> reads data as a stream and while you can do sort of random access with the
> low level consumer, its not ideal.
>
>
>
> > On 15-Feb-2016, at 10:26 PM, Ted Swerve <te...@gmail.com> wrote:
> >
> > Hello,
> >
> > Is it viable to use infinite-retention Kafka topics as a master data
> > store?  I'm not talking massive volumes of data here, but still
> potentially
> > extending into tens of terabytes.
> >
> > Are there any drawbacks or pitfalls to such an approach?  It seems like a
> > compelling design, but there seem to be mixed messages about its
> > suitability for this kind of role.
> >
> > Regards,
> > Ted
>
>

Re: Kafka as master data store

Posted by Sharninder Khera <sh...@gmail.com>.
This topic comes up often on this list. Kafka can be used as a datastore if that’s what your application wants with the caveat that Kafka isn’t designed to keep data around forever. There is a default retention time after which older data gets deleted. The high level consumer essentially reads data as a stream and while you can do sort of random access with the low level consumer, its not ideal.



> On 15-Feb-2016, at 10:26 PM, Ted Swerve <te...@gmail.com> wrote:
> 
> Hello,
> 
> Is it viable to use infinite-retention Kafka topics as a master data
> store?  I'm not talking massive volumes of data here, but still potentially
> extending into tens of terabytes.
> 
> Are there any drawbacks or pitfalls to such an approach?  It seems like a
> compelling design, but there seem to be mixed messages about its
> suitability for this kind of role.
> 
> Regards,
> Ted