You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Daniel Schierbeck <da...@gmail.com> on 2015/07/10 10:46:48 UTC

Using Kafka as a persistent store

I'd like to use Kafka as a persistent store – sort of as an alternative to
HDFS. The idea is that I'd load the data into various other systems in
order to solve specific needs such as full-text search, analytics, indexing
by various attributes, etc. I'd like to keep a single source of truth,
however.

I'm struggling a bit to understand how I can configure a topic to retain
messages indefinitely. I want to make sure that my data isn't deleted. Is
there a guide to configuring Kafka like this?

Re: Using Kafka as a persistent store

Posted by noah <ia...@gmail.com>.

I don't want to endorse this use of Kafka, but assuming you can give your
message unique identifiers, I believe using log compaction will keep all
unique messages forever. You can read about how consumer offsets stored in
Kafka are managed using a compacted topic here:
http://kafka.apache.org/documentation.html#distributionimpl  In that case,
the consumer group id+topic+partition forms a unique message id and the
brokers read that topic on startup into the offsets cache (and take updates
to the offsets cache via the same topic.) If you have a finite, smallish
data set that you want indexed in multiple systems, that might be a good
approach.

If your data can grow without bound, it doesn't seem to me like Kafka is a
good choice? Even with compaction, you will still have to sequentially read
it all, message by message, to get it into a different system. As far as I
know, there is no lookup by id, and even going to a specific date is a
manual O(log n) process.

(warning: I'm just another user, so I may have a few things wrong.)

On Fri, Jul 10, 2015 at 3:47 AM Daniel Schierbeck <
daniel.schierbeck@gmail.com> wrote:

> I'd like to use Kafka as a persistent store – sort of as an alternative to
> HDFS. The idea is that I'd load the data into various other systems in
> order to solve specific needs such as full-text search, analytics, indexing
> by various attributes, etc. I'd like to keep a single source of truth,
> however.
>
> I'm struggling a bit to understand how I can configure a topic to retain
> messages indefinitely. I want to make sure that my data isn't deleted. Is
> there a guide to configuring Kafka like this?
>

Re: Using Kafka as a persistent store

Posted by Shayne S <sh...@gmail.com>.

Thanks, I'm on 0.8.2 so that explains it.

Should retention.ms affect segment rolling? In my experiment it did (
retention.ms = -1), which was unexpected since I thought only segment.bytes
and segment.ms would control that.

On Mon, Jul 13, 2015 at 7:57 PM, Daniel Tamai <da...@gmail.com>
wrote:

> Using -1 for log.retention.ms should work only for 0.8.3 (
> https://issues.apache.org/jira/browse/KAFKA-1990).
>
> 2015-07-13 17:08 GMT-03:00 Shayne S <sh...@gmail.com>:
>
> > Did this work for you? I set the topic settings to retention.ms=-1 and
> > retention.bytes=-1 and it looks like it is deleting segments immediately.
> >
> > On Sun, Jul 12, 2015 at 8:02 AM, Daniel Schierbeck <
> > daniel.schierbeck@gmail.com> wrote:
> >
> > >
> > > > On 10. jul. 2015, at 23.03, Jay Kreps <ja...@confluent.io> wrote:
> > > >
> > > > If I recall correctly, setting log.retention.ms and
> > log.retention.bytes
> > > to
> > > > -1 disables both.
> > >
> > > Thanks!
> > >
> > > >
> > > > On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck <
> > > > daniel.schierbeck@gmail.com> wrote:
> > > >
> > > >>
> > > >>> On 10. jul. 2015, at 15.16, Shayne S <sh...@gmail.com>
> wrote:
> > > >>>
> > > >>> There are two ways you can configure your topics, log compaction
> and
> > > with
> > > >>> no cleaning. The choice depends on your use case. Are the records
> > > >> uniquely
> > > >>> identifiable and will they receive updates? Then log compaction is
> > the
> > > >> way
> > > >>> to go. If they are truly read only, you can go without log
> > compaction.
> > > >>
> > > >> I'd rather be free to use the key for partitioning, and the records
> > are
> > > >> immutable — they're event records — so disabling compaction
> altogether
> > > >> would be preferable. How is that accomplished?
> > > >>>
> > > >>> We have a small processes which consume a topic and perform upserts
> > to
> > > >> our
> > > >>> various database engines. It's easy to change how it all works and
> > > simply
> > > >>> consume the single source of truth again.
> > > >>>
> > > >>> I've written a bit about log compaction here:
> > > >>>
> > >
> http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
> > > >>>
> > > >>> On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
> > > >>> daniel.schierbeck@gmail.com> wrote:
> > > >>>
> > > >>>> I'd like to use Kafka as a persistent store – sort of as an
> > > alternative
> > > >> to
> > > >>>> HDFS. The idea is that I'd load the data into various other
> systems
> > in
> > > >>>> order to solve specific needs such as full-text search, analytics,
> > > >> indexing
> > > >>>> by various attributes, etc. I'd like to keep a single source of
> > truth,
> > > >>>> however.
> > > >>>>
> > > >>>> I'm struggling a bit to understand how I can configure a topic to
> > > retain
> > > >>>> messages indefinitely. I want to make sure that my data isn't
> > deleted.
> > > >> Is
> > > >>>> there a guide to configuring Kafka like this?
> > > >>
> > >
> >
>

Re: Using Kafka as a persistent store

Posted by Daniel Tamai <da...@gmail.com>.

Using -1 for log.retention.ms should work only for 0.8.3 (
https://issues.apache.org/jira/browse/KAFKA-1990).

2015-07-13 17:08 GMT-03:00 Shayne S <sh...@gmail.com>:

> Did this work for you? I set the topic settings to retention.ms=-1 and
> retention.bytes=-1 and it looks like it is deleting segments immediately.
>
> On Sun, Jul 12, 2015 at 8:02 AM, Daniel Schierbeck <
> daniel.schierbeck@gmail.com> wrote:
>
> >
> > > On 10. jul. 2015, at 23.03, Jay Kreps <ja...@confluent.io> wrote:
> > >
> > > If I recall correctly, setting log.retention.ms and
> log.retention.bytes
> > to
> > > -1 disables both.
> >
> > Thanks!
> >
> > >
> > > On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck <
> > > daniel.schierbeck@gmail.com> wrote:
> > >
> > >>
> > >>> On 10. jul. 2015, at 15.16, Shayne S <sh...@gmail.com> wrote:
> > >>>
> > >>> There are two ways you can configure your topics, log compaction and
> > with
> > >>> no cleaning. The choice depends on your use case. Are the records
> > >> uniquely
> > >>> identifiable and will they receive updates? Then log compaction is
> the
> > >> way
> > >>> to go. If they are truly read only, you can go without log
> compaction.
> > >>
> > >> I'd rather be free to use the key for partitioning, and the records
> are
> > >> immutable — they're event records — so disabling compaction altogether
> > >> would be preferable. How is that accomplished?
> > >>>
> > >>> We have a small processes which consume a topic and perform upserts
> to
> > >> our
> > >>> various database engines. It's easy to change how it all works and
> > simply
> > >>> consume the single source of truth again.
> > >>>
> > >>> I've written a bit about log compaction here:
> > >>>
> > http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
> > >>>
> > >>> On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
> > >>> daniel.schierbeck@gmail.com> wrote:
> > >>>
> > >>>> I'd like to use Kafka as a persistent store – sort of as an
> > alternative
> > >> to
> > >>>> HDFS. The idea is that I'd load the data into various other systems
> in
> > >>>> order to solve specific needs such as full-text search, analytics,
> > >> indexing
> > >>>> by various attributes, etc. I'd like to keep a single source of
> truth,
> > >>>> however.
> > >>>>
> > >>>> I'm struggling a bit to understand how I can configure a topic to
> > retain
> > >>>> messages indefinitely. I want to make sure that my data isn't
> deleted.
> > >> Is
> > >>>> there a guide to configuring Kafka like this?
> > >>
> >
>

Re: Using Kafka as a persistent store

Posted by Shayne S <sh...@gmail.com>.

Did this work for you? I set the topic settings to retention.ms=-1 and
retention.bytes=-1 and it looks like it is deleting segments immediately.

On Sun, Jul 12, 2015 at 8:02 AM, Daniel Schierbeck <
daniel.schierbeck@gmail.com> wrote:

>
> > On 10. jul. 2015, at 23.03, Jay Kreps <ja...@confluent.io> wrote:
> >
> > If I recall correctly, setting log.retention.ms and log.retention.bytes
> to
> > -1 disables both.
>
> Thanks!
>
> >
> > On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck <
> > daniel.schierbeck@gmail.com> wrote:
> >
> >>
> >>> On 10. jul. 2015, at 15.16, Shayne S <sh...@gmail.com> wrote:
> >>>
> >>> There are two ways you can configure your topics, log compaction and
> with
> >>> no cleaning. The choice depends on your use case. Are the records
> >> uniquely
> >>> identifiable and will they receive updates? Then log compaction is the
> >> way
> >>> to go. If they are truly read only, you can go without log compaction.
> >>
> >> I'd rather be free to use the key for partitioning, and the records are
> >> immutable — they're event records — so disabling compaction altogether
> >> would be preferable. How is that accomplished?
> >>>
> >>> We have a small processes which consume a topic and perform upserts to
> >> our
> >>> various database engines. It's easy to change how it all works and
> simply
> >>> consume the single source of truth again.
> >>>
> >>> I've written a bit about log compaction here:
> >>>
> http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
> >>>
> >>> On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
> >>> daniel.schierbeck@gmail.com> wrote:
> >>>
> >>>> I'd like to use Kafka as a persistent store – sort of as an
> alternative
> >> to
> >>>> HDFS. The idea is that I'd load the data into various other systems in
> >>>> order to solve specific needs such as full-text search, analytics,
> >> indexing
> >>>> by various attributes, etc. I'd like to keep a single source of truth,
> >>>> however.
> >>>>
> >>>> I'm struggling a bit to understand how I can configure a topic to
> retain
> >>>> messages indefinitely. I want to make sure that my data isn't deleted.
> >> Is
> >>>> there a guide to configuring Kafka like this?
> >>
>

Re: Using Kafka as a persistent store

Posted by Gwen Shapira <gs...@cloudera.com>.

Hi,

1. What you described sounds like a reasonable architecture, but may I
ask why JSON? Avro seems better supported in the ecosystem
(Confluent's tools, Hadoop integration, schema evolution, tools, etc).

1.5 If all you do is convert data into JSON, SparkStreaming sounds
like a difficult-to-manage overkill. Compared to Flume or a slightly
modified MirrorMaker (Or CopyCat, if it exists yet). Any specific
reasons for SparkStreaming?

2. Different compute engines prefer different storage formats because
in most cases thats where optimizations come from. Parquet improves
scan performance for Impala and MR, but will be pretty horrible for
NoSQL. So, I wouldn't hold my breath for compute engines to start
sharing data storage suddenly.

Gwen

On Mon, Jul 13, 2015 at 11:45 AM, Tim Smith <se...@gmail.com> wrote:
> I have had a similar issue where I wanted a single source of truth between
> Search and HDFS. First, if you zoom out a little, eventually you are going
> to have some compute engine(s) process the data. If you store it in a
> compute neutral tier like kafka then you will need to suck the data out at
> runtime and stage it for the compute engine to use. So pick your poison,
> process at ingest and store multiple copies of data, one per compute
> engine, OR store in a neutral store and process at runtime. I am not saying
> one is better than the other but that's how I see the trade-off so
> depending on your use cases, YMMV.
>
> What I do is:
> - store raw data into kafka
> - use spark streaming to transform data to JSON and post it back to kafka
> - Hang multiple data stores off kafka that ingest the JSON
> - Not do any other transformations in the "consumer" stores and store the
> copy as immutable event
>
> So I do have multiple copies (one per compute tier) but they all look the
> same.
>
> Unless different compute engines, natively start to use a common data
> storage format, I don't see how one could get away from storing multiple
> copies. Primarily, I see Lucene based products have their format, the
> Hadoop ecosystem seems congregating around Parquet and then the NoSQL
> players have their formats (one per each product).
>
> My 2 cents worth :)
>
>
>
> On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck <
> daniel.schierbeck@gmail.com> wrote:
>
>> Am I correct in assuming that Kafka will only retain a file handle for the
>> last segment of the log? If the number of handles grows unbounded, then it
>> would be an issue. But I plan on writing to this topic continuously anyway,
>> so not separating data into cold and hot storage is the entire point.
>>
>> Daniel Schierbeck
>>
>> > On 13. jul. 2015, at 15.41, Scott Thibault <
>> scott.thibault@multiscalehn.com> wrote:
>> >
>> > We've tried to use Kafka not as a persistent store, but as a long-term
>> > archival store.  An outstanding issue we've had with that is that the
>> > broker holds on to an open file handle on every file in the log!  The
>> other
>> > issue we've had is when you create a long-term archival log on shared
>> > storage, you can't simply access that data from another cluster b/c of
>> meta
>> > data being stored in zookeeper rather than in the log.
>> >
>> > --Scott Thibault
>> >
>> >
>> > On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck <
>> > daniel.schierbeck@gmail.com> wrote:
>> >
>> >> Would it be possible to document how to configure Kafka to never delete
>> >> messages in a topic? It took a good while to figure this out, and I see
>> it
>> >> as an important use case for Kafka.
>> >>
>> >> On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck <
>> >> daniel.schierbeck@gmail.com> wrote:
>> >>
>> >>>
>> >>>> On 10. jul. 2015, at 23.03, Jay Kreps <ja...@confluent.io> wrote:
>> >>>>
>> >>>> If I recall correctly, setting log.retention.ms and
>> >> log.retention.bytes
>> >>> to
>> >>>> -1 disables both.
>> >>>
>> >>> Thanks!
>> >>>
>> >>>>
>> >>>> On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck <
>> >>>> daniel.schierbeck@gmail.com> wrote:
>> >>>>
>> >>>>>
>> >>>>>> On 10. jul. 2015, at 15.16, Shayne S <sh...@gmail.com> wrote:
>> >>>>>>
>> >>>>>> There are two ways you can configure your topics, log compaction and
>> >>> with
>> >>>>>> no cleaning. The choice depends on your use case. Are the records
>> >>>>> uniquely
>> >>>>>> identifiable and will they receive updates? Then log compaction is
>> >> the
>> >>>>> way
>> >>>>>> to go. If they are truly read only, you can go without log
>> >> compaction.
>> >>>>>
>> >>>>> I'd rather be free to use the key for partitioning, and the records
>> >> are
>> >>>>> immutable — they're event records — so disabling compaction
>> altogether
>> >>>>> would be preferable. How is that accomplished?
>> >>>>>>
>> >>>>>> We have a small processes which consume a topic and perform upserts
>> >> to
>> >>>>> our
>> >>>>>> various database engines. It's easy to change how it all works and
>> >>> simply
>> >>>>>> consume the single source of truth again.
>> >>>>>>
>> >>>>>> I've written a bit about log compaction here:
>> >>>
>> http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
>> >>>>>>
>> >>>>>> On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
>> >>>>>> daniel.schierbeck@gmail.com> wrote:
>> >>>>>>
>> >>>>>>> I'd like to use Kafka as a persistent store – sort of as an
>> >>> alternative
>> >>>>> to
>> >>>>>>> HDFS. The idea is that I'd load the data into various other systems
>> >> in
>> >>>>>>> order to solve specific needs such as full-text search, analytics,
>> >>>>> indexing
>> >>>>>>> by various attributes, etc. I'd like to keep a single source of
>> >> truth,
>> >>>>>>> however.
>> >>>>>>>
>> >>>>>>> I'm struggling a bit to understand how I can configure a topic to
>> >>> retain
>> >>>>>>> messages indefinitely. I want to make sure that my data isn't
>> >> deleted.
>> >>>>> Is
>> >>>>>>> there a guide to configuring Kafka like this?
>> >
>> >
>> >
>> > --
>> > *This e-mail is not encrypted.  Due to the unsecured nature of
>> unencrypted
>> > e-mail, there may be some level of risk that the information in this
>> e-mail
>> > could be read by a third party.  Accordingly, the recipient(s) named
>> above
>> > are hereby advised to not communicate protected health information using
>> > this e-mail address.  If you desire to send protected health information
>> > electronically, please contact MultiScale Health Networks at
>> (206)538-6090*
>>

Re: Using Kafka as a persistent store

Posted by Rad Gruchalski <ra...@gruchalski.com>.

Sounds like the same idea. The nice thing about having such option is that, with a correct application of containers, backup and restore strategy, one can create an infinite ordered backup of raw input stream using native Kafka storage format.
I understand the point of having the data in other formats in other systems. Impossible to get away from that.
My concept presented a few days ago is to address having “multiple same-looking copies of the truth”.

At the end of the day, if something happens with operational data, it will have to be recreated from "the truth”. But, if the data was once ingested over Kafka and there is already a pipeline for building operational state from Kafka, why would someone write another processing logic to get the truth, say, from Hadoop? And if fast, parallel processing of native Kafka format is required, it can still be done with Samza or Hadoop / whathaveyou.










Kind regards, 
Radek Gruchalski
 radek@gruchalski.com (mailto:radek@gruchalski.com)  (mailto:radek@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately.



On Monday, 13 July 2015 at 21:17, James Cheng wrote:

> For what it's worth, I did something similar to Rad's suggestion of "cold-storage" to add long-term archiving when using Amazon Kinesis. Kinesis is also a message bus, but only has a 24 hour retention window.
>  
> I wrote a Kinesis consumer that would take all messages from Kinesis and save them into S3. I stored them in S3 in such a way that the structure mirrors the original Kinesis stream, and all message metadata is preserved (message offsets and primary keys, for example).
>  
> This means that I can write a "consumer" that would consume from S3 files in the same way that it would consume from the Kinesis stream itself. And the data is structured such that when you are done reading from S3, you can connect to the Kinesis stream at the point where the S3 archive left off.
>  
> This effectively allowed me to add a configurable retention period when consuming from Kinesis.
>  
> -James
>  
> On Jul 13, 2015, at 11:45 AM, Tim Smith <secsubs@gmail.com (mailto:secsubs@gmail.com)> wrote:
>  
> > I have had a similar issue where I wanted a single source of truth between
> > Search and HDFS. First, if you zoom out a little, eventually you are going
> > to have some compute engine(s) process the data. If you store it in a
> > compute neutral tier like kafka then you will need to suck the data out at
> > runtime and stage it for the compute engine to use. So pick your poison,
> > process at ingest and store multiple copies of data, one per compute
> > engine, OR store in a neutral store and process at runtime. I am not saying
> > one is better than the other but that's how I see the trade-off so
> > depending on your use cases, YMMV.
> >  
> > What I do is:
> > - store raw data into kafka
> > - use spark streaming to transform data to JSON and post it back to kafka
> > - Hang multiple data stores off kafka that ingest the JSON
> > - Not do any other transformations in the "consumer" stores and store the
> > copy as immutable event
> >  
> > So I do have multiple copies (one per compute tier) but they all look the
> > same.
> >  
> > Unless different compute engines, natively start to use a common data
> > storage format, I don't see how one could get away from storing multiple
> > copies. Primarily, I see Lucene based products have their format, the
> > Hadoop ecosystem seems congregating around Parquet and then the NoSQL
> > players have their formats (one per each product).
> >  
> > My 2 cents worth :)
> >  
> >  
> >  
> > On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck <
> > daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com)> wrote:
> >  
> > > Am I correct in assuming that Kafka will only retain a file handle for the
> > > last segment of the log? If the number of handles grows unbounded, then it
> > > would be an issue. But I plan on writing to this topic continuously anyway,
> > > so not separating data into cold and hot storage is the entire point.
> > >  
> > > Daniel Schierbeck
> > >  
> > > > On 13. jul. 2015, at 15.41, Scott Thibault <
> > > scott.thibault@multiscalehn.com (mailto:scott.thibault@multiscalehn.com)> wrote:
> > > >  
> > > > We've tried to use Kafka not as a persistent store, but as a long-term
> > > > archival store. An outstanding issue we've had with that is that the
> > > > broker holds on to an open file handle on every file in the log! The
> > > >  
> > >  
> > > other
> > > > issue we've had is when you create a long-term archival log on shared
> > > > storage, you can't simply access that data from another cluster b/c of
> > > >  
> > >  
> > > meta
> > > > data being stored in zookeeper rather than in the log.
> > > >  
> > > > --Scott Thibault
> > > >  
> > > >  
> > > > On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck <
> > > > daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com)> wrote:
> > > >  
> > > > > Would it be possible to document how to configure Kafka to never delete
> > > > > messages in a topic? It took a good while to figure this out, and I see
> > > > >  
> > > >  
> > > >  
> > >  
> > > it
> > > > > as an important use case for Kafka.
> > > > >  
> > > > > On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck <
> > > > > daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com)> wrote:
> > > > >  
> > > > > >  
> > > > > > > On 10. jul. 2015, at 23.03, Jay Kreps <jay@confluent.io (mailto:jay@confluent.io)> wrote:
> > > > > > >  
> > > > > > > If I recall correctly, setting log.retention.ms (http://log.retention.ms) and
> > > > > log.retention.bytes
> > > > > > to
> > > > > > > -1 disables both.
> > > > > >  
> > > > > >  
> > > > > > Thanks!
> > > > > >  
> > > > > > >  
> > > > > > > On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck <
> > > > > > > daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com)> wrote:
> > > > > > >  
> > > > > > > >  
> > > > > > > > > On 10. jul. 2015, at 15.16, Shayne S <shaynest113@gmail.com (mailto:shaynest113@gmail.com)> wrote:
> > > > > > > > >  
> > > > > > > > > There are two ways you can configure your topics, log compaction and
> > > > > > with
> > > > > > > > > no cleaning. The choice depends on your use case. Are the records
> > > > > > > >  
> > > > > > > > uniquely
> > > > > > > > > identifiable and will they receive updates? Then log compaction is
> > > > > > > >  
> > > > > > > >  
> > > > > > >  
> > > > > >  
> > > > > >  
> > > > >  
> > > > > the
> > > > > > > > way
> > > > > > > > > to go. If they are truly read only, you can go without log
> > > > > > > >  
> > > > > > > >  
> > > > > > >  
> > > > > >  
> > > > >  
> > > > > compaction.
> > > > > > > >  
> > > > > > > > I'd rather be free to use the key for partitioning, and the records
> > > > > are
> > > > > > > > immutable — they're event records — so disabling compaction
> > > > > > >  
> > > > > >  
> > > > >  
> > > > >  
> > > >  
> > >  
> > > altogether
> > > > > > > > would be preferable. How is that accomplished?
> > > > > > > > >  
> > > > > > > > > We have a small processes which consume a topic and perform upserts
> > > > > to
> > > > > > > > our
> > > > > > > > > various database engines. It's easy to change how it all works and
> > > > > > > >  
> > > > > > > >  
> > > > > > >  
> > > > > >  
> > > > > > simply
> > > > > > > > > consume the single source of truth again.
> > > > > > > > >  
> > > > > > > > > I've written a bit about log compaction here:
> > > http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
> > > > > > > > >  
> > > > > > > > > On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
> > > > > > > > > daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com)> wrote:
> > > > > > > > >  
> > > > > > > > > > I'd like to use Kafka as a persistent store – sort of as an
> > > > > > alternative
> > > > > > > > to
> > > > > > > > > > HDFS. The idea is that I'd load the data into various other systems
> > > > > > > > >  
> > > > > > > >  
> > > > > > > >  
> > > > > > >  
> > > > > >  
> > > > > >  
> > > > >  
> > > > > in
> > > > > > > > > > order to solve specific needs such as full-text search, analytics,
> > > > > > > > >  
> > > > > > > >  
> > > > > > > > indexing
> > > > > > > > > > by various attributes, etc. I'd like to keep a single source of
> > > > > > > > >  
> > > > > > > >  
> > > > > > > >  
> > > > > > >  
> > > > > >  
> > > > >  
> > > > > truth,
> > > > > > > > > > however.
> > > > > > > > > >  
> > > > > > > > > > I'm struggling a bit to understand how I can configure a topic to
> > > > > > retain
> > > > > > > > > > messages indefinitely. I want to make sure that my data isn't
> > > > > > > > >  
> > > > > > > >  
> > > > > > >  
> > > > > >  
> > > > > >  
> > > > >  
> > > > > deleted.
> > > > > > > > Is
> > > > > > > > > > there a guide to configuring Kafka like this?
> > > > > > > > >  
> > > > > > > >  
> > > > > > > >  
> > > > > > >  
> > > > > >  
> > > > >  
> > > > >  
> > > >  
> > > >  
> > > >  
> > > >  
> > > > --
> > > > *This e-mail is not encrypted. Due to the unsecured nature of
> > > >  
> > >  
> > > unencrypted
> > > > e-mail, there may be some level of risk that the information in this
> > >  
> > > e-mail
> > > > could be read by a third party. Accordingly, the recipient(s) named
> > >  
> > > above
> > > > are hereby advised to not communicate protected health information using
> > > > this e-mail address. If you desire to send protected health information
> > > > electronically, please contact MultiScale Health Networks at
> > > >  
> > >  
> > > (206)538-6090*
> > >  
> >  
> >  
>  
>  
>

Re: Using Kafka as a persistent store

Posted by James Cheng <jc...@tivo.com>.

For what it's worth, I did something similar to Rad's suggestion of "cold-storage" to add long-term archiving when using Amazon Kinesis. Kinesis is also a message bus, but only has a 24 hour retention window.

I wrote a Kinesis consumer that would take all messages from Kinesis and save them into S3. I stored them in S3 in such a way that the structure mirrors the original Kinesis stream, and all message metadata is preserved (message offsets and primary keys, for example).

This means that I can write a "consumer" that would consume from S3 files in the same way that it would consume from the Kinesis stream itself. And the data is structured such that when you are done reading from S3, you can connect to the Kinesis stream at the point where the S3 archive left off.

This effectively allowed me to add a configurable retention period when consuming from Kinesis.

-James

On Jul 13, 2015, at 11:45 AM, Tim Smith <se...@gmail.com> wrote:

> I have had a similar issue where I wanted a single source of truth between
> Search and HDFS. First, if you zoom out a little, eventually you are going
> to have some compute engine(s) process the data. If you store it in a
> compute neutral tier like kafka then you will need to suck the data out at
> runtime and stage it for the compute engine to use. So pick your poison,
> process at ingest and store multiple copies of data, one per compute
> engine, OR store in a neutral store and process at runtime. I am not saying
> one is better than the other but that's how I see the trade-off so
> depending on your use cases, YMMV.
> 
> What I do is:
> - store raw data into kafka
> - use spark streaming to transform data to JSON and post it back to kafka
> - Hang multiple data stores off kafka that ingest the JSON
> - Not do any other transformations in the "consumer" stores and store the
> copy as immutable event
> 
> So I do have multiple copies (one per compute tier) but they all look the
> same.
> 
> Unless different compute engines, natively start to use a common data
> storage format, I don't see how one could get away from storing multiple
> copies. Primarily, I see Lucene based products have their format, the
> Hadoop ecosystem seems congregating around Parquet and then the NoSQL
> players have their formats (one per each product).
> 
> My 2 cents worth :)
> 
> 
> 
> On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck <
> daniel.schierbeck@gmail.com> wrote:
> 
>> Am I correct in assuming that Kafka will only retain a file handle for the
>> last segment of the log? If the number of handles grows unbounded, then it
>> would be an issue. But I plan on writing to this topic continuously anyway,
>> so not separating data into cold and hot storage is the entire point.
>> 
>> Daniel Schierbeck
>> 
>>> On 13. jul. 2015, at 15.41, Scott Thibault <
>> scott.thibault@multiscalehn.com> wrote:
>>> 
>>> We've tried to use Kafka not as a persistent store, but as a long-term
>>> archival store.  An outstanding issue we've had with that is that the
>>> broker holds on to an open file handle on every file in the log!  The
>> other
>>> issue we've had is when you create a long-term archival log on shared
>>> storage, you can't simply access that data from another cluster b/c of
>> meta
>>> data being stored in zookeeper rather than in the log.
>>> 
>>> --Scott Thibault
>>> 
>>> 
>>> On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck <
>>> daniel.schierbeck@gmail.com> wrote:
>>> 
>>>> Would it be possible to document how to configure Kafka to never delete
>>>> messages in a topic? It took a good while to figure this out, and I see
>> it
>>>> as an important use case for Kafka.
>>>> 
>>>> On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck <
>>>> daniel.schierbeck@gmail.com> wrote:
>>>> 
>>>>> 
>>>>>> On 10. jul. 2015, at 23.03, Jay Kreps <ja...@confluent.io> wrote:
>>>>>> 
>>>>>> If I recall correctly, setting log.retention.ms and
>>>> log.retention.bytes
>>>>> to
>>>>>> -1 disables both.
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>>> 
>>>>>> On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck <
>>>>>> daniel.schierbeck@gmail.com> wrote:
>>>>>> 
>>>>>>> 
>>>>>>>> On 10. jul. 2015, at 15.16, Shayne S <sh...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> There are two ways you can configure your topics, log compaction and
>>>>> with
>>>>>>>> no cleaning. The choice depends on your use case. Are the records
>>>>>>> uniquely
>>>>>>>> identifiable and will they receive updates? Then log compaction is
>>>> the
>>>>>>> way
>>>>>>>> to go. If they are truly read only, you can go without log
>>>> compaction.
>>>>>>> 
>>>>>>> I'd rather be free to use the key for partitioning, and the records
>>>> are
>>>>>>> immutable — they're event records — so disabling compaction
>> altogether
>>>>>>> would be preferable. How is that accomplished?
>>>>>>>> 
>>>>>>>> We have a small processes which consume a topic and perform upserts
>>>> to
>>>>>>> our
>>>>>>>> various database engines. It's easy to change how it all works and
>>>>> simply
>>>>>>>> consume the single source of truth again.
>>>>>>>> 
>>>>>>>> I've written a bit about log compaction here:
>>>>> 
>> http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
>>>>>>>> 
>>>>>>>> On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
>>>>>>>> daniel.schierbeck@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> I'd like to use Kafka as a persistent store – sort of as an
>>>>> alternative
>>>>>>> to
>>>>>>>>> HDFS. The idea is that I'd load the data into various other systems
>>>> in
>>>>>>>>> order to solve specific needs such as full-text search, analytics,
>>>>>>> indexing
>>>>>>>>> by various attributes, etc. I'd like to keep a single source of
>>>> truth,
>>>>>>>>> however.
>>>>>>>>> 
>>>>>>>>> I'm struggling a bit to understand how I can configure a topic to
>>>>> retain
>>>>>>>>> messages indefinitely. I want to make sure that my data isn't
>>>> deleted.
>>>>>>> Is
>>>>>>>>> there a guide to configuring Kafka like this?
>>> 
>>> 
>>> 
>>> --
>>> *This e-mail is not encrypted.  Due to the unsecured nature of
>> unencrypted
>>> e-mail, there may be some level of risk that the information in this
>> e-mail
>>> could be read by a third party.  Accordingly, the recipient(s) named
>> above
>>> are hereby advised to not communicate protected health information using
>>> this e-mail address.  If you desire to send protected health information
>>> electronically, please contact MultiScale Health Networks at
>> (206)538-6090*
>>

Re: Using Kafka as a persistent store

Posted by Tim Smith <se...@gmail.com>.

I have had a similar issue where I wanted a single source of truth between
Search and HDFS. First, if you zoom out a little, eventually you are going
to have some compute engine(s) process the data. If you store it in a
compute neutral tier like kafka then you will need to suck the data out at
runtime and stage it for the compute engine to use. So pick your poison,
process at ingest and store multiple copies of data, one per compute
engine, OR store in a neutral store and process at runtime. I am not saying
one is better than the other but that's how I see the trade-off so
depending on your use cases, YMMV.

What I do is:
- store raw data into kafka
- use spark streaming to transform data to JSON and post it back to kafka
- Hang multiple data stores off kafka that ingest the JSON
- Not do any other transformations in the "consumer" stores and store the
copy as immutable event

So I do have multiple copies (one per compute tier) but they all look the
same.

Unless different compute engines, natively start to use a common data
storage format, I don't see how one could get away from storing multiple
copies. Primarily, I see Lucene based products have their format, the
Hadoop ecosystem seems congregating around Parquet and then the NoSQL
players have their formats (one per each product).

My 2 cents worth :)



On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck <
daniel.schierbeck@gmail.com> wrote:

> Am I correct in assuming that Kafka will only retain a file handle for the
> last segment of the log? If the number of handles grows unbounded, then it
> would be an issue. But I plan on writing to this topic continuously anyway,
> so not separating data into cold and hot storage is the entire point.
>
> Daniel Schierbeck
>
> > On 13. jul. 2015, at 15.41, Scott Thibault <
> scott.thibault@multiscalehn.com> wrote:
> >
> > We've tried to use Kafka not as a persistent store, but as a long-term
> > archival store.  An outstanding issue we've had with that is that the
> > broker holds on to an open file handle on every file in the log!  The
> other
> > issue we've had is when you create a long-term archival log on shared
> > storage, you can't simply access that data from another cluster b/c of
> meta
> > data being stored in zookeeper rather than in the log.
> >
> > --Scott Thibault
> >
> >
> > On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck <
> > daniel.schierbeck@gmail.com> wrote:
> >
> >> Would it be possible to document how to configure Kafka to never delete
> >> messages in a topic? It took a good while to figure this out, and I see
> it
> >> as an important use case for Kafka.
> >>
> >> On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck <
> >> daniel.schierbeck@gmail.com> wrote:
> >>
> >>>
> >>>> On 10. jul. 2015, at 23.03, Jay Kreps <ja...@confluent.io> wrote:
> >>>>
> >>>> If I recall correctly, setting log.retention.ms and
> >> log.retention.bytes
> >>> to
> >>>> -1 disables both.
> >>>
> >>> Thanks!
> >>>
> >>>>
> >>>> On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck <
> >>>> daniel.schierbeck@gmail.com> wrote:
> >>>>
> >>>>>
> >>>>>> On 10. jul. 2015, at 15.16, Shayne S <sh...@gmail.com> wrote:
> >>>>>>
> >>>>>> There are two ways you can configure your topics, log compaction and
> >>> with
> >>>>>> no cleaning. The choice depends on your use case. Are the records
> >>>>> uniquely
> >>>>>> identifiable and will they receive updates? Then log compaction is
> >> the
> >>>>> way
> >>>>>> to go. If they are truly read only, you can go without log
> >> compaction.
> >>>>>
> >>>>> I'd rather be free to use the key for partitioning, and the records
> >> are
> >>>>> immutable — they're event records — so disabling compaction
> altogether
> >>>>> would be preferable. How is that accomplished?
> >>>>>>
> >>>>>> We have a small processes which consume a topic and perform upserts
> >> to
> >>>>> our
> >>>>>> various database engines. It's easy to change how it all works and
> >>> simply
> >>>>>> consume the single source of truth again.
> >>>>>>
> >>>>>> I've written a bit about log compaction here:
> >>>
> http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
> >>>>>>
> >>>>>> On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
> >>>>>> daniel.schierbeck@gmail.com> wrote:
> >>>>>>
> >>>>>>> I'd like to use Kafka as a persistent store – sort of as an
> >>> alternative
> >>>>> to
> >>>>>>> HDFS. The idea is that I'd load the data into various other systems
> >> in
> >>>>>>> order to solve specific needs such as full-text search, analytics,
> >>>>> indexing
> >>>>>>> by various attributes, etc. I'd like to keep a single source of
> >> truth,
> >>>>>>> however.
> >>>>>>>
> >>>>>>> I'm struggling a bit to understand how I can configure a topic to
> >>> retain
> >>>>>>> messages indefinitely. I want to make sure that my data isn't
> >> deleted.
> >>>>> Is
> >>>>>>> there a guide to configuring Kafka like this?
> >
> >
> >
> > --
> > *This e-mail is not encrypted.  Due to the unsecured nature of
> unencrypted
> > e-mail, there may be some level of risk that the information in this
> e-mail
> > could be read by a third party.  Accordingly, the recipient(s) named
> above
> > are hereby advised to not communicate protected health information using
> > this e-mail address.  If you desire to send protected health information
> > electronically, please contact MultiScale Health Networks at
> (206)538-6090*
>

Re: Using Kafka as a persistent store

Posted by Daniel Schierbeck <da...@gmail.com>.

Am I correct in assuming that Kafka will only retain a file handle for the last segment of the log? If the number of handles grows unbounded, then it would be an issue. But I plan on writing to this topic continuously anyway, so not separating data into cold and hot storage is the entire point. 

Daniel Schierbeck

> On 13. jul. 2015, at 15.41, Scott Thibault <sc...@multiscalehn.com> wrote:
> 
> We've tried to use Kafka not as a persistent store, but as a long-term
> archival store.  An outstanding issue we've had with that is that the
> broker holds on to an open file handle on every file in the log!  The other
> issue we've had is when you create a long-term archival log on shared
> storage, you can't simply access that data from another cluster b/c of meta
> data being stored in zookeeper rather than in the log.
> 
> --Scott Thibault
> 
> 
> On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck <
> daniel.schierbeck@gmail.com> wrote:
> 
>> Would it be possible to document how to configure Kafka to never delete
>> messages in a topic? It took a good while to figure this out, and I see it
>> as an important use case for Kafka.
>> 
>> On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck <
>> daniel.schierbeck@gmail.com> wrote:
>> 
>>> 
>>>> On 10. jul. 2015, at 23.03, Jay Kreps <ja...@confluent.io> wrote:
>>>> 
>>>> If I recall correctly, setting log.retention.ms and
>> log.retention.bytes
>>> to
>>>> -1 disables both.
>>> 
>>> Thanks!
>>> 
>>>> 
>>>> On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck <
>>>> daniel.schierbeck@gmail.com> wrote:
>>>> 
>>>>> 
>>>>>> On 10. jul. 2015, at 15.16, Shayne S <sh...@gmail.com> wrote:
>>>>>> 
>>>>>> There are two ways you can configure your topics, log compaction and
>>> with
>>>>>> no cleaning. The choice depends on your use case. Are the records
>>>>> uniquely
>>>>>> identifiable and will they receive updates? Then log compaction is
>> the
>>>>> way
>>>>>> to go. If they are truly read only, you can go without log
>> compaction.
>>>>> 
>>>>> I'd rather be free to use the key for partitioning, and the records
>> are
>>>>> immutable — they're event records — so disabling compaction altogether
>>>>> would be preferable. How is that accomplished?
>>>>>> 
>>>>>> We have a small processes which consume a topic and perform upserts
>> to
>>>>> our
>>>>>> various database engines. It's easy to change how it all works and
>>> simply
>>>>>> consume the single source of truth again.
>>>>>> 
>>>>>> I've written a bit about log compaction here:
>>> http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
>>>>>> 
>>>>>> On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
>>>>>> daniel.schierbeck@gmail.com> wrote:
>>>>>> 
>>>>>>> I'd like to use Kafka as a persistent store – sort of as an
>>> alternative
>>>>> to
>>>>>>> HDFS. The idea is that I'd load the data into various other systems
>> in
>>>>>>> order to solve specific needs such as full-text search, analytics,
>>>>> indexing
>>>>>>> by various attributes, etc. I'd like to keep a single source of
>> truth,
>>>>>>> however.
>>>>>>> 
>>>>>>> I'm struggling a bit to understand how I can configure a topic to
>>> retain
>>>>>>> messages indefinitely. I want to make sure that my data isn't
>> deleted.
>>>>> Is
>>>>>>> there a guide to configuring Kafka like this?
> 
> 
> 
> -- 
> *This e-mail is not encrypted.  Due to the unsecured nature of unencrypted
> e-mail, there may be some level of risk that the information in this e-mail
> could be read by a third party.  Accordingly, the recipient(s) named above
> are hereby advised to not communicate protected health information using
> this e-mail address.  If you desire to send protected health information
> electronically, please contact MultiScale Health Networks at (206)538-6090*

Re: Using Kafka as a persistent store

Posted by Rad Gruchalski <ra...@gruchalski.com>.

Indeed, the files would have to be moved to some separate, dedicated storage.  
There are basically 3 options, as kafka does not allow adding logs at runtime:

1. make the consumer able to read from an arbitrary file
2. add ability to drop files in (I believe this adds a lot of complexity)
3. read files with another program, as suggested in my first email

I’d love to get some input from someone who knows the code and options a bit better!  










Kind regards, 
Radek Gruchalski
 radek@gruchalski.com (mailto:radek@gruchalski.com)  (mailto:radek@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately.



On Monday, 13 July 2015 at 18:02, Scott Thibault wrote:

> Yes, consider my e-mail an up vote!
>  
> I guess the files would automatically moved somewhere else to separate the
> active from cold segments? Ideally, one could run an unmodified consumer
> application on the cold segments.
>  
>  
> --Scott
>  
>  
> On Mon, Jul 13, 2015 at 6:57 AM, Rad Gruchalski <radek@gruchalski.com (mailto:radek@gruchalski.com)>
> wrote:
>  
> > Scott,
> >  
> > This is what I was trying to target in one of my previous responses to
> > Daniel. The one in which I suggest another compaction setting for kafka.
> >  
> >  
> >  
> >  
> >  
> >  
> >  
> >  
> >  
> >  
> > Kind regards,
> > Radek Gruchalski
> > radek@gruchalski.com (mailto:radek@gruchalski.com) (mailto:
> > radek@gruchalski.com (mailto:radek@gruchalski.com))
> > de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/) (
> > http://de.linkedin.com/in/radgruchalski/)
> >  
> > Confidentiality:
> > This communication is intended for the above-named person and may be
> > confidential and/or legally privileged.
> > If it has come to you in error you must take no action based on it, nor
> > must you copy or show it to anyone; please delete/destroy and inform the
> > sender immediately.
> >  
> >  
> >  
> > On Monday, 13 July 2015 at 15:41, Scott Thibault wrote:
> >  
> > > We've tried to use Kafka not as a persistent store, but as a long-term
> > > archival store. An outstanding issue we've had with that is that the
> > > broker holds on to an open file handle on every file in the log! The
> > >  
> >  
> > other
> > > issue we've had is when you create a long-term archival log on shared
> > > storage, you can't simply access that data from another cluster b/c of
> > >  
> >  
> > meta
> > > data being stored in zookeeper rather than in the log.
> > >  
> > > --Scott Thibault
> > >  
> > >  
> > > On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck <
> > > daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com)> wrote:
> > >  
> > > > Would it be possible to document how to configure Kafka to never delete
> > > > messages in a topic? It took a good while to figure this out, and I
> > > >  
> > >  
> > >  
> >  
> > see it
> > > > as an important use case for Kafka.
> > > >  
> > > > On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck <
> > > > daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com)>
> > > >  
> > >  
> >  
> > wrote:
> > > >  
> > > > >  
> > > > > > On 10. jul. 2015, at 23.03, Jay Kreps <jay@confluent.io (mailto:jay@confluent.io) (mailto:
> > jay@confluent.io (mailto:jay@confluent.io))> wrote:
> > > > > >  
> > > > > > If I recall correctly, setting log.retention.ms (http://log.retention.ms) (
> > http://log.retention.ms) and
> > > > log.retention.bytes
> > > > > to
> > > > > > -1 disables both.
> > > > >  
> > > > >  
> > > > >  
> > > > > Thanks!
> > > > >  
> > > > > >  
> > > > > > On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck <
> > > > > > daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com)>
> > > > > >  
> > > > >  
> > > > >  
> > > >  
> > > >  
> > >  
> >  
> > wrote:
> > > > > >  
> > > > > > >  
> > > > > > > > On 10. jul. 2015, at 15.16, Shayne S <shaynest113@gmail.com (mailto:shaynest113@gmail.com)
> > (mailto:shaynest113@gmail.com)> wrote:
> > > > > > > >  
> > > > > > > > There are two ways you can configure your topics, log
> > compaction and
> > > > > with
> > > > > > > > no cleaning. The choice depends on your use case. Are the
> > > > > > >  
> > > > > >  
> > > > >  
> > > > >  
> > > >  
> > >  
> >  
> > records
> > > > > > >  
> > > > > > > uniquely
> > > > > > > > identifiable and will they receive updates? Then log
> > > > > > >  
> > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > >  
> >  
> > compaction is
> > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > > >  
> > > > the
> > > > > > > way
> > > > > > > > to go. If they are truly read only, you can go without log
> > > > > > >  
> > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > > >  
> > > > compaction.
> > > > > > >  
> > > > > > > I'd rather be free to use the key for partitioning, and the
> > records
> > > > are
> > > > > > > immutable — they're event records — so disabling compaction
> > > > > >  
> > > > >  
> > > >  
> > > >  
> > >  
> >  
> > altogether
> > > > > > > would be preferable. How is that accomplished?
> > > > > > > >  
> > > > > > > > We have a small processes which consume a topic and perform
> > upserts
> > > > to
> > > > > > > our
> > > > > > > > various database engines. It's easy to change how it all works
> > > > > > >  
> > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > > >  
> > >  
> >  
> > and
> > > > > > >  
> > > > > >  
> > > > >  
> > > > >  
> > > > > simply
> > > > > > > > consume the single source of truth again.
> > > > > > > >  
> > > > > > > > I've written a bit about log compaction here:
> > http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
> > > > > > > >  
> > > > > > > > On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
> > > > > > > > daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com) (mailto:
> > > > > > > >  
> > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > >  
> >  
> > daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com))> wrote:
> > > > > > > >  
> > > > > > > > > I'd like to use Kafka as a persistent store – sort of as an
> > > > > alternative
> > > > > > > to
> > > > > > > > > HDFS. The idea is that I'd load the data into various other
> > > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > >  
> > > > >  
> > > > >  
> > > >  
> > >  
> >  
> > systems
> > > > > > > >  
> > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > > >  
> > > > in
> > > > > > > > > order to solve specific needs such as full-text search,
> > > > > > > >  
> > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > > >  
> > >  
> >  
> > analytics,
> > > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > > indexing
> > > > > > > > > by various attributes, etc. I'd like to keep a single source
> > > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > >  
> >  
> > of
> > > > > > > >  
> > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > > >  
> > > > truth,
> > > > > > > > > however.
> > > > > > > > >  
> > > > > > > > > I'm struggling a bit to understand how I can configure a
> > topic to
> > > > > retain
> > > > > > > > > messages indefinitely. I want to make sure that my data isn't
> > > > > > > >  
> > > > > > >  
> > > > > >  
> > > > >  
> > > > >  
> > > >  
> > > >  
> > > > deleted.
> > > > > > > Is
> > > > > > > > > there a guide to configuring Kafka like this?
> > > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > > >  
> > >  
> > >  
> > >  
> > >  
> > >  
> > > --
> > > *This e-mail is not encrypted. Due to the unsecured nature of unencrypted
> > > e-mail, there may be some level of risk that the information in this
> > >  
> >  
> > e-mail
> > > could be read by a third party. Accordingly, the recipient(s) named above
> > > are hereby advised to not communicate protected health information using
> > > this e-mail address. If you desire to send protected health information
> > > electronically, please contact MultiScale Health Networks at
> > >  
> >  
> > (206)538-6090*
> > >  
> >  
> >  
>  
>  
>  
> --  
> *This e-mail is not encrypted. Due to the unsecured nature of unencrypted
> e-mail, there may be some level of risk that the information in this e-mail
> could be read by a third party. Accordingly, the recipient(s) named above
> are hereby advised to not communicate protected health information using
> this e-mail address. If you desire to send protected health information
> electronically, please contact MultiScale Health Networks at (206)538-6090*
>  
>

Re: Using Kafka as a persistent store

Posted by Scott Thibault <sc...@multiscalehn.com>.

Yes, consider my e-mail an up vote!

I guess the files would automatically moved somewhere else to separate the
active from cold segments?  Ideally, one could run an unmodified consumer
application on the cold segments.


--Scott


On Mon, Jul 13, 2015 at 6:57 AM, Rad Gruchalski <ra...@gruchalski.com>
wrote:

> Scott,
>
> This is what I was trying to target in one of my previous responses to
> Daniel. The one in which I suggest another compaction setting for kafka.
>
>
>
>
>
>
>
>
>
>
> Kind regards,
> Radek Gruchalski
> radek@gruchalski.com (mailto:radek@gruchalski.com) (mailto:
> radek@gruchalski.com)
> de.linkedin.com/in/radgruchalski/ (
> http://de.linkedin.com/in/radgruchalski/)
>
> Confidentiality:
> This communication is intended for the above-named person and may be
> confidential and/or legally privileged.
> If it has come to you in error you must take no action based on it, nor
> must you copy or show it to anyone; please delete/destroy and inform the
> sender immediately.
>
>
>
> On Monday, 13 July 2015 at 15:41, Scott Thibault wrote:
>
> > We've tried to use Kafka not as a persistent store, but as a long-term
> > archival store. An outstanding issue we've had with that is that the
> > broker holds on to an open file handle on every file in the log! The
> other
> > issue we've had is when you create a long-term archival log on shared
> > storage, you can't simply access that data from another cluster b/c of
> meta
> > data being stored in zookeeper rather than in the log.
> >
> > --Scott Thibault
> >
> >
> > On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck <
> > daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com)> wrote:
> >
> > > Would it be possible to document how to configure Kafka to never delete
> > > messages in a topic? It took a good while to figure this out, and I
> see it
> > > as an important use case for Kafka.
> > >
> > > On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck <
> > > daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com)>
> wrote:
> > >
> > > >
> > > > > On 10. jul. 2015, at 23.03, Jay Kreps <jay@confluent.io (mailto:
> jay@confluent.io)> wrote:
> > > > >
> > > > > If I recall correctly, setting log.retention.ms (
> http://log.retention.ms) and
> > > log.retention.bytes
> > > > to
> > > > > -1 disables both.
> > > >
> > > >
> > > > Thanks!
> > > >
> > > > >
> > > > > On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck <
> > > > > daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com)>
> wrote:
> > > > >
> > > > > >
> > > > > > > On 10. jul. 2015, at 15.16, Shayne S <shaynest113@gmail.com
> (mailto:shaynest113@gmail.com)> wrote:
> > > > > > >
> > > > > > > There are two ways you can configure your topics, log
> compaction and
> > > > with
> > > > > > > no cleaning. The choice depends on your use case. Are the
> records
> > > > > >
> > > > > > uniquely
> > > > > > > identifiable and will they receive updates? Then log
> compaction is
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
> > > the
> > > > > > way
> > > > > > > to go. If they are truly read only, you can go without log
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > > compaction.
> > > > > >
> > > > > > I'd rather be free to use the key for partitioning, and the
> records
> > > are
> > > > > > immutable — they're event records — so disabling compaction
> altogether
> > > > > > would be preferable. How is that accomplished?
> > > > > > >
> > > > > > > We have a small processes which consume a topic and perform
> upserts
> > > to
> > > > > > our
> > > > > > > various database engines. It's easy to change how it all works
> and
> > > > > >
> > > > > >
> > > > >
> > > >
> > > > simply
> > > > > > > consume the single source of truth again.
> > > > > > >
> > > > > > > I've written a bit about log compaction here:
> > > >
> http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
> > > > > > >
> > > > > > > On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
> > > > > > > daniel.schierbeck@gmail.com (mailto:
> daniel.schierbeck@gmail.com)> wrote:
> > > > > > >
> > > > > > > > I'd like to use Kafka as a persistent store – sort of as an
> > > > alternative
> > > > > > to
> > > > > > > > HDFS. The idea is that I'd load the data into various other
> systems
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
> > > in
> > > > > > > > order to solve specific needs such as full-text search,
> analytics,
> > > > > > >
> > > > > >
> > > > > > indexing
> > > > > > > > by various attributes, etc. I'd like to keep a single source
> of
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > > truth,
> > > > > > > > however.
> > > > > > > >
> > > > > > > > I'm struggling a bit to understand how I can configure a
> topic to
> > > > retain
> > > > > > > > messages indefinitely. I want to make sure that my data isn't
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
> > > deleted.
> > > > > > Is
> > > > > > > > there a guide to configuring Kafka like this?
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> >
> >
> >
> >
> > --
> > *This e-mail is not encrypted. Due to the unsecured nature of unencrypted
> > e-mail, there may be some level of risk that the information in this
> e-mail
> > could be read by a third party. Accordingly, the recipient(s) named above
> > are hereby advised to not communicate protected health information using
> > this e-mail address. If you desire to send protected health information
> > electronically, please contact MultiScale Health Networks at
> (206)538-6090*
> >
> >
>
>
>


-- 
*This e-mail is not encrypted.  Due to the unsecured nature of unencrypted
e-mail, there may be some level of risk that the information in this e-mail
could be read by a third party.  Accordingly, the recipient(s) named above
are hereby advised to not communicate protected health information using
this e-mail address.  If you desire to send protected health information
electronically, please contact MultiScale Health Networks at (206)538-6090*

Re: Using Kafka as a persistent store

Posted by Rad Gruchalski <ra...@gruchalski.com>.

Scott,  

This is what I was trying to target in one of my previous responses to Daniel. The one in which I suggest another compaction setting for kafka.










Kind regards, 
Radek Gruchalski
 radek@gruchalski.com (mailto:radek@gruchalski.com)  (mailto:radek@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately.



On Monday, 13 July 2015 at 15:41, Scott Thibault wrote:

> We've tried to use Kafka not as a persistent store, but as a long-term
> archival store. An outstanding issue we've had with that is that the
> broker holds on to an open file handle on every file in the log! The other
> issue we've had is when you create a long-term archival log on shared
> storage, you can't simply access that data from another cluster b/c of meta
> data being stored in zookeeper rather than in the log.
>  
> --Scott Thibault
>  
>  
> On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck <
> daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com)> wrote:
>  
> > Would it be possible to document how to configure Kafka to never delete
> > messages in a topic? It took a good while to figure this out, and I see it
> > as an important use case for Kafka.
> >  
> > On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck <
> > daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com)> wrote:
> >  
> > >  
> > > > On 10. jul. 2015, at 23.03, Jay Kreps <jay@confluent.io (mailto:jay@confluent.io)> wrote:
> > > >  
> > > > If I recall correctly, setting log.retention.ms (http://log.retention.ms) and
> > log.retention.bytes
> > > to
> > > > -1 disables both.
> > >  
> > >  
> > > Thanks!
> > >  
> > > >  
> > > > On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck <
> > > > daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com)> wrote:
> > > >  
> > > > >  
> > > > > > On 10. jul. 2015, at 15.16, Shayne S <shaynest113@gmail.com (mailto:shaynest113@gmail.com)> wrote:
> > > > > >  
> > > > > > There are two ways you can configure your topics, log compaction and
> > > with
> > > > > > no cleaning. The choice depends on your use case. Are the records
> > > > >  
> > > > > uniquely
> > > > > > identifiable and will they receive updates? Then log compaction is
> > > > >  
> > > > >  
> > > >  
> > >  
> > >  
> >  
> > the
> > > > > way
> > > > > > to go. If they are truly read only, you can go without log
> > > > >  
> > > > >  
> > > >  
> > >  
> >  
> > compaction.
> > > > >  
> > > > > I'd rather be free to use the key for partitioning, and the records
> > are
> > > > > immutable — they're event records — so disabling compaction altogether
> > > > > would be preferable. How is that accomplished?
> > > > > >  
> > > > > > We have a small processes which consume a topic and perform upserts
> > to
> > > > > our
> > > > > > various database engines. It's easy to change how it all works and
> > > > >  
> > > > >  
> > > >  
> > >  
> > > simply
> > > > > > consume the single source of truth again.
> > > > > >  
> > > > > > I've written a bit about log compaction here:
> > > http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
> > > > > >  
> > > > > > On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
> > > > > > daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com)> wrote:
> > > > > >  
> > > > > > > I'd like to use Kafka as a persistent store – sort of as an
> > > alternative
> > > > > to
> > > > > > > HDFS. The idea is that I'd load the data into various other systems
> > > > > >  
> > > > >  
> > > > >  
> > > >  
> > >  
> > >  
> >  
> > in
> > > > > > > order to solve specific needs such as full-text search, analytics,
> > > > > >  
> > > > >  
> > > > > indexing
> > > > > > > by various attributes, etc. I'd like to keep a single source of
> > > > > >  
> > > > >  
> > > > >  
> > > >  
> > >  
> >  
> > truth,
> > > > > > > however.
> > > > > > >  
> > > > > > > I'm struggling a bit to understand how I can configure a topic to
> > > retain
> > > > > > > messages indefinitely. I want to make sure that my data isn't
> > > > > >  
> > > > >  
> > > >  
> > >  
> > >  
> >  
> > deleted.
> > > > > Is
> > > > > > > there a guide to configuring Kafka like this?
> > > > > >  
> > > > >  
> > > > >  
> > > >  
> > >  
> >  
> >  
>  
>  
>  
>  
> --  
> *This e-mail is not encrypted. Due to the unsecured nature of unencrypted
> e-mail, there may be some level of risk that the information in this e-mail
> could be read by a third party. Accordingly, the recipient(s) named above
> are hereby advised to not communicate protected health information using
> this e-mail address. If you desire to send protected health information
> electronically, please contact MultiScale Health Networks at (206)538-6090*
>  
>

Re: Using Kafka as a persistent store

Posted by Scott Thibault <sc...@multiscalehn.com>.

We've tried to use Kafka not as a persistent store, but as a long-term
archival store.  An outstanding issue we've had with that is that the
broker holds on to an open file handle on every file in the log!  The other
issue we've had is when you create a long-term archival log on shared
storage, you can't simply access that data from another cluster b/c of meta
data being stored in zookeeper rather than in the log.

--Scott Thibault


On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck <
daniel.schierbeck@gmail.com> wrote:

> Would it be possible to document how to configure Kafka to never delete
> messages in a topic? It took a good while to figure this out, and I see it
> as an important use case for Kafka.
>
> On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck <
> daniel.schierbeck@gmail.com> wrote:
>
> >
> > > On 10. jul. 2015, at 23.03, Jay Kreps <ja...@confluent.io> wrote:
> > >
> > > If I recall correctly, setting log.retention.ms and
> log.retention.bytes
> > to
> > > -1 disables both.
> >
> > Thanks!
> >
> > >
> > > On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck <
> > > daniel.schierbeck@gmail.com> wrote:
> > >
> > >>
> > >>> On 10. jul. 2015, at 15.16, Shayne S <sh...@gmail.com> wrote:
> > >>>
> > >>> There are two ways you can configure your topics, log compaction and
> > with
> > >>> no cleaning. The choice depends on your use case. Are the records
> > >> uniquely
> > >>> identifiable and will they receive updates? Then log compaction is
> the
> > >> way
> > >>> to go. If they are truly read only, you can go without log
> compaction.
> > >>
> > >> I'd rather be free to use the key for partitioning, and the records
> are
> > >> immutable — they're event records — so disabling compaction altogether
> > >> would be preferable. How is that accomplished?
> > >>>
> > >>> We have a small processes which consume a topic and perform upserts
> to
> > >> our
> > >>> various database engines. It's easy to change how it all works and
> > simply
> > >>> consume the single source of truth again.
> > >>>
> > >>> I've written a bit about log compaction here:
> > >>>
> > http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
> > >>>
> > >>> On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
> > >>> daniel.schierbeck@gmail.com> wrote:
> > >>>
> > >>>> I'd like to use Kafka as a persistent store – sort of as an
> > alternative
> > >> to
> > >>>> HDFS. The idea is that I'd load the data into various other systems
> in
> > >>>> order to solve specific needs such as full-text search, analytics,
> > >> indexing
> > >>>> by various attributes, etc. I'd like to keep a single source of
> truth,
> > >>>> however.
> > >>>>
> > >>>> I'm struggling a bit to understand how I can configure a topic to
> > retain
> > >>>> messages indefinitely. I want to make sure that my data isn't
> deleted.
> > >> Is
> > >>>> there a guide to configuring Kafka like this?
> > >>
> >
>



-- 
*This e-mail is not encrypted.  Due to the unsecured nature of unencrypted
e-mail, there may be some level of risk that the information in this e-mail
could be read by a third party.  Accordingly, the recipient(s) named above
are hereby advised to not communicate protected health information using
this e-mail address.  If you desire to send protected health information
electronically, please contact MultiScale Health Networks at (206)538-6090*

Re: Using Kafka as a persistent store

Posted by Daniel Schierbeck <da...@gmail.com>.

Would it be possible to document how to configure Kafka to never delete
messages in a topic? It took a good while to figure this out, and I see it
as an important use case for Kafka.

On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck <
daniel.schierbeck@gmail.com> wrote:

>
> > On 10. jul. 2015, at 23.03, Jay Kreps <ja...@confluent.io> wrote:
> >
> > If I recall correctly, setting log.retention.ms and log.retention.bytes
> to
> > -1 disables both.
>
> Thanks!
>
> >
> > On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck <
> > daniel.schierbeck@gmail.com> wrote:
> >
> >>
> >>> On 10. jul. 2015, at 15.16, Shayne S <sh...@gmail.com> wrote:
> >>>
> >>> There are two ways you can configure your topics, log compaction and
> with
> >>> no cleaning. The choice depends on your use case. Are the records
> >> uniquely
> >>> identifiable and will they receive updates? Then log compaction is the
> >> way
> >>> to go. If they are truly read only, you can go without log compaction.
> >>
> >> I'd rather be free to use the key for partitioning, and the records are
> >> immutable — they're event records — so disabling compaction altogether
> >> would be preferable. How is that accomplished?
> >>>
> >>> We have a small processes which consume a topic and perform upserts to
> >> our
> >>> various database engines. It's easy to change how it all works and
> simply
> >>> consume the single source of truth again.
> >>>
> >>> I've written a bit about log compaction here:
> >>>
> http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
> >>>
> >>> On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
> >>> daniel.schierbeck@gmail.com> wrote:
> >>>
> >>>> I'd like to use Kafka as a persistent store – sort of as an
> alternative
> >> to
> >>>> HDFS. The idea is that I'd load the data into various other systems in
> >>>> order to solve specific needs such as full-text search, analytics,
> >> indexing
> >>>> by various attributes, etc. I'd like to keep a single source of truth,
> >>>> however.
> >>>>
> >>>> I'm struggling a bit to understand how I can configure a topic to
> retain
> >>>> messages indefinitely. I want to make sure that my data isn't deleted.
> >> Is
> >>>> there a guide to configuring Kafka like this?
> >>
>

Re: Using Kafka as a persistent store

Posted by Daniel Schierbeck <da...@gmail.com>.

> On 10. jul. 2015, at 23.03, Jay Kreps <ja...@confluent.io> wrote:
> 
> If I recall correctly, setting log.retention.ms and log.retention.bytes to
> -1 disables both.

Thanks! 

> 
> On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck <
> daniel.schierbeck@gmail.com> wrote:
> 
>> 
>>> On 10. jul. 2015, at 15.16, Shayne S <sh...@gmail.com> wrote:
>>> 
>>> There are two ways you can configure your topics, log compaction and with
>>> no cleaning. The choice depends on your use case. Are the records
>> uniquely
>>> identifiable and will they receive updates? Then log compaction is the
>> way
>>> to go. If they are truly read only, you can go without log compaction.
>> 
>> I'd rather be free to use the key for partitioning, and the records are
>> immutable — they're event records — so disabling compaction altogether
>> would be preferable. How is that accomplished?
>>> 
>>> We have a small processes which consume a topic and perform upserts to
>> our
>>> various database engines. It's easy to change how it all works and simply
>>> consume the single source of truth again.
>>> 
>>> I've written a bit about log compaction here:
>>> http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
>>> 
>>> On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
>>> daniel.schierbeck@gmail.com> wrote:
>>> 
>>>> I'd like to use Kafka as a persistent store – sort of as an alternative
>> to
>>>> HDFS. The idea is that I'd load the data into various other systems in
>>>> order to solve specific needs such as full-text search, analytics,
>> indexing
>>>> by various attributes, etc. I'd like to keep a single source of truth,
>>>> however.
>>>> 
>>>> I'm struggling a bit to understand how I can configure a topic to retain
>>>> messages indefinitely. I want to make sure that my data isn't deleted.
>> Is
>>>> there a guide to configuring Kafka like this?
>>

Re: Using Kafka as a persistent store

Posted by Jay Kreps <ja...@confluent.io>.

If I recall correctly, setting log.retention.ms and log.retention.bytes to
-1 disables both.

On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck <
daniel.schierbeck@gmail.com> wrote:

>
> > On 10. jul. 2015, at 15.16, Shayne S <sh...@gmail.com> wrote:
> >
> > There are two ways you can configure your topics, log compaction and with
> > no cleaning. The choice depends on your use case. Are the records
> uniquely
> > identifiable and will they receive updates? Then log compaction is the
> way
> > to go. If they are truly read only, you can go without log compaction.
>
> I'd rather be free to use the key for partitioning, and the records are
> immutable — they're event records — so disabling compaction altogether
> would be preferable. How is that accomplished?
> >
> > We have a small processes which consume a topic and perform upserts to
> our
> > various database engines. It's easy to change how it all works and simply
> > consume the single source of truth again.
> >
> > I've written a bit about log compaction here:
> > http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
> >
> > On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
> > daniel.schierbeck@gmail.com> wrote:
> >
> >> I'd like to use Kafka as a persistent store – sort of as an alternative
> to
> >> HDFS. The idea is that I'd load the data into various other systems in
> >> order to solve specific needs such as full-text search, analytics,
> indexing
> >> by various attributes, etc. I'd like to keep a single source of truth,
> >> however.
> >>
> >> I'm struggling a bit to understand how I can configure a topic to retain
> >> messages indefinitely. I want to make sure that my data isn't deleted.
> Is
> >> there a guide to configuring Kafka like this?
> >>
>

Re: Using Kafka as a persistent store

Posted by Rad Gruchalski <ra...@gruchalski.com>.

Daniel,  

I understand your point. From what I understand the mode that suits you is what Jay suggested: log.retention.ms (http://log.retention.ms) and log.retention.bytes both set to -1.

A few questions before I continue on something what may already be possible:

1. is it possible to attach additional storage without having to restart Kafka?
2. If answer to 1. is yes: will Kafka continue the topic on a new storage if all attached disks are full? Or is the assumption that one data_dir = one topic/partition (the code suggests so).
3. If answer to 1. is no: is it possible to take segments out without having to restart Kafka?










Kind regards, 
Radek Gruchalski
 radek@gruchalski.com (mailto:radek@gruchalski.com)  (mailto:radek@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately.



On Saturday, 11 July 2015 at 22:22, Daniel Schierbeck wrote:

> Radek: I don't see how data could be stored more efficiently than in Kafka
> itself. It's optimized for cheap storage and offers high-performance bulk
> export, exactly what you want from long-term archival.
> On fre. 10. jul. 2015 at 23.16 Rad Gruchalski <radek@gruchalski.com (mailto:radek@gruchalski.com)> wrote:
>  
> > Hello all,
> >  
> > This is a very interesting discussion. I’ve been thinking of a similar use
> > case for Kafka over the last few days.
> > The usual data workflow with Kafka is most likely something this:
> >  
> > - ingest with Kafka
> > - process with Storm / Samza / whathaveyou
> > - put some processed data back on Kafka
> > - at the same time store the raw data somewhere in case if everything
> > has to be reprocessed in the future (hdfs, similar?)
> >  
> > Currently Kafka offers a couple of types of topics: regular stream
> > (non-compacted topic) and a compacted topic (key/value). In case of a
> > stream topic, when the compaction kicks in, the “old” data is truncated. It
> > is lost from Kafka. What if there was an additional compaction setting:
> > cold-store.
> > Instead of trimming old data, Kafka would compile old data into a separate
> > log with its own index. The user would be free to decide what to do with
> > such files: put them on NFS / S3 / Swift / HDFS… Actually, the index file
> > is not needed. The only 3 things are:
> >  
> > - the folder name / partition index
> > - the log itself
> > - topic metadata at the time of taking the data out of the segment
> >  
> > With all this info, reading data back is fairly easy, even without
> > starting Kafka, sample program goes like this (scala-ish):
> >  
> > val props = new Properties()
> > props.put("log.segment.bytes", "1073741824")
> > props.put("segment.index.bytes", "10485760") // should be 10MB
> >  
> > val log = new Log(
> > new File(“/somestorage/kafka-test-0"),
> > cfg,
> > 0L,
> > null )
> >  
> > val fdi = log.activeSegment.read( log.logStartOffset,
> > Some(log.logEndOffset), 1000000 )
> > var msgs = 1
> > fdi.messageSet.iterator.foreach { msgoffset =>
> > println( s" ${msgoffset.message.hasKey} ::: > $msgs ::::>
> > ${msgoffset.offset} :::::: ${msgoffset.nextOffset}" )
> > msgs = msgs + 1
> > val key = new String( msgoffset.message.key.array(), "UTF-8")
> > val msg = new String( msgoffset.message.payload.array(), "UTF-8")
> > println( s" === ${key} " )
> > println( s" === ${msg} " )
> > }
> >  
> >  
> > This reads from active segment (the last known segment) but it’s easy to
> > make it read from all segments. The interesting thing is - as long as the
> > back up files are well formed, they can be read without having to put them
> > in Kafka itself.
> >  
> > The advantage is: what was once the raw data (as it came in), is the raw
> > data forever, without having to introduce another format for storing this.
> > Another advantage is: in case of reprocessing, no need to write a producer
> > to ingest the data back and so on, so on (it’s possible but not necessary).
> > Such raw Kafka files can be easily processed by Storm / Samza (would need
> > another stream definition) / Hadoop.
> >  
> > This sounds like a very useful addition to Kafka. But I could be
> > overthinking this...
> >  
> >  
> >  
> >  
> >  
> >  
> >  
> >  
> >  
> >  
> > Kind regards,
> > Radek Gruchalski
> > radek@gruchalski.com (mailto:radek@gruchalski.com) (mailto:
> > radek@gruchalski.com (mailto:radek@gruchalski.com))
> > de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/) (
> > http://de.linkedin.com/in/radgruchalski/)
> >  
> > Confidentiality:
> > This communication is intended for the above-named person and may be
> > confidential and/or legally privileged.
> > If it has come to you in error you must take no action based on it, nor
> > must you copy or show it to anyone; please delete/destroy and inform the
> > sender immediately.
> >  
> >  
> >  
> > On Friday, 10 July 2015 at 22:55, Daniel Schierbeck wrote:
> >  
> > >  
> > > > On 10. jul. 2015, at 15.16, Shayne S <shaynest113@gmail.com (mailto:shaynest113@gmail.com) (mailto:
> > shaynest113@gmail.com (mailto:shaynest113@gmail.com))> wrote:
> > > >  
> > > > There are two ways you can configure your topics, log compaction and
> > with
> > > > no cleaning. The choice depends on your use case. Are the records
> > >  
> >  
> > uniquely
> > > > identifiable and will they receive updates? Then log compaction is the
> > >  
> >  
> > way
> > > > to go. If they are truly read only, you can go without log compaction.
> > >  
> > >  
> > >  
> > > I'd rather be free to use the key for partitioning, and the records are
> > immutable — they're event records — so disabling compaction altogether
> > would be preferable. How is that accomplished?
> > > >  
> > > > We have a small processes which consume a topic and perform upserts to
> > our
> > > > various database engines. It's easy to change how it all works and
> > >  
> >  
> > simply
> > > > consume the single source of truth again.
> > > >  
> > > > I've written a bit about log compaction here:
> > http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
> > > >  
> > > > On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
> > > > daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com)>
> > > >  
> > >  
> >  
> > wrote:
> > > >  
> > > > > I'd like to use Kafka as a persistent store – sort of as an
> > alternative to
> > > > > HDFS. The idea is that I'd load the data into various other systems
> > > >  
> > >  
> >  
> > in
> > > > > order to solve specific needs such as full-text search, analytics,
> > > >  
> > >  
> >  
> > indexing
> > > > > by various attributes, etc. I'd like to keep a single source of
> > > >  
> > >  
> >  
> > truth,
> > > > > however.
> > > > >  
> > > > > I'm struggling a bit to understand how I can configure a topic to
> > retain
> > > > > messages indefinitely. I want to make sure that my data isn't
> > > >  
> > >  
> >  
> > deleted. Is
> > > > > there a guide to configuring Kafka like this?
> > > >  
> > >  
> >  
> >  
>  
>  
>

Re: Using Kafka as a persistent store

Posted by Daniel Schierbeck <da...@gmail.com>.

Radek: I don't see how data could be stored more efficiently than in Kafka
itself. It's optimized for cheap storage and offers high-performance bulk
export, exactly what you want from long-term archival.
On fre. 10. jul. 2015 at 23.16 Rad Gruchalski <ra...@gruchalski.com> wrote:

> Hello all,
>
> This is a very interesting discussion. I’ve been thinking of a similar use
> case for Kafka over the last few days.
> The usual data workflow with Kafka is most likely something this:
>
> - ingest with Kafka
> - process with Storm / Samza / whathaveyou
>   - put some processed data back on Kafka
>   - at the same time store the raw data somewhere in case if everything
> has to be reprocessed in the future (hdfs, similar?)
>
> Currently Kafka offers a couple of types of topics: regular stream
> (non-compacted topic) and a compacted topic (key/value). In case of a
> stream topic, when the compaction kicks in, the “old” data is truncated. It
> is lost from Kafka. What if there was an additional compaction setting:
> cold-store.
> Instead of trimming old data, Kafka would compile old data into a separate
> log with its own index. The user would be free to decide what to do with
> such files: put them on NFS / S3 / Swift / HDFS… Actually, the index file
> is not needed. The only 3 things are:
>
>  - the folder name / partition index
>  - the log itself
>  - topic metadata at the time of taking the data out of the segment
>
> With all this info, reading data back is fairly easy, even without
> starting Kafka, sample program goes like this (scala-ish):
>
>     val props = new Properties()
>     props.put("log.segment.bytes", "1073741824")
>     props.put("segment.index.bytes", "10485760") // should be 10MB
>
>     val log = new Log(
>       new File(“/somestorage/kafka-test-0"),
>       cfg,
>       0L,
>       null )
>
>     val fdi = log.activeSegment.read( log.logStartOffset,
> Some(log.logEndOffset), 1000000 )
>     var msgs = 1
>     fdi.messageSet.iterator.foreach { msgoffset =>
>       println( s" ${msgoffset.message.hasKey} ::: > $msgs ::::>
> ${msgoffset.offset} :::::: ${msgoffset.nextOffset}" )
>       msgs = msgs + 1
>       val key = new String( msgoffset.message.key.array(), "UTF-8")
>       val msg = new String( msgoffset.message.payload.array(), "UTF-8")
>       println( s" === ${key} " )
>       println( s" === ${msg} " )
>     }
>
>
> This reads from active segment (the last known segment) but it’s easy to
> make it read from all segments. The interesting thing is - as long as the
> back up files are well formed, they can be read without having to put them
> in Kafka itself.
>
> The advantage is: what was once the raw data (as it came in), is the raw
> data forever, without having to introduce another format for storing this.
> Another advantage is: in case of reprocessing, no need to write a producer
> to ingest the data back and so on, so on (it’s possible but not necessary).
> Such raw Kafka files can be easily processed by Storm / Samza (would need
> another stream definition) / Hadoop.
>
> This sounds like a very useful addition to Kafka. But I could be
> overthinking this...
>
>
>
>
>
>
>
>
>
>
> Kind regards,
> Radek Gruchalski
> radek@gruchalski.com (mailto:radek@gruchalski.com) (mailto:
> radek@gruchalski.com)
> de.linkedin.com/in/radgruchalski/ (
> http://de.linkedin.com/in/radgruchalski/)
>
> Confidentiality:
> This communication is intended for the above-named person and may be
> confidential and/or legally privileged.
> If it has come to you in error you must take no action based on it, nor
> must you copy or show it to anyone; please delete/destroy and inform the
> sender immediately.
>
>
>
> On Friday, 10 July 2015 at 22:55, Daniel Schierbeck wrote:
>
> >
> > > On 10. jul. 2015, at 15.16, Shayne S <shaynest113@gmail.com (mailto:
> shaynest113@gmail.com)> wrote:
> > >
> > > There are two ways you can configure your topics, log compaction and
> with
> > > no cleaning. The choice depends on your use case. Are the records
> uniquely
> > > identifiable and will they receive updates? Then log compaction is the
> way
> > > to go. If they are truly read only, you can go without log compaction.
> > >
> >
> >
> > I'd rather be free to use the key for partitioning, and the records are
> immutable — they're event records — so disabling compaction altogether
> would be preferable. How is that accomplished?
> > >
> > > We have a small processes which consume a topic and perform upserts to
> our
> > > various database engines. It's easy to change how it all works and
> simply
> > > consume the single source of truth again.
> > >
> > > I've written a bit about log compaction here:
> > >
> http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
> > >
> > > On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
> > > daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com)>
> wrote:
> > >
> > > > I'd like to use Kafka as a persistent store – sort of as an
> alternative to
> > > > HDFS. The idea is that I'd load the data into various other systems
> in
> > > > order to solve specific needs such as full-text search, analytics,
> indexing
> > > > by various attributes, etc. I'd like to keep a single source of
> truth,
> > > > however.
> > > >
> > > > I'm struggling a bit to understand how I can configure a topic to
> retain
> > > > messages indefinitely. I want to make sure that my data isn't
> deleted. Is
> > > > there a guide to configuring Kafka like this?
> > > >
> > >
> > >
> >
> >
> >
>
>
>

Re: Using Kafka as a persistent store

Posted by Rad Gruchalski <ra...@gruchalski.com>.

Hello all,

This is a very interesting discussion. I’ve been thinking of a similar use case for Kafka over the last few days.  
The usual data workflow with Kafka is most likely something this:

- ingest with Kafka
- process with Storm / Samza / whathaveyou
  - put some processed data back on Kafka
  - at the same time store the raw data somewhere in case if everything has to be reprocessed in the future (hdfs, similar?)

Currently Kafka offers a couple of types of topics: regular stream (non-compacted topic) and a compacted topic (key/value). In case of a stream topic, when the compaction kicks in, the “old” data is truncated. It is lost from Kafka. What if there was an additional compaction setting: cold-store.
Instead of trimming old data, Kafka would compile old data into a separate log with its own index. The user would be free to decide what to do with such files: put them on NFS / S3 / Swift / HDFS… Actually, the index file is not needed. The only 3 things are:

 - the folder name / partition index
 - the log itself
 - topic metadata at the time of taking the data out of the segment

With all this info, reading data back is fairly easy, even without starting Kafka, sample program goes like this (scala-ish):

    val props = new Properties()
    props.put("log.segment.bytes", "1073741824")
    props.put("segment.index.bytes", "10485760") // should be 10MB

    val log = new Log(
      new File(“/somestorage/kafka-test-0"),
      cfg,
      0L,
      null )

    val fdi = log.activeSegment.read( log.logStartOffset, Some(log.logEndOffset), 1000000 )
    var msgs = 1
    fdi.messageSet.iterator.foreach { msgoffset =>
      println( s" ${msgoffset.message.hasKey} ::: > $msgs ::::> ${msgoffset.offset} :::::: ${msgoffset.nextOffset}" )
      msgs = msgs + 1
      val key = new String( msgoffset.message.key.array(), "UTF-8")
      val msg = new String( msgoffset.message.payload.array(), "UTF-8")
      println( s" === ${key} " )
      println( s" === ${msg} " )
    }

This reads from active segment (the last known segment) but it’s easy to make it read from all segments. The interesting thing is - as long as the back up files are well formed, they can be read without having to put them in Kafka itself.

The advantage is: what was once the raw data (as it came in), is the raw data forever, without having to introduce another format for storing this. Another advantage is: in case of reprocessing, no need to write a producer to ingest the data back and so on, so on (it’s possible but not necessary). Such raw Kafka files can be easily processed by Storm / Samza (would need another stream definition) / Hadoop.

This sounds like a very useful addition to Kafka. But I could be overthinking this...  

Kind regards, 
Radek Gruchalski
 radek@gruchalski.com (mailto:radek@gruchalski.com)  (mailto:radek@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately.

On Friday, 10 July 2015 at 22:55, Daniel Schierbeck wrote:

>  
> > On 10. jul. 2015, at 15.16, Shayne S <shaynest113@gmail.com (mailto:shaynest113@gmail.com)> wrote:
> >  
> > There are two ways you can configure your topics, log compaction and with
> > no cleaning. The choice depends on your use case. Are the records uniquely
> > identifiable and will they receive updates? Then log compaction is the way
> > to go. If they are truly read only, you can go without log compaction.
> >  
>  
>  
> I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished?
> >  
> > We have a small processes which consume a topic and perform upserts to our
> > various database engines. It's easy to change how it all works and simply
> > consume the single source of truth again.
> >  
> > I've written a bit about log compaction here:
> > http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
> >  
> > On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
> > daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com)> wrote:
> >  
> > > I'd like to use Kafka as a persistent store – sort of as an alternative to
> > > HDFS. The idea is that I'd load the data into various other systems in
> > > order to solve specific needs such as full-text search, analytics, indexing
> > > by various attributes, etc. I'd like to keep a single source of truth,
> > > however.
> > >  
> > > I'm struggling a bit to understand how I can configure a topic to retain
> > > messages indefinitely. I want to make sure that my data isn't deleted. Is
> > > there a guide to configuring Kafka like this?
> > >  
> >  
> >  
>  
>  
>

Re: Using Kafka as a persistent store

Posted by Daniel Schierbeck <da...@gmail.com>.

> On 10. jul. 2015, at 15.16, Shayne S <sh...@gmail.com> wrote:
> 
> There are two ways you can configure your topics, log compaction and with
> no cleaning. The choice depends on your use case. Are the records uniquely
> identifiable and will they receive updates? Then log compaction is the way
> to go. If they are truly read only, you can go without log compaction.

I'd rather be free to use the key for partitioning, and the records are immutable — they're event records — so disabling compaction altogether would be preferable. How is that accomplished?
> 
> We have a small processes which consume a topic and perform upserts to our
> various database engines. It's easy to change how it all works and simply
> consume the single source of truth again.
> 
> I've written a bit about log compaction here:
> http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
> 
> On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
> daniel.schierbeck@gmail.com> wrote:
> 
>> I'd like to use Kafka as a persistent store – sort of as an alternative to
>> HDFS. The idea is that I'd load the data into various other systems in
>> order to solve specific needs such as full-text search, analytics, indexing
>> by various attributes, etc. I'd like to keep a single source of truth,
>> however.
>> 
>> I'm struggling a bit to understand how I can configure a topic to retain
>> messages indefinitely. I want to make sure that my data isn't deleted. Is
>> there a guide to configuring Kafka like this?
>>

Re: Using Kafka as a persistent store

Posted by Shayne S <sh...@gmail.com>.

There are two ways you can configure your topics, log compaction and with
no cleaning. The choice depends on your use case. Are the records uniquely
identifiable and will they receive updates? Then log compaction is the way
to go. If they are truly read only, you can go without log compaction.

We have a small processes which consume a topic and perform upserts to our
various database engines. It's easy to change how it all works and simply
consume the single source of truth again.

I've written a bit about log compaction here:
http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/

On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
daniel.schierbeck@gmail.com> wrote:

> I'd like to use Kafka as a persistent store – sort of as an alternative to
> HDFS. The idea is that I'd load the data into various other systems in
> order to solve specific needs such as full-text search, analytics, indexing
> by various attributes, etc. I'd like to keep a single source of truth,
> however.
>
> I'm struggling a bit to understand how I can configure a topic to retain
> messages indefinitely. I want to make sure that my data isn't deleted. Is
> there a guide to configuring Kafka like this?
>