You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Anthony Grimes <i...@raynes.me> on 2013/02/22 01:00:18 UTC

Keeping logs forever

Our use case is that we'd like to log data we don't need away and 
potentially replay it at some point. We don't want to delete old logs. I 
googled around a bit and I only discovered this particular post: 
http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201210.mbox/%3CCAFbh0Q2=eJcDT6NvTAPtxhXSk64x0Yms-G-AOqOoy=FtVVM6SQ@mail.gmail.com%3E

In summary, it appears the primary issue is that Kafka keeps file 
handles of each log segment open. Is there a way to configure this, or 
is a way to do so planned? It appears that an option to deduplicate 
instead of delete was added recently, so doesn't the file handle issue 
exist with that as well (since files aren't being deleted)?

Re: Keeping logs forever

Posted by Milind Parikh <mi...@gmail.com>.
Forever is a long time. The definition of replay and navigating through
different versions of kafka would be key.

Example:
If you are storing market data into kafka and have a cep engine running on
top and would like replay "transactions" to be fed back to ensure
replayability, then you would probably want to manage that through the same
mechanism as it existed at that time in the past. This might mean a
different kafka broker (perhaps 0.7) with a different set of consumers with
a potentially different JVM. This, of course, gets into a rat hole.

Regards
Milind




On Thu, Feb 21, 2013 at 4:29 PM, Eric Tschetter <ch...@metamarkets.com>wrote:

> Anthony,
>
> Is there a reason you wouldn't want to just push the data into something
> built for cheap, long-term storage (like glacier, S3, or HDFS) and perhaps
> "replay" from that instead of from the kafka brokers?  I can't speak for
> Jay, Jun or Neha, but I believe the expected usage of Kafka is essentially
> as a buffering mechanism to take the edge off the natural ebb-n-flow of
> unpredictable internet traffic.  The highly available, long-term storage of
> data is probably not at the top of their list of use cases when making
> design decisions.
>
> --Eric
>
>
> On Thu, Feb 21, 2013 at 6:00 PM, Anthony Grimes <i...@raynes.me> wrote:
>
> > Our use case is that we'd like to log data we don't need away and
> > potentially replay it at some point. We don't want to delete old logs. I
> > googled around a bit and I only discovered this particular post:
> > http://mail-archives.apache.**org/mod_mbox/incubator-kafka-**
> > users/201210.mbox/%3CCAFbh0Q2=**eJcDT6NvTAPtxhXSk64x0Yms-G-**
> > AOqOoy=FtVVM6SQ@mail.gmail.**com%3E<
> http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201210.mbox/%3CCAFbh0Q2=eJcDT6NvTAPtxhXSk64x0Yms-G-AOqOoy=FtVVM6SQ@mail.gmail.com%3E
> >
> >
> > In summary, it appears the primary issue is that Kafka keeps file handles
> > of each log segment open. Is there a way to configure this, or is a way
> to
> > do so planned? It appears that an option to deduplicate instead of delete
> > was added recently, so doesn't the file handle issue exist with that as
> > well (since files aren't being deleted)?
> >
>

Re: Keeping logs forever

Posted by Eric Tschetter <ch...@metamarkets.com>.
Anthony,

Is there a reason you wouldn't want to just push the data into something
built for cheap, long-term storage (like glacier, S3, or HDFS) and perhaps
"replay" from that instead of from the kafka brokers?  I can't speak for
Jay, Jun or Neha, but I believe the expected usage of Kafka is essentially
as a buffering mechanism to take the edge off the natural ebb-n-flow of
unpredictable internet traffic.  The highly available, long-term storage of
data is probably not at the top of their list of use cases when making
design decisions.

--Eric


On Thu, Feb 21, 2013 at 6:00 PM, Anthony Grimes <i...@raynes.me> wrote:

> Our use case is that we'd like to log data we don't need away and
> potentially replay it at some point. We don't want to delete old logs. I
> googled around a bit and I only discovered this particular post:
> http://mail-archives.apache.**org/mod_mbox/incubator-kafka-**
> users/201210.mbox/%3CCAFbh0Q2=**eJcDT6NvTAPtxhXSk64x0Yms-G-**
> AOqOoy=FtVVM6SQ@mail.gmail.**com%3E<http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201210.mbox/%3CCAFbh0Q2=eJcDT6NvTAPtxhXSk64x0Yms-G-AOqOoy=FtVVM6SQ@mail.gmail.com%3E>
>
> In summary, it appears the primary issue is that Kafka keeps file handles
> of each log segment open. Is there a way to configure this, or is a way to
> do so planned? It appears that an option to deduplicate instead of delete
> was added recently, so doesn't the file handle issue exist with that as
> well (since files aren't being deleted)?
>

Re: Keeping logs forever

Posted by Eric Tschetter <ec...@gmail.com>.
Apologies for asking another question as a newbie without having really
tried stuff out, but actually one of our main reasons for wanting to use
kafka (not the linkedin use case) is exactly the fact that the "buffer" is
not just for buffering. We want to keep data for days to weeks, and be able
to add ad-hoc consumers after the fact (obviously we could do that based on
downstream systems in HDFS), however lets say we have N machines gathering
approximate runtime statistics to use real time in live web applications;
it is easy for them to listen to the stream destined for HDFS and keep such
stats. If we have to add a new machine, or one dies etc. it totally makes
sense to use the same code and just have it replay the last H hours of
events to get back up to speed.

>
> So I'm curious if as this thread suggests that there are problems with
> keeping days to weeks of data around them and accessing them.
>
>
Sorry if my comments caused this type of concern.  Keeping days to weeks of
data around is normal in Kafka (it defaults to keeping 7 days worth of data
around, but that's configurable) and replaying from that is definitely
within the realm of what it does well.  My comments were more around the
"forever" comments, and as Jay says, it should be possible, you just have
to keep adding more disks and machines to store all the data.

I believe the replication in 0.8 will allow for migration of data if you
lose nodes and stuff too, so maybe my concerns were poorly founded.

--Eric



> Note also we are considering using kafka for (continuous/on demand) high
> performance instrumentation at which point we may not actually have any
> consumers until we need them (we would want a back-window to produce debug
> logs from the event stream after the fact, or replay events into other
> systems), but equally the real time feed may be used for alerting and
> graphite. Also we might eventually allow ad-hoc queries against data in the
> event stream, which may require us to turn event generation on/off in the
> producers, but nonetheless we would efficiently filter the kafka event
> stream based on arbitrary data - something that can't be done with topic
> today (even the suggested hierarchical topics) - if we do it right, we can
> use a schema/producer registry to figure out a small subset of topics that
> might contain the data we need, then use the schema registry to pick the
> AVRO schema used to efficiently filter that subset of topics based on any
> arbitrary set of attributes in the data.
>
> If the latter sounds useful to anyone then we'll of course contribute back
> - I'm also curious on the current state of camel etc, since we were already
> considering building something similar, but it seems like it isn't
> currently (as of recent open source) zookeeper based which seems odd, but
> also we are certainly considering allowing for mixing in more dynamic
> registration where value isn't just schema, but schema + other contextual
> information common to all events from a producer (e.g. source machine,
> application, app version etc).
>
>
> On Feb 21, 2013, at 7:26 PM, Jay Kreps <ja...@gmail.com> wrote:
>
> > You can do this and it should work fine. You would have to keep adding
> > machines to get disk capacity, of course, since your data set would
> > only grow.
> >
> > We will keep an open file descriptor per file, but I think that is
> > okay. Just set the segment size to 1GB, then with 10TB of storage that
> > is only 10k files which should be fine. Adjust the OS open FD limit up
> > a bit if needed. File descriptors don't use too much memory so this
> > should not hurt anything.
> >
> > -Jay
> >
> > On Thu, Feb 21, 2013 at 4:00 PM, Anthony Grimes <i...@raynes.me> wrote:
> >> Our use case is that we'd like to log data we don't need away and
> >> potentially replay it at some point. We don't want to delete old logs. I
> >> googled around a bit and I only discovered this particular post:
> >>
> http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201210.mbox/%3CCAFbh0Q2=eJcDT6NvTAPtxhXSk64x0Yms-G-AOqOoy=FtVVM6SQ@mail.gmail.com%3E
> >>
> >> In summary, it appears the primary issue is that Kafka keeps file
> handles of
> >> each log segment open. Is there a way to configure this, or is a way to
> do
> >> so planned? It appears that an option to deduplicate instead of delete
> was
> >> added recently, so doesn't the file handle issue exist with that as well
> >> (since files aren't being deleted)?
>
>

Re: Keeping logs forever

Posted by Jay Kreps <ja...@gmail.com>.
Hi Graham,

This sounds like it should work fine. LinkedIn keeps the majority of
things for 7 days. Performance is linear in data size and we have
validated performance up to many TB of data per machine.

The registry you describe sounds like it could potentially be useful.
You would probably have to describe it in more detail for others to
understand all the use cases.

Cheers,

-Jay

On Thu, Feb 21, 2013 at 6:47 PM, graham sanderson <gr...@vast.com> wrote:
> Apologies for asking another question as a newbie without having really tried stuff out, but actually one of our main reasons for wanting to use kafka (not the linkedin use case) is exactly the fact that the "buffer" is not just for buffering. We want to keep data for days to weeks, and be able to add ad-hoc consumers after the fact (obviously we could do that based on downstream systems in HDFS), however lets say we have N machines gathering approximate runtime statistics to use real time in live web applications; it is easy for them to listen to the stream destined for HDFS and keep such stats. If we have to add a new machine, or one dies etc. it totally makes sense to use the same code and just have it replay the last H hours of events to get back up to speed.
>
> So I'm curious if as this thread suggests that there are problems with keeping days to weeks of data around them and accessing them.
>
> Note also we are considering using kafka for (continuous/on demand) high performance instrumentation at which point we may not actually have any consumers until we need them (we would want a back-window to produce debug logs from the event stream after the fact, or replay events into other systems), but equally the real time feed may be used for alerting and graphite. Also we might eventually allow ad-hoc queries against data in the event stream, which may require us to turn event generation on/off in the producers, but nonetheless we would efficiently filter the kafka event stream based on arbitrary data - something that can't be done with topic today (even the suggested hierarchical topics) - if we do it right, we can use a schema/producer registry to figure out a small subset of topics that might contain the data we need, then use the schema registry to pick the AVRO schema used to efficiently filter that subset of topics based on any arbitrary set of attributes in the data.
>
> If the latter sounds useful to anyone then we'll of course contribute back - I'm also curious on the current state of camel etc, since we were already considering building something similar, but it seems like it isn't currently (as of recent open source) zookeeper based which seems odd, but also we are certainly considering allowing for mixing in more dynamic registration where value isn't just schema, but schema + other contextual information common to all events from a producer (e.g. source machine, application, app version etc).
>
>
> On Feb 21, 2013, at 7:26 PM, Jay Kreps <ja...@gmail.com> wrote:
>
>> You can do this and it should work fine. You would have to keep adding
>> machines to get disk capacity, of course, since your data set would
>> only grow.
>>
>> We will keep an open file descriptor per file, but I think that is
>> okay. Just set the segment size to 1GB, then with 10TB of storage that
>> is only 10k files which should be fine. Adjust the OS open FD limit up
>> a bit if needed. File descriptors don't use too much memory so this
>> should not hurt anything.
>>
>> -Jay
>>
>> On Thu, Feb 21, 2013 at 4:00 PM, Anthony Grimes <i...@raynes.me> wrote:
>>> Our use case is that we'd like to log data we don't need away and
>>> potentially replay it at some point. We don't want to delete old logs. I
>>> googled around a bit and I only discovered this particular post:
>>> http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201210.mbox/%3CCAFbh0Q2=eJcDT6NvTAPtxhXSk64x0Yms-G-AOqOoy=FtVVM6SQ@mail.gmail.com%3E
>>>
>>> In summary, it appears the primary issue is that Kafka keeps file handles of
>>> each log segment open. Is there a way to configure this, or is a way to do
>>> so planned? It appears that an option to deduplicate instead of delete was
>>> added recently, so doesn't the file handle issue exist with that as well
>>> (since files aren't being deleted)?
>

Re: Keeping logs forever

Posted by graham sanderson <gr...@vast.com>.
Apologies for asking another question as a newbie without having really tried stuff out, but actually one of our main reasons for wanting to use kafka (not the linkedin use case) is exactly the fact that the "buffer" is not just for buffering. We want to keep data for days to weeks, and be able to add ad-hoc consumers after the fact (obviously we could do that based on downstream systems in HDFS), however lets say we have N machines gathering approximate runtime statistics to use real time in live web applications; it is easy for them to listen to the stream destined for HDFS and keep such stats. If we have to add a new machine, or one dies etc. it totally makes sense to use the same code and just have it replay the last H hours of events to get back up to speed.

So I'm curious if as this thread suggests that there are problems with keeping days to weeks of data around them and accessing them.

Note also we are considering using kafka for (continuous/on demand) high performance instrumentation at which point we may not actually have any consumers until we need them (we would want a back-window to produce debug logs from the event stream after the fact, or replay events into other systems), but equally the real time feed may be used for alerting and graphite. Also we might eventually allow ad-hoc queries against data in the event stream, which may require us to turn event generation on/off in the producers, but nonetheless we would efficiently filter the kafka event stream based on arbitrary data - something that can't be done with topic today (even the suggested hierarchical topics) - if we do it right, we can use a schema/producer registry to figure out a small subset of topics that might contain the data we need, then use the schema registry to pick the AVRO schema used to efficiently filter that subset of topics based on any arbitrary set of attributes in the data.

If the latter sounds useful to anyone then we'll of course contribute back - I'm also curious on the current state of camel etc, since we were already considering building something similar, but it seems like it isn't currently (as of recent open source) zookeeper based which seems odd, but also we are certainly considering allowing for mixing in more dynamic registration where value isn't just schema, but schema + other contextual information common to all events from a producer (e.g. source machine, application, app version etc).


On Feb 21, 2013, at 7:26 PM, Jay Kreps <ja...@gmail.com> wrote:

> You can do this and it should work fine. You would have to keep adding
> machines to get disk capacity, of course, since your data set would
> only grow.
> 
> We will keep an open file descriptor per file, but I think that is
> okay. Just set the segment size to 1GB, then with 10TB of storage that
> is only 10k files which should be fine. Adjust the OS open FD limit up
> a bit if needed. File descriptors don't use too much memory so this
> should not hurt anything.
> 
> -Jay
> 
> On Thu, Feb 21, 2013 at 4:00 PM, Anthony Grimes <i...@raynes.me> wrote:
>> Our use case is that we'd like to log data we don't need away and
>> potentially replay it at some point. We don't want to delete old logs. I
>> googled around a bit and I only discovered this particular post:
>> http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201210.mbox/%3CCAFbh0Q2=eJcDT6NvTAPtxhXSk64x0Yms-G-AOqOoy=FtVVM6SQ@mail.gmail.com%3E
>> 
>> In summary, it appears the primary issue is that Kafka keeps file handles of
>> each log segment open. Is there a way to configure this, or is a way to do
>> so planned? It appears that an option to deduplicate instead of delete was
>> added recently, so doesn't the file handle issue exist with that as well
>> (since files aren't being deleted)?


Re: Re: Keeping logs forever

Posted by Anthony Grimes <i...@raynes.me>.
Sounds good. Thanks for the input, kind sir!

Jay Kreps wrote:
> You can do this and it should work fine. You would have to keep adding
> machines to get disk capacity, of course, since your data set would
> only grow.
>
> We will keep an open file descriptor per file, but I think that is
> okay. Just set the segment size to 1GB, then with 10TB of storage that
> is only 10k files which should be fine. Adjust the OS open FD limit up
> a bit if needed. File descriptors don't use too much memory so this
> should not hurt anything.
>
> -Jay
>
> On Thu, Feb 21, 2013 at 4:00 PM, Anthony Grimes<i...@raynes.me>  wrote:
>> Our use case is that we'd like to log data we don't need away and
>> potentially replay it at some point. We don't want to delete old logs. I
>> googled around a bit and I only discovered this particular post:
>> http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201210.mbox/%3CCAFbh0Q2=eJcDT6NvTAPtxhXSk64x0Yms-G-AOqOoy=FtVVM6SQ@mail.gmail.com%3E
>>
>> In summary, it appears the primary issue is that Kafka keeps file handles of
>> each log segment open. Is there a way to configure this, or is a way to do
>> so planned? It appears that an option to deduplicate instead of delete was
>> added recently, so doesn't the file handle issue exist with that as well
>> (since files aren't being deleted)?

Re: Keeping logs forever

Posted by Jay Kreps <ja...@gmail.com>.
You can do this and it should work fine. You would have to keep adding
machines to get disk capacity, of course, since your data set would
only grow.

We will keep an open file descriptor per file, but I think that is
okay. Just set the segment size to 1GB, then with 10TB of storage that
is only 10k files which should be fine. Adjust the OS open FD limit up
a bit if needed. File descriptors don't use too much memory so this
should not hurt anything.

-Jay

On Thu, Feb 21, 2013 at 4:00 PM, Anthony Grimes <i...@raynes.me> wrote:
> Our use case is that we'd like to log data we don't need away and
> potentially replay it at some point. We don't want to delete old logs. I
> googled around a bit and I only discovered this particular post:
> http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201210.mbox/%3CCAFbh0Q2=eJcDT6NvTAPtxhXSk64x0Yms-G-AOqOoy=FtVVM6SQ@mail.gmail.com%3E
>
> In summary, it appears the primary issue is that Kafka keeps file handles of
> each log segment open. Is there a way to configure this, or is a way to do
> so planned? It appears that an option to deduplicate instead of delete was
> added recently, so doesn't the file handle issue exist with that as well
> (since files aren't being deleted)?