You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by "Prunier, Dominique" <do...@emc.com> on 2014/08/14 23:56:11 UTC

Consumer sensitive expiration of topic

Hi,

I'm playing around with Kafka with the idea to implement a general purpose message exchanger for a distributed application with high throughput requirements (multiple hundred thousand messages per sec).

In this context, i would like to be able to use a topic as some form of private mailbox for a single consumer group. In this situation, once the single consumer group has committed its offset on its private topic, the messages there won't be used anymore and can be safely discarded. Therefore, i was wondering if you'd see a way (in the current release or in the future) to have a topic which expiration policy is based on consumer offsets.

Thanks,

--
Dominique Prunier


RE: Consumer sensitive expiration of topic

Posted by "Prunier, Dominique" <do...@emc.com>.
True, but not different in essence with deleting a topic all together. Basically, this is an administrative operation, not strictly speaking a consumer operation. I'm not expecting this feature to be implemented in the consumers shipped with Kafka, but myself, i could use it in my own code because i know what my consumers are doing.

-----Original Message-----
From: Gwen Shapira [mailto:gshapira@cloudera.com] 
Sent: Thursday, August 28, 2014 3:24 PM
To: users@kafka.apache.org
Subject: Re: Consumer sensitive expiration of topic

Maybe I misunderstand the proposal, but it sounds like an
"irresponsible" consumer can accidentally delete data that others did
not consume yet?

On Thu, Aug 28, 2014 at 10:06 AM, Prunier, Dominique
<do...@emc.com> wrote:
> Jay,
>
> I understand perfectly. I think you have all the reasons in the world to keep the broker truly consumer independent, as it is according to me, a very wise principle that differentiate Kafka from pretty much all the other solutions.
>
> That is why, instead of the idea of consumer sensitive topic as a feature of the broker, i now prefer this to be the responsibility of consumer(s). Therefore, simply exposing a remote call to expire a partition at a given offset would enable consumers to discard data by offset, most likely at the same time they would commit offsets. It sounds to me simpler (as it keeps the broker pretty much as is), and cleaner (as it maintains current design principles) while offering the flexibility of client applications choosing how they want to handle data expiration.
>
>
> Thanks,
>
> -----Original Message-----
> From: Jay Kreps [mailto:jay.kreps@gmail.com]
> Sent: Thursday, August 28, 2014 12:28 PM
> To: users@kafka.apache.org
> Subject: Re: Consumer sensitive expiration of topic
>
> Hey Dominique,
>
> What you describe makes sense, and it would certainly be possible for
> the broker to more aggressively discard data once it sees that the
> consumer has read it once.
>
> The reason we haven't really taken that as a priority is because
> modern drives are so large relative to their throughput that discard
> is not usually pressing. Practically speaking let's say you have a
> single cheap 2TB SATA drive and let's say that you are doing 50k 1k
> messages per second across all topics on that machine (~50MB/sec). In
> this case you have
>    2*1024*1024*1024*1024 / (50000 * 1024) / 60 / 60 = 10 hours of retention
> So even under very high load optimizing discard is not a very pressing concern.
>
> That said this would not be a terrible feature to have.
>
> -Jay
>
> On Thu, Aug 28, 2014 at 8:03 AM, Prunier, Dominique
> <do...@emc.com> wrote:
>> Yeah, i'm really not worried about performance. Disk space, or more specifically, disk space by duplication of the same data in different topics was my concern. The primary use case would be a special consumer which job would be to partition the messages from a topic into various "private consumer topics" (without altering it) to provide a filtered subscription service (e.g. for a remote service on slower network which cannot afford to receive the whole bunch of data and only wants a subset of it).
>>
>> Do you think it would make sense to have a remote API call that manually expire some partition segments by offset (as opposed to time and/or size) ? For example, exposing cleanupLogs with additional parameters to cleanup segments on demand ? I think it would be more than enough for me and could be used for various other things, like manually truncating a topic which data isn't relevant anymore without recreating it ?
>>
>> Thanks,
>>
>> -----Original Message-----
>> From: Neha Narkhede [mailto:neha.narkhede@gmail.com]
>> Sent: Wednesday, August 27, 2014 11:36 PM
>> To: users@kafka.apache.org
>> Subject: Re: Consumer sensitive expiration of topic
>>
>> Kafka is designed to maintain persistent backlog of data on disk
>> efficiently and at scale. Unlike other messaging systems, doing so does not
>> affect the performance of the system. If you are worried about the messages
>> occupying disk space, you can always set a lower retention on the topic
>> that is higher than any lag your consumer can accrue. The best plan here
>> would be to plan for allocating disk space for the retention.
>>
>>
>> On Mon, Aug 25, 2014 at 2:25 PM, Prunier, Dominique <
>> dominique.prunier@emc.com> wrote:
>>
>>> Any idea on this usecase guys ?
>>>
>>> Thanks,
>>>
>>> -----Original Message-----
>>> From: Prunier, Dominique [mailto:dominique.prunier@emc.com]
>>> Sent: Friday, August 15, 2014 11:02 AM
>>> To: users@kafka.apache.org
>>> Subject: RE: Consumer sensitive expiration of topic
>>>
>>> Hi,
>>>
>>> Thanks for the answer.
>>>
>>> The topics themselves won't be shortlived (as their consumers are supposed
>>> to stay there), the messages in them will. What i'm trying to achieve is
>>> something similar to this:
>>>
>>> Producers --<topic>--> Processor A0 --<topic_a_1>--> Processor A1
>>> --<topic_a_2>--> ... --<topic_a_N>--> Consumer
>>>                   |--> Processor B0 --<topic_b_1>--> Processor B1
>>> --<topic_a_2>--> ... --<topic_b_N>--> Consumer
>>>                   |--> Processor C0 --<topic_c_1>--> Processor C1
>>> --<topic_a_2>--> ... --<topic_c_N>--> Consumer
>>>
>>> Essentially, the "main" topic is the first one and only one consumed by
>>> multiple processors/consumers. These processors know what is the next
>>> processor they should send their data to by knowing their "private" topic
>>> name. So in this example, once Processor A1 picks a message in topic_a_1
>>> and commits the offset, the message won't be used anymore by anyone else.
>>>
>>> There is no particular issue just leaving this as is, but topic_a_1 is
>>> going to buffer quite a lot of stuff on disk while essentially, the only
>>> thing that we have to deal with here is Processor A1 going down or lagging.
>>> When Processor A1 is healthy, the expiration of topic_a_1 could be kept
>>> very low and avoid a fair amount of resource use.
>>>
>>> An idea on the top of my head would be an API where you can manually set
>>> the expiration of a topic by specifying offsets for partitions. This way,
>>> once Processor A1 has consumed its messages, it could not only commit the
>>> offsets (which, as far as i understand, has nothing to do with the broker
>>> itself) but also set the expiration of the topic using the same offsets
>>> (which could be done less frequently).
>>>
>>> Does it make sense ?
>>>
>>> Thanks,
>>>
>>> -----Original Message-----
>>> From: Neha Narkhede [mailto:neha.narkhede@gmail.com]
>>> Sent: Thursday, August 14, 2014 8:10 PM
>>> To: users@kafka.apache.org
>>> Subject: Re: Consumer sensitive expiration of topic
>>>
>>> By design, Kafka stores data independent of the number of publishers or
>>> subscribers connecting to it. This provides high performance as the broker
>>> does not have to manage consumers and evict data based on the consumers
>>> position. This is one of the main reasons why Kafka is much more
>>> performance compared to the JMS queues.
>>>
>>> It seems like your use case requires the concept of ephemeral topics where
>>> you would like to auto delete a topic once a particular consumer group has
>>> finished consuming data from it. Once 0.8.2 is released with the delete
>>> topic support, we intend to add auto expiration of topics that will delete
>>> topics that have not been accessed in some configurable time.
>>>
>>> Is there a reason why your application needs to create such short lived
>>> topics?
>>>
>>> Thanks,
>>> Neha
>>>
>>>
>>> On Thu, Aug 14, 2014 at 2:56 PM, Prunier, Dominique <
>>> dominique.prunier@emc.com> wrote:
>>>
>>> > Hi,
>>> >
>>> > I'm playing around with Kafka with the idea to implement a general
>>> purpose
>>> > message exchanger for a distributed application with high throughput
>>> > requirements (multiple hundred thousand messages per sec).
>>> >
>>> > In this context, i would like to be able to use a topic as some form of
>>> > private mailbox for a single consumer group. In this situation, once the
>>> > single consumer group has committed its offset on its private topic, the
>>> > messages there won't be used anymore and can be safely discarded.
>>> > Therefore, i was wondering if you'd see a way (in the current release or
>>> in
>>> > the future) to have a topic which expiration policy is based on consumer
>>> > offsets.
>>> >
>>> > Thanks,
>>> >
>>> > --
>>> > Dominique Prunier
>>> >
>>> >
>>>

Re: Consumer sensitive expiration of topic

Posted by Gwen Shapira <gs...@cloudera.com>.
Maybe I misunderstand the proposal, but it sounds like an
"irresponsible" consumer can accidentally delete data that others did
not consume yet?

On Thu, Aug 28, 2014 at 10:06 AM, Prunier, Dominique
<do...@emc.com> wrote:
> Jay,
>
> I understand perfectly. I think you have all the reasons in the world to keep the broker truly consumer independent, as it is according to me, a very wise principle that differentiate Kafka from pretty much all the other solutions.
>
> That is why, instead of the idea of consumer sensitive topic as a feature of the broker, i now prefer this to be the responsibility of consumer(s). Therefore, simply exposing a remote call to expire a partition at a given offset would enable consumers to discard data by offset, most likely at the same time they would commit offsets. It sounds to me simpler (as it keeps the broker pretty much as is), and cleaner (as it maintains current design principles) while offering the flexibility of client applications choosing how they want to handle data expiration.
>
>
> Thanks,
>
> -----Original Message-----
> From: Jay Kreps [mailto:jay.kreps@gmail.com]
> Sent: Thursday, August 28, 2014 12:28 PM
> To: users@kafka.apache.org
> Subject: Re: Consumer sensitive expiration of topic
>
> Hey Dominique,
>
> What you describe makes sense, and it would certainly be possible for
> the broker to more aggressively discard data once it sees that the
> consumer has read it once.
>
> The reason we haven't really taken that as a priority is because
> modern drives are so large relative to their throughput that discard
> is not usually pressing. Practically speaking let's say you have a
> single cheap 2TB SATA drive and let's say that you are doing 50k 1k
> messages per second across all topics on that machine (~50MB/sec). In
> this case you have
>    2*1024*1024*1024*1024 / (50000 * 1024) / 60 / 60 = 10 hours of retention
> So even under very high load optimizing discard is not a very pressing concern.
>
> That said this would not be a terrible feature to have.
>
> -Jay
>
> On Thu, Aug 28, 2014 at 8:03 AM, Prunier, Dominique
> <do...@emc.com> wrote:
>> Yeah, i'm really not worried about performance. Disk space, or more specifically, disk space by duplication of the same data in different topics was my concern. The primary use case would be a special consumer which job would be to partition the messages from a topic into various "private consumer topics" (without altering it) to provide a filtered subscription service (e.g. for a remote service on slower network which cannot afford to receive the whole bunch of data and only wants a subset of it).
>>
>> Do you think it would make sense to have a remote API call that manually expire some partition segments by offset (as opposed to time and/or size) ? For example, exposing cleanupLogs with additional parameters to cleanup segments on demand ? I think it would be more than enough for me and could be used for various other things, like manually truncating a topic which data isn't relevant anymore without recreating it ?
>>
>> Thanks,
>>
>> -----Original Message-----
>> From: Neha Narkhede [mailto:neha.narkhede@gmail.com]
>> Sent: Wednesday, August 27, 2014 11:36 PM
>> To: users@kafka.apache.org
>> Subject: Re: Consumer sensitive expiration of topic
>>
>> Kafka is designed to maintain persistent backlog of data on disk
>> efficiently and at scale. Unlike other messaging systems, doing so does not
>> affect the performance of the system. If you are worried about the messages
>> occupying disk space, you can always set a lower retention on the topic
>> that is higher than any lag your consumer can accrue. The best plan here
>> would be to plan for allocating disk space for the retention.
>>
>>
>> On Mon, Aug 25, 2014 at 2:25 PM, Prunier, Dominique <
>> dominique.prunier@emc.com> wrote:
>>
>>> Any idea on this usecase guys ?
>>>
>>> Thanks,
>>>
>>> -----Original Message-----
>>> From: Prunier, Dominique [mailto:dominique.prunier@emc.com]
>>> Sent: Friday, August 15, 2014 11:02 AM
>>> To: users@kafka.apache.org
>>> Subject: RE: Consumer sensitive expiration of topic
>>>
>>> Hi,
>>>
>>> Thanks for the answer.
>>>
>>> The topics themselves won't be shortlived (as their consumers are supposed
>>> to stay there), the messages in them will. What i'm trying to achieve is
>>> something similar to this:
>>>
>>> Producers --<topic>--> Processor A0 --<topic_a_1>--> Processor A1
>>> --<topic_a_2>--> ... --<topic_a_N>--> Consumer
>>>                   |--> Processor B0 --<topic_b_1>--> Processor B1
>>> --<topic_a_2>--> ... --<topic_b_N>--> Consumer
>>>                   |--> Processor C0 --<topic_c_1>--> Processor C1
>>> --<topic_a_2>--> ... --<topic_c_N>--> Consumer
>>>
>>> Essentially, the "main" topic is the first one and only one consumed by
>>> multiple processors/consumers. These processors know what is the next
>>> processor they should send their data to by knowing their "private" topic
>>> name. So in this example, once Processor A1 picks a message in topic_a_1
>>> and commits the offset, the message won't be used anymore by anyone else.
>>>
>>> There is no particular issue just leaving this as is, but topic_a_1 is
>>> going to buffer quite a lot of stuff on disk while essentially, the only
>>> thing that we have to deal with here is Processor A1 going down or lagging.
>>> When Processor A1 is healthy, the expiration of topic_a_1 could be kept
>>> very low and avoid a fair amount of resource use.
>>>
>>> An idea on the top of my head would be an API where you can manually set
>>> the expiration of a topic by specifying offsets for partitions. This way,
>>> once Processor A1 has consumed its messages, it could not only commit the
>>> offsets (which, as far as i understand, has nothing to do with the broker
>>> itself) but also set the expiration of the topic using the same offsets
>>> (which could be done less frequently).
>>>
>>> Does it make sense ?
>>>
>>> Thanks,
>>>
>>> -----Original Message-----
>>> From: Neha Narkhede [mailto:neha.narkhede@gmail.com]
>>> Sent: Thursday, August 14, 2014 8:10 PM
>>> To: users@kafka.apache.org
>>> Subject: Re: Consumer sensitive expiration of topic
>>>
>>> By design, Kafka stores data independent of the number of publishers or
>>> subscribers connecting to it. This provides high performance as the broker
>>> does not have to manage consumers and evict data based on the consumers
>>> position. This is one of the main reasons why Kafka is much more
>>> performance compared to the JMS queues.
>>>
>>> It seems like your use case requires the concept of ephemeral topics where
>>> you would like to auto delete a topic once a particular consumer group has
>>> finished consuming data from it. Once 0.8.2 is released with the delete
>>> topic support, we intend to add auto expiration of topics that will delete
>>> topics that have not been accessed in some configurable time.
>>>
>>> Is there a reason why your application needs to create such short lived
>>> topics?
>>>
>>> Thanks,
>>> Neha
>>>
>>>
>>> On Thu, Aug 14, 2014 at 2:56 PM, Prunier, Dominique <
>>> dominique.prunier@emc.com> wrote:
>>>
>>> > Hi,
>>> >
>>> > I'm playing around with Kafka with the idea to implement a general
>>> purpose
>>> > message exchanger for a distributed application with high throughput
>>> > requirements (multiple hundred thousand messages per sec).
>>> >
>>> > In this context, i would like to be able to use a topic as some form of
>>> > private mailbox for a single consumer group. In this situation, once the
>>> > single consumer group has committed its offset on its private topic, the
>>> > messages there won't be used anymore and can be safely discarded.
>>> > Therefore, i was wondering if you'd see a way (in the current release or
>>> in
>>> > the future) to have a topic which expiration policy is based on consumer
>>> > offsets.
>>> >
>>> > Thanks,
>>> >
>>> > --
>>> > Dominique Prunier
>>> >
>>> >
>>>

RE: Consumer sensitive expiration of topic

Posted by "Prunier, Dominique" <do...@emc.com>.
Jay,

I understand perfectly. I think you have all the reasons in the world to keep the broker truly consumer independent, as it is according to me, a very wise principle that differentiate Kafka from pretty much all the other solutions.

That is why, instead of the idea of consumer sensitive topic as a feature of the broker, i now prefer this to be the responsibility of consumer(s). Therefore, simply exposing a remote call to expire a partition at a given offset would enable consumers to discard data by offset, most likely at the same time they would commit offsets. It sounds to me simpler (as it keeps the broker pretty much as is), and cleaner (as it maintains current design principles) while offering the flexibility of client applications choosing how they want to handle data expiration.


Thanks,

-----Original Message-----
From: Jay Kreps [mailto:jay.kreps@gmail.com] 
Sent: Thursday, August 28, 2014 12:28 PM
To: users@kafka.apache.org
Subject: Re: Consumer sensitive expiration of topic

Hey Dominique,

What you describe makes sense, and it would certainly be possible for
the broker to more aggressively discard data once it sees that the
consumer has read it once.

The reason we haven't really taken that as a priority is because
modern drives are so large relative to their throughput that discard
is not usually pressing. Practically speaking let's say you have a
single cheap 2TB SATA drive and let's say that you are doing 50k 1k
messages per second across all topics on that machine (~50MB/sec). In
this case you have
   2*1024*1024*1024*1024 / (50000 * 1024) / 60 / 60 = 10 hours of retention
So even under very high load optimizing discard is not a very pressing concern.

That said this would not be a terrible feature to have.

-Jay

On Thu, Aug 28, 2014 at 8:03 AM, Prunier, Dominique
<do...@emc.com> wrote:
> Yeah, i'm really not worried about performance. Disk space, or more specifically, disk space by duplication of the same data in different topics was my concern. The primary use case would be a special consumer which job would be to partition the messages from a topic into various "private consumer topics" (without altering it) to provide a filtered subscription service (e.g. for a remote service on slower network which cannot afford to receive the whole bunch of data and only wants a subset of it).
>
> Do you think it would make sense to have a remote API call that manually expire some partition segments by offset (as opposed to time and/or size) ? For example, exposing cleanupLogs with additional parameters to cleanup segments on demand ? I think it would be more than enough for me and could be used for various other things, like manually truncating a topic which data isn't relevant anymore without recreating it ?
>
> Thanks,
>
> -----Original Message-----
> From: Neha Narkhede [mailto:neha.narkhede@gmail.com]
> Sent: Wednesday, August 27, 2014 11:36 PM
> To: users@kafka.apache.org
> Subject: Re: Consumer sensitive expiration of topic
>
> Kafka is designed to maintain persistent backlog of data on disk
> efficiently and at scale. Unlike other messaging systems, doing so does not
> affect the performance of the system. If you are worried about the messages
> occupying disk space, you can always set a lower retention on the topic
> that is higher than any lag your consumer can accrue. The best plan here
> would be to plan for allocating disk space for the retention.
>
>
> On Mon, Aug 25, 2014 at 2:25 PM, Prunier, Dominique <
> dominique.prunier@emc.com> wrote:
>
>> Any idea on this usecase guys ?
>>
>> Thanks,
>>
>> -----Original Message-----
>> From: Prunier, Dominique [mailto:dominique.prunier@emc.com]
>> Sent: Friday, August 15, 2014 11:02 AM
>> To: users@kafka.apache.org
>> Subject: RE: Consumer sensitive expiration of topic
>>
>> Hi,
>>
>> Thanks for the answer.
>>
>> The topics themselves won't be shortlived (as their consumers are supposed
>> to stay there), the messages in them will. What i'm trying to achieve is
>> something similar to this:
>>
>> Producers --<topic>--> Processor A0 --<topic_a_1>--> Processor A1
>> --<topic_a_2>--> ... --<topic_a_N>--> Consumer
>>                   |--> Processor B0 --<topic_b_1>--> Processor B1
>> --<topic_a_2>--> ... --<topic_b_N>--> Consumer
>>                   |--> Processor C0 --<topic_c_1>--> Processor C1
>> --<topic_a_2>--> ... --<topic_c_N>--> Consumer
>>
>> Essentially, the "main" topic is the first one and only one consumed by
>> multiple processors/consumers. These processors know what is the next
>> processor they should send their data to by knowing their "private" topic
>> name. So in this example, once Processor A1 picks a message in topic_a_1
>> and commits the offset, the message won't be used anymore by anyone else.
>>
>> There is no particular issue just leaving this as is, but topic_a_1 is
>> going to buffer quite a lot of stuff on disk while essentially, the only
>> thing that we have to deal with here is Processor A1 going down or lagging.
>> When Processor A1 is healthy, the expiration of topic_a_1 could be kept
>> very low and avoid a fair amount of resource use.
>>
>> An idea on the top of my head would be an API where you can manually set
>> the expiration of a topic by specifying offsets for partitions. This way,
>> once Processor A1 has consumed its messages, it could not only commit the
>> offsets (which, as far as i understand, has nothing to do with the broker
>> itself) but also set the expiration of the topic using the same offsets
>> (which could be done less frequently).
>>
>> Does it make sense ?
>>
>> Thanks,
>>
>> -----Original Message-----
>> From: Neha Narkhede [mailto:neha.narkhede@gmail.com]
>> Sent: Thursday, August 14, 2014 8:10 PM
>> To: users@kafka.apache.org
>> Subject: Re: Consumer sensitive expiration of topic
>>
>> By design, Kafka stores data independent of the number of publishers or
>> subscribers connecting to it. This provides high performance as the broker
>> does not have to manage consumers and evict data based on the consumers
>> position. This is one of the main reasons why Kafka is much more
>> performance compared to the JMS queues.
>>
>> It seems like your use case requires the concept of ephemeral topics where
>> you would like to auto delete a topic once a particular consumer group has
>> finished consuming data from it. Once 0.8.2 is released with the delete
>> topic support, we intend to add auto expiration of topics that will delete
>> topics that have not been accessed in some configurable time.
>>
>> Is there a reason why your application needs to create such short lived
>> topics?
>>
>> Thanks,
>> Neha
>>
>>
>> On Thu, Aug 14, 2014 at 2:56 PM, Prunier, Dominique <
>> dominique.prunier@emc.com> wrote:
>>
>> > Hi,
>> >
>> > I'm playing around with Kafka with the idea to implement a general
>> purpose
>> > message exchanger for a distributed application with high throughput
>> > requirements (multiple hundred thousand messages per sec).
>> >
>> > In this context, i would like to be able to use a topic as some form of
>> > private mailbox for a single consumer group. In this situation, once the
>> > single consumer group has committed its offset on its private topic, the
>> > messages there won't be used anymore and can be safely discarded.
>> > Therefore, i was wondering if you'd see a way (in the current release or
>> in
>> > the future) to have a topic which expiration policy is based on consumer
>> > offsets.
>> >
>> > Thanks,
>> >
>> > --
>> > Dominique Prunier
>> >
>> >
>>

Re: Consumer sensitive expiration of topic

Posted by Jay Kreps <ja...@gmail.com>.
Hey Dominique,

What you describe makes sense, and it would certainly be possible for
the broker to more aggressively discard data once it sees that the
consumer has read it once.

The reason we haven't really taken that as a priority is because
modern drives are so large relative to their throughput that discard
is not usually pressing. Practically speaking let's say you have a
single cheap 2TB SATA drive and let's say that you are doing 50k 1k
messages per second across all topics on that machine (~50MB/sec). In
this case you have
   2*1024*1024*1024*1024 / (50000 * 1024) / 60 / 60 = 10 hours of retention
So even under very high load optimizing discard is not a very pressing concern.

That said this would not be a terrible feature to have.

-Jay

On Thu, Aug 28, 2014 at 8:03 AM, Prunier, Dominique
<do...@emc.com> wrote:
> Yeah, i'm really not worried about performance. Disk space, or more specifically, disk space by duplication of the same data in different topics was my concern. The primary use case would be a special consumer which job would be to partition the messages from a topic into various "private consumer topics" (without altering it) to provide a filtered subscription service (e.g. for a remote service on slower network which cannot afford to receive the whole bunch of data and only wants a subset of it).
>
> Do you think it would make sense to have a remote API call that manually expire some partition segments by offset (as opposed to time and/or size) ? For example, exposing cleanupLogs with additional parameters to cleanup segments on demand ? I think it would be more than enough for me and could be used for various other things, like manually truncating a topic which data isn't relevant anymore without recreating it ?
>
> Thanks,
>
> -----Original Message-----
> From: Neha Narkhede [mailto:neha.narkhede@gmail.com]
> Sent: Wednesday, August 27, 2014 11:36 PM
> To: users@kafka.apache.org
> Subject: Re: Consumer sensitive expiration of topic
>
> Kafka is designed to maintain persistent backlog of data on disk
> efficiently and at scale. Unlike other messaging systems, doing so does not
> affect the performance of the system. If you are worried about the messages
> occupying disk space, you can always set a lower retention on the topic
> that is higher than any lag your consumer can accrue. The best plan here
> would be to plan for allocating disk space for the retention.
>
>
> On Mon, Aug 25, 2014 at 2:25 PM, Prunier, Dominique <
> dominique.prunier@emc.com> wrote:
>
>> Any idea on this usecase guys ?
>>
>> Thanks,
>>
>> -----Original Message-----
>> From: Prunier, Dominique [mailto:dominique.prunier@emc.com]
>> Sent: Friday, August 15, 2014 11:02 AM
>> To: users@kafka.apache.org
>> Subject: RE: Consumer sensitive expiration of topic
>>
>> Hi,
>>
>> Thanks for the answer.
>>
>> The topics themselves won't be shortlived (as their consumers are supposed
>> to stay there), the messages in them will. What i'm trying to achieve is
>> something similar to this:
>>
>> Producers --<topic>--> Processor A0 --<topic_a_1>--> Processor A1
>> --<topic_a_2>--> ... --<topic_a_N>--> Consumer
>>                   |--> Processor B0 --<topic_b_1>--> Processor B1
>> --<topic_a_2>--> ... --<topic_b_N>--> Consumer
>>                   |--> Processor C0 --<topic_c_1>--> Processor C1
>> --<topic_a_2>--> ... --<topic_c_N>--> Consumer
>>
>> Essentially, the "main" topic is the first one and only one consumed by
>> multiple processors/consumers. These processors know what is the next
>> processor they should send their data to by knowing their "private" topic
>> name. So in this example, once Processor A1 picks a message in topic_a_1
>> and commits the offset, the message won't be used anymore by anyone else.
>>
>> There is no particular issue just leaving this as is, but topic_a_1 is
>> going to buffer quite a lot of stuff on disk while essentially, the only
>> thing that we have to deal with here is Processor A1 going down or lagging.
>> When Processor A1 is healthy, the expiration of topic_a_1 could be kept
>> very low and avoid a fair amount of resource use.
>>
>> An idea on the top of my head would be an API where you can manually set
>> the expiration of a topic by specifying offsets for partitions. This way,
>> once Processor A1 has consumed its messages, it could not only commit the
>> offsets (which, as far as i understand, has nothing to do with the broker
>> itself) but also set the expiration of the topic using the same offsets
>> (which could be done less frequently).
>>
>> Does it make sense ?
>>
>> Thanks,
>>
>> -----Original Message-----
>> From: Neha Narkhede [mailto:neha.narkhede@gmail.com]
>> Sent: Thursday, August 14, 2014 8:10 PM
>> To: users@kafka.apache.org
>> Subject: Re: Consumer sensitive expiration of topic
>>
>> By design, Kafka stores data independent of the number of publishers or
>> subscribers connecting to it. This provides high performance as the broker
>> does not have to manage consumers and evict data based on the consumers
>> position. This is one of the main reasons why Kafka is much more
>> performance compared to the JMS queues.
>>
>> It seems like your use case requires the concept of ephemeral topics where
>> you would like to auto delete a topic once a particular consumer group has
>> finished consuming data from it. Once 0.8.2 is released with the delete
>> topic support, we intend to add auto expiration of topics that will delete
>> topics that have not been accessed in some configurable time.
>>
>> Is there a reason why your application needs to create such short lived
>> topics?
>>
>> Thanks,
>> Neha
>>
>>
>> On Thu, Aug 14, 2014 at 2:56 PM, Prunier, Dominique <
>> dominique.prunier@emc.com> wrote:
>>
>> > Hi,
>> >
>> > I'm playing around with Kafka with the idea to implement a general
>> purpose
>> > message exchanger for a distributed application with high throughput
>> > requirements (multiple hundred thousand messages per sec).
>> >
>> > In this context, i would like to be able to use a topic as some form of
>> > private mailbox for a single consumer group. In this situation, once the
>> > single consumer group has committed its offset on its private topic, the
>> > messages there won't be used anymore and can be safely discarded.
>> > Therefore, i was wondering if you'd see a way (in the current release or
>> in
>> > the future) to have a topic which expiration policy is based on consumer
>> > offsets.
>> >
>> > Thanks,
>> >
>> > --
>> > Dominique Prunier
>> >
>> >
>>

RE: Consumer sensitive expiration of topic

Posted by "Prunier, Dominique" <do...@emc.com>.
Yeah, i'm really not worried about performance. Disk space, or more specifically, disk space by duplication of the same data in different topics was my concern. The primary use case would be a special consumer which job would be to partition the messages from a topic into various "private consumer topics" (without altering it) to provide a filtered subscription service (e.g. for a remote service on slower network which cannot afford to receive the whole bunch of data and only wants a subset of it).

Do you think it would make sense to have a remote API call that manually expire some partition segments by offset (as opposed to time and/or size) ? For example, exposing cleanupLogs with additional parameters to cleanup segments on demand ? I think it would be more than enough for me and could be used for various other things, like manually truncating a topic which data isn't relevant anymore without recreating it ?

Thanks,

-----Original Message-----
From: Neha Narkhede [mailto:neha.narkhede@gmail.com] 
Sent: Wednesday, August 27, 2014 11:36 PM
To: users@kafka.apache.org
Subject: Re: Consumer sensitive expiration of topic

Kafka is designed to maintain persistent backlog of data on disk
efficiently and at scale. Unlike other messaging systems, doing so does not
affect the performance of the system. If you are worried about the messages
occupying disk space, you can always set a lower retention on the topic
that is higher than any lag your consumer can accrue. The best plan here
would be to plan for allocating disk space for the retention.


On Mon, Aug 25, 2014 at 2:25 PM, Prunier, Dominique <
dominique.prunier@emc.com> wrote:

> Any idea on this usecase guys ?
>
> Thanks,
>
> -----Original Message-----
> From: Prunier, Dominique [mailto:dominique.prunier@emc.com]
> Sent: Friday, August 15, 2014 11:02 AM
> To: users@kafka.apache.org
> Subject: RE: Consumer sensitive expiration of topic
>
> Hi,
>
> Thanks for the answer.
>
> The topics themselves won't be shortlived (as their consumers are supposed
> to stay there), the messages in them will. What i'm trying to achieve is
> something similar to this:
>
> Producers --<topic>--> Processor A0 --<topic_a_1>--> Processor A1
> --<topic_a_2>--> ... --<topic_a_N>--> Consumer
>                   |--> Processor B0 --<topic_b_1>--> Processor B1
> --<topic_a_2>--> ... --<topic_b_N>--> Consumer
>                   |--> Processor C0 --<topic_c_1>--> Processor C1
> --<topic_a_2>--> ... --<topic_c_N>--> Consumer
>
> Essentially, the "main" topic is the first one and only one consumed by
> multiple processors/consumers. These processors know what is the next
> processor they should send their data to by knowing their "private" topic
> name. So in this example, once Processor A1 picks a message in topic_a_1
> and commits the offset, the message won't be used anymore by anyone else.
>
> There is no particular issue just leaving this as is, but topic_a_1 is
> going to buffer quite a lot of stuff on disk while essentially, the only
> thing that we have to deal with here is Processor A1 going down or lagging.
> When Processor A1 is healthy, the expiration of topic_a_1 could be kept
> very low and avoid a fair amount of resource use.
>
> An idea on the top of my head would be an API where you can manually set
> the expiration of a topic by specifying offsets for partitions. This way,
> once Processor A1 has consumed its messages, it could not only commit the
> offsets (which, as far as i understand, has nothing to do with the broker
> itself) but also set the expiration of the topic using the same offsets
> (which could be done less frequently).
>
> Does it make sense ?
>
> Thanks,
>
> -----Original Message-----
> From: Neha Narkhede [mailto:neha.narkhede@gmail.com]
> Sent: Thursday, August 14, 2014 8:10 PM
> To: users@kafka.apache.org
> Subject: Re: Consumer sensitive expiration of topic
>
> By design, Kafka stores data independent of the number of publishers or
> subscribers connecting to it. This provides high performance as the broker
> does not have to manage consumers and evict data based on the consumers
> position. This is one of the main reasons why Kafka is much more
> performance compared to the JMS queues.
>
> It seems like your use case requires the concept of ephemeral topics where
> you would like to auto delete a topic once a particular consumer group has
> finished consuming data from it. Once 0.8.2 is released with the delete
> topic support, we intend to add auto expiration of topics that will delete
> topics that have not been accessed in some configurable time.
>
> Is there a reason why your application needs to create such short lived
> topics?
>
> Thanks,
> Neha
>
>
> On Thu, Aug 14, 2014 at 2:56 PM, Prunier, Dominique <
> dominique.prunier@emc.com> wrote:
>
> > Hi,
> >
> > I'm playing around with Kafka with the idea to implement a general
> purpose
> > message exchanger for a distributed application with high throughput
> > requirements (multiple hundred thousand messages per sec).
> >
> > In this context, i would like to be able to use a topic as some form of
> > private mailbox for a single consumer group. In this situation, once the
> > single consumer group has committed its offset on its private topic, the
> > messages there won't be used anymore and can be safely discarded.
> > Therefore, i was wondering if you'd see a way (in the current release or
> in
> > the future) to have a topic which expiration policy is based on consumer
> > offsets.
> >
> > Thanks,
> >
> > --
> > Dominique Prunier
> >
> >
>

Re: Consumer sensitive expiration of topic

Posted by Neha Narkhede <ne...@gmail.com>.
Kafka is designed to maintain persistent backlog of data on disk
efficiently and at scale. Unlike other messaging systems, doing so does not
affect the performance of the system. If you are worried about the messages
occupying disk space, you can always set a lower retention on the topic
that is higher than any lag your consumer can accrue. The best plan here
would be to plan for allocating disk space for the retention.


On Mon, Aug 25, 2014 at 2:25 PM, Prunier, Dominique <
dominique.prunier@emc.com> wrote:

> Any idea on this usecase guys ?
>
> Thanks,
>
> -----Original Message-----
> From: Prunier, Dominique [mailto:dominique.prunier@emc.com]
> Sent: Friday, August 15, 2014 11:02 AM
> To: users@kafka.apache.org
> Subject: RE: Consumer sensitive expiration of topic
>
> Hi,
>
> Thanks for the answer.
>
> The topics themselves won't be shortlived (as their consumers are supposed
> to stay there), the messages in them will. What i'm trying to achieve is
> something similar to this:
>
> Producers --<topic>--> Processor A0 --<topic_a_1>--> Processor A1
> --<topic_a_2>--> ... --<topic_a_N>--> Consumer
>                   |--> Processor B0 --<topic_b_1>--> Processor B1
> --<topic_a_2>--> ... --<topic_b_N>--> Consumer
>                   |--> Processor C0 --<topic_c_1>--> Processor C1
> --<topic_a_2>--> ... --<topic_c_N>--> Consumer
>
> Essentially, the "main" topic is the first one and only one consumed by
> multiple processors/consumers. These processors know what is the next
> processor they should send their data to by knowing their "private" topic
> name. So in this example, once Processor A1 picks a message in topic_a_1
> and commits the offset, the message won't be used anymore by anyone else.
>
> There is no particular issue just leaving this as is, but topic_a_1 is
> going to buffer quite a lot of stuff on disk while essentially, the only
> thing that we have to deal with here is Processor A1 going down or lagging.
> When Processor A1 is healthy, the expiration of topic_a_1 could be kept
> very low and avoid a fair amount of resource use.
>
> An idea on the top of my head would be an API where you can manually set
> the expiration of a topic by specifying offsets for partitions. This way,
> once Processor A1 has consumed its messages, it could not only commit the
> offsets (which, as far as i understand, has nothing to do with the broker
> itself) but also set the expiration of the topic using the same offsets
> (which could be done less frequently).
>
> Does it make sense ?
>
> Thanks,
>
> -----Original Message-----
> From: Neha Narkhede [mailto:neha.narkhede@gmail.com]
> Sent: Thursday, August 14, 2014 8:10 PM
> To: users@kafka.apache.org
> Subject: Re: Consumer sensitive expiration of topic
>
> By design, Kafka stores data independent of the number of publishers or
> subscribers connecting to it. This provides high performance as the broker
> does not have to manage consumers and evict data based on the consumers
> position. This is one of the main reasons why Kafka is much more
> performance compared to the JMS queues.
>
> It seems like your use case requires the concept of ephemeral topics where
> you would like to auto delete a topic once a particular consumer group has
> finished consuming data from it. Once 0.8.2 is released with the delete
> topic support, we intend to add auto expiration of topics that will delete
> topics that have not been accessed in some configurable time.
>
> Is there a reason why your application needs to create such short lived
> topics?
>
> Thanks,
> Neha
>
>
> On Thu, Aug 14, 2014 at 2:56 PM, Prunier, Dominique <
> dominique.prunier@emc.com> wrote:
>
> > Hi,
> >
> > I'm playing around with Kafka with the idea to implement a general
> purpose
> > message exchanger for a distributed application with high throughput
> > requirements (multiple hundred thousand messages per sec).
> >
> > In this context, i would like to be able to use a topic as some form of
> > private mailbox for a single consumer group. In this situation, once the
> > single consumer group has committed its offset on its private topic, the
> > messages there won't be used anymore and can be safely discarded.
> > Therefore, i was wondering if you'd see a way (in the current release or
> in
> > the future) to have a topic which expiration policy is based on consumer
> > offsets.
> >
> > Thanks,
> >
> > --
> > Dominique Prunier
> >
> >
>

RE: Consumer sensitive expiration of topic

Posted by "Prunier, Dominique" <do...@emc.com>.
Any idea on this usecase guys ?

Thanks,

-----Original Message-----
From: Prunier, Dominique [mailto:dominique.prunier@emc.com] 
Sent: Friday, August 15, 2014 11:02 AM
To: users@kafka.apache.org
Subject: RE: Consumer sensitive expiration of topic

Hi,

Thanks for the answer. 

The topics themselves won't be shortlived (as their consumers are supposed to stay there), the messages in them will. What i'm trying to achieve is something similar to this:

Producers --<topic>--> Processor A0 --<topic_a_1>--> Processor A1 --<topic_a_2>--> ... --<topic_a_N>--> Consumer
                  |--> Processor B0 --<topic_b_1>--> Processor B1 --<topic_a_2>--> ... --<topic_b_N>--> Consumer
                  |--> Processor C0 --<topic_c_1>--> Processor C1 --<topic_a_2>--> ... --<topic_c_N>--> Consumer

Essentially, the "main" topic is the first one and only one consumed by multiple processors/consumers. These processors know what is the next processor they should send their data to by knowing their "private" topic name. So in this example, once Processor A1 picks a message in topic_a_1 and commits the offset, the message won't be used anymore by anyone else.

There is no particular issue just leaving this as is, but topic_a_1 is going to buffer quite a lot of stuff on disk while essentially, the only thing that we have to deal with here is Processor A1 going down or lagging. When Processor A1 is healthy, the expiration of topic_a_1 could be kept very low and avoid a fair amount of resource use.

An idea on the top of my head would be an API where you can manually set the expiration of a topic by specifying offsets for partitions. This way, once Processor A1 has consumed its messages, it could not only commit the offsets (which, as far as i understand, has nothing to do with the broker itself) but also set the expiration of the topic using the same offsets (which could be done less frequently).

Does it make sense ?

Thanks,

-----Original Message-----
From: Neha Narkhede [mailto:neha.narkhede@gmail.com] 
Sent: Thursday, August 14, 2014 8:10 PM
To: users@kafka.apache.org
Subject: Re: Consumer sensitive expiration of topic

By design, Kafka stores data independent of the number of publishers or
subscribers connecting to it. This provides high performance as the broker
does not have to manage consumers and evict data based on the consumers
position. This is one of the main reasons why Kafka is much more
performance compared to the JMS queues.

It seems like your use case requires the concept of ephemeral topics where
you would like to auto delete a topic once a particular consumer group has
finished consuming data from it. Once 0.8.2 is released with the delete
topic support, we intend to add auto expiration of topics that will delete
topics that have not been accessed in some configurable time.

Is there a reason why your application needs to create such short lived
topics?

Thanks,
Neha


On Thu, Aug 14, 2014 at 2:56 PM, Prunier, Dominique <
dominique.prunier@emc.com> wrote:

> Hi,
>
> I'm playing around with Kafka with the idea to implement a general purpose
> message exchanger for a distributed application with high throughput
> requirements (multiple hundred thousand messages per sec).
>
> In this context, i would like to be able to use a topic as some form of
> private mailbox for a single consumer group. In this situation, once the
> single consumer group has committed its offset on its private topic, the
> messages there won't be used anymore and can be safely discarded.
> Therefore, i was wondering if you'd see a way (in the current release or in
> the future) to have a topic which expiration policy is based on consumer
> offsets.
>
> Thanks,
>
> --
> Dominique Prunier
>
>

RE: Consumer sensitive expiration of topic

Posted by "Prunier, Dominique" <do...@emc.com>.
Hi,

Thanks for the answer. 

The topics themselves won't be shortlived (as their consumers are supposed to stay there), the messages in them will. What i'm trying to achieve is something similar to this:

Producers --<topic>--> Processor A0 --<topic_a_1>--> Processor A1 --<topic_a_2>--> ... --<topic_a_N>--> Consumer
                  |--> Processor B0 --<topic_b_1>--> Processor B1 --<topic_a_2>--> ... --<topic_b_N>--> Consumer
                  |--> Processor C0 --<topic_c_1>--> Processor C1 --<topic_a_2>--> ... --<topic_c_N>--> Consumer

Essentially, the "main" topic is the first one and only one consumed by multiple processors/consumers. These processors know what is the next processor they should send their data to by knowing their "private" topic name. So in this example, once Processor A1 picks a message in topic_a_1 and commits the offset, the message won't be used anymore by anyone else.

There is no particular issue just leaving this as is, but topic_a_1 is going to buffer quite a lot of stuff on disk while essentially, the only thing that we have to deal with here is Processor A1 going down or lagging. When Processor A1 is healthy, the expiration of topic_a_1 could be kept very low and avoid a fair amount of resource use.

An idea on the top of my head would be an API where you can manually set the expiration of a topic by specifying offsets for partitions. This way, once Processor A1 has consumed its messages, it could not only commit the offsets (which, as far as i understand, has nothing to do with the broker itself) but also set the expiration of the topic using the same offsets (which could be done less frequently).

Does it make sense ?

Thanks,

-----Original Message-----
From: Neha Narkhede [mailto:neha.narkhede@gmail.com] 
Sent: Thursday, August 14, 2014 8:10 PM
To: users@kafka.apache.org
Subject: Re: Consumer sensitive expiration of topic

By design, Kafka stores data independent of the number of publishers or
subscribers connecting to it. This provides high performance as the broker
does not have to manage consumers and evict data based on the consumers
position. This is one of the main reasons why Kafka is much more
performance compared to the JMS queues.

It seems like your use case requires the concept of ephemeral topics where
you would like to auto delete a topic once a particular consumer group has
finished consuming data from it. Once 0.8.2 is released with the delete
topic support, we intend to add auto expiration of topics that will delete
topics that have not been accessed in some configurable time.

Is there a reason why your application needs to create such short lived
topics?

Thanks,
Neha


On Thu, Aug 14, 2014 at 2:56 PM, Prunier, Dominique <
dominique.prunier@emc.com> wrote:

> Hi,
>
> I'm playing around with Kafka with the idea to implement a general purpose
> message exchanger for a distributed application with high throughput
> requirements (multiple hundred thousand messages per sec).
>
> In this context, i would like to be able to use a topic as some form of
> private mailbox for a single consumer group. In this situation, once the
> single consumer group has committed its offset on its private topic, the
> messages there won't be used anymore and can be safely discarded.
> Therefore, i was wondering if you'd see a way (in the current release or in
> the future) to have a topic which expiration policy is based on consumer
> offsets.
>
> Thanks,
>
> --
> Dominique Prunier
>
>

Re: Consumer sensitive expiration of topic

Posted by Neha Narkhede <ne...@gmail.com>.
By design, Kafka stores data independent of the number of publishers or
subscribers connecting to it. This provides high performance as the broker
does not have to manage consumers and evict data based on the consumers
position. This is one of the main reasons why Kafka is much more
performance compared to the JMS queues.

It seems like your use case requires the concept of ephemeral topics where
you would like to auto delete a topic once a particular consumer group has
finished consuming data from it. Once 0.8.2 is released with the delete
topic support, we intend to add auto expiration of topics that will delete
topics that have not been accessed in some configurable time.

Is there a reason why your application needs to create such short lived
topics?

Thanks,
Neha


On Thu, Aug 14, 2014 at 2:56 PM, Prunier, Dominique <
dominique.prunier@emc.com> wrote:

> Hi,
>
> I'm playing around with Kafka with the idea to implement a general purpose
> message exchanger for a distributed application with high throughput
> requirements (multiple hundred thousand messages per sec).
>
> In this context, i would like to be able to use a topic as some form of
> private mailbox for a single consumer group. In this situation, once the
> single consumer group has committed its offset on its private topic, the
> messages there won't be used anymore and can be safely discarded.
> Therefore, i was wondering if you'd see a way (in the current release or in
> the future) to have a topic which expiration policy is based on consumer
> offsets.
>
> Thanks,
>
> --
> Dominique Prunier
>
>

Re: Consumer sensitive expiration of topic

Posted by Joel Koshy <jj...@gmail.com>.
On Thu, Aug 14, 2014 at 09:56:11PM +0000, Prunier, Dominique wrote:
> Hi,
> 
> I'm playing around with Kafka with the idea to implement a general purpose message exchanger for a distributed application with high throughput requirements (multiple hundred thousand messages per sec).
> 
> In this context, i would like to be able to use a topic as some form of private mailbox for a single consumer group. In this situation, once the single consumer group has committed its offset on its private topic, the messages there won't be used anymore and can be safely discarded. Therefore, i was wondering if you'd see a way (in the current release or in the future) to have a topic which expiration policy is based on consumer offsets.

Kafka does not provide any specific consumer-driven expiration policy.
However, it is possible to do the following which I think should
accomplish what you want, but I wouldn't really recommend this:

Use log compaction.
(http://kafka.apache.org/documentation.html#compaction) So you could
set your topic's retention policy to "compact" and attach a unique key
with every message. After your consuming application consumes a set of
messages, you can "delete" those messages by having your consuming
application produce a tombstone message to that topic with that key.
Those messages will be cleaned out when the log segment rolls over and
the cleaner runs.

That said I think it is much simpler if you just use the standard
retention policy (and set a relatively low retention period) and just
monitor your consumer lag.

Joel