You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Grant Henke <gh...@cloudera.com> on 2015/12/14 17:32:24 UTC

Consumer Offsets Compaction

Since 0.9.0 the internal "__consumer_offsets" topic is being used more
heavily. Because this is a compacted topic does "log.cleaner.enable" need
to be "true" in order for it to be compacted? Or is there special handling
for internal topics?

If log.cleaner.enable=true is required, should we make it true by default?
Or document that is required for normal operation?

Thanks,
Grant
-- 
Grant Henke
Software Engineer | Cloudera
grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: Consumer Offsets Compaction

Posted by Grant Henke <gh...@cloudera.com>.
Thanks for the responses and confirmation. I have created
https://issues.apache.org/jira/browse/KAFKA-2988 to track the work/changes.
We can continue discussion here on what changes to make.



On Mon, Dec 14, 2015 at 1:51 PM, Jason Gustafson <ja...@confluent.io> wrote:

> That's a good point. It doesn't look like there's any special handling for
> the offsets topic, so enabling the cleaner by default makes sense to me. If
> compaction is not enabled, it would grow without bound, so I wonder if we
> should even deprecate that setting. Are there any use cases where it needs
> to be disabled?
>
> -Jason
>
> On Mon, Dec 14, 2015 at 11:31 AM, Gwen Shapira <gw...@confluent.io> wrote:
>
> > This makes sense to me. Copycat also works better if topics are
> compacted.
> >
> > Just to clarify:
> > log.cleaner.enable = true just makes the compaction thread run, but
> doesn't
> > force compaction on any specific topic. You still need to set
> > delete.policy=compact, and we should not change defaults here.
> >
> > On Mon, Dec 14, 2015 at 10:32 AM, Grant Henke <gh...@cloudera.com>
> wrote:
> >
> > > Since 0.9.0 the internal "__consumer_offsets" topic is being used more
> > > heavily. Because this is a compacted topic does "log.cleaner.enable"
> need
> > > to be "true" in order for it to be compacted? Or is there special
> > handling
> > > for internal topics?
> > >
> > > If log.cleaner.enable=true is required, should we make it true by
> > default?
> > > Or document that is required for normal operation?
> > >
> > > Thanks,
> > > Grant
> > > --
> > > Grant Henke
> > > Software Engineer | Cloudera
> > > grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
> > >
> >
>



-- 
Grant Henke
Software Engineer | Cloudera
grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: Consumer Offsets Compaction

Posted by Grant Henke <gh...@cloudera.com>.
I am considering changing these defaults in KAFKA-2988:

log.cleaner.enable=true (was false)
log.cleaner.dedupe.buffer.size=128MiB (was 500MiB)
log.cleaner.delete.retention.ms=7 Days (was 1 day)

Thoughts on those values?

Should I add logic to make sure we scale down the buffer size instead of
causing a OufOfMemoryError? Would anyone have an instance small enough to
cause a problem?

On Tue, Dec 15, 2015 at 4:35 PM, Gwen Shapira <gw...@confluent.io> wrote:

> I'm thinking that anyone who actually uses compaction has non-standard
> configuration (at the very least, they had to enable the cleaner, and
> probably few other configurations too... Compaction is a bit fiddly from
> what I've seen).
>
> So, I'm in favor of minimal default buffer just for offsets and copycat
> configuration. If you need compaction elsewhere, feel free to resize.
>
> Gwen
>
> On Tue, Dec 15, 2015 at 2:29 PM, Grant Henke <gh...@cloudera.com> wrote:
>
> > Following up based on some digging. There are some upper and lower bounds
> > on the buffer size:
> >
> > log.cleaner.dedupe.buffer.size has a:
> >
> >    - Minimum of 1 MiB per cleaner thread
> >       -
> >
> >
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/KafkaConfig.scala#L950
> >       - Maximum of 2 GiB per cleaner thread
> >    -
> >
> >
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogCleaner.scala#L183
> >
> > Then entry size is 24 bytes (the size of an MD5 hash (16 Bytes) + the
> size
> > of an offset (8 bytes)).
> > Note: The hash algorithm is technically interchangeable, but not exposed
> > via configuration.
> >
> > I would like to enable the log cleaner by default given that the new
> > consumer depends on it. The main concern I have with enabling the log
> > cleaner by default, and not changing the default size of the dedupe
> buffer
> > is the impact it would have on small POC and test deployments that have
> > small heaps. When moving to the new version, many would just fail with an
> > OufOfMemoryError.
> >
> > We could scale the size of the dedupe buffer down to some percentage of
> the
> > maximum memory available not exceeding the configured
> > log.cleaner.dedupe.buffer.size, and warn if it is less than the
> configured
> > value. But I am not sure if that is the best way to handle that either.
> >
> >
> > On Tue, Dec 15, 2015 at 11:23 AM, Jay Kreps <ja...@confluent.io> wrote:
> >
> > > The buffer determines the maximum number of unique keys in the new
> > > writes that can be processed in one cleaning. Each key requires 24
> > > bytes of space iirc, so 500 MB = ~21,845,333 unique keys (this is
> > > actually adjusted for some load factor and divided by the number of
> > > cleaner threads). If it is too small, you will need multiple cleanings
> > > to compact new data, so sizing too small will tend to lead to lots of
> > > additional I/O. The tradeoff is that if we size it for just handling
> > > the offset topic it could be super small (proportional to the number
> > > of active group-partition combinations), but then people who use log
> > > compaction will see poor performance. If we size it larger than we
> > > waste memory.
> > >
> > > -Jay
> > >
> > > -Jay
> > >
> > > On Tue, Dec 15, 2015 at 8:19 AM, Grant Henke <gh...@cloudera.com>
> > wrote:
> > > > Thanks for the background context Jay.
> > > >
> > > > Do we have any context on what size is small (but still effect for
> > small
> > > > deployments) for the compaction buffer? and what is large? what
> factors
> > > > help you choose the correct (or a safe) size?
> > > >
> > > > Currently the default "log.cleaner.dedupe.buffer.size" is 500 MiB. If
> > we
> > > > are enabling the log cleaner by default, should we adjust that
> default
> > > size
> > > > to be smaller?
> > > >
> > > > On a similar note, log.cleaner.delete.retention.ms is currently
> > > defaulted
> > > > to 1 day. I am not sure the background here either, but would it make
> > > sense
> > > > to default this setting to 7 days to match the default log retention
> > and
> > > > ensure no delete messages are missed by consumers?
> > > >
> > > > Thanks,
> > > > Grant
> > > >
> > > > On Mon, Dec 14, 2015 at 2:19 PM, Jay Kreps <ja...@confluent.io> wrote:
> > > >
> > > >> The reason for disabling it by default was (1) general paranoia
> about
> > > >> log compaction when we released it, (2) avoid allocating the
> > > >> compaction buffer. The first concern is now definitely obsolete, but
> > > >> the second concern is maybe valid. Basically that compaction buffer
> is
> > > >> a preallocated chunk of memory used in compaction and is closely
> tied
> > > >> to the efficiency of the compaction process (so you want it to be
> > > >> big). But if you're not using compaction then it is just wasting
> > > >> memory. I guess since the new consumer requires native offsets
> > > >> (right?) and native offsets require log compaction, maybe we should
> > > >> just default it to on...
> > > >>
> > > >> -Jay
> > > >>
> > > >> On Mon, Dec 14, 2015 at 11:51 AM, Jason Gustafson <
> jason@confluent.io
> > >
> > > >> wrote:
> > > >> > That's a good point. It doesn't look like there's any special
> > handling
> > > >> for
> > > >> > the offsets topic, so enabling the cleaner by default makes sense
> to
> > > me.
> > > >> If
> > > >> > compaction is not enabled, it would grow without bound, so I
> wonder
> > > if we
> > > >> > should even deprecate that setting. Are there any use cases where
> it
> > > >> needs
> > > >> > to be disabled?
> > > >> >
> > > >> > -Jason
> > > >> >
> > > >> > On Mon, Dec 14, 2015 at 11:31 AM, Gwen Shapira <gwen@confluent.io
> >
> > > >> wrote:
> > > >> >
> > > >> >> This makes sense to me. Copycat also works better if topics are
> > > >> compacted.
> > > >> >>
> > > >> >> Just to clarify:
> > > >> >> log.cleaner.enable = true just makes the compaction thread run,
> but
> > > >> doesn't
> > > >> >> force compaction on any specific topic. You still need to set
> > > >> >> delete.policy=compact, and we should not change defaults here.
> > > >> >>
> > > >> >> On Mon, Dec 14, 2015 at 10:32 AM, Grant Henke <
> ghenke@cloudera.com
> > >
> > > >> wrote:
> > > >> >>
> > > >> >> > Since 0.9.0 the internal "__consumer_offsets" topic is being
> used
> > > more
> > > >> >> > heavily. Because this is a compacted topic does
> > > "log.cleaner.enable"
> > > >> need
> > > >> >> > to be "true" in order for it to be compacted? Or is there
> special
> > > >> >> handling
> > > >> >> > for internal topics?
> > > >> >> >
> > > >> >> > If log.cleaner.enable=true is required, should we make it true
> by
> > > >> >> default?
> > > >> >> > Or document that is required for normal operation?
> > > >> >> >
> > > >> >> > Thanks,
> > > >> >> > Grant
> > > >> >> > --
> > > >> >> > Grant Henke
> > > >> >> > Software Engineer | Cloudera
> > > >> >> > grant@cloudera.com | twitter.com/gchenke |
> > > linkedin.com/in/granthenke
> > > >> >> >
> > > >> >>
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Grant Henke
> > > > Software Engineer | Cloudera
> > > > grant@cloudera.com | twitter.com/gchenke |
> linkedin.com/in/granthenke
> > >
> >
> >
> >
> > --
> > Grant Henke
> > Software Engineer | Cloudera
> > grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
> >
>



-- 
Grant Henke
Software Engineer | Cloudera
grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: Consumer Offsets Compaction

Posted by Gwen Shapira <gw...@confluent.io>.
I'm thinking that anyone who actually uses compaction has non-standard
configuration (at the very least, they had to enable the cleaner, and
probably few other configurations too... Compaction is a bit fiddly from
what I've seen).

So, I'm in favor of minimal default buffer just for offsets and copycat
configuration. If you need compaction elsewhere, feel free to resize.

Gwen

On Tue, Dec 15, 2015 at 2:29 PM, Grant Henke <gh...@cloudera.com> wrote:

> Following up based on some digging. There are some upper and lower bounds
> on the buffer size:
>
> log.cleaner.dedupe.buffer.size has a:
>
>    - Minimum of 1 MiB per cleaner thread
>       -
>
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/KafkaConfig.scala#L950
>       - Maximum of 2 GiB per cleaner thread
>    -
>
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogCleaner.scala#L183
>
> Then entry size is 24 bytes (the size of an MD5 hash (16 Bytes) + the size
> of an offset (8 bytes)).
> Note: The hash algorithm is technically interchangeable, but not exposed
> via configuration.
>
> I would like to enable the log cleaner by default given that the new
> consumer depends on it. The main concern I have with enabling the log
> cleaner by default, and not changing the default size of the dedupe buffer
> is the impact it would have on small POC and test deployments that have
> small heaps. When moving to the new version, many would just fail with an
> OufOfMemoryError.
>
> We could scale the size of the dedupe buffer down to some percentage of the
> maximum memory available not exceeding the configured
> log.cleaner.dedupe.buffer.size, and warn if it is less than the configured
> value. But I am not sure if that is the best way to handle that either.
>
>
> On Tue, Dec 15, 2015 at 11:23 AM, Jay Kreps <ja...@confluent.io> wrote:
>
> > The buffer determines the maximum number of unique keys in the new
> > writes that can be processed in one cleaning. Each key requires 24
> > bytes of space iirc, so 500 MB = ~21,845,333 unique keys (this is
> > actually adjusted for some load factor and divided by the number of
> > cleaner threads). If it is too small, you will need multiple cleanings
> > to compact new data, so sizing too small will tend to lead to lots of
> > additional I/O. The tradeoff is that if we size it for just handling
> > the offset topic it could be super small (proportional to the number
> > of active group-partition combinations), but then people who use log
> > compaction will see poor performance. If we size it larger than we
> > waste memory.
> >
> > -Jay
> >
> > -Jay
> >
> > On Tue, Dec 15, 2015 at 8:19 AM, Grant Henke <gh...@cloudera.com>
> wrote:
> > > Thanks for the background context Jay.
> > >
> > > Do we have any context on what size is small (but still effect for
> small
> > > deployments) for the compaction buffer? and what is large? what factors
> > > help you choose the correct (or a safe) size?
> > >
> > > Currently the default "log.cleaner.dedupe.buffer.size" is 500 MiB. If
> we
> > > are enabling the log cleaner by default, should we adjust that default
> > size
> > > to be smaller?
> > >
> > > On a similar note, log.cleaner.delete.retention.ms is currently
> > defaulted
> > > to 1 day. I am not sure the background here either, but would it make
> > sense
> > > to default this setting to 7 days to match the default log retention
> and
> > > ensure no delete messages are missed by consumers?
> > >
> > > Thanks,
> > > Grant
> > >
> > > On Mon, Dec 14, 2015 at 2:19 PM, Jay Kreps <ja...@confluent.io> wrote:
> > >
> > >> The reason for disabling it by default was (1) general paranoia about
> > >> log compaction when we released it, (2) avoid allocating the
> > >> compaction buffer. The first concern is now definitely obsolete, but
> > >> the second concern is maybe valid. Basically that compaction buffer is
> > >> a preallocated chunk of memory used in compaction and is closely tied
> > >> to the efficiency of the compaction process (so you want it to be
> > >> big). But if you're not using compaction then it is just wasting
> > >> memory. I guess since the new consumer requires native offsets
> > >> (right?) and native offsets require log compaction, maybe we should
> > >> just default it to on...
> > >>
> > >> -Jay
> > >>
> > >> On Mon, Dec 14, 2015 at 11:51 AM, Jason Gustafson <jason@confluent.io
> >
> > >> wrote:
> > >> > That's a good point. It doesn't look like there's any special
> handling
> > >> for
> > >> > the offsets topic, so enabling the cleaner by default makes sense to
> > me.
> > >> If
> > >> > compaction is not enabled, it would grow without bound, so I wonder
> > if we
> > >> > should even deprecate that setting. Are there any use cases where it
> > >> needs
> > >> > to be disabled?
> > >> >
> > >> > -Jason
> > >> >
> > >> > On Mon, Dec 14, 2015 at 11:31 AM, Gwen Shapira <gw...@confluent.io>
> > >> wrote:
> > >> >
> > >> >> This makes sense to me. Copycat also works better if topics are
> > >> compacted.
> > >> >>
> > >> >> Just to clarify:
> > >> >> log.cleaner.enable = true just makes the compaction thread run, but
> > >> doesn't
> > >> >> force compaction on any specific topic. You still need to set
> > >> >> delete.policy=compact, and we should not change defaults here.
> > >> >>
> > >> >> On Mon, Dec 14, 2015 at 10:32 AM, Grant Henke <ghenke@cloudera.com
> >
> > >> wrote:
> > >> >>
> > >> >> > Since 0.9.0 the internal "__consumer_offsets" topic is being used
> > more
> > >> >> > heavily. Because this is a compacted topic does
> > "log.cleaner.enable"
> > >> need
> > >> >> > to be "true" in order for it to be compacted? Or is there special
> > >> >> handling
> > >> >> > for internal topics?
> > >> >> >
> > >> >> > If log.cleaner.enable=true is required, should we make it true by
> > >> >> default?
> > >> >> > Or document that is required for normal operation?
> > >> >> >
> > >> >> > Thanks,
> > >> >> > Grant
> > >> >> > --
> > >> >> > Grant Henke
> > >> >> > Software Engineer | Cloudera
> > >> >> > grant@cloudera.com | twitter.com/gchenke |
> > linkedin.com/in/granthenke
> > >> >> >
> > >> >>
> > >>
> > >
> > >
> > >
> > > --
> > > Grant Henke
> > > Software Engineer | Cloudera
> > > grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
> >
>
>
>
> --
> Grant Henke
> Software Engineer | Cloudera
> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>

Re: Consumer Offsets Compaction

Posted by Grant Henke <gh...@cloudera.com>.
Following up based on some digging. There are some upper and lower bounds
on the buffer size:

log.cleaner.dedupe.buffer.size has a:

   - Minimum of 1 MiB per cleaner thread
      -
      https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/KafkaConfig.scala#L950
      - Maximum of 2 GiB per cleaner thread
   -
      https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogCleaner.scala#L183

Then entry size is 24 bytes (the size of an MD5 hash (16 Bytes) + the size
of an offset (8 bytes)).
Note: The hash algorithm is technically interchangeable, but not exposed
via configuration.

I would like to enable the log cleaner by default given that the new
consumer depends on it. The main concern I have with enabling the log
cleaner by default, and not changing the default size of the dedupe buffer
is the impact it would have on small POC and test deployments that have
small heaps. When moving to the new version, many would just fail with an
OufOfMemoryError.

We could scale the size of the dedupe buffer down to some percentage of the
maximum memory available not exceeding the configured
log.cleaner.dedupe.buffer.size, and warn if it is less than the configured
value. But I am not sure if that is the best way to handle that either.


On Tue, Dec 15, 2015 at 11:23 AM, Jay Kreps <ja...@confluent.io> wrote:

> The buffer determines the maximum number of unique keys in the new
> writes that can be processed in one cleaning. Each key requires 24
> bytes of space iirc, so 500 MB = ~21,845,333 unique keys (this is
> actually adjusted for some load factor and divided by the number of
> cleaner threads). If it is too small, you will need multiple cleanings
> to compact new data, so sizing too small will tend to lead to lots of
> additional I/O. The tradeoff is that if we size it for just handling
> the offset topic it could be super small (proportional to the number
> of active group-partition combinations), but then people who use log
> compaction will see poor performance. If we size it larger than we
> waste memory.
>
> -Jay
>
> -Jay
>
> On Tue, Dec 15, 2015 at 8:19 AM, Grant Henke <gh...@cloudera.com> wrote:
> > Thanks for the background context Jay.
> >
> > Do we have any context on what size is small (but still effect for small
> > deployments) for the compaction buffer? and what is large? what factors
> > help you choose the correct (or a safe) size?
> >
> > Currently the default "log.cleaner.dedupe.buffer.size" is 500 MiB. If we
> > are enabling the log cleaner by default, should we adjust that default
> size
> > to be smaller?
> >
> > On a similar note, log.cleaner.delete.retention.ms is currently
> defaulted
> > to 1 day. I am not sure the background here either, but would it make
> sense
> > to default this setting to 7 days to match the default log retention and
> > ensure no delete messages are missed by consumers?
> >
> > Thanks,
> > Grant
> >
> > On Mon, Dec 14, 2015 at 2:19 PM, Jay Kreps <ja...@confluent.io> wrote:
> >
> >> The reason for disabling it by default was (1) general paranoia about
> >> log compaction when we released it, (2) avoid allocating the
> >> compaction buffer. The first concern is now definitely obsolete, but
> >> the second concern is maybe valid. Basically that compaction buffer is
> >> a preallocated chunk of memory used in compaction and is closely tied
> >> to the efficiency of the compaction process (so you want it to be
> >> big). But if you're not using compaction then it is just wasting
> >> memory. I guess since the new consumer requires native offsets
> >> (right?) and native offsets require log compaction, maybe we should
> >> just default it to on...
> >>
> >> -Jay
> >>
> >> On Mon, Dec 14, 2015 at 11:51 AM, Jason Gustafson <ja...@confluent.io>
> >> wrote:
> >> > That's a good point. It doesn't look like there's any special handling
> >> for
> >> > the offsets topic, so enabling the cleaner by default makes sense to
> me.
> >> If
> >> > compaction is not enabled, it would grow without bound, so I wonder
> if we
> >> > should even deprecate that setting. Are there any use cases where it
> >> needs
> >> > to be disabled?
> >> >
> >> > -Jason
> >> >
> >> > On Mon, Dec 14, 2015 at 11:31 AM, Gwen Shapira <gw...@confluent.io>
> >> wrote:
> >> >
> >> >> This makes sense to me. Copycat also works better if topics are
> >> compacted.
> >> >>
> >> >> Just to clarify:
> >> >> log.cleaner.enable = true just makes the compaction thread run, but
> >> doesn't
> >> >> force compaction on any specific topic. You still need to set
> >> >> delete.policy=compact, and we should not change defaults here.
> >> >>
> >> >> On Mon, Dec 14, 2015 at 10:32 AM, Grant Henke <gh...@cloudera.com>
> >> wrote:
> >> >>
> >> >> > Since 0.9.0 the internal "__consumer_offsets" topic is being used
> more
> >> >> > heavily. Because this is a compacted topic does
> "log.cleaner.enable"
> >> need
> >> >> > to be "true" in order for it to be compacted? Or is there special
> >> >> handling
> >> >> > for internal topics?
> >> >> >
> >> >> > If log.cleaner.enable=true is required, should we make it true by
> >> >> default?
> >> >> > Or document that is required for normal operation?
> >> >> >
> >> >> > Thanks,
> >> >> > Grant
> >> >> > --
> >> >> > Grant Henke
> >> >> > Software Engineer | Cloudera
> >> >> > grant@cloudera.com | twitter.com/gchenke |
> linkedin.com/in/granthenke
> >> >> >
> >> >>
> >>
> >
> >
> >
> > --
> > Grant Henke
> > Software Engineer | Cloudera
> > grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>



-- 
Grant Henke
Software Engineer | Cloudera
grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: Consumer Offsets Compaction

Posted by Jay Kreps <ja...@confluent.io>.
The buffer determines the maximum number of unique keys in the new
writes that can be processed in one cleaning. Each key requires 24
bytes of space iirc, so 500 MB = ~21,845,333 unique keys (this is
actually adjusted for some load factor and divided by the number of
cleaner threads). If it is too small, you will need multiple cleanings
to compact new data, so sizing too small will tend to lead to lots of
additional I/O. The tradeoff is that if we size it for just handling
the offset topic it could be super small (proportional to the number
of active group-partition combinations), but then people who use log
compaction will see poor performance. If we size it larger than we
waste memory.

-Jay

-Jay

On Tue, Dec 15, 2015 at 8:19 AM, Grant Henke <gh...@cloudera.com> wrote:
> Thanks for the background context Jay.
>
> Do we have any context on what size is small (but still effect for small
> deployments) for the compaction buffer? and what is large? what factors
> help you choose the correct (or a safe) size?
>
> Currently the default "log.cleaner.dedupe.buffer.size" is 500 MiB. If we
> are enabling the log cleaner by default, should we adjust that default size
> to be smaller?
>
> On a similar note, log.cleaner.delete.retention.ms is currently defaulted
> to 1 day. I am not sure the background here either, but would it make sense
> to default this setting to 7 days to match the default log retention and
> ensure no delete messages are missed by consumers?
>
> Thanks,
> Grant
>
> On Mon, Dec 14, 2015 at 2:19 PM, Jay Kreps <ja...@confluent.io> wrote:
>
>> The reason for disabling it by default was (1) general paranoia about
>> log compaction when we released it, (2) avoid allocating the
>> compaction buffer. The first concern is now definitely obsolete, but
>> the second concern is maybe valid. Basically that compaction buffer is
>> a preallocated chunk of memory used in compaction and is closely tied
>> to the efficiency of the compaction process (so you want it to be
>> big). But if you're not using compaction then it is just wasting
>> memory. I guess since the new consumer requires native offsets
>> (right?) and native offsets require log compaction, maybe we should
>> just default it to on...
>>
>> -Jay
>>
>> On Mon, Dec 14, 2015 at 11:51 AM, Jason Gustafson <ja...@confluent.io>
>> wrote:
>> > That's a good point. It doesn't look like there's any special handling
>> for
>> > the offsets topic, so enabling the cleaner by default makes sense to me.
>> If
>> > compaction is not enabled, it would grow without bound, so I wonder if we
>> > should even deprecate that setting. Are there any use cases where it
>> needs
>> > to be disabled?
>> >
>> > -Jason
>> >
>> > On Mon, Dec 14, 2015 at 11:31 AM, Gwen Shapira <gw...@confluent.io>
>> wrote:
>> >
>> >> This makes sense to me. Copycat also works better if topics are
>> compacted.
>> >>
>> >> Just to clarify:
>> >> log.cleaner.enable = true just makes the compaction thread run, but
>> doesn't
>> >> force compaction on any specific topic. You still need to set
>> >> delete.policy=compact, and we should not change defaults here.
>> >>
>> >> On Mon, Dec 14, 2015 at 10:32 AM, Grant Henke <gh...@cloudera.com>
>> wrote:
>> >>
>> >> > Since 0.9.0 the internal "__consumer_offsets" topic is being used more
>> >> > heavily. Because this is a compacted topic does "log.cleaner.enable"
>> need
>> >> > to be "true" in order for it to be compacted? Or is there special
>> >> handling
>> >> > for internal topics?
>> >> >
>> >> > If log.cleaner.enable=true is required, should we make it true by
>> >> default?
>> >> > Or document that is required for normal operation?
>> >> >
>> >> > Thanks,
>> >> > Grant
>> >> > --
>> >> > Grant Henke
>> >> > Software Engineer | Cloudera
>> >> > grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>> >> >
>> >>
>>
>
>
>
> --
> Grant Henke
> Software Engineer | Cloudera
> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: Consumer Offsets Compaction

Posted by Grant Henke <gh...@cloudera.com>.
Thanks for the background context Jay.

Do we have any context on what size is small (but still effect for small
deployments) for the compaction buffer? and what is large? what factors
help you choose the correct (or a safe) size?

Currently the default "log.cleaner.dedupe.buffer.size" is 500 MiB. If we
are enabling the log cleaner by default, should we adjust that default size
to be smaller?

On a similar note, log.cleaner.delete.retention.ms is currently defaulted
to 1 day. I am not sure the background here either, but would it make sense
to default this setting to 7 days to match the default log retention and
ensure no delete messages are missed by consumers?

Thanks,
Grant

On Mon, Dec 14, 2015 at 2:19 PM, Jay Kreps <ja...@confluent.io> wrote:

> The reason for disabling it by default was (1) general paranoia about
> log compaction when we released it, (2) avoid allocating the
> compaction buffer. The first concern is now definitely obsolete, but
> the second concern is maybe valid. Basically that compaction buffer is
> a preallocated chunk of memory used in compaction and is closely tied
> to the efficiency of the compaction process (so you want it to be
> big). But if you're not using compaction then it is just wasting
> memory. I guess since the new consumer requires native offsets
> (right?) and native offsets require log compaction, maybe we should
> just default it to on...
>
> -Jay
>
> On Mon, Dec 14, 2015 at 11:51 AM, Jason Gustafson <ja...@confluent.io>
> wrote:
> > That's a good point. It doesn't look like there's any special handling
> for
> > the offsets topic, so enabling the cleaner by default makes sense to me.
> If
> > compaction is not enabled, it would grow without bound, so I wonder if we
> > should even deprecate that setting. Are there any use cases where it
> needs
> > to be disabled?
> >
> > -Jason
> >
> > On Mon, Dec 14, 2015 at 11:31 AM, Gwen Shapira <gw...@confluent.io>
> wrote:
> >
> >> This makes sense to me. Copycat also works better if topics are
> compacted.
> >>
> >> Just to clarify:
> >> log.cleaner.enable = true just makes the compaction thread run, but
> doesn't
> >> force compaction on any specific topic. You still need to set
> >> delete.policy=compact, and we should not change defaults here.
> >>
> >> On Mon, Dec 14, 2015 at 10:32 AM, Grant Henke <gh...@cloudera.com>
> wrote:
> >>
> >> > Since 0.9.0 the internal "__consumer_offsets" topic is being used more
> >> > heavily. Because this is a compacted topic does "log.cleaner.enable"
> need
> >> > to be "true" in order for it to be compacted? Or is there special
> >> handling
> >> > for internal topics?
> >> >
> >> > If log.cleaner.enable=true is required, should we make it true by
> >> default?
> >> > Or document that is required for normal operation?
> >> >
> >> > Thanks,
> >> > Grant
> >> > --
> >> > Grant Henke
> >> > Software Engineer | Cloudera
> >> > grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
> >> >
> >>
>



-- 
Grant Henke
Software Engineer | Cloudera
grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: Consumer Offsets Compaction

Posted by Jay Kreps <ja...@confluent.io>.
The reason for disabling it by default was (1) general paranoia about
log compaction when we released it, (2) avoid allocating the
compaction buffer. The first concern is now definitely obsolete, but
the second concern is maybe valid. Basically that compaction buffer is
a preallocated chunk of memory used in compaction and is closely tied
to the efficiency of the compaction process (so you want it to be
big). But if you're not using compaction then it is just wasting
memory. I guess since the new consumer requires native offsets
(right?) and native offsets require log compaction, maybe we should
just default it to on...

-Jay

On Mon, Dec 14, 2015 at 11:51 AM, Jason Gustafson <ja...@confluent.io> wrote:
> That's a good point. It doesn't look like there's any special handling for
> the offsets topic, so enabling the cleaner by default makes sense to me. If
> compaction is not enabled, it would grow without bound, so I wonder if we
> should even deprecate that setting. Are there any use cases where it needs
> to be disabled?
>
> -Jason
>
> On Mon, Dec 14, 2015 at 11:31 AM, Gwen Shapira <gw...@confluent.io> wrote:
>
>> This makes sense to me. Copycat also works better if topics are compacted.
>>
>> Just to clarify:
>> log.cleaner.enable = true just makes the compaction thread run, but doesn't
>> force compaction on any specific topic. You still need to set
>> delete.policy=compact, and we should not change defaults here.
>>
>> On Mon, Dec 14, 2015 at 10:32 AM, Grant Henke <gh...@cloudera.com> wrote:
>>
>> > Since 0.9.0 the internal "__consumer_offsets" topic is being used more
>> > heavily. Because this is a compacted topic does "log.cleaner.enable" need
>> > to be "true" in order for it to be compacted? Or is there special
>> handling
>> > for internal topics?
>> >
>> > If log.cleaner.enable=true is required, should we make it true by
>> default?
>> > Or document that is required for normal operation?
>> >
>> > Thanks,
>> > Grant
>> > --
>> > Grant Henke
>> > Software Engineer | Cloudera
>> > grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>> >
>>

Re: Consumer Offsets Compaction

Posted by Jason Gustafson <ja...@confluent.io>.
That's a good point. It doesn't look like there's any special handling for
the offsets topic, so enabling the cleaner by default makes sense to me. If
compaction is not enabled, it would grow without bound, so I wonder if we
should even deprecate that setting. Are there any use cases where it needs
to be disabled?

-Jason

On Mon, Dec 14, 2015 at 11:31 AM, Gwen Shapira <gw...@confluent.io> wrote:

> This makes sense to me. Copycat also works better if topics are compacted.
>
> Just to clarify:
> log.cleaner.enable = true just makes the compaction thread run, but doesn't
> force compaction on any specific topic. You still need to set
> delete.policy=compact, and we should not change defaults here.
>
> On Mon, Dec 14, 2015 at 10:32 AM, Grant Henke <gh...@cloudera.com> wrote:
>
> > Since 0.9.0 the internal "__consumer_offsets" topic is being used more
> > heavily. Because this is a compacted topic does "log.cleaner.enable" need
> > to be "true" in order for it to be compacted? Or is there special
> handling
> > for internal topics?
> >
> > If log.cleaner.enable=true is required, should we make it true by
> default?
> > Or document that is required for normal operation?
> >
> > Thanks,
> > Grant
> > --
> > Grant Henke
> > Software Engineer | Cloudera
> > grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
> >
>

Re: Consumer Offsets Compaction

Posted by Gwen Shapira <gw...@confluent.io>.
This makes sense to me. Copycat also works better if topics are compacted.

Just to clarify:
log.cleaner.enable = true just makes the compaction thread run, but doesn't
force compaction on any specific topic. You still need to set
delete.policy=compact, and we should not change defaults here.

On Mon, Dec 14, 2015 at 10:32 AM, Grant Henke <gh...@cloudera.com> wrote:

> Since 0.9.0 the internal "__consumer_offsets" topic is being used more
> heavily. Because this is a compacted topic does "log.cleaner.enable" need
> to be "true" in order for it to be compacted? Or is there special handling
> for internal topics?
>
> If log.cleaner.enable=true is required, should we make it true by default?
> Or document that is required for normal operation?
>
> Thanks,
> Grant
> --
> Grant Henke
> Software Engineer | Cloudera
> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>