You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Victoria Zuberman <vi...@imperva.com> on 2020/06/02 07:56:29 UTC

Disk space - sharp increase in usage

Hi,

Background:
Kafka cluster
7 brokers, with 4T disk each
version 2.3 (recently upgraded from 0.1.0 via 1.0.1)

Problem:
Used disk space went from 40% to 80%.
Looking for root cause.

Suspects:

  1.  Incoming traffic

Ruled out, according to metrics no significant change in “bytes in” for topics in cluster

  1.  Upgrade

The raise started on the day of upgrade to 2.3

But we upgraded another cluster in the same way and we don’t see similar issue there

Is there a known change or issue at 2.3 related to disk space usage?

  1.  Replication factor

Is there a way to see whether replication factor of any topic was changed recently? Didn’t find in metrics...

  1.  Retention

Is there a way to see whether retention was changed recently? Didn’t find in metrics...

Would appreciate any other ideas or investigation leads

Thanks,
Victoria

-------------------------------------------
NOTICE:
This email and all attachments are confidential, may be proprietary, and may be privileged or otherwise protected from disclosure. They are intended solely for the individual or entity to whom the email is addressed. However, mistakes sometimes happen in addressing emails. If you believe that you are not an intended recipient, please stop reading immediately. Do not copy, forward, or rely on the contents in any way. Notify the sender and/or Imperva, Inc. by telephone at +1 (650) 832-6006 and then delete or destroy any copy of this email and its attachments. The sender reserves and asserts all rights to confidentiality, as well as any privileges that may apply. Any disclosure, copying, distribution or action taken or omitted to be taken by an unintended recipient in reliance on this message is prohibited and may be unlawful.
Please consider the environment before printing this email.

Re: Disk space - sharp increase in usage

Posted by Andrew Otto <ot...@wikimedia.org>.
WMF recently had an issue
<https://phabricator.wikimedia.org/T250133#6063641> where Kafka broker
disks were filling up with log segment data.  It turned out that Kafka was
not deleting old log segments because the oldest log segment had a message
with a Kafka timestamp a year in the future.  Since the oldest log segment
had a message newer than any others, Kafka could not respect the
retention.ms setting and delete any old log segments.  We mitigated this by
setting retention.bytes, which overrode retention.ms and allowed Kafka to
prune old logs.  For us, this could be prevented from happening again by
setting message.timestamp.difference.max.ms.

Not sure if this is your problem, but it is at least something to check!
 :)

On Tue, Jun 2, 2020 at 6:26 AM Liam Clarke-Hutchinson <
liam.clarke@adscale.co.nz> wrote:

> Hi Victoria,
>
> There are no metrics of when a config was changed. However, if you've been
> capturing the JMX metrics from the brokers, the metric
> kafka.cluster:name=ReplicasCount,partition=*,topic=*,type=Partition
> would show if replication factor was increased.
>
> As for retention time, if you're sure that there's not been an increase in
> data ingestion, best metric to look into for that is
> kafka.log:name=LogSegments... as an increase in that would either be caused
> by a large influx of data or an increase in retention time.
>
> Lastly, check the logs and metrics for the log cleaner, in case there's any
> issues occurring preventing logs from being cleaned.
> kafka.log:name=max-clean-time-secs,type=LogCleaner
> and kafka.log:name=time-since-last-run-ms,type=LogCleanerManager would be
> most useful here.
>
> The ZK logs won't be much use (ZK being where the config is stored) unless
> you had audit logging enabled, which is disabled by default.
>
> Good luck,
>
> Liam Clarke-Hutchinson
>
>
> On Tue, 2 Jun. 2020, 8:50 pm Victoria Zuberman, <
> victoria.zuberman@imperva.com> wrote:
>
> > Regards kafka-logs directory, it was an interesting lead, we checked and
> > it is the same.
> >
> > Regards replication factor and retention, I am not looking for current
> > information, I am look for metrics that can give me information about a
> > change.
> >
> > Still looking for more ideas
> >
> > On 02/06/2020, 11:31, "Peter Bukowinski" <pm...@gmail.com> wrote:
> >
> >     CAUTION: This message was sent from outside the company. Do not click
> > links or open attachments unless you recognize the sender and know the
> > content is safe.
> >
> >
> >     > On Jun 2, 2020, at 12:56 AM, Victoria Zuberman <
> > victoria.zuberman@imperva.com> wrote:
> >     >
> >     > Hi,
> >     >
> >     > Background:
> >     > Kafka cluster
> >     > 7 brokers, with 4T disk each
> >     > version 2.3 (recently upgraded from 0.1.0 via 1.0.1)
> >     >
> >     > Problem:
> >     > Used disk space went from 40% to 80%.
> >     > Looking for root cause.
> >     >
> >     > Suspects:
> >     >
> >     >  1.  Incoming traffic
> >     >
> >     > Ruled out, according to metrics no significant change in “bytes in”
> > for topics in cluster
> >     >
> >     >  1.  Upgrade
> >     >
> >     > The raise started on the day of upgrade to 2.3
> >     >
> >     > But we upgraded another cluster in the same way and we don’t see
> > similar issue there
> >     >
> >     > Is there a known change or issue at 2.3 related to disk space
> usage?
> >     >
> >     >  1.  Replication factor
> >     >
> >     > Is there a way to see whether replication factor of any topic was
> > changed recently? Didn’t find in metrics...
> >
> >     You can use the kafka-topics.sh script to check the replica count for
> > all your topics. Upgrading would not have affected the replica count,
> > though.
> >
> >     >  1.  Retention
> >     >
> >     > Is there a way to see whether retention was changed recently?
> Didn’t
> > find in metrics...
> >
> >     You can use  kafka-topics.sh —-zookeeper host:2181 --describe
> > --topics-with-overrides
> >     to list any topics with non-default retention, but I’m guessing
> that’s
> > not it.
> >
> >     If your disk usage went from 40 to 80% on all brokers — effectively
> > doubled — it could be that your kafka data log directory path(s) changed
> > during the upgrade. As you upgraded each broker and (re)started the
> kafka,
> > it would have left the existing data under the old one path and created
> new
> > topic partition directories and logs under the new path as it rejoined
> the
> > cluster. Have you verified that your data log directory locations are the
> > same as they used to be?
> >
> >     > Would appreciate any other ideas or investigation leads
> >     >
> >     > Thanks,
> >     > Victoria
> >     >
> >     > -------------------------------------------
> >     > NOTICE:
> >     > This email and all attachments are confidential, may be
> proprietary,
> > and may be privileged or otherwise protected from disclosure. They are
> > intended solely for the individual or entity to whom the email is
> > addressed. However, mistakes sometimes happen in addressing emails. If
> you
> > believe that you are not an intended recipient, please stop reading
> > immediately. Do not copy, forward, or rely on the contents in any way.
> > Notify the sender and/or Imperva, Inc. by telephone at +1 (650) 832-6006
> > and then delete or destroy any copy of this email and its attachments.
> The
> > sender reserves and asserts all rights to confidentiality, as well as any
> > privileges that may apply. Any disclosure, copying, distribution or
> action
> > taken or omitted to be taken by an unintended recipient in reliance on
> this
> > message is prohibited and may be unlawful.
> >     > Please consider the environment before printing this email.
> >
> >
> >
>

Re: Disk space - sharp increase in usage

Posted by Liam Clarke-Hutchinson <li...@adscale.co.nz>.
Hi Victoria,

There are no metrics of when a config was changed. However, if you've been
capturing the JMX metrics from the brokers, the metric
kafka.cluster:name=ReplicasCount,partition=*,topic=*,type=Partition
would show if replication factor was increased.

As for retention time, if you're sure that there's not been an increase in
data ingestion, best metric to look into for that is
kafka.log:name=LogSegments... as an increase in that would either be caused
by a large influx of data or an increase in retention time.

Lastly, check the logs and metrics for the log cleaner, in case there's any
issues occurring preventing logs from being cleaned.
kafka.log:name=max-clean-time-secs,type=LogCleaner
and kafka.log:name=time-since-last-run-ms,type=LogCleanerManager would be
most useful here.

The ZK logs won't be much use (ZK being where the config is stored) unless
you had audit logging enabled, which is disabled by default.

Good luck,

Liam Clarke-Hutchinson


On Tue, 2 Jun. 2020, 8:50 pm Victoria Zuberman, <
victoria.zuberman@imperva.com> wrote:

> Regards kafka-logs directory, it was an interesting lead, we checked and
> it is the same.
>
> Regards replication factor and retention, I am not looking for current
> information, I am look for metrics that can give me information about a
> change.
>
> Still looking for more ideas
>
> On 02/06/2020, 11:31, "Peter Bukowinski" <pm...@gmail.com> wrote:
>
>     CAUTION: This message was sent from outside the company. Do not click
> links or open attachments unless you recognize the sender and know the
> content is safe.
>
>
>     > On Jun 2, 2020, at 12:56 AM, Victoria Zuberman <
> victoria.zuberman@imperva.com> wrote:
>     >
>     > Hi,
>     >
>     > Background:
>     > Kafka cluster
>     > 7 brokers, with 4T disk each
>     > version 2.3 (recently upgraded from 0.1.0 via 1.0.1)
>     >
>     > Problem:
>     > Used disk space went from 40% to 80%.
>     > Looking for root cause.
>     >
>     > Suspects:
>     >
>     >  1.  Incoming traffic
>     >
>     > Ruled out, according to metrics no significant change in “bytes in”
> for topics in cluster
>     >
>     >  1.  Upgrade
>     >
>     > The raise started on the day of upgrade to 2.3
>     >
>     > But we upgraded another cluster in the same way and we don’t see
> similar issue there
>     >
>     > Is there a known change or issue at 2.3 related to disk space usage?
>     >
>     >  1.  Replication factor
>     >
>     > Is there a way to see whether replication factor of any topic was
> changed recently? Didn’t find in metrics...
>
>     You can use the kafka-topics.sh script to check the replica count for
> all your topics. Upgrading would not have affected the replica count,
> though.
>
>     >  1.  Retention
>     >
>     > Is there a way to see whether retention was changed recently? Didn’t
> find in metrics...
>
>     You can use  kafka-topics.sh —-zookeeper host:2181 --describe
> --topics-with-overrides
>     to list any topics with non-default retention, but I’m guessing that’s
> not it.
>
>     If your disk usage went from 40 to 80% on all brokers — effectively
> doubled — it could be that your kafka data log directory path(s) changed
> during the upgrade. As you upgraded each broker and (re)started the kafka,
> it would have left the existing data under the old one path and created new
> topic partition directories and logs under the new path as it rejoined the
> cluster. Have you verified that your data log directory locations are the
> same as they used to be?
>
>     > Would appreciate any other ideas or investigation leads
>     >
>     > Thanks,
>     > Victoria
>     >
>     > -------------------------------------------
>     > NOTICE:
>     > This email and all attachments are confidential, may be proprietary,
> and may be privileged or otherwise protected from disclosure. They are
> intended solely for the individual or entity to whom the email is
> addressed. However, mistakes sometimes happen in addressing emails. If you
> believe that you are not an intended recipient, please stop reading
> immediately. Do not copy, forward, or rely on the contents in any way.
> Notify the sender and/or Imperva, Inc. by telephone at +1 (650) 832-6006
> and then delete or destroy any copy of this email and its attachments. The
> sender reserves and asserts all rights to confidentiality, as well as any
> privileges that may apply. Any disclosure, copying, distribution or action
> taken or omitted to be taken by an unintended recipient in reliance on this
> message is prohibited and may be unlawful.
>     > Please consider the environment before printing this email.
>
>
>

Re: Disk space - sharp increase in usage

Posted by Victoria Zuberman <vi...@imperva.com>.
Regards kafka-logs directory, it was an interesting lead, we checked and it is the same.

Regards replication factor and retention, I am not looking for current information, I am look for metrics that can give me information about a change.

Still looking for more ideas

On 02/06/2020, 11:31, "Peter Bukowinski" <pm...@gmail.com> wrote:

    CAUTION: This message was sent from outside the company. Do not click links or open attachments unless you recognize the sender and know the content is safe.
    
    
    > On Jun 2, 2020, at 12:56 AM, Victoria Zuberman <vi...@imperva.com> wrote:
    >
    > Hi,
    >
    > Background:
    > Kafka cluster
    > 7 brokers, with 4T disk each
    > version 2.3 (recently upgraded from 0.1.0 via 1.0.1)
    >
    > Problem:
    > Used disk space went from 40% to 80%.
    > Looking for root cause.
    >
    > Suspects:
    >
    >  1.  Incoming traffic
    >
    > Ruled out, according to metrics no significant change in “bytes in” for topics in cluster
    >
    >  1.  Upgrade
    >
    > The raise started on the day of upgrade to 2.3
    >
    > But we upgraded another cluster in the same way and we don’t see similar issue there
    >
    > Is there a known change or issue at 2.3 related to disk space usage?
    >
    >  1.  Replication factor
    >
    > Is there a way to see whether replication factor of any topic was changed recently? Didn’t find in metrics...
    
    You can use the kafka-topics.sh script to check the replica count for all your topics. Upgrading would not have affected the replica count, though.
    
    >  1.  Retention
    >
    > Is there a way to see whether retention was changed recently? Didn’t find in metrics...
    
    You can use  kafka-topics.sh —-zookeeper host:2181 --describe --topics-with-overrides
    to list any topics with non-default retention, but I’m guessing that’s not it.
    
    If your disk usage went from 40 to 80% on all brokers — effectively doubled — it could be that your kafka data log directory path(s) changed during the upgrade. As you upgraded each broker and (re)started the kafka, it would have left the existing data under the old one path and created new topic partition directories and logs under the new path as it rejoined the cluster. Have you verified that your data log directory locations are the same as they used to be?
    
    > Would appreciate any other ideas or investigation leads
    >
    > Thanks,
    > Victoria
    >
    > -------------------------------------------
    > NOTICE:
    > This email and all attachments are confidential, may be proprietary, and may be privileged or otherwise protected from disclosure. They are intended solely for the individual or entity to whom the email is addressed. However, mistakes sometimes happen in addressing emails. If you believe that you are not an intended recipient, please stop reading immediately. Do not copy, forward, or rely on the contents in any way. Notify the sender and/or Imperva, Inc. by telephone at +1 (650) 832-6006 and then delete or destroy any copy of this email and its attachments. The sender reserves and asserts all rights to confidentiality, as well as any privileges that may apply. Any disclosure, copying, distribution or action taken or omitted to be taken by an unintended recipient in reliance on this message is prohibited and may be unlawful.
    > Please consider the environment before printing this email.
    


Re: Disk space - sharp increase in usage

Posted by Peter Bukowinski <pm...@gmail.com>.

> On Jun 2, 2020, at 12:56 AM, Victoria Zuberman <vi...@imperva.com> wrote:
> 
> Hi,
> 
> Background:
> Kafka cluster
> 7 brokers, with 4T disk each
> version 2.3 (recently upgraded from 0.1.0 via 1.0.1)
> 
> Problem:
> Used disk space went from 40% to 80%.
> Looking for root cause.
> 
> Suspects:
> 
>  1.  Incoming traffic
> 
> Ruled out, according to metrics no significant change in “bytes in” for topics in cluster
> 
>  1.  Upgrade
> 
> The raise started on the day of upgrade to 2.3
> 
> But we upgraded another cluster in the same way and we don’t see similar issue there
> 
> Is there a known change or issue at 2.3 related to disk space usage?
> 
>  1.  Replication factor
> 
> Is there a way to see whether replication factor of any topic was changed recently? Didn’t find in metrics...

You can use the kafka-topics.sh script to check the replica count for all your topics. Upgrading would not have affected the replica count, though.

>  1.  Retention
> 
> Is there a way to see whether retention was changed recently? Didn’t find in metrics...

You can use  kafka-topics.sh —-zookeeper host:2181 --describe --topics-with-overrides
to list any topics with non-default retention, but I’m guessing that’s not it.

If your disk usage went from 40 to 80% on all brokers — effectively doubled — it could be that your kafka data log directory path(s) changed during the upgrade. As you upgraded each broker and (re)started the kafka, it would have left the existing data under the old one path and created new topic partition directories and logs under the new path as it rejoined the cluster. Have you verified that your data log directory locations are the same as they used to be?

> Would appreciate any other ideas or investigation leads
> 
> Thanks,
> Victoria
> 
> -------------------------------------------
> NOTICE:
> This email and all attachments are confidential, may be proprietary, and may be privileged or otherwise protected from disclosure. They are intended solely for the individual or entity to whom the email is addressed. However, mistakes sometimes happen in addressing emails. If you believe that you are not an intended recipient, please stop reading immediately. Do not copy, forward, or rely on the contents in any way. Notify the sender and/or Imperva, Inc. by telephone at +1 (650) 832-6006 and then delete or destroy any copy of this email and its attachments. The sender reserves and asserts all rights to confidentiality, as well as any privileges that may apply. Any disclosure, copying, distribution or action taken or omitted to be taken by an unintended recipient in reliance on this message is prohibited and may be unlawful.
> Please consider the environment before printing this email.