You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Paul Mackles <pm...@adobe.com> on 2013/09/21 21:06:25 UTC

full disk

Hi -

We ran into a situation on our dev cluster (3 nodes, v0.8) where we ran out of disk on one of the nodes . As expected, the broker shut itself down and all of the clients switched over to the other nodes. So far so good.

To free up disk space, I reduced log.retention.hours to something more manageable (from 172 to 12). I did this on all 3 nodes. Since the other 2 nodes were running OK, I first tried to restart the node which ran out of disk. Unfortunately, it kept shutting itself down due to the full disk. From the logs, I think this was because it was trying to sync-up the replicas it was responsible for and of course couldn't due to the lack of disk space. My hope was that upon restart, it would see the new retention settings and free up a bunch of disk space before trying to do any syncs.

I then went and restarted the other 2 nodes. They both picked up the new retention settings and freed up a bunch of storage as a result. I then went back and tried to restart the 3rd node but to no avail. It still had problems with the full disks.

I thought about trying to reassign partitions so that the node in question had less to manage but that turned out to be a hassle so I wound up manually deleting some of the old log/segment files. The broker seemed to come back fine after that but that's not something I would want to do on a production server.

We obviously need better monitoring/alerting to avoid this situation altogether, but I am wondering if the order of operations at startup could/should be changed to better account for scenarios like this. Or maybe a utility to remove old logs after changing ttl? Did I miss a better way to handle this?

Thanks,
Paul

Re: full disk

Posted by Jason Rosenberg <jb...@squareup.com>.

I just encountered the same issue (and I ended up following the same
work-around as Paul).

One thing I noticed too, is that since the broker went down hard with an
IOException when the disk filled up, it also needed 'recover' most of the
logs on disk as part of the startup sequence.  So any log-cleaner task on
startup would need to do the right thing also with log recovery too, while
disk space is at a premium.

Jason


On Sun, Sep 22, 2013 at 8:10 PM, Jun Rao <ju...@gmail.com> wrote:

> Paul,
>
> This is likely due to that the log cleaner only runs every
> log.cleanup.interval.mins
> (defaults to 10) mins. We probably should consider running the cleaner on
> startup of a broker. Could you file a jira for that?
> Thanks,
> Jun
>
>
> On Sat, Sep 21, 2013 at 12:06 PM, Paul Mackles <pm...@adobe.com> wrote:
>
> > Hi -
> >
> > We ran into a situation on our dev cluster (3 nodes, v0.8) where we ran
> > out of disk on one of the nodes . As expected, the broker shut itself
> down
> > and all of the clients switched over to the other nodes. So far so good.
> >
> > To free up disk space, I reduced log.retention.hours to something more
> > manageable (from 172 to 12). I did this on all 3 nodes. Since the other 2
> > nodes were running OK, I first tried to restart the node which ran out of
> > disk. Unfortunately, it kept shutting itself down due to the full disk.
> > From the logs, I think this was because it was trying to sync-up the
> > replicas it was responsible for and of course couldn't due to the lack of
> > disk space. My hope was that upon restart, it would see the new retention
> > settings and free up a bunch of disk space before trying to do any syncs.
> >
> > I then went and restarted the other 2 nodes. They both picked up the new
> > retention settings and freed up a bunch of storage as a result. I then
> went
> > back and tried to restart the 3rd node but to no avail. It still had
> > problems with the full disks.
> >
> > I thought about trying to reassign partitions so that the node in
> question
> > had less to manage but that turned out to be a hassle so I wound up
> > manually deleting some of the old log/segment files. The broker seemed to
> > come back fine after that but that's not something I would want to do on
> a
> > production server.
> >
> > We obviously need better monitoring/alerting to avoid this situation
> > altogether, but I am wondering if the order of operations at startup
> > could/should be changed to better account for scenarios like this. Or
> maybe
> > a utility to remove old logs after changing ttl? Did I miss a better way
> to
> > handle this?
> >
> > Thanks,
> > Paul
> >
> >
> >
> >
>

Re: full disk

Posted by Jun Rao <ju...@gmail.com>.

Paul,

This is likely due to that the log cleaner only runs every
log.cleanup.interval.mins
(defaults to 10) mins. We probably should consider running the cleaner on
startup of a broker. Could you file a jira for that?
Thanks,
Jun


On Sat, Sep 21, 2013 at 12:06 PM, Paul Mackles <pm...@adobe.com> wrote:

> Hi -
>
> We ran into a situation on our dev cluster (3 nodes, v0.8) where we ran
> out of disk on one of the nodes . As expected, the broker shut itself down
> and all of the clients switched over to the other nodes. So far so good.
>
> To free up disk space, I reduced log.retention.hours to something more
> manageable (from 172 to 12). I did this on all 3 nodes. Since the other 2
> nodes were running OK, I first tried to restart the node which ran out of
> disk. Unfortunately, it kept shutting itself down due to the full disk.
> From the logs, I think this was because it was trying to sync-up the
> replicas it was responsible for and of course couldn't due to the lack of
> disk space. My hope was that upon restart, it would see the new retention
> settings and free up a bunch of disk space before trying to do any syncs.
>
> I then went and restarted the other 2 nodes. They both picked up the new
> retention settings and freed up a bunch of storage as a result. I then went
> back and tried to restart the 3rd node but to no avail. It still had
> problems with the full disks.
>
> I thought about trying to reassign partitions so that the node in question
> had less to manage but that turned out to be a hassle so I wound up
> manually deleting some of the old log/segment files. The broker seemed to
> come back fine after that but that's not something I would want to do on a
> production server.
>
> We obviously need better monitoring/alerting to avoid this situation
> altogether, but I am wondering if the order of operations at startup
> could/should be changed to better account for scenarios like this. Or maybe
> a utility to remove old logs after changing ttl? Did I miss a better way to
> handle this?
>
> Thanks,
> Paul
>
>
>
>

Re: full disk

Posted by Jun Rao <ju...@gmail.com>.

Yes, manually removing the old log files is the simplest solution right now.

Thanks,

Jun


On Mon, Sep 23, 2013 at 9:16 AM, Paul Mackles <pm...@adobe.com> wrote:

> Done:
>
> https://issues.apache.org/jira/browse/KAFKA-1063
>
> Out of curioisity, is manually removing the older log files the only
> option at this point?
>
> From: Paul Mackles <pm...@adobe.com>>
> To: "users@kafka.apache.org<ma...@kafka.apache.org>" <
> users@kafka.apache.org<ma...@kafka.apache.org>>
> Subject: full disk
>
> Hi -
>
> We ran into a situation on our dev cluster (3 nodes, v0.8) where we ran
> out of disk on one of the nodes . As expected, the broker shut itself down
> and all of the clients switched over to the other nodes. So far so good.
>
> To free up disk space, I reduced log.retention.hours to something more
> manageable (from 172 to 12). I did this on all 3 nodes. Since the other 2
> nodes were running OK, I first tried to restart the node which ran out of
> disk. Unfortunately, it kept shutting itself down due to the full disk.
> From the logs, I think this was because it was trying to sync-up the
> replicas it was responsible for and of course couldn't due to the lack of
> disk space. My hope was that upon restart, it would see the new retention
> settings and free up a bunch of disk space before trying to do any syncs.
>
> I then went and restarted the other 2 nodes. They both picked up the new
> retention settings and freed up a bunch of storage as a result. I then went
> back and tried to restart the 3rd node but to no avail. It still had
> problems with the full disks.
>
> I thought about trying to reassign partitions so that the node in question
> had less to manage but that turned out to be a hassle so I wound up
> manually deleting some of the old log/segment files. The broker seemed to
> come back fine after that but that's not something I would want to do on a
> production server.
>
> We obviously need better monitoring/alerting to avoid this situation
> altogether, but I am wondering if the order of operations at startup
> could/should be changed to better account for scenarios like this. Or maybe
> a utility to remove old logs after changing ttl? Did I miss a better way to
> handle this?
>
> Thanks,
> Paul
>
>
>
>

Re: full disk

Posted by Paul Mackles <pm...@adobe.com>.

Done:

https://issues.apache.org/jira/browse/KAFKA-1063

Out of curioisity, is manually removing the older log files the only option at this point?

From: Paul Mackles <pm...@adobe.com>>
To: "users@kafka.apache.org<ma...@kafka.apache.org>" <us...@kafka.apache.org>>
Subject: full disk

Hi -

We ran into a situation on our dev cluster (3 nodes, v0.8) where we ran out of disk on one of the nodes . As expected, the broker shut itself down and all of the clients switched over to the other nodes. So far so good.

To free up disk space, I reduced log.retention.hours to something more manageable (from 172 to 12). I did this on all 3 nodes. Since the other 2 nodes were running OK, I first tried to restart the node which ran out of disk. Unfortunately, it kept shutting itself down due to the full disk. From the logs, I think this was because it was trying to sync-up the replicas it was responsible for and of course couldn't due to the lack of disk space. My hope was that upon restart, it would see the new retention settings and free up a bunch of disk space before trying to do any syncs.

I then went and restarted the other 2 nodes. They both picked up the new retention settings and freed up a bunch of storage as a result. I then went back and tried to restart the 3rd node but to no avail. It still had problems with the full disks.

I thought about trying to reassign partitions so that the node in question had less to manage but that turned out to be a hassle so I wound up manually deleting some of the old log/segment files. The broker seemed to come back fine after that but that's not something I would want to do on a production server.

We obviously need better monitoring/alerting to avoid this situation altogether, but I am wondering if the order of operations at startup could/should be changed to better account for scenarios like this. Or maybe a utility to remove old logs after changing ttl? Did I miss a better way to handle this?

Thanks,
Paul