You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Tom Arnfeld <to...@duedil.com> on 2015/07/08 19:54:51 UTC

Cleaning out old mesos-slave sandbox directories

Hey,


I'm wondering if anyone in the community has a decent solution to this; when a slave restarts and re-registers (perhaps it was offline for too long) it will get a new slave ID, and use a new directory inside the work_dir for sandboxes.


When this happens the old slave directories appear not to be tracked by the mesos GC process, and stay around indefinitely. Over time if enough full slave restarts happen (say, due to reconfiguration) the disks can be completely filled and the mesos slave won't do anything about it.


I'm guessing the simplest case would be a cron job that cleans out the directories based on the timestamps in the directory names...


Any input would be great!


Tom.

--

Tom Arnfeld
Senior Developer // DueDil

Re: Cleaning out old mesos-slave sandbox directories

Posted by Vinod Kone <vi...@gmail.com>.
If the patch is clean I think so. But what makes you think a retry would succeed?

@vinodkone

> On Jul 9, 2015, at 1:26 AM, Tom Arnfeld <to...@duedil.com> wrote:
> 
> Ok, do you think that'd be a change that would be accepted into Mesos if I sent it in?
> 
> Thanks Vinod, btw.
> 
> --
> 
> Tom Arnfeld
> Developer // DueDil
> 
> (+44) 7525940046
> 25 Christopher Street, London, EC2A 2BS
> 
> 
>> On Wed, Jul 8, 2015 at 7:24 PM, Vinod Kone <vi...@gmail.com> wrote:
>> 
>>> On Wed, Jul 8, 2015 at 11:20 AM, Tom Arnfeld <to...@duedil.com> wrote:
>>> Do you know if the mesos-slave will re-schedule something for GC if it fails deletion?
>> 
>> No it doesn't. 
> 

Re: Cleaning out old mesos-slave sandbox directories

Posted by Tom Arnfeld <to...@duedil.com>.
Ok, do you think that'd be a change that would be accepted into Mesos if I sent it in?




Thanks Vinod, btw.



--


Tom Arnfeld

Developer // DueDil





(+44) 7525940046

25 Christopher Street, London, EC2A 2BS

On Wed, Jul 8, 2015 at 7:24 PM, Vinod Kone <vi...@gmail.com> wrote:

> On Wed, Jul 8, 2015 at 11:20 AM, Tom Arnfeld <to...@duedil.com> wrote:
>> Do you know if the mesos-slave will re-schedule something for GC if it
>> fails deletion?
>>
> No it doesn't.

Re: Cleaning out old mesos-slave sandbox directories

Posted by Vinod Kone <vi...@gmail.com>.
On Wed, Jul 8, 2015 at 11:20 AM, Tom Arnfeld <to...@duedil.com> wrote:

> Do you know if the mesos-slave will re-schedule something for GC if it
> fails deletion?
>

No it doesn't.

Re: Cleaning out old mesos-slave sandbox directories

Posted by Tom Arnfeld <to...@duedil.com>.
Good question, there are likely mounts, yup... though they should be being unmounted cleanly, though perhaps not in all cases and maybe we need to retry deleting things in the gc process.




Do you know if the mesos-slave will re-schedule something for GC if it fails deletion?



--


Tom Arnfeld

Senior Developer // DueDil






On Wednesday, Jul 8, 2015 at 7:19 pm, Vinod Kone <vi...@gmail.com>, wrote:
Are there any special files (mounts etc) in your slave directory? The logic Mesos uses to delete a directory is likely different from the shell utility 'rm'.

On Wed, Jul 8, 2015 at 11:09 AM, Tom Arnfeld <to...@duedil.com> wrote:

In this instance there were three old slave directories, and there are three log lines in the mesos-slave.INFO file;





I0708 11:24:52.023453  2425 slave.cpp:3499] Garbage collecting old slave 20150515-105200-84152492-5050-9915-S46

I0708 11:24:52.023923  2425 slave.cpp:3499] Garbage collecting old slave 20150217-184553-67375276-5050-18563-S74

I0708 11:24:52.023921  2428 gc.cpp:56] Scheduling '/mnt/mesos/mesos-slave/slaves/20150515-105200-84152492-5050-9915-S46' for gc 6.99999972599407days in the future

I0708 11:24:52.054704  2425 slave.cpp:3499] Garbage collecting old slave 20150515-105200-84152492-5050-9915-S22

I0708 11:24:52.054723  2424 gc.cpp:56] Scheduling '/mnt/mesos/mesos-slave/slaves/20150217-184553-67375276-5050-18563-S74' for gc 6.99999937182815days in the future

I0708 11:24:52.067934  2425 gc.cpp:56] Scheduling '/mnt/mesos/mesos-slave/slaves/20150515-105200-84152492-5050-9915-S22' for gc 6.99999922252444days in the future




This happens right after the recovery process finishes after the slave boots up. I've looked at another slave that's currently at 99% disk capacity and the slave has been up since 27th May 2015, it also has the "Garbage collecting old slave" log lines just after boot for ~6 days. Looking a little deeper in to this slave logs; this looks like an interesting error;





W0527 17:35:08.935755  1749 gc.cpp:139] Failed to delete '/mnt/mesos/mesos-slave/slaves/20150217-184553-67375276-5050-18563-S72': Directory not empty




I think I actually discussed this with BenH a while back, we're running 0.21.0 on this cluster.




Anyone else seen this before? Using the standard `rm` unix tool clears out the directories fine currently, running as the same user as the slave (root).






--


Tom Arnfeld

Senior Developer // DueDil







On Wed, Jul 8, 2015 at 7:00 PM, Vinod Kone <vi...@gmail.com> wrote:





On Wed, Jul 8, 2015 at 10:54 AM, Tom Arnfeld <to...@duedil.com> wrote:

When this happens the old slave directories appear not to be tracked by the mesos GC process, and stay around indefinitely. Over time if enough full slave restarts happen (say, due to reconfiguration) the disks can be completely filled and the mesos slave won't do anything about it.







This shouldn't happen. Old slave directories should be gc'ed by the slave based on their last modification time. Do you see any log lines with  "Garbage collecting old slave" ?

Re: Cleaning out old mesos-slave sandbox directories

Posted by Vinod Kone <vi...@gmail.com>.
Are there any special files (mounts etc) in your slave directory? The logic
<https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os.hpp#L383>
Mesos uses to delete a directory is likely different from the shell utility
'rm'.

On Wed, Jul 8, 2015 at 11:09 AM, Tom Arnfeld <to...@duedil.com> wrote:

> In this instance there were three old slave directories, and there are
> three log lines in the mesos-slave.INFO file;
>
>  I0708 11:24:52.023453  2425 slave.cpp:3499] Garbage collecting old slave
> 20150515-105200-84152492-5050-9915-S46
> I0708 11:24:52.023923  2425 slave.cpp:3499] Garbage collecting old slave
> 20150217-184553-67375276-5050-18563-S74
> I0708 11:24:52.023921  2428 gc.cpp:56] Scheduling
> '/mnt/mesos/mesos-slave/slaves/20150515-105200-84152492-5050-9915-S46' for
> gc 6.99999972599407days in the future
> I0708 11:24:52.054704  2425 slave.cpp:3499] Garbage collecting old slave
> 20150515-105200-84152492-5050-9915-S22
> I0708 11:24:52.054723  2424 gc.cpp:56] Scheduling
> '/mnt/mesos/mesos-slave/slaves/20150217-184553-67375276-5050-18563-S74' for
> gc 6.99999937182815days in the future
> I0708 11:24:52.067934  2425 gc.cpp:56] Scheduling
> '/mnt/mesos/mesos-slave/slaves/20150515-105200-84152492-5050-9915-S22' for
> gc 6.99999922252444days in the future
>
> This happens right after the recovery process finishes after the slave
> boots up. I've looked at another slave that's currently at 99% disk
> capacity and the slave has been up since 27th May 2015, it also has the
> "Garbage collecting old slave" log lines just after boot for ~6 days.
> Looking a little deeper in to this slave logs; this looks like an
> interesting error;
>
>  W0527 17:35:08.935755  1749 gc.cpp:139] Failed to delete
> '/mnt/mesos/mesos-slave/slaves/20150217-184553-67375276-5050-18563-S72':
> Directory not empty
>
> I think I actually discussed this with BenH a while back, we're running
> 0.21.0 on this cluster.
>
> Anyone else seen this before? Using the standard `rm` unix tool clears out
> the directories fine currently, running as the same user as the slave
> (root).
>
> --
>
> Tom Arnfeld
> Senior Developer // DueDil
>
>
> On Wed, Jul 8, 2015 at 7:00 PM, Vinod Kone <vi...@gmail.com> wrote:
>
>>
>> On Wed, Jul 8, 2015 at 10:54 AM, Tom Arnfeld <to...@duedil.com> wrote:
>>
>>> When this happens the old slave directories appear not to be tracked by
>>> the mesos GC process, and stay around indefinitely. Over time if enough
>>> full slave restarts happen (say, due to reconfiguration) the disks can be
>>> completely filled and the mesos slave won't do anything about it.
>>>
>>
>> This shouldn't happen. Old slave directories should be gc'ed by the slave
>> based on their last modification time
>> <https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L4059>.
>> Do you see any log lines with  "Garbage collecting old slave" ?
>>
>>
>

Re: Cleaning out old mesos-slave sandbox directories

Posted by Tom Arnfeld <to...@duedil.com>.
In this instance there were three old slave directories, and there are three log lines in the mesos-slave.INFO file;





I0708 11:24:52.023453  2425 slave.cpp:3499] Garbage collecting old slave 20150515-105200-84152492-5050-9915-S46

I0708 11:24:52.023923  2425 slave.cpp:3499] Garbage collecting old slave 20150217-184553-67375276-5050-18563-S74

I0708 11:24:52.023921  2428 gc.cpp:56] Scheduling '/mnt/mesos/mesos-slave/slaves/20150515-105200-84152492-5050-9915-S46' for gc 6.99999972599407days in the future

I0708 11:24:52.054704  2425 slave.cpp:3499] Garbage collecting old slave 20150515-105200-84152492-5050-9915-S22

I0708 11:24:52.054723  2424 gc.cpp:56] Scheduling '/mnt/mesos/mesos-slave/slaves/20150217-184553-67375276-5050-18563-S74' for gc 6.99999937182815days in the future

I0708 11:24:52.067934  2425 gc.cpp:56] Scheduling '/mnt/mesos/mesos-slave/slaves/20150515-105200-84152492-5050-9915-S22' for gc 6.99999922252444days in the future




This happens right after the recovery process finishes after the slave boots up. I've looked at another slave that's currently at 99% disk capacity and the slave has been up since 27th May 2015, it also has the "Garbage collecting old slave" log lines just after boot for ~6 days. Looking a little deeper in to this slave logs; this looks like an interesting error;





W0527 17:35:08.935755  1749 gc.cpp:139] Failed to delete '/mnt/mesos/mesos-slave/slaves/20150217-184553-67375276-5050-18563-S72': Directory not empty




I think I actually discussed this with BenH a while back, we're running 0.21.0 on this cluster.




Anyone else seen this before? Using the standard `rm` unix tool clears out the directories fine currently, running as the same user as the slave (root).






--


Tom Arnfeld

Senior Developer // DueDil

On Wed, Jul 8, 2015 at 7:00 PM, Vinod Kone <vi...@gmail.com> wrote:

> On Wed, Jul 8, 2015 at 10:54 AM, Tom Arnfeld <to...@duedil.com> wrote:
>> When this happens the old slave directories appear not to be tracked by
>> the mesos GC process, and stay around indefinitely. Over time if enough
>> full slave restarts happen (say, due to reconfiguration) the disks can be
>> completely filled and the mesos slave won't do anything about it.
>>
> This shouldn't happen. Old slave directories should be gc'ed by the slave
> based on their last modification time
> <https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L4059>. Do
> you see any log lines with  "Garbage collecting old slave" ?

Re: Cleaning out old mesos-slave sandbox directories

Posted by Vinod Kone <vi...@gmail.com>.
On Wed, Jul 8, 2015 at 10:54 AM, Tom Arnfeld <to...@duedil.com> wrote:

> When this happens the old slave directories appear not to be tracked by
> the mesos GC process, and stay around indefinitely. Over time if enough
> full slave restarts happen (say, due to reconfiguration) the disks can be
> completely filled and the mesos slave won't do anything about it.
>

This shouldn't happen. Old slave directories should be gc'ed by the slave
based on their last modification time
<https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L4059>. Do
you see any log lines with  "Garbage collecting old slave" ?