You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by Venkat Morampudi <ve...@gmail.com> on 2017/10/26 22:43:44 UTC

Sandbox life cycle /age

Hello,
In our production env, we noticed that our disk filled up because one framework had a lot of failed/completed executors folders laying around.
The folders eventually filled up the disk.
 
 
228M /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52334.1.0
228M /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52334.2.0
228M /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52335.1.0
228M /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52335.2.0
228M /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52336.1.0
 
http://mesos.apache.org/documentation/latest/sandbox/#sandbox-lifecycle <http://mesos.apache.org/documentation/latest/sandbox/#sandbox-lifecycle>
 
We have our lifecycle clean up set to the default which is 7days, I believe.
 
We wanted to know if this is the proper way to clean up the failed/completed executors folders for a running framework?
OR does the framework need to be Inactive or Completed for the garbage collection to work?
OR does the framework , itself, need to deal with cleaning up its own executors?
 
Bonus question: How does “gc_disk_headroom” actually work? This equation will always return 0 it seems. gc_delay * max(0.0, (1.0 - gc_disk_headroom - disk usage))     
 
Thanks,
Venkat

Re: Sandbox life cycle /age

Posted by Venkat Morampudi <ve...@gmail.com>.

Hi Benjamin,

Apologies for the delay. GC seem be working fine. Folders older than 2 hours are being deleted. After change of config and restart, Mesos agent took some time to delete the old folder that are around before the restart. I may have jumped the  gun.

Thanks,
Venkat

> On Oct 30, 2017, at 1:01 PM, Benjamin Mahler <bm...@apache.org> wrote:
> 
> Hi Venkat,
> 
> You're seeing that files with a modification time greater than your gc
> delay of 2 hours are *not* getting deleted? Can you show a full
> listing of /var/lib/mesos/slave/slaves/?
> Is there more than 1 entry there?
> 
> On Fri, Oct 27, 2017 at 8:43 AM, Venkat Morampudi <venkatmorampudi@gmail.com <ma...@gmail.com>
>> wrote:
> 
>> Hi Tomek,
>> 
>> After changing GC delay to 2hrs, the existing sandbox folders that are
>> older than the “Max allowed age” are not deleted. Here are the logs
>> 
>> Logs entire before and after the change:
>> 
>> I1027 15:00:48.055465 12861 slave.cpp:4615] Current disk usage 21.63%. Max
>> allowed age: 1.367499658088657days
>> I1027 15:02:37.720451 30693 slave.cpp:4615] Current disk usage 21.60%. Max
>> allowed age: 1.368035520611667hrs
>> 
>> Executor info from the node:
>> 
>> 
>> [techops@kaiju-dcos-privateslave27 ~]$ date
>> Fri Oct 27 15:41:59 UTC 2017
>> [techops@kaiju-dcos-privateslave27 ~]$ ls -l /var/lib/mesos/slave/slaves/
>> 3fd7ffe9-945b-4ae1-968e-2c397585960c-S201/frameworks/35e600c2-6
>> f43-402c-856f-9084c0040187-002/executors/
>> total 452
>> drwxr-xr-x. 3 root root 4096 Oct 26 09:43 74861.0.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 09:56 74861.10.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 10:31 74867.0.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 11:07 74871.3.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 11:45 74875.10.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 13:43 74886.0.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 13:45 74886.2.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 13:51 74886.8.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 13:56 74886.9.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 13:59 74887.1.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 14:42 74895.0.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 14:58 74899.7.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 14:59 74900.8.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 15:12 74902.0.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 17:05 74938.0.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 17:10 74940.0.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 17:11 74942.0.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 17:30 74944.3.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 17:45 74951.0.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 17:45 74951.1.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 18:07 74959.7.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 18:47 74971.0.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 19:18 74981.3.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 20:14 74992.0.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 20:28 74995.3.0
>> drwxr-xr-x. 3 root root 4096 Oct 26 20:59 75001.1.0
>> 
>> 
>> Thanks,
>> Venkat
>> 
>>> On Oct 27, 2017, at 6:42 AM, Tomek Janiszewski <ja...@gmail.com>
>> wrote:
>>> 
>>> Low GC delay menas files will be deleted more often. I don't' think there
>>> will be any performance problem but low GC means you will lose your
>>> sandboxes earlier and they are useful for debugging purposes.
>>> 
>>> pt., 27 paź 2017 o 04:40 użytkownik Venkat Morampudi <
>>> venkatmorampudi@gmail.com <ma...@gmail.com> <mailto:venkatmorampudi@gmail.com <ma...@gmail.com>>> napisał:
>>> 
>>>> Hi Tomek,
>>>> 
>>>> Thanks for the quick reply. After digging a bit into Mesos code we were
>>>> able understand that age actually mean threshold age. Anything older
>> than
>>>> the “age" would be GCed. We are going to try different setting starting
>>>> with "--gc_disk_headroom=.2 --gc_delay=2hrs”. Is there any downside of
>> the
>>>> going with very low GC delay?
>>>> 
>>>> Thanks,
>>>> Venkat
>>>> 
>>>> 
>>>>> On Oct 26, 2017, at 4:28 PM, Tomek Janiszewski <ja...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> gc_delay * max(0.0, (1.0 - gc_disk_headroom - disk usage))
>>>>> 
>>>>> *Example:*
>>>>> gc_delay = 7days
>>>>> gc_disk_headroom = 0.1
>>>>> disk_usage = 0.8
>>>>> 7 * max(0.0, 1 - 0.1 - 0.8) = 7 * max(0.0, 0.1) = 0.7 days = 16 h 48
>> min
>>>>> 
>>>>> Can you show some logs containging information about GC?
>>>>> 
>>>>> pt., 27 paź 2017 o 00:43 użytkownik Venkat Morampudi <
>>>>> venkatmorampudi@gmail.com <ma...@gmail.com> <mailto:venkatmorampudi@gmail.com <ma...@gmail.com>> <mailto:
>> venkatmorampudi@gmail.com <ma...@gmail.com>>> napisał:
>>>>> 
>>>>>> Hello,
>>>>>> In our production env, we noticed that our disk filled up because one
>>>>>> framework had a lot of failed/completed executors folders laying
>> around.
>>>>>> The folders eventually filled up the disk.
>>>>>> 
>>>>>> 
>>>>>> 228M
>>>>>> 
>>>> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-
>> S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52334.1.0
>>>>>> 228M
>>>>>> 
>>>> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-
>> S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52334.2.0
>>>>>> 228M
>>>>>> 
>>>> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-
>> S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52335.1.0
>>>>>> 228M
>>>>>> 
>>>> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-
>> S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52335.2.0
>>>>>> 228M
>>>>>> 
>>>> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-
>> S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52336.1.0
>>>>>> 
>>>>>> http://mesos.apache.org/documentation/latest/sandbox/#
>> sandbox-lifecycle <http://mesos.apache.org/documentation/latest/sandbox/# <http://mesos.apache.org/documentation/latest/sandbox/#>
>> sandbox-lifecycle>
>>>> <
>>>>>> http://mesos.apache.org/documentation/latest/sandbox/#
>> sandbox-lifecycle
>>>> <http://mesos.apache.org/documentation/latest/sandbox/#
>> sandbox-lifecycle>>
>>>>>> 
>>>>>> We have our lifecycle clean up set to the default which is 7days, I
>>>>>> believe.
>>>>>> 
>>>>>> We wanted to know if this is the proper way to clean up the
>>>>>> failed/completed executors folders for a running framework?
>>>>>> OR does the framework need to be Inactive or Completed for the garbage
>>>>>> collection to work?
>>>>>> OR does the framework , itself, need to deal with cleaning up its own
>>>>>> executors?
>>>>>> 
>>>>>> Bonus question: How does “gc_disk_headroom” actually work? This
>> equation
>>>>>> will always return 0 it seems. gc_delay * max(0.0, (1.0 -
>>>> gc_disk_headroom
>>>>>> - disk usage))
>>>>>> 
>>>>>> Thanks,
>>>>>> Venkat

Re: Sandbox life cycle /age

Posted by Benjamin Mahler <bm...@apache.org>.

Hi Venkat,

You're seeing that files with a modification time greater than your gc
delay of 2 hours are *not* getting deleted? Can you show a full
listing of /var/lib/mesos/slave/slaves/?
Is there more than 1 entry there?

On Fri, Oct 27, 2017 at 8:43 AM, Venkat Morampudi <venkatmorampudi@gmail.com
> wrote:

> Hi Tomek,
>
> After changing GC delay to 2hrs, the existing sandbox folders that are
> older than the “Max allowed age” are not deleted. Here are the logs
>
> Logs entire before and after the change:
>
> I1027 15:00:48.055465 12861 slave.cpp:4615] Current disk usage 21.63%. Max
> allowed age: 1.367499658088657days
> I1027 15:02:37.720451 30693 slave.cpp:4615] Current disk usage 21.60%. Max
> allowed age: 1.368035520611667hrs
>
> Executor info from the node:
>
>
> [techops@kaiju-dcos-privateslave27 ~]$ date
> Fri Oct 27 15:41:59 UTC 2017
> [techops@kaiju-dcos-privateslave27 ~]$ ls -l /var/lib/mesos/slave/slaves/
> 3fd7ffe9-945b-4ae1-968e-2c397585960c-S201/frameworks/35e600c2-6
> f43-402c-856f-9084c0040187-002/executors/
> total 452
> drwxr-xr-x. 3 root root 4096 Oct 26 09:43 74861.0.0
> drwxr-xr-x. 3 root root 4096 Oct 26 09:56 74861.10.0
> drwxr-xr-x. 3 root root 4096 Oct 26 10:31 74867.0.0
> drwxr-xr-x. 3 root root 4096 Oct 26 11:07 74871.3.0
> drwxr-xr-x. 3 root root 4096 Oct 26 11:45 74875.10.0
> drwxr-xr-x. 3 root root 4096 Oct 26 13:43 74886.0.0
> drwxr-xr-x. 3 root root 4096 Oct 26 13:45 74886.2.0
> drwxr-xr-x. 3 root root 4096 Oct 26 13:51 74886.8.0
> drwxr-xr-x. 3 root root 4096 Oct 26 13:56 74886.9.0
> drwxr-xr-x. 3 root root 4096 Oct 26 13:59 74887.1.0
> drwxr-xr-x. 3 root root 4096 Oct 26 14:42 74895.0.0
> drwxr-xr-x. 3 root root 4096 Oct 26 14:58 74899.7.0
> drwxr-xr-x. 3 root root 4096 Oct 26 14:59 74900.8.0
> drwxr-xr-x. 3 root root 4096 Oct 26 15:12 74902.0.0
> drwxr-xr-x. 3 root root 4096 Oct 26 17:05 74938.0.0
> drwxr-xr-x. 3 root root 4096 Oct 26 17:10 74940.0.0
> drwxr-xr-x. 3 root root 4096 Oct 26 17:11 74942.0.0
> drwxr-xr-x. 3 root root 4096 Oct 26 17:30 74944.3.0
> drwxr-xr-x. 3 root root 4096 Oct 26 17:45 74951.0.0
> drwxr-xr-x. 3 root root 4096 Oct 26 17:45 74951.1.0
> drwxr-xr-x. 3 root root 4096 Oct 26 18:07 74959.7.0
> drwxr-xr-x. 3 root root 4096 Oct 26 18:47 74971.0.0
> drwxr-xr-x. 3 root root 4096 Oct 26 19:18 74981.3.0
> drwxr-xr-x. 3 root root 4096 Oct 26 20:14 74992.0.0
> drwxr-xr-x. 3 root root 4096 Oct 26 20:28 74995.3.0
> drwxr-xr-x. 3 root root 4096 Oct 26 20:59 75001.1.0
>
>
> Thanks,
> Venkat
>
> > On Oct 27, 2017, at 6:42 AM, Tomek Janiszewski <ja...@gmail.com>
> wrote:
> >
> > Low GC delay menas files will be deleted more often. I don't' think there
> > will be any performance problem but low GC means you will lose your
> > sandboxes earlier and they are useful for debugging purposes.
> >
> > pt., 27 paź 2017 o 04:40 użytkownik Venkat Morampudi <
> > venkatmorampudi@gmail.com <ma...@gmail.com>> napisał:
> >
> >> Hi Tomek,
> >>
> >> Thanks for the quick reply. After digging a bit into Mesos code we were
> >> able understand that age actually mean threshold age. Anything older
> than
> >> the “age" would be GCed. We are going to try different setting starting
> >> with "--gc_disk_headroom=.2 --gc_delay=2hrs”. Is there any downside of
> the
> >> going with very low GC delay?
> >>
> >> Thanks,
> >> Venkat
> >>
> >>
> >>> On Oct 26, 2017, at 4:28 PM, Tomek Janiszewski <ja...@gmail.com>
> >> wrote:
> >>>
> >>>> gc_delay * max(0.0, (1.0 - gc_disk_headroom - disk usage))
> >>>
> >>> *Example:*
> >>> gc_delay = 7days
> >>> gc_disk_headroom = 0.1
> >>> disk_usage = 0.8
> >>> 7 * max(0.0, 1 - 0.1 - 0.8) = 7 * max(0.0, 0.1) = 0.7 days = 16 h 48
> min
> >>>
> >>> Can you show some logs containging information about GC?
> >>>
> >>> pt., 27 paź 2017 o 00:43 użytkownik Venkat Morampudi <
> >>> venkatmorampudi@gmail.com <ma...@gmail.com> <mailto:
> venkatmorampudi@gmail.com <ma...@gmail.com>>> napisał:
> >>>
> >>>> Hello,
> >>>> In our production env, we noticed that our disk filled up because one
> >>>> framework had a lot of failed/completed executors folders laying
> around.
> >>>> The folders eventually filled up the disk.
> >>>>
> >>>>
> >>>> 228M
> >>>>
> >> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-
> S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52334.1.0
> >>>> 228M
> >>>>
> >> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-
> S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52334.2.0
> >>>> 228M
> >>>>
> >> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-
> S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52335.1.0
> >>>> 228M
> >>>>
> >> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-
> S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52335.2.0
> >>>> 228M
> >>>>
> >> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-
> S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52336.1.0
> >>>>
> >>>> http://mesos.apache.org/documentation/latest/sandbox/#
> sandbox-lifecycle <http://mesos.apache.org/documentation/latest/sandbox/#
> sandbox-lifecycle>
> >> <
> >>>> http://mesos.apache.org/documentation/latest/sandbox/#
> sandbox-lifecycle
> >> <http://mesos.apache.org/documentation/latest/sandbox/#
> sandbox-lifecycle>>
> >>>>
> >>>> We have our lifecycle clean up set to the default which is 7days, I
> >>>> believe.
> >>>>
> >>>> We wanted to know if this is the proper way to clean up the
> >>>> failed/completed executors folders for a running framework?
> >>>> OR does the framework need to be Inactive or Completed for the garbage
> >>>> collection to work?
> >>>> OR does the framework , itself, need to deal with cleaning up its own
> >>>> executors?
> >>>>
> >>>> Bonus question: How does “gc_disk_headroom” actually work? This
> equation
> >>>> will always return 0 it seems. gc_delay * max(0.0, (1.0 -
> >> gc_disk_headroom
> >>>> - disk usage))
> >>>>
> >>>> Thanks,
> >>>> Venkat
>
>

Re: Sandbox life cycle /age

Posted by Venkat Morampudi <ve...@gmail.com>.

Hi Tomek,

After changing GC delay to 2hrs, the existing sandbox folders that are older than the “Max allowed age” are not deleted. Here are the logs

Logs entire before and after the change:

I1027 15:00:48.055465 12861 slave.cpp:4615] Current disk usage 21.63%. Max allowed age: 1.367499658088657days
I1027 15:02:37.720451 30693 slave.cpp:4615] Current disk usage 21.60%. Max allowed age: 1.368035520611667hrs

Executor info from the node:


[techops@kaiju-dcos-privateslave27 ~]$ date
Fri Oct 27 15:41:59 UTC 2017
[techops@kaiju-dcos-privateslave27 ~]$ ls -l /var/lib/mesos/slave/slaves/3fd7ffe9-945b-4ae1-968e-2c397585960c-S201/frameworks/35e600c2-6
f43-402c-856f-9084c0040187-002/executors/
total 452
drwxr-xr-x. 3 root root 4096 Oct 26 09:43 74861.0.0
drwxr-xr-x. 3 root root 4096 Oct 26 09:56 74861.10.0
drwxr-xr-x. 3 root root 4096 Oct 26 10:31 74867.0.0
drwxr-xr-x. 3 root root 4096 Oct 26 11:07 74871.3.0
drwxr-xr-x. 3 root root 4096 Oct 26 11:45 74875.10.0
drwxr-xr-x. 3 root root 4096 Oct 26 13:43 74886.0.0
drwxr-xr-x. 3 root root 4096 Oct 26 13:45 74886.2.0
drwxr-xr-x. 3 root root 4096 Oct 26 13:51 74886.8.0
drwxr-xr-x. 3 root root 4096 Oct 26 13:56 74886.9.0
drwxr-xr-x. 3 root root 4096 Oct 26 13:59 74887.1.0
drwxr-xr-x. 3 root root 4096 Oct 26 14:42 74895.0.0
drwxr-xr-x. 3 root root 4096 Oct 26 14:58 74899.7.0
drwxr-xr-x. 3 root root 4096 Oct 26 14:59 74900.8.0
drwxr-xr-x. 3 root root 4096 Oct 26 15:12 74902.0.0
drwxr-xr-x. 3 root root 4096 Oct 26 17:05 74938.0.0
drwxr-xr-x. 3 root root 4096 Oct 26 17:10 74940.0.0
drwxr-xr-x. 3 root root 4096 Oct 26 17:11 74942.0.0
drwxr-xr-x. 3 root root 4096 Oct 26 17:30 74944.3.0
drwxr-xr-x. 3 root root 4096 Oct 26 17:45 74951.0.0
drwxr-xr-x. 3 root root 4096 Oct 26 17:45 74951.1.0
drwxr-xr-x. 3 root root 4096 Oct 26 18:07 74959.7.0
drwxr-xr-x. 3 root root 4096 Oct 26 18:47 74971.0.0
drwxr-xr-x. 3 root root 4096 Oct 26 19:18 74981.3.0
drwxr-xr-x. 3 root root 4096 Oct 26 20:14 74992.0.0
drwxr-xr-x. 3 root root 4096 Oct 26 20:28 74995.3.0
drwxr-xr-x. 3 root root 4096 Oct 26 20:59 75001.1.0


Thanks,
Venkat

> On Oct 27, 2017, at 6:42 AM, Tomek Janiszewski <ja...@gmail.com> wrote:
> 
> Low GC delay menas files will be deleted more often. I don't' think there
> will be any performance problem but low GC means you will lose your
> sandboxes earlier and they are useful for debugging purposes.
> 
> pt., 27 paź 2017 o 04:40 użytkownik Venkat Morampudi <
> venkatmorampudi@gmail.com <ma...@gmail.com>> napisał:
> 
>> Hi Tomek,
>> 
>> Thanks for the quick reply. After digging a bit into Mesos code we were
>> able understand that age actually mean threshold age. Anything older than
>> the “age" would be GCed. We are going to try different setting starting
>> with "--gc_disk_headroom=.2 --gc_delay=2hrs”. Is there any downside of the
>> going with very low GC delay?
>> 
>> Thanks,
>> Venkat
>> 
>> 
>>> On Oct 26, 2017, at 4:28 PM, Tomek Janiszewski <ja...@gmail.com>
>> wrote:
>>> 
>>>> gc_delay * max(0.0, (1.0 - gc_disk_headroom - disk usage))
>>> 
>>> *Example:*
>>> gc_delay = 7days
>>> gc_disk_headroom = 0.1
>>> disk_usage = 0.8
>>> 7 * max(0.0, 1 - 0.1 - 0.8) = 7 * max(0.0, 0.1) = 0.7 days = 16 h 48 min
>>> 
>>> Can you show some logs containging information about GC?
>>> 
>>> pt., 27 paź 2017 o 00:43 użytkownik Venkat Morampudi <
>>> venkatmorampudi@gmail.com <ma...@gmail.com> <mailto:venkatmorampudi@gmail.com <ma...@gmail.com>>> napisał:
>>> 
>>>> Hello,
>>>> In our production env, we noticed that our disk filled up because one
>>>> framework had a lot of failed/completed executors folders laying around.
>>>> The folders eventually filled up the disk.
>>>> 
>>>> 
>>>> 228M
>>>> 
>> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52334.1.0
>>>> 228M
>>>> 
>> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52334.2.0
>>>> 228M
>>>> 
>> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52335.1.0
>>>> 228M
>>>> 
>> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52335.2.0
>>>> 228M
>>>> 
>> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52336.1.0
>>>> 
>>>> http://mesos.apache.org/documentation/latest/sandbox/#sandbox-lifecycle <http://mesos.apache.org/documentation/latest/sandbox/#sandbox-lifecycle>
>> <
>>>> http://mesos.apache.org/documentation/latest/sandbox/#sandbox-lifecycle
>> <http://mesos.apache.org/documentation/latest/sandbox/#sandbox-lifecycle>>
>>>> 
>>>> We have our lifecycle clean up set to the default which is 7days, I
>>>> believe.
>>>> 
>>>> We wanted to know if this is the proper way to clean up the
>>>> failed/completed executors folders for a running framework?
>>>> OR does the framework need to be Inactive or Completed for the garbage
>>>> collection to work?
>>>> OR does the framework , itself, need to deal with cleaning up its own
>>>> executors?
>>>> 
>>>> Bonus question: How does “gc_disk_headroom” actually work? This equation
>>>> will always return 0 it seems. gc_delay * max(0.0, (1.0 -
>> gc_disk_headroom
>>>> - disk usage))
>>>> 
>>>> Thanks,
>>>> Venkat

Re: Sandbox life cycle /age

Posted by Tomek Janiszewski <ja...@gmail.com>.

Low GC delay menas files will be deleted more often. I don't' think there
will be any performance problem but low GC means you will lose your
sandboxes earlier and they are useful for debugging purposes.

pt., 27 paź 2017 o 04:40 użytkownik Venkat Morampudi <
venkatmorampudi@gmail.com> napisał:

> Hi Tomek,
>
> Thanks for the quick reply. After digging a bit into Mesos code we were
> able understand that age actually mean threshold age. Anything older than
> the “age" would be GCed. We are going to try different setting starting
> with "--gc_disk_headroom=.2 --gc_delay=2hrs”. Is there any downside of the
> going with very low GC delay?
>
> Thanks,
> Venkat
>
>
> > On Oct 26, 2017, at 4:28 PM, Tomek Janiszewski <ja...@gmail.com>
> wrote:
> >
> >> gc_delay * max(0.0, (1.0 - gc_disk_headroom - disk usage))
> >
> > *Example:*
> > gc_delay = 7days
> > gc_disk_headroom = 0.1
> > disk_usage = 0.8
> > 7 * max(0.0, 1 - 0.1 - 0.8) = 7 * max(0.0, 0.1) = 0.7 days = 16 h 48 min
> >
> > Can you show some logs containging information about GC?
> >
> > pt., 27 paź 2017 o 00:43 użytkownik Venkat Morampudi <
> > venkatmorampudi@gmail.com <ma...@gmail.com>> napisał:
> >
> >> Hello,
> >> In our production env, we noticed that our disk filled up because one
> >> framework had a lot of failed/completed executors folders laying around.
> >> The folders eventually filled up the disk.
> >>
> >>
> >> 228M
> >>
> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52334.1.0
> >> 228M
> >>
> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52334.2.0
> >> 228M
> >>
> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52335.1.0
> >> 228M
> >>
> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52335.2.0
> >> 228M
> >>
> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52336.1.0
> >>
> >> http://mesos.apache.org/documentation/latest/sandbox/#sandbox-lifecycle
> <
> >> http://mesos.apache.org/documentation/latest/sandbox/#sandbox-lifecycle
> <http://mesos.apache.org/documentation/latest/sandbox/#sandbox-lifecycle>>
> >>
> >> We have our lifecycle clean up set to the default which is 7days, I
> >> believe.
> >>
> >> We wanted to know if this is the proper way to clean up the
> >> failed/completed executors folders for a running framework?
> >> OR does the framework need to be Inactive or Completed for the garbage
> >> collection to work?
> >> OR does the framework , itself, need to deal with cleaning up its own
> >> executors?
> >>
> >> Bonus question: How does “gc_disk_headroom” actually work? This equation
> >> will always return 0 it seems. gc_delay * max(0.0, (1.0 -
> gc_disk_headroom
> >> - disk usage))
> >>
> >> Thanks,
> >> Venkat
>
>

Re: Sandbox life cycle /age

Posted by Venkat Morampudi <ve...@gmail.com>.

Hi Tomek,

Thanks for the quick reply. After digging a bit into Mesos code we were able understand that age actually mean threshold age. Anything older than the “age" would be GCed. We are going to try different setting starting with "--gc_disk_headroom=.2 --gc_delay=2hrs”. Is there any downside of the going with very low GC delay? 

Thanks,
Venkat 


> On Oct 26, 2017, at 4:28 PM, Tomek Janiszewski <ja...@gmail.com> wrote:
> 
>> gc_delay * max(0.0, (1.0 - gc_disk_headroom - disk usage))
> 
> *Example:*
> gc_delay = 7days
> gc_disk_headroom = 0.1
> disk_usage = 0.8
> 7 * max(0.0, 1 - 0.1 - 0.8) = 7 * max(0.0, 0.1) = 0.7 days = 16 h 48 min
> 
> Can you show some logs containging information about GC?
> 
> pt., 27 paź 2017 o 00:43 użytkownik Venkat Morampudi <
> venkatmorampudi@gmail.com <ma...@gmail.com>> napisał:
> 
>> Hello,
>> In our production env, we noticed that our disk filled up because one
>> framework had a lot of failed/completed executors folders laying around.
>> The folders eventually filled up the disk.
>> 
>> 
>> 228M
>> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52334.1.0
>> 228M
>> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52334.2.0
>> 228M
>> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52335.1.0
>> 228M
>> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52335.2.0
>> 228M
>> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52336.1.0
>> 
>> http://mesos.apache.org/documentation/latest/sandbox/#sandbox-lifecycle <
>> http://mesos.apache.org/documentation/latest/sandbox/#sandbox-lifecycle <http://mesos.apache.org/documentation/latest/sandbox/#sandbox-lifecycle>>
>> 
>> We have our lifecycle clean up set to the default which is 7days, I
>> believe.
>> 
>> We wanted to know if this is the proper way to clean up the
>> failed/completed executors folders for a running framework?
>> OR does the framework need to be Inactive or Completed for the garbage
>> collection to work?
>> OR does the framework , itself, need to deal with cleaning up its own
>> executors?
>> 
>> Bonus question: How does “gc_disk_headroom” actually work? This equation
>> will always return 0 it seems. gc_delay * max(0.0, (1.0 - gc_disk_headroom
>> - disk usage))
>> 
>> Thanks,
>> Venkat

Re: Sandbox life cycle /age

Posted by Tomek Janiszewski <ja...@gmail.com>.

>  gc_delay * max(0.0, (1.0 - gc_disk_headroom - disk usage))

*Example:*
gc_delay = 7days
gc_disk_headroom = 0.1
disk_usage = 0.8
7 * max(0.0, 1 - 0.1 - 0.8) = 7 * max(0.0, 0.1) = 0.7 days = 16 h 48 min

Can you show some logs containging information about GC?

pt., 27 paź 2017 o 00:43 użytkownik Venkat Morampudi <
venkatmorampudi@gmail.com> napisał:

> Hello,
> In our production env, we noticed that our disk filled up because one
> framework had a lot of failed/completed executors folders laying around.
> The folders eventually filled up the disk.
>
>
> 228M
> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52334.1.0
> 228M
> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52334.2.0
> 228M
> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52335.1.0
> 228M
> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52335.2.0
> 228M
> /mnt/resource/slaves/c8674097-6e67-4609-b022-3e11de380fe5-S2/frameworks/35e600c2-6f43-402c-856f-9084c0040187-002/executors/52336.1.0
>
> http://mesos.apache.org/documentation/latest/sandbox/#sandbox-lifecycle <
> http://mesos.apache.org/documentation/latest/sandbox/#sandbox-lifecycle>
>
> We have our lifecycle clean up set to the default which is 7days, I
> believe.
>
> We wanted to know if this is the proper way to clean up the
> failed/completed executors folders for a running framework?
> OR does the framework need to be Inactive or Completed for the garbage
> collection to work?
> OR does the framework , itself, need to deal with cleaning up its own
> executors?
>
> Bonus question: How does “gc_disk_headroom” actually work? This equation
> will always return 0 it seems. gc_delay * max(0.0, (1.0 - gc_disk_headroom
> - disk usage))
>
> Thanks,
> Venkat
>
>