You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Tom Arnfeld <to...@duedil.com> on 2014/11/18 10:00:52 UTC

Master memory usage

I've noticed some strange memory usage behaviour of the Mesos master in a small cluster of ours. We have three master nodes in a quorum and are using ZK.

The master in question has 12GB of ram available of which the mesos-master process is using 10GB (resident) of which seems quite a lot. That being said I'm not sure what the memory profile of the master should look like...


Here's a snapshot of our /stats.json endpoint.


This cluster is running 0.19.1 so perhaps there are some memory leak fixes in a newer release that we need to take advantage of.


Any help would be appreciated!


---------------------------------------------


{"activated_slaves":19,"active_schedulers":1,"active_tasks_gauge":1,"cpus_percent":0.116618075801749,"cpus_total":171.5,"cpus_used":20,"deactivated_slaves":0,"disk_percent":0.0273684210526316,"disk_total":972800,"disk_used":26624,"elected":1,"failed_tasks":11,"finished_tasks":2658,"invalid_status_updates":2638,"killed_tasks":1,"lost_tasks":4,"master/cpus_percent":0.116618075801749,"master/cpus_total":171.5,"master/cpus_used":20,"master/disk_percent":0.0273684210526316,"master/disk_total":972800,"master/disk_used":26624,"master/dropped_messages":16,"master/elected":1,"master/event_queue_size":0,"master/frameworks_active":1,"master/frameworks_inactive":0,"master/invalid_framework_to_executor_messages":0,"master/invalid_status_update_acknowledgements":0,"master/invalid_status_updates":2638,"master/mem_percent":0.279896013864818,"master/mem_total":1181696,"master/mem_used":330752,"master/messages_authenticate":0,"master/messages_deactivate_framework":0,"master/messages_exited_executor":2667,"master/messages_framework_to_executor":0,"master/messages_kill_task":4397,"master/messages_launch_tasks":838024,"master/messages_reconcile_tasks":0,"master/messages_register_framework":27,"master/messages_register_slave":1,"master/messages_reregister_framework":326788,"master/messages_reregister_slave":31,"master/messages_resource_request":0,"master/messages_revive_offers":0,"master/messages_status_update":8009,"master/messages_status_update_acknowledgement":0,"master/messages_unregister_framework":26,"master/messages_unregister_slave":0,"master/outstanding_offers":0,"master/recovery_slave_removals":0,"master/slave_registrations":1,"master/slave_removals":0,"master/slave_reregistrations":18,"master/slaves_active":19,"master/slaves_inactive":0,"master/tasks_failed":11,"master/tasks_finished":2658,"master/tasks_killed":1,"master/tasks_lost":4,"master/tasks_running":1,"master/tasks_staging":0,"master/tasks_starting":0,"master/uptime_secs":1411611.70786125,"master/valid_framework_to_executor_messages":0,"master/valid_status_update_acknowledgements":0,"master/valid_status_updates":5371,"mem_percent":0.279896013864818,"mem_total":1181696,"mem_used":330752,"outstanding_offers":0,"registrar/queued_operations":0,"registrar/registry_size_bytes":4348,"registrar/state_fetch_ms":95.591936,"registrar/state_store_ms":48.622848,"staged_tasks":2675,"started_tasks":26,"system/cpus_total":2,"system/load_15min":0.05,"system/load_1min":0.03,"system/load_5min":0.04,"system/mem_free_bytes":152408064,"system/mem_total_bytes":12631490560,"total_schedulers":1,"uptime":1411611.27369318,"valid_status_updates":5371}

--

Tom Arnfeld
Developer // DueDil


(+44) 7525940046
25 Christopher Street, London, EC2A 2BS

Re: Master memory usage

Posted by Tom Arnfeld <to...@duedil.com>.
I have and it doesn't seem to add up. That being said, the growth of the memory and number of tasks does seem to make sense give the issue you linked to.


I'll upgrade and see where that leaves the issue.




Thanks for your help!


--


Tom Arnfeld

Developer // DueDil





(+44) 7525940046

25 Christopher Street, London, EC2A 2BS

On Thu, Nov 20, 2014 at 11:06 PM, Benjamin Mahler
<be...@gmail.com> wrote:

> Have you done the math on number of tasks * size of task?
> We didn't wipe the .data field in 0.19.1:
> https://issues.apache.org/jira/browse/MESOS-1746
> On Thu, Nov 20, 2014 at 2:51 PM, Tom Arnfeld <to...@duedil.com> wrote:
>> That's what I thought. There around 2500 tasks launched with this master,
>> most of which will be by our Hadoop JT. The Hadoop framework ships the
>> configuration for the TT using the TaskInfo.data property, and that looks
>> to be about 80K per task.
>>
>> Any debugging suggestions?
>>
>> --
>>
>> Tom Arnfeld
>> Developer // DueDil
>>
>> (+44) 7525940046
>> 25 Christopher Street, London, EC2A 2BS
>>
>>
>> On Thu, Nov 20, 2014 at 10:33 PM, Benjamin Mahler <
>> benjamin.mahler@gmail.com> wrote:
>>
>>> It shouldn't be that high, especially with the size of the cluster I see
>>> in your stats.
>>>
>>> Which scheduler(s) are you running, and do they create large TaskInfo
>>> objects? Just a hunch, as I do not recall any leaks in 0.19.1.
>>>
>>> On Tue, Nov 18, 2014 at 1:00 AM, Tom Arnfeld <to...@duedil.com> wrote:
>>>
>>>>  I've noticed some strange memory usage behaviour of the Mesos master
>>>> in a small cluster of ours. We have three master nodes in a quorum and are
>>>> using ZK.
>>>>
>>>> The master in question has 12GB of ram available of which the
>>>> mesos-master process is using 10GB (resident) of which seems quite a lot.
>>>> That being said I'm not sure what the memory profile of the master should
>>>> look like...
>>>>
>>>> Here's a snapshot of our /stats.json endpoint.
>>>>
>>>> This cluster is running 0.19.1 so perhaps there are some memory leak
>>>> fixes in a newer release that we need to take advantage of.
>>>>
>>>> Any help would be appreciated!
>>>>
>>>> ---------------------------------------------
>>>>
>>>> {"activated_slaves":19,"active_schedulers":1,"active_tasks_gauge":1,"cpus_percent":0.116618075801749,"cpus_total":171.5,"cpus_used":20,"deactivated_slaves":0,"disk_percent":0.0273684210526316,"disk_total":972800,"disk_used":26624,"elected":1,"failed_tasks":11,"finished_tasks":2658,"invalid_status_updates":2638,"killed_tasks":1,"lost_tasks":4,"master/cpus_percent":0.116618075801749,"master/cpus_total":171.5,"master/cpus_used":20,"master/disk_percent":0.0273684210526316,"master/disk_total":972800,"master/disk_used":26624,"master/dropped_messages":16,"master/elected":1,"master/event_queue_size":0,"master/frameworks_active":1,"master/frameworks_inactive":0,"master/invalid_framework_to_executor_messages":0,"master/invalid_status_update_acknowledgements":0,"master/invalid_status_updates":2638,"master/mem_percent":0.279896013864818,"master/mem_total":1181696,"master/mem_used":330752,"master/messages_authenticate":0,"master/messages_deactivate_framework":0,"master/messages_exited_executor":2667,"master/messages_framework_to_executor":0,"master/messages_kill_task":4397,"master/messages_launch_tasks":838024,"master/messages_reconcile_tasks":0,"master/messages_register_framework":27,"master/messages_register_slave":1,"master/messages_reregister_framework":326788,"master/messages_reregister_slave":31,"master/messages_resource_request":0,"master/messages_revive_offers":0,"master/messages_status_update":8009,"master/messages_status_update_acknowledgement":0,"master/messages_unregister_framework":26,"master/messages_unregister_slave":0,"master/outstanding_offers":0,"master/recovery_slave_removals":0,"master/slave_registrations":1,"master/slave_removals":0,"master/slave_reregistrations":18,"master/slaves_active":19,"master/slaves_inactive":0,"master/tasks_failed":11,"master/tasks_finished":2658,"master/tasks_killed":1,"master/tasks_lost":4,"master/tasks_running":1,"master/tasks_staging":0,"master/tasks_starting":0,"master/uptime_secs":1411611.70786125,"master/valid_framework_to_executor_messages":0,"master/valid_status_update_acknowledgements":0,"master/valid_status_updates":5371,"mem_percent":0.279896013864818,"mem_total":1181696,"mem_used":330752,"outstanding_offers":0,"registrar/queued_operations":0,"registrar/registry_size_bytes":4348,"registrar/state_fetch_ms":95.591936,"registrar/state_store_ms":48.622848,"staged_tasks":2675,"started_tasks":26,"system/cpus_total":2,"system/load_15min":0.05,"system/load_1min":0.03,"system/load_5min":0.04,"system/mem_free_bytes":152408064,"system/mem_total_bytes":12631490560,"total_schedulers":1,"uptime":1411611.27369318,"valid_status_updates":5371}
>>>>
>>>>
>>>> --
>>>>
>>>> Tom Arnfeld
>>>> Developer // DueDil
>>>>
>>>> (+44) 7525940046
>>>> 25 Christopher Street, London, EC2A 2BS
>>>>
>>>
>>>
>>

Re: Master memory usage

Posted by Benjamin Mahler <be...@gmail.com>.
Have you done the math on number of tasks * size of task?

We didn't wipe the .data field in 0.19.1:
https://issues.apache.org/jira/browse/MESOS-1746

On Thu, Nov 20, 2014 at 2:51 PM, Tom Arnfeld <to...@duedil.com> wrote:

> That's what I thought. There around 2500 tasks launched with this master,
> most of which will be by our Hadoop JT. The Hadoop framework ships the
> configuration for the TT using the TaskInfo.data property, and that looks
> to be about 80K per task.
>
> Any debugging suggestions?
>
> --
>
> Tom Arnfeld
> Developer // DueDil
>
> (+44) 7525940046
> 25 Christopher Street, London, EC2A 2BS
>
>
> On Thu, Nov 20, 2014 at 10:33 PM, Benjamin Mahler <
> benjamin.mahler@gmail.com> wrote:
>
>> It shouldn't be that high, especially with the size of the cluster I see
>> in your stats.
>>
>> Which scheduler(s) are you running, and do they create large TaskInfo
>> objects? Just a hunch, as I do not recall any leaks in 0.19.1.
>>
>> On Tue, Nov 18, 2014 at 1:00 AM, Tom Arnfeld <to...@duedil.com> wrote:
>>
>>>  I've noticed some strange memory usage behaviour of the Mesos master
>>> in a small cluster of ours. We have three master nodes in a quorum and are
>>> using ZK.
>>>
>>> The master in question has 12GB of ram available of which the
>>> mesos-master process is using 10GB (resident) of which seems quite a lot.
>>> That being said I'm not sure what the memory profile of the master should
>>> look like...
>>>
>>> Here's a snapshot of our /stats.json endpoint.
>>>
>>> This cluster is running 0.19.1 so perhaps there are some memory leak
>>> fixes in a newer release that we need to take advantage of.
>>>
>>> Any help would be appreciated!
>>>
>>> ---------------------------------------------
>>>
>>> {"activated_slaves":19,"active_schedulers":1,"active_tasks_gauge":1,"cpus_percent":0.116618075801749,"cpus_total":171.5,"cpus_used":20,"deactivated_slaves":0,"disk_percent":0.0273684210526316,"disk_total":972800,"disk_used":26624,"elected":1,"failed_tasks":11,"finished_tasks":2658,"invalid_status_updates":2638,"killed_tasks":1,"lost_tasks":4,"master/cpus_percent":0.116618075801749,"master/cpus_total":171.5,"master/cpus_used":20,"master/disk_percent":0.0273684210526316,"master/disk_total":972800,"master/disk_used":26624,"master/dropped_messages":16,"master/elected":1,"master/event_queue_size":0,"master/frameworks_active":1,"master/frameworks_inactive":0,"master/invalid_framework_to_executor_messages":0,"master/invalid_status_update_acknowledgements":0,"master/invalid_status_updates":2638,"master/mem_percent":0.279896013864818,"master/mem_total":1181696,"master/mem_used":330752,"master/messages_authenticate":0,"master/messages_deactivate_framework":0,"master/messages_exited_executor":2667,"master/messages_framework_to_executor":0,"master/messages_kill_task":4397,"master/messages_launch_tasks":838024,"master/messages_reconcile_tasks":0,"master/messages_register_framework":27,"master/messages_register_slave":1,"master/messages_reregister_framework":326788,"master/messages_reregister_slave":31,"master/messages_resource_request":0,"master/messages_revive_offers":0,"master/messages_status_update":8009,"master/messages_status_update_acknowledgement":0,"master/messages_unregister_framework":26,"master/messages_unregister_slave":0,"master/outstanding_offers":0,"master/recovery_slave_removals":0,"master/slave_registrations":1,"master/slave_removals":0,"master/slave_reregistrations":18,"master/slaves_active":19,"master/slaves_inactive":0,"master/tasks_failed":11,"master/tasks_finished":2658,"master/tasks_killed":1,"master/tasks_lost":4,"master/tasks_running":1,"master/tasks_staging":0,"master/tasks_starting":0,"master/uptime_secs":1411611.70786125,"master/valid_framework_to_executor_messages":0,"master/valid_status_update_acknowledgements":0,"master/valid_status_updates":5371,"mem_percent":0.279896013864818,"mem_total":1181696,"mem_used":330752,"outstanding_offers":0,"registrar/queued_operations":0,"registrar/registry_size_bytes":4348,"registrar/state_fetch_ms":95.591936,"registrar/state_store_ms":48.622848,"staged_tasks":2675,"started_tasks":26,"system/cpus_total":2,"system/load_15min":0.05,"system/load_1min":0.03,"system/load_5min":0.04,"system/mem_free_bytes":152408064,"system/mem_total_bytes":12631490560,"total_schedulers":1,"uptime":1411611.27369318,"valid_status_updates":5371}
>>>
>>>
>>> --
>>>
>>> Tom Arnfeld
>>> Developer // DueDil
>>>
>>> (+44) 7525940046
>>> 25 Christopher Street, London, EC2A 2BS
>>>
>>
>>
>

Re: Master memory usage

Posted by Tom Arnfeld <to...@duedil.com>.
That's what I thought. There around 2500 tasks launched with this master, most of which will be by our Hadoop JT. The Hadoop framework ships the configuration for the TT using the TaskInfo.data property, and that looks to be about 80K per task.




Any debugging suggestions?


--


Tom Arnfeld

Developer // DueDil





(+44) 7525940046

25 Christopher Street, London, EC2A 2BS

On Thu, Nov 20, 2014 at 10:33 PM, Benjamin Mahler
<be...@gmail.com> wrote:

> It shouldn't be that high, especially with the size of the cluster I see in
> your stats.
> Which scheduler(s) are you running, and do they create large TaskInfo
> objects? Just a hunch, as I do not recall any leaks in 0.19.1.
> On Tue, Nov 18, 2014 at 1:00 AM, Tom Arnfeld <to...@duedil.com> wrote:
>>  I've noticed some strange memory usage behaviour of the Mesos master in
>> a small cluster of ours. We have three master nodes in a quorum and are
>> using ZK.
>>
>> The master in question has 12GB of ram available of which the mesos-master
>> process is using 10GB (resident) of which seems quite a lot. That being
>> said I'm not sure what the memory profile of the master should look like...
>>
>> Here's a snapshot of our /stats.json endpoint.
>>
>> This cluster is running 0.19.1 so perhaps there are some memory leak fixes
>> in a newer release that we need to take advantage of.
>>
>> Any help would be appreciated!
>>
>> ---------------------------------------------
>>
>> {"activated_slaves":19,"active_schedulers":1,"active_tasks_gauge":1,"cpus_percent":0.116618075801749,"cpus_total":171.5,"cpus_used":20,"deactivated_slaves":0,"disk_percent":0.0273684210526316,"disk_total":972800,"disk_used":26624,"elected":1,"failed_tasks":11,"finished_tasks":2658,"invalid_status_updates":2638,"killed_tasks":1,"lost_tasks":4,"master/cpus_percent":0.116618075801749,"master/cpus_total":171.5,"master/cpus_used":20,"master/disk_percent":0.0273684210526316,"master/disk_total":972800,"master/disk_used":26624,"master/dropped_messages":16,"master/elected":1,"master/event_queue_size":0,"master/frameworks_active":1,"master/frameworks_inactive":0,"master/invalid_framework_to_executor_messages":0,"master/invalid_status_update_acknowledgements":0,"master/invalid_status_updates":2638,"master/mem_percent":0.279896013864818,"master/mem_total":1181696,"master/mem_used":330752,"master/messages_authenticate":0,"master/messages_deactivate_framework":0,"master/messages_exited_executor":2667,"master/messages_framework_to_executor":0,"master/messages_kill_task":4397,"master/messages_launch_tasks":838024,"master/messages_reconcile_tasks":0,"master/messages_register_framework":27,"master/messages_register_slave":1,"master/messages_reregister_framework":326788,"master/messages_reregister_slave":31,"master/messages_resource_request":0,"master/messages_revive_offers":0,"master/messages_status_update":8009,"master/messages_status_update_acknowledgement":0,"master/messages_unregister_framework":26,"master/messages_unregister_slave":0,"master/outstanding_offers":0,"master/recovery_slave_removals":0,"master/slave_registrations":1,"master/slave_removals":0,"master/slave_reregistrations":18,"master/slaves_active":19,"master/slaves_inactive":0,"master/tasks_failed":11,"master/tasks_finished":2658,"master/tasks_killed":1,"master/tasks_lost":4,"master/tasks_running":1,"master/tasks_staging":0,"master/tasks_starting":0,"master/uptime_secs":1411611.70786125,"master/valid_framework_to_executor_messages":0,"master/valid_status_update_acknowledgements":0,"master/valid_status_updates":5371,"mem_percent":0.279896013864818,"mem_total":1181696,"mem_used":330752,"outstanding_offers":0,"registrar/queued_operations":0,"registrar/registry_size_bytes":4348,"registrar/state_fetch_ms":95.591936,"registrar/state_store_ms":48.622848,"staged_tasks":2675,"started_tasks":26,"system/cpus_total":2,"system/load_15min":0.05,"system/load_1min":0.03,"system/load_5min":0.04,"system/mem_free_bytes":152408064,"system/mem_total_bytes":12631490560,"total_schedulers":1,"uptime":1411611.27369318,"valid_status_updates":5371}
>>
>>
>> --
>>
>> Tom Arnfeld
>> Developer // DueDil
>>
>> (+44) 7525940046
>> 25 Christopher Street, London, EC2A 2BS
>>

Re: Master memory usage

Posted by Benjamin Mahler <be...@gmail.com>.
It shouldn't be that high, especially with the size of the cluster I see in
your stats.

Which scheduler(s) are you running, and do they create large TaskInfo
objects? Just a hunch, as I do not recall any leaks in 0.19.1.

On Tue, Nov 18, 2014 at 1:00 AM, Tom Arnfeld <to...@duedil.com> wrote:

>  I've noticed some strange memory usage behaviour of the Mesos master in
> a small cluster of ours. We have three master nodes in a quorum and are
> using ZK.
>
> The master in question has 12GB of ram available of which the mesos-master
> process is using 10GB (resident) of which seems quite a lot. That being
> said I'm not sure what the memory profile of the master should look like...
>
> Here's a snapshot of our /stats.json endpoint.
>
> This cluster is running 0.19.1 so perhaps there are some memory leak fixes
> in a newer release that we need to take advantage of.
>
> Any help would be appreciated!
>
> ---------------------------------------------
>
> {"activated_slaves":19,"active_schedulers":1,"active_tasks_gauge":1,"cpus_percent":0.116618075801749,"cpus_total":171.5,"cpus_used":20,"deactivated_slaves":0,"disk_percent":0.0273684210526316,"disk_total":972800,"disk_used":26624,"elected":1,"failed_tasks":11,"finished_tasks":2658,"invalid_status_updates":2638,"killed_tasks":1,"lost_tasks":4,"master/cpus_percent":0.116618075801749,"master/cpus_total":171.5,"master/cpus_used":20,"master/disk_percent":0.0273684210526316,"master/disk_total":972800,"master/disk_used":26624,"master/dropped_messages":16,"master/elected":1,"master/event_queue_size":0,"master/frameworks_active":1,"master/frameworks_inactive":0,"master/invalid_framework_to_executor_messages":0,"master/invalid_status_update_acknowledgements":0,"master/invalid_status_updates":2638,"master/mem_percent":0.279896013864818,"master/mem_total":1181696,"master/mem_used":330752,"master/messages_authenticate":0,"master/messages_deactivate_framework":0,"master/messages_exited_executor":2667,"master/messages_framework_to_executor":0,"master/messages_kill_task":4397,"master/messages_launch_tasks":838024,"master/messages_reconcile_tasks":0,"master/messages_register_framework":27,"master/messages_register_slave":1,"master/messages_reregister_framework":326788,"master/messages_reregister_slave":31,"master/messages_resource_request":0,"master/messages_revive_offers":0,"master/messages_status_update":8009,"master/messages_status_update_acknowledgement":0,"master/messages_unregister_framework":26,"master/messages_unregister_slave":0,"master/outstanding_offers":0,"master/recovery_slave_removals":0,"master/slave_registrations":1,"master/slave_removals":0,"master/slave_reregistrations":18,"master/slaves_active":19,"master/slaves_inactive":0,"master/tasks_failed":11,"master/tasks_finished":2658,"master/tasks_killed":1,"master/tasks_lost":4,"master/tasks_running":1,"master/tasks_staging":0,"master/tasks_starting":0,"master/uptime_secs":1411611.70786125,"master/valid_framework_to_executor_messages":0,"master/valid_status_update_acknowledgements":0,"master/valid_status_updates":5371,"mem_percent":0.279896013864818,"mem_total":1181696,"mem_used":330752,"outstanding_offers":0,"registrar/queued_operations":0,"registrar/registry_size_bytes":4348,"registrar/state_fetch_ms":95.591936,"registrar/state_store_ms":48.622848,"staged_tasks":2675,"started_tasks":26,"system/cpus_total":2,"system/load_15min":0.05,"system/load_1min":0.03,"system/load_5min":0.04,"system/mem_free_bytes":152408064,"system/mem_total_bytes":12631490560,"total_schedulers":1,"uptime":1411611.27369318,"valid_status_updates":5371}
>
>
> --
>
> Tom Arnfeld
> Developer // DueDil
>
> (+44) 7525940046
> 25 Christopher Street, London, EC2A 2BS
>