You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@aurora.apache.org by Stephan Erb <se...@apache.org> on 2018/06/18 08:57:11 UTC

Review Request 67627: Add observer flag to disable resource metric collection

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67627/
-----------------------------------------------------------

Review request for Aurora, Renan DelValle, Reza Motamedi, and Santhosh Kumar Shanmugham.


Repository: aurora


Description
-------

Add observer command line option `--disable_task_resource_collection` to
disable the collection of CPU, memory, and disk metrics for observed tasks.
This is useful in setups where metrics cannot be gathered reliable (e.g. when
using PID namespaces) or when it is expensive due to hundreds of active tasks
per host.


Diffs
-----

  RELEASE-NOTES.md edc081f502370190597ad028f3275cdfd572f5ca 
  docs/reference/observer-configuration.md c791b3480e5bf35e6eb0fbea908ff3242eab315d 
  src/main/python/apache/aurora/config/BUILD 12e7fe973f456d0847ce63d3b293131a7f4c3bdd 
  src/main/python/apache/aurora/tools/thermos_observer.py fd9465d2e2b3135f3fdf8230777117adaa89337c 
  src/main/python/apache/thermos/monitoring/resource.py 72ed4e5a82dfd8a09e0a8262f6da4992ac98542a 
  src/main/python/apache/thermos/observer/task_observer.py 94cd6c541bb7f8a4c153cc51caa63d2c08888a49 
  src/test/python/apache/thermos/monitoring/test_resource.py 44450647a180f86903ebd37f2a9f4327496597e9 


Diff: https://reviews.apache.org/r/67627/diff/1/


Testing
-------

We are running our Mesos agents with enabled PID namespaces (i.e.
`--isolation='namespaces/ipc,namespaces/pid,...'`). Sometimes the hosts are
also tightly packed with many small tasks (e.g. `~130` active tasks and `~1000`
finished tasks). Even with very relaxed scrape settings of 
`--task_process_collection_interval_secs=3000` and
`--task_disk_collection_interval_secs=3000` it can take between `150ms-2500ms`
to render the observer landing page `/main`. This patch reduces this to about
`100ms-150ms`. There is no immediate downside as metrics reporting is broken
anyway due to the PID namespacing.


Thanks,

Stephan Erb


Re: Review Request 67627: Add observer flag to disable resource metric collection

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67627/#review204916
-----------------------------------------------------------


Ship it!




Master (4719fa7) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On June 18, 2018, 8:57 a.m., Stephan Erb wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67627/
> -----------------------------------------------------------
> 
> (Updated June 18, 2018, 8:57 a.m.)
> 
> 
> Review request for Aurora, Renan DelValle, Reza Motamedi, and Santhosh Kumar Shanmugham.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Add observer command line option `--disable_task_resource_collection` to
> disable the collection of CPU, memory, and disk metrics for observed tasks.
> This is useful in setups where metrics cannot be gathered reliable (e.g. when
> using PID namespaces) or when it is expensive due to hundreds of active tasks
> per host.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md edc081f502370190597ad028f3275cdfd572f5ca 
>   docs/reference/observer-configuration.md c791b3480e5bf35e6eb0fbea908ff3242eab315d 
>   src/main/python/apache/aurora/config/BUILD 12e7fe973f456d0847ce63d3b293131a7f4c3bdd 
>   src/main/python/apache/aurora/tools/thermos_observer.py fd9465d2e2b3135f3fdf8230777117adaa89337c 
>   src/main/python/apache/thermos/monitoring/resource.py 72ed4e5a82dfd8a09e0a8262f6da4992ac98542a 
>   src/main/python/apache/thermos/observer/task_observer.py 94cd6c541bb7f8a4c153cc51caa63d2c08888a49 
>   src/test/python/apache/thermos/monitoring/test_resource.py 44450647a180f86903ebd37f2a9f4327496597e9 
> 
> 
> Diff: https://reviews.apache.org/r/67627/diff/1/
> 
> 
> Testing
> -------
> 
> We are running our Mesos agents with enabled PID namespaces (i.e.
> `--isolation='namespaces/ipc,namespaces/pid,...'`). Sometimes the hosts are
> also tightly packed with many small tasks (e.g. `~130` active tasks and `~1000`
> finished tasks). Even with very relaxed scrape settings of 
> `--task_process_collection_interval_secs=3000` and
> `--task_disk_collection_interval_secs=3000` it can take between `150ms-2500ms`
> to render the observer landing page `/main`. This patch reduces this to about
> `100ms-150ms`. There is no immediate downside as metrics reporting is broken
> anyway due to the PID namespacing.
> 
> 
> Thanks,
> 
> Stephan Erb
> 
>


Re: Review Request 67627: Add observer flag to disable resource metric collection

Posted by Santhosh Kumar Shanmugham <sa...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67627/#review204936
-----------------------------------------------------------



Mostly LGTM.

Will the UI show 0s or empty spaces?

Can you expand on why PID namespaces breaks metrics?


docs/reference/observer-configuration.md
Lines 27 (patched)
<https://reviews.apache.org/r/67627/#comment287754>

    also disk metrics



src/main/python/apache/aurora/tools/thermos_observer.py
Lines 68 (patched)
<https://reviews.apache.org/r/67627/#comment287753>

    also disk metrics


- Santhosh Kumar Shanmugham


On June 18, 2018, 1:57 a.m., Stephan Erb wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67627/
> -----------------------------------------------------------
> 
> (Updated June 18, 2018, 1:57 a.m.)
> 
> 
> Review request for Aurora, Renan DelValle, Reza Motamedi, and Santhosh Kumar Shanmugham.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Add observer command line option `--disable_task_resource_collection` to
> disable the collection of CPU, memory, and disk metrics for observed tasks.
> This is useful in setups where metrics cannot be gathered reliable (e.g. when
> using PID namespaces) or when it is expensive due to hundreds of active tasks
> per host.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md edc081f502370190597ad028f3275cdfd572f5ca 
>   docs/reference/observer-configuration.md c791b3480e5bf35e6eb0fbea908ff3242eab315d 
>   src/main/python/apache/aurora/config/BUILD 12e7fe973f456d0847ce63d3b293131a7f4c3bdd 
>   src/main/python/apache/aurora/tools/thermos_observer.py fd9465d2e2b3135f3fdf8230777117adaa89337c 
>   src/main/python/apache/thermos/monitoring/resource.py 72ed4e5a82dfd8a09e0a8262f6da4992ac98542a 
>   src/main/python/apache/thermos/observer/task_observer.py 94cd6c541bb7f8a4c153cc51caa63d2c08888a49 
>   src/test/python/apache/thermos/monitoring/test_resource.py 44450647a180f86903ebd37f2a9f4327496597e9 
> 
> 
> Diff: https://reviews.apache.org/r/67627/diff/1/
> 
> 
> Testing
> -------
> 
> We are running our Mesos agents with enabled PID namespaces (i.e.
> `--isolation='namespaces/ipc,namespaces/pid,...'`). Sometimes the hosts are
> also tightly packed with many small tasks (e.g. `~130` active tasks and `~1000`
> finished tasks). Even with very relaxed scrape settings of 
> `--task_process_collection_interval_secs=3000` and
> `--task_disk_collection_interval_secs=3000` it can take between `150ms-2500ms`
> to render the observer landing page `/main`. This patch reduces this to about
> `100ms-150ms`. There is no immediate downside as metrics reporting is broken
> anyway due to the PID namespacing.
> 
> 
> Thanks,
> 
> Stephan Erb
> 
>