You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@aurora.apache.org by Reza Motamedi <re...@gmail.com> on 2018/03/19 14:58:41 UTC

Review Request 66103: Introduce mesos disk collector

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/
-----------------------------------------------------------

Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan Erb.


Repository: aurora


Description
-------

When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container. 
Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.

This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.

I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.


Diffs
-----

  3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
  examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
  examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f 
  examples/vagrant/systemd/aurora-executor.service 5a1a9082ecd7b1367ec677d760a5c375b6db9076 
  src/main/python/apache/aurora/tools/thermos_observer.py dd9f0c46ceac9e939b1b763073314161de0ea614 
  src/main/python/apache/thermos/monitoring/BUILD 65ba7088f65e7baa5d30744736ba456b46a55e86 
  src/main/python/apache/thermos/monitoring/disk.py 52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
  src/main/python/apache/thermos/monitoring/resource.py f5e3849ca6682c6d4720698be869ca6b9f703b94 
  src/main/python/apache/thermos/observer/task_observer.py 4bb5d239e81fe4659397f899760c0e8853e93786 
  src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
  src/test/python/apache/thermos/monitoring/BUILD 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
  src/test/python/apache/thermos/monitoring/test_disk.py 362393bfd1facf3198e2d438d0596b16700b72b8 


Diff: https://reviews.apache.org/r/66103/diff/1/


Testing
-------

I added unit tests.
Tested in vagrant and it works as intenced.
I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)


Thanks,

Reza Motamedi


Re: Review Request 66103: Introduce mesos disk collector

Posted by Reza Motamedi <re...@gmail.com>.

> On March 19, 2018, 9 p.m., Santhosh Kumar Shanmugham wrote:
> > 3rdparty/python/requirements.txt
> > Lines 23 (patched)
> > <https://reviews.apache.org/r/66103/diff/1/?file=1982409#file1982409line23>
> >
> >     Any reason not using the more widely used `jq`?

There are two python libraries for jq 
1) https://pypi.python.org/pypi/jq
2) https://pypi.python.org/pypi/pyjq

These two libs have not be updataed recently. We also don't need all the functions of jq. I am open to suggestions. If any of the above or another lib is preferred.


> On March 19, 2018, 9 p.m., Santhosh Kumar Shanmugham wrote:
> > src/main/python/apache/thermos/monitoring/resource.py
> > Line 158 (original), 159 (patched)
> > <https://reviews.apache.org/r/66103/diff/1/?file=1982416#file1982416line159>
> >
> >     Call this `disk_collector_class`? It reads a little wierd when we call this `disk_collector`, meaning it is the actual object to be used.

I agree. Addressed.


> On March 19, 2018, 9 p.m., Santhosh Kumar Shanmugham wrote:
> > src/main/python/apache/thermos/monitoring/resource.py
> > Lines 164 (patched)
> > <https://reviews.apache.org/r/66103/diff/1/?file=1982416#file1982416line164>
> >
> >     Can we combine this via partial function to the `disk_collector_class` argument? This will keep the constructor more idiomatic.

I added `DiskCollectorProvider` to address this.


- Reza


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/#review199448
-----------------------------------------------------------


On March 20, 2018, 5:37 a.m., Reza Motamedi wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66103/
> -----------------------------------------------------------
> 
> (Updated March 20, 2018, 5:37 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container. 
> Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.
> 
> This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.
> 
> I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.
> 
> 
> Diffs
> -----
> 
>   3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
>   examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
>   examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f 
>   examples/vagrant/systemd/aurora-executor.service 5a1a9082ecd7b1367ec677d760a5c375b6db9076 
>   src/main/python/apache/aurora/tools/thermos_observer.py dd9f0c46ceac9e939b1b763073314161de0ea614 
>   src/main/python/apache/thermos/monitoring/BUILD 65ba7088f65e7baa5d30744736ba456b46a55e86 
>   src/main/python/apache/thermos/monitoring/disk.py 52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
>   src/main/python/apache/thermos/monitoring/resource.py f5e3849ca6682c6d4720698be869ca6b9f703b94 
>   src/main/python/apache/thermos/observer/task_observer.py 4bb5d239e81fe4659397f899760c0e8853e93786 
>   src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
>   src/test/python/apache/thermos/monitoring/BUILD 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
>   src/test/python/apache/thermos/monitoring/test_disk.py 362393bfd1facf3198e2d438d0596b16700b72b8 
>   src/test/python/apache/thermos/monitoring/test_resource.py e577e552d4ee1807096a15401851bb9fd95fa426 
> 
> 
> Diff: https://reviews.apache.org/r/66103/diff/2/
> 
> 
> Testing
> -------
> 
> I added unit tests.
> Tested in vagrant and it works as intenced.
> I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)
> 
> 
> Thanks,
> 
> Reza Motamedi
> 
>


Re: Review Request 66103: Introduce mesos disk collector

Posted by Santhosh Kumar Shanmugham <sa...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/#review199448
-----------------------------------------------------------



Approach looks good. Few comments on improving the interface and code-structuring.

Instead of plumbing the new arguments all the way through to `TaskResourceMonitor` can we build a partial factory method in `ThermoObserver` since we already have all the information at this point itself.


3rdparty/python/requirements.txt
Lines 23 (patched)
<https://reviews.apache.org/r/66103/#comment279767>

    Any reason not using the more widely used `jq`?



examples/vagrant/systemd/aurora-executor.service
Lines 22-25 (patched)
<https://reviews.apache.org/r/66103/#comment279768>

    Snake-case arguments like `log_to_disk`.



src/main/python/apache/thermos/monitoring/disk.py
Lines 153 (patched)
<https://reviews.apache.org/r/66103/#comment279772>

    Calling it API_URL is misleading since HTTP endpoints have both the regular status, flags and metrics endpoints and the new HTTP API as well.
    
    s/API_URL/AGENT_HTTP_ENDPOINT/
    
    We have APIs under /api and that is not the ones we are calling here.



src/main/python/apache/thermos/monitoring/disk.py
Lines 154 (patched)
<https://reviews.apache.org/r/66103/#comment279773>

    Can we use the `/containers` which should list all containers (AFAIK) and get rid of the `API_PATH` configuration parameter?



src/main/python/apache/thermos/monitoring/resource.py
Line 158 (original), 159 (patched)
<https://reviews.apache.org/r/66103/#comment279779>

    Call this `disk_collector_class`? It reads a little wierd when we call this `disk_collector`, meaning it is the actual object to be used.



src/main/python/apache/thermos/monitoring/resource.py
Lines 164 (patched)
<https://reviews.apache.org/r/66103/#comment279780>

    Can we combine this via partial function to the `disk_collector_class` argument? This will keep the constructor more idiomatic.


- Santhosh Kumar Shanmugham


On March 19, 2018, 7:58 a.m., Reza Motamedi wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66103/
> -----------------------------------------------------------
> 
> (Updated March 19, 2018, 7:58 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container. 
> Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.
> 
> This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.
> 
> I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.
> 
> 
> Diffs
> -----
> 
>   3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
>   examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
>   examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f 
>   examples/vagrant/systemd/aurora-executor.service 5a1a9082ecd7b1367ec677d760a5c375b6db9076 
>   src/main/python/apache/aurora/tools/thermos_observer.py dd9f0c46ceac9e939b1b763073314161de0ea614 
>   src/main/python/apache/thermos/monitoring/BUILD 65ba7088f65e7baa5d30744736ba456b46a55e86 
>   src/main/python/apache/thermos/monitoring/disk.py 52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
>   src/main/python/apache/thermos/monitoring/resource.py f5e3849ca6682c6d4720698be869ca6b9f703b94 
>   src/main/python/apache/thermos/observer/task_observer.py 4bb5d239e81fe4659397f899760c0e8853e93786 
>   src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
>   src/test/python/apache/thermos/monitoring/BUILD 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
>   src/test/python/apache/thermos/monitoring/test_disk.py 362393bfd1facf3198e2d438d0596b16700b72b8 
> 
> 
> Diff: https://reviews.apache.org/r/66103/diff/1/
> 
> 
> Testing
> -------
> 
> I added unit tests.
> Tested in vagrant and it works as intenced.
> I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)
> 
> 
> Thanks,
> 
> Reza Motamedi
> 
>


Re: Review Request 66103: Introduce mesos disk collector

Posted by Reza Motamedi <re...@gmail.com>.

> On March 19, 2018, 11:30 p.m., Kai Huang wrote:
> > src/main/python/apache/thermos/monitoring/disk.py
> > Lines 96 (patched)
> > <https://reviews.apache.org/r/66103/diff/1/?file=1982415#file1982415line96>
> >
> >     Just curious: is there any reason we set this value to -1GB?

It is just a magic number. Although it seems like the observer shows negative numbers as `-0.0GB` so I guess I can only set that to `-1`. The reason to show a negative number is to show the user that there is something worong with the setup.


- Reza


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/#review199486
-----------------------------------------------------------


On March 20, 2018, 5:37 a.m., Reza Motamedi wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66103/
> -----------------------------------------------------------
> 
> (Updated March 20, 2018, 5:37 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container. 
> Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.
> 
> This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.
> 
> I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.
> 
> 
> Diffs
> -----
> 
>   3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
>   examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
>   examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f 
>   examples/vagrant/systemd/aurora-executor.service 5a1a9082ecd7b1367ec677d760a5c375b6db9076 
>   src/main/python/apache/aurora/tools/thermos_observer.py dd9f0c46ceac9e939b1b763073314161de0ea614 
>   src/main/python/apache/thermos/monitoring/BUILD 65ba7088f65e7baa5d30744736ba456b46a55e86 
>   src/main/python/apache/thermos/monitoring/disk.py 52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
>   src/main/python/apache/thermos/monitoring/resource.py f5e3849ca6682c6d4720698be869ca6b9f703b94 
>   src/main/python/apache/thermos/observer/task_observer.py 4bb5d239e81fe4659397f899760c0e8853e93786 
>   src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
>   src/test/python/apache/thermos/monitoring/BUILD 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
>   src/test/python/apache/thermos/monitoring/test_disk.py 362393bfd1facf3198e2d438d0596b16700b72b8 
>   src/test/python/apache/thermos/monitoring/test_resource.py e577e552d4ee1807096a15401851bb9fd95fa426 
> 
> 
> Diff: https://reviews.apache.org/r/66103/diff/2/
> 
> 
> Testing
> -------
> 
> I added unit tests.
> Tested in vagrant and it works as intenced.
> I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)
> 
> 
> Thanks,
> 
> Reza Motamedi
> 
>


Re: Review Request 66103: Introduce mesos disk collector

Posted by Kai Huang <te...@hotmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/#review199486
-----------------------------------------------------------




src/main/python/apache/aurora/tools/thermos_observer.py
Lines 92 (patched)
<https://reviews.apache.org/r/66103/#comment279790>

    remove the comment?



src/main/python/apache/aurora/tools/thermos_observer.py
Lines 128 (patched)
<https://reviews.apache.org/r/66103/#comment279791>

    separate the args across multiple lines.



src/main/python/apache/thermos/monitoring/disk.py
Lines 96 (patched)
<https://reviews.apache.org/r/66103/#comment279797>

    Just curious: is there any reason we set this value to -1GB?



src/main/python/apache/thermos/monitoring/disk.py
Lines 116 (patched)
<https://reviews.apache.org/r/66103/#comment279794>

    parameterize the template?



src/main/python/apache/thermos/monitoring/disk.py
Lines 121 (patched)
<https://reviews.apache.org/r/66103/#comment279793>

    same here?
    
    s/log.info/log.warn/



src/main/python/apache/thermos/monitoring/resource.py
Lines 257 (patched)
<https://reviews.apache.org/r/66103/#comment279795>

    nit. split args in two lines


- Kai Huang


On March 19, 2018, 2:58 p.m., Reza Motamedi wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66103/
> -----------------------------------------------------------
> 
> (Updated March 19, 2018, 2:58 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container. 
> Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.
> 
> This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.
> 
> I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.
> 
> 
> Diffs
> -----
> 
>   3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
>   examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
>   examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f 
>   examples/vagrant/systemd/aurora-executor.service 5a1a9082ecd7b1367ec677d760a5c375b6db9076 
>   src/main/python/apache/aurora/tools/thermos_observer.py dd9f0c46ceac9e939b1b763073314161de0ea614 
>   src/main/python/apache/thermos/monitoring/BUILD 65ba7088f65e7baa5d30744736ba456b46a55e86 
>   src/main/python/apache/thermos/monitoring/disk.py 52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
>   src/main/python/apache/thermos/monitoring/resource.py f5e3849ca6682c6d4720698be869ca6b9f703b94 
>   src/main/python/apache/thermos/observer/task_observer.py 4bb5d239e81fe4659397f899760c0e8853e93786 
>   src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
>   src/test/python/apache/thermos/monitoring/BUILD 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
>   src/test/python/apache/thermos/monitoring/test_disk.py 362393bfd1facf3198e2d438d0596b16700b72b8 
> 
> 
> Diff: https://reviews.apache.org/r/66103/diff/1/
> 
> 
> Testing
> -------
> 
> I added unit tests.
> Tested in vagrant and it works as intenced.
> I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)
> 
> 
> Thanks,
> 
> Reza Motamedi
> 
>


Re: Review Request 66103: Introduce mesos disk collector

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/#review199418
-----------------------------------------------------------



Master (aaadad7) is red with this patch.
  ./build-support/jenkins/build.sh

	at org.easymock.internal.ReplayState.invoke(ReplayState.java:46)
	at org.easymock.internal.MockInvocationHandler.invoke(MockInvocationHandler.java:40)
	at org.easymock.internal.ObjectMethodsFilter.invoke(ObjectMethodsFilter.java:94)
	at com.sun.proxy.$Proxy21.changeState(Unknown Source)
	at org.apache.aurora.scheduler.TaskStatusHandlerImpl.lambda$run$0(TaskStatusHandlerImpl.java:158)
	at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:144)
	at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:139)
	at org.apache.aurora.scheduler.storage.testing.StorageTestUtil.lambda$expectWrite$1(StorageTestUtil.java:83)
	at org.easymock.internal.Result.answer(Result.java:106)
	at org.easymock.internal.ReplayState.invokeInner(ReplayState.java:60)
	at org.easymock.internal.ReplayState.invoke(ReplayState.java:46)
	at org.easymock.internal.MockInvocationHandler.invoke(MockInvocationHandler.java:40)
	at org.easymock.internal.ObjectMethodsFilter.invoke(ObjectMethodsFilter.java:94)
	at com.sun.proxy.$Proxy20.write(Unknown Source)
	at org.apache.aurora.scheduler.TaskStatusHandlerImpl.run(TaskStatusHandlerImpl.java:154)
	at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
	at com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
	at java.lang.Thread.run(Thread.java:748)

I0319 16:19:42.150 [ShutdownHook, SchedulerMain] Stopping scheduler services. 

1081 tests completed, 1 failed, 1 skipped
:test FAILED
:jacocoTestReport
Coverage report generated: file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/jacoco/test/html/index.html
:jacocoTestCoverageVerification

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':test'.
> There were failing tests. See the report at: file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/tests/test/index.html

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

* Get more help at https://help.gradle.org

BUILD FAILED in 7m 40s
45 actionable tasks: 36 executed, 9 up-to-date


I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On March 19, 2018, 2:58 p.m., Reza Motamedi wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66103/
> -----------------------------------------------------------
> 
> (Updated March 19, 2018, 2:58 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container. 
> Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.
> 
> This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.
> 
> I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.
> 
> 
> Diffs
> -----
> 
>   3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
>   examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
>   examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f 
>   examples/vagrant/systemd/aurora-executor.service 5a1a9082ecd7b1367ec677d760a5c375b6db9076 
>   src/main/python/apache/aurora/tools/thermos_observer.py dd9f0c46ceac9e939b1b763073314161de0ea614 
>   src/main/python/apache/thermos/monitoring/BUILD 65ba7088f65e7baa5d30744736ba456b46a55e86 
>   src/main/python/apache/thermos/monitoring/disk.py 52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
>   src/main/python/apache/thermos/monitoring/resource.py f5e3849ca6682c6d4720698be869ca6b9f703b94 
>   src/main/python/apache/thermos/observer/task_observer.py 4bb5d239e81fe4659397f899760c0e8853e93786 
>   src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
>   src/test/python/apache/thermos/monitoring/BUILD 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
>   src/test/python/apache/thermos/monitoring/test_disk.py 362393bfd1facf3198e2d438d0596b16700b72b8 
> 
> 
> Diff: https://reviews.apache.org/r/66103/diff/1/
> 
> 
> Testing
> -------
> 
> I added unit tests.
> Tested in vagrant and it works as intenced.
> I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)
> 
> 
> Thanks,
> 
> Reza Motamedi
> 
>


Re: Review Request 66103: Introduce mesos disk collector

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/#review199521
-----------------------------------------------------------



This patch does not apply cleanly against master (b3fa9fe), do you need to rebase?

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On March 20, 2018, 5:37 a.m., Reza Motamedi wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66103/
> -----------------------------------------------------------
> 
> (Updated March 20, 2018, 5:37 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container. 
> Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.
> 
> This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.
> 
> I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.
> 
> 
> Diffs
> -----
> 
>   3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
>   examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
>   examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f 
>   examples/vagrant/systemd/aurora-executor.service 5a1a9082ecd7b1367ec677d760a5c375b6db9076 
>   src/main/python/apache/aurora/tools/thermos_observer.py dd9f0c46ceac9e939b1b763073314161de0ea614 
>   src/main/python/apache/thermos/monitoring/BUILD 65ba7088f65e7baa5d30744736ba456b46a55e86 
>   src/main/python/apache/thermos/monitoring/disk.py 52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
>   src/main/python/apache/thermos/monitoring/resource.py f5e3849ca6682c6d4720698be869ca6b9f703b94 
>   src/main/python/apache/thermos/observer/task_observer.py 4bb5d239e81fe4659397f899760c0e8853e93786 
>   src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
>   src/test/python/apache/thermos/monitoring/BUILD 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
>   src/test/python/apache/thermos/monitoring/test_disk.py 362393bfd1facf3198e2d438d0596b16700b72b8 
>   src/test/python/apache/thermos/monitoring/test_resource.py e577e552d4ee1807096a15401851bb9fd95fa426 
> 
> 
> Diff: https://reviews.apache.org/r/66103/diff/2/
> 
> 
> Testing
> -------
> 
> I added unit tests.
> Tested in vagrant and it works as intenced.
> I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)
> 
> 
> Thanks,
> 
> Reza Motamedi
> 
>


Re: Review Request 66103: Introduce mesos disk collector

Posted by Reza Motamedi <re...@gmail.com>.

> On March 21, 2018, 5:39 p.m., Santhosh Kumar Shanmugham wrote:
> > src/main/python/apache/aurora/tools/thermos_observer.py
> > Lines 89 (patched)
> > <https://reviews.apache.org/r/66103/diff/3/?file=1983461#file1983461line89>
> >
> >     The agent's HTTP endpoints can have AuthN/AuthZ enabled. We can either add an option to specify the credentials file to be used while talking to the endpoint or we can call this out as limitation for now.

Good point. I updated the help message for `--enable_mesos_disk_collector` arg to make this incompatibility clear.


- Reza


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/#review199680
-----------------------------------------------------------


On March 20, 2018, 5:20 p.m., Reza Motamedi wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66103/
> -----------------------------------------------------------
> 
> (Updated March 20, 2018, 5:20 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container. 
> Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.
> 
> This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.
> 
> I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.
> 
> 
> Diffs
> -----
> 
>   3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
>   examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
>   examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f 
>   examples/vagrant/systemd/thermos.service 01925bcd2ae44f100df511f3c3951c3f5a1a72aa 
>   src/main/python/apache/aurora/tools/thermos_observer.py dd9f0c46ceac9e939b1b763073314161de0ea614 
>   src/main/python/apache/thermos/monitoring/BUILD 65ba7088f65e7baa5d30744736ba456b46a55e86 
>   src/main/python/apache/thermos/monitoring/disk.py 986d33a5000f8d5db15cb639c81f8b1d756ffa05 
>   src/main/python/apache/thermos/monitoring/resource.py adcdc751c03460dc801a18278faa96d6bd64722b 
>   src/main/python/apache/thermos/observer/task_observer.py a6870d48bddf2a2ccede7bb68195f2baae1d0e47 
>   src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
>   src/test/python/apache/thermos/monitoring/BUILD 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
>   src/test/python/apache/thermos/monitoring/test_disk.py 362393bfd1facf3198e2d438d0596b16700b72b8 
>   src/test/python/apache/thermos/monitoring/test_resource.py e577e552d4ee1807096a15401851bb9fd95fa426 
> 
> 
> Diff: https://reviews.apache.org/r/66103/diff/3/
> 
> 
> Testing
> -------
> 
> I added unit tests.
> Tested in vagrant and it works as intenced.
> I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)
> 
> 
> Thanks,
> 
> Reza Motamedi
> 
>


Re: Review Request 66103: Introduce mesos disk collector

Posted by Santhosh Kumar Shanmugham <sa...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/#review199680
-----------------------------------------------------------




src/main/python/apache/aurora/tools/thermos_observer.py
Lines 89 (patched)
<https://reviews.apache.org/r/66103/#comment280064>

    The agent's HTTP endpoints can have AuthN/AuthZ enabled. We can either add an option to specify the credentials file to be used while talking to the endpoint or we can call this out as limitation for now.


- Santhosh Kumar Shanmugham


On March 20, 2018, 10:20 a.m., Reza Motamedi wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66103/
> -----------------------------------------------------------
> 
> (Updated March 20, 2018, 10:20 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container. 
> Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.
> 
> This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.
> 
> I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.
> 
> 
> Diffs
> -----
> 
>   3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
>   examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
>   examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f 
>   examples/vagrant/systemd/thermos.service 01925bcd2ae44f100df511f3c3951c3f5a1a72aa 
>   src/main/python/apache/aurora/tools/thermos_observer.py dd9f0c46ceac9e939b1b763073314161de0ea614 
>   src/main/python/apache/thermos/monitoring/BUILD 65ba7088f65e7baa5d30744736ba456b46a55e86 
>   src/main/python/apache/thermos/monitoring/disk.py 986d33a5000f8d5db15cb639c81f8b1d756ffa05 
>   src/main/python/apache/thermos/monitoring/resource.py adcdc751c03460dc801a18278faa96d6bd64722b 
>   src/main/python/apache/thermos/observer/task_observer.py a6870d48bddf2a2ccede7bb68195f2baae1d0e47 
>   src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
>   src/test/python/apache/thermos/monitoring/BUILD 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
>   src/test/python/apache/thermos/monitoring/test_disk.py 362393bfd1facf3198e2d438d0596b16700b72b8 
>   src/test/python/apache/thermos/monitoring/test_resource.py e577e552d4ee1807096a15401851bb9fd95fa426 
> 
> 
> Diff: https://reviews.apache.org/r/66103/diff/3/
> 
> 
> Testing
> -------
> 
> I added unit tests.
> Tested in vagrant and it works as intenced.
> I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)
> 
> 
> Thanks,
> 
> Reza Motamedi
> 
>


Re: Review Request 66103: Introduce mesos disk collector

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/#review199560
-----------------------------------------------------------


Ship it!




Master (b3fa9fe) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On March 20, 2018, 5:20 p.m., Reza Motamedi wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66103/
> -----------------------------------------------------------
> 
> (Updated March 20, 2018, 5:20 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container. 
> Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.
> 
> This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.
> 
> I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.
> 
> 
> Diffs
> -----
> 
>   3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
>   examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
>   examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f 
>   examples/vagrant/systemd/thermos.service 01925bcd2ae44f100df511f3c3951c3f5a1a72aa 
>   src/main/python/apache/aurora/tools/thermos_observer.py dd9f0c46ceac9e939b1b763073314161de0ea614 
>   src/main/python/apache/thermos/monitoring/BUILD 65ba7088f65e7baa5d30744736ba456b46a55e86 
>   src/main/python/apache/thermos/monitoring/disk.py 986d33a5000f8d5db15cb639c81f8b1d756ffa05 
>   src/main/python/apache/thermos/monitoring/resource.py adcdc751c03460dc801a18278faa96d6bd64722b 
>   src/main/python/apache/thermos/observer/task_observer.py a6870d48bddf2a2ccede7bb68195f2baae1d0e47 
>   src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
>   src/test/python/apache/thermos/monitoring/BUILD 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
>   src/test/python/apache/thermos/monitoring/test_disk.py 362393bfd1facf3198e2d438d0596b16700b72b8 
>   src/test/python/apache/thermos/monitoring/test_resource.py e577e552d4ee1807096a15401851bb9fd95fa426 
> 
> 
> Diff: https://reviews.apache.org/r/66103/diff/3/
> 
> 
> Testing
> -------
> 
> I added unit tests.
> Tested in vagrant and it works as intenced.
> I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)
> 
> 
> Thanks,
> 
> Reza Motamedi
> 
>


Re: Review Request 66103: Introduce mesos disk collector

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/#review199735
-----------------------------------------------------------


Ship it!




Master (f32086d) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On March 21, 2018, 11:44 p.m., Reza Motamedi wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66103/
> -----------------------------------------------------------
> 
> (Updated March 21, 2018, 11:44 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container. 
> Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.
> 
> This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.
> 
> I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.
> 
> 
> Diffs
> -----
> 
>   3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
>   examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
>   examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f 
>   examples/vagrant/systemd/thermos.service 01925bcd2ae44f100df511f3c3951c3f5a1a72aa 
>   src/main/python/apache/aurora/tools/thermos_observer.py dd9f0c46ceac9e939b1b763073314161de0ea614 
>   src/main/python/apache/thermos/monitoring/BUILD 65ba7088f65e7baa5d30744736ba456b46a55e86 
>   src/main/python/apache/thermos/monitoring/disk.py 986d33a5000f8d5db15cb639c81f8b1d756ffa05 
>   src/main/python/apache/thermos/monitoring/resource.py adcdc751c03460dc801a18278faa96d6bd64722b 
>   src/main/python/apache/thermos/observer/task_observer.py a6870d48bddf2a2ccede7bb68195f2baae1d0e47 
>   src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
>   src/test/python/apache/thermos/monitoring/BUILD 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
>   src/test/python/apache/thermos/monitoring/test_disk.py 362393bfd1facf3198e2d438d0596b16700b72b8 
>   src/test/python/apache/thermos/monitoring/test_resource.py e577e552d4ee1807096a15401851bb9fd95fa426 
> 
> 
> Diff: https://reviews.apache.org/r/66103/diff/4/
> 
> 
> Testing
> -------
> 
> I added unit tests.
> Tested in vagrant and it works as intenced.
> I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)
> 
> 
> Thanks,
> 
> Reza Motamedi
> 
>


Re: Review Request 66103: Introduce mesos disk collector

Posted by Santhosh Kumar Shanmugham <sa...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/#review199732
-----------------------------------------------------------



Update RELEASE_NOTES.


src/main/python/apache/thermos/monitoring/disk.py
Lines 132-135 (patched)
<https://reviews.apache.org/r/66103/#comment280170>

    What happens if `GET` returns 5xx or 4xx? Will this crash the Observer? We should be adding more logging and fail gracefully. Particularly this can happen if someone has HTTP Auth enabled for the Mesos endpoints and deploying this has the potential to crash all the Observers.


- Santhosh Kumar Shanmugham


On March 21, 2018, 4:44 p.m., Reza Motamedi wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66103/
> -----------------------------------------------------------
> 
> (Updated March 21, 2018, 4:44 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container. 
> Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.
> 
> This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.
> 
> I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.
> 
> 
> Diffs
> -----
> 
>   3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
>   examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
>   examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f 
>   examples/vagrant/systemd/thermos.service 01925bcd2ae44f100df511f3c3951c3f5a1a72aa 
>   src/main/python/apache/aurora/tools/thermos_observer.py dd9f0c46ceac9e939b1b763073314161de0ea614 
>   src/main/python/apache/thermos/monitoring/BUILD 65ba7088f65e7baa5d30744736ba456b46a55e86 
>   src/main/python/apache/thermos/monitoring/disk.py 986d33a5000f8d5db15cb639c81f8b1d756ffa05 
>   src/main/python/apache/thermos/monitoring/resource.py adcdc751c03460dc801a18278faa96d6bd64722b 
>   src/main/python/apache/thermos/observer/task_observer.py a6870d48bddf2a2ccede7bb68195f2baae1d0e47 
>   src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
>   src/test/python/apache/thermos/monitoring/BUILD 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
>   src/test/python/apache/thermos/monitoring/test_disk.py 362393bfd1facf3198e2d438d0596b16700b72b8 
>   src/test/python/apache/thermos/monitoring/test_resource.py e577e552d4ee1807096a15401851bb9fd95fa426 
> 
> 
> Diff: https://reviews.apache.org/r/66103/diff/4/
> 
> 
> Testing
> -------
> 
> I added unit tests.
> Tested in vagrant and it works as intenced.
> I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)
> 
> 
> Thanks,
> 
> Reza Motamedi
> 
>


Re: Review Request 66103: Introduce mesos disk collector

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/#review199763
-----------------------------------------------------------


Ship it!




Master (f32086d) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On March 22, 2018, 7:02 a.m., Reza Motamedi wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66103/
> -----------------------------------------------------------
> 
> (Updated March 22, 2018, 7:02 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container. 
> Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.
> 
> This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.
> 
> I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.
> 
> 
> Diffs
> -----
> 
>   3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
>   RELEASE-NOTES.md 51ab6c724694244bf616b29e9beace4a4a3f5252 
>   docs/reference/observer-configuration.md 8a443c94f7f37f9454989781f722101a97c99f15 
>   examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
>   examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f 
>   examples/vagrant/systemd/thermos.service 01925bcd2ae44f100df511f3c3951c3f5a1a72aa 
>   src/main/python/apache/aurora/tools/thermos_observer.py dd9f0c46ceac9e939b1b763073314161de0ea614 
>   src/main/python/apache/thermos/monitoring/BUILD 65ba7088f65e7baa5d30744736ba456b46a55e86 
>   src/main/python/apache/thermos/monitoring/disk.py 986d33a5000f8d5db15cb639c81f8b1d756ffa05 
>   src/main/python/apache/thermos/monitoring/resource.py adcdc751c03460dc801a18278faa96d6bd64722b 
>   src/main/python/apache/thermos/observer/task_observer.py a6870d48bddf2a2ccede7bb68195f2baae1d0e47 
>   src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
>   src/test/python/apache/thermos/monitoring/BUILD 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
>   src/test/python/apache/thermos/monitoring/test_disk.py 362393bfd1facf3198e2d438d0596b16700b72b8 
>   src/test/python/apache/thermos/monitoring/test_resource.py e577e552d4ee1807096a15401851bb9fd95fa426 
> 
> 
> Diff: https://reviews.apache.org/r/66103/diff/6/
> 
> 
> Testing
> -------
> 
> - I added unit tests.
> - Tested in vagrant and it works as intenced.
> - I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)
> 
> Here is one specific test setup: On two hosts I created a two tasks. Each task creates identical nested directory structures and files in them. The overall size is 30GB. test_host_1 runs the current version of observer and test_host_2 runs Observer with this patch and also has mesos_disk_collection enabled. The results are as follows:
> 
> ```
> rezam[7]TEST_HOST_1 ~ $ while true; do echo `date`; curl localhost:1338/vars -s | grep cpu; sleep 10; done
> Thu Mar 22 04:36:17 UTC 2018
> observer.observer_cpu 108.9
> Thu Mar 22 04:36:27 UTC 2018
> observer.observer_cpu 123.2
> Thu Mar 22 04:36:38 UTC 2018
> observer.observer_cpu 123.2
> Thu Mar 22 04:36:48 UTC 2018
> observer.observer_cpu 123.2
> Thu Mar 22 04:36:58 UTC 2018
> observer.observer_cpu 111.0
> Thu Mar 22 04:37:08 UTC 2018
> observer.observer_cpu 111.0
> Thu Mar 22 04:37:18 UTC 2018
> observer.observer_cpu 111.0
> 
> 
> rezam[7]TEST_HOST_2 ~ $ while true; do echo `date`; curl localhost:1338/vars -s | grep cpu; sleep 10; done
> Thu Mar 22 04:36:20 UTC 2018
> observer.observer_cpu 1.3
> Thu Mar 22 04:36:30 UTC 2018
> observer.observer_cpu 1.3
> Thu Mar 22 04:36:40 UTC 2018
> observer.observer_cpu 1.3
> Thu Mar 22 04:36:50 UTC 2018
> observer.observer_cpu 1.2
> Thu Mar 22 04:37:00 UTC 2018
> observer.observer_cpu 1.2
> Thu Mar 22 04:37:10 UTC 2018
> observer.observer_cpu 1.2
> Thu Mar 22 04:37:20 UTC 2018
> observer.observer_cpu 1.8
> ```
> 
> 
> Thanks,
> 
> Reza Motamedi
> 
>


Re: Review Request 66103: Introduce mesos disk collector

Posted by Reza Motamedi <re...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/
-----------------------------------------------------------

(Updated March 22, 2018, 2:02 p.m.)


Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan Erb.


Changes
-------

Adding a test for unexpected response format, e.g. a dict instead of a list.


Repository: aurora


Description
-------

When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container. 
Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.

This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.

I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.


Diffs (updated)
-----

  3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
  RELEASE-NOTES.md 51ab6c724694244bf616b29e9beace4a4a3f5252 
  docs/reference/observer-configuration.md 8a443c94f7f37f9454989781f722101a97c99f15 
  examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
  examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f 
  examples/vagrant/systemd/thermos.service 01925bcd2ae44f100df511f3c3951c3f5a1a72aa 
  src/main/python/apache/aurora/tools/thermos_observer.py dd9f0c46ceac9e939b1b763073314161de0ea614 
  src/main/python/apache/thermos/monitoring/BUILD 65ba7088f65e7baa5d30744736ba456b46a55e86 
  src/main/python/apache/thermos/monitoring/disk.py 986d33a5000f8d5db15cb639c81f8b1d756ffa05 
  src/main/python/apache/thermos/monitoring/resource.py adcdc751c03460dc801a18278faa96d6bd64722b 
  src/main/python/apache/thermos/observer/task_observer.py a6870d48bddf2a2ccede7bb68195f2baae1d0e47 
  src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
  src/test/python/apache/thermos/monitoring/BUILD 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
  src/test/python/apache/thermos/monitoring/test_disk.py 362393bfd1facf3198e2d438d0596b16700b72b8 
  src/test/python/apache/thermos/monitoring/test_resource.py e577e552d4ee1807096a15401851bb9fd95fa426 


Diff: https://reviews.apache.org/r/66103/diff/6/

Changes: https://reviews.apache.org/r/66103/diff/5-6/


Testing
-------

- I added unit tests.
- Tested in vagrant and it works as intenced.
- I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)

Here is one specific test setup: On two hosts I created a two tasks. Each task creates identical nested directory structures and files in them. The overall size is 30GB. test_host_1 runs the current version of observer and test_host_2 runs Observer with this patch and also has mesos_disk_collection enabled. The results are as follows:

```
rezam[7]TEST_HOST_1 ~ $ while true; do echo `date`; curl localhost:1338/vars -s | grep cpu; sleep 10; done
Thu Mar 22 04:36:17 UTC 2018
observer.observer_cpu 108.9
Thu Mar 22 04:36:27 UTC 2018
observer.observer_cpu 123.2
Thu Mar 22 04:36:38 UTC 2018
observer.observer_cpu 123.2
Thu Mar 22 04:36:48 UTC 2018
observer.observer_cpu 123.2
Thu Mar 22 04:36:58 UTC 2018
observer.observer_cpu 111.0
Thu Mar 22 04:37:08 UTC 2018
observer.observer_cpu 111.0
Thu Mar 22 04:37:18 UTC 2018
observer.observer_cpu 111.0


rezam[7]TEST_HOST_2 ~ $ while true; do echo `date`; curl localhost:1338/vars -s | grep cpu; sleep 10; done
Thu Mar 22 04:36:20 UTC 2018
observer.observer_cpu 1.3
Thu Mar 22 04:36:30 UTC 2018
observer.observer_cpu 1.3
Thu Mar 22 04:36:40 UTC 2018
observer.observer_cpu 1.3
Thu Mar 22 04:36:50 UTC 2018
observer.observer_cpu 1.2
Thu Mar 22 04:37:00 UTC 2018
observer.observer_cpu 1.2
Thu Mar 22 04:37:10 UTC 2018
observer.observer_cpu 1.2
Thu Mar 22 04:37:20 UTC 2018
observer.observer_cpu 1.8
```


Thanks,

Reza Motamedi


Re: Review Request 66103: Introduce mesos disk collector

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/#review199741
-----------------------------------------------------------


Ship it!




Master (f32086d) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On March 21, 2018, 10:29 p.m., Reza Motamedi wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66103/
> -----------------------------------------------------------
> 
> (Updated March 21, 2018, 10:29 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container. 
> Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.
> 
> This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.
> 
> I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.
> 
> 
> Diffs
> -----
> 
>   3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
>   RELEASE-NOTES.md 51ab6c724694244bf616b29e9beace4a4a3f5252 
>   docs/reference/observer-configuration.md 8a443c94f7f37f9454989781f722101a97c99f15 
>   examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
>   examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f 
>   examples/vagrant/systemd/thermos.service 01925bcd2ae44f100df511f3c3951c3f5a1a72aa 
>   src/main/python/apache/aurora/tools/thermos_observer.py dd9f0c46ceac9e939b1b763073314161de0ea614 
>   src/main/python/apache/thermos/monitoring/BUILD 65ba7088f65e7baa5d30744736ba456b46a55e86 
>   src/main/python/apache/thermos/monitoring/disk.py 986d33a5000f8d5db15cb639c81f8b1d756ffa05 
>   src/main/python/apache/thermos/monitoring/resource.py adcdc751c03460dc801a18278faa96d6bd64722b 
>   src/main/python/apache/thermos/observer/task_observer.py a6870d48bddf2a2ccede7bb68195f2baae1d0e47 
>   src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
>   src/test/python/apache/thermos/monitoring/BUILD 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
>   src/test/python/apache/thermos/monitoring/test_disk.py 362393bfd1facf3198e2d438d0596b16700b72b8 
>   src/test/python/apache/thermos/monitoring/test_resource.py e577e552d4ee1807096a15401851bb9fd95fa426 
> 
> 
> Diff: https://reviews.apache.org/r/66103/diff/5/
> 
> 
> Testing
> -------
> 
> - I added unit tests.
> - Tested in vagrant and it works as intenced.
> - I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)
> 
> Here is one specific test setup: On two hosts I created a two tasks. Each task creates identical nested directory structures and files in them. The overall size is 30GB. test_host_1 runs the current version of observer and test_host_2 runs Observer with this patch and also has mesos_disk_collection enabled. The results are as follows:
> 
> ```
> rezam[7]TEST_HOST_1 ~ $ while true; do echo `date`; curl localhost:1338/vars -s | grep cpu; sleep 10; done
> Thu Mar 22 04:36:17 UTC 2018
> observer.observer_cpu 108.9
> Thu Mar 22 04:36:27 UTC 2018
> observer.observer_cpu 123.2
> Thu Mar 22 04:36:38 UTC 2018
> observer.observer_cpu 123.2
> Thu Mar 22 04:36:48 UTC 2018
> observer.observer_cpu 123.2
> Thu Mar 22 04:36:58 UTC 2018
> observer.observer_cpu 111.0
> Thu Mar 22 04:37:08 UTC 2018
> observer.observer_cpu 111.0
> Thu Mar 22 04:37:18 UTC 2018
> observer.observer_cpu 111.0
> 
> 
> rezam[7]TEST_HOST_2 ~ $ while true; do echo `date`; curl localhost:1338/vars -s | grep cpu; sleep 10; done
> Thu Mar 22 04:36:20 UTC 2018
> observer.observer_cpu 1.3
> Thu Mar 22 04:36:30 UTC 2018
> observer.observer_cpu 1.3
> Thu Mar 22 04:36:40 UTC 2018
> observer.observer_cpu 1.3
> Thu Mar 22 04:36:50 UTC 2018
> observer.observer_cpu 1.2
> Thu Mar 22 04:37:00 UTC 2018
> observer.observer_cpu 1.2
> Thu Mar 22 04:37:10 UTC 2018
> observer.observer_cpu 1.2
> Thu Mar 22 04:37:20 UTC 2018
> observer.observer_cpu 1.8
> ```
> 
> 
> Thanks,
> 
> Reza Motamedi
> 
>


Re: Review Request 66103: Introduce mesos disk collector

Posted by Reza Motamedi <re...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/
-----------------------------------------------------------

(Updated March 22, 2018, 5:29 a.m.)


Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan Erb.


Changes
-------

- Address the comment about Auth enabled HTTP APIs on mesos agent and added additional tests.
- updated release notes and config notes.


Repository: aurora


Description
-------

When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container. 
Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.

This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.

I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.


Diffs (updated)
-----

  3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
  RELEASE-NOTES.md 51ab6c724694244bf616b29e9beace4a4a3f5252 
  docs/reference/observer-configuration.md 8a443c94f7f37f9454989781f722101a97c99f15 
  examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
  examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f 
  examples/vagrant/systemd/thermos.service 01925bcd2ae44f100df511f3c3951c3f5a1a72aa 
  src/main/python/apache/aurora/tools/thermos_observer.py dd9f0c46ceac9e939b1b763073314161de0ea614 
  src/main/python/apache/thermos/monitoring/BUILD 65ba7088f65e7baa5d30744736ba456b46a55e86 
  src/main/python/apache/thermos/monitoring/disk.py 986d33a5000f8d5db15cb639c81f8b1d756ffa05 
  src/main/python/apache/thermos/monitoring/resource.py adcdc751c03460dc801a18278faa96d6bd64722b 
  src/main/python/apache/thermos/observer/task_observer.py a6870d48bddf2a2ccede7bb68195f2baae1d0e47 
  src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
  src/test/python/apache/thermos/monitoring/BUILD 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
  src/test/python/apache/thermos/monitoring/test_disk.py 362393bfd1facf3198e2d438d0596b16700b72b8 
  src/test/python/apache/thermos/monitoring/test_resource.py e577e552d4ee1807096a15401851bb9fd95fa426 


Diff: https://reviews.apache.org/r/66103/diff/5/

Changes: https://reviews.apache.org/r/66103/diff/4-5/


Testing (updated)
-------

- I added unit tests.
- Tested in vagrant and it works as intenced.
- I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)

Here is one specific test setup: On two hosts I created a two tasks. Each task creates identical nested directory structures and files in them. The overall size is 30GB. test_host_1 runs the current version of observer and test_host_2 runs Observer with this patch and also has mesos_disk_collection enabled. The results are as follows:

```
rezam[7]TEST_HOST_1 ~ $ while true; do echo `date`; curl localhost:1338/vars -s | grep cpu; sleep 10; done
Thu Mar 22 04:36:17 UTC 2018
observer.observer_cpu 108.9
Thu Mar 22 04:36:27 UTC 2018
observer.observer_cpu 123.2
Thu Mar 22 04:36:38 UTC 2018
observer.observer_cpu 123.2
Thu Mar 22 04:36:48 UTC 2018
observer.observer_cpu 123.2
Thu Mar 22 04:36:58 UTC 2018
observer.observer_cpu 111.0
Thu Mar 22 04:37:08 UTC 2018
observer.observer_cpu 111.0
Thu Mar 22 04:37:18 UTC 2018
observer.observer_cpu 111.0


rezam[7]TEST_HOST_2 ~ $ while true; do echo `date`; curl localhost:1338/vars -s | grep cpu; sleep 10; done
Thu Mar 22 04:36:20 UTC 2018
observer.observer_cpu 1.3
Thu Mar 22 04:36:30 UTC 2018
observer.observer_cpu 1.3
Thu Mar 22 04:36:40 UTC 2018
observer.observer_cpu 1.3
Thu Mar 22 04:36:50 UTC 2018
observer.observer_cpu 1.2
Thu Mar 22 04:37:00 UTC 2018
observer.observer_cpu 1.2
Thu Mar 22 04:37:10 UTC 2018
observer.observer_cpu 1.2
Thu Mar 22 04:37:20 UTC 2018
observer.observer_cpu 1.8
```


Thanks,

Reza Motamedi


Re: Review Request 66103: Introduce mesos disk collector

Posted by Reza Motamedi <re...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/
-----------------------------------------------------------

(Updated March 21, 2018, 11:44 p.m.)


Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan Erb.


Repository: aurora


Description
-------

When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container. 
Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.

This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.

I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.


Diffs (updated)
-----

  3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
  examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
  examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f 
  examples/vagrant/systemd/thermos.service 01925bcd2ae44f100df511f3c3951c3f5a1a72aa 
  src/main/python/apache/aurora/tools/thermos_observer.py dd9f0c46ceac9e939b1b763073314161de0ea614 
  src/main/python/apache/thermos/monitoring/BUILD 65ba7088f65e7baa5d30744736ba456b46a55e86 
  src/main/python/apache/thermos/monitoring/disk.py 986d33a5000f8d5db15cb639c81f8b1d756ffa05 
  src/main/python/apache/thermos/monitoring/resource.py adcdc751c03460dc801a18278faa96d6bd64722b 
  src/main/python/apache/thermos/observer/task_observer.py a6870d48bddf2a2ccede7bb68195f2baae1d0e47 
  src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
  src/test/python/apache/thermos/monitoring/BUILD 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
  src/test/python/apache/thermos/monitoring/test_disk.py 362393bfd1facf3198e2d438d0596b16700b72b8 
  src/test/python/apache/thermos/monitoring/test_resource.py e577e552d4ee1807096a15401851bb9fd95fa426 


Diff: https://reviews.apache.org/r/66103/diff/4/

Changes: https://reviews.apache.org/r/66103/diff/3-4/


Testing
-------

I added unit tests.
Tested in vagrant and it works as intenced.
I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)


Thanks,

Reza Motamedi


Re: Review Request 66103: Introduce mesos disk collector

Posted by Reza Motamedi <re...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/
-----------------------------------------------------------

(Updated March 20, 2018, 5:20 p.m.)


Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan Erb.


Changes
-------

- rebase from master


Repository: aurora


Description
-------

When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container. 
Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.

This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.

I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.


Diffs (updated)
-----

  3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
  examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
  examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f 
  examples/vagrant/systemd/thermos.service 01925bcd2ae44f100df511f3c3951c3f5a1a72aa 
  src/main/python/apache/aurora/tools/thermos_observer.py dd9f0c46ceac9e939b1b763073314161de0ea614 
  src/main/python/apache/thermos/monitoring/BUILD 65ba7088f65e7baa5d30744736ba456b46a55e86 
  src/main/python/apache/thermos/monitoring/disk.py 986d33a5000f8d5db15cb639c81f8b1d756ffa05 
  src/main/python/apache/thermos/monitoring/resource.py adcdc751c03460dc801a18278faa96d6bd64722b 
  src/main/python/apache/thermos/observer/task_observer.py a6870d48bddf2a2ccede7bb68195f2baae1d0e47 
  src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
  src/test/python/apache/thermos/monitoring/BUILD 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
  src/test/python/apache/thermos/monitoring/test_disk.py 362393bfd1facf3198e2d438d0596b16700b72b8 
  src/test/python/apache/thermos/monitoring/test_resource.py e577e552d4ee1807096a15401851bb9fd95fa426 


Diff: https://reviews.apache.org/r/66103/diff/3/

Changes: https://reviews.apache.org/r/66103/diff/2-3/


Testing
-------

I added unit tests.
Tested in vagrant and it works as intenced.
I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)


Thanks,

Reza Motamedi


Re: Review Request 66103: Introduce mesos disk collector

Posted by Reza Motamedi <re...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/
-----------------------------------------------------------

(Updated March 20, 2018, 5:37 a.m.)


Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan Erb.


Changes
-------

- Address feedback
- push `DISK_COLLECTION_INTERVAL` into `DiskCollectorSettings`
- Introduce `DiskCollectorProvider` to encapsulate the logic for selecting the right disk collection implementation.


Repository: aurora


Description
-------

When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container. 
Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.

This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.

I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.


Diffs (updated)
-----

  3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
  examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
  examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f 
  examples/vagrant/systemd/aurora-executor.service 5a1a9082ecd7b1367ec677d760a5c375b6db9076 
  src/main/python/apache/aurora/tools/thermos_observer.py dd9f0c46ceac9e939b1b763073314161de0ea614 
  src/main/python/apache/thermos/monitoring/BUILD 65ba7088f65e7baa5d30744736ba456b46a55e86 
  src/main/python/apache/thermos/monitoring/disk.py 52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
  src/main/python/apache/thermos/monitoring/resource.py f5e3849ca6682c6d4720698be869ca6b9f703b94 
  src/main/python/apache/thermos/observer/task_observer.py 4bb5d239e81fe4659397f899760c0e8853e93786 
  src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
  src/test/python/apache/thermos/monitoring/BUILD 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
  src/test/python/apache/thermos/monitoring/test_disk.py 362393bfd1facf3198e2d438d0596b16700b72b8 
  src/test/python/apache/thermos/monitoring/test_resource.py e577e552d4ee1807096a15401851bb9fd95fa426 


Diff: https://reviews.apache.org/r/66103/diff/2/

Changes: https://reviews.apache.org/r/66103/diff/1-2/


Testing
-------

I added unit tests.
Tested in vagrant and it works as intenced.
I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)


Thanks,

Reza Motamedi