You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by Jorge Machado <jo...@me.com.INVALID> on 2019/03/22 07:35:07 UTC

[MESOS-8248] - Expose information about GPU assigned to a task

Hi Mesos devs, 

In our use case from mesos we need to get gpu resource usage per task and build dashboards on grafana for it.  Getting the metrics to Grafana we will send the metrics to prometheus the main problem is how to get the metrics in a reliable way. 
I proposing the following: 

Changing the mesos.proto and mesos.proto under v1 and on ResourceStatistics message add: 

//GPU statistics for each container
optional int32 gpu_idx = 50;
optional string gpu_uuid = 51;
optional string device_name = 52;
optional uint64 gpu_memory_used_mb = 53;
optional uint64 gpu_memory_total_mb = 54;
optional double gpu_usage = 55;
optional int32 gpu_temperature = 56;
optional int32 gpu_frequency_MHz = 57;
optional int32 gpu_power_used_W = 58;

For starters I would like to change NvidiaGpuIsolatorProcess at isolator.cpp and there get the nvml call for the usage method. As I’m new to this I need some guidelines please. 

My questions:  

Does the NvidiaGpuIsolatorProcess runs already inside the container or just outside in the agent ? (I’m assuming outside)
From what I saw on the cpu metrics they are gathered inside the container for the gpu we could do it in the NvidiaGpuIsolatorProcess and get the metrics via the host. 
Anything more that I should check ? 

Thanks a lot

Jorge Machado
www.jmachado.me

Re: [MESOS-8248] - Expose information about GPU assigned to a task

Posted by Gilbert Song <gi...@mesosphere.io>.

Thanks for the feedback, BenM!

Jorge, could you mind addressing BenM's comment above and put the proposal
to a google doc?

We could discuss this proposal in next Containerization WG meeting on April
4th (please add an agenda and link your proposal):
https://docs.google.com/document/d/1z55a7tLZFoRWVuUxz1FZwgxkHeugtc2nHR89skFXSpU/edit#heading=h.978qjujkxfvu

-Gilbert

On Fri, Mar 22, 2019 at 12:19 PM Benjamin Mahler <be...@gmail.com>
wrote:

> Containers can be assigned multiple GPUs, so I assume you're thinking of
> putting these metrics in a repeated message? (similar to DiskStatistics)
>
> It has seemed to me we should probably make this Nvidia specific (e.g.
> NvidiaGPUStatistics). In the past we thought generalizing this would be
> good, but there's only Nvidia support at the moment and we haven't been
> able to make sure that other GPU libraries provide the same information.
>
> For each metric can you also include the relevant calls from NVML for
> obtaining the information? Can you also highlight what cadvisor provides to
> make sure we don't miss anything? From my read of their code, it seems to
> be a subset of what you listed?
>
> https://github.com/google/cadvisor/blob/e310755a36728b457fcc1de6b54bb4c6cb38f031/accelerators/nvidia.go#L216-L246
>
> On Fri, Mar 22, 2019 at 6:58 AM Jorge Machado <jo...@me.com.invalid>
> wrote:
>
> > another way would be to just use cadvisor
> >
> > > On 22 Mar 2019, at 08:35, Jorge Machado <jo...@me.com.INVALID> wrote:
> > >
> > > Hi Mesos devs,
> > >
> > > In our use case from mesos we need to get gpu resource usage per task
> > and build dashboards on grafana for it.  Getting the metrics to Grafana
> we
> > will send the metrics to prometheus the main problem is how to get the
> > metrics in a reliable way.
> > > I proposing the following:
> > >
> > > Changing the mesos.proto and mesos.proto under v1 and on
> > ResourceStatistics message add:
> > >
> > > //GPU statistics for each container
> > > optional int32 gpu_idx = 50;
> > > optional string gpu_uuid = 51;
> > > optional string device_name = 52;
> > > optional uint64 gpu_memory_used_mb = 53;
> > > optional uint64 gpu_memory_total_mb = 54;
> > > optional double gpu_usage = 55;
> > > optional int32 gpu_temperature = 56;
> > > optional int32 gpu_frequency_MHz = 57;
> > > optional int32 gpu_power_used_W = 58;
> > >
> > > For starters I would like to change NvidiaGpuIsolatorProcess at
> > isolator.cpp and there get the nvml call for the usage method. As I’m new
> > to this I need some guidelines please.
> > >
> > > My questions:
> > >
> > > Does the NvidiaGpuIsolatorProcess runs already inside the container or
> > just outside in the agent ? (I’m assuming outside)
> > > From what I saw on the cpu metrics they are gathered inside the
> > container for the gpu we could do it in the NvidiaGpuIsolatorProcess and
> > get the metrics via the host.
> > > Anything more that I should check ?
> > >
> > > Thanks a lot
> > >
> > > Jorge Machado
> > > www.jmachado.me
> > >
> > >
> > >
> > >
> > >
> >
> >
>

Re: [MESOS-8248] - Expose information about GPU assigned to a task

Posted by Benjamin Mahler <be...@gmail.com>.

Containers can be assigned multiple GPUs, so I assume you're thinking of
putting these metrics in a repeated message? (similar to DiskStatistics)

It has seemed to me we should probably make this Nvidia specific (e.g.
NvidiaGPUStatistics). In the past we thought generalizing this would be
good, but there's only Nvidia support at the moment and we haven't been
able to make sure that other GPU libraries provide the same information.

For each metric can you also include the relevant calls from NVML for
obtaining the information? Can you also highlight what cadvisor provides to
make sure we don't miss anything? From my read of their code, it seems to
be a subset of what you listed?
https://github.com/google/cadvisor/blob/e310755a36728b457fcc1de6b54bb4c6cb38f031/accelerators/nvidia.go#L216-L246

On Fri, Mar 22, 2019 at 6:58 AM Jorge Machado <jo...@me.com.invalid> wrote:

> another way would be to just use cadvisor
>
> > On 22 Mar 2019, at 08:35, Jorge Machado <jo...@me.com.INVALID> wrote:
> >
> > Hi Mesos devs,
> >
> > In our use case from mesos we need to get gpu resource usage per task
> and build dashboards on grafana for it.  Getting the metrics to Grafana we
> will send the metrics to prometheus the main problem is how to get the
> metrics in a reliable way.
> > I proposing the following:
> >
> > Changing the mesos.proto and mesos.proto under v1 and on
> ResourceStatistics message add:
> >
> > //GPU statistics for each container
> > optional int32 gpu_idx = 50;
> > optional string gpu_uuid = 51;
> > optional string device_name = 52;
> > optional uint64 gpu_memory_used_mb = 53;
> > optional uint64 gpu_memory_total_mb = 54;
> > optional double gpu_usage = 55;
> > optional int32 gpu_temperature = 56;
> > optional int32 gpu_frequency_MHz = 57;
> > optional int32 gpu_power_used_W = 58;
> >
> > For starters I would like to change NvidiaGpuIsolatorProcess at
> isolator.cpp and there get the nvml call for the usage method. As I’m new
> to this I need some guidelines please.
> >
> > My questions:
> >
> > Does the NvidiaGpuIsolatorProcess runs already inside the container or
> just outside in the agent ? (I’m assuming outside)
> > From what I saw on the cpu metrics they are gathered inside the
> container for the gpu we could do it in the NvidiaGpuIsolatorProcess and
> get the metrics via the host.
> > Anything more that I should check ?
> >
> > Thanks a lot
> >
> > Jorge Machado
> > www.jmachado.me
> >
> >
> >
> >
> >
>
>

Re: [MESOS-8248] - Expose information about GPU assigned to a task

Posted by Jorge Machado <jo...@me.com.INVALID>.

another way would be to just use cadvisor

> On 22 Mar 2019, at 08:35, Jorge Machado <jo...@me.com.INVALID> wrote:
> 
> Hi Mesos devs, 
> 
> In our use case from mesos we need to get gpu resource usage per task and build dashboards on grafana for it.  Getting the metrics to Grafana we will send the metrics to prometheus the main problem is how to get the metrics in a reliable way. 
> I proposing the following: 
> 
> Changing the mesos.proto and mesos.proto under v1 and on ResourceStatistics message add: 
> 
> //GPU statistics for each container
> optional int32 gpu_idx = 50;
> optional string gpu_uuid = 51;
> optional string device_name = 52;
> optional uint64 gpu_memory_used_mb = 53;
> optional uint64 gpu_memory_total_mb = 54;
> optional double gpu_usage = 55;
> optional int32 gpu_temperature = 56;
> optional int32 gpu_frequency_MHz = 57;
> optional int32 gpu_power_used_W = 58;
> 
> For starters I would like to change NvidiaGpuIsolatorProcess at isolator.cpp and there get the nvml call for the usage method. As I’m new to this I need some guidelines please. 
> 
> My questions:  
> 
> Does the NvidiaGpuIsolatorProcess runs already inside the container or just outside in the agent ? (I’m assuming outside)
> From what I saw on the cpu metrics they are gathered inside the container for the gpu we could do it in the NvidiaGpuIsolatorProcess and get the metrics via the host. 
> Anything more that I should check ? 
> 
> Thanks a lot
> 
> Jorge Machado
> www.jmachado.me
> 
> 
> 
> 
>