You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/09/10 21:17:35 UTC

[GitHub] [druid] mghosh4 opened a new issue #10378: Adding Worker Count Metrics to Druid Overlord

mghosh4 opened a new issue #10378:
URL: https://github.com/apache/druid/issues/10378


   ### Motivation
   
   The primary motivation of this work is to provide more visibility into the worker utilization over time. Monitoring utilization can help cluster administrators determine when to add/remove workers from the pool. With native ingestion adoption, this has become even more important. 
   
   ### Proposed changes
   
   I propose to add a new `WorkerCountStatsMonitor` in Overlord similar to `TaskCountStatsMonitor` class. It will expose the following metrics:
   
   - `worker/total/count`: Total number of workers 
   - `worker/idle/count`: Total number of workers available for adding tasks 
   - `worker/used/count`: Total number of workers being currently used
   - `worker/lazy/count`: Total number of workers that have been marked lazy 
   - `worker/blacklisted/count`: Total number of workers that have been blacklisted
   
   ### Proposed Design
   
   I am planning to add the following apis in `TaskRunner`
   
   ```
     /**
      * APIs useful for emitting statistics for @WorkerCountStatsMonitor
     */
     long getTotalWorkerCount();
   
     long getIdleWorkerCount();
   
     long getUsedWorkerCount();
   
     long getLazyWorkerCount();
   
     long getBlacklistedWorkerCount();
   ```
   
   The implementation for `WorkerCountStatsMonitor` will be similar to `TaskCountStatsMonitor`. It will use the `WorkerCountStatsProvider` interface which will be implemented by `TaskMaster`. `TaskMaster` will use `taskRunner` to emit the required metrics
   
   ### Operational impact
   
   As such this change does not have any operational impact. We are adding some new metrics for better cluster monitoring.
   
   ### Test plan 
   
   Other than unit tests, we are also planning to test this in our local Druid clusters. We will be using the statsd-based emitter framework to collect all the new emitted metrics for visualization.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #10378: Adding Task Slot Count Metrics to Druid Overlord

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #10378:
URL: https://github.com/apache/druid/issues/10378#issuecomment-698005593


   > > * `taskSlot/lazy/count`: Total number of task slots that have been marked lazy
   > > * `taskSlot/blacklisted/count`: Total task slots of peons that have been blacklisted
   > 
   > I think there 2 metrics will fit better to middleManagers/indexers rather than task slots, since middleManager/indexers can be marked as lazy or blacklisted, not individual task slot. How about `worker/lazy/count` and `worker/blacklisted/count` (or `indexingWorker/lazy/count` and `indexingWorker/blacklisted/count`)?
   
   On a second thought, these 2 metrics seem useful as well since it is easier to compare with other metrics such as total or idle. But I would suggest fixing the descriptions since task slots are not marked as lazy or blacklisted, but middleManagers and indexers are. Maybe they can be "total number of task slots of lazy/blacklisted middleManagers and indexers".


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jon-wei closed issue #10378: Adding Task Slot Count Metrics to Druid Overlord

Posted by GitBox <gi...@apache.org>.
jon-wei closed issue #10378:
URL: https://github.com/apache/druid/issues/10378


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] RoseDaniel commented on issue #10378: Adding Peon Metrics to Druid Overlord

Posted by GitBox <gi...@apache.org>.
RoseDaniel commented on issue #10378:
URL: https://github.com/apache/druid/issues/10378#issuecomment-693007122


   This is a fantastic idea @mghosh4 !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #10378: Adding Task Slot Count Metrics to Druid Overlord

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #10378:
URL: https://github.com/apache/druid/issues/10378#issuecomment-693130076


   Hi @mghosh4, thanks for the nice proposal! 
   
   > * `taskSlot/total/count`: Total number of task slots per emission period
   > * `taskSlot/idle/count`: Total number of task slots available for adding tasks
   > * `taskSlot/used/count`: Total number of task slots being currently used
   
   These 3 metrics will be definitely useful! To be honest, I can't believe we don't have these yet.
   
   > * `taskSlot/lazy/count`: Total number of task slots that have been marked lazy
   > * `taskSlot/blacklisted/count`: Total task slots of peons that have been blacklisted
   
   I think there 2 metrics will fit better to middleManagers/indexers rather than task slots, since middleManager/indexers can be marked as lazy or blacklisted, not individual task slot. How about `worker/lazy/count` and `worker/blacklisted/count` (or `indexingWorker/lazy/count` and `indexingWorker/blacklisted/count`)?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org