You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/03/08 22:17:03 UTC

[GitHub] [incubator-druid] justinborromeo opened a new issue #7217: [PROPOSAL] API Endpoint for Supervisor Errors

justinborromeo opened a new issue #7217: [PROPOSAL] API Endpoint for Supervisor Errors
URL: https://github.com/apache/incubator-druid/issues/7217
 
 
   ## Motivation (see #6571)
   
   Currently, there's a status API endpoint that allows users to retrieve the status of a supervisor and its tasks.  However, if errors are occurring, users have no way of determining what exceptions are being thrown without digging into log messages.  If there was a API endpoint that returned the error-level exceptions for a specific supervisor, diagnosing issues would be much easier.
   
   ## Analysis of Options
   
   The two design decisions that need to be made are how the exceptions will be stored and the design of the API.
   
   #### Storage
   
   1. In-Memory: Write each logged exception to an on-heap circular buffer on the leading overlord with a configurable maximum of elements.
   2. Database (use metadata store): Write every logged exception to some sort of system table.  Periodically run a task to purge old events.
   3. Druid?  It would be neat to be able to perform Druid queries such (especially SQL) on log data.
   
   #### API
   
   1. Return all the stored exceptions (the number stored is configurable)
   2. Return the last _m_ types of exceptions with the timestamps of their last _n_ occurrences
   
   ## Proposed Changes and Rationale
   
   For the sake of simplicity, I propose using the in-memory storage approach.  The one disadvantage with this approach is that it doesn't let users perform post-mortems if the Overlord goes down (unlike the database approach).  This likely won't be a significant disadvantage because the logs can always be analyzed in the event of Overlord failure.
   
   __Added Configs__:
   `druid.kafka.ingestion.numLoggedErrorsStoredPerSupervisor`: The number of error log messages to store in memory per supervisor.  Config value is an int.
   
   Also for the sake of simplicity, the following API endpoints will return the stored error log messages:
   
   Get errors from a specific supervisor: `GET /druid/indexer/v1/supervisor/<supervisorId>/errors`
   ```
   {
     "supervisorId":_______________,
    "errors":[
       {
         "timestamp":_______________,
         "errorMessage":_______________
       },
       {
         "timestamp":_______________,
         "errorMessage":_______________
       }
     ]
   }
   ```
   
   Bulk-get errors from all supervisors: `GET /druid/indexer/v1/supervisor/errors`
   ```
   {
     "supervisorErrors": [
       {
         "supervisorId":_______________,
         "errors":[
           {
             "timestamp":_______________,
             "errorMessage":_______________
           },
           {
             "timestamp":_______________,
             "errorMessage":_______________
           }
         ]
       }
     ]
   }
   ```
   
   I plan to achieve this behaviour by extending EmittingLogger and adding an `error(Throwable t, String message, String supervisorId)` method to call Logger#error() then write details about the exception to a CircularBuffer in the corresponding supervisor class.  Then, an additional `errors` method would be added to SupervisorResource to create an endpoint that returns the contents of the corresponding buffer.
   
   ## Operational impact
   
   No significant operational impact.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org