You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/04/27 08:13:59 UTC

[GitHub] [druid] kfaraz opened a new issue #11165: [Feature] Improve error reporting for task and query failures

kfaraz opened a new issue #11165:
URL: https://github.com/apache/druid/issues/11165


   ## Motivation
   
   Task and query failures in Druid are often difficult to analyze due to missing, incomplete or vague error messages.
   
   A unified error reporting mechanism would improve the experience of a Druid user through:
   - Easier debugging and RCA (without looking at server logs)
   - Richer error messages detailing what went wrong and possible actions for mitigation
   - Homogeneous error reporting across different Druid services, modules and extensions
   - Specifying the severity of errors and other potential side effects
   - Hiding implementation and other sensitive details from the end user
   
   ## Overview of Changes
   
   ### New Classes
   
   - `ErrorTypeProvider`: Multi-bound interface to be implemented by core Druid as well as any extensions that needs to register error types
     - `String getModuleName()`: Namespace denoting name of the extension (or `"druid"` in case of core Druid). Must be unique across extensions.
     - `List<ErrorType> getErrorTypes()`: List of error types for the extension
   
   - `ErrorType`: Denotes a specific category of an error
     - `int code`: Integer code denoting a specific type of error within the namespace. Must be unique within the module.
     - `String messageFormat`: Contains placeholders that can be replaced to get the full error message 
     - additional details e.g. severity
   
   - `ErrorTypeParams`: Denotes the occurrence of an error. Contains params to identify and format the actual `ErrorType` 
     - `String moduleName`
     - `int code`
     - `List<String> messageArgs`: total length of args is limited (current limit on `TaskStatus.errorMsg` is 100)
   
   - `DruidTypedException`: exception that corresponds to an error type
     - `ErrorTypeParams errorTypeParams`
     - `Throwable cause`: optional
   
   - `ErrorMessageFormatter`: (singleton) class that maintains an in-memory mapping from `(moduleName, code)` pair to `ErrorType`
   
   ### Flow
   
   - Core Druid and extensions register their respective error types on startup on Overlord (extensions that are not loaded on Overlord have been addressed later)
   - An in-memory mapping is maintained from `(moduleName, code)` pair to the respective `ErrorType`
   - The persisted `TaskStatus` of any failed task contains an `ErrorTypeParams` rather than the full error message
   - When the status of a Task is requested, the `ErrorTypeParams` of the `TaskStatus` are used by the `ErrorMessageFormatter` to construct the full error message, which is then sent back in the API response
   
   ## Code Snippets
   
   ### Throwing an Exception
   
   e.g., for an extension `kafka-emitter`:
   ```java
   final String topicName = ...;
   try {
   	// ...
   	// Execution happens here
   	// ...
   } catch (InvalidTopicException topicEx) {
   	throw new DruidTypedException(
   		ErrorTypeParams.of(
   			KafkaEmitterErrorTypes.MODULE_NAME, // "kafka-emitter"
   			KafkaEmitterErrorTypes.INVALID_TOPIC, // integer error code
   			// message arguments
   			topicName),
   		topicEx
   	);
   }
   ```
   
   ### Registering Error Types
   
   Binding the ErrorTypeProvider
   ```java
   @Override
   public void configure(Binder binder) {
   	Multibinder.newSetBinder(binder, ErrorTypeProvider.class).addBinding().to(KafkaEmitterErrorTypeProvider.class);
   }
   ```
   
   Listing the error types
   ```java
   public class KafkaEmitterErrorTypeProvider {
   
   	@Override
   	public String getModuleName() {
   		return KafkaEmitterErrorTypes.MODULE_NAME; // "kafka-emitter";
   	}
   
   	@Override
   	public List<ErrorType> getErrorTypes() {
   		// return the list of all error types for this extension
   		return Arrays.asList(
   			...
   			// example error type for invalid topic
   			ErrorType.of(
   				KafkaEmitterErrorTypes.INVALID_TOPIC, 
   				"The given topic name [%s] is invalid. Please provide a valid topic name.")
   			...
   		);
   	}
   
   }
   ```
   
   Mapping error codes to types:
   ```java
   public class ErrorMessageFormatter {
   
   	private final Map<String, Map<Integer, ErrorType>> moduleToErrorTypes = ...;
   
   	@Inject
   	public ErrorMessageFormatter(
   		Set<ErrorTypeProvider> errorTypeProviders) {
   		
   		for (ErrorTypeProvider provider : errorTypeProviders) {
   			// Ensure that there are no module name clashes
   			final String moduleName = provider.getModuleName();
   
   			// Add all error types to the map
   			Map<Integer, ErrorType> errorTypeMap = new ConcurrentHashMap<>();
   			for (ErrorType errorType : provider.getErrorTypes()) {
   				errorTypeMap.put(errorType.getCode(), errorType);
   			}
   		}
   	}
   }
   
   ```
   
   ### Building the full error message (to serve UI requests)
   
   ```java
   public class ErrorMessageFormatter {
   
   	...
   
   	public String getErrorMessage(ErrorTypeParams errorParams) {
   		ErrorType errorType = moduleToErrorTypes
   								.get(errorParams.getModuleName())
   								.get(errorParams.getCode());
   
   		return String.format(
   			errorType.getMessageFormat(),
   			errorParams.getMessageArgs().toArray());
   	}
   
   	...
   
   }
   
   
   ```
   
   ## Design Concerns
   Pros of using error codes:
   - Any operation that involves persisting of a task/query status would have a smaller memory/disk footprint.
   - Less verbose logs
   
   ### Task Failures
   Ingestion and compaction tasks are managed by the Overlord. Thus, the Overlord needs to be aware of the error types to be able to serve task statuses over REST APIs.
   
   ### Query Failures
   Queries (SQL and native) are submitted over HTTP connections and the response can contain the detailed error message in case of failures. Thus the Broker need not be aware of the list of error types as there is no persistence of query status (and hence no requirement of persisting error codes and formatting the error messages when requested).
   
   ### Extensions that are not loaded on Overlord
   There are several extensions in Druid which are not loaded on the Overlord and run only on the Middle Managers/Peons. As these are not loaded on the Overlord, it is not aware of the error types that these extensions can throw.
   
   The approach here can be similar to that in Query Failures above. While communicating to the Overlord, the Middle Manager can send back both the `ErrorType` object (denotes the category of the error) and the `ErrorTypeParams` (denotes a specific error event). The Overlord can then persist the received `ErrorTypeParams` in its task status while also adding an entry to its error type mappings.
   
   ### Storing the mappings from Error Code to Error Type
   In the design discussed above, the error types are maintained in-memory (in the Overlord). If extensions register too many error codes for rare scenarios, it would have an unnecessarily large memory usage which could have been used otherwise.
   
   An alternative approach could be to persist the error types in the metadata store accessed via a small in-memory cache.
   
   Pros:
   - Only the frequently occurring error types would be present in the warmed up cache.
   - Central repo for all error types that can be accessed by both Overlord and Coordinator
   
   Cons:
   - De-duplication of module names and integer error codes would be more expensive
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #11165: [Feature Proposal] Improve error reporting for task and query failures

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #11165:
URL: https://github.com/apache/druid/issues/11165#issuecomment-840119258


   @kfaraz thanks for adding more details. Please find my new comments below.
   
   > > In DruidTypedException, what does "typed" mean?
   > 
   > The `Type` refers to the fact that this exception is raised for a specific `ErrorType` (contains code, moduleName and message format). Sorry, I couldn't think of a more appropriate name for this.
   
   1. No worries. I think it should be something else though as now we are going to focus on the ingestion side. Maybe `IngestionException`?
   
   2. FYI, `TaskQueue.attachCallbacks()` is used only in the local mode of overlord (see `druid.indexer.runner.type` in https://druid.apache.org/docs/latest/configuration/index.html#overlord-operations). The local mode is not used in production, but we should handle it as you suggested.
   
   3. In peons and indexers, tasks can currently throw _any exception anywhere_. Peons and indexers catch those exceptions and report task failures to the overlord. I think you will want to change this to not allowing tasks to throw any exceptions, but only the new type of exceptions (`DruidTypedException` or `IngestionException`) because you should be able to convert them to `TaskStatus.failure()`. This means you will have to modify all places where the task throws any exception to use the new exception type. I'm not sure if there is a less invasive way than this..
   
   4. What do you think about exposing error codes to users? We cannot avoid it in some cases anyway such as where the extension is unloaded and the overlord doesn't understand the error code of existing task statuses. It seems more consistent to me but not harm. Maybe `TaskStatus` can only show the error code and `TaskReport` can show the full details of the error.
   
   5. Semi-related, have you thought about adding the new `CANCELED` state for ingestion tasks? I think you may want to do it because the error code doesn't seem reasoable for canceled tasks. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on issue #11165: [Feature Proposal] Improve error reporting for task and query failures

Posted by GitBox <gi...@apache.org>.
kfaraz commented on issue #11165:
URL: https://github.com/apache/druid/issues/11165#issuecomment-832720263


   Thanks a lotfor the feedback, @jihoonson ! Sorry if the description was somewhat vague.
   
   > If I understand correctly, the ultimate goal is Richer error messages detailing what went wrong and possible actions for mitigation for Easier debugging and RCA (without looking at server logs)
   Yes, your understanding is correct.
   
   By a `a unified error handling and reporting system`, I simply mean using the error codes and the corresponding exception wherever applicable.
   
   > From the user perspective, richer error message for easier debugging is useful when the error is something that the user can work around by adjusting some user settings.
   
   Agreed.
   
   > The Druid query system already has an error-code based error reporting system (see Query exception class). In the query system, there are a few user-facing errors and the current error reporting system has been working well. I think it already provides richer error message with a potential workaround.
   
   The proposed error code system closely resembles the existing system we have for query failures. We already have sufficient error codes and error messages for such failures. So, as you suggest, we can focus only on ingestion task failures in this proposal.
   
   > Thinking about what the "error code" actually is, is it simply a key to choose the proper error formatter when the error is exposed to users?
   
   Yes, "Error Code" is primarily a key to choose the correct error formatter while showing errors to the users.
   It is meant to serve two more (minor) purposes:
   - It is a concise way of representing specific failures in error logs
   - In a corner case where the Overlord encounters an error code that it does not understand (perhaps in case of an error code arising from an extension that is loaded on, say only middle managers), the error code might be displayed as part of a generic message to the user. e.g. `Error occurred on middle manager while running task. Error Code: kafka-indexer.offset.outofrange.`
   
   > Specifying the severity of errors and other potential side effects
   > Hiding implementation and other sensitive details from the end user
   
   In some places, we currently expose the names of the Exception class in the error message itself. Through this proposal, we can prevent leaks of such information as there would be curated messages corresponding to each known code, and any non-coded error could be represented as a miscellaneous error.
   
   > In DruidTypedException, what does "typed" mean?
   
   The `Type` refers to the fact that this exception is raised for a specific `ErrorType` (contains code, moduleName and message format). Sorry, I couldn't think of a more appropriate name for this.
   
   Please let me know what you think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on issue #11165: [Feature Proposal] Improve error reporting for task and query failures

Posted by GitBox <gi...@apache.org>.
kfaraz commented on issue #11165:
URL: https://github.com/apache/druid/issues/11165#issuecomment-840313714


   Thanks a lot, @jihoonson !
   All your suggestions sound great. I agree on all of them.
   I will spend some more time on peons and indexers and see if there could be a less invasive method of using the new type of Exception.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #11165: [Feature Proposal] Improve error reporting for task and query failures

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #11165:
URL: https://github.com/apache/druid/issues/11165#issuecomment-832341499


   Hmm, it was a bit hard to understand what is the problem you want solve and how your proposal can address it. If I understand correctly, the ultimate goal is `Richer error messages detailing what went wrong and possible actions for mitigation` for `Easier debugging and RCA (without looking at server logs)`. And what you are proposing seems actually 2 things, 1) error propagation using the error code and 2) a unified error handling and reporting system based on 1). The ultimate goal sounds reasonable if we are not doing it already, but what it's unclear to me is how the unified system you are proposing achieves the goal. To be more precise, the error propagation using the error code seems enough to me to achieve the goal. So, here is my question around the point 2). 
   
   - From the user perspective, richer error message for easier debugging is useful when the error is something that the user can work around by adjusting some user settings.
   - The Druid query system already has an error-code based error reporting system (see [Query exception class](https://github.com/apache/druid/blob/master/core/src/main/java/org/apache/druid/query/QueryException.java)). In the query system, there are a few [user-facing errors](http://druid.apache.org/docs/latest/querying/querying.html#query-errors) and the current error reporting system has been working well. I _think_ it already provides richer error message with a potential workaround. 
   - If we didn't have this query error reporting system, it would be worth considering having a unified system that can be used everywhere. However, given that we already have such a system for querying, what is the benefit of having a unified system?
     - If you are thinking of the registrable errors in custom extensions, I would say we don't probably need it for query system. You can currently develop [custom aggregators, complexMetrics, or query types](http://druid.apache.org/docs/latest/development/modules.html#adding-aggregators). Custom aggregators and complexMetrics will mostly need to handle not-enough-resources style errors which are covered by the error types currently we have. If you develop a new query type, it will require you to develop a new query engine as well to process your custom query type. You can freely add your custom error type in this case.
   
   Based on this, I would suggest focusing on the ingestion side in this proposal. What do you think?
   
   Besides the comment on introducing a unified system, here are my comments on other parts.
   
   > - Specifying the severity of errors and other potential side effects
   > - Hiding implementation and other sensitive details from the end user
   
   1. Can you elaborate more on these? What is the problem of the current status you think? And how does the proposal address the problem?
   
   2. How would the proposed error-code-based system integrate with the current ingestion system? Looking at the code snippets, it seems like we have to modify all codes where throw exceptions to convert them to `DruidTypedException`? If this is what you are proposing, it seems pretty error-prone because it's easy to forget catching and converting exceptions especially when they are unchecked exceptions. Also, please add a code snippet of `TaskStatus` after this change. It will help to understand the proposal.
   
   > Reference containing rich documentation for each error code, thus greatly simplifying debugging
   
   3. This is one of the advantages you listed. However, I'm not quite sure how it can simplify debugging because the error code will not be exposed to users? Can you add some more details on this?
   
   4. Thinking about what the "error code" actually is, is it simply a key to choose the proper error formatter when the error is exposed to users?
   
   5. In `DruidTypedException`, what does "typed" mean? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on issue #11165: [Feature Proposal] Improve error reporting for task and query failures

Posted by GitBox <gi...@apache.org>.
kfaraz commented on issue #11165:
URL: https://github.com/apache/druid/issues/11165#issuecomment-832923618


   > This is one of the advantages you listed. However, I'm not quite sure how it can simplify debugging because the error code will not be exposed to users? Can you add some more details on this?
   
   This is more of an afterthought. 
   
   You are correct that in most cases, error codes would not be exposed to users. But in the corner case where a certain error code is not identified by the Overlord (case discussed above in Extensions not loaded on the Overlord), we might display a generic error message with the error code such as `Error occurred on middle manager while running task. Error Code: kafka-indexer.offset.outofrange`.
   
   The error codes can also be present in logs rather than the verbose messages themselves. To be able to translate these without having to call an API of the Overlord, we would benefit from having a documented set of known error codes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on issue #11165: [Feature Proposal] Improve error reporting for task and query failures

Posted by GitBox <gi...@apache.org>.
kfaraz commented on issue #11165:
URL: https://github.com/apache/druid/issues/11165#issuecomment-832919549


   > How would the proposed error-code-based system integrate with the current ingestion system? Looking at the code snippets, it seems like we have to modify all codes where throw exceptions to convert them to DruidTypedException? If this is what you are proposing, it seems pretty error-prone because it's easy to forget catching and converting exceptions especially when they are unchecked exceptions. Also, please add a code snippet of TaskStatus after this change. It will help to understand the proposal.
   
   In case of ingestion, the catch-all error handling is already being done in `TaskQueue`. We would need to modify it to handle the new type of exception as below. This would also require converting the exceptions that we throw from the known failure points to the new exception i.e. the `DruidTypedException` discussed above.
   
   ```java
   public class TaskQueue {
   
   	// ...
   	private ListenableFuture<TaskStatus> attachCallbacks(final Task task) {
   		// ...
   		Futures.addCallback(new FutureCallback<TaskStatus> {
   			
   			@Override
   			public void onSuccess(final TaskStatus successStatus) {
   				// persist the successStatus
   			}
   
   			@Override
   			public void onFailure(final Throwable t) {
   				TaskStatus failureStatus;
   				if (t instanceof DruidTypedException) {
   					failureStatus = TaskStatus.failure(task.getId(), ((DruidTypedException) t).getErrorTypeParams());
   				} else {
   					// build a generic error message here
   					failureStatus = TaskStatus.failure(task.getId(), ...);
   				}
   				// persist the failureStatus
   			}
   		});
   	}
   
   	// ...
   
   }
   ```
   
   
   Alongwith this, any other place where we return a `TaskStatus.failure()` would get modified as below:
   ```java
   	TaskStatus status = TaskStatus.failure(
   		taskId,
   		ErrorTypeParams.of(moduleName, errorCode, messageArgs)
   	);
   ```
   
   The primary change to the `TaskStatus` class is as under (also added to the proposal above):
   ```java
   public class TaskStatus {
   
        // Existing field
        private final @Nullable String errorMsg;
   
       // New field. Contains moduleName, errorCode and messageArgs.
       // A TaskStatus with this field as null would fall back
       // to the existing field errorMsg
       private final @Nullable ErrorTypeParams errorTypeParams;
   
       //...
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org