You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/05/05 00:30:13 UTC

[GitHub] [druid] jihoonson commented on issue #11165: [Feature Proposal] Improve error reporting for task and query failures

jihoonson commented on issue #11165:
URL: https://github.com/apache/druid/issues/11165#issuecomment-832341499


   Hmm, it was a bit hard to understand what is the problem you want solve and how your proposal can address it. If I understand correctly, the ultimate goal is `Richer error messages detailing what went wrong and possible actions for mitigation` for `Easier debugging and RCA (without looking at server logs)`. And what you are proposing seems actually 2 things, 1) error propagation using the error code and 2) a unified error handling and reporting system based on 1). The ultimate goal sounds reasonable if we are not doing it already, but what it's unclear to me is how the unified system you are proposing achieves the goal. To be more precise, the error propagation using the error code seems enough to me to achieve the goal. So, here is my question around the point 2). 
   
   - From the user perspective, richer error message for easier debugging is useful when the error is something that the user can work around by adjusting some user settings.
   - The Druid query system already has an error-code based error reporting system (see [Query exception class](https://github.com/apache/druid/blob/master/core/src/main/java/org/apache/druid/query/QueryException.java)). In the query system, there are a few [user-facing errors](http://druid.apache.org/docs/latest/querying/querying.html#query-errors) and the current error reporting system has been working well. I _think_ it already provides richer error message with a potential workaround. 
   - If we didn't have this query error reporting system, it would be worth considering having a unified system that can be used everywhere. However, given that we already have such a system for querying, what is the benefit of having a unified system?
     - If you are thinking of the registrable errors in custom extensions, I would say we don't probably need it for query system. You can currently develop [custom aggregators, complexMetrics, or query types](http://druid.apache.org/docs/latest/development/modules.html#adding-aggregators). Custom aggregators and complexMetrics will mostly need to handle not-enough-resources style errors which are covered by the error types currently we have. If you develop a new query type, it will require you to develop a new query engine as well to process your custom query type. You can freely add your custom error type in this case.
   
   Based on this, I would suggest focusing on the ingestion side in this proposal. What do you think?
   
   Besides the comment on introducing a unified system, here are my comments on other parts.
   
   > - Specifying the severity of errors and other potential side effects
   > - Hiding implementation and other sensitive details from the end user
   
   1. Can you elaborate more on these? What is the problem of the current status you think? And how does the proposal address the problem?
   
   2. How would the proposed error-code-based system integrate with the current ingestion system? Looking at the code snippets, it seems like we have to modify all codes where throw exceptions to convert them to `DruidTypedException`? If this is what you are proposing, it seems pretty error-prone because it's easy to forget catching and converting exceptions especially when they are unchecked exceptions. Also, please add a code snippet of `TaskStatus` after this change. It will help to understand the proposal.
   
   > Reference containing rich documentation for each error code, thus greatly simplifying debugging
   
   3. This is one of the advantages you listed. However, I'm not quite sure how it can simplify debugging because the error code will not be exposed to users? Can you add some more details on this?
   
   4. Thinking about what the "error code" actually is, is it simply a key to choose the proper error formatter when the error is exposed to users?
   
   5. In `DruidTypedException`, what does "typed" mean? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org