You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by "paul-rogers (via GitHub)" <gi...@apache.org> on 2023/03/05 02:21:53 UTC

[GitHub] [druid] paul-rogers opened a new pull request, #13884: Draft 2 of the Proposed Revised Error Mechanism

paul-rogers opened a new pull request, #13884:
URL: https://github.com/apache/druid/pull/13884

Proposed is a revision to Druid’s error reporting structure. An [earlier PR](https://github.com/apache/druid/pull/13798) and [issue](https://github.com/apache/druid/issues/13123) introduced the concept of better error reporting. Discussion around that work refined and expanded the set of requirements:

1. Better error messages for Druid users.
2. Build on patterns established by MSQ.
3. Require that users think of the persona who can act on each error message.
4. Unique error code for each error condition.
5. Deployments can customize error messages.
6. Easier to search for specific log messages in a log management system.

The key new concepts in this PR are explained below.

This PR is another draft of the concept to solicit further comment.

#### `MSQFault`

This PR models the `DruidException` class on the `MSQFault` class where feasible. [A previous note](https://github.com/apache/druid/pull/13798#issuecomment-1435377632) explained that `MSQFault` is tightly coupled to the MSQ error report framework and is thus not easily reusable. The present version of `DruidException` borrows the idea of creating subclasses for each kind of error, and of holding error “parameters” as distinct values, rather than only as fields in the message text. This PR also borrows the idea of creating groups of error subclass to represent error types.

#### Personas

Writing good error messages is hard. To encourage developers to write better messages, we require the developer to think about who can act on the message: the target persona. All errors are sent to the user, but the user may not be the person who can fix the problem. For example, if the network is down, user operations will fail, but the user is generally not the persona who can fix network issues.

To help, every exception must have a target persona identified. Since classes of errors all tend to have the same target persona, the persona is generally set on a base class, which reduces the overhead of specifying the person each time we raise an exception.

The persona isn’t actually used in this PR: we just require that it be set as a mental exercise.

#### Error Translation

Druid is used in many different ways. Developers tend to use Druid to create and debug new features. New users tend to use Druid on a single machine and play the role of user, app developer and system administrator. In these cases, the end user wants the full text of each error as they play a number of roles.

Once Druid lands in production, these roles are spread out across many different people. Druid is often embedded in part of a larger application. In those cases, the application may have its own terminology. Deployments may also chose to redact certain details from user messages. We call this “error translation.” Even if the text remains in English, the wording is changed to be consistent with the overall application.

To support these use cases, this PR introduces the idea of separating the message “template” from the error “parameters”. Druid provides a default template. Deployments can provide a file with customized templates. A number of features are required to realize this idea.

#### Unique Error Codes

To aid translation, we propose that all errors have a unique text error code. Errors from the `/sql` endpoint already have an error code, but the code is not unique or each kind of error. The existing code is also a bit ad-hoc and unstructured. This draft suggests a three-part pattern:

* General area, such as “SQL” or “MSQ”
* Sub-area, such as “Parse” or “Validation”
* Detail code, such as “InvalidToken” or “InvalidConstant”.

The result is a unique code such as `SQL-Parse-InvalidToken` or `SQL-Validation-InvalidConstant`.

The errors map to an exception class hierarchy: we define a class for SQL Parse errors and another for SQL Validation errors. Each point in the code which raises an exception identifies the detail code. When the “same” error is defined in multiple places, a function can ensure that the various instances all use the same code. We could eventually create a centralized map of error codes if needed. Doing so seems overly complex at this time.

#### Error Parameters

Most exceptions include information. Which token is invalid? Which constant is invalid?

`MSQFault` creates a subclass for each distinct error “type”, defined as a unique set of error parameters. The earlier PR used an error “context” to hold (a subset of) exception parameters. This version combines the two ideas. We hold error parameters as a map of key/value pairs. The reason for a map, rather than fields, will become clear shortly.

We combine error message templates with error parameters using the Apache Commons [`StringSubstitutor`](https://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/StringSubstitutor.html) class. For example:

```text
Template: Line [${line}], Position [${column}]: Column [${name}] does not exist in any datasource in the query
Values: {“line”: 3, “Column”: 4, “name”: “foo”}
Result: Line 3, Position 4: Column [foo] does not exist in any datasource in the query
```

Suppose an application prefers to use the words “field” and “table”. Suppose also the application generates the SQL, so that error position in the SQL are not useful to the user. This application can provide its own template:

```text
Template: Field “${name}” not found in any table. Ensure the field is valid.
Result: Field “foo” not found in any table. Ensure the field is valid.
```

Because `StringSubstitutor` needs a map of values, it turns out to be simpler to store error parameters as a map than as individual fields. If the parameters were fields, each error class would need code to build a map from those fields.

Example of raising an error with parameters:

```java
throw new SqlValidationError(
"ColumnNotFound"
"Line [${line}], Position [${column}]: Column [${name}] does not exist in any datasource in the query"
)
.withValue("line", lineNo)
.withValue("column", col)
.withValue("name", columnName);
```

The `SqlValidationError` class provides the top and second-level error code: `SQL-Validation`. The parameters in the template match the names of the values provided. As explained below, order of the values is important.

#### Translation Files

To implement error translation, an application simply provides a custom "error catalog" in `conf/druid/errors.properties`. The file is a normal Java properties file that uses the error code as the key:

```text
SQL-Validation-ColumnNotFound=Field “${name}” not found in any table. Ensure the field is valid.
```

If the file is not found, or cannot be read, or does not contain the key for a given error, then the Druid default format is used instead.

This is the simplest possible solution to get started: we can refine as we gain feedback.

#### Log Message Format

Given the above mechanism, we now have three distinct forms of error text:

* The default Druid formatting
* The translated message, which may redact some information
* The logged form of the error

The logged message should provide all information to the administrator or other "privileged" user who must support the system. The log message must contain these details even if they are not returned to the user. Error logging is controlled by the `getMessage()` method. Thus, this method must return a form of the error that does not do translation.

As we improve error messages, we will add helpful text about how to resolve issues. That text need not appear in the log.

When logs are fed into a log search system (such as Splunk or ElasticSearch), administrators want a simple way to search for a given error. Thus, the log message should include the error code.

Given this, the log format of an error, given by the `getMessage()` method, is of the form:

```text
<code>: <key>=[<value>]...
```

For example:

```text
SQL-Validation-ColumnNotFound: line=[3], column=[4], name=[foo]
```

To ensure the log format is stable, the `values` map in each exception is a `LinkedHashMap`: values appear in the log in the order in which they are added to the map.

#### `getMessage()` and Tests

Since `getMessage()` must produce the log form of a message, we use `toString()` to produce the default Druid format for the message. This format is primarily for debugging and testing.

Messages are returned to the user via the REST response. In the proposed code, we do not serialize `DruidException` directly. Instead, we perform a conversion step from `DruidException` to the form needed for each REST call. The `/sql` endpoint uses one format, other RPC messages use other formats. (The multiple error formats is more of a "bug" than "feature", but we'll fix that another time.)

The mechanism to produce the `/sql` REST response also handles the translation. That mechanism requires Guice injection of the properties with the name of the translation file.

Many Druid tests use the standard mechanism to verify error text: ask a JUnit or Hamcrest matcher to check the return value from `getMessage()`. These tests must change to use the log format. Or, they must change to use custom code to check the value of `toString()` to check the default Druid error text.

Once we settle on a design, a future commit will standardize SQL test error checking.

#### Third Party Errors

Druid uses Calcite to parse SQL. Druid has no control over Calcite messages. In this case, we translate the Calcite exception to a `DruidException` with one parameter, `message`, which holds the original Calcite text. Such text cannot be translated using the mechanism proposed here, though the entire message can be redacted.

In such cases, we could expand the error translator to include a function. That function can perform regular expression matches to "peek inside" the third party error, and reformat the message that way. The result is not elegant, but does solve the problem.

#### Release note

This PR is a draft and is a proposal. If this design turns out to be the final one, we'll update this section to include:

* The text of some error messages will change.
* The error code for `/sql` messages will change. The new IDs are unique.
* Description of how to customize error messages using an error message catalog.

<hr>

This PR has:

- [X] been self-reviewed.
- [ ] added documentation for new or modified features or behaviors.
- [ ] a release note entry in the PR description.
- [X] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
- [X] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
- [X] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] imply-cheddar commented on pull request #13884: Draft 2 of the Proposed Revised Error Mechanism

Posted by "imply-cheddar (via GitHub)" <gi...@apache.org>.

imply-cheddar commented on PR #13884:
URL: https://github.com/apache/druid/pull/13884#issuecomment-1458931783

   1) The log format should have a fully interpolated message.  People do view errors in logs and a terse message that is only a key and a few context values is not sufficient.  It should be the fully interpolated message + all of the "context" parameters on the same line.
   
   2) It's completely unclear which persona is being targeted by an exception message.  It is important that developers and reviewers and anybody who comes to read the code can see the intended persona next to the error message.  Don't put it in a base class and serialize it out on all of the RPC calls
   
   3) We do not need to support error translation (at least not now).  We need to generate better error messages.  The creation of names (using Strings that could easily accidentally collide between different call points) that sit next to the actual message adds little value and doesn't actually help.  Reviewers will just skip over those strings as they don't add value and it is impossible for a reviewer to fully vet that the new String created is good *and* they are prone to copy-and-paste duplication.  Overall, I think it's a non-requirement at this point and we should focus instead on things that generate good error messages and perhaps go back to error translation at some point in the future if it's required after we are happy with the error messages that we generate by default.
   
   4) Regarding Faults and creating a new class for every, single, error that ever comes.  Personally, I think it's overkill.  It definitely makes sense in some cases and supporting the ability to provide things through a "Fault" style class has nice things, but the overhead of creating new errors is high.  This overhead is the reason that I think it's overkill, for us to get value from it, we have to be willing to make it the only way that we ever generate errors (i.e. we require a new Fault class for every error), but that's a really tall order for errors that are, e.g. just defensive validations that we expect to not actually be hit very often (if ever).  We could perhaps have a world where anything targeting a User persona requires a Fault class (as it's intending to be end-user facing) and all of the Admin/Developer focused ones follow a less-strict "Builder style" exception creation scheme.  Either way, if we want to support `Fault` classes, we need to determine the semantics t
 hat make them required and then have them 100% of the time for those semantics.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] paul-rogers commented on pull request #13884: Draft 2 of the Proposed Revised Error Mechanism

Posted by "paul-rogers (via GitHub)" <gi...@apache.org>.

paul-rogers commented on PR #13884:
URL: https://github.com/apache/druid/pull/13884#issuecomment-1459602618

   Closing as we don't have agreement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] paul-rogers closed pull request #13884: Draft 2 of the Proposed Revised Error Mechanism

Posted by "paul-rogers (via GitHub)" <gi...@apache.org>.

paul-rogers closed pull request #13884: Draft 2 of the Proposed Revised Error Mechanism
URL: https://github.com/apache/druid/pull/13884


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org