You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/10/02 16:08:22 UTC

[GitHub] [druid] a2l007 opened a new pull request #10464: Improved exception handling in case of query timeouts

a2l007 opened a new pull request #10464:
URL: https://github.com/apache/druid/pull/10464


   <!-- Thanks for trying to help us make Apache Druid be the best it can be! Please fill out as much of the following information as is possible (where relevant, and remove it when irrelevant) to help make the intention and scope of this PR clear in order to ease review. -->
   
   This PR separates out query timeout exceptions and avoids it from being thrown as a `QueryInterruptedException`. Timeout exceptions are now handled using a new  `QueryTimeoutException`. Timed out queries will now return HTTP 504 status code instead of the default HTTP 500 status. This makes it easier for clients to identify timeout exceptions and handle it accordingly.
   To preserve  the query metrics behavior, timeout exceptions are still being clubbed under `query/interrupted/count`, but I feel 
   there should be a separate metric for query timeouts which can be done in a separate PR.
   
   Fixes #10028 
   
   <!-- Replace XXXX with the id of the issue fixed in this PR. Remove this section if there is no corresponding issue. Don't reference the issue in the title of this pull-request. -->
   
   <!-- If you are a committer, follow the PR action item checklist for committers:
   https://github.com/apache/druid/blob/master/dev/committer-instructions.md#pr-and-issue-action-item-checklist-for-committers. -->
   
   
   <!--
   In each section, please describe design decisions made, including:
    - Choice of algorithms
    - Behavioral aspects. What configuration values are acceptable? How are corner cases and error conditions handled, such as when there are insufficient resources?
    - Class organization and design (how the logic is split between classes, inheritance, composition, design patterns)
    - Method organization and design (how the logic is split between methods, parameters and return types)
    - Naming (class, method, API, configuration, HTTP endpoint, names of emitted metrics)
   -->
   
   
   <!-- It's good to describe an alternative design (or mention an alternative name) for every design (or naming) decision point and compare the alternatives with the designs that you've implemented (or the names you've chosen) to highlight the advantages of the chosen designs and names. -->
   
   <!-- If there was a discussion of the design of the feature implemented in this PR elsewhere (e. g. a "Proposal" issue, any other issue, or a thread in the development mailing list), link to that discussion from this PR description and explain what have changed in your final design compared to your original proposal or the consensus version in the end of the discussion. If something hasn't changed since the original discussion, you can omit a detailed discussion of those aspects of the design here, perhaps apart from brief mentioning for the sake of readability of this PR description. -->
   
   <!-- Some of the aspects mentioned above may be omitted for simple and small changes. -->
   
   <hr>
   
   This PR has:
   - [x] been self-reviewed.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [x] been tested in a test Druid cluster.
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not all of these items apply to every PR. Remove the items which are not done or not relevant to the PR. None of the items from the checklist above are strictly necessary, but it would be very helpful if you at least self-review the PR. -->
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `QueryTimeoutException`
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] capistrant commented on pull request #10464: Improved exception handling in case of query timeouts

Posted by GitBox <gi...@apache.org>.
capistrant commented on pull request #10464:
URL: https://github.com/apache/druid/pull/10464#issuecomment-718868718


   Awesome, this is something my team's users have been asking for. I'm going to review code and test locally today 👍 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] a2l007 commented on pull request #10464: Improved exception handling in case of query timeouts

Posted by GitBox <gi...@apache.org>.
a2l007 commented on pull request #10464:
URL: https://github.com/apache/druid/pull/10464#issuecomment-721781435


   FYI, I have modified the PR description to add a more release-note friendly description. A PR is out for the documentation fix as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #10464: Improved exception handling in case of query timeouts

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #10464:
URL: https://github.com/apache/druid/pull/10464#issuecomment-721388168


   > Good point. How does something like this work for release notes? Does the patch author write up release notes for their change that the release manager will use later on?
   
   We usually ask authors to add some summary and major changes in the PR description so that the release manager can easily pick up and add them in the release notes.
   
   > Could you add more to your thoughts that 408 may be a better error code? I thought that 504 sounded ideal since the Broker API is kind of a proxy to the druid query engine. The engine did not produce a response in time for the Broker to return to client so it responds with 504. I also worry about the implications of using a 408 code in relation to the potential for middleware auto-retrying the request when the end client may not want that (something like a browser doing auto retries N times before actually surfacing that 408 to client if it continues). And I've always had the perception of 5XX being a server error and 4XX being more of a client error, like the client made a mistake. But in this case the server isn't coming up with a response in time.
   
   Good point. The reason I mentioned 408 is that the broker does not always act as a gateway, but also performs query execution especially when you have subqueries on top of non-groupBy or multiple joins in the query. I think these queries cause timeout more often because processing subqueries or multiple joins are not parallelizable for now. But I do see why you chose 504 instead of 408 which also seems reasonable.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] capistrant merged pull request #10464: Improved exception handling in case of query timeouts

Posted by GitBox <gi...@apache.org>.
capistrant merged pull request #10464:
URL: https://github.com/apache/druid/pull/10464


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #10464: Improved exception handling in case of query timeouts

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #10464:
URL: https://github.com/apache/druid/pull/10464#issuecomment-721296649


   Hi @a2l007 and @capistrant, thanks for the nice PR and review! We usually treat silence as acceptance, so thanks for merging the PR. However, for this particular PR, I think it needs to get "Design Review" which needs 3 +1s from committers (including the author himself) because it changes some user-facing behavior (https://github.com/apache/druid/blob/master/dev/committer-instructions.md). We have this policy because it's hard to undo once the changed user-facing behavior appears in some release. When we do design review, we usually have one committer reviewing the whole PR line by line, so that others can focus on only the design part. From this aspect, I quickly reviewed only the design part. I have 3 comments so far.
   - The purpose of the PR sounds good to me because the timeout error reporting was inconsistent before. However this change is incompatible which means it can break user applications working based on the previous behavior. Someone can argue that no one could do it because the previous behavior was inconsistent. IMO, this change looks worth, but should be called out in the release notes.
   - Looking at the code, [the status code is defined as 504](https://github.com/apache/druid/pull/10464/files#diff-4bc40182ca7938163c8f2b71355cb5849920526319bc56f83dd55e542f3a4323R42) which means gateway timeout. I'm wondering 408 request timeout is more appropriate in this case.
   - All query errors should be documented. Check out https://druid.apache.org/docs/latest/querying/querying.html#query-errors.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] a2l007 commented on pull request #10464: Improved exception handling in case of query timeouts

Posted by GitBox <gi...@apache.org>.
a2l007 commented on pull request #10464:
URL: https://github.com/apache/druid/pull/10464#issuecomment-721359443


   @jihoonson and @capistrant Thank you for the feedback.
   
   > Looking at the code, the status code is defined as 504 which means gateway timeout. I'm wondering 408 request timeout is more appropriate in this case.
   
   Concurring with @capistrant 's response, I feel that categorizing this as a server gateway timeout(https://tools.ietf.org/html/rfc2068#section-10.5.5 )makes more sense than returning a 408 (https://tools.ietf.org/html/rfc2068#section-10.4.9) here. I'm interested on hearing your opinion on this.
   
   I can start a follow up PR once the conversation here is resolved.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] capistrant commented on pull request #10464: Improved exception handling in case of query timeouts

Posted by GitBox <gi...@apache.org>.
capistrant commented on pull request #10464:
URL: https://github.com/apache/druid/pull/10464#issuecomment-721306987


   > Hi @a2l007 and @capistrant, thanks for the nice PR and review! We usually treat silence as acceptance, so thanks for merging the PR too. However, for this particular PR, I think it needs to get "Design Review" which needs 3 +1s from committers (including the author himself) because it changes some user-facing behavior (https://github.com/apache/druid/blob/master/dev/committer-instructions.md). We have this policy because it's hard to undo once the changed user-facing behavior appears in some release. 
   
   Apologies for the pre-mature merge. I will keep this in mind in future to make sure I loop in more committers during review process for something client facing like this.
   
   > When we do design review, we usually have one committer reviewing the whole PR line by line, so that others can focus on only the design part. From this aspect, I quickly reviewed only the design part. I have 3 comments so far.
   > 
   > * The purpose of the PR sounds good to me because the timeout error reporting was inconsistent before. However this change is incompatible which means it can break user applications working based on the previous behavior. Someone can argue that no one could do it because the previous behavior was inconsistent. IMO, this change looks worth since the previous timeout error handling was easy to break, but should be called out in the release notes.
   
   Good point. How does something like this work for release notes? Does the patch author write up release notes for this that the release manager will use later on?
   
   > * Looking at the code, [the status code is defined as 504](https://github.com/apache/druid/pull/10464/files#diff-4bc40182ca7938163c8f2b71355cb5849920526319bc56f83dd55e542f3a4323R42) which means gateway timeout. I'm wondering 408 request timeout is more appropriate in this case.
   
   Could you add more to your thoughts that 408 may be a better error code? I thought that 504 sounded ideal since the Broker API is kind of a proxy to the druid query engine. The engine did not produce a response in time for the Broker to return to client so it responds with 504. I also worry about the implications of using a 408 code in relation to the potential for middleware auto-retrying the request when the end client may not want that (something like a browser doing auto retries N times before actually surfacing that 408 to client if it continues). And I've always had the perception of 5XX being a server error and 4XX being more of a client error, like the client made a mistake. But in this case the server isn't coming up with a response in time.
   
   > * All query errors should be documented. Check out https://druid.apache.org/docs/latest/querying/querying.html#query-errors.
   
   shoot I had a feeling there was a doc like this out there, @a2l007 do you want to start a follow on PR for this? I suppose it may be a part of a larger follow on with whatever else we uncover in discussions with Jihoon.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] a2l007 edited a comment on pull request #10464: Improved exception handling in case of query timeouts

Posted by GitBox <gi...@apache.org>.
a2l007 edited a comment on pull request #10464:
URL: https://github.com/apache/druid/pull/10464#issuecomment-721781435


   FYI, I have modified the PR description to add a more release-note friendly description. Raised a PR for the documentation fix as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] capistrant edited a comment on pull request #10464: Improved exception handling in case of query timeouts

Posted by GitBox <gi...@apache.org>.
capistrant edited a comment on pull request #10464:
URL: https://github.com/apache/druid/pull/10464#issuecomment-721306987


   > Hi @a2l007 and @capistrant, thanks for the nice PR and review! We usually treat silence as acceptance, so thanks for merging the PR too. However, for this particular PR, I think it needs to get "Design Review" which needs 3 +1s from committers (including the author himself) because it changes some user-facing behavior (https://github.com/apache/druid/blob/master/dev/committer-instructions.md). We have this policy because it's hard to undo once the changed user-facing behavior appears in some release. 
   
   Apologies for the pre-mature merge. I will keep this in mind in future to make sure I loop in more committers during review process for something client facing like this.
   
   > When we do design review, we usually have one committer reviewing the whole PR line by line, so that others can focus on only the design part. From this aspect, I quickly reviewed only the design part. I have 3 comments so far.
   > 
   > * The purpose of the PR sounds good to me because the timeout error reporting was inconsistent before. However this change is incompatible which means it can break user applications working based on the previous behavior. Someone can argue that no one could do it because the previous behavior was inconsistent. IMO, this change looks worth since the previous timeout error handling was easy to break, but should be called out in the release notes.
   
   Good point. How does something like this work for release notes? Does the patch author write up release notes for their change that the release manager will use later on?
   
   > * Looking at the code, [the status code is defined as 504](https://github.com/apache/druid/pull/10464/files#diff-4bc40182ca7938163c8f2b71355cb5849920526319bc56f83dd55e542f3a4323R42) which means gateway timeout. I'm wondering 408 request timeout is more appropriate in this case.
   
   Could you add more to your thoughts that 408 may be a better error code? I thought that 504 sounded ideal since the Broker API is kind of a proxy to the druid query engine. The engine did not produce a response in time for the Broker to return to client so it responds with 504. I also worry about the implications of using a 408 code in relation to the potential for middleware auto-retrying the request when the end client may not want that (something like a browser doing auto retries N times before actually surfacing that 408 to client if it continues). And I've always had the perception of 5XX being a server error and 4XX being more of a client error, like the client made a mistake. But in this case the server isn't coming up with a response in time.
   
   > * All query errors should be documented. Check out https://druid.apache.org/docs/latest/querying/querying.html#query-errors.
   
   shoot I had a feeling there was a doc like this out there, @a2l007 do you want to start a follow on PR for this? I suppose it may be a part of a larger follow on with whatever else we uncover in discussions with Jihoon.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson edited a comment on pull request #10464: Improved exception handling in case of query timeouts

Posted by GitBox <gi...@apache.org>.
jihoonson edited a comment on pull request #10464:
URL: https://github.com/apache/druid/pull/10464#issuecomment-721296649


   Hi @a2l007 and @capistrant, thanks for the nice PR and review! We usually treat silence as acceptance, so thanks for merging the PR too. However, for this particular PR, I think it needs to get "Design Review" which needs 3 +1s from committers (including the author himself) because it changes some user-facing behavior (https://github.com/apache/druid/blob/master/dev/committer-instructions.md). We have this policy because it's hard to undo once the changed user-facing behavior appears in some release. When we do design review, we usually have one committer reviewing the whole PR line by line, so that others can focus on only the design part. From this aspect, I quickly reviewed only the design part. I have 3 comments so far.
   - The purpose of the PR sounds good to me because the timeout error reporting was inconsistent before. However this change is incompatible which means it can break user applications working based on the previous behavior. Someone can argue that no one could do it because the previous behavior was inconsistent. IMO, this change looks worth since the previous timeout error handling was easy to break, but should be called out in the release notes.
   - Looking at the code, [the status code is defined as 504](https://github.com/apache/druid/pull/10464/files#diff-4bc40182ca7938163c8f2b71355cb5849920526319bc56f83dd55e542f3a4323R42) which means gateway timeout. I'm wondering 408 request timeout is more appropriate in this case.
   - All query errors should be documented. Check out https://druid.apache.org/docs/latest/querying/querying.html#query-errors.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] capistrant commented on pull request #10464: Improved exception handling in case of query timeouts

Posted by GitBox <gi...@apache.org>.
capistrant commented on pull request #10464:
URL: https://github.com/apache/druid/pull/10464#issuecomment-721179142


   This PR has been approved for about 4 days with no further feedback so I am going to go ahead and merge. Thank you for this patch @a2l007 ... I know that my team is going to really enjoy this new API response dedicated to time outs.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org