You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/05/04 06:06:15 UTC

[GitHub] [incubator-druid] dclim commented on issue #7428: Add errors and state to stream supervisor status API endpoint

dclim commented on issue #7428: Add errors and state to stream supervisor status API endpoint
URL: https://github.com/apache/incubator-druid/pull/7428#issuecomment-489297940
 
 
   In `128edad` I made some modifications to the implementation along two main lines:
   
   1) After some consideration, I felt to remove the whole concept of classifying exceptions by their transience as suggested in https://github.com/apache/incubator-druid/pull/7428#discussion_r277116304. I think it added more complexity than value, but more important could mislead users when we incorrectly classify an error as being transient but in reality it will never recover without user intervention. Some examples: in Kafka, `TimeoutException` gets classified as 'transient' if we've previously had a successful run, but without knowing why the timeout is happening, how could you know if it would ever resolve? Is it because the network was congested momentarily, or is it because the Kafka broker got zapped by lightning and is now a smoldering pile of ashes? In Kinesis, the generic `AmazonKinesisException` gets classified as 'non-transient', but I can bet that there is, or if not in a future release will be, a subclass exception that is actually a transient failure that we haven't accounted for because it hasn't been written yet. Bottom line is that it's fragile to try to classify exceptions, so better not try. 
   
   2) In trying to resolve the issue mentioned in https://github.com/apache/incubator-druid/pull/7428#discussion_r278358074 + removing of the transience concept in 1), `SeekableStreamSupervisorStateManager` was fairly heavily modified from the original implementation. Most of the other files remain largely the same. I added some missed state capture points in `SeekableStreamSupervisor` and removed some that were capturing failures in non-run loop code blocks (e.g. I don't want the supervisor reporting an unhealthy state if someone repeatedly hits a status endpoint with a bad request but the main loop is fine).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org