You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@kafka.apache.org by GitBox <gi...@apache.org> on 2021/04/11 02:34:08 UTC

[GitHub] [kafka] C0urante opened a new pull request #10520: KAFKA-10816: Use root resource as readiness and health probe for Connect distributed mode

C0urante opened a new pull request #10520:
URL: https://github.com/apache/kafka/pull/10520


   [Jira](https://issues.apache.org/jira/browse/KAFKA-10816)
   
   
   To address KAFKA-10816, the root resource (`GET /`) is modified to go through the worker's herder tick thread. This causes all requests to that resource to either not complete or be met with a 404 response before the worker has finished startup and is ready to handle requests.
   
   In addition to being made a viable readiness probe, the root resource is also made into a health probe for the worker. If the worker is having trouble reading to the end of the config topic or (re-)joining the Connect cluster, requests to the root resource will not return and will eventually be met with a 500 timeout response.
   
   The hacky workaround used to check for worker readiness in the `BlockingConnectorTest` is modified to use the root resource instead. No other tests are touched; existing coverage should be sufficient to vet this change.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] C0urante commented on pull request #10520: KAFKA-10816: Use root resource as readiness and health probe for Connect distributed mode

Posted by GitBox <gi...@apache.org>.

C0urante commented on pull request #10520:
URL: https://github.com/apache/kafka/pull/10520#issuecomment-817237286


   @ncliang @gharris1727 @kpatelatwork @ddasarathan could one or two of you take a look at this when you have time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] tombentley commented on pull request #10520: KAFKA-10816: Use root resource as readiness and health probe for Connect distributed mode

Posted by GitBox <gi...@apache.org>.

tombentley commented on pull request #10520:
URL: https://github.com/apache/kafka/pull/10520#issuecomment-818658508


   @C0urante this looks neat, and much simpler than my attempt. I see this as being part of the contract for `GET /`, so I suppose that technically a small KIP is necessary.  I've not looked at it properly, but I think we'd need to add a sentence to the documentation for `GET /` to note that it is suitable for use as a readiness and health check. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] kpatelatwork commented on pull request #10520: KAFKA-10816: Use root resource as readiness and health probe for Connect distributed mode

Posted by GitBox <gi...@apache.org>.

kpatelatwork commented on pull request #10520:
URL: https://github.com/apache/kafka/pull/10520#issuecomment-819107528


   > @tombentley that's fair. I was thinking that it's not really a part of the contract that `/` be implemented under the hood in a certain way (i.e., as a static resource or as one that's routed through the herder's request queue), but in order for this change to be useful (and protected against being undone in the future), it's important that this change be codified via KIP so that the endpoint is both documented and supported as a viable health check for distributed workers (and specifically, their herder).
   > 
   > @kpatelatwork I'll address the two scenarios you've outlined separately:
   > 
   > 1. If the probe is called every second, it's unlikely to have a significant impact on worker health. If nothing else is queued up (such as a rebalance, a different request, session key rotation, etc.), then most of the code path that gets traversed will consist of no-ops or trivial operations. I ran some local benchmarking and each iteration of the tick thread took less than two milliseconds; most of the time that was taken for each request to `GET /` was spent outside of the tick thread, but even then, I was able to issue about 325 requests/sec on my local machine with some sloppy, unoptimized benchmarking*. So if someone's using this endpoint reasonably for healthcheck/monitoring purposes and issuing at most one request every second, the additional performance hit should be fairly negligible.
   > 2. I can't envision a realistic scenario where this degrades worker health by competing with other herder requests, even if hundreds of healthcheck requests are issued per second. Yes, they will be handled on the same thread that takes care of "real" herder requests, but the actual work done in the herder thread to respond to healthcheck requests is tiny--just a lookup of a static object. If so many requests come in that a prior tick iteration hasn't completed before currently-active ones are in progress, they'll get grouped together and handled en masse within a single tick iteration, at which point, performance will be extremely high (likely on the order of thousands if not millions of iterations per second). The only circumstance I can envision where this is remotely possible is if the worker is busy doing other work that takes much longer, such as participating in an expensive rebalance, or trying and failing to read to the end of the config topic. At that point, if it conti
 nues to receive requests, they'll pile up and things might get out of hand. But in order for that to happen, each request will either have to originate from a different client, or there will have to be a client out there sending out simultaneous requests without waiting for still-active ones to complete. This only seems possible with some kind of malicious intent, but since it's already possible to lock down access to your worker using REST extensions, we probably don't have to worry about DDos protection for something like this. If we still want to do some kind of caching, I think one second should be more than enough; the longer it gets, the harder we make it harder to detect bad worker health when things are going south.
   > 
   > ***** - `time bash -c "for _ in {0..999}; do curl -sS -o /dev/null localhost:8083/ & done && wait"`, if anyone's interested.
   
   Got it, I think for now no need for caching, we can revisit if we encounter this issue in the field.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] C0urante commented on pull request #10520: KAFKA-10816: Use root resource as readiness and health probe for Connect distributed mode

Posted by GitBox <gi...@apache.org>.

C0urante commented on pull request #10520:
URL: https://github.com/apache/kafka/pull/10520#issuecomment-818931717

@tombentley that's fair. I was thinking that it's not really a part of the contract that `/` be implemented under the hood in a certain way (i.e., as a static resource or as one that's routed through the herder's request queue), but in order for this change to be useful (and protected against being undone in the future), it's important that this change be codified via KIP so that the endpoint is both documented and supported as a viable health check for distributed workers (and specifically, their herder).

@kpatelatwork I'll address the two scenarios you've outlined separately:
1. If the probe is called every second, it's unlikely to have a significant impact on worker health. If nothing else is queued up (such as a rebalance, a different request, session key rotation, etc.), then most of the code path that gets traversed will consist of no-ops or trivial operations. I ran some local benchmarking and each iteration of the tick thread took less than two milliseconds; most of the time that was taken for each request to `GET /` was spent outside of the tick thread, but even then, I was able to issue about 325 requests/sec on my local machine with some sloppy, unoptimized benchmarking*. So if someone's using this endpoint reasonably for healthcheck/monitoring purposes and issuing at most one request every second, the additional performance hit should be fairly negligible.
2. I can't envision a realistic scenario where this degrades worker health by competing with other herder requests, even if hundreds of healthcheck requests are issued per second. Yes, they will be handled on the same thread that takes care of "real" herder requests, but the actual work done in the herder thread to respond to healthcheck requests is tiny--just a lookup of a static object. If so many requests come in that a prior tick iteration hasn't completed before currently-active ones are in progress, they'll get grouped together and handled en masse within a single tick iteration, at which point, performance will be extremely high (likely on the order of thousands if not millions of iterations per second). The only circumstance I can envision where this is remotely possible is if the worker is busy doing other work that takes much longer, such as participating in an expensive rebalance, or trying and failing to read to the end of the config topic. At that point, if it continu
es to receive requests, they'll pile up and things might get out of hand. But in order for that to happen, each request will either have to originate from a different client, or there will have to be a client out there sending out simultaneous requests without waiting for still-active ones to complete. This only seems possible with some kind of malicious intent, but since it's already possible to lock down access to your worker using REST extensions, we probably don't have to worry about DDos protection for something like this. If we still want to do some kind of caching, I think one second should be more than enough; the longer it gets, the harder we make it harder to detect bad worker health when things are going south.

**\*** - `time bash -c "for _ in {0..999}; do curl -sS -o /dev/null localhost:8083/ & done && wait"`, if anyone's interested.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] kpatelatwork commented on pull request #10520: KAFKA-10816: Use root resource as readiness and health probe for Connect distributed mode

Posted by GitBox <gi...@apache.org>.

kpatelatwork commented on pull request #10520:
URL: https://github.com/apache/kafka/pull/10520#issuecomment-818841491


   @C0urante the change looks good. 
   
   I am a newbie here but correct me If I am wrong on the below minor concerns:
   
   1. Agree with Tom that a small KIP is necessary as this is a change in behavior.
   2. If the probe is called every second via some monitoring request or someone calls it 100s of time in a second then won't it compete with other herder tasks in the queue, should we cache the results of herder.kafkaClusterId() in RootResource for 1 min. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] C0urante commented on pull request #10520: KAFKA-10816: Use root resource as readiness and health probe for Connect distributed mode

Posted by GitBox <gi...@apache.org>.

C0urante commented on pull request #10520:
URL: https://github.com/apache/kafka/pull/10520#issuecomment-818932302


   Converting this to DRAFT pending a KIP.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org