You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/07/06 18:28:07 UTC

[GitHub] [pinot] walterddr opened a new issue, #9023: Server HealthCheck StatusCallback should not include table/segment IS/EV check

walterddr opened a new issue, #9023:
URL: https://github.com/apache/pinot/issues/9023

   Currently BaseServerStarter will register status health check callback with all the enabled table on that server. 
   
   It will only return health OK once all table/segments are loaded. 
   - This will cause problem for server with large amount of table/data to be loaded. running in a container environment. this will result in health check returns 404 and later container manager (K8s for example) killing the server container and restarting it. ending in an infinite loop
   
   propose to strip table/segment status callback from server health check endpoint and only check server's health. and create /health/services/ endpoint to check server + table resource health and only returns OK if everything is healthy


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] mcvsubbu commented on issue #9023: Server HealthCheck StatusCallback should not include table/segment IS/EV check

Posted by GitBox <gi...@apache.org>.
mcvsubbu commented on issue #9023:
URL: https://github.com/apache/pinot/issues/9023#issuecomment-1176549687

   We need this check in our environment. Please do not remove it.
   
   Instead, you can use this config variable (set it to 0) to disable it altogether.
   
   https://github.com/apache/pinot/blob/master/pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java#L370


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] walterddr commented on issue #9023: Server HealthCheck StatusCallback should not include table/segment IS/EV check

Posted by GitBox <gi...@apache.org>.
walterddr commented on issue #9023:
URL: https://github.com/apache/pinot/issues/9023#issuecomment-1176782703

   ok yes that would be changing the semantics of a REST API. 
   
   counter proposal
   - keep current `health` API behavior.
   - new `health/instance` API that strip table/segment status callback from server health check (only instance health check)
   - new `health/services` API that returns a similar to ServiceStarterManager REST API that returns a map of serviceStatusCallback value 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] walterddr commented on issue #9023: Server HealthCheck StatusCallback should separate instance and table/segment health check

Posted by GitBox <gi...@apache.org>.
walterddr commented on issue #9023:
URL: https://github.com/apache/pinot/issues/9023#issuecomment-1191559119

   ENUM type in Response vs. Different end points
   ==
   IMO it is best to use different endpoints instead of the same endpoint with varying types of response
   
   - [HTTP Response Code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#successful_responses) does suggest that any 200-299 is considered a success, we can potentially encode that along with the different acceptable success stats. But it would be a very non-traditional way of using the HTTP response code.
   - [Container Management in K8s](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) provides the liveness/readiness/startup probes specifically to differentiate different type of condition of the container pod, but I am not seeing any customization to allow parsing of the response content.
    
   Some Thoughts
   ==
   - Initially, this issue is created to deal with running Pinot on K8s, while debugging a slow pod restart failure. thus I added the `health/instance` for liveness probe; and use the normal `health` for readiness probe.
   - I've also discovered that when the slow pod restart returns a 404 (BAD) state. It doesn't tell me exactly which segment was slow loading for which table. thus I added the additional endpoint `health/service` to return the entire service status callback map. 
   
   Seems like it is causing some confusion and addressing both issues at the same time also distract the discussion. So I would say let's focus on the main purpose of the discussion to find a way to allow containerization software to know whether a pinot server is alive vs. a pinot server is ready for serving queries. 
   
   if there's a way to easily configure k8s or ELB to differentiate the 2 server state. I am ok with using just one endpoint.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] walterddr commented on issue #9023: Server HealthCheck StatusCallback should separate instance and table/segment health check

Posted by GitBox <gi...@apache.org>.
walterddr commented on issue #9023:
URL: https://github.com/apache/pinot/issues/9023#issuecomment-1191999067

   sounds good. I will create these. ^ @jadami10 any follow up?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] mcvsubbu commented on issue #9023: Server HealthCheck StatusCallback should not include table/segment IS/EV check

Posted by GitBox <gi...@apache.org>.
mcvsubbu commented on issue #9023:
URL: https://github.com/apache/pinot/issues/9023#issuecomment-1178500610

   The existing behavior of loading all segments was put in place to allow for rolling restarts and upgrades. With this behavior, we can use healthcheck after an upgrade or restart and rest assured that we don't compromise on availability. @walterddr 
    how do you propose to solve this problem in the container environment? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] walterddr commented on issue #9023: Server HealthCheck StatusCallback should not include table/segment IS/EV check

Posted by GitBox <gi...@apache.org>.
walterddr commented on issue #9023:
URL: https://github.com/apache/pinot/issues/9023#issuecomment-1178587581

   I think the issue I am trying to solve here is orthogonal to the rolling restart problem, which is an intended restart. I am happy to create another issue to address this problem for the containerized environment.
   
   this issue mainly discusses the unintentional restart triggered by the containerized environment. Here I propose to create an additional endpoint to retrieve "instance availability" and "data availability" separately. It shouldn't affect how the rolling restart issue being addressed, containerized or not. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] mcvsubbu commented on issue #9023: Server HealthCheck StatusCallback should separate instance and table/segment health check

Posted by GitBox <gi...@apache.org>.
mcvsubbu commented on issue #9023:
URL: https://github.com/apache/pinot/issues/9023#issuecomment-1184791904

   +1 for new endpoint, yes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on issue #9023: Server HealthCheck StatusCallback should separate instance and table/segment health check

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on issue #9023:
URL: https://github.com/apache/pinot/issues/9023#issuecomment-1188454236

   I'd suggest adding a new API with more information in the response:
   - Status (NOT_STARTED, STARTING, GOOD, BAD, SHUTTING_DOWN)
   - Custom map explaining the status (segments not loaded yet)
   
   It is much more flexible this way, and different client can choose how to parse the response. E.g. liveness check can pass when the server is not `BAD`, readiness check can pass when server is `GOOD`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jadami10 commented on issue #9023: Server HealthCheck StatusCallback should separate instance and table/segment health check

Posted by GitBox <gi...@apache.org>.
jadami10 commented on issue #9023:
URL: https://github.com/apache/pinot/issues/9023#issuecomment-1192607314

   nope, those look great


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on issue #9023: Server HealthCheck StatusCallback should separate instance and table/segment health check

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on issue #9023:
URL: https://github.com/apache/pinot/issues/9023#issuecomment-1191866665

   Checked other systems, and seems health check is mostly for liveness instead of readiness. Since we already use it as the readiness check, we can leave it as is for backward compatibility.
   
   Based on the requirement, I'd suggest making the api more explicit on the purpose:
   - `/health/liveness` for liveness
   - `/health/readiness` for readiness
   
   I feel using `instance` and `service` is not as clear


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] walterddr commented on issue #9023: Server HealthCheck StatusCallback should not include table/segment IS/EV check

Posted by GitBox <gi...@apache.org>.
walterddr commented on issue #9023:
URL: https://github.com/apache/pinot/issues/9023#issuecomment-1176854637

   @mcvsubbu @mayankshriv @npawar ^


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jadami10 commented on issue #9023: Server HealthCheck StatusCallback should separate instance and table/segment health check

Posted by GitBox <gi...@apache.org>.
jadami10 commented on issue #9023:
URL: https://github.com/apache/pinot/issues/9023#issuecomment-1191501499

   I don't think your typical ELB parses that way. Any non-200 code is considered failing. And in this case we want a server that's up but downloading segments/catching up to be a 200 on one endpoint and non-200 on another.
   
   @walterddr, do you plan to add a healthcheck metrics for the server while you're in here
   ```
   HEALTHCHECK_OK_CALLS("healthcheck", true),
   HEALTHCHECK_BAD_CALLS("healthcheck", true),
   ```
   similar what broker and controller have?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jadami10 commented on issue #9023: Server HealthCheck StatusCallback should separate instance and table/segment health check

Posted by GitBox <gi...@apache.org>.
jadami10 commented on issue #9023:
URL: https://github.com/apache/pinot/issues/9023#issuecomment-1188457347

   One concern I have with a more complex API is most load balancers only allow configuring the path and then look at the status code of the response.
   
   So having both /health/instance and /health/services like in the proposal makes most sense to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] walterddr closed issue #9023: Server HealthCheck StatusCallback should separate instance and table/segment health check

Posted by GitBox <gi...@apache.org>.
walterddr closed issue #9023: Server HealthCheck StatusCallback should separate instance and table/segment health check
URL: https://github.com/apache/pinot/issues/9023


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jadami10 commented on issue #9023: Server HealthCheck StatusCallback should separate instance and table/segment health check

Posted by GitBox <gi...@apache.org>.
jadami10 commented on issue #9023:
URL: https://github.com/apache/pinot/issues/9023#issuecomment-1184790620

   We have the same problem where servers take too long to go healthy and the ASG recycles them forever. This is for intentional restarts as well as unintentional ones. Rewording what you said to see if we agrees; there does seem to be some issue here where the `/health` endpoint is doing too much:
   - The infrastructure responsible for making sure the instance or container is healthy really just wants to know, "are you working correctly?". Which really as long as the server is up and doing something should return 200.
   - The infrastructure used to coordinate rolling restarts wants to know, "are you up and have loaded all your segments".
   
   I'm +1 for your counter proposal. We can use `/health/instance` for whatever instance/container manager is being used and leave `/health` as is for backwards compatibility.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on issue #9023: Server HealthCheck StatusCallback should separate instance and table/segment health check

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on issue #9023:
URL: https://github.com/apache/pinot/issues/9023#issuecomment-1189679015

   > One concern I have with a more complex API is most load balancers only allow configuring the path and then look at the status code of the response.
   > 
   > So having both /health/instance and /health/services like in the proposal makes most sense to me.
   
   @jadami10 What if we return different status code for different status? Current API only supports `200` and `503`, which cannot represent all the available status.
   One problem is that different client might have different logic on parsing the pinot service status, and fixing the return type can reduce the flexibility. We don't want to add an extra new API for each different status check logic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org