You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2021/12/03 23:31:38 UTC

[GitHub] [pinot] dang-stripe opened a new issue #7864: Broker /health endpoint returns 200 OK when process first starts up

dang-stripe opened a new issue #7864:
URL: https://github.com/apache/pinot/issues/7864


   After restarting a broker process, we noticed our brokers have their `/health` endpoint return 200s right after the process starts up. Queries sent to the broker during this time return with `BrokerResourceMissingError` in the exceptions field. This happens for 10-20s and then the `/health` endpoint will return 503 while I believe it builds its routing maps.
   
   We're concerned that our load balances might pick up brokers that are being restarted as healthy and were wondering if it's possible for them report 503 instead until it's ready to serve queries for all tables it's assigned to.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jadami10 commented on issue #7864: Broker /health endpoint returns 200 OK when process first starts up

Posted by GitBox <gi...@apache.org>.
jadami10 commented on issue #7864:
URL: https://github.com/apache/pinot/issues/7864#issuecomment-987564287


   I've doubly confirmed this. You get:
   - 200s back while the broker is draining. 
   - connection refused while it's down
   - connection refused while the service is starting
   - 200s after the `PinotServiceManager` starts
   - 503s once the broker starts building the routing table
   - 200s once the broker is done building the routing table
   
   I think the problem is that the `PinotServiceManager` manager registers its own `PinotServiceManagerStatusCallback` which returns `true`/`200` as soon as it's done. And then once the broker thread starts, it goes back to `503`. The `/health` endpoint just blindly looks at every single health callback registered, so there's a window where it's just the `PinotServiceManager`. 
   
   It looks like this behavior was added in #5266, so it's been around a while, but I'm not sure it's correct. The `PinotServiceManager` needs to either not healthcheck at all, or the broker healthcheck needs to make sure its own instanceId healthcheck is passing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] xiangfu0 commented on issue #7864: Broker /health endpoint returns 200 OK when process first starts up

Posted by GitBox <gi...@apache.org>.
xiangfu0 commented on issue #7864:
URL: https://github.com/apache/pinot/issues/7864#issuecomment-989221113


   > > For shutdown phase, both `PinotServiceManager` and Broker healthcheck should fail when you send the signal.
   > 
   > This is not the case as `PinotServiceManager` tries to stop the broker first, and the broker sleeps for default 10 seconds then just deregisters the handler. But #7880 seems to do all right here by just failing the PinotServiceManager instead. Are we comfortable rolling this out as the new default for everyone?
   > 
   > > This PR enables health check when all bootstrap services are ready. When sending the signal, PinotServiceManager disable healthcheck first.
   > 
   > Thank you for doing this. This was my other thought at how to approach it, and it does cover all the services at once. I just have concerns that this approach is extremely prone to random deadlocks.
   > 
   > ```
   > public static Status getServiceStatus() {
   >     return getServiceStatus(SERVICE_STATUS_CALLBACK);
   >   }
   > ```
   > 
   > really needs to be limited in usage to only the `/health` endpoint. That said, I guess if you make it deadlock, then the integration tests fail, so maybe this is ok.
   
   True, I'm making changes to not add PinotSM status check when the SM port < 0.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jadami10 commented on issue #7864: Broker /health endpoint returns 200 OK when process first starts up

Posted by GitBox <gi...@apache.org>.
jadami10 commented on issue #7864:
URL: https://github.com/apache/pinot/issues/7864#issuecomment-988518328


   @xiangfu0, curiosity got the better of me. I proposed a fix in #7879 based on my findings, but please take a further look as well since this code path is a bit complicated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] xiangfu0 commented on issue #7864: Broker /health endpoint returns 200 OK when process first starts up

Posted by GitBox <gi...@apache.org>.
xiangfu0 commented on issue #7864:
URL: https://github.com/apache/pinot/issues/7864#issuecomment-988539881


   Another thing wanna check is, have you tested with the instance level API `/health/services/{instanceName}`?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jadami10 commented on issue #7864: Broker /health endpoint returns 200 OK when process first starts up

Posted by GitBox <gi...@apache.org>.
jadami10 commented on issue #7864:
URL: https://github.com/apache/pinot/issues/7864#issuecomment-989128318


   > For shutdown phase, both `PinotServiceManager` and Broker healthcheck should fail when you send the signal.
   
   This is not the case as `PinotServiceManager` tries to stop the broker first, and the broker sleeps for default 10 seconds then just deregisters the handler. But #7880 seems to do all right here by just failing the PinotServiceManager instead. Are we comfortable rolling this out as the new default for everyone? 
   
   > This PR enables health check when all bootstrap services are ready. When sending the signal, PinotServiceManager disable healthcheck first.
   
   Thank you for doing this. This was my other thought at how to approach it, and it does cover all the services at once. I just have concerns that this approach is extremely prone to random deadlocks. 
   ```
   public static Status getServiceStatus() {
       return getServiceStatus(SERVICE_STATUS_CALLBACK);
     }
   ```
   really needs to be limited in usage to only the `/health` endpoint. That said, I guess if you make it deadlock, then the integration tests fail, so maybe this is ok.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on issue #7864: Broker /health endpoint returns 200 OK when process first starts up

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on issue #7864:
URL: https://github.com/apache/pinot/issues/7864#issuecomment-987453013


   Can you please check if the 200s are returned before the process is stopped or after the new one is started? IIRC, the service status should never change from 200 to 503 once it turns to 200. I suspect the 200s are returned before the process is stopped


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on issue #7864: Broker /health endpoint returns 200 OK when process first starts up

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on issue #7864:
URL: https://github.com/apache/pinot/issues/7864#issuecomment-988447949


   @xiangfu0 Can you please take a look at this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jadami10 commented on issue #7864: Broker /health endpoint returns 200 OK when process first starts up

Posted by GitBox <gi...@apache.org>.
jadami10 commented on issue #7864:
URL: https://github.com/apache/pinot/issues/7864#issuecomment-987564702


   My second question here is, can we make it configurable for service status to go to 503 once we send `TERM` or `HUP`. Otherwise even though the broker will go down after draining requests, the load balancer will still think it's safe to send requests that way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] xiangfu0 commented on issue #7864: Broker /health endpoint returns 200 OK when process first starts up

Posted by GitBox <gi...@apache.org>.
xiangfu0 commented on issue #7864:
URL: https://github.com/apache/pinot/issues/7864#issuecomment-988630865


   @jadami10 do you wanna try this one: https://github.com/apache/pinot/pull/7880
   This PR enables health check when all bootstrap services are ready. When sending the signal,  PinotServiceManager disable healthcheck first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] xiangfu0 commented on issue #7864: Broker /health endpoint returns 200 OK when process first starts up

Posted by GitBox <gi...@apache.org>.
xiangfu0 commented on issue #7864:
URL: https://github.com/apache/pinot/issues/7864#issuecomment-988538833


   Thanks for bringing this up. Your tests and assumptions make sense.
   
   `PinotServiceManager` itself can start without any role, so 200 is valid, however if it starts with other roles, then we should ensure the health check invalid.
   
   For shutdown phase, both `PinotServiceManager` and Broker healthcheck should fail when you send the signal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org