You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@solr.apache.org by Jan Høydahl <ja...@cominvent.com> on 2023/03/29 13:16:39 UTC

Draining a Solr node for traffic before shutting down

Hi,

Trying to prevent traffic being sent to a Solr node that is going to shut down, to avoid interruption of service as seen from various clients.
First part of the puzzle is signaling to any (external) load balancer to stop sending requests to the node.
The other part is having SolrJ understand that the node is being stopped, and not routing internal requests to cores on the node.

Does anyone have a good command of the Shutdown logic in Solr?
My understanding is a bit sparse, but here's what I can see in the code: 

bin/solr stop will send a STOP command to Jetty's STOP_PORT with (not-so-secret) stop key
Jetty starts the shutdown process, destroying all servlets and filters, including Solr's dispatchFilter
Solr is notified about the shutdown through a callback in CoreContainerProvider.
CoreContainerProvider#close() is called which calls CC#shutdown
CC shuts down every core on the node and then calls zkController#preClose
ZkController#preClose removes ephemeral live_nodes/myNode and then publishes down state in state.json
Wait for shutdown of executors mm and let Jetty exit

I could have got it wrong though.

I was hoping that a Solr node would first publish itself as "not ready" in ZK before rejecting requests, but seems as this is all reversed, since shutdown is initiated by Jetty?
So could we instead register our own shutdown-port in Solr, and let our bin/solr script trigger that one? There we could orchestrate the shutdown as we want:

Remove live_nodes znode in ZK
Publish itself as not ready on api/node/health handler (or a new api/node/ready?)
Sleep for a few seconds (or longer with an optional &shutdownDelay argument to our shutdown endpoint)
trigger server.stop() to take down Jetty and kill the servlet

I filed https://issues.apache.org/jira/browse/SOLR-16722 to discuss a technical solution.
The primary goal is to drain traffic right before shutting a node down, but it could also be designed as a generic Readiness Probe <https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes> modeled from Kubernetes?
I'm also aware that any solr client should be prepared to hit a dead node due to network/power events, and retry. But it won't hurt to be graceful whenever we can..

Happy to hear your thoughts. Is this a made-up problem?

Jan

Re: Draining a Solr node for traffic before shutting down

Posted by Shawn Heisey <ap...@elyograg.org>.
On 3/29/23 07:16, Jan Høydahl wrote:
> Trying to prevent traffic being sent to a Solr node that is going to shut down, to avoid interruption of service as seen from various clients.
> First part of the puzzle is signaling to any (external) load balancer to stop sending requests to the node.
> The other part is having SolrJ understand that the node is being stopped, and not routing internal requests to cores on the node.

I would use the ping handler with a healthcheck file for this.  A load 
balancer can send a request to /solr/CORE_NAME/admin/ping (probably with 
distrib=false) as a healthcheck ... probably best to have a dedicated 
replica of an empty collection for that purpose.  Disable the ping 
handler before shutting the node down, and the load balancer should stop 
sending requests there pretty quickly.

I would hope that existing mechanisms in SolrCloud are robust enough to 
handle this transparently in the event of server failure or a deletion 
request via the collections API, but I do not know if that is the case.

Thanks,
Shawn

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org


Re: Draining a Solr node for traffic before shutting down

Posted by Houston Putman <ho...@apache.org>.
>
> Looks like there's room for improvement.  I too would want the desired
> state to be reflected in ZK first before attempting to make it happen.
> Remove live_nodes first, then iterate the local replicas to be state=DOWN,
> then close down all the things.
>

I agree with this, but just for Jan's comments on the shutdown logic. I've
been hitting issues with Solr 9.0, that if run in a docker image it can
take a long time for Solr to startup (over 30 seconds). And then the
operator will try to kill solr via the STOP_PORT, and this hangs until it
is stopped via a kill. So I think we have an issue with recent versions on
the stop logic and how it coordinates between Jetty and Solr.

The primary goal is to drain traffic right before shutting a node down, but
> it could also be designed as a generic Readiness Probe <
> https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes>
> modeled from Kubernetes?
>

I think that the idea for a "not ready" state is different and not entirely
related to shutdown behavior.
The Solr Operator has logic to evict all replicas on a node before
restarting it, if the data is ephemeral. (Since we don't want the node to
come up with lost data)
So it would be great for us to say that this node is "not ready" before
evicting the replicas, so that the eviction process goes as smoothly as
possible.
I think that this "not ready" state could also be used with other commands,
or just sent directly via the user.

Basically there are multiple ways that this "not ready" command could be
triggered:

   - An explicit command from the user to set/unset this state on the node
   - An optional param on REPLACENODE and DELETENODE, that will set this
   state before doing the replace/delete logic.
   - I would imagine on node startup this state us unset by default.

There are reasons why you want to keep these nodes "live", since other Solr
nodes might still be interacting with them, but you want to avoid sending
updates and queries there.

I agree that I'm not entirely sure that the benefit is there, since this
would be a pretty big change. But I think it's definitely worth a
discussion.

- Houston

On Wed, Mar 29, 2023 at 8:43 PM David Smiley <ds...@apache.org> wrote:

> Looks like there's room for improvement.  I too would want the desired
> state to be reflected in ZK first before attempting to make it happen.
> Remove live_nodes first, then iterate the local replicas to be state=DOWN,
> then close down all the things.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Wed, Mar 29, 2023 at 9:16 AM Jan Høydahl <ja...@cominvent.com> wrote:
>
> > Hi,
> >
> > Trying to prevent traffic being sent to a Solr node that is going to shut
> > down, to avoid interruption of service as seen from various clients.
> > First part of the puzzle is signaling to any (external) load balancer to
> > stop sending requests to the node.
> > The other part is having SolrJ understand that the node is being stopped,
> > and not routing internal requests to cores on the node.
> >
> > Does anyone have a good command of the Shutdown logic in Solr?
> > My understanding is a bit sparse, but here's what I can see in the code:
> >
> > bin/solr stop will send a STOP command to Jetty's STOP_PORT with
> > (not-so-secret) stop key
> > Jetty starts the shutdown process, destroying all servlets and filters,
> > including Solr's dispatchFilter
> > Solr is notified about the shutdown through a callback in
> > CoreContainerProvider.
> > CoreContainerProvider#close() is called which calls CC#shutdown
> > CC shuts down every core on the node and then calls zkController#preClose
> > ZkController#preClose removes ephemeral live_nodes/myNode and then
> > publishes down state in state.json
> > Wait for shutdown of executors mm and let Jetty exit
> >
> > I could have got it wrong though.
> >
> > I was hoping that a Solr node would first publish itself as "not ready"
> in
> > ZK before rejecting requests, but seems as this is all reversed, since
> > shutdown is initiated by Jetty?
> > So could we instead register our own shutdown-port in Solr, and let our
> > bin/solr script trigger that one? There we could orchestrate the shutdown
> > as we want:
> >
> > Remove live_nodes znode in ZK
> > Publish itself as not ready on api/node/health handler (or a new
> > api/node/ready?)
> > Sleep for a few seconds (or longer with an optional &shutdownDelay
> > argument to our shutdown endpoint)
> > trigger server.stop() to take down Jetty and kill the servlet
> >
> > I filed https://issues.apache.org/jira/browse/SOLR-16722 to discuss a
> > technical solution.
> > The primary goal is to drain traffic right before shutting a node down,
> > but it could also be designed as a generic Readiness Probe <
> >
> https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes
> >
> > modeled from Kubernetes?
> > I'm also aware that any solr client should be prepared to hit a dead node
> > due to network/power events, and retry. But it won't hurt to be graceful
> > whenever we can..
> >
> > Happy to hear your thoughts. Is this a made-up problem?
> >
> > Jan
>

Re: Draining a Solr node for traffic before shutting down

Posted by David Smiley <ds...@apache.org>.
Looks like there's room for improvement.  I too would want the desired
state to be reflected in ZK first before attempting to make it happen.
Remove live_nodes first, then iterate the local replicas to be state=DOWN,
then close down all the things.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, Mar 29, 2023 at 9:16 AM Jan Høydahl <ja...@cominvent.com> wrote:

> Hi,
>
> Trying to prevent traffic being sent to a Solr node that is going to shut
> down, to avoid interruption of service as seen from various clients.
> First part of the puzzle is signaling to any (external) load balancer to
> stop sending requests to the node.
> The other part is having SolrJ understand that the node is being stopped,
> and not routing internal requests to cores on the node.
>
> Does anyone have a good command of the Shutdown logic in Solr?
> My understanding is a bit sparse, but here's what I can see in the code:
>
> bin/solr stop will send a STOP command to Jetty's STOP_PORT with
> (not-so-secret) stop key
> Jetty starts the shutdown process, destroying all servlets and filters,
> including Solr's dispatchFilter
> Solr is notified about the shutdown through a callback in
> CoreContainerProvider.
> CoreContainerProvider#close() is called which calls CC#shutdown
> CC shuts down every core on the node and then calls zkController#preClose
> ZkController#preClose removes ephemeral live_nodes/myNode and then
> publishes down state in state.json
> Wait for shutdown of executors mm and let Jetty exit
>
> I could have got it wrong though.
>
> I was hoping that a Solr node would first publish itself as "not ready" in
> ZK before rejecting requests, but seems as this is all reversed, since
> shutdown is initiated by Jetty?
> So could we instead register our own shutdown-port in Solr, and let our
> bin/solr script trigger that one? There we could orchestrate the shutdown
> as we want:
>
> Remove live_nodes znode in ZK
> Publish itself as not ready on api/node/health handler (or a new
> api/node/ready?)
> Sleep for a few seconds (or longer with an optional &shutdownDelay
> argument to our shutdown endpoint)
> trigger server.stop() to take down Jetty and kill the servlet
>
> I filed https://issues.apache.org/jira/browse/SOLR-16722 to discuss a
> technical solution.
> The primary goal is to drain traffic right before shutting a node down,
> but it could also be designed as a generic Readiness Probe <
> https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes>
> modeled from Kubernetes?
> I'm also aware that any solr client should be prepared to hit a dead node
> due to network/power events, and retry. But it won't hurt to be graceful
> whenever we can..
>
> Happy to hear your thoughts. Is this a made-up problem?
>
> Jan