You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Ixai Lanzagorta <ix...@rondhuit.com> on 2021/06/02 10:07:43 UTC

Balancing SolrCloud via LB vs CloudSolrClient

Hi, I'm trying to understand the difference between doing load balancing
via an HTTP proxy vs. using SolrJ's CloudSolrClient. I created a test
cluster (3x SolrCloud 8.8 nodes, 3x ZooKeeper nodes) and then tested a few
things:

1) Configure a proxy to do the load balancing. I figured that:
- I can delegate health checks to both the proxy, and my container
orchestration.
- I can connect to the cluster with SolrJ using the HttpSolrClient with the
proxy URL.

My concern is that, since health checks are done on the Solr instance (e.g.
GET /solr/), and not a specific collection, the proxy could redirect a
request to a healthy node with a faulty collection. Is this a real concern?

2) Alternatively, I could use CloudSolrClient and configure either the list
of `solrBaseUrls` or `zkHosts`.

The constraint here is that, when using CloudSolrClient, the SolrJ client
gets back a list of resolved IP addresses from the SolrCloud cluster or
ZooKeeper ensemble. The client must be able to reach those resolved IP
addresses or the connection will fail. Therefore, either the client must
live in the same network as the servers (subnet, VPN, etc.), or the servers
must be publicly accessible.

I'm new to Solr, so I wonder if there's any other specifics or alternatives
that I'm not considering. Are there any particular reasons why you'd
recommend one setup over the other?

Any insight is appreciated,
Ixai

Re: Balancing SolrCloud via LB vs CloudSolrClient

Posted by Shawn Heisey <ap...@elyograg.org>.

On 6/2/2021 4:07 AM, Ixai Lanzagorta wrote:
> Hi, I'm trying to understand the difference between doing load balancing
> via an HTTP proxy vs. using SolrJ's CloudSolrClient. I created a test
> cluster (3x SolrCloud 8.8 nodes, 3x ZooKeeper nodes) and then tested a few
> things:
> 
> 1) Configure a proxy to do the load balancing. I figured that:
> - I can delegate health checks to both the proxy, and my container
> orchestration.
> - I can connect to the cluster with SolrJ using the HttpSolrClient with the
> proxy URL.

If you use a load balancer at the HTTP level, SolrCloud is *STILL* going 
to load balance requests across the cloud.  I don't know that this is a 
problem, just something to be aware of.

> My concern is that, since health checks are done on the Solr instance (e.g.
> GET /solr/), and not a specific collection, the proxy could redirect a
> request to a healthy node with a faulty collection. Is this a real concern?

SolrCloud is smart enough to properly handle queries even if the machine 
that receives the query does not contain any part of that collection. 
It will be load balanced between other machines in the cloud that DO 
have the collection.  If there is a problem with the local shards on the 
machine, chances are good that the cloud will know this and will not 
attempt to query them.

> 2) Alternatively, I could use CloudSolrClient and configure either the list
> of `solrBaseUrls` or `zkHosts`.

If you use CloudSolrClient, then no load balancer of any kind is 
required.  The client itself is fully aware of the entire cloud, and 
watches for changes in the ZK database -- restarting the client is not 
necessary if you shut down a Solr instance or add new ones to the cloud.

> The constraint here is that, when using CloudSolrClient, the SolrJ client
> gets back a list of resolved IP addresses from the SolrCloud cluster or
> ZooKeeper ensemble. The client must be able to reach those resolved IP
> addresses or the connection will fail. Therefore, either the client must
> live in the same network as the servers (subnet, VPN, etc.), or the servers
> must be publicly accessible.

Yes, you are correct that the client must be able to reach all the Solr 
servers.  This is a basic SolrCloud requirement, and it's not 
negotiable.  You would have to use HttpSolrClient instead of 
CloudSolrClient if only some of the machines would be reachable, which
might mean a reduction in functionality.

> I'm new to Solr, so I wonder if there's any other specifics or alternatives
> that I'm not considering. Are there any particular reasons why you'd
> recommend one setup over the other?

If the software is Java, the recommendation would be SolrJ - 
CloudSolrClient.  No load balancer is required there.

If you need to use HttpSolrClient or something else that is not cloud 
aware, then you should use a load balancer that talks to at least two of 
the Solr machines.  And you should have some kind of redundancy on the 
load balancer itself.

Thanks,
Shawn

Re: Balancing SolrCloud via LB vs CloudSolrClient

Posted by matthew sporleder <ms...@gmail.com>.

For a java app using solrJ and a persistent pool is probably best.
I've used varnish in front of solr with php clients and had a good
experience but for java direct is likely best.


On Wed, Jun 2, 2021 at 6:07 AM Ixai Lanzagorta
<ix...@rondhuit.com> wrote:
>
> Hi, I'm trying to understand the difference between doing load balancing
> via an HTTP proxy vs. using SolrJ's CloudSolrClient. I created a test
> cluster (3x SolrCloud 8.8 nodes, 3x ZooKeeper nodes) and then tested a few
> things:
>
> 1) Configure a proxy to do the load balancing. I figured that:
> - I can delegate health checks to both the proxy, and my container
> orchestration.
> - I can connect to the cluster with SolrJ using the HttpSolrClient with the
> proxy URL.
>
> My concern is that, since health checks are done on the Solr instance (e.g.
> GET /solr/), and not a specific collection, the proxy could redirect a
> request to a healthy node with a faulty collection. Is this a real concern?
>
> 2) Alternatively, I could use CloudSolrClient and configure either the list
> of `solrBaseUrls` or `zkHosts`.
>
> The constraint here is that, when using CloudSolrClient, the SolrJ client
> gets back a list of resolved IP addresses from the SolrCloud cluster or
> ZooKeeper ensemble. The client must be able to reach those resolved IP
> addresses or the connection will fail. Therefore, either the client must
> live in the same network as the servers (subnet, VPN, etc.), or the servers
> must be publicly accessible.
>
> I'm new to Solr, so I wonder if there's any other specifics or alternatives
> that I'm not considering. Are there any particular reasons why you'd
> recommend one setup over the other?
>
> Any insight is appreciated,
> Ixai