You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Ixai Lanzagorta <ix...@rondhuit.com> on 2021/06/02 10:07:43 UTC
Balancing SolrCloud via LB vs CloudSolrClient
Hi, I'm trying to understand the difference between doing load balancing
via an HTTP proxy vs. using SolrJ's CloudSolrClient. I created a test
cluster (3x SolrCloud 8.8 nodes, 3x ZooKeeper nodes) and then tested a few
things:
1) Configure a proxy to do the load balancing. I figured that:
- I can delegate health checks to both the proxy, and my container
orchestration.
- I can connect to the cluster with SolrJ using the HttpSolrClient with the
proxy URL.
My concern is that, since health checks are done on the Solr instance (e.g.
GET /solr/), and not a specific collection, the proxy could redirect a
request to a healthy node with a faulty collection. Is this a real concern?
2) Alternatively, I could use CloudSolrClient and configure either the list
of `solrBaseUrls` or `zkHosts`.
The constraint here is that, when using CloudSolrClient, the SolrJ client
gets back a list of resolved IP addresses from the SolrCloud cluster or
ZooKeeper ensemble. The client must be able to reach those resolved IP
addresses or the connection will fail. Therefore, either the client must
live in the same network as the servers (subnet, VPN, etc.), or the servers
must be publicly accessible.
I'm new to Solr, so I wonder if there's any other specifics or alternatives
that I'm not considering. Are there any particular reasons why you'd
recommend one setup over the other?
Any insight is appreciated,
Ixai
Re: Balancing SolrCloud via LB vs CloudSolrClient
Posted by Shawn Heisey <ap...@elyograg.org>.
On 6/2/2021 4:07 AM, Ixai Lanzagorta wrote:
> Hi, I'm trying to understand the difference between doing load balancing
> via an HTTP proxy vs. using SolrJ's CloudSolrClient. I created a test
> cluster (3x SolrCloud 8.8 nodes, 3x ZooKeeper nodes) and then tested a few
> things:
>
> 1) Configure a proxy to do the load balancing. I figured that:
> - I can delegate health checks to both the proxy, and my container
> orchestration.
> - I can connect to the cluster with SolrJ using the HttpSolrClient with the
> proxy URL.
If you use a load balancer at the HTTP level, SolrCloud is *STILL* going
to load balance requests across the cloud. I don't know that this is a
problem, just something to be aware of.
> My concern is that, since health checks are done on the Solr instance (e.g.
> GET /solr/), and not a specific collection, the proxy could redirect a
> request to a healthy node with a faulty collection. Is this a real concern?
SolrCloud is smart enough to properly handle queries even if the machine
that receives the query does not contain any part of that collection.
It will be load balanced between other machines in the cloud that DO
have the collection. If there is a problem with the local shards on the
machine, chances are good that the cloud will know this and will not
attempt to query them.
> 2) Alternatively, I could use CloudSolrClient and configure either the list
> of `solrBaseUrls` or `zkHosts`.
If you use CloudSolrClient, then no load balancer of any kind is
required. The client itself is fully aware of the entire cloud, and
watches for changes in the ZK database -- restarting the client is not
necessary if you shut down a Solr instance or add new ones to the cloud.
> The constraint here is that, when using CloudSolrClient, the SolrJ client
> gets back a list of resolved IP addresses from the SolrCloud cluster or
> ZooKeeper ensemble. The client must be able to reach those resolved IP
> addresses or the connection will fail. Therefore, either the client must
> live in the same network as the servers (subnet, VPN, etc.), or the servers
> must be publicly accessible.
Yes, you are correct that the client must be able to reach all the Solr
servers. This is a basic SolrCloud requirement, and it's not
negotiable. You would have to use HttpSolrClient instead of
CloudSolrClient if only some of the machines would be reachable, which
might mean a reduction in functionality.
> I'm new to Solr, so I wonder if there's any other specifics or alternatives
> that I'm not considering. Are there any particular reasons why you'd
> recommend one setup over the other?
If the software is Java, the recommendation would be SolrJ -
CloudSolrClient. No load balancer is required there.
If you need to use HttpSolrClient or something else that is not cloud
aware, then you should use a load balancer that talks to at least two of
the Solr machines. And you should have some kind of redundancy on the
load balancer itself.
Thanks,
Shawn
Re: Balancing SolrCloud via LB vs CloudSolrClient
Posted by matthew sporleder <ms...@gmail.com>.
For a java app using solrJ and a persistent pool is probably best.
I've used varnish in front of solr with php clients and had a good
experience but for java direct is likely best.
On Wed, Jun 2, 2021 at 6:07 AM Ixai Lanzagorta
<ix...@rondhuit.com> wrote:
>
> Hi, I'm trying to understand the difference between doing load balancing
> via an HTTP proxy vs. using SolrJ's CloudSolrClient. I created a test
> cluster (3x SolrCloud 8.8 nodes, 3x ZooKeeper nodes) and then tested a few
> things:
>
> 1) Configure a proxy to do the load balancing. I figured that:
> - I can delegate health checks to both the proxy, and my container
> orchestration.
> - I can connect to the cluster with SolrJ using the HttpSolrClient with the
> proxy URL.
>
> My concern is that, since health checks are done on the Solr instance (e.g.
> GET /solr/), and not a specific collection, the proxy could redirect a
> request to a healthy node with a faulty collection. Is this a real concern?
>
> 2) Alternatively, I could use CloudSolrClient and configure either the list
> of `solrBaseUrls` or `zkHosts`.
>
> The constraint here is that, when using CloudSolrClient, the SolrJ client
> gets back a list of resolved IP addresses from the SolrCloud cluster or
> ZooKeeper ensemble. The client must be able to reach those resolved IP
> addresses or the connection will fail. Therefore, either the client must
> live in the same network as the servers (subnet, VPN, etc.), or the servers
> must be publicly accessible.
>
> I'm new to Solr, so I wonder if there's any other specifics or alternatives
> that I'm not considering. Are there any particular reasons why you'd
> recommend one setup over the other?
>
> Any insight is appreciated,
> Ixai