You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shamik Bandopadhyay <sh...@gmail.com> on 2018/10/24 20:02:10 UTC

Does ConcurrentUpdateSolrClient apply for SolrCloud ?

Hi,

   I'm looking into the possibility of using ConcurrentUpdateSolrClient for
indexing a large volume of data instead of CloudSolrClient. Having an
async,batch API seems to be a better fit for us where we tend to index a
lot of data periodically. As I'm looking into the API, I'm wonderign if
this can be used for SolrCloud.

ConcurrentUpdateSolrClientclient = new
ConcurrentUpdateSolrClient.Builder(url).withThreadCount(100).withQueueSize(50).build();

The Builder object only takes a single url, not sure what that would be in
case of SolrCloud. For e.g. if I've two shards with a couple of replicas,
then what will be the server url?

I was not able to find any relevant document or example to clarify my
doubt. Any pointers will be appreciated.

Thanks

Re: Does ConcurrentUpdateSolrClient apply for SolrCloud ?

Posted by Jason Gerlowski <ge...@gmail.com>.
One comment to complicate Erick's already-good advice.

> If a doc that needs to go to shard2 is received by a replica on shard1, it must be forwarded to the leader of shard1, introducing an extra hop.

Definitely true, but I don't think that's the only factor in the
relative performance of CUSC vs CSC.  CUSC responds asynchronously
when you're using it for updates, which lets users continue on to
prepare the next set of docs while a CloudSolrClient might still be
waiting to hear back from Solr.  I benchmarked this recently and was
surprised to see that ConcurrentUpdateSolrClient actually came out
ahead in some setups.

Now I'm not trying to say that CUSC performs better than CSC, just
that "It Depends" (Erick's TM) on the rest of your ETL code, on the
topology of your SolrCloud cluster, etc.

Good luck!

Jason



On Wed, Oct 24, 2018 at 6:49 PM shamik <sh...@gmail.com> wrote:
>
> Thanks Erick, appreciate your help
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Does ConcurrentUpdateSolrClient apply for SolrCloud ?

Posted by shamik <sh...@gmail.com>.
Thanks Erick, appreciate your help



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Does ConcurrentUpdateSolrClient apply for SolrCloud ?

Posted by Erick Erickson <er...@gmail.com>.
No best practices as such, "whatever works" about covers it. That's
not a huge query rate, especially if you have replicas per shard so I
wouldn't worry too much about it. If you rack 100 clients all driving
Solr as hard as possible and people complain that query responses are
bad you'll know where to look first.

About batching, see:
https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/

YMMV of course. If I were going to give you a starting point for
batching it would be on the order of at least 100 per shard. So a 5
shard collection would have at least 500 Solr documents per call to
cloudSolrClient.add(doclist).

Best,
Erick
On Wed, Oct 24, 2018 at 2:20 PM shamik <sh...@gmail.com> wrote:
>
> Thanks Erick, that's extremely insightful. I'm not using batching and that's
> the reason I was exploring ConcurrentUpdateSolrClient. Currently, N threads
> are reusing the same CloudSolrClient to send data to Solr. Ofcourse, the
> single point of failure was my biggest concern with
> ConcurrentUpdateSolrClient, thanks for clarifying my doubt.
>
> "You also want to be a little careful how hard you drive Solr if you're also
> serving queries at the same time, the more cycles you use for indexing the
> fewer are available to serve queries."
>
> Our solr servers are also used to serve queries (50-100/minute). Our hard
> commit set at 10 minutes while soft commit is disabled. Are there any best
> practices (I know it's too generic, but specifically around indexing) that I
> should follow?
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Does ConcurrentUpdateSolrClient apply for SolrCloud ?

Posted by shamik <sh...@gmail.com>.
Thanks Erick, that's extremely insightful. I'm not using batching and that's
the reason I was exploring ConcurrentUpdateSolrClient. Currently, N threads
are reusing the same CloudSolrClient to send data to Solr. Ofcourse, the
single point of failure was my biggest concern with
ConcurrentUpdateSolrClient, thanks for clarifying my doubt.

"You also want to be a little careful how hard you drive Solr if you're also
serving queries at the same time, the more cycles you use for indexing the
fewer are available to serve queries."

Our solr servers are also used to serve queries (50-100/minute). Our hard
commit set at 10 minutes while soft commit is disabled. Are there any best
practices (I know it's too generic, but specifically around indexing) that I
should follow?





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Does ConcurrentUpdateSolrClient apply for SolrCloud ?

Posted by Erick Erickson <er...@gmail.com>.
I wouldn't use ConcurrentUpdateSolrClient for the following reasons:

1> If a doc that needs to go to shard2 is received by a replica on
shard1, it must be forwarded to the leader of shard1, introducing an
extra hop. CloudSolrClient subdivides the batch and sends the docs to
the leader of the right shard automatically. You are batching, right?
You should.

2> CloudSolrClient does the above in parallel _already_.

3> You put the load for routing docs entirely on the single Solr node
you specify in the url.

4> You introduce a single point of failure (i.e. the node you specify
in the url).

5> If your indexing throughput is not what you need, you can string
together N SolrJ clients. Or you can create N threads in your indexing
client and still get the advantages of CloudSolrClient routing docs
correctly.

You also want to be a little careful how hard you drive Solr if you're
also serving queries at the same time, the more cycles you use for
indexing the fewer are available to serve queries.

Best,
Erick


On Wed, Oct 24, 2018 at 1:01 PM Shamik Bandopadhyay <sh...@gmail.com> wrote:
>
> Hi,
>
>    I'm looking into the possibility of using ConcurrentUpdateSolrClient for
> indexing a large volume of data instead of CloudSolrClient. Having an
> async,batch API seems to be a better fit for us where we tend to index a
> lot of data periodically. As I'm looking into the API, I'm wonderign if
> this can be used for SolrCloud.
>
> ConcurrentUpdateSolrClientclient = new
> ConcurrentUpdateSolrClient.Builder(url).withThreadCount(100).withQueueSize(50).build();
>
> The Builder object only takes a single url, not sure what that would be in
> case of SolrCloud. For e.g. if I've two shards with a couple of replicas,
> then what will be the server url?
>
> I was not able to find any relevant document or example to clarify my
> doubt. Any pointers will be appreciated.
>
> Thanks