You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Shivaji Dutta <sd...@hortonworks.com> on 2016/01/13 03:42:22 UTC

ConcurrentUpdateSolrClient vs CloudSolrClient for bulk update to SolrCloud

We have a customer that needs to update few billion documents to SolrCloud. I know the suggested way of using is SolrCloudClient, for its load balancing feature.

As per docs - CloudSolrClient

SolrJ client class to communicate with SolrCloud. Instances of this class communicate with Zookeeper to discover Solr endpoints for SolrCloud collections, and then use the LBHttpSolrClient<http://lucene.apache.org/solr/5_4_0/solr-solrj/org/apache/solr/client/solrj/impl/LBHttpSolrClient.html> to issue requests. This class assumes the id field for your documents is called 'id' - if this is not the case, you must set the right name with setIdField(String)<http://lucene.apache.org/solr/5_4_0/solr-solrj/org/apache/solr/client/solrj/impl/CloudSolrClient.html#setIdField%28java.lang.String%29>.

As per the docs - ConcurrentUpdateSolrClient

ConcurrentUpdateSolrClient buffers all added documents and writes them into open HTTP connections. This class is thread safe. Params from UpdateRequest<http://lucene.apache.org/solr/5_4_0/solr-solrj/org/apache/solr/client/solrj/request/UpdateRequest.html> are converted to http request parameters. When params change between UpdateRequests a new HTTP request is started. Although any SolrClient request can be made with this implementation, it is only recommended to use ConcurrentUpdateSolrClient with /update requests. The class HttpSolrClient<http://lucene.apache.org/solr/5_4_0/solr-solrj/org/apache/solr/client/solrj/impl/HttpSolrClient.html> is better suited for the query interface.

Now since with ConcurrentUdateSolrClient I am able to use a queue and a pool of threads, which makes it more attractive to use over CloudSolrClient which will use a HTTPSolrClient once it gets a set of nodes to do the updates.

What is the recommended API for updating large amounts of documents with higher throughput rate.

Thanks,

Shivaji Dutta

Re: ConcurrentUpdateSolrClient vs CloudSolrClient for bulk update to SolrCloud

Posted by Shivaji Dutta <sd...@hortonworks.com>.

Thanks Erick.

On 1/13/16, 10:55 AM, "Erick Erickson" <er...@gmail.com> wrote:

>My first thought is "yes, you're overthinking it" ;)....
>
>Here's something to get you started for indexing
>through a Java program:
>https://cwiki.apache.org/confluence/display/solr/Using+SolrJ
>
>Of course you _could_ use Lucene to build your indexes
>and just copy them "to the right place", but there are
>a number of ways that can go wrong, here are a couple:
>1> if you have shards, you'd have to mimic the automatic
>routing.
>2> you have to mimic the analysis chain you've defined for
>each field in Solr.
>3> you have to copy the built Lucene indexes to the right shard
>(assuming you got <1> right).
>
>Depending on the docs in question, if they need Tika parsing
>you can do that simply in SolrJ too, see:
>https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>(this is a bit outdated, a couple of class names have changed
>in particular).
>
>SolrJ uses an efficient binary format to move the docs. I regularly
>get 20K docs/second on my local setup, see:
>https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/
>I was indexing 11M Wiki articles n about 10 minutes on some tests
>recently. Solr can scale that close to linearly with more shards and
>enough indexing clients. Is it really worth the effort of using Lucene?
>
>FWIW,
>Erick
>
>
>
>On Wed, Jan 13, 2016 at 10:19 AM, Shivaji Dutta <sd...@hortonworks.com>
>wrote:
>> Erik and Shawn
>>
>> Thanks for the input. In the process below we are posting the documents
>>to
>> Solr over HTTP Connection in batches.
>>
>> Trying to solve the same problem but in a different way :-
>>
>> I have used lucene back in the day, where I would index the documents
>> locally on the disk and run search queries on them. Big fan of lucene.
>>
>> I was wondering if there is any possibility like that.
>>
>> If I have a repository of millions of documents, would it not make sense
>> to just index them locally and then copy the index file over to Solr and
>> have it read from it?
>>
>> Any thoughts or blogs that could help me, or may be I am over thinking
>> this?
>>
>> Thanks
>> Shivaji
>>
>>
>> On 1/13/16, 9:12 AM, "Erick Erickson" <er...@gmail.com> wrote:
>>
>>>It's usually not all that difficult to write a multi-threaded
>>>client that uses CloudSolrClient, or even fire up multiple
>>>instances of the SolrJ client (assuming they can work
>>>on discreet sections of the documents you need to index).
>>>
>>>That avoids the problem Shawn alludes to. Plus other
>>>issues. If you do _not_ use CloudSolrClient, then all the
>>>docs go to some node in the system that then sub-divides
>>>the list (and you really should update in batches, see:
>>>https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/)
>>>then the node that receives the packet sub-divides it
>>>into groups based on what shard they should be part of
>>>and forwards them to the leaders for that shard, very
>>>significantly increasing the numbers of conversations
>>>being carried on between Solr nodes. Times the number
>>>of threads you're specifying with CUSC (I really regret
>>>the renaming from ConcurrentUpdateSolrServer, I liked
>>>writing CUSS).
>>>
>>>With CloudSolrClient, you can scale nearly linearly with
>>>the number of shards. Not so with CUSC.
>>>
>>>FWIW,
>>>Erick
>>>
>>>On Tue, Jan 12, 2016 at 8:06 PM, Shawn Heisey <ap...@elyograg.org>
>>>wrote:
>>>> On 1/12/2016 7:42 PM, Shivaji Dutta wrote:
>>>>> Now since with ConcurrentUdateSolrClient I am able to use a queue and
>>>>>a pool of threads, which makes it more attractive to use over
>>>>>CloudSolrClient which will use a HTTPSolrClient once it gets a set of
>>>>>nodes to do the updates.
>>>>>
>>>>> What is the recommended API for updating large amounts of documents
>>>>>with higher throughput rate.
>>>>
>>>> ConcurrentUpdateSolrClient has one flaw -- it swallows all exceptions
>>>> that happen during indexing.  Your application will never know about
>>>>any
>>>> problems that occur during indexing.  The entire cluster could be
>>>>down,
>>>> and your application would never know about it until you tried an
>>>> explicit commit operation.  Commit is an operation that is not handled
>>>> in the background by CUSC, so I would expect any exception to be
>>>> returned for that operation.
>>>>
>>>> This flaw is inherent to its design, the behavior would be very
>>>> difficult to change.
>>>>
>>>> If you don't care about your application getting error messages when
>>>> indexing requests fail, then CUSC is perfect.  This might be the case
>>>>if
>>>> you are doing initial bulk loading.  For normal index updates after
>>>> initial loading, you would not want to use CUSC.
>>>>
>>>> If you do care about getting error messages when bulk indexing
>>>>requests
>>>> fail, then you'll want to build a program with CloudSolrClient where
>>>>you
>>>> create multiple indexing threads that all use the same the client
>>>>object.
>>>>
>>>> Thanks,
>>>> Shawn
>>>>
>>>
>>
>

Re: ConcurrentUpdateSolrClient vs CloudSolrClient for bulk update to SolrCloud

Posted by Erick Erickson <er...@gmail.com>.

My first thought is "yes, you're overthinking it" ;)....

Here's something to get you started for indexing
through a Java program:
https://cwiki.apache.org/confluence/display/solr/Using+SolrJ

Of course you _could_ use Lucene to build your indexes
and just copy them "to the right place", but there are
a number of ways that can go wrong, here are a couple:
1> if you have shards, you'd have to mimic the automatic
routing.
2> you have to mimic the analysis chain you've defined for
each field in Solr.
3> you have to copy the built Lucene indexes to the right shard
(assuming you got <1> right).

Depending on the docs in question, if they need Tika parsing
you can do that simply in SolrJ too, see:
https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
(this is a bit outdated, a couple of class names have changed
in particular).

SolrJ uses an efficient binary format to move the docs. I regularly
get 20K docs/second on my local setup, see:
https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/
I was indexing 11M Wiki articles n about 10 minutes on some tests
recently. Solr can scale that close to linearly with more shards and
enough indexing clients. Is it really worth the effort of using Lucene?

FWIW,
Erick



On Wed, Jan 13, 2016 at 10:19 AM, Shivaji Dutta <sd...@hortonworks.com> wrote:
> Erik and Shawn
>
> Thanks for the input. In the process below we are posting the documents to
> Solr over HTTP Connection in batches.
>
> Trying to solve the same problem but in a different way :-
>
> I have used lucene back in the day, where I would index the documents
> locally on the disk and run search queries on them. Big fan of lucene.
>
> I was wondering if there is any possibility like that.
>
> If I have a repository of millions of documents, would it not make sense
> to just index them locally and then copy the index file over to Solr and
> have it read from it?
>
> Any thoughts or blogs that could help me, or may be I am over thinking
> this?
>
> Thanks
> Shivaji
>
>
> On 1/13/16, 9:12 AM, "Erick Erickson" <er...@gmail.com> wrote:
>
>>It's usually not all that difficult to write a multi-threaded
>>client that uses CloudSolrClient, or even fire up multiple
>>instances of the SolrJ client (assuming they can work
>>on discreet sections of the documents you need to index).
>>
>>That avoids the problem Shawn alludes to. Plus other
>>issues. If you do _not_ use CloudSolrClient, then all the
>>docs go to some node in the system that then sub-divides
>>the list (and you really should update in batches, see:
>>https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/)
>>then the node that receives the packet sub-divides it
>>into groups based on what shard they should be part of
>>and forwards them to the leaders for that shard, very
>>significantly increasing the numbers of conversations
>>being carried on between Solr nodes. Times the number
>>of threads you're specifying with CUSC (I really regret
>>the renaming from ConcurrentUpdateSolrServer, I liked
>>writing CUSS).
>>
>>With CloudSolrClient, you can scale nearly linearly with
>>the number of shards. Not so with CUSC.
>>
>>FWIW,
>>Erick
>>
>>On Tue, Jan 12, 2016 at 8:06 PM, Shawn Heisey <ap...@elyograg.org> wrote:
>>> On 1/12/2016 7:42 PM, Shivaji Dutta wrote:
>>>> Now since with ConcurrentUdateSolrClient I am able to use a queue and
>>>>a pool of threads, which makes it more attractive to use over
>>>>CloudSolrClient which will use a HTTPSolrClient once it gets a set of
>>>>nodes to do the updates.
>>>>
>>>> What is the recommended API for updating large amounts of documents
>>>>with higher throughput rate.
>>>
>>> ConcurrentUpdateSolrClient has one flaw -- it swallows all exceptions
>>> that happen during indexing.  Your application will never know about any
>>> problems that occur during indexing.  The entire cluster could be down,
>>> and your application would never know about it until you tried an
>>> explicit commit operation.  Commit is an operation that is not handled
>>> in the background by CUSC, so I would expect any exception to be
>>> returned for that operation.
>>>
>>> This flaw is inherent to its design, the behavior would be very
>>> difficult to change.
>>>
>>> If you don't care about your application getting error messages when
>>> indexing requests fail, then CUSC is perfect.  This might be the case if
>>> you are doing initial bulk loading.  For normal index updates after
>>> initial loading, you would not want to use CUSC.
>>>
>>> If you do care about getting error messages when bulk indexing requests
>>> fail, then you'll want to build a program with CloudSolrClient where you
>>> create multiple indexing threads that all use the same the client
>>>object.
>>>
>>> Thanks,
>>> Shawn
>>>
>>
>

Re: ConcurrentUpdateSolrClient vs CloudSolrClient for bulk update to SolrCloud

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

Shivaji Dutta <sd...@hortonworks.com> wrote:
> If I have a repository of millions of documents, would it not make sense
> to just index them locally and then copy the index file over to Solr and
> have it read from it?

It is certainly possible and for some scenarios it will work well.

We do it locally: Create a shard, optimize it and add it to our SolrCloud, where it is never updated again. This works for us as we have immutable data and one of the benefits is drastically lowered hardware requirements for the search machines. There is a small write-up at https://sbdevel.wordpress.com/net-archive-search/

There was a talk at Lucene/Solr Revolution 2015 about using a similar workflow for indexing logfiles. I think it was this one: https://www.youtube.com/watch?v=u5_vzcYYWfc

Bear in mind that both Brett Hoerner (from the talk) and we are working with billions of documents and terabytes of index data. As you need to build your own logistics system, there is the usual trade-off of development & maintenance cost vs. just buying beefier hardware.

- Toke Eskildsen

Re: ConcurrentUpdateSolrClient vs CloudSolrClient for bulk update to SolrCloud

Posted by Shivaji Dutta <sd...@hortonworks.com>.

Erik and Shawn

Thanks for the input. In the process below we are posting the documents to
Solr over HTTP Connection in batches.

Trying to solve the same problem but in a different way :-

I have used lucene back in the day, where I would index the documents
locally on the disk and run search queries on them. Big fan of lucene.

I was wondering if there is any possibility like that.

If I have a repository of millions of documents, would it not make sense
to just index them locally and then copy the index file over to Solr and
have it read from it?

Any thoughts or blogs that could help me, or may be I am over thinking
this?

Thanks
Shivaji


On 1/13/16, 9:12 AM, "Erick Erickson" <er...@gmail.com> wrote:

>It's usually not all that difficult to write a multi-threaded
>client that uses CloudSolrClient, or even fire up multiple
>instances of the SolrJ client (assuming they can work
>on discreet sections of the documents you need to index).
>
>That avoids the problem Shawn alludes to. Plus other
>issues. If you do _not_ use CloudSolrClient, then all the
>docs go to some node in the system that then sub-divides
>the list (and you really should update in batches, see:
>https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/)
>then the node that receives the packet sub-divides it
>into groups based on what shard they should be part of
>and forwards them to the leaders for that shard, very
>significantly increasing the numbers of conversations
>being carried on between Solr nodes. Times the number
>of threads you're specifying with CUSC (I really regret
>the renaming from ConcurrentUpdateSolrServer, I liked
>writing CUSS).
>
>With CloudSolrClient, you can scale nearly linearly with
>the number of shards. Not so with CUSC.
>
>FWIW,
>Erick
>
>On Tue, Jan 12, 2016 at 8:06 PM, Shawn Heisey <ap...@elyograg.org> wrote:
>> On 1/12/2016 7:42 PM, Shivaji Dutta wrote:
>>> Now since with ConcurrentUdateSolrClient I am able to use a queue and
>>>a pool of threads, which makes it more attractive to use over
>>>CloudSolrClient which will use a HTTPSolrClient once it gets a set of
>>>nodes to do the updates.
>>>
>>> What is the recommended API for updating large amounts of documents
>>>with higher throughput rate.
>>
>> ConcurrentUpdateSolrClient has one flaw -- it swallows all exceptions
>> that happen during indexing.  Your application will never know about any
>> problems that occur during indexing.  The entire cluster could be down,
>> and your application would never know about it until you tried an
>> explicit commit operation.  Commit is an operation that is not handled
>> in the background by CUSC, so I would expect any exception to be
>> returned for that operation.
>>
>> This flaw is inherent to its design, the behavior would be very
>> difficult to change.
>>
>> If you don't care about your application getting error messages when
>> indexing requests fail, then CUSC is perfect.  This might be the case if
>> you are doing initial bulk loading.  For normal index updates after
>> initial loading, you would not want to use CUSC.
>>
>> If you do care about getting error messages when bulk indexing requests
>> fail, then you'll want to build a program with CloudSolrClient where you
>> create multiple indexing threads that all use the same the client
>>object.
>>
>> Thanks,
>> Shawn
>>
>

Re: ConcurrentUpdateSolrClient vs CloudSolrClient for bulk update to SolrCloud

Posted by Erick Erickson <er...@gmail.com>.

It's usually not all that difficult to write a multi-threaded
client that uses CloudSolrClient, or even fire up multiple
instances of the SolrJ client (assuming they can work
on discreet sections of the documents you need to index).

That avoids the problem Shawn alludes to. Plus other
issues. If you do _not_ use CloudSolrClient, then all the
docs go to some node in the system that then sub-divides
the list (and you really should update in batches, see:
https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/)
then the node that receives the packet sub-divides it
into groups based on what shard they should be part of
and forwards them to the leaders for that shard, very
significantly increasing the numbers of conversations
being carried on between Solr nodes. Times the number
of threads you're specifying with CUSC (I really regret
the renaming from ConcurrentUpdateSolrServer, I liked
writing CUSS).

With CloudSolrClient, you can scale nearly linearly with
the number of shards. Not so with CUSC.

FWIW,
Erick

On Tue, Jan 12, 2016 at 8:06 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 1/12/2016 7:42 PM, Shivaji Dutta wrote:
>> Now since with ConcurrentUdateSolrClient I am able to use a queue and a pool of threads, which makes it more attractive to use over CloudSolrClient which will use a HTTPSolrClient once it gets a set of nodes to do the updates.
>>
>> What is the recommended API for updating large amounts of documents with higher throughput rate.
>
> ConcurrentUpdateSolrClient has one flaw -- it swallows all exceptions
> that happen during indexing.  Your application will never know about any
> problems that occur during indexing.  The entire cluster could be down,
> and your application would never know about it until you tried an
> explicit commit operation.  Commit is an operation that is not handled
> in the background by CUSC, so I would expect any exception to be
> returned for that operation.
>
> This flaw is inherent to its design, the behavior would be very
> difficult to change.
>
> If you don't care about your application getting error messages when
> indexing requests fail, then CUSC is perfect.  This might be the case if
> you are doing initial bulk loading.  For normal index updates after
> initial loading, you would not want to use CUSC.
>
> If you do care about getting error messages when bulk indexing requests
> fail, then you'll want to build a program with CloudSolrClient where you
> create multiple indexing threads that all use the same the client object.
>
> Thanks,
> Shawn
>

Re: ConcurrentUpdateSolrClient vs CloudSolrClient for bulk update to SolrCloud

Posted by Shawn Heisey <ap...@elyograg.org>.

On 1/12/2016 7:42 PM, Shivaji Dutta wrote:
> Now since with ConcurrentUdateSolrClient I am able to use a queue and a pool of threads, which makes it more attractive to use over CloudSolrClient which will use a HTTPSolrClient once it gets a set of nodes to do the updates.
>
> What is the recommended API for updating large amounts of documents with higher throughput rate.

ConcurrentUpdateSolrClient has one flaw -- it swallows all exceptions
that happen during indexing.  Your application will never know about any
problems that occur during indexing.  The entire cluster could be down,
and your application would never know about it until you tried an
explicit commit operation.  Commit is an operation that is not handled
in the background by CUSC, so I would expect any exception to be
returned for that operation.

This flaw is inherent to its design, the behavior would be very
difficult to change.

If you don't care about your application getting error messages when
indexing requests fail, then CUSC is perfect.  This might be the case if
you are doing initial bulk loading.  For normal index updates after
initial loading, you would not want to use CUSC.

If you do care about getting error messages when bulk indexing requests
fail, then you'll want to build a program with CloudSolrClient where you
create multiple indexing threads that all use the same the client object.

Thanks,
Shawn