You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shawn Heisey <ap...@elyograg.org> on 2016/11/18 14:02:35 UTC

Re: SolrJ bulk indexing documents - HttpSolrClient vs. ConcurrentUpdateSolrClient

On 11/18/2016 6:00 AM, Sebastian Riemer wrote:
> I am looking to improve indexing speed when loading many documents as part of an import. I am using the SolrJ-Client and currently I add the documents one-by-one using HttpSolrClient and  its method add(SolrInputDocument doc, int commitWithinMs).

If you batch them (probably around 500 to 1000 at a time), indexing
speed will go up.  Below you have described the add methods used for
batching.

> My first step would be to change that to use add(Collection<SolrInputDocument> docs, int commitWithinMs) instead, which I expect would already improve performance.
> Does it matter which method I use? Beside the method taking a Collection<SolrInputDocument> there is also one that takes an Iterator<SolrInputDocument> ... and what about ConcurrentUpdateSolrClient? Should I use it for bulk indexing instead of HttpSolrClient?
>
> Currently we are on version 5.5.0 of solr, and we don't run SolrCloud, i.e. only one instance etc.
> Indexing 39657 documents (which result in a core size of appr. 127MB) took about 10 minutes with the one-by-one approach.

The concurrent client will send updates in parallel, without any
threading code in your own program, but there is one glaring
disadvantage -- indexing failures will be logged (via SLF4J), but your
program will NOT be informed about them, which means that the entire
Solr cluster could be down, and all your indexing requests will still
appear to succeed from your program's point of view.  Here's an issue I
filed on the problem.  It hasn't been fixed because there really isn't a
good solution.

https://issues.apache.org/jira/browse/SOLR-3284

The concurrent client swallows all exceptions that occur during add()
operations -- they are conducted in the background.  This might also
happen during delete operations, though I am unsure about that.  You
won't know about any problems unless those problems are still there when
your program tries an operation that can't happen in the background,
like commit or query.  If you're relying on automatic commits, your
indexing program might NEVER become aware of problems on the server end.

In a nutshell ... the concurrent client is great for initial bulk
loading (if and only if you don't need error detection), but not all
that useful for ongoing update activity that runs all the time.

If you set up multiple indexing threads in your own program, you can use
HttpSolrClient or CloudSolrClient with similar concurrent effectiveness
to the concurrent client, without sacrificing the ability to detect
errors during indexing.

Indexing 40K documents in batches should take very little time, and in
my opinion is not worth the disadvantages of the concurrent client, or
taking the time to write multi-threaded code.  If you reach the point
where you've got millions of documents, then you might want to consider
writing multi-threaded indexing code.

Thanks,
Shawn


Re: SolrJ bulk indexing documents - HttpSolrClient vs. ConcurrentUpdateSolrClient

Posted by Erick Erickson <er...@gmail.com>.
Here's some numbers for batching improvements:

https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/

And I totally agree with Shawn that for 40K documents anything more
complex is probably overkill.

Best,
Erick

On Fri, Nov 18, 2016 at 6:02 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 11/18/2016 6:00 AM, Sebastian Riemer wrote:
>> I am looking to improve indexing speed when loading many documents as part of an import. I am using the SolrJ-Client and currently I add the documents one-by-one using HttpSolrClient and  its method add(SolrInputDocument doc, int commitWithinMs).
>
> If you batch them (probably around 500 to 1000 at a time), indexing
> speed will go up.  Below you have described the add methods used for
> batching.
>
>> My first step would be to change that to use add(Collection<SolrInputDocument> docs, int commitWithinMs) instead, which I expect would already improve performance.
>> Does it matter which method I use? Beside the method taking a Collection<SolrInputDocument> there is also one that takes an Iterator<SolrInputDocument> ... and what about ConcurrentUpdateSolrClient? Should I use it for bulk indexing instead of HttpSolrClient?
>>
>> Currently we are on version 5.5.0 of solr, and we don't run SolrCloud, i.e. only one instance etc.
>> Indexing 39657 documents (which result in a core size of appr. 127MB) took about 10 minutes with the one-by-one approach.
>
> The concurrent client will send updates in parallel, without any
> threading code in your own program, but there is one glaring
> disadvantage -- indexing failures will be logged (via SLF4J), but your
> program will NOT be informed about them, which means that the entire
> Solr cluster could be down, and all your indexing requests will still
> appear to succeed from your program's point of view.  Here's an issue I
> filed on the problem.  It hasn't been fixed because there really isn't a
> good solution.
>
> https://issues.apache.org/jira/browse/SOLR-3284
>
> The concurrent client swallows all exceptions that occur during add()
> operations -- they are conducted in the background.  This might also
> happen during delete operations, though I am unsure about that.  You
> won't know about any problems unless those problems are still there when
> your program tries an operation that can't happen in the background,
> like commit or query.  If you're relying on automatic commits, your
> indexing program might NEVER become aware of problems on the server end.
>
> In a nutshell ... the concurrent client is great for initial bulk
> loading (if and only if you don't need error detection), but not all
> that useful for ongoing update activity that runs all the time.
>
> If you set up multiple indexing threads in your own program, you can use
> HttpSolrClient or CloudSolrClient with similar concurrent effectiveness
> to the concurrent client, without sacrificing the ability to detect
> errors during indexing.
>
> Indexing 40K documents in batches should take very little time, and in
> my opinion is not worth the disadvantages of the concurrent client, or
> taking the time to write multi-threaded code.  If you reach the point
> where you've got millions of documents, then you might want to consider
> writing multi-threaded indexing code.
>
> Thanks,
> Shawn
>