You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Shawn Heisey <ap...@elyograg.org> on 2016/04/01 03:39:20 UTC

Re: Performance potential for updating (reindexing) documents

On 3/24/2016 11:57 AM, tedsolr wrote:
> My post was scant on details. The numbers I gave for collection sizes are
> projections for the future. I am in the midst of an upgrade that will be
> completed within a few weeks. My concern is that I may not be able to
> produce the throughput necessary to index an entire collection quickly
> enough (3 to 4 hours) for a large customer (100M docs).

I can fully rebuild one of my indexes, with 146 million docs, in 8-10
hours.  This is fairly inefficient indexing -- six large shards (not
cloud), each one running the dataimport handler, importing from MySQL. 
I suspect I could probably get two or three times this rate (and maybe
more) on the same hardware if I wrote a SolrJ application that uses
multiple threads for each Solr shard.

I know from experiments that the MySQL server can push over 100 million
rows to a SolrJ program in less than an hour, including constructing
SolrInputDocument objects.  That experiment just left out the
"client.add(docs);" line.  The bottleneck is definitely Solr.

Each machine holds three large shards(half the index),is running Solr
4.x (5.x upgrade is in the works), and has 64GB RAM with an 8GB heap. 
Each shard is approximately 24.4 million docs and 28GB.  These machines
also hold another sharded index in the same Solr install, but it's quite
a lot smaller.

Thanks,
Shawn

Re: Performance potential for updating (reindexing) documents

Posted by Shawn Heisey <ap...@elyograg.org>.

On 4/1/2016 8:56 PM, Erick Erickson wrote:
> bq: The bottleneck is definitely Solr.
>
> Since you commented out the server.add(doclist), you're right to focus
> there. I've seen
> a few things that help.
>
> 1> batch the documents, i.e. in the doclist above the list should be
> on the order of 1,000 docs. Here
> are some numbers I worked up one time:
> https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/

For that test, I was just seeing how fast MySQL could push data.  Based
on the results I saw from a small-scale test where I *did* add them,
letting the code run the add on the entire database with a single thread
would have taken forever.  I'm aware of the need to batch -- the code
did create batches, it just didn't send them.

I have a couple of ideas for the design on a multi-threaded indexing
program, but haven't worked out how to implement it.

> 3> Make sure you're using CloudSolrClient.

It's not SolrCloud, so that wouldn't really be helpful. :)

Thanks,
Shawn

Re: Performance potential for updating (reindexing) documents

Posted by Erick Erickson <er...@gmail.com>.

Shawn:

bq: The bottleneck is definitely Solr.

Since you commented out the server.add(doclist), you're right to focus
there. I've seen
a few things that help.

1> batch the documents, i.e. in the doclist above the list should be
on the order of 1,000 docs. Here
are some numbers I worked up one time:
https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/

2> If your Solr CPUs aren't running flat out, then adding threads
until they are being pretty well hammered
is A Good Thing. Of course you have to balance that off against
anything else your servers are doing like
serving queries....

3> Make sure you're using CloudSolrClient.

4> If you still need more throughput, use more shards.....

Best,
Erick

On Thu, Mar 31, 2016 at 6:39 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 3/24/2016 11:57 AM, tedsolr wrote:
>> My post was scant on details. The numbers I gave for collection sizes are
>> projections for the future. I am in the midst of an upgrade that will be
>> completed within a few weeks. My concern is that I may not be able to
>> produce the throughput necessary to index an entire collection quickly
>> enough (3 to 4 hours) for a large customer (100M docs).
>
> I can fully rebuild one of my indexes, with 146 million docs, in 8-10
> hours.  This is fairly inefficient indexing -- six large shards (not
> cloud), each one running the dataimport handler, importing from MySQL.
> I suspect I could probably get two or three times this rate (and maybe
> more) on the same hardware if I wrote a SolrJ application that uses
> multiple threads for each Solr shard.
>
> I know from experiments that the MySQL server can push over 100 million
> rows to a SolrJ program in less than an hour, including constructing
> SolrInputDocument objects.  That experiment just left out the
> "client.add(docs);" line.  The bottleneck is definitely Solr.
>
> Each machine holds three large shards(half the index),is running Solr
> 4.x (5.x upgrade is in the works), and has 64GB RAM with an 8GB heap.
> Each shard is approximately 24.4 million docs and 28GB.  These machines
> also hold another sharded index in the same Solr install, but it's quite
> a lot smaller.
>
> Thanks,
> Shawn
>