You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Kommu, Vinodh K." <vk...@dtcc.com> on 2020/05/20 11:25:56 UTC

Different indexing times for two different collections with different data sizes

Hi,

Recently we had noticed that one of the largest collection (shards = 6 ; replication factor =3) which holds up to 1TB of data & nearly 3.2 billion of docs is taking longer time to index than it used to before. To see the indexing time difference, we created another collection using largest collection configs (schema.xml and solrconfig.xml files) and loaded the collection with up to 100 million docs which is ~60G of data. Later we tried to index exactly same 25 million docs data file on these two collections which clearly showed timing difference. BTW, we are running on Solr 7.7.1 version.

Original largest collection has completed indexing in ~100mins
Newly created collection (which has 100 million docs) has completed in ~70mins

This indexing time difference is due to the amount of data that each collection hold? If yes, how to increase indexing performance on larger data collection? adding more shards can help here?

Also, is there any threshold numbers for a single shard can hold in terms of size and number of docs before adding a new shard?

Any answers would really help!!


Thanks & Regards,
Vinodh

DTCC DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify us immediately and delete the email and any attachments from your system. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.

Re: Different indexing times for two different collections with different data sizes

Posted by Erick Erickson <er...@gmail.com>.
The easy question first. There is an absolute limit of 2B docs per shard. Internally, Lucene assigns an integer internal document ID that overflows after 2B. That includes deleted docs, so your “maxDoc” on the admin page is the limit. Practically, as you are finding, you run into performance issues at significantly than 2B. Note that when segments are merged, the internal IDs get reassigned...

Indexing scales pretty linearly with the number of shards, _assuming_ you’re adding more hardware. To really answer the question you need to look at what the bottleneck is on your current system. IOW, “It Depends(tm)”.

Let’s claim your current system is running all your CPUs flat out. Or I/O is maxed out. Adding more shards to the existing hardware won’t help. Perhaps you don’t even need more shards, you just need to move some of your replicas to new hardware.

OTOH, let’s claim that your indexing isn’t straining your current hardware at all, then adding more shards to existing hardware should increase throughput.

Probably the issue is merging. When segments are merged, they’re re-written. My guess is that your larger collection is doing more merging than your test collection, but that’s a guess. See Mike McCandless’ blog, TieredMergePolicy is the default you’re probably using: http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

Best,
Erick

> On May 20, 2020, at 7:25 AM, Kommu, Vinodh K. <vk...@dtcc.com> wrote:
> 
> Hi,
> 
> Recently we had noticed that one of the largest collection (shards = 6 ; replication factor =3) which holds up to 1TB of data & nearly 3.2 billion of docs is taking longer time to index than it used to before. To see the indexing time difference, we created another collection using largest collection configs (schema.xml and solrconfig.xml files) and loaded the collection with up to 100 million docs which is ~60G of data. Later we tried to index exactly same 25 million docs data file on these two collections which clearly showed timing difference. BTW, we are running on Solr 7.7.1 version.
> 
> Original largest collection has completed indexing in ~100mins
> Newly created collection (which has 100 million docs) has completed in ~70mins
> 
> This indexing time difference is due to the amount of data that each collection hold? If yes, how to increase indexing performance on larger data collection? adding more shards can help here?
> 
> Also, is there any threshold numbers for a single shard can hold in terms of size and number of docs before adding a new shard?
> 
> Any answers would really help!!
> 
> 
> Thanks & Regards,
> Vinodh
> 
> DTCC DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify us immediately and delete the email and any attachments from your system. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.