You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Greenhorn Techie <gr...@gmail.com> on 2018/05/02 16:58:50 UTC

Indexing throughput

Hi,

The current hardware profile for our production cluster is 20 nodes, each
with 24cores and 256GB memory. Data being indexed is very structured in
nature and is about 30 columns or so, out of which half of them are
categorical with a defined list of values. The expected peak indexing
throughput is to be about *50000* documents per second (expected to be done
at off-peak hours so that search requests will be minimal during this time)
and the average throughput around *10000* documents (normal business
hours).

Given the hardware profile, is it realistic and practical to achieve the
desired throughput? What factors affect the performance of indexing apart
from the above hardware characteristics? I understand that its very
difficult to provide any guidance unless a prototype is done. But wondering
what are the considerations and dependencies we need to be aware of and
whether our throughput expectations are realistic or not.

Thanks

Re: Indexing throughput

Posted by Shawn Heisey <ap...@elyograg.org>.

On 5/2/2018 10:58 AM, Greenhorn Techie wrote:
> The current hardware profile for our production cluster is 20 nodes, each
> with 24cores and 256GB memory. Data being indexed is very structured in
> nature and is about 30 columns or so, out of which half of them are
> categorical with a defined list of values. The expected peak indexing
> throughput is to be about *50000* documents per second (expected to be done
> at off-peak hours so that search requests will be minimal during this time)
> and the average throughput around *10000* documents (normal business
> hours).
>
> Given the hardware profile, is it realistic and practical to achieve the
> desired throughput? What factors affect the performance of indexing apart
> from the above hardware characteristics? I understand that its very
> difficult to provide any guidance unless a prototype is done. But wondering
> what are the considerations and dependencies we need to be aware of and
> whether our throughput expectations are realistic or not.

50000 docs per second is not a slow indexing rate.  It has been
achieved, and as Erick noted, surpassed by a very large margin.  Whether
you can get there with your planned hardware on your index is not a
question that I can answer.  If I had to guess, I think that as long as
the source system can push the data that fast, it SHOULD be possible to
create an indexing system that can do it.

The important thing to do for fast indexing with Solr is to have a lot
of threads or processes indexing all at the same time.  Indexing with a
single thread will not achieve the fastest possible performance.

Since you're planning SolrCloud, you should put some effort into having
your indexing system be aware of your cluster state and the shard
routing so that it can send indexing requests directly to shard
leaders.  Indexing is faster if Solr doesn't need to forward requests. 
The SolrJ client named "CloudSolrClient" is always aware of the
clusterstate.  So if you can use that, updates can always be sent to the
leaders.

Thanks,
Shawn

Re: Indexing throughput

Posted by Greenhorn Techie <gr...@gmail.com>.

Thanks Walter and Erick for the valuable suggestions. We shall try out
various values for shards and as well other tuning metrics I discussed in
various threads earlier.

Kind Regards

On 2 May 2018 at 18:24:31, Erick Erickson (erickerickson@gmail.com) wrote:

I've seen 1.5 M docs/second. Basically the indexing throughput is gated
by two things:
1> the number of shards. Indexing throughput essentially scales up
reasonably linearly with the number of shards.
2> the indexing program that pushes data to Solr. Before thinking Solr
is the bottleneck, check how fast your ETL process is pushing docs.

This pre-supposes using SolrJ and CloudSolrClient for the final push
to Solr. This pre-buckets the updates and sends the updates for each
shard to the shard leader, thus reducing the amount of work Solr has
to do. If you use SolrJ, you can easily do <2> above by just
commenting out the single call that pushes the docs to Solr in your
program.

Speaking of which, it's definitely best to batch the updates, see:
https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/

Best,
Erick

On Wed, May 2, 2018 at 10:07 AM, Walter Underwood <wu...@wunderwood.org>
wrote:
> We have a similar sized cluster, 32 nodes with 36 processors and 60 Gb
RAM each
> (EC2 C4.8xlarge). The collection is 24 million documents with four
shards. The cluster
> is Solr 6.6.2. All storage is SSD EBS.
>
> We built a simple batch loader in Java. We get about one million
documents per minute
> with 64 threads. We do not use the cloud-smart SolrJ client. We just send
all the
> batches to the load balancer and let Solr sort it out.
>
> You are looking for 3 million documents per minute. You will just have to
test that.
>
> I haven’t tested it, but indexing should speed up linearly with the
number of shards,
> because those are indexing in parallel.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
>> On May 2, 2018, at 9:58 AM, Greenhorn Techie <gr...@gmail.com>
wrote:
>>
>> Hi,
>>
>> The current hardware profile for our production cluster is 20 nodes,
each
>> with 24cores and 256GB memory. Data being indexed is very structured in
>> nature and is about 30 columns or so, out of which half of them are
>> categorical with a defined list of values. The expected peak indexing
>> throughput is to be about *50000* documents per second (expected to be
done
>> at off-peak hours so that search requests will be minimal during this
time)
>> and the average throughput around *10000* documents (normal business
>> hours).
>>
>> Given the hardware profile, is it realistic and practical to achieve the
>> desired throughput? What factors affect the performance of indexing
apart
>> from the above hardware characteristics? I understand that its very
>> difficult to provide any guidance unless a prototype is done. But
wondering
>> what are the considerations and dependencies we need to be aware of and
>> whether our throughput expectations are realistic or not.
>>
>> Thanks
>

Re: Indexing throughput

Posted by Erick Erickson <er...@gmail.com>.

I've seen 1.5 M docs/second. Basically the indexing throughput is gated
by two things:
1> the number of shards. Indexing throughput essentially scales up
reasonably linearly with the number of shards.
2> the indexing program that pushes data to Solr. Before thinking Solr
is the bottleneck, check how fast your ETL process is pushing docs.

This pre-supposes using SolrJ and CloudSolrClient for the final push
to Solr. This pre-buckets the updates and sends the updates for each
shard to the shard leader, thus reducing the amount of work Solr has
to do. If you use SolrJ, you can easily do <2> above by just
commenting out the single call that pushes the docs to Solr in your
program.

Speaking of which, it's definitely best to batch the updates, see:
https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/

Best,
Erick

On Wed, May 2, 2018 at 10:07 AM, Walter Underwood <wu...@wunderwood.org> wrote:
> We have a similar sized cluster, 32 nodes with 36 processors and 60 Gb RAM each
> (EC2 C4.8xlarge). The collection is 24 million documents with four shards. The cluster
> is Solr 6.6.2. All storage is SSD EBS.
>
> We built a simple batch loader in Java. We get about one million documents per minute
> with 64 threads. We do not use the cloud-smart SolrJ client. We just send all the
> batches to the load balancer and let Solr sort it out.
>
> You are looking for 3 million documents per minute. You will just have to test that.
>
> I haven’t tested it, but indexing should speed up linearly with the number of shards,
> because those are indexing in parallel.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>> On May 2, 2018, at 9:58 AM, Greenhorn Techie <gr...@gmail.com> wrote:
>>
>> Hi,
>>
>> The current hardware profile for our production cluster is 20 nodes, each
>> with 24cores and 256GB memory. Data being indexed is very structured in
>> nature and is about 30 columns or so, out of which half of them are
>> categorical with a defined list of values. The expected peak indexing
>> throughput is to be about *50000* documents per second (expected to be done
>> at off-peak hours so that search requests will be minimal during this time)
>> and the average throughput around *10000* documents (normal business
>> hours).
>>
>> Given the hardware profile, is it realistic and practical to achieve the
>> desired throughput? What factors affect the performance of indexing apart
>> from the above hardware characteristics? I understand that its very
>> difficult to provide any guidance unless a prototype is done. But wondering
>> what are the considerations and dependencies we need to be aware of and
>> whether our throughput expectations are realistic or not.
>>
>> Thanks
>

Re: Indexing throughput

Posted by Walter Underwood <wu...@wunderwood.org>.

We have a similar sized cluster, 32 nodes with 36 processors and 60 Gb RAM each
(EC2 C4.8xlarge). The collection is 24 million documents with four shards. The cluster
is Solr 6.6.2. All storage is SSD EBS.

We built a simple batch loader in Java. We get about one million documents per minute
with 64 threads. We do not use the cloud-smart SolrJ client. We just send all the
batches to the load balancer and let Solr sort it out.

You are looking for 3 million documents per minute. You will just have to test that.

I haven’t tested it, but indexing should speed up linearly with the number of shards,
because those are indexing in parallel.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 2, 2018, at 9:58 AM, Greenhorn Techie <gr...@gmail.com> wrote:
> 
> Hi,
> 
> The current hardware profile for our production cluster is 20 nodes, each
> with 24cores and 256GB memory. Data being indexed is very structured in
> nature and is about 30 columns or so, out of which half of them are
> categorical with a defined list of values. The expected peak indexing
> throughput is to be about *50000* documents per second (expected to be done
> at off-peak hours so that search requests will be minimal during this time)
> and the average throughput around *10000* documents (normal business
> hours).
> 
> Given the hardware profile, is it realistic and practical to achieve the
> desired throughput? What factors affect the performance of indexing apart
> from the above hardware characteristics? I understand that its very
> difficult to provide any guidance unless a prototype is done. But wondering
> what are the considerations and dependencies we need to be aware of and
> whether our throughput expectations are realistic or not.
> 
> Thanks