You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mark Robinson <ma...@gmail.com> on 2016/04/19 11:25:19 UTC

Indexing 700 docs per second

Hi,

I have a requirement to index (mainly updation) 700 docs per second.
Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260
byes (6 fields out of which only 2 will undergo updation at the above
rate). This collection has around 122Million docs and that count is pretty
much a constant.

1. Can I manage this updation rate with a non-sharded ie single Solr
instance set up?
2. Also is atomic update or a full update (the whole doc) of the changed
records the better approach in this case.

Could some one please share their views/ experience?

Thanks!
Mark.

Re: Indexing 700 docs per second

Posted by Jeff Wartes <jw...@whitepages.com>.
I have no numbers to back this up, but I’d expect Atomic Updates to be slightly slower than a full update, since the atomic approach has to retrieve the fields you didn't specify before it can write the new (updated) document.




On 4/19/16, 11:54 AM, "Tim Robertson" <ti...@gmail.com> wrote:

>Hi Mark,
>
>We were putting in and updating docs of around 20-25 indexed fields (mainly
>INTs, but some Strings and multivalue fields) at >1000/sec on far lesser
>hardware and a total of 600 million docs (batch updates of course) while
>also serving live queries for a website which had about 30 concurrent users
>steady state (not all hitting SOLR though).
>
>It seems realistic with that kind of hardware in my experience, but you
>didn't mention what else was going on that might affect it (e.g. reads).
>
>HTH,
>Tim
>
>
>On Tue, Apr 19, 2016 at 7:12 PM, Erick Erickson <er...@gmail.com>
>wrote:
>
>> Make very sure you batch updates though.
>> Here's a benchmark I ran:
>> https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/
>>
>> NOTE: it's not entirely clear that you want to
>> put 122M docs on a single shard. Depending on the queries
>> you'll run you may want 2 or more shards, but that depends
>> on the query pattern and your SLAs. Here's the long version
>> of "you really have to load test this":
>>
>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>
>> Best,
>> Erick
>>
>> On Tue, Apr 19, 2016 at 6:48 AM, Susheel Kumar <su...@gmail.com>
>> wrote:
>> >  It sounds achievable with your machine configuration and i would suggest
>> > to try out atomic update.  Use SolrJ with multi-threaded indexing for
>> > higher indexing rate.
>> >
>> > Thanks,
>> > Susheel
>> >
>> >
>> >
>> > On Tue, Apr 19, 2016 at 9:27 AM, Tom Evans <te...@googlemail.com>
>> wrote:
>> >
>> >> On Tue, Apr 19, 2016 at 10:25 AM, Mark Robinson <
>> mark123learns@gmail.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > I have a requirement to index (mainly updation) 700 docs per second.
>> >> > Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around
>> 260
>> >> > byes (6 fields out of which only 2 will undergo updation at the above
>> >> > rate). This collection has around 122Million docs and that count is
>> >> pretty
>> >> > much a constant.
>> >> >
>> >> > 1. Can I manage this updation rate with a non-sharded ie single Solr
>> >> > instance set up?
>> >> > 2. Also is atomic update or a full update (the whole doc) of the
>> changed
>> >> > records the better approach in this case.
>> >> >
>> >> > Could some one please share their views/ experience?
>> >>
>> >> Try it and see - everyone's data/schemas are different and can affect
>> >> indexing speed. It certainly sounds achievable enough - presumably you
>> >> can at least produce the documents at that rate?
>> >>
>> >> Cheers
>> >>
>> >> Tom
>> >>
>>

Re: Indexing 700 docs per second

Posted by Tim Robertson <ti...@gmail.com>.
Hi Mark,

We were putting in and updating docs of around 20-25 indexed fields (mainly
INTs, but some Strings and multivalue fields) at >1000/sec on far lesser
hardware and a total of 600 million docs (batch updates of course) while
also serving live queries for a website which had about 30 concurrent users
steady state (not all hitting SOLR though).

It seems realistic with that kind of hardware in my experience, but you
didn't mention what else was going on that might affect it (e.g. reads).

HTH,
Tim


On Tue, Apr 19, 2016 at 7:12 PM, Erick Erickson <er...@gmail.com>
wrote:

> Make very sure you batch updates though.
> Here's a benchmark I ran:
> https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/
>
> NOTE: it's not entirely clear that you want to
> put 122M docs on a single shard. Depending on the queries
> you'll run you may want 2 or more shards, but that depends
> on the query pattern and your SLAs. Here's the long version
> of "you really have to load test this":
>
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> Best,
> Erick
>
> On Tue, Apr 19, 2016 at 6:48 AM, Susheel Kumar <su...@gmail.com>
> wrote:
> >  It sounds achievable with your machine configuration and i would suggest
> > to try out atomic update.  Use SolrJ with multi-threaded indexing for
> > higher indexing rate.
> >
> > Thanks,
> > Susheel
> >
> >
> >
> > On Tue, Apr 19, 2016 at 9:27 AM, Tom Evans <te...@googlemail.com>
> wrote:
> >
> >> On Tue, Apr 19, 2016 at 10:25 AM, Mark Robinson <
> mark123learns@gmail.com>
> >> wrote:
> >> > Hi,
> >> >
> >> > I have a requirement to index (mainly updation) 700 docs per second.
> >> > Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around
> 260
> >> > byes (6 fields out of which only 2 will undergo updation at the above
> >> > rate). This collection has around 122Million docs and that count is
> >> pretty
> >> > much a constant.
> >> >
> >> > 1. Can I manage this updation rate with a non-sharded ie single Solr
> >> > instance set up?
> >> > 2. Also is atomic update or a full update (the whole doc) of the
> changed
> >> > records the better approach in this case.
> >> >
> >> > Could some one please share their views/ experience?
> >>
> >> Try it and see - everyone's data/schemas are different and can affect
> >> indexing speed. It certainly sounds achievable enough - presumably you
> >> can at least produce the documents at that rate?
> >>
> >> Cheers
> >>
> >> Tom
> >>
>

Re: Indexing 700 docs per second

Posted by Erick Erickson <er...@gmail.com>.
Make very sure you batch updates though.
Here's a benchmark I ran:
https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/

NOTE: it's not entirely clear that you want to
put 122M docs on a single shard. Depending on the queries
you'll run you may want 2 or more shards, but that depends
on the query pattern and your SLAs. Here's the long version
of "you really have to load test this":
https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

On Tue, Apr 19, 2016 at 6:48 AM, Susheel Kumar <su...@gmail.com> wrote:
>  It sounds achievable with your machine configuration and i would suggest
> to try out atomic update.  Use SolrJ with multi-threaded indexing for
> higher indexing rate.
>
> Thanks,
> Susheel
>
>
>
> On Tue, Apr 19, 2016 at 9:27 AM, Tom Evans <te...@googlemail.com> wrote:
>
>> On Tue, Apr 19, 2016 at 10:25 AM, Mark Robinson <ma...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > I have a requirement to index (mainly updation) 700 docs per second.
>> > Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260
>> > byes (6 fields out of which only 2 will undergo updation at the above
>> > rate). This collection has around 122Million docs and that count is
>> pretty
>> > much a constant.
>> >
>> > 1. Can I manage this updation rate with a non-sharded ie single Solr
>> > instance set up?
>> > 2. Also is atomic update or a full update (the whole doc) of the changed
>> > records the better approach in this case.
>> >
>> > Could some one please share their views/ experience?
>>
>> Try it and see - everyone's data/schemas are different and can affect
>> indexing speed. It certainly sounds achievable enough - presumably you
>> can at least produce the documents at that rate?
>>
>> Cheers
>>
>> Tom
>>

Re: Indexing 700 docs per second

Posted by Susheel Kumar <su...@gmail.com>.
 It sounds achievable with your machine configuration and i would suggest
to try out atomic update.  Use SolrJ with multi-threaded indexing for
higher indexing rate.

Thanks,
Susheel



On Tue, Apr 19, 2016 at 9:27 AM, Tom Evans <te...@googlemail.com> wrote:

> On Tue, Apr 19, 2016 at 10:25 AM, Mark Robinson <ma...@gmail.com>
> wrote:
> > Hi,
> >
> > I have a requirement to index (mainly updation) 700 docs per second.
> > Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260
> > byes (6 fields out of which only 2 will undergo updation at the above
> > rate). This collection has around 122Million docs and that count is
> pretty
> > much a constant.
> >
> > 1. Can I manage this updation rate with a non-sharded ie single Solr
> > instance set up?
> > 2. Also is atomic update or a full update (the whole doc) of the changed
> > records the better approach in this case.
> >
> > Could some one please share their views/ experience?
>
> Try it and see - everyone's data/schemas are different and can affect
> indexing speed. It certainly sounds achievable enough - presumably you
> can at least produce the documents at that rate?
>
> Cheers
>
> Tom
>

Re: Indexing 700 docs per second

Posted by Tom Evans <te...@googlemail.com>.
On Tue, Apr 19, 2016 at 10:25 AM, Mark Robinson <ma...@gmail.com> wrote:
> Hi,
>
> I have a requirement to index (mainly updation) 700 docs per second.
> Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260
> byes (6 fields out of which only 2 will undergo updation at the above
> rate). This collection has around 122Million docs and that count is pretty
> much a constant.
>
> 1. Can I manage this updation rate with a non-sharded ie single Solr
> instance set up?
> 2. Also is atomic update or a full update (the whole doc) of the changed
> records the better approach in this case.
>
> Could some one please share their views/ experience?

Try it and see - everyone's data/schemas are different and can affect
indexing speed. It certainly sounds achievable enough - presumably you
can at least produce the documents at that rate?

Cheers

Tom

Re: Indexing 700 docs per second

Posted by Mark Robinson <ma...@gmail.com>.
Thank you all for your very valuable suggestions.
I will try out the options shared once our set up is ready and probably get
back on my experience once it is done.

Thanks!
Mark.

On Wed, Apr 20, 2016 at 9:54 AM, Bram Van Dam <br...@intix.eu> wrote:

> > I have a requirement to index (mainly updation) 700 docs per second.
> > Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260
> > byes (6 fields out of which only 2 will undergo updation at the above
> > rate). This collection has around 122Million docs and that count is
> pretty
> > much a constant.
>
> We've found that average index size per document is a good predictor of
> performance. For instance, I've got a 150GB index lying around,
> containing 400M documents. That's roughly 400 bytes per document in
> index size. This was indexed @ 4500 documents/second.
>
> If the average index size per documents doubles, the throughput will go
> down by about a third. Your mileage may vary.
>
> But yeah, I would say that 700 docs on your machine won't be much of a
> problem. Especially considering your index will likely fit in memory.
>
>  - Bram
>
>
>

Re: Indexing 700 docs per second

Posted by Bram Van Dam <br...@intix.eu>.
> I have a requirement to index (mainly updation) 700 docs per second.
> Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260
> byes (6 fields out of which only 2 will undergo updation at the above
> rate). This collection has around 122Million docs and that count is pretty
> much a constant.

We've found that average index size per document is a good predictor of
performance. For instance, I've got a 150GB index lying around,
containing 400M documents. That's roughly 400 bytes per document in
index size. This was indexed @ 4500 documents/second.

If the average index size per documents doubles, the throughput will go
down by about a third. Your mileage may vary.

But yeah, I would say that 700 docs on your machine won't be much of a
problem. Especially considering your index will likely fit in memory.

 - Bram