You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Dotan Cohen <do...@gmail.com> on 2013/05/29 13:00:24 UTC

Reindexing strategy

I see that I do need to reindex my Solr index. The index consists of
20 million documents with a few hundred new documents added per minute
(social media data). The documents are mostly smaller than 1KiB of
data, but some may go as large as 10 KiB. All the data is text, and
all indexed fields are stored.

To reindex, I am considering adding a 'last_indexed' field, and having
a Python or Java application pull out N results every T seconds when
sorting on "last_indexed asc". How might I determine a good values for
N and T? I would like to know when the Solr index is 'overloaded', or
whatever happens to Solr when it is being pushed beyond the limits of
its hardware. What should I be looking at to know if Solr is over
stressed? Is looking at CPU and memory good enough? Is there a way to
measure I/O to the disk on which the Solr index is stored? Bear in
mind that while the reindex is happening, clients will be performing
searches and a few hundred documents will be written per minute. Note
that the machine running Solr is an EC2 instance running on Amazon Web
Services, and that the 'disk' on which the Solr index is stored in an
EBS volume.

Thank you.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Reindexing strategy

Posted by Dotan Cohen <do...@gmail.com>.
On Fri, May 31, 2013 at 3:57 AM, Michael Sokolov
<ms...@safaribooksonline.com>gt wrote:
> On UNIX platforms, take a look at vmstat for basic I/O measurement, and
> iostat for more detailed stats.  One coarse measurement is the number of
> blocked/waiting processes - usually this is due to I/O contention, and you
> will want to look at the paging and swapping numbers - you don't want any
> swapping at all.  But the best single number to look at is overall disk
> activity, which is the I/O percentage utilized number Shaun was mentioning.
>
> -Mike

Great, thanks! I've got some terms to google. For those who follow in
my footsteps, on Ubuntu the package 'sysstat' needs to be installed to
use iostat. Here are my reference stats before starting to experiment,
both for my own use later to compare and also if anybody sees anything
amiss here then I would love to know about it. If there is any fine
manual that is particularly urgent that I should read, please do
mention it. Thanks!


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Reindexing strategy

Posted by Michael Sokolov <ms...@safaribooksonline.com>.
On 5/30/2013 8:30 AM, Dotan Cohen wrote:
> On Wed, May 29, 2013 at 5:37 PM, Shawn Heisey <so...@elyograg.org> wrote:
>> It's impossible for us to give you hard numbers.  You'll have to
>> experiment to know how fast you can reindex without killing your
>> servers.  A basic tenet for such experimentation, and something you
>> hopefully already know: You'll want to get baseline measurements before
>> you begin testing for comparison.
>>
> Thanks. I wan't looking for hard numbers, but rather am looking for
> what are the signs of problems. I know to keep my eye on memory and
> CPU, but I have no idea how to check disk I/O, and I'm not sure how to
> determine even if that becomes saturated.
>
On UNIX platforms, take a look at vmstat for basic I/O measurement, and 
iostat for more detailed stats.  One coarse measurement is the number of 
blocked/waiting processes - usually this is due to I/O contention, and 
you will want to look at the paging and swapping numbers - you don't 
want any swapping at all.  But the best single number to look at is 
overall disk activity, which is the I/O percentage utilized number Shaun 
was mentioning.

-Mike

Re: Reindexing strategy

Posted by Dotan Cohen <do...@gmail.com>.
On Wed, May 29, 2013 at 5:37 PM, Shawn Heisey <so...@elyograg.org> wrote:
> It's impossible for us to give you hard numbers.  You'll have to
> experiment to know how fast you can reindex without killing your
> servers.  A basic tenet for such experimentation, and something you
> hopefully already know: You'll want to get baseline measurements before
> you begin testing for comparison.
>

Thanks. I wan't looking for hard numbers, but rather am looking for
what are the signs of problems. I know to keep my eye on memory and
CPU, but I have no idea how to check disk I/O, and I'm not sure how to
determine even if that becomes saturated.

> One of the most reliable Solr-specific indicators of pushing your
> hardware too hard is that the QTime on your queries will start to
> increase dramatically.  Solr 4.1 and later has more granular query time
> statistics in the UI - the median and 95% numbers are much more
> important than the average.
>

Thank you, this will help. At least I now have a hard metric to see
when Solr is getting overburdened (QTime).


> Outside of that, if your overall IOwait CPU percentage starts getting
> near (or above) 30-50%, your server is struggling.  If all of your CPU
> cores are staying near 100% usage, then it's REALLY struggling.
>

I see, thanks.


> Assuming you have plenty of CPU cores, using fast storage and having
> plenty of extra RAM will alleviate much of the I/O bottleneck.  The
> usual rule of thumb for good query performance is that you need enough
> RAM to put 50-100% of your index in the OS disk cache.  For blazing
> performance during a rebuild, that becomes 100-200%.  If you had 150%,
> that would probably keep most indexes well-cached even during a rebuild.
>
> A rebuild will always lower performance, even with lots of RAM.
>

Considering that the Solr index is the only place that the data is
stored, and that users are actively using the system, I was not
planning on a rebuild but rather to iteratively reindex the extant
documents, even as new documents are being push in.


> My earlier reply to your other message has some other ideas that will
> hopefully help.
>

Thank you Shawn!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Reindexing strategy

Posted by Shawn Heisey <so...@elyograg.org>.
On 5/29/2013 6:01 AM, Dotan Cohen wrote:
> I mean 'overload' Solr in the sense that it cannot read, process, and
> write data fast enough because too much data is being handled. I
> remind you that this system is writing hundreds of documents per
> minute. Certainly there is a limit to what Solr can handle. I ask how
> to know how close I am to this limit.

It's impossible for us to give you hard numbers.  You'll have to
experiment to know how fast you can reindex without killing your
servers.  A basic tenet for such experimentation, and something you
hopefully already know: You'll want to get baseline measurements before
you begin testing for comparison.

One of the most reliable Solr-specific indicators of pushing your
hardware too hard is that the QTime on your queries will start to
increase dramatically.  Solr 4.1 and later has more granular query time
statistics in the UI - the median and 95% numbers are much more
important than the average.

Outside of that, if your overall IOwait CPU percentage starts getting
near (or above) 30-50%, your server is struggling.  If all of your CPU
cores are staying near 100% usage, then it's REALLY struggling.

Assuming you have plenty of CPU cores, using fast storage and having
plenty of extra RAM will alleviate much of the I/O bottleneck.  The
usual rule of thumb for good query performance is that you need enough
RAM to put 50-100% of your index in the OS disk cache.  For blazing
performance during a rebuild, that becomes 100-200%.  If you had 150%,
that would probably keep most indexes well-cached even during a rebuild.

A rebuild will always lower performance, even with lots of RAM.

My earlier reply to your other message has some other ideas that will
hopefully help.

Thanks,
Shawn


Re: Reindexing strategy

Posted by Dotan Cohen <do...@gmail.com>.
On Wed, May 29, 2013 at 2:41 PM, Upayavira <uv...@odoko.co.uk> wrote:
> I presume you are running Solr on a multi-core/CPU server. If you kept a
> single process hitting Solr to re-index, you'd be using just one of
> those cores. It would take as long as it takes, I can't see how you
> would 'overload' it that way.
>

I mean 'overload' Solr in the sense that it cannot read, process, and
write data fast enough because too much data is being handled. I
remind you that this system is writing hundreds of documents per
minute. Certainly there is a limit to what Solr can handle. I ask how
to know how close I am to this limit.


> I guess you could have a strategy that pulls 100 documents with an old
> last_indexed, and push them for re-indexing. If you get the full 100
> docs, you make a subsequent request immediately. If you get less than
> 100 back, you know you're up-to-date and can wait, say, 30s before
> making another request.
>

Actually, I would add a filter query for documents whose last_index
value is before the last schema change, and stop when less documents
were returned than were requested.

Thanks.


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Reindexing strategy

Posted by Upayavira <uv...@odoko.co.uk>.
I presume you are running Solr on a multi-core/CPU server. If you kept a
single process hitting Solr to re-index, you'd be using just one of
those cores. It would take as long as it takes, I can't see how you
would 'overload' it that way. 

I guess you could have a strategy that pulls 100 documents with an old
last_indexed, and push them for re-indexing. If you get the full 100
docs, you make a subsequent request immediately. If you get less than
100 back, you know you're up-to-date and can wait, say, 30s before
making another request.

Upayavira

On Wed, May 29, 2013, at 12:00 PM, Dotan Cohen wrote:
> I see that I do need to reindex my Solr index. The index consists of
> 20 million documents with a few hundred new documents added per minute
> (social media data). The documents are mostly smaller than 1KiB of
> data, but some may go as large as 10 KiB. All the data is text, and
> all indexed fields are stored.
> 
> To reindex, I am considering adding a 'last_indexed' field, and having
> a Python or Java application pull out N results every T seconds when
> sorting on "last_indexed asc". How might I determine a good values for
> N and T? I would like to know when the Solr index is 'overloaded', or
> whatever happens to Solr when it is being pushed beyond the limits of
> its hardware. What should I be looking at to know if Solr is over
> stressed? Is looking at CPU and memory good enough? Is there a way to
> measure I/O to the disk on which the Solr index is stored? Bear in
> mind that while the reindex is happening, clients will be performing
> searches and a few hundred documents will be written per minute. Note
> that the machine running Solr is an EC2 instance running on Amazon Web
> Services, and that the 'disk' on which the Solr index is stored in an
> EBS volume.
> 
> Thank you.
> 
> --
> Dotan Cohen
> 
> http://gibberish.co.il
> http://what-is-what.com