You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Luis Cappa Banda <lu...@gmail.com> on 2014/02/05 16:28:13 UTC

Optimize and replication: some questions battery.

Hello!

I've got an scenario where I index very frequently on master servers and
replicate to slave servers with one minute polling. Master indexes are
growing fast and I would like to optimize indexes to improve search
queries. However...

1. During an optimize operation, can master servers index new documents? I
suppose that is not possible.

2. The optimize operation can take probably minutes, hours... and then will
affect to live/production environment because new documents won't be
indexed. Should I optimize each slave indexes, instead? What will happen
with replication? Will slave servers "loose" index identifiers that allow
them to replicate delta documents from master after optimizing them? Will
the next replication update slaves indexes overriding the optimized index?

Thank you very much in advance.

Regards,

-- 
- Luis Cappa

Re: Optimize and replication: some questions battery.

Posted by Luis Cappa Banda <lu...@gmail.com>.
Hi Toke!

Thanks for answering. That's it: I talk about index corruption just to
prevent, not because I have already noticed it. During some tests in the
past I checked that a mergeFactor of 2 improves more than a little bit
search speed instead common merge factors such as 10, for example. Of
course index speed is penalized, but my production architecture is based on
task queues and workers that index into Solr, and I've developed a custom
SolrCluster module that it's a black box that acts as a single Solr server
from an outside point of view, but it balances into N Solr master servers
internally deciding where to index, checking Solr servers status (alive,
dead), executing sharding search queries, etc., so that point is
controlled: if I need more index speed I can add new Solr masters and/or
new worker modules to dequeue, process and execute index operations. My
principal worry was about optimizing at much as possible search speed
thanks to optimizing, mergeFactor tunning, caches setup, etc.

Thanks a lot!


2014-02-06 Toke Eskildsen <te...@statsbiblioteket.dk>:

> On Thu, 2014-02-06 at 10:22 +0100, Luis Cappa Banda wrote:
> > I knew some performance tips to improve search and I configured a very
> > low merge factor (<mergeFactor>2</mergeFactor>) to boost search
> > operations instead of indexation ones.
>
> That would give you a small search speed increase and a huge penalty on
> indexing speed (as it will perform large merges all the time) and
> replication speed (as all file data will be updated frequently instead
> of just a subset of them). Unless you are absolutely sure that you need
> the small search speed increase, you should set this to a higher number.
>
> > I haven't got a deep knowledge of internal Lucene behavior in this
> > case, but I thought that somehow an optimization operation may rebuild
> > the index checking and fixing corrupted segments,
>
> To my knowledge, there are not attempts to repair corrupted segments
> during merge. I hope you speak of corruption as a precaution and not
> because it is something that happens to your setup. If you have
> corrupted indexes at any time, you should investigate how that happens,
> instead of trying to repair them.
>
> > One last question: do you think that this kind of scenario where I
> > continuously index and replicate data will corrupt the index?
>
> Lucene is used in a lot of places with massive updates. Aside for
> JVM-related bugs, it has proven to be very stable under these
> conditions. So not, the indexing will not corrupt anything.
>
> - Toke Eskildsen, State and University Library, Denmark
>
>


-- 
- Luis Cappa

Re: Optimize and replication: some questions battery.

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Thu, 2014-02-06 at 10:22 +0100, Luis Cappa Banda wrote:
> I knew some performance tips to improve search and I configured a very
> low merge factor (<mergeFactor>2</mergeFactor>) to boost search
> operations instead of indexation ones.

That would give you a small search speed increase and a huge penalty on
indexing speed (as it will perform large merges all the time) and
replication speed (as all file data will be updated frequently instead
of just a subset of them). Unless you are absolutely sure that you need
the small search speed increase, you should set this to a higher number.

> I haven't got a deep knowledge of internal Lucene behavior in this
> case, but I thought that somehow an optimization operation may rebuild
> the index checking and fixing corrupted segments,

To my knowledge, there are not attempts to repair corrupted segments
during merge. I hope you speak of corruption as a precaution and not
because it is something that happens to your setup. If you have
corrupted indexes at any time, you should investigate how that happens,
instead of trying to repair them.

> One last question: do you think that this kind of scenario where I
> continuously index and replicate data will corrupt the index?

Lucene is used in a lot of places with massive updates. Aside for
JVM-related bugs, it has proven to be very stable under these
conditions. So not, the indexing will not corrupt anything.

- Toke Eskildsen, State and University Library, Denmark


Re: Optimize and replication: some questions battery.

Posted by Luis Cappa Banda <lu...@gmail.com>.
Hi Chris,

Thank you very much for your response! It was very instructive. I knew some
performance tips to improve search and I configured a very low merge factor
(<mergeFactor>2</mergeFactor>) to boost search operations instead of
indexation ones. I haven't got a deep knowledge of internal Lucene behavior
in this case, but I thought that somehow an optimization operation may
rebuild the index checking and fixing corrupted segments, merging again
whatever it should merge, etc., and finally the "new" master index will be
a better index where to insert new data frequently.

One last question: do you think that this kind of scenario where I
continuously index and replicate data will corrupt the index? In the past I
developed a simple tool using a Lucene class to check the index and alert
me if it's corrupted or not, so if you think that this scenario is
dangerous maybe I can reuse that tool to prevent weird production
situations.

Best,


- Luis Cappa


2014-02-05 Chris Hostetter <ho...@fucit.org>:

>
> : I've got an scenario where I index very frequently on master servers and
> : replicate to slave servers with one minute polling. Master indexes are
> : growing fast and I would like to optimize indexes to improve search
> : queries. However...
>
> For a scenerio where your index is changing that rapidly, you don't wnat
> to use the optimize command at all -- it's not going to improve the
> performance of anything...
>
>
> https://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations
>
> You may want to optimize an index in certain situations -- ie: if you
> build your index once, and then never modify it.
>
> If you have a rapidly changing index, rather than optimizing, you likely
> simply want to use a lower merge factor. Optimizing is very expensive, and
> if the index is constantly changing, the slight performance boost will not
> last long. The tradeoff is not often worth it for a non static index.
>
> In a master slave setup, sometimes you may also want to optimize on the
> master so that slaves serve from a single segment index. This will can
> greatly increase the time to replicate the index though, so this is often
> not desirable either.
>
>
>
> -Hoss
> http://www.lucidworks.com/
>



-- 
- Luis Cappa

Re: Optimize and replication: some questions battery.

Posted by Chris Hostetter <ho...@fucit.org>.
: I've got an scenario where I index very frequently on master servers and
: replicate to slave servers with one minute polling. Master indexes are
: growing fast and I would like to optimize indexes to improve search
: queries. However...

For a scenerio where your index is changing that rapidly, you don't wnat 
to use the optimize command at all -- it's not going to improve the 
performance of anything...

https://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations

You may want to optimize an index in certain situations -- ie: if you 
build your index once, and then never modify it.

If you have a rapidly changing index, rather than optimizing, you likely 
simply want to use a lower merge factor. Optimizing is very expensive, and 
if the index is constantly changing, the slight performance boost will not 
last long. The tradeoff is not often worth it for a non static index.

In a master slave setup, sometimes you may also want to optimize on the 
master so that slaves serve from a single segment index. This will can 
greatly increase the time to replicate the index though, so this is often 
not desirable either. 



-Hoss
http://www.lucidworks.com/