You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Shenghua(Daniel) Wan" <wa...@gmail.com> on 2015/06/16 01:56:29 UTC

solr/lucene index merge and optimize performance improvement

Hi,
Do you have any suggestions to improve the performance for merging and
optimizing index?
I have been using embedded solr server to merge and optimize the index. I
am looking for the right parameters to tune. My use case have about 300
fields plus 250 copyfields, and moderate doc size (about 65K each doc
averagely)

https://wiki.apache.org/solr/MergingSolrIndexes does not help much.

Thanks a lot for any ideas and suggestions.

-- 

Regards,
Shenghua (Daniel) Wan

Re: solr/lucene index merge and optimize performance improvement

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Tue, 2015-06-16 at 09:54 -0700, Shenghua(Daniel) Wan wrote:
> Hi, Toke,
> Did you try MapReduce with solr? I think it should be a good fit for your
> use case.

Thanks for the suggestion. Improved logistics, such as starting build of
a new shard while the previous shard is optimizing, would work for us. 
Switching to a new controlling layer is not trivial, so the win by
better utilization during the optimization phase is not enough in itself
to pay the cost.

- Toke Eskildsen, State and University Library, Denmark

Re: solr/lucene index merge and optimize performance improvement

Posted by "Shenghua(Daniel) Wan" <wa...@gmail.com>.

Hi, Toke,
Did you try MapReduce with solr? I think it should be a good fit for your
use case.

On Tue, Jun 16, 2015 at 5:02 AM, Toke Eskildsen <te...@statsbiblioteket.dk>
wrote:

> Shenghua(Daniel) Wan <wa...@gmail.com> wrote:
> > Actually, I am currently interested in how to boost merging/optimizing
> > performance of single solr instance.
>
> We have the same challenge (we build static 900GB shards one at a time and
> the final optimization takes 8 hours with only 1 CPU core at 100%). I know
> that there is code for detecting SSDs, which should make merging faster (by
> running more merges in parallel?), but I am afraid that optimize (a single
> merge) is always single threaded.
>
> It seems to me that at least some of the different files making up a
> segment could be created in parallel, but I do not know how hard it would
> be to do so.
>
> - Toke Eskildsen
>



-- 

Regards,
Shenghua (Daniel) Wan

Re: solr/lucene index merge and optimize performance improvement

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

Shenghua(Daniel) Wan <wa...@gmail.com> wrote:
> Actually, I am currently interested in how to boost merging/optimizing
> performance of single solr instance.

We have the same challenge (we build static 900GB shards one at a time and the final optimization takes 8 hours with only 1 CPU core at 100%). I know that there is code for detecting SSDs, which should make merging faster (by running more merges in parallel?), but I am afraid that optimize (a single merge) is always single threaded.

It seems to me that at least some of the different files making up a segment could be created in parallel, but I do not know how hard it would be to do so.

- Toke Eskildsen

Re: solr/lucene index merge and optimize performance improvement

Posted by "Shenghua(Daniel) Wan" <wa...@gmail.com>.

I think your advice on future incremental update is very useful. I will
keep eye on that.

Actually, I am currently interested in how to boost merging/optimizing
performance of single solr instance.
Parallelism at MapReduce level does not help merging/optimizing much,
unless Solr/Lucene internally has distributed indexing mechanism like
threading.

Specifically, I am talking about the parameters in
//          ((TieredMergePolicy) mergePolicy).setMaxMergeAtOnceExplicit(
*10000*);
//          ((TieredMergePolicy) mergePolicy).setMaxMergeAtOnce(*10000*);

//          ((TieredMergePolicy) mergePolicy).setSegmentsPerTier(*10000*);
https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/map-reduce/src/java/org/apache/solr/hadoop/TreeMergeOutputFormat.java#L119-121
Do you know how they affect merging/optimizing the performance? or do you
know any doc about them?
I tried to uncomment them, and the performance improved. And I am
considering further tune the parameters.

As you mentioned, IndexWriter.forceMerge does exist in line 153 of
https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/map-reduce/src/java/org/apache/solr/hadoop/TreeMergeOutputFormat.java#L153

I am very grateful for your advice. Thanks a lot.


On Mon, Jun 15, 2015 at 10:39 PM, Erick Erickson <er...@gmail.com>
wrote:

> Ah, OK. For very slowly changing indexes optimize can makes sense.
>
> Do note, though, that if you incrementally index after the full build, and
> especially if you update documents, you're laying a trap for the future.
> Let's
> say you optimize down to a single segment. The default TieredMergePolicy
> tries to merge "similar size segments". But now you have one huge segment
> and docs will be marked as deleted from that segment, but not cleaned up
> until that segment is merged, which won't happen for a long time since it
> is so much bigger (I'm assuming) than the segments the incremental indexing
> will create.
>
> Now, the percentage of deleted documents weighs quite heavily in the
> decision
> what segments to merge, so it might not matter. It's just something to
> be aware of.
> Surely benchmarking is in order as you indicated.
>
> The Lucene-level IndexWriter.forceMerge method seems to be what you need
> though, although if you're working over HDFS I'm in unfamiliar territory.
> But
> the constructors to IndexWriter take a Directory, and the HdfsDirectory
> extends BaseDirectory which extends Directory so if you can set up
> an HdfsDIrectory it should "just work". I haven't personally tried it
> though.
>
> I saw something recently where optimization helped considerably in a
> sharded situation where the rows parameter was 400 (10 shards). My
> belief is that what was really happening was that the first-pass of a
> distributed search was getting slowed by disk seeks across multiple
> smaller segments. I'm waiting for SOLR-6810 which should impact that
> problem. Don't know if it applies to your situation or not though.
>
> HTH,
> Erick
>
>
> On Mon, Jun 15, 2015 at 8:30 PM, Shenghua(Daniel) Wan
> <wa...@gmail.com> wrote:
> > Hi, Erick,
> > First thanks for sharing the ideas. I am further giving more context here
> > accordingly.
> >
> > 1. why optimize? I have done some experiments to compare the query
> response
> > time, and there is some difference. In addition, the searcher will be
> > customer-facing. I think any performance boost will be worthwhile unless
> > the indexing will be more frequent. However, more benchmark will be
> > necessary to quantize the margin.
> >
> > 2. Why embedded solr server? I adopted the idea from Mark Miller's
> > map-reduce indexing and build on top of its original contribution to
> Solr.
> > It launches an embedded solr server at the end of reducer stages.
> Basically
> > a solr "instance" is brought up and fed with documents. Then the index is
> > generated at each reducer. Then the indexes are merged, and optimized if
> > desired.
> >
> > Thanks.
> >
> > On Mon, Jun 15, 2015 at 5:06 PM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> >> The first question is why you're optimizing at all. It's not recommended
> >> unless you can demonstrate that an optimized index is giving you enough
> >> of a performance boost to be worth the effort.
> >>
> >> And why are you using embedded solr server? That's kind of unusual
> >> so I wonder if you've gone down a wrong path somewhere. In other
> >> words this feels like an XY problem, you're specifically asking about
> >> a task without explaining the problem you're trying to solve, there may
> >> be better alternatives.
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Jun 15, 2015 at 4:56 PM, Shenghua(Daniel) Wan
> >> <wa...@gmail.com> wrote:
> >> > Hi,
> >> > Do you have any suggestions to improve the performance for merging and
> >> > optimizing index?
> >> > I have been using embedded solr server to merge and optimize the
> index. I
> >> > am looking for the right parameters to tune. My use case have about
> 300
> >> > fields plus 250 copyfields, and moderate doc size (about 65K each doc
> >> > averagely)
> >> >
> >> > https://wiki.apache.org/solr/MergingSolrIndexes does not help much.
> >> >
> >> > Thanks a lot for any ideas and suggestions.
> >> >
> >> > --
> >> >
> >> > Regards,
> >> > Shenghua (Daniel) Wan
> >>
> >
> >
> >
> > --
> >
> > Regards,
> > Shenghua (Daniel) Wan
>



-- 

Regards,
Shenghua (Daniel) Wan

Re: solr/lucene index merge and optimize performance improvement

Posted by Erick Erickson <er...@gmail.com>.

Ah, OK. For very slowly changing indexes optimize can makes sense.

Do note, though, that if you incrementally index after the full build, and
especially if you update documents, you're laying a trap for the future. Let's
say you optimize down to a single segment. The default TieredMergePolicy
tries to merge "similar size segments". But now you have one huge segment
and docs will be marked as deleted from that segment, but not cleaned up
until that segment is merged, which won't happen for a long time since it
is so much bigger (I'm assuming) than the segments the incremental indexing
will create.

Now, the percentage of deleted documents weighs quite heavily in the decision
what segments to merge, so it might not matter. It's just something to
be aware of.
Surely benchmarking is in order as you indicated.

The Lucene-level IndexWriter.forceMerge method seems to be what you need
though, although if you're working over HDFS I'm in unfamiliar territory. But
the constructors to IndexWriter take a Directory, and the HdfsDirectory
extends BaseDirectory which extends Directory so if you can set up
an HdfsDIrectory it should "just work". I haven't personally tried it though.

I saw something recently where optimization helped considerably in a
sharded situation where the rows parameter was 400 (10 shards). My
belief is that what was really happening was that the first-pass of a
distributed search was getting slowed by disk seeks across multiple
smaller segments. I'm waiting for SOLR-6810 which should impact that
problem. Don't know if it applies to your situation or not though.

HTH,
Erick

On Mon, Jun 15, 2015 at 8:30 PM, Shenghua(Daniel) Wan
<wa...@gmail.com> wrote:
> Hi, Erick,
> First thanks for sharing the ideas. I am further giving more context here
> accordingly.
>
> 1. why optimize? I have done some experiments to compare the query response
> time, and there is some difference. In addition, the searcher will be
> customer-facing. I think any performance boost will be worthwhile unless
> the indexing will be more frequent. However, more benchmark will be
> necessary to quantize the margin.
>
> 2. Why embedded solr server? I adopted the idea from Mark Miller's
> map-reduce indexing and build on top of its original contribution to Solr.
> It launches an embedded solr server at the end of reducer stages. Basically
> a solr "instance" is brought up and fed with documents. Then the index is
> generated at each reducer. Then the indexes are merged, and optimized if
> desired.
>
> Thanks.
>
> On Mon, Jun 15, 2015 at 5:06 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> The first question is why you're optimizing at all. It's not recommended
>> unless you can demonstrate that an optimized index is giving you enough
>> of a performance boost to be worth the effort.
>>
>> And why are you using embedded solr server? That's kind of unusual
>> so I wonder if you've gone down a wrong path somewhere. In other
>> words this feels like an XY problem, you're specifically asking about
>> a task without explaining the problem you're trying to solve, there may
>> be better alternatives.
>>
>> Best,
>> Erick
>>
>> On Mon, Jun 15, 2015 at 4:56 PM, Shenghua(Daniel) Wan
>> <wa...@gmail.com> wrote:
>> > Hi,
>> > Do you have any suggestions to improve the performance for merging and
>> > optimizing index?
>> > I have been using embedded solr server to merge and optimize the index. I
>> > am looking for the right parameters to tune. My use case have about 300
>> > fields plus 250 copyfields, and moderate doc size (about 65K each doc
>> > averagely)
>> >
>> > https://wiki.apache.org/solr/MergingSolrIndexes does not help much.
>> >
>> > Thanks a lot for any ideas and suggestions.
>> >
>> > --
>> >
>> > Regards,
>> > Shenghua (Daniel) Wan
>>
>
>
>
> --
>
> Regards,
> Shenghua (Daniel) Wan

Re: solr/lucene index merge and optimize performance improvement

Posted by "Shenghua(Daniel) Wan" <wa...@gmail.com>.

Hi, Erick,
First thanks for sharing the ideas. I am further giving more context here
accordingly.

1. why optimize? I have done some experiments to compare the query response
time, and there is some difference. In addition, the searcher will be
customer-facing. I think any performance boost will be worthwhile unless
the indexing will be more frequent. However, more benchmark will be
necessary to quantize the margin.

2. Why embedded solr server? I adopted the idea from Mark Miller's
map-reduce indexing and build on top of its original contribution to Solr.
It launches an embedded solr server at the end of reducer stages. Basically
a solr "instance" is brought up and fed with documents. Then the index is
generated at each reducer. Then the indexes are merged, and optimized if
desired.

Thanks.

On Mon, Jun 15, 2015 at 5:06 PM, Erick Erickson <er...@gmail.com>
wrote:

> The first question is why you're optimizing at all. It's not recommended
> unless you can demonstrate that an optimized index is giving you enough
> of a performance boost to be worth the effort.
>
> And why are you using embedded solr server? That's kind of unusual
> so I wonder if you've gone down a wrong path somewhere. In other
> words this feels like an XY problem, you're specifically asking about
> a task without explaining the problem you're trying to solve, there may
> be better alternatives.
>
> Best,
> Erick
>
> On Mon, Jun 15, 2015 at 4:56 PM, Shenghua(Daniel) Wan
> <wa...@gmail.com> wrote:
> > Hi,
> > Do you have any suggestions to improve the performance for merging and
> > optimizing index?
> > I have been using embedded solr server to merge and optimize the index. I
> > am looking for the right parameters to tune. My use case have about 300
> > fields plus 250 copyfields, and moderate doc size (about 65K each doc
> > averagely)
> >
> > https://wiki.apache.org/solr/MergingSolrIndexes does not help much.
> >
> > Thanks a lot for any ideas and suggestions.
> >
> > --
> >
> > Regards,
> > Shenghua (Daniel) Wan
>

-- 

Regards,
Shenghua (Daniel) Wan

Re: solr/lucene index merge and optimize performance improvement

Posted by Erick Erickson <er...@gmail.com>.

The first question is why you're optimizing at all. It's not recommended
unless you can demonstrate that an optimized index is giving you enough
of a performance boost to be worth the effort.

And why are you using embedded solr server? That's kind of unusual
so I wonder if you've gone down a wrong path somewhere. In other
words this feels like an XY problem, you're specifically asking about
a task without explaining the problem you're trying to solve, there may
be better alternatives.

Best,
Erick

On Mon, Jun 15, 2015 at 4:56 PM, Shenghua(Daniel) Wan
<wa...@gmail.com> wrote:
> Hi,
> Do you have any suggestions to improve the performance for merging and
> optimizing index?
> I have been using embedded solr server to merge and optimize the index. I
> am looking for the right parameters to tune. My use case have about 300
> fields plus 250 copyfields, and moderate doc size (about 65K each doc
> averagely)
>
> https://wiki.apache.org/solr/MergingSolrIndexes does not help much.
>
> Thanks a lot for any ideas and suggestions.
>
> --
>
> Regards,
> Shenghua (Daniel) Wan