You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Summer Shire <sh...@gmail.com> on 2015/06/29 09:08:14 UTC

Re: optimize status

Have to cause of performance issues. 
Just want to know if there is a way to tap into the status. 

> On Jun 28, 2015, at 11:37 PM, Upayavira <uv...@odoko.co.uk> wrote:
> 
> Bigger question, why are you optimizing? Since 3.6 or so, it generally
> hasn't been requires, even, is a bad thing.
> 
> Upayavira
> 
>> On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote:
>> Hi All,
>> 
>> I have two indexers (Independent processes ) writing to a common solr
>> core.
>> If One indexer process issued an optimize on the core 
>> I want the second indexer to wait adding docs until the optimize has
>> finished.
>> 
>> Are there ways I can do this programmatically?
>> pinging the core when the optimize is happening is returning OK because
>> technically
>> solr allows you to update when an optimize is happening. 
>> 
>> any suggestions ?
>> 
>> thanks,
>> Summer

Re: optimize status

Posted by Shawn Heisey <ap...@elyograg.org>.
On 6/30/2015 6:23 AM, Erick Erickson wrote:
> I've actually seen this happen right in front of my eyes "in the
> field". However, that was a very high-performance environment. My
> assumption was that fragmented index files were causing more disk
> seeks especially for the first-pass query response in distributed
> mode. So, if the problem is similar, it should go away if you test
> requesting fewer docs. Note: This is not a cure for your problem, but
> would be useful for identifying if it's similar to what I saw.
> 
> NOTE: the symptom was a significant disparity between the QTime (which
> does not measure assembling the document) and the response time. _If_
> that's the case and _if_ my theory that disk access is the culprit,
> then SOLR-5478 and SOLR-6810 should be a big help as they remove the
> first-pass decompression for distributed searches.
> 
> If that hypothesis has any validity, I'd expect you're running on
> spinning-disks rather than SSDs, is that so?

If the index is small enough that all or most of it will fit into the OS
disk cache, then the problem of disk seeks would largely disappear,
because all or most data would be read from RAM, not the disk.

If the index is WAY too big to fit into available RAM (think about
terabyte scale indexes), then I think SSD is probably the only way to
get performance levels that could be called good.

Thanks,
Shawn


Re: optimize status

Posted by Summer Shire <sh...@gmail.com>.
Upayavira:
I am using solr 4.7 and yes I am using TieredMergePolicy

Erick:
All my boxes have SSD’s and there isn’t a big disparity between qTime and response time.
The performance hit on my end is because of the fragmented index files causing more disk seeks are you mentioned.
And I tried requesting fewer docs too but that did not help .



> On Jun 30, 2015, at 5:23 AM, Erick Erickson <er...@gmail.com> wrote:
> 
> I've actually seen this happen right in front of my eyes "in the
> field". However, that was a very high-performance environment. My
> assumption was that fragmented index files were causing more disk
> seeks especially for the first-pass query response in distributed
> mode. So, if the problem is similar, it should go away if you test
> requesting fewer docs. Note: This is not a cure for your problem, but
> would be useful for identifying if it's similar to what I saw.
> 
> NOTE: the symptom was a significant disparity between the QTime (which
> does not measure assembling the document) and the response time. _If_
> that's the case and _if_ my theory that disk access is the culprit,
> then SOLR-5478 and SOLR-6810 should be a big help as they remove the
> first-pass decompression for distributed searches.
> 
> If that hypothesis has any validity, I'd expect you're running on
> spinning-disks rather than SSDs, is that so?
> 
> Best,
> Erick
> 
> On Tue, Jun 30, 2015 at 2:07 AM, Upayavira <uv...@odoko.co.uk> wrote:
>> We need to work out why your performance is bad without optimise. What
>> version of Solr are you using? Can you confirm that your config is using
>> the TieredMergePolicy?
>> 
>> Upayavira
>> 
>> Oe, Jun 30, 2015, at 04:48 AM, Summer Shire wrote:
>>> Hi Upayavira and Erick,
>>> 
>>> There are two things we are talking about here.
>>> 
>>> First: Why am I optimizing? If I don’t our SEARCH (NOT INDEXING)
>>> performance is 100% worst.
>>> The problem lies in the number of total segments. We have to have max
>>> segments 1 or 2.
>>> I have done intensive performance related tests around number of
>>> segments, merge factor or changing the Merge policy.
>>> 
>>> Second: Solr does not perform better for me without an optimize. So now
>>> that I have to optimize the second issue
>>> is updating concurrently during an optimize. If I update when an optimize
>>> is happening the optimize takes 5 times as long as
>>> the normal optimize.
>>> 
>>> So is there any way other than creating a postOptimize hook and writing
>>> the status in a file and somehow making it available to the indexer.
>>> All of this just sounds traumatic :)
>>> 
>>> Thanks
>>> Summer
>>> 
>>> 
>>>> On Jun 29, 2015, at 5:40 AM, Erick Erickson <er...@gmail.com> wrote:
>>>> 
>>>> Steven:
>>>> 
>>>> Yes, but....
>>>> 
>>>> First, here's Mike McCandles' excellent blog on segment merging:
>>>> http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
>>>> 
>>>> I think the third animation is the TieredMergePolicy. In short, yes an
>>>> optimize will reclaim disk space. But as you update, this is done for
>>>> you anyway. About the only time optimizing is at all beneficial is
>>>> when you have a relatively static index. If you're continually
>>>> updating documents, and by that I mean replacing some existing
>>>> documents, then you'll immediately start generating "holes" in your
>>>> index.
>>>> 
>>>> And if you _do_ optimize, you wind up with a huge segment. And since
>>>> the default policy tries to merge segments of roughly the same size,
>>>> it accumulates deletes for quite a while before they merged away.
>>>> 
>>>> And if you don't update existing docs or delete docs, then there's no
>>>> wasted space anyway.
>>>> 
>>>> Summer:
>>>> 
>>>> First off, why do you care about not updating during optimizing?
>>>> There's no good reason you have to worry about that, you can freely
>>>> update while optimizing.
>>>> 
>>>> But frankly I have to agree with Upayavira that on the face of it
>>>> you're doing a lot of extra work. See above, but you optimize while
>>>> indexing, so immediately you're rather defeating the purpose.
>>>> Personally I'd only optimize relatively static indexes and, by
>>>> definition, you're index isn't static since the second process is just
>>>> waiting to modify it.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>> On Mon, Jun 29, 2015 at 8:15 AM, Steven White <sw...@gmail.com> wrote:
>>>>> Hi Upayavira,
>>>>> 
>>>>> This is news to me that we should not optimize and index.
>>>>> 
>>>>> What about disk space saving, isn't optimization to reclaim disk space or
>>>>> is Solr somehow does that?  Where can I read more about this?
>>>>> 
>>>>> I'm on Solr 5.1.0 (may switch to 5.2.1)
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> Steve
>>>>> 
>>>>> On Mon, Jun 29, 2015 at 4:16 AM, Upayavira <uv...@odoko.co.uk> wrote:
>>>>> 
>>>>>> I'm afraid I don't understand. You're saying that optimising is causing
>>>>>> performance issues?
>>>>>> 
>>>>>> Simple solution: DO NOT OPTIMIZE!
>>>>>> 
>>>>>> Optimisation is very badly named. What it does is squashes all segments
>>>>>> in your index into one segment, removing all deleted documents. It is
>>>>>> good to get rid of deletes - in that sense the index is "optimized".
>>>>>> However, future merges become very expensive. The best way to handle
>>>>>> this topic is to leave it to Lucene/Solr to do it for you. Pretend the
>>>>>> "optimize" option never existed.
>>>>>> 
>>>>>> This is, of course, assuming you are using something like Solr 3.5+.
>>>>>> 
>>>>>> Upayavira
>>>>>> 
>>>>>> On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote:
>>>>>>> 
>>>>>>> Have to cause of performance issues.
>>>>>>> Just want to know if there is a way to tap into the status.
>>>>>>> 
>>>>>>>> On Jun 28, 2015, at 11:37 PM, Upayavira <uv...@odoko.co.uk> wrote:
>>>>>>>> 
>>>>>>>> Bigger question, why are you optimizing? Since 3.6 or so, it generally
>>>>>>>> hasn't been requires, even, is a bad thing.
>>>>>>>> 
>>>>>>>> Upayavira
>>>>>>>> 
>>>>>>>>> On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote:
>>>>>>>>> Hi All,
>>>>>>>>> 
>>>>>>>>> I have two indexers (Independent processes ) writing to a common solr
>>>>>>>>> core.
>>>>>>>>> If One indexer process issued an optimize on the core
>>>>>>>>> I want the second indexer to wait adding docs until the optimize has
>>>>>>>>> finished.
>>>>>>>>> 
>>>>>>>>> Are there ways I can do this programmatically?
>>>>>>>>> pinging the core when the optimize is happening is returning OK
>>>>>> because
>>>>>>>>> technically
>>>>>>>>> solr allows you to update when an optimize is happening.
>>>>>>>>> 
>>>>>>>>> any suggestions ?
>>>>>>>>> 
>>>>>>>>> thanks,
>>>>>>>>> Summer
>>>>>> 
>>> 


Re: optimize status

Posted by Erick Erickson <er...@gmail.com>.
I've actually seen this happen right in front of my eyes "in the
field". However, that was a very high-performance environment. My
assumption was that fragmented index files were causing more disk
seeks especially for the first-pass query response in distributed
mode. So, if the problem is similar, it should go away if you test
requesting fewer docs. Note: This is not a cure for your problem, but
would be useful for identifying if it's similar to what I saw.

NOTE: the symptom was a significant disparity between the QTime (which
does not measure assembling the document) and the response time. _If_
that's the case and _if_ my theory that disk access is the culprit,
then SOLR-5478 and SOLR-6810 should be a big help as they remove the
first-pass decompression for distributed searches.

If that hypothesis has any validity, I'd expect you're running on
spinning-disks rather than SSDs, is that so?

Best,
Erick

On Tue, Jun 30, 2015 at 2:07 AM, Upayavira <uv...@odoko.co.uk> wrote:
> We need to work out why your performance is bad without optimise. What
> version of Solr are you using? Can you confirm that your config is using
> the TieredMergePolicy?
>
> Upayavira
>
> Oe, Jun 30, 2015, at 04:48 AM, Summer Shire wrote:
>> Hi Upayavira and Erick,
>>
>> There are two things we are talking about here.
>>
>> First: Why am I optimizing? If I don’t our SEARCH (NOT INDEXING)
>> performance is 100% worst.
>> The problem lies in the number of total segments. We have to have max
>> segments 1 or 2.
>> I have done intensive performance related tests around number of
>> segments, merge factor or changing the Merge policy.
>>
>> Second: Solr does not perform better for me without an optimize. So now
>> that I have to optimize the second issue
>> is updating concurrently during an optimize. If I update when an optimize
>> is happening the optimize takes 5 times as long as
>> the normal optimize.
>>
>> So is there any way other than creating a postOptimize hook and writing
>> the status in a file and somehow making it available to the indexer.
>> All of this just sounds traumatic :)
>>
>> Thanks
>> Summer
>>
>>
>> > On Jun 29, 2015, at 5:40 AM, Erick Erickson <er...@gmail.com> wrote:
>> >
>> > Steven:
>> >
>> > Yes, but....
>> >
>> > First, here's Mike McCandles' excellent blog on segment merging:
>> > http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
>> >
>> > I think the third animation is the TieredMergePolicy. In short, yes an
>> > optimize will reclaim disk space. But as you update, this is done for
>> > you anyway. About the only time optimizing is at all beneficial is
>> > when you have a relatively static index. If you're continually
>> > updating documents, and by that I mean replacing some existing
>> > documents, then you'll immediately start generating "holes" in your
>> > index.
>> >
>> > And if you _do_ optimize, you wind up with a huge segment. And since
>> > the default policy tries to merge segments of roughly the same size,
>> > it accumulates deletes for quite a while before they merged away.
>> >
>> > And if you don't update existing docs or delete docs, then there's no
>> > wasted space anyway.
>> >
>> > Summer:
>> >
>> > First off, why do you care about not updating during optimizing?
>> > There's no good reason you have to worry about that, you can freely
>> > update while optimizing.
>> >
>> > But frankly I have to agree with Upayavira that on the face of it
>> > you're doing a lot of extra work. See above, but you optimize while
>> > indexing, so immediately you're rather defeating the purpose.
>> > Personally I'd only optimize relatively static indexes and, by
>> > definition, you're index isn't static since the second process is just
>> > waiting to modify it.
>> >
>> > Best,
>> > Erick
>> >
>> > On Mon, Jun 29, 2015 at 8:15 AM, Steven White <sw...@gmail.com> wrote:
>> >> Hi Upayavira,
>> >>
>> >> This is news to me that we should not optimize and index.
>> >>
>> >> What about disk space saving, isn't optimization to reclaim disk space or
>> >> is Solr somehow does that?  Where can I read more about this?
>> >>
>> >> I'm on Solr 5.1.0 (may switch to 5.2.1)
>> >>
>> >> Thanks
>> >>
>> >> Steve
>> >>
>> >> On Mon, Jun 29, 2015 at 4:16 AM, Upayavira <uv...@odoko.co.uk> wrote:
>> >>
>> >>> I'm afraid I don't understand. You're saying that optimising is causing
>> >>> performance issues?
>> >>>
>> >>> Simple solution: DO NOT OPTIMIZE!
>> >>>
>> >>> Optimisation is very badly named. What it does is squashes all segments
>> >>> in your index into one segment, removing all deleted documents. It is
>> >>> good to get rid of deletes - in that sense the index is "optimized".
>> >>> However, future merges become very expensive. The best way to handle
>> >>> this topic is to leave it to Lucene/Solr to do it for you. Pretend the
>> >>> "optimize" option never existed.
>> >>>
>> >>> This is, of course, assuming you are using something like Solr 3.5+.
>> >>>
>> >>> Upayavira
>> >>>
>> >>> On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote:
>> >>>>
>> >>>> Have to cause of performance issues.
>> >>>> Just want to know if there is a way to tap into the status.
>> >>>>
>> >>>>> On Jun 28, 2015, at 11:37 PM, Upayavira <uv...@odoko.co.uk> wrote:
>> >>>>>
>> >>>>> Bigger question, why are you optimizing? Since 3.6 or so, it generally
>> >>>>> hasn't been requires, even, is a bad thing.
>> >>>>>
>> >>>>> Upayavira
>> >>>>>
>> >>>>>> On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote:
>> >>>>>> Hi All,
>> >>>>>>
>> >>>>>> I have two indexers (Independent processes ) writing to a common solr
>> >>>>>> core.
>> >>>>>> If One indexer process issued an optimize on the core
>> >>>>>> I want the second indexer to wait adding docs until the optimize has
>> >>>>>> finished.
>> >>>>>>
>> >>>>>> Are there ways I can do this programmatically?
>> >>>>>> pinging the core when the optimize is happening is returning OK
>> >>> because
>> >>>>>> technically
>> >>>>>> solr allows you to update when an optimize is happening.
>> >>>>>>
>> >>>>>> any suggestions ?
>> >>>>>>
>> >>>>>> thanks,
>> >>>>>> Summer
>> >>>
>>

Re: optimize status

Posted by Upayavira <uv...@odoko.co.uk>.
We need to work out why your performance is bad without optimise. What
version of Solr are you using? Can you confirm that your config is using
the TieredMergePolicy?

Upayavira 

Oe, Jun 30, 2015, at 04:48 AM, Summer Shire wrote:
> Hi Upayavira and Erick,
> 
> There are two things we are talking about here.
> 
> First: Why am I optimizing? If I don’t our SEARCH (NOT INDEXING)
> performance is 100% worst. 
> The problem lies in the number of total segments. We have to have max
> segments 1 or 2. 
> I have done intensive performance related tests around number of
> segments, merge factor or changing the Merge policy.
> 
> Second: Solr does not perform better for me without an optimize. So now
> that I have to optimize the second issue
> is updating concurrently during an optimize. If I update when an optimize
> is happening the optimize takes 5 times as long as
> the normal optimize.
> 
> So is there any way other than creating a postOptimize hook and writing
> the status in a file and somehow making it available to the indexer. 
> All of this just sounds traumatic :) 
> 
> Thanks
> Summer
> 
> 
> > On Jun 29, 2015, at 5:40 AM, Erick Erickson <er...@gmail.com> wrote:
> > 
> > Steven:
> > 
> > Yes, but....
> > 
> > First, here's Mike McCandles' excellent blog on segment merging:
> > http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
> > 
> > I think the third animation is the TieredMergePolicy. In short, yes an
> > optimize will reclaim disk space. But as you update, this is done for
> > you anyway. About the only time optimizing is at all beneficial is
> > when you have a relatively static index. If you're continually
> > updating documents, and by that I mean replacing some existing
> > documents, then you'll immediately start generating "holes" in your
> > index.
> > 
> > And if you _do_ optimize, you wind up with a huge segment. And since
> > the default policy tries to merge segments of roughly the same size,
> > it accumulates deletes for quite a while before they merged away.
> > 
> > And if you don't update existing docs or delete docs, then there's no
> > wasted space anyway.
> > 
> > Summer:
> > 
> > First off, why do you care about not updating during optimizing?
> > There's no good reason you have to worry about that, you can freely
> > update while optimizing.
> > 
> > But frankly I have to agree with Upayavira that on the face of it
> > you're doing a lot of extra work. See above, but you optimize while
> > indexing, so immediately you're rather defeating the purpose.
> > Personally I'd only optimize relatively static indexes and, by
> > definition, you're index isn't static since the second process is just
> > waiting to modify it.
> > 
> > Best,
> > Erick
> > 
> > On Mon, Jun 29, 2015 at 8:15 AM, Steven White <sw...@gmail.com> wrote:
> >> Hi Upayavira,
> >> 
> >> This is news to me that we should not optimize and index.
> >> 
> >> What about disk space saving, isn't optimization to reclaim disk space or
> >> is Solr somehow does that?  Where can I read more about this?
> >> 
> >> I'm on Solr 5.1.0 (may switch to 5.2.1)
> >> 
> >> Thanks
> >> 
> >> Steve
> >> 
> >> On Mon, Jun 29, 2015 at 4:16 AM, Upayavira <uv...@odoko.co.uk> wrote:
> >> 
> >>> I'm afraid I don't understand. You're saying that optimising is causing
> >>> performance issues?
> >>> 
> >>> Simple solution: DO NOT OPTIMIZE!
> >>> 
> >>> Optimisation is very badly named. What it does is squashes all segments
> >>> in your index into one segment, removing all deleted documents. It is
> >>> good to get rid of deletes - in that sense the index is "optimized".
> >>> However, future merges become very expensive. The best way to handle
> >>> this topic is to leave it to Lucene/Solr to do it for you. Pretend the
> >>> "optimize" option never existed.
> >>> 
> >>> This is, of course, assuming you are using something like Solr 3.5+.
> >>> 
> >>> Upayavira
> >>> 
> >>> On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote:
> >>>> 
> >>>> Have to cause of performance issues.
> >>>> Just want to know if there is a way to tap into the status.
> >>>> 
> >>>>> On Jun 28, 2015, at 11:37 PM, Upayavira <uv...@odoko.co.uk> wrote:
> >>>>> 
> >>>>> Bigger question, why are you optimizing? Since 3.6 or so, it generally
> >>>>> hasn't been requires, even, is a bad thing.
> >>>>> 
> >>>>> Upayavira
> >>>>> 
> >>>>>> On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote:
> >>>>>> Hi All,
> >>>>>> 
> >>>>>> I have two indexers (Independent processes ) writing to a common solr
> >>>>>> core.
> >>>>>> If One indexer process issued an optimize on the core
> >>>>>> I want the second indexer to wait adding docs until the optimize has
> >>>>>> finished.
> >>>>>> 
> >>>>>> Are there ways I can do this programmatically?
> >>>>>> pinging the core when the optimize is happening is returning OK
> >>> because
> >>>>>> technically
> >>>>>> solr allows you to update when an optimize is happening.
> >>>>>> 
> >>>>>> any suggestions ?
> >>>>>> 
> >>>>>> thanks,
> >>>>>> Summer
> >>> 
> 

Re: optimize status

Posted by Summer Shire <sh...@gmail.com>.
Hi Upayavira and Erick,

There are two things we are talking about here.

First: Why am I optimizing? If I don’t our SEARCH (NOT INDEXING) performance is 100% worst. 
The problem lies in the number of total segments. We have to have max segments 1 or 2. 
I have done intensive performance related tests around number of segments, merge factor or changing the Merge policy.

Second: Solr does not perform better for me without an optimize. So now that I have to optimize the second issue
is updating concurrently during an optimize. If I update when an optimize is happening the optimize takes 5 times as long as
the normal optimize.

So is there any way other than creating a postOptimize hook and writing the status in a file and somehow making it available to the indexer. 
All of this just sounds traumatic :) 

Thanks
Summer


> On Jun 29, 2015, at 5:40 AM, Erick Erickson <er...@gmail.com> wrote:
> 
> Steven:
> 
> Yes, but....
> 
> First, here's Mike McCandles' excellent blog on segment merging:
> http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
> 
> I think the third animation is the TieredMergePolicy. In short, yes an
> optimize will reclaim disk space. But as you update, this is done for
> you anyway. About the only time optimizing is at all beneficial is
> when you have a relatively static index. If you're continually
> updating documents, and by that I mean replacing some existing
> documents, then you'll immediately start generating "holes" in your
> index.
> 
> And if you _do_ optimize, you wind up with a huge segment. And since
> the default policy tries to merge segments of roughly the same size,
> it accumulates deletes for quite a while before they merged away.
> 
> And if you don't update existing docs or delete docs, then there's no
> wasted space anyway.
> 
> Summer:
> 
> First off, why do you care about not updating during optimizing?
> There's no good reason you have to worry about that, you can freely
> update while optimizing.
> 
> But frankly I have to agree with Upayavira that on the face of it
> you're doing a lot of extra work. See above, but you optimize while
> indexing, so immediately you're rather defeating the purpose.
> Personally I'd only optimize relatively static indexes and, by
> definition, you're index isn't static since the second process is just
> waiting to modify it.
> 
> Best,
> Erick
> 
> On Mon, Jun 29, 2015 at 8:15 AM, Steven White <sw...@gmail.com> wrote:
>> Hi Upayavira,
>> 
>> This is news to me that we should not optimize and index.
>> 
>> What about disk space saving, isn't optimization to reclaim disk space or
>> is Solr somehow does that?  Where can I read more about this?
>> 
>> I'm on Solr 5.1.0 (may switch to 5.2.1)
>> 
>> Thanks
>> 
>> Steve
>> 
>> On Mon, Jun 29, 2015 at 4:16 AM, Upayavira <uv...@odoko.co.uk> wrote:
>> 
>>> I'm afraid I don't understand. You're saying that optimising is causing
>>> performance issues?
>>> 
>>> Simple solution: DO NOT OPTIMIZE!
>>> 
>>> Optimisation is very badly named. What it does is squashes all segments
>>> in your index into one segment, removing all deleted documents. It is
>>> good to get rid of deletes - in that sense the index is "optimized".
>>> However, future merges become very expensive. The best way to handle
>>> this topic is to leave it to Lucene/Solr to do it for you. Pretend the
>>> "optimize" option never existed.
>>> 
>>> This is, of course, assuming you are using something like Solr 3.5+.
>>> 
>>> Upayavira
>>> 
>>> On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote:
>>>> 
>>>> Have to cause of performance issues.
>>>> Just want to know if there is a way to tap into the status.
>>>> 
>>>>> On Jun 28, 2015, at 11:37 PM, Upayavira <uv...@odoko.co.uk> wrote:
>>>>> 
>>>>> Bigger question, why are you optimizing? Since 3.6 or so, it generally
>>>>> hasn't been requires, even, is a bad thing.
>>>>> 
>>>>> Upayavira
>>>>> 
>>>>>> On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote:
>>>>>> Hi All,
>>>>>> 
>>>>>> I have two indexers (Independent processes ) writing to a common solr
>>>>>> core.
>>>>>> If One indexer process issued an optimize on the core
>>>>>> I want the second indexer to wait adding docs until the optimize has
>>>>>> finished.
>>>>>> 
>>>>>> Are there ways I can do this programmatically?
>>>>>> pinging the core when the optimize is happening is returning OK
>>> because
>>>>>> technically
>>>>>> solr allows you to update when an optimize is happening.
>>>>>> 
>>>>>> any suggestions ?
>>>>>> 
>>>>>> thanks,
>>>>>> Summer
>>> 


Re: optimize status

Posted by Erick Erickson <er...@gmail.com>.
Steven:

Yes, but....

First, here's Mike McCandles' excellent blog on segment merging:
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

I think the third animation is the TieredMergePolicy. In short, yes an
optimize will reclaim disk space. But as you update, this is done for
you anyway. About the only time optimizing is at all beneficial is
when you have a relatively static index. If you're continually
updating documents, and by that I mean replacing some existing
documents, then you'll immediately start generating "holes" in your
index.

And if you _do_ optimize, you wind up with a huge segment. And since
the default policy tries to merge segments of roughly the same size,
it accumulates deletes for quite a while before they merged away.

And if you don't update existing docs or delete docs, then there's no
wasted space anyway.

Summer:

First off, why do you care about not updating during optimizing?
There's no good reason you have to worry about that, you can freely
update while optimizing.

But frankly I have to agree with Upayavira that on the face of it
you're doing a lot of extra work. See above, but you optimize while
indexing, so immediately you're rather defeating the purpose.
Personally I'd only optimize relatively static indexes and, by
definition, you're index isn't static since the second process is just
waiting to modify it.

Best,
Erick

On Mon, Jun 29, 2015 at 8:15 AM, Steven White <sw...@gmail.com> wrote:
> Hi Upayavira,
>
> This is news to me that we should not optimize and index.
>
> What about disk space saving, isn't optimization to reclaim disk space or
> is Solr somehow does that?  Where can I read more about this?
>
> I'm on Solr 5.1.0 (may switch to 5.2.1)
>
> Thanks
>
> Steve
>
> On Mon, Jun 29, 2015 at 4:16 AM, Upayavira <uv...@odoko.co.uk> wrote:
>
>> I'm afraid I don't understand. You're saying that optimising is causing
>> performance issues?
>>
>> Simple solution: DO NOT OPTIMIZE!
>>
>> Optimisation is very badly named. What it does is squashes all segments
>> in your index into one segment, removing all deleted documents. It is
>> good to get rid of deletes - in that sense the index is "optimized".
>> However, future merges become very expensive. The best way to handle
>> this topic is to leave it to Lucene/Solr to do it for you. Pretend the
>> "optimize" option never existed.
>>
>> This is, of course, assuming you are using something like Solr 3.5+.
>>
>> Upayavira
>>
>> On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote:
>> >
>> > Have to cause of performance issues.
>> > Just want to know if there is a way to tap into the status.
>> >
>> > > On Jun 28, 2015, at 11:37 PM, Upayavira <uv...@odoko.co.uk> wrote:
>> > >
>> > > Bigger question, why are you optimizing? Since 3.6 or so, it generally
>> > > hasn't been requires, even, is a bad thing.
>> > >
>> > > Upayavira
>> > >
>> > >> On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote:
>> > >> Hi All,
>> > >>
>> > >> I have two indexers (Independent processes ) writing to a common solr
>> > >> core.
>> > >> If One indexer process issued an optimize on the core
>> > >> I want the second indexer to wait adding docs until the optimize has
>> > >> finished.
>> > >>
>> > >> Are there ways I can do this programmatically?
>> > >> pinging the core when the optimize is happening is returning OK
>> because
>> > >> technically
>> > >> solr allows you to update when an optimize is happening.
>> > >>
>> > >> any suggestions ?
>> > >>
>> > >> thanks,
>> > >> Summer
>>

RE: optimize status

Posted by "Reitzel, Charles" <Ch...@tiaa-cref.org>.
I see what you mean.   Many thanks for the details.   

-----Original Message-----
From: Toke Eskildsen [mailto:te@statsbiblioteket.dk] 
Sent: Monday, June 29, 2015 6:36 PM
To: solr-user@lucene.apache.org
Subject: Re: optimize status

Reitzel, Charles <Ch...@tiaa-cref.org> wrote:
> Question, Toke: in your "immutable" cases, don't the benefits of 
> optimizing come mostly from eliminating deleted records?

Not for us. We have about 1 deleted document for every 1000 or 10.000 standard documents.

> Is there any material difference in heap, CPU, etc. between 1, 5 or 10 segments?
> I.e. at how many segments/shard do you see a noticeable performance hit?

It really is either 1 or more than 1 segment, coupled with 0 deleted records or more than 0.

Having 1 segment means that String faceting benefits from not having to map between segment ordinals and global ordinals. That's a speed increase (just a null check instead of a memory lookup) as well as a heap requirement reduction: We save 2GB+ heap per shard on that account (our current heap size is 8GB). Granted, we facet on 600M values for one of the fields, which I don't think is very common.

0 deleted records is related as the usual bitmap of deleted documents is null, meaning faster checks.

Most of the performance benefit probably comes from the freed memory. We have 25 shards/machine, so sparing 2GB gives us an extra 50GB of disk cache. The performance increase for that is 20-40%, guesstimated from some previous tests where we varied the disk cache size.


I doubt that there is much difference between 2, 5, 10 or even 20 segments. The persons at UKWA are running some tests on different degrees of optimization of their 30 shard TB-class index. You'll have to dig a bit, but there might be relevant results: https://github.com/ukwa/shine/tree/master/python/test-logs

> Also, I curious if you have experimented much with the 
> maxMergedSegmentMB and reclaimDeletesWeight  properties of the TieredMergePolicy?

I have zero experience with that: We build the shards one at a time and don't touch them after that. 90% of our building power goes to Tika analysis, so there hasn't been a apparent need for tuning Solr's indexing.

- Toke Eskildsen

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA-CREF
*************************************************************************


Re: optimize status

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Reitzel, Charles <Ch...@tiaa-cref.org> wrote:
> Question, Toke: in your "immutable" cases, don't the benefits of
> optimizing come mostly from eliminating deleted records?

Not for us. We have about 1 deleted document for every 1000 or 10.000 standard documents.

> Is there any material difference in heap, CPU, etc. between 1, 5 or 10 segments?
> I.e. at how many segments/shard do you see a noticeable performance hit?

It really is either 1 or more than 1 segment, coupled with 0 deleted records or more than 0.

Having 1 segment means that String faceting benefits from not having to map between segment ordinals and global ordinals. That's a speed increase (just a null check instead of a memory lookup) as well as a heap requirement reduction: We save 2GB+ heap per shard on that account (our current heap size is 8GB). Granted, we facet on 600M values for one of the fields, which I don't think is very common.

0 deleted records is related as the usual bitmap of deleted documents is null, meaning faster checks.

Most of the performance benefit probably comes from the freed memory. We have 25 shards/machine, so sparing 2GB gives us an extra 50GB of disk cache. The performance increase for that is 20-40%, guesstimated from some previous tests where we varied the disk cache size.


I doubt that there is much difference between 2, 5, 10 or even 20 segments. The persons at UKWA are running some tests on different degrees of optimization of their 30 shard TB-class index. You'll have to dig a bit, but there might be relevant results: https://github.com/ukwa/shine/tree/master/python/test-logs

> Also, I curious if you have experimented much with the maxMergedSegmentMB
> and reclaimDeletesWeight  properties of the TieredMergePolicy?

I have zero experience with that: We build the shards one at a time and don't touch them after that. 90% of our building power goes to Tika analysis, so there hasn't been a apparent need for tuning Solr's indexing.

- Toke Eskildsen

RE: optimize status

Posted by "Reitzel, Charles" <Ch...@tiaa-cref.org>.
Question, Toke: in your "immutable" cases, don't the benefits of optimizing come mostly from eliminating deleted records?   Is there any material difference in heap, CPU, etc. between 1, 5 or 10 segments?   I.e. at how many segments/shard do you see a noticeable performance hit?

Also, I curious if you have experimented much with the maxMergedSegmentMB and reclaimDeletesWeight  properties of the TieredMergePolicy?

For frequently updated indexes, would setting maxMergedSegmentMB lower (say 512 or 1024 MB, depending on total index size) and reclaimDeletesWeight higher (say 2.5?) be a good best practice?

-----Original Message-----
From: Toke Eskildsen [mailto:te@statsbiblioteket.dk] 
Sent: Monday, June 29, 2015 3:56 PM
To: solr-user@lucene.apache.org
Subject: Re: optimize status

Reitzel, Charles <Ch...@tiaa-cref.org> wrote:
> Is there really a good reason to consolidate down to a single segment?

In the  scenario spawning this thread it does not seem to be the best choice. Speaking more broadly there are Solr setups out there that deals with immutable data, often tied to a point in time, e.g. log data. We have such a setup (harvested web resources) and are able to lower heap requirements significantly and increase speed by building fully optimized and immutable shards.

> Any incremental query performance benefit is tiny compared to the loss of managability.

True in many cases and I agree that the "Optimize"-wording is a bit of a trap. While technically correct, it implies that one should do it occasionally to keep any index fit. A different wording and maybe a tooltip saying something like "Only recommended for non-changing indexes" might be better.

Turning it around: To minimize the risk of occasional performance-degrading large merges, one might want an index where all the shards are below a certain size. Splitting larger shards into smaller ones would in that case also be an optimization, just towards a different goal.

- Toke Eskildsen

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA-CREF
*************************************************************************


Re: optimize status

Posted by Upayavira <uv...@odoko.co.uk>.
For the sake of history, somewhere around Solr/Lucene 3.2 a new
"MergePolicy" was introduced. The old one merged simply based upon age,
or "index generation", meaning the older the segment, the less likely it
would get merged, hence needing optimize to clear out deletes from your
older segments.

The new MergePolicy, the TieredMergePolicy, uses a more intelligent
algorithm to decide which segments to merge, and is the single reason
why optimization isn't recommended anymore. According to the javadocs:

"For normal merging, this policy first computes a "budget" of how many
segments are allowed to be in the index. If the index is over-budget,
then the policy sorts segments by decreasing size (pro-rating by percent
deletes), and then finds the least-cost merge. Merge cost is measured by
a combination of the "skew" of the merge (size of largest segment
divided by smallest segment), total merge size and percent deletes
reclaimed, so that merges with lower skew, smaller size and those
reclaiming more deletes, are favored.

If a merge will produce a segment that's larger than
setMaxMergedSegmentMB(double), then the policy will merge fewer segments
(down to 1 at once, if that one has deletions) to keep the segment size
under budget."

Upayavira


On Mon, Jun 29, 2015, at 08:55 PM, Toke Eskildsen wrote:
> Reitzel, Charles <Ch...@tiaa-cref.org> wrote:
> > Is there really a good reason to consolidate down to a single segment?
> 
> In the  scenario spawning this thread it does not seem to be the best
> choice. Speaking more broadly there are Solr setups out there that deals
> with immutable data, often tied to a point in time, e.g. log data. We
> have such a setup (harvested web resources) and are able to lower heap
> requirements significantly and increase speed by building fully optimized
> and immutable shards.
> 
> > Any incremental query performance benefit is tiny compared to the loss of managability.
> 
> True in many cases and I agree that the "Optimize"-wording is a bit of a
> trap. While technically correct, it implies that one should do it
> occasionally to keep any index fit. A different wording and maybe a
> tooltip saying something like "Only recommended for non-changing indexes"
> might be better.
> 
> Turning it around: To minimize the risk of occasional
> performance-degrading large merges, one might want an index where all the
> shards are below a certain size. Splitting larger shards into smaller
> ones would in that case also be an optimization, just towards a different
> goal.
> 
> - Toke Eskildsen

Re: optimize status

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Reitzel, Charles <Ch...@tiaa-cref.org> wrote:
> Is there really a good reason to consolidate down to a single segment?

In the  scenario spawning this thread it does not seem to be the best choice. Speaking more broadly there are Solr setups out there that deals with immutable data, often tied to a point in time, e.g. log data. We have such a setup (harvested web resources) and are able to lower heap requirements significantly and increase speed by building fully optimized and immutable shards.

> Any incremental query performance benefit is tiny compared to the loss of managability.

True in many cases and I agree that the "Optimize"-wording is a bit of a trap. While technically correct, it implies that one should do it occasionally to keep any index fit. A different wording and maybe a tooltip saying something like "Only recommended for non-changing indexes" might be better.

Turning it around: To minimize the risk of occasional performance-degrading large merges, one might want an index where all the shards are below a certain size. Splitting larger shards into smaller ones would in that case also be an optimization, just towards a different goal.

- Toke Eskildsen

RE: optimize status

Posted by "Reitzel, Charles" <Ch...@tiaa-cref.org>.
Thanks, Upayavira and Shawn, for sharing what is happening internally.   Your points (cluster state explosion, segment per commit, solr/lucene split) are well taken.

Wishful thinking aside, my gut instinct is that such a scheme would cause Solr's stellar indexing speed drop dramatically to match mongodb (indexing speed is not its strong point) ...   :-)

-----Original Message-----
From: Upayavira [mailto:uv@odoko.co.uk] 
Sent: Tuesday, June 30, 2015 2:46 PM
To: solr-user@lucene.apache.org
Subject: Re: optimize status



On Tue, Jun 30, 2015, at 04:42 PM, Shawn Heisey wrote:
> On 6/29/2015 2:48 PM, Reitzel, Charles wrote:
> > I take your point about shards and segments being different things.  I understand that the hash ranges per segment are not kept in ZK.   I guess I wish they were.
> >
> > In this regard, I liked Mongodb, uses a 2-level sharding scheme.   Each shard manages a list of  "chunks", each has its own hash range which is kept in the cluster state.   If data needs to be balanced across nodes, it works at the chunk level.  No record/doc level I/O is necessary.   Much more targeted and only the data that needs to move is touched.  Solr does most things better than Mongo, imo.  But this is one area where the Mongo got it right.
> 
> Segment detail would not only lead to a data explosion in the 
> clusterstate, it would be crossing abstraction boundaries, and would 
> potentially require updating the clusterstate just because a single 
> document was inserted into the index.  That one tiny update could (and 
> probably would) create a new segment on one shard.  Due to the way 
> SolrCloud replicates data during normal operation, every replica for a 
> given shard might have a different set of segments, which means 
> segments would need to be tracked at the replica level, not the shard level.
> 
> Also, Solr cannot control which hash ranges end up in each segment. 
> Solr only knows about the index as a whole ... implementation details 
> like segments are left entirely up to Lucene, and although I admit to 
> not knowing Lucene internals very well, I don't think Lucene offers 
> any way to control that either.  You mention that MongoDB dictates 
> which hash ranges end up in each chunk.  That implies that MongoDB can 
> control each chunk.  If we move the analogy to Solr, it breaks down 
> because Solr cannot control segments.  Although Solr does have several 
> configuration knobs that affect how segments are created, those 
> configurations are simply passed through to Lucene, Solr itself does 
> not use that information.

To put it more specifically - when a (hard) commit happens, all of the documents in that commit are written into a new segment. Thus, it has no bearing on what hash range is used. A segment can never be edited. When there are too many, segments are merged into a new one, and the originals deleted. So, there is no way for Solr/Lucene to insert a document into anything other than a brand new segment.

Hence, the idea of using a second level of sharding at the segment level does not fit with how a lucene index is structured.

Upayavira

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA-CREF
*************************************************************************


Re: optimize status

Posted by Upayavira <uv...@odoko.co.uk>.

On Tue, Jun 30, 2015, at 04:42 PM, Shawn Heisey wrote:
> On 6/29/2015 2:48 PM, Reitzel, Charles wrote:
> > I take your point about shards and segments being different things.  I understand that the hash ranges per segment are not kept in ZK.   I guess I wish they were.
> >
> > In this regard, I liked Mongodb, uses a 2-level sharding scheme.   Each shard manages a list of  "chunks", each has its own hash range which is kept in the cluster state.   If data needs to be balanced across nodes, it works at the chunk level.  No record/doc level I/O is necessary.   Much more targeted and only the data that needs to move is touched.  Solr does most things better than Mongo, imo.  But this is one area where the Mongo got it right.
> 
> Segment detail would not only lead to a data explosion in the
> clusterstate, it would be crossing abstraction boundaries, and would
> potentially require updating the clusterstate just because a single
> document was inserted into the index.  That one tiny update could (and
> probably would) create a new segment on one shard.  Due to the way
> SolrCloud replicates data during normal operation, every replica for a
> given shard might have a different set of segments, which means segments
> would need to be tracked at the replica level, not the shard level.
> 
> Also, Solr cannot control which hash ranges end up in each segment. 
> Solr only knows about the index as a whole ... implementation details
> like segments are left entirely up to Lucene, and although I admit to
> not knowing Lucene internals very well, I don't think Lucene offers any
> way to control that either.  You mention that MongoDB dictates which
> hash ranges end up in each chunk.  That implies that MongoDB can control
> each chunk.  If we move the analogy to Solr, it breaks down because Solr
> cannot control segments.  Although Solr does have several configuration
> knobs that affect how segments are created, those configurations are
> simply passed through to Lucene, Solr itself does not use that
> information.

To put it more specifically - when a (hard) commit happens, all of the
documents in that commit are written into a new segment. Thus, it has no
bearing on what hash range is used. A segment can never be edited. When
there are too many, segments are merged into a new one, and the
originals deleted. So, there is no way for Solr/Lucene to insert a
document into anything other than a brand new segment.

Hence, the idea of using a second level of sharding at the segment level
does not fit with how a lucene index is structured.

Upayavira

Re: optimize status

Posted by Shawn Heisey <ap...@elyograg.org>.
On 6/29/2015 2:48 PM, Reitzel, Charles wrote:
> I take your point about shards and segments being different things.  I understand that the hash ranges per segment are not kept in ZK.   I guess I wish they were.
>
> In this regard, I liked Mongodb, uses a 2-level sharding scheme.   Each shard manages a list of  "chunks", each has its own hash range which is kept in the cluster state.   If data needs to be balanced across nodes, it works at the chunk level.  No record/doc level I/O is necessary.   Much more targeted and only the data that needs to move is touched.  Solr does most things better than Mongo, imo.  But this is one area where the Mongo got it right.

Segment detail would not only lead to a data explosion in the
clusterstate, it would be crossing abstraction boundaries, and would
potentially require updating the clusterstate just because a single
document was inserted into the index.  That one tiny update could (and
probably would) create a new segment on one shard.  Due to the way
SolrCloud replicates data during normal operation, every replica for a
given shard might have a different set of segments, which means segments
would need to be tracked at the replica level, not the shard level.

Also, Solr cannot control which hash ranges end up in each segment. 
Solr only knows about the index as a whole ... implementation details
like segments are left entirely up to Lucene, and although I admit to
not knowing Lucene internals very well, I don't think Lucene offers any
way to control that either.  You mention that MongoDB dictates which
hash ranges end up in each chunk.  That implies that MongoDB can control
each chunk.  If we move the analogy to Solr, it breaks down because Solr
cannot control segments.  Although Solr does have several configuration
knobs that affect how segments are created, those configurations are
simply passed through to Lucene, Solr itself does not use that information.

Thanks,
Shawn


RE: optimize status

Posted by "Reitzel, Charles" <Ch...@tiaa-cref.org>.
Hi Garth,

Yes, I'm straying from OP's question (I think Steve is all set).   But his question, quite naturally, comes up often and a similar discussion ensues each time.

I take your point about shards and segments being different things.  I understand that the hash ranges per segment are not kept in ZK.   I guess I wish they were.

In this regard, I liked Mongodb, uses a 2-level sharding scheme.   Each shard manages a list of  "chunks", each has its own hash range which is kept in the cluster state.   If data needs to be balanced across nodes, it works at the chunk level.  No record/doc level I/O is necessary.   Much more targeted and only the data that needs to move is touched.  Solr does most things better than Mongo, imo.  But this is one area where the Mongo got it right.

As for your example, what benefit does an application gain by reducing 10 segments, say, down to 1?   Even if the index never changes?   The gain _might_ be measurable, but it will be small compared to performance gains that can be had by maintaining a good data balance across nodes.

Your example is based on implicit routing.  So dynamic management of shards is less applicable.  I just hope you get similar volumes of data every year.   Otherwise, some years will perform better than others due to unbalanced data distribution!

best,
Charlie


-----Original Message-----
From: Garth Grimm [mailto:gdgrimm@yahoo.com.INVALID] 
Sent: Monday, June 29, 2015 1:15 PM
To: solr-user@lucene.apache.org
Subject: RE: optimize status

" Is there really a good reason to consolidate down to a single segment?"

Archiving (as one example).  Come July 1, the collection for log entries/transactions in June will never be changed, so optimizing is actually a good thing to do.

Kind of getting away from OP's question on this, but I don't think the ability to move data between shards in SolrCloud (such as shard splitting) has much to do with the Lucene segments under the hood.  I'm just guessing, but I'd think the main issue with shard splitting would be to ensure that document route ranges are handled properly, and I don't think the value used for routing has anything to do with what segment they happen to be stored into.

-----Original Message-----
From: Reitzel, Charles [mailto:Charles.Reitzel@tiaa-cref.org]
Sent: Monday, June 29, 2015 11:38 AM
To: solr-user@lucene.apache.org
Subject: RE: optimize status

Is there really a good reason to consolidate down to a single segment?

Any incremental query performance benefit is tiny compared to the loss of
managability.   

I.e. shouldn't segments _always_ be kept small enough to facilitate
re-balancing data across shards?   Even in non-cloud instances this is true.
When a collection grows, you may want shard/split an existing index by
adding a node and moving some segments around.    Isn't this the direction
Solr is going?   With many, smaller segments, this is feasible.  With "one
big segment", the collection must always be reindexed.

Thus, "optimize" would mean, "get rid of all deleted records" and would, in
fact, optimize queries by eliminating wasted I/O.   Perhaps worth it for
slowly changing indexes.   Seems like the Tiered merge policy is 90% there
...    Or am I all wet (again)?

-----Original Message-----
From: Walter Underwood [mailto:wunder@wunderwood.org]
Sent: Monday, June 29, 2015 10:39 AM
To: solr-user@lucene.apache.org
Subject: Re: optimize status

"Optimize" is a manual full merge.

Solr automatically merges segments as needed. This also expunges deleted documents.

We really need to rename "optimize" to "force merge". Is there a Jira for that?

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On Jun 29, 2015, at 5:15 AM, Steven White <sw...@gmail.com> wrote:

> Hi Upayavira,
> 
> This is news to me that we should not optimize and index.
> 
> What about disk space saving, isn't optimization to reclaim disk space 
> or is Solr somehow does that?  Where can I read more about this?
> 
> I'm on Solr 5.1.0 (may switch to 5.2.1)
> 
> Thanks
> 
> Steve
> 
> On Mon, Jun 29, 2015 at 4:16 AM, Upayavira <uv...@odoko.co.uk> wrote:
> 
>> I'm afraid I don't understand. You're saying that optimising is 
>> causing performance issues?
>> 
>> Simple solution: DO NOT OPTIMIZE!
>> 
>> Optimisation is very badly named. What it does is squashes all 
>> segments in your index into one segment, removing all deleted 
>> documents. It is good to get rid of deletes - in that sense the index 
>> is
"optimized".
>> However, future merges become very expensive. The best way to handle 
>> this topic is to leave it to Lucene/Solr to do it for you. Pretend 
>> the "optimize" option never existed.
>> 
>> This is, of course, assuming you are using something like Solr 3.5+.
>> 
>> Upayavira
>> 
>> On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote:
>>> 
>>> Have to cause of performance issues.
>>> Just want to know if there is a way to tap into the status.
>>> 
>>>> On Jun 28, 2015, at 11:37 PM, Upayavira <uv...@odoko.co.uk> wrote:
>>>> 
>>>> Bigger question, why are you optimizing? Since 3.6 or so, it 
>>>> generally hasn't been requires, even, is a bad thing.
>>>> 
>>>> Upayavira
>>>> 
>>>>> On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote:
>>>>> Hi All,
>>>>> 
>>>>> I have two indexers (Independent processes ) writing to a common 
>>>>> solr core.
>>>>> If One indexer process issued an optimize on the core I want the 
>>>>> second indexer to wait adding docs until the optimize has 
>>>>> finished.
>>>>> 
>>>>> Are there ways I can do this programmatically?
>>>>> pinging the core when the optimize is happening is returning OK
>> because
>>>>> technically
>>>>> solr allows you to update when an optimize is happening.
>>>>> 
>>>>> any suggestions ?
>>>>> 
>>>>> thanks,
>>>>> Summer
>> 


*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA-CREF
*************************************************************************



*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA-CREF
*************************************************************************


RE: optimize status

Posted by Garth Grimm <gd...@yahoo.com.INVALID>.
" Is there really a good reason to consolidate down to a single segment?"

Archiving (as one example).  Come July 1, the collection for log
entries/transactions in June will never be changed, so optimizing is
actually a good thing to do.

Kind of getting away from OP's question on this, but I don't think the
ability to move data between shards in SolrCloud (such as shard splitting)
has much to do with the Lucene segments under the hood.  I'm just guessing,
but I'd think the main issue with shard splitting would be to ensure that
document route ranges are handled properly, and I don't think the value used
for routing has anything to do with what segment they happen to be stored
into.

-----Original Message-----
From: Reitzel, Charles [mailto:Charles.Reitzel@tiaa-cref.org] 
Sent: Monday, June 29, 2015 11:38 AM
To: solr-user@lucene.apache.org
Subject: RE: optimize status

Is there really a good reason to consolidate down to a single segment?

Any incremental query performance benefit is tiny compared to the loss of
managability.   

I.e. shouldn't segments _always_ be kept small enough to facilitate
re-balancing data across shards?   Even in non-cloud instances this is true.
When a collection grows, you may want shard/split an existing index by
adding a node and moving some segments around.    Isn't this the direction
Solr is going?   With many, smaller segments, this is feasible.  With "one
big segment", the collection must always be reindexed.

Thus, "optimize" would mean, "get rid of all deleted records" and would, in
fact, optimize queries by eliminating wasted I/O.   Perhaps worth it for
slowly changing indexes.   Seems like the Tiered merge policy is 90% there
...    Or am I all wet (again)?

-----Original Message-----
From: Walter Underwood [mailto:wunder@wunderwood.org]
Sent: Monday, June 29, 2015 10:39 AM
To: solr-user@lucene.apache.org
Subject: Re: optimize status

"Optimize" is a manual full merge.

Solr automatically merges segments as needed. This also expunges deleted
documents.

We really need to rename "optimize" to "force merge". Is there a Jira for
that?

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On Jun 29, 2015, at 5:15 AM, Steven White <sw...@gmail.com> wrote:

> Hi Upayavira,
> 
> This is news to me that we should not optimize and index.
> 
> What about disk space saving, isn't optimization to reclaim disk space 
> or is Solr somehow does that?  Where can I read more about this?
> 
> I'm on Solr 5.1.0 (may switch to 5.2.1)
> 
> Thanks
> 
> Steve
> 
> On Mon, Jun 29, 2015 at 4:16 AM, Upayavira <uv...@odoko.co.uk> wrote:
> 
>> I'm afraid I don't understand. You're saying that optimising is 
>> causing performance issues?
>> 
>> Simple solution: DO NOT OPTIMIZE!
>> 
>> Optimisation is very badly named. What it does is squashes all 
>> segments in your index into one segment, removing all deleted 
>> documents. It is good to get rid of deletes - in that sense the index is
"optimized".
>> However, future merges become very expensive. The best way to handle 
>> this topic is to leave it to Lucene/Solr to do it for you. Pretend 
>> the "optimize" option never existed.
>> 
>> This is, of course, assuming you are using something like Solr 3.5+.
>> 
>> Upayavira
>> 
>> On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote:
>>> 
>>> Have to cause of performance issues.
>>> Just want to know if there is a way to tap into the status.
>>> 
>>>> On Jun 28, 2015, at 11:37 PM, Upayavira <uv...@odoko.co.uk> wrote:
>>>> 
>>>> Bigger question, why are you optimizing? Since 3.6 or so, it 
>>>> generally hasn't been requires, even, is a bad thing.
>>>> 
>>>> Upayavira
>>>> 
>>>>> On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote:
>>>>> Hi All,
>>>>> 
>>>>> I have two indexers (Independent processes ) writing to a common 
>>>>> solr core.
>>>>> If One indexer process issued an optimize on the core I want the 
>>>>> second indexer to wait adding docs until the optimize has 
>>>>> finished.
>>>>> 
>>>>> Are there ways I can do this programmatically?
>>>>> pinging the core when the optimize is happening is returning OK
>> because
>>>>> technically
>>>>> solr allows you to update when an optimize is happening.
>>>>> 
>>>>> any suggestions ?
>>>>> 
>>>>> thanks,
>>>>> Summer
>> 


*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately
and then delete it.

TIAA-CREF
*************************************************************************



Re: optimize status

Posted by Steven White <sw...@gmail.com>.
Thank you guys, this was very helpful.  I was always under the impression
that the index need to be optimize periodically to reclaim disk space
otherwise the index will just keep on growing and growing (was that the
case in Lucene 2.x and prior days?).

I agree with Walter, renaming "optimize" to something else, even “force
merge” is better.  However, make sure it has the proper documentation
explaining what it does and why it's not worthy for live data.

Steve

On Mon, Jun 29, 2015 at 12:37 PM, Reitzel, Charles <
Charles.Reitzel@tiaa-cref.org> wrote:

> Is there really a good reason to consolidate down to a single segment?
>
> Any incremental query performance benefit is tiny compared to the loss of
> managability.
>
> I.e. shouldn't segments _always_ be kept small enough to facilitate
> re-balancing data across shards?   Even in non-cloud instances this is
> true.  When a collection grows, you may want shard/split an existing index
> by adding a node and moving some segments around.    Isn't this the
> direction Solr is going?   With many, smaller segments, this is feasible.
> With "one big segment", the collection must always be reindexed.
>
> Thus, "optimize" would mean, "get rid of all deleted records" and would,
> in fact, optimize queries by eliminating wasted I/O.   Perhaps worth it for
> slowly changing indexes.   Seems like the Tiered merge policy is 90% there
> ...    Or am I all wet (again)?
>
> -----Original Message-----
> From: Walter Underwood [mailto:wunder@wunderwood.org]
> Sent: Monday, June 29, 2015 10:39 AM
> To: solr-user@lucene.apache.org
> Subject: Re: optimize status
>
> "Optimize" is a manual full merge.
>
> Solr automatically merges segments as needed. This also expunges deleted
> documents.
>
> We really need to rename "optimize" to "force merge". Is there a Jira for
> that?
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Jun 29, 2015, at 5:15 AM, Steven White <sw...@gmail.com> wrote:
>
> > Hi Upayavira,
> >
> > This is news to me that we should not optimize and index.
> >
> > What about disk space saving, isn't optimization to reclaim disk space
> > or is Solr somehow does that?  Where can I read more about this?
> >
> > I'm on Solr 5.1.0 (may switch to 5.2.1)
> >
> > Thanks
> >
> > Steve
> >
> > On Mon, Jun 29, 2015 at 4:16 AM, Upayavira <uv...@odoko.co.uk> wrote:
> >
> >> I'm afraid I don't understand. You're saying that optimising is
> >> causing performance issues?
> >>
> >> Simple solution: DO NOT OPTIMIZE!
> >>
> >> Optimisation is very badly named. What it does is squashes all
> >> segments in your index into one segment, removing all deleted
> >> documents. It is good to get rid of deletes - in that sense the index
> is "optimized".
> >> However, future merges become very expensive. The best way to handle
> >> this topic is to leave it to Lucene/Solr to do it for you. Pretend
> >> the "optimize" option never existed.
> >>
> >> This is, of course, assuming you are using something like Solr 3.5+.
> >>
> >> Upayavira
> >>
> >> On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote:
> >>>
> >>> Have to cause of performance issues.
> >>> Just want to know if there is a way to tap into the status.
> >>>
> >>>> On Jun 28, 2015, at 11:37 PM, Upayavira <uv...@odoko.co.uk> wrote:
> >>>>
> >>>> Bigger question, why are you optimizing? Since 3.6 or so, it
> >>>> generally hasn't been requires, even, is a bad thing.
> >>>>
> >>>> Upayavira
> >>>>
> >>>>> On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote:
> >>>>> Hi All,
> >>>>>
> >>>>> I have two indexers (Independent processes ) writing to a common
> >>>>> solr core.
> >>>>> If One indexer process issued an optimize on the core I want the
> >>>>> second indexer to wait adding docs until the optimize has
> >>>>> finished.
> >>>>>
> >>>>> Are there ways I can do this programmatically?
> >>>>> pinging the core when the optimize is happening is returning OK
> >> because
> >>>>> technically
> >>>>> solr allows you to update when an optimize is happening.
> >>>>>
> >>>>> any suggestions ?
> >>>>>
> >>>>> thanks,
> >>>>> Summer
> >>
>
>
> *************************************************************************
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender
> immediately and then delete it.
>
> TIAA-CREF
> *************************************************************************
>
>

RE: optimize status

Posted by "Reitzel, Charles" <Ch...@tiaa-cref.org>.
Is there really a good reason to consolidate down to a single segment?

Any incremental query performance benefit is tiny compared to the loss of managability.   

I.e. shouldn't segments _always_ be kept small enough to facilitate re-balancing data across shards?   Even in non-cloud instances this is true.  When a collection grows, you may want shard/split an existing index by adding a node and moving some segments around.    Isn't this the direction Solr is going?   With many, smaller segments, this is feasible.  With "one big segment", the collection must always be reindexed.

Thus, "optimize" would mean, "get rid of all deleted records" and would, in fact, optimize queries by eliminating wasted I/O.   Perhaps worth it for slowly changing indexes.   Seems like the Tiered merge policy is 90% there ...    Or am I all wet (again)?

-----Original Message-----
From: Walter Underwood [mailto:wunder@wunderwood.org] 
Sent: Monday, June 29, 2015 10:39 AM
To: solr-user@lucene.apache.org
Subject: Re: optimize status

"Optimize" is a manual full merge.

Solr automatically merges segments as needed. This also expunges deleted documents.

We really need to rename "optimize" to "force merge". Is there a Jira for that?

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On Jun 29, 2015, at 5:15 AM, Steven White <sw...@gmail.com> wrote:

> Hi Upayavira,
> 
> This is news to me that we should not optimize and index.
> 
> What about disk space saving, isn't optimization to reclaim disk space 
> or is Solr somehow does that?  Where can I read more about this?
> 
> I'm on Solr 5.1.0 (may switch to 5.2.1)
> 
> Thanks
> 
> Steve
> 
> On Mon, Jun 29, 2015 at 4:16 AM, Upayavira <uv...@odoko.co.uk> wrote:
> 
>> I'm afraid I don't understand. You're saying that optimising is 
>> causing performance issues?
>> 
>> Simple solution: DO NOT OPTIMIZE!
>> 
>> Optimisation is very badly named. What it does is squashes all 
>> segments in your index into one segment, removing all deleted 
>> documents. It is good to get rid of deletes - in that sense the index is "optimized".
>> However, future merges become very expensive. The best way to handle 
>> this topic is to leave it to Lucene/Solr to do it for you. Pretend 
>> the "optimize" option never existed.
>> 
>> This is, of course, assuming you are using something like Solr 3.5+.
>> 
>> Upayavira
>> 
>> On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote:
>>> 
>>> Have to cause of performance issues.
>>> Just want to know if there is a way to tap into the status.
>>> 
>>>> On Jun 28, 2015, at 11:37 PM, Upayavira <uv...@odoko.co.uk> wrote:
>>>> 
>>>> Bigger question, why are you optimizing? Since 3.6 or so, it 
>>>> generally hasn't been requires, even, is a bad thing.
>>>> 
>>>> Upayavira
>>>> 
>>>>> On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote:
>>>>> Hi All,
>>>>> 
>>>>> I have two indexers (Independent processes ) writing to a common 
>>>>> solr core.
>>>>> If One indexer process issued an optimize on the core I want the 
>>>>> second indexer to wait adding docs until the optimize has 
>>>>> finished.
>>>>> 
>>>>> Are there ways I can do this programmatically?
>>>>> pinging the core when the optimize is happening is returning OK
>> because
>>>>> technically
>>>>> solr allows you to update when an optimize is happening.
>>>>> 
>>>>> any suggestions ?
>>>>> 
>>>>> thanks,
>>>>> Summer
>> 


*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA-CREF
*************************************************************************


Re: optimize status

Posted by Walter Underwood <wu...@wunderwood.org>.
“Optimize” is a manual full merge.

Solr automatically merges segments as needed. This also expunges deleted documents.

We really need to rename “optimize” to “force merge”. Is there a Jira for that?

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On Jun 29, 2015, at 5:15 AM, Steven White <sw...@gmail.com> wrote:

> Hi Upayavira,
> 
> This is news to me that we should not optimize and index.
> 
> What about disk space saving, isn't optimization to reclaim disk space or
> is Solr somehow does that?  Where can I read more about this?
> 
> I'm on Solr 5.1.0 (may switch to 5.2.1)
> 
> Thanks
> 
> Steve
> 
> On Mon, Jun 29, 2015 at 4:16 AM, Upayavira <uv...@odoko.co.uk> wrote:
> 
>> I'm afraid I don't understand. You're saying that optimising is causing
>> performance issues?
>> 
>> Simple solution: DO NOT OPTIMIZE!
>> 
>> Optimisation is very badly named. What it does is squashes all segments
>> in your index into one segment, removing all deleted documents. It is
>> good to get rid of deletes - in that sense the index is "optimized".
>> However, future merges become very expensive. The best way to handle
>> this topic is to leave it to Lucene/Solr to do it for you. Pretend the
>> "optimize" option never existed.
>> 
>> This is, of course, assuming you are using something like Solr 3.5+.
>> 
>> Upayavira
>> 
>> On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote:
>>> 
>>> Have to cause of performance issues.
>>> Just want to know if there is a way to tap into the status.
>>> 
>>>> On Jun 28, 2015, at 11:37 PM, Upayavira <uv...@odoko.co.uk> wrote:
>>>> 
>>>> Bigger question, why are you optimizing? Since 3.6 or so, it generally
>>>> hasn't been requires, even, is a bad thing.
>>>> 
>>>> Upayavira
>>>> 
>>>>> On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote:
>>>>> Hi All,
>>>>> 
>>>>> I have two indexers (Independent processes ) writing to a common solr
>>>>> core.
>>>>> If One indexer process issued an optimize on the core
>>>>> I want the second indexer to wait adding docs until the optimize has
>>>>> finished.
>>>>> 
>>>>> Are there ways I can do this programmatically?
>>>>> pinging the core when the optimize is happening is returning OK
>> because
>>>>> technically
>>>>> solr allows you to update when an optimize is happening.
>>>>> 
>>>>> any suggestions ?
>>>>> 
>>>>> thanks,
>>>>> Summer
>> 


Re: optimize status

Posted by Steven White <sw...@gmail.com>.
Hi Upayavira,

This is news to me that we should not optimize and index.

What about disk space saving, isn't optimization to reclaim disk space or
is Solr somehow does that?  Where can I read more about this?

I'm on Solr 5.1.0 (may switch to 5.2.1)

Thanks

Steve

On Mon, Jun 29, 2015 at 4:16 AM, Upayavira <uv...@odoko.co.uk> wrote:

> I'm afraid I don't understand. You're saying that optimising is causing
> performance issues?
>
> Simple solution: DO NOT OPTIMIZE!
>
> Optimisation is very badly named. What it does is squashes all segments
> in your index into one segment, removing all deleted documents. It is
> good to get rid of deletes - in that sense the index is "optimized".
> However, future merges become very expensive. The best way to handle
> this topic is to leave it to Lucene/Solr to do it for you. Pretend the
> "optimize" option never existed.
>
> This is, of course, assuming you are using something like Solr 3.5+.
>
> Upayavira
>
> On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote:
> >
> > Have to cause of performance issues.
> > Just want to know if there is a way to tap into the status.
> >
> > > On Jun 28, 2015, at 11:37 PM, Upayavira <uv...@odoko.co.uk> wrote:
> > >
> > > Bigger question, why are you optimizing? Since 3.6 or so, it generally
> > > hasn't been requires, even, is a bad thing.
> > >
> > > Upayavira
> > >
> > >> On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote:
> > >> Hi All,
> > >>
> > >> I have two indexers (Independent processes ) writing to a common solr
> > >> core.
> > >> If One indexer process issued an optimize on the core
> > >> I want the second indexer to wait adding docs until the optimize has
> > >> finished.
> > >>
> > >> Are there ways I can do this programmatically?
> > >> pinging the core when the optimize is happening is returning OK
> because
> > >> technically
> > >> solr allows you to update when an optimize is happening.
> > >>
> > >> any suggestions ?
> > >>
> > >> thanks,
> > >> Summer
>

Re: optimize status

Posted by Upayavira <uv...@odoko.co.uk>.
I'm afraid I don't understand. You're saying that optimising is causing
performance issues?

Simple solution: DO NOT OPTIMIZE!

Optimisation is very badly named. What it does is squashes all segments
in your index into one segment, removing all deleted documents. It is
good to get rid of deletes - in that sense the index is "optimized".
However, future merges become very expensive. The best way to handle
this topic is to leave it to Lucene/Solr to do it for you. Pretend the
"optimize" option never existed.

This is, of course, assuming you are using something like Solr 3.5+.

Upayavira

On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote:
> 
> Have to cause of performance issues. 
> Just want to know if there is a way to tap into the status. 
> 
> > On Jun 28, 2015, at 11:37 PM, Upayavira <uv...@odoko.co.uk> wrote:
> > 
> > Bigger question, why are you optimizing? Since 3.6 or so, it generally
> > hasn't been requires, even, is a bad thing.
> > 
> > Upayavira
> > 
> >> On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote:
> >> Hi All,
> >> 
> >> I have two indexers (Independent processes ) writing to a common solr
> >> core.
> >> If One indexer process issued an optimize on the core 
> >> I want the second indexer to wait adding docs until the optimize has
> >> finished.
> >> 
> >> Are there ways I can do this programmatically?
> >> pinging the core when the optimize is happening is returning OK because
> >> technically
> >> solr allows you to update when an optimize is happening. 
> >> 
> >> any suggestions ?
> >> 
> >> thanks,
> >> Summer