You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Phillip Farber <pf...@umich.edu> on 2009/10/07 18:51:47 UTC

How much disk space does optimize really take

In a separate thread, I've detailed how an optimize is taking > 2x disk 
space. We don't use solr distribution/snapshooter.  We are using the 
default deletion policy = 1. We can't optimize a 192G index in 400GB of 
space.

This thread in lucene/java-user

http://www.gossamer-threads.com/lists/lucene/java-user/43475

suggests that an optimize should not take > 2x unless perhaps an 
IndexReader is holding on to segments. This could be our problem since 
when optimization runs out of space, if we stop tomcat, a number of 
files go away and space is recovered.

But we are not searching the index so how could a Searcher/IndexReader 
have any segments open?

I notice in the logs that as part of routine commits or as part of 
optimize a Searcher is registered and autowarmed from a previous 
searcher (of course there's nothing in the caches -- this is just a 
build machine).

INFO: registering core:
Oct 6, 2009 2:16:20 PM org.apache.solr.core.SolrCore registerSearcher
INFO: [] Registered new searcher Searcher@2e097617 main

Does this means that there's always a lucene IndexReader holding segment 
files open so they can't be deleted during an optimize so we run out of 
disk space > 2x?

Phil
hathitrust.org




Re: How much disk space does optimize really take

Posted by Mark Miller <ma...@gmail.com>.
I think that argument requires auto commit to be on and opening readers
after the optimize starts? Otherwise, the optimized version is not put
into place until a commit is called, and a Reader won't see the newly
merged segments until then - so the original index is kept around in
either case - having a Reader open on it shouldn't affect the space
requirements?

Yonik Seeley wrote:
> On Wed, Oct 7, 2009 at 12:51 PM, Phillip Farber <pf...@umich.edu> wrote:
>   
>> In a separate thread, I've detailed how an optimize is taking > 2x disk
>> space. We don't use solr distribution/snapshooter.  We are using the default
>> deletion policy = 1. We can't optimize a 192G index in 400GB of space.
>>
>> This thread in lucene/java-user
>>
>> http://www.gossamer-threads.com/lists/lucene/java-user/43475
>>
>> suggests that an optimize should not take > 2x unless perhaps an IndexReader
>> is holding on to segments. This could be our problem since when optimization
>> runs out of space, if we stop tomcat, a number of files go away and space is
>> recovered.
>>
>> But we are not searching the index so how could a Searcher/IndexReader have
>> any segments open?
>>
>> I notice in the logs that as part of routine commits or as part of optimize
>> a Searcher is registered and autowarmed from a previous searcher (of course
>> there's nothing in the caches -- this is just a build machine).
>>
>> INFO: registering core:
>> Oct 6, 2009 2:16:20 PM org.apache.solr.core.SolrCore registerSearcher
>> INFO: [] Registered new searcher Searcher@2e097617 main
>>
>> Does this means that there's always a lucene IndexReader holding segment
>> files open so they can't be deleted during an optimize so we run out of disk
>> space > 2x?
>>     
>
> Yes.
> A feature could probably now be developed now that avoids opening a
> reader until it's requested.
> That wasn't really possible in the past - due to many issues such as
> Lucene autocommit.
>
> -Yonik
> http://www.lucidimagination.com
>   


-- 
- Mark

http://www.lucidimagination.com




Re: How much disk space does optimize really take

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Wed, Oct 7, 2009 at 3:56 PM, Mark Miller <ma...@gmail.com> wrote:
> I guess you can't guarantee 2x though, as if you have queries coming in
> that take a while, a commit opening a new Reader will not guarantee the
> old Reader is quite ready to go away. Might want to wait a short bit
> after the commit.

Right - and in a complete system, there are other things that can also
hold commit points open longer, like index replication.

-Yonik
http://www.lucidimagination.com

Re: How much disk space does optimize really take

Posted by Mark Miller <ma...@gmail.com>.
Yonik Seeley wrote:
> On Wed, Oct 7, 2009 at 3:31 PM, Mark Miller <ma...@gmail.com> wrote:
>   
>> I can't tell why calling a commit or restarting is going to help
>> anything
>>     
>
> Depends on what scenarios you consider, and what you are taking 2x of.
>
> 1) Open reader on index
> 2) Open writer and add two documents... the first causes a large
> merge, and the second is just to make it a non-optimized index.
>   At this point youre already at 2x of your original index size.
> 3) call optimize()... this will make a 3rd copy before deleting the 2nd.
>
> -Yonik
> http://www.lucidimagination.com
>   
Yup - finally hit me what you were talking about. Wasn't considering the
case of adding docs to an existing index, not committing, and then
trying to optimize.

I like trying to take an opposing side from you anyway - it means I know
where I will end up - but your usually so darn terse, I never know how
long till I end up there.

Anyway, so all you generally *need* is 2x, you just have to make sure
your not adding docs first without committing them - which I was taking
for granted. But means your comment of calling commit makes perfect sense.

I guess you can't guarantee 2x though, as if you have queries coming in
that take a while, a commit opening a new Reader will not guarantee the
old Reader is quite ready to go away. Might want to wait a short bit
after the commit.

-- 
- Mark

http://www.lucidimagination.com




Re: How much disk space does optimize really take

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Wed, Oct 7, 2009 at 3:31 PM, Mark Miller <ma...@gmail.com> wrote:
> I can't tell why calling a commit or restarting is going to help
> anything

Depends on what scenarios you consider, and what you are taking 2x of.

1) Open reader on index
2) Open writer and add two documents... the first causes a large
merge, and the second is just to make it a non-optimized index.
  At this point youre already at 2x of your original index size.
3) call optimize()... this will make a 3rd copy before deleting the 2nd.

-Yonik
http://www.lucidimagination.com

Re: How much disk space does optimize really take

Posted by Mark Miller <ma...@gmail.com>.
Okay - I think I've got you - your talking about the case of adding a
bunch of docs, not calling commit, and then trying to optimize. I keep
coming at it from a cold optimize. Making sense to me now.

Mark Miller wrote:
> I can't tell why calling a commit or restarting is going to help
> anything - or why you need more than 2x in any case. The only reason i
> can see this being is if you have turned on auto-commit. Otherwise the
> Reader is *always* only referencing what would have to be around anyway.
>
> Your likely to just too close to the edge. There are fragmentation
> issues and whatnot when your dealing with such large files and so little
> space above what you need.
>
> Phillip Farber wrote:
>   
>> Wow, this is weird.  I commit before I optimize.  In fact, I bounce
>> tomcat before I optimize just in case. It makse sense, as you say,
>> that then "the open reader can only be holding references to segments
>> that wouldn't be deleted until the optimize is complete anyway".
>>
>> But we're still exceeding 2x. And after the optimize fails, if we then
>> do a commit or bounce tomcat, a bunch of segments disappear. I am
>> stumped.
>>
>> Yonik Seeley wrote:
>>     
>>> On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber <pf...@umich.edu>
>>> wrote:
>>>       
>>>> So this implies that for a "normal" optimize, in every case, due to the
>>>> Searcher holding open the existing segment prior to optimize that we'd
>>>> always need 3x even in the normal case.
>>>>
>>>> This seems wrong since it is repeated stated that in the normal case
>>>> only 2x
>>>> is needed and I have successfully optimized a similar sized 192G
>>>> index on
>>>> identical hardware with a 400G capacity.
>>>>         
>>> 2x for the IndexWriter only.
>>> Having an open index reader can increase that somewhat... 3x is the
>>> absolute worst case I think and that can currently be avoided by first
>>> calling commit and then calling optimize I think.  This way the open
>>> reader will only be holding references to segments that wouldn't be
>>> deleted until the optimize is complete anyway.
>>>
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>       
>
>
>   


-- 
- Mark

http://www.lucidimagination.com




Re: How much disk space does optimize really take

Posted by Mark Miller <ma...@gmail.com>.
I can't tell why calling a commit or restarting is going to help
anything - or why you need more than 2x in any case. The only reason i
can see this being is if you have turned on auto-commit. Otherwise the
Reader is *always* only referencing what would have to be around anyway.

Your likely to just too close to the edge. There are fragmentation
issues and whatnot when your dealing with such large files and so little
space above what you need.

Phillip Farber wrote:
> Wow, this is weird.  I commit before I optimize.  In fact, I bounce
> tomcat before I optimize just in case. It makse sense, as you say,
> that then "the open reader can only be holding references to segments
> that wouldn't be deleted until the optimize is complete anyway".
>
> But we're still exceeding 2x. And after the optimize fails, if we then
> do a commit or bounce tomcat, a bunch of segments disappear. I am
> stumped.
>
> Yonik Seeley wrote:
>> On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber <pf...@umich.edu>
>> wrote:
>>> So this implies that for a "normal" optimize, in every case, due to the
>>> Searcher holding open the existing segment prior to optimize that we'd
>>> always need 3x even in the normal case.
>>>
>>> This seems wrong since it is repeated stated that in the normal case
>>> only 2x
>>> is needed and I have successfully optimized a similar sized 192G
>>> index on
>>> identical hardware with a 400G capacity.
>>
>> 2x for the IndexWriter only.
>> Having an open index reader can increase that somewhat... 3x is the
>> absolute worst case I think and that can currently be avoided by first
>> calling commit and then calling optimize I think.  This way the open
>> reader will only be holding references to segments that wouldn't be
>> deleted until the optimize is complete anyway.
>>
>>
>> -Yonik
>> http://www.lucidimagination.com


-- 
- Mark

http://www.lucidimagination.com




Re: How much disk space does optimize really take

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Wed, Oct 7, 2009 at 3:16 PM, Phillip Farber <pf...@umich.edu> wrote:
> Wow, this is weird.  I commit before I optimize.  In fact, I bounce tomcat
> before I optimize just in case. It makse sense, as you say, that then "the
> open reader can only be holding references to segments that wouldn't be
> deleted until the optimize is complete anyway".
>
> But we're still exceeding 2x.

How much over 2x?
It is possible (though relatively rare) for an optimized index to be
larger than a non-optimized index.

-Yonik
http://www.lucidimagination.com

Re: How much disk space does optimize really take

Posted by Lance Norskog <go...@gmail.com>.
Oops, send before finished.  "Partial Optimize" aka "maxSegments" is a
recent Solr 1.4/Lucene 2.9 feature.

As to 2x v.s. 3x, the general wisdom is that an optimize on a "simple"
index takes at most 2x disk space, and on a "compound" index takes at
most 3x. "Simple" is the default (*). At Divvio we had the same
problem and it never took up more than 2x.

If your index disks are really bursting at the seams, you could try
creating an empty index on a separate disk and merging your large
index into that index. The resulting index will be "mostly optimized".

Lance Norskog

* in solrconfig.xml:
<useCompoundFile>false</useCompoundFile>

On 10/7/09, Phillip Farber <pf...@umich.edu> wrote:
> Wow, this is weird.  I commit before I optimize.  In fact, I bounce
> tomcat before I optimize just in case. It makse sense, as you say, that
> then "the open reader can only be holding references to segments that
> wouldn't be deleted until the optimize is complete anyway".
>
> But we're still exceeding 2x. And after the optimize fails, if we then
> do a commit or bounce tomcat, a bunch of segments disappear. I am stumped.
>
> Yonik Seeley wrote:
>> On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber <pf...@umich.edu> wrote:
>>> So this implies that for a "normal" optimize, in every case, due to the
>>> Searcher holding open the existing segment prior to optimize that we'd
>>> always need 3x even in the normal case.
>>>
>>> This seems wrong since it is repeated stated that in the normal case only
>>> 2x
>>> is needed and I have successfully optimized a similar sized 192G index on
>>> identical hardware with a 400G capacity.
>>
>> 2x for the IndexWriter only.
>> Having an open index reader can increase that somewhat... 3x is the
>> absolute worst case I think and that can currently be avoided by first
>> calling commit and then calling optimize I think.  This way the open
>> reader will only be holding references to segments that wouldn't be
>> deleted until the optimize is complete anyway.
>>
>>
>> -Yonik
>> http://www.lucidimagination.com
>


-- 
Lance Norskog
goksron@gmail.com

Re: How much disk space does optimize really take

Posted by Phillip Farber <pf...@umich.edu>.
Wow, this is weird.  I commit before I optimize.  In fact, I bounce 
tomcat before I optimize just in case. It makse sense, as you say, that 
then "the open reader can only be holding references to segments that 
wouldn't be deleted until the optimize is complete anyway".

But we're still exceeding 2x. And after the optimize fails, if we then 
do a commit or bounce tomcat, a bunch of segments disappear. I am stumped.

Yonik Seeley wrote:
> On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber <pf...@umich.edu> wrote:
>> So this implies that for a "normal" optimize, in every case, due to the
>> Searcher holding open the existing segment prior to optimize that we'd
>> always need 3x even in the normal case.
>>
>> This seems wrong since it is repeated stated that in the normal case only 2x
>> is needed and I have successfully optimized a similar sized 192G index on
>> identical hardware with a 400G capacity.
> 
> 2x for the IndexWriter only.
> Having an open index reader can increase that somewhat... 3x is the
> absolute worst case I think and that can currently be avoided by first
> calling commit and then calling optimize I think.  This way the open
> reader will only be holding references to segments that wouldn't be
> deleted until the optimize is complete anyway.
> 
> 
> -Yonik
> http://www.lucidimagination.com

Re: How much disk space does optimize really take

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber <pf...@umich.edu> wrote:
> So this implies that for a "normal" optimize, in every case, due to the
> Searcher holding open the existing segment prior to optimize that we'd
> always need 3x even in the normal case.
>
> This seems wrong since it is repeated stated that in the normal case only 2x
> is needed and I have successfully optimized a similar sized 192G index on
> identical hardware with a 400G capacity.

2x for the IndexWriter only.
Having an open index reader can increase that somewhat... 3x is the
absolute worst case I think and that can currently be avoided by first
calling commit and then calling optimize I think.  This way the open
reader will only be holding references to segments that wouldn't be
deleted until the optimize is complete anyway.


-Yonik
http://www.lucidimagination.com

Re: How much disk space does optimize really take

Posted by Phillip Farber <pf...@umich.edu>.

Yonik Seeley wrote:

>>
>> Does this means that there's always a lucene IndexReader holding segment
>> files open so they can't be deleted during an optimize so we run out of disk
>> space > 2x?
> 
> Yes.
> A feature could probably now be developed now that avoids opening a
> reader until it's requested.
> That wasn't really possible in the past - due to many issues such as
> Lucene autocommit.
> 

So this implies that for a "normal" optimize, in every case, due to the 
Searcher holding open the existing segment prior to optimize that we'd 
always need 3x even in the normal case.

This seems wrong since it is repeated stated that in the normal case 
only 2x is needed and I have successfully optimized a similar sized 192G 
index on identical hardware with a 400G capacity.

Yonik, I'm uncertain then about what you're saying about required disk 
space ofr optimize.  Could you clarify?



> -Yonik
> http://www.lucidimagination.com

Re: How much disk space does optimize really take

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Wed, Oct 7, 2009 at 1:34 PM, Shalin Shekhar Mangar
<sh...@gmail.com> wrote:
> On Wed, Oct 7, 2009 at 10:45 PM, Jason Rutherglen <
> jason.rutherglen@gmail.com> wrote:
>
>> It would be good to be able to commit without opening a new
>> reader however with Lucene 2.9 the segment readers for all
>> available segments are already created and available via
>> getReader which manages the reference counting internally.
>>
>> Using reopen redundantly creates SRs that are already held
>> internally in IW.
>>
>>
> Jason, I think this is something we should consider changing. A user who is
> not using NRT features should not pay the price of keeping readers opened.
> We are also interested in opening a searcher just-in-time for SOLR-1293. We
> have use-cases where a SolrCore is loaded only for indexing and then
> unloaded.

This is already true today.

If you don't use NRT then the readers are not held open by Lucene.

Mike

Re: How much disk space does optimize really take

Posted by Jason Rutherglen <ja...@gmail.com>.
To be clear, the SRs created by merges don't have the term index
loaded which is the main cost.  One would need to use
IndexReaderWarmer to load the term index before the new SR becomes a
part of SegmentInfos.

On Wed, Oct 7, 2009 at 10:34 AM, Shalin Shekhar Mangar
<sh...@gmail.com> wrote:
> On Wed, Oct 7, 2009 at 10:45 PM, Jason Rutherglen <
> jason.rutherglen@gmail.com> wrote:
>
>> It would be good to be able to commit without opening a new
>> reader however with Lucene 2.9 the segment readers for all
>> available segments are already created and available via
>> getReader which manages the reference counting internally.
>>
>> Using reopen redundantly creates SRs that are already held
>> internally in IW.
>>
>>
> Jason, I think this is something we should consider changing. A user who is
> not using NRT features should not pay the price of keeping readers opened.
> We are also interested in opening a searcher just-in-time for SOLR-1293. We
> have use-cases where a SolrCore is loaded only for indexing and then
> unloaded.
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: How much disk space does optimize really take

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Wed, Oct 7, 2009 at 10:45 PM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> It would be good to be able to commit without opening a new
> reader however with Lucene 2.9 the segment readers for all
> available segments are already created and available via
> getReader which manages the reference counting internally.
>
> Using reopen redundantly creates SRs that are already held
> internally in IW.
>
>
Jason, I think this is something we should consider changing. A user who is
not using NRT features should not pay the price of keeping readers opened.
We are also interested in opening a searcher just-in-time for SOLR-1293. We
have use-cases where a SolrCore is loaded only for indexing and then
unloaded.

-- 
Regards,
Shalin Shekhar Mangar.

Re: How much disk space does optimize really take

Posted by Jason Rutherglen <ja...@gmail.com>.
It would be good to be able to commit without opening a new
reader however with Lucene 2.9 the segment readers for all
available segments are already created and available via
getReader which manages the reference counting internally.

Using reopen redundantly creates SRs that are already held
internally in IW.

On Wed, Oct 7, 2009 at 9:59 AM, Yonik Seeley <yo...@lucidimagination.com> wrote:
> On Wed, Oct 7, 2009 at 12:51 PM, Phillip Farber <pf...@umich.edu> wrote:
>>
>> In a separate thread, I've detailed how an optimize is taking > 2x disk
>> space. We don't use solr distribution/snapshooter.  We are using the default
>> deletion policy = 1. We can't optimize a 192G index in 400GB of space.
>>
>> This thread in lucene/java-user
>>
>> http://www.gossamer-threads.com/lists/lucene/java-user/43475
>>
>> suggests that an optimize should not take > 2x unless perhaps an IndexReader
>> is holding on to segments. This could be our problem since when optimization
>> runs out of space, if we stop tomcat, a number of files go away and space is
>> recovered.
>>
>> But we are not searching the index so how could a Searcher/IndexReader have
>> any segments open?
>>
>> I notice in the logs that as part of routine commits or as part of optimize
>> a Searcher is registered and autowarmed from a previous searcher (of course
>> there's nothing in the caches -- this is just a build machine).
>>
>> INFO: registering core:
>> Oct 6, 2009 2:16:20 PM org.apache.solr.core.SolrCore registerSearcher
>> INFO: [] Registered new searcher Searcher@2e097617 main
>>
>> Does this means that there's always a lucene IndexReader holding segment
>> files open so they can't be deleted during an optimize so we run out of disk
>> space > 2x?
>
> Yes.
> A feature could probably now be developed now that avoids opening a
> reader until it's requested.
> That wasn't really possible in the past - due to many issues such as
> Lucene autocommit.
>
> -Yonik
> http://www.lucidimagination.com
>

Re: How much disk space does optimize really take

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Wed, Oct 7, 2009 at 12:51 PM, Phillip Farber <pf...@umich.edu> wrote:
>
> In a separate thread, I've detailed how an optimize is taking > 2x disk
> space. We don't use solr distribution/snapshooter.  We are using the default
> deletion policy = 1. We can't optimize a 192G index in 400GB of space.
>
> This thread in lucene/java-user
>
> http://www.gossamer-threads.com/lists/lucene/java-user/43475
>
> suggests that an optimize should not take > 2x unless perhaps an IndexReader
> is holding on to segments. This could be our problem since when optimization
> runs out of space, if we stop tomcat, a number of files go away and space is
> recovered.
>
> But we are not searching the index so how could a Searcher/IndexReader have
> any segments open?
>
> I notice in the logs that as part of routine commits or as part of optimize
> a Searcher is registered and autowarmed from a previous searcher (of course
> there's nothing in the caches -- this is just a build machine).
>
> INFO: registering core:
> Oct 6, 2009 2:16:20 PM org.apache.solr.core.SolrCore registerSearcher
> INFO: [] Registered new searcher Searcher@2e097617 main
>
> Does this means that there's always a lucene IndexReader holding segment
> files open so they can't be deleted during an optimize so we run out of disk
> space > 2x?

Yes.
A feature could probably now be developed now that avoids opening a
reader until it's requested.
That wasn't really possible in the past - due to many issues such as
Lucene autocommit.

-Yonik
http://www.lucidimagination.com