You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Phillip Farber <pf...@umich.edu> on 2009/10/01 18:18:59 UTC

best way to get the size of an index

Resuming this discussion in a new thread to focus only on this question:

What is the best way to get the size of an index so it does not get too 
big to be optimized (or to allow a very large segment merge) given space 
limits?

I already have the largest 15,000rpm SCSI direct attached storage so 
buying storage is not an option.  I don't do deletes.

 From what I've read, I expect no more than a 2x increase during 
optimization and have not seen more in practice.

I'm thinking: stop indexing, commit, do a du.

Will this give me the number I need for what I'm trying to do? Is there 
a better way?

Phil

Re: best way to get the size of an index

Posted by Mark Miller <ma...@gmail.com>.
Mark Miller wrote:
> Phillip Farber wrote:
>   
>> Resuming this discussion in a new thread to focus only on this question:
>>
>> What is the best way to get the size of an index so it does not get
>> too big to be optimized (or to allow a very large segment merge) given
>> space limits?
>>
>> I already have the largest 15,000rpm SCSI direct attached storage so
>> buying storage is not an option.  I don't do deletes.
>>     
> Even if you did do deletes, its not really a 3x problem - thats just
> theory - you'd have to work to get there. Deletes are merged out as you
> index additional docs as segments are merged over time. The 3x scenario
> brought up is more of a fun mind exercise than anything that would
> realistically happen.
>   
>
And for completeness for those following along:

Lets say you did do some crazy deleting, and deleted half the docs in
your index. Those docs stay around, and the ids are just added to a list
that keeps those docs from being "seen". Later, as natural merging
occurs, or if you force merges with an optimize, those deleted docs will
physically be removed. Lets then say you then managed to re-add all of
those docs without any merging occurring while adding those docs (say
you wanted to see this affect so bad that you wrote and put in a custom
merge policy that doesn't find any segments to merge). Even if you do
all that, before you do the optimize, your going to look at the size of
your index and see its n GB. Thats your current index size. Now say you
kick off the optimize. Its not even going to take 2x that n size to
optimize - this is because all those deletes will be removed as the
index is optimized down to one segment. Its going to take <2x.

This delete thing, as I said, is more of a fun mental thing. It has
little relation to how much space you need to optimize in comparison to
how big your index is before optimizing. And its really worse than a
worse case scenario unless you write a custom merge policy, or crank
some settings insanely high and have enough RAM to do (all the indexing
would have to take place in one huge segment in RAM that would then get
flushed).


-- 
- Mark

http://www.lucidimagination.com




Re: best way to get the size of an index

Posted by Phillip Farber <pf...@umich.edu>.
Thanks, Mark. I really appreciate your confirmation.

Phil

Mark Miller wrote:
> Phillip Farber wrote:
>> Resuming this discussion in a new thread to focus only on this question:
>>
>> What is the best way to get the size of an index so it does not get
>> too big to be optimized (or to allow a very large segment merge) given
>> space limits?
>>
>> I already have the largest 15,000rpm SCSI direct attached storage so
>> buying storage is not an option.  I don't do deletes.
> Even if you did do deletes, its not really a 3x problem - thats just
> theory - you'd have to work to get there. Deletes are merged out as you
> index additional docs as segments are merged over time. The 3x scenario
> brought up is more of a fun mind exercise than anything that would
> realistically happen.
>> From what I've read, I expect no more than a 2x increase during
>> optimization and have not seen more in practice.
>>
>> I'm thinking: stop indexing, commit, do a du.
>>
>> Will this give me the number I need for what I'm trying to do? Is
>> there a better way?
> Should work fine. When you do the commit, onCommit will be called on the
> IndexDeltionPolicy, and all of the "snapshots" of the index other than
> the latest one will be removed. You should have a clean index to gauge
> the size with. Using something like Java Replication complicates this
> though - in that case, older commit points can be reserved while they
> are being copied.
>> Phil
> 
> 

Re: best way to get the size of an index

Posted by Mark Miller <ma...@gmail.com>.
Phillip Farber wrote:
>
> Resuming this discussion in a new thread to focus only on this question:
>
> What is the best way to get the size of an index so it does not get
> too big to be optimized (or to allow a very large segment merge) given
> space limits?
>
> I already have the largest 15,000rpm SCSI direct attached storage so
> buying storage is not an option.  I don't do deletes.
Even if you did do deletes, its not really a 3x problem - thats just
theory - you'd have to work to get there. Deletes are merged out as you
index additional docs as segments are merged over time. The 3x scenario
brought up is more of a fun mind exercise than anything that would
realistically happen.
>
> From what I've read, I expect no more than a 2x increase during
> optimization and have not seen more in practice.
>
> I'm thinking: stop indexing, commit, do a du.
>
> Will this give me the number I need for what I'm trying to do? Is
> there a better way?
Should work fine. When you do the commit, onCommit will be called on the
IndexDeltionPolicy, and all of the "snapshots" of the index other than
the latest one will be removed. You should have a clean index to gauge
the size with. Using something like Java Replication complicates this
though - in that case, older commit points can be reserved while they
are being copied.
>
> Phil


-- 
- Mark

http://www.lucidimagination.com




Re: best way to get the size of an index

Posted by Grant Ingersoll <gs...@apache.org>.
On Oct 1, 2009, at 12:18 PM, Phillip Farber wrote:

>
> Resuming this discussion in a new thread to focus only on this  
> question:
>
> What is the best way to get the size of an index so it does not get  
> too big to be optimized (or to allow a very large segment merge)  
> given space limits?
>
> I already have the largest 15,000rpm SCSI direct attached storage so  
> buying storage is not an option.  I don't do deletes.
>
> From what I've read, I expect no more than a 2x increase during  
> optimization and have not seen more in practice.
>
> I'm thinking: stop indexing, commit, do a du.

That sounds reasonable, but on the other thread, I'd still plan for a  
3x increase, even if you aren't doing deletes, just to be on the safe  
side.


I wonder if there is a way to report it back via Java/Lucene in a  
Request Handler or in the Luke Request Handler?  May be worth taking  
the time to add.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search