You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by v....@lombardodier.com on 2011/10/27 13:44:29 UTC

index bigger than it should be?

Hi,

I have an application that has an index with 30 millions docs in it. every 
day, I add around 1 million docs, and I remove the oldest 1 million, to 
keepit stable at 30 million.
for the most part doc fields are indexed and stored. each doc weighs 
around from a few Kb to a 1 Mb (a few Mb in some cases).
I used to be able to maintain the index at around 60 Gb on disk. but 
recently the index has had a tendency to keep growing (90 Gb). I can see 
that the expunge is doing what it should do, because after it executes, 
the size on disk does go down, but never as low as the previous day. from 
the outside, it looks like a leak, but since I do not remove the docs I 
added during the day, it might be that the new docs are just bigger than 
the old ones. still I am surprised with the increase.

are there any tools to dig into the index structure and help justify the 
space taken on disk?
I was thinking about something that would help identify terms that take up 
the most space, or some sort of dump that I could compare from one day to 
the other.

any help appreciated,

thanks,

vince



************************ DISCLAIMER ************************
This message is intended only for use by the person to
whom it is addressed. It may contain information that is
privileged and confidential. Its content does not
constitute a formal commitment by Lombard Odier
Darier Hentsch & Cie or any of its branches or affiliates.
If you are not the intended recipient of this message,
kindly notify the sender immediately and destroy this
message. Thank You.
*****************************************************************

Re: index bigger than it should be?

Posted by Ian Lea <ia...@gmail.com>.
Do the individual docs get bigger after 28 million?  Can you try
loading the last few million docs, from when the size jumps, and see
what happens?  Or load them in reverse order or something, again to
see what happens?

I don't have indexes with that many docs, but I believe that plenty of
people do.


--
Ian.


On Sun, Oct 30, 2011 at 9:01 AM,  <v....@lombardodier.com> wrote:
> Hi,
>
> I did the following on the existing index:
>  - expunge deletes
>  - optimize(5)
>  - check index
>
> then from the existing index I exported all docs into a new one, then on
> the new one I did:
>  - optimize(5)
>  - check index
>
> the entire log is in http://dl.dropbox.com/u/47469698/lucene/index.txt
>
> during the export, I also monitored the size on disk at each chunk of
> 100000 docs added to the new index:
> http://dl.dropbox.com/u/47469698/lucene/index.xls
>
> what I found was that the index was taking around 2400 Mb/million docs
> almost all the time, and from time to time it would take a little bit more
> (<3500) during a short period of time. this stays true until around 28
> millions docs where the size on disk increases a lot (4500 Mb/million docs
> = 135 Gb on disk) until the end of the export (my index contains 32
> millions docs). at the end the space on disk went from 134 Gb to 91 Gb
> thanks to the optimize. but even at 91 Gb pour 32 millions docs, it is
> still 3000 Mb/million docs, far more than the 2400 I was seeing most of
> the time.
>
> I understand that merges happen, what I was surprised about was that the
> behavior between 28 and 32 millions was a lot bigger in scale than the
> other merges before, and even an optimize would not solve this entirely.
> did I reach a limit? should I maintain the index at 25 millions to avoid
> this behavior?
>
> I am using lucene 3.4 with the tiered merge policy and all the fields are
> stored.
>
> thanks,
>
>
> Vincent Sevel
>
>
>
>
>
>
>
>
> Ian Lea <ia...@gmail.com>
> Sent by: java-user-return-51136-v.sevel=lombardodier.com@lucene.apache.org
>
>
> 27.10.2011 15:28
> Please respond to
> java-user@lucene.apache.org
>
>
>
> To
> java-user@lucene.apache.org
> cc
>
> Subject
> Re: index bigger than it should be?
>
>
>
>
>
>
> There's org.apache.lucene.index.CheckIndex which will report assorted
> stats about the index, as well as checking it for correctness.  It can
> fix it too but you don't need that.  I hope. Will take quite a while
> to run on a large index.
>
> What version of lucene?  Does a before/after (or large/small)
> directory listing give any clues?
>
>
> --
> Ian.
>
>
> On Thu, Oct 27, 2011 at 12:44 PM,  <v....@lombardodier.com> wrote:
>> Hi,
>>
>> I have an application that has an index with 30 millions docs in it.
> every
>> day, I add around 1 million docs, and I remove the oldest 1 million, to
>> keepit stable at 30 million.
>> for the most part doc fields are indexed and stored. each doc weighs
>> around from a few Kb to a 1 Mb (a few Mb in some cases).
>> I used to be able to maintain the index at around 60 Gb on disk. but
>> recently the index has had a tendency to keep growing (90 Gb). I can see
>> that the expunge is doing what it should do, because after it executes,
>> the size on disk does go down, but never as low as the previous day.
> from
>> the outside, it looks like a leak, but since I do not remove the docs I
>> added during the day, it might be that the new docs are just bigger than
>> the old ones. still I am surprised with the increase.
>>
>> are there any tools to dig into the index structure and help justify the
>> space taken on disk?
>> I was thinking about something that would help identify terms that take
> up
>> the most space, or some sort of dump that I could compare from one day
> to
>> the other.
>>
>> any help appreciated,
>>
>> thanks,
>>
>> vince
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
> ************************ DISCLAIMER ************************
> This message is intended only for use by the person to
> whom it is addressed. It may contain information that is
> privileged and confidential. Its content does not
> constitute a formal commitment by Lombard Odier
> Darier Hentsch & Cie or any of its branches or affiliates.
> If you are not the intended recipient of this message,
> kindly notify the sender immediately and destroy this
> message. Thank You.
> *****************************************************************
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: index bigger than it should be?

Posted by v....@lombardodier.com.
Hi,

I did the following on the existing index:
 - expunge deletes
 - optimize(5)
 - check index

then from the existing index I exported all docs into a new one, then on 
the new one I did:
 - optimize(5)
 - check index

the entire log is in http://dl.dropbox.com/u/47469698/lucene/index.txt

during the export, I also monitored the size on disk at each chunk of 
100000 docs added to the new index:
http://dl.dropbox.com/u/47469698/lucene/index.xls

what I found was that the index was taking around 2400 Mb/million docs 
almost all the time, and from time to time it would take a little bit more 
(<3500) during a short period of time. this stays true until around 28 
millions docs where the size on disk increases a lot (4500 Mb/million docs 
= 135 Gb on disk) until the end of the export (my index contains 32 
millions docs). at the end the space on disk went from 134 Gb to 91 Gb 
thanks to the optimize. but even at 91 Gb pour 32 millions docs, it is 
still 3000 Mb/million docs, far more than the 2400 I was seeing most of 
the time.

I understand that merges happen, what I was surprised about was that the 
behavior between 28 and 32 millions was a lot bigger in scale than the 
other merges before, and even an optimize would not solve this entirely.
did I reach a limit? should I maintain the index at 25 millions to avoid 
this behavior?

I am using lucene 3.4 with the tiered merge policy and all the fields are 
stored.

thanks,


Vincent Sevel








Ian Lea <ia...@gmail.com> 
Sent by: java-user-return-51136-v.sevel=lombardodier.com@lucene.apache.org
 
 
27.10.2011 15:28
Please respond to
java-user@lucene.apache.org



To
java-user@lucene.apache.org
cc

Subject
Re: index bigger than it should be?






There's org.apache.lucene.index.CheckIndex which will report assorted
stats about the index, as well as checking it for correctness.  It can
fix it too but you don't need that.  I hope. Will take quite a while
to run on a large index.

What version of lucene?  Does a before/after (or large/small)
directory listing give any clues?


--
Ian.


On Thu, Oct 27, 2011 at 12:44 PM,  <v....@lombardodier.com> wrote:
> Hi,
>
> I have an application that has an index with 30 millions docs in it. 
every
> day, I add around 1 million docs, and I remove the oldest 1 million, to
> keepit stable at 30 million.
> for the most part doc fields are indexed and stored. each doc weighs
> around from a few Kb to a 1 Mb (a few Mb in some cases).
> I used to be able to maintain the index at around 60 Gb on disk. but
> recently the index has had a tendency to keep growing (90 Gb). I can see
> that the expunge is doing what it should do, because after it executes,
> the size on disk does go down, but never as low as the previous day. 
from
> the outside, it looks like a leak, but since I do not remove the docs I
> added during the day, it might be that the new docs are just bigger than
> the old ones. still I am surprised with the increase.
>
> are there any tools to dig into the index structure and help justify the
> space taken on disk?
> I was thinking about something that would help identify terms that take 
up
> the most space, or some sort of dump that I could compare from one day 
to
> the other.
>
> any help appreciated,
>
> thanks,
>
> vince

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




************************ DISCLAIMER ************************
This message is intended only for use by the person to
whom it is addressed. It may contain information that is
privileged and confidential. Its content does not
constitute a formal commitment by Lombard Odier
Darier Hentsch & Cie or any of its branches or affiliates.
If you are not the intended recipient of this message,
kindly notify the sender immediately and destroy this
message. Thank You.
*****************************************************************

Re: index bigger than it should be?

Posted by Ian Lea <ia...@gmail.com>.
There's org.apache.lucene.index.CheckIndex which will report assorted
stats about the index, as well as checking it for correctness.  It can
fix it too but you don't need that.  I hope. Will take quite a while
to run on a large index.

What version of lucene?  Does a before/after (or large/small)
directory listing give any clues?


--
Ian.


On Thu, Oct 27, 2011 at 12:44 PM,  <v....@lombardodier.com> wrote:
> Hi,
>
> I have an application that has an index with 30 millions docs in it. every
> day, I add around 1 million docs, and I remove the oldest 1 million, to
> keepit stable at 30 million.
> for the most part doc fields are indexed and stored. each doc weighs
> around from a few Kb to a 1 Mb (a few Mb in some cases).
> I used to be able to maintain the index at around 60 Gb on disk. but
> recently the index has had a tendency to keep growing (90 Gb). I can see
> that the expunge is doing what it should do, because after it executes,
> the size on disk does go down, but never as low as the previous day. from
> the outside, it looks like a leak, but since I do not remove the docs I
> added during the day, it might be that the new docs are just bigger than
> the old ones. still I am surprised with the increase.
>
> are there any tools to dig into the index structure and help justify the
> space taken on disk?
> I was thinking about something that would help identify terms that take up
> the most space, or some sort of dump that I could compare from one day to
> the other.
>
> any help appreciated,
>
> thanks,
>
> vince

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org