You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by v....@lombardodier.com on 2011/10/27 13:44:29 UTC
index bigger than it should be?
Hi,
I have an application that has an index with 30 millions docs in it. every
day, I add around 1 million docs, and I remove the oldest 1 million, to
keepit stable at 30 million.
for the most part doc fields are indexed and stored. each doc weighs
around from a few Kb to a 1 Mb (a few Mb in some cases).
I used to be able to maintain the index at around 60 Gb on disk. but
recently the index has had a tendency to keep growing (90 Gb). I can see
that the expunge is doing what it should do, because after it executes,
the size on disk does go down, but never as low as the previous day. from
the outside, it looks like a leak, but since I do not remove the docs I
added during the day, it might be that the new docs are just bigger than
the old ones. still I am surprised with the increase.
are there any tools to dig into the index structure and help justify the
space taken on disk?
I was thinking about something that would help identify terms that take up
the most space, or some sort of dump that I could compare from one day to
the other.
any help appreciated,
thanks,
vince
************************ DISCLAIMER ************************
This message is intended only for use by the person to
whom it is addressed. It may contain information that is
privileged and confidential. Its content does not
constitute a formal commitment by Lombard Odier
Darier Hentsch & Cie or any of its branches or affiliates.
If you are not the intended recipient of this message,
kindly notify the sender immediately and destroy this
message. Thank You.
*****************************************************************
Re: index bigger than it should be?
Posted by Ian Lea <ia...@gmail.com>.
Do the individual docs get bigger after 28 million? Can you try
loading the last few million docs, from when the size jumps, and see
what happens? Or load them in reverse order or something, again to
see what happens?
I don't have indexes with that many docs, but I believe that plenty of
people do.
--
Ian.
On Sun, Oct 30, 2011 at 9:01 AM, <v....@lombardodier.com> wrote:
> Hi,
>
> I did the following on the existing index:
> - expunge deletes
> - optimize(5)
> - check index
>
> then from the existing index I exported all docs into a new one, then on
> the new one I did:
> - optimize(5)
> - check index
>
> the entire log is in http://dl.dropbox.com/u/47469698/lucene/index.txt
>
> during the export, I also monitored the size on disk at each chunk of
> 100000 docs added to the new index:
> http://dl.dropbox.com/u/47469698/lucene/index.xls
>
> what I found was that the index was taking around 2400 Mb/million docs
> almost all the time, and from time to time it would take a little bit more
> (<3500) during a short period of time. this stays true until around 28
> millions docs where the size on disk increases a lot (4500 Mb/million docs
> = 135 Gb on disk) until the end of the export (my index contains 32
> millions docs). at the end the space on disk went from 134 Gb to 91 Gb
> thanks to the optimize. but even at 91 Gb pour 32 millions docs, it is
> still 3000 Mb/million docs, far more than the 2400 I was seeing most of
> the time.
>
> I understand that merges happen, what I was surprised about was that the
> behavior between 28 and 32 millions was a lot bigger in scale than the
> other merges before, and even an optimize would not solve this entirely.
> did I reach a limit? should I maintain the index at 25 millions to avoid
> this behavior?
>
> I am using lucene 3.4 with the tiered merge policy and all the fields are
> stored.
>
> thanks,
>
>
> Vincent Sevel
>
>
>
>
>
>
>
>
> Ian Lea <ia...@gmail.com>
> Sent by: java-user-return-51136-v.sevel=lombardodier.com@lucene.apache.org
>
>
> 27.10.2011 15:28
> Please respond to
> java-user@lucene.apache.org
>
>
>
> To
> java-user@lucene.apache.org
> cc
>
> Subject
> Re: index bigger than it should be?
>
>
>
>
>
>
> There's org.apache.lucene.index.CheckIndex which will report assorted
> stats about the index, as well as checking it for correctness. It can
> fix it too but you don't need that. I hope. Will take quite a while
> to run on a large index.
>
> What version of lucene? Does a before/after (or large/small)
> directory listing give any clues?
>
>
> --
> Ian.
>
>
> On Thu, Oct 27, 2011 at 12:44 PM, <v....@lombardodier.com> wrote:
>> Hi,
>>
>> I have an application that has an index with 30 millions docs in it.
> every
>> day, I add around 1 million docs, and I remove the oldest 1 million, to
>> keepit stable at 30 million.
>> for the most part doc fields are indexed and stored. each doc weighs
>> around from a few Kb to a 1 Mb (a few Mb in some cases).
>> I used to be able to maintain the index at around 60 Gb on disk. but
>> recently the index has had a tendency to keep growing (90 Gb). I can see
>> that the expunge is doing what it should do, because after it executes,
>> the size on disk does go down, but never as low as the previous day.
> from
>> the outside, it looks like a leak, but since I do not remove the docs I
>> added during the day, it might be that the new docs are just bigger than
>> the old ones. still I am surprised with the increase.
>>
>> are there any tools to dig into the index structure and help justify the
>> space taken on disk?
>> I was thinking about something that would help identify terms that take
> up
>> the most space, or some sort of dump that I could compare from one day
> to
>> the other.
>>
>> any help appreciated,
>>
>> thanks,
>>
>> vince
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
> ************************ DISCLAIMER ************************
> This message is intended only for use by the person to
> whom it is addressed. It may contain information that is
> privileged and confidential. Its content does not
> constitute a formal commitment by Lombard Odier
> Darier Hentsch & Cie or any of its branches or affiliates.
> If you are not the intended recipient of this message,
> kindly notify the sender immediately and destroy this
> message. Thank You.
> *****************************************************************
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: index bigger than it should be?
Posted by v....@lombardodier.com.
Hi,
I did the following on the existing index:
- expunge deletes
- optimize(5)
- check index
then from the existing index I exported all docs into a new one, then on
the new one I did:
- optimize(5)
- check index
the entire log is in http://dl.dropbox.com/u/47469698/lucene/index.txt
during the export, I also monitored the size on disk at each chunk of
100000 docs added to the new index:
http://dl.dropbox.com/u/47469698/lucene/index.xls
what I found was that the index was taking around 2400 Mb/million docs
almost all the time, and from time to time it would take a little bit more
(<3500) during a short period of time. this stays true until around 28
millions docs where the size on disk increases a lot (4500 Mb/million docs
= 135 Gb on disk) until the end of the export (my index contains 32
millions docs). at the end the space on disk went from 134 Gb to 91 Gb
thanks to the optimize. but even at 91 Gb pour 32 millions docs, it is
still 3000 Mb/million docs, far more than the 2400 I was seeing most of
the time.
I understand that merges happen, what I was surprised about was that the
behavior between 28 and 32 millions was a lot bigger in scale than the
other merges before, and even an optimize would not solve this entirely.
did I reach a limit? should I maintain the index at 25 millions to avoid
this behavior?
I am using lucene 3.4 with the tiered merge policy and all the fields are
stored.
thanks,
Vincent Sevel
Ian Lea <ia...@gmail.com>
Sent by: java-user-return-51136-v.sevel=lombardodier.com@lucene.apache.org
27.10.2011 15:28
Please respond to
java-user@lucene.apache.org
To
java-user@lucene.apache.org
cc
Subject
Re: index bigger than it should be?
There's org.apache.lucene.index.CheckIndex which will report assorted
stats about the index, as well as checking it for correctness. It can
fix it too but you don't need that. I hope. Will take quite a while
to run on a large index.
What version of lucene? Does a before/after (or large/small)
directory listing give any clues?
--
Ian.
On Thu, Oct 27, 2011 at 12:44 PM, <v....@lombardodier.com> wrote:
> Hi,
>
> I have an application that has an index with 30 millions docs in it.
every
> day, I add around 1 million docs, and I remove the oldest 1 million, to
> keepit stable at 30 million.
> for the most part doc fields are indexed and stored. each doc weighs
> around from a few Kb to a 1 Mb (a few Mb in some cases).
> I used to be able to maintain the index at around 60 Gb on disk. but
> recently the index has had a tendency to keep growing (90 Gb). I can see
> that the expunge is doing what it should do, because after it executes,
> the size on disk does go down, but never as low as the previous day.
from
> the outside, it looks like a leak, but since I do not remove the docs I
> added during the day, it might be that the new docs are just bigger than
> the old ones. still I am surprised with the increase.
>
> are there any tools to dig into the index structure and help justify the
> space taken on disk?
> I was thinking about something that would help identify terms that take
up
> the most space, or some sort of dump that I could compare from one day
to
> the other.
>
> any help appreciated,
>
> thanks,
>
> vince
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
************************ DISCLAIMER ************************
This message is intended only for use by the person to
whom it is addressed. It may contain information that is
privileged and confidential. Its content does not
constitute a formal commitment by Lombard Odier
Darier Hentsch & Cie or any of its branches or affiliates.
If you are not the intended recipient of this message,
kindly notify the sender immediately and destroy this
message. Thank You.
*****************************************************************
Re: index bigger than it should be?
Posted by Ian Lea <ia...@gmail.com>.
There's org.apache.lucene.index.CheckIndex which will report assorted
stats about the index, as well as checking it for correctness. It can
fix it too but you don't need that. I hope. Will take quite a while
to run on a large index.
What version of lucene? Does a before/after (or large/small)
directory listing give any clues?
--
Ian.
On Thu, Oct 27, 2011 at 12:44 PM, <v....@lombardodier.com> wrote:
> Hi,
>
> I have an application that has an index with 30 millions docs in it. every
> day, I add around 1 million docs, and I remove the oldest 1 million, to
> keepit stable at 30 million.
> for the most part doc fields are indexed and stored. each doc weighs
> around from a few Kb to a 1 Mb (a few Mb in some cases).
> I used to be able to maintain the index at around 60 Gb on disk. but
> recently the index has had a tendency to keep growing (90 Gb). I can see
> that the expunge is doing what it should do, because after it executes,
> the size on disk does go down, but never as low as the previous day. from
> the outside, it looks like a leak, but since I do not remove the docs I
> added during the day, it might be that the new docs are just bigger than
> the old ones. still I am surprised with the increase.
>
> are there any tools to dig into the index structure and help justify the
> space taken on disk?
> I was thinking about something that would help identify terms that take up
> the most space, or some sort of dump that I could compare from one day to
> the other.
>
> any help appreciated,
>
> thanks,
>
> vince
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org