You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Harini Raghavan <ha...@insideview.com> on 2006/12/21 17:15:11 UTC

Merge Index Filling up Disk Space

Hi All,

I am using lucene 1.9.1 for search functionality in my j2ee application 
using JBoss as app server. The lucene index directory size is almost 20G 
right now. There is a Quartz job that is adding data to the index evey 
min and around 20000 documents get added to the index every day.When the 
documents are added and the segments are merged, the index size 
increases and sometimes grows to more than double its original size. 
This results in filling up the disk space. We have allotted a f/s size 
of 50G and even that is not sufficient at times. Is there an optimum 
vales for the f/s size to be allotted in such scenario.
Any suggestions would be appreciated.
 
Thanks,
Harini

-- 
Harini Raghavan
Software Engineer
Office : +91-40-23556255
harini.raghavan@insideview.com
we think, you sell
www.InsideView.com
InsideView 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Modelling Relational Lucene Index

Posted by Erick Erickson <er...@gmail.com>.

One other note. If you do NOT store the article text, you can still search
it but your index size for storing the text data will be MUCH smaller. This
requires that you have access to the actual text somewhere in order to be
able to return it to the user, but it's a possibility. The scenario runs
something like this....

Index the text of the article WITHOUT storing it for each company. Have the
actual text out on disk someplace so you can fetch it to the user for
display.

Another alternative is to store the text in the index once (without
indexing) and index it (but not store it) for each company. Then you could
fetch the text out of the index when you needed but not pay the penalty for
storing the text for each company and wouldn't have to try coordinating the
disk storage, which can be a pain.

But I wouldn't try any of that until I generated some numbers. Take some
representative articles and index them, say, 10 different times, stored and
unstored and see what the index size is. Then do it again storing them, say,
100 times and look at the difference. You simply can't make architectural
decisions like this without some data to back it up. What is the eventual
number of articles you intend to index? How many times to you expect each
article to be indexed (that is, how many different companies do you expect
to associate with each company)? Why do you fear doing two searches? Do you
have any evidence at all that this will be unacceptably slow? The reason
that I keep asking these questions whenever someone starts talking about
efficiency is that I've spent far too much time making things complicated
for the sake of efficiency that, in the end, was wasted effort, meant that
the program was delivered later than it should have been and had more bugs
in it than it needed to.

Best
Erick

On 12/27/06, Harini Raghavan <ha...@insideview.com> wrote:
>
> Hi Erick,
>
> Thank you for the detailed response.
>
> First I would like to mention that my application has an index with
> company id & name indexed for article for the following reasons:
> 1. A search interface where we search across articles and companies.
> 2. Paging - I need to page the results after loading the hits due to
> which I don't want to separate out the text search and article-company
> matching logic. I want to load the articles using one single Lucene query.
>
> I am using MySQL database to store the relations. But since I need to
> search across companies & keywords in article, I am also storing the
> company name and id in the index. The option 3 looks good to me. But I
> am concerned about degrading the performance of the existing system if I
> make the search into a 2 step process.
>
> However I will try to evaluate your suggestions in detail.
>
> Thank you again,
> Harini
>
> Erick Erickson wrote:
>
> > First, it probably would have been a good thing to start a new thread on
> > this topic, since it's only vaguely related to disk space <G>...
> >
> > That said, sure. Note that there's no requirement in lucene that all
> > documents in an index have the same fields. Also, there's no reason you
> > can't use two separate indexes. Finally, you have to think about how
> many
> > times you are going to add update a given article when choosing your
> > approach. Here are several possibilities.
> >
> > 1> Add a field (tokenized) to each article in your index that contains
> > IDs
> > of the companies you want to associate with that article. The downside
> > here
> > is that you need to delete and re-add the document every time you want
> to
> > add a company to that article.
> >
> > 2> Create a separate index that contains that relationship.
> >
> > 3> have two kinds of documents in your index, one that indexes
> > articles and
> > one that relates those to companies. Something like this:
> >
> > Articles are indexed with "text" and "artid" fields. (NOTE: artid is
> > NOT the
> > Lucene document ID, those change)
> > Relations are indexed with "id" and "company id" fields.
> >
> > id and artid are your relationship. You *don't* want to name the field
> > the
> > same for both kinds of documents since they would be indexed together.
> >
> > Now, given a search over some text, you get back a bunch of article
> > IDs. You
> > then search on the id field of the relations documents to extract
> > company id
> > fields.
> >
> > You may be able to do some interesting things with termdocs/termenums to
> > make this efficient, but don't go there unless you need to.
> >
> > At this point, though, I've got to ask if you have access to a
> > database in
> > your application. If you do, why not store the relations there? Lucene
> > is a
> > text-search engine, not a relational database. This kind of relation
> > may be
> > perfectly valid to implement in Lucene, but you want to be careful if
> you
> > find yourself trying to do any more RDBMS-like things.
> >
> > Best
> > Erick
> >
> > On 12/26/06, Harini Raghavan <ha...@insideview.com> wrote:
> >
> >>
> >> Hi,
> >>
> >> I have another related problem. I am adding news articles for a company
> >> to the lucene index. As of now if the articles are mapped to more than
> >> one company, they are added so many times in the index. As the no. of
> >> companies mapped to each article increases, this will not be a scalable
> >> implementation as documents will be duplicated in the index. Is there a
> >> way to model the lucene index in a relational way such that the
> articles
> >> can be stored in an index and article-company mapping can be modelled
> >> separately?
> >>
> >> Thanks,
> >> Harini
> >>
>
> Harini Raghavan
> Software Engineer
> Office : +91-40-23556255
> harini.raghavan@insideview.com
> we think, you sell
> www.InsideView.com
> InsideView
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Modelling Relational Lucene Index

Posted by Harini Raghavan <ha...@insideview.com>.

Hi Erick,

Thank you for the detailed response.

First I would like to mention that my application has an index with 
company id & name indexed for article for the following reasons:
1. A search interface where we search across articles and companies.
2. Paging - I need to page the results after loading the hits due to 
which I don't want to separate out the text search and article-company 
matching logic. I want to load the articles using one single Lucene query.

I am using MySQL database to store the relations. But since I need to 
search across companies & keywords in article, I am also storing the 
company name and id in the index. The option 3 looks good to me. But I 
am concerned about degrading the performance of the existing system if I 
make the search into a 2 step process.

However I will try to evaluate your suggestions in detail.

Thank you again,
Harini

Erick Erickson wrote:

> First, it probably would have been a good thing to start a new thread on
> this topic, since it's only vaguely related to disk space <G>...
>
> That said, sure. Note that there's no requirement in lucene that all
> documents in an index have the same fields. Also, there's no reason you
> can't use two separate indexes. Finally, you have to think about how many
> times you are going to add update a given article when choosing your
> approach. Here are several possibilities.
>
> 1> Add a field (tokenized) to each article in your index that contains 
> IDs
> of the companies you want to associate with that article. The downside 
> here
> is that you need to delete and re-add the document every time you want to
> add a company to that article.
>
> 2> Create a separate index that contains that relationship.
>
> 3> have two kinds of documents in your index, one that indexes 
> articles and
> one that relates those to companies. Something like this:
>
> Articles are indexed with "text" and "artid" fields. (NOTE: artid is 
> NOT the
> Lucene document ID, those change)
> Relations are indexed with "id" and "company id" fields.
>
> id and artid are your relationship. You *don't* want to name the field 
> the
> same for both kinds of documents since they would be indexed together.
>
> Now, given a search over some text, you get back a bunch of article 
> IDs. You
> then search on the id field of the relations documents to extract 
> company id
> fields.
>
> You may be able to do some interesting things with termdocs/termenums to
> make this efficient, but don't go there unless you need to.
>
> At this point, though, I've got to ask if you have access to a 
> database in
> your application. If you do, why not store the relations there? Lucene 
> is a
> text-search engine, not a relational database. This kind of relation 
> may be
> perfectly valid to implement in Lucene, but you want to be careful if you
> find yourself trying to do any more RDBMS-like things.
>
> Best
> Erick
>
> On 12/26/06, Harini Raghavan <ha...@insideview.com> wrote:
>
>>
>> Hi,
>>
>> I have another related problem. I am adding news articles for a company
>> to the lucene index. As of now if the articles are mapped to more than
>> one company, they are added so many times in the index. As the no. of
>> companies mapped to each article increases, this will not be a scalable
>> implementation as documents will be duplicated in the index. Is there a
>> way to model the lucene index in a relational way such that the articles
>> can be stored in an index and article-company mapping can be modelled
>> separately?
>>
>> Thanks,
>> Harini
>>

Harini Raghavan
Software Engineer
Office : +91-40-23556255
harini.raghavan@insideview.com
we think, you sell
www.InsideView.com
InsideView 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Merge Index Filling up Disk Space

Posted by Erick Erickson <er...@gmail.com>.

First, it probably would have been a good thing to start a new thread on
this topic, since it's only vaguely related to disk space <G>...

That said, sure. Note that there's no requirement in lucene that all
documents in an index have the same fields. Also, there's no reason you
can't use two separate indexes. Finally, you have to think about how many
times you are going to add update a given article when choosing your
approach. Here are several possibilities.

1> Add a field (tokenized) to each article in your index that contains IDs
of the companies you want to associate with that article. The downside here
is that you need to delete and re-add the document every time you want to
add a company to that article.

2> Create a separate index that contains that relationship.

3> have two kinds of documents in your index, one that indexes articles and
one that relates those to companies. Something like this:

Articles are indexed with "text" and "artid" fields. (NOTE: artid is NOT the
Lucene document ID, those change)
Relations are indexed with "id" and "company id" fields.

id and artid are your relationship. You *don't* want to name the field the
same for both kinds of documents since they would be indexed together.

Now, given a search over some text, you get back a bunch of article IDs. You
then search on the id field of the relations documents to extract company id
fields.

You may be able to do some interesting things with termdocs/termenums to
make this efficient, but don't go there unless you need to.

At this point, though, I've got to ask if you have access to a database in
your application. If you do, why not store the relations there? Lucene is a
text-search engine, not a relational database. This kind of relation may be
perfectly valid to implement in Lucene, but you want to be careful if you
find yourself trying to do any more RDBMS-like things.

Best
Erick

On 12/26/06, Harini Raghavan <ha...@insideview.com> wrote:
>
> Hi,
>
> I have another related problem. I am adding news articles for a company
> to the lucene index. As of now if the articles are mapped to more than
> one company, they are added so many times in the index. As the no. of
> companies mapped to each article increases, this will not be a scalable
> implementation as documents will be duplicated in the index. Is there a
> way to model the lucene index in a relational way such that the articles
> can be stored in an index and article-company mapping can be modelled
> separately?
>
> Thanks,
> Harini
>
> Mark Miller wrote:
>
> > A Searcher uses a Reader to read the index for searching.
> >
> > - Mark
> >
> > Harini Raghavan wrote:
> >
> >> Hi Mike,
> >>
> >> Thank you for the response. I don't have readers open on the index,
> >> but while the optimize/merge was running I was searching on the
> >> index. Would that make any difference?
> >> Also after the optimizing the index I had some .tmp files which were
> >> > 10G and did not get merged. Could that also be related to having
> >> searchers open while running optimize?
> >>
> >> -Harini
> >>
> >> Michael McCandless wrote:
> >>
> >>> Harini Raghavan wrote:
> >>>
> >>>> I am using lucene 1.9.1 for search functionality in my j2ee
> >>>> application using JBoss as app server. The lucene index directory
> >>>> size is almost 20G right now. There is a Quartz job that is adding
> >>>> data to the index evey min and around 20000 documents get added to
> >>>> the index every day.When the documents are added and the segments
> >>>> are merged, the index size increases and sometimes grows to more
> >>>> than double its original size. This results in filling up the disk
> >>>> space. We have allotted a f/s size of 50G and even that is not
> >>>> sufficient at times. Is there an optimum vales for the f/s size to
> >>>> be allotted in such scenario.
> >>>> Any suggestions would be appreciated.
> >>>
> >>>
> >>>
> >>> I believe optimize should use at most 2X the starting index size,
> >>> transiently, if there are no readers open against the index.  And then
> >>> when optimize is done the size should be around the starting size, or
> >>> less.
> >>>
> >>> If there are open readers against the index when the optimize occurs,
> >>> then, the segments that were merged cannot actually be deleted until
> >>> those readers close.  Even on Unix, where it will look like the
> >>> segments were deleted, they are still consuming disk space because
> >>> open file handles keep them allocated ("delete on last close").
> >>>
> >>> This means if you have open readers you should see at most 3X the
> >>> starting index size.  Worse, if some (but not all) readers are
> >>> re-opening while the merge is underway it's possible to peak at even
> >>> more than 3X the starting size.
> >>>
> >>> Do you have readers running against your index?
> >>>
> >>> I will call this out in the javadocs for optimize, addDocument,
> >>> addIndexes...
> >>>
> >>> Mike
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> --
> Harini Raghavan
> Software Engineer
> Office : +91-40-23556255
> harini.raghavan@insideview.com
> we think, you sell
> www.InsideView.com
> InsideView
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Merge Index Filling up Disk Space

Posted by Harini Raghavan <ha...@insideview.com>.

Hi,

I have another related problem. I am adding news articles for a company 
to the lucene index. As of now if the articles are mapped to more than 
one company, they are added so many times in the index. As the no. of 
companies mapped to each article increases, this will not be a scalable 
implementation as documents will be duplicated in the index. Is there a 
way to model the lucene index in a relational way such that the articles 
can be stored in an index and article-company mapping can be modelled 
separately?

Thanks,
Harini

Mark Miller wrote:

> A Searcher uses a Reader to read the index for searching.
>
> - Mark
>
> Harini Raghavan wrote:
>
>> Hi Mike,
>>
>> Thank you for the response. I don't have readers open on the index, 
>> but while the optimize/merge was running I was searching on the 
>> index. Would that make any difference?
>> Also after the optimizing the index I had some .tmp files which were 
>> > 10G and did not get merged. Could that also be related to having 
>> searchers open while running optimize?
>>
>> -Harini
>>
>> Michael McCandless wrote:
>>
>>> Harini Raghavan wrote:
>>>
>>>> I am using lucene 1.9.1 for search functionality in my j2ee 
>>>> application using JBoss as app server. The lucene index directory 
>>>> size is almost 20G right now. There is a Quartz job that is adding 
>>>> data to the index evey min and around 20000 documents get added to 
>>>> the index every day.When the documents are added and the segments 
>>>> are merged, the index size increases and sometimes grows to more 
>>>> than double its original size. This results in filling up the disk 
>>>> space. We have allotted a f/s size of 50G and even that is not 
>>>> sufficient at times. Is there an optimum vales for the f/s size to 
>>>> be allotted in such scenario.
>>>> Any suggestions would be appreciated.
>>>
>>>
>>>
>>> I believe optimize should use at most 2X the starting index size,
>>> transiently, if there are no readers open against the index.  And then
>>> when optimize is done the size should be around the starting size, or
>>> less.
>>>
>>> If there are open readers against the index when the optimize occurs,
>>> then, the segments that were merged cannot actually be deleted until
>>> those readers close.  Even on Unix, where it will look like the
>>> segments were deleted, they are still consuming disk space because
>>> open file handles keep them allocated ("delete on last close").
>>>
>>> This means if you have open readers you should see at most 3X the
>>> starting index size.  Worse, if some (but not all) readers are
>>> re-opening while the merge is underway it's possible to peak at even
>>> more than 3X the starting size.
>>>
>>> Do you have readers running against your index?
>>>
>>> I will call this out in the javadocs for optimize, addDocument,
>>> addIndexes...
>>>
>>> Mike
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

-- 
Harini Raghavan
Software Engineer
Office : +91-40-23556255
harini.raghavan@insideview.com
we think, you sell
www.InsideView.com
InsideView 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Merge Index Filling up Disk Space

Posted by Mark Miller <ma...@gmail.com>.

A Searcher uses a Reader to read the index for searching.

- Mark

Harini Raghavan wrote:
> Hi Mike,
>
> Thank you for the response. I don't have readers open on the index, 
> but while the optimize/merge was running I was searching on the index. 
> Would that make any difference?
> Also after the optimizing the index I had some .tmp files which were > 
> 10G and did not get merged. Could that also be related to having 
> searchers open while running optimize?
>
> -Harini
>
> Michael McCandless wrote:
>
>> Harini Raghavan wrote:
>>
>>> I am using lucene 1.9.1 for search functionality in my j2ee 
>>> application using JBoss as app server. The lucene index directory 
>>> size is almost 20G right now. There is a Quartz job that is adding 
>>> data to the index evey min and around 20000 documents get added to 
>>> the index every day.When the documents are added and the segments 
>>> are merged, the index size increases and sometimes grows to more 
>>> than double its original size. This results in filling up the disk 
>>> space. We have allotted a f/s size of 50G and even that is not 
>>> sufficient at times. Is there an optimum vales for the f/s size to 
>>> be allotted in such scenario.
>>> Any suggestions would be appreciated.
>>
>>
>> I believe optimize should use at most 2X the starting index size,
>> transiently, if there are no readers open against the index.  And then
>> when optimize is done the size should be around the starting size, or
>> less.
>>
>> If there are open readers against the index when the optimize occurs,
>> then, the segments that were merged cannot actually be deleted until
>> those readers close.  Even on Unix, where it will look like the
>> segments were deleted, they are still consuming disk space because
>> open file handles keep them allocated ("delete on last close").
>>
>> This means if you have open readers you should see at most 3X the
>> starting index size.  Worse, if some (but not all) readers are
>> re-opening while the merge is underway it's possible to peak at even
>> more than 3X the starting size.
>>
>> Do you have readers running against your index?
>>
>> I will call this out in the javadocs for optimize, addDocument,
>> addIndexes...
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Merge Index Filling up Disk Space

Posted by Michael McCandless <lu...@mikemccandless.com>.

Harini Raghavan wrote:
> Yes I think I got hit IOException. I assumed that the.tmp files are not 
> required and deleted them manually from the indes directory as they were 
> more than 10G. Is that ok?

Yes, they are indeed not necessary so deleting them is fine.  This
(deleting partially created files on an Exception) is what the trunk
version of Lucene now does if it hits a disk full exception while
merging.

So I think everything here is explained now?  I will update the
javadocs with that additional caveat about open readers against the
index while optimizing.

> Michael McCandless wrote:
> 
>> Harini Raghavan wrote:
>>
>>> Thank you for the response. I don't have readers open on the index, 
>>> but while the optimize/merge was running I was searching on the 
>>> index. Would that make any difference?
>>
>>
>> You're welcome!  Right, a searcher opens an IndexReader.  So this
>> means you should see peak @ 3X the starting index size (assuming you
>> don't re-open some readers during the optimize in which case it could
>> be > 3X).
>>
>>> Also after the optimizing the index I had some .tmp files which were 
>>> > 10G and did not get merged. Could that also be related to having 
>>> searchers open while running optimize?
>>
>>
>> If you hit a disk full, then, I would expect unused .tmp files to be
>> left around.  Did you actually hit an IOException due to disk full,
>> and, is the timestamp of these .tmp files around the time it occurred?
>>
>> Note that the trunk version of Lucene has been fixed to not leave such
>> unreferenced files even on IOException, but that's not yet released.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Merge Index Filling up Disk Space

Posted by Harini Raghavan <ha...@insideview.com>.

Yes I think I got hit IOException. I assumed that the.tmp files are not 
required and deleted them manually from the indes directory as they were 
more than 10G. Is that ok?

Michael McCandless wrote:

> Harini Raghavan wrote:
>
>> Thank you for the response. I don't have readers open on the index, 
>> but while the optimize/merge was running I was searching on the 
>> index. Would that make any difference?
>
>
> You're welcome!  Right, a searcher opens an IndexReader.  So this
> means you should see peak @ 3X the starting index size (assuming you
> don't re-open some readers during the optimize in which case it could
> be > 3X).
>
>> Also after the optimizing the index I had some .tmp files which were 
>> > 10G and did not get merged. Could that also be related to having 
>> searchers open while running optimize?
>
>
> If you hit a disk full, then, I would expect unused .tmp files to be
> left around.  Did you actually hit an IOException due to disk full,
> and, is the timestamp of these .tmp files around the time it occurred?
>
> Note that the trunk version of Lucene has been fixed to not leave such
> unreferenced files even on IOException, but that's not yet released.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

-- 
Harini Raghavan
Software Engineer
Office : +91-40-23556255
harini.raghavan@insideview.com
we think, you sell
www.InsideView.com
InsideView 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Merge Index Filling up Disk Space

Posted by Michael McCandless <lu...@mikemccandless.com>.

Harini Raghavan wrote:

> Thank you for the response. I don't have readers open on the index, but 
> while the optimize/merge was running I was searching on the index. Would 
> that make any difference?

You're welcome!  Right, a searcher opens an IndexReader.  So this
means you should see peak @ 3X the starting index size (assuming you
don't re-open some readers during the optimize in which case it could
be > 3X).

> Also after the optimizing the index I had some .tmp files which were > 
> 10G and did not get merged. Could that also be related to having 
> searchers open while running optimize?

If you hit a disk full, then, I would expect unused .tmp files to be
left around.  Did you actually hit an IOException due to disk full,
and, is the timestamp of these .tmp files around the time it occurred?

Note that the trunk version of Lucene has been fixed to not leave such
unreferenced files even on IOException, but that's not yet released.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Merge Index Filling up Disk Space

Posted by Harini Raghavan <ha...@insideview.com>.

Hi Mike,

Thank you for the response. I don't have readers open on the index, but 
while the optimize/merge was running I was searching on the index. Would 
that make any difference?
Also after the optimizing the index I had some .tmp files which were > 
10G and did not get merged. Could that also be related to having 
searchers open while running optimize?

-Harini

Michael McCandless wrote:

> Harini Raghavan wrote:
>
>> I am using lucene 1.9.1 for search functionality in my j2ee 
>> application using JBoss as app server. The lucene index directory 
>> size is almost 20G right now. There is a Quartz job that is adding 
>> data to the index evey min and around 20000 documents get added to 
>> the index every day.When the documents are added and the segments are 
>> merged, the index size increases and sometimes grows to more than 
>> double its original size. This results in filling up the disk space. 
>> We have allotted a f/s size of 50G and even that is not sufficient at 
>> times. Is there an optimum vales for the f/s size to be allotted in 
>> such scenario.
>> Any suggestions would be appreciated.
>
>
> I believe optimize should use at most 2X the starting index size,
> transiently, if there are no readers open against the index.  And then
> when optimize is done the size should be around the starting size, or
> less.
>
> If there are open readers against the index when the optimize occurs,
> then, the segments that were merged cannot actually be deleted until
> those readers close.  Even on Unix, where it will look like the
> segments were deleted, they are still consuming disk space because
> open file handles keep them allocated ("delete on last close").
>
> This means if you have open readers you should see at most 3X the
> starting index size.  Worse, if some (but not all) readers are
> re-opening while the merge is underway it's possible to peak at even
> more than 3X the starting size.
>
> Do you have readers running against your index?
>
> I will call this out in the javadocs for optimize, addDocument,
> addIndexes...
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

-- 
Harini Raghavan
Software Engineer
Office : +91-40-23556255
harini.raghavan@insideview.com
we think, you sell
www.InsideView.com
InsideView 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Merge Index Filling up Disk Space

Posted by Yonik Seeley <yo...@apache.org>.

On 12/21/06, Michael McCandless <lu...@mikemccandless.com> wrote:
> I *think* it's really max 2X even with compound file (if no readers)?
>
> Because, in IndexWriter.mergeSegments we:
>
>    1. Create the newly merged segment in non-compound format (brings us
>       up to 2X, when it's the last merge).
>
>    2. Commit the new segments(_N) file referencing this new segment (in
>       non-compound format).
>
>    3. Remove all input segments so back to 1X.

Ah ok.. I hadn't realized that steps (2) and (3) were done.

>    4. Build the compound file (brings us up to 2X).
>
>    5. Commit the next segments(_N) file referencing the new segment in
>       compound format.
>
>    6. Delete the non-cfs segment files (back to 1X or less).

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Merge Index Filling up Disk Space

Posted by Michael McCandless <lu...@mikemccandless.com>.

Yonik Seeley wrote:
> On 12/21/06, Michael McCandless <lu...@mikemccandless.com> wrote:
>> Harini Raghavan wrote:
>> > I am using lucene 1.9.1 for search functionality in my j2ee application
>> > using JBoss as app server. The lucene index directory size is almost 
>> 20G
>> > right now. There is a Quartz job that is adding data to the index evey
>> > min and around 20000 documents get added to the index every day.When 
>> the
>> > documents are added and the segments are merged, the index size
>> > increases and sometimes grows to more than double its original size.
>> > This results in filling up the disk space. We have allotted a f/s size
>> > of 50G and even that is not sufficient at times. Is there an optimum
>> > vales for the f/s size to be allotted in such scenario.
>> > Any suggestions would be appreciated.
>>
>> I believe optimize should use at most 2X the starting index size,
>> transiently, if there are no readers open against the index.
> 
> Isn't it up to 3x with the compound index format? (and 4x with readers 
> opened)

I *think* it's really max 2X even with compound file (if no readers)?

Because, in IndexWriter.mergeSegments we:

   1. Create the newly merged segment in non-compound format (brings us
      up to 2X, when it's the last merge).

   2. Commit the new segments(_N) file referencing this new segment (in
      non-compound format).

   3. Remove all input segments so back to 1X.

   4. Build the compound file (brings us up to 2X).

   5. Commit the next segments(_N) file referencing the new segment in
      compound format.

   6. Delete the non-cfs segment files (back to 1X or less).

What's spooky is if a reader reopens eg after 2 and before 5, and if
another reader still holds the original index open, then that brings
us to 4X (I think?).  More generally, since optimize may do a whole
series of merges (typical) leading up to the final merge, if readers
are aggressively re-opening then the held disk usage can be extremely
high (far more than 4X).  I think it's best not to recycle readers
during merge/optimize!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Merge Index Filling up Disk Space

Posted by Yonik Seeley <yo...@apache.org>.

On 12/21/06, Michael McCandless <lu...@mikemccandless.com> wrote:
> Harini Raghavan wrote:
> > I am using lucene 1.9.1 for search functionality in my j2ee application
> > using JBoss as app server. The lucene index directory size is almost 20G
> > right now. There is a Quartz job that is adding data to the index evey
> > min and around 20000 documents get added to the index every day.When the
> > documents are added and the segments are merged, the index size
> > increases and sometimes grows to more than double its original size.
> > This results in filling up the disk space. We have allotted a f/s size
> > of 50G and even that is not sufficient at times. Is there an optimum
> > vales for the f/s size to be allotted in such scenario.
> > Any suggestions would be appreciated.
>
> I believe optimize should use at most 2X the starting index size,
> transiently, if there are no readers open against the index.

Isn't it up to 3x with the compound index format? (and 4x with readers opened)

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Merge Index Filling up Disk Space

Posted by Michael McCandless <lu...@mikemccandless.com>.

Harini Raghavan wrote:

> I am using lucene 1.9.1 for search functionality in my j2ee application 
> using JBoss as app server. The lucene index directory size is almost 20G 
> right now. There is a Quartz job that is adding data to the index evey 
> min and around 20000 documents get added to the index every day.When the 
> documents are added and the segments are merged, the index size 
> increases and sometimes grows to more than double its original size. 
> This results in filling up the disk space. We have allotted a f/s size 
> of 50G and even that is not sufficient at times. Is there an optimum 
> vales for the f/s size to be allotted in such scenario.
> Any suggestions would be appreciated.

I believe optimize should use at most 2X the starting index size,
transiently, if there are no readers open against the index.  And then
when optimize is done the size should be around the starting size, or
less.

If there are open readers against the index when the optimize occurs,
then, the segments that were merged cannot actually be deleted until
those readers close.  Even on Unix, where it will look like the
segments were deleted, they are still consuming disk space because
open file handles keep them allocated ("delete on last close").

This means if you have open readers you should see at most 3X the
starting index size.  Worse, if some (but not all) readers are
re-opening while the merge is underway it's possible to peak at even
more than 3X the starting size.

Do you have readers running against your index?

I will call this out in the javadocs for optimize, addDocument,
addIndexes...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Merge Index Filling up Disk Space

Posted by Mark Miller <ma...@gmail.com>.

When Lucene optimizes the Index (which it semi does naturally as the 
index grows) it creates a copy of the index, so you can expect the space 
requirements for an index to be double the index at an absolute minimum. 
If you are adding 20,000 docs a day and working with an index that is 
already 20 G than your just playing with fire sitting on a 50G 
partition. With the price of disk-space these days I would recommend 
throwing a lot more storage at your appserver. I would hate to keep 
dancing around as I gave it the few more Gig it needed to keep from 
crashing.
- Mark

Harini Raghavan wrote:
> Hi All,
>
> I am using lucene 1.9.1 for search functionality in my j2ee 
> application using JBoss as app server. The lucene index directory size 
> is almost 20G right now. There is a Quartz job that is adding data to 
> the index evey min and around 20000 documents get added to the index 
> every day.When the documents are added and the segments are merged, 
> the index size increases and sometimes grows to more than double its 
> original size. This results in filling up the disk space. We have 
> allotted a f/s size of 50G and even that is not sufficient at times. 
> Is there an optimum vales for the f/s size to be allotted in such 
> scenario.
> Any suggestions would be appreciated.
>
> Thanks,
> Harini
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Merge Index Filling up Disk Space

Posted by "Rob Staveley (Tom)" <rs...@seseit.com>.

I've found that merging a 20G directory into another 20G directory on
another disk required the target disk to have > 50G available during the
merge. I ran out of space on my ~70G disk for the merge and had to do it on
another system with ~170G available, but I'm not sure how much was used
transiently for the merge. The whole process took a long time, so I've done
limited experimenting.

[Incidentally, the two directories resulted in a 24G after merging.]

-----Original Message-----
From: Harini Raghavan [mailto:harini.raghavan@insideview.com] 
Sent: 21 December 2006 16:15
To: java-user@lucene.apache.org
Subject: Merge Index Filling up Disk Space

Hi All,

I am using lucene 1.9.1 for search functionality in my j2ee application
using JBoss as app server. The lucene index directory size is almost 20G
right now. There is a Quartz job that is adding data to the index evey min
and around 20000 documents get added to the index every day.When the
documents are added and the segments are merged, the index size increases
and sometimes grows to more than double its original size. 
This results in filling up the disk space. We have allotted a f/s size of
50G and even that is not sufficient at times. Is there an optimum vales for
the f/s size to be allotted in such scenario.
Any suggestions would be appreciated.
 
Thanks,
Harini

--
Harini Raghavan
Software Engineer
Office : +91-40-23556255
harini.raghavan@insideview.com
we think, you sell
www.InsideView.com
InsideView 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org