You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Jones, Graham" <g....@sthree.com> on 2011/11/28 11:26:34 UTC

Tidying files after optimize. Is a service restart mandatory?

Hello

Brief question: How can I clean-up excess files after performing optimize without restarting the Tomcat service?

Detail follows:

I've been running several SOLR cores for approx 12 months and have recently noticed the disk usage of one of them is growing considerably faster than the rate at which documents are being added.

- 1,200,000 docs 12 months ago used a 45 GB index
- 1,700,000 docs today use a 87 GB index
- There may have been _some_ deletions, almost certainly <100,000
- The documents are of a broadly uniform style, approx 1000 words

So, approximately 45% growth in documents had grown the disk usage by approx 100%.

I took a server out of production (I've 1 master & 7 slaves) and did the following.
I ran http://server/corename/update?stream.body=<optimize/> on this core which added 49.4 GB to the index folder No previously existing files were deleted I restarted the Tomcat service ONLY the files generated by the optimize remained. All older files were deleted.

This is the result I want, but not quite the method I'd prefer. How can I get to this position without restarting the service?


Many thanks in advance for any advice you can give


This email transmission is confidential and intended solely for the addressee.
If you are not the intended addressee, you must not disclose, copy or
distribute the contents of this transmission. If you have received this 
transmission in error, please notify the sender immediately.

SThree Management Services Limited. Registered in England and Wales 4255086.
Registered office 5th Floor, GPS House, 215-227 Great Portland Street, 
London, W1W 5PN.

http://www.sthree.com

Re: Tidying files after optimize. Is a service restart mandatory?

Posted by Shawn Heisey <so...@elyograg.org>.
On 11/28/2011 3:26 AM, Jones, Graham wrote:
> Hello
>
> Brief question: How can I clean-up excess files after performing optimize without restarting the Tomcat service?
>
> Detail follows:
>
> I've been running several SOLR cores for approx 12 months and have recently noticed the disk usage of one of them is growing considerably faster than the rate at which documents are being added.
>
> - 1,200,000 docs 12 months ago used a 45 GB index
> - 1,700,000 docs today use a 87 GB index
> - There may have been _some_ deletions, almost certainly<100,000
> - The documents are of a broadly uniform style, approx 1000 words
>
> So, approximately 45% growth in documents had grown the disk usage by approx 100%.
>
> I took a server out of production (I've 1 master&  7 slaves) and did the following.
> I ran http://server/corename/update?stream.body=<optimize/>  on this core which added 49.4 GB to the index folder No previously existing files were deleted I restarted the Tomcat service ONLY the files generated by the optimize remained. All older files were deleted.
>
> This is the result I want, but not quite the method I'd prefer. How can I get to this position without restarting the service?

Based on this description, it seems likely that you are running Solr on 
Windows.  On Windows, if you have a file open for any reason (even just 
reading) it's not possible to delete that file.  Solr keeps the old 
index files open to serve queries until the new index is fully committed 
and ready to take over, which can often be quite a while in software terms.

On Unix/Linux, deleting a file just removes the link to that file in the 
filesystem directory.  When the last link is gone, the space is 
reclaimed.  When a program opens a file, the OS creates an internal link 
to that file.  If you delete that file while it's still open, it is 
still there, but only accessible via the internal link.  This is what 
happens during an optimize - the files are removed from the directory, 
but part of Solr still has them open, until the newly created index is 
completely online and all queries to the old one are complete.  Once 
they are closed, the OS reclaims the space.  I'm fairly sure that there 
is little communication between the processes that serve queries and the 
processes that update and merge the index.

I've checked previous messages on this.  If you can arrange to run the 
optimize a second time before any documents are added or deleted, it 
will complete instantaneously and the extra files will be deleted.  If 
the index is changed at all between the two optimizes, it won't really 
help, as you'll have a new set of old files that won't get deleted.

I am not in a position to test it, but it's possible that issuing a 
RELOAD command to the CoreAdmin might also take care of deleting the old 
files.  I'm pretty sure that such an action is potentially disruptive, 
but in my experience, the index is back online within a second or two, 
much much faster than a full restart.

http://wiki.apache.org/solr/CoreAdmin#RELOAD

This has been a known problem for quite a while, but I do not believe 
that it is a major priority for most Solr users.  Most people I've seen 
posting to this list do not run on Windows.  I found the following bug 
filed on Solr:

https://issues.apache.org/jira/browse/SOLR-1691

Thanks,
Shawn