You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by gw...@connected.com on 2004/11/03 17:40:41 UTC

Finding unused segment files?

Hello all.

 

I recently ran into a problem where errors during indexing or optimization
(perhaps related to running out of disk space) left me with a working index
in a directory but with additional segment files (partial) that were
unneeded.  The solution for finding the ~40 files to keep out of the ~900
files in the directory amounted to dumping the segments file and noting that
only 5 segments were in fact "live".  The index is a non-compound index
using FSDirectory.

 

Is there (or would it be possible to add (and I'd be willing to submit code
if it made sense to people)) some sort of interrogation on the index of what
files belonged to it?  I looked first as FSDirectory itself thinking that
it's "list()" method should return the subset of index-related files but
looking deeper it looks like Directory is at a lower level abstracting
simple I/O and thus wouldn't "know".

 

So any thoughts?  Would it make sense to have a form of clean on
IndexWriter()?  I hesitate since it seems there isn't a charter that only
Lucene files could exist in the directory thus what is ideal for my
application (since I know I won't mingle other files) might not be ideal for
all.  Would it be fair to look for Lucene known extensions and file naming
signatures to identify unused files that might be failed or dead segments?

 

Thanks,

-George


Re: Finding unused segment files?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hello George,

Here is a quick hack (with a few TODOs).  I only tested it a bit, so
the actual delete calls are still commented out.  If this works for
you, and especially if you take care of TODOs, I may put this in the
Lucene Sandbox.

Otis
P.S.
Usage example showing how the fool found some unused segments (this was
caused by a bug in one of the earlier 1.4 versions of Lucene).

[otis@cosmo java]$ java org.apache.lucene.index.SegmentPurger
/simpy/users/1/index
Candidate non-Lucene file found: _1b2.del
Candidate unused Lucene file found: _1b2.cfs
Candidate unused Lucene file found: _1bm.cfs
Candidate unused Lucene file found: _1c6.cfs
Candidate unused Lucene file found: _1cq.cfs
Candidate unused Lucene file found: _1da.cfs
Candidate unused Lucene file found: _1du.cfs
Candidate unused Lucene file found: _1ee.cfs
Candidate unused Lucene file found: _1ey.cfs
[otis@cosmo java]$
[otis@cosmo java]$ strings /simpy/users/1/index/segments
_3o0
[otis@cosmo java]$ ls -al /simpy/users/1/index/
total 647
drwxrwsr-x    2 otis     simpy        1024 Dec  7 14:39 .
drwxrwsr-x    3 otis     simpy        1024 Sep 16 20:39 ..
-rw-rw-r--    1 otis     simpy      212815 Nov 17 18:36 _1b2.cfs
-rw-rw-r--    1 otis     simpy         104 Nov 17 18:40 _1b2.del
-rw-rw-r--    1 otis     simpy        3380 Nov 17 18:40 _1bm.cfs
-rw-rw-r--    1 otis     simpy        3533 Nov 17 18:40 _1c6.cfs
-rw-rw-r--    1 otis     simpy        4774 Nov 17 18:40 _1cq.cfs
-rw-rw-r--    1 otis     simpy        3389 Nov 17 18:40 _1da.cfs
-rw-rw-r--    1 otis     simpy        3809 Nov 17 18:40 _1du.cfs
-rw-rw-r--    1 otis     simpy        3423 Nov 17 18:40 _1ee.cfs
-rw-rw-r--    1 otis     simpy        4016 Nov 17 18:40 _1ey.cfs
-rw-rw-r--    1 otis     simpy      410299 Dec  7 14:39 _3o0.cfs
-rw-rw-r--    1 otis     simpy           4 Dec  7 14:39 deletable
-rw-rw-r--    1 otis     simpy          29 Dec  7 14:39 segments


--- gwithers@connected.com wrote:

> Hello all.
> 
>  
> 
> I recently ran into a problem where errors during indexing or
> optimization
> (perhaps related to running out of disk space) left me with a working
> index
> in a directory but with additional segment files (partial) that were
> unneeded.  The solution for finding the ~40 files to keep out of the
> ~900
> files in the directory amounted to dumping the segments file and
> noting that
> only 5 segments were in fact "live".  The index is a non-compound
> index
> using FSDirectory.
> 
>  
> 
> Is there (or would it be possible to add (and I'd be willing to
> submit code
> if it made sense to people)) some sort of interrogation on the index
> of what
> files belonged to it?  I looked first as FSDirectory itself thinking
> that
> it's "list()" method should return the subset of index-related files
> but
> looking deeper it looks like Directory is at a lower level
> abstracting
> simple I/O and thus wouldn't "know".
> 
>  
> 
> So any thoughts?  Would it make sense to have a form of clean on
> IndexWriter()?  I hesitate since it seems there isn't a charter that
> only
> Lucene files could exist in the directory thus what is ideal for my
> application (since I know I won't mingle other files) might not be
> ideal for
> all.  Would it be fair to look for Lucene known extensions and file
> naming
> signatures to identify unused files that might be failed or dead
> segments?
> 
>  
> 
> Thanks,
> 
> -George
> 
>