You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2004/12/13 00:05:15 UTC

Re: Finding unused segment files?

Hello George,

Here is a quick hack (with a few TODOs).  I only tested it a bit, so
the actual delete calls are still commented out.  If this works for
you, and especially if you take care of TODOs, I may put this in the
Lucene Sandbox.

Otis
P.S.
Usage example showing how the fool found some unused segments (this was
caused by a bug in one of the earlier 1.4 versions of Lucene).

[otis@cosmo java]$ java org.apache.lucene.index.SegmentPurger
/simpy/users/1/index
Candidate non-Lucene file found: _1b2.del
Candidate unused Lucene file found: _1b2.cfs
Candidate unused Lucene file found: _1bm.cfs
Candidate unused Lucene file found: _1c6.cfs
Candidate unused Lucene file found: _1cq.cfs
Candidate unused Lucene file found: _1da.cfs
Candidate unused Lucene file found: _1du.cfs
Candidate unused Lucene file found: _1ee.cfs
Candidate unused Lucene file found: _1ey.cfs
[otis@cosmo java]$
[otis@cosmo java]$ strings /simpy/users/1/index/segments
_3o0
[otis@cosmo java]$ ls -al /simpy/users/1/index/
total 647
drwxrwsr-x    2 otis     simpy        1024 Dec  7 14:39 .
drwxrwsr-x    3 otis     simpy        1024 Sep 16 20:39 ..
-rw-rw-r--    1 otis     simpy      212815 Nov 17 18:36 _1b2.cfs
-rw-rw-r--    1 otis     simpy         104 Nov 17 18:40 _1b2.del
-rw-rw-r--    1 otis     simpy        3380 Nov 17 18:40 _1bm.cfs
-rw-rw-r--    1 otis     simpy        3533 Nov 17 18:40 _1c6.cfs
-rw-rw-r--    1 otis     simpy        4774 Nov 17 18:40 _1cq.cfs
-rw-rw-r--    1 otis     simpy        3389 Nov 17 18:40 _1da.cfs
-rw-rw-r--    1 otis     simpy        3809 Nov 17 18:40 _1du.cfs
-rw-rw-r--    1 otis     simpy        3423 Nov 17 18:40 _1ee.cfs
-rw-rw-r--    1 otis     simpy        4016 Nov 17 18:40 _1ey.cfs
-rw-rw-r--    1 otis     simpy      410299 Dec  7 14:39 _3o0.cfs
-rw-rw-r--    1 otis     simpy           4 Dec  7 14:39 deletable
-rw-rw-r--    1 otis     simpy          29 Dec  7 14:39 segments


--- gwithers@connected.com wrote:

> Hello all.
> 
>  
> 
> I recently ran into a problem where errors during indexing or
> optimization
> (perhaps related to running out of disk space) left me with a working
> index
> in a directory but with additional segment files (partial) that were
> unneeded.  The solution for finding the ~40 files to keep out of the
> ~900
> files in the directory amounted to dumping the segments file and
> noting that
> only 5 segments were in fact "live".  The index is a non-compound
> index
> using FSDirectory.
> 
>  
> 
> Is there (or would it be possible to add (and I'd be willing to
> submit code
> if it made sense to people)) some sort of interrogation on the index
> of what
> files belonged to it?  I looked first as FSDirectory itself thinking
> that
> it's "list()" method should return the subset of index-related files
> but
> looking deeper it looks like Directory is at a lower level
> abstracting
> simple I/O and thus wouldn't "know".
> 
>  
> 
> So any thoughts?  Would it make sense to have a form of clean on
> IndexWriter()?  I hesitate since it seems there isn't a charter that
> only
> Lucene files could exist in the directory thus what is ideal for my
> application (since I know I won't mingle other files) might not be
> ideal for
> all.  Would it be fair to look for Lucene known extensions and file
> naming
> signatures to identify unused files that might be failed or dead
> segments?
> 
>  
> 
> Thanks,
> 
> -George
> 
>