You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Rob Staveley (Tom)" <rs...@seseit.com> on 2006/09/15 19:17:52 UTC

Merging "orphaned" segments into a composite index

I have had some badly behaved Lucene indexing software crash on me several
times and have been left with an index directory with lots of non-composite
files in, when all I ought to be getting is the compound files .cfs  files
plus deletable  and segments. 

Re-indexing everything doesn't bear thinking about. I was wondering if I'd
be able to merge these non-compound files into the composite index, and if
so... how? [I appreciate that there is some risk in doing this, bearing in
mind software crashed when the orphaned index files were created.]

If you'll excuse the Perl gibber, this gives a sense of what's in the index
directory:

$ find . | perl -n -e 'if (/\..+.(\..+)/) {print "$1\n"}' | sort | uniq -c
     15 .cfs
      4 .f0
     40 .fdt
     36 .fdx
     40 .fnm
     16 .frq
      1 .log
     16 .prx
     15 .tii
     16 .tis
      5 .tmp

Here's my thinking:

(1) I stop my indexer
(2) I create a temp directory and move everything other than the .cfs files,
deletable and segments into it
(3) I open an IndexWriter to my composite index and use
IndexWriter.addIndexes(Directory[]) on the temp directory

Assuming the files aren't corrupt, should that do the job to create a nicely
merged composite index, or is this a foolish undertaking?

RE: Merging "orphaned" segments into a composite index

Posted by "Rob Staveley (Tom)" <rs...@seseit.com>.

Thanks for the advice, Andrzej, including using BeanShell for this.
SegmentInfos and CompoundFileReader here I come.

-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: 16 September 2006 14:09
To: java-user@lucene.apache.org
Subject: Re: Merging "orphaned" segments into a composite index

Rob Staveley (Tom) wrote:
> It looks like my segments file only contains information for the .cfs 
> segments. So this approach doesn't work. I was wondering if I could 
> use
> IndexWriter.addIndexes(IndexReader[]) instead. Can I open an 
> IndexReader without a corresponding segments file? In notice that 
> IndexReader.open(...) always operates on directories, which I guess 
> means that it uses the segments file.
>
> Is a segments file something that can be easily bodged for a bunch of 
> index files which aren't referenced by the segments file?
>
> This probably all seems like a foolish errand, but my two indexes are 
> > 300G each and regenerating them is something I'd like to avoid.
>   

1. do a backup first !!!

3. separate the segment data which is accounted for in the current
"segments" file from all other data, and move that to a "healthy" index. 
Move the rest of the info to a "corrupt" index. Make sure that the "healthy"
index is still healthy after this operation .. ;)

2. look at the source of
org.apache.lucene.index.SegmentInfos.read(Directory) and write(Directory).
You can see how the new "segments" file is created based on the SegmentInfo
information. So, the only challenge is to create a bunch of SegmentInfo
instances corresponding to your segment names in the "corrupt" index, and
write them out to a new "segments" 
file according to this format.

3. you can easily discover the number of documents in each segment. This is
equivalent to the length (in bytes) of each <segment_name>.f<number> file,
which are storing lengthNorm info per document and per field.

Once you have written the new "segments" file, try to open it in IndexReader
and iterate over documents to see if the index is basically "healthy". You
could also do an IndexWriter.optimize() to make sure that all parts of the
index correctly fit together.

A note: if a cfs file is corrupted, you can try to explode it into
individual files - you can use org.apache.lucene.index.CompoundFileReader,
iterate over file names, open streams and save them to regular files; and
then do the trick with re-generating the segment file.

Then you may consider this "corrupt" index to be restored. Now you can merge
your "healthy" and "restored" indexes using IndexWriter.addIndexes().

BTW. I strongly recommend doing frequent backups at different stages. 
Also, I recommend using BeanShell - you will save a lot of time you would
otherwise spend on editing and compilation of these steps.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||  \|
||  |  Embedded Unix, System Integration http://www.sigram.com  Contact:
info at sigram dot com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Merging "orphaned" segments into a composite index

Posted by Andrzej Bialecki <ab...@getopt.org>.

Rob Staveley (Tom) wrote:
> It looks like my segments file only contains information for the .cfs
> segments. So this approach doesn't work. I was wondering if I could use
> IndexWriter.addIndexes(IndexReader[]) instead. Can I open an IndexReader
> without a corresponding segments file? In notice that IndexReader.open(...)
> always operates on directories, which I guess means that it uses the
> segments file. 
>
> Is a segments file something that can be easily bodged for a bunch of index
> files which aren't referenced by the segments file?
>
> This probably all seems like a foolish errand, but my two indexes are > 300G
> each and regenerating them is something I'd like to avoid.
>   

1. do a backup first !!!

3. separate the segment data which is accounted for in the current 
"segments" file from all other data, and move that to a "healthy" index. 
Move the rest of the info to a "corrupt" index. Make sure that the 
"healthy" index is still healthy after this operation .. ;)

2. look at the source of 
org.apache.lucene.index.SegmentInfos.read(Directory) and 
write(Directory). You can see how the new "segments" file is created 
based on the SegmentInfo information. So, the only challenge is to 
create a bunch of SegmentInfo instances corresponding to your segment 
names in the "corrupt" index, and write them out to a new "segments" 
file according to this format.

3. you can easily discover the number of documents in each segment. This 
is equivalent to the length (in bytes) of each <segment_name>.f<number> 
file, which are storing lengthNorm info per document and per field.

Once you have written the new "segments" file, try to open it in 
IndexReader and iterate over documents to see if the index is basically 
"healthy". You could also do an IndexWriter.optimize() to make sure that 
all parts of the index correctly fit together.

A note: if a cfs file is corrupted, you can try to explode it into 
individual files - you can use 
org.apache.lucene.index.CompoundFileReader, iterate over file names, 
open streams and save them to regular files; and then do the trick with 
re-generating the segment file.

Then you may consider this "corrupt" index to be restored. Now you can 
merge your "healthy" and "restored" indexes using IndexWriter.addIndexes().

BTW. I strongly recommend doing frequent backups at different stages. 
Also, I recommend using BeanShell - you will save a lot of time you 
would otherwise spend on editing and compilation of these steps.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Merging "orphaned" segments into a composite index

Posted by "Rob Staveley (Tom)" <rs...@seseit.com>.

It looks like my segments file only contains information for the .cfs
segments. So this approach doesn't work. I was wondering if I could use
IndexWriter.addIndexes(IndexReader[]) instead. Can I open an IndexReader
without a corresponding segments file? In notice that IndexReader.open(...)
always operates on directories, which I guess means that it uses the
segments file. 

Is a segments file something that can be easily bodged for a bunch of index
files which aren't referenced by the segments file?

This probably all seems like a foolish errand, but my two indexes are > 300G
each and regenerating them is something I'd like to avoid.

-----Original Message-----
From: Rob Staveley (Tom) [mailto:rstaveley@seseit.com] 
Sent: 15 September 2006 18:18
To: java-user@lucene.apache.org
Subject: Merging "orphaned" segments into a composite index

I have had some badly behaved Lucene indexing software crash on me several
times and have been left with an index directory with lots of non-composite
files in, when all I ought to be getting is the compound files .cfs  files
plus deletable  and segments. 

Re-indexing everything doesn't bear thinking about. I was wondering if I'd
be able to merge these non-compound files into the composite index, and if
so... how? [I appreciate that there is some risk in doing this, bearing in
mind software crashed when the orphaned index files were created.]

If you'll excuse the Perl gibber, this gives a sense of what's in the index
directory:

$ find . | perl -n -e 'if (/\..+.(\..+)/) {print "$1\n"}' | sort | uniq -c
     15 .cfs
      4 .f0
     40 .fdt
     36 .fdx
     40 .fnm
     16 .frq
      1 .log
     16 .prx
     15 .tii
     16 .tis
      5 .tmp

Here's my thinking:

(1) I stop my indexer
(2) I create a temp directory and move everything other than the .cfs files,
deletable and segments into it
(3) I open an IndexWriter to my composite index and use
IndexWriter.addIndexes(Directory[]) on the temp directory

Assuming the files aren't corrupt, should that do the job to create a nicely
merged composite index, or is this a foolish undertaking?