You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Stanislav Jordanov <st...@sirma.bg> on 2006/08/29 10:09:47 UTC

Reviving a dead index

What might be the possible reason for an IndexReader failing to open 
properly,
because it can not find a .fnm file that is expected to be there:

java.io.FileNotFoundException: E:\index4\_1j8s.fnm (The system cannot 
find the file specified)
    at java.io.RandomAccessFile.open(Native Method)
    at java.io.RandomAccessFile.<init>(RandomAccessFile.java:212)
    at 
org.apache.lucene.store.FSIndexInput$Descriptor.<init>(FSDirectory.java:425)
    at org.apache.lucene.store.FSIndexInput.<init>(FSDirectory.java:434)
    at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:324)
    at org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:56)
    at 
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:144)
    at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:129)
    at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:110)
    at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:154)
    at org.apache.lucene.store.Lock$With.run(Lock.java:109)
    at org.apache.lucene.index.IndexReader.open(IndexReader.java:143)
    at org.apache.lucene.index.IndexReader.open(IndexReader.java:127)
    at org.apache.lucene.search.IndexSearcher.<init>(IndexSearcher.java:42)



The only thing that comes to my mind is that last time the indexing 
process was not shut down properly.
Is there a way to revive the index or everything should be reindexed 
from scratch?


Thanks,
Stanislav


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Installing a custom tokenizer

Posted by Mark Miller <ma...@gmail.com>.

Bill Taylor wrote:
>
> On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote:
>
>> I'm in a real rush here, so pardon my brevity, but..... one of the
>> constructors for IndexWriter takes an Analyzer as a parameter, which 
>> can be
>> a PerFieldAnalyzerWrapper. That, if I understand your issue, should 
>> fix you
>> right up.
>
> that almost worked.  I can't use a per Field analyzer because I have 
> to process the content fields of all documents.  I built a custom 
> analyzer which extended the Standard Analyzer and replaced the 
> tokenStream method with a new one which used WhitespaceTokenizer 
> instead of StandardTokenizer.  This meant that my document IDs were 
> not split, but I lost the conversion of acronyms such as w.o. to wo 
> and the like
>
> So what I need to do is to make a new Tokenizer based on the 
> StandardTokenizer except that a NUM on line 83 of StandardTokenizer.jj 
> should be
>
> | NUM: (<ALPHANUM> (<P> <ALPHANUM>) +  | <ALPHANUM>) >
>
> so that a serial number need not have a digit in every other segment 
> and a series of letters and digits without special characters such as 
> a dash will be treated as a single word.
>
> Questions:
>
> 1) If I change the .jj file in this way, how to I run javaCC to make a 
> new tokenizer?  The JavaCC documentation says that JavaCC generates a 
> number of output files; I think that I only need the tokenizer code.
>
> 2) I suppose i have to tell the query parser to parse queries in the 
> same way, is that right?
>
> The reason I think so is that Luke says I have words such as w.o. in 
> the index which the query parser can't find.  I suspect I have to use 
> the same Analyzer on both, right?
>
Get JavaCC and run it on StandardTokenizer.jj. This should be as simple 
as typing 'JavaCC StandardTokenizer.jj'...I believe with no output 
folder specified all of the files will be built in the current 
directory. Don't worry about not generating the ones you do not 
need--JavaCC will handle everything for you. If you use Eclipse I 
recommend the JavaCC plug-in. I find it very handy.

Generally you must run the same analyzer that you indexed with on your 
search strings...if the standard analyzer parses oldman-83 to oldman 
while indexing and you use whitespace analyzer while searching then you 
will attempt to find oldman-83 in the index instead of oldman (which was 
what standard analyzer stored).

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Straight TF-IDF cosine similarity?

Posted by Jason Polites <ja...@gmail.com>.

Have you looked at the MoreLikeThis class in the similarity package?

On 8/30/06, Winton Davies <wd...@yahoo-inc.com> wrote:
>
> Hi All,
>
> I'm scratching my head - can someone tell me which class implements
> an efficient multiple term TF.IDF Cosine similarity scoring mechanism?
>
> There is clearly the single TermScorer - but I can't find the class
> that would do a bucketed TF.IDF cosine - i.e. fill an accumulator
> with the tf.idf^2 for each of the term posting lists, until
> accumulator is full, and then compute the final score.
>
> I don't need a Boolean Query - at least this seems like overkill.
>
> Cheers,
>   Winton
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Straight TF-IDF cosine similarity?

Posted by Winton Davies <wd...@yahoo-inc.com>.

Hi All,

I'm scratching my head - can someone tell me which class implements 
an efficient multiple term TF.IDF Cosine similarity scoring mechanism?

There is clearly the single TermScorer - but I can't find the class 
that would do a bucketed TF.IDF cosine - i.e. fill an accumulator 
with the tf.idf^2 for each of the term posting lists, until 
accumulator is full, and then compute the final score.

I don't need a Boolean Query - at least this seems like overkill.

Cheers,
  Winton

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Installing a custom tokenizer

Posted by Bill Taylor <wa...@as-st.com>.

On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote:

> I'm in a real rush here, so pardon my brevity, but..... one of the
> constructors for IndexWriter takes an Analyzer as a parameter, which 
> can be
> a PerFieldAnalyzerWrapper. That, if I understand your issue, should 
> fix you
> right up.

that almost worked.  I can't use a per Field analyzer because I have to 
process the content fields of all documents.  I built a custom analyzer 
which extended the Standard Analyzer and replaced the tokenStream 
method with a new one which used WhitespaceTokenizer instead of 
StandardTokenizer.  This meant that my document IDs were not split, but 
I lost the conversion of acronyms such as w.o. to wo and the like

So what I need to do is to make a new Tokenizer based on the 
StandardTokenizer except that a NUM on line 83 of StandardTokenizer.jj 
should be

| NUM: (<ALPHANUM> (<P> <ALPHANUM>) +  | <ALPHANUM>) >

so that a serial number need not have a digit in every other segment 
and a series of letters and digits without special characters such as a 
dash will be treated as a single word.

Questions:

1) If I change the .jj file in this way, how to I run javaCC to make a 
new tokenizer?  The JavaCC documentation says that JavaCC generates a 
number of output files; I think that I only need the tokenizer code.

2) I suppose i have to tell the query parser to parse queries in the 
same way, is that right?

The reason I think so is that Luke says I have words such as w.o. in 
the index which the query parser can't find.  I suspect I have to use 
the same Analyzer on both, right?

> On 8/29/06, Bill Taylor <wa...@as-st.com> wrote:
>>
>> I am indexing documents which are filled with government jargon.  As
>> one would expect, the standard tokenizer has problems with
>> governmenteese.
>>
>> In particular, the documents use words such as 310N-P-Q as references
>> to other documents.  The standard tokenizer breaks this "word" at the
>> dashes so that I can find P or Q but not the entire token.
>>
>> I know how to write a new tokenizer.  I would like hints on how to
>> install it and get my indexing system to use it.  I don't want to
>> modify the standard .jar file.  What I think I want to do is set up my
>> indexing operation to use the WhitespaceTokenizer instead of the 
>> normal
>> one, but I am unsure how to do this.
>>
>> I know that the IndexTask has a setAnalyzer method.  The document
>> formats are rather complicated and I need special code to isolate the
>> text strings which should be indexed.   My file analyzer isolates the
>> string I want to index, then does
>>
>> doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
>> Field.Store.YES, Field.index.TOKENIZED));
>>
>> I suspect that my issue is getting the Field constructor to use a
>> different tokenizer.  Can anyone help?
>>
>> Thanks.
>>
>> Bill Taylor
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Installing a custom tokenizer

Posted by Erick Erickson <er...@gmail.com>.

I'm in a real rush here, so pardon my brevity, but..... one of the
constructors for IndexWriter takes an Analyzer as a parameter, which can be
a PerFieldAnalyzerWrapper. That, if I understand your issue, should fix you
right up.

Same kind of thing for a Query.

Erick

On 8/29/06, Bill Taylor <wa...@as-st.com> wrote:
>
> I am indexing documents which are filled with government jargon.  As
> one would expect, the standard tokenizer has problems with
> governmenteese.
>
> In particular, the documents use words such as 310N-P-Q as references
> to other documents.  The standard tokenizer breaks this "word" at the
> dashes so that I can find P or Q but not the entire token.
>
> I know how to write a new tokenizer.  I would like hints on how to
> install it and get my indexing system to use it.  I don't want to
> modify the standard .jar file.  What I think I want to do is set up my
> indexing operation to use the WhitespaceTokenizer instead of the normal
> one, but I am unsure how to do this.
>
> I know that the IndexTask has a setAnalyzer method.  The document
> formats are rather complicated and I need special code to isolate the
> text strings which should be indexed.   My file analyzer isolates the
> string I want to index, then does
>
> doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
> Field.Store.YES, Field.index.TOKENIZED));
>
> I suspect that my issue is getting the Field constructor to use a
> different tokenizer.  Can anyone help?
>
> Thanks.
>
> Bill Taylor
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Installing a custom tokenizer

Posted by Bill Taylor <wa...@as-st.com>.

I am indexing documents which are filled with government jargon.  As 
one would expect, the standard tokenizer has problems with 
governmenteese.

In particular, the documents use words such as 310N-P-Q as references 
to other documents.  The standard tokenizer breaks this "word" at the 
dashes so that I can find P or Q but not the entire token.

I know how to write a new tokenizer.  I would like hints on how to 
install it and get my indexing system to use it.  I don't want to 
modify the standard .jar file.  What I think I want to do is set up my 
indexing operation to use the WhitespaceTokenizer instead of the normal 
one, but I am unsure how to do this.

I know that the IndexTask has a setAnalyzer method.  The document 
formats are rather complicated and I need special code to isolate the 
text strings which should be indexed.   My file analyzer isolates the 
string I want to index, then does

doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>, 
Field.Store.YES, Field.index.TOKENIZED));

I suspect that my issue is getting the Field constructor to use a 
different tokenizer.  Can anyone help?

Thanks.

Bill Taylor


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Reviving a dead index

Posted by "Aleksander M. Stensby" <al...@integrasco.no>.

works like a charm Michael! (only thing is that SegmentInfos / SegmentInfo  
are final classe, (which I didnt know) so i was bugging around to really  
find the classes:) heh.

I was able to remove the broken segment. I must now get the MAX(id) from  
the clean remaining segments, then just regenerate from that point on.  
(ive indexed the ids, so thats a good pinpoint.)

Anyways, thanx! My logrotation was broken so was unable to trace back the  
root cause, but will report if it happens again!

On Thu, 23 Nov 2006 12:39:19 +0100, Michael McCandless  
<lu...@mikemccandless.com> wrote:

> Aleksander M. Stensby wrote:
>> Hey, saw this old thread, and was just wondering if any of you solved  
>> the problem? Same has happened to me now. Couldn't really trace back to  
>> the origin of the problem, but the segments file references a segment  
>> that is obviously corrupt/not complete...
>>  I thought i might remove the uncomplete segment, but then i guess this  
>> would f* up my segments file. Any taker on how to remove the segment in  
>> question, and make the rest of the index work again?.. Since i guess  
>> its not as straightforward to just remove the segmentname from the  
>> segments file without changing some of the other crytic bytes aswell..?
>
> I don't think the root cause was ever uncovered on this thread.
>
> Do you have any copying process to move your index from one machine to
> another, or across mount points on the same machine, or anything?  A
> copying step has been the cause of similar corruption in past issues.
> I would really like to get to the root cause of any and all
> corruption.
>
> To recover your index, it would be fairly simple to write a tool that
> reads the segments files, removes the known bad segments, and writes
> the segments file back out.  Something like this (I haven't tested!):
>
>    Directory dir = FSDirectory.getDirectory("/path/to/my/index", false);
>    SegmentInfos sis = new SegmentInfos();
>    sis.read(dir);
>    for(int i=0;i<sis.size();i++) {
>      SegmentInfo si = (SegmentInfo) sis.elementAt(i);
>      if (si.name.equals("_XXXX")) {
>        sis.removeElementAt(i);
>        break;
>      }
>    }
>    sis.write(dir);
>
> Please make a backup copy of your index before running this in case
> it messes anything up!!
>
> And note that by removing an entire segment, many documents are now
> gone from your index.  Determining which ones are gone and must be
> reindexed is not particularly easy unless you have a way to do so in
> your application...
>
> Also note that Lucene will not delete the now un-referenced (bad)
> segment file (ie, _XXXX.cfs, and other _XXXX files like _XXXX.del if
> it exists) so you will have to do that step manually.  (The current
> "trunk" version of Lucene does in fact remove unreferenced index
> files correctly but this hasn't been released yet).
>
> Mike
>
>>  - Aleksander
>>   On Thu, 31 Aug 2006 03:42:28 +0200, Michael McCandless  
>> <lu...@mikemccandless.com> wrote:
>>
>>> Stanislav Jordanov wrote:
>>>
>>>> For a moment I wondered what exactly do you mean by "compound file"?
>>>> Then I read http://lucene.apache.org/java/docs/fileformats.html and  
>>>> got the idea.
>>>> I do not have access to that specific machine that all this is  
>>>> happening at.
>>>> It is a 80x86 machine running Win 2003 server;
>>>> Sorry, but they neglected my question about the index is stored on a  
>>>> Local FS or on a NFS.
>>>> I was only able to obtain a directory listing of the index dir and  
>>>> guess what - there's no a /*_1j8s.cfs * /file at all!
>>>> Pitty, I can't have a look at segments file, but I guess it lists the  
>>>> _1j8s
>>>> Given these scarce resources, can you give me some further advise  
>>>> about what has happened and what can be done to prevent it from  
>>>> happening again?
>>>
>>> I'm assuming this is easily repeated (question from my last email) and
>>> not a transient error?  If it's transient, this could be explained by
>>> locking not working properly.
>>>
>>> If it's not transient (ie, happens every time you open this index),
>>> it sounds like indeed the segments file is referencing a segment that
>>> does not exist.
>>>
>>> But, how the index got into this state is a mystery.  I don't know of
>>> any existing Lucene bugs that can do this.  Furthermore, crashing
>>> an indexing process should not lead to this (it can lead to other  
>>> things
>>> like only have a segments.new file and no segments file).
>>>
>>> Were there any earlier exceptions (before indexing hit an "improper
>>> shutdown") in your indexing process that could give a clue as to root
>>> cause?  Or for example was the machine rebooted and did windows to run
>>> a "filesystem check" on rebooting this box (which can remove corrupt
>>> files)?
>>>
>>> Mike
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>    --Aleksander M. Stensby
>> Software Developer
>> Integrasco A/S
>> aleksander.stensby@integrasco.no
>> Tlf.: +47 41 22 82 72
>>  ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Aleksander M. Stensby
Software Developer
Integrasco A/S
aleksander.stensby@integrasco.no
Tlf.: +47 41 22 82 72

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Reviving a dead index

Posted by Michael McCandless <lu...@mikemccandless.com>.

Aleksander M. Stensby wrote:

 > works like a charm Michael! (only thing is that SegmentInfos /
 > SegmentInfo are final classe, (which I didnt know) so i was bugging
 > around to really find the classes:) heh.
 >
 > I was able to remove the broken segment. I must now get the MAX(id) from
 > the clean remaining segments, then just regenerate from that point on.
 > (ive indexed the ids, so thats a good pinpoint.)
 >
 > Anyways, thanx! My logrotation was broken so was unable to trace back
 > the root cause, but will report if it happens again!

Super!  Phew.  Please reply back if you make any progress on the root
cause.  If it comes down to a bug in Lucene we've got to ferret it out
and get it fixed.

> One last thing... can i be sure that the latest inserted documents in 
> fact was inserted into that broken segment? Or are they placed randomly 
> in the different segments?

Well, when documents are added they are always added into the "latest"
segment (when the writer flushes its RAMDirectory buffer).  Then, on
merging/optimizing, this segment will be merged with others before
it.  If you are certain it is the "newest" (ie, biggest name in base
36) segment(s) that you've lost, and you add your docs in increasing ID
order, then I believe getting the max(ID) that's in the index and
re-indexing the docs above that will make your index complete
again.  Best to do some serious testing to be sure though :)

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Reviving a dead index

Posted by "Aleksander M. Stensby" <al...@integrasco.no>.

One last thing... can i be sure that the latest inserted documents in fact  
was inserted into that broken segment? Or are they placed randomly in the  
different segments?

- Aleks

On Thu, 23 Nov 2006 12:39:19 +0100, Michael McCandless  
<lu...@mikemccandless.com> wrote:

> Aleksander M. Stensby wrote:
>> Hey, saw this old thread, and was just wondering if any of you solved  
>> the problem? Same has happened to me now. Couldn't really trace back to  
>> the origin of the problem, but the segments file references a segment  
>> that is obviously corrupt/not complete...
>>  I thought i might remove the uncomplete segment, but then i guess this  
>> would f* up my segments file. Any taker on how to remove the segment in  
>> question, and make the rest of the index work again?.. Since i guess  
>> its not as straightforward to just remove the segmentname from the  
>> segments file without changing some of the other crytic bytes aswell..?
>
> I don't think the root cause was ever uncovered on this thread.
>
> Do you have any copying process to move your index from one machine to
> another, or across mount points on the same machine, or anything?  A
> copying step has been the cause of similar corruption in past issues.
> I would really like to get to the root cause of any and all
> corruption.
>
> To recover your index, it would be fairly simple to write a tool that
> reads the segments files, removes the known bad segments, and writes
> the segments file back out.  Something like this (I haven't tested!):
>
>    Directory dir = FSDirectory.getDirectory("/path/to/my/index", false);
>    SegmentInfos sis = new SegmentInfos();
>    sis.read(dir);
>    for(int i=0;i<sis.size();i++) {
>      SegmentInfo si = (SegmentInfo) sis.elementAt(i);
>      if (si.name.equals("_XXXX")) {
>        sis.removeElementAt(i);
>        break;
>      }
>    }
>    sis.write(dir);
>
> Please make a backup copy of your index before running this in case
> it messes anything up!!
>
> And note that by removing an entire segment, many documents are now
> gone from your index.  Determining which ones are gone and must be
> reindexed is not particularly easy unless you have a way to do so in
> your application...
>
> Also note that Lucene will not delete the now un-referenced (bad)
> segment file (ie, _XXXX.cfs, and other _XXXX files like _XXXX.del if
> it exists) so you will have to do that step manually.  (The current
> "trunk" version of Lucene does in fact remove unreferenced index
> files correctly but this hasn't been released yet).
>
> Mike
>
>>  - Aleksander
>>   On Thu, 31 Aug 2006 03:42:28 +0200, Michael McCandless  
>> <lu...@mikemccandless.com> wrote:
>>
>>> Stanislav Jordanov wrote:
>>>
>>>> For a moment I wondered what exactly do you mean by "compound file"?
>>>> Then I read http://lucene.apache.org/java/docs/fileformats.html and  
>>>> got the idea.
>>>> I do not have access to that specific machine that all this is  
>>>> happening at.
>>>> It is a 80x86 machine running Win 2003 server;
>>>> Sorry, but they neglected my question about the index is stored on a  
>>>> Local FS or on a NFS.
>>>> I was only able to obtain a directory listing of the index dir and  
>>>> guess what - there's no a /*_1j8s.cfs * /file at all!
>>>> Pitty, I can't have a look at segments file, but I guess it lists the  
>>>> _1j8s
>>>> Given these scarce resources, can you give me some further advise  
>>>> about what has happened and what can be done to prevent it from  
>>>> happening again?
>>>
>>> I'm assuming this is easily repeated (question from my last email) and
>>> not a transient error?  If it's transient, this could be explained by
>>> locking not working properly.
>>>
>>> If it's not transient (ie, happens every time you open this index),
>>> it sounds like indeed the segments file is referencing a segment that
>>> does not exist.
>>>
>>> But, how the index got into this state is a mystery.  I don't know of
>>> any existing Lucene bugs that can do this.  Furthermore, crashing
>>> an indexing process should not lead to this (it can lead to other  
>>> things
>>> like only have a segments.new file and no segments file).
>>>
>>> Were there any earlier exceptions (before indexing hit an "improper
>>> shutdown") in your indexing process that could give a clue as to root
>>> cause?  Or for example was the machine rebooted and did windows to run
>>> a "filesystem check" on rebooting this box (which can remove corrupt
>>> files)?
>>>
>>> Mike
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>    --Aleksander M. Stensby
>> Software Developer
>> Integrasco A/S
>> aleksander.stensby@integrasco.no
>> Tlf.: +47 41 22 82 72
>>  ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Aleksander M. Stensby
Software Developer
Integrasco A/S
aleksander.stensby@integrasco.no
Tlf.: +47 41 22 82 72

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Reviving a dead index

Posted by Michael McCandless <lu...@mikemccandless.com>.

Aleksander M. Stensby wrote:
> Hey, saw this old thread, and was just wondering if any of you solved 
> the problem? Same has happened to me now. Couldn't really trace back to 
> the origin of the problem, but the segments file references a segment 
> that is obviously corrupt/not complete...
> 
> I thought i might remove the uncomplete segment, but then i guess this 
> would f* up my segments file. Any taker on how to remove the segment in 
> question, and make the rest of the index work again?.. Since i guess its 
> not as straightforward to just remove the segmentname from the segments 
> file without changing some of the other crytic bytes aswell..?

I don't think the root cause was ever uncovered on this thread.

Do you have any copying process to move your index from one machine to
another, or across mount points on the same machine, or anything?  A
copying step has been the cause of similar corruption in past issues.
I would really like to get to the root cause of any and all
corruption.

To recover your index, it would be fairly simple to write a tool that
reads the segments files, removes the known bad segments, and writes
the segments file back out.  Something like this (I haven't tested!):

   Directory dir = FSDirectory.getDirectory("/path/to/my/index", false);
   SegmentInfos sis = new SegmentInfos();
   sis.read(dir);
   for(int i=0;i<sis.size();i++) {
     SegmentInfo si = (SegmentInfo) sis.elementAt(i);
     if (si.name.equals("_XXXX")) {
       sis.removeElementAt(i);
       break;
     }
   }
   sis.write(dir);

Please make a backup copy of your index before running this in case
it messes anything up!!

And note that by removing an entire segment, many documents are now
gone from your index.  Determining which ones are gone and must be
reindexed is not particularly easy unless you have a way to do so in
your application...

Also note that Lucene will not delete the now un-referenced (bad)
segment file (ie, _XXXX.cfs, and other _XXXX files like _XXXX.del if
it exists) so you will have to do that step manually.  (The current
"trunk" version of Lucene does in fact remove unreferenced index
files correctly but this hasn't been released yet).

Mike

> 
> - Aleksander
> 
> 
> On Thu, 31 Aug 2006 03:42:28 +0200, Michael McCandless 
> <lu...@mikemccandless.com> wrote:
> 
>> Stanislav Jordanov wrote:
>>
>>> For a moment I wondered what exactly do you mean by "compound file"?
>>> Then I read http://lucene.apache.org/java/docs/fileformats.html and 
>>> got the idea.
>>> I do not have access to that specific machine that all this is 
>>> happening at.
>>> It is a 80x86 machine running Win 2003 server;
>>> Sorry, but they neglected my question about the index is stored on a 
>>> Local FS or on a NFS.
>>> I was only able to obtain a directory listing of the index dir and 
>>> guess what - there's no a /*_1j8s.cfs * /file at all!
>>> Pitty, I can't have a look at segments file, but I guess it lists the 
>>> _1j8s
>>> Given these scarce resources, can you give me some further advise 
>>> about what has happened and what can be done to prevent it from 
>>> happening again?
>>
>> I'm assuming this is easily repeated (question from my last email) and
>> not a transient error?  If it's transient, this could be explained by
>> locking not working properly.
>>
>> If it's not transient (ie, happens every time you open this index),
>> it sounds like indeed the segments file is referencing a segment that
>> does not exist.
>>
>> But, how the index got into this state is a mystery.  I don't know of
>> any existing Lucene bugs that can do this.  Furthermore, crashing
>> an indexing process should not lead to this (it can lead to other things
>> like only have a segments.new file and no segments file).
>>
>> Were there any earlier exceptions (before indexing hit an "improper
>> shutdown") in your indexing process that could give a clue as to root
>> cause?  Or for example was the machine rebooted and did windows to run
>> a "filesystem check" on rebooting this box (which can remove corrupt
>> files)?
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 
> 
> --Aleksander M. Stensby
> Software Developer
> Integrasco A/S
> aleksander.stensby@integrasco.no
> Tlf.: +47 41 22 82 72
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Reviving a dead index

Posted by "Aleksander M. Stensby" <al...@integrasco.no>.

Hey, saw this old thread, and was just wondering if any of you solved the  
problem? Same has happened to me now. Couldn't really trace back to the  
origin of the problem, but the segments file references a segment that is  
obviously corrupt/not complete...

I thought i might remove the uncomplete segment, but then i guess this  
would f* up my segments file. Any taker on how to remove the segment in  
question, and make the rest of the index work again?.. Since i guess its  
not as straightforward to just remove the segmentname from the segments  
file without changing some of the other crytic bytes aswell..?

- Aleksander


On Thu, 31 Aug 2006 03:42:28 +0200, Michael McCandless  
<lu...@mikemccandless.com> wrote:

> Stanislav Jordanov wrote:
>
>> For a moment I wondered what exactly do you mean by "compound file"?
>> Then I read http://lucene.apache.org/java/docs/fileformats.html and got  
>> the idea.
>> I do not have access to that specific machine that all this is  
>> happening at.
>> It is a 80x86 machine running Win 2003 server;
>> Sorry, but they neglected my question about the index is stored on a  
>> Local FS or on a NFS.
>> I was only able to obtain a directory listing of the index dir and  
>> guess what - there's no a /*_1j8s.cfs * /file at all!
>> Pitty, I can't have a look at segments file, but I guess it lists the  
>> _1j8s
>> Given these scarce resources, can you give me some further advise about  
>> what has happened and what can be done to prevent it from happening  
>> again?
>
> I'm assuming this is easily repeated (question from my last email) and
> not a transient error?  If it's transient, this could be explained by
> locking not working properly.
>
> If it's not transient (ie, happens every time you open this index),
> it sounds like indeed the segments file is referencing a segment that
> does not exist.
>
> But, how the index got into this state is a mystery.  I don't know of
> any existing Lucene bugs that can do this.  Furthermore, crashing
> an indexing process should not lead to this (it can lead to other things
> like only have a segments.new file and no segments file).
>
> Were there any earlier exceptions (before indexing hit an "improper
> shutdown") in your indexing process that could give a clue as to root
> cause?  Or for example was the machine rebooted and did windows to run
> a "filesystem check" on rebooting this box (which can remove corrupt
> files)?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Aleksander M. Stensby
Software Developer
Integrasco A/S
aleksander.stensby@integrasco.no
Tlf.: +47 41 22 82 72

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Reviving a dead index

Posted by Michael McCandless <lu...@mikemccandless.com>.

Stanislav Jordanov wrote:

> For a moment I wondered what exactly do you mean by "compound file"?
> Then I read http://lucene.apache.org/java/docs/fileformats.html and got 
> the idea.
> I do not have access to that specific machine that all this is happening 
> at.
> It is a 80x86 machine running Win 2003 server;
> Sorry, but they neglected my question about the index is stored on a 
> Local FS or on a NFS.
> I was only able to obtain a directory listing of the index dir and guess 
> what - there's no a /*_1j8s.cfs * /file at all!
> Pitty, I can't have a look at segments file, but I guess it lists the _1j8s
> Given these scarce resources, can you give me some further advise about 
> what has happened and what can be done to prevent it from happening again?

I'm assuming this is easily repeated (question from my last email) and
not a transient error?  If it's transient, this could be explained by
locking not working properly.

If it's not transient (ie, happens every time you open this index),
it sounds like indeed the segments file is referencing a segment that
does not exist.

But, how the index got into this state is a mystery.  I don't know of
any existing Lucene bugs that can do this.  Furthermore, crashing
an indexing process should not lead to this (it can lead to other things
like only have a segments.new file and no segments file).

Were there any earlier exceptions (before indexing hit an "improper
shutdown") in your indexing process that could give a clue as to root
cause?  Or for example was the machine rebooted and did windows to run
a "filesystem check" on rebooting this box (which can remove corrupt
files)?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Reviving a dead index

Posted by Michael McCandless <lu...@mikemccandless.com>.

Stanislav Jordanov wrote:

> After all, the Lucene's CFS format is abstraction over the OS's native 
> FS and the App should not be trying to open a native FS file named *.fnm
> when it is supposed to open the corresponding *.cfs file and "manually" 
> extract the *.fnm file from it.
> Right?

Yes, good catch :)

This always confuses people, but it's actually "normal" (when a segments 
file is missing) because Lucene first checks whether the compound-file 
exists and if it does it will use that.  If it does not, it falls back 
to trying to directly open the individual files against the filesystem.

So, when there is a problem and a given segment is referenced but does 
not exist, you will see this [confusing] exception making it look like 
Lucene "forgot" that it's using the compound file format.

[Still intending to respond to your previous email but a bit busy right 
now...]

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Reviving a dead index

Posted by Stanislav Jordanov <st...@sirma.bg>.

I missed something that may be very important:
I find it really strange, that the exception log reads:

java.io.FileNotFoundException: F:\Indexes\index1\_16f6.fnm (The system 
cannot find the file specified)
    at java.io.RandomAccessFile.open(Native 
Method)                                                     
    at 
java.io.RandomAccessFile.<init>(RandomAccessFile.java:212)                                        

    at 
org.apache.lucene.store.FSIndexInput$Descriptor.<init>(FSDirectory.java:425)                      

    at 
org.apache.lucene.store.FSIndexInput.<init>(FSDirectory.java:434)                                 

    at 
org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:324)                               

    at 
org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:56)                                     

    at 
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:144)                          

    at 
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:129)                                 

    at 
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:110)                                 

    at 
org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:154)                                

    at 
org.apache.lucene.store.Lock$With.run(Lock.java:109)                                              

    at 
org.apache.lucene.index.IndexReader.open(IndexReader.java:143)                                    

    at 
org.apache.lucene.index.IndexReader.open(IndexReader.java:127)                                    

    at 
org.apache.lucene.search.IndexSearcher.<init>(IndexSearcher.java:42)                              



After all, the Lucene's CFS format is abstraction over the OS's native 
FS and the App should not be trying to open a native FS file named *.fnm
when it is supposed to open the corresponding *.cfs file and "manually" 
extract the *.fnm file from it.
Right?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Reviving a dead index

Posted by Stanislav Jordanov <st...@sirma.bg>.

Michael McCandless wrote:

/This means the segments files is referencing a segment named _1j8s and 
in trying to load that segment, the first thing Lucene does is load the 
"field infos" (_1j8s.fnm).  It tries to do so from a compound file (if 
you have it turned on & it exists), else from the filesystem directly.

/Michael,
For a moment I wondered what exactly do you mean by "compound file"?
Then I read http://lucene.apache.org/java/docs/fileformats.html and got 
the idea.
I do not have access to that specific machine that all this is happening at.
It is a 80x86 machine running Win 2003 server;
Sorry, but they neglected my question about the index is stored on a 
Local FS or on a NFS.
I was only able to obtain a directory listing of the index dir and guess 
what - there's no a /*_1j8s.cfs * /file at all!
Pitty, I can't have a look at segments file, but I guess it lists the _1j8s
Given these scarce resources, can you give me some further advise about 
what has happened and what can be done to prevent it from happening again?

Regards,
Stanislav

> Stanislav Jordanov wrote:
>> What might be the possible reason for an IndexReader failing to open 
>> properly,
>> because it can not find a .fnm file that is expected to be there:
>
> This means the segments files is referencing a segment named _1j8s and 
> in trying to load that segment, the first thing Lucene does is load 
> the "field infos" (_1j8s.fnm).  It tries to do so from a compound file 
> (if you have it turned on & it exists), else from the filesystem 
> directly.
>
> Which version of Lucene are you using?  And which OS are you running on?
>
> Is this error easily repeated (not a transient error)?  Ie, 
> instantiating an IndexSearcher against your index always causes this 
> exception?  Because, this sort of exception is certainly possible when 
> Lucene's locking is not working correctly (for exmple over NFS), but 
> in that case it's typically very intermittant.
>
> Could you send a list of the files in your index?
>
>> The only thing that comes to my mind is that last time the indexing 
>> process was not shut down properly.
>> Is there a way to revive the index or everything should be reindexed 
>> from scratch?
>
> Hmmm.  It's surprising that an improper shutdown caused this because 
> when the IndexWriter commits its change, it first writes all files for 
> the new segment and only when that's successful does it write a new 
> segments file referencing the newly written segment.  Could you 
> provide some more detail about your setup and how the improper 
> shutdown happened?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Reviving a dead index

Posted by Michael McCandless <lu...@mikemccandless.com>.

Stanislav Jordanov wrote:
> What might be the possible reason for an IndexReader failing to open 
> properly,
> because it can not find a .fnm file that is expected to be there:

This means the segments files is referencing a segment named _1j8s and 
in trying to load that segment, the first thing Lucene does is load the 
"field infos" (_1j8s.fnm).  It tries to do so from a compound file (if 
you have it turned on & it exists), else from the filesystem directly.

Which version of Lucene are you using?  And which OS are you running on?

Is this error easily repeated (not a transient error)?  Ie, 
instantiating an IndexSearcher against your index always causes this 
exception?  Because, this sort of exception is certainly possible when 
Lucene's locking is not working correctly (for exmple over NFS), but in 
that case it's typically very intermittant.

Could you send a list of the files in your index?

> The only thing that comes to my mind is that last time the indexing 
> process was not shut down properly.
> Is there a way to revive the index or everything should be reindexed 
> from scratch?

Hmmm.  It's surprising that an improper shutdown caused this because 
when the IndexWriter commits its change, it first writes all files for 
the new segment and only when that's successful does it write a new 
segments file referencing the newly written segment.  Could you provide 
some more detail about your setup and how the improper shutdown happened?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org