You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Stanislav Jordanov <st...@sirma.bg> on 2006/08/29 10:09:47 UTC
Reviving a dead index
What might be the possible reason for an IndexReader failing to open
properly,
because it can not find a .fnm file that is expected to be there:
java.io.FileNotFoundException: E:\index4\_1j8s.fnm (The system cannot
find the file specified)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:212)
at
org.apache.lucene.store.FSIndexInput$Descriptor.<init>(FSDirectory.java:425)
at org.apache.lucene.store.FSIndexInput.<init>(FSDirectory.java:434)
at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:324)
at org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:56)
at
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:144)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:129)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:110)
at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:154)
at org.apache.lucene.store.Lock$With.run(Lock.java:109)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:143)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:127)
at org.apache.lucene.search.IndexSearcher.<init>(IndexSearcher.java:42)
The only thing that comes to my mind is that last time the indexing
process was not shut down properly.
Is there a way to revive the index or everything should be reindexed
from scratch?
Thanks,
Stanislav
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Installing a custom tokenizer
Posted by Mark Miller <ma...@gmail.com>.
Bill Taylor wrote:
>
> On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote:
>
>> I'm in a real rush here, so pardon my brevity, but..... one of the
>> constructors for IndexWriter takes an Analyzer as a parameter, which
>> can be
>> a PerFieldAnalyzerWrapper. That, if I understand your issue, should
>> fix you
>> right up.
>
> that almost worked. I can't use a per Field analyzer because I have
> to process the content fields of all documents. I built a custom
> analyzer which extended the Standard Analyzer and replaced the
> tokenStream method with a new one which used WhitespaceTokenizer
> instead of StandardTokenizer. This meant that my document IDs were
> not split, but I lost the conversion of acronyms such as w.o. to wo
> and the like
>
> So what I need to do is to make a new Tokenizer based on the
> StandardTokenizer except that a NUM on line 83 of StandardTokenizer.jj
> should be
>
> | NUM: (<ALPHANUM> (<P> <ALPHANUM>) + | <ALPHANUM>) >
>
> so that a serial number need not have a digit in every other segment
> and a series of letters and digits without special characters such as
> a dash will be treated as a single word.
>
> Questions:
>
> 1) If I change the .jj file in this way, how to I run javaCC to make a
> new tokenizer? The JavaCC documentation says that JavaCC generates a
> number of output files; I think that I only need the tokenizer code.
>
> 2) I suppose i have to tell the query parser to parse queries in the
> same way, is that right?
>
> The reason I think so is that Luke says I have words such as w.o. in
> the index which the query parser can't find. I suspect I have to use
> the same Analyzer on both, right?
>
Get JavaCC and run it on StandardTokenizer.jj. This should be as simple
as typing 'JavaCC StandardTokenizer.jj'...I believe with no output
folder specified all of the files will be built in the current
directory. Don't worry about not generating the ones you do not
need--JavaCC will handle everything for you. If you use Eclipse I
recommend the JavaCC plug-in. I find it very handy.
Generally you must run the same analyzer that you indexed with on your
search strings...if the standard analyzer parses oldman-83 to oldman
while indexing and you use whitespace analyzer while searching then you
will attempt to find oldman-83 in the index instead of oldman (which was
what standard analyzer stored).
- Mark
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Straight TF-IDF cosine similarity?
Posted by Jason Polites <ja...@gmail.com>.
Have you looked at the MoreLikeThis class in the similarity package?
On 8/30/06, Winton Davies <wd...@yahoo-inc.com> wrote:
>
> Hi All,
>
> I'm scratching my head - can someone tell me which class implements
> an efficient multiple term TF.IDF Cosine similarity scoring mechanism?
>
> There is clearly the single TermScorer - but I can't find the class
> that would do a bucketed TF.IDF cosine - i.e. fill an accumulator
> with the tf.idf^2 for each of the term posting lists, until
> accumulator is full, and then compute the final score.
>
> I don't need a Boolean Query - at least this seems like overkill.
>
> Cheers,
> Winton
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Straight TF-IDF cosine similarity?
Posted by Winton Davies <wd...@yahoo-inc.com>.
Hi All,
I'm scratching my head - can someone tell me which class implements
an efficient multiple term TF.IDF Cosine similarity scoring mechanism?
There is clearly the single TermScorer - but I can't find the class
that would do a bucketed TF.IDF cosine - i.e. fill an accumulator
with the tf.idf^2 for each of the term posting lists, until
accumulator is full, and then compute the final score.
I don't need a Boolean Query - at least this seems like overkill.
Cheers,
Winton
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Installing a custom tokenizer
Posted by Bill Taylor <wa...@as-st.com>.
On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote:
> I'm in a real rush here, so pardon my brevity, but..... one of the
> constructors for IndexWriter takes an Analyzer as a parameter, which
> can be
> a PerFieldAnalyzerWrapper. That, if I understand your issue, should
> fix you
> right up.
that almost worked. I can't use a per Field analyzer because I have to
process the content fields of all documents. I built a custom analyzer
which extended the Standard Analyzer and replaced the tokenStream
method with a new one which used WhitespaceTokenizer instead of
StandardTokenizer. This meant that my document IDs were not split, but
I lost the conversion of acronyms such as w.o. to wo and the like
So what I need to do is to make a new Tokenizer based on the
StandardTokenizer except that a NUM on line 83 of StandardTokenizer.jj
should be
| NUM: (<ALPHANUM> (<P> <ALPHANUM>) + | <ALPHANUM>) >
so that a serial number need not have a digit in every other segment
and a series of letters and digits without special characters such as a
dash will be treated as a single word.
Questions:
1) If I change the .jj file in this way, how to I run javaCC to make a
new tokenizer? The JavaCC documentation says that JavaCC generates a
number of output files; I think that I only need the tokenizer code.
2) I suppose i have to tell the query parser to parse queries in the
same way, is that right?
The reason I think so is that Luke says I have words such as w.o. in
the index which the query parser can't find. I suspect I have to use
the same Analyzer on both, right?
> On 8/29/06, Bill Taylor <wa...@as-st.com> wrote:
>>
>> I am indexing documents which are filled with government jargon. As
>> one would expect, the standard tokenizer has problems with
>> governmenteese.
>>
>> In particular, the documents use words such as 310N-P-Q as references
>> to other documents. The standard tokenizer breaks this "word" at the
>> dashes so that I can find P or Q but not the entire token.
>>
>> I know how to write a new tokenizer. I would like hints on how to
>> install it and get my indexing system to use it. I don't want to
>> modify the standard .jar file. What I think I want to do is set up my
>> indexing operation to use the WhitespaceTokenizer instead of the
>> normal
>> one, but I am unsure how to do this.
>>
>> I know that the IndexTask has a setAnalyzer method. The document
>> formats are rather complicated and I need special code to isolate the
>> text strings which should be indexed. My file analyzer isolates the
>> string I want to index, then does
>>
>> doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
>> Field.Store.YES, Field.index.TOKENIZED));
>>
>> I suspect that my issue is getting the Field constructor to use a
>> different tokenizer. Can anyone help?
>>
>> Thanks.
>>
>> Bill Taylor
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Installing a custom tokenizer
Posted by Erick Erickson <er...@gmail.com>.
I'm in a real rush here, so pardon my brevity, but..... one of the
constructors for IndexWriter takes an Analyzer as a parameter, which can be
a PerFieldAnalyzerWrapper. That, if I understand your issue, should fix you
right up.
Same kind of thing for a Query.
Erick
On 8/29/06, Bill Taylor <wa...@as-st.com> wrote:
>
> I am indexing documents which are filled with government jargon. As
> one would expect, the standard tokenizer has problems with
> governmenteese.
>
> In particular, the documents use words such as 310N-P-Q as references
> to other documents. The standard tokenizer breaks this "word" at the
> dashes so that I can find P or Q but not the entire token.
>
> I know how to write a new tokenizer. I would like hints on how to
> install it and get my indexing system to use it. I don't want to
> modify the standard .jar file. What I think I want to do is set up my
> indexing operation to use the WhitespaceTokenizer instead of the normal
> one, but I am unsure how to do this.
>
> I know that the IndexTask has a setAnalyzer method. The document
> formats are rather complicated and I need special code to isolate the
> text strings which should be indexed. My file analyzer isolates the
> string I want to index, then does
>
> doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
> Field.Store.YES, Field.index.TOKENIZED));
>
> I suspect that my issue is getting the Field constructor to use a
> different tokenizer. Can anyone help?
>
> Thanks.
>
> Bill Taylor
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Installing a custom tokenizer
Posted by Bill Taylor <wa...@as-st.com>.
I am indexing documents which are filled with government jargon. As
one would expect, the standard tokenizer has problems with
governmenteese.
In particular, the documents use words such as 310N-P-Q as references
to other documents. The standard tokenizer breaks this "word" at the
dashes so that I can find P or Q but not the entire token.
I know how to write a new tokenizer. I would like hints on how to
install it and get my indexing system to use it. I don't want to
modify the standard .jar file. What I think I want to do is set up my
indexing operation to use the WhitespaceTokenizer instead of the normal
one, but I am unsure how to do this.
I know that the IndexTask has a setAnalyzer method. The document
formats are rather complicated and I need special code to isolate the
text strings which should be indexed. My file analyzer isolates the
string I want to index, then does
doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
Field.Store.YES, Field.index.TOKENIZED));
I suspect that my issue is getting the Field constructor to use a
different tokenizer. Can anyone help?
Thanks.
Bill Taylor
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Reviving a dead index
Posted by "Aleksander M. Stensby" <al...@integrasco.no>.
works like a charm Michael! (only thing is that SegmentInfos / SegmentInfo
are final classe, (which I didnt know) so i was bugging around to really
find the classes:) heh.
I was able to remove the broken segment. I must now get the MAX(id) from
the clean remaining segments, then just regenerate from that point on.
(ive indexed the ids, so thats a good pinpoint.)
Anyways, thanx! My logrotation was broken so was unable to trace back the
root cause, but will report if it happens again!
On Thu, 23 Nov 2006 12:39:19 +0100, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Aleksander M. Stensby wrote:
>> Hey, saw this old thread, and was just wondering if any of you solved
>> the problem? Same has happened to me now. Couldn't really trace back to
>> the origin of the problem, but the segments file references a segment
>> that is obviously corrupt/not complete...
>> I thought i might remove the uncomplete segment, but then i guess this
>> would f* up my segments file. Any taker on how to remove the segment in
>> question, and make the rest of the index work again?.. Since i guess
>> its not as straightforward to just remove the segmentname from the
>> segments file without changing some of the other crytic bytes aswell..?
>
> I don't think the root cause was ever uncovered on this thread.
>
> Do you have any copying process to move your index from one machine to
> another, or across mount points on the same machine, or anything? A
> copying step has been the cause of similar corruption in past issues.
> I would really like to get to the root cause of any and all
> corruption.
>
> To recover your index, it would be fairly simple to write a tool that
> reads the segments files, removes the known bad segments, and writes
> the segments file back out. Something like this (I haven't tested!):
>
> Directory dir = FSDirectory.getDirectory("/path/to/my/index", false);
> SegmentInfos sis = new SegmentInfos();
> sis.read(dir);
> for(int i=0;i<sis.size();i++) {
> SegmentInfo si = (SegmentInfo) sis.elementAt(i);
> if (si.name.equals("_XXXX")) {
> sis.removeElementAt(i);
> break;
> }
> }
> sis.write(dir);
>
> Please make a backup copy of your index before running this in case
> it messes anything up!!
>
> And note that by removing an entire segment, many documents are now
> gone from your index. Determining which ones are gone and must be
> reindexed is not particularly easy unless you have a way to do so in
> your application...
>
> Also note that Lucene will not delete the now un-referenced (bad)
> segment file (ie, _XXXX.cfs, and other _XXXX files like _XXXX.del if
> it exists) so you will have to do that step manually. (The current
> "trunk" version of Lucene does in fact remove unreferenced index
> files correctly but this hasn't been released yet).
>
> Mike
>
>> - Aleksander
>> On Thu, 31 Aug 2006 03:42:28 +0200, Michael McCandless
>> <lu...@mikemccandless.com> wrote:
>>
>>> Stanislav Jordanov wrote:
>>>
>>>> For a moment I wondered what exactly do you mean by "compound file"?
>>>> Then I read http://lucene.apache.org/java/docs/fileformats.html and
>>>> got the idea.
>>>> I do not have access to that specific machine that all this is
>>>> happening at.
>>>> It is a 80x86 machine running Win 2003 server;
>>>> Sorry, but they neglected my question about the index is stored on a
>>>> Local FS or on a NFS.
>>>> I was only able to obtain a directory listing of the index dir and
>>>> guess what - there's no a /*_1j8s.cfs * /file at all!
>>>> Pitty, I can't have a look at segments file, but I guess it lists the
>>>> _1j8s
>>>> Given these scarce resources, can you give me some further advise
>>>> about what has happened and what can be done to prevent it from
>>>> happening again?
>>>
>>> I'm assuming this is easily repeated (question from my last email) and
>>> not a transient error? If it's transient, this could be explained by
>>> locking not working properly.
>>>
>>> If it's not transient (ie, happens every time you open this index),
>>> it sounds like indeed the segments file is referencing a segment that
>>> does not exist.
>>>
>>> But, how the index got into this state is a mystery. I don't know of
>>> any existing Lucene bugs that can do this. Furthermore, crashing
>>> an indexing process should not lead to this (it can lead to other
>>> things
>>> like only have a segments.new file and no segments file).
>>>
>>> Were there any earlier exceptions (before indexing hit an "improper
>>> shutdown") in your indexing process that could give a clue as to root
>>> cause? Or for example was the machine rebooted and did windows to run
>>> a "filesystem check" on rebooting this box (which can remove corrupt
>>> files)?
>>>
>>> Mike
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>> --Aleksander M. Stensby
>> Software Developer
>> Integrasco A/S
>> aleksander.stensby@integrasco.no
>> Tlf.: +47 41 22 82 72
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
--
Aleksander M. Stensby
Software Developer
Integrasco A/S
aleksander.stensby@integrasco.no
Tlf.: +47 41 22 82 72
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Reviving a dead index
Posted by Michael McCandless <lu...@mikemccandless.com>.
Aleksander M. Stensby wrote:
> works like a charm Michael! (only thing is that SegmentInfos /
> SegmentInfo are final classe, (which I didnt know) so i was bugging
> around to really find the classes:) heh.
>
> I was able to remove the broken segment. I must now get the MAX(id) from
> the clean remaining segments, then just regenerate from that point on.
> (ive indexed the ids, so thats a good pinpoint.)
>
> Anyways, thanx! My logrotation was broken so was unable to trace back
> the root cause, but will report if it happens again!
Super! Phew. Please reply back if you make any progress on the root
cause. If it comes down to a bug in Lucene we've got to ferret it out
and get it fixed.
> One last thing... can i be sure that the latest inserted documents in
> fact was inserted into that broken segment? Or are they placed randomly
> in the different segments?
Well, when documents are added they are always added into the "latest"
segment (when the writer flushes its RAMDirectory buffer). Then, on
merging/optimizing, this segment will be merged with others before
it. If you are certain it is the "newest" (ie, biggest name in base
36) segment(s) that you've lost, and you add your docs in increasing ID
order, then I believe getting the max(ID) that's in the index and
re-indexing the docs above that will make your index complete
again. Best to do some serious testing to be sure though :)
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Reviving a dead index
Posted by "Aleksander M. Stensby" <al...@integrasco.no>.
One last thing... can i be sure that the latest inserted documents in fact
was inserted into that broken segment? Or are they placed randomly in the
different segments?
- Aleks
On Thu, 23 Nov 2006 12:39:19 +0100, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Aleksander M. Stensby wrote:
>> Hey, saw this old thread, and was just wondering if any of you solved
>> the problem? Same has happened to me now. Couldn't really trace back to
>> the origin of the problem, but the segments file references a segment
>> that is obviously corrupt/not complete...
>> I thought i might remove the uncomplete segment, but then i guess this
>> would f* up my segments file. Any taker on how to remove the segment in
>> question, and make the rest of the index work again?.. Since i guess
>> its not as straightforward to just remove the segmentname from the
>> segments file without changing some of the other crytic bytes aswell..?
>
> I don't think the root cause was ever uncovered on this thread.
>
> Do you have any copying process to move your index from one machine to
> another, or across mount points on the same machine, or anything? A
> copying step has been the cause of similar corruption in past issues.
> I would really like to get to the root cause of any and all
> corruption.
>
> To recover your index, it would be fairly simple to write a tool that
> reads the segments files, removes the known bad segments, and writes
> the segments file back out. Something like this (I haven't tested!):
>
> Directory dir = FSDirectory.getDirectory("/path/to/my/index", false);
> SegmentInfos sis = new SegmentInfos();
> sis.read(dir);
> for(int i=0;i<sis.size();i++) {
> SegmentInfo si = (SegmentInfo) sis.elementAt(i);
> if (si.name.equals("_XXXX")) {
> sis.removeElementAt(i);
> break;
> }
> }
> sis.write(dir);
>
> Please make a backup copy of your index before running this in case
> it messes anything up!!
>
> And note that by removing an entire segment, many documents are now
> gone from your index. Determining which ones are gone and must be
> reindexed is not particularly easy unless you have a way to do so in
> your application...
>
> Also note that Lucene will not delete the now un-referenced (bad)
> segment file (ie, _XXXX.cfs, and other _XXXX files like _XXXX.del if
> it exists) so you will have to do that step manually. (The current
> "trunk" version of Lucene does in fact remove unreferenced index
> files correctly but this hasn't been released yet).
>
> Mike
>
>> - Aleksander
>> On Thu, 31 Aug 2006 03:42:28 +0200, Michael McCandless
>> <lu...@mikemccandless.com> wrote:
>>
>>> Stanislav Jordanov wrote:
>>>
>>>> For a moment I wondered what exactly do you mean by "compound file"?
>>>> Then I read http://lucene.apache.org/java/docs/fileformats.html and
>>>> got the idea.
>>>> I do not have access to that specific machine that all this is
>>>> happening at.
>>>> It is a 80x86 machine running Win 2003 server;
>>>> Sorry, but they neglected my question about the index is stored on a
>>>> Local FS or on a NFS.
>>>> I was only able to obtain a directory listing of the index dir and
>>>> guess what - there's no a /*_1j8s.cfs * /file at all!
>>>> Pitty, I can't have a look at segments file, but I guess it lists the
>>>> _1j8s
>>>> Given these scarce resources, can you give me some further advise
>>>> about what has happened and what can be done to prevent it from
>>>> happening again?
>>>
>>> I'm assuming this is easily repeated (question from my last email) and
>>> not a transient error? If it's transient, this could be explained by
>>> locking not working properly.
>>>
>>> If it's not transient (ie, happens every time you open this index),
>>> it sounds like indeed the segments file is referencing a segment that
>>> does not exist.
>>>
>>> But, how the index got into this state is a mystery. I don't know of
>>> any existing Lucene bugs that can do this. Furthermore, crashing
>>> an indexing process should not lead to this (it can lead to other
>>> things
>>> like only have a segments.new file and no segments file).
>>>
>>> Were there any earlier exceptions (before indexing hit an "improper
>>> shutdown") in your indexing process that could give a clue as to root
>>> cause? Or for example was the machine rebooted and did windows to run
>>> a "filesystem check" on rebooting this box (which can remove corrupt
>>> files)?
>>>
>>> Mike
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>> --Aleksander M. Stensby
>> Software Developer
>> Integrasco A/S
>> aleksander.stensby@integrasco.no
>> Tlf.: +47 41 22 82 72
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
--
Aleksander M. Stensby
Software Developer
Integrasco A/S
aleksander.stensby@integrasco.no
Tlf.: +47 41 22 82 72
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Reviving a dead index
Posted by Michael McCandless <lu...@mikemccandless.com>.
Aleksander M. Stensby wrote:
> Hey, saw this old thread, and was just wondering if any of you solved
> the problem? Same has happened to me now. Couldn't really trace back to
> the origin of the problem, but the segments file references a segment
> that is obviously corrupt/not complete...
>
> I thought i might remove the uncomplete segment, but then i guess this
> would f* up my segments file. Any taker on how to remove the segment in
> question, and make the rest of the index work again?.. Since i guess its
> not as straightforward to just remove the segmentname from the segments
> file without changing some of the other crytic bytes aswell..?
I don't think the root cause was ever uncovered on this thread.
Do you have any copying process to move your index from one machine to
another, or across mount points on the same machine, or anything? A
copying step has been the cause of similar corruption in past issues.
I would really like to get to the root cause of any and all
corruption.
To recover your index, it would be fairly simple to write a tool that
reads the segments files, removes the known bad segments, and writes
the segments file back out. Something like this (I haven't tested!):
Directory dir = FSDirectory.getDirectory("/path/to/my/index", false);
SegmentInfos sis = new SegmentInfos();
sis.read(dir);
for(int i=0;i<sis.size();i++) {
SegmentInfo si = (SegmentInfo) sis.elementAt(i);
if (si.name.equals("_XXXX")) {
sis.removeElementAt(i);
break;
}
}
sis.write(dir);
Please make a backup copy of your index before running this in case
it messes anything up!!
And note that by removing an entire segment, many documents are now
gone from your index. Determining which ones are gone and must be
reindexed is not particularly easy unless you have a way to do so in
your application...
Also note that Lucene will not delete the now un-referenced (bad)
segment file (ie, _XXXX.cfs, and other _XXXX files like _XXXX.del if
it exists) so you will have to do that step manually. (The current
"trunk" version of Lucene does in fact remove unreferenced index
files correctly but this hasn't been released yet).
Mike
>
> - Aleksander
>
>
> On Thu, 31 Aug 2006 03:42:28 +0200, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>
>> Stanislav Jordanov wrote:
>>
>>> For a moment I wondered what exactly do you mean by "compound file"?
>>> Then I read http://lucene.apache.org/java/docs/fileformats.html and
>>> got the idea.
>>> I do not have access to that specific machine that all this is
>>> happening at.
>>> It is a 80x86 machine running Win 2003 server;
>>> Sorry, but they neglected my question about the index is stored on a
>>> Local FS or on a NFS.
>>> I was only able to obtain a directory listing of the index dir and
>>> guess what - there's no a /*_1j8s.cfs * /file at all!
>>> Pitty, I can't have a look at segments file, but I guess it lists the
>>> _1j8s
>>> Given these scarce resources, can you give me some further advise
>>> about what has happened and what can be done to prevent it from
>>> happening again?
>>
>> I'm assuming this is easily repeated (question from my last email) and
>> not a transient error? If it's transient, this could be explained by
>> locking not working properly.
>>
>> If it's not transient (ie, happens every time you open this index),
>> it sounds like indeed the segments file is referencing a segment that
>> does not exist.
>>
>> But, how the index got into this state is a mystery. I don't know of
>> any existing Lucene bugs that can do this. Furthermore, crashing
>> an indexing process should not lead to this (it can lead to other things
>> like only have a segments.new file and no segments file).
>>
>> Were there any earlier exceptions (before indexing hit an "improper
>> shutdown") in your indexing process that could give a clue as to root
>> cause? Or for example was the machine rebooted and did windows to run
>> a "filesystem check" on rebooting this box (which can remove corrupt
>> files)?
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --Aleksander M. Stensby
> Software Developer
> Integrasco A/S
> aleksander.stensby@integrasco.no
> Tlf.: +47 41 22 82 72
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Reviving a dead index
Posted by "Aleksander M. Stensby" <al...@integrasco.no>.
Hey, saw this old thread, and was just wondering if any of you solved the
problem? Same has happened to me now. Couldn't really trace back to the
origin of the problem, but the segments file references a segment that is
obviously corrupt/not complete...
I thought i might remove the uncomplete segment, but then i guess this
would f* up my segments file. Any taker on how to remove the segment in
question, and make the rest of the index work again?.. Since i guess its
not as straightforward to just remove the segmentname from the segments
file without changing some of the other crytic bytes aswell..?
- Aleksander
On Thu, 31 Aug 2006 03:42:28 +0200, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Stanislav Jordanov wrote:
>
>> For a moment I wondered what exactly do you mean by "compound file"?
>> Then I read http://lucene.apache.org/java/docs/fileformats.html and got
>> the idea.
>> I do not have access to that specific machine that all this is
>> happening at.
>> It is a 80x86 machine running Win 2003 server;
>> Sorry, but they neglected my question about the index is stored on a
>> Local FS or on a NFS.
>> I was only able to obtain a directory listing of the index dir and
>> guess what - there's no a /*_1j8s.cfs * /file at all!
>> Pitty, I can't have a look at segments file, but I guess it lists the
>> _1j8s
>> Given these scarce resources, can you give me some further advise about
>> what has happened and what can be done to prevent it from happening
>> again?
>
> I'm assuming this is easily repeated (question from my last email) and
> not a transient error? If it's transient, this could be explained by
> locking not working properly.
>
> If it's not transient (ie, happens every time you open this index),
> it sounds like indeed the segments file is referencing a segment that
> does not exist.
>
> But, how the index got into this state is a mystery. I don't know of
> any existing Lucene bugs that can do this. Furthermore, crashing
> an indexing process should not lead to this (it can lead to other things
> like only have a segments.new file and no segments file).
>
> Were there any earlier exceptions (before indexing hit an "improper
> shutdown") in your indexing process that could give a clue as to root
> cause? Or for example was the machine rebooted and did windows to run
> a "filesystem check" on rebooting this box (which can remove corrupt
> files)?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
--
Aleksander M. Stensby
Software Developer
Integrasco A/S
aleksander.stensby@integrasco.no
Tlf.: +47 41 22 82 72
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Reviving a dead index
Posted by Michael McCandless <lu...@mikemccandless.com>.
Stanislav Jordanov wrote:
> For a moment I wondered what exactly do you mean by "compound file"?
> Then I read http://lucene.apache.org/java/docs/fileformats.html and got
> the idea.
> I do not have access to that specific machine that all this is happening
> at.
> It is a 80x86 machine running Win 2003 server;
> Sorry, but they neglected my question about the index is stored on a
> Local FS or on a NFS.
> I was only able to obtain a directory listing of the index dir and guess
> what - there's no a /*_1j8s.cfs * /file at all!
> Pitty, I can't have a look at segments file, but I guess it lists the _1j8s
> Given these scarce resources, can you give me some further advise about
> what has happened and what can be done to prevent it from happening again?
I'm assuming this is easily repeated (question from my last email) and
not a transient error? If it's transient, this could be explained by
locking not working properly.
If it's not transient (ie, happens every time you open this index),
it sounds like indeed the segments file is referencing a segment that
does not exist.
But, how the index got into this state is a mystery. I don't know of
any existing Lucene bugs that can do this. Furthermore, crashing
an indexing process should not lead to this (it can lead to other things
like only have a segments.new file and no segments file).
Were there any earlier exceptions (before indexing hit an "improper
shutdown") in your indexing process that could give a clue as to root
cause? Or for example was the machine rebooted and did windows to run
a "filesystem check" on rebooting this box (which can remove corrupt
files)?
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Reviving a dead index
Posted by Michael McCandless <lu...@mikemccandless.com>.
Stanislav Jordanov wrote:
> After all, the Lucene's CFS format is abstraction over the OS's native
> FS and the App should not be trying to open a native FS file named *.fnm
> when it is supposed to open the corresponding *.cfs file and "manually"
> extract the *.fnm file from it.
> Right?
Yes, good catch :)
This always confuses people, but it's actually "normal" (when a segments
file is missing) because Lucene first checks whether the compound-file
exists and if it does it will use that. If it does not, it falls back
to trying to directly open the individual files against the filesystem.
So, when there is a problem and a given segment is referenced but does
not exist, you will see this [confusing] exception making it look like
Lucene "forgot" that it's using the compound file format.
[Still intending to respond to your previous email but a bit busy right
now...]
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Reviving a dead index
Posted by Stanislav Jordanov <st...@sirma.bg>.
I missed something that may be very important:
I find it really strange, that the exception log reads:
java.io.FileNotFoundException: F:\Indexes\index1\_16f6.fnm (The system
cannot find the file specified)
at java.io.RandomAccessFile.open(Native
Method)
at
java.io.RandomAccessFile.<init>(RandomAccessFile.java:212)
at
org.apache.lucene.store.FSIndexInput$Descriptor.<init>(FSDirectory.java:425)
at
org.apache.lucene.store.FSIndexInput.<init>(FSDirectory.java:434)
at
org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:324)
at
org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:56)
at
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:144)
at
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:129)
at
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:110)
at
org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:154)
at
org.apache.lucene.store.Lock$With.run(Lock.java:109)
at
org.apache.lucene.index.IndexReader.open(IndexReader.java:143)
at
org.apache.lucene.index.IndexReader.open(IndexReader.java:127)
at
org.apache.lucene.search.IndexSearcher.<init>(IndexSearcher.java:42)
After all, the Lucene's CFS format is abstraction over the OS's native
FS and the App should not be trying to open a native FS file named *.fnm
when it is supposed to open the corresponding *.cfs file and "manually"
extract the *.fnm file from it.
Right?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Reviving a dead index
Posted by Stanislav Jordanov <st...@sirma.bg>.
Michael McCandless wrote:
/This means the segments files is referencing a segment named _1j8s and
in trying to load that segment, the first thing Lucene does is load the
"field infos" (_1j8s.fnm). It tries to do so from a compound file (if
you have it turned on & it exists), else from the filesystem directly.
/Michael,
For a moment I wondered what exactly do you mean by "compound file"?
Then I read http://lucene.apache.org/java/docs/fileformats.html and got
the idea.
I do not have access to that specific machine that all this is happening at.
It is a 80x86 machine running Win 2003 server;
Sorry, but they neglected my question about the index is stored on a
Local FS or on a NFS.
I was only able to obtain a directory listing of the index dir and guess
what - there's no a /*_1j8s.cfs * /file at all!
Pitty, I can't have a look at segments file, but I guess it lists the _1j8s
Given these scarce resources, can you give me some further advise about
what has happened and what can be done to prevent it from happening again?
Regards,
Stanislav
> Stanislav Jordanov wrote:
>> What might be the possible reason for an IndexReader failing to open
>> properly,
>> because it can not find a .fnm file that is expected to be there:
>
> This means the segments files is referencing a segment named _1j8s and
> in trying to load that segment, the first thing Lucene does is load
> the "field infos" (_1j8s.fnm). It tries to do so from a compound file
> (if you have it turned on & it exists), else from the filesystem
> directly.
>
> Which version of Lucene are you using? And which OS are you running on?
>
> Is this error easily repeated (not a transient error)? Ie,
> instantiating an IndexSearcher against your index always causes this
> exception? Because, this sort of exception is certainly possible when
> Lucene's locking is not working correctly (for exmple over NFS), but
> in that case it's typically very intermittant.
>
> Could you send a list of the files in your index?
>
>> The only thing that comes to my mind is that last time the indexing
>> process was not shut down properly.
>> Is there a way to revive the index or everything should be reindexed
>> from scratch?
>
> Hmmm. It's surprising that an improper shutdown caused this because
> when the IndexWriter commits its change, it first writes all files for
> the new segment and only when that's successful does it write a new
> segments file referencing the newly written segment. Could you
> provide some more detail about your setup and how the improper
> shutdown happened?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Reviving a dead index
Posted by Michael McCandless <lu...@mikemccandless.com>.
Stanislav Jordanov wrote:
> What might be the possible reason for an IndexReader failing to open
> properly,
> because it can not find a .fnm file that is expected to be there:
This means the segments files is referencing a segment named _1j8s and
in trying to load that segment, the first thing Lucene does is load the
"field infos" (_1j8s.fnm). It tries to do so from a compound file (if
you have it turned on & it exists), else from the filesystem directly.
Which version of Lucene are you using? And which OS are you running on?
Is this error easily repeated (not a transient error)? Ie,
instantiating an IndexSearcher against your index always causes this
exception? Because, this sort of exception is certainly possible when
Lucene's locking is not working correctly (for exmple over NFS), but in
that case it's typically very intermittant.
Could you send a list of the files in your index?
> The only thing that comes to my mind is that last time the indexing
> process was not shut down properly.
> Is there a way to revive the index or everything should be reindexed
> from scratch?
Hmmm. It's surprising that an improper shutdown caused this because
when the IndexWriter commits its change, it first writes all files for
the new segment and only when that's successful does it write a new
segments file referencing the newly written segment. Could you provide
some more detail about your setup and how the improper shutdown happened?
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org