You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Arvind Srinivasan <lu...@ziplip.com> on 2005/06/15 04:49:07 UTC

Data Integrity Rules

HI,


In an earlier article, Doug Cutting described a method
to verify a segments integrity by simply merging the segment
to a NullDirectory.

We have found several instances where the segement Corrupts 
even if it passes the NullDirectory TEST. Merging with NULL 
Directory only protects us against disk errors.  There are 
structural errors that makes the segments corrupt after a
few iterations of merges. 

I would like to define a simple rule:

"A segment has data integrity if and only if 
the segment is readable and successively mergeable
 without any errors."


For example, in the current version, you can add an empty string 
into the DocumentWriter.  This is not a problem so
long as it is readable and successively mergeable. But, after a
few merge iterations, the segment merge errors with a
"term out of order" exception in TermInfosWriter.  Now you have an 
inoperable Search Engine. GRANTED, the tokenizer is at fault, but a
simple issue like that should not bring the search engine down.

Similarly, we have found instances where term postings having Zero
frequency (NOT sure how it got in that state) and having document ids
greater than the max doc of the segement. See earlier posting or
Bug (a).  



Therefore I suggest a few more checks into DocumentWriter right after 
line "283" in DocumentWriter.java.


       if (posting.term.text.length()==0) {
            continue;
        }

        // add an entry to the freq file
        int postingFreq = posting.freq;
        if (postingFreq <= 0) {
            continue;
        }
---
  Also, please apply the changes to SegmentMerger as suggested in bug 23650.

I also think, we should create test cases that keep the segments robust and not
derailed by edge cases.



See ALSO
(a)http://issues.apache.org/bugzilla/show_bug.cgi?id=23650
(b)http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200505.mbox/%3cN0H4L0B1EFP3JVL4EBFRBUMHLSJ3MAAGKCN2OMDT@ziplip.com%3e
(c) http://issues.apache.org/bugzilla/show_bug.cgi?id=35029

Thanks,
Arvind.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org