You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Christoph Goller <go...@detego-software.de> on 2003/09/03 15:58:55 UTC

PATCH: SegmentsReader/SegmentsTermEnum

Hi Lucene Developers,

first let me thank you all for this excellent peace of software
that you created. I am using Lucene in several projects and I
am currently also building more enhanced text mining applications
on top of it. Because of that I have spent a lot of time studying
the Lucene sources and I will come up with a couple of proposals
for bug fixes in the next days. Here is the first one:

I think I can fix a bug in SegmentsTermEnum.
One can create a TermEnum from an IndexReader in two ways:

indexReader.terms()
indexReader.terms(t)

If one gets a TermEnum starting at a specified term t one does not
have to call enum.next() before using it. The enum is valid from the
beginning.Calling enum.next() switches to the next term. However, this
bahaviour is only true if our index consists of only one segment. If we
have an index consisting of several segments term t is delivered twice,
1st time after calling indexReader.terms(t); enum.term(), 2nd time after
calling enum.next(). Furthermore the initial document frequency might
be false (if t occurs in more than one segment). The problem can be
fixed by calling next() in the constructor of SegmentsTermEnum.
I attach a test that demonstrates the problem and a patch that fixes it.

kind regards,
Christoph

-- 
*****************************************************************
* Dr. Christoph Goller       Tel.:   +49 89 203 45734           *
* Detego Software GmbH       Mobile: +49 179 1128469            *
* Keuslinstr. 13             Fax.:   +49 721 151516176          *
* 80798 München, Germany     Email:  goller@detego-software.de  *
*****************************************************************

Re: PATCH: SegmentsReader/SegmentsTermEnum

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Christoph,

Thanks for the patch and the test.
I refactored your test a bit, converted it to a JUnit-based unit test
and will commit it shortly, following it with your patch.

Thank you,
Otis

--- Christoph Goller <go...@detego-software.de> wrote:
> Hi Lucene Developers,
> 
> first let me thank you all for this excellent peace of software
> that you created. I am using Lucene in several projects and I
> am currently also building more enhanced text mining applications
> on top of it. Because of that I have spent a lot of time studying
> the Lucene sources and I will come up with a couple of proposals
> for bug fixes in the next days. Here is the first one:
> 
> I think I can fix a bug in SegmentsTermEnum.
> One can create a TermEnum from an IndexReader in two ways:
> 
> indexReader.terms()
> indexReader.terms(t)
> 
> If one gets a TermEnum starting at a specified term t one does not
> have to call enum.next() before using it. The enum is valid from the
> beginning.Calling enum.next() switches to the next term. However,
> this
> bahaviour is only true if our index consists of only one segment. If
> we
> have an index consisting of several segments term t is delivered
> twice,
> 1st time after calling indexReader.terms(t); enum.term(), 2nd time
> after
> calling enum.next(). Furthermore the initial document frequency might
> be false (if t occurs in more than one segment). The problem can be
> fixed by calling next() in the constructor of SegmentsTermEnum.
> I attach a test that demonstrates the problem and a patch that fixes
> it.
> 
> kind regards,
> Christoph
> 
> -- 
> *****************************************************************
> * Dr. Christoph Goller       Tel.:   +49 89 203 45734           *
> * Detego Software GmbH       Mobile: +49 179 1128469            *
> * Keuslinstr. 13             Fax.:   +49 721 151516176          *
> * 80798 M�nchen, Germany     Email:  goller@detego-software.de  *
> *****************************************************************
> > import java.io.IOException;
> 
> import org.apache.lucene.analysis.WhitespaceAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.index.TermEnum;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.RAMDirectory;
> 
> /*
>  * Created on 23.04.2003
>  *
>  * To change the template for this generated file go to
>  * Window>Preferences>Java>Code Generation>Code and Comments
>  */
> 
> /**
>  * @author goller
>  *
>  * To change the template for this generated type comment go to
>  * Window>Preferences>Java>Code Generation>Code and Comments
>  */
> public class SegmentsTermEnumTest {
> 
>   int docCount = 0;
>   
>   void addDoc1(IndexWriter writer)
>   {
>     Document doc = new Document();
>     
>     doc.add(Field.Keyword("id","id" + docCount));
>     doc.add(Field.UnStored("content","aaa"));
>     
>     try {
>       writer.addDocument(doc);
>     }
>     catch (IOException e) {
>       // TODO Auto-generated catch block
>       e.printStackTrace();
>     }
>     docCount++;
>   }
>   
>   void addDoc2(IndexWriter writer)
>   {
>     Document doc = new Document();
>     
>     doc.add(Field.Keyword("id","id" + docCount));
>     doc.add(Field.UnStored("content","aaa bbb"));
>     
>     try {
>       writer.addDocument(doc);
>     }
>     catch (IOException e) {
>       // TODO Auto-generated catch block
>       e.printStackTrace();
>     }
>     docCount++;
>   }
>   
>   
>   
>   public static void main(String[] args) 
>   {
>     //System.out.println(System.getProperty("java.version"));
>     
>     Directory dir = new RAMDirectory();
>     SegmentsTermEnumTest test = new SegmentsTermEnumTest();
>     
>     IndexWriter writer = null;
>     IndexReader reader = null;
>     TermEnum enum = null;
>     int i;
>     
>     try {
>       writer  = new IndexWriter(dir, new WhitespaceAnalyzer(), true);
>       
>       for (i = 0; i < 100; i++)
>         test.addDoc1(writer);
>       
>       for (i = 0; i < 100; i++)
>         test.addDoc2(writer);
>       
>       writer.close();
>     }
>     catch (IOException e) {
>       // TODO Auto-generated catch block
>       e.printStackTrace();
>     }
>     
>     
>     try {
>       reader = IndexReader.open(dir);
>   
>       System.out.println("terms():");
>       enum = reader.terms();
>       for(i = 0; i < 5 && enum.next(); i++)
>         System.out.println(enum.term() + " " + enum.docFreq());
>       
>       enum.close();
>       
>       System.out.println();
>       System.out.println("terms(\"aaa\")");
>       enum = reader.terms(new Term("content", "aaa"));
>       System.out.println(enum.term() + " " + enum.docFreq());
>       for(i = 0; i < 5 && enum.next(); i++)
>         System.out.println(enum.term() + " " + enum.docFreq());
>         
>       enum.close();
>       reader.close();
>       
>       writer  = new IndexWriter(dir, new WhitespaceAnalyzer(),
> false);
>       writer.optimize();
>       writer.close();
>    
>       System.out.println();
>       System.out.println("optimize");
>       
>       reader = IndexReader.open(dir);
>   
>       System.out.println();
>       System.out.println("terms():");
>       enum = reader.terms();
>       for(i = 0; i < 5 && enum.next(); i++)
>         System.out.println(enum.term() + " " + enum.docFreq());
>       
>       enum.close();
>       
>       System.out.println();
>       System.out.println("terms(\"aaa\")");
>       enum = reader.terms(new Term("content", "aaa"));
>       System.out.println(enum.term() + " " + enum.docFreq());
>       for(i = 0; i < 5 && enum.next(); i++)
>         System.out.println(enum.term() + " " + enum.docFreq());
>         
>       enum.close();
>       reader.close();
>       
>     }
>     catch (IOException e2) {
>       // TODO Auto-generated catch block
>       e2.printStackTrace();
>     }
>     
> 
>     
>     
>     
>     
>   }
> }
> > Index: SegmentsReader.java
> ===================================================================
> RCS file:
>
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/SegmentsReader.java,v
> retrieving revision 1.11
> diff -u -r1.11 SegmentsReader.java
> --- SegmentsReader.java	1 May 2003 01:09:15 -0000	1.11
> +++ SegmentsReader.java	3 Sep 2003 13:03:27 -0000
> @@ -238,9 +238,7 @@
>      }
>  
>      if (t != null && queue.size() > 0) {
> -      SegmentMergeInfo top = (SegmentMergeInfo)queue.top();
> -      term = top.termEnum.term();
> -      docFreq = top.termEnum.docFreq();
> +      next();
>      }
>    }
>  
> 
> >
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com