You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2003/09/18 22:09:37 UTC

Revival of Dmitry's Term Vector patches

Dmitry and others,

One of the relatively frequently asked for features is 'conceptual
search', or 'search by similarity', etc.  Lucene does not store term
vectors in its index, so such searches cannot be supported.

However, almost two years ago, Dmitry provided a large set of patches
that added term vector support to Lucene.  We never applied those
patches for some reason, even though the patches looked really good.
The other day I looked at Dmitry's two year old email again.
I applied a few diffs to my copy of Lucene and added new classes that
Dmitry wrote in order to add term vector support, to the source tree.
Unfortunately, lots of classes changed over the last two years, and not
all patches will apply.

I was wondering, Dmitry, if you have your term vector changes
integrated with the current version of Lucene.  If you do, would it be
possible for you to send the patches again?

Also, I noticed that a large portion of those patches contained a good
amount of documentation (code comments, Javadocs).  Dmitry obviously
studied the code in depth :)  I will try extracting at least the
documentation from that contribution.

Finally, Dmitry, if you have term vector support in your local copy of
the current Lucene sources, how are you going to make patches
containing only the changes that you outlined in the recent email?
Are term vector changes gone or....?

Thanks,
Otis


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Revival of Dmitry's Term Vector patches

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hello Damian,

Could you please do the following:

- Use diff -uN from the top directory instead of using --recursive?
- Make sure CVS directories are excluded
- Create a new bug report in Bugzilla (link on Lucene's home page) and
attach the output of that diff -uN?  You can capture the output using >
(diff -uN > patch.diff)

Thanks,
Otis


--- Damian Gajda <dg...@caltha.pl> wrote:
> W li�cie z pi�, 14-11-2003, godz. 16:47, Otis Gospodnetic pisze: 
> > Yes, yes, please - documentation patches by themselves are
> valuable,
> > too!
> > 
> > Thanks,
> > Otis
> 
> These are only 5 files and just a few comments from Dmitry.
> 
> Have fun.
> -- 
> Damian Gajda
> Caltha Sp. j.
> Warszawa 02-807
> ul. Kuku�ki 2
> tel. +48 22 643 20 20
> mobile: +48 501 032 506
> http://www.caltha.pl/
> > diff --recursive
>
/home/damian/java_packages/jakarta-lucene/src/java/org/apache/lucene/analysis/CVS/Entries
>
/home/damian/eclipse/plus/jakarta-lucene/src/java/org/apache/lucene/analysis/CVS/Entries
> 20,21c20,21
> < /CharTokenizer.java/1.4/Mon Nov 17 20:50:28 2003//
> < /PorterStemFilter.java/1.4/Mon Nov 17 20:50:28 2003//
> ---
> > /CharTokenizer.java/1.4/Mon Nov 17 20:09:57 2003//
> > /PorterStemFilter.java/1.4/Mon Nov 17 20:09:57 2003//
> diff --recursive
>
/home/damian/java_packages/jakarta-lucene/src/java/org/apache/lucene/analysis/standard/CVS/Entries
>
/home/damian/eclipse/plus/jakarta-lucene/src/java/org/apache/lucene/analysis/standard/CVS/Entries
> 12c12
> < /StandardAnalyzer.java/1.6/Mon Nov 17 20:50:28 2003//
> ---
> > /StandardAnalyzer.java/1.6/Mon Nov 17 20:09:57 2003//
> diff --recursive
>
/home/damian/java_packages/jakarta-lucene/src/java/org/apache/lucene/document/Field.java
>
/home/damian/eclipse/plus/jakarta-lucene/src/java/org/apache/lucene/document/Field.java
> 164a165,166
> >   /** Create a field by specifying all parameters.
> >    */
> diff --recursive
>
/home/damian/java_packages/jakarta-lucene/src/java/org/apache/lucene/index/FieldInfos.java
>
/home/damian/eclipse/plus/jakarta-lucene/src/java/org/apache/lucene/index/FieldInfos.java
> 70a71,76
> > /** Access to the Field Info file that describes document fields
> and whether or
> >  *  not they are indexed. Each segment has a separate Field Info
> file. Objects
> >  *  of this class is thread-safe for multiple readers, but only one
> thread can
> >  *  be adding documents at a time, with no other reader or writer
> threads
> >  *  accessing this object.
> >  */
> 96a103,106
> >   /** Adds in information for a set of FieldInfos.
> >    *  Returns an array mapping each field number in the
> <code>names</code>
> >    *  collection to the field numbers in this one.
> >    */
> 103a114,117
> >   /** If the field is not yet known, adds it. If it is known,
> checks
> > 	*  to make sure that the isIndexed flag is the same as was given
> > 	*  previously for this field. If not - throws
> IllegalStateException.
> > 	*/
> diff --recursive
>
/home/damian/java_packages/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMergeInfo.java
>
/home/damian/eclipse/plus/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMergeInfo.java
> 59a60,63
> > /** Data container to work with SegmentMergeQueue. Represents a
> single segment
> >  *  to be merged. Maintains the segment reader, TermEnum, and
> TermPositions
> >  *  for this segment.
> >  */
> 60a65
> >   /** The current term of this segment, or null if none. */
> 61a67,68
> > 
> >   /** Index of the 0th document from this segment in the merged
> document numbering. */
> 62a70,71
> > 
> >   /** This segment's term enum. Do not use directly. */
> 63a73,74
> > 
> >   /** This segment's reader. Do not use directly. */
> 64a76,77
> > 
> >   /** Postings for the current term. */
> 65a79,85
> > 
> > 
> >   /** Maps around deleted docs. Contains a slot for each document
> in the
> >    *  reader. Slots corresponding to deleted docs have the value of
> -1. The
> >    *  rest have their new document numbers that start at 0. This
> value
> >    *  added to <code>base</code> is the document number in the
> merged numbering.
> >    */
> 67a88,91
> >   /** Create a new merge info. Base <code>b</code> is a starting
> >    *  number for documents from this segment in the merged document
> >    *  numbering.
> >    */
> 89a114,119
> > 
> >   /** Shift to the next term on this segment's TermEnum. The new
> >    *  term becomes the current term for this segment, effecting the
> >    *  ordering of the SegmentMergeQueue. If no more terms remain
> >    *  in this segment, returns false and resets the current term to
> null.
> >    */
> diff --recursive
>
/home/damian/java_packages/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMergeQueue.java
>
/home/damian/eclipse/plus/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMergeQueue.java
> 59a60,63
> > /** Priority queue of SegmentMergeInfo objects. The queue sorts the
> >  *  info objects by their current term, and if the terms are equal,
> >  *  by their base offset.
> >  */
> diff --recursive
>
/home/damian/java_packages/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java
>
/home/damian/eclipse/plus/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java
> 79a80,83
> >   /** Create a segment merger that will merge a number of segments
> (specified
> >    *  as SegmentReaders added to this object with calls to
> <code>add</code>) into a
> >    *  single segment with the specified <code>name</code>.
> >    */
> 85a90,92
> >   /** Add segment reader to be merged.
> >    *
> >    */
> 89a97,99
> >   /** Return one of the segment readers being merged.
> >    *
> >    */
> 93a104,106
> >   /** Start the merge. All segment readers to be merged must have
> been added
> >    *  prior to this call.
> >    */
> 150a164,166
> >   /** Merge the field information from the segment readers.
> >    *  Called from <code>merge</code>.
> >    */
> 183a200,202
> >   /** Merge the term index, frequency and proximity information
> >    *  from specified segment readers. Called from
> <code>merge</code>.
> >    */
> 200a220,221
> >   /** Merge the term index information. Called from
> <code>mergeTerms</code>.
> >    */
> 201a223,224
> > 	// Create and populate a priority queue of segments to be merged.
> > 	// Segments are sorted by their top term and the base doc number
> in the merged segment.
> 222a246,247
> >       // pop off the queue and put into match[] all segments
> >       // that have the same term at the top
> 227a253,254
> >       // perform the merge for all segments that are positioned on
> >       // the same term
> 229a257,258
> >       // advance the matched segments to the next term and, if one
> exists, put
> >       // the segment back onto the queue (priority queue takes care
> of sorting them)
> 241a271,278
> > 
> >   /** Merge one term found in one or more segments. The array
> <code>smis</code>
> >    *  contains segments that are positioned at the same term.
> <code>N</code>
> >    *  is the number of cells in the array actually occupied.
> >    *
> >    * @param smis array of segments
> >    * @param n number of cells in the array actually occupied
> >    */
> 255a293,300
> >   /** Process postings from multiple segments all positioned on the
> >    *  same term. Writes out merged entries into freqOutput and
> >    *  the proxOutput streams.
> >    *
> >    * @param smis array of segments
> >    * @param n number of cells in the array actually occupied
> >    * @return number of documents across all segments where this
> term was found
> >    */
> 297a343,346
> > 
> >   /** Merge field normalization factors for the specified segment
> readers.
> >    *  Called from <code>merge</code>.
> >    */
> diff --recursive
>
/home/damian/java_packages/jakarta-lucene/src/java/org/apache/lucene/queryParser/CVS/Entries
>
/home/damian/eclipse/plus/jakarta-lucene/src/java/org/apache/lucene/queryParser/CVS/Entries
> 10,11c10,11
> < /QueryParser.java/1.7/Mon Nov 17 20:50:28 2003//
> < /QueryParser.jj/1.37/Mon Nov 17 20:50:28 2003//
> ---
> > /QueryParser.java/1.7/Mon Nov 17 20:09:57 2003//
> > /QueryParser.jj/1.37/Mon Nov 17 20:09:57 2003//
> 
> >
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


__________________________________
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Revival of Dmitry's Term Vector patches

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hello,

Thanks Damian!

This came inlined for me (i.e. wrapped/broken lines, etc.)
It could be just my email client (Yahoo's web mail) inlining this.

If anyone received this is as a real attachment, could you apply this
patch?

Thanks,
Otis


--- Damian Gajda <dg...@caltha.pl> wrote:
> Hello Otis,
> 
> Here is a patch with documentation from Dmitry.
> 
> I used
> cvs diff -uN
> 
> Hope it is OK now.
> 
> -- 
> Damian
> 
> > Index: src/java/org/apache/lucene/document/Field.java
> ===================================================================
> RCS file:
>
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/document/Field.java,v
> retrieving revision 1.11
> diff -u -r1.11 Field.java
> --- src/java/org/apache/lucene/document/Field.java	20 Mar 2003
> 18:28:13 -0000	1.11
> +++ src/java/org/apache/lucene/document/Field.java	9 Dec 2003
> 19:39:05 -0000
> @@ -162,6 +162,8 @@
>      is used.  Exactly one of stringValue() and readerValue() must be
> set. */
>    public Reader readerValue()	{ return readerValue; }
>  
> +  /** Create a field by specifying all parameters.
> +   */
>    public Field(String name, String string,
>  	       boolean store, boolean index, boolean token) {
>      if (name == null)
> Index: src/java/org/apache/lucene/index/FieldInfos.java
> ===================================================================
> RCS file:
>
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/FieldInfos.java,v
> retrieving revision 1.4
> diff -u -r1.4 FieldInfos.java
> --- src/java/org/apache/lucene/index/FieldInfos.java	21 Oct 2003
> 17:59:16 -0000	1.4
> +++ src/java/org/apache/lucene/index/FieldInfos.java	9 Dec 2003
> 19:39:05 -0000
> @@ -68,6 +68,12 @@
>  import org.apache.lucene.store.OutputStream;
>  import org.apache.lucene.store.InputStream;
>  
> +/** Access to the Field Info file that describes document fields and
> whether or
> + *  not they are indexed. Each segment has a separate Field Info
> file. Objects
> + *  of this class is thread-safe for multiple readers, but only one
> thread can
> + *  be adding documents at a time, with no other reader or writer
> threads
> + *  accessing this object.
> + */
>  final class FieldInfos {
>    private Vector byNumber = new Vector();
>    private Hashtable byName = new Hashtable();
> @@ -94,6 +100,10 @@
>      }
>    }
>  
> +  /** Adds in information for a set of FieldInfos.
> +   *  Returns an array mapping each field number in the
> <code>names</code>
> +   *  collection to the field numbers in this one.
> +   */
>    final void add(Collection names, boolean isIndexed) {
>      Iterator i = names.iterator();
>      while (i.hasNext()) {
> @@ -101,6 +111,10 @@
>      }
>    }
>  
> +  /** If the field is not yet known, adds it. If it is known, checks
> +	*  to make sure that the isIndexed flag is the same as was given
> +	*  previously for this field. If not - throws
> IllegalStateException.
> +	*/
>    final void add(String name, boolean isIndexed) {
>      FieldInfo fi = fieldInfo(name);
>      if (fi == null)
> Index: src/java/org/apache/lucene/index/SegmentMergeInfo.java
> ===================================================================
> RCS file:
>
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMergeInfo.java,v
> retrieving revision 1.2
> diff -u -r1.2 SegmentMergeInfo.java
> --- src/java/org/apache/lucene/index/SegmentMergeInfo.java	21 Oct
> 2003 17:59:16 -0000	1.2
> +++ src/java/org/apache/lucene/index/SegmentMergeInfo.java	9 Dec 2003
> 19:39:06 -0000
> @@ -57,14 +57,38 @@
>  import java.io.IOException;
>  import org.apache.lucene.util.BitVector;
>  
> +/** Data container to work with SegmentMergeQueue. Represents a
> single segment
> + *  to be merged. Maintains the segment reader, TermEnum, and
> TermPositions
> + *  for this segment.
> + */
>  final class SegmentMergeInfo {
> +  /** The current term of this segment, or null if none. */
>    Term term;
> +
> +  /** Index of the 0th document from this segment in the merged
> document numbering. */
>    int base;
> +
> +  /** This segment's term enum. Do not use directly. */
>    TermEnum termEnum;
> +
> +  /** This segment's reader. Do not use directly. */
>    IndexReader reader;
> +
> +  /** Postings for the current term. */
>    TermPositions postings;
> +
> +
> +  /** Maps around deleted docs. Contains a slot for each document in
> the
> +   *  reader. Slots corresponding to deleted docs have the value of
> -1. The
> +   *  rest have their new document numbers that start at 0. This
> value
> +   *  added to <code>base</code> is the document number in the
> merged numbering.
> +   */
>    int[] docMap = null;				  // maps around deleted docs
>  
> +  /** Create a new merge info. Base <code>b</code> is a starting
> +   *  number for documents from this segment in the merged document
> +   *  numbering.
> +   */
>    SegmentMergeInfo(int b, TermEnum te, IndexReader r)
>      throws IOException {
>      base = b;
> @@ -87,6 +111,12 @@
>      }
>    }
>  
> +
> +  /** Shift to the next term on this segment's TermEnum. The new
> +   *  term becomes the current term for this segment, effecting the
> +   *  ordering of the SegmentMergeQueue. If no more terms remain
> +   *  in this segment, returns false and resets the current term to
> null.
> +   */
>    final boolean next() throws IOException {
>      if (termEnum.next()) {
>        term = termEnum.term();
> Index: src/java/org/apache/lucene/index/SegmentMergeQueue.java
> ===================================================================
> RCS file:
>
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMergeQueue.java,v
> retrieving revision 1.1.1.1
> diff -u -r1.1.1.1 SegmentMergeQueue.java
> --- src/java/org/apache/lucene/index/SegmentMergeQueue.java	18 Sep
> 2001 16:29:53 -0000	1.1.1.1
> +++ src/java/org/apache/lucene/index/SegmentMergeQueue.java	9 Dec
> 2003 19:39:06 -0000
> @@ -57,6 +57,10 @@
>  import java.io.IOException;
>  import org.apache.lucene.util.PriorityQueue;
>  
> +/** Priority queue of SegmentMergeInfo objects. The queue sorts the
> + *  info objects by their current term, and if the terms are equal,
> + *  by their base offset.
> + */
>  final class SegmentMergeQueue extends PriorityQueue {
>    SegmentMergeQueue(int size) {
>      initialize(size);
> Index: src/java/org/apache/lucene/index/SegmentMerger.java
> ===================================================================
> RCS file:
>
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java,v
> retrieving revision 1.6
> diff -u -r1.6 SegmentMerger.java
> --- src/java/org/apache/lucene/index/SegmentMerger.java	31 Oct 2003
> 09:28:44 -0000	1.6
> +++ src/java/org/apache/lucene/index/SegmentMerger.java	9 Dec 2003
> 19:39:07 -0000
> @@ -77,20 +77,33 @@
>      "fnm", "frq", "prx", "fdx", "fdt", "tii", "tis"
>    };
>    
> +  /** Create a segment merger that will merge a number of segments
> (specified
> +   *  as SegmentReaders added to this object with calls to
> <code>add</code>) into a
> +   *  single segment with the specified <code>name</code>.
> +   */
>    SegmentMerger(Directory dir, String name, boolean compoundFile) {
>      directory = dir;
>      segment = name;
>      useCompoundFile = compoundFile;
>    }
>  
> +  /** Add segment reader to be merged.
> +   *
> +   */
>    final void add(IndexReader reader) {
>      readers.addElement(reader);
>    }
>  
> +  /** Return one of the segment readers being merged.
> +   *
> +   */
>    final IndexReader segmentReader(int i) {
>      return (IndexReader)readers.elementAt(i);
>    }
>  
> +  /** Start the merge. All segment readers to be merged must have
> been added
> +   *  prior to this call.
> +   */
>    final int merge() throws IOException {
>      int value;
>      try {
> @@ -148,6 +161,9 @@
>    }
>    
>    
> +  /** Merge the field information from the segment readers.
> +   *  Called from <code>merge</code>.
> +   */
>    private final int mergeFields() throws IOException {
>      fieldInfos = new FieldInfos();		  // merge field names
>      int docCount = 0;
> @@ -181,6 +197,9 @@
>    private TermInfosWriter termInfosWriter = null;
>    private SegmentMergeQueue queue = null;
>  
> +  /** Merge the term index, frequency and proximity information
> +   *  from specified segment readers. Called from
> <code>merge</code>.
> +   */
>    private final void mergeTerms() throws IOException {
>      try {
>        freqOutput = directory.createFile(segment + ".frq");
> @@ -198,7 +217,11 @@
>      }
>    }
>  
> +  /** Merge the term index information. Called from
> <code>mergeTerms</code>.
> +   */
>    private final void mergeTermInfos() throws IOException {
> +	// Create and populate a priority queue of segments to be merged.
> +	// Segments are sorted by their top term and the base doc number in
> the merged segment.
>      queue = new SegmentMergeQueue(readers.size());
>      int base = 0;
>      for (int i = 0; i < readers.size(); i++) {
> @@ -220,13 +243,19 @@
>        Term term = match[0].term;
>        SegmentMergeInfo top = (SegmentMergeInfo)queue.top();
>        
> +      // pop off the queue and put into match[] all segments
> +      // that have the same term at the top
>        while (top != null && term.compareTo(top.term) == 0) {
>          match[matchSize++] = (SegmentMergeInfo)queue.pop();
>          top = (SegmentMergeInfo)queue.top();
>        }
>  
> +      // perform the merge for all segments that are positioned on
> +      // the same term
>        mergeTermInfo(match, matchSize);		  // add new TermInfo
>        
> +      // advance the matched segments to the next term and, if one
> exists, put
> +      // the segment back onto the queue (priority queue takes care
> of sorting them)
>        while (matchSize > 0) {
>          SegmentMergeInfo smi = match[--matchSize];
>          if (smi.next())
> @@ -239,6 +268,14 @@
>  
>    private final TermInfo termInfo = new TermInfo(); // minimize
> consing
>  
> +
> +  /** Merge one term found in one or more segments. The array
> <code>smis</code>
> +   *  contains segments that are positioned at the same term.
> <code>N</code>
> +   *  is the number of cells in the array actually occupied.
> +   *
> +   * @param smis array of segments
> +   * @param n number of cells in the array actually occupied
> +   */
>    private final void mergeTermInfo(SegmentMergeInfo[] smis, int n)
>         throws IOException {
>      long freqPointer = freqOutput.getFilePointer();
> @@ -253,6 +290,14 @@
>      }
>    }
>  
> +  /** Process postings from multiple segments all positioned on the
> +   *  same term. Writes out merged entries into freqOutput and
> +   *  the proxOutput streams.
> +   *
> +   * @param smis array of segments
> +   * @param n number of cells in the array actually occupied
> +   * @return number of documents across all segments where this term
> was found
> +   */
>    private final int appendPostings(SegmentMergeInfo[] smis, int n)
>         throws IOException {
>      int lastDoc = 0;
> @@ -295,6 +340,10 @@
>      }
>      return df;
>    }
> +
> +  /** Merge field normalization factors for the specified segment
> readers.
> +   *  Called from <code>merge</code>.
> +   */
>    private final void mergeNorms() throws IOException {
>      for (int i = 0; i < fieldInfos.size(); i++) {
>        FieldInfo fi = fieldInfos.fieldInfo(i);
> 
> >
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


__________________________________
Do you Yahoo!?
New Yahoo! Photos - easier uploading and sharing.
http://photos.yahoo.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Revival of Dmitry's Term Vector patches

Posted by Damian Gajda <dg...@caltha.pl>.
Hello Otis,

Here is a patch with documentation from Dmitry.

I used
cvs diff -uN

Hope it is OK now.

-- 
Damian


Re: Revival of Dmitry's Term Vector patches

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Damian,

One more thing - I think the CVS diff will be better if you make the
code changes directly in the checked out CVS repository.  Judging from
your diff, it looks like you made changed in a separate directory
hierarchy.

Thanks!
Otis

--- Damian Gajda <dg...@caltha.pl> wrote:
> W li�cie z pi�, 14-11-2003, godz. 16:47, Otis Gospodnetic pisze: 
> > Yes, yes, please - documentation patches by themselves are
> valuable,
> > too!
> > 
> > Thanks,
> > Otis
> 
> These are only 5 files and just a few comments from Dmitry.
> 
> Have fun.
> -- 
> Damian Gajda
> Caltha Sp. j.
> Warszawa 02-807
> ul. Kuku�ki 2
> tel. +48 22 643 20 20
> mobile: +48 501 032 506
> http://www.caltha.pl/
> > diff --recursive
>
/home/damian/java_packages/jakarta-lucene/src/java/org/apache/lucene/analysis/CVS/Entries
>
/home/damian/eclipse/plus/jakarta-lucene/src/java/org/apache/lucene/analysis/CVS/Entries
> 20,21c20,21
> < /CharTokenizer.java/1.4/Mon Nov 17 20:50:28 2003//
> < /PorterStemFilter.java/1.4/Mon Nov 17 20:50:28 2003//
> ---
> > /CharTokenizer.java/1.4/Mon Nov 17 20:09:57 2003//
> > /PorterStemFilter.java/1.4/Mon Nov 17 20:09:57 2003//
> diff --recursive
>
/home/damian/java_packages/jakarta-lucene/src/java/org/apache/lucene/analysis/standard/CVS/Entries
>
/home/damian/eclipse/plus/jakarta-lucene/src/java/org/apache/lucene/analysis/standard/CVS/Entries
> 12c12
> < /StandardAnalyzer.java/1.6/Mon Nov 17 20:50:28 2003//
> ---
> > /StandardAnalyzer.java/1.6/Mon Nov 17 20:09:57 2003//
> diff --recursive
>
/home/damian/java_packages/jakarta-lucene/src/java/org/apache/lucene/document/Field.java
>
/home/damian/eclipse/plus/jakarta-lucene/src/java/org/apache/lucene/document/Field.java
> 164a165,166
> >   /** Create a field by specifying all parameters.
> >    */
> diff --recursive
>
/home/damian/java_packages/jakarta-lucene/src/java/org/apache/lucene/index/FieldInfos.java
>
/home/damian/eclipse/plus/jakarta-lucene/src/java/org/apache/lucene/index/FieldInfos.java
> 70a71,76
> > /** Access to the Field Info file that describes document fields
> and whether or
> >  *  not they are indexed. Each segment has a separate Field Info
> file. Objects
> >  *  of this class is thread-safe for multiple readers, but only one
> thread can
> >  *  be adding documents at a time, with no other reader or writer
> threads
> >  *  accessing this object.
> >  */
> 96a103,106
> >   /** Adds in information for a set of FieldInfos.
> >    *  Returns an array mapping each field number in the
> <code>names</code>
> >    *  collection to the field numbers in this one.
> >    */
> 103a114,117
> >   /** If the field is not yet known, adds it. If it is known,
> checks
> > 	*  to make sure that the isIndexed flag is the same as was given
> > 	*  previously for this field. If not - throws
> IllegalStateException.
> > 	*/
> diff --recursive
>
/home/damian/java_packages/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMergeInfo.java
>
/home/damian/eclipse/plus/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMergeInfo.java
> 59a60,63
> > /** Data container to work with SegmentMergeQueue. Represents a
> single segment
> >  *  to be merged. Maintains the segment reader, TermEnum, and
> TermPositions
> >  *  for this segment.
> >  */
> 60a65
> >   /** The current term of this segment, or null if none. */
> 61a67,68
> > 
> >   /** Index of the 0th document from this segment in the merged
> document numbering. */
> 62a70,71
> > 
> >   /** This segment's term enum. Do not use directly. */
> 63a73,74
> > 
> >   /** This segment's reader. Do not use directly. */
> 64a76,77
> > 
> >   /** Postings for the current term. */
> 65a79,85
> > 
> > 
> >   /** Maps around deleted docs. Contains a slot for each document
> in the
> >    *  reader. Slots corresponding to deleted docs have the value of
> -1. The
> >    *  rest have their new document numbers that start at 0. This
> value
> >    *  added to <code>base</code> is the document number in the
> merged numbering.
> >    */
> 67a88,91
> >   /** Create a new merge info. Base <code>b</code> is a starting
> >    *  number for documents from this segment in the merged document
> >    *  numbering.
> >    */
> 89a114,119
> > 
> >   /** Shift to the next term on this segment's TermEnum. The new
> >    *  term becomes the current term for this segment, effecting the
> >    *  ordering of the SegmentMergeQueue. If no more terms remain
> >    *  in this segment, returns false and resets the current term to
> null.
> >    */
> diff --recursive
>
/home/damian/java_packages/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMergeQueue.java
>
/home/damian/eclipse/plus/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMergeQueue.java
> 59a60,63
> > /** Priority queue of SegmentMergeInfo objects. The queue sorts the
> >  *  info objects by their current term, and if the terms are equal,
> >  *  by their base offset.
> >  */
> diff --recursive
>
/home/damian/java_packages/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java
>
/home/damian/eclipse/plus/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java
> 79a80,83
> >   /** Create a segment merger that will merge a number of segments
> (specified
> >    *  as SegmentReaders added to this object with calls to
> <code>add</code>) into a
> >    *  single segment with the specified <code>name</code>.
> >    */
> 85a90,92
> >   /** Add segment reader to be merged.
> >    *
> >    */
> 89a97,99
> >   /** Return one of the segment readers being merged.
> >    *
> >    */
> 93a104,106
> >   /** Start the merge. All segment readers to be merged must have
> been added
> >    *  prior to this call.
> >    */
> 150a164,166
> >   /** Merge the field information from the segment readers.
> >    *  Called from <code>merge</code>.
> >    */
> 183a200,202
> >   /** Merge the term index, frequency and proximity information
> >    *  from specified segment readers. Called from
> <code>merge</code>.
> >    */
> 200a220,221
> >   /** Merge the term index information. Called from
> <code>mergeTerms</code>.
> >    */
> 201a223,224
> > 	// Create and populate a priority queue of segments to be merged.
> > 	// Segments are sorted by their top term and the base doc number
> in the merged segment.
> 222a246,247
> >       // pop off the queue and put into match[] all segments
> >       // that have the same term at the top
> 227a253,254
> >       // perform the merge for all segments that are positioned on
> >       // the same term
> 229a257,258
> >       // advance the matched segments to the next term and, if one
> exists, put
> >       // the segment back onto the queue (priority queue takes care
> of sorting them)
> 241a271,278
> > 
> >   /** Merge one term found in one or more segments. The array
> <code>smis</code>
> >    *  contains segments that are positioned at the same term.
> <code>N</code>
> >    *  is the number of cells in the array actually occupied.
> >    *
> >    * @param smis array of segments
> >    * @param n number of cells in the array actually occupied
> >    */
> 255a293,300
> >   /** Process postings from multiple segments all positioned on the
> >    *  same term. Writes out merged entries into freqOutput and
> >    *  the proxOutput streams.
> >    *
> >    * @param smis array of segments
> >    * @param n number of cells in the array actually occupied
> >    * @return number of documents across all segments where this
> term was found
> >    */
> 297a343,346
> > 
> >   /** Merge field normalization factors for the specified segment
> readers.
> >    *  Called from <code>merge</code>.
> >    */
> diff --recursive
>
/home/damian/java_packages/jakarta-lucene/src/java/org/apache/lucene/queryParser/CVS/Entries
>
/home/damian/eclipse/plus/jakarta-lucene/src/java/org/apache/lucene/queryParser/CVS/Entries
> 10,11c10,11
> < /QueryParser.java/1.7/Mon Nov 17 20:50:28 2003//
> < /QueryParser.jj/1.37/Mon Nov 17 20:50:28 2003//
> ---
> > /QueryParser.java/1.7/Mon Nov 17 20:09:57 2003//
> > /QueryParser.jj/1.37/Mon Nov 17 20:09:57 2003//
> 
> >
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


__________________________________
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Revival of Dmitry's Term Vector patches

Posted by Damian Gajda <dg...@caltha.pl>.
W liście z pią, 14-11-2003, godz. 16:47, Otis Gospodnetic pisze: 
> Yes, yes, please - documentation patches by themselves are valuable,
> too!
> 
> Thanks,
> Otis

These are only 5 files and just a few comments from Dmitry.

Have fun.
-- 
Damian Gajda
Caltha Sp. j.
Warszawa 02-807
ul. Kukułki 2
tel. +48 22 643 20 20
mobile: +48 501 032 506
http://www.caltha.pl/

Re: Revival of Dmitry's Term Vector patches

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Yes, yes, please - documentation patches by themselves are valuable,
too!

Thanks,
Otis

--- Damian Gajda <dg...@caltha.pl> wrote:
> > >Also, I noticed that a large portion of those patches contained a
> good
> > >amount of documentation (code comments, Javadocs).  Dmitry
> obviously
> > >studied the code in depth :)  I will try extracting at least the
> > >documentation from that contribution.
> > >
> > Yes, I did read it end to end - boy, was that a learning
> experience! :)
> 
> I'am working on merging Dimitry's code into recent CVS Lucene (like
> from
> two weeks ago). Because of some architecural changes, it's not fully
> compilable and needs somework - especially in Segment merging code.
> 
> Concerning Dmitry's comments - I might send You patches including
> them.
> If You want, will do that during the weekend.
> 
> Regards,
> -- 
> Damian Gajda
> Caltha Sp. j.
> Warszawa 02-807
> ul. Kuku�ki 2
> tel. +48 22 643 20 20
> mobile: +48 501 032 506
> http://www.caltha.pl/
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Revival of Dmitry's Term Vector patches

Posted by Damian Gajda <dg...@caltha.pl>.
> >Also, I noticed that a large portion of those patches contained a good
> >amount of documentation (code comments, Javadocs).  Dmitry obviously
> >studied the code in depth :)  I will try extracting at least the
> >documentation from that contribution.
> >
> Yes, I did read it end to end - boy, was that a learning experience! :)

I'am working on merging Dimitry's code into recent CVS Lucene (like from
two weeks ago). Because of some architecural changes, it's not fully
compilable and needs somework - especially in Segment merging code.

Concerning Dmitry's comments - I might send You patches including them.
If You want, will do that during the weekend.

Regards,
-- 
Damian Gajda
Caltha Sp. j.
Warszawa 02-807
ul. Kukułki 2
tel. +48 22 643 20 20
mobile: +48 501 032 506
http://www.caltha.pl/


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Revival of Dmitry's Term Vector patches

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Otis Gospodnetic wrote:

>Dmitry and others,
>
>One of the relatively frequently asked for features is 'conceptual
>search', or 'search by similarity', etc.  Lucene does not store term
>vectors in its index, so such searches cannot be supported.
>
>However, almost two years ago, Dmitry provided a large set of patches
>that added term vector support to Lucene.  We never applied those
>patches for some reason, even though the patches looked really good.
>The other day I looked at Dmitry's two year old email again.
>I applied a few diffs to my copy of Lucene and added new classes that
>Dmitry wrote in order to add term vector support, to the source tree.
>Unfortunately, lots of classes changed over the last two years, and not
>all patches will apply.
>
>I was wondering, Dmitry, if you have your term vector changes
>integrated with the current version of Lucene.  If you do, would it be
>possible for you to send the patches again?
>
Well, it's actually not that simple. The code of Lucene that we use is 
pretty heavily modified (by the term vector patch and by a few later 
additions, such as the TermEnum patch from 6 months ago or so). What I'd 
like to do with the file handles is to make changes in the current 
Lucene sources, do the testing and all, and then port the changes into 
our version of Lucene. This way the contribution will be readily usable. 
The term vector patches that I sent before, are out there, so feel free 
to incorporate them into Lucene, but I can't really spend time on them 
right now. Plus, I think that from IP point of view, those changes allow 
the company I work for to do things with Lucene that our competitors 
can't readily do, and these things happen to be very much key to our 
value proposition, so I really can't publish any more of those changes 
yet. Now, if Lucene acquired a similar capability from what I already 
published or from some other source, perhaps we could contribute to that 
effort later in smaller ways.

A great thing about the Apache license is that it allows this kind of 
flexibility (IANAL). This is just where I'm comfortable drawing the line 
right now. Sorry if this comes across as ungrateful... We are really 
very appreciative of the Lucene project and of the community, and we'll 
try to contribute in other ways, but this one is not available any 
more/yet. :)

>Also, I noticed that a large portion of those patches contained a good
>amount of documentation (code comments, Javadocs).  Dmitry obviously
>studied the code in depth :)  I will try extracting at least the
>documentation from that contribution.
>
Yes, I did read it end to end - boy, was that a learning experience! :)

>
>Finally, Dmitry, if you have term vector support in your local copy of
>the current Lucene sources, how are you going to make patches
>containing only the changes that you outlined in the recent email?
>Are term vector changes gone or....?
>
Like I said above, I'll be working with the current Lucene from CVS up 
until the changes are final, then I will port them to my copy of the 
Lucene.
Perhaps later we can get back to the TermEnum changes as well. Those I 
could contribute (well, actually I already did :) ). The jist there is 
that I was able to reduce garbage collection on certain operations 
substantially, but I think someone reported that the code did not work 
correctly in some cases (must be uses of Lucene that we do not 
experience in our environment).

Thanks for digging the term vectors back out, Otis.
Dmitry.



Re: Revival of Dmitry's Term Vector patches

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Otis Gospodnetic wrote:

>Dmitry and others,
>
>One of the relatively frequently asked for features is 'conceptual
>search', or 'search by similarity', etc.  Lucene does not store term
>vectors in its index, so such searches cannot be supported.
>
>However, almost two years ago, Dmitry provided a large set of patches
>that added term vector support to Lucene.  We never applied those
>patches for some reason, even though the patches looked really good.
>The other day I looked at Dmitry's two year old email again.
>I applied a few diffs to my copy of Lucene and added new classes that
>Dmitry wrote in order to add term vector support, to the source tree.
>Unfortunately, lots of classes changed over the last two years, and not
>all patches will apply.
>
>I was wondering, Dmitry, if you have your term vector changes
>integrated with the current version of Lucene.  If you do, would it be
>possible for you to send the patches again?
>
Well, it's actually not that simple. The code of Lucene that we use is 
pretty heavily modified (by the term vector patch and by a few later 
additions, such as the TermEnum patch from 6 months ago or so). What I'd 
like to do with the file handles is to make changes in the current 
Lucene sources, do the testing and all, and then port the changes into 
our version of Lucene. This way the contribution will be readily usable. 
The term vector patches that I sent before, are out there, so feel free 
to incorporate them into Lucene, but I can't really spend time on them 
right now. Plus, I think that from IP point of view, those changes allow 
the company I work for to do things with Lucene that our competitors 
can't readily do, and these things happen to be very much key to our 
value proposition, so I really can't publish any more of those changes 
yet. Now, if Lucene acquired a similar capability from what I already 
published or from some other source, perhaps we could contribute to that 
effort later in smaller ways.

A great thing about the Apache license is that it allows this kind of 
flexibility (IANAL). This is just where I'm comfortable drawing the line 
right now. Sorry if this comes across as ungrateful... We are really 
very appreciative of the Lucene project and of the community, and we'll 
try to contribute in other ways, but this one is not available any 
more/yet. :)

>Also, I noticed that a large portion of those patches contained a good
>amount of documentation (code comments, Javadocs).  Dmitry obviously
>studied the code in depth :)  I will try extracting at least the
>documentation from that contribution.
>
Yes, I did read it end to end - boy, was that a learning experience! :)

>
>Finally, Dmitry, if you have term vector support in your local copy of
>the current Lucene sources, how are you going to make patches
>containing only the changes that you outlined in the recent email?
>Are term vector changes gone or....?
>
Like I said above, I'll be working with the current Lucene from CVS up 
until the changes are final, then I will port them to my copy of the 
Lucene.
Perhaps later we can get back to the TermEnum changes as well. Those I 
could contribute (well, actually I already did :) ). The jist there is 
that I was able to reduce garbage collection on certain operations 
substantially, but I think someone reported that the code did not work 
correctly in some cases (must be uses of Lucene that we do not 
experience in our environment).

Thanks for digging the term vectors back out, Otis.
Dmitry.



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org