You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by bu...@apache.org on 2004/02/06 17:55:14 UTC
DO NOT REPLY [Bug 18927] - [PATCH] Term Vector support

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18927>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18927

[PATCH] Term Vector support

grant_ingersoll@yahoo.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Term Vector support         |[PATCH] Term Vector support



------- Additional Comments From grant_ingersoll@yahoo.com  2004-02-06 16:55 -------
Attached is Dmitry's code updated for 1.3.  Here are my notes on the 
implementation (which are also included in the attachment)

The patch is in the zip and is named termVector1.3Patch.txt and
was generate using cvs diff -Nu at the root of the tree.

If there are any questions, I would be more than happy to help via the mailing
list.

-----------------------------------------------
Notes on the re-implemenation of Dmitry's Term Vector enhancements for Lucene 
1.3.

Please see http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-
dev@jakarta.apache.org&msgId=114748
for the original patch.

General Notes
-----------------------

I used Dmitry's code as a template by getting it working against 1.2 and then 
going through by hand 
and applying it against the HEAD.  Thanks to Dmitry's great notes, it was 
relatively painless.  All of 
the tests against HEAD pass.

Differences from 1.2 Version
----------------------------  

The most significant change I had to make is that in the 
TermFreqVector interface the getTermNumbers() method has been replaced by a 
getTerms() method which
returns an array of Strings.  These strings are the equivalent of Term.text() 
and store the unique string
that has been indexed.  While the numbering schema worked to save space it 
presented a problem in 1.3 when
it comes to merging because the 1.3 code could support up to Long.MAX_LONG 
positions (see TermEnum and 
SegmentTermEnum) versus Integer.MAX_INTEGER in 1.2 (at least in my 
understanding).  This prevented me from
using the termMaps array technique used in 1.2 for remapping the term numbers 
from the old segment to the new
segment.  To solve this, we needed some globally unique identifier for a term.  
For this, I use the term
text plus the field number that the terms came from (which is why there is a 
new accessor methods
on TermFreqVector called get/setFieldNum).

The side benefit of this is that merging is much simpler, as we can just 
iterate over the readers and 
vectors add the terms from the old TermVector to the new TermVectorWriter, we 
don't have to do any
remapping.  The down side to this is the term vector files are going to take up 
more space on the disk.

I believe I have overcome the limitation that you can only retrieve term 
vectors on optimized indices.
The SegmentsReader, which previously through runtime exceptions for the 
getTermVector methods now properly
implements them.

Compatibility
----------------------
Similar to Dmitry's, I believe the index files should be backward compatible.

Performance
----------------------
Have not run thorough performance tests, but I did do the following runs, one 
with term vectors and one 
without term vectors:

Index Size: 12598 documents with 88362 terms. The documents in question are XML 
files where all of the TEXT
was extracted and indexed.

Without TVs: 
Drive Space Used: 42 MB
Time to index: 5 minutes, 30 seconds

With TVs:
Drive Space Used: 71.3 MB
Time to index: 6 minutes, 2 seconds

Your mileage may vary.

Limitations
------------------------
Not sure what they are yet.  I am sure there are places that could be 
optimized.  The numbering scheme
could probably be reinstituted by using some type of Paging Array or array of 
arrays scheme that allows you 
to store really large number of values.

FilterIndexReader throws an UnsupportedOperationException for the new Term 
Vector methods.

I did not test with compound files.  Do not know if they are compatible.

Other limitations are probably those of omission.  That is, are the new methods 
sufficient for doing what
people need to do?  I can think of a few:
1. Since only terms and frequencies are stored, something to quickly calculate 
the actual weight of the term
as it was scored for the query.  I looked into this, but, frankly, I am fairly 
confused by the whole 
Scorer/Similarity interactions, especially when it comes to nested queries.

2. Perhaps the Document object itself should have a method similar to those on 
IndexReader.

New File Notes
----------------------------------
src/java/org/apache/lucene/index/SegmentTermVector.java
  Implementation of TermFreqVector and TermPositionVector.
  
src/java/org/apache/lucene/index/TermFreqVector.java
  Interface for describing a Document term vector.  See notes above for what 
was changed from 1.2
  
src/java/org/apache/lucene/index/TermPositionVector.java
  No change from 1.2 version.
  
src/java/org/apache/lucene/index/TermVectorsReader.java
  Changed get methods to return TermFreqVector interface instead of explicit 
SegmentTermVector.
  Added getTermPositions method to retrieve TermPositionVector(s).
  Changed reading in slightly to match the writing of a the Term text instead 
of the term number.
  
src/java/org/apache/lucene/index/TermVectorsWriter.java
  Added documentation
  Changed the writing to write the term string instead of the term number
  Would be nice if there was a way to turn on or off the writing of positional 
information.  
  See the TODO comment.
    
src/test/org/apache/lucene/index/DocHelper.java
  Package local Class to help setup documents for testing.
  
src/test/org/apache/lucene/index/TestDocumentWriter.java
  New test class for the DocumentWriter object.  Probably needs to be fleshed 
out more to fully test.
  
src/test/org/apache/lucene/index/TestFieldInfos.java
  Test for the new FieldInfos return values, etc.
  
src/test/org/apache/lucene/index/TestFieldsReader.java
  Basic test for FieldsReader.  Needs to be expanded to fully test 
functionality.
  
src/test/org/apache/lucene/index/TestSegmentMerger.java
  Setups up two segments, including term vectors then merges them and asserts 
that items were properly
  merged.
  
src/test/org/apache/lucene/index/TestSegmentReader.java
  Various tests for the SegmentReader.  Tests retrieving a document, deleting a 
document, 
  retrieving field names and retrieving terms.  Has a placeholder for 
retrieving norms,
  but I did not implement, as I didn't fully understand how norms worked.
  
src/test/org/apache/lucene/index/TestSegmentsReader.java
  Setups up a SegmentsReader made up of two Segments and does various tests on 
them.  Needs
  to be filled in more completely.
  
src/test/org/apache/lucene/index/TestSegmentTermDocs.java
  Has positive and negative tests for the SegmentTermDocs.
    
src/test/org/apache/lucene/index/TestTermVectorsReader.java
  Writes out some term vectors and then asserts that they can be read back in
  
src/test/org/apache/lucene/index/TestTermVectorsWriter.java
  Writes out some term vectors and then asserts that the proper files were 
created w/ the proper
  information in them.
  
src/test/org/apache/lucene/search/TestTermVectors.java
  Searches over an indexed set of documents and then retrieves the term vectors 
for the documents.
  Also sets up a small collection of documents and maps containing term and 
frequency information
  and calculates that the term vectors are properly constructed.  This is a 
fairly decent example
  of end to end use of the vectors.

Existing File Changes:
----------------------------------
org/apache/lucene/analysis/PorterStemmer.java:
  Made public.
  Please, please, please apply this patch!  I think several people have 
submitted this one and I vote for it
  as well!  I use the implementation in other parts of my code and it is 
annoying to have to change it in
  my local copy every time there is a new release.
  
org/apache/lucene/document/Document.java
  Added a getNumFields() method that will return the number of fields that a 
document has.
  
org/apache/lucene/document/Field.java
  Same as 1.2 patch.

org/apache/lucene/index/DocumentWriter.java
  Same as 1.2 patch.  Updated some formatting.
  
org/apache/lucene/index/FieldInfo.java
  Added constructor for indicating the term vector is stored.
  
org/apache/lucene/index/FieldInfos.java
  Added support for term vector storage.  Similar to 1.2 patch  
  The add methods now return a Map of <field name, field number> pairs.

org/apache/lucene/index/FieldsReader.java
  Added comment.  Now constructs the Field object with the termVector 
information

org/apache/lucene/index/FilterIndexReader.java
  Formatted code.  Added in implementation of Term Vector methods, but they are 
not implemented.      

org/apache/lucene/index/IndexReader.java
  Same as 1.2 patch, plus added a getTermVectorReader method which returns the 
TermVectorReader
  for the IndexReader.  Added new getIndexedFieldNames(boolean) methods which 
retrieve
  all indexed field names based on whether the field stores term vectors or not.
  Added a package local method named getFieldInfos which returns the field 
infos object
  for the reader.  This is needed in merging. 
  Formatted code.
  
org/apache/lucene/index/SegmentMerger.java
  Added comments and a mergeVectors() method that merges the terms in from the 
various
  readers into the new segment.  Formatted code.
  
org/apache/lucene/index/SegmentReader.java
  Added new TV files to the list of segments.  Implemented new IndexReader 
methods for TVS.
  
org/apache/lucene/index/SegmentTermDocs.java
  Formatted.  Added in the isValid() method, but is commented out, as I am not 
sure it is needed. 
  It was in 1.2 version.
  
org/apache/lucene/index/SegmentTermEnum.java
  Same as 1.2 patch.  Formatted.
  
org/apache/lucene/index/SegmentTermPositions.java
  Same as 1.2 patch.
  
org/apache/lucene/index/SegmentsReader.java
  Added a fieldInfos variable that is the summation of all of the fieldInfos 
from the other segments.
  This is used to implement the getFieldInfos() method, but is probably not all 
that useful.
  Implements the new term vector methods.
  
org/apache/lucene/index/TermDocs.java
  Added isValid method per 1.2, but it is commented out as I am not sure we 
need it.  Formatted code.
  
org/apache/lucene/index/TermEnum.java
  Same as 1.2 patch.

org/apache/lucene/index/TermInfosWriter.java
  Same as 1.2 patch.
  
org/apache/lucene/search/FilteredTermEnum.java
  Implements size() method, but throws UnsupportedOperationException.
  
org/apache/lucene/search/FuzzyTermEnum.java
  Implements termNumber() and isValid() but both throw 
UnsupportedOperationException.          

org/apache/lucene/search/MultiSearcher.java
  Implements new count() methods as per 1.2 patch.
  
org/apache/lucene/search/RemoteSearchable.java
  Same as MultiSearcher.
  
org/apache/lucene/search/Searchable.java
  Added count() methods onto the interface.
  
org/apache/lucene/search/Searcher.java
  Added count() methods support.                        
  
org/apache/lucene/search/WildcardTermEnum.java
  Implements termNumber() and isValid() but both throw 
UnsupportedOperationException.
  
org/apache/lucene/index/TestFilterIndexReader.java
  Implements the necessary TV methods

org/apache/lucene/search/TestBasics.java
  Tests the count methods for the searcher.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org