You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by bu...@apache.org on 2004/02/06 17:55:14 UTC
DO NOT REPLY [Bug 18927] -
[PATCH] Term Vector support
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18927>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND
INSERTED IN THE BUG DATABASE.
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18927
[PATCH] Term Vector support
grant_ingersoll@yahoo.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Summary|Term Vector support |[PATCH] Term Vector support
------- Additional Comments From grant_ingersoll@yahoo.com 2004-02-06 16:55 -------
Attached is Dmitry's code updated for 1.3. Here are my notes on the
implementation (which are also included in the attachment)
The patch is in the zip and is named termVector1.3Patch.txt and
was generate using cvs diff -Nu at the root of the tree.
If there are any questions, I would be more than happy to help via the mailing
list.
-----------------------------------------------
Notes on the re-implemenation of Dmitry's Term Vector enhancements for Lucene
1.3.
Please see http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-
dev@jakarta.apache.org&msgId=114748
for the original patch.
General Notes
-----------------------
I used Dmitry's code as a template by getting it working against 1.2 and then
going through by hand
and applying it against the HEAD. Thanks to Dmitry's great notes, it was
relatively painless. All of
the tests against HEAD pass.
Differences from 1.2 Version
----------------------------
The most significant change I had to make is that in the
TermFreqVector interface the getTermNumbers() method has been replaced by a
getTerms() method which
returns an array of Strings. These strings are the equivalent of Term.text()
and store the unique string
that has been indexed. While the numbering schema worked to save space it
presented a problem in 1.3 when
it comes to merging because the 1.3 code could support up to Long.MAX_LONG
positions (see TermEnum and
SegmentTermEnum) versus Integer.MAX_INTEGER in 1.2 (at least in my
understanding). This prevented me from
using the termMaps array technique used in 1.2 for remapping the term numbers
from the old segment to the new
segment. To solve this, we needed some globally unique identifier for a term.
For this, I use the term
text plus the field number that the terms came from (which is why there is a
new accessor methods
on TermFreqVector called get/setFieldNum).
The side benefit of this is that merging is much simpler, as we can just
iterate over the readers and
vectors add the terms from the old TermVector to the new TermVectorWriter, we
don't have to do any
remapping. The down side to this is the term vector files are going to take up
more space on the disk.
I believe I have overcome the limitation that you can only retrieve term
vectors on optimized indices.
The SegmentsReader, which previously through runtime exceptions for the
getTermVector methods now properly
implements them.
Compatibility
----------------------
Similar to Dmitry's, I believe the index files should be backward compatible.
Performance
----------------------
Have not run thorough performance tests, but I did do the following runs, one
with term vectors and one
without term vectors:
Index Size: 12598 documents with 88362 terms. The documents in question are XML
files where all of the TEXT
was extracted and indexed.
Without TVs:
Drive Space Used: 42 MB
Time to index: 5 minutes, 30 seconds
With TVs:
Drive Space Used: 71.3 MB
Time to index: 6 minutes, 2 seconds
Your mileage may vary.
Limitations
------------------------
Not sure what they are yet. I am sure there are places that could be
optimized. The numbering scheme
could probably be reinstituted by using some type of Paging Array or array of
arrays scheme that allows you
to store really large number of values.
FilterIndexReader throws an UnsupportedOperationException for the new Term
Vector methods.
I did not test with compound files. Do not know if they are compatible.
Other limitations are probably those of omission. That is, are the new methods
sufficient for doing what
people need to do? I can think of a few:
1. Since only terms and frequencies are stored, something to quickly calculate
the actual weight of the term
as it was scored for the query. I looked into this, but, frankly, I am fairly
confused by the whole
Scorer/Similarity interactions, especially when it comes to nested queries.
2. Perhaps the Document object itself should have a method similar to those on
IndexReader.
New File Notes
----------------------------------
src/java/org/apache/lucene/index/SegmentTermVector.java
Implementation of TermFreqVector and TermPositionVector.
src/java/org/apache/lucene/index/TermFreqVector.java
Interface for describing a Document term vector. See notes above for what
was changed from 1.2
src/java/org/apache/lucene/index/TermPositionVector.java
No change from 1.2 version.
src/java/org/apache/lucene/index/TermVectorsReader.java
Changed get methods to return TermFreqVector interface instead of explicit
SegmentTermVector.
Added getTermPositions method to retrieve TermPositionVector(s).
Changed reading in slightly to match the writing of a the Term text instead
of the term number.
src/java/org/apache/lucene/index/TermVectorsWriter.java
Added documentation
Changed the writing to write the term string instead of the term number
Would be nice if there was a way to turn on or off the writing of positional
information.
See the TODO comment.
src/test/org/apache/lucene/index/DocHelper.java
Package local Class to help setup documents for testing.
src/test/org/apache/lucene/index/TestDocumentWriter.java
New test class for the DocumentWriter object. Probably needs to be fleshed
out more to fully test.
src/test/org/apache/lucene/index/TestFieldInfos.java
Test for the new FieldInfos return values, etc.
src/test/org/apache/lucene/index/TestFieldsReader.java
Basic test for FieldsReader. Needs to be expanded to fully test
functionality.
src/test/org/apache/lucene/index/TestSegmentMerger.java
Setups up two segments, including term vectors then merges them and asserts
that items were properly
merged.
src/test/org/apache/lucene/index/TestSegmentReader.java
Various tests for the SegmentReader. Tests retrieving a document, deleting a
document,
retrieving field names and retrieving terms. Has a placeholder for
retrieving norms,
but I did not implement, as I didn't fully understand how norms worked.
src/test/org/apache/lucene/index/TestSegmentsReader.java
Setups up a SegmentsReader made up of two Segments and does various tests on
them. Needs
to be filled in more completely.
src/test/org/apache/lucene/index/TestSegmentTermDocs.java
Has positive and negative tests for the SegmentTermDocs.
src/test/org/apache/lucene/index/TestTermVectorsReader.java
Writes out some term vectors and then asserts that they can be read back in
src/test/org/apache/lucene/index/TestTermVectorsWriter.java
Writes out some term vectors and then asserts that the proper files were
created w/ the proper
information in them.
src/test/org/apache/lucene/search/TestTermVectors.java
Searches over an indexed set of documents and then retrieves the term vectors
for the documents.
Also sets up a small collection of documents and maps containing term and
frequency information
and calculates that the term vectors are properly constructed. This is a
fairly decent example
of end to end use of the vectors.
Existing File Changes:
----------------------------------
org/apache/lucene/analysis/PorterStemmer.java:
Made public.
Please, please, please apply this patch! I think several people have
submitted this one and I vote for it
as well! I use the implementation in other parts of my code and it is
annoying to have to change it in
my local copy every time there is a new release.
org/apache/lucene/document/Document.java
Added a getNumFields() method that will return the number of fields that a
document has.
org/apache/lucene/document/Field.java
Same as 1.2 patch.
org/apache/lucene/index/DocumentWriter.java
Same as 1.2 patch. Updated some formatting.
org/apache/lucene/index/FieldInfo.java
Added constructor for indicating the term vector is stored.
org/apache/lucene/index/FieldInfos.java
Added support for term vector storage. Similar to 1.2 patch
The add methods now return a Map of <field name, field number> pairs.
org/apache/lucene/index/FieldsReader.java
Added comment. Now constructs the Field object with the termVector
information
org/apache/lucene/index/FilterIndexReader.java
Formatted code. Added in implementation of Term Vector methods, but they are
not implemented.
org/apache/lucene/index/IndexReader.java
Same as 1.2 patch, plus added a getTermVectorReader method which returns the
TermVectorReader
for the IndexReader. Added new getIndexedFieldNames(boolean) methods which
retrieve
all indexed field names based on whether the field stores term vectors or not.
Added a package local method named getFieldInfos which returns the field
infos object
for the reader. This is needed in merging.
Formatted code.
org/apache/lucene/index/SegmentMerger.java
Added comments and a mergeVectors() method that merges the terms in from the
various
readers into the new segment. Formatted code.
org/apache/lucene/index/SegmentReader.java
Added new TV files to the list of segments. Implemented new IndexReader
methods for TVS.
org/apache/lucene/index/SegmentTermDocs.java
Formatted. Added in the isValid() method, but is commented out, as I am not
sure it is needed.
It was in 1.2 version.
org/apache/lucene/index/SegmentTermEnum.java
Same as 1.2 patch. Formatted.
org/apache/lucene/index/SegmentTermPositions.java
Same as 1.2 patch.
org/apache/lucene/index/SegmentsReader.java
Added a fieldInfos variable that is the summation of all of the fieldInfos
from the other segments.
This is used to implement the getFieldInfos() method, but is probably not all
that useful.
Implements the new term vector methods.
org/apache/lucene/index/TermDocs.java
Added isValid method per 1.2, but it is commented out as I am not sure we
need it. Formatted code.
org/apache/lucene/index/TermEnum.java
Same as 1.2 patch.
org/apache/lucene/index/TermInfosWriter.java
Same as 1.2 patch.
org/apache/lucene/search/FilteredTermEnum.java
Implements size() method, but throws UnsupportedOperationException.
org/apache/lucene/search/FuzzyTermEnum.java
Implements termNumber() and isValid() but both throw
UnsupportedOperationException.
org/apache/lucene/search/MultiSearcher.java
Implements new count() methods as per 1.2 patch.
org/apache/lucene/search/RemoteSearchable.java
Same as MultiSearcher.
org/apache/lucene/search/Searchable.java
Added count() methods onto the interface.
org/apache/lucene/search/Searcher.java
Added count() methods support.
org/apache/lucene/search/WildcardTermEnum.java
Implements termNumber() and isValid() but both throw
UnsupportedOperationException.
org/apache/lucene/index/TestFilterIndexReader.java
Implements the necessary TV methods
org/apache/lucene/search/TestBasics.java
Tests the count methods for the searcher.
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org