You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Martin Bayly <Ma...@taglocity.com> on 2008/02/29 21:15:31 UTC
Most efficient way to find related terms
I'm wondering what the most efficient approach is to finding all terms
which are related to a specific term X.
By related I mean terms which appear in a specific document field that
also contains the target term X.
e.g. Document has a keyword field, field1 that can contain multiple
keywords.
document1 - field1 HAS key1, key2, key3
document2 - field1 HAS key2, key4
document3 - field1 HAS key5
If I want to find terms related to key2, I need to return key1, key3,
key4
Obviously I can do a search for key2, iterate all the docs and collect
there field1 terms manually.
But presumably a more efficient way is to use TermDocs:
1. TermDocs termdocs = IndexReader.termDocs(new Term("field1", "key2")
2. Iterate term docs to get documents containing that term
3. Now this is the bit I'm not sure of:
a. I could call Document doc = IndexReader.document(n), but that will
load all fields and I only want the field1
b. Presumably better to call Document doc = IndexReader.document(n,
fieldSelector)
c. Or would I be better to turn on TermFrequencyVectors for this field
so I can call IndexReader.getTermFrequencyVector(n, "field1") - don't
particularly care about the frequencies as it will always be 1 for a
particular doc.
Other approaches?
I'm going to perf test to see how (b) and (c) compare but would be glad
if anyone has any insights.
Thanks
Martin