You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Doug Cutting <cu...@lucene.com> on 2003/04/07 20:28:36 UTC
Re: Proposal: Statistical Stopword elimination
Karsten Konrad wrote:
> For this, I have introduced a frequency limit factor into
> Similarity and test for excessively high document frequencies
> in the TermQuery.
>
> My questions:
>
> (1) Is there some more elegant way of doing this?
I think you could do this more simply by creating a subclass of
TermQuery and overriding createWeight, with something like:
protected Weight createWeight(Searcher searcher) {
float maxDoc = searcher.maxDoc();
float ratio = searcher.docFreq(getTerm()) / maxDoc;
float threshold =
(ThresholdSimilarity)getSimilarity()).getThreshold());
if (ratio >= threshold)
return super.createWeight(searcher);
else
return new NullWeight(); // a no-op weight implementation
}
You'd also need to define ThresholdSimilarity as a subclass of
Similarity or DefaultSimilarity that has a threshold, and define
NullWeight as a Weight implementation whose Scorer does nothing.
Note that, with a MultiSearcher, your implementation computed thresholds
independently for each index, whereas this computes them globally over
all indexes, which is probably what you want.
Note also that this is all done with public APIs and requires no changes
to the Lucene core.
> E.g., access to the docFreq is done again in the TermScorer
> and I would like to remove this redundancy.
I doubt that will substantially impact performance. If it does, it
would be easy to add a small cache into the IndexReader. However
someone tried this once and found that it didn't make much difference.
> (2) Is this a worthwhile contribution to Lucene's features in your opinion?
Please post the code. If folks use it, then it's worthwhile and we
should probably include it with Lucene. Ideally it should be simple to
do implement such things with the public APIs without having to build
more features into the core.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org