You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2011/06/20 14:51:23 UTC
[Lucene-java Wiki] Update of "SummerOfCode2011ProjectRankingTerrier" by DavidNemeskey
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.
The "SummerOfCode2011ProjectRankingTerrier" page has been changed by DavidNemeskey:
http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRankingTerrier?action=diff&rev1=3&rev2=4
Also, as far as ''df'' goes, there is also `IndexReader.docFreq()`.
Collection-level statistics seem to be harder to come by.
- * ''number of fields'': `IndexReader.fields()`;
+ * ''number of fields'': `IndexReader.fields()`, '''BUT''' this statistic is only for normalization, which is performed outside of the `Similarity` in Lucene; hence, we don't need it;
* ''no. of tokens in a field'': `IndexReader.getSumOfNorms()`; it's a bit different than the real length; it may be worth to have both, since the more options, the more possibilities to experiment with;
* ''avg. field length'': has to be computed as in `MockBM25Similarity.avgDocumentLength()` from the no. of tokens in each field;
* ''no. of documents'': `IndexReader.numDocs()` (for some reason, `maxDoc()` is used in `MockBM25Similarity`) from the context;