You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2011/06/20 14:51:23 UTC

[Lucene-java Wiki] Update of "SummerOfCode2011ProjectRankingTerrier" by DavidNemeskey

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The "SummerOfCode2011ProjectRankingTerrier" page has been changed by DavidNemeskey:
http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRankingTerrier?action=diff&rev1=3&rev2=4

  Also, as far as ''df'' goes, there is also `IndexReader.docFreq()`.
  
  Collection-level statistics seem to be harder to come by.
-  * ''number of fields'': `IndexReader.fields()`;
+  * ''number of fields'': `IndexReader.fields()`, '''BUT''' this statistic is only for normalization, which is performed outside of the `Similarity` in Lucene; hence, we don't need it;
   * ''no. of tokens in a field'': `IndexReader.getSumOfNorms()`; it's a bit different than the real length; it may be worth to have both, since the more options, the more possibilities to experiment with;
   * ''avg. field length'': has to be computed as in `MockBM25Similarity.avgDocumentLength()` from the no. of tokens in each field;
   * ''no. of documents'': `IndexReader.numDocs()` (for some reason, `maxDoc()` is used in `MockBM25Similarity`) from the context;