You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "David Mark Nemeskey (JIRA)" <ji...@apache.org> on 2011/06/07 21:24:58 UTC

[jira] [Commented] (LUCENE-3174) Similarity.Stats class for term & collection statistics

    [ https://issues.apache.org/jira/browse/LUCENE-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13045598#comment-13045598 ] 

David Mark Nemeskey commented on LUCENE-3174:
---------------------------------------------

Here's what the patch does:
- it introduces the Similarity.Stats class and its subclasses
- renames computeWeight() to computeStats()
- fixes methods that call computeStats()

What remains to be done:
- rewrite the javadoc
- Stats will be used inside other Similarity methods: its availability should be unsured somehow. The current solution in MockBM25Similarity is not satisfactory because there is only one Similarity object at a time.
- MultiPhraseWeight, PhraseWeight, SpanWeight, TermWeight call computeStats and extract the IDFExplain object. This level of coupling is not desirable, and should be eliminated. All the more so, as not all Similarity subclasses will have an idf
- It might not even make sense to expose computeStats()?

To consider:
- it might be better if Stats were static, because they could inherit fields from each other

> Similarity.Stats class for term & collection statistics
> -------------------------------------------------------
>
>                 Key: LUCENE-3174
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3174
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/search
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3174.patch
>
>
> In order to support ranking methods besides TF-IDF, we need to make the statistics they need available. These statistics could be computed in computeWeight (soon to become computeStats) and stored in a separate object for easy access. Since this object will be used solely by subclasses of Similarity, it should be implented as a static inner class, i.e. Similarity.Stats.
> There are two ways this could be implemented:
> - as a single Similarity.Stats class, reused by all ranking algorithms. In this case, this class would have a member field for all statistics;
> - as a hierarchy of Stats classes, one for each ranking algorithm. Each subclass would define only the statistics needed for the ranking algorithm.
> In the second case, the Stats class in DefaultSimilarity would have a single field, idf, while the one in e.g. BM25Similarity would have idf and average field/document length.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org