You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Kelvin Tan <ke...@relevanz.com> on 2005/02/06 10:14:16 UTC

Study Group (WAS Re: Normalized Scoring)

Wouldn't it be great if we can form a study-group of Lucene folks who want to take the "next step"? I feel uneasy posting non-Lucene specific questions to dev or user even if its related to IR.

Feels to me like there could be a couple like us, who didn't do a dissertation in IR, but would like a more indepth knowledge for practical purposes. Basically, the end result is that we are able to tune or extend lucene by using the Expert api (classes marked as Expert). Perhaps a possible outcome is a tuning tutorial for advanced users who already know how to use Lucene.

What do you think?

k

On Sat, 5 Feb 2005 22:10:26 -0800 (PST), Otis Gospodnetic wrote:
>�Exactly. �Luckily, since then I've learned a bit from lucene-dev
>�discussions and side IR readings, so some of the topics are making
>�more sense now.
>
>�Otis
>
>�--- Kelvin Tan <ke...@relevanz.com>�wrote:
>
>>�Hi Otis, I was re-reading this whole theoretical thread about
>>�idf, scoring, normalization, etc from last Oct and couldn't help
>>�laughing out loud when I read your post, coz it summed up what I
>>�was thinking the whole time. I think its really great to have
>>�people like Chuck and Paul (Eshlot) around. I'm learning so much.
>>
>>�k
>>
>>�On Thu, 21 Oct 2004 10:05:51 -0700 (PDT), Otis Gospodnetic wrote:
>>
>>>�Hi Chuck,
>>>
>>>�The relative lack of responses is not because there is no
>>>�interest, but probably because there are only a few people on
>>>�lucene-dev who can follow/understand every detail of your
>>>�proposal. �I understand and hear you, but I have a hard time
>>>�'visualizing' some of the formulas in your proposal. �What you
>>>�are saying sounds right to me, but I don't have enough
>>>�theoretical knowledge to go one way or the other.
>>>
>>>�Otis
>>>
>>>
>>>�--- Chuck Williams <ch...@manawiz.com>�wrote:
>>>
>>>>�Hi everybody,
>>>>
>>>>�Although there doesn't seem to be much interest in this I
>>>>�have one significant improvement to the below and a couple
>>>>�observations that clarify the situation.
>>>>
>>>>�To illustrate the problem better normalization is intended to
>>>>�address,
>>>>�in my current application for BooleanQuery's of two terms, I
>>>>�frequently
>>>>�get a top score of 1.0 when only one of the terms is matched.
>>>>�1.0 should indicate a "perfect match". �I'd like set my UI up
>>>>�to present the
>>>>�results differently depending on how good the different
>>>>�results are (e.g., showing a visual indication of result
>>>>�quality, dropping the really bad results entirely, and
>>>>�segregating the good results from other
>>>>�only vaguely relevant results). �To build this kind of
>>>>�"intelligence" into the UI, I certainly need to know whether
>>>>�my top result matched all
>>>>�query terms, or only half of them. �I'd like to have the
>>>>�score tell me
>>>>�directly how good the matches are. �The current normalization
>>>>�does not achieve this.
>>>>
>>>>�The intent of a new normalization scheme is to preserve the
>>>>�current scoring in the following sense: �the ratio of the nth
>>>>�result's score to
>>>>�the best result's score remains the same. �Therefore, the
>>>>�only question
>>>>�is what normalization factor to apply to all scores. �This
>>>>�reduces to a
>>>>�very specific question that determines the entire
>>>>�normalization -- what should the score of the best matching
>>>>�result be?
>>>>
>>>>�The mechanism below has this property, i.e. it keeps the
>>>>�current score
>>>>�ratios, except that I removed one idf term for reasons
>>>>�covered earlier
>>>>�(this better reflects the current empirically best scoring
>>>>�algorithms).
>>>>�However, removing an idf is a completely separate issue. �The
>>>>�improved
>>>>�normalization is independent of whether or not that change is
>>>>�made.
>>>>
>>>>�For the central question of what the top score should be,
>>>>�upon reflection, I don't like the definition below. �It
>>>>�defined the top score
>>>>�as (approximately) the percentage of query terms matched by
>>>>�the top scoring result. �A better conceptual definition is to
>>>>�use a weighted average based on the boosts. �I.e., downward
>>>>�propagate all boosts to the
>>>>�underlying terms (or phrases). �Secifically, the "net boost"
>>>>�of a term
>>>>�is the product of the direct boost of the term and all boosts
>>>>�applied to
>>>>�encompassing clauses. �Then the score of the top result
>>>>�becomes the sum
>>>>�of the net boosts of its matching terms divided by the sum of
>>>>�the net boosts of all query terms.
>>>>
>>>>�This definition is a refinement of the original proposal
>>>>�below, and it
>>>>�could probably be further refined if some impact of the tf,
>>>>�idf and/or
>>>>�lengthNorm was desired in determining the top score. �These
>>>>�other factors seems to be harder to normalize for, although
>>>>�I've thought of some simple approaches; e.g., assume the
>>>>�unmatched terms in the top result have values for these three
>>>>�factors that are the average of the
>>>>�values of the matched terms, then apply exactly the same
>>>>�concept of actual score divided by theorectical maximum
>>>>�score. �That would eliminate any need to maintain maximum
>>>>�value statistics in the index.
>>>>
>>>>�As an example of the simple boost-based normalization, for
>>>>�the query ((a^2 b)^3 (c d^2)) the net boosts are: a -->�6 b --
>>>>�>�3 c --
>>>>
>>>>>�1 d -->�2
>>>>>
>>>>�So if a and b matched, but not c and d, in the top scoring
>>>>�result, its
>>>>�score would be 0.75. �The normalizer would be 0.75/(current
>>>>�score except
>>>>�for the current normalization). �This normalizer would be
>>>>�applied to all
>>>>�current scores (minus normalization) to create the normalized
>>>>�scores.
>>>>
>>>>�For simple query (a b), if only one of the terms matched in
>>>>�the top result, then its score would be 0.5, vs. 1.0 or many
>>>>�other possible scores today.
>>>>
>>>>�In addition to enabling more "intelligent" UI's that
>>>>�communicate the quality of results to end-users, the proposal
>>>>�below also extends the explain() mechanism to fully explain
>>>>�the final normalized score. However, that change is also
>>>>�independent -- it could be done with the current scoring.
>>>>
>>>>�Am I the only one who would like to see better normalization
>>>>�in Lucene? Does anybody have a better approach?
>>>>
>>>>�If you've read this far, thanks for indulging me on this. �I
>>>>�would love
>>>>�to see this or something with similar properties in Lucene.
>>>>�I really just want to build my app, but as stated below would
>>>>�write and contribute this if there is interest in putting it
>>>>�in, and nobody else
>>>>�wants to write it. �Please let me know what you think one way
>>>>�or the other.
>>>>
>>>>�Thanks,
>>>>
>>>>�Chuck
>>>>
>>>>
>>>>>�-----Original Message-----
>>>>>�From: Chuck Williams
>>>>>�Sent: Monday, October 18, 2004 7:04 PM
>>>>>�To: 'Lucene Developers List'
>>>>>�Subject: RE: idf and explain(), was Re: Search and Scoring
>>>>>
>>>>>�Doug Cutting wrote:
>>>>>>�If this is a big issue for you, as it seems it is, please
>>>>>>
>>>>�submit
>>>>�a
>>>>>�patch
>>>>>>�to optionally disable score normalization in Hits.java.
>>>>>>
>>>>>�and:
>>>>>>�The quantity 'sum(t) weight(t,d)^2' must be recomputed for
>>>>>>
>>>>�each
>>>>>�document
>>>>>>�each time a document is added to the collection, since
>>>>>>
>>>>�'weight(t,d)'
>>>>>�is
>>>>>>�dependent on global term statistics. �This is prohibitivly
>>>>>>
>>>>�expensive.
>>>>>>�Research has also demonstrated that such cosine
>>>>>>�normalization
>>>>>>
>>>>�gives
>>>>>>�somewhat inferior results (e.g., Singhal's pivoted length
>>>>>>
>>>>>�normalization).
>>>>>
>>>>>�I'm willing to write, test and contribute code to address
>>>>>�the normalization issue, i.e. to yield scores in [0, 1]
>>>>>�that are
>>>>>
>>>>�meaningful
>>>>>�across searches. �Unfortunately, this is considerably more
>>>>>
>>>>�involved
>>>>�that
>>>>>�just optionally eliminating the current normalization in
>>>>>�Hits.
>>>>>
>>>>�Before
>>>>>�undertaking this, I'd like to see if there is agreement in
>>>>>
>>>>�principle
>>>>>�that this is a good idea, and that my specific proposal
>>>>>�below is
>>>>>
>>>>�the
>>>>>�right way to go. �Also, I'd like to make sure I've correctly
>>>>>
>>>>�inferred
>>>>>�the constraints on writing code to be incorporated into
>>>>>�Lucene.
>>>>>
>>>>>�After looking at this in more detail I agree that the
>>>>>�cosine normalization is not the way to go, because of both
>>>>>�efficiency
>>>>>
>>>>�and
>>>>>�effectiveness considerations. �A conceptual approach that
>>>>>�would
>>>>>
>>>>�be
>>>>>�efficient, relatively easy to implement, and seems to have
>>>>>�at
>>>>>
>>>>�least
>>>>>�reasonable behavior would be to define the top scoring
>>>>>�match to
>>>>>
>>>>�have
>>>>�a
>>>>>�score roughly equal to the percentage of query terms it
>>>>>�matches
>>>>>
>>>>�(its
>>>>>�"netCoord"). �Scores below the top hit would be reduced
>>>>>�based on
>>>>>
>>>>�the
>>>>>�ratio of their raw scores to the raw score of the top hit
>>>>>
>>>>�(considering
>>>>>�all of the current score factors, except that I'd like to
>>>>>�remove
>>>>>
>>>>�one
>>>>�of
>>>>>�the idf factors, as discussed earlier).
>>>>>
>>>>>�For a couple simple cases:
>>>>>�a) the top match for a single term query would always have a
>>>>>
>>>>�score
>>>>�of
>>>>>�1.0,
>>>>>�b) the top scoring match for a BooleanQuery using
>>>>>
>>>>�DefaultSimilarity
>>>>>�with all non-prohibited TermQuery clauses would have a
>>>>>�score of
>>>>>
>>>>�m/n,
>>>>>�where the hit matches m of the n terms.
>>>>>
>>>>>�This isn't optimal, but seems much better than the current
>>>>>
>>>>�situation.
>>>>>�Consider two single-term queries, s and t. �If s matches
>>>>>�more
>>>>>
>>>>�strongly
>>>>>�than t in its top hit (e.g., a higher tf in a shorter
>>>>>�field), it
>>>>>
>>>>�would
>>>>>�be best if the top score of s was greater score than top
>>>>>�score of
>>>>>
>>>>�t.
>>>>>�But this kind of normalization would require keeping
>>>>>�additional statistics that so far as I know are not
>>>>>�currently in the index,
>>>>>
>>>>�like
>>>>>�the maximum tf for terms and the minimum length for fields.
>>>>>
>>>>�These
>>>>�could
>>>>>�be expensive to update on deletes. �Also, normalizing by
>>>>>�such
>>>>>
>>>>�factors
>>>>>�could yield lower than subjectively reasonable scores in
>>>>>�most
>>>>>
>>>>�cases,
>>>>�so
>>>>>�it's not clear it would be better.
>>>>>
>>>>>�The semantics above are at least clean, easy to understand,
>>>>>�and
>>>>>
>>>>�support
>>>>>�what seems to me is the most important motivation to do
>>>>>�this:
>>>>>
>>>>�allowing
>>>>>�an application to use simple thresholding to segregate
>>>>>
>>>>�likely-to-be-
>>>>>�relevant hits from likely-to-be-irrelevant hits.
>>>>>
>>>>>�More specifically, for a BooleanQuery of TermQuery's the
>>>>>
>>>>�resulting
>>>>�score
>>>>>�functions would be:
>>>>>
>>>>>�BooleanQuery of TermQuery's sbq = (tq1 ... tqn)
>>>>>
>>>>>�baseScore(sbq, doc) =
>>>>>�sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)*
>>>>>�lengthNorm(tqi.term.field, doc)
>>>>>
>>>>>�rawScore(sbq, doc) = coord(sbq, doc) * baseScore
>>>>>
>>>>>�norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit)
>>>>>
>>>>>�score(sbq, doc) = rawScore * norm
>>>>>
>>>>>�rawScore's can be computed in the Scorer.score() methods and
>>>>>
>>>>�therefore
>>>>>�used to sort the hits (e.g., in the instance method for
>>>>>�collect()
>>>>>
>>>>�in
>>>>�the
>>>>>�HitCollector in IndexSearcher.search()). �If the top
>>>>>�scoring hit
>>>>>
>>>>�does
>>>>>�not have the highest baseScore, then its score could be
>>>>>�less that
>>>>>
>>>>�its
>>>>>�coord; this seems desirable. �These formulas imply that no
>>>>>�result
>>>>>
>>>>�score
>>>>>�can be larger than its coord, so if coord is well-defined
>>>>>�(always between 0 and 1) then all results will be
>>>>>�normalized between 0
>>>>>
>>>>�and
>>>>�1.
>>>>
>>>>>�In general, the netCoord, which takes the place of coord in
>>>>>�the
>>>>>
>>>>�simple
>>>>>�case above, needs to be defined for any query.
>>>>>�Conceptually,
>>>>>
>>>>�this
>>>>>�should be the total percentage of query terms matched by the
>>>>>
>>>>�document.
>>>>>�It must be recursively computable from the subquery
>>>>>�structure and overridable for specific Query types (e.g.,
>>>>>�to support
>>>>>
>>>>�specialized
>>>>>�coords, like one that is always 1.0 as is useful in multi-
>>>>>�field searching). �Suitable default definitions for
>>>>>�TermQuery and
>>>>>
>>>>�BooleanQuery
>>>>>�are:
>>>>>
>>>>>�TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise
>>>>>
>>>>>�BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) *
>>>>>
>>>>�ci.netCoord
>>>>
>>>>>�This is not quite percentage of terms matched; e.g.,
>>>>>�consider a BooleanQuery with two clauses, one of which is a
>>>>>�BooleanQuery of
>>>>>
>>>>�3
>>>>�terms
>>>>>�and the other which is just a term. �However, it doesn't
>>>>>�seem to
>>>>>
>>>>�be
>>>>>�unreasonable for a BooleanQuery to state that its clauses
>>>>>�are
>>>>>
>>>>�equally
>>>>>�important, and this is consistent with the current coord
>>>>>
>>>>�behavior.
>>>>>�BooleanQuery.netCoord could be overridden for special
>>>>>�cases, like
>>>>>
>>>>�the
>>>>>�pure disjunction I use in my app for field expansions:
>>>>>�MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord
>>>>>
>>>>>�Implementing this would proceed along these lines: 1. �For
>>>>>�backwards compatibility, add some kind of newScoring
>>>>>
>>>>�boolean
>>>>>�setting.
>>>>>�2. �Update all of these places to behave as indicated if
>>>>>
>>>>�newScoring:
>>>>>�a. �Change Query.weight() to not do any normalization (no
>>>>>
>>>>�call
>>>>�to
>>>>>�sumOfSquaredWeights(), Similarity.queryNorm() or
>>>>>�normalize()). b. �Update all Query.weight classes to set
>>>>>�their value
>>>>>
>>>>�according
>>>>�to
>>>>>�the terms in the score formula above that don't involve the
>>>>>
>>>>�document
>>>>>�(e.g., boost*idf for TermQuery).
>>>>>�c. �Add suitable netCoord definitions to all Scorer
>>>>>�classes. d. Update all Scorer.score() methods to compute
>>>>>�the rawScore
>>>>>
>>>>�as
>>>>>�above.
>>>>>�e. �Add the maximum baseScore as a field kept on TopDocs,
>>>>>
>>>>�computed
>>>>>�in the HitCollector's.
>>>>>�f. �Change the normalization in Hits to always divide every
>>>>>
>>>>�raw
>>>>>�score by the maximum baseScore.
>>>>>�g. �Update all of the current explain() methods to be
>>>>>
>>>>�consistent
>>>>>�with this scoring, and to either report the rawScore, or to
>>>>>
>>>>�report
>>>>�the
>>>>>�final score if the normalization factor is provided. h.
>>>>>�Add Hits.explain() (or better extend Searcher so that it
>>>>>
>>>>�keeps
>>>>>�the Hits for use in Searcher.explain()) to call the new
>>>>>�explain variation with the normalization factor so that
>>>>>�final scores are
>>>>>
>>>>�fully
>>>>>�explained.
>>>>>
>>>>>�If this seems like a good idea, please let me know. �I'm
>>>>>�sure
>>>>>
>>>>�there
>>>>�are
>>>>>�details I've missed that would come out during coding and
>>>>>
>>>>�testing.
>>>>�Also,
>>>>>�the value of this is dependent on how reasonable the final
>>>>>�scores
>>>>>
>>>>�look,
>>>>>�which is hard to tell for sure until it is working.
>>>>>
>>>>>�The coding standards for Lucene seem reasonably clear from
>>>>>�the
>>>>>
>>>>�source
>>>>>�code I've read. �I could use just simple Java so there
>>>>>�shouldn't
>>>>>
>>>>�be
>>>>�any
>>>>>�significant JVM dependencies. �The above should be fully
>>>>>�backward compatible due to the newScoring flag.
>>>>>
>>>>>�On another note, I had to remove the German analyzer in my
>>>>>
>>>>�current
>>>>�1.4.2
>>>>>�source configuration because GermanStemmer failed to
>>>>>�compile due
>>>>>
>>>>�to
>>>>�what
>>>>>�are apparently Unicode character constants that I've now
>>>>>�got as
>>>>>
>>>>�illegal
>>>>>�two-character character constants. �This is presumably an
>>>>>
>>>>�encoding
>>>>>�problem somewhere that I could track down. �It's not
>>>>>�important,
>>>>>
>>>>�but
>>>>�if
>>>>>�the answer is obvious to any of you, I'd appreciate the
>>>>>�quick
>>>>>
>>>>�tip.
>>>>
>>>>>�Thanks,
>>>>>
>>>>>�Chuck
>>>>>
>>>>>>�-----Original Message-----
>>>>>>�From: Doug Cutting [mailto:cutting@apache.org] Sent:
>>>>>>�Monday, October 18, 2004 9:44 AM To: Lucene Developers
>>>>>>�List Subject: Re: idf and explain(), was Re: Search and
>>>>>>�Scoring
>>>>>>
>>>>>>�Chuck Williams wrote:
>>>>>>>�That's a good point on how the standard vector space
>>>>>>>�inner
>>>>>>>
>>>>�product
>>>>>>>�similarity measure does imply that the idf is squared
>>>>>>>
>>>>�relative
>>>>�to
>>>>>�the
>>>>>>>�document tf. �Even having been aware of this formula
>>>>>>>�for a
>>>>>>>
>>>>�long
>>>>>�time,
>>>>>>>�this particular implication never occurred to me. �Do
>>>>>>>�you
>>>>>>>
>>>>�know
>>>>�if
>>>>>>>�anybody has done precision/recall or other relevancy
>>>>>>>
>>>>�empirical
>>>>>>>�measurements comparing this vs. a model that does not
>>>>>>>
>>>>�square
>>>>�idf?
>>>>
>>>>>>�No, not that I know of.
>>>>>>
>>>>>>>�Regarding normalization, the normalization in Hits does
>>>>>>>�not
>>>>>>>
>>>>�have
>>>>>�very
>>>>>>>�nice properties. �Due to the >�1.0 threshold check, it
>>>>>>>
>>>>�loses
>>>>>>>�information, and it arbitrarily defines the highest
>>>>>>>�scoring
>>>>>>>
>>>>�result
>>>>>�in
>>>>>>>�any list that generates scores above 1.0 as a perfect
>>>>>>>
>>>>�match.
>>>>�It
>>>>>�would
>>>>>>>�be nice if score values were meaningful independent of
>>>>>>>
>>>>�searches,
>>>>>�e.g.,
>>>>>>>�if "0.6" meant the same quality of retrieval
>>>>>>>�independent of
>>>>>>>
>>>>�what
>>>>>>�search
>>>>>>>�was done. �This would allow, for example, sites to use
>>>>>>>�a a
>>>>>>>
>>>>�simple
>>>>>>>�quality threshold to only show results that were "good
>>>>>>>
>>>>�enough".
>>>>>�At my
>>>>>>>�last company (I was President and head of engineering
>>>>>>>�for
>>>>>>>
>>>>�InQuira),
>>>>>�we
>>>>>>>�found this to be important to many customers.
>>>>>>>
>>>>>>�If this is a big issue for you, as it seems it is, please
>>>>>>
>>>>�submit
>>>>�a
>>>>>�patch
>>>>>>�to optionally disable score normalization in Hits.java.
>>>>>>
>>>>>>>�The standard vector space similarity measure includes
>>>>>>>
>>>>>�normalization by
>>>>>>>�the product of the norms of the vectors, i.e.:
>>>>>>>
>>>>>>>�score(d,q) = �sum over t of ( weight(t,q) * weight(t,d)
>>>>>>>�)
>>>>>>>
>>>>�/
>>>>>>>�sqrt [ (sum(t) weight(t,q)^2) * (sum(t)
>>>>>>>
>>>>>>�weight(t,d)^2) ]
>>>>>>
>>>>>>>�This makes the score a cosine, which since the values
>>>>>>>�are
>>>>>>>
>>>>�all
>>>>>�positive,
>>>>>>>�forces it to be in [0, 1]. �The sumOfSquares()
>>>>>>>
>>>>�normalization
>>>>�in
>>>>>�Lucene
>>>>>>>�does not fully implement this. �Is there a specific
>>>>>>>�reason
>>>>>>>
>>>>�for
>>>>>�that?
>>>>>
>>>>>>�The quantity 'sum(t) weight(t,d)^2' must be recomputed for
>>>>>>
>>>>�each
>>>>>�document
>>>>>>�each time a document is added to the collection, since
>>>>>>
>>>>�'weight(t,d)'
>>>>>�is
>>>>>>�dependent on global term statistics. �This is prohibitivly
>>>>>>
>>>>�expensive.
>>>>>>�Research has also demonstrated that such cosine
>>>>>>�normalization
>>>>>>
>>>>�gives
>>>>>>�somewhat inferior results (e.g., Singhal's pivoted length
>>>>>>
>>>>>�normalization).
>>>>>
>>>>>>>�Re. explain(), I don't see a downside to extending it
>>>>>>>�show
>>>>>>>
>>>>�the
>>>>>�final
>>>>>>>�normalization in Hits. �It could still show the raw
>>>>>>>�score
>>>>>>>
>>>>�just
>>>>>�prior
>>>>>>�to
>>>>>>>�that normalization.
>>>>>>>
>>>>>>�In order to normalize scores to 1.0 one must know the
>>>>>>�maximum
>>>>>>
>>>>�score.
>>>>>>�Explain only computes the score for a single document, and
>>>>>>
>>>>�the
>>>>>�maximum
>>>>>>�score is not known.
>>>>>>
>>>>>>>�Although I think it would be best to have a
>>>>>>>�normalization that would render scores comparable across
>>>>>>>
>>>>�searches.
>>>>
>>>>>>�Then please submit a patch. �Lucene doesn't change on its
>>>>>>
>>>>�own.
>>>>
>>>>>>�Doug
>>>>>>
>>>>>>
>>>>�--------------------------------------------------------------
>>>>�---- --
>>>>
>>>>>�-
>>>>>>�To unsubscribe, e-mail:
>>>>�lucene-dev-unsubscribe@jakarta.apache.org
>>>>>>�For additional commands, e-mail:
>>>>�lucene-dev-help@jakarta.apache.org
>>>>
>>>>
>>>>�--------------------------------------------------------------
>>>>�---- --- To unsubscribe, e-mail: lucene-dev-
>>>>�unsubscribe@jakarta.apache.org For additional commands, e-
>>>>�mail: lucene-dev-help@jakarta.apache.org
>>>
>>>
>�--------------------------------------------------------------------
>
>>>�- To unsubscribe, e-mail: lucene-dev-
>>>�unsubscribe@jakarta.apache.org For additional commands, e-mail:
>>>�lucene-dev-help@jakarta.apache.org
>>
>>
>>�------------------------------------------------------------------
>>�--- To unsubscribe, e-mail: lucene-dev-
>>�unsubscribe@jakarta.apache.org For additional commands, e-mail:
>>�lucene-dev-help@jakarta.apache.org
>
>
>�--------------------------------------------------------------------
>�- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>�For additional commands, e-mail: lucene-dev-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Study Group (WAS Re: Normalized Scoring)

Posted by David Spencer <da...@tropo.com>.

Doug Cutting wrote:

> Paul Elschot wrote:
> 
>> I learned a lot by adding some javadocs to such classes. I suppose Doug
>> added the Expert markings, but I don't know their precise purpose.
> 
> 
> The "Expert" declaration is meant to indicate that most users should not 
> need to understand the feature.  Lucene's API seeks to be both simple 
> and flexible, but this is not always possible.  When flexibility is 
> added that is not part of the simple API, it is deemed "expert".  For 
> example, we don't expect most users to need to write new Query 
> implementations.  So Query methods that are only used internally in 
> query processing are marked "Expert", since they only need to be 
> understood by those implementing a Query.

I think this is an excellent way of giving guidance to users - you want 
to indicate what they'd normally use, and what they should initially 
stay away from.

> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Study Group (WAS Re: Normalized Scoring)

Posted by Doug Cutting <cu...@apache.org>.

Paul Elschot wrote:
> I learned a lot by adding some javadocs to such classes. I suppose Doug
> added the Expert markings, but I don't know their precise purpose.

The "Expert" declaration is meant to indicate that most users should not 
need to understand the feature.  Lucene's API seeks to be both simple 
and flexible, but this is not always possible.  When flexibility is 
added that is not part of the simple API, it is deemed "expert".  For 
example, we don't expect most users to need to write new Query 
implementations.  So Query methods that are only used internally in 
query processing are marked "Expert", since they only need to be 
understood by those implementing a Query.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Study Group (WAS Re: Normalized Scoring)

Posted by Kelvin Tan <ke...@relevanz.com>.

Hey Paul, thanks for responding.

On Sun, 6 Feb 2005 13:26:24 +0100, Paul Elschot wrote:
>
> Tuning the scoring is difficult because one needs to avoid the trap
> of optimizing for the test collection and test queries at hand. The
> interplays between query structure, coord(), idf() and tf() add to
> the complexity.
>
> As long as the discussion is on possible additions/improvements
> /tunings/extensions to Lucene, I think lucene-dev is a good
> platform. For example, there is some code in bugzilla for
> variations on idf():
> http://issues.apache.org/bugzilla/show_bug.cgi?id=32942 and tf():
> http://issues.apache.org/bugzilla/show_bug.cgi?id=31784 and the
> MultiFieldQuery things are here:
> http://issues.apache.org/bugzilla/show_bug.cgi?id=32674
>

Thanks for the references, and for updating the Wiki. I'll plough through the list archives and see what else I can dig up, like related papers/publications, etc and update the wiki. 

k



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Study Group (WAS Re: Normalized Scoring)

Posted by Paul Elschot <pa...@xs4all.nl>.

On Sunday 06 February 2005 10:14, Kelvin Tan wrote:
> Wouldn't it be great if we can form a study-group of Lucene folks who want 
to take the "next step"? I feel uneasy posting non-Lucene specific questions 
to dev or user even if its related to IR.
>
> Feels to me like there could be a couple like us, who didn't do a 
dissertation in IR, but would like a more indepth knowledge for practical 
purposes. Basically, the end result is that we are able to tune or extend 
lucene by using the Expert api (classes marked as Expert). Perhaps a possible 
outcome is a tuning tutorial for advanced users who already know how to use 
Lucene.
> 
> What do you think?
> 

I learned a lot by adding some javadocs to such classes. I suppose Doug
added the Expert markings, but I don't know their precise purpose.

Tuning the scoring is difficult because one needs to avoid the trap of
optimizing for the test collection and test queries at hand.
The interplays between query structure, coord(), idf() and tf() 
add to the complexity.

As long as the discussion is on possible additions/improvements
/tunings/extensions to Lucene, I think lucene-dev is a good platform.
For example, there is some code in bugzilla for variations
on idf(): http://issues.apache.org/bugzilla/show_bug.cgi?id=32942
and tf(): http://issues.apache.org/bugzilla/show_bug.cgi?id=31784
and the MultiFieldQuery things are here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=32674

> k
> 
> On Sat, 5 Feb 2005 22:10:26 -0800 (PST), Otis Gospodnetic wrote:
> > Exactly.  Luckily, since then I've learned a bit from lucene-dev
> > discussions and side IR readings, so some of the topics are making
> > more sense now.

One could collect and annotate more references here:
http://wiki.apache.org/jakarta-lucene/InformationRetrieval

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Study Group (WAS Re: Normalized Scoring)

Posted by David Spencer <da...@tropo.com>.

You might want to see a post I just made to the thread with this long 
subject:

"single field code ready - Re: URL to compare 2 Similarity's ready-- Re: 
Scoring benchmark evaluation.  Was RE: How to proceed with Bug 31841 - 
MultiSearcher problems with Similarity.docFreq() ?"


I've done an example page that compares results of searching with 
different query parsers and Similarities.




Kelvin Tan wrote:

> Wouldn't it be great if we can form a study-group of Lucene folks who want to take the "next step"? I feel uneasy posting non-Lucene specific questions to dev or user even if its related to IR.
> 
> Feels to me like there could be a couple like us, who didn't do a dissertation in IR, but would like a more indepth knowledge for practical purposes. Basically, the end result is that we are able to tune or extend lucene by using the Expert api (classes marked as Expert). Perhaps a possible outcome is a tuning tutorial for advanced users who already know how to use Lucene.
> 
> What do you think?
> 
> k
> 
> On Sat, 5 Feb 2005 22:10:26 -0800 (PST), Otis Gospodnetic wrote:
> 
>> Exactly.  Luckily, since then I've learned a bit from lucene-dev
>> discussions and side IR readings, so some of the topics are making
>> more sense now.
>>
>> Otis
>>
>> --- Kelvin Tan <ke...@relevanz.com> wrote:
>>
>>
>>> Hi Otis, I was re-reading this whole theoretical thread about
>>> idf, scoring, normalization, etc from last Oct and couldn't help
>>> laughing out loud when I read your post, coz it summed up what I
>>> was thinking the whole time. I think its really great to have
>>> people like Chuck and Paul (Eshlot) around. I'm learning so much.
>>>
>>> k
>>>
>>> On Thu, 21 Oct 2004 10:05:51 -0700 (PDT), Otis Gospodnetic wrote:
>>>
>>>
>>>> Hi Chuck,
>>>>
>>>> The relative lack of responses is not because there is no
>>>> interest, but probably because there are only a few people on
>>>> lucene-dev who can follow/understand every detail of your
>>>> proposal.  I understand and hear you, but I have a hard time
>>>> 'visualizing' some of the formulas in your proposal.  What you
>>>> are saying sounds right to me, but I don't have enough
>>>> theoretical knowledge to go one way or the other.
>>>>
>>>> Otis
>>>>
>>>>
>>>> --- Chuck Williams <ch...@manawiz.com> wrote:
>>>>
>>>>
>>>>> Hi everybody,
>>>>>
>>>>> Although there doesn't seem to be much interest in this I
>>>>> have one significant improvement to the below and a couple
>>>>> observations that clarify the situation.
>>>>>
>>>>> To illustrate the problem better normalization is intended to
>>>>> address,
>>>>> in my current application for BooleanQuery's of two terms, I
>>>>> frequently
>>>>> get a top score of 1.0 when only one of the terms is matched.
>>>>> 1.0 should indicate a "perfect match".  I'd like set my UI up
>>>>> to present the
>>>>> results differently depending on how good the different
>>>>> results are (e.g., showing a visual indication of result
>>>>> quality, dropping the really bad results entirely, and
>>>>> segregating the good results from other
>>>>> only vaguely relevant results).  To build this kind of
>>>>> "intelligence" into the UI, I certainly need to know whether
>>>>> my top result matched all
>>>>> query terms, or only half of them.  I'd like to have the
>>>>> score tell me
>>>>> directly how good the matches are.  The current normalization
>>>>> does not achieve this.
>>>>>
>>>>> The intent of a new normalization scheme is to preserve the
>>>>> current scoring in the following sense:  the ratio of the nth
>>>>> result's score to
>>>>> the best result's score remains the same.  Therefore, the
>>>>> only question
>>>>> is what normalization factor to apply to all scores.  This
>>>>> reduces to a
>>>>> very specific question that determines the entire
>>>>> normalization -- what should the score of the best matching
>>>>> result be?
>>>>>
>>>>> The mechanism below has this property, i.e. it keeps the
>>>>> current score
>>>>> ratios, except that I removed one idf term for reasons
>>>>> covered earlier
>>>>> (this better reflects the current empirically best scoring
>>>>> algorithms).
>>>>> However, removing an idf is a completely separate issue.  The
>>>>> improved
>>>>> normalization is independent of whether or not that change is
>>>>> made.
>>>>>
>>>>> For the central question of what the top score should be,
>>>>> upon reflection, I don't like the definition below.  It
>>>>> defined the top score
>>>>> as (approximately) the percentage of query terms matched by
>>>>> the top scoring result.  A better conceptual definition is to
>>>>> use a weighted average based on the boosts.  I.e., downward
>>>>> propagate all boosts to the
>>>>> underlying terms (or phrases).  Secifically, the "net boost"
>>>>> of a term
>>>>> is the product of the direct boost of the term and all boosts
>>>>> applied to
>>>>> encompassing clauses.  Then the score of the top result
>>>>> becomes the sum
>>>>> of the net boosts of its matching terms divided by the sum of
>>>>> the net boosts of all query terms.
>>>>>
>>>>> This definition is a refinement of the original proposal
>>>>> below, and it
>>>>> could probably be further refined if some impact of the tf,
>>>>> idf and/or
>>>>> lengthNorm was desired in determining the top score.  These
>>>>> other factors seems to be harder to normalize for, although
>>>>> I've thought of some simple approaches; e.g., assume the
>>>>> unmatched terms in the top result have values for these three
>>>>> factors that are the average of the
>>>>> values of the matched terms, then apply exactly the same
>>>>> concept of actual score divided by theorectical maximum
>>>>> score.  That would eliminate any need to maintain maximum
>>>>> value statistics in the index.
>>>>>
>>>>> As an example of the simple boost-based normalization, for
>>>>> the query ((a^2 b)^3 (c d^2)) the net boosts are: a --> 6 b --
>>>>> > 3 c --
>>>>>
>>>>>
>>>>>> 1 d --> 2
>>>>>>
>>>>>
>>>>> So if a and b matched, but not c and d, in the top scoring
>>>>> result, its
>>>>> score would be 0.75.  The normalizer would be 0.75/(current
>>>>> score except
>>>>> for the current normalization).  This normalizer would be
>>>>> applied to all
>>>>> current scores (minus normalization) to create the normalized
>>>>> scores.
>>>>>
>>>>> For simple query (a b), if only one of the terms matched in
>>>>> the top result, then its score would be 0.5, vs. 1.0 or many
>>>>> other possible scores today.
>>>>>
>>>>> In addition to enabling more "intelligent" UI's that
>>>>> communicate the quality of results to end-users, the proposal
>>>>> below also extends the explain() mechanism to fully explain
>>>>> the final normalized score. However, that change is also
>>>>> independent -- it could be done with the current scoring.
>>>>>
>>>>> Am I the only one who would like to see better normalization
>>>>> in Lucene? Does anybody have a better approach?
>>>>>
>>>>> If you've read this far, thanks for indulging me on this.  I
>>>>> would love
>>>>> to see this or something with similar properties in Lucene.
>>>>> I really just want to build my app, but as stated below would
>>>>> write and contribute this if there is interest in putting it
>>>>> in, and nobody else
>>>>> wants to write it.  Please let me know what you think one way
>>>>> or the other.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Chuck
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Chuck Williams
>>>>>> Sent: Monday, October 18, 2004 7:04 PM
>>>>>> To: 'Lucene Developers List'
>>>>>> Subject: RE: idf and explain(), was Re: Search and Scoring
>>>>>>
>>>>>> Doug Cutting wrote:
>>>>>>
>>>>>>> If this is a big issue for you, as it seems it is, please
>>>>>>>
>>>>>
>>>>> submit
>>>>> a
>>>>>
>>>>>> patch
>>>>>>
>>>>>>> to optionally disable score normalization in Hits.java.
>>>>>>>
>>>>>>
>>>>>> and:
>>>>>>
>>>>>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for
>>>>>>>
>>>>>
>>>>> each
>>>>>
>>>>>> document
>>>>>>
>>>>>>> each time a document is added to the collection, since
>>>>>>>
>>>>>
>>>>> 'weight(t,d)'
>>>>>
>>>>>> is
>>>>>>
>>>>>>> dependent on global term statistics.  This is prohibitivly
>>>>>>>
>>>>>
>>>>> expensive.
>>>>>
>>>>>>> Research has also demonstrated that such cosine
>>>>>>> normalization
>>>>>>>
>>>>>
>>>>> gives
>>>>>
>>>>>>> somewhat inferior results (e.g., Singhal's pivoted length
>>>>>>>
>>>>>>
>>>>>> normalization).
>>>>>>
>>>>>> I'm willing to write, test and contribute code to address
>>>>>> the normalization issue, i.e. to yield scores in [0, 1]
>>>>>> that are
>>>>>>
>>>>>
>>>>> meaningful
>>>>>
>>>>>> across searches.  Unfortunately, this is considerably more
>>>>>>
>>>>>
>>>>> involved
>>>>> that
>>>>>
>>>>>> just optionally eliminating the current normalization in
>>>>>> Hits.
>>>>>>
>>>>>
>>>>> Before
>>>>>
>>>>>> undertaking this, I'd like to see if there is agreement in
>>>>>>
>>>>>
>>>>> principle
>>>>>
>>>>>> that this is a good idea, and that my specific proposal
>>>>>> below is
>>>>>>
>>>>>
>>>>> the
>>>>>
>>>>>> right way to go.  Also, I'd like to make sure I've correctly
>>>>>>
>>>>>
>>>>> inferred
>>>>>
>>>>>> the constraints on writing code to be incorporated into
>>>>>> Lucene.
>>>>>>
>>>>>> After looking at this in more detail I agree that the
>>>>>> cosine normalization is not the way to go, because of both
>>>>>> efficiency
>>>>>>
>>>>>
>>>>> and
>>>>>
>>>>>> effectiveness considerations.  A conceptual approach that
>>>>>> would
>>>>>>
>>>>>
>>>>> be
>>>>>
>>>>>> efficient, relatively easy to implement, and seems to have
>>>>>> at
>>>>>>
>>>>>
>>>>> least
>>>>>
>>>>>> reasonable behavior would be to define the top scoring
>>>>>> match to
>>>>>>
>>>>>
>>>>> have
>>>>> a
>>>>>
>>>>>> score roughly equal to the percentage of query terms it
>>>>>> matches
>>>>>>
>>>>>
>>>>> (its
>>>>>
>>>>>> "netCoord").  Scores below the top hit would be reduced
>>>>>> based on
>>>>>>
>>>>>
>>>>> the
>>>>>
>>>>>> ratio of their raw scores to the raw score of the top hit
>>>>>>
>>>>>
>>>>> (considering
>>>>>
>>>>>> all of the current score factors, except that I'd like to
>>>>>> remove
>>>>>>
>>>>>
>>>>> one
>>>>> of
>>>>>
>>>>>> the idf factors, as discussed earlier).
>>>>>>
>>>>>> For a couple simple cases:
>>>>>> a) the top match for a single term query would always have a
>>>>>>
>>>>>
>>>>> score
>>>>> of
>>>>>
>>>>>> 1.0,
>>>>>> b) the top scoring match for a BooleanQuery using
>>>>>>
>>>>>
>>>>> DefaultSimilarity
>>>>>
>>>>>> with all non-prohibited TermQuery clauses would have a
>>>>>> score of
>>>>>>
>>>>>
>>>>> m/n,
>>>>>
>>>>>> where the hit matches m of the n terms.
>>>>>>
>>>>>> This isn't optimal, but seems much better than the current
>>>>>>
>>>>>
>>>>> situation.
>>>>>
>>>>>> Consider two single-term queries, s and t.  If s matches
>>>>>> more
>>>>>>
>>>>>
>>>>> strongly
>>>>>
>>>>>> than t in its top hit (e.g., a higher tf in a shorter
>>>>>> field), it
>>>>>>
>>>>>
>>>>> would
>>>>>
>>>>>> be best if the top score of s was greater score than top
>>>>>> score of
>>>>>>
>>>>>
>>>>> t.
>>>>>
>>>>>> But this kind of normalization would require keeping
>>>>>> additional statistics that so far as I know are not
>>>>>> currently in the index,
>>>>>>
>>>>>
>>>>> like
>>>>>
>>>>>> the maximum tf for terms and the minimum length for fields.
>>>>>>
>>>>>
>>>>> These
>>>>> could
>>>>>
>>>>>> be expensive to update on deletes.  Also, normalizing by
>>>>>> such
>>>>>>
>>>>>
>>>>> factors
>>>>>
>>>>>> could yield lower than subjectively reasonable scores in
>>>>>> most
>>>>>>
>>>>>
>>>>> cases,
>>>>> so
>>>>>
>>>>>> it's not clear it would be better.
>>>>>>
>>>>>> The semantics above are at least clean, easy to understand,
>>>>>> and
>>>>>>
>>>>>
>>>>> support
>>>>>
>>>>>> what seems to me is the most important motivation to do
>>>>>> this:
>>>>>>
>>>>>
>>>>> allowing
>>>>>
>>>>>> an application to use simple thresholding to segregate
>>>>>>
>>>>>
>>>>> likely-to-be-
>>>>>
>>>>>> relevant hits from likely-to-be-irrelevant hits.
>>>>>>
>>>>>> More specifically, for a BooleanQuery of TermQuery's the
>>>>>>
>>>>>
>>>>> resulting
>>>>> score
>>>>>
>>>>>> functions would be:
>>>>>>
>>>>>> BooleanQuery of TermQuery's sbq = (tq1 ... tqn)
>>>>>>
>>>>>> baseScore(sbq, doc) =
>>>>>> sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)*
>>>>>> lengthNorm(tqi.term.field, doc)
>>>>>>
>>>>>> rawScore(sbq, doc) = coord(sbq, doc) * baseScore
>>>>>>
>>>>>> norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit)
>>>>>>
>>>>>> score(sbq, doc) = rawScore * norm
>>>>>>
>>>>>> rawScore's can be computed in the Scorer.score() methods and
>>>>>>
>>>>>
>>>>> therefore
>>>>>
>>>>>> used to sort the hits (e.g., in the instance method for
>>>>>> collect()
>>>>>>
>>>>>
>>>>> in
>>>>> the
>>>>>
>>>>>> HitCollector in IndexSearcher.search()).  If the top
>>>>>> scoring hit
>>>>>>
>>>>>
>>>>> does
>>>>>
>>>>>> not have the highest baseScore, then its score could be
>>>>>> less that
>>>>>>
>>>>>
>>>>> its
>>>>>
>>>>>> coord; this seems desirable.  These formulas imply that no
>>>>>> result
>>>>>>
>>>>>
>>>>> score
>>>>>
>>>>>> can be larger than its coord, so if coord is well-defined
>>>>>> (always between 0 and 1) then all results will be
>>>>>> normalized between 0
>>>>>>
>>>>>
>>>>> and
>>>>> 1.
>>>>>
>>>>>
>>>>>> In general, the netCoord, which takes the place of coord in
>>>>>> the
>>>>>>
>>>>>
>>>>> simple
>>>>>
>>>>>> case above, needs to be defined for any query.
>>>>>> Conceptually,
>>>>>>
>>>>>
>>>>> this
>>>>>
>>>>>> should be the total percentage of query terms matched by the
>>>>>>
>>>>>
>>>>> document.
>>>>>
>>>>>> It must be recursively computable from the subquery
>>>>>> structure and overridable for specific Query types (e.g.,
>>>>>> to support
>>>>>>
>>>>>
>>>>> specialized
>>>>>
>>>>>> coords, like one that is always 1.0 as is useful in multi-
>>>>>> field searching).  Suitable default definitions for
>>>>>> TermQuery and
>>>>>>
>>>>>
>>>>> BooleanQuery
>>>>>
>>>>>> are:
>>>>>>
>>>>>> TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise
>>>>>>
>>>>>> BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) *
>>>>>>
>>>>>
>>>>> ci.netCoord
>>>>>
>>>>>
>>>>>> This is not quite percentage of terms matched; e.g.,
>>>>>> consider a BooleanQuery with two clauses, one of which is a
>>>>>> BooleanQuery of
>>>>>>
>>>>>
>>>>> 3
>>>>> terms
>>>>>
>>>>>> and the other which is just a term.  However, it doesn't
>>>>>> seem to
>>>>>>
>>>>>
>>>>> be
>>>>>
>>>>>> unreasonable for a BooleanQuery to state that its clauses
>>>>>> are
>>>>>>
>>>>>
>>>>> equally
>>>>>
>>>>>> important, and this is consistent with the current coord
>>>>>>
>>>>>
>>>>> behavior.
>>>>>
>>>>>> BooleanQuery.netCoord could be overridden for special
>>>>>> cases, like
>>>>>>
>>>>>
>>>>> the
>>>>>
>>>>>> pure disjunction I use in my app for field expansions:
>>>>>> MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord
>>>>>>
>>>>>> Implementing this would proceed along these lines: 1.  For
>>>>>> backwards compatibility, add some kind of newScoring
>>>>>>
>>>>>
>>>>> boolean
>>>>>
>>>>>> setting.
>>>>>> 2.  Update all of these places to behave as indicated if
>>>>>>
>>>>>
>>>>> newScoring:
>>>>>
>>>>>> a.  Change Query.weight() to not do any normalization (no
>>>>>>
>>>>>
>>>>> call
>>>>> to
>>>>>
>>>>>> sumOfSquaredWeights(), Similarity.queryNorm() or
>>>>>> normalize()). b.  Update all Query.weight classes to set
>>>>>> their value
>>>>>>
>>>>>
>>>>> according
>>>>> to
>>>>>
>>>>>> the terms in the score formula above that don't involve the
>>>>>>
>>>>>
>>>>> document
>>>>>
>>>>>> (e.g., boost*idf for TermQuery).
>>>>>> c.  Add suitable netCoord definitions to all Scorer
>>>>>> classes. d. Update all Scorer.score() methods to compute
>>>>>> the rawScore
>>>>>>
>>>>>
>>>>> as
>>>>>
>>>>>> above.
>>>>>> e.  Add the maximum baseScore as a field kept on TopDocs,
>>>>>>
>>>>>
>>>>> computed
>>>>>
>>>>>> in the HitCollector's.
>>>>>> f.  Change the normalization in Hits to always divide every
>>>>>>
>>>>>
>>>>> raw
>>>>>
>>>>>> score by the maximum baseScore.
>>>>>> g.  Update all of the current explain() methods to be
>>>>>>
>>>>>
>>>>> consistent
>>>>>
>>>>>> with this scoring, and to either report the rawScore, or to
>>>>>>
>>>>>
>>>>> report
>>>>> the
>>>>>
>>>>>> final score if the normalization factor is provided. h.
>>>>>> Add Hits.explain() (or better extend Searcher so that it
>>>>>>
>>>>>
>>>>> keeps
>>>>>
>>>>>> the Hits for use in Searcher.explain()) to call the new
>>>>>> explain variation with the normalization factor so that
>>>>>> final scores are
>>>>>>
>>>>>
>>>>> fully
>>>>>
>>>>>> explained.
>>>>>>
>>>>>> If this seems like a good idea, please let me know.  I'm
>>>>>> sure
>>>>>>
>>>>>
>>>>> there
>>>>> are
>>>>>
>>>>>> details I've missed that would come out during coding and
>>>>>>
>>>>>
>>>>> testing.
>>>>> Also,
>>>>>
>>>>>> the value of this is dependent on how reasonable the final
>>>>>> scores
>>>>>>
>>>>>
>>>>> look,
>>>>>
>>>>>> which is hard to tell for sure until it is working.
>>>>>>
>>>>>> The coding standards for Lucene seem reasonably clear from
>>>>>> the
>>>>>>
>>>>>
>>>>> source
>>>>>
>>>>>> code I've read.  I could use just simple Java so there
>>>>>> shouldn't
>>>>>>
>>>>>
>>>>> be
>>>>> any
>>>>>
>>>>>> significant JVM dependencies.  The above should be fully
>>>>>> backward compatible due to the newScoring flag.
>>>>>>
>>>>>> On another note, I had to remove the German analyzer in my
>>>>>>
>>>>>
>>>>> current
>>>>> 1.4.2
>>>>>
>>>>>> source configuration because GermanStemmer failed to
>>>>>> compile due
>>>>>>
>>>>>
>>>>> to
>>>>> what
>>>>>
>>>>>> are apparently Unicode character constants that I've now
>>>>>> got as
>>>>>>
>>>>>
>>>>> illegal
>>>>>
>>>>>> two-character character constants.  This is presumably an
>>>>>>
>>>>>
>>>>> encoding
>>>>>
>>>>>> problem somewhere that I could track down.  It's not
>>>>>> important,
>>>>>>
>>>>>
>>>>> but
>>>>> if
>>>>>
>>>>>> the answer is obvious to any of you, I'd appreciate the
>>>>>> quick
>>>>>>
>>>>>
>>>>> tip.
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Chuck
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Doug Cutting [mailto:cutting@apache.org] Sent:
>>>>>>> Monday, October 18, 2004 9:44 AM To: Lucene Developers
>>>>>>> List Subject: Re: idf and explain(), was Re: Search and
>>>>>>> Scoring
>>>>>>>
>>>>>>> Chuck Williams wrote:
>>>>>>>
>>>>>>>> That's a good point on how the standard vector space
>>>>>>>> inner
>>>>>>>>
>>>>>
>>>>> product
>>>>>
>>>>>>>> similarity measure does imply that the idf is squared
>>>>>>>>
>>>>>
>>>>> relative
>>>>> to
>>>>>
>>>>>> the
>>>>>>
>>>>>>>> document tf.  Even having been aware of this formula
>>>>>>>> for a
>>>>>>>>
>>>>>
>>>>> long
>>>>>
>>>>>> time,
>>>>>>
>>>>>>>> this particular implication never occurred to me.  Do
>>>>>>>> you
>>>>>>>>
>>>>>
>>>>> know
>>>>> if
>>>>>
>>>>>>>> anybody has done precision/recall or other relevancy
>>>>>>>>
>>>>>
>>>>> empirical
>>>>>
>>>>>>>> measurements comparing this vs. a model that does not
>>>>>>>>
>>>>>
>>>>> square
>>>>> idf?
>>>>>
>>>>>
>>>>>>> No, not that I know of.
>>>>>>>
>>>>>>>
>>>>>>>> Regarding normalization, the normalization in Hits does
>>>>>>>> not
>>>>>>>>
>>>>>
>>>>> have
>>>>>
>>>>>> very
>>>>>>
>>>>>>>> nice properties.  Due to the > 1.0 threshold check, it
>>>>>>>>
>>>>>
>>>>> loses
>>>>>
>>>>>>>> information, and it arbitrarily defines the highest
>>>>>>>> scoring
>>>>>>>>
>>>>>
>>>>> result
>>>>>
>>>>>> in
>>>>>>
>>>>>>>> any list that generates scores above 1.0 as a perfect
>>>>>>>>
>>>>>
>>>>> match.
>>>>> It
>>>>>
>>>>>> would
>>>>>>
>>>>>>>> be nice if score values were meaningful independent of
>>>>>>>>
>>>>>
>>>>> searches,
>>>>>
>>>>>> e.g.,
>>>>>>
>>>>>>>> if "0.6" meant the same quality of retrieval
>>>>>>>> independent of
>>>>>>>>
>>>>>
>>>>> what
>>>>>
>>>>>>> search
>>>>>>>
>>>>>>>> was done.  This would allow, for example, sites to use
>>>>>>>> a a
>>>>>>>>
>>>>>
>>>>> simple
>>>>>
>>>>>>>> quality threshold to only show results that were "good
>>>>>>>>
>>>>>
>>>>> enough".
>>>>>
>>>>>> At my
>>>>>>
>>>>>>>> last company (I was President and head of engineering
>>>>>>>> for
>>>>>>>>
>>>>>
>>>>> InQuira),
>>>>>
>>>>>> we
>>>>>>
>>>>>>>> found this to be important to many customers.
>>>>>>>>
>>>>>>>
>>>>>>> If this is a big issue for you, as it seems it is, please
>>>>>>>
>>>>>
>>>>> submit
>>>>> a
>>>>>
>>>>>> patch
>>>>>>
>>>>>>> to optionally disable score normalization in Hits.java.
>>>>>>>
>>>>>>>
>>>>>>>> The standard vector space similarity measure includes
>>>>>>>>
>>>>>>
>>>>>> normalization by
>>>>>>
>>>>>>>> the product of the norms of the vectors, i.e.:
>>>>>>>>
>>>>>>>> score(d,q) =  sum over t of ( weight(t,q) * weight(t,d)
>>>>>>>> )
>>>>>>>>
>>>>>
>>>>> /
>>>>>
>>>>>>>> sqrt [ (sum(t) weight(t,q)^2) * (sum(t)
>>>>>>>>
>>>>>>>
>>>>>>> weight(t,d)^2) ]
>>>>>>>
>>>>>>>
>>>>>>>> This makes the score a cosine, which since the values
>>>>>>>> are
>>>>>>>>
>>>>>
>>>>> all
>>>>>
>>>>>> positive,
>>>>>>
>>>>>>>> forces it to be in [0, 1].  The sumOfSquares()
>>>>>>>>
>>>>>
>>>>> normalization
>>>>> in
>>>>>
>>>>>> Lucene
>>>>>>
>>>>>>>> does not fully implement this.  Is there a specific
>>>>>>>> reason
>>>>>>>>
>>>>>
>>>>> for
>>>>>
>>>>>> that?
>>>>>>
>>>>>>
>>>>>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for
>>>>>>>
>>>>>
>>>>> each
>>>>>
>>>>>> document
>>>>>>
>>>>>>> each time a document is added to the collection, since
>>>>>>>
>>>>>
>>>>> 'weight(t,d)'
>>>>>
>>>>>> is
>>>>>>
>>>>>>> dependent on global term statistics.  This is prohibitivly
>>>>>>>
>>>>>
>>>>> expensive.
>>>>>
>>>>>>> Research has also demonstrated that such cosine
>>>>>>> normalization
>>>>>>>
>>>>>
>>>>> gives
>>>>>
>>>>>>> somewhat inferior results (e.g., Singhal's pivoted length
>>>>>>>
>>>>>>
>>>>>> normalization).
>>>>>>
>>>>>>
>>>>>>>> Re. explain(), I don't see a downside to extending it
>>>>>>>> show
>>>>>>>>
>>>>>
>>>>> the
>>>>>
>>>>>> final
>>>>>>
>>>>>>>> normalization in Hits.  It could still show the raw
>>>>>>>> score
>>>>>>>>
>>>>>
>>>>> just
>>>>>
>>>>>> prior
>>>>>>
>>>>>>> to
>>>>>>>
>>>>>>>> that normalization.
>>>>>>>>
>>>>>>>
>>>>>>> In order to normalize scores to 1.0 one must know the
>>>>>>> maximum
>>>>>>>
>>>>>
>>>>> score.
>>>>>
>>>>>>> Explain only computes the score for a single document, and
>>>>>>>
>>>>>
>>>>> the
>>>>>
>>>>>> maximum
>>>>>>
>>>>>>> score is not known.
>>>>>>>
>>>>>>>
>>>>>>>> Although I think it would be best to have a
>>>>>>>> normalization that would render scores comparable across
>>>>>>>>
>>>>>
>>>>> searches.
>>>>>
>>>>>
>>>>>>> Then please submit a patch.  Lucene doesn't change on its
>>>>>>>
>>>>>
>>>>> own.
>>>>>
>>>>>
>>>>>>> Doug
>>>>>>>
>>>>>>>
>>>>>
>>>>> --------------------------------------------------------------
>>>>> ---- --
>>>>>
>>>>>
>>>>>> -
>>>>>>
>>>>>>> To unsubscribe, e-mail:
>>>>>
>>>>> lucene-dev-unsubscribe@jakarta.apache.org
>>>>>
>>>>>>> For additional commands, e-mail:
>>>>>
>>>>> lucene-dev-help@jakarta.apache.org
>>>>>
>>>>>
>>>>> --------------------------------------------------------------
>>>>> ---- --- To unsubscribe, e-mail: lucene-dev-
>>>>> unsubscribe@jakarta.apache.org For additional commands, e-
>>>>> mail: lucene-dev-help@jakarta.apache.org
>>>>
>>>>
>> --------------------------------------------------------------------
>>
>>
>>>> - To unsubscribe, e-mail: lucene-dev-
>>>> unsubscribe@jakarta.apache.org For additional commands, e-mail:
>>>> lucene-dev-help@jakarta.apache.org
>>>
>>>
>>> ------------------------------------------------------------------
>>> --- To unsubscribe, e-mail: lucene-dev-
>>> unsubscribe@jakarta.apache.org For additional commands, e-mail:
>>> lucene-dev-help@jakarta.apache.org
>>
>>
>> --------------------------------------------------------------------
>> - To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Study Group (WAS Re: Normalized Scoring)

Posted by Roxana Angheluta <ro...@attentio.com>.

>I think I see what you are after.  I'm after the same knowledge. :)
>
>The only things that I can recommend are books:
>  Modern Information Retrieval
>  Managing Gigabytes
>
>And online resources like:
>  http://finance.groups.yahoo.com/group/mg/ (note the weird host name)
>  http://www.sims.berkeley.edu/~hearst/irbook/
>
>There is a pile of stuff in Citeseer, but those papers never really dig
>into the details and always require solid previous knowledge of the
>field.  They are no replacement for a class or a textbook.
>
>If you find a good forum for IR, please share.
>
>Otis
>  
>
I don't know about IR forums, but maybe the following link helps to get 
an introduction for those not familiar with the field of IR.
It gives an overview over possible weighting schemas used with vector 
space model:

http://www.dcs.gla.ac.uk/~tombrosa/AIS/SMART-tutorial/weights.html

These weights have been implemented in SMART, which is a famous 
retrieval system developed at Cornell University by Gerald Salton, one 
of the big names in the history of  IR (see 
http://www.cs.cornell.edu/Info/Department/Annual96/Beginning/salton.html).

The weighting methods  used in SMART  can be coded with 3 characters.

First char gives the term-freq procedure to be used
      Second char gives the inverted-doc-freq procedure to be used.
      Third char gives the normalization procedure to be used.

Any combination of 3 letters is in theory acceptable. The system accounts for the boolean model (by using e.g. bnn schema), 
as well as for more sophisticated weights.

While these schemas are theoretically attractive, it seems that empirically other weightings have been proven to 
be more useful (e.g. not squaring the idf term).

Hope this helps,
roxana



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Study Group (WAS Re: Normalized Scoring)

Posted by Otis Gospodnetic <ot...@yahoo.com>.

I think I see what you are after.  I'm after the same knowledge. :)

The only things that I can recommend are books:
  Modern Information Retrieval
  Managing Gigabytes

And online resources like:
  http://finance.groups.yahoo.com/group/mg/ (note the weird host name)
  http://www.sims.berkeley.edu/~hearst/irbook/

There is a pile of stuff in Citeseer, but those papers never really dig
into the details and always require solid previous knowledge of the
field.  They are no replacement for a class or a textbook.

If you find a good forum for IR, please share.

Otis


--- Kelvin Tan <ke...@relevanz.com> wrote:

> Wouldn't it be great if we can form a study-group of Lucene folks who
> want to take the "next step"? I feel uneasy posting non-Lucene
> specific questions to dev or user even if its related to IR.
> 
> Feels to me like there could be a couple like us, who didn't do a
> dissertation in IR, but would like a more indepth knowledge for
> practical purposes. Basically, the end result is that we are able to
> tune or extend lucene by using the Expert api (classes marked as
> Expert). Perhaps a possible outcome is a tuning tutorial for advanced
> users who already know how to use Lucene.
> 
> What do you think?
> 
> k
> 
> On Sat, 5 Feb 2005 22:10:26 -0800 (PST), Otis Gospodnetic wrote:
> > Exactly.  Luckily, since then I've learned a bit from lucene-dev
> > discussions and side IR readings, so some of the topics are making
> > more sense now.
> >
> > Otis
> >
> > --- Kelvin Tan <ke...@relevanz.com> wrote:
> >
> >> Hi Otis, I was re-reading this whole theoretical thread about
> >> idf, scoring, normalization, etc from last Oct and couldn't help
> >> laughing out loud when I read your post, coz it summed up what I
> >> was thinking the whole time. I think its really great to have
> >> people like Chuck and Paul (Eshlot) around. I'm learning so much.
> >>
> >> k
> >>
> >> On Thu, 21 Oct 2004 10:05:51 -0700 (PDT), Otis Gospodnetic wrote:
> >>
> >>> Hi Chuck,
> >>>
> >>> The relative lack of responses is not because there is no
> >>> interest, but probably because there are only a few people on
> >>> lucene-dev who can follow/understand every detail of your
> >>> proposal.  I understand and hear you, but I have a hard time
> >>> 'visualizing' some of the formulas in your proposal.  What you
> >>> are saying sounds right to me, but I don't have enough
> >>> theoretical knowledge to go one way or the other.
> >>>
> >>> Otis
> >>>
> >>>
> >>> --- Chuck Williams <ch...@manawiz.com> wrote:
> >>>
> >>>> Hi everybody,
> >>>>
> >>>> Although there doesn't seem to be much interest in this I
> >>>> have one significant improvement to the below and a couple
> >>>> observations that clarify the situation.
> >>>>
> >>>> To illustrate the problem better normalization is intended to
> >>>> address,
> >>>> in my current application for BooleanQuery's of two terms, I
> >>>> frequently
> >>>> get a top score of 1.0 when only one of the terms is matched.
> >>>> 1.0 should indicate a "perfect match".  I'd like set my UI up
> >>>> to present the
> >>>> results differently depending on how good the different
> >>>> results are (e.g., showing a visual indication of result
> >>>> quality, dropping the really bad results entirely, and
> >>>> segregating the good results from other
> >>>> only vaguely relevant results).  To build this kind of
> >>>> "intelligence" into the UI, I certainly need to know whether
> >>>> my top result matched all
> >>>> query terms, or only half of them.  I'd like to have the
> >>>> score tell me
> >>>> directly how good the matches are.  The current normalization
> >>>> does not achieve this.
> >>>>
> >>>> The intent of a new normalization scheme is to preserve the
> >>>> current scoring in the following sense:  the ratio of the nth
> >>>> result's score to
> >>>> the best result's score remains the same.  Therefore, the
> >>>> only question
> >>>> is what normalization factor to apply to all scores.  This
> >>>> reduces to a
> >>>> very specific question that determines the entire
> >>>> normalization -- what should the score of the best matching
> >>>> result be?
> >>>>
> >>>> The mechanism below has this property, i.e. it keeps the
> >>>> current score
> >>>> ratios, except that I removed one idf term for reasons
> >>>> covered earlier
> >>>> (this better reflects the current empirically best scoring
> >>>> algorithms).
> >>>> However, removing an idf is a completely separate issue.  The
> >>>> improved
> >>>> normalization is independent of whether or not that change is
> >>>> made.
> >>>>
> >>>> For the central question of what the top score should be,
> >>>> upon reflection, I don't like the definition below.  It
> >>>> defined the top score
> >>>> as (approximately) the percentage of query terms matched by
> >>>> the top scoring result.  A better conceptual definition is to
> >>>> use a weighted average based on the boosts.  I.e., downward
> >>>> propagate all boosts to the
> >>>> underlying terms (or phrases).  Secifically, the "net boost"
> >>>> of a term
> >>>> is the product of the direct boost of the term and all boosts
> >>>> applied to
> >>>> encompassing clauses.  Then the score of the top result
> >>>> becomes the sum
> >>>> of the net boosts of its matching terms divided by the sum of
> >>>> the net boosts of all query terms.
> >>>>
> >>>> This definition is a refinement of the original proposal
> >>>> below, and it
> >>>> could probably be further refined if some impact of the tf,
> >>>> idf and/or
> >>>> lengthNorm was desired in determining the top score.  These
> >>>> other factors seems to be harder to normalize for, although
> >>>> I've thought of some simple approaches; e.g., assume the
> >>>> unmatched terms in the top result have values for these three
> >>>> factors that are the average of the
> >>>> values of the matched terms, then apply exactly the same
> >>>> concept of actual score divided by theorectical maximum
> >>>> score.  That would eliminate any need to maintain maximum
> >>>> value statistics in the index.
> >>>>
> >>>> As an example of the simple boost-based normalization, for
> >>>> the query ((a^2 b)^3 (c d^2)) the net boosts are: a --> 6 b --
> >>>> > 3 c --
> >>>>
> >>>>> 1 d --> 2
> >>>>>
> >>>> So if a and b matched, but not c and d, in the top scoring
> >>>> result, its
> >>>> score would be 0.75.  The normalizer would be 0.75/(current
> >>>> score except
> >>>> for the current normalization).  This normalizer would be
> >>>> applied to all
> >>>> current scores (minus normalization) to create the normalized
> >>>> scores.
> >>>>
> >>>> For simple query (a b), if only one of the terms matched in
> >>>> the top result, then its score would be 0.5, vs. 1.0 or many
> >>>> other possible scores today.
> >>>>
> >>>> In addition to enabling more "intelligent" UI's that
> >>>> communicate the quality of results to end-users, the proposal
> >>>> below also extends the explain() mechanism to fully explain
> >>>> the final normalized score. However, that change is also
> >>>> independent -- it could be done with the current scoring.
> >>>>
> >>>> Am I the only one who would like to see better normalization
> >>>> in Lucene? Does anybody have a better approach?
> >>>>
> >>>> If you've read this far, thanks for indulging me on this.  I
> >>>> would love
> >>>> to see this or something with similar properties in Lucene.
> >>>> I really just want to build my app, but as stated below would
> >>>> write and contribute this if there is interest in putting it
> >>>> in, and nobody else
> >>>> wants to write it.  Please let me know what you think one way
> >>>> or the other.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Chuck
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Chuck Williams
> >>>>> Sent: Monday, October 18, 2004 7:04 PM
> >>>>> To: 'Lucene Developers List'
> >>>>> Subject: RE: idf and explain(), was Re: Search and Scoring
> >>>>>
> >>>>> Doug Cutting wrote:
> >>>>>> If this is a big issue for you, as it seems it is, please
> >>>>>>
> >>>> submit
> >>>> a
> >>>>> patch
> >>>>>> to optionally disable score normalization in Hits.java.
> >>>>>>
> >>>>> and:
> >>>>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for
> >>>>>>
> >>>> each
> >>>>> document
> >>>>>> each time a document is added to the collection, since
> >>>>>>
> >>>> 'weight(t,d)'
> >>>>> is
> >>>>>> dependent on global term statistics.  This is prohibitivly
> >>>>>>
> >>>> expensive.
> >>>>>> Research has also demonstrated that such cosine
> >>>>>> normalization
> >>>>>>
> >>>> gives
> >>>>>> somewhat inferior results (e.g., Singhal's pivoted length
> >>>>>>
> >>>>> normalization).
> >>>>>
> >>>>> I'm willing to write, test and contribute code to address
> >>>>> the normalization issue, i.e. to yield scores in [0, 1]
> >>>>> that are
> >>>>>
> >>>> meaningful
> >>>>> across searches.  Unfortunately, this is considerably more
> >>>>>
> >>>> involved
> >>>> that
> >>>>> just optionally eliminating the current normalization in
> >>>>> Hits.
> >>>>>
> >>>> Before
> >>>>> undertaking this, I'd like to see if there is agreement in
> >>>>>
> >>>> principle
> >>>>> that this is a good idea, and that my specific proposal
> >>>>> below is
> >>>>>
> >>>> the
> >>>>> right way to go.  Also, I'd like to make sure I've correctly
> >>>>>
> >>>> inferred
> >>>>> the constraints on writing code to be incorporated into
> >>>>> Lucene.
> >>>>>
> >>>>> After looking at this in more detail I agree that the
> >>>>> cosine normalization is not the way to go, because of both
> >>>>> efficiency
> >>>>>
> >>>> and
> >>>>> effectiveness considerations.  A conceptual approach that
> >>>>> would
> >>>>>
> >>>> be
> >>>>> efficient, relatively easy to implement, and seems to have
> >>>>> at
> >>>>>
> >>>> least
> >>>>> reasonable behavior would be to define the top scoring
> >>>>> match to
> >>>>>
> >>>> have
> >>>> a
> >>>>> score roughly equal to the percentage of query terms it
> >>>>> matches
> >>>>>
> >>>> (its
> >>>>> "netCoord").  Scores below the top hit would be reduced
> >>>>> based on
> >>>>>
> >>>> the
> >>>>> ratio of their raw scores to the raw score of the top hit
> >>>>>
> >>>> (considering
> >>>>> all of the current score factors, except that I'd like to
> >>>>> remove
> >>>>>
> >>>> one
> >>>> of
> >>>>> the idf factors, as discussed earlier).
> >>>>>
> >>>>> For a couple simple cases:
> >>>>> a) the top match for a single term query would always have a
> >>>>>
> >>>> score
> >>>> of
> >>>>> 1.0,
> >>>>> b) the top scoring match for a BooleanQuery using
> >>>>>
> >>>> DefaultSimilarity
> >>>>> with all non-prohibited TermQuery clauses would have a
> >>>>> score of
> >>>>>
> >>>> m/n,
> >>>>> where the hit matches m of the n terms.
> >>>>>
> >>>>> This isn't optimal, but seems much better than the current
> >>>>>
> >>>> situation.
> >>>>> Consider two single-term queries, s and t.  If s matches
> >>>>> more
> >>>>>
> >>>> strongly
> >>>>> than t in its top hit (e.g., a higher tf in a shorter
> >>>>> field), it
> >>>>>
> >>>> would
> >>>>> be best if the top score of s was greater score than top
> >>>>> score of
> >>>>>
> >>>> t.
> >>>>> But this kind of normalization would require keeping
> >>>>> additional statistics that so far as I know are not
> >>>>> currently in the index,
> >>>>>
> >>>> like
> >>>>> the maximum tf for terms and the minimum length for fields.
> >>>>>
> >>>> These
> >>>> could
> >>>>> be expensive to update on deletes.  Also, normalizing by
> >>>>> such
> >>>>>
> >>>> factors
> >>>>> could yield lower than subjectively reasonable scores in
> >>>>> most
> >>>>>
> >>>> cases,
> >>>> so
> >>>>> it's not clear it would be better.
> >>>>>
> >>>>> The semantics above are at least clean, easy to understand,
> >>>>> and
> >>>>>
> >>>> support
> >>>>> what seems to me is the most important motivation to do
> >>>>> this:
> >>>>>
> >>>> allowing
> >>>>> an application to use simple thresholding to segregate
> >>>>>
> >>>> likely-to-be-
> >>>>> relevant hits from likely-to-be-irrelevant hits.
> >>>>>
> >>>>> More specifically, for a BooleanQuery of TermQuery's the
> >>>>>
> >>>> resulting
> >>>> score
> >>>>> functions would be:
> >>>>>
> >>>>> BooleanQuery of TermQuery's sbq = (tq1 ... tqn)
> >>>>>
> >>>>> baseScore(sbq, doc) =
> >>>>> sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)*
> >>>>> lengthNorm(tqi.term.field, doc)
> >>>>>
> >>>>> rawScore(sbq, doc) = coord(sbq, doc) * baseScore
> >>>>>
> >>>>> norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit)
> >>>>>
> >>>>> score(sbq, doc) = rawScore * norm
> >>>>>
> >>>>> rawScore's can be computed in the Scorer.score() methods and
> >>>>>
> >>>> therefore
> >>>>> used to sort the hits (e.g., in the instance method for
> >>>>> collect()
> >>>>>
> >>>> in
> >>>> the
> >>>>> HitCollector in IndexSearcher.search()).  If the top
> >>>>> scoring hit
> >>>>>
> >>>> does
> >>>>> not have the highest baseScore, then its score could be
> >>>>> less that
> >>>>>
> >>>> its
> >>>>> coord; this seems desirable.  These formulas imply that no
> >>>>> result
> >>>>>
> >>>> score
> >>>>> can be larger than its coord, so if coord is well-defined
> >>>>> (always between 0 and 1) then all results will be
> >>>>> normalized between 0
> >>>>>
> >>>> and
> >>>> 1.
> >>>>
> >>>>> In general, the netCoord, which takes the place of coord in
> >>>>> the
> >>>>>
> >>>> simple
> >>>>> case above, needs to be defined for any query.
> >>>>> Conceptually,
> >>>>>
> >>>> this
> >>>>> should be the total percentage of query terms matched by the
> >>>>>
> >>>> document.
> >>>>> It must be recursively computable from the subquery
> >>>>> structure and overridable for specific Query types (e.g.,
> >>>>> to support
> >>>>>
> >>>> specialized
> >>>>> coords, like one that is always 1.0 as is useful in multi-
> >>>>> field searching).  Suitable default definitions for
> >>>>> TermQuery and
> >>>>>
> >>>> BooleanQuery
> >>>>> are:
> >>>>>
> >>>>> TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise
> >>>>>
> >>>>> BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) *
> >>>>>
> >>>> ci.netCoord
> >>>>
> >>>>> This is not quite percentage of terms matched; e.g.,
> >>>>> consider a BooleanQuery with two clauses, one of which is a
> >>>>> BooleanQuery of
> >>>>>
> >>>> 3
> >>>> terms
> >>>>> and the other which is just a term.  However, it doesn't
> >>>>> seem to
> >>>>>
> >>>> be
> >>>>> unreasonable for a BooleanQuery to state that its clauses
> >>>>> are
> >>>>>
> >>>> equally
> >>>>> important, and this is consistent with the current coord
> >>>>>
> >>>> behavior.
> >>>>> BooleanQuery.netCoord could be overridden for special
> >>>>> cases, like
> >>>>>
> >>>> the
> >>>>> pure disjunction I use in my app for field expansions:
> >>>>> MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord
> >>>>>
> >>>>> Implementing this would proceed along these lines: 1.  For
> >>>>> backwards compatibility, add some kind of newScoring
> >>>>>
> >>>> boolean
> >>>>> setting.
> >>>>> 2.  Update all of these places to behave as indicated if
> >>>>>
> >>>> newScoring:
> >>>>> a.  Change Query.weight() to not do any normalization (no
> >>>>>
> >>>> call
> >>>> to
> >>>>> sumOfSquaredWeights(), Similarity.queryNorm() or
> >>>>> normalize()). b.  Update all Query.weight classes to set
> >>>>> their value
> >>>>>
> >>>> according
> >>>> to
> >>>>> the terms in the score formula above that don't involve the
> >>>>>
> >>>> document
> >>>>> (e.g., boost*idf for TermQuery).
> >>>>> c.  Add suitable netCoord definitions to all Scorer
> >>>>> classes. d. Update all Scorer.score() methods to compute
> >>>>> the rawScore
> >>>>>
> >>>> as
> >>>>> above.
> >>>>> e.  Add the maximum baseScore as a field kept on TopDocs,
> >>>>>
> >>>> computed
> >>>>> in the HitCollector's.
> >>>>> f.  Change the normalization in Hits to always divide every
> >>>>>
> >>>> raw
> >>>>> score by the maximum baseScore.
> >>>>> g.  Update all of the current explain() methods to be
> >>>>>
> >>>> consistent
> >>>>> with this scoring, and to either report the rawScore, or to
> >>>>>
> >>>> report
> >>>> the
> >>>>> final score if the normalization factor is provided. h.
> >>>>> Add Hits.explain() (or better extend Searcher so that it
> >>>>>
> >>>> keeps
> >>>>> the Hits for use in Searcher.explain()) to call the new
> >>>>> explain variation with the normalization factor so that
> >>>>> final scores are
> >>>>>
> >>>> fully
> >>>>> explained.
> >>>>>
> >>>>> If this seems like a good idea, please let me know.  I'm
> >>>>> sure
> >>>>>
> >>>> there
> >>>> are
> >>>>> details I've missed that would come out during coding and
> >>>>>
> >>>> testing.
> >>>> Also,
> >>>>> the value of this is dependent on how reasonable the final
> >>>>> scores
> >>>>>
> >>>> look,
> >>>>> which is hard to tell for sure until it is working.
> >>>>>
> >>>>> The coding standards for Lucene seem reasonably clear from
> >>>>> the
> >>>>>
> >>>> source
> >>>>> code I've read.  I could use just simple Java so there
> >>>>> shouldn't
> >>>>>
> >>>> be
> >>>> any
> >>>>> significant JVM dependencies.  The above should be fully
> >>>>> backward compatible due to the newScoring flag.
> >>>>>
> >>>>> On another note, I had to remove the German analyzer in my
> >>>>>
> >>>> current
> >>>> 1.4.2
> >>>>> source configuration because GermanStemmer failed to
> >>>>> compile due
> >>>>>
> >>>> to
> >>>> what
> >>>>> are apparently Unicode character constants that I've now
> >>>>> got as
> >>>>>
> >>>> illegal
> >>>>> two-character character constants.  This is presumably an
> >>>>>
> >>>> encoding
> >>>>> problem somewhere that I could track down.  It's not
> >>>>> important,
> >>>>>
> >>>> but
> >>>> if
> >>>>> the answer is obvious to any of you, I'd appreciate the
> >>>>> quick
> >>>>>
> >>>> tip.
> >>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Chuck
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Doug Cutting [mailto:cutting@apache.org] Sent:
> >>>>>> Monday, October 18, 2004 9:44 AM To: Lucene Developers
> >>>>>> List Subject: Re: idf and explain(), was Re: Search and
> >>>>>> Scoring
> >>>>>>
> >>>>>> Chuck Williams wrote:
> >>>>>>> That's a good point on how the standard vector space
> >>>>>>> inner
> >>>>>>>
> >>>> product
> >>>>>>> similarity measure does imply that the idf is squared
> >>>>>>>
> >>>> relative
> >>>> to
> >>>>> the
> >>>>>>> document tf.  Even having been aware of this formula
> >>>>>>> for a
> >>>>>>>
> >>>> long
> >>>>> time,
> >>>>>>> this particular implication never occurred to me.  Do
> >>>>>>> you
> >>>>>>>
> >>>> know
> >>>> if
> >>>>>>> anybody has done precision/recall or other relevancy
> >>>>>>>
> >>>> empirical
> >>>>>>> measurements comparing this vs. a model that does not
> >>>>>>>
> >>>> square
> >>>> idf?
> >>>>
> >>>>>> No, not that I know of.
> >>>>>>
> >>>>>>> Regarding normalization, the normalization in Hits does
> >>>>>>> not
> >>>>>>>
> >>>> have
> >>>>> very
> >>>>>>> nice properties.  Due to the > 1.0 threshold check, it
> >>>>>>>
> >>>> loses
> >>>>>>> information, and it arbitrarily defines the highest
> >>>>>>> scoring
> >>>>>>>
> >>>> result
> >>>>> in
> >>>>>>> any list that generates scores above 1.0 as a perfect
> >>>>>>>
> >>>> match.
> >>>> It
> >>>>> would
> >>>>>>> be nice if score values were meaningful independent of
> >>>>>>>
> >>>> searches,
> >>>>> e.g.,
> >>>>>>> if "0.6" meant the same quality of retrieval
> >>>>>>> independent of
> >>>>>>>
> >>>> what
> >>>>>> search
> >>>>>>> was done.  This would allow, for example, sites to use
> >>>>>>> a a
> >>>>>>>
> >>>> simple
> >>>>>>> quality threshold to only show results that were "good
> >>>>>>>
> >>>> enough".
> >>>>> At my
> >>>>>>> last company (I was President and head of engineering
> >>>>>>> for
> >>>>>>>
> >>>> InQuira),
> >>>>> we
> >>>>>>> found this to be important to many customers.
> >>>>>>>
> >>>>>> If this is a big issue for you, as it seems it is, please
> >>>>>>
> >>>> submit
> >>>> a
> >>>>> patch
> >>>>>> to optionally disable score normalization in Hits.java.
> >>>>>>
> >>>>>>> The standard vector space similarity measure includes
> >>>>>>>
> >>>>> normalization by
> >>>>>>> the product of the norms of the vectors, i.e.:
> >>>>>>>
> >>>>>>> score(d,q) =  sum over t of ( weight(t,q) * weight(t,d)
> >>>>>>> )
> >>>>>>>
> >>>> /
> >>>>>>> sqrt [ (sum(t) weight(t,q)^2) * (sum(t)
> >>>>>>>
> >>>>>> weight(t,d)^2) ]
> >>>>>>
> >>>>>>> This makes the score a cosine, which since the values
> >>>>>>> are
> >>>>>>>
> >>>> all
> >>>>> positive,
> >>>>>>> forces it to be in [0, 1].  The sumOfSquares()
> >>>>>>>
> >>>> normalization
> >>>> in
> >>>>> Lucene
> >>>>>>> does not fully implement this.  Is there a specific
> >>>>>>> reason
> >>>>>>>
> >>>> for
> >>>>> that?
> >>>>>
> >>>>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for
> >>>>>>
> >>>> each
> >>>>> document
> >>>>>> each time a document is added to the collection, since
> >>>>>>
> >>>> 'weight(t,d)'
> >>>>> is
> >>>>>> dependent on global term statistics.  This is prohibitivly
> >>>>>>
> >>>> expensive.
> >>>>>> Research has also demonstrated that such cosine
> >>>>>> normalization
> >>>>>>
> >>>> gives
> >>>>>> somewhat inferior results (e.g., Singhal's pivoted length
> >>>>>>
> >>>>> normalization).
> >>>>>
> >>>>>>> Re. explain(), I don't see a downside to extending it
> >>>>>>> show
> >>>>>>>
> >>>> the
> >>>>> final
> >>>>>>> normalization in Hits.  It could still show the raw
> >>>>>>> score
> >>>>>>>
> >>>> just
> >>>>> prior
> >>>>>> to
> >>>>>>> that normalization.
> >>>>>>>
> >>>>>> In order to normalize scores to 1.0 one must know the
> >>>>>> maximum
> >>>>>>
> >>>> score.
> >>>>>> Explain only computes the score for a single document, and
> >>>>>>
> >>>> the
> >>>>> maximum
> >>>>>> score is not known.
> >>>>>>
> >>>>>>> Although I think it would be best to have a
> >>>>>>> normalization that would render scores comparable across
> >>>>>>>
> >>>> searches.
> >>>>
> >>>>>> Then please submit a patch.  Lucene doesn't change on its
> >>>>>>
> >>>> own.
> >>>>
> >>>>>> Doug
> >>>>>>
> >>>>>>
> >>>> --------------------------------------------------------------
> >>>> ---- --
> >>>>
> >>>>> -
> >>>>>> To unsubscribe, e-mail:
> >>>> lucene-dev-unsubscribe@jakarta.apache.org
> >>>>>> For additional commands, e-mail:
> >>>> lucene-dev-help@jakarta.apache.org
> >>>>
> >>>>
> >>>> --------------------------------------------------------------
> >>>> ---- --- To unsubscribe, e-mail: lucene-dev-
> >>>> unsubscribe@jakarta.apache.org For additional commands, e-
> >>>> mail: lucene-dev-help@jakarta.apache.org
> >>>
> >>>
> >
--------------------------------------------------------------------
> >
> >>> - To unsubscribe, e-mail: lucene-dev-
> >>> unsubscribe@jakarta.apache.org For additional commands, e-mail:
> >>> lucene-dev-help@jakarta.apache.org
> >>
> >>
> >> ------------------------------------------------------------------
> >> --- To unsubscribe, e-mail: lucene-dev-
> >> unsubscribe@jakarta.apache.org For additional commands, e-mail:
> >> lucene-dev-help@jakarta.apache.org
> >
> >
> >
--------------------------------------------------------------------
> > - To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org