You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2009/11/20 16:56:14 UTC

Whither Query Norm?

For a long time now, we've been telling people not to compare scores across queries, yet we maintain the queryNorm() code as an attempt to do this and the javadocs even promote it.  I'm in the process of researching this some more (references welcomed), but wanted to hear what people think about it here.  I haven't profiled it just yet, but it seems like a good chunk of wasted computation to me (loops, divisions and square roots).  At a minimum, I think we might be able to refactor the callback mechanism for it just as we did for the collectors, such that we push of the actual calculation of the sum of squares into Similarity, instead of just doing 1/sqrt(sumSqs).  That way, when people want to override queryNorm() to return 1, they are saving more than just the 1/sqrt calculation.  I haven't tested it yet, but wanted to find out what others think.

Thoughts?

-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Whither Query Norm?

Posted by Mark Miller <ma...@gmail.com>.
Jake Mannix wrote:
>
>
> On Fri, Nov 20, 2009 at 4:09 PM, Mark Miller <markrmiller@gmail.com
> <ma...@gmail.com>> wrote:
>  
>
>     But cosine has two norms? The query norm and the document norm -
>     taking
>     the two vectors to the unit space - it looks expensive to me to do
>     both
>     of them properly. The IR lit I've seen fudges them down to Root(L)
>     even
>     in the non incremental case for the document vector normalization.
>
>
> Lucene has two norms too: lengthNorm (of the field), produced at index
> time,
> and queryNorm.  The expense of queryNorm is done once per query, not
> once per document, and so as you yourself say, is not a performance
> problem.  The lengthNorm is factored into the document at index time, and
> so is not a hit at *query time* at all
Right - Lucene's two norms come from the cosine similarity. The
queryNorm is our version of the Euclidian length of the query vector.
The lengthNorm is our version of the Euclidian length of the document
vector. I don't think tf/idf systems that use cosine similarity, even
less fudged cosine similarity, generally produce scores that are
actually the cosine - the cosine similarity is just a factor to help
remove document length from the score, not actually produce the score.
And there are a few different options for how the document length
normalization is done. I've never seen it done correctly in lit though -
its always an approximation.

I know lengthNorm is not a hit at query time - but it can be a
performance issue at index time as well. And can be even worse with
incremental indexing as you said, but impossible?
>  
>
>     > (the norm Lucene does
>     > use on the query side) of fields - 1/sqrt(sum_{i}(tf_(term_i)^2 *
>     > idf(term_i)^2)) -
>     > is impossible to calculate.
>     But isn't that *not* the query side norm? Looks like the document
>     vector
>     norm to me.
>
>
> This is exactly the query norm - 1/sqrt(sumOfSquaredWeights), and
> sum of squared weights is just sum of tf^2 * idf^2, except queries have
> "boost" in place of "tf".  The document vector norm that *Lucene* uses is
> independent of idf of the terms in the document (it's 1/sqrt(doc_length)
> for DefaultSimilarity).
Right - but thats not the real doc norm. Thats a fake. You won't get the
real cosine with it. You'd have to do the real document norm. I was
talking not in terms of Lucene, but just cosine similarity in general.
>  
>
>     Where are you doing the correct document vector normalization to
>     get the
>     true cosine? I'm not following. It looks like the above formula is the
>     doc vector norm - so what about the query vector norm then?
>
>
> Ok, the technique I use to do this (I should really just refactor it
> to a point
> where I can post it for contrib/queries so that other people can use this
> if it's not that commonly known), is to, at index time, calculate the true
> document vector normalizations for the fields, and then feed it into
> their
> boosts, and make sure that Similarity does *not* do field
> normalization itself.
>
> The formula for query vector normalization and document vector
> normalization
> is *exactly* the same if you translate boosts of terms into
> termFrequency of
> said term in the query "document".  It's just applied to a different
> bag of
> words.
>  
>
>     >   a) angles,
>     >   b) numbers which are less than one, and
>     >   c) when they're equal to 1, you have a perfect match.
>
>  
>
>     When they think of cosine yes - but IR lit seems to fudge this -
>     non of
>     the common document length normalizations are real cosine - they are
>     just square root variations - with Lucene's being the most simple
>     in the
>     family.
>
>
> This is true, which is why I'm not really giving a better wording than
> what
> Doron already wrote - I doubt I could do better.  I'm concerned myself
> because I've seen lots of questions over the years on the lists of people
> asking about how to compare scores across queries, no matter how much
> we tell them not to.  And now here again we're telling them "they're not
> comparable, but they're more comparable than if we didn't have this
> queryNorm in there!".  What exactly does that mean?  I do search, and
> I'm not really sure what it means, what does someone who is just learning
> this stuff going to think?
Who knows? Most of us working on this don't know - good luck to those
just learning :)
>  
>
>
>     >
>     > I think it's way better than it used to be (although it's even more
>     > dense, which
>     > I guess is required, given that it's a dense subject), but I really
>     > don't think
>     > we want to let people think it's a cosine - it's "like" a
>     cosine, but
>     > with a
>     > rather different normalization.
>
>  
>
>     We can amp that - but I like I said, I think thats standard practice -
>     in the IR books I've seen, they say, here is the cosine formula
>     for term
>     vector scoring, and then they go on to fudge in a manner very
>     similar to
>     how Lucene does it.
>
>
> Well just because it's standard practice, doesn't mean it's a good idea.
> I came to search from a math background, not CS, so when I see "vector
> space" and "cosine", I maybe think different things than the average coder
> learning IR.  So maybe I'm not the right person to ask how such things
> should be explained!
As good as anyone else here :) I'm taking what I'm thinking of it from
IR books I'm looking at though. They don't appear to consider cosine
similarity to require the score actually be the cosine - its just a
factor in the score. And they fudge that factor with shortcuts.
>  
>
>     > Doron's got it stated like that, already, but
>     > it needs to be crystal clear that scores are sometimes higher
>     than 1,
>     > and in
>     > fact the "best" score for one query may be rather different than the
>     > "best"
>     > for another query.
>
>  
>
>     I think this is general knowledge with Lucene, but that may be a
>     stretch
>     - perhaps its not as wide know as I now assume for new users. Hits
>     used
>     to normalize to a 0-1 score actually, but its no longer around.
>
>
> It was definitely a good idea to get rid of normalizing by the top
> score, but
> what could replace it is normalizing by the "best possible score". 
> This tells
> you how close to a perfect match you could have gotten.  For cosine,
> this would
> just be the cosine of the angle from the document vector to the query
> vector,
> for Lucene, it would be something similar, but would have absolute
> meaning
> (maybe... maybe not, because the theoretical best score is likely to
> be far
> from the actual best score in a given corpus).
>
>   -jake


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Whither Query Norm?

Posted by Jake Mannix <ja...@gmail.com>.
On Fri, Nov 20, 2009 at 4:09 PM, Mark Miller <ma...@gmail.com> wrote:


> But cosine has two norms? The query norm and the document norm - taking
> the two vectors to the unit space - it looks expensive to me to do both
> of them properly. The IR lit I've seen fudges them down to Root(L) even
> in the non incremental case for the document vector normalization.
>

Lucene has two norms too: lengthNorm (of the field), produced at index time,
and queryNorm.  The expense of queryNorm is done once per query, not
once per document, and so as you yourself say, is not a performance
problem.  The lengthNorm is factored into the document at index time, and
so is not a hit at *query time* at all


> > (the norm Lucene does
> > use on the query side) of fields - 1/sqrt(sum_{i}(tf_(term_i)^2 *
> > idf(term_i)^2)) -
> > is impossible to calculate.
> But isn't that *not* the query side norm? Looks like the document vector
> norm to me.
>

This is exactly the query norm - 1/sqrt(sumOfSquaredWeights), and
sum of squared weights is just sum of tf^2 * idf^2, except queries have
"boost" in place of "tf".  The document vector norm that *Lucene* uses is
independent of idf of the terms in the document (it's 1/sqrt(doc_length)
for DefaultSimilarity).


> Where are you doing the correct document vector normalization to get the
> true cosine? I'm not following. It looks like the above formula is the
> doc vector norm - so what about the query vector norm then?
>

Ok, the technique I use to do this (I should really just refactor it to a
point
where I can post it for contrib/queries so that other people can use this
if it's not that commonly known), is to, at index time, calculate the true
document vector normalizations for the fields, and then feed it into their
boosts, and make sure that Similarity does *not* do field normalization
itself.

The formula for query vector normalization and document vector normalization
is *exactly* the same if you translate boosts of terms into termFrequency of
said term in the query "document".  It's just applied to a different bag of
words.


> >   a) angles,
> >   b) numbers which are less than one, and
> >   c) when they're equal to 1, you have a perfect match.
>


> When they think of cosine yes - but IR lit seems to fudge this - non of
> the common document length normalizations are real cosine - they are
> just square root variations - with Lucene's being the most simple in the
> family.
>

This is true, which is why I'm not really giving a better wording than what
Doron already wrote - I doubt I could do better.  I'm concerned myself
because I've seen lots of questions over the years on the lists of people
asking about how to compare scores across queries, no matter how much
we tell them not to.  And now here again we're telling them "they're not
comparable, but they're more comparable than if we didn't have this
queryNorm in there!".  What exactly does that mean?  I do search, and
I'm not really sure what it means, what does someone who is just learning
this stuff going to think?


>
> >
> > I think it's way better than it used to be (although it's even more
> > dense, which
> > I guess is required, given that it's a dense subject), but I really
> > don't think
> > we want to let people think it's a cosine - it's "like" a cosine, but
> > with a
> > rather different normalization.
>


> We can amp that - but I like I said, I think thats standard practice -
> in the IR books I've seen, they say, here is the cosine formula for term
> vector scoring, and then they go on to fudge in a manner very similar to
> how Lucene does it.
>

Well just because it's standard practice, doesn't mean it's a good idea.
I came to search from a math background, not CS, so when I see "vector
space" and "cosine", I maybe think different things than the average coder
learning IR.  So maybe I'm not the right person to ask how such things
should be explained!


> > Doron's got it stated like that, already, but
> > it needs to be crystal clear that scores are sometimes higher than 1,
> > and in
> > fact the "best" score for one query may be rather different than the
> > "best"
> > for another query.
>


> I think this is general knowledge with Lucene, but that may be a stretch
> - perhaps its not as wide know as I now assume for new users. Hits used
> to normalize to a 0-1 score actually, but its no longer around.
>

It was definitely a good idea to get rid of normalizing by the top score,
but
what could replace it is normalizing by the "best possible score".  This
tells
you how close to a perfect match you could have gotten.  For cosine, this
would
just be the cosine of the angle from the document vector to the query
vector,
for Lucene, it would be something similar, but would have absolute meaning
(maybe... maybe not, because the theoretical best score is likely to be far
from the actual best score in a given corpus).

  -jake

Re: Whither Query Norm?

Posted by Jake Mannix <ja...@gmail.com>.
On Tue, Nov 24, 2009 at 9:31 PM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> Hello,
>
> Regarding that monstrous term->idf map.
> Is this something that one could use to adjust the scores in
> http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitationsscenario?  Say a map like that was created periodically for each shard and
> distributed to all other nodes (so in the end each node has all maps
> locally).  Couldn't the local scorer in the Solr instance (and in
> distributed Lucene setup) consult idfs for relevant terms in all those maps
> and adjust the scores of local scores before returning results?
>
>
Why would you want all nodes to have all maps?  Why not merge them into one
map, then redistributed out to all nodes, which would be far smaller than
many maps anyways?  Then yes, the scoring can be done locally using this big
idfMap to produce scores, instead of using reader.docFreq() for idf, that's
what I do.  But then what are you implying should be done?  Just rescale the
top scores based on the idfs before returning your top results?  You'd need
to know exactly which terms hit those top-scoring documents, right? Which
implies the cost of basically explain(), doesn't it?

Although with the per-field scoring (the thing I do to be able to train on
sub-query field matches scores), this gets easier, because then you can try
to hang onto this information if the query isn't too big, but this isn't
something normal BooleanQueries will handle for you naturally.

  -jake

Re: Whither Query Norm?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hello,

Regarding that monstrous term->idf map.
Is this something that one could use to adjust the scores in http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations scenario?  Say a map like that was created periodically for each shard and distributed to all other nodes (so in the end each node has all maps locally).  Couldn't the local scorer in the Solr instance (and in distributed Lucene setup) consult idfs for relevant terms in all those maps and adjust the scores of local scores before returning results?

Otis

>From: Jake Mannix <ja...@gmail.com>
>To: java-dev@lucene.apache.org
>Sent: Fri, November 20, 2009 7:49:34 PM
>Subject: Re: Whither Query Norm?
>
>
>
>
>On Fri, Nov 20, 2009 at 4:20 PM, Mark Miller <ma...@gmail.com> wrote:
>
>Mark Miller wrote:
>>Okay - I guess that somewhat makes sense - you can calculate the
>>>>magnitude of the doc vectors at index time. How is that impossible with
>>>>incremental indexing though? Isn't it just expensive? Seems somewhat
>>>>expensive in the non incremental case as well - your just eating it at
>>>>index time rather than query time - though the same could be done for
>>>>incremental? The information is all there in either case.
>>
>>
>
>Ok, I think I see what you were imagining I was doing: you take the current
>state of the index as gospel for idf (when the index is already large, this 
>>is a good approximation), and look up these factors at index time - this 
>means grabbing docFreq(Term) for each term in my document, and yes,
>this would be very expensive, I'd imagine.  I've done it by pulling a
>>monstrous (the most common 1-million terms, say) Map<String, Float> 
>(effectively) outside of lucene entirely, which gives term idfs, and housing
>this in memory so that computing field norms for cosine is a very fast
>>operation at index time.
>
>Doing it like this is hard from scratch, but is fine incrementally, because 
>I've basically fixed idf using some previous corpus (and update the idfMap
>every once in a while, in cases where it doesn't change much).  This has
>>the effect of also providing a global notion of idf in a distributed corpus.
>
>  -jake
> 
>
>>
>> 
>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Whither Query Norm?

Posted by Jake Mannix <ja...@gmail.com>.
On Fri, Nov 20, 2009 at 4:20 PM, Mark Miller <ma...@gmail.com> wrote:

> Mark Miller wrote:
> Okay - I guess that somewhat makes sense - you can calculate the
> magnitude of the doc vectors at index time. How is that impossible with
> incremental indexing though? Isn't it just expensive? Seems somewhat
> expensive in the non incremental case as well - your just eating it at
> index time rather than query time - though the same could be done for
> incremental? The information is all there in either case.
>
>
Ok, I think I see what you were imagining I was doing: you take the current
state of the index as gospel for idf (when the index is already large, this
is a good approximation), and look up these factors at index time - this
means grabbing docFreq(Term) for each term in my document, and yes,
this would be very expensive, I'd imagine.  I've done it by pulling a
monstrous (the most common 1-million terms, say) Map<String, Float>
(effectively) outside of lucene entirely, which gives term idfs, and housing
this in memory so that computing field norms for cosine is a very fast
operation at index time.

Doing it like this is hard from scratch, but is fine incrementally, because
I've basically fixed idf using some previous corpus (and update the idfMap
every once in a while, in cases where it doesn't change much).  This has
the effect of also providing a global notion of idf in a distributed corpus.

  -jake


>
>
>

> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Whither Query Norm?

Posted by Jake Mannix <ja...@gmail.com>.
On Fri, Nov 20, 2009 at 5:02 PM, Mark Miller <ma...@gmail.com> wrote:

> Go back and put it in after you have all the documents for that commit
> point. Or on reader load, calculate it.
>

Ah!  Now I see what you mean by "expensive". :)  Basically run through all
your documents
you've indexed all over again, fixing their norms.  That's pretty
expensive.  Not as bad
as doing it at reader load, but still pretty harsh, but yes, not impossible!


 -jake

Re: Whither Query Norm?

Posted by Mark Miller <ma...@gmail.com>.
Go back and put it in after you have all the documents for that commit  
point. Or on reader load, calculate it.

- Mark

http://www.lucidimagination.com (mobile)

On Nov 20, 2009, at 7:56 PM, Jake Mannix <ja...@gmail.com> wrote:

>
>
> On Fri, Nov 20, 2009 at 4:51 PM, Mark Miller <ma...@gmail.com>  
> wrote:
> Okay - my fault - I'm not really talking in terms of Lucene. Though  
> even
> there I consider it possible. You'd just have to like, rewrite it :)  
> And
> it would likely be pretty slow.
>
> Rewrite it how?  When you index the very first document, the docFreq  
> of all
> terms is 1, out of numDocs = 1 docs in the corpus.  Everybody's idf  
> is the same.
> No matter how you normalize this, it'll be wrong, once you've  
> indexed a million
> documents.  This isn't a matter of Lucene architecture, it's a  
> matter of idf being
> a query-time exactly available value (you can approximate it partway  
> through
> indexing, but you don't know it at all when you start).
>
>   -jake

Re: Whither Query Norm?

Posted by Jake Mannix <ja...@gmail.com>.
On Fri, Nov 20, 2009 at 4:51 PM, Mark Miller <ma...@gmail.com> wrote:

> Okay - my fault - I'm not really talking in terms of Lucene. Though even
> there I consider it possible. You'd just have to like, rewrite it :) And
> it would likely be pretty slow.
>

Rewrite it how?  When you index the very first document, the docFreq of all
terms is 1, out of numDocs = 1 docs in the corpus.  Everybody's idf is the
same.
No matter how you normalize this, it'll be wrong, once you've indexed a
million
documents.  This isn't a matter of Lucene architecture, it's a matter of idf
being
a query-time exactly available value (you can approximate it partway through
indexing, but you don't know it at all when you start).

  -jake

Re: Whither Query Norm?

Posted by Mark Miller <ma...@gmail.com>.
Okay - my fault - I'm not really talking in terms of Lucene. Though even
there I consider it possible. You'd just have to like, rewrite it :) And
it would likely be pretty slow.


Jake Mannix wrote:
>
>
> On Fri, Nov 20, 2009 at 4:20 PM, Mark Miller <markrmiller@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Mark Miller wrote:
>     >
>     > it looks expensive to me to do both
>     > of them properly.
>     Okay - I guess that somewhat makes sense - you can calculate the
>     magnitude of the doc vectors at index time. How is that impossible
>     with
>     incremental indexing though? Isn't it just expensive? Seems somewhat
>     expensive in the non incremental case as well - your just eating it at
>     index time rather than query time - though the same could be done for
>     incremental? The information is all there in either case.
>
>
> The expense, if you have the idfs of all terms in the vocabulary (keep
> them
> in the form of idf^2 for efficiency at index time), is pretty trivial,
> isn't it?  If
> you have a document with 1000 terms, it's maybe 3000 floating point
> operations, all CPU actions, in memory, no disk seeks. 
>
> What it does require, is knowing, even when you have no documents yet
> on disk, what the idf of terms in the first few documents are.  Where do
> you know this, in Lucene, if you haven't externalized some notion of idf?
>
>   -jake
>  
>
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     For additional commands, e-mail: java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Whither Query Norm?

Posted by Jake Mannix <ja...@gmail.com>.
On Fri, Nov 20, 2009 at 4:20 PM, Mark Miller <ma...@gmail.com> wrote:

> Mark Miller wrote:
> >
> > it looks expensive to me to do both
> > of them properly.
> Okay - I guess that somewhat makes sense - you can calculate the
> magnitude of the doc vectors at index time. How is that impossible with
> incremental indexing though? Isn't it just expensive? Seems somewhat
> expensive in the non incremental case as well - your just eating it at
> index time rather than query time - though the same could be done for
> incremental? The information is all there in either case.
>

The expense, if you have the idfs of all terms in the vocabulary (keep them
in the form of idf^2 for efficiency at index time), is pretty trivial, isn't
it?  If
you have a document with 1000 terms, it's maybe 3000 floating point
operations, all CPU actions, in memory, no disk seeks.

What it does require, is knowing, even when you have no documents yet
on disk, what the idf of terms in the first few documents are.  Where do
you know this, in Lucene, if you haven't externalized some notion of idf?

  -jake


>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Whither Query Norm?

Posted by Mark Miller <ma...@gmail.com>.
Mark Miller wrote:
>
> it looks expensive to me to do both
> of them properly. 
Okay - I guess that somewhat makes sense - you can calculate the
magnitude of the doc vectors at index time. How is that impossible with
incremental indexing though? Isn't it just expensive? Seems somewhat
expensive in the non incremental case as well - your just eating it at
index time rather than query time - though the same could be done for
incremental? The information is all there in either case.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Whither Query Norm?

Posted by Mark Miller <ma...@gmail.com>.
Jake Mannix wrote:
>
>
> On Fri, Nov 20, 2009 at 2:50 PM, Mark Miller <markrmiller@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Jake Mannix wrote:
>     > Remember: we're not really doing cosine at all here.
>     This, I think, is fuzzy right? It seems to be common to still call
>     this
>     cosine scoring loosely - pretty much every practical impl fudges
>     things
>     somewhat when doing the normalization (though we are on the heavy side
>     of fudgers) - I think its pretty rare to do the true cosine
>     because its
>     so expensive. It can be somewhat misleading though.
>
>
> Cosine isn't expensive - it's just *impossible* to do when you're doing
> incremental indexing: to normalize to actually do cosine similarity
> requires
> knowing the idf at *index time*, so that the cosine norm
But cosine has two norms? The query norm and the document norm - taking
the two vectors to the unit space - it looks expensive to me to do both
of them properly. The IR lit I've seen fudges them down to Root(L) even
in the non incremental case for the document vector normalization.
> (the norm Lucene does
> use on the query side) of fields - 1/sqrt(sum_{i}(tf_(term_i)^2 *
> idf(term_i)^2)) -
> is impossible to calculate.
But isn't that *not* the query side norm? Looks like the document vector
norm to me.
>
> Incremental indexing has it's costs!
>  
> In cases where you *do* have either the exact idf values for your corpus's
> term-vocabulary (or a good way to approximate it, since it only varies
> logarithmically, this isn't too hard), then if you override lengthNorm
> to always
> return 1, and then take your document's Fields, and boost() them by
> the factor
> I listed above, then you can get exact (well, as exact as you can with
> only
> one-byte field-norms :\ [are we really that starved for RAM that we
> have to
> have that little precision?] ) cosine similarity, at really no
> additional *query-time*
> performance hit.  Just a little math done at indexing time.
Where are you doing the correct document vector normalization to get the
true cosine? I'm not following. It looks like the above formula is the
doc vector norm - so what about the query vector norm then?
>
>     Have you looked at the Similarity scoring explanation page that was
>     recently improved? Have any suggestions on changes to it? Doron put a
>     fair amount of work into improving it recently, but I think it could
>     always get better. Its currently leaning towards presenting this as
>     cosine - that seems in line with the few text books I've seen, but I'm
>     admittedly not that deep into any of this.
>
>
> Yeah, that page is a lot better than it used to be, but I'd really try
> not to
> call this cosine - when people think of cosines, they think of:
>
>   a) angles,
>   b) numbers which are less than one, and
>   c) when they're equal to 1, you have a perfect match. 
When they think of cosine yes - but IR lit seems to fudge this - non of
the common document length normalizations are real cosine - they are
just square root variations - with Lucene's being the most simple in the
family.
>
> We have a) - kinda.  We do not have b) or c), and we don't have scores
> which are really comparable, and maybe Grant's question should be
> answered with the statement: they shouldn't be:
Right - we have always said they are not really comparable. But there
are degrees of that. I think they have a degree of comparability - they
must if we can make them less comparable, and we can - drop queryNorm.
>
> If I query the google with "web" (getting a little less than three
> billion hits,
> so it maybe has an idf of 3 or 4), and find a document with 10 terms, one
> of which is "web" (and the others are equally common in terms of idf),
> then lucene says we should score that as:
>
>    idf(web) / sqrt(10) =~ 0.9,
>
> while cosine would say:
>
>   idf(web) / sqrt(1^2 * idf(web)^2 + tf_common_term2^2 * idf(term2)^2
> + ... )
>  
>     =~ 1/sqrt(10) = 0.31
>  
> If I instead query with "floccinaucinilipilification" (which comes up
> with
> about 45K hits, so maybe has an idf of 20), and find a document with
> 10 terms, one of which is floccinaucinilipilification, and the other terms
> are also similarly rare (also idf 20), lucene says the score is
>
>   idf(floccinaucinilipilification) / sqrt(10),  = 6.5
>
> while cosine would say:
>
>   idf(floccinaucinilipilification) / sqrt(1^2 idf(flocci...)^2 +
> tf*rareTerm2^2 + ... )
>  
>     =~ 1/sqrt(10) = 0.31
>
> What is nice about *real* cosine is that it captures a sense of absolute
> scoring - the hit on the really rare word in a document which does not
> have a lot of other rare words (and is in general pretty short) is indeed
> measured by the fact that the score is close to 1.0 (perfect match),
> whereas the Lucene default scoring does not give a good measure -
> web scores close to 1.0 on its hit, and you only notice that this
> isn't as
> good as the rare word hit because that one scores way *higher* than 1.0.
>
> On the other hand, the thing which Lucene scoring does which cosine
> does *not* is measure the fact that hitting the rare word is *way* better
> than hit on a common word, from a user perspective, so giving them
> equal scores because their cosines are the same is maybe not the
> right thing to do (and so "yay Doug!", heh).
>
>     Have you looked at the Similarity scoring explanation page that was
>     recently improved? Have any suggestions on changes to it? Doron put a
>     fair amount of work into improving it recently, but I think it could
>     always get better. Its currently leaning towards presenting this as
>     cosine - that seems in line with the few text books I've seen, but I'm
>     admittedly not that deep into any of this.
>
>
> I think it's way better than it used to be (although it's even more
> dense, which
> I guess is required, given that it's a dense subject), but I really
> don't think
> we want to let people think it's a cosine - it's "like" a cosine, but
> with a
> rather different normalization. 
We can amp that - but I like I said, I think thats standard practice -
in the IR books I've seen, they say, here is the cosine formula for term
vector scoring, and then they go on to fudge in a manner very similar to
how Lucene does it.

> Doron's got it stated like that, already, but
> it needs to be crystal clear that scores are sometimes higher than 1,
> and in
> fact the "best" score for one query may be rather different than the
> "best"
> for another query.
I think this is general knowledge with Lucene, but that may be a stretch
- perhaps its not as wide know as I now assume for new users. Hits used
to normalize to a 0-1 score actually, but its no longer around.
>
>   -jake


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Whither Query Norm?

Posted by Jake Mannix <ja...@gmail.com>.
On Fri, Nov 20, 2009 at 2:50 PM, Mark Miller <ma...@gmail.com> wrote:

> Jake Mannix wrote:
> > Remember: we're not really doing cosine at all here.
> This, I think, is fuzzy right? It seems to be common to still call this
> cosine scoring loosely - pretty much every practical impl fudges things
> somewhat when doing the normalization (though we are on the heavy side
> of fudgers) - I think its pretty rare to do the true cosine because its
> so expensive. It can be somewhat misleading though.
>

Cosine isn't expensive - it's just *impossible* to do when you're doing
incremental indexing: to normalize to actually do cosine similarity requires

knowing the idf at *index time*, so that the cosine norm (the norm Lucene
does
use on the query side) of fields - 1/sqrt(sum_{i}(tf_(term_i)^2 *
idf(term_i)^2)) -
is impossible to calculate.

Incremental indexing has it's costs!

In cases where you *do* have either the exact idf values for your corpus's
term-vocabulary (or a good way to approximate it, since it only varies
logarithmically, this isn't too hard), then if you override lengthNorm to
always
return 1, and then take your document's Fields, and boost() them by the
factor
I listed above, then you can get exact (well, as exact as you can with only
one-byte field-norms :\ [are we really that starved for RAM that we have to
have that little precision?] ) cosine similarity, at really no additional
*query-time*
performance hit.  Just a little math done at indexing time.

Have you looked at the Similarity scoring explanation page that was
> recently improved? Have any suggestions on changes to it? Doron put a
> fair amount of work into improving it recently, but I think it could
> always get better. Its currently leaning towards presenting this as
> cosine - that seems in line with the few text books I've seen, but I'm
> admittedly not that deep into any of this.
>

Yeah, that page is a lot better than it used to be, but I'd really try not
to
call this cosine - when people think of cosines, they think of:

  a) angles,
  b) numbers which are less than one, and
  c) when they're equal to 1, you have a perfect match.

We have a) - kinda.  We do not have b) or c), and we don't have scores
which are really comparable, and maybe Grant's question should be
answered with the statement: they shouldn't be:

If I query the google with "web" (getting a little less than three billion
hits,
so it maybe has an idf of 3 or 4), and find a document with 10 terms, one
of which is "web" (and the others are equally common in terms of idf),
then lucene says we should score that as:

   idf(web) / sqrt(10) =~ 0.9,

while cosine would say:

  idf(web) / sqrt(1^2 * idf(web)^2 + tf_common_term2^2 * idf(term2)^2 + ...
)

    =~ 1/sqrt(10) = 0.31

If I instead query with "floccinaucinilipilification" (which comes up with
about 45K hits, so maybe has an idf of 20), and find a document with
10 terms, one of which is floccinaucinilipilification, and the other terms
are also similarly rare (also idf 20), lucene says the score is

  idf(floccinaucinilipilification) / sqrt(10),  = 6.5

while cosine would say:

  idf(floccinaucinilipilification) / sqrt(1^2 idf(flocci...)^2 +
tf*rareTerm2^2 + ... )

    =~ 1/sqrt(10) = 0.31

What is nice about *real* cosine is that it captures a sense of absolute
scoring - the hit on the really rare word in a document which does not
have a lot of other rare words (and is in general pretty short) is indeed
measured by the fact that the score is close to 1.0 (perfect match),
whereas the Lucene default scoring does not give a good measure -
web scores close to 1.0 on its hit, and you only notice that this isn't as
good as the rare word hit because that one scores way *higher* than 1.0.

On the other hand, the thing which Lucene scoring does which cosine
does *not* is measure the fact that hitting the rare word is *way* better
than hit on a common word, from a user perspective, so giving them
equal scores because their cosines are the same is maybe not the
right thing to do (and so "yay Doug!", heh).

Have you looked at the Similarity scoring explanation page that was
recently improved? Have any suggestions on changes to it? Doron put a
fair amount of work into improving it recently, but I think it could
always get better. Its currently leaning towards presenting this as
cosine - that seems in line with the few text books I've seen, but I'm
admittedly not that deep into any of this.


I think it's way better than it used to be (although it's even more dense,
which
I guess is required, given that it's a dense subject), but I really don't
think
we want to let people think it's a cosine - it's "like" a cosine, but with a
rather different normalization.  Doron's got it stated like that, already,
but
it needs to be crystal clear that scores are sometimes higher than 1, and in
fact the "best" score for one query may be rather different than the "best"
for another query.

  -jake

Re: Whither Query Norm?

Posted by Mark Miller <ma...@gmail.com>.
Jake Mannix wrote:
> Remember: we're not really doing cosine at all here.
This, I think, is fuzzy right? It seems to be common to still call this
cosine scoring loosely - pretty much every practical impl fudges things
somewhat when doing the normalization (though we are on the heavy side
of fudgers) - I think its pretty rare to do the true cosine because its
so expensive. It can be somewhat misleading though.

Have you looked at the Similarity scoring explanation page that was
recently improved? Have any suggestions on changes to it? Doron put a
fair amount of work into improving it recently, but I think it could
always get better. Its currently leaning towards presenting this as
cosine - that seems in line with the few text books I've seen, but I'm
admittedly not that deep into any of this.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Whither Query Norm?

Posted by Mark Miller <ma...@gmail.com>.
Yes, its a good point. I'm coming at it from a more pure angle. And I'm
not so elegant in my thought patterns :)

Right though - our document vector normalization is - uh - quick and
dirty :) Its about the cheapest one I've seen other than root(length).

I don't think that scores between queries are very comparable in general
in Lucene  either- but they would be even less so if we dropped the
query norm. As I've argued in the past - if it had any real perf hit,
I'd be on the side of dropping it - but from what I can see, it really
doesn't, so I don't see why we should further skew the scores.

Jake Mannix wrote:
> Remember: we're not really doing cosine at all here.  The factor of
> IDF^2 on
> the top, with the factor of 1/sqrt(numTermsInDocument) on the bottom
> couples
> together to end up with the following effect:
>
>  q1 = "TERM1"
>  q2 = "TERM2"
>
> doc1 = "TERM1"
> doc2 = "TERM2"
>
> score(q1, doc1) = idf(TERM1)
> score(q2, doc2) = idf(TERM2)
>
> Both are perfect matches, but one scores higher (possibly much higher)
> than
> the other.
>
> Boosts work just fine with cosine (it's just a way of putting "tf"
> into the query side
> as well as in the document side), but normalizing documents without
> taking the
> idf of terms in the document into consideration blows away the ability to
> compare scores in default Lucene scoring, even *with* the queryNorm()
> factored
> in.
>
> I know you probably know this Mark, but it's important to make sure
> we're stating
> that in Lucene as is currently structured, scores can be *wildly*
> different between
> queries, even with queryNorm() factored in, for the sake of people
> reading this
> who haven't worked through the scoring in detail.
>
>   -jake
>  
>
> On Fri, Nov 20, 2009 at 2:24 PM, Mark Miller <markrmiller@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Grant Ingersoll wrote:
>     >
>     >  What I would like to get at is why anyone thinks scores are
>     > comparable across queries to begin with.
>     >
>     They are somewhat comparable because we are using the approximate
>     cosine
>     between the document/query vectors for the score - plus boosts n
>     stuff.
>     How close the vectors are to each other. If q1 has a smaller angle
>     diff
>     with d1 than q2 does with d2, then you can do a comparison. Its just
>     vector similarities. Its approximate because we fudge the
>     normalization.
>     Why do you think the scores within a query search are comparable?
>     Whats
>     the difference when you try another query? The query is the
>     difference,
>     and the query norm is what makes it more comparable. Its just a
>     different query vector with another query. Its still going to just
>     be a
>     given "angle" from the doc vectors. Closer is considered a better
>     match.
>     We don't do it to improve anything, or because someone discovered
>     something - its just part of the formula for calculating the
>     cosine. Its
>     the dot product formula. You can lose it and keep the same relative
>     rankings, but then you are further from the cosine for the score - you
>     start scaling by the magnitude of the query vector. When you do that
>     they are not so comparable.
>
>     If you take out the queryNorm, its much less comparable. You are
>     effectively multiplying the cosine by the magnitude of the query
>     vector
>     - so different queries will scale the score differently - and not in a
>     helpful way - a term vector and query vector can have very different
>     magnitudes, but very similar term distributions. Thats why we are
>     using
>     the cosine rather than euclidean distance in the first place. Pretty
>     sure its more linear algebra than IR - or the vector stuff from calc 3
>     (or wherever else different schools put it).
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     For additional commands, e-mail: java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Whither Query Norm?

Posted by Jake Mannix <ja...@gmail.com>.
Remember: we're not really doing cosine at all here.  The factor of IDF^2 on

the top, with the factor of 1/sqrt(numTermsInDocument) on the bottom couples

together to end up with the following effect:

 q1 = "TERM1"
 q2 = "TERM2"

doc1 = "TERM1"
doc2 = "TERM2"

score(q1, doc1) = idf(TERM1)
score(q2, doc2) = idf(TERM2)

Both are perfect matches, but one scores higher (possibly much higher) than
the other.

Boosts work just fine with cosine (it's just a way of putting "tf" into the
query side
as well as in the document side), but normalizing documents without taking
the
idf of terms in the document into consideration blows away the ability to
compare scores in default Lucene scoring, even *with* the queryNorm()
factored
in.

I know you probably know this Mark, but it's important to make sure we're
stating
that in Lucene as is currently structured, scores can be *wildly* different
between
queries, even with queryNorm() factored in, for the sake of people reading
this
who haven't worked through the scoring in detail.

  -jake


On Fri, Nov 20, 2009 at 2:24 PM, Mark Miller <ma...@gmail.com> wrote:

> Grant Ingersoll wrote:
> >
> >  What I would like to get at is why anyone thinks scores are
> > comparable across queries to begin with.
> >
> They are somewhat comparable because we are using the approximate cosine
> between the document/query vectors for the score - plus boosts n stuff.
> How close the vectors are to each other. If q1 has a smaller angle diff
> with d1 than q2 does with d2, then you can do a comparison. Its just
> vector similarities. Its approximate because we fudge the normalization.
> Why do you think the scores within a query search are comparable? Whats
> the difference when you try another query? The query is the difference,
> and the query norm is what makes it more comparable. Its just a
> different query vector with another query. Its still going to just be a
> given "angle" from the doc vectors. Closer is considered a better match.
> We don't do it to improve anything, or because someone discovered
> something - its just part of the formula for calculating the cosine. Its
> the dot product formula. You can lose it and keep the same relative
> rankings, but then you are further from the cosine for the score - you
> start scaling by the magnitude of the query vector. When you do that
> they are not so comparable.
>
> If you take out the queryNorm, its much less comparable. You are
> effectively multiplying the cosine by the magnitude of the query vector
> - so different queries will scale the score differently - and not in a
> helpful way - a term vector and query vector can have very different
> magnitudes, but very similar term distributions. Thats why we are using
> the cosine rather than euclidean distance in the first place. Pretty
> sure its more linear algebra than IR - or the vector stuff from calc 3
> (or wherever else different schools put it).
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Whither Query Norm?

Posted by Mark Miller <ma...@gmail.com>.
Grant Ingersoll wrote:
>
>  What I would like to get at is why anyone thinks scores are
> comparable across queries to begin with.
>
They are somewhat comparable because we are using the approximate cosine
between the document/query vectors for the score - plus boosts n stuff.
How close the vectors are to each other. If q1 has a smaller angle diff
with d1 than q2 does with d2, then you can do a comparison. Its just
vector similarities. Its approximate because we fudge the normalization.
Why do you think the scores within a query search are comparable? Whats
the difference when you try another query? The query is the difference,
and the query norm is what makes it more comparable. Its just a
different query vector with another query. Its still going to just be a
given "angle" from the doc vectors. Closer is considered a better match.
We don't do it to improve anything, or because someone discovered
something - its just part of the formula for calculating the cosine. Its
the dot product formula. You can lose it and keep the same relative
rankings, but then you are further from the cosine for the score - you
start scaling by the magnitude of the query vector. When you do that
they are not so comparable.

If you take out the queryNorm, its much less comparable. You are
effectively multiplying the cosine by the magnitude of the query vector
- so different queries will scale the score differently - and not in a
helpful way - a term vector and query vector can have very different
magnitudes, but very similar term distributions. Thats why we are using
the cosine rather than euclidean distance in the first place. Pretty
sure its more linear algebra than IR - or the vector stuff from calc 3
(or wherever else different schools put it).

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Whither Query Norm?

Posted by Jake Mannix <ja...@gmail.com>.
Back to Grant's original question, for a second...

On Fri, Nov 20, 2009 at 1:59 PM, Grant Ingersoll <gs...@apache.org>wrote:


> This makes sense from a mathematical sense, assuming scores are comparable.
>  What I would like to get at is why anyone thinks scores are comparable
> across queries to begin with.  I agree it is beneficial in some cases (as
> you described) if they are.   Probably a question suited for an academic IR
> list...
>

Well, without getting into the academic IR which I'm not really qualified to
argue about, what is wrong with comparing two queries by saying that a
document which "perfectly" matches a query should score 1.0, and scale with
respect to that?

Maybe it's a better question to turn it around: can you give examples of two
queries where you can see that it *doesn't* make sense to compare scores?
Let's imagine we're doing pure, properly normalized tf-idf cosine scoring
(not default Lucene scoring) on a couple of different fields at once.  Then
whenever a sub-query is exactly equal to the field it's hitting (or else the
field is the repetition of that query some multiple number of times), the
score for that sub-query will be 1.0.  When the match isn't perfect, the
score will go down, ok.  Sub-queries hitting longer fields (which aren't
just pathologically made up of just repetitions of a smaller set of terms)
will in general have even the best scores be very low compared to the best
scores on the small fields (this is true for Lucene as well, of course), but
this makes sense: if you query with a very small set of terms (as is usually
done, unless you're doing a MoreLikeThis kind of query), and you find a
match in the "title" field which is exactly what you were looking for, that
field match is far and away better than anything else you could get in a
body match.

To put it more simply - if you do really have cosine similarity (or
Jaccard/Tanimoto or something like that, if you don't care about idf for
some reason), then queries scores are normalized relative to "how close did
I find documents to *perfectly* matching my query" - 1.0 means you found
your query in the corpus, and less than that means some fractional
proximity.  This is an absolute measure, across queries.

Of course, then you ask, well, in reality, in web and enterprise search,
documents are big, queries are small, you never really find documents which
are perfect matches, so if the best match for q1, out of your whole corpus,
is 0.1 for doc1, and the best match for q2 is 0.25 for doc2, is it really
true that the best match for the second query is "better" than the best
match for the first query?  I've typically tried to remain agnostic on that
front, and instead as the related question: if the user (or really, a
sampling of many users) queried for (q1 OR q2) and assuming for simplicity
that q1 didn't match any of the good hits for q2, and vice-versa, then does
the user (ie. your gold-standard training set) say that the best result is
doc1, or doc2?  If it's doc1, then you'd better have found a way to boost
q1's score contribution higher than q2's, right?  Is this wrong?  (in the
theoretical sense?)

  -jake

Re: Whither Query Norm?

Posted by Grant Ingersoll <gs...@apache.org>.
On Nov 20, 2009, at 1:24 PM, Jake Mannix wrote:

> 
> On Fri, Nov 20, 2009 at 10:08 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> I should add in my $0.02 on whether to just get rid of queryNorm() altogether: 
>> 
>>   -1 from me, even though it's confusing, because having that call there (somewhere, at least) allows you to actually do compare scores across queries if you do the extra work of properly normalizing the documents as well (at index time).
> 
> Do you have some references on this?  I'm interested in reading more on the subject.  I've never quite been sold on how it is meaningful to compare scores and would like to read more opinions.
>  
> References on how people do this *with Lucene*, or just how this is done in general? 

in general.  Academic references, etc.

> There are lots of papers on fancy things which can be done, but I'm not sure where to point you to start out.  The technique I'm referring to is really just the simplest possible thing beyond setting your weights "by hand": let's assume you have a boolean OR query, Q, built up out of sub-queries q_i (hitting, for starters, different fields, although you can overlap as well with some more work), each with a set of weights (boosts) b_i, then if you have a training corpus (good matches, bad matches, or ranked lists of matches in order of relevance for the queries at hand), *and* scores (at the q_i level) which are comparable, then you can do a simple regression (linear or logistic, depending on whether you map your final scores to a logit or not) on the w_i to fit for the best boosts to use.  What is critical here is that scores from different queries are comparable.  If they're not, then queries where the best document for a query scores 2.0 overly affect the training in comparison to the queries where the best possible score is 0.5 (actually, wait, it's the reverse: you're training to increase scores of matching documents, so the system tries to make that 0.5 scoring document score much higher by raising boosts higher and higher, while the good matches already scoring 2.0 don't need any more boosting, if that makes sense).
> 

This makes sense from a mathematical sense, assuming scores are comparable.  What I would like to get at is why anyone thinks scores are comparable across queries to begin with.  I agree it is beneficial in some cases (as you described) if they are.   Probably a question suited for an academic IR list...

> There are of course far more complex "state of the art" training techniques, but probably someone like Ted would be able to give a better list of references on where is easiest to read those from.  But I can try to dredge up some places where I've read about doing this, and post again later if I can find any.
> 



Re: Whither Query Norm?

Posted by Jake Mannix <ja...@gmail.com>.
On Tue, Nov 24, 2009 at 9:18 PM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> I'm late to the thread, and although it looks like the discussion is over,
> I'll inline a Q for Jake.
>
> >
> >References on how people do this *with Lucene*, or just how this is done
> in general?  There are lots of papers on fancy things which can be done, but
> I'm not sure where to point you to start out.  The technique I'm referring
> to is really just the simplest possible thing beyond setting your weights
> "by hand": let's assume you have a boolean OR query, Q, built up out of
> sub-queries q_i (hitting, for starters, different fields, although you can
> overlap as well with some more work), each with a set of weights (boosts)
> b_i, then if you have a training corpus (good matches, bad matches, or
> ranked lists of matches in order of relevance for the queries at hand),
> *and* scores (at the q_i level) which are comparable,
>
> You mentioned this about 3 times in this thread (contrib/queries wants
> you!)
> It sounds like you've done this before (with Lucene?).  But how, if the
> scores are not comparable, and that's required for this "field boost
> learning/training" to work?
>

Well that's the point, right?  You need to make the scores comparable,
somehow.  The most general thing you can do is figure out what the maximum
possible score for a query is (if there is a maximum, which for most scoring
systems there will be, given strictly positive doc norms) and normalize with
respect to that.  For Lucene, the simplest possible way to do this (I
think?) is to swap in a true cosine (or something like Tanimoto) similarity
instead of the doc-length normalized one (which may require externalizing
the IDF).

When the score for a tf-idf weighted document and a boost-idf weighted query
(with both are normalized on the same scale) is exactly just the cosine of
the angle between them, then scores become fairly comparable - they're all
on a 0 to 1 scale.  Now, longer fields/documents are still going to score
way lower than shorter documents, for typical user-generated queries, but at
least now "way lower" has more meaning than before (because it's "way lower
*relative to 1.0*").

Frankly, I've even done this kind of logistic regression training of weights
even when you use raw lucene scoring, and while it doesn't work completely
(because scores are so incomparable), it's remarkable how well it ends up
working (I guess in comparison to randomly setting your boosts by hand and
simple A/B tests...)

  -jake



> Thanks,
> Otis
>
> > then you can do a simple regression (linear or logistic, depending on
> whether you map your final scores to a logit or not) on the w_i to fit for
> the best boosts to use.  What is critical here is that scores from different
> queries are comparable.  If they're not, then queries where the best
> document for a query scores 2.0 overly affect the training in comparison to
> the queries where the best possible score is 0.5 (actually, wait, it's the
> reverse: you're training to increase scores of matching documents, so the
> system tries to make that 0.5 scoring document score much higher by raising
> boosts higher and higher, while the good matches already scoring 2.0 don't
> need any more boosting, if that makes sense).
> >
> >There are of course far more complex "state of the art" training
> techniques, but probably someone like Ted would be able to give a better
> list of references on where is easiest to read those from.  But I can try to
> dredge up some places where I've read about doing this, and post again later
> if I can find any.
> >
> >  -jake
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Whither Query Norm?

Posted by Jake Mannix <ja...@gmail.com>.
On Tue, Nov 24, 2009 at 9:18 PM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> I'm late to the thread, and although it looks like the discussion is over,
> I'll inline a Q for Jake.
>


> You mentioned this about 3 times in this thread (contrib/queries wants
> you!)
>

Yeah, I've got to get some of the stuff I've written pulled out of all of
our own code - now that Lucene is on java 1.5, it'll be much easier for me
to contribute this stuff.  Soon!

  -jake

Re: Whither Query Norm?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
I'm late to the thread, and although it looks like the discussion is over, I'll inline a Q for Jake.

>I should add in my $0.02 on whether to just get rid of queryNorm() altogether: 
>>>
>>>  -1 from me, even though it's confusing, because having that call there (somewhere, at least) allows you to actually do compare scores across queries if you do the extra work of properly normalizing the documents as well (at index time).
>>
>>
>>Do you have some references on this?  I'm interested in reading more on the subject.  I've never quite been sold on how it is meaningful to compare scores and would like to read more opinions.
> 
>References on how people do this *with Lucene*, or just how this is done in general?  There are lots of papers on fancy things which can be done, but I'm not sure where to point you to start out.  The technique I'm referring to is really just the simplest possible thing beyond setting your weights "by hand": let's assume you have a boolean OR query, Q, built up out of sub-queries q_i (hitting, for starters, different fields, although you can overlap as well with some more work), each with a set of weights (boosts) b_i, then if you have a training corpus (good matches, bad matches, or ranked lists of matches in order of relevance for the queries at hand), *and* scores (at the q_i level) which are comparable,

You mentioned this about 3 times in this thread (contrib/queries wants you!)
It sounds like you've done this before (with Lucene?).  But how, if the scores are not comparable, and that's required for this "field boost learning/training" to work?

Thanks,
Otis

> then you can do a simple regression (linear or logistic, depending on whether you map your final scores to a logit or not) on the w_i to fit for the best boosts to use.  What is critical here is that scores from different queries are comparable.  If they're not, then queries where the best document for a query scores 2.0 overly affect the training in comparison to the queries where the best possible score is 0.5 (actually, wait, it's the reverse: you're training to increase scores of matching documents, so the system tries to make that 0.5 scoring document score much higher by raising boosts higher and higher, while the good matches already scoring 2.0 don't need any more boosting, if that makes sense).
>
>There are of course far more complex "state of the art" training techniques, but probably someone like Ted would be able to give a better list of references on where is easiest to read those from.  But I can try to dredge up some places where I've read about doing this, and post again later if I can find any.
>
>  -jake
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Whither Query Norm?

Posted by Jake Mannix <ja...@gmail.com>.
On Fri, Nov 20, 2009 at 10:08 AM, Grant Ingersoll <gs...@apache.org>wrote:

> I should add in my $0.02 on whether to just get rid of queryNorm()
> altogether:
>
>   -1 from me, even though it's confusing, because having that call there
> (somewhere, at least) allows you to actually do compare scores across
> queries if you do the extra work of properly normalizing the documents as
> well (at index time).
>
>
> Do you have some references on this?  I'm interested in reading more on the
> subject.  I've never quite been sold on how it is meaningful to compare
> scores and would like to read more opinions.
>

References on how people do this *with Lucene*, or just how this is done in
general?  There are lots of papers on fancy things which can be done, but
I'm not sure where to point you to start out.  The technique I'm referring
to is really just the simplest possible thing beyond setting your weights
"by hand": let's assume you have a boolean OR query, Q, built up out of
sub-queries q_i (hitting, for starters, different fields, although you can
overlap as well with some more work), each with a set of weights (boosts)
b_i, then if you have a training corpus (good matches, bad matches, or
ranked lists of matches in order of relevance for the queries at hand),
*and* scores (at the q_i level) which are comparable, then you can do a
simple regression (linear or logistic, depending on whether you map your
final scores to a logit or not) on the w_i to fit for the best boosts to
use.  What is critical here is that scores from different queries are
comparable.  If they're not, then queries where the best document for a
query scores 2.0 overly affect the training in comparison to the queries
where the best possible score is 0.5 (actually, wait, it's the reverse:
you're training to increase scores of matching documents, so the system
tries to make that 0.5 scoring document score much higher by raising boosts
higher and higher, while the good matches already scoring 2.0 don't need any
more boosting, if that makes sense).

There are of course far more complex "state of the art" training techniques,
but probably someone like Ted would be able to give a better list of
references on where is easiest to read those from.  But I can try to dredge
up some places where I've read about doing this, and post again later if I
can find any.

  -jake

Re: Whither Query Norm?

Posted by Jake Mannix <ja...@gmail.com>.
Now that Otis reminded me that this thread existed (I've got a brain like a
sieve these days, I tell you)...

On Fri, Nov 20, 2009 at 10:08 AM, Grant Ingersoll <gs...@apache.org>wrote:

>
>   -1 from me, even though it's confusing, because having that call there
> (somewhere, at least) allows you to actually do compare scores across
> queries if you do the extra work of properly normalizing the documents as
> well (at index time).
>
>
> Do you have some references on this?  I'm interested in reading more on the
> subject.  I've never quite been sold on how it is meaningful to compare
> scores and would like to read more opinions.
>

So I couldn't find any really good papers on this specifically, but I seem
to remember seeing this stuff done a lot in Manning and Schutze' IR book -
the go over training field boosts with logistic regression and all that, but
they don't specifically look at the Lucene case (although they consider
similar scoring functions).   They must talk about the necessity of
comparable scores to do this, I'm sure.

  -jake

Re: Whither Query Norm?

Posted by Grant Ingersoll <gs...@apache.org>.
On Nov 20, 2009, at 11:19 AM, Jake Mannix wrote:

> I should add in my $0.02 on whether to just get rid of queryNorm() altogether: 
> 
>   -1 from me, even though it's confusing, because having that call there (somewhere, at least) allows you to actually do compare scores across queries if you do the extra work of properly normalizing the documents as well (at index time).

Do you have some references on this?  I'm interested in reading more on the subject.  I've never quite been sold on how it is meaningful to compare scores and would like to read more opinions.

>   And for people who actually do machine-learning training of their per-field query boosts, this is pretty critical.
> 
>   -jake
> 
> On Fri, Nov 20, 2009 at 8:15 AM, Jake Mannix <ja...@gmail.com> wrote:
> The fact Lucene Similarity is most decidely *not* cosine similarity, but strongly resembles it with the queryNorm() in there, means that I personally would certainly like to see this called out, at least in the documentation.
> 
> As for performance, is the queryNorm() called ever in any loops?  It's all set up in the construction of the Weight, right?  Which means that by the time you're doing scoring, all the weighting factors are already factored into one?  What's the performance issue which would be saved here?
> 
>   -jake
> 
> 
> On Fri, Nov 20, 2009 at 7:56 AM, Grant Ingersoll <gs...@apache.org> wrote:
> For a long time now, we've been telling people not to compare scores across queries, yet we maintain the queryNorm() code as an attempt to do this and the javadocs even promote it.  I'm in the process of researching this some more (references welcomed), but wanted to hear what people think about it here.  I haven't profiled it just yet, but it seems like a good chunk of wasted computation to me (loops, divisions and square roots).  At a minimum, I think we might be able to refactor the callback mechanism for it just as we did for the collectors, such that we push of the actual calculation of the sum of squares into Similarity, instead of just doing 1/sqrt(sumSqs).  That way, when people want to override queryNorm() to return 1, they are saving more than just the 1/sqrt calculation.  I haven't tested it yet, but wanted to find out what others think.
> 
> Thoughts?
> 
> -Grant
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 



Re: Whither Query Norm?

Posted by Jake Mannix <ja...@gmail.com>.
I should add in my $0.02 on whether to just get rid of queryNorm()
altogether:

  -1 from me, even though it's confusing, because having that call there
(somewhere, at least) allows you to actually do compare scores across
queries if you do the extra work of properly normalizing the documents as
well (at index time).  And for people who actually do machine-learning
training of their per-field query boosts, this is pretty critical.

  -jake

On Fri, Nov 20, 2009 at 8:15 AM, Jake Mannix <ja...@gmail.com> wrote:

> The fact Lucene Similarity is most decidely *not* cosine similarity, but
> strongly resembles it with the queryNorm() in there, means that I personally
> would certainly like to see this called out, at least in the documentation.
>
> As for performance, is the queryNorm() called ever in any loops?  It's all
> set up in the construction of the Weight, right?  Which means that by the
> time you're doing scoring, all the weighting factors are already factored
> into one?  What's the performance issue which would be saved here?
>
>   -jake
>
>
> On Fri, Nov 20, 2009 at 7:56 AM, Grant Ingersoll <gs...@apache.org>wrote:
>
>> For a long time now, we've been telling people not to compare scores
>> across queries, yet we maintain the queryNorm() code as an attempt to do
>> this and the javadocs even promote it.  I'm in the process of researching
>> this some more (references welcomed), but wanted to hear what people think
>> about it here.  I haven't profiled it just yet, but it seems like a good
>> chunk of wasted computation to me (loops, divisions and square roots).  At a
>> minimum, I think we might be able to refactor the callback mechanism for it
>> just as we did for the collectors, such that we push of the actual
>> calculation of the sum of squares into Similarity, instead of just doing
>> 1/sqrt(sumSqs).  That way, when people want to override queryNorm() to
>> return 1, they are saving more than just the 1/sqrt calculation.  I haven't
>> tested it yet, but wanted to find out what others think.
>>
>> Thoughts?
>>
>> -Grant
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>

Re: Whither Query Norm?

Posted by Jake Mannix <ja...@gmail.com>.
The fact Lucene Similarity is most decidely *not* cosine similarity, but
strongly resembles it with the queryNorm() in there, means that I personally
would certainly like to see this called out, at least in the documentation.

As for performance, is the queryNorm() called ever in any loops?  It's all
set up in the construction of the Weight, right?  Which means that by the
time you're doing scoring, all the weighting factors are already factored
into one?  What's the performance issue which would be saved here?

  -jake

On Fri, Nov 20, 2009 at 7:56 AM, Grant Ingersoll <gs...@apache.org>wrote:

> For a long time now, we've been telling people not to compare scores across
> queries, yet we maintain the queryNorm() code as an attempt to do this and
> the javadocs even promote it.  I'm in the process of researching this some
> more (references welcomed), but wanted to hear what people think about it
> here.  I haven't profiled it just yet, but it seems like a good chunk of
> wasted computation to me (loops, divisions and square roots).  At a minimum,
> I think we might be able to refactor the callback mechanism for it just as
> we did for the collectors, such that we push of the actual calculation of
> the sum of squares into Similarity, instead of just doing 1/sqrt(sumSqs).
>  That way, when people want to override queryNorm() to return 1, they are
> saving more than just the 1/sqrt calculation.  I haven't tested it yet, but
> wanted to find out what others think.
>
> Thoughts?
>
> -Grant
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Whither Query Norm?

Posted by Mark Miller <ma...@gmail.com>.
Mark Miller wrote:
> Grant Ingersoll wrote:
>   
>>  At a minimum, I think we might be able to refactor the callback mechanism for it just as we did for the collectors, such that we push of the actual calculation of the sum of squares into Similarity, instead of just doing 1/sqrt(sumSqs).  That way, when people want to override queryNorm() to return 1, they are saving more than just the 1/sqrt calculation.
By the way - this issue does just that
http://issues.apache.org/jira/browse/LUCENE-1907. I don't think its
really any measurable perf savings, but it certainly makes more sense.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Whither Query Norm?

Posted by Mark Miller <ma...@gmail.com>.
Grant Ingersoll wrote:
> For a long time now, we've been telling people not to compare scores across queries, yet we maintain the queryNorm() code as an attempt to do this and the javadocs even promote it.  I'm in the process of researching this some more (references welcomed), but wanted to hear what people think about it here.  I haven't profiled it just yet, but it seems like a good chunk of wasted computation to me (loops, divisions and square roots).  At a minimum, I think we might be able to refactor the callback mechanism for it just as we did for the collectors, such that we push of the actual calculation of the sum of squares into Similarity, instead of just doing 1/sqrt(sumSqs).  That way, when people want to override queryNorm() to return 1, they are saving more than just the 1/sqrt calculation.  I haven't tested it yet, but wanted to find out what others think.
>
> Thoughts?
>
> -Grant
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>   
Here is old discussion http://issues.apache.org/jira/browse/LUCENE-1896.

Its essentially no cost and has minor benefits - I'm still +1 for
keeping it.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org