You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2008/10/07 22:13:11 UTC

Similarity.lengthNorm and positionIncrement=0

Hi all,

I'm using analyzers that insert several tokens at the same position 
(positionIncrement=0), and I noticed that the calculation of lengthNorm 
takes into account all tokens, no matter what is their position.

Example:
	- input string: "tree houses"
	- analyzed:	tree, houses|house
	- lengthNorm(field, 3)

	- input string: "tree house"
	- analyzed:	tree, house
	- lengthNorm(field, 2)


This however leads to some counter-intuitive results: for a query "tree" 
the second document will have a higher score, i.e. the first document 
will be penalized for the additional terms at the same positions.

Arguably this should not happen, i.e. additional terms inserted at the 
same positions should be treated as an artificial construct equivalent 
in length to a single token, and not intended to increase the length of 
the field, but rather to increase the probability of a successful match.

[Side-note: The actual use case is more complicated, because it involves 
using accent-stripping filters that insert additional pure-ASCII tokens, 
and using different analyzers at index and query time. Users are allowed 
to make queries using either accented or ASCII input, and they should 
get comparable scores from documents with pure ascii field (no 
additional tokens) and from accented fields (many additional tokens with 
ascii|accented|stemmed variants).]

On the other hand, if someone were to submit a query 'house OR houses', 
using analyzer that doesn't perform stemming, the first document should 
have a higher score than the second (and this is already ensured by the 
fact that two terms match instead of one), but this score should be 
mitigated by the increased length to reflect the fact that there are 
more terms in total in this field ...

Current behavior can be changed by changing DocInverterPerField so that 
it increments fieldState.length only for tokens with positionIncrement > 
0. This could be controlled by an option - IMHO conceptually this option 
belongs to Similarity, and should be specific to a field, so perhaps a 
new method in Similarity like this would do:

	public float lengthNorm(String fieldName,
		 int numTokens, int numOverlappingTokens) {

		return lengthNorm(fieldName, numTokens);
	}

What do you think?

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Similarity.lengthNorm and positionIncrement=0

Posted by Andrzej Bialecki <ab...@getopt.org>.
Michael McCandless wrote:
> 
> I agree we should make this possible.  A field should not be "penalized" 
> just because many of its terms had synonyms.
> 
> In your proposed method addition to Similarity, below, 
> numOverlappingTokens would count the number of tokens that had 
> positionIncrement==0?  And then that default impl is fully backwards 
> compatible since it falls back to the current approach of counting the 
> overlapping tokens when computing lengthNorm?

Yes, and yes.


> Maybe in 3.0 we should then switch it to not count overlapping tokens by 
> default.

I'm not sure. There are good arguments for and against it, that's why I 
suggested adding it as an option.

If a typical usecase is to submit queries with multiple synonyms, then 
the current method works better, because it prevents excessive score 
boosting from multiple matching synonyms. OTOH, if a typical usecase is 
that users submit queries consisting of a single synonym, then the 
proposed method works better.

I'll create a JIRA issue and prepare a patch.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Similarity.lengthNorm and positionIncrement=0

Posted by Michael McCandless <lu...@mikemccandless.com>.
OK, this & Andrzej's logic makes sense -- let's add it as an option,  
but leave the default to the current approach of counting all tokens  
towards length norm.

Mike

Nadav Har'El wrote:

> On Sun, Oct 12, 2008, Michael McCandless wrote about "Re:  
> Similarity.lengthNorm and positionIncrement=0":
>>
>> I agree we should make this possible.  A field should not be
>> "penalized" just because many of its terms had synonyms.
>
> I guess it won't do any harm to make this an option, but we need to  
> do some
> careful thinking before making this the default, or even encouraging  
> it.
>
> If we recall the rationale of length normalization, it is not to  
> "penalize"
> long documents, in the sense that users are less likely to want to  
> see long
> documents. Rather, the idea is that a long document contains more  
> words -
> more unique words and more repetitions of each word - so long  
> documents are
> more likely to match any query, and more likely to have higher  
> scores for
> each query. If you don't do length normalization, (almost) no matter  
> what
> search you preform, you'll get the longest documents back, rather  
> than the
> really best-matching documents. This is why length normalization is  
> necessary.
>
> Now, if we do synonym expension during indexing, the document *really*
> becomes longer - it now (possibly) contains more unique words and more
> repetitions thereof. So it actually makes sense, I think, to count  
> also
> these synonyms, and not try to avoid it.
>
> But you're right - if we're not talking about real synonyms, but  
> rather
> variants which will *never* be used in the same query (ASCII vs.  
> accented
> in your case), it does make sense not to count them twice, so it might
> indeed be useful to have this prosed behavior as an option.
>
> Anyway, this is just my opinion (not backed by any hard research or
> experimentation), so it might be wrong.
>
>
> -- 
> Nadav Har'El                        |      Monday, Oct 13 2008, 14  
> Tishri 5769
> IBM Haifa Research Lab               
> |-----------------------------------------
>                                    |Windows-2000/Professional isn't.
> http://nadav.harel.org.il           |
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Similarity.lengthNorm and positionIncrement=0

Posted by Nadav Har'El <ny...@math.technion.ac.il>.
On Sun, Oct 12, 2008, Michael McCandless wrote about "Re: Similarity.lengthNorm and positionIncrement=0":
> 
> I agree we should make this possible.  A field should not be  
> "penalized" just because many of its terms had synonyms.

I guess it won't do any harm to make this an option, but we need to do some
careful thinking before making this the default, or even encouraging it.

If we recall the rationale of length normalization, it is not to "penalize"
long documents, in the sense that users are less likely to want to see long
documents. Rather, the idea is that a long document contains more words -
more unique words and more repetitions of each word - so long documents are
more likely to match any query, and more likely to have higher scores for
each query. If you don't do length normalization, (almost) no matter what
search you preform, you'll get the longest documents back, rather than the
really best-matching documents. This is why length normalization is necessary.

Now, if we do synonym expension during indexing, the document *really*
becomes longer - it now (possibly) contains more unique words and more
repetitions thereof. So it actually makes sense, I think, to count also
these synonyms, and not try to avoid it.

But you're right - if we're not talking about real synonyms, but rather
variants which will *never* be used in the same query (ASCII vs. accented
in your case), it does make sense not to count them twice, so it might
indeed be useful to have this prosed behavior as an option.

Anyway, this is just my opinion (not backed by any hard research or
experimentation), so it might be wrong.


-- 
Nadav Har'El                        |      Monday, Oct 13 2008, 14 Tishri 5769
IBM Haifa Research Lab              |-----------------------------------------
                                    |Windows-2000/Professional isn't.
http://nadav.harel.org.il           |

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Similarity.lengthNorm and positionIncrement=0

Posted by Michael McCandless <lu...@mikemccandless.com>.
I agree we should make this possible.  A field should not be  
"penalized" just because many of its terms had synonyms.

In your proposed method addition to Similarity, below,  
numOverlappingTokens would count the number of tokens that had  
positionIncrement==0?  And then that default impl is fully backwards  
compatible since it falls back to the current approach of counting the  
overlapping tokens when computing lengthNorm?

Maybe in 3.0 we should then switch it to not count overlapping tokens  
by default.

Mike

Andrzej Bialecki wrote:

> Hi all,
>
> I'm using analyzers that insert several tokens at the same position  
> (positionIncrement=0), and I noticed that the calculation of  
> lengthNorm takes into account all tokens, no matter what is their  
> position.
>
> Example:
> 	- input string: "tree houses"
> 	- analyzed:	tree, houses|house
> 	- lengthNorm(field, 3)
>
> 	- input string: "tree house"
> 	- analyzed:	tree, house
> 	- lengthNorm(field, 2)
>
>
> This however leads to some counter-intuitive results: for a query  
> "tree" the second document will have a higher score, i.e. the first  
> document will be penalized for the additional terms at the same  
> positions.
>
> Arguably this should not happen, i.e. additional terms inserted at  
> the same positions should be treated as an artificial construct  
> equivalent in length to a single token, and not intended to increase  
> the length of the field, but rather to increase the probability of a  
> successful match.
>
> [Side-note: The actual use case is more complicated, because it  
> involves using accent-stripping filters that insert additional pure- 
> ASCII tokens, and using different analyzers at index and query time.  
> Users are allowed to make queries using either accented or ASCII  
> input, and they should get comparable scores from documents with  
> pure ascii field (no additional tokens) and from accented fields  
> (many additional tokens with ascii|accented|stemmed variants).]
>
> On the other hand, if someone were to submit a query 'house OR  
> houses', using analyzer that doesn't perform stemming, the first  
> document should have a higher score than the second (and this is  
> already ensured by the fact that two terms match instead of one),  
> but this score should be mitigated by the increased length to  
> reflect the fact that there are more terms in total in this field ...
>
> Current behavior can be changed by changing DocInverterPerField so  
> that it increments fieldState.length only for tokens with  
> positionIncrement > 0. This could be controlled by an option - IMHO  
> conceptually this option belongs to Similarity, and should be  
> specific to a field, so perhaps a new method in Similarity like this  
> would do:
>
> 	public float lengthNorm(String fieldName,
> 		 int numTokens, int numOverlappingTokens) {
>
> 		return lengthNorm(fieldName, numTokens);
> 	}
>
> What do you think?
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org