You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Paul Taylor <pa...@fastmail.fm> on 2010/01/22 13:40:23 UTC
Can tf() field access the field it is being used for ?
Hi, Im trying to override the Similarity lengthNorm() and tf() methods,
but I only want to override for particular index fields, lengthNorm() is
fine but tf() doesn't provide the fieldname as a parameter, so Im a bit
stuck - is there anyway round this.
Here is my code, which doesnt compile because fieldname field doesnt
exist in tf() method
package org.musicbrainz.search.analysis;
import org.apache.lucene.search.DefaultSimilarity;
public class MusicbrainzSimilarity extends DefaultSimilarity {
@Override
public float lengthNorm(String fieldName, int numTerms) {
//This will match both artist and label aliases and is
applicable to both, didn't use the constant
//ArtistIndexField.ALIAS because that would be confusing
if (fieldName.equals("alias")) {
return 0.71f; //Same result as normal calc if field had two
terms the most common scenario
} else {
return super.lengthNorm(fieldName, numTerms);
}
}
@Override
public float tf(float freq) {
if(fieldName.equals("alias")) { /************** FIELDNAME DOESNT
EXIST
if(freq > 1.0f) {
return 1.0f; //Same result as if matched term once
}
} else {
return (float)Math.sqrt(freq);
}
}
}
FYI:
Each document represents a recording artist (i.e Madonna, U2)
An artist has one artistname , and may have many artist aliases, with
the DefaultSimilarity implemenataion I hit two problems.
1. LengthNorm() sees all the aliases to one artist as one field, so an
artist with many aliases but just matching one will return a much lower
value for a match on an alias, then one which has few aliases. I wanted
to remove this bias so I override to treat all alias fields as if they
have two terms (Originally I just disabled norms for the alias field but
the default value of 1.0f gave aliases an advantage over the artist field)
2. Tf() If seaching for an artist by artist or alias (i.e artist:bach OR
alias:bach ) - and one artist has many aliases that match the search
term this will return a large tf() values easily beating another artist
that matches exactly on artist name but doesnt happen to have
any aliases. So I want to remove this bias by just returning a tf() of
1.0f for a matching alias, so having multiple aliases isn't an advantage.
thanks Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org