You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Paul Taylor <pa...@fastmail.fm> on 2010/01/22 13:40:23 UTC

Can tf() field access the field it is being used for ?

Hi, Im trying to override the Similarity lengthNorm() and tf() methods, 
but I only want to override for particular index fields, lengthNorm() is 
fine but tf() doesn't provide the fieldname as a parameter, so Im a bit 
stuck - is there anyway round this.

Here is my code, which doesnt compile because fieldname field doesnt 
exist in tf() method

package org.musicbrainz.search.analysis;

import org.apache.lucene.search.DefaultSimilarity;

public class MusicbrainzSimilarity extends DefaultSimilarity {


    @Override
    public float lengthNorm(String fieldName, int numTerms) {

        //This will match both artist and label aliases and is 
applicable to both, didn't use the constant
        //ArtistIndexField.ALIAS because that would be confusing
        if (fieldName.equals("alias")) {
            return 0.71f; //Same result as normal calc if field had two 
terms the most common scenario
        } else {
            return super.lengthNorm(fieldName, numTerms);
        }
    }

   
     @Override
     public float tf(float freq) {

       if(fieldName.equals("alias")) {  /************** FIELDNAME DOESNT 
EXIST
         if(freq > 1.0f) {
             return 1.0f; //Same result as if matched term once
         }
       } else {
            return (float)Math.sqrt(freq);
       }
     }
}

FYI:
Each document represents a recording artist (i.e Madonna, U2)
 
An artist has one artistname , and may have many artist aliases, with 
the DefaultSimilarity implemenataion I hit two problems.

1. LengthNorm() sees all the aliases to one artist as one field, so an 
artist with many aliases but just matching one will return a much lower 
value for a match on an alias, then one which has few aliases. I wanted 
to remove this bias so I override to treat all alias fields as if they 
have two terms (Originally I just disabled norms for the alias field but 
the default value of 1.0f gave aliases an advantage over the artist field)

2. Tf() If seaching for an artist by artist or alias (i.e artist:bach OR 
alias:bach ) - and one artist has many aliases that match the search 
term  this will return a large tf() values easily beating another artist 
that matches exactly on artist name but doesnt happen to have
any aliases. So I want to remove this bias by just returning a tf() of 
1.0f for a matching alias, so having multiple aliases isn't an advantage.

thanks  Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org