You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Damian Birchler <Da...@bsiag.com> on 2012/11/05 11:26:52 UTC

Overriding DefaultSimilarity to not consider tf/idf and friends

Hi everyone

We are using Lucene to search for possible duplicates in an address database. We create an index with a document for each person in the database. Each document has a field with one term for the first name, a field with one term for the last name and so on. I think in this setting it doesn't make sense to let term frequency, inverse document frequency and friends influence the document score (or does it?). For this reason I'm thinking of overriding DefaultSimilarity to not take tf/idf into account when scoring.

Do you think that's a reasonable thing to do? If so, how should I proceed (I'm looking for implementation details here; should I, e.g., override the method that calculates the term frequency to just return a constant [altought, at the top of my head, I wouldn't know what would be a sensible constant.]).

Thanks a lot,
Damian


Re: Overriding DefaultSimilarity to not consider tf/idf and friends

Posted by Erick Erickson <er...@gmail.com>.
first id see if omitting term frequencies and positions and norms did what
you need, these are all things you can disable OOB...

Best
Erick


On Mon, Nov 5, 2012 at 5:26 AM, Damian Birchler
<Da...@bsiag.com>wrote:

>  Hi everyone****
>
> ** **
>
> We are using Lucene to search for possible duplicates in an address
> database. We create an index with a document for each person in the
> database. Each document has a field with one term for the first name, a
> field with one term for the last name and so on. I think in this setting it
> doesn’t make sense to let term frequency, inverse document frequency and
> friends influence the document score (or does it?). For this reason I’m
> thinking of overriding DefaultSimilarity to not take tf/idf into account
> when scoring.****
>
> ** **
>
> Do you think that’s a reasonable thing to do? If so, how should I proceed
> (I’m looking for implementation details here; should I, e.g., override the
> method that calculates the term frequency to just return a constant
> [altought, at the top of my head, I wouldn’t know what would be a sensible
> constant.]).****
>
> ** **
>
> Thanks a lot,****
>
> Damian****
>
> ** **
>