You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Doug Cutting <cu...@lucene.com> on 2003/02/07 20:37:50 UTC

Re: Computing Relevancy Differently

Terry Steichen wrote:
> I read all the relevant references I could find in the Users (not
> Developers) list, and I still don't exactly know what to do.
> 
> What I'd like to do is get a relevancy-based order in which (a) longer
> documents tend to get more weight than shorter ones, (b) a document body
> with 'X' instances of a query term gets a higher ranking than one with fewer
> than 'X' instances. and (c) a term found in the headline (usually in
> addition to finding the same term in the body) is more highly ranked than
> one with the term only in the body.

In the latest sources this can all be done by defining your own 
Similarity implementation.  You can make longer documents score higher 
by overriding the lengthNorm() method.  You can boost headlines there, 
or with Field.setBoost(), or at query time with Query.setBoost().

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Computing Relevancy Differently

Posted by Terry Steichen <te...@net-frame.com>.

Doug,

I'll put a test case together shortly.  In the meanwhile, here's the code in
the attachment that didn't get through (and BTW, is there some special way
to get attachments through?):

public class WESimilarity extends DefaultSimilarity {

 public float lengthNorm(String fieldName, int numTerms) {
  if (fieldName.equals("headline") || fieldName.equals("summary") ){
   System.out.println("WES - special");
   return 4.0f * super.lengthNorm(fieldName, numTerms);
  } else {
   System.out.println("WES - normal");
   return super.lengthNorm(fieldName, Math.max(numTerms, 300));
  }
 }
}

I just ran a test indexing - but neither of the debug statements were
displayed.  I again verified that if I renamed WESimilarity.class, I got an
exception (just to ensure it was being picked up).

Regards,

Terry

----- Original Message -----
From: "Doug Cutting" <cu...@lucene.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Friday, February 28, 2003 5:52 PM
Subject: Re: Computing Relevancy Differently


> Your attachment did not make it, so I cannot see your code.
>
> If you think there's a bug, cuold you please provide a complete,
> self-contained test case?  You could, for example, model this after the
> TestSimilarity class in the test code hierarchy.
>
> The lengthNorm(String,int) method is called when you index the document.
>
> Doug
>
> Terry Steichen wrote:
> > Doug,
> >
> > I've implemented a subclass of DefaultSimilarity (called
WESimilarity.java,
> > copy attached) which defines a new lengthNorm() method more or less as
you
> > suggested.  I then added a line prior to using my IndexWriter:
> > writer.setSimilarity(new WESimilarity()), and a similar line prior to
using
> > my IndexSeacher: searcher.setSimilarity(new WESimilarity()).
> >
> > The result:
> > 1) There's no change whatsoever in the computed scores, and
> > 2) The debugging messages never get printed out.
> >
> > I know the WESimilarity is being used (because if I rename it I get an
> > exception), but it does not appear that the new lengthNorm() method is
being
> > called.
> >
> > It's probably some silly goof, but I can't figure out where it is.
> >
> > If you (or anyone else, of course) have any ideas/suggestions, I'd
> > appreciate them.
> >
> > Regards,
> >
> > Terry
> >
> > ----- Original Message -----
> > From: "Terry Steichen" <te...@net-frame.com>
> > To: "Lucene Users List" <lu...@jakarta.apache.org>
> > Sent: Monday, February 10, 2003 2:28 PM
> > Subject: Re: Computing Relevancy Differently
> >
> >
> >
> >>Doug,
> >>
> >>That's excellent.  Just what I've been looking for.  I'll start
> >>experimenting shortly.
> >>
> >>Regards,
> >>
> >>Terry
> >>
> >>----- Original Message -----
> >>From: "Doug Cutting" <cu...@lucene.com>
> >>To: "Lucene Users List" <lu...@jakarta.apache.org>
> >>Sent: Monday, February 10, 2003 1:57 PM
> >>Subject: Re: Computing Relevancy Differently
> >>
> >>
> >>
> >>>Terry Steichen wrote:
> >>>
> >>>>Can you give me an idea of what to replace the lengthNorm() method
> >
> > with
> >
> >>to,
> >>
> >>>>for example, remove any special weight given to shorter matching
> >>
> >>documents?
> >>
> >>>The goal of the default implementation is not to give any special
weight
> >>>to shorter documents, but rather to remove the advantage longer
> >>>documents have.  Longer documents are likely to have more matches
simply
> >>>because they contain more terms.  Also, for the query "foo", a document
> >>>containing just "foo" is a better match than a longer one containing
> >>>"foo bar baz", since the match is more exact.
> >>>
> >>>However, one problem with this approach can be that very short
documents
> >>>are in fact not very informative.  Thus a bias against very short
> >>>documents is sometimes useful.
> >>>
> >>>
> >>>>I can certainly go through a bunch of trial-and-error efforts, but it
> >>
> >>would
> >>
> >>>>help if I had some grasp of the logic initially.
> >>>>
> >>>>For example, from DefaultSimilarity, here's the lengthNorm() method:
> >>>>
> >>>>  public float lengthNorm(String fieldName, int numTerms) {
> >>>>    return (float)(1.0 / Math.sqrt(numTerms));
> >>>>  }
> >>>>
> >>>>Should I (for the purpose of eliminating any size bias) override it to
> >>>>always return a 1?
> >>>
> >>>That's something to try, although, as mentioned above, I suspect your
> >>>top hits will be dominated by long documents.  Try it.  It's really not
> >>>a difficult experiment!
> >>>
> >>>One trick I've used to keep very short documents from dominating
> >>>results, that, while good matches, are not informative documents, is to
> >>>override this with something like:
> >>>
> >>>    public float lengthNorm(String fieldName, int numTerms) {
> >>>      super.lengthNorm(fieldName, Math.max(numTerms, 100));
> >>>    }
> >>>
> >>>This way all fields shorter than 100 terms are scored like fields
> >>>containing 100 terms.  Long documents are still normalized, but search
> >>>is biased a bit against very short documents.
> >>>
> >>>
> >>>>How would I boost the headline field here? Is that how you are
> >
> > supposed
> >
> >>to
> >>
> >>>>use the (presently unused) fieldName parameter?  If that's the case, I
> >>>>assume I would logically (to do what I'm trying to do) make this
> >
> > factor
> >
> >>>>greater than 1 for the 'headline' field, and 1 for all other fields?
> >>>
> >>>You could do that here too.  So, for example, you could do something
> >
> > like:
> >
> >>>    public float lengthNorm(String fieldName, int numTerms) {
> >>>      float n = super.lengthNorm(fieldName, Math.max(numTerms, 100));
> >>>      if (fieldName.equals("headline"))
> >>>        n *= 4.0f;
> >>>      return n;
> >>>    }
> >>>
> >>>Equivalently, you could create your documents with something like:
> >>>
> >>>   Document d = new Document();
> >>>   Field f = new Field.Text("headline", headline);
> >>>   f.setBoost(4.0f);
> >>>   ...
> >>>
> >>>But headlines tend to be short, and naturally benefit from the default
> >>>lengthNorm implementation.  So what you really might want is something
> >>
> >>like:
> >>
> >>>    public float lengthNorm(String fieldName, int numTerms) {
> >>>      if (fieldName.equals("headline"))
> >>>        return 4.0f * super.lengthNorm(fieldName, numTerms);
> >>>      else
> >>>        return super.lengthNorm(fieldName, Math.max(numTerms, 100));
> >>>    }
> >>>
> >>>This is probably what I'd try first.
> >>>
> >>>Doug
> >>>
> >>>
> >>>---------------------------------------------------------------------
> >>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >>>
> >>>
> >>
> >>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >>
> >>
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Computing Relevancy Differently

Posted by Doug Cutting <cu...@lucene.com>.

Your attachment did not make it, so I cannot see your code.

If you think there's a bug, cuold you please provide a complete, 
self-contained test case?  You could, for example, model this after the 
TestSimilarity class in the test code hierarchy.

The lengthNorm(String,int) method is called when you index the document.

Doug

Terry Steichen wrote:
> Doug,
> 
> I've implemented a subclass of DefaultSimilarity (called WESimilarity.java,
> copy attached) which defines a new lengthNorm() method more or less as you
> suggested.  I then added a line prior to using my IndexWriter:
> writer.setSimilarity(new WESimilarity()), and a similar line prior to using
> my IndexSeacher: searcher.setSimilarity(new WESimilarity()).
> 
> The result:
> 1) There's no change whatsoever in the computed scores, and
> 2) The debugging messages never get printed out.
> 
> I know the WESimilarity is being used (because if I rename it I get an
> exception), but it does not appear that the new lengthNorm() method is being
> called.
> 
> It's probably some silly goof, but I can't figure out where it is.
> 
> If you (or anyone else, of course) have any ideas/suggestions, I'd
> appreciate them.
> 
> Regards,
> 
> Terry
> 
> ----- Original Message -----
> From: "Terry Steichen" <te...@net-frame.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Monday, February 10, 2003 2:28 PM
> Subject: Re: Computing Relevancy Differently
> 
> 
> 
>>Doug,
>>
>>That's excellent.  Just what I've been looking for.  I'll start
>>experimenting shortly.
>>
>>Regards,
>>
>>Terry
>>
>>----- Original Message -----
>>From: "Doug Cutting" <cu...@lucene.com>
>>To: "Lucene Users List" <lu...@jakarta.apache.org>
>>Sent: Monday, February 10, 2003 1:57 PM
>>Subject: Re: Computing Relevancy Differently
>>
>>
>>
>>>Terry Steichen wrote:
>>>
>>>>Can you give me an idea of what to replace the lengthNorm() method
> 
> with
> 
>>to,
>>
>>>>for example, remove any special weight given to shorter matching
>>
>>documents?
>>
>>>The goal of the default implementation is not to give any special weight
>>>to shorter documents, but rather to remove the advantage longer
>>>documents have.  Longer documents are likely to have more matches simply
>>>because they contain more terms.  Also, for the query "foo", a document
>>>containing just "foo" is a better match than a longer one containing
>>>"foo bar baz", since the match is more exact.
>>>
>>>However, one problem with this approach can be that very short documents
>>>are in fact not very informative.  Thus a bias against very short
>>>documents is sometimes useful.
>>>
>>>
>>>>I can certainly go through a bunch of trial-and-error efforts, but it
>>
>>would
>>
>>>>help if I had some grasp of the logic initially.
>>>>
>>>>For example, from DefaultSimilarity, here's the lengthNorm() method:
>>>>
>>>>  public float lengthNorm(String fieldName, int numTerms) {
>>>>    return (float)(1.0 / Math.sqrt(numTerms));
>>>>  }
>>>>
>>>>Should I (for the purpose of eliminating any size bias) override it to
>>>>always return a 1?
>>>
>>>That's something to try, although, as mentioned above, I suspect your
>>>top hits will be dominated by long documents.  Try it.  It's really not
>>>a difficult experiment!
>>>
>>>One trick I've used to keep very short documents from dominating
>>>results, that, while good matches, are not informative documents, is to
>>>override this with something like:
>>>
>>>    public float lengthNorm(String fieldName, int numTerms) {
>>>      super.lengthNorm(fieldName, Math.max(numTerms, 100));
>>>    }
>>>
>>>This way all fields shorter than 100 terms are scored like fields
>>>containing 100 terms.  Long documents are still normalized, but search
>>>is biased a bit against very short documents.
>>>
>>>
>>>>How would I boost the headline field here? Is that how you are
> 
> supposed
> 
>>to
>>
>>>>use the (presently unused) fieldName parameter?  If that's the case, I
>>>>assume I would logically (to do what I'm trying to do) make this
> 
> factor
> 
>>>>greater than 1 for the 'headline' field, and 1 for all other fields?
>>>
>>>You could do that here too.  So, for example, you could do something
> 
> like:
> 
>>>    public float lengthNorm(String fieldName, int numTerms) {
>>>      float n = super.lengthNorm(fieldName, Math.max(numTerms, 100));
>>>      if (fieldName.equals("headline"))
>>>        n *= 4.0f;
>>>      return n;
>>>    }
>>>
>>>Equivalently, you could create your documents with something like:
>>>
>>>   Document d = new Document();
>>>   Field f = new Field.Text("headline", headline);
>>>   f.setBoost(4.0f);
>>>   ...
>>>
>>>But headlines tend to be short, and naturally benefit from the default
>>>lengthNorm implementation.  So what you really might want is something
>>
>>like:
>>
>>>    public float lengthNorm(String fieldName, int numTerms) {
>>>      if (fieldName.equals("headline"))
>>>        return 4.0f * super.lengthNorm(fieldName, numTerms);
>>>      else
>>>        return super.lengthNorm(fieldName, Math.max(numTerms, 100));
>>>    }
>>>
>>>This is probably what I'd try first.
>>>
>>>Doug
>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
> 
> 
> 
> ------------------------------------------------------------------------
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

[ANN] Sample code to index an IMAP message store

Posted by David Spencer <Da...@micromuse.com>.

I've written what I'd like to donate as example code to the project.
I'm not on the list to have CVS write permissions, so if one of the power
users agrees then please put this into the sandbox.

This code indexes the mail in an IMAP message store.
By default it reads all email from an IMAP server and forms an index.
Yes, I know IMAP supports searching: this is for those who want uniform 
Lucene
indexes for all data.
The code is a main() driver + the indexing code.
It is not a full app like Zoe -- the intent is to help build out the 
indexing of data sources in Lucene.

The source file, sample output, syntax, and a description of the fields 
in the index are
here:

http://www.tropo.com/techno/java/lucene/imap.html

Enjoy - oh - and I'd like to know if anyone actually uses this.
It works fine for me but noone else has tested it.

 Dave


My previous contribution is here:
http://cvs.apache.org/viewcvs/jakarta-lucene-sandbox/contributions/WordNet/src/java/org/apache/lucene/wordnet/





---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Computing Relevancy Differently

Posted by Terry Steichen <te...@net-frame.com>.

Doug,

I've implemented a subclass of DefaultSimilarity (called WESimilarity.java,
copy attached) which defines a new lengthNorm() method more or less as you
suggested.  I then added a line prior to using my IndexWriter:
writer.setSimilarity(new WESimilarity()), and a similar line prior to using
my IndexSeacher: searcher.setSimilarity(new WESimilarity()).

The result:
1) There's no change whatsoever in the computed scores, and
2) The debugging messages never get printed out.

I know the WESimilarity is being used (because if I rename it I get an
exception), but it does not appear that the new lengthNorm() method is being
called.

It's probably some silly goof, but I can't figure out where it is.

If you (or anyone else, of course) have any ideas/suggestions, I'd
appreciate them.

Regards,

Terry

----- Original Message -----
From: "Terry Steichen" <te...@net-frame.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Monday, February 10, 2003 2:28 PM
Subject: Re: Computing Relevancy Differently


> Doug,
>
> That's excellent.  Just what I've been looking for.  I'll start
> experimenting shortly.
>
> Regards,
>
> Terry
>
> ----- Original Message -----
> From: "Doug Cutting" <cu...@lucene.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Monday, February 10, 2003 1:57 PM
> Subject: Re: Computing Relevancy Differently
>
>
> > Terry Steichen wrote:
> > > Can you give me an idea of what to replace the lengthNorm() method
with
> to,
> > > for example, remove any special weight given to shorter matching
> documents?
> >
> > The goal of the default implementation is not to give any special weight
> > to shorter documents, but rather to remove the advantage longer
> > documents have.  Longer documents are likely to have more matches simply
> > because they contain more terms.  Also, for the query "foo", a document
> > containing just "foo" is a better match than a longer one containing
> > "foo bar baz", since the match is more exact.
> >
> > However, one problem with this approach can be that very short documents
> > are in fact not very informative.  Thus a bias against very short
> > documents is sometimes useful.
> >
> > > I can certainly go through a bunch of trial-and-error efforts, but it
> would
> > > help if I had some grasp of the logic initially.
> > >
> > > For example, from DefaultSimilarity, here's the lengthNorm() method:
> > >
> > >   public float lengthNorm(String fieldName, int numTerms) {
> > >     return (float)(1.0 / Math.sqrt(numTerms));
> > >   }
> > >
> > > Should I (for the purpose of eliminating any size bias) override it to
> > > always return a 1?
> >
> > That's something to try, although, as mentioned above, I suspect your
> > top hits will be dominated by long documents.  Try it.  It's really not
> > a difficult experiment!
> >
> > One trick I've used to keep very short documents from dominating
> > results, that, while good matches, are not informative documents, is to
> > override this with something like:
> >
> >     public float lengthNorm(String fieldName, int numTerms) {
> >       super.lengthNorm(fieldName, Math.max(numTerms, 100));
> >     }
> >
> > This way all fields shorter than 100 terms are scored like fields
> > containing 100 terms.  Long documents are still normalized, but search
> > is biased a bit against very short documents.
> >
> > > How would I boost the headline field here? Is that how you are
supposed
> to
> > > use the (presently unused) fieldName parameter?  If that's the case, I
> > > assume I would logically (to do what I'm trying to do) make this
factor
> > > greater than 1 for the 'headline' field, and 1 for all other fields?
> >
> > You could do that here too.  So, for example, you could do something
like:
> >
> >     public float lengthNorm(String fieldName, int numTerms) {
> >       float n = super.lengthNorm(fieldName, Math.max(numTerms, 100));
> >       if (fieldName.equals("headline"))
> >         n *= 4.0f;
> >       return n;
> >     }
> >
> > Equivalently, you could create your documents with something like:
> >
> >    Document d = new Document();
> >    Field f = new Field.Text("headline", headline);
> >    f.setBoost(4.0f);
> >    ...
> >
> > But headlines tend to be short, and naturally benefit from the default
> > lengthNorm implementation.  So what you really might want is something
> like:
> >
> >     public float lengthNorm(String fieldName, int numTerms) {
> >       if (fieldName.equals("headline"))
> >         return 4.0f * super.lengthNorm(fieldName, numTerms);
> >       else
> >         return super.lengthNorm(fieldName, Math.max(numTerms, 100));
> >     }
> >
> > This is probably what I'd try first.
> >
> > Doug
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>

Re: Computing Relevancy Differently

Posted by Terry Steichen <te...@net-frame.com>.

Doug,

That's excellent.  Just what I've been looking for.  I'll start
experimenting shortly.

Regards,

Terry

----- Original Message -----
From: "Doug Cutting" <cu...@lucene.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Monday, February 10, 2003 1:57 PM
Subject: Re: Computing Relevancy Differently


> Terry Steichen wrote:
> > Can you give me an idea of what to replace the lengthNorm() method with
to,
> > for example, remove any special weight given to shorter matching
documents?
>
> The goal of the default implementation is not to give any special weight
> to shorter documents, but rather to remove the advantage longer
> documents have.  Longer documents are likely to have more matches simply
> because they contain more terms.  Also, for the query "foo", a document
> containing just "foo" is a better match than a longer one containing
> "foo bar baz", since the match is more exact.
>
> However, one problem with this approach can be that very short documents
> are in fact not very informative.  Thus a bias against very short
> documents is sometimes useful.
>
> > I can certainly go through a bunch of trial-and-error efforts, but it
would
> > help if I had some grasp of the logic initially.
> >
> > For example, from DefaultSimilarity, here's the lengthNorm() method:
> >
> >   public float lengthNorm(String fieldName, int numTerms) {
> >     return (float)(1.0 / Math.sqrt(numTerms));
> >   }
> >
> > Should I (for the purpose of eliminating any size bias) override it to
> > always return a 1?
>
> That's something to try, although, as mentioned above, I suspect your
> top hits will be dominated by long documents.  Try it.  It's really not
> a difficult experiment!
>
> One trick I've used to keep very short documents from dominating
> results, that, while good matches, are not informative documents, is to
> override this with something like:
>
>     public float lengthNorm(String fieldName, int numTerms) {
>       super.lengthNorm(fieldName, Math.max(numTerms, 100));
>     }
>
> This way all fields shorter than 100 terms are scored like fields
> containing 100 terms.  Long documents are still normalized, but search
> is biased a bit against very short documents.
>
> > How would I boost the headline field here? Is that how you are supposed
to
> > use the (presently unused) fieldName parameter?  If that's the case, I
> > assume I would logically (to do what I'm trying to do) make this factor
> > greater than 1 for the 'headline' field, and 1 for all other fields?
>
> You could do that here too.  So, for example, you could do something like:
>
>     public float lengthNorm(String fieldName, int numTerms) {
>       float n = super.lengthNorm(fieldName, Math.max(numTerms, 100));
>       if (fieldName.equals("headline"))
>         n *= 4.0f;
>       return n;
>     }
>
> Equivalently, you could create your documents with something like:
>
>    Document d = new Document();
>    Field f = new Field.Text("headline", headline);
>    f.setBoost(4.0f);
>    ...
>
> But headlines tend to be short, and naturally benefit from the default
> lengthNorm implementation.  So what you really might want is something
like:
>
>     public float lengthNorm(String fieldName, int numTerms) {
>       if (fieldName.equals("headline"))
>         return 4.0f * super.lengthNorm(fieldName, numTerms);
>       else
>         return super.lengthNorm(fieldName, Math.max(numTerms, 100));
>     }
>
> This is probably what I'd try first.
>
> Doug
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Computing Relevancy Differently

Posted by Doug Cutting <cu...@lucene.com>.

Terry Steichen wrote:
> Can you give me an idea of what to replace the lengthNorm() method with to,
> for example, remove any special weight given to shorter matching documents?

The goal of the default implementation is not to give any special weight 
to shorter documents, but rather to remove the advantage longer 
documents have.  Longer documents are likely to have more matches simply 
because they contain more terms.  Also, for the query "foo", a document 
containing just "foo" is a better match than a longer one containing 
"foo bar baz", since the match is more exact.

However, one problem with this approach can be that very short documents 
are in fact not very informative.  Thus a bias against very short 
documents is sometimes useful.

> I can certainly go through a bunch of trial-and-error efforts, but it would
> help if I had some grasp of the logic initially.
> 
> For example, from DefaultSimilarity, here's the lengthNorm() method:
> 
>   public float lengthNorm(String fieldName, int numTerms) {
>     return (float)(1.0 / Math.sqrt(numTerms));
>   }
> 
> Should I (for the purpose of eliminating any size bias) override it to
> always return a 1?

That's something to try, although, as mentioned above, I suspect your 
top hits will be dominated by long documents.  Try it.  It's really not 
a difficult experiment!

One trick I've used to keep very short documents from dominating 
results, that, while good matches, are not informative documents, is to 
override this with something like:

    public float lengthNorm(String fieldName, int numTerms) {
      super.lengthNorm(fieldName, Math.max(numTerms, 100));
    }

This way all fields shorter than 100 terms are scored like fields 
containing 100 terms.  Long documents are still normalized, but search 
is biased a bit against very short documents.

> How would I boost the headline field here? Is that how you are supposed to
> use the (presently unused) fieldName parameter?  If that's the case, I
> assume I would logically (to do what I'm trying to do) make this factor
> greater than 1 for the 'headline' field, and 1 for all other fields?

You could do that here too.  So, for example, you could do something like:

    public float lengthNorm(String fieldName, int numTerms) {
      float n = super.lengthNorm(fieldName, Math.max(numTerms, 100));
      if (fieldName.equals("headline"))
        n *= 4.0f;
      return n;
    }

Equivalently, you could create your documents with something like:

   Document d = new Document();
   Field f = new Field.Text("headline", headline);
   f.setBoost(4.0f);
   ...

But headlines tend to be short, and naturally benefit from the default 
lengthNorm implementation.  So what you really might want is something like:

    public float lengthNorm(String fieldName, int numTerms) {
      if (fieldName.equals("headline"))
        return 4.0f * super.lengthNorm(fieldName, numTerms);
      else
        return super.lengthNorm(fieldName, Math.max(numTerms, 100));
    }

This is probably what I'd try first.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Computing Relevancy Differently

Posted by Terry Steichen <te...@net-frame.com>.

Doug,

Can you give me an idea of what to replace the lengthNorm() method with to,
for example, remove any special weight given to shorter matching documents?
I can certainly go through a bunch of trial-and-error efforts, but it would
help if I had some grasp of the logic initially.

For example, from DefaultSimilarity, here's the lengthNorm() method:

  public float lengthNorm(String fieldName, int numTerms) {
    return (float)(1.0 / Math.sqrt(numTerms));
  }

Should I (for the purpose of eliminating any size bias) override it to
always return a 1?

How would I boost the headline field here? Is that how you are supposed to
use the (presently unused) fieldName parameter?  If that's the case, I
assume I would logically (to do what I'm trying to do) make this factor
greater than 1 for the 'headline' field, and 1 for all other fields?

Regards,

Terry

----- Original Message -----
From: "Doug Cutting" <cu...@lucene.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Friday, February 07, 2003 2:37 PM
Subject: Re: Computing Relevancy Differently

> Terry Steichen wrote:
> > I read all the relevant references I could find in the Users (not
> > Developers) list, and I still don't exactly know what to do.
> >
> > What I'd like to do is get a relevancy-based order in which (a) longer
> > documents tend to get more weight than shorter ones, (b) a document body
> > with 'X' instances of a query term gets a higher ranking than one with
fewer
> > than 'X' instances. and (c) a term found in the headline (usually in
> > addition to finding the same term in the body) is more highly ranked
than
> > one with the term only in the body.
>
> In the latest sources this can all be done by defining your own
> Similarity implementation.  You can make longer documents score higher
> by overriding the lengthNorm() method.  You can boost headlines there,
> or with Field.setBoost(), or at query time with Query.setBoost().
>
> Doug
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org