You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2003/01/26 06:30:17 UTC

Re: Question: using boost for sorting

I think I'll try to find a place for your lucene_ext code somewhere in
Lucene Sandbox, what do you think?

Otis


--- Che Dong <ch...@hotmail.com> wrote:
> How about add sortType in IndexSearcher first?
> User can speciefy IndexSearcher.sortType(by score:default, by docID,
> by docID desc) before indexing.
> 
> Che, Dong
> 
> diff IndexSearcher.java
> ~/lucene-1.2-src/src/java/org/apache/lucene/search/IndexSearcher.java
> 
> 66,81c66
> < /**
> <  * Implements search over a single IndexReader.
> <  *
> <  * user can customize search result sort behavior via
> <code>sortType</code>:
> <  * if data source sorted by some field before indexing docID can be
> take
> <  * as the alias to the sort field, so
> <  * search result sort by docID(or desc) equals to sort by field
> <  *
> <  * search results sort method:
> <  *  0:  sort by score (default)
> <  *  1:  sort by docID
> <  *  -1: sort by docID desc
> <  *
> <  * @author Che, Dong <ch...@bigfoot.com>
> <  * $Header:
>
/home/cvsroot/lucene_ext/src/org/apache/lucene/search/IndexSearcher.java,v
> 1.1.1.1 2002/09/22 19:36:08 chedong Exp $
> <  */
> ---
> > /** Implements search over a single IndexReader. */
> 83,89d67
> <   /**
> < 
> <    */
> <   public static final int ORDER_BY_SCORE = 0;
> <   public static final int ORDER_BY_DOCID = 1;
> <   public static final int ORDER_BY_DOCID_DESC = -1;
> <   public int sortType = ORDER_BY_SCORE;
> 96c74
> < 
> ---
> >     
> 101c79
> < 
> ---
> >     
> 106c84
> < 
> ---
> >     
> 134,162c112,127
> <     final int md = reader.maxDoc();
> < 
> <     scorer.score(new HitCollector()
> <       {
> <               private float minScore = 0.0f;
> <               public final void collect(int doc, float score) {
> <                 if (score > 0.0f &&                     // ignore
> zeroed buckets
> <                     (bits==null || bits.get(doc))) {    // skip
> docs not in bits
> <                   totalHits[0]++;
> <                   if (score >= minScore) {
> <                     // update hit queue
> <                     switch (sortType) {
> <                           case ORDER_BY_SCORE:   //sort results by
> score
> <                             hq.put(new ScoreDoc(doc, score));
> <                           case ORDER_BY_DOCID:   //sort results by
> docID
> <                             hq.put(new ScoreDoc(doc, doc));
> <                           case ORDER_BY_DOCID_DESC:  //sort results
> by docID desc
> <                             hq.put(new ScoreDoc(doc, (md - doc) )
> );
> <                           default:  //sort results by
> score(default)
> <                             hq.put(new ScoreDoc(doc, score));
> <                         }
> <                     if (hq.size() > nDocs) {            // if hit
> queue overfull
> <                               hq.pop();                         //
> remove lowest in hit queue
> <                               minScore =
> ((ScoreDoc)hq.top()).score; // reset minScore
> <                     }
> <                   }
> <                 }
> <               }
> <       }, md);
> ---
> >     scorer.score(new HitCollector() {
> >       private float minScore = 0.0f;
> >       public final void collect(int doc, float score) {
> >         if (score > 0.0f &&                     // ignore zeroed
> buckets
> >             (bits==null || bits.get(doc))) {    // skip docs not in
> bits
> >           totalHits[0]++;
> >           if (score >= minScore) {
> >             hq.put(new ScoreDoc(doc, score));   // update hit queue
> >             if (hq.size() > nDocs) {            // if hit queue
> overfull
> >               hq.pop();                         // remove lowest in
> hit queue
> >               minScore = ((ScoreDoc)hq.top()).score; // reset
> minScore
> >             }
> >           }
> >         }
> >       }
> >       }, reader.maxDoc());
> 167c132
> < 
> ---
> >     
> 
> 
> ----- Original Message ----- 
> From: "Doug Cutting" <cu...@lucene.com>
> To: "Lucene Developers List" <lu...@jakarta.apache.org>
> Sent: Thursday, October 17, 2002 5:21 AM
> Subject: Re: Question: using boost for sorting
> 
> 
> > Please submit diffs before committing anything, as this is delicate
> 
> > code.  Small changes here can affect performance in a big way.
> > 
> > Also, we must be extra-careful when making a new public API: once a
> 
> > method is public it's very hard to remove it.  The Similarity
> methods 
> > also need to be well documented.
> > 
> > Doug
> > 
> > Otis Gospodnetic wrote:
> > > This sounds good to me, as it would lead us to pluggable
> similarity
> > > computation...mmmm.
> > > I can refactor some of this tonight.
> > > 
> > > Otis
> > > 
> > > 
> > > --- Doug Cutting <cu...@lucene.com> wrote:
> > > 
> > >>This looks like a good approach.  When I get a chance, I'd like
> to
> > >>make 
> > >>Similarity an interface or an abstract class, whose default 
> > >>implementation would do what the current class does, but whose
> > >>methods 
> > >>can be overridden.  Then I'd add methods like:
> > >>
> > >>   public static void Similarity.setDefaultSimilarity(Similarity
> > >>sim);
> > >>   public void IndexWriter.setSimilarity(Similarity sim);
> > >>   public void Searcher.setSimilarity(Similarity sim);
> > >>
> > >>So to override Similarity methods you'd define a subclass of the 
> > >>standard implementation, then either install yours globally via 
> > >>setDefaultSimilarity, or set it in your IndexWriter before adding
> 
> > >>documents and in your Searcher before searching.  Does that sound
> 
> > >>reasonable?
> > >>
> > >>This would let you do what you describe below without changing
> > >>Lucene's 
> > >>sources.  However I'm very short on time right now and don't know
> how
> > >>
> > >>soon I'll get to this.
> > >>
> > >>Doug
> > >>
> > >>David Birtwell wrote:
> > >>
> > >>>Hi Dmitry,
> > >>>
> > >>>I was faced with a similar problem.  We wanted to have a numeric
> > >>
> > >>rank 
> > >>
> > >>>field in each document influence the order in which the
> documents
> > >>
> > >>were 
> > >>
> > >>>returned by lucene.  While investigating a solution for this, I
> > >>
> > >>wanted 
> > >>
> > >>>to see if I could implement strict sorting based on this numeric
> > >>
> > >>value. 
> > >>
> > >>>I was able to accomplish this using document boosting, but not
> > >>
> > >>without 
> > >>
> > >>>modifying the lucene source.  Our "ranking" field is an integer
> > >>
> > >>value 
> > >>
> > >>>from one to one hundred.  I'm not sure if this will help you,
> but
> > >>
> > >>I'll 
> > >>
> > >>>include a summary of what I did.
> > >>>
> > >>>In DocumentWriter remove the normalization by field length:
> > >>>   float norm = fieldBoosts[n] * 
> > >>>Similarity.normalizeLength(fieldLengths[n]);
> > >>>to
> > >>>   float norm = fieldBoosts[n];
> > >>>
> > >>>In TermScorer and PhraseScorer, modify the score() method to
> ignore
> > >>
> > >>the 
> > >>
> > >>>lucene base score:
> > >>>   score *= Similarity.decodeNorm(norms[d]);
> > >>>to
> > >>>   score = Similarity.decodeNorm(norms[d]);
> > >>>
> > >>>In Similarity.java, make byteToFloat() public.
> > >>>
> > >>>At index time, use Similarity.byteToFloat() to determine your
> boost
> > >>
> > >>>value as in the following pseudocode:
> > >>>   Document d = new Document();
> > >>>   ... add your fields ...
> > >>>   int rank = d.getField("RANK"); (range of rank can be 0 to
> 255)
> > >>>   float sortVal = Similarity.byteToFloat(rank)
> > >>>   d.setBoost(sortVal)
> > >>>
> > >>>If you'd like the reasoning behind any or all of these items,
> let
> > >>
> > >>me know.
> > >>
> > >>>DaveB
> > >>>
> > >>>
> > >>>
> > >>>Dmitry Serebrennikov wrote:
> > >>>
> > >>>
> > >>>>Greetings Everyone,
> > >>>>
> > >>>>I'm thinking of trying to build something that manipulates a
> query
> > >>>
> > >>>>score in order to achieve a sort order other then the default 
> > >>>>relevance sort. The idea is to create a new type of query:
> > >>>>SortingQuery( Query query, String sortByField )
> > >>>>
> > >>>>It would run the sub-query and return results in an order of
> the 
> > >>>>values found in the "sortByField" for those documents. Now,
> I've 
> > >>>>looked at all of the sorting discussion prior to this, and the
> > >>>
> > >>best 
> > >>
> > >>>>approach (recommended by Doug among others) is to provide some
> > >>>
> > >>sort of 
> > >>
> > >>>>a fast access to the field values inside the HitCollector.
> Reading
> > >>>
> > >>>>documents at search time is too slow, so people access the data
> 
> > >>>>elsewhere or build an in-memory index of that data (such as is
> > >>>
> > >>done in 
> > >>
> > >>>>the SearchBean's SortField).
> > >>>>
> > >>>>My idea is different. I want to try to do the following:
> > >>>>- compose a query that consists of the original sub-query
> followed
> > >>>
> > >>by 
> > >>
> > >>>>a special "sorting query"
> > >>>>- "boost" the score of the original sub-query to 0
> > >>>>- compute the score of the sorting query such that it would
> > >>>
> > >>reflect 
> > >>
> > >>>>the desired sort order
> > >>>>
> > >>>>Has anyone tried to do something like this?
> > >>>>Would this work?
> > >>>>Is this worth doing?
> > >>>>If it would, would then I have to do something during the
> indexing
> > >>>
> > >>>>time to set normalization / scoring factors for that field to 
> > >>>>something or other?
> > >>>>
> > >>>>Thanks.
> > >>>>Dmitry.
> > >>>>
> > >>>>
> > >>>>
> > >>>>-- 
> > >>>>To unsubscribe, e-mail:   
> > >>>><ma...@jakarta.apache.org>
> > >>>>For additional commands, e-mail: 
> > >>>><ma...@jakarta.apache.org>
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>>-- 
> > >>>To unsubscribe, e-mail:   
> > >>><ma...@jakarta.apache.org>
> > >>>For additional commands, e-mail: 
> > >>><ma...@jakarta.apache.org>
> > >>>
> > >>
> > >>
> > >>--
> > >>To unsubscribe, e-mail:  
> > >><ma...@jakarta.apache.org>
> > >>For additional commands, e-mail:
> > >><ma...@jakarta.apache.org>
> > >>
> > > 
> > > 
> > > __________________________________________________
> > > Do you Yahoo!?
> > > Faith Hill - Exclusive Performances, Videos & More
> > > http://faith.yahoo.com
> > > 
> > > --
> > > To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> > > For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> > > 
> > 
> > 
> > 
> > --
> > To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> > 

> ATTACHMENT part 2 application/octet-stream name=IndexSearcher.java
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>


__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Question: using boost for sorting

Posted by Che Dong <ch...@hotmail.com>.

Thank you, is it possable create a sub project to store user's implent basic lucene interface:  Tokenizer, Filter and some other indexing approach.

Regards

Che, Dong
----- Original Message ----- 
From: "Otis Gospodnetic" <ot...@yahoo.com>
To: "Lucene Developers List" <lu...@jakarta.apache.org>
Cc: "Che Dong" <ch...@hotmail.com>
Sent: Sunday, January 26, 2003 1:30 PM
Subject: Re: Question: using boost for sorting


> I think I'll try to find a place for your lucene_ext code somewhere in
> Lucene Sandbox, what do you think?
> 
> Otis