You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Dmitry Serebrennikov <dm...@earthlink.net> on 2002/10/15 04:18:26 UTC

Question: using boost for sorting

Greetings Everyone,

I'm thinking of trying to build something that manipulates a query score 
in order to achieve a sort order other then the default relevance sort. 
The idea is to create a new type of query:
SortingQuery( Query query, String sortByField )

It would run the sub-query and return results in an order of the values 
found in the "sortByField" for those documents. Now, I've looked at all 
of the sorting discussion prior to this, and the best approach 
(recommended by Doug among others) is to provide some sort of a fast 
access to the field values inside the HitCollector. Reading documents at 
search time is too slow, so people access the data elsewhere or build an 
in-memory index of that data (such as is done in the SearchBean's 
SortField).

My idea is different. I want to try to do the following:
- compose a query that consists of the original sub-query followed by a 
special "sorting query"
- "boost" the score of the original sub-query to 0
- compute the score of the sorting query such that it would reflect the 
desired sort order

Has anyone tried to do something like this?
Would this work?
Is this worth doing?
If it would, would then I have to do something during the indexing time 
to set normalization / scoring factors for that field to something or 
other?

Thanks.
Dmitry.



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Question: using boost for sorting

Posted by Che Dong <ch...@hotmail.com>.

Thank you, is it possable create a sub project to store user's implent basic lucene interface:  Tokenizer, Filter and some other indexing approach.

Regards

Che, Dong
----- Original Message ----- 
From: "Otis Gospodnetic" <ot...@yahoo.com>
To: "Lucene Developers List" <lu...@jakarta.apache.org>
Cc: "Che Dong" <ch...@hotmail.com>
Sent: Sunday, January 26, 2003 1:30 PM
Subject: Re: Question: using boost for sorting


> I think I'll try to find a place for your lucene_ext code somewhere in
> Lucene Sandbox, what do you think?
> 
> Otis

Re: Question: using boost for sorting

Posted by Otis Gospodnetic <ot...@yahoo.com>.

I think I'll try to find a place for your lucene_ext code somewhere in
Lucene Sandbox, what do you think?

Otis


--- Che Dong <ch...@hotmail.com> wrote:
> How about add sortType in IndexSearcher first?
> User can speciefy IndexSearcher.sortType(by score:default, by docID,
> by docID desc) before indexing.
> 
> Che, Dong
> 
> diff IndexSearcher.java
> ~/lucene-1.2-src/src/java/org/apache/lucene/search/IndexSearcher.java
> 
> 66,81c66
> < /**
> <  * Implements search over a single IndexReader.
> <  *
> <  * user can customize search result sort behavior via
> <code>sortType</code>:
> <  * if data source sorted by some field before indexing docID can be
> take
> <  * as the alias to the sort field, so
> <  * search result sort by docID(or desc) equals to sort by field
> <  *
> <  * search results sort method:
> <  *  0:  sort by score (default)
> <  *  1:  sort by docID
> <  *  -1: sort by docID desc
> <  *
> <  * @author Che, Dong <ch...@bigfoot.com>
> <  * $Header:
>
/home/cvsroot/lucene_ext/src/org/apache/lucene/search/IndexSearcher.java,v
> 1.1.1.1 2002/09/22 19:36:08 chedong Exp $
> <  */
> ---
> > /** Implements search over a single IndexReader. */
> 83,89d67
> <   /**
> < 
> <    */
> <   public static final int ORDER_BY_SCORE = 0;
> <   public static final int ORDER_BY_DOCID = 1;
> <   public static final int ORDER_BY_DOCID_DESC = -1;
> <   public int sortType = ORDER_BY_SCORE;
> 96c74
> < 
> ---
> >     
> 101c79
> < 
> ---
> >     
> 106c84
> < 
> ---
> >     
> 134,162c112,127
> <     final int md = reader.maxDoc();
> < 
> <     scorer.score(new HitCollector()
> <       {
> <               private float minScore = 0.0f;
> <               public final void collect(int doc, float score) {
> <                 if (score > 0.0f &&                     // ignore
> zeroed buckets
> <                     (bits==null || bits.get(doc))) {    // skip
> docs not in bits
> <                   totalHits[0]++;
> <                   if (score >= minScore) {
> <                     // update hit queue
> <                     switch (sortType) {
> <                           case ORDER_BY_SCORE:   //sort results by
> score
> <                             hq.put(new ScoreDoc(doc, score));
> <                           case ORDER_BY_DOCID:   //sort results by
> docID
> <                             hq.put(new ScoreDoc(doc, doc));
> <                           case ORDER_BY_DOCID_DESC:  //sort results
> by docID desc
> <                             hq.put(new ScoreDoc(doc, (md - doc) )
> );
> <                           default:  //sort results by
> score(default)
> <                             hq.put(new ScoreDoc(doc, score));
> <                         }
> <                     if (hq.size() > nDocs) {            // if hit
> queue overfull
> <                               hq.pop();                         //
> remove lowest in hit queue
> <                               minScore =
> ((ScoreDoc)hq.top()).score; // reset minScore
> <                     }
> <                   }
> <                 }
> <               }
> <       }, md);
> ---
> >     scorer.score(new HitCollector() {
> >       private float minScore = 0.0f;
> >       public final void collect(int doc, float score) {
> >         if (score > 0.0f &&                     // ignore zeroed
> buckets
> >             (bits==null || bits.get(doc))) {    // skip docs not in
> bits
> >           totalHits[0]++;
> >           if (score >= minScore) {
> >             hq.put(new ScoreDoc(doc, score));   // update hit queue
> >             if (hq.size() > nDocs) {            // if hit queue
> overfull
> >               hq.pop();                         // remove lowest in
> hit queue
> >               minScore = ((ScoreDoc)hq.top()).score; // reset
> minScore
> >             }
> >           }
> >         }
> >       }
> >       }, reader.maxDoc());
> 167c132
> < 
> ---
> >     
> 
> 
> ----- Original Message ----- 
> From: "Doug Cutting" <cu...@lucene.com>
> To: "Lucene Developers List" <lu...@jakarta.apache.org>
> Sent: Thursday, October 17, 2002 5:21 AM
> Subject: Re: Question: using boost for sorting
> 
> 
> > Please submit diffs before committing anything, as this is delicate
> 
> > code.  Small changes here can affect performance in a big way.
> > 
> > Also, we must be extra-careful when making a new public API: once a
> 
> > method is public it's very hard to remove it.  The Similarity
> methods 
> > also need to be well documented.
> > 
> > Doug
> > 
> > Otis Gospodnetic wrote:
> > > This sounds good to me, as it would lead us to pluggable
> similarity
> > > computation...mmmm.
> > > I can refactor some of this tonight.
> > > 
> > > Otis
> > > 
> > > 
> > > --- Doug Cutting <cu...@lucene.com> wrote:
> > > 
> > >>This looks like a good approach.  When I get a chance, I'd like
> to
> > >>make 
> > >>Similarity an interface or an abstract class, whose default 
> > >>implementation would do what the current class does, but whose
> > >>methods 
> > >>can be overridden.  Then I'd add methods like:
> > >>
> > >>   public static void Similarity.setDefaultSimilarity(Similarity
> > >>sim);
> > >>   public void IndexWriter.setSimilarity(Similarity sim);
> > >>   public void Searcher.setSimilarity(Similarity sim);
> > >>
> > >>So to override Similarity methods you'd define a subclass of the 
> > >>standard implementation, then either install yours globally via 
> > >>setDefaultSimilarity, or set it in your IndexWriter before adding
> 
> > >>documents and in your Searcher before searching.  Does that sound
> 
> > >>reasonable?
> > >>
> > >>This would let you do what you describe below without changing
> > >>Lucene's 
> > >>sources.  However I'm very short on time right now and don't know
> how
> > >>
> > >>soon I'll get to this.
> > >>
> > >>Doug
> > >>
> > >>David Birtwell wrote:
> > >>
> > >>>Hi Dmitry,
> > >>>
> > >>>I was faced with a similar problem.  We wanted to have a numeric
> > >>
> > >>rank 
> > >>
> > >>>field in each document influence the order in which the
> documents
> > >>
> > >>were 
> > >>
> > >>>returned by lucene.  While investigating a solution for this, I
> > >>
> > >>wanted 
> > >>
> > >>>to see if I could implement strict sorting based on this numeric
> > >>
> > >>value. 
> > >>
> > >>>I was able to accomplish this using document boosting, but not
> > >>
> > >>without 
> > >>
> > >>>modifying the lucene source.  Our "ranking" field is an integer
> > >>
> > >>value 
> > >>
> > >>>from one to one hundred.  I'm not sure if this will help you,
> but
> > >>
> > >>I'll 
> > >>
> > >>>include a summary of what I did.
> > >>>
> > >>>In DocumentWriter remove the normalization by field length:
> > >>>   float norm = fieldBoosts[n] * 
> > >>>Similarity.normalizeLength(fieldLengths[n]);
> > >>>to
> > >>>   float norm = fieldBoosts[n];
> > >>>
> > >>>In TermScorer and PhraseScorer, modify the score() method to
> ignore
> > >>
> > >>the 
> > >>
> > >>>lucene base score:
> > >>>   score *= Similarity.decodeNorm(norms[d]);
> > >>>to
> > >>>   score = Similarity.decodeNorm(norms[d]);
> > >>>
> > >>>In Similarity.java, make byteToFloat() public.
> > >>>
> > >>>At index time, use Similarity.byteToFloat() to determine your
> boost
> > >>
> > >>>value as in the following pseudocode:
> > >>>   Document d = new Document();
> > >>>   ... add your fields ...
> > >>>   int rank = d.getField("RANK"); (range of rank can be 0 to
> 255)
> > >>>   float sortVal = Similarity.byteToFloat(rank)
> > >>>   d.setBoost(sortVal)
> > >>>
> > >>>If you'd like the reasoning behind any or all of these items,
> let
> > >>
> > >>me know.
> > >>
> > >>>DaveB
> > >>>
> > >>>
> > >>>
> > >>>Dmitry Serebrennikov wrote:
> > >>>
> > >>>
> > >>>>Greetings Everyone,
> > >>>>
> > >>>>I'm thinking of trying to build something that manipulates a
> query
> > >>>
> > >>>>score in order to achieve a sort order other then the default 
> > >>>>relevance sort. The idea is to create a new type of query:
> > >>>>SortingQuery( Query query, String sortByField )
> > >>>>
> > >>>>It would run the sub-query and return results in an order of
> the 
> > >>>>values found in the "sortByField" for those documents. Now,
> I've 
> > >>>>looked at all of the sorting discussion prior to this, and the
> > >>>
> > >>best 
> > >>
> > >>>>approach (recommended by Doug among others) is to provide some
> > >>>
> > >>sort of 
> > >>
> > >>>>a fast access to the field values inside the HitCollector.
> Reading
> > >>>
> > >>>>documents at search time is too slow, so people access the data
> 
> > >>>>elsewhere or build an in-memory index of that data (such as is
> > >>>
> > >>done in 
> > >>
> > >>>>the SearchBean's SortField).
> > >>>>
> > >>>>My idea is different. I want to try to do the following:
> > >>>>- compose a query that consists of the original sub-query
> followed
> > >>>
> > >>by 
> > >>
> > >>>>a special "sorting query"
> > >>>>- "boost" the score of the original sub-query to 0
> > >>>>- compute the score of the sorting query such that it would
> > >>>
> > >>reflect 
> > >>
> > >>>>the desired sort order
> > >>>>
> > >>>>Has anyone tried to do something like this?
> > >>>>Would this work?
> > >>>>Is this worth doing?
> > >>>>If it would, would then I have to do something during the
> indexing
> > >>>
> > >>>>time to set normalization / scoring factors for that field to 
> > >>>>something or other?
> > >>>>
> > >>>>Thanks.
> > >>>>Dmitry.
> > >>>>
> > >>>>
> > >>>>
> > >>>>-- 
> > >>>>To unsubscribe, e-mail:   
> > >>>><ma...@jakarta.apache.org>
> > >>>>For additional commands, e-mail: 
> > >>>><ma...@jakarta.apache.org>
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>>-- 
> > >>>To unsubscribe, e-mail:   
> > >>><ma...@jakarta.apache.org>
> > >>>For additional commands, e-mail: 
> > >>><ma...@jakarta.apache.org>
> > >>>
> > >>
> > >>
> > >>--
> > >>To unsubscribe, e-mail:  
> > >><ma...@jakarta.apache.org>
> > >>For additional commands, e-mail:
> > >><ma...@jakarta.apache.org>
> > >>
> > > 
> > > 
> > > __________________________________________________
> > > Do you Yahoo!?
> > > Faith Hill - Exclusive Performances, Videos & More
> > > http://faith.yahoo.com
> > > 
> > > --
> > > To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> > > For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> > > 
> > 
> > 
> > 
> > --
> > To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> > 

> ATTACHMENT part 2 application/octet-stream name=IndexSearcher.java
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>


__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Question: using boost for sorting

Posted by Che Dong <ch...@hotmail.com>.

How about add sortType in IndexSearcher first?
User can speciefy IndexSearcher.sortType(by score:default, by docID, by docID desc) before indexing.

Che, Dong

diff IndexSearcher.java ~/lucene-1.2-src/src/java/org/apache/lucene/search/IndexSearcher.java 
66,81c66
< /**
<  * Implements search over a single IndexReader.
<  *
<  * user can customize search result sort behavior via <code>sortType</code>:
<  * if data source sorted by some field before indexing docID can be take
<  * as the alias to the sort field, so
<  * search result sort by docID(or desc) equals to sort by field
<  *
<  * search results sort method:
<  *  0:  sort by score (default)
<  *  1:  sort by docID
<  *  -1: sort by docID desc
<  *
<  * @author Che, Dong <ch...@bigfoot.com>
<  * $Header: /home/cvsroot/lucene_ext/src/org/apache/lucene/search/IndexSearcher.java,v 1.1.1.1 2002/09/22 19:36:08 chedong Exp $
<  */
---
> /** Implements search over a single IndexReader. */
83,89d67
<   /**
< 
<    */
<   public static final int ORDER_BY_SCORE = 0;
<   public static final int ORDER_BY_DOCID = 1;
<   public static final int ORDER_BY_DOCID_DESC = -1;
<   public int sortType = ORDER_BY_SCORE;
96c74
< 
---
>     
101c79
< 
---
>     
106c84
< 
---
>     
134,162c112,127
<     final int md = reader.maxDoc();
< 
<     scorer.score(new HitCollector()
<       {
<               private float minScore = 0.0f;
<               public final void collect(int doc, float score) {
<                 if (score > 0.0f &&                     // ignore zeroed buckets
<                     (bits==null || bits.get(doc))) {    // skip docs not in bits
<                   totalHits[0]++;
<                   if (score >= minScore) {
<                     // update hit queue
<                     switch (sortType) {
<                           case ORDER_BY_SCORE:   //sort results by score
<                             hq.put(new ScoreDoc(doc, score));
<                           case ORDER_BY_DOCID:   //sort results by docID
<                             hq.put(new ScoreDoc(doc, doc));
<                           case ORDER_BY_DOCID_DESC:  //sort results by docID desc
<                             hq.put(new ScoreDoc(doc, (md - doc) ) );
<                           default:  //sort results by score(default)
<                             hq.put(new ScoreDoc(doc, score));
<                         }
<                     if (hq.size() > nDocs) {            // if hit queue overfull
<                               hq.pop();                         // remove lowest in hit queue
<                               minScore = ((ScoreDoc)hq.top()).score; // reset minScore
<                     }
<                   }
<                 }
<               }
<       }, md);
---
>     scorer.score(new HitCollector() {
>       private float minScore = 0.0f;
>       public final void collect(int doc, float score) {
>         if (score > 0.0f &&                     // ignore zeroed buckets
>             (bits==null || bits.get(doc))) {    // skip docs not in bits
>           totalHits[0]++;
>           if (score >= minScore) {
>             hq.put(new ScoreDoc(doc, score));   // update hit queue
>             if (hq.size() > nDocs) {            // if hit queue overfull
>               hq.pop();                         // remove lowest in hit queue
>               minScore = ((ScoreDoc)hq.top()).score; // reset minScore
>             }
>           }
>         }
>       }
>       }, reader.maxDoc());
167c132
< 
---
>     


----- Original Message ----- 
From: "Doug Cutting" <cu...@lucene.com>
To: "Lucene Developers List" <lu...@jakarta.apache.org>
Sent: Thursday, October 17, 2002 5:21 AM
Subject: Re: Question: using boost for sorting


> Please submit diffs before committing anything, as this is delicate 
> code.  Small changes here can affect performance in a big way.
> 
> Also, we must be extra-careful when making a new public API: once a 
> method is public it's very hard to remove it.  The Similarity methods 
> also need to be well documented.
> 
> Doug
> 
> Otis Gospodnetic wrote:
> > This sounds good to me, as it would lead us to pluggable similarity
> > computation...mmmm.
> > I can refactor some of this tonight.
> > 
> > Otis
> > 
> > 
> > --- Doug Cutting <cu...@lucene.com> wrote:
> > 
> >>This looks like a good approach.  When I get a chance, I'd like to
> >>make 
> >>Similarity an interface or an abstract class, whose default 
> >>implementation would do what the current class does, but whose
> >>methods 
> >>can be overridden.  Then I'd add methods like:
> >>
> >>   public static void Similarity.setDefaultSimilarity(Similarity
> >>sim);
> >>   public void IndexWriter.setSimilarity(Similarity sim);
> >>   public void Searcher.setSimilarity(Similarity sim);
> >>
> >>So to override Similarity methods you'd define a subclass of the 
> >>standard implementation, then either install yours globally via 
> >>setDefaultSimilarity, or set it in your IndexWriter before adding 
> >>documents and in your Searcher before searching.  Does that sound 
> >>reasonable?
> >>
> >>This would let you do what you describe below without changing
> >>Lucene's 
> >>sources.  However I'm very short on time right now and don't know how
> >>
> >>soon I'll get to this.
> >>
> >>Doug
> >>
> >>David Birtwell wrote:
> >>
> >>>Hi Dmitry,
> >>>
> >>>I was faced with a similar problem.  We wanted to have a numeric
> >>
> >>rank 
> >>
> >>>field in each document influence the order in which the documents
> >>
> >>were 
> >>
> >>>returned by lucene.  While investigating a solution for this, I
> >>
> >>wanted 
> >>
> >>>to see if I could implement strict sorting based on this numeric
> >>
> >>value. 
> >>
> >>>I was able to accomplish this using document boosting, but not
> >>
> >>without 
> >>
> >>>modifying the lucene source.  Our "ranking" field is an integer
> >>
> >>value 
> >>
> >>>from one to one hundred.  I'm not sure if this will help you, but
> >>
> >>I'll 
> >>
> >>>include a summary of what I did.
> >>>
> >>>In DocumentWriter remove the normalization by field length:
> >>>   float norm = fieldBoosts[n] * 
> >>>Similarity.normalizeLength(fieldLengths[n]);
> >>>to
> >>>   float norm = fieldBoosts[n];
> >>>
> >>>In TermScorer and PhraseScorer, modify the score() method to ignore
> >>
> >>the 
> >>
> >>>lucene base score:
> >>>   score *= Similarity.decodeNorm(norms[d]);
> >>>to
> >>>   score = Similarity.decodeNorm(norms[d]);
> >>>
> >>>In Similarity.java, make byteToFloat() public.
> >>>
> >>>At index time, use Similarity.byteToFloat() to determine your boost
> >>
> >>>value as in the following pseudocode:
> >>>   Document d = new Document();
> >>>   ... add your fields ...
> >>>   int rank = d.getField("RANK"); (range of rank can be 0 to 255)
> >>>   float sortVal = Similarity.byteToFloat(rank)
> >>>   d.setBoost(sortVal)
> >>>
> >>>If you'd like the reasoning behind any or all of these items, let
> >>
> >>me know.
> >>
> >>>DaveB
> >>>
> >>>
> >>>
> >>>Dmitry Serebrennikov wrote:
> >>>
> >>>
> >>>>Greetings Everyone,
> >>>>
> >>>>I'm thinking of trying to build something that manipulates a query
> >>>
> >>>>score in order to achieve a sort order other then the default 
> >>>>relevance sort. The idea is to create a new type of query:
> >>>>SortingQuery( Query query, String sortByField )
> >>>>
> >>>>It would run the sub-query and return results in an order of the 
> >>>>values found in the "sortByField" for those documents. Now, I've 
> >>>>looked at all of the sorting discussion prior to this, and the
> >>>
> >>best 
> >>
> >>>>approach (recommended by Doug among others) is to provide some
> >>>
> >>sort of 
> >>
> >>>>a fast access to the field values inside the HitCollector. Reading
> >>>
> >>>>documents at search time is too slow, so people access the data 
> >>>>elsewhere or build an in-memory index of that data (such as is
> >>>
> >>done in 
> >>
> >>>>the SearchBean's SortField).
> >>>>
> >>>>My idea is different. I want to try to do the following:
> >>>>- compose a query that consists of the original sub-query followed
> >>>
> >>by 
> >>
> >>>>a special "sorting query"
> >>>>- "boost" the score of the original sub-query to 0
> >>>>- compute the score of the sorting query such that it would
> >>>
> >>reflect 
> >>
> >>>>the desired sort order
> >>>>
> >>>>Has anyone tried to do something like this?
> >>>>Would this work?
> >>>>Is this worth doing?
> >>>>If it would, would then I have to do something during the indexing
> >>>
> >>>>time to set normalization / scoring factors for that field to 
> >>>>something or other?
> >>>>
> >>>>Thanks.
> >>>>Dmitry.
> >>>>
> >>>>
> >>>>
> >>>>-- 
> >>>>To unsubscribe, e-mail:   
> >>>><ma...@jakarta.apache.org>
> >>>>For additional commands, e-mail: 
> >>>><ma...@jakarta.apache.org>
> >>>>
> >>>>
> >>>
> >>>
> >>>-- 
> >>>To unsubscribe, e-mail:   
> >>><ma...@jakarta.apache.org>
> >>>For additional commands, e-mail: 
> >>><ma...@jakarta.apache.org>
> >>>
> >>
> >>
> >>--
> >>To unsubscribe, e-mail:  
> >><ma...@jakarta.apache.org>
> >>For additional commands, e-mail:
> >><ma...@jakarta.apache.org>
> >>
> > 
> > 
> > __________________________________________________
> > Do you Yahoo!?
> > Faith Hill - Exclusive Performances, Videos & More
> > http://faith.yahoo.com
> > 
> > --
> > To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> > For additional commands, e-mail: <ma...@jakarta.apache.org>
> > 
> 
> 
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
>

Re: Question: using boost for sorting

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.

Also, please consider that some applications may require multiple 
Similarity implementations in the same index. For example, I would like 
to be able to sort by relevance on most searches, but sometimes allow 
users to request that results were ordered by price or by some other 
field. I think it would be ok to dedicate a special field for a given 
sort order, but requiring a whole new index is too much.

Dmitry.

Doug Cutting wrote:

> Please submit diffs before committing anything, as this is delicate 
> code.  Small changes here can affect performance in a big way.
>
> Also, we must be extra-careful when making a new public API: once a 
> method is public it's very hard to remove it.  The Similarity methods 
> also need to be well documented.
>
> Doug
>
> Otis Gospodnetic wrote:
>
>> This sounds good to me, as it would lead us to pluggable similarity
>> computation...mmmm.
>> I can refactor some of this tonight.
>>
>> Otis
>>
>>
>> --- Doug Cutting <cu...@lucene.com> wrote:
>>
>>> This looks like a good approach.  When I get a chance, I'd like to
>>> make Similarity an interface or an abstract class, whose default 
>>> implementation would do what the current class does, but whose
>>> methods can be overridden.  Then I'd add methods like:
>>>
>>>   public static void Similarity.setDefaultSimilarity(Similarity
>>> sim);
>>>   public void IndexWriter.setSimilarity(Similarity sim);
>>>   public void Searcher.setSimilarity(Similarity sim);
>>>
>>> So to override Similarity methods you'd define a subclass of the 
>>> standard implementation, then either install yours globally via 
>>> setDefaultSimilarity, or set it in your IndexWriter before adding 
>>> documents and in your Searcher before searching.  Does that sound 
>>> reasonable?
>>>
>>> This would let you do what you describe below without changing
>>> Lucene's sources.  However I'm very short on time right now and 
>>> don't know how
>>>
>>> soon I'll get to this.
>>>
>>> Doug
>>



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Question: using boost for sorting

Posted by Otis Gospodnetic <ot...@yahoo.com>.

It just occurred to me that this diff is really pretty useless.
The methods I added don't do anything by themselves...
I just added new methods to those 3 classes, but I don't see where
IndexWriter and Searcher use Similarity, and Similarity currently
doesn't use the instance that was set by setDefaultSimilarity.

And Similarity's public methods are static.  In order for the new
Similarity instance to be used (the one specified in
setDefaultSimilarity(Similarity)) we would/could make Similarity a
singleton, make method non-static, add
Similarity.getDefaultSimilarity() method, and then replace calls like
this:

idf = Similarity.idf(term, searcher);

with

idf = Similarity.getDefaultSimilarity().idf(term, searcher);


Is this what you had in mind, Doug?

Otis


--- Otis Gospodnetic <ot...@yahoo.com> wrote:
> Here are the diffs for:
>   Similarity.java
>   IndexWriter.java
>   Searcher.java
> 
> The changes were minimal, everything should still work the same way
> as
> before.  Similarity's public methods are all static, so making this
> class abstract makes no difference to the outside callers of its
> public
> methods.
> 
> Otis
> 
> 
> --- Doug Cutting <cu...@lucene.com> wrote:
> > Please submit diffs before committing anything, as this is delicate
> 
> > code.  Small changes here can affect performance in a big way.
> > 
> > Also, we must be extra-careful when making a new public API: once a
> 
> > method is public it's very hard to remove it.  The Similarity
> methods
> > 
> > also need to be well documented.
> > 
> > Doug
> > 
> > Otis Gospodnetic wrote:
> > > This sounds good to me, as it would lead us to pluggable
> similarity
> > > computation...mmmm.
> > > I can refactor some of this tonight.
> > > 
> > > Otis
> > > 
> > > 
> > > --- Doug Cutting <cu...@lucene.com> wrote:
> > > 
> > >>This looks like a good approach.  When I get a chance, I'd like
> to
> > >>make 
> > >>Similarity an interface or an abstract class, whose default 
> > >>implementation would do what the current class does, but whose
> > >>methods 
> > >>can be overridden.  Then I'd add methods like:
> > >>
> > >>   public static void Similarity.setDefaultSimilarity(Similarity
> > >>sim);
> > >>   public void IndexWriter.setSimilarity(Similarity sim);
> > >>   public void Searcher.setSimilarity(Similarity sim);
> > >>
> > >>So to override Similarity methods you'd define a subclass of the 
> > >>standard implementation, then either install yours globally via 
> > >>setDefaultSimilarity, or set it in your IndexWriter before adding
> 
> > >>documents and in your Searcher before searching.  Does that sound
> 
> > >>reasonable?
> > >>
> > >>This would let you do what you describe below without changing
> > >>Lucene's 
> > >>sources.  However I'm very short on time right now and don't know
> > how
> > >>
> > >>soon I'll get to this.
> > >>
> > >>Doug
> > >>
> > >>David Birtwell wrote:
> > >>
> > >>>Hi Dmitry,
> > >>>
> > >>>I was faced with a similar problem.  We wanted to have a numeric
> > >>
> > >>rank 
> > >>
> > >>>field in each document influence the order in which the
> documents
> > >>
> > >>were 
> > >>
> > >>>returned by lucene.  While investigating a solution for this, I
> > >>
> > >>wanted 
> > >>
> > >>>to see if I could implement strict sorting based on this numeric
> > >>
> > >>value. 
> > >>
> > >>>I was able to accomplish this using document boosting, but not
> > >>
> > >>without 
> > >>
> > >>>modifying the lucene source.  Our "ranking" field is an integer
> > >>
> > >>value 
> > >>
> > >>>from one to one hundred.  I'm not sure if this will help you,
> but
> > >>
> > >>I'll 
> > >>
> > >>>include a summary of what I did.
> > >>>
> > >>>In DocumentWriter remove the normalization by field length:
> > >>>   float norm = fieldBoosts[n] * 
> > >>>Similarity.normalizeLength(fieldLengths[n]);
> > >>>to
> > >>>   float norm = fieldBoosts[n];
> > >>>
> > >>>In TermScorer and PhraseScorer, modify the score() method to
> > ignore
> > >>
> > >>the 
> > >>
> > >>>lucene base score:
> > >>>   score *= Similarity.decodeNorm(norms[d]);
> > >>>to
> > >>>   score = Similarity.decodeNorm(norms[d]);
> > >>>
> > >>>In Similarity.java, make byteToFloat() public.
> > >>>
> > >>>At index time, use Similarity.byteToFloat() to determine your
> > boost
> > >>
> > >>>value as in the following pseudocode:
> > >>>   Document d = new Document();
> > >>>   ... add your fields ...
> > >>>   int rank = d.getField("RANK"); (range of rank can be 0 to
> 255)
> > >>>   float sortVal = Similarity.byteToFloat(rank)
> > >>>   d.setBoost(sortVal)
> > >>>
> > >>>If you'd like the reasoning behind any or all of these items,
> let
> > >>
> > >>me know.
> > >>
> > >>>DaveB
> > >>>
> > >>>
> > >>>
> > >>>Dmitry Serebrennikov wrote:
> > >>>
> > >>>
> > >>>>Greetings Everyone,
> > >>>>
> > >>>>I'm thinking of trying to build something that manipulates a
> > query
> > >>>
> > >>>>score in order to achieve a sort order other then the default 
> > >>>>relevance sort. The idea is to create a new type of query:
> > >>>>SortingQuery( Query query, String sortByField )
> > >>>>
> > >>>>It would run the sub-query and return results in an order of
> the 
> > >>>>values found in the "sortByField" for those documents. Now,
> I've 
> > >>>>looked at all of the sorting discussion prior to this, and the
> > >>>
> > >>best 
> > >>
> > >>>>approach (recommended by Doug among others) is to provide some
> > >>>
> > >>sort of 
> > >>
> > >>>>a fast access to the field values inside the HitCollector.
> > Reading
> > >>>
> > >>>>documents at search time is too slow, so people access the data
> 
> > >>>>elsewhere or build an in-memory index of that data (such as is
> > >>>
> > >>done in 
> > >>
> > >>>>the SearchBean's SortField).
> > >>>>
> > >>>>My idea is different. I want to try to do the following:
> > >>>>- compose a query that consists of the original sub-query
> > followed
> > >>>
> > >>by 
> > >>
> > >>>>a special "sorting query"
> > >>>>- "boost" the score of the original sub-query to 0
> > >>>>- compute the score of the sorting query such that it would
> > >>>
> > >>reflect 
> > >>
> > >>>>the desired sort order
> > >>>>
> > >>>>Has anyone tried to do something like this?
> > >>>>Would this work?
> > >>>>Is this worth doing?
> > >>>>If it would, would then I have to do something during the
> > indexing
> > >>>
> > >>>>time to set normalization / scoring factors for that field to 
> > >>>>something or other?
> > >>>>
> > >>>>Thanks.
> > >>>>Dmitry.
> > >>>>
> > >>>>
> > >>>>
> > >>>>-- 
> > >>>>To unsubscribe, e-mail:   
> > >>>><ma...@jakarta.apache.org>
> > >>>>For additional commands, e-mail: 
> > >>>><ma...@jakarta.apache.org>
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>>-- 
> > >>>To unsubscribe, e-mail:   
> > >>><ma...@jakarta.apache.org>
> > >>>For additional commands, e-mail: 
> > >>><ma...@jakarta.apache.org>
> > >>>
> > >>
> > >>
> > >>--
> > >>To unsubscribe, e-mail:  
> > >><ma...@jakarta.apache.org>
> > >>For additional commands, e-mail:
> > >><ma...@jakarta.apache.org>
> > >>
> > > 
> > > 
> > > __________________________________________________
> > > Do you Yahoo!?
> > > Faith Hill - Exclusive Performances, Videos & More
> > > http://faith.yahoo.com
> > > 
> > > --
> > > To unsubscribe, e-mail:  
> > <ma...@jakarta.apache.org>
> > > For additional commands, e-mail:
> > <ma...@jakarta.apache.org>
> > > 
> > 
> > 
> > 
> > --
> > To unsubscribe, e-mail:  
> > <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> > <ma...@jakarta.apache.org>
> > 
> 
> 
> 
> __________________________________________________
> Do you Yahoo!?
> Faith Hill - Exclusive Performances, Videos & More
> http://faith.yahoo.com

> ATTACHMENT part 2 application/octet-stream name=Similarity.diff


> ATTACHMENT part 3 application/octet-stream name=IndexWriter.diff


> ATTACHMENT part 4 application/octet-stream name=Searcher.diff
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>


__________________________________________________
Do you Yahoo!?
Faith Hill - Exclusive Performances, Videos & More
http://faith.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Question: using boost for sorting

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Here are the diffs for:
  Similarity.java
  IndexWriter.java
  Searcher.java

The changes were minimal, everything should still work the same way as
before.  Similarity's public methods are all static, so making this
class abstract makes no difference to the outside callers of its public
methods.

Otis


--- Doug Cutting <cu...@lucene.com> wrote:
> Please submit diffs before committing anything, as this is delicate 
> code.  Small changes here can affect performance in a big way.
> 
> Also, we must be extra-careful when making a new public API: once a 
> method is public it's very hard to remove it.  The Similarity methods
> 
> also need to be well documented.
> 
> Doug
> 
> Otis Gospodnetic wrote:
> > This sounds good to me, as it would lead us to pluggable similarity
> > computation...mmmm.
> > I can refactor some of this tonight.
> > 
> > Otis
> > 
> > 
> > --- Doug Cutting <cu...@lucene.com> wrote:
> > 
> >>This looks like a good approach.  When I get a chance, I'd like to
> >>make 
> >>Similarity an interface or an abstract class, whose default 
> >>implementation would do what the current class does, but whose
> >>methods 
> >>can be overridden.  Then I'd add methods like:
> >>
> >>   public static void Similarity.setDefaultSimilarity(Similarity
> >>sim);
> >>   public void IndexWriter.setSimilarity(Similarity sim);
> >>   public void Searcher.setSimilarity(Similarity sim);
> >>
> >>So to override Similarity methods you'd define a subclass of the 
> >>standard implementation, then either install yours globally via 
> >>setDefaultSimilarity, or set it in your IndexWriter before adding 
> >>documents and in your Searcher before searching.  Does that sound 
> >>reasonable?
> >>
> >>This would let you do what you describe below without changing
> >>Lucene's 
> >>sources.  However I'm very short on time right now and don't know
> how
> >>
> >>soon I'll get to this.
> >>
> >>Doug
> >>
> >>David Birtwell wrote:
> >>
> >>>Hi Dmitry,
> >>>
> >>>I was faced with a similar problem.  We wanted to have a numeric
> >>
> >>rank 
> >>
> >>>field in each document influence the order in which the documents
> >>
> >>were 
> >>
> >>>returned by lucene.  While investigating a solution for this, I
> >>
> >>wanted 
> >>
> >>>to see if I could implement strict sorting based on this numeric
> >>
> >>value. 
> >>
> >>>I was able to accomplish this using document boosting, but not
> >>
> >>without 
> >>
> >>>modifying the lucene source.  Our "ranking" field is an integer
> >>
> >>value 
> >>
> >>>from one to one hundred.  I'm not sure if this will help you, but
> >>
> >>I'll 
> >>
> >>>include a summary of what I did.
> >>>
> >>>In DocumentWriter remove the normalization by field length:
> >>>   float norm = fieldBoosts[n] * 
> >>>Similarity.normalizeLength(fieldLengths[n]);
> >>>to
> >>>   float norm = fieldBoosts[n];
> >>>
> >>>In TermScorer and PhraseScorer, modify the score() method to
> ignore
> >>
> >>the 
> >>
> >>>lucene base score:
> >>>   score *= Similarity.decodeNorm(norms[d]);
> >>>to
> >>>   score = Similarity.decodeNorm(norms[d]);
> >>>
> >>>In Similarity.java, make byteToFloat() public.
> >>>
> >>>At index time, use Similarity.byteToFloat() to determine your
> boost
> >>
> >>>value as in the following pseudocode:
> >>>   Document d = new Document();
> >>>   ... add your fields ...
> >>>   int rank = d.getField("RANK"); (range of rank can be 0 to 255)
> >>>   float sortVal = Similarity.byteToFloat(rank)
> >>>   d.setBoost(sortVal)
> >>>
> >>>If you'd like the reasoning behind any or all of these items, let
> >>
> >>me know.
> >>
> >>>DaveB
> >>>
> >>>
> >>>
> >>>Dmitry Serebrennikov wrote:
> >>>
> >>>
> >>>>Greetings Everyone,
> >>>>
> >>>>I'm thinking of trying to build something that manipulates a
> query
> >>>
> >>>>score in order to achieve a sort order other then the default 
> >>>>relevance sort. The idea is to create a new type of query:
> >>>>SortingQuery( Query query, String sortByField )
> >>>>
> >>>>It would run the sub-query and return results in an order of the 
> >>>>values found in the "sortByField" for those documents. Now, I've 
> >>>>looked at all of the sorting discussion prior to this, and the
> >>>
> >>best 
> >>
> >>>>approach (recommended by Doug among others) is to provide some
> >>>
> >>sort of 
> >>
> >>>>a fast access to the field values inside the HitCollector.
> Reading
> >>>
> >>>>documents at search time is too slow, so people access the data 
> >>>>elsewhere or build an in-memory index of that data (such as is
> >>>
> >>done in 
> >>
> >>>>the SearchBean's SortField).
> >>>>
> >>>>My idea is different. I want to try to do the following:
> >>>>- compose a query that consists of the original sub-query
> followed
> >>>
> >>by 
> >>
> >>>>a special "sorting query"
> >>>>- "boost" the score of the original sub-query to 0
> >>>>- compute the score of the sorting query such that it would
> >>>
> >>reflect 
> >>
> >>>>the desired sort order
> >>>>
> >>>>Has anyone tried to do something like this?
> >>>>Would this work?
> >>>>Is this worth doing?
> >>>>If it would, would then I have to do something during the
> indexing
> >>>
> >>>>time to set normalization / scoring factors for that field to 
> >>>>something or other?
> >>>>
> >>>>Thanks.
> >>>>Dmitry.
> >>>>
> >>>>
> >>>>
> >>>>-- 
> >>>>To unsubscribe, e-mail:   
> >>>><ma...@jakarta.apache.org>
> >>>>For additional commands, e-mail: 
> >>>><ma...@jakarta.apache.org>
> >>>>
> >>>>
> >>>
> >>>
> >>>-- 
> >>>To unsubscribe, e-mail:   
> >>><ma...@jakarta.apache.org>
> >>>For additional commands, e-mail: 
> >>><ma...@jakarta.apache.org>
> >>>
> >>
> >>
> >>--
> >>To unsubscribe, e-mail:  
> >><ma...@jakarta.apache.org>
> >>For additional commands, e-mail:
> >><ma...@jakarta.apache.org>
> >>
> > 
> > 
> > __________________________________________________
> > Do you Yahoo!?
> > Faith Hill - Exclusive Performances, Videos & More
> > http://faith.yahoo.com
> > 
> > --
> > To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> > 
> 
> 
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 



__________________________________________________
Do you Yahoo!?
Faith Hill - Exclusive Performances, Videos & More
http://faith.yahoo.com

Re: Question: using boost for sorting

Posted by Doug Cutting <cu...@lucene.com>.

Please submit diffs before committing anything, as this is delicate 
code.  Small changes here can affect performance in a big way.

Also, we must be extra-careful when making a new public API: once a 
method is public it's very hard to remove it.  The Similarity methods 
also need to be well documented.

Doug

Otis Gospodnetic wrote:
> This sounds good to me, as it would lead us to pluggable similarity
> computation...mmmm.
> I can refactor some of this tonight.
> 
> Otis
> 
> 
> --- Doug Cutting <cu...@lucene.com> wrote:
> 
>>This looks like a good approach.  When I get a chance, I'd like to
>>make 
>>Similarity an interface or an abstract class, whose default 
>>implementation would do what the current class does, but whose
>>methods 
>>can be overridden.  Then I'd add methods like:
>>
>>   public static void Similarity.setDefaultSimilarity(Similarity
>>sim);
>>   public void IndexWriter.setSimilarity(Similarity sim);
>>   public void Searcher.setSimilarity(Similarity sim);
>>
>>So to override Similarity methods you'd define a subclass of the 
>>standard implementation, then either install yours globally via 
>>setDefaultSimilarity, or set it in your IndexWriter before adding 
>>documents and in your Searcher before searching.  Does that sound 
>>reasonable?
>>
>>This would let you do what you describe below without changing
>>Lucene's 
>>sources.  However I'm very short on time right now and don't know how
>>
>>soon I'll get to this.
>>
>>Doug
>>
>>David Birtwell wrote:
>>
>>>Hi Dmitry,
>>>
>>>I was faced with a similar problem.  We wanted to have a numeric
>>
>>rank 
>>
>>>field in each document influence the order in which the documents
>>
>>were 
>>
>>>returned by lucene.  While investigating a solution for this, I
>>
>>wanted 
>>
>>>to see if I could implement strict sorting based on this numeric
>>
>>value. 
>>
>>>I was able to accomplish this using document boosting, but not
>>
>>without 
>>
>>>modifying the lucene source.  Our "ranking" field is an integer
>>
>>value 
>>
>>>from one to one hundred.  I'm not sure if this will help you, but
>>
>>I'll 
>>
>>>include a summary of what I did.
>>>
>>>In DocumentWriter remove the normalization by field length:
>>>   float norm = fieldBoosts[n] * 
>>>Similarity.normalizeLength(fieldLengths[n]);
>>>to
>>>   float norm = fieldBoosts[n];
>>>
>>>In TermScorer and PhraseScorer, modify the score() method to ignore
>>
>>the 
>>
>>>lucene base score:
>>>   score *= Similarity.decodeNorm(norms[d]);
>>>to
>>>   score = Similarity.decodeNorm(norms[d]);
>>>
>>>In Similarity.java, make byteToFloat() public.
>>>
>>>At index time, use Similarity.byteToFloat() to determine your boost
>>
>>>value as in the following pseudocode:
>>>   Document d = new Document();
>>>   ... add your fields ...
>>>   int rank = d.getField("RANK"); (range of rank can be 0 to 255)
>>>   float sortVal = Similarity.byteToFloat(rank)
>>>   d.setBoost(sortVal)
>>>
>>>If you'd like the reasoning behind any or all of these items, let
>>
>>me know.
>>
>>>DaveB
>>>
>>>
>>>
>>>Dmitry Serebrennikov wrote:
>>>
>>>
>>>>Greetings Everyone,
>>>>
>>>>I'm thinking of trying to build something that manipulates a query
>>>
>>>>score in order to achieve a sort order other then the default 
>>>>relevance sort. The idea is to create a new type of query:
>>>>SortingQuery( Query query, String sortByField )
>>>>
>>>>It would run the sub-query and return results in an order of the 
>>>>values found in the "sortByField" for those documents. Now, I've 
>>>>looked at all of the sorting discussion prior to this, and the
>>>
>>best 
>>
>>>>approach (recommended by Doug among others) is to provide some
>>>
>>sort of 
>>
>>>>a fast access to the field values inside the HitCollector. Reading
>>>
>>>>documents at search time is too slow, so people access the data 
>>>>elsewhere or build an in-memory index of that data (such as is
>>>
>>done in 
>>
>>>>the SearchBean's SortField).
>>>>
>>>>My idea is different. I want to try to do the following:
>>>>- compose a query that consists of the original sub-query followed
>>>
>>by 
>>
>>>>a special "sorting query"
>>>>- "boost" the score of the original sub-query to 0
>>>>- compute the score of the sorting query such that it would
>>>
>>reflect 
>>
>>>>the desired sort order
>>>>
>>>>Has anyone tried to do something like this?
>>>>Would this work?
>>>>Is this worth doing?
>>>>If it would, would then I have to do something during the indexing
>>>
>>>>time to set normalization / scoring factors for that field to 
>>>>something or other?
>>>>
>>>>Thanks.
>>>>Dmitry.
>>>>
>>>>
>>>>
>>>>-- 
>>>>To unsubscribe, e-mail:   
>>>><ma...@jakarta.apache.org>
>>>>For additional commands, e-mail: 
>>>><ma...@jakarta.apache.org>
>>>>
>>>>
>>>
>>>
>>>-- 
>>>To unsubscribe, e-mail:   
>>><ma...@jakarta.apache.org>
>>>For additional commands, e-mail: 
>>><ma...@jakarta.apache.org>
>>>
>>
>>
>>--
>>To unsubscribe, e-mail:  
>><ma...@jakarta.apache.org>
>>For additional commands, e-mail:
>><ma...@jakarta.apache.org>
>>
> 
> 
> __________________________________________________
> Do you Yahoo!?
> Faith Hill - Exclusive Performances, Videos & More
> http://faith.yahoo.com
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
> 



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Question: using boost for sorting

Posted by David Birtwell <Da...@vwr.com>.

Doug Cutting wrote:

> David Birtwell wrote:
>
>> To enable this, Similarity could have a method like:
>>    float applyNorm(float baseScore)
>> which could optionally ignore baseScore and modify the scorer classes 
>> to do:
>>    score = applyNorm(score)
>> instead of the
>>    score *= Similarity.decodeNorms()
>
> That would add a method call in the innermost search loop, which would 
> probably have a noticeable performance impact.  
> (Similarity.decodeNorm() is a simple static method that JITs can 
> trivially inline.)  Couldn't you achieve the same effect by overriding 
> the normalizeLength() method and/or use Field.setBoost() to impact the 
> value that is stored in the norm file?  That way this computation is 
> performed  at index time rather than at search time.

Hmmm... you know what, I hadn't considered performance when making the 
above suggestion.  Still, though, I don't see how to accomplish strict 
ordering of results without making a modification to the the score() 
methods.

I may be missing something, but my understanding at this point is that 
the ordering of results is determined by the score, and the score is a 
combination of the relevance of the hit (frequency/density of terms, 
etc....) and the norm values.  To predefine the order of results at 
index time, we have to be able to throw out the "hit relevance" portion 
of the score at search time.

Could we make applyNorm() a static method of Similarity and achieve 
acceptable performance?  The default implementation could be something like:

static float applyNorm(float hitRelevance, byte norm)
{
    return hitRelevance * decodeNorm(norm)
}

For strict ordering:

static float applyNorm(float hitRelevance, byte norm)
{
    // ignore hit relevance
    return decodeNorm(norm)
}

> If you need to be able to dynamically change the scoring method at 
> search time then there will probably be a performance impact.  Ideally 
> this should still be an option, however this would require opening up 
> the scorer API, so that folks could define different scorer 
> implementations for each Query class.  I'm not sure I yet want to take 
> on that task, but if you have a proposal, I'd love to hear it.

Heh, no proposals here... yet.  This topic directly affects the 
application development work I'm doing and I'd love to propose a 
solution (or otherwise contribute).  Though, I would want to gain a 
deeper understanding of Lucene before doing so.  I'm going to try to 
make some time to do so in the coming weeks which will hopefully enable 
me to make an intelligent contribution a little later on.

DaveB

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Question: using boost for sorting

Posted by Doug Cutting <cu...@lucene.com>.

[I've moved this discussion to the developer list.]

David Birtwell wrote:
> I think this sounds like a great idea, too.
> 
> It would be nice to modify the PhraseScorer and TermScorer to enable the 
> strict ordering.  I had to have the score() methods of these class 
> essentially ignore the base lucene scoring and just use the decoded norm 
> from Similarity.
> 
> To enable this, Similarity could have a method like:
>    float applyNorm(float baseScore)
> which could optionally ignore baseScore and modify the scorer classes to 
> do:
>    score = applyNorm(score)
> instead of the
>    score *= Similarity.decodeNorms()

That would add a method call in the innermost search loop, which would 
probably have a noticeable performance impact.  (Similarity.decodeNorm() 
is a simple static method that JITs can trivially inline.)  Couldn't you 
achieve the same effect by overriding the normalizeLength() method 
and/or use Field.setBoost() to impact the value that is stored in the 
norm file?  That way this computation is performed  at index time rather 
than at search time.

If you need to be able to dynamically change the scoring method at 
search time then there will probably be a performance impact.  Ideally 
this should still be an option, however this would require opening up 
the scorer API, so that folks could define different scorer 
implementations for each Query class.  I'm not sure I yet want to take 
on that task, but if you have a proposal, I'd love to hear it.

> I'd be happy to contribute in this area if it would be helpful.

Please send diffs if you are interested in contributing.

Doug

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Question: using boost for sorting

Posted by David Birtwell <Da...@vwr.com>.

I think this sounds like a great idea, too.

It would be nice to modify the PhraseScorer and TermScorer to enable the 
strict ordering.  I had to have the score() methods of these class 
essentially ignore the base lucene scoring and just use the decoded norm 
from Similarity.

To enable this, Similarity could have a method like:
    float applyNorm(float baseScore)
which could optionally ignore baseScore and modify the scorer classes to do:
    score = applyNorm(score)
instead of the
    score *= Similarity.decodeNorms()

I'd be happy to contribute in this area if it would be helpful.

DaveB


Otis Gospodnetic wrote:

>This sounds good to me, as it would lead us to pluggable similarity
>computation...mmmm.
>I can refactor some of this tonight.
>
>Otis
>
>
>--- Doug Cutting <cu...@lucene.com> wrote:
>  
>
>>This looks like a good approach.  When I get a chance, I'd like to
>>make 
>>Similarity an interface or an abstract class, whose default 
>>implementation would do what the current class does, but whose
>>methods 
>>can be overridden.  Then I'd add methods like:
>>
>>   public static void Similarity.setDefaultSimilarity(Similarity
>>sim);
>>   public void IndexWriter.setSimilarity(Similarity sim);
>>   public void Searcher.setSimilarity(Similarity sim);
>>
>>So to override Similarity methods you'd define a subclass of the 
>>standard implementation, then either install yours globally via 
>>setDefaultSimilarity, or set it in your IndexWriter before adding 
>>documents and in your Searcher before searching.  Does that sound 
>>reasonable?
>>
>>This would let you do what you describe below without changing
>>Lucene's 
>>sources.  However I'm very short on time right now and don't know how
>>
>>soon I'll get to this.
>>
>>Doug
>>
>>David Birtwell wrote:
>>    
>>
>>>Hi Dmitry,
>>>
>>>I was faced with a similar problem.  We wanted to have a numeric
>>>      
>>>
>>rank 
>>    
>>
>>>field in each document influence the order in which the documents
>>>      
>>>
>>were 
>>    
>>
>>>returned by lucene.  While investigating a solution for this, I
>>>      
>>>
>>wanted 
>>    
>>
>>>to see if I could implement strict sorting based on this numeric
>>>      
>>>
>>value. 
>>    
>>
>>>I was able to accomplish this using document boosting, but not
>>>      
>>>
>>without 
>>    
>>
>>>modifying the lucene source.  Our "ranking" field is an integer
>>>      
>>>
>>value 
>>    
>>
>>>from one to one hundred.  I'm not sure if this will help you, but
>>>      
>>>
>>I'll 
>>    
>>
>>>include a summary of what I did.
>>>
>>>In DocumentWriter remove the normalization by field length:
>>>   float norm = fieldBoosts[n] * 
>>>Similarity.normalizeLength(fieldLengths[n]);
>>>to
>>>   float norm = fieldBoosts[n];
>>>
>>>In TermScorer and PhraseScorer, modify the score() method to ignore
>>>      
>>>
>>the 
>>    
>>
>>>lucene base score:
>>>   score *= Similarity.decodeNorm(norms[d]);
>>>to
>>>   score = Similarity.decodeNorm(norms[d]);
>>>
>>>In Similarity.java, make byteToFloat() public.
>>>
>>>At index time, use Similarity.byteToFloat() to determine your boost
>>>      
>>>
>>>value as in the following pseudocode:
>>>   Document d = new Document();
>>>   ... add your fields ...
>>>   int rank = d.getField("RANK"); (range of rank can be 0 to 255)
>>>   float sortVal = Similarity.byteToFloat(rank)
>>>   d.setBoost(sortVal)
>>>
>>>If you'd like the reasoning behind any or all of these items, let
>>>      
>>>
>>me know.
>>    
>>
>>>DaveB
>>>
>>>
>>>
>>>Dmitry Serebrennikov wrote:
>>>
>>>      
>>>
>>>>Greetings Everyone,
>>>>
>>>>I'm thinking of trying to build something that manipulates a query
>>>>        
>>>>
>>>>score in order to achieve a sort order other then the default 
>>>>relevance sort. The idea is to create a new type of query:
>>>>SortingQuery( Query query, String sortByField )
>>>>
>>>>It would run the sub-query and return results in an order of the 
>>>>values found in the "sortByField" for those documents. Now, I've 
>>>>looked at all of the sorting discussion prior to this, and the
>>>>        
>>>>
>>best 
>>    
>>
>>>>approach (recommended by Doug among others) is to provide some
>>>>        
>>>>
>>sort of 
>>    
>>
>>>>a fast access to the field values inside the HitCollector. Reading
>>>>        
>>>>
>>>>documents at search time is too slow, so people access the data 
>>>>elsewhere or build an in-memory index of that data (such as is
>>>>        
>>>>
>>done in 
>>    
>>
>>>>the SearchBean's SortField).
>>>>
>>>>My idea is different. I want to try to do the following:
>>>>- compose a query that consists of the original sub-query followed
>>>>        
>>>>
>>by 
>>    
>>
>>>>a special "sorting query"
>>>>- "boost" the score of the original sub-query to 0
>>>>- compute the score of the sorting query such that it would
>>>>        
>>>>
>>reflect 
>>    
>>
>>>>the desired sort order
>>>>
>>>>Has anyone tried to do something like this?
>>>>Would this work?
>>>>Is this worth doing?
>>>>If it would, would then I have to do something during the indexing
>>>>        
>>>>
>>>>time to set normalization / scoring factors for that field to 
>>>>something or other?
>>>>
>>>>Thanks.
>>>>Dmitry.
>>>>
>>>>
>>>>
>>>>-- 
>>>>To unsubscribe, e-mail:   
>>>><ma...@jakarta.apache.org>
>>>>For additional commands, e-mail: 
>>>><ma...@jakarta.apache.org>
>>>>
>>>>
>>>>        
>>>>
>>>
>>>-- 
>>>To unsubscribe, e-mail:   
>>><ma...@jakarta.apache.org>
>>>For additional commands, e-mail: 
>>><ma...@jakarta.apache.org>
>>>
>>>      
>>>
>>
>>--
>>To unsubscribe, e-mail:  
>><ma...@jakarta.apache.org>
>>For additional commands, e-mail:
>><ma...@jakarta.apache.org>
>>
>>    
>>
>
>
>__________________________________________________
>Do you Yahoo!?
>Faith Hill - Exclusive Performances, Videos & More
>http://faith.yahoo.com
>
>--
>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
>For additional commands, e-mail: <ma...@jakarta.apache.org>
>
>
>  
>



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Question: using boost for sorting

Posted by Otis Gospodnetic <ot...@yahoo.com>.

This sounds good to me, as it would lead us to pluggable similarity
computation...mmmm.
I can refactor some of this tonight.

Otis


--- Doug Cutting <cu...@lucene.com> wrote:
> This looks like a good approach.  When I get a chance, I'd like to
> make 
> Similarity an interface or an abstract class, whose default 
> implementation would do what the current class does, but whose
> methods 
> can be overridden.  Then I'd add methods like:
> 
>    public static void Similarity.setDefaultSimilarity(Similarity
> sim);
>    public void IndexWriter.setSimilarity(Similarity sim);
>    public void Searcher.setSimilarity(Similarity sim);
> 
> So to override Similarity methods you'd define a subclass of the 
> standard implementation, then either install yours globally via 
> setDefaultSimilarity, or set it in your IndexWriter before adding 
> documents and in your Searcher before searching.  Does that sound 
> reasonable?
> 
> This would let you do what you describe below without changing
> Lucene's 
> sources.  However I'm very short on time right now and don't know how
> 
> soon I'll get to this.
> 
> Doug
> 
> David Birtwell wrote:
> > Hi Dmitry,
> > 
> > I was faced with a similar problem.  We wanted to have a numeric
> rank 
> > field in each document influence the order in which the documents
> were 
> > returned by lucene.  While investigating a solution for this, I
> wanted 
> > to see if I could implement strict sorting based on this numeric
> value. 
> > I was able to accomplish this using document boosting, but not
> without 
> > modifying the lucene source.  Our "ranking" field is an integer
> value 
> > from one to one hundred.  I'm not sure if this will help you, but
> I'll 
> > include a summary of what I did.
> > 
> > In DocumentWriter remove the normalization by field length:
> >    float norm = fieldBoosts[n] * 
> > Similarity.normalizeLength(fieldLengths[n]);
> > to
> >    float norm = fieldBoosts[n];
> > 
> > In TermScorer and PhraseScorer, modify the score() method to ignore
> the 
> > lucene base score:
> >    score *= Similarity.decodeNorm(norms[d]);
> > to
> >    score = Similarity.decodeNorm(norms[d]);
> > 
> > In Similarity.java, make byteToFloat() public.
> > 
> > At index time, use Similarity.byteToFloat() to determine your boost
> 
> > value as in the following pseudocode:
> >    Document d = new Document();
> >    ... add your fields ...
> >    int rank = d.getField("RANK"); (range of rank can be 0 to 255)
> >    float sortVal = Similarity.byteToFloat(rank)
> >    d.setBoost(sortVal)
> > 
> > If you'd like the reasoning behind any or all of these items, let
> me know.
> > 
> > DaveB
> > 
> > 
> > 
> > Dmitry Serebrennikov wrote:
> > 
> >> Greetings Everyone,
> >>
> >> I'm thinking of trying to build something that manipulates a query
> 
> >> score in order to achieve a sort order other then the default 
> >> relevance sort. The idea is to create a new type of query:
> >> SortingQuery( Query query, String sortByField )
> >>
> >> It would run the sub-query and return results in an order of the 
> >> values found in the "sortByField" for those documents. Now, I've 
> >> looked at all of the sorting discussion prior to this, and the
> best 
> >> approach (recommended by Doug among others) is to provide some
> sort of 
> >> a fast access to the field values inside the HitCollector. Reading
> 
> >> documents at search time is too slow, so people access the data 
> >> elsewhere or build an in-memory index of that data (such as is
> done in 
> >> the SearchBean's SortField).
> >>
> >> My idea is different. I want to try to do the following:
> >> - compose a query that consists of the original sub-query followed
> by 
> >> a special "sorting query"
> >> - "boost" the score of the original sub-query to 0
> >> - compute the score of the sorting query such that it would
> reflect 
> >> the desired sort order
> >>
> >> Has anyone tried to do something like this?
> >> Would this work?
> >> Is this worth doing?
> >> If it would, would then I have to do something during the indexing
> 
> >> time to set normalization / scoring factors for that field to 
> >> something or other?
> >>
> >> Thanks.
> >> Dmitry.
> >>
> >>
> >>
> >> -- 
> >> To unsubscribe, e-mail:   
> >> <ma...@jakarta.apache.org>
> >> For additional commands, e-mail: 
> >> <ma...@jakarta.apache.org>
> >>
> >>
> > 
> > 
> > 
> > -- 
> > To unsubscribe, e-mail:   
> > <ma...@jakarta.apache.org>
> > For additional commands, e-mail: 
> > <ma...@jakarta.apache.org>
> > 
> 
> 
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 


__________________________________________________
Do you Yahoo!?
Faith Hill - Exclusive Performances, Videos & More
http://faith.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Question: using boost for sorting

Posted by Doug Cutting <cu...@lucene.com>.

This looks like a good approach.  When I get a chance, I'd like to make 
Similarity an interface or an abstract class, whose default 
implementation would do what the current class does, but whose methods 
can be overridden.  Then I'd add methods like:

   public static void Similarity.setDefaultSimilarity(Similarity sim);
   public void IndexWriter.setSimilarity(Similarity sim);
   public void Searcher.setSimilarity(Similarity sim);

So to override Similarity methods you'd define a subclass of the 
standard implementation, then either install yours globally via 
setDefaultSimilarity, or set it in your IndexWriter before adding 
documents and in your Searcher before searching.  Does that sound 
reasonable?

This would let you do what you describe below without changing Lucene's 
sources.  However I'm very short on time right now and don't know how 
soon I'll get to this.

Doug

David Birtwell wrote:
> Hi Dmitry,
> 
> I was faced with a similar problem.  We wanted to have a numeric rank 
> field in each document influence the order in which the documents were 
> returned by lucene.  While investigating a solution for this, I wanted 
> to see if I could implement strict sorting based on this numeric value. 
> I was able to accomplish this using document boosting, but not without 
> modifying the lucene source.  Our "ranking" field is an integer value 
> from one to one hundred.  I'm not sure if this will help you, but I'll 
> include a summary of what I did.
> 
> In DocumentWriter remove the normalization by field length:
>    float norm = fieldBoosts[n] * 
> Similarity.normalizeLength(fieldLengths[n]);
> to
>    float norm = fieldBoosts[n];
> 
> In TermScorer and PhraseScorer, modify the score() method to ignore the 
> lucene base score:
>    score *= Similarity.decodeNorm(norms[d]);
> to
>    score = Similarity.decodeNorm(norms[d]);
> 
> In Similarity.java, make byteToFloat() public.
> 
> At index time, use Similarity.byteToFloat() to determine your boost 
> value as in the following pseudocode:
>    Document d = new Document();
>    ... add your fields ...
>    int rank = d.getField("RANK"); (range of rank can be 0 to 255)
>    float sortVal = Similarity.byteToFloat(rank)
>    d.setBoost(sortVal)
> 
> If you'd like the reasoning behind any or all of these items, let me know.
> 
> DaveB
> 
> 
> 
> Dmitry Serebrennikov wrote:
> 
>> Greetings Everyone,
>>
>> I'm thinking of trying to build something that manipulates a query 
>> score in order to achieve a sort order other then the default 
>> relevance sort. The idea is to create a new type of query:
>> SortingQuery( Query query, String sortByField )
>>
>> It would run the sub-query and return results in an order of the 
>> values found in the "sortByField" for those documents. Now, I've 
>> looked at all of the sorting discussion prior to this, and the best 
>> approach (recommended by Doug among others) is to provide some sort of 
>> a fast access to the field values inside the HitCollector. Reading 
>> documents at search time is too slow, so people access the data 
>> elsewhere or build an in-memory index of that data (such as is done in 
>> the SearchBean's SortField).
>>
>> My idea is different. I want to try to do the following:
>> - compose a query that consists of the original sub-query followed by 
>> a special "sorting query"
>> - "boost" the score of the original sub-query to 0
>> - compute the score of the sorting query such that it would reflect 
>> the desired sort order
>>
>> Has anyone tried to do something like this?
>> Would this work?
>> Is this worth doing?
>> If it would, would then I have to do something during the indexing 
>> time to set normalization / scoring factors for that field to 
>> something or other?
>>
>> Thanks.
>> Dmitry.
>>
>>
>>
>> -- 
>> To unsubscribe, e-mail:   
>> <ma...@jakarta.apache.org>
>> For additional commands, e-mail: 
>> <ma...@jakarta.apache.org>
>>
>>
> 
> 
> 
> -- 
> To unsubscribe, e-mail:   
> <ma...@jakarta.apache.org>
> For additional commands, e-mail: 
> <ma...@jakarta.apache.org>
> 



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Question: using boost for sorting

Posted by David Birtwell <Da...@vwr.com>.

Hi Dmitry,

I was faced with a similar problem.  We wanted to have a numeric rank 
field in each document influence the order in which the documents were 
returned by lucene.  While investigating a solution for this, I wanted 
to see if I could implement strict sorting based on this numeric value. 
 I was able to accomplish this using document boosting, but not without 
modifying the lucene source.  Our "ranking" field is an integer value 
from one to one hundred.  I'm not sure if this will help you, but I'll 
include a summary of what I did.

In DocumentWriter remove the normalization by field length:
    float norm = fieldBoosts[n] * 
Similarity.normalizeLength(fieldLengths[n]);
to
    float norm = fieldBoosts[n];

In TermScorer and PhraseScorer, modify the score() method to ignore the 
lucene base score:
    score *= Similarity.decodeNorm(norms[d]);
to
    score = Similarity.decodeNorm(norms[d]);

In Similarity.java, make byteToFloat() public.

At index time, use Similarity.byteToFloat() to determine your boost 
value as in the following pseudocode:
    Document d = new Document();
    ... add your fields ...
    int rank = d.getField("RANK"); (range of rank can be 0 to 255)
    float sortVal = Similarity.byteToFloat(rank)
    d.setBoost(sortVal)

If you'd like the reasoning behind any or all of these items, let me know.

DaveB



Dmitry Serebrennikov wrote:

> Greetings Everyone,
>
> I'm thinking of trying to build something that manipulates a query 
> score in order to achieve a sort order other then the default 
> relevance sort. The idea is to create a new type of query:
> SortingQuery( Query query, String sortByField )
>
> It would run the sub-query and return results in an order of the 
> values found in the "sortByField" for those documents. Now, I've 
> looked at all of the sorting discussion prior to this, and the best 
> approach (recommended by Doug among others) is to provide some sort of 
> a fast access to the field values inside the HitCollector. Reading 
> documents at search time is too slow, so people access the data 
> elsewhere or build an in-memory index of that data (such as is done in 
> the SearchBean's SortField).
>
> My idea is different. I want to try to do the following:
> - compose a query that consists of the original sub-query followed by 
> a special "sorting query"
> - "boost" the score of the original sub-query to 0
> - compute the score of the sorting query such that it would reflect 
> the desired sort order
>
> Has anyone tried to do something like this?
> Would this work?
> Is this worth doing?
> If it would, would then I have to do something during the indexing 
> time to set normalization / scoring factors for that field to 
> something or other?
>
> Thanks.
> Dmitry.
>
>
>
> -- 
> To unsubscribe, e-mail:   
> <ma...@jakarta.apache.org>
> For additional commands, e-mail: 
> <ma...@jakarta.apache.org>
>
>



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Question: using boost for sorting

Posted by David Birtwell <Da...@vwr.com>.

Hi Dmitry,

I was faced with a similar problem.  We wanted to have a numeric rank 
field in each document influence the order in which the documents were 
returned by lucene.  While investigating a solution for this, I wanted 
to see if I could implement strict sorting based on this numeric value. 
 I was able to accomplish this using document boosting, but not without 
modifying the lucene source.  Our "ranking" field is an integer value 
from one to one hundred.  I'm not sure if this will help you, but I'll 
include a summary of what I did.

In DocumentWriter remove the normalization by field length:
    float norm = fieldBoosts[n] * 
Similarity.normalizeLength(fieldLengths[n]);
to
    float norm = fieldBoosts[n];

In TermScorer and PhraseScorer, modify the score() method to ignore the 
lucene base score:
    score *= Similarity.decodeNorm(norms[d]);
to
    score = Similarity.decodeNorm(norms[d]);

In Similarity.java, make byteToFloat() public.

At index time, use Similarity.byteToFloat() to determine your boost 
value as in the following pseudocode:
    Document d = new Document();
    ... add your fields ...
    int rank = d.getField("RANK"); (range of rank can be 0 to 255)
    float sortVal = Similarity.byteToFloat(rank)
    d.setBoost(sortVal)

If you'd like the reasoning behind any or all of these items, let me know.

DaveB



Dmitry Serebrennikov wrote:

> Greetings Everyone,
>
> I'm thinking of trying to build something that manipulates a query 
> score in order to achieve a sort order other then the default 
> relevance sort. The idea is to create a new type of query:
> SortingQuery( Query query, String sortByField )
>
> It would run the sub-query and return results in an order of the 
> values found in the "sortByField" for those documents. Now, I've 
> looked at all of the sorting discussion prior to this, and the best 
> approach (recommended by Doug among others) is to provide some sort of 
> a fast access to the field values inside the HitCollector. Reading 
> documents at search time is too slow, so people access the data 
> elsewhere or build an in-memory index of that data (such as is done in 
> the SearchBean's SortField).
>
> My idea is different. I want to try to do the following:
> - compose a query that consists of the original sub-query followed by 
> a special "sorting query"
> - "boost" the score of the original sub-query to 0
> - compute the score of the sorting query such that it would reflect 
> the desired sort order
>
> Has anyone tried to do something like this?
> Would this work?
> Is this worth doing?
> If it would, would then I have to do something during the indexing 
> time to set normalization / scoring factors for that field to 
> something or other?
>
> Thanks.
> Dmitry.
>
>
>
> -- 
> To unsubscribe, e-mail:   
> <ma...@jakarta.apache.org>
> For additional commands, e-mail: 
> <ma...@jakarta.apache.org>
>
>



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>