You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Martin Braun <mb...@uni-hd.de> on 2006/12/20 17:32:09 UTC

to boost or not to boost

Hello all,

I am trying to boost more recent Docs, i.e. Docs with a greater year
Value like this:

 		if (title.getEJ() != null) {
			titleDocument.setBoost(new Float("1." + title.getEJ()));
		}
so a doc from 1973 should get a boost of 1.1973 and a doc of 1975 should
get a boost of 1.1975 .

I have indexed these two docs:


DOK 1:
Document<stored/uncompressed,indexed<katkey:1042362>
stored/uncompressed,indexed,termVector<EJ:1973>
indexed,tokenized<AU:Palandt, Otto
Danckelmann, Bernhard>
[...]

DOK 2:
Document<stored/uncompressed,indexed<katkey:1043960>
stored/uncompressed,indexed,termVector<EJ:1975>
indexed,tokenized<AU:Palandt, Otto
Danckelmann, Bernhard>
[...]

If I am Searching for AU:palandt

I get this:
Explain für 1042362: 1.6931472 = fieldWeight(AU:palandt in 0), product of:
  1.0 = tf(termFreq(AU:palandt)=1)
  1.6931472 = idf(docFreq=2)
  1.0 = fieldNorm(field=AU, doc=0)

Explain für 1043960: 1.6931472 = fieldWeight(AU:palandt in 1), product of:
  1.0 = tf(termFreq(AU:palandt)=1)
  1.6931472 = idf(docFreq=2)
  1.0 = fieldNorm(field=AU, doc=1)


so the "older" doc is better rated or with the same rank as the newer?


any ideas?

tia,
martin













---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: boosting instead of sorting WAS: to boost or not to boost

Posted by Daniel Naber <lu...@danielnaber.de>.

On Thursday 21 December 2006 10:55, Martin Braun wrote:

> and in my case I have some documents
> which have same values in many fields (=>same score) and the only
> difference is the year.

Andrzej's response sounds like a good solution, so just for completeness: 
you can sort by more than one criterion, e.g. first by score, then by 
date.

regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: boosting instead of sorting WAS: to boost or not to boost

Posted by Andrzej Bialecki <ab...@getopt.org>.

Suman Ghosh wrote:
> Andrzej,
>
> I have been trying to solve a similar problem where I need to boost
> score based on the document type. Your approach is very interesting
> and I want to give it a try.
>
> I have a implementation specific question. When you mention to put as
> many "1" as the boost need to be, do you mean that the resultant field
> should look like "1 1 1 1 1" or "1,1,1,1,1" so that the content is
> tokenized and indexed?

Yes.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: boosting instead of sorting WAS: to boost or not to boost

Posted by Suman Ghosh <su...@gmail.com>.

Andrzej,

I have been trying to solve a similar problem where I need to boost
score based on the document type. Your approach is very interesting
and I want to give it a try.

I have a implementation specific question. When you mention to put as
many "1" as the boost need to be, do you mean that the resultant field
should look like "1 1 1 1 1" or "1,1,1,1,1" so that the content is
tokenized and indexed?

Suman

On 12/21/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> Martin Braun wrote:
> > Hi Daniel,
> >
> >
> >>> so a doc from 1973 should get a boost of 1.1973 and a doc of 1975 should
> >>> get a boost of 1.1975 .
> >>>
> >> The boost is stored with a limited resolution. Try boosting one doc by 10,
> >> the other one by 20 or something like that.
> >>
> >
> > You're right. I thought that with the float values the resolution should
> > be good enough!
> > But there is only a difference in the score with a boosting diff of 0.2
> > (e.g. 1.7 and 1.9).
> >
> > I know that there were many questions on the list regarding scoring
> > better new documents.
> > But I want to avoid any overhead like "FunctionQuery" at query time,
> > and in my case I have some documents
> > which have same values in many fields (=>same score) and the only
> > difference is the year.
> >
> > However  I don't want to overboost the score so that the scoring for
> > other criteria is not considered.
> >
> > Shortly spoken: As a result of a search I have a list of book titles and
> > I want  a sort by score AND by year of publication.
> >
> > But for performance reasons I want to avoid this sorting at query-time
> > by boosting at index time.
> >
> > Is that possible?
> >
>
> Here's the trick that works for me, without the issues of boost
> resolution or FunctionQuery.
>
> Add a separate field, say "days", in which you will put as many "1" as
> many days elapsed since the epoch (not neccessarily since 1 Jan 1970 -
> pick a date that makes sense for you). Then, if you want to prioritize
> newer documents, just add "+days:1" to your query. Voila - the final
> results are a sum of other score factors plus a score factor that is
> higher for more recent document, containing more 1-s.
>
> If you are dealing with large time spans, you can split this into years
> and days-in-a-year, and apply query boosts, like "+years:1^10.0
> +days:1^0.02". Do some experiments and find what works best for you.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: boosting instead of sorting WAS: to boost or not to boost

Posted by Andrzej Bialecki <ab...@getopt.org>.

Martin Braun wrote:
> Hi Daniel,
>
>   
>>> so a doc from 1973 should get a boost of 1.1973 and a doc of 1975 should
>>> get a boost of 1.1975 .
>>>       
>> The boost is stored with a limited resolution. Try boosting one doc by 10, 
>> the other one by 20 or something like that.
>>     
>
> You're right. I thought that with the float values the resolution should
> be good enough!
> But there is only a difference in the score with a boosting diff of 0.2
> (e.g. 1.7 and 1.9).
>
> I know that there were many questions on the list regarding scoring
> better new documents.
> But I want to avoid any overhead like "FunctionQuery" at query time,
> and in my case I have some documents
> which have same values in many fields (=>same score) and the only
> difference is the year.
>
> However  I don't want to overboost the score so that the scoring for
> other criteria is not considered.
>
> Shortly spoken: As a result of a search I have a list of book titles and
> I want  a sort by score AND by year of publication.
>
> But for performance reasons I want to avoid this sorting at query-time
> by boosting at index time.
>
> Is that possible?
>   

Here's the trick that works for me, without the issues of boost 
resolution or FunctionQuery.

Add a separate field, say "days", in which you will put as many "1" as 
many days elapsed since the epoch (not neccessarily since 1 Jan 1970 - 
pick a date that makes sense for you). Then, if you want to prioritize 
newer documents, just add "+days:1" to your query. Voila - the final 
results are a sum of other score factors plus a score factor that is 
higher for more recent document, containing more 1-s.

If you are dealing with large time spans, you can split this into years 
and days-in-a-year, and apply query boosts, like "+years:1^10.0 
+days:1^0.02". Do some experiments and find what works best for you.


-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

boosting instead of sorting WAS: to boost or not to boost

Posted by Martin Braun <mb...@uni-hd.de>.

Hi Daniel,

>> so a doc from 1973 should get a boost of 1.1973 and a doc of 1975 should
>> get a boost of 1.1975 .
> 
> The boost is stored with a limited resolution. Try boosting one doc by 10, 
> the other one by 20 or something like that.

You're right. I thought that with the float values the resolution should
be good enough!
But there is only a difference in the score with a boosting diff of 0.2
(e.g. 1.7 and 1.9).

I know that there were many questions on the list regarding scoring
better new documents.
But I want to avoid any overhead like "FunctionQuery" at query time,
and in my case I have some documents
which have same values in many fields (=>same score) and the only
difference is the year.

However  I don't want to overboost the score so that the scoring for
other criteria is not considered.

Shortly spoken: As a result of a search I have a list of book titles and
I want  a sort by score AND by year of publication.

But for performance reasons I want to avoid this sorting at query-time
by boosting at index time.

Is that possible?

thanks,
Martin






> 



-- 
Universitaetsbibliothek Heidelberg   Tel: +49 6221 54-2580
Ploeck 107-109, D-69117 Heidelberg   Fax: +49 6221 54-2623

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: to boost or not to boost

Posted by Daniel Naber <lu...@danielnaber.de>.

On Wednesday 20 December 2006 17:32, Martin Braun wrote:

> so a doc from 1973 should get a boost of 1.1973 and a doc of 1975 should
> get a boost of 1.1975 .

The boost is stored with a limited resolution. Try boosting one doc by 10, 
the other one by 20 or something like that.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org