You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by HAIDUC SONIA <ha...@yahoo.com> on 2011/04/16 17:43:34 UTC

Choosing boosting in Lucene

Hello,

I have a few questions about boosting in Lucene. I am running a research
project where I have, for each document, 4 fields: f1, f2, f3, f4. I also
have a set of queries for my corpus, and I know the relevant documents for
each of these queries. What I want to study is how boosting affects the
search results of these queries. Basically, I want to show that by boosting
some of these fields the results are better (I hope).
I have, though, a few essential questions that I cannot figure out and I
would really appreciate some help...

1. Is there any difference between boosting the fields at index time and
boosting the terms in the queries which appear in these fields at search
time? 
Again, I know beforehand the set of queries and also the terms in these
queries which appear in the documents in the corpus in each of the fields.

2. In what range are boosting values usually chosen? I.e., should I choose
boosts in a 0.5-2 range (say 0.5, 1, 1.5, 2), like I have seen in soem
examples, or is it the same if I choose boosts in a range like 50-200
(respectively 50, 100, 150, 200)? 

3. How sensitive is boosting in Lucene? For example, if I know approximately
the importance of each field, and I want to assign boosting values
accordingly, what would be good differences between the values of the
boosting factor for the different fields? More precisely, if the importance
order is f1<f2<f3<f4, will it matter if I choose the boosts as (1,2,3,4), or
(1, 5, 10, 15)?

4. Is there any method besides trial and error for finding the boosts for
each field that work the best for a particular corpus? 

Thank you very much,
Cristina

Re: Choosing boosting in Lucene

Posted by Yiannis Gkoufas <jo...@gmail.com>.
On Sat, Apr 16, 2011 at 6:43 PM, HAIDUC SONIA <ha...@yahoo.com>wrote:

> Hello,
>
> I have a few questions about boosting in Lucene. I am running a research
> project where I have, for each document, 4 fields: f1, f2, f3, f4. I also
> have a set of queries for my corpus, and I know the relevant documents for
> each of these queries. What I want to study is how boosting affects the
> search results of these queries. Basically, I want to show that by boosting
> some of these fields the results are better (I hope).
> I have, though, a few essential questions that I cannot figure out and I
> would really appreciate some help...
>
> 1. Is there any difference between boosting the fields at index time and
> boosting the terms in the queries which appear in these fields at search
> time?
>

I have noticed that for some queries the results are the same, for others
not. It should be the same, if someone knows why is it happening it would be
very useful.


> Again, I know beforehand the set of queries and also the terms in these
> queries which appear in the documents in the corpus in each of the fields.
>
> 2. In what range are boosting values usually chosen? I.e., should I choose
> boosts in a 0.5-2 range (say 0.5, 1, 1.5, 2), like I have seen in soem
> examples, or is it the same if I choose boosts in a range like 50-200
> (respectively 50, 100, 150, 200)?
>

After running numerous tests I have concluded that a boost factor up to 20
makes sense.


>
> 3. How sensitive is boosting in Lucene? For example, if I know
> approximately
> the importance of each field, and I want to assign boosting values
> accordingly, what would be good differences between the values of the
> boosting factor for the different fields? More precisely, if the importance
> order is f1<f2<f3<f4, will it matter if I choose the boosts as (1,2,3,4),
> or
> (1, 5, 10, 15)?
>

Yes, that is certain.


>
> 4. Is there any method besides trial and error for finding the boosts for
> each field that work the best for a particular corpus?
>

In order to get the best boosts for my tests, I tried brute-forcing it.
In theory, you can experiment with vector space models, modified perceptron.


>
> Thank you very much,
> Cristina
>

Best Regards,
Yiannis

Re: Choosing boosting in Lucene

Posted by Anshum <an...@gmail.com>.
Hi Cristina,
Lucene scores each doc per search based on its scoring formula. As there is
a lot of query related normalizing and other component, the scores for docs
change as the query changes.
About understanding how boosting affects the score in detail, you may read
about *lucene scoring* at
http://lucene.apache.org/java/3_1_0/scoring.html
And the *scoring formula* at:
http://lucene.apache.org/java/3_1_0/api/core/org/apache/lucene/search/Similarity.html

Talking about the difference between index time and search time boost, score
time boost is term level and generally speaking, index level boost is
field/doc level.
Also, having a look at the scoring formula in the Similarity class (link
provided above) you'd be in a  better position to understand the difference
(and there is some).
You should also use the *IndexSearcher's explain method*
http://lucene.apache.org/java/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#explain(org.apache.lucene.search.Query,
int)

Choosing the boost is again about what is it that you desire, these are
subjective questions. You should try different sets and have a look at the
score using the explain function to figure out what fits you the best.
Relevance or an apt method about boost values, can again be figured out
using varying the boost *via* *trial and error*. That is pretty much a
general practice.

Hope this helps you figuring out a reasonable solution and boost values.

--
Anshum Gupta
http://ai-cafe.blogspot.com


On Sat, Apr 16, 2011 at 9:13 PM, HAIDUC SONIA <ha...@yahoo.com>wrote:

> Hello,
>
> I have a few questions about boosting in Lucene. I am running a research
> project where I have, for each document, 4 fields: f1, f2, f3, f4. I also
> have a set of queries for my corpus, and I know the relevant documents for
> each of these queries. What I want to study is how boosting affects the
> search results of these queries. Basically, I want to show that by boosting
> some of these fields the results are better (I hope).
> I have, though, a few essential questions that I cannot figure out and I
> would really appreciate some help...
>
> 1. Is there any difference between boosting the fields at index time and
> boosting the terms in the queries which appear in these fields at search
> time?
> Again, I know beforehand the set of queries and also the terms in these
> queries which appear in the documents in the corpus in each of the fields.
>
> 2. In what range are boosting values usually chosen? I.e., should I choose
> boosts in a 0.5-2 range (say 0.5, 1, 1.5, 2), like I have seen in soem
> examples, or is it the same if I choose boosts in a range like 50-200
> (respectively 50, 100, 150, 200)?
>
> 3. How sensitive is boosting in Lucene? For example, if I know
> approximately
> the importance of each field, and I want to assign boosting values
> accordingly, what would be good differences between the values of the
> boosting factor for the different fields? More precisely, if the importance
> order is f1<f2<f3<f4, will it matter if I choose the boosts as (1,2,3,4),
> or
> (1, 5, 10, 15)?
>
> 4. Is there any method besides trial and error for finding the boosts for
> each field that work the best for a particular corpus?
>
> Thank you very much,
> Cristina
>