You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Akos Tajti <ak...@gmail.com> on 2012/04/26 21:12:46 UTC

two fields, the first important than the second

Dear List,

we've been struggling the following problem for a while:
we have two fields: title and description. Title is generated from short
summaries while description is generated fromlong texts. We want to search
on both fields at the same time but we'd like to get all documents in which
the title matches the search term before all others. For multi term queries
we want to achieve the following: all documents that contain all terms in
their title must come before every other document, no matter how many times
the description matches the query. Is there a simple way to achieve this?

Thanks in advance,
Ákos Tajti

Re: two fields, the first important than the second

Posted by Ian Lea <ia...@gmail.com>.

If you really mean "must" and "always", you'll probably have to
execute 2 searches.  First on title alone then on description, or
title and description, merging the hit lists as appropriate.


--
Ian.


On Thu, Apr 26, 2012 at 8:30 PM, Akos Tajti <ak...@gmail.com> wrote:
> Jake,
>
> we're already using index time boosts and tried querytime boosts earlier.
> None of them helped. The problem was that if the description contained a
> part of a multiterm query many many times it got higher score than the ones
> that contained the terms in their title. So it is hard to set the boosts
> such a way that they always work as expected. Or is there an easy solution
> for that? Am I doing something wrong,
>
> Ákos
>
>
>
> On Thu, Apr 26, 2012 at 9:20 PM, jake dsouza <ja...@gmail.com> wrote:
>
>> Hi,
>>
>> I think what your are looking for is boost factor that you can use in your
>> score . Take a look at
>> http://lucene.apache.org/core/3_6_0/scoring.html#Score
>> Boosting
>>
>> - Jake
>>
>> On Thu, Apr 26, 2012 at 3:12 PM, Akos Tajti <ak...@gmail.com> wrote:
>>
>> > Dear List,
>> >
>> > we've been struggling the following problem for a while:
>> > we have two fields: title and description. Title is generated from short
>> > summaries while description is generated fromlong texts. We want to
>> search
>> > on both fields at the same time but we'd like to get all documents in
>> which
>> > the title matches the search term before all others. For multi term
>> queries
>> > we want to achieve the following: all documents that contain all terms in
>> > their title must come before every other document, no matter how many
>> times
>> > the description matches the query. Is there a simple way to achieve this?
>> >
>> > Thanks in advance,
>> > Ákos Tajti
>> >
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: two fields, the first important than the second

Posted by Akos Tajti <ak...@gmail.com>.

Jake,

we're already using index time boosts and tried querytime boosts earlier.
None of them helped. The problem was that if the description contained a
part of a multiterm query many many times it got higher score than the ones
that contained the terms in their title. So it is hard to set the boosts
such a way that they always work as expected. Or is there an easy solution
for that? Am I doing something wrong,

Ákos

On Thu, Apr 26, 2012 at 9:20 PM, jake dsouza <ja...@gmail.com> wrote:

> Hi,
>
> I think what your are looking for is boost factor that you can use in your
> score . Take a look at
> http://lucene.apache.org/core/3_6_0/scoring.html#Score
> Boosting
>
> - Jake
>
> On Thu, Apr 26, 2012 at 3:12 PM, Akos Tajti <ak...@gmail.com> wrote:
>
> > Dear List,
> >
> > we've been struggling the following problem for a while:
> > we have two fields: title and description. Title is generated from short
> > summaries while description is generated fromlong texts. We want to
> search
> > on both fields at the same time but we'd like to get all documents in
> which
> > the title matches the search term before all others. For multi term
> queries
> > we want to achieve the following: all documents that contain all terms in
> > their title must come before every other document, no matter how many
> times
> > the description matches the query. Is there a simple way to achieve this?
> >
> > Thanks in advance,
> > Ákos Tajti
> >
>

Re: two fields, the first important than the second

Posted by jake dsouza <ja...@gmail.com>.

Hi,

I think what your are looking for is boost factor that you can use in your
score . Take a look at  http://lucene.apache.org/core/3_6_0/scoring.html#Score
Boosting

- Jake

On Thu, Apr 26, 2012 at 3:12 PM, Akos Tajti <ak...@gmail.com> wrote:

> Dear List,
>
> we've been struggling the following problem for a while:
> we have two fields: title and description. Title is generated from short
> summaries while description is generated fromlong texts. We want to search
> on both fields at the same time but we'd like to get all documents in which
> the title matches the search term before all others. For multi term queries
> we want to achieve the following: all documents that contain all terms in
> their title must come before every other document, no matter how many times
> the description matches the query. Is there a simple way to achieve this?
>
> Thanks in advance,
> Ákos Tajti
>

Re: two fields, the first important than the second

Posted by Li Li <fa...@gmail.com>.

+(title:hello title:world desc:hello desc:world)
(+title:hello +title:world)^100
(+desc:hello +desc:world)^50
(+title:hello +desc:world)^10
(+desc:hello +title:world)^10

the boost values(100,50,10,10) should be carefully adjusted.
if tf of a document is very large, 10 may be not enough.
you can modify DefaultSimilariy of it's methods such as tf() idf() and
constrain them to a controllable range.

On Fri, Apr 27, 2012 at 2:59 PM, Akos Tajti <ak...@gmail.com> wrote:
> Thanks gfor the details explanation. But as I understand this query will
> still match only documents that contains both terms (either in the same
> field or in different). What if there's a document that contains only
> "hello"? This query will not find it, am I right? But what we want to
> achieve is this. So in the result first have to come those documents that
> contain both terms then thos that contain only one of them.
>
> Ákos
>
>
>
> On Fri, Apr 27, 2012 at 5:17 AM, Li Li <fa...@gmail.com> wrote:
>
>> sorry for some typos.
>> original query +(title:hello desc:hello) +(title:world desc:world)
>> boosted one   +(title:hello^2 desc:hello) +(title:world^2 desc:world)
>> last one     +(title:hello desc:hello) +(title:world desc:hello)
>>    (+title:hello +title:world)^10 (+desc:hello +desc:world)^5
>>
>> the example has two terms. if it has more terms, the query will become too
>> complicated.
>>
>> On Fri, Apr 27, 2012 at 11:12 AM, Li Li <fa...@gmail.com> wrote:
>>
>> > you should describe your ranking strategy more precisely.
>> > if the query has 2 terms, "hello" and  "world" for example, and your
>> > search fields are title and description.  There are many possible
>> > combinations.
>> > Here is my understanding.
>> > Both terms should occur in title or desc
>> >     query may be +(title:hello desc:hello) +(title:world desc:hello)
>> >     the problem is that we need title weight more than desc, so may be we
>> > rewrite it to
>> >    +(title:hello^2 desc:hello) +(title:world^2 desc:hello)
>> >     but we consider this two scenarios:
>> >     1. hello hit only in title, world hit only in desc
>> >     2. hello and world both hit in desc
>> >     because title is boosted, so 1 has more score than 2.
>> >     But we may think 2 is better than 1 because hello world is a phrase.
>> > But we don't want to use phrase query because it's too strict that the
>> > recall can meet our needs.
>> >    Our solution is modify lucene so boolean scorer can tell us which term
>> > is matched. then we use our own collector to boost scenario 1. This
>> > solution need modify lucene(I have posted a mail and you can patch your
>> > DisjunctionSumScorer with
>> > https://issues.apache.org/jira/browse/LUCENE-2686)
>> >    Another solution I can come up with is using complicated query:
>> >    +(title:hello desc:hello) +(title:world desc:hello)
>> >    (+title:hello +title:world)^10 (+desc:hello +desc:world)^5
>> >    The must occurrence condition is the same as before. but if hello
>> world
>> > are all in title, we give it a boost. similarly, if hello world are all
>> in
>> > desc, we also boost it.
>> >
>> >
>> >
>> > On Fri, Apr 27, 2012 at 3:12 AM, Akos Tajti <ak...@gmail.com>
>> wrote:
>> >
>> >> Dear List,
>> >>
>> >> we've been struggling the following problem for a while:
>> >> we have two fields: title and description. Title is generated from short
>> >> summaries while description is generated fromlong texts. We want to
>> search
>> >> on both fields at the same time but we'd like to get all documents in
>> >> which
>> >> the title matches the search term before all others. For multi term
>> >> queries
>> >> we want to achieve the following: all documents that contain all terms
>> in
>> >> their title must come before every other document, no matter how many
>> >> times
>> >> the description matches the query. Is there a simple way to achieve
>> this?
>> >>
>> >> Thanks in advance,
>> >> Ákos Tajti
>> >>
>> >
>> >
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: two fields, the first important than the second

Posted by Akos Tajti <ak...@gmail.com>.

Thanks gfor the details explanation. But as I understand this query will
still match only documents that contains both terms (either in the same
field or in different). What if there's a document that contains only
"hello"? This query will not find it, am I right? But what we want to
achieve is this. So in the result first have to come those documents that
contain both terms then thos that contain only one of them.

Ákos



On Fri, Apr 27, 2012 at 5:17 AM, Li Li <fa...@gmail.com> wrote:

> sorry for some typos.
> original query +(title:hello desc:hello) +(title:world desc:world)
> boosted one   +(title:hello^2 desc:hello) +(title:world^2 desc:world)
> last one     +(title:hello desc:hello) +(title:world desc:hello)
>    (+title:hello +title:world)^10 (+desc:hello +desc:world)^5
>
> the example has two terms. if it has more terms, the query will become too
> complicated.
>
> On Fri, Apr 27, 2012 at 11:12 AM, Li Li <fa...@gmail.com> wrote:
>
> > you should describe your ranking strategy more precisely.
> > if the query has 2 terms, "hello" and  "world" for example, and your
> > search fields are title and description.  There are many possible
> > combinations.
> > Here is my understanding.
> > Both terms should occur in title or desc
> >     query may be +(title:hello desc:hello) +(title:world desc:hello)
> >     the problem is that we need title weight more than desc, so may be we
> > rewrite it to
> >    +(title:hello^2 desc:hello) +(title:world^2 desc:hello)
> >     but we consider this two scenarios:
> >     1. hello hit only in title, world hit only in desc
> >     2. hello and world both hit in desc
> >     because title is boosted, so 1 has more score than 2.
> >     But we may think 2 is better than 1 because hello world is a phrase.
> > But we don't want to use phrase query because it's too strict that the
> > recall can meet our needs.
> >    Our solution is modify lucene so boolean scorer can tell us which term
> > is matched. then we use our own collector to boost scenario 1. This
> > solution need modify lucene(I have posted a mail and you can patch your
> > DisjunctionSumScorer with
> > https://issues.apache.org/jira/browse/LUCENE-2686)
> >    Another solution I can come up with is using complicated query:
> >    +(title:hello desc:hello) +(title:world desc:hello)
> >    (+title:hello +title:world)^10 (+desc:hello +desc:world)^5
> >    The must occurrence condition is the same as before. but if hello
> world
> > are all in title, we give it a boost. similarly, if hello world are all
> in
> > desc, we also boost it.
> >
> >
> >
> > On Fri, Apr 27, 2012 at 3:12 AM, Akos Tajti <ak...@gmail.com>
> wrote:
> >
> >> Dear List,
> >>
> >> we've been struggling the following problem for a while:
> >> we have two fields: title and description. Title is generated from short
> >> summaries while description is generated fromlong texts. We want to
> search
> >> on both fields at the same time but we'd like to get all documents in
> >> which
> >> the title matches the search term before all others. For multi term
> >> queries
> >> we want to achieve the following: all documents that contain all terms
> in
> >> their title must come before every other document, no matter how many
> >> times
> >> the description matches the query. Is there a simple way to achieve
> this?
> >>
> >> Thanks in advance,
> >> Ákos Tajti
> >>
> >
> >
>

Re: two fields, the first important than the second

Posted by Li Li <fa...@gmail.com>.

sorry for some typos.
original query +(title:hello desc:hello) +(title:world desc:world)
boosted one   +(title:hello^2 desc:hello) +(title:world^2 desc:world)
last one     +(title:hello desc:hello) +(title:world desc:hello)
   (+title:hello +title:world)^10 (+desc:hello +desc:world)^5

the example has two terms. if it has more terms, the query will become too
complicated.

On Fri, Apr 27, 2012 at 11:12 AM, Li Li <fa...@gmail.com> wrote:

> you should describe your ranking strategy more precisely.
> if the query has 2 terms, "hello" and  "world" for example, and your
> search fields are title and description.  There are many possible
> combinations.
> Here is my understanding.
> Both terms should occur in title or desc
>     query may be +(title:hello desc:hello) +(title:world desc:hello)
>     the problem is that we need title weight more than desc, so may be we
> rewrite it to
>    +(title:hello^2 desc:hello) +(title:world^2 desc:hello)
>     but we consider this two scenarios:
>     1. hello hit only in title, world hit only in desc
>     2. hello and world both hit in desc
>     because title is boosted, so 1 has more score than 2.
>     But we may think 2 is better than 1 because hello world is a phrase.
> But we don't want to use phrase query because it's too strict that the
> recall can meet our needs.
>    Our solution is modify lucene so boolean scorer can tell us which term
> is matched. then we use our own collector to boost scenario 1. This
> solution need modify lucene(I have posted a mail and you can patch your
> DisjunctionSumScorer with
> https://issues.apache.org/jira/browse/LUCENE-2686)
>    Another solution I can come up with is using complicated query:
>    +(title:hello desc:hello) +(title:world desc:hello)
>    (+title:hello +title:world)^10 (+desc:hello +desc:world)^5
>    The must occurrence condition is the same as before. but if hello world
> are all in title, we give it a boost. similarly, if hello world are all in
> desc, we also boost it.
>
>
>
> On Fri, Apr 27, 2012 at 3:12 AM, Akos Tajti <ak...@gmail.com> wrote:
>
>> Dear List,
>>
>> we've been struggling the following problem for a while:
>> we have two fields: title and description. Title is generated from short
>> summaries while description is generated fromlong texts. We want to search
>> on both fields at the same time but we'd like to get all documents in
>> which
>> the title matches the search term before all others. For multi term
>> queries
>> we want to achieve the following: all documents that contain all terms in
>> their title must come before every other document, no matter how many
>> times
>> the description matches the query. Is there a simple way to achieve this?
>>
>> Thanks in advance,
>> Ákos Tajti
>>
>
>

Re: two fields, the first important than the second

Posted by Li Li <fa...@gmail.com>.

you should describe your ranking strategy more precisely.
if the query has 2 terms, "hello" and  "world" for example, and your search
fields are title and description.  There are many possible combinations.
Here is my understanding.
Both terms should occur in title or desc
    query may be +(title:hello desc:hello) +(title:world desc:hello)
    the problem is that we need title weight more than desc, so may be we
rewrite it to
   +(title:hello^2 desc:hello) +(title:world^2 desc:hello)
    but we consider this two scenarios:
    1. hello hit only in title, world hit only in desc
    2. hello and world both hit in desc
    because title is boosted, so 1 has more score than 2.
    But we may think 2 is better than 1 because hello world is a phrase.
But we don't want to use phrase query because it's too strict that the
recall can meet our needs.
   Our solution is modify lucene so boolean scorer can tell us which term
is matched. then we use our own collector to boost scenario 1. This
solution need modify lucene(I have posted a mail and you can patch your
DisjunctionSumScorer with https://issues.apache.org/jira/browse/LUCENE-2686)
   Another solution I can come up with is using complicated query:
   +(title:hello desc:hello) +(title:world desc:hello)
   (+title:hello +title:world)^10 (+desc:hello +desc:world)^5
   The must occurrence condition is the same as before. but if hello world
are all in title, we give it a boost. similarly, if hello world are all in
desc, we also boost it.


On Fri, Apr 27, 2012 at 3:12 AM, Akos Tajti <ak...@gmail.com> wrote:

> Dear List,
>
> we've been struggling the following problem for a while:
> we have two fields: title and description. Title is generated from short
> summaries while description is generated fromlong texts. We want to search
> on both fields at the same time but we'd like to get all documents in which
> the title matches the search term before all others. For multi term queries
> we want to achieve the following: all documents that contain all terms in
> their title must come before every other document, no matter how many times
> the description matches the query. Is there a simple way to achieve this?
>
> Thanks in advance,
> Ákos Tajti
>