You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by rama44ster <ra...@gmail.com> on 2015/01/15 10:34:31 UTC

Multi-valued field and numTerms

Hi,
I am using lucene to index documents that have a multivalued text field
named ‘city’.
Each document might have multiple values for this field, like la, los
angeles etc.

Assuming
document d1 contains city = la ; city = los angeles
document d2 contains city = la mirada
document d3 contains city = la quinta

Now when I search for 'la', I would prefer getting d1 as it has the exact
match ie., a match that doesn't have any extra terms than what is in the
query. I read lucene already prefers documents with fewer terms as
DefaultSimilarity.computeNorm does

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

The problem I have is, I am not sure how numTerms is calculated for a
multivalued field like city. Here would numTerms for d1 be 1 or 3? Would
the numTerms be the sum of all the numTerms for each field value?

Any idea on how to make the document d1 rank higher than d2 and d3?

Thanks in advance,
Prasad.

Re: Multi-valued field and numTerms

Posted by Michael Sokolov <ms...@safaribooksonline.com>.
On 1/15/15 4:34 AM, rama44ster wrote:
> Hi,
> I am using lucene to index documents that have a multivalued text field
> named ‘city’.
> Each document might have multiple values for this field, like la, los
> angeles etc.
>
> Assuming
> document d1 contains city = la ; city = los angeles
> document d2 contains city = la mirada
> document d3 contains city = la quinta
>
> Now when I search for 'la', I would prefer getting d1 as it has the exact
> match ie., a match that doesn't have any extra terms than what is in the
> query. I read lucene already prefers documents with fewer terms as
> DefaultSimilarity.computeNorm does
>
> return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
>
> The problem I have is, I am not sure how numTerms is calculated for a
> multivalued field like city. Here would numTerms for d1 be 1 or 3? Would
> the numTerms be the sum of all the numTerms for each field value?
>
> Any idea on how to make the document d1 rank higher than d2 and d3?
>
> Thanks in advance,
> Prasad.
>
One thing we have done to prefer "exact" matches is to index magic 
anchoring terms at the start/finish of every field and then use phrase 
queries to boost exact matches.  EG you would index

document 1 city = __anchor__ la __anchor__ ; city = __anchor__ los 
angeles __anchor__

then you can query for:

la "__anchor__ la __anchor__"^2

this won't do the same thing you asked for, but it might be what you want?

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Multi-valued field and numTerms

Posted by Michael McCandless <lu...@mikemccandless.com>.
Normally Lucene will count your d1 as having length=2.

However, if "la" was added as a synonym for "los angeles", such that
it "overlaps" its position, then the default similarity discounts that
and will count it as length=1.

But for that to work, the position of the 2nd token must be the same
as the previous token.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Jan 15, 2015 at 4:34 AM, rama44ster <ra...@gmail.com> wrote:
> Hi,
> I am using lucene to index documents that have a multivalued text field
> named ‘city’.
> Each document might have multiple values for this field, like la, los
> angeles etc.
>
> Assuming
> document d1 contains city = la ; city = los angeles
> document d2 contains city = la mirada
> document d3 contains city = la quinta
>
> Now when I search for 'la', I would prefer getting d1 as it has the exact
> match ie., a match that doesn't have any extra terms than what is in the
> query. I read lucene already prefers documents with fewer terms as
> DefaultSimilarity.computeNorm does
>
> return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
>
> The problem I have is, I am not sure how numTerms is calculated for a
> multivalued field like city. Here would numTerms for d1 be 1 or 3? Would
> the numTerms be the sum of all the numTerms for each field value?
>
> Any idea on how to make the document d1 rank higher than d2 and d3?
>
> Thanks in advance,
> Prasad.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org