You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jeff Wartes <jw...@whitepages.com> on 2013/08/07 22:05:16 UTC

TermFrequency in a multi-valued field

This might end up being more of a Lucene question, but anyway...

For a multivalued field, it appears that term frequency is calculated as
something a little like:

sum(tf(value1), ..., tf(valueN))

I'd rather my score not give preference based on how *many* of the values
in the multivalued field matched, I want it to give preference based on
the value that matched *best*. In other words, something more like:

max(tf(value1), ..., tf(valueN))


Put another way, I want a search like q=mvf:foo against a document with a
multivalued field: 
mvf: [ "foo" ]
to get scored the exact same as a document with a multivalued field:
mvf: [ "foo", "foo" ]
but worse than a document with a multivalued field:
mvf: [ "foo foo" ]


I'm guessing this'd require a custom Similarity implementation, but I'm
beginning to wonder if even that is low enough level.
Other thoughts? This seems like a pretty obvious desire.

Thanks.

Re: TermFrequency in a multi-valued field

Posted by Jeff Wartes <jw...@whitepages.com>.


>A multivalued text field is directly equivalent to concatenating the
>values, 
>with a possible position gap between the last and first terms of adjacent
>values.


That, in a nutshell, would be the problem. Maybe the discussion is over at
this point. 


It could be I dumbed down the problem a bit too much for illustration
purposes. I'm actually doing phrase query matches with slop. As such, the
search phrase I'm interested in could easily be in more than one of the
(unique) values, and the score for each value-match could be very
different when considered alone.

For document scoring purposes, I don't care that (for example) I got a
sloppy match on one value if I got a nice phrase out of another value in
the same document. In fact, I explicitly want to ignore the fact that
there was also a sloppy match. I also don't care if the exact phrase
occurred in more than one value, and I don't want the case where it does
match more than one influencing that document's score.

Re: TermFrequency in a multi-valued field

Posted by Jack Krupansky <ja...@basetechnology.com>.

A multivalued text field is directly equivalent to concatenating the values, 
with a possible position gap between the last and first terms of adjacent 
values.

Term frequency is driven by the terms from the query, not the terms from the 
field(tf(query-term), not tf(field-term)). Your "max" formula doesn't quite 
make sense in that sense.

Why do you have two "foo" in the same field if you don't mean them to be... 
two "foo"??

You can use the Uniq update processer to eliminate duplicate values in 
multivalued fields (where the whole value matches, not individual terms 
within values.)

You need to clarify your use case.

-- Jack Krupansky

-----Original Message----- 
From: Jeff Wartes
Sent: Wednesday, August 07, 2013 4:05 PM
To: solr-user@lucene.apache.org
Subject: TermFrequency in a multi-valued field


This might end up being more of a Lucene question, but anyway...

For a multivalued field, it appears that term frequency is calculated as
something a little like:

sum(tf(value1), ..., tf(valueN))

I'd rather my score not give preference based on how *many* of the values
in the multivalued field matched, I want it to give preference based on
the value that matched *best*. In other words, something more like:

max(tf(value1), ..., tf(valueN))


Put another way, I want a search like q=mvf:foo against a document with a
multivalued field:
mvf: [ "foo" ]
to get scored the exact same as a document with a multivalued field:
mvf: [ "foo", "foo" ]
but worse than a document with a multivalued field:
mvf: [ "foo foo" ]


I'm guessing this'd require a custom Similarity implementation, but I'm
beginning to wonder if even that is low enough level.
Other thoughts? This seems like a pretty obvious desire.

Thanks.