You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by "Allan, Brad (Bracknell)" <Br...@Fiserv.com> on 2013/05/09 18:08:41 UTC
Minimize document hits based on number of matching terms between
source text terms and document field terms
I'd like to get any comments about how I might do this - I have list some options below, which of course I'll investigate...
Example first:
Name Field
--------------
Mr. Youness Rokven
Mr. Joe Paul Harry Arnold
Mr. Paul B. Mitchell
Mrs. Fernanda Joe Mitchell
Ms. Jade Paula Victoria Muir
Mr. Joe Harvey Pope
If I search the above with text such as "Joe P.H. Arnold" which is turned into a query:
((Joe) or (P) or (H) or (Arnold))
I get hits:
Mr. Joe Paul Harry Arnold
Mrs. Fernanda Joe Mitchell
Mr. Joe Harvey Pope
And the scores are great! The top hit having a higher relative score.
What I'd like to do is exclude hits where say less than 2 terms matched the document field terms.
Options I think:
1.) Overide DefaultSimilarity?
2.) Construct awkward searches, example:
((Joe) and (P)) or ((Joe) and (H)) or ((Joe) and (Arnold)) etc ... all the possible combinations
3.) Use TermVector information? Don't know much about this, but my thought is that if highlighting knows the matching terms,...perhaps I use that?
Would be grateful for comments.
Thanks!
________________________________
CheckFree Solutions Limited (trading as Fiserv)
Registered Office: Eversheds House, 70 Great Bridgewater Street, Manchester, M15 ES
Registered in England: No. 2694333
Re: Minimize document hits based on number of matching terms between
source text terms and document field terms
Posted by Simon Svensson <si...@devhost.se>.
Hi,
QueryParser.Parse will return a BooleanQuery when you've given it
several terms. You can set MinimumNumberShouldMatch to get the behavior
you want.
var query = queryParser.Parse(...)
var boolQuery = query as BooleanQuery;
if (boolQuery != null) {
boolQuery.MinimumNumberShouldMatch = 2
}
// Simon
On 2013-05-09 18:08, Allan, Brad (Bracknell) wrote:
> I'd like to get any comments about how I might do this - I have list some options below, which of course I'll investigate...
>
> Example first:
> Name Field
> --------------
> Mr. Youness Rokven
>
> Mr. Joe Paul Harry Arnold
>
> Mr. Paul B. Mitchell
>
> Mrs. Fernanda Joe Mitchell
>
> Ms. Jade Paula Victoria Muir
>
> Mr. Joe Harvey Pope
>
>
> If I search the above with text such as "Joe P.H. Arnold" which is turned into a query:
> ((Joe) or (P) or (H) or (Arnold))
>
> I get hits:
> Mr. Joe Paul Harry Arnold
>
> Mrs. Fernanda Joe Mitchell
>
> Mr. Joe Harvey Pope
>
>
> And the scores are great! The top hit having a higher relative score.
>
> What I'd like to do is exclude hits where say less than 2 terms matched the document field terms.
>
> Options I think:
>
> 1.) Overide DefaultSimilarity?
>
> 2.) Construct awkward searches, example:
>
> ((Joe) and (P)) or ((Joe) and (H)) or ((Joe) and (Arnold)) etc ... all the possible combinations
>
> 3.) Use TermVector information? Don't know much about this, but my thought is that if highlighting knows the matching terms,...perhaps I use that?
>
> Would be grateful for comments.
> Thanks!
>
>
>
> ________________________________
>
> CheckFree Solutions Limited (trading as Fiserv)
> Registered Office: Eversheds House, 70 Great Bridgewater Street, Manchester, M15 ES
> Registered in England: No. 2694333
>