You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucenenet.apache.org by "Allan, Brad (Bracknell)" <Br...@Fiserv.com> on 2013/05/09 18:08:41 UTC

Minimize document hits based on number of matching terms between source text terms and document field terms

I'd like to get any comments about how I might do this - I have list some options below, which of course I'll investigate...

Example first:
Name Field
--------------
Mr. Youness Rokven

Mr. Joe Paul Harry Arnold

Mr. Paul B. Mitchell

Mrs. Fernanda Joe Mitchell

Ms. Jade Paula Victoria Muir

Mr. Joe Harvey Pope


If I search the above with text such as "Joe P.H. Arnold" which is turned into a query:
((Joe) or (P) or (H) or (Arnold))

I get hits:
Mr. Joe Paul Harry Arnold

Mrs. Fernanda Joe Mitchell

Mr. Joe Harvey Pope


And the scores are great! The top hit having a higher relative score.

What I'd like to do is exclude hits where say less than 2 terms matched the document field terms.

Options I think:

1.)    Overide DefaultSimilarity?

2.)    Construct awkward searches, example:

((Joe) and (P)) or ((Joe) and (H)) or ((Joe) and (Arnold))   etc ... all the possible combinations

3.)    Use TermVector information? Don't know much about this, but my thought is that if highlighting knows the matching terms,...perhaps I use that?

Would be grateful for comments.
Thanks!



________________________________

CheckFree Solutions Limited (trading as Fiserv)
Registered Office: Eversheds House, 70 Great Bridgewater Street, Manchester, M15 ES
Registered in England: No. 2694333

Re: Minimize document hits based on number of matching terms between source text terms and document field terms

Posted by Simon Svensson <si...@devhost.se>.

Hi,

QueryParser.Parse will return a BooleanQuery when you've given it 
several terms. You can set MinimumNumberShouldMatch to get the behavior 
you want.

var query = queryParser.Parse(...)
var boolQuery = query as BooleanQuery;
if (boolQuery != null) {
     boolQuery.MinimumNumberShouldMatch = 2
}

// Simon

On 2013-05-09 18:08, Allan, Brad (Bracknell) wrote:
> I'd like to get any comments about how I might do this - I have list some options below, which of course I'll investigate...
>
> Example first:
> Name Field
> --------------
> Mr. Youness Rokven
>
> Mr. Joe Paul Harry Arnold
>
> Mr. Paul B. Mitchell
>
> Mrs. Fernanda Joe Mitchell
>
> Ms. Jade Paula Victoria Muir
>
> Mr. Joe Harvey Pope
>
>
> If I search the above with text such as "Joe P.H. Arnold" which is turned into a query:
> ((Joe) or (P) or (H) or (Arnold))
>
> I get hits:
> Mr. Joe Paul Harry Arnold
>
> Mrs. Fernanda Joe Mitchell
>
> Mr. Joe Harvey Pope
>
>
> And the scores are great! The top hit having a higher relative score.
>
> What I'd like to do is exclude hits where say less than 2 terms matched the document field terms.
>
> Options I think:
>
> 1.)    Overide DefaultSimilarity?
>
> 2.)    Construct awkward searches, example:
>
> ((Joe) and (P)) or ((Joe) and (H)) or ((Joe) and (Arnold))   etc ... all the possible combinations
>
> 3.)    Use TermVector information? Don't know much about this, but my thought is that if highlighting knows the matching terms,...perhaps I use that?
>
> Would be grateful for comments.
> Thanks!
>
>
>
> ________________________________
>
> CheckFree Solutions Limited (trading as Fiserv)
> Registered Office: Eversheds House, 70 Great Bridgewater Street, Manchester, M15 ES
> Registered in England: No. 2694333
>