You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Preetam Rao <bl...@gmail.com> on 2008/07/14 19:09:29 UTC

matching sub phrases from user entered query...

Hi,

Is there a query in Lucene which matches sub phrases ?

For example if the document text is "new york  existing homes *3 bed 2
bath*homes 3 miles from city center 2 rooms" and if user enters
"Brooklyn homes
with *3 bed *rooms  and swimming pools", I would like to recognize the fact
the the document contained a sub prefix of the user query and give it more
score compared to a document which contained all the terms, but not in
phrases, for example " new york *2 bed 3 bath"* (should not match). Also
note that I do not want the *3 *and *2 *in *3 *miles and *2*  rooms to
affect the query when I have a sub phrase match.

This is something of a middle ground between pure 'boolean OR' query and an
'exact phrase query'. Here  the more the number of sub phrases and their
lengths, the higher the score compared to individual term matches.

I was redirected to Shingle filter which is a token filter that spits out
n-grams. But it does not seem to be best solution since one does not know in
advance what n in n-grams should be. Also it means one has to get all these
bi grams and then construct a boolean OR query which is not very efficient
either.

Anyone has tried any alternate approaches ? Appreciate your thoughts. The
use case is for user entered queries when one does interpret or parse them,
but to ensure that the better sub phrases user enters, the better results he
gets.

Thanks
Preetam