You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jeff Newburn <jn...@zappos.com> on 2009/11/30 15:50:23 UTC

Word Concat 0 Results

All,

I have a quick question for anyone with an idea how to solve this.  We have
times when our users don¹t put spaces between words.  So for instance
³airmax² returns 0 results but ³air max² has at least 100 results.  Other
than adding to the synonyms file every time, is there a more programmatic
way we could possibly understand this scenario and return correct results.

-- 
Jeff Newburn
Software Engineer, Zappos.com
jnewburn@zappos.com


Re: Word Concat 0 Results

Posted by AHMET ARSLAN <io...@yahoo.com>.
> I have a quick question for anyone with an idea how to
> solve this.  We have
> times when our users don¹t put spaces between words. 
> So for instance
> ³airmax² returns 0 results but ³air max² has at least
> 100 results.  Other
> than adding to the synonyms file every time, is there a
> more programmatic
> way we could possibly understand this scenario and return
> correct results.


Without manuel synonym table lookup, it would be very hard to recognize airmax at query time and split it into air max.

But at index time you can do it using modified version of ShingleFilterFactory. Simply it will concat all token n-grams.

Change the 
public static final String TOKEN_SEPARATOR = " ";
to 
public static final String TOKEN_SEPARATOR = "";
in org.apache.lucene.analysis.shingle.ShingleFilter

Also you need its Factory class to integrate it into solr.

The input document ( "but air max has" ) at index time will be tokenized into :

but => word
butair => shingle
air => word
airmax => shingle
max => word
maxhas => shingle
has => word

And the query airmax will match that document. But this solution increase your index size. It is better to write all possible words in to synonym.txt file manually. There is a similiar discussion suggests this in lucene-java-users group: 
http://old.nabble.com/splitting-words-to26573829.html#a26573829

Hope this helps.