You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Eric Caron <er...@gmail.com> on 2010/11/22 21:01:30 UTC

Using WhitespaceTokenizer but still wanting to match when all fields are concatenated

Problem:
Indexed phrase: JetBlue Airlines
Ideal matching queries: jetblue, "jet blue" "jetblue airway", "jetblue
company"

I'd like to be able to use synonyms (to convert airway to airline),
stopwords (to drop "company"), strip periods and use ASCII folding, and
split on case.

I'm close with the following:
***
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="\."
replacement="" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="0" catenateWords="1" catenateNumbers="0"
catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
<filter class="solr.SynonymFilterFactory" synonyms="syn.txt"
ignoreCase="true" expand="true"/>
***
Except the problem that I can't do synonyms or stopwords because of the
non-tokenizing tokenizer. There's also the problem that a wildcard at the
end of the exact-match returns nothing.

Does anyone have suggestions on how this could be accomplished? The dataset
is under 100k entries and none of the docs are more than 200 characters.