You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Damon Zwolinski <Da...@ocp.org> on 2011/07/08 01:35:03 UTC

Appropriate Tokenizer/Filter to Handle Punctuation Variation

This maybe be a simple question; well I hope so anyways. We have songs that punctuation and quoting and the trick is to get all variations of a query to result with the correct result. Please see the following example. 

>From the database we index a song with title "Damon's Radical Song?". We want the user to find this song based on a few different types of queries. The most common being:
1) "Damon's Radical Song?"
2) Damon's Radical Song?
3) Damons Radical Song?
4) Damons Radical Song
5) Damons radical song
6) Damon's radical song?
7) Damons radical song

We have created a few fieldTypes:

167     <!-- Remove's apostrophe. -->
168     <fieldType name="text_prc" class="solr.TextField" positionIncrementGap="100">
169       <analyzer>
170         <charFilter
171           blockDelimiters="|"
172           class="solr.PatternReplaceCharFilterFactory" 
173           maxBlockChars="10000"
174           pattern="([’'\?])" 
175           replacement=""
176         />
177         <tokenizer class="solr.ClassicTokenizerFactory"/>
178         <filter class="solr.LowerCaseFilterFactory" />
179         <filter class="solr.ASCIIFoldingFilterFactory" />
180       </analyzer>
181     </fieldType>
**** NOTE: For the previous one, we thought of using worddelimeter
factory but the stemming filter option removes the possession so Damon Radical Song? produces search results but Damons Radical Song? does not.


167     <!-- Results in an exact match search. ie "Damon's Radical Song?"-->
     <fieldType name="text_ktf" class="solr.TextField" positionIncrementGap="100">
152       <analyzer>
153         <tokenizer class="solr.KeywordTokenizerFactory"/>
154         <filter class="solr.LowerCaseFilterFactory" />
155         <filter class="solr.ASCIIFoldingFilterFactory" />
156         <filter class="solr.WordDelimiterFilterFactory" />
157       </analyzer>
158     </fieldType>

What we have with the previous 2 field types gives us results for most of our desired query variations but isn't able to get us results for queries like #3 above, ie 
Damons Radical Song? …. We need help figuring out what tokenizer w/ filter combination will fit our needs.

thanks for any and all responses.
--Damon