You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Arkadi Colson <ar...@smartbit.be> on 2012/06/29 11:17:20 UTC

NGram and full word

Hi

I have a question regarding the NGram filter and full word search.

When I insert "arkadicolson" into Solr and search for "arkadic", solr 
will find a match.
When searching for "arkadicols", Solr will not find a match because the 
maxGramSize is set to 8.
However when searching for the full word "arkadicolson" Solr will also 
not match.

Is there a way to also match full word in combination with NGram?

Thanks!

     <fieldType name="text" class="solr.TextField" 
positionIncrementGap="100">
       <analyzer type="index">
         <charFilter class="solr.HTMLStripCharFilterFactory"/>
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords_en.txt,stopwords_du.txt" enablePositionIncrements="true"/>
         <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1" generateNumberParts="1" catenateWords="1" 
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.SnowballPorterFilterFactory" 
language="Dutch" />
         <filter class="solr.NGramFilterFactory" minGramSize="3" 
maxGramSize="8"/>
       </analyzer>
       <analyzer type="query">
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.SynonymFilterFactory" 
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->
         <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords_en.txt,stopwords_du.txt" enablePositionIncrements="true"/>
         <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1" generateNumberParts="1" catenateWords="0" 
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.SnowballPorterFilterFactory" 
language="Dutch" />
       </analyzer>
     </fieldType>

-- 
Smartbit bvba
Hoogstraat 13
B-3670 Meeuwen
T: +32 11 64 08 80
F: +32 89 46 81 10
W: http://www.smartbit.be
E: arkadi@smartbit.be

RE: NGram and full word

Posted by "Klostermeyer, Michael" <mk...@riskexchange.com>.

With the help of this list, I solved a similar issue by altering my query as follows:

Before (did not return full word matches): q=searchTerm*
After (returned full-word matches and wildcard searches as you would expect): q=searchTerm OR searchTerm*

You can also boost the exact match by doing the following: q=searchTerm^2 OR searchTerm*

Not sure if the NGram changes things or not, but it might be a starting point.

Mike


-----Original Message-----
From: Arkadi Colson [mailto:arkadi@smartbit.be] 
Sent: Friday, June 29, 2012 3:17 AM
To: solr-user@lucene.apache.org
Subject: NGram and full word

Hi

I have a question regarding the NGram filter and full word search.

When I insert "arkadicolson" into Solr and search for "arkadic", solr will find a match.
When searching for "arkadicols", Solr will not find a match because the maxGramSize is set to 8.
However when searching for the full word "arkadicolson" Solr will also not match.

Is there a way to also match full word in combination with NGram?

Thanks!

     <fieldType name="text" class="solr.TextField" 
positionIncrementGap="100">
       <analyzer type="index">
         <charFilter class="solr.HTMLStripCharFilterFactory"/>
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords_en.txt,stopwords_du.txt" enablePositionIncrements="true"/>
         <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1" generateNumberParts="1" catenateWords="1" 
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.SnowballPorterFilterFactory" 
language="Dutch" />
         <filter class="solr.NGramFilterFactory" minGramSize="3" 
maxGramSize="8"/>
       </analyzer>
       <analyzer type="query">
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.SynonymFilterFactory" 
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->
         <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords_en.txt,stopwords_du.txt" enablePositionIncrements="true"/>
         <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1" generateNumberParts="1" catenateWords="0" 
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.SnowballPorterFilterFactory" 
language="Dutch" />
       </analyzer>
     </fieldType>

--
Smartbit bvba
Hoogstraat 13
B-3670 Meeuwen
T: +32 11 64 08 80
F: +32 89 46 81 10
W: http://www.smartbit.be
E: arkadi@smartbit.be

Re: NGram and full word

Posted by Lan <du...@gmail.com>.

The search for the full word arkadicolson exceeds 8 characters so thats why
it's not working.

The fix is to add another field that will tokenize into full words. 

The query would look like this

some_field_ngram:arkadicolson AND some_field_whole_word:arkadicolson

--
View this message in context: http://lucene.472066.n3.nabble.com/NGram-and-full-word-tp3992035p3992160.html
Sent from the Solr - User mailing list archive at Nabble.com.