You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Husain, Yavar" <yh...@firstam.com> on 2012/07/18 16:13:38 UTC

NGram for misspelt words



I have configured NGram Indexing for some fields.

Say I search for the city Ludlow, I get the results (normal search)

If I search for Ludlo (with w ommitted) I get the results

If I search for Ludl (with ow ommitted) I still get the results

I know that they are all partial strings of the main string hence NGram works perfect.

But when I type in Ludlwo (misspelt, characters o and w interchanged) I dont get any results, It should ideally match "Ludl" and provide the results.

I am not looking for Edit distance based Spell Correctors. How can I make above NGram based search work?

Here is my schema.xml (NGramFieldType):

<fieldType name="nGram" class="solr.TextField" positionIncrementGap="100" stored="false" multiValued="true">

<analyzer type="index">

<tokenizer class="solr.StandardTokenizerFactory"/>

<!-- potentially word delimiter, synonym filter, stop words, NOT stemming -->

<filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front" />



</analyzer>

<analyzer type="query">

<tokenizer class="solr.StandardTokenizerFactory"/>

<!-- potentially word delimiter, synonym filter, stop words, NOT stemming -->

<filter class="solr.LowerCaseFilterFactory"/>

</analyzer>

</fieldType>


</PRE>
<BR>
******************************************************************************************<BR>This message may contain confidential or proprietary information intended only for the use of the<BR>addressee(s) named above or may contain information that is legally privileged. If you are<BR>not the intended addressee, or the person responsible for delivering it to the intended addressee,<BR>you are hereby notified that reading, disseminating, distributing or copying this message is strictly<BR>prohibited. If you have received this message by mistake, please immediately notify us by<BR>replying to the message and delete the original message and any copies immediately thereafter.<BR>
<BR>
Thank you.~<BR>
******************************************************************************************<BR>
FAFLD<BR>
<PRE>

Re: NGram for misspelt words

Posted by Dikchant Sahi <co...@gmail.com>.
Have you tried the analysis window to debug.

I believe you are doing something wrong in the fieldType.

On Wed, Jul 18, 2012 at 8:07 PM, Husain, Yavar <yh...@firstam.com> wrote:

> Thanks Sahi. I have replaced my EdgeNGramFilterFactory to
> NGramFilterFactory as I need substrings not just in front or back but
> anywhere.
> You are right I put the same NGramFilterFactory in both Query and Index
> however now it does not return any results not even the basic one.
>
> -----Original Message-----
> From: Dikchant Sahi [mailto:contactsahi@gmail.com]
> Sent: Wednesday, July 18, 2012 7:54 PM
> To: solr-user@lucene.apache.org
> Subject: Re: NGram for misspelt words
>
> You are creating grams only while indexing and not querying hence 'ludlwo'
> would not match. Your analyzer will create the following grams while
> indexing for 'ludlow': lu lud ludl ludlo ludlow and hence would not match
> to 'ludlwo'.
>
> Either you need to create gram while querying also or use Edit Distance.
>
> On Wed, Jul 18, 2012 at 7:43 PM, Husain, Yavar <yh...@firstam.com>
> wrote:
>
> >
> >
> >
> > I have configured NGram Indexing for some fields.
> >
> > Say I search for the city Ludlow, I get the results (normal search)
> >
> > If I search for Ludlo (with w ommitted) I get the results
> >
> > If I search for Ludl (with ow ommitted) I still get the results
> >
> > I know that they are all partial strings of the main string hence
> > NGram works perfect.
> >
> > But when I type in Ludlwo (misspelt, characters o and w interchanged)
> > I dont get any results, It should ideally match "Ludl" and provide the
> > results.
> >
> > I am not looking for Edit distance based Spell Correctors. How can I
> > make above NGram based search work?
> >
> > Here is my schema.xml (NGramFieldType):
> >
> > <fieldType name="nGram" class="solr.TextField" positionIncrementGap="100"
> > stored="false" multiValued="true">
> >
> > <analyzer type="index">
> >
> > <tokenizer class="solr.StandardTokenizerFactory"/>
> >
> > <!-- potentially word delimiter, synonym filter, stop words, NOT
> > stemming
> > -->
> >
> > <filter class="solr.LowerCaseFilterFactory"/>
> >
> > <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
> > maxGramSize="15" side="front" />
> >
> >
> >
> > </analyzer>
> >
> > <analyzer type="query">
> >
> > <tokenizer class="solr.StandardTokenizerFactory"/>
> >
> > <!-- potentially word delimiter, synonym filter, stop words, NOT
> > stemming
> > -->
> >
> > <filter class="solr.LowerCaseFilterFactory"/>
> >
> > </analyzer>
> >
> > </fieldType>
> >
> >
> > </PRE>
> > <BR>
> > **********************************************************************
> > ********************<BR>This message may contain confidential or
> > proprietary information intended only for the use of
> > the<BR>addressee(s) named above or may contain information that is
> > legally privileged. If you are<BR>not the intended addressee, or the
> > person responsible for delivering it to the intended addressee,<BR>you
> > are hereby notified that reading, disseminating, distributing or
> > copying this message is strictly<BR>prohibited. If you have received
> > this message by mistake, please immediately notify us by<BR>replying
> > to the message and delete the original message and any copies
> > immediately thereafter.<BR> <BR> Thank you.~<BR>
> >
> > **********************************************************************
> > ********************<BR>
> > FAFLD<BR>
> > <PRE>
> >
>

RE: NGram for misspelt words

Posted by "Husain, Yavar" <yh...@firstam.com>.
Thanks Sahi. I have replaced my EdgeNGramFilterFactory to NGramFilterFactory as I need substrings not just in front or back but anywhere.
You are right I put the same NGramFilterFactory in both Query and Index however now it does not return any results not even the basic one.

-----Original Message-----
From: Dikchant Sahi [mailto:contactsahi@gmail.com] 
Sent: Wednesday, July 18, 2012 7:54 PM
To: solr-user@lucene.apache.org
Subject: Re: NGram for misspelt words

You are creating grams only while indexing and not querying hence 'ludlwo'
would not match. Your analyzer will create the following grams while indexing for 'ludlow': lu lud ludl ludlo ludlow and hence would not match to 'ludlwo'.

Either you need to create gram while querying also or use Edit Distance.

On Wed, Jul 18, 2012 at 7:43 PM, Husain, Yavar <yh...@firstam.com> wrote:

>
>
>
> I have configured NGram Indexing for some fields.
>
> Say I search for the city Ludlow, I get the results (normal search)
>
> If I search for Ludlo (with w ommitted) I get the results
>
> If I search for Ludl (with ow ommitted) I still get the results
>
> I know that they are all partial strings of the main string hence 
> NGram works perfect.
>
> But when I type in Ludlwo (misspelt, characters o and w interchanged) 
> I dont get any results, It should ideally match "Ludl" and provide the 
> results.
>
> I am not looking for Edit distance based Spell Correctors. How can I 
> make above NGram based search work?
>
> Here is my schema.xml (NGramFieldType):
>
> <fieldType name="nGram" class="solr.TextField" positionIncrementGap="100"
> stored="false" multiValued="true">
>
> <analyzer type="index">
>
> <tokenizer class="solr.StandardTokenizerFactory"/>
>
> <!-- potentially word delimiter, synonym filter, stop words, NOT 
> stemming
> -->
>
> <filter class="solr.LowerCaseFilterFactory"/>
>
> <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
> maxGramSize="15" side="front" />
>
>
>
> </analyzer>
>
> <analyzer type="query">
>
> <tokenizer class="solr.StandardTokenizerFactory"/>
>
> <!-- potentially word delimiter, synonym filter, stop words, NOT 
> stemming
> -->
>
> <filter class="solr.LowerCaseFilterFactory"/>
>
> </analyzer>
>
> </fieldType>
>
>
> </PRE>
> <BR>
> **********************************************************************
> ********************<BR>This message may contain confidential or 
> proprietary information intended only for the use of 
> the<BR>addressee(s) named above or may contain information that is 
> legally privileged. If you are<BR>not the intended addressee, or the 
> person responsible for delivering it to the intended addressee,<BR>you 
> are hereby notified that reading, disseminating, distributing or 
> copying this message is strictly<BR>prohibited. If you have received 
> this message by mistake, please immediately notify us by<BR>replying 
> to the message and delete the original message and any copies 
> immediately thereafter.<BR> <BR> Thank you.~<BR>
>
> **********************************************************************
> ********************<BR>
> FAFLD<BR>
> <PRE>
>

Re: NGram for misspelt words

Posted by Dikchant Sahi <co...@gmail.com>.
You are creating grams only while indexing and not querying hence 'ludlwo'
would not match. Your analyzer will create the following grams while
indexing for 'ludlow': lu lud ludl ludlo ludlow and hence would not match
to 'ludlwo'.

Either you need to create gram while querying also or use Edit Distance.

On Wed, Jul 18, 2012 at 7:43 PM, Husain, Yavar <yh...@firstam.com> wrote:

>
>
>
> I have configured NGram Indexing for some fields.
>
> Say I search for the city Ludlow, I get the results (normal search)
>
> If I search for Ludlo (with w ommitted) I get the results
>
> If I search for Ludl (with ow ommitted) I still get the results
>
> I know that they are all partial strings of the main string hence NGram
> works perfect.
>
> But when I type in Ludlwo (misspelt, characters o and w interchanged) I
> dont get any results, It should ideally match "Ludl" and provide the
> results.
>
> I am not looking for Edit distance based Spell Correctors. How can I make
> above NGram based search work?
>
> Here is my schema.xml (NGramFieldType):
>
> <fieldType name="nGram" class="solr.TextField" positionIncrementGap="100"
> stored="false" multiValued="true">
>
> <analyzer type="index">
>
> <tokenizer class="solr.StandardTokenizerFactory"/>
>
> <!-- potentially word delimiter, synonym filter, stop words, NOT stemming
> -->
>
> <filter class="solr.LowerCaseFilterFactory"/>
>
> <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
> maxGramSize="15" side="front" />
>
>
>
> </analyzer>
>
> <analyzer type="query">
>
> <tokenizer class="solr.StandardTokenizerFactory"/>
>
> <!-- potentially word delimiter, synonym filter, stop words, NOT stemming
> -->
>
> <filter class="solr.LowerCaseFilterFactory"/>
>
> </analyzer>
>
> </fieldType>
>
>
> </PRE>
> <BR>
> ******************************************************************************************<BR>This
> message may contain confidential or proprietary information intended only
> for the use of the<BR>addressee(s) named above or may contain information
> that is legally privileged. If you are<BR>not the intended addressee, or
> the person responsible for delivering it to the intended addressee,<BR>you
> are hereby notified that reading, disseminating, distributing or copying
> this message is strictly<BR>prohibited. If you have received this message
> by mistake, please immediately notify us by<BR>replying to the message and
> delete the original message and any copies immediately thereafter.<BR>
> <BR>
> Thank you.~<BR>
>
> ******************************************************************************************<BR>
> FAFLD<BR>
> <PRE>
>