You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Sohail Aboobaker <sa...@gmail.com> on 2012/11/06 14:08:25 UTC

Searching for Partial Words

Hi,

Given following values in the document:

Doc1: Engine
Doc2. Engineer
Doc3. ResidentEngineer

We need to return all three documents when someone searches for "engi".

Basically we need to implement partial word search. Currently, we have a
wild card on the right side of search term (term*). Is it possible to have
wild card on both sides of a search term?

Regards,
Sohail Aboobaker.

Re: Searching for Partial Words

Posted by Sohail Aboobaker <sa...@gmail.com>.

Yes, that is true. We are looking for partial word matches. It seems like
we can achieve this by using edge ngram for prefixes and adding wild card
at the end for ignoring suffix. If we set the edge ngram to 3. "eng" will
match ResidentEng but not ResidentEngineer. But a search for "eng*" will
match both ResidentEngineer and ResidentEngine and Engine etc.

Thank you for your responses. We were able to convince the business users
that partial word searching is not expected behavior and will generate more
results than needed so the partial word search requirement was dropped :)

Thanks again.

Re: Searching for Partial Words

Posted by Amit Nithian <an...@gmail.com>.

Look at the normal ngram tokenizer. "Engine" with ngram size 3 would yield
"eng" "ngi" "gin" "ine" so a search for engi should match. You can play
around with the min/max values. Edge ngram is useful for prefix matching
but sounds like you want intra-word matching too? ("eng" should match "
ResidentEngineer")

On Tue, Nov 6, 2012 at 7:35 AM, Sohail Aboobaker <sa...@gmail.com>wrote:

> Thanks Jack.
> In the configuration below:
>
>  <fieldType name="text_edgngrm" class="solr.TextField"
> positionIncrementGap="100">
>    <analyzer>
>      <tokenizer class="solr.EdgeNGramTokenizerFactory" side="front"
> minGramSize="1" maxGramSize="1"/>
>    </analyzer>
>  </fieldType>
>
> What are the possible values for "side"?
>
> If I understand it correctly, minGramSize=3 and side=front, will
> include eng* but not en*. Is this correct? So, the minGramSize is for
> number of characters allowed in the specified side.
>
> Does it allow side=both :) or something similar?
>
> Regards,
> Sohail
>

Re: Searching for Partial Words

Posted by Jack Krupansky <ja...@basetechnology.com>.

The "side" attribute must be "front" or "back". Sorry, no "both", although 
that sounds like a reasonable feature request.

"front" is the default side.

-- Jack Krupansky

-----Original Message----- 
From: Sohail Aboobaker
Sent: Tuesday, November 06, 2012 7:35 AM
To: solr-user@lucene.apache.org
Subject: Re: Searching for Partial Words

Thanks Jack.
In the configuration below:

<fieldType name="text_edgngrm" class="solr.TextField"
positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.EdgeNGramTokenizerFactory" side="front"
minGramSize="1" maxGramSize="1"/>
   </analyzer>
</fieldType>

What are the possible values for "side"?

If I understand it correctly, minGramSize=3 and side=front, will
include eng* but not en*. Is this correct? So, the minGramSize is for
number of characters allowed in the specified side.

Does it allow side=both :) or something similar?

Regards,
Sohail

Re: Searching for Partial Words

Posted by Sohail Aboobaker <sa...@gmail.com>.

Thanks Jack.
In the configuration below:

 <fieldType name="text_edgngrm" class="solr.TextField"
positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.EdgeNGramTokenizerFactory" side="front"
minGramSize="1" maxGramSize="1"/>
   </analyzer>
 </fieldType>

What are the possible values for "side"?

If I understand it correctly, minGramSize=3 and side=front, will
include eng* but not en*. Is this correct? So, the minGramSize is for
number of characters allowed in the specified side.

Does it allow side=both :) or something similar?

Regards,
Sohail

Re: Searching for Partial Words

Posted by Jack Krupansky <ja...@basetechnology.com>.

Add an "edge" n-gram filter (EdgeNGramFilterFactory) to your "index" 
analyzer. This will add all the prefixes of words to the index, so that a 
query of "engi" will be equivalent to but much faster than the wildcard 
engi*. You can specify a minimum size, such as 3 or 4 to eliminate tons of 
too-short prefixes, if you want.

See:
http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramFilterFactory.html
http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html

-- Jack Krupansky

-----Original Message----- 
From: Sohail Aboobaker
Sent: Tuesday, November 06, 2012 8:08 AM
To: solr-user@lucene.apache.org
Subject: Searching for Partial Words

Hi,

Given following values in the document:

Doc1: Engine
Doc2. Engineer
Doc3. ResidentEngineer

We need to return all three documents when someone searches for "engi".

Basically we need to implement partial word search. Currently, we have a
wild card on the right side of search term (term*). Is it possible to have
wild card on both sides of a search term?

Regards,
Sohail Aboobaker.