You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Sachin <sa...@aim.com> on 2010/02/24 07:17:09 UTC

Autosuggest/Autocomplete with solr 1.4 and EdgeNGrams


 Hi All,

I am trying to setup autosuggest using solr 1.4 for my site and needed some pointers on that. Basically, we provide autosuggest for user typed in characters in the searchbox. The autosuggest index is created with older user typed in search queries which returned > 0 results. We do some lazy writing to store this information into the db and then export it to solr on a nightly basis. As far as I know, there are 3 ways (apart from wild card search) of achieving autosuggest using solr 1.4:

1. Use EdgeNGrams
2. Use shingles and prefix query.
3. Use the new Terms component.

I am for now more inclinded towards using the EdgeNGrams (no method to madness) and just wanted to know is there any recommended approach out of the 3 in terms of performance, since the user excepts the suggestions to be almost instantaneous? We do some heavy caching at our end to avoid hitting solr everytime but is any of these 3 approaches faster than the other?

Also, I would also like to return the suggestion even if the user typed in query matches in between: for instance if I have the query "chicken pasta" in my index and the user types in "pasta", I would also like this query to be returned as part of the suggestion (ala Yahoo!). Below is my field definition:

        <fieldType name="suggest" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.KeywordTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="50" />
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.KeywordTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
        </fieldType>


 I tried changing the KeywordTokenizerFactory with LetterTokenizerFactory, and though it works great for the above scenario (does a in-between match), it has the side-effect of removing everything which are not letters so if the user types in "123" he gets absolutely no suggestions. Is there anything that I'm missing in my configuration, is this even achievable by using EdgeNGrams or shall I look at using perhaps the TermsComponent after applying the regex patch from 1.5 and maybe do something like ".*user-typed-in-chars.*"?

Thanks!

Re: Autosuggest/Autocomplete with solr 1.4 and EdgeNGrams

Posted by "Smiley, David W." <ds...@mitre.org>.

On Feb 24, 2010, at 1:17 AM, Sachin wrote:

> Hi All,
> 
> I am trying to setup autosuggest using solr 1.4 for my site and needed some pointers on that. Basically, we provide autosuggest for user typed in characters in the searchbox. The autosuggest index is created with older user typed in search queries which returned > 0 results. We do some lazy writing to store this information into the db and then export it to solr on a nightly basis. As far as I know, there are 3 ways (apart from wild card search) of achieving autosuggest using solr 1.4:
> 
> 1. Use EdgeNGrams
> 2. Use shingles and prefix query.
> 3. Use the new Terms component.

Another scenario you did not consider is the approach I recommend in my book (p. 156).  There's a poor example of this on the wiki: http://wiki.apache.org/solr/SimpleFacetParameters#Facet_prefix_.28term_suggest.29

> I am for now more inclinded towards using the EdgeNGrams (no method to madness) and just wanted to know is there any recommended approach out of the 3 in terms of performance, since the user excepts the suggestions to be almost instantaneous? We do some heavy caching at our end to avoid hitting solr everytime but is any of these 3 approaches faster than the other?

The Terms component should be the fastest since it has the most direct access to the underlying data.  But I don't understand why people use it for auto-suggest because it fails to consider the context of the query considering words before the right-most term.  However if you use KeywordTokenizer with EdgeNGram with Terms then this addresses that somewhat... You don't seem interested in matching cases where someone once queried "a b c" and you don't want "b c" to match on this apparently. Personally that would bug me.  I like the faceting approach but admittedly I have not used it at scale.

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/


> Also, I would also like to return the suggestion even if the user typed in query matches in between: for instance if I have the query "chicken pasta" in my index and the user types in "pasta", I would also like this query to be returned as part of the suggestion (ala Yahoo!). Below is my field definition:
> 
>        <fieldType name="suggest" class="solr.TextField" positionIncrementGap="100">
>            <analyzer type="index">
>                <tokenizer class="solr.KeywordTokenizerFactory"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>                <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="50" />
>            </analyzer>
>            <analyzer type="query">
>                <tokenizer class="solr.KeywordTokenizerFactory"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>            </analyzer>
>        </fieldType>
> 
> 
> I tried changing the KeywordTokenizerFactory with LetterTokenizerFactory, and though it works great for the above scenario (does a in-between match), it has the side-effect of removing everything which are not letters so if the user types in "123" he gets absolutely no suggestions. Is there anything that I'm missing in my configuration, is this even achievable by using EdgeNGrams or shall I look at using perhaps the TermsComponent after applying the regex patch from 1.5 and maybe do something like ".*user-typed-in-chars.*"?
> 
> Thanks!
> 
> 
>

Re: Autosuggest/Autocomplete with solr 1.4 and EdgeNGrams

Posted by Sachin <sa...@aim.com>.

 

 Hello Joe,

The whitespacetokenizerfactory seems to have done the trick, I would for now keep it like this and closely monitor to see if there are any performance implications of using EdgeNGrams but for now this works like a charm. Thanks!


 

 

-----Original Message-----
From: Joe Calderon <ca...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Wed, Feb 24, 2010 10:22 pm
Subject: Re: Autosuggest/Autocomplete with solr 1.4 and EdgeNGrams


i had to create a autosuggest implementation not too long ago,
originally i was using faceting, where i would match wildcards on a
tokenized field and facet on an unaltered field, this had the
advantage that i could do everything from one index, though it was
also limited by the fact suggestions came though facets and scoring
and highlighting went out the window


what i settled on was to create a separate core for suggest to use, i
analyze the fields i want to match against with whitespace tokenizer
and edgengram filter, this has multiple advantages:
query is ran through text analysis where as with wildcarded terms they are not
highlighter will highlight only the text matched not the expanded word
scoring and boosts can be used to rank suggest results

i tokenize on whitespace so i can match out of order tokens , ex
q=family guy stewie  and q=stewie family guy, etc, this is something
that prefix based solutions wont be able to do,

one small gotcha is that i recently submitted a patch to edgengram
filter to fix highlighting behaviour, it has been comitted to lucenes
trunk but its only available in versions 2.9.2 and up unless you patch
it yourself

On Wed, Feb 24, 2010 at 7:35 AM, Grant Ingersoll <gs...@apache.org> wrote:
> You might also look at http://issues.apache.org/jira/browse/SOLR-1316
>
> On Feb 24, 2010, at 1:17 AM, Sachin wrote:
>
>>
>>
>> Hi All,
>>
>> I am trying to setup autosuggest using solr 1.4 for my site and needed some 
pointers on that. Basically, we provide autosuggest for user typed in characters 
in the searchbox. The autosuggest index is created with older user typed in 
search queries which returned > 0 results. We do some lazy writing to store this 
information into the db and then export it to solr on a nightly basis. As far as 
I know, there are 3 ways (apart from wild card search) of achieving autosuggest 
using solr 1.4:
>>
>> 1. Use EdgeNGrams
>> 2. Use shingles and prefix query.
>> 3. Use the new Terms component.
>>
>> I am for now more inclinded towards using the EdgeNGrams (no method to 
madness) and just wanted to know is there any recommended approach out of the 3 
in terms of performance, since the user excepts the suggestions to be almost 
instantaneous? We do some heavy caching at our end to avoid hitting solr 
everytime but is any of these 3 approaches faster than the other?
>>
>> Also, I would also like to return the suggestion even if the user typed in 
query matches in between: for instance if I have the query "chicken pasta" in my 
index and the user types in "pasta", I would also like this query to be returned 
as part of the suggestion (ala Yahoo!). Below is my field definition:
>>
>>        <fieldType name="suggest" class="solr.TextField" positionIncrementGap="100">
>>            <analyzer type="index">
>>                <tokenizer class="solr.KeywordTokenizerFactory"/>
>>                <filter class="solr.LowerCaseFilterFactory"/>
>>                <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" 
maxGramSize="50" />
>>            </analyzer>
>>            <analyzer type="query">
>>                <tokenizer class="solr.KeywordTokenizerFactory"/>
>>                <filter class="solr.LowerCaseFilterFactory"/>
>>            </analyzer>
>>        </fieldType>
>>
>>
>> I tried changing the KeywordTokenizerFactory with LetterTokenizerFactory, and 
though it works great for the above scenario (does a in-between match), it has 
the side-effect of removing everything which are not letters so if the user 
types in "123" he gets absolutely no suggestions. Is there anything that I'm 
missing in my configuration, is this even achievable by using EdgeNGrams or 
shall I look at using perhaps the TermsComponent after applying the regex patch 
from 1.5 and maybe do something like ".*user-typed-in-chars.*"?
>>
>> Thanks!
>>
>>
>>
>
>
>

Re: Autosuggest/Autocomplete with solr 1.4 and EdgeNGrams

Posted by Joe Calderon <ca...@gmail.com>.

i had to create a autosuggest implementation not too long ago,
originally i was using faceting, where i would match wildcards on a
tokenized field and facet on an unaltered field, this had the
advantage that i could do everything from one index, though it was
also limited by the fact suggestions came though facets and scoring
and highlighting went out the window


what i settled on was to create a separate core for suggest to use, i
analyze the fields i want to match against with whitespace tokenizer
and edgengram filter, this has multiple advantages:
query is ran through text analysis where as with wildcarded terms they are not
highlighter will highlight only the text matched not the expanded word
scoring and boosts can be used to rank suggest results

i tokenize on whitespace so i can match out of order tokens , ex
q=family guy stewie  and q=stewie family guy, etc, this is something
that prefix based solutions wont be able to do,

one small gotcha is that i recently submitted a patch to edgengram
filter to fix highlighting behaviour, it has been comitted to lucenes
trunk but its only available in versions 2.9.2 and up unless you patch
it yourself

On Wed, Feb 24, 2010 at 7:35 AM, Grant Ingersoll <gs...@apache.org> wrote:
> You might also look at http://issues.apache.org/jira/browse/SOLR-1316
>
> On Feb 24, 2010, at 1:17 AM, Sachin wrote:
>
>>
>>
>> Hi All,
>>
>> I am trying to setup autosuggest using solr 1.4 for my site and needed some pointers on that. Basically, we provide autosuggest for user typed in characters in the searchbox. The autosuggest index is created with older user typed in search queries which returned > 0 results. We do some lazy writing to store this information into the db and then export it to solr on a nightly basis. As far as I know, there are 3 ways (apart from wild card search) of achieving autosuggest using solr 1.4:
>>
>> 1. Use EdgeNGrams
>> 2. Use shingles and prefix query.
>> 3. Use the new Terms component.
>>
>> I am for now more inclinded towards using the EdgeNGrams (no method to madness) and just wanted to know is there any recommended approach out of the 3 in terms of performance, since the user excepts the suggestions to be almost instantaneous? We do some heavy caching at our end to avoid hitting solr everytime but is any of these 3 approaches faster than the other?
>>
>> Also, I would also like to return the suggestion even if the user typed in query matches in between: for instance if I have the query "chicken pasta" in my index and the user types in "pasta", I would also like this query to be returned as part of the suggestion (ala Yahoo!). Below is my field definition:
>>
>>        <fieldType name="suggest" class="solr.TextField" positionIncrementGap="100">
>>            <analyzer type="index">
>>                <tokenizer class="solr.KeywordTokenizerFactory"/>
>>                <filter class="solr.LowerCaseFilterFactory"/>
>>                <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="50" />
>>            </analyzer>
>>            <analyzer type="query">
>>                <tokenizer class="solr.KeywordTokenizerFactory"/>
>>                <filter class="solr.LowerCaseFilterFactory"/>
>>            </analyzer>
>>        </fieldType>
>>
>>
>> I tried changing the KeywordTokenizerFactory with LetterTokenizerFactory, and though it works great for the above scenario (does a in-between match), it has the side-effect of removing everything which are not letters so if the user types in "123" he gets absolutely no suggestions. Is there anything that I'm missing in my configuration, is this even achievable by using EdgeNGrams or shall I look at using perhaps the TermsComponent after applying the regex patch from 1.5 and maybe do something like ".*user-typed-in-chars.*"?
>>
>> Thanks!
>>
>>
>>
>
>
>

Re: Autosuggest/Autocomplete with solr 1.4 and EdgeNGrams

Posted by Grant Ingersoll <gs...@apache.org>.

You might also look at http://issues.apache.org/jira/browse/SOLR-1316

On Feb 24, 2010, at 1:17 AM, Sachin wrote:

> 
> 
> Hi All,
> 
> I am trying to setup autosuggest using solr 1.4 for my site and needed some pointers on that. Basically, we provide autosuggest for user typed in characters in the searchbox. The autosuggest index is created with older user typed in search queries which returned > 0 results. We do some lazy writing to store this information into the db and then export it to solr on a nightly basis. As far as I know, there are 3 ways (apart from wild card search) of achieving autosuggest using solr 1.4:
> 
> 1. Use EdgeNGrams
> 2. Use shingles and prefix query.
> 3. Use the new Terms component.
> 
> I am for now more inclinded towards using the EdgeNGrams (no method to madness) and just wanted to know is there any recommended approach out of the 3 in terms of performance, since the user excepts the suggestions to be almost instantaneous? We do some heavy caching at our end to avoid hitting solr everytime but is any of these 3 approaches faster than the other?
> 
> Also, I would also like to return the suggestion even if the user typed in query matches in between: for instance if I have the query "chicken pasta" in my index and the user types in "pasta", I would also like this query to be returned as part of the suggestion (ala Yahoo!). Below is my field definition:
> 
>        <fieldType name="suggest" class="solr.TextField" positionIncrementGap="100">
>            <analyzer type="index">
>                <tokenizer class="solr.KeywordTokenizerFactory"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>                <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="50" />
>            </analyzer>
>            <analyzer type="query">
>                <tokenizer class="solr.KeywordTokenizerFactory"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>            </analyzer>
>        </fieldType>
> 
> 
> I tried changing the KeywordTokenizerFactory with LetterTokenizerFactory, and though it works great for the above scenario (does a in-between match), it has the side-effect of removing everything which are not letters so if the user types in "123" he gets absolutely no suggestions. Is there anything that I'm missing in my configuration, is this even achievable by using EdgeNGrams or shall I look at using perhaps the TermsComponent after applying the regex patch from 1.5 and maybe do something like ".*user-typed-in-chars.*"?
> 
> Thanks!
> 
> 
>