You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Feak, Todd" <To...@smss.sony.com> on 2009/01/05 17:16:47 UTC

RE: Ngram Repeats

To get the unique brand names, you are wandering in to the Facet query territory that I mentioned.

You could consider a separate index, and that will probably provide the best performance. Especially if you are hitting it on a per-keystroke basis to update that auto-complete box. Creating a separate index also allows you to scale this section of your search infrastructure separately, if necessary.

You *can* put the separate index within the same Tomcat instance if you need to. The context snippets in Tomcat can be used to provide a different URL for those queries.

-Todd Feak

-----Original Message-----
From: Jeff Newburn [mailto:jnewburn@zappos.com] 
Sent: Wednesday, December 24, 2008 2:30 PM
To: solr-user@lucene.apache.org
Subject: Re: Ngram Repeats

You are correct on the layout.  The reason we are trying to do the ngrams is
we want to do a drop down box for autocomplete.  The ngrams are extremely
fast and the recommended way to do this according to the user group.  They
work wonderfully except this one issue.  So do we basically have to do a
separate index for this or is there a dedup setting to only return unique
brand names.


On 12/24/08 7:51 AM, "Feak, Todd" <To...@smss.sony.com> wrote:

> It sounds like you want to get a list of "brands" that start with a particular
> string, out of your index. But your index is based on products, not brands. Is
> that correct?
> 
> If so, that has nothing to do with NGrams (or even tokenizing for that matter)
> I think you should be doing a Facet query instead of a standard query. Take a
> look at Facets on the Solr Wiki.
> 
> http://wiki.apache.org/solr/SolrFacetingOverview
> 
> -Todd Feak
> -----Original Message-----
> From: Jeff Newburn [mailto:jnewburn@zappos.com]
> Sent: Wednesday, December 24, 2008 7:39 AM
> To: solr-user@lucene.apache.org
> Subject: Ngram Repeats
> 
> I have set up an ngram filter and have run into a problem.  Our index is
> basically composed of products as the unique id.  Each product also has a
> brand name assigned to it.  There are much fewer unique brand names than
> products in the index.  I tried to set up an ngram based on the brand name
> but it is returning the same brand name over and over for each product.
> Essentially if you try for the brand name starting with ³as² you will get
> the brand ³asus² 15 times.  Is there a way to make the ngram only return
> unique brand name?  I have attached the configuration below.
> 
>         <fieldType name="prefix_token" class="solr.TextField"
> positionIncrementGap="1">
>                 <analyzer type="index">
>                         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                         <filter class="solr.LowerCaseFilterFactory" />
>                         <filter class="solr.EdgeNGramFilterFactory"
> minGramSize="1" maxGramSize="20"/>
>                 </analyzer>
>                 <analyzer type="query">
>                         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                         <filter class="solr.LowerCaseFilterFactory" />
>                 </analyzer>
>         </fieldType>
> -Jeff