You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Andy <an...@yahoo.com> on 2010/10/02 11:32:55 UTC

NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

I working on a user-generated tagging feature. Some of the tags could be multi-lingual, mixng languages like English, Chinese, Japanese

I'd like to add auto-complete to help users to enter the tags. And I'd want to match in the middle of the tags as well.

For example, if a user types "guit" I want to suggest:
"guitar"
"electric guitar"
"电动guitar"
"guitar英雄"

And if a user types "吉他" I want to suggest:
"吉他Hero"
"electric吉他"
"古典吉他"


I'm thinking about using:

<fieldType name="autocomplete" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="15" />
 </analyzer>
 <analyzer type="query">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
</fieldType>

Would the above setup do what I want to do?

Also how would I deal with hyphens? For example I want an input or either "wi-f" or "wif" to match the tag "wi-fi". 

Would adding WordDelimiterFilterFactory to both "index" and "query" accomplish that?


Thanks.

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by pravin <pr...@gmail.com>.

Hello,
Andy, so did you get final answer to your quetion?
I am also trying to do something similar. Please give me pointers if you
have any.
Basically even I need to use Ngram with WhitespaceTokenizer any help will be
appreciated.
-- 
View this message in context: http://lucene.472066.n3.nabble.com/NGramFilterFactory-for-auto-complete-that-matches-the-middle-of-multi-lingual-tags-tp1619234p2459466.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by Gert Brinkmann <g1...@netcologne.de>.

On 03.10.2010 09:20, Andy wrote:
> NGramFilterFactory would then take that one toke ("electric guitar")
> and generate N-grams out of it. One of the ngrams would be "guit"
> because "guit" is a substring of "electric guitar".

AFAIK it only produces prefix-strings like

gui
guit
guita
guitar

etc.
So that you can do a prefix search without a wildcard. So it is enough 
to search for "guit" and you do not need to search for "guit*". The 
latter wildcard string can make trouble with stopwordfiltering and (at 
least in solr 1.3) with text snippet generating.

Greetings,
Gert

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by Robert Muir <rc...@gmail.com>.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=newaccount

On Sun, Oct 3, 2010 at 2:40 PM, Jonathan Rochkind <ro...@jhu.edu> wrote:

> Huh, the NGramFilterFactory itself isn't listed on the the analyzers wiki
> at: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>
> That wiki page seems to be protected to certain users only. Anyone know if
> there's a way to send a 'patch' to the maintainers for the wiki, or if
> there's a process for getting editing privileges on that page?  I'd like to
> help out by adding documentation when I come accross it.
>
> Jonahtan
> ________________________________________
> From: Ahmet Arslan [iorixxx@yahoo.com]
> Sent: Sunday, October 03, 2010 6:26 AM
> To: solr-user@lucene.apache.org
> Subject: Re: NGramFilterFactory for auto-complete that matches the middle
> of multi-lingual tags?
>
> > But I thought NGramFilterFactory would generate substrings
> > that start in the "middle", hence ensuring autocomplete
> > matching in the middle.
> >
> > So in the case of "electric guitar", keywordtokenizer would
> > create one token - "electric guitar"
> >
> > NGramFilterFactory would then take that one toke ("electric
> > guitar") and generate N-grams out of it. One of the ngrams
> > would be "guit" because "guit" is a substring of "electric
> > guitar".
> >
>
> Ups. You are correct, I am sorry. I mixed it with *Edge*NGramFilterFActory.
>
>
>
>


-- 
Robert Muir
rcmuir@gmail.com

RE: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by Jonathan Rochkind <ro...@jhu.edu>.

Huh, the NGramFilterFactory itself isn't listed on the the analyzers wiki at: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

That wiki page seems to be protected to certain users only. Anyone know if there's a way to send a 'patch' to the maintainers for the wiki, or if there's a process for getting editing privileges on that page?  I'd like to help out by adding documentation when I come accross it. 

Jonahtan
________________________________________
From: Ahmet Arslan [iorixxx@yahoo.com]
Sent: Sunday, October 03, 2010 6:26 AM
To: solr-user@lucene.apache.org
Subject: Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

> But I thought NGramFilterFactory would generate substrings
> that start in the "middle", hence ensuring autocomplete
> matching in the middle.
>
> So in the case of "electric guitar", keywordtokenizer would
> create one token - "electric guitar"
>
> NGramFilterFactory would then take that one toke ("electric
> guitar") and generate N-grams out of it. One of the ngrams
> would be "guit" because "guit" is a substring of "electric
> guitar".
>

Ups. You are correct, I am sorry. I mixed it with *Edge*NGramFilterFActory.

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by Lance Norskog <go...@gmail.com>.

Start a new thread.

Dennis Gearon wrote:
> What's the difference between the filter/anayzers that have 'factory' in their name, and the ones that don't?
>
>
> Dennis Gearon
>
> Signature Warning
> ----------------
> EARTH has a Right To Life,
>    otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
>
>
> --- On Sun, 10/3/10, Ahmet Arslan<io...@yahoo.com>  wrote:
>
>    
>> From: Ahmet Arslan<io...@yahoo.com>
>> Subject: Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?
>> To: solr-user@lucene.apache.org
>> Date: Sunday, October 3, 2010, 3:26 AM
>>      
>>> But I thought NGramFilterFactory
>>>        
>> would generate substrings
>>      
>>> that start in the "middle", hence ensuring
>>>        
>> autocomplete
>>      
>>> matching in the middle.
>>>
>>> So in the case of "electric guitar", keywordtokenizer
>>>        
>> would
>>      
>>> create one token - "electric guitar"
>>>
>>> NGramFilterFactory would then take that one toke
>>>        
>> ("electric
>>      
>>> guitar") and generate N-grams out of it. One of the
>>>        
>> ngrams
>>      
>>> would be "guit" because "guit" is a substring of
>>>        
>> "electric
>>      
>>> guitar".
>>>
>>>        
>> Ups. You are correct, I am sorry. I mixed it with
>> *Edge*NGramFilterFActory.
>>
>>
>>
>>
>>