You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andy <an...@yahoo.com> on 2010/10/02 11:32:55 UTC

NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

I working on a user-generated tagging feature. Some of the tags could be multi-lingual, mixng languages like English, Chinese, Japanese

I'd like to add auto-complete to help users to enter the tags. And I'd want to match in the middle of the tags as well.

For example, if a user types "guit" I want to suggest:
"guitar"
"electric guitar"
"电动guitar"
"guitar英雄"

And if a user types "吉他" I want to suggest:
"吉他Hero"
"electric吉他"
"古典吉他"


I'm thinking about using:

<fieldType name="autocomplete" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="15" />
 </analyzer>
 <analyzer type="query">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
</fieldType>

Would the above setup do what I want to do?

Also how would I deal with hyphens? For example I want an input or either "wi-f" or "wif" to match the tag "wi-fi". 

Would adding WordDelimiterFilterFactory to both "index" and "query" accomplish that?


Thanks.


      

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by pravin <pr...@gmail.com>.
Hello,
Andy, so did you get final answer to your quetion?
I am also trying to do something similar. Please give me pointers if you
have any.
Basically even I need to use Ngram with WhitespaceTokenizer any help will be
appreciated.
-- 
View this message in context: http://lucene.472066.n3.nabble.com/NGramFilterFactory-for-auto-complete-that-matches-the-middle-of-multi-lingual-tags-tp1619234p2459466.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by Gert Brinkmann <g1...@netcologne.de>.
On 03.10.2010 09:20, Andy wrote:
> NGramFilterFactory would then take that one toke ("electric guitar")
> and generate N-grams out of it. One of the ngrams would be "guit"
> because "guit" is a substring of "electric guitar".

AFAIK it only produces prefix-strings like

gui
guit
guita
guitar

etc.
So that you can do a prefix search without a wildcard. So it is enough 
to search for "guit" and you do not need to search for "guit*". The 
latter wildcard string can make trouble with stopwordfiltering and (at 
least in solr 1.3) with text snippet generating.

Greetings,
Gert

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by Robert Muir <rc...@gmail.com>.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=newaccount

On Sun, Oct 3, 2010 at 2:40 PM, Jonathan Rochkind <ro...@jhu.edu> wrote:

> Huh, the NGramFilterFactory itself isn't listed on the the analyzers wiki
> at: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>
> That wiki page seems to be protected to certain users only. Anyone know if
> there's a way to send a 'patch' to the maintainers for the wiki, or if
> there's a process for getting editing privileges on that page?  I'd like to
> help out by adding documentation when I come accross it.
>
> Jonahtan
> ________________________________________
> From: Ahmet Arslan [iorixxx@yahoo.com]
> Sent: Sunday, October 03, 2010 6:26 AM
> To: solr-user@lucene.apache.org
> Subject: Re: NGramFilterFactory for auto-complete that matches the middle
> of multi-lingual tags?
>
> > But I thought NGramFilterFactory would generate substrings
> > that start in the "middle", hence ensuring autocomplete
> > matching in the middle.
> >
> > So in the case of "electric guitar", keywordtokenizer would
> > create one token - "electric guitar"
> >
> > NGramFilterFactory would then take that one toke ("electric
> > guitar") and generate N-grams out of it. One of the ngrams
> > would be "guit" because "guit" is a substring of "electric
> > guitar".
> >
>
> Ups. You are correct, I am sorry. I mixed it with *Edge*NGramFilterFActory.
>
>
>
>


-- 
Robert Muir
rcmuir@gmail.com

RE: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by Jonathan Rochkind <ro...@jhu.edu>.
Huh, the NGramFilterFactory itself isn't listed on the the analyzers wiki at: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

That wiki page seems to be protected to certain users only. Anyone know if there's a way to send a 'patch' to the maintainers for the wiki, or if there's a process for getting editing privileges on that page?  I'd like to help out by adding documentation when I come accross it. 

Jonahtan
________________________________________
From: Ahmet Arslan [iorixxx@yahoo.com]
Sent: Sunday, October 03, 2010 6:26 AM
To: solr-user@lucene.apache.org
Subject: Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

> But I thought NGramFilterFactory would generate substrings
> that start in the "middle", hence ensuring autocomplete
> matching in the middle.
>
> So in the case of "electric guitar", keywordtokenizer would
> create one token - "electric guitar"
>
> NGramFilterFactory would then take that one toke ("electric
> guitar") and generate N-grams out of it. One of the ngrams
> would be "guit" because "guit" is a substring of "electric
> guitar".
>

Ups. You are correct, I am sorry. I mixed it with *Edge*NGramFilterFActory.




Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by Lance Norskog <go...@gmail.com>.
Start a new thread.

Dennis Gearon wrote:
> What's the difference between the filter/anayzers that have 'factory' in their name, and the ones that don't?
>
>
> Dennis Gearon
>
> Signature Warning
> ----------------
> EARTH has a Right To Life,
>    otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
>
>
> --- On Sun, 10/3/10, Ahmet Arslan<io...@yahoo.com>  wrote:
>
>    
>> From: Ahmet Arslan<io...@yahoo.com>
>> Subject: Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?
>> To: solr-user@lucene.apache.org
>> Date: Sunday, October 3, 2010, 3:26 AM
>>      
>>> But I thought NGramFilterFactory
>>>        
>> would generate substrings
>>      
>>> that start in the "middle", hence ensuring
>>>        
>> autocomplete
>>      
>>> matching in the middle.
>>>
>>> So in the case of "electric guitar", keywordtokenizer
>>>        
>> would
>>      
>>> create one token - "electric guitar"
>>>
>>> NGramFilterFactory would then take that one toke
>>>        
>> ("electric
>>      
>>> guitar") and generate N-grams out of it. One of the
>>>        
>> ngrams
>>      
>>> would be "guit" because "guit" is a substring of
>>>        
>> "electric
>>      
>>> guitar".
>>>
>>>        
>> Ups. You are correct, I am sorry. I mixed it with
>> *Edge*NGramFilterFActory.
>>
>>
>>
>>
>>      

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by Dennis Gearon <ge...@sbcglobal.net>.
What's the difference between the filter/anayzers that have 'factory' in their name, and the ones that don't?


Dennis Gearon

Signature Warning
----------------
EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Sun, 10/3/10, Ahmet Arslan <io...@yahoo.com> wrote:

> From: Ahmet Arslan <io...@yahoo.com>
> Subject: Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?
> To: solr-user@lucene.apache.org
> Date: Sunday, October 3, 2010, 3:26 AM
> > But I thought NGramFilterFactory
> would generate substrings
> > that start in the "middle", hence ensuring
> autocomplete
> > matching in the middle.
> > 
> > So in the case of "electric guitar", keywordtokenizer
> would
> > create one token - "electric guitar"
> > 
> > NGramFilterFactory would then take that one toke
> ("electric
> > guitar") and generate N-grams out of it. One of the
> ngrams
> > would be "guit" because "guit" is a substring of
> "electric
> > guitar".
> > 
> 
> Ups. You are correct, I am sorry. I mixed it with
> *Edge*NGramFilterFActory.
> 
> 
>       
> 

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by Ahmet Arslan <io...@yahoo.com>.
> What TokenFilters would split "electric吉他" into
> "electric" & "吉他"?

Is it possible to write a regex to capture Chinese text? (Unicode range?)

If yes, you can use PatternReplaceFilter to transform electric吉他 into electric_吉他.

<filter class="solr.PatternReplaceFilter"
pattern="(latin)(chineese)" replacement="$1_$2"/>

After that WordDelimeterFilterFactory can produce two adjacent tokens.

But may be using a custom filter can be more easy.


      

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by Andy <an...@yahoo.com>.
 
> I got your point. You want to retrieve "electric吉他"
> with the query 吉他. That's why you don't want EdgeNGram.
> If this is the only reason for NGram, I think you can
> transform "electric吉他" into two tokens "electric"
> "吉他" in TokenFilter(s) and apply EdgeNGram approach.
> 

What TokenFilters would split "electric吉他" into "electric" & "吉他"?


      

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by Ahmet Arslan <io...@yahoo.com>.
> I agree with the issues with NGramFilterFactory you pointed
> out and I really want to avoid using it. But the problem is
> that I have Chinese tags like "电吉他" and multi-lingual
> tags like "electric吉他".

I got your point. You want to retrieve "electric吉他" with the query 吉他. That's why you don't want EdgeNGram.
If this is the only reason for NGram, I think you can transform "electric吉他" into two tokens "electric" "吉他" in TokenFilter(s) and apply EdgeNGram approach.



      

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by Andy <an...@yahoo.com>.
> > 1) hyphens - if user types "ema" or "e-ma" I want to
> > suggest "email"
> > 
> > 2) accents - if user types "herme"  want to suggest
> > "Hermès"
> 
> Accents can be removed with using MappingCharFilterFactory
> before the tokenizer. (both index and query time)
> 
> <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
> 
> I am not sure if this is most elegant solution but you can
> replace - with "" uing MappingCharFilterFactory too. It
> satisfies what you describe in 1.
> 
> But generally NGramFilterFactory produces a lot of tokens.
> I mean query er can return hermes. May be
> EdgeNGramFilterFactory can be more suitable for
> auto-complete task. At least it guarantees that some word is
> starting with that character sequence.

Thanks.

I agree with the issues with NGramFilterFactory you pointed out and I really want to avoid using it. But the problem is that I have Chinese tags like "电吉他" and multi-lingual tags like "electric吉他".

For tags like that WhitespaceTokenizerFactory wouldn't work. And if I use ChineseFilterFactory would it recognize that the "electric" in "electric吉他" isn't Chinese and shouldn't be split into individual characters?

Any ideas here are greatly appreciated.

In a related matter, I checked out http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-tree.html and saw that there are:

EdgeNGramFilterFactory & EdgeNGramTokenizerFactory
NGramFilterFactory & NGramTokenizerFactory

What are the differences between *FilterFactory and *TokenizerFactory? In my case which one should I be using?

Thanks.


      

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by Ahmet Arslan <io...@yahoo.com>.
> Does anyone know how to deal with these 2 issues when using
> NGramFilterFactory for autocomplete?
> 
> 1) hyphens - if user types "ema" or "e-ma" I want to
> suggest "email"
> 
> 2) accents - if user types "herme"  want to suggest
> "Hermès"

Accents can be removed with using MappingCharFilterFactory before the tokenizer. (both index and query time)

<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>

I am not sure if this is most elegant solution but you can replace - with "" uing MappingCharFilterFactory too. It satisfies what you describe in 1.

But generally NGramFilterFactory produces a lot of tokens. I mean query er can return hermes. May be EdgeNGramFilterFactory can be more suitable for auto-complete task. At least it guarantees that some word is starting with that character sequence.



      

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by Andy <an...@yahoo.com>.
Ah Thanks for clearing that up.

Does anyone know how to deal with these 2 issues when using NGramFilterFactory for autocomplete?

1) hyphens - if user types "ema" or "e-ma" I want to suggest "email"

2) accents - if user types "herme"  want to suggest "Hermès"

Thanks.

--- On Sun, 10/3/10, Ahmet Arslan <io...@yahoo.com> wrote:

> From: Ahmet Arslan <io...@yahoo.com>
> Subject: Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?
> To: solr-user@lucene.apache.org
> Date: Sunday, October 3, 2010, 6:26 AM
> > But I thought NGramFilterFactory
> would generate substrings
> > that start in the "middle", hence ensuring
> autocomplete
> > matching in the middle.
> > 
> > So in the case of "electric guitar", keywordtokenizer
> would
> > create one token - "electric guitar"
> > 
> > NGramFilterFactory would then take that one toke
> ("electric
> > guitar") and generate N-grams out of it. One of the
> ngrams
> > would be "guit" because "guit" is a substring of
> "electric
> > guitar".
> > 
> 
> Ups. You are correct, I am sorry. I mixed it with
> *Edge*NGramFilterFActory.
> 
> 
>       
> 


      

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by Ahmet Arslan <io...@yahoo.com>.
> But I thought NGramFilterFactory would generate substrings
> that start in the "middle", hence ensuring autocomplete
> matching in the middle.
> 
> So in the case of "electric guitar", keywordtokenizer would
> create one token - "electric guitar"
> 
> NGramFilterFactory would then take that one toke ("electric
> guitar") and generate N-grams out of it. One of the ngrams
> would be "guit" because "guit" is a substring of "electric
> guitar".
> 

Ups. You are correct, I am sorry. I mixed it with *Edge*NGramFilterFActory.


      

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by Andy <an...@yahoo.com>.

--- On Sat, 10/2/10, Ahmet Arslan <io...@yahoo.com> wrote:

> > I don't understand. Many tags like "electric吉他"
> or
> > "古典吉他" have no whitespace at all, so how does
> > WhitespaceTokenizer help?
> 
> It makes sense for tags having more than one words. i.e.
> "electric guitar"
> 
> If you tokenize this using whitespacetokenizer, you obtain
> two tokens.
> If you use keywordtokenizer, you obtain only one token,
> always.
> 
> In other words, if you want query qui to return "electric
> guitar" you need whitespacetokenizer.


But I thought NGramFilterFactory would generate substrings that start in the "middle", hence ensuring autocomplete matching in the middle.

So in the case of "electric guitar", keywordtokenizer would create one token - "electric guitar"

NGramFilterFactory would then take that one toke ("electric guitar") and generate N-grams out of it. One of the ngrams would be "guit" because "guit" is a substring of "electric guitar".

Or did I misunderstand how NGramFilterFactory work?





      

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by Ahmet Arslan <io...@yahoo.com>.
> I don't understand. Many tags like "electric吉他" or
> "古典吉他" have no whitespace at all, so how does
> WhitespaceTokenizer help?

It makes sense for tags having more than one words. i.e. "electric guitar"

If you tokenize this using whitespacetokenizer, you obtain two tokens.
If you use keywordtokenizer, you obtain only one token, always.

In other words, if you want query qui to return "electric guitar" you need whitespacetokenizer.

analysis.jsp visualizes analysis process step by step. You can observe it.


      

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by Andy <an...@yahoo.com>.

--- On Sat, 10/2/10, Ahmet Arslan <io...@yahoo.com> wrote:

> From: Ahmet Arslan <io...@yahoo.com>

> > For example, if a user types
> "guit" I want to suggest:
> > "guitar"
> > "electric guitar"
> > "电动guitar"
> > "guitar英雄"
> > 
> > And if a user types "吉他" I want to suggest:
> > "吉他Hero"
> > "electric吉他"
> > "古典吉他"
> > 
> > 
> > I'm thinking about using:
> > 
> > <fieldType name="autocomplete"
> class="solr.TextField"
> > positionIncrementGap="100">
> >  <analyzer type="index">
> >    <tokenizer
> > class="solr.KeywordTokenizerFactory"/>
> >    <filter
> > class="solr.LowerCaseFilterFactory"/>
> >    <filter
> > class="solr.NGramFilterFactory" minGramSize="1"
> > maxGramSize="15" />
> >  </analyzer>
> >  <analyzer type="query">
> >    <tokenizer
> > class="solr.KeywordTokenizerFactory"/>
> >    <filter
> > class="solr.LowerCaseFilterFactory"/>
> >  </analyzer>
> > </fieldType>
> > 
> > Would the above setup do what I want to do?
> 
> fieldType autocomplete will bring you only startsWith tags
> since it uses KeywordTokenizerFactory. You need
> WhitespaceTokenizer for your use case. 
> 
> Or you can use two different fields and types (using
> keywordtokenizer and whitespacetokenizer). So that
> beginsWith matches comes first.
> 

I don't understand. Many tags like "electric吉他" or "古典吉他" have no whitespace at all, so how does WhitespaceTokenizer help?


      

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Posted by Ahmet Arslan <io...@yahoo.com>.
> For example, if a user types "guit" I want to suggest:
> "guitar"
> "electric guitar"
> "电动guitar"
> "guitar英雄"
> 
> And if a user types "吉他" I want to suggest:
> "吉他Hero"
> "electric吉他"
> "古典吉他"
> 
> 
> I'm thinking about using:
> 
> <fieldType name="autocomplete" class="solr.TextField"
> positionIncrementGap="100">
>  <analyzer type="index">
>    <tokenizer
> class="solr.KeywordTokenizerFactory"/>
>    <filter
> class="solr.LowerCaseFilterFactory"/>
>    <filter
> class="solr.NGramFilterFactory" minGramSize="1"
> maxGramSize="15" />
>  </analyzer>
>  <analyzer type="query">
>    <tokenizer
> class="solr.KeywordTokenizerFactory"/>
>    <filter
> class="solr.LowerCaseFilterFactory"/>
>  </analyzer>
> </fieldType>
> 
> Would the above setup do what I want to do?

fieldType autocomplete will bring you only startsWith tags since it uses KeywordTokenizerFactory. You need WhitespaceTokenizer for your use case. 

Or you can use two different fields and types (using keywordtokenizer and whitespacetokenizer). So that beginsWith matches comes first.