You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Chris Bamford <ch...@scalix.com> on 2008/08/08 14:07:25 UTC

SnowballAnalyzer question

Hi.

I am using the SnowballAnalyzer because of it's multi-language stemming 
capabilities - and am very happy with that.
There is one small glitch which I'm hoping to overcome - can I get it to 
split up internet domain names in the same way that StopAnalyzer does?
i.e.  for the sentence "This is a URL: www.google.de / this is a company 
name: XY&Z Corporation", here is the default output from the two analysers:

 StopAnalyzer:
    [url] [www] [google] [de] [company] [name] [xy] [z] [corporation]

 SnowballAnalyzer:
    [this] [is] [a] [url] [www.google.d] [this] [is] [a] [compani] 
[name] [xy&z] [corpor]

Ideally I would like "www.google.de" to be split into [www] [google] 
[de] (rather than [www.google.d]), but retain the rest of the  
SnowballAnalyzer's capabilities.
Can I perhaps extend  SnowballAnalyzer to allow me to achieve this?

Thanks for any tips / pointers,

- Chris


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: SnowballAnalyzer question

Posted by Chris Hostetter <ho...@fucit.org>.

: I am using the SnowballAnalyzer because of it's multi-language stemming
: capabilities - and am very happy with that.
: There is one small glitch which I'm hoping to overcome - can I get it to split
: up internet domain names in the same way that StopAnalyzer does?

90% of the Lucene Analyzers that exist tend to be simple wrappers arround 
Tokenizers and TokenFilters -- this is true for SnowballAnalyzer and 
StopAnalyzer as well -- all those classes do is setup some initialization 
work, and then delegate to various Tokenizers and TokenFilters ... if you 
poke arround in the code for SnowballAnalyzer you'll see that you can 
write your own analyzer that uses SnowballFilter along with whatever 
tokenizer you want.  (if you like StopAnalyzer's tokenization, that would 
be LowerCaseTokenizer)




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org