You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Chris Bamford <ch...@scalix.com> on 2008/08/08 14:07:25 UTC
SnowballAnalyzer question
Hi.
I am using the SnowballAnalyzer because of it's multi-language stemming
capabilities - and am very happy with that.
There is one small glitch which I'm hoping to overcome - can I get it to
split up internet domain names in the same way that StopAnalyzer does?
i.e. for the sentence "This is a URL: www.google.de / this is a company
name: XY&Z Corporation", here is the default output from the two analysers:
StopAnalyzer:
[url] [www] [google] [de] [company] [name] [xy] [z] [corporation]
SnowballAnalyzer:
[this] [is] [a] [url] [www.google.d] [this] [is] [a] [compani]
[name] [xy&z] [corpor]
Ideally I would like "www.google.de" to be split into [www] [google]
[de] (rather than [www.google.d]), but retain the rest of the
SnowballAnalyzer's capabilities.
Can I perhaps extend SnowballAnalyzer to allow me to achieve this?
Thanks for any tips / pointers,
- Chris
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: SnowballAnalyzer question
Posted by Chris Hostetter <ho...@fucit.org>.
: I am using the SnowballAnalyzer because of it's multi-language stemming
: capabilities - and am very happy with that.
: There is one small glitch which I'm hoping to overcome - can I get it to split
: up internet domain names in the same way that StopAnalyzer does?
90% of the Lucene Analyzers that exist tend to be simple wrappers arround
Tokenizers and TokenFilters -- this is true for SnowballAnalyzer and
StopAnalyzer as well -- all those classes do is setup some initialization
work, and then delegate to various Tokenizers and TokenFilters ... if you
poke arround in the code for SnowballAnalyzer you'll see that you can
write your own analyzer that uses SnowballFilter along with whatever
tokenizer you want. (if you like StopAnalyzer's tokenization, that would
be LowerCaseTokenizer)
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org