You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by bu...@apache.org on 2004/08/24 01:26:13 UTC

DO NOT REPLY [Bug 28182] - [PATCH] Never write an Analyzer again

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=28182>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=28182

[PATCH] Never write an Analyzer again





------- Additional Comments From otis@apache.org  2004-08-23 23:26 -------
Moving a patch from email to Bugzilla, in order not to lose it and in order to
bring it to attention of others.


From:	"Rasik Pandey" <ra...@ajlsm.com>
To:	"'Lucene Developers List'"
Subject:	analyzer refactoring
Date:	Mon, 21 Jun 2004 19:48:58 +0200

Hello,

As mentioned in previous exchanges, notably with Grant Ingersoll, I 
added some new classes to the "analysis" package to meet the requirements 
of the feature request in Bugzilla 
(http://issues.apache.org/bugzilla/show_bug.cgi?id=28182) and did some 
refactoring while I was under-the-hood. This is an overview of the 
hierarchies per my changes:

-Analyzer
--CustomAnalyzer (new abstract class largely based on Grant's 
BaseAnalyzer)
--AbstractAnalyzer (new abstract class)
---RussianAnalyzer
---GermanAnalyzer
---etc.

-Tokenizer
--CloneableTokenizer (new abstract class)
---StandardTokenizer
---CharTokenizer
---CJKTokenizer
---etc.

-TokenFilter
--CloneableTokenFilter (new abstract class)
---AbstractStemFilter (new abstract class)
----RussianStemFilter
----GermanStemFilter
----etc.

-Stemmer (very simple new interface used in AbstractStemFilter)
--PorterStemmer
--RussianStemmer
--etc.

In the attached zip file there are 3 diff files (core.analysis, 
sandbox.analysis, and sandbox.analysis.snowball) and a zip containing the new 
classes for org.apache.lucene.analysis in the lucene core. I tried to 
minimize the irrelevant code changes (e.g. style, spaces, etc.) in the 
diffs while conforming to the code formatting guidelines outlined by 
Otis. I think there were a number of classes in the "analysis" package 
that didn't conform so these diffs may have a lot of noise as I 
reformatted those classe with my IDE, sorry :( . If the diffs are too painful 
then let me know and I'll try to prune them. 

If there is a TODO list specific to Analyzers, are the below items on 
that list?

1) move German and Russian packages to sandbox (I think this is on the 
Lucene TODO list)
2) Analyzer class renaming such that dynamic configuration could return 
classes like Analyzer_ru, Analyzer_de, Analyzer_fr, etc. based on the 
class naming scheme "Analyzer_{Locale.toString}"
3) Documentation

Question, comments, feedback, criticisms are all welcome......

Regards,
RBP

PS - Thanks Grant!

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org