You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2002/09/09 03:41:53 UTC

Re: RussianAnalyzer

I started integrating this in the rest of Lucene today, but I haven't
finished it yet.

I like the fact that it comes with unit tests, bravo, bravo, but does
anyone have a good suggestion for dealing with external data files that
are required by those tests?  I'm wondering about where the best place
to put them in would be?  The same directory as the test class?

Also, a question for Boris: why .htm extension?  Just curious?  Why not
just .txt?

Thanks,
Otis

--- Boris Okner <b....@rogers.com> wrote:
> This is my contribution to Lucene project.
> 
> RussianAnalyzer v. 1.0 (attachment:russianLucene.zip)
> 
> RussianAnalyzer implements org.apache.lucene.analysis.Analyzer and
> designed to support indexing/search capabilities for Cyrillic in
> Lucene. Currently, 3 encoding schemas can be used out of the box:
> Unicode, KOI8 and CP1251. For those who wants to use other encoding
> schemas, even custom ones, please look at RusianCharsets class - it
> should be very straightforward to add any encoding.
> 
> RussianAnalyzer uses RussianStemFilter, based on algorithm, described
> at Snowball's site (http://snowball.sourceforge.net), and also
> StopFilter with Russian stop-words. I was never able to find a
> comprehensive list of stop-words, so please feel free to add whatever
> stop words you'll find missing. 
> 
> There are 2 JUnit testcases: 
> 
> 1) RussianStemTest, designed to test stemming. It takes sample
> Russian vocabulary (wordsUnicode), produces stem for each word, and
> then compares it to stem from stemmed version of
> vocabulary(stemsUnicode.txt). Vocabulary and its stemmed version were
> taken from Snowball's site(they contain more than 49000 words and
> stems), so passing test means that implementation of stemming
> algorithm is consistent with SnowBall's description.
> 
> 2) RussianAnalyzerTest contains 3 tests to check RussianAnalyzer on
> Unicode, KOI8 and CP1251.For each test it takes appropriate input
> (testUnicode.txt, testKOI8.txt and test1251.txt), and produces tokens
> that then get verified one by one against expected results (placed
> respectively in resUnicode.htm, resKOI8.htm and res1251.htm) 
> 
> 
> To run tescases:
> 
> 1)Unzip russianLucene.zip to any directory
> 
> 2)From command line, cd to the directory from 1), and run (adjusting
> the path to your junit.jar and lucene.jar):
> 
>  java -cp .;junit_37.jar;lucene-1.2.jar
> ca.oksphere.lucene.RussianAnalyzerTest
> 
>  java -cp .;junit_37.jar ca.oksphere.lucene.RussianStemTest
> 
> 
> 
> That's pretty much it. I hope you'll enjoy it. If you have any
> questions/comments etc., please send me a message to:
> 
> b.okner@rogers.com
> 
> Boris Okner
> 
>
_____________________________________________________________________________________________________
> 
> LEGAL STUFF:
> 
> *
> 
> * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
> 
> * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
> 
> * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
> 
> * DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR ITS CONTRIBUTORS BE 
> 
> * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, 
> 
> * OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
> PROCUREMENT 
> 
> * OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR 
> 
> * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
> LIABILITY, 
> 
> * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
> NEGLIGENCE 
> 
> * OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
> EVEN 
> 
> * IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> 
> *
> ====================================================================
> 
> 
> 

> ATTACHMENT part 2 application/x-zip-compressed name=russianLucene.zip
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>


__________________________________________________
Do You Yahoo!?
Yahoo! Finance - Get real-time stock quotes
http://finance.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>