You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Otis Gospodnetic (JIRA)" <ji...@apache.org> on 2008/05/14 07:39:55 UTC

[jira] Resolved: (LUCENE-1003) [PATCH] RussianAnalyzer's tokenizer skips numbers from input text,

     [ https://issues.apache.org/jira/browse/LUCENE-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic resolved LUCENE-1003.
--------------------------------------

    Resolution: Fixed

> [PATCH] RussianAnalyzer's tokenizer skips numbers from input text,
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1003
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1003
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: TUSUR OpenTeam
>            Assignee: Otis Gospodnetic
>         Attachments: RussianCharsets.java.patch, TestRussianAnalyzer.java.patch
>
>
> RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below  for details.
> {code:title=TestRussianAnalyzer.java|borderStyle=solid}
> public class TestRussianAnalyzer extends TestCase {
>   Reader reader = new StringReader("text 1000");
>   // test FAILS
>   public void testStemmer() {
>     testAnalyzer(new RussianAnalyzer());
>   }
>   // test PASSES
>   public void testFixedRussianAnalyzer() {
>     testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
>   }
>   private void testAnalyzer(RussianAnalyzer analyzer) {
>     try {
>       TokenStream stream = analyzer.tokenStream("text", reader);
>       assertEquals("text", stream.next().termText());
>       assertNotNull(stream.next());
>     } catch (IOException e) {
>       fail(e.getMessage());
>     }
>   }
>   private char[] getRussianCharSet() {
>     int length = RussianCharsets.UnicodeRussian.length;
>     final char[] russianChars = new char[length + 10];
>     System
>         .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
>     russianChars[length++] = '0';
>     russianChars[length++] = '1';
>     russianChars[length++] = '2';
>     russianChars[length++] = '3';
>     russianChars[length++] = '4';
>     russianChars[length++] = '5';
>     russianChars[length++] = '6';
>     russianChars[length++] = '7';
>     russianChars[length++] = '8';
>     russianChars[length] = '9';
>     return russianChars;
>   }
> }
> {code} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org