You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Vladimir Gubarkov <xo...@gmail.com> on 2012/04/19 13:26:11 UTC

Two questions on RussianAnalyzer

Hi,

Upon updating to Lucene 3.6 I've noticed that new RussianAnalyzer
analyzes not the same way as before.

Please see example:

    private List<String> getTokens(Analyzer theAnalyzer, String str)
throws IOException {
        final TokenStream tokenStream =
theAnalyzer.tokenStream(MessageFields.BODY, new StringReader(str));

        tokenStream.reset();

        final CharTermAttribute termAttribute =
tokenStream.getAttribute(CharTermAttribute.class);

        List<String> tokens = new LinkedList<String>();

        while (tokenStream.incrementToken()) {
            final String term = new String(termAttribute.buffer(), 0,
termAttribute.length());
            tokens.add(term);
//            System.out.println(">>" + term);
        }
        return tokens;
    }

    @Test
    public void testDots() throws IOException {
        final String str = "aaa.bbb.com:8888 " +
                "a,b;c/d'e$f&g*h+i-j%k/l_m#n@o!p?q>r\"s~t(u`v|z}y\\z";

        System.out.println("New analyzer:");
        System.out.println(getTokens(new
RussianAnalyzer(Version.LUCENE_36), str));

        System.out.println("Old analyzer:");
        System.out.println(getTokens(new
RussianAnalyzer(Version.LUCENE_30), str));
    }

This shows:

New analyzer:
[aaa.bbb.com, 8888, a, b, c, d'e, f, g, h, i, j, k, l_m, n, o, p, q,
r, s, t, u, v, z, y, z]
Old analyzer:
[aaa, bbb, com, 8888, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p,
q, r, s, t, u, v, z, y, z]

Please note the differences.

The most uncomfortable in new behaviour to me is that in past I used
to search by subdomain like
bbb.com:8888
and have displayed results with www.bbb.com:8888, aaa.bbb.com:8888 and
so on. Now I have 0 results.

My questions are: 1) it this change is by design (not a mistake) and
2) is the only option to achieve old behaviour is to use
Version.LUCENE_30 for creating analyzer?

The other problem with RussionAnalyzer is with the letter Yo
http://en.wikipedia.org/wiki/Yo_(Cyrillic) which in russian often
replaced by letter Ye http://en.wikipedia.org/wiki/Ye_(Cyrillic), and
such words are considered same.
What I want to achieve is that my search by word with yo also yield
words with this letter replaced to ye (and vice-versa).

What I'm currently doing is roughly next:

// NOTE: I have to define my class in this package, because method
russianAnalyzer.createComponents is protected
package org.apache.lucene.analysis.ru;

public class RussianAnalyzerImproved extends ReusableAnalyzerBase{
    private RussianAnalyzer russianAnalyzer = new
RussianAnalyzer(LuceneVersion.VERSION);

    @Override
    protected Reader initReader(Reader reader) {
        return new YoCharFilter(CharReader.get(reader));
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
        return russianAnalyzer.createComponents(fieldName, reader);
    }
}

public class YoCharFilter extends CharFilter {
    public YoCharFilter(CharStream in) {
        super(in);
    }

    @Override
    public int read(char[] cbuf, int off, int len) throws IOException {
        final int charsRead = super.read(cbuf, off, len);
        if (charsRead > 0) {
            final int end = off + charsRead;
            while (off < end) {
                if (cbuf[off] == 'ё' || cbuf[off] == 'Ё')
                    cbuf[off] = 'е';
                off++;
            }
        }
        return charsRead;
    }
}

But I'm not sure this is the correct approach.
What do you think?
Maybe it may have sense to add a configuration option to
RussianAnalyzer itself (distinguish or not yo & ye)?


Sincerely yours,
Vladimir

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Two questions on RussianAnalyzer

Posted by Vladimir Gubarkov <xo...@gmail.com>.

Thank you Steven,

I'll look into this

On Fri, Apr 20, 2012 at 12:43 AM, Steven A Rowe <sa...@syr.edu> wrote:
> Hi Vladimir,
>
>> The most uncomfortable in new behaviour to me is that in past I used
>> to search by subdomain like bbb.com:8888 and have displayed results
>> with www.bbb.com:8888, aaa.bbb.com:8888 and so on. Now I have 0
>> results.
>
> About domain names, see my response to a similar question today on the Solr users list: <http://markmail.org/message/3ddxwc7dunblthyt>.
>
> Steve
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Two questions on RussianAnalyzer

Posted by Steven A Rowe <sa...@syr.edu>.

Hi Vladimir,

> The most uncomfortable in new behaviour to me is that in past I used
> to search by subdomain like bbb.com:8888 and have displayed results
> with www.bbb.com:8888, aaa.bbb.com:8888 and so on. Now I have 0
> results.

About domain names, see my response to a similar question today on the Solr users list: <http://markmail.org/message/3ddxwc7dunblthyt>. 

Steve

Re: Two questions on RussianAnalyzer

Posted by Vladimir Gubarkov <xo...@gmail.com>.

On Thu, Apr 19, 2012 at 7:57 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
>> My questions are: 1) it this change is by design (not a mistake) and
>> 2) is the only option to achieve old behaviour is to use
>> Version.LUCENE_30 for creating analyzer?
>
> This is why this option is there!

Right and it's great, but this not answers my questions, actually =).

>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Two questions on RussianAnalyzer

Posted by Uwe Schindler <uw...@thetaphi.de>.

> My questions are: 1) it this change is by design (not a mistake) and
> 2) is the only option to achieve old behaviour is to use
> Version.LUCENE_30 for creating analyzer?

This is why this option is there!


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Two questions on RussianAnalyzer

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Apr 19, 2012 at 4:51 PM, Vladimir Gubarkov <xo...@gmail.com> wrote:

> Hmmm... I know this and I reindexed!
> I'll try to explain the problem (fortunately, already solved by using
> LUCENE_30) ones again:
> When indexing with new analyzer the whole lexeme "some.cool.site.com"
> goes to index, not 4 lexems "some", "cool", "site", "com".
> So it's now imposible to find this document with query: "site.com".
> I'm having an RSS subscription for that search, and now it's broken.
>

You are right, to search for prefixes of that (assuming its a URL,
which it may or may not be, it depends on the domain and use case),
you need something else. So I think Steven Rowe's advice is best here.

-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Two questions on RussianAnalyzer

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Apr 19, 2012 at 4:51 PM, Vladimir Gubarkov <xo...@gmail.com> wrote:
> So it's now imposible to find this document with query: "site.com".
> I'm having an RSS subscription for that search, and now it's broken.
>

Just to point out, its not impossible, as i suggested before, if you
were happy with the old tokenizer and you dont like passing LUCENE_30,
then just make your own analyzer, using ClassicTokenizer + <your list
of filters> instead.

-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Two questions on RussianAnalyzer

Posted by Vladimir Gubarkov <xo...@gmail.com>.

Thank you Robert for detailed reply

On Fri, Apr 20, 2012 at 12:37 AM, Robert Muir <rc...@gmail.com> wrote:
> On Thu, Apr 19, 2012 at 7:26 AM, Vladimir Gubarkov <xo...@gmail.com> wrote:
>> New analyzer:
>> [aaa.bbb.com, 8888, a, b, c, d'e, f, g, h, i, j, k, l_m, n, o, p, q,
>> r, s, t, u, v, z, y, z]
>> Old analyzer:
>> [aaa, bbb, com, 8888, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p,
>> q, r, s, t, u, v, z, y, z]
>>
>> Please note the differences.
>
> Right, the tokenizer has changed. This is mentioned in the javadocs:
> http://lucene.apache.org/core/3_6_0/api/contrib-analyzers/org/apache/lucene/analysis/ru/RussianAnalyzer.html
>
>>
>> The most uncomfortable in new behaviour to me is that in past I used
>> to search by subdomain like
>> bbb.com:8888
>> and have displayed results with www.bbb.com:8888, aaa.bbb.com:8888 and
>> so on. Now I have 0 results.
>
> Don't simply set your version parameter to 3.6 without reindexing.
> This is really important!!!!!!!!!!!
> Otherwise it defeats the whole purpose.
>

Hmmm... I know this and I reindexed!
I'll try to explain the problem (fortunately, already solved by using
LUCENE_30) ones again:
When indexing with new analyzer the whole lexeme "some.cool.site.com"
goes to index, not 4 lexems "some", "cool", "site", "com".
So it's now imposible to find this document with query: "site.com".
I'm having an RSS subscription for that search, and now it's broken.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Two questions on RussianAnalyzer

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Apr 19, 2012 at 7:26 AM, Vladimir Gubarkov <xo...@gmail.com> wrote:
> New analyzer:
> [aaa.bbb.com, 8888, a, b, c, d'e, f, g, h, i, j, k, l_m, n, o, p, q,
> r, s, t, u, v, z, y, z]
> Old analyzer:
> [aaa, bbb, com, 8888, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p,
> q, r, s, t, u, v, z, y, z]
>
> Please note the differences.

Right, the tokenizer has changed. This is mentioned in the javadocs:
http://lucene.apache.org/core/3_6_0/api/contrib-analyzers/org/apache/lucene/analysis/ru/RussianAnalyzer.html

>
> The most uncomfortable in new behaviour to me is that in past I used
> to search by subdomain like
> bbb.com:8888
> and have displayed results with www.bbb.com:8888, aaa.bbb.com:8888 and
> so on. Now I have 0 results.

Don't simply set your version parameter to 3.6 without reindexing.
This is really important!!!!!!!!!!!
Otherwise it defeats the whole purpose.

>
> My questions are: 1) it this change is by design (not a mistake) and
> 2) is the only option to achieve old behaviour is to use
> Version.LUCENE_30 for creating analyzer?

No, this analyzer is just an example. You can always easily make your
own analyzer (just extend ReusableAnalyzerBase)
with maybe 6 or 7 lines of code to combine whatever Tokenizer
(ClassicTokenizer, UAX29URLEmail, StandardTokenizer, whatever),
along with any combination of filters (such as stemmers, whatever)
that you want.

>
> The other problem with RussionAnalyzer is with the letter Yo
> http://en.wikipedia.org/wiki/Yo_(Cyrillic) which in russian often
> replaced by letter Ye http://en.wikipedia.org/wiki/Ye_(Cyrillic), and
> such words are considered same.
> What I want to achieve is that my search by word with yo also yield
> words with this letter replaced to ye (and vice-versa).
>
> What I'm currently doing is roughly next:
>
> // NOTE: I have to define my class in this package, because method
> russianAnalyzer.createComponents is protected
> package org.apache.lucene.analysis.ru;

Don't try to so hard to extend existing analyzers. Just make your own.
They are just examples.

>
> public class YoCharFilter extends CharFilter {

CharFilter is really mostly in case you need to correct offsets too
for highlighting and such before the tokenizer.

But you dont, this is a simple 1-1 mapping that won't affect
tokenization. Its just a trivial normalization. I would use a
tokenfilter instead.

> Maybe it may have sense to add a configuration option to
> RussianAnalyzer itself (distinguish or not yo & ye)?

I dont think so, there are tons of choices here (for example we
provide 2 stemming options for russian, and more exist), and also
even this normalization is complicated,  for example some people might
have documents where russian ye is actually latin 'e' and so on.
I've seen it, so i know it exists.

So we generally just provide XYZ_Analyzer as an example mostly, it
would be a lot for us to add every possible use case as an option
to every possible Analyzer. Instead just make your own!

-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org