You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Michal Diamantstein <mi...@genesys.com> on 2015/03/30 18:14:04 UTC

Quesion concerning Arabic analyzer

Hi,
I'm a software developer at Genesys and we use Lucene in our product.
Lately we added support in Arabic which includes indexing (write and read) data in this language.
Using ArabicLetterTokenizer  from http://lucenenet.apache.org/docs/3.0.3/dc/d1c/_arabic_letter_tokenizer_8cs_source.html
I bump into some issue -
The function IsTokenChar(char c) does not allow numbers while parsing.

/**
         * Allows for Letter category or NonspacingMark category
         * @see org.apache.lucene.analysis.LetterTokenizer#isTokenChar(char)
         */
        protected internal override bool IsTokenChar(char c)
        {
          return base.IsTokenChar(c) || char.GetUnicodeCategory(c) == System.Globalization.UnicodeCategory.NonSpacingMark;
        }


What is the reason for not allowing numbers?

The process includes using the analyzer to get all the tokens,
and then build a TermQuery, PhraseQuery, or nothing based on the term count.
While going over the tokens, numbers are dropped out).

Thanks in advance.


Michal Diamantstein
Software Engineer
T:  +972 72 220 1866
M: +972 50 424 5533
Michal.Diamantstein@genesys.com<ma...@genesys.com>





[Geneys_logo_RGB]<http://www.genesyslab.com/>

Re: Quesion concerning Arabic analyzer

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Mar 30, 2015 at 12:14 PM, Michal Diamantstein <
michal.diamantstein@genesys.com> wrote:

>  *What is the reason for not allowing numbers?*
>
>
>

No reason, it was just a simple tokenizer that worked for Arabic.

Since Lucene 3.1, StandardTokenizer can tokenize arabic (and work with
numbers and other stuff), and ArabicAnalyzer uses that instead. This
tokenizer was then deprecated. See
https://issues.apache.org/jira/browse/LUCENE-2747