You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Yusuf Aaji <yu...@gmail.com> on 2009/02/20 12:22:27 UTC

Regarding ArabicLetterTokenizer and the StandardTokenizer - best of both worlds!

Hi Everyone,


My question is related to the arabic analysis package under: 
org.apache.lucene.analysis.ar


It is cool and it is doing a great job, but it uses a special tokenizer: 
ArabicLetterTokenizer


The problem with this tokenizer is that it fails to handle emails, urls 
and acronyms the same way the StandardTokenizer does.


Also the problem of the StandardTokenizer is that it fails to handle 
arabic diacritics right. so it splits words which shouldn't be splitted.


Arabic diacritics are: (as mentioned in the class: 
org.apache.lucene.analysis.ar.ArabicNormalizer)


FATHATAN = '\u064B';
DAMMATAN = '\u064C';
KASRATAN = '\u064D';
FATHA = '\u064E';
DAMMA = '\u064F';
KASRA = '\u0650';
SHADDA = '\u0651';
SUKUN = '\u0652';


so it is the range [\u064B-\u0652]


Is it possible to modify the StandardTokenizerImp to consider these 
diacritics as normal letters.


I guess it should be done the same way its is done for Chinese and 
Japanese in this line in the file StandardTokenizerImp.jflex


// Chinese and Japanese (but NOT Korean, which is included in [:letter:])

CJ         = 
[\u3100-\u312f\u3040-\u309F\u30A0-\u30FF\u31F0-\u31FF\u3300-\u337f\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff\uff65-\uff9f]


so it can be something like:

AR = [\u064B-\u0652]


then modify this line also to include our new group of characters:


// From the JFlex manual: "the expression that matches everything of <a> 
not matched by <b> is !(!<a>|<b>)"
LETTER     = !(![:letter:]|{CJ}|{AR})



Am I right?! and am I going in the right direction?!! Comments are very 
welcome.


Regards..


Yusuf



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Regarding ArabicLetterTokenizer and the StandardTokenizer - best of both worlds!

Posted by Robert Muir <rc...@gmail.com>.

Yusuf,

You are 100% correct it is bad that this uses a custom tokenizer.

this was my motivation for attacking it from this angle:
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

(unfinished)

otherwise, at some point jflex rules will become insanely complex and
essentially be trying to reimplement things that unicode has already done.

even using real unicode properties for break iteration still leaves
complexities,
Ex: single quotes in hebrew and ukrainian (i see them using ` as well)

On Fri, Feb 20, 2009 at 7:03 AM, Grant Ingersoll <gs...@apache.org>wrote:

> It's been a few years since I've worked on Arabic, but it sounds
> reasonable.  Care to submit a patch with unit tests showing the
> StandardTokenizer properly handling all Arabic characters?
> http://wiki.apache.org/lucene-java/HowToContribute
>
>
>
> On Feb 20, 2009, at 6:22 AM, Yusuf Aaji wrote:
>
>  Hi Everyone,
>>
>>
>> My question is related to the arabic analysis package under:
>> org.apache.lucene.analysis.ar
>>
>>
>> It is cool and it is doing a great job, but it uses a special tokenizer:
>> ArabicLetterTokenizer
>>
>>
>> The problem with this tokenizer is that it fails to handle emails, urls
>> and acronyms the same way the StandardTokenizer does.
>>
>>
>> Also the problem of the StandardTokenizer is that it fails to handle
>> arabic diacritics right. so it splits words which shouldn't be splitted.
>>
>>
>> Arabic diacritics are: (as mentioned in the class:
>> org.apache.lucene.analysis.ar.ArabicNormalizer)
>>
>>
>> FATHATAN = '\u064B';
>> DAMMATAN = '\u064C';
>> KASRATAN = '\u064D';
>> FATHA = '\u064E';
>> DAMMA = '\u064F';
>> KASRA = '\u0650';
>> SHADDA = '\u0651';
>> SUKUN = '\u0652';
>>
>>
>> so it is the range [\u064B-\u0652]
>>
>>
>> Is it possible to modify the StandardTokenizerImp to consider these
>> diacritics as normal letters.
>>
>>
>> I guess it should be done the same way its is done for Chinese and
>> Japanese in this line in the file StandardTokenizerImp.jflex
>>
>>
>> // Chinese and Japanese (but NOT Korean, which is included in [:letter:])
>>
>> CJ         =
>> [\u3100-\u312f\u3040-\u309F\u30A0-\u30FF\u31F0-\u31FF\u3300-\u337f\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff\uff65-\uff9f]
>>
>>
>> so it can be something like:
>>
>> AR = [\u064B-\u0652]
>>
>>
>> then modify this line also to include our new group of characters:
>>
>>
>> // From the JFlex manual: "the expression that matches everything of <a>
>> not matched by <b> is !(!<a>|<b>)"
>> LETTER     = !(![:letter:]|{CJ}|{AR})
>>
>>
>>
>> Am I right?! and am I going in the right direction?!! Comments are very
>> welcome.
>>
>>
>> Regards..
>>
>>
>> Yusuf
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: Regarding ArabicLetterTokenizer and the StandardTokenizer - best of both worlds!

Posted by Grant Ingersoll <gs...@apache.org>.

It's been a few years since I've worked on Arabic, but it sounds  
reasonable.  Care to submit a patch with unit tests showing the  
StandardTokenizer properly handling all Arabic characters?  http://wiki.apache.org/lucene-java/HowToContribute


On Feb 20, 2009, at 6:22 AM, Yusuf Aaji wrote:

> Hi Everyone,
>
>
> My question is related to the arabic analysis package under:  
> org.apache.lucene.analysis.ar
>
>
> It is cool and it is doing a great job, but it uses a special  
> tokenizer: ArabicLetterTokenizer
>
>
> The problem with this tokenizer is that it fails to handle emails,  
> urls and acronyms the same way the StandardTokenizer does.
>
>
> Also the problem of the StandardTokenizer is that it fails to handle  
> arabic diacritics right. so it splits words which shouldn't be  
> splitted.
>
>
> Arabic diacritics are: (as mentioned in the class:  
> org.apache.lucene.analysis.ar.ArabicNormalizer)
>
>
> FATHATAN = '\u064B';
> DAMMATAN = '\u064C';
> KASRATAN = '\u064D';
> FATHA = '\u064E';
> DAMMA = '\u064F';
> KASRA = '\u0650';
> SHADDA = '\u0651';
> SUKUN = '\u0652';
>
>
> so it is the range [\u064B-\u0652]
>
>
> Is it possible to modify the StandardTokenizerImp to consider these  
> diacritics as normal letters.
>
>
> I guess it should be done the same way its is done for Chinese and  
> Japanese in this line in the file StandardTokenizerImp.jflex
>
>
> // Chinese and Japanese (but NOT Korean, which is included in  
> [:letter:])
>
> CJ         = [\u3100-\u312f\u3040-\u309F\u30A0-\u30FF\u31F0-\u31FF 
> \u3300-\u337f\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff\uff65-\uff9f]
>
>
> so it can be something like:
>
> AR = [\u064B-\u0652]
>
>
> then modify this line also to include our new group of characters:
>
>
> // From the JFlex manual: "the expression that matches everything of  
> <a> not matched by <b> is !(!<a>|<b>)"
> LETTER     = !(![:letter:]|{CJ}|{AR})
>
>
>
> Am I right?! and am I going in the right direction?!! Comments are  
> very welcome.
>
>
> Regards..
>
>
> Yusuf
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org