You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by walid <wa...@elementn.com> on 2009/08/02 02:17:19 UTC

Re: arabic analyzer

I guess in that case, my users will be angry :)
the fact is, plural (as an example) is not supported, and that is one of
the most common things that a person doing some search will expect to
not have to worry about.
anyway, will roll it out and see the users' reaction :)

thank you.

-walid 

On Fri, 2009-07-24 at 08:39 -0400, Robert Muir wrote:
> walid, it is true some of what you mentioned (from aramorph) works in
> light stemming version, some does not.
> The problem is that its not clear to me that what aramorph is doing is
> really the best.
> 
> >From the paper I sent you:
> 
> The best stemmer in our experiments, light8-s was very simple and did
> not try to find roots or take into account most of Arabic morphology.
> It is probably not essential for the stemmer to yield the correct
> forms, whether stems or roots.
> It is sufficient for it to group most of the forms that belong together.
> 
> This is what is being used in lucene, light8-s. If you read section
> 5.2.1 of the paper, you will see this method outperforms the
> morphological analysis method you speak of (using the same buckwalter
> dictionary)
> 
> But I also understand this is just a general text IR relevance
> measurement (your specific text might vary), and it does not take into
> account some human factors (it can be better on average, but make
> users angry, that type of thing).
> 
> Another problem I have with this situation is that I'm not sure the
> morph. analysis method is really wrong, just that perhaps aramorph /
> that paper might be indexing the wrong thing. For example, aramorph
> indexes arabic stems, but the latest buckwalter dictionary has
> lemmaID, why not index that?
> 
> anyway, I hope in the future there will be more options, that would be
> a good thing!
> 
> On Fri, Jul 24, 2009 at 4:06 AM, walid<wa...@elementn.com> wrote:
> > We were using the aramorph library for some time and so we mapped out
> > the set of features it provides, they come as follows:
> >
> > ----------------------------------------------------------------------------------------------------------------------------------------------------------------
> > The ء and ~ are considered unique characters.
> >
> >
 <>              * أ , آ, ا, and إ are distinct
> >
 <>              * و and ؤ are distinct
> >
 <>              * ى and ئ are distinct
> >
> >
> >
> >      * The ا and ة (denoting the feminine adjective) at the end of a
> >        word are optional.
> >
> >      * The ال, ب, ل, ك, بال, كال, لل at the beginning of a word are
> >        optional
> >
> >      * All حركات as well as the ّ (شدّة) are ignored.
> >
> >      * The ي , و , ات , ون denoting the plural form of a word are
> >        optional. If the indexed word ends with a ة its plural, which
> >        replaces the ة with ات , is recognized.
> >
> >
> >
> >
> >
> > The following examples illustrate these rules:
> >
> >
> > Indexed Word
> >
> >
> > Search Term
> >
> >
> > Success
> >
> >
 <>                  الحياة
> >
> >
 <>                  للحياة
> >
> >
> > True
> >
> >
> >
> >
 <> حياة
> >
> >
> > True
> >
> >
> >
> >
 <> حيا
> >
> >
> > False
> >
> >
> >
> >
 <> ألحياة
> >
> >
> > False
> >
> >
> >
> >
 <> إلحياة
> >
> >
> > False
> >
> >
> >
> >
 <> كالحياة
> >
> >
> > True
> >
> >
> >
> >
 <> بالحياة
> >
> >
> > True
> >
> >
> >
> >
 <> بحياة
> >
> >
> > True
> >
> >
> >
> >
 <> لحياة
> >
> >
> > True
> >
> >
 <> دولارا
> >
> >
 <> دولار
> >
> >
> > True
> >
> >
> >
> >
 <> بدولار
> >
> >
> > True
> >
> >
> >
> >
 <> بالدولار
> >
> >
> > True
> >
> >
> >
> >
 <> الدولار
> >
> >
> > True
> >
> >
> >
> >
 <> دؤلارا
> >
> >
> > False
> >
> >
> >
> >
 <> دولأرا
> >
> >
> > False
> >
> >
> >
> >
 <> دولارأ
> >
> >
> > False
> >
> >
 <> الكاتب
> >
> >
 <> كاتب
> >
> >
> > True
> >
> >
> >
> >
 <> لكاتب
> >
> >
> > True
> >
> >
> >
> >
 <> كاتبة
> >
> >
> > True
> >
> >
> >
> >
 <> الكاتبة
> >
> >
> > True
> >
> >
> >
> >
 <> الكاتبات
> >
> >
> > True
> >
> >
> >
> >
 <> كاتبون
> >
> >
> > True
> >
> >
> >
> >
 <> كاتبو \ كاتبي
> >
> >
> > True
> >
> >
> >
> >
 <> كتب
> >
> >
> > False
> >
> >
 <> جميلة
> >
> >
 <> جميلات
> >
> >
> > True
> >
> >
> >
> >
 <> جميل
> >
> >
> > True
> >
> >
> >
> >
 <> الجمال
> >
> >
> > False
> >
> >
 <> بنت
> >
> >
 <> ابنة
> >
> >
> > False
> >
> >
> >
> >
 <> بن
> >
> >
> > True
> >
> >
> >
> >
 <> ابن
> >
> >
> > True
> >
> >
> >
> >
 <> ابنت
> >
> >
> > False
> >
> >
> >
> > ----------------------------------------------------------------------------------------------------------------------------------------------------------------
> >
> > while with the new one, we only got matches for:
> > |
 <> فّ فُ فٌ فف فِِ فٍ ف
> >  and the likes of that.
> >
> > -walid
> >
> > On Thu, 2009-07-23 at 09:33 -0400, Robert Muir wrote:
> >
> >> walid, can you provide any more information other than "very poor result"?
> >>
> >> Others have not measured much difference between morphological
> >> analysis and light stemming:
> >> http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> >>
> >>
> >> On Thu, Jul 23, 2009 at 7:34 AM, walid<wa...@elementn.com> wrote:
> >> > http://issues.apache.org/jira/browse/LUCENE-1406
> >> > http://issues.apache.org/jira/browse/LUCENE-153
> >> >
> >> > based on this, there are two options:
> >> > 1- using the aramorph library
> >> > 2- moving the code from trunk to the current release and using the
> >> > provided arabic analyzer
> >> >
> >> > 1- the library works very well in indexing, tokenizing, stemming and
> >> > everything, but causes memory leaks
> >> > 2- the provided library has a very poor result compared to the aramorph
> >> > library.
> >> >
> >> > Is there a plan to have better arabic support with full morphological
> >> > analysis support?
> >> >
> >> > walid
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >
> >> >
> >>
> >>
> >>
> >
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: arabic analyzer

Posted by Robert Muir <rc...@gmail.com>.

Walid, thanks for your feedback.

fyi I created an issue with some minor improvements (such as lam-lam
prefix) to the arabic analyzer:
http://issues.apache.org/jira/browse/LUCENE-1758

I also tried to improve the stopwords list, but your Arabic is surely
much better than mine. If you are interested, have a look perhaps you
could double check :)


On Mon, Aug 3, 2009 at 12:05 PM, walid<wa...@elementn.com> wrote:
> Hello Robert,
>
> you are so right, plurals based on prefixes and suffixes are working.
> Plurals based on inserted "و" do not (باب and ابوب).
>
> The few words i had tested where all of the "insert" type and not the
> prefix/suffix.
>
> thank you :)
>
> -walid
>
> On Sun, 2009-08-02 at 15:08 -0400, Robert Muir wrote:
>> > the fact is, plural (as an example) is not supported, and that is one of
>> > the most common things that a person doing some search will expect to
>>
>> Walid, I'm not sure this is true. Many plurals are supported
>> (certainly not exceptional cases or broken plurals).
>> This is no different than the other language analyzers in lucene, even
>> english stemmers: the most common forms are grouped together and thats
>> about all you can say :)
>>
>> maybe in the future we can improve it though for your particular
>> concern, add simple dictionary mappings for at least the most common
>> broken plurals, something like that.
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: arabic analyzer

Posted by walid <wa...@elementn.com>.

Hello Robert,

you are so right, plurals based on prefixes and suffixes are working.
Plurals based on inserted "و" do not (باب and ابوب).

The few words i had tested where all of the "insert" type and not the
prefix/suffix.

thank you :)

-walid

On Sun, 2009-08-02 at 15:08 -0400, Robert Muir wrote:
> > the fact is, plural (as an example) is not supported, and that is one of
> > the most common things that a person doing some search will expect to
> 
> Walid, I'm not sure this is true. Many plurals are supported
> (certainly not exceptional cases or broken plurals).
> This is no different than the other language analyzers in lucene, even
> english stemmers: the most common forms are grouped together and thats
> about all you can say :)
> 
> maybe in the future we can improve it though for your particular
> concern, add simple dictionary mappings for at least the most common
> broken plurals, something like that.
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: arabic analyzer

Posted by Robert Muir <rc...@gmail.com>.

> the fact is, plural (as an example) is not supported, and that is one of
> the most common things that a person doing some search will expect to

Walid, I'm not sure this is true. Many plurals are supported
(certainly not exceptional cases or broken plurals).
This is no different than the other language analyzers in lucene, even
english stemmers: the most common forms are grouped together and thats
about all you can say :)

maybe in the future we can improve it though for your particular
concern, add simple dictionary mappings for at least the most common
broken plurals, something like that.

-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org