You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2008/09/26 13:57:44 UTC

[jira] Created: (LUCENE-1406) new Arabic Analyzer (Apache license)

new Arabic Analyzer (Apache license)
------------------------------------

Key: LUCENE-1406
URL: https://issues.apache.org/jira/browse/LUCENE-1406
Project: Lucene - Java
Issue Type: New Feature
Components: Analysis
Reporter: Robert Muir
Priority: Minor
Attachments: arabic.zip

I've noticed there is no Arabic analyzer for Lucene, most likely because Tim Buckwalter's morphological dictionary is GPL.

However, it is not necessary to have full morphological analysis engine for a quality arabic search.
This implementation implements the light-8s algorithm present in the following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf

As you can see from the paper, improvement via this method over searching surface forms (as lucene currently does) is significant, with almost 100% improvement in average precision.

While I personally don't think all the choices were the best, and some easily improvements are still possible, the major motivation for implementing it exactly the way it is presented in the paper is that the algorithm is TREC-tested, so the precision/recall improvements to lucene are already documented.

For a stopword list, I used a list present at http://members.unine.ch/jacques.savoy/clef/index.html simply because the creator of this list documents the data as BSD-licensed.

This implementation (Analyzer) consists of above mentioned stopword list plus two filters:
ArabicNormalizationFilter: performs orthographic normalization (such as hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, etc)
ArabicStemFilter: performs arabic light stemming

Both filters operate directly on termbuffer for maximum performance. There is no object creation in this Analyzer.

There are no external dependencies. I've indexed about half a billion words of arabic text and tested against that.

If there are any issues with this implementation I am willing to fix them. I use lucene on a daily basis and would like to give something back. Thanks.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

Posted by Robert Muir <rc...@gmail.com>.

thanks for your feedback. below is a description of an idea for a biblical
hebrew stemmer that would work somewhat differently than a modern hebrew
stemmer.

With regards to pointing i can imagine a user might be frustrated if a word
is stemmed too aggressively when niqqud is present in query or text or both.
It would be nice to make use of niqqud information when available for higher
precision stemming before it is normalized away.

It is necessary to stem consistently in both ways, because user input is
likely to sometimes not contain niqqud. a trick could be done whereas
multiple tokens are indexed for undotted text (i.e. without ha- and with
ha-) as synonyms but not for dotted text (since there is less ambiguity with
the niqqud present). This would ensure that recall does not suffer, would
increase precision, and would not increase index size for dotted text.

The downside is that for your undotted biblical text index size would
increase. This is why this would have to be a separate analyzer than the
modern hebrew stemmer, because niqqud is rare for MH.

with regards to your comment about unicode normalization, I am unaware of
any characters in the Hebrew block that are encoded differently in NFC
versus NFD. The only thing this would affect would be 'Hebrew Presentation
Forms' block. The analyzer would not work with presentation forms text, just
as the arabic analyzer doesn't, as you need unicode normalization (java 6 or
ICU) to fix this text.

Thanks,
Robert

On Tue, Sep 30, 2008 at 10:15 AM, DM Smith <dm...@gmail.com> wrote:

> Robert Muir wrote:
>
>> can you provide any more information on your use case? I had originally
>> imagined MH, ktiv male spelling only, but your use case is interesting.
>>
>> Are you currently indexing biblical hebrew text? dotted or undotted?
>>
> Biblical Hebrew. Variety of texts. Some unpointed. Others w/ points and
> cantillation. All are NFC.
>
> IMHO, I think it is important to document whether an analyzer works with
> NFC, NFD or whatever. And leave it to the program to normalize to that form.
>
>
>>
>> On Tue, Sep 30, 2008 at 8:54 AM, DM Smith <dmsmith555@gmail.com <mailto:
>> dmsmith555@gmail.com>> wrote:
>>
>>
>>    On Sep 30, 2008, at 8:19 AM, Robert Muir wrote:
>>
>>     cool. is there interest in similar basic functionality for Hebrew?
>>>
>>
>>    I'm interested as I use lucene for biblical research.
>>
>>
>>>
>>>    same rules apply: without using GPL data (i.e. Hspell data) you
>>>    can't do it right, but you can do a lot of the common stuff just
>>>    like Arabic. Tokenization is a tad bit more complex, and out of
>>>    box western behavior is probably annoying at the least (splitting
>>>    words on punctuation where it shouldn't, etc).
>>>
>>>    Robert
>>>
>>>    On Tue, Sep 30, 2008 at 7:36 AM, Grant Ingersoll (JIRA)
>>>    <jira@apache.org <ma...@apache.org>> wrote:
>>>
>>>
>>>           [
>>>
>>> https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723
>>>        <
>>> https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723
>>> >
>>>        ]
>>>
>>>        Grant Ingersoll commented on LUCENE-1406:
>>>        -----------------------------------------
>>>
>>>        I'll commit once 2.4 is released.
>>>
>>>        > new Arabic Analyzer (Apache license)
>>>        > ------------------------------------
>>>        >
>>>        >                 Key: LUCENE-1406
>>>        >                 URL:
>>>        https://issues.apache.org/jira/browse/LUCENE-1406
>>>        >             Project: Lucene - Java
>>>        >          Issue Type: New Feature
>>>        >          Components: Analysis
>>>        >            Reporter: Robert Muir
>>>        >            Assignee: Grant Ingersoll
>>>        >            Priority: Minor
>>>        >         Attachments: LUCENE-1406.patch
>>>        >
>>>        >
>>>        > I've noticed there is no Arabic analyzer for Lucene, most
>>>        likely because Tim Buckwalter's morphological dictionary is GPL.
>>>        > However, it is not necessary  to have full morphological
>>>        analysis engine for a quality arabic search.
>>>        > This implementation implements the light-8s algorithm
>>>        present in the following paper:
>>>        http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
>>>        > As you can see from the paper, improvement via this method
>>>        over searching surface forms (as lucene currently does) is
>>>        significant, with almost 100% improvement in average precision.
>>>        > While I personally don't think all the choices were the
>>>        best, and some easily improvements are still possible, the
>>>        major motivation for implementing it exactly the way it is
>>>        presented in the paper is that the algorithm is TREC-tested,
>>>        so the precision/recall improvements to lucene are already
>>>        documented.
>>>        > For a stopword list, I used a list present at
>>>        http://members.unine.ch/jacques.savoy/clef/index.html simply
>>>        because the creator of this list documents the data as
>>>        BSD-licensed.
>>>        > This implementation (Analyzer) consists of above mentioned
>>>        stopword list plus two filters:
>>>        >  ArabicNormalizationFilter: performs orthographic
>>>        normalization (such as hamza seated on alif, alif maksura,
>>>        teh marbuta, removal of harakat, tatweel, etc)
>>>        >  ArabicStemFilter: performs arabic light stemming
>>>        > Both filters operate directly on termbuffer for maximum
>>>        performance. There is no object creation in this Analyzer.
>>>        > There are no external dependencies. I've indexed about half
>>>        a billion words of arabic text and tested against that.
>>>        > If there are any issues with this implementation I am
>>>        willing to fix them. I use lucene on a daily basis and would
>>>        like to give something back. Thanks.
>>>
>>>        --
>>>        This message is automatically generated by JIRA.
>>>        -
>>>        You can reply to this email to add a comment to the issue online.
>>>
>>>
>>>
>>>  ---------------------------------------------------------------------
>>>        To unsubscribe, e-mail:
>>>        java-dev-unsubscribe@lucene.apache.org
>>>        <ma...@lucene.apache.org>
>>>        For additional commands, e-mail:
>>>        java-dev-help@lucene.apache.org
>>>        <ma...@lucene.apache.org>
>>>
>>>
>>>
>>>
>>>    --    Robert Muir
>>>    rcmuir@gmail.com <ma...@gmail.com>
>>>
>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com <ma...@gmail.com>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

Posted by DM Smith <dm...@gmail.com>.

Robert Muir wrote:
> can you provide any more information on your use case? I had 
> originally imagined MH, ktiv male spelling only, but your use case is 
> interesting.
>
> Are you currently indexing biblical hebrew text? dotted or undotted?
Biblical Hebrew. Variety of texts. Some unpointed. Others w/ points and 
cantillation. All are NFC.

IMHO, I think it is important to document whether an analyzer works with 
NFC, NFD or whatever. And leave it to the program to normalize to that form.

>
>
> On Tue, Sep 30, 2008 at 8:54 AM, DM Smith <dmsmith555@gmail.com 
> <ma...@gmail.com>> wrote:
>
>
>     On Sep 30, 2008, at 8:19 AM, Robert Muir wrote:
>
>>     cool. is there interest in similar basic functionality for Hebrew?
>
>     I'm interested as I use lucene for biblical research.
>
>>
>>
>>     same rules apply: without using GPL data (i.e. Hspell data) you
>>     can't do it right, but you can do a lot of the common stuff just
>>     like Arabic. Tokenization is a tad bit more complex, and out of
>>     box western behavior is probably annoying at the least (splitting
>>     words on punctuation where it shouldn't, etc).
>>
>>     Robert
>>
>>     On Tue, Sep 30, 2008 at 7:36 AM, Grant Ingersoll (JIRA)
>>     <jira@apache.org <ma...@apache.org>> wrote:
>>
>>
>>            [
>>         https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723
>>         <https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723>
>>         ]
>>
>>         Grant Ingersoll commented on LUCENE-1406:
>>         -----------------------------------------
>>
>>         I'll commit once 2.4 is released.
>>
>>         > new Arabic Analyzer (Apache license)
>>         > ------------------------------------
>>         >
>>         >                 Key: LUCENE-1406
>>         >                 URL:
>>         https://issues.apache.org/jira/browse/LUCENE-1406
>>         >             Project: Lucene - Java
>>         >          Issue Type: New Feature
>>         >          Components: Analysis
>>         >            Reporter: Robert Muir
>>         >            Assignee: Grant Ingersoll
>>         >            Priority: Minor
>>         >         Attachments: LUCENE-1406.patch
>>         >
>>         >
>>         > I've noticed there is no Arabic analyzer for Lucene, most
>>         likely because Tim Buckwalter's morphological dictionary is GPL.
>>         > However, it is not necessary  to have full morphological
>>         analysis engine for a quality arabic search.
>>         > This implementation implements the light-8s algorithm
>>         present in the following paper:
>>         http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
>>         > As you can see from the paper, improvement via this method
>>         over searching surface forms (as lucene currently does) is
>>         significant, with almost 100% improvement in average precision.
>>         > While I personally don't think all the choices were the
>>         best, and some easily improvements are still possible, the
>>         major motivation for implementing it exactly the way it is
>>         presented in the paper is that the algorithm is TREC-tested,
>>         so the precision/recall improvements to lucene are already
>>         documented.
>>         > For a stopword list, I used a list present at
>>         http://members.unine.ch/jacques.savoy/clef/index.html simply
>>         because the creator of this list documents the data as
>>         BSD-licensed.
>>         > This implementation (Analyzer) consists of above mentioned
>>         stopword list plus two filters:
>>         >  ArabicNormalizationFilter: performs orthographic
>>         normalization (such as hamza seated on alif, alif maksura,
>>         teh marbuta, removal of harakat, tatweel, etc)
>>         >  ArabicStemFilter: performs arabic light stemming
>>         > Both filters operate directly on termbuffer for maximum
>>         performance. There is no object creation in this Analyzer.
>>         > There are no external dependencies. I've indexed about half
>>         a billion words of arabic text and tested against that.
>>         > If there are any issues with this implementation I am
>>         willing to fix them. I use lucene on a daily basis and would
>>         like to give something back. Thanks.
>>
>>         --
>>         This message is automatically generated by JIRA.
>>         -
>>         You can reply to this email to add a comment to the issue online.
>>
>>
>>         ---------------------------------------------------------------------
>>         To unsubscribe, e-mail:
>>         java-dev-unsubscribe@lucene.apache.org
>>         <ma...@lucene.apache.org>
>>         For additional commands, e-mail:
>>         java-dev-help@lucene.apache.org
>>         <ma...@lucene.apache.org>
>>
>>
>>
>>
>>     -- 
>>     Robert Muir
>>     rcmuir@gmail.com <ma...@gmail.com>
>
>
>
>
> -- 
> Robert Muir
> rcmuir@gmail.com <ma...@gmail.com>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

Posted by Robert Muir <rc...@gmail.com>.

can you provide any more information on your use case? I had originally
imagined MH, ktiv male spelling only, but your use case is interesting.

Are you currently indexing biblical hebrew text? dotted or undotted?


On Tue, Sep 30, 2008 at 8:54 AM, DM Smith <dm...@gmail.com> wrote:

>
> On Sep 30, 2008, at 8:19 AM, Robert Muir wrote:
>
> cool. is there interest in similar basic functionality for Hebrew?
>
>
> I'm interested as I use lucene for biblical research.
>
>
>
> same rules apply: without using GPL data (i.e. Hspell data) you can't do it
> right, but you can do a lot of the common stuff just like Arabic.
> Tokenization is a tad bit more complex, and out of box western behavior is
> probably annoying at the least (splitting words on punctuation where it
> shouldn't, etc).
>
> Robert
>
> On Tue, Sep 30, 2008 at 7:36 AM, Grant Ingersoll (JIRA) <ji...@apache.org>wrote:
>
>>
>>    [
>> https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723]
>>
>> Grant Ingersoll commented on LUCENE-1406:
>> -----------------------------------------
>>
>> I'll commit once 2.4 is released.
>>
>> > new Arabic Analyzer (Apache license)
>> > ------------------------------------
>> >
>> >                 Key: LUCENE-1406
>> >                 URL: https://issues.apache.org/jira/browse/LUCENE-1406
>> >             Project: Lucene - Java
>> >          Issue Type: New Feature
>> >          Components: Analysis
>> >            Reporter: Robert Muir
>> >            Assignee: Grant Ingersoll
>> >            Priority: Minor
>> >         Attachments: LUCENE-1406.patch
>> >
>> >
>> > I've noticed there is no Arabic analyzer for Lucene, most likely because
>> Tim Buckwalter's morphological dictionary is GPL.
>> > However, it is not necessary  to have full morphological analysis engine
>> for a quality arabic search.
>> > This implementation implements the light-8s algorithm present in the
>> following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
>> > As you can see from the paper, improvement via this method over
>> searching surface forms (as lucene currently does) is significant, with
>> almost 100% improvement in average precision.
>> > While I personally don't think all the choices were the best, and some
>> easily improvements are still possible, the major motivation for
>> implementing it exactly the way it is presented in the paper is that the
>> algorithm is TREC-tested, so the precision/recall improvements to lucene are
>> already documented.
>> > For a stopword list, I used a list present at
>> http://members.unine.ch/jacques.savoy/clef/index.html simply because the
>> creator of this list documents the data as BSD-licensed.
>> > This implementation (Analyzer) consists of above mentioned stopword list
>> plus two filters:
>> >  ArabicNormalizationFilter: performs orthographic normalization (such as
>> hamza seated on alif, alif maksura, teh marbuta, removal of harakat,
>> tatweel, etc)
>> >  ArabicStemFilter: performs arabic light stemming
>> > Both filters operate directly on termbuffer for maximum performance.
>> There is no object creation in this Analyzer.
>> > There are no external dependencies. I've indexed about half a billion
>> words of arabic text and tested against that.
>> > If there are any issues with this implementation I am willing to fix
>> them. I use lucene on a daily basis and would like to give something back.
>> Thanks.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

Posted by DM Smith <dm...@gmail.com>.

On Sep 30, 2008, at 8:19 AM, Robert Muir wrote:

> cool. is there interest in similar basic functionality for Hebrew?

I'm interested as I use lucene for biblical research.

>
>
> same rules apply: without using GPL data (i.e. Hspell data) you  
> can't do it right, but you can do a lot of the common stuff just  
> like Arabic. Tokenization is a tad bit more complex, and out of box  
> western behavior is probably annoying at the least (splitting words  
> on punctuation where it shouldn't, etc).
>
> Robert
>
> On Tue, Sep 30, 2008 at 7:36 AM, Grant Ingersoll (JIRA) <jira@apache.org 
> > wrote:
>
>    [ https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723 
> #action_12635723 ]
>
> Grant Ingersoll commented on LUCENE-1406:
> -----------------------------------------
>
> I'll commit once 2.4 is released.
>
> > new Arabic Analyzer (Apache license)
> > ------------------------------------
> >
> >                 Key: LUCENE-1406
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-1406
> >             Project: Lucene - Java
> >          Issue Type: New Feature
> >          Components: Analysis
> >            Reporter: Robert Muir
> >            Assignee: Grant Ingersoll
> >            Priority: Minor
> >         Attachments: LUCENE-1406.patch
> >
> >
> > I've noticed there is no Arabic analyzer for Lucene, most likely  
> because Tim Buckwalter's morphological dictionary is GPL.
> > However, it is not necessary  to have full morphological analysis  
> engine for a quality arabic search.
> > This implementation implements the light-8s algorithm present in  
> the following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> > As you can see from the paper, improvement via this method over  
> searching surface forms (as lucene currently does) is significant,  
> with almost 100% improvement in average precision.
> > While I personally don't think all the choices were the best, and  
> some easily improvements are still possible, the major motivation  
> for implementing it exactly the way it is presented in the paper is  
> that the algorithm is TREC-tested, so the precision/recall  
> improvements to lucene are already documented.
> > For a stopword list, I used a list present at http://members.unine.ch/jacques.savoy/clef/index.html 
>  simply because the creator of this list documents the data as BSD- 
> licensed.
> > This implementation (Analyzer) consists of above mentioned  
> stopword list plus two filters:
> >  ArabicNormalizationFilter: performs orthographic normalization  
> (such as hamza seated on alif, alif maksura, teh marbuta, removal of  
> harakat, tatweel, etc)
> >  ArabicStemFilter: performs arabic light stemming
> > Both filters operate directly on termbuffer for maximum  
> performance. There is no object creation in this Analyzer.
> > There are no external dependencies. I've indexed about half a  
> billion words of arabic text and tested against that.
> > If there are any issues with this implementation I am willing to  
> fix them. I use lucene on a daily basis and would like to give  
> something back. Thanks.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
>
> -- 
> Robert Muir
> rcmuir@gmail.com

Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

Posted by Robert Muir <rc...@gmail.com>.

cool. is there interest in similar basic functionality for Hebrew?

same rules apply: without using GPL data (i.e. Hspell data) you can't do it
right, but you can do a lot of the common stuff just like Arabic.
Tokenization is a tad bit more complex, and out of box western behavior is
probably annoying at the least (splitting words on punctuation where it
shouldn't, etc).

Robert

On Tue, Sep 30, 2008 at 7:36 AM, Grant Ingersoll (JIRA) <ji...@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723]
>
> Grant Ingersoll commented on LUCENE-1406:
> -----------------------------------------
>
> I'll commit once 2.4 is released.
>
> > new Arabic Analyzer (Apache license)
> > ------------------------------------
> >
> >                 Key: LUCENE-1406
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-1406
> >             Project: Lucene - Java
> >          Issue Type: New Feature
> >          Components: Analysis
> >            Reporter: Robert Muir
> >            Assignee: Grant Ingersoll
> >            Priority: Minor
> >         Attachments: LUCENE-1406.patch
> >
> >
> > I've noticed there is no Arabic analyzer for Lucene, most likely because
> Tim Buckwalter's morphological dictionary is GPL.
> > However, it is not necessary  to have full morphological analysis engine
> for a quality arabic search.
> > This implementation implements the light-8s algorithm present in the
> following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> > As you can see from the paper, improvement via this method over searching
> surface forms (as lucene currently does) is significant, with almost 100%
> improvement in average precision.
> > While I personally don't think all the choices were the best, and some
> easily improvements are still possible, the major motivation for
> implementing it exactly the way it is presented in the paper is that the
> algorithm is TREC-tested, so the precision/recall improvements to lucene are
> already documented.
> > For a stopword list, I used a list present at
> http://members.unine.ch/jacques.savoy/clef/index.html simply because the
> creator of this list documents the data as BSD-licensed.
> > This implementation (Analyzer) consists of above mentioned stopword list
> plus two filters:
> >  ArabicNormalizationFilter: performs orthographic normalization (such as
> hamza seated on alif, alif maksura, teh marbuta, removal of harakat,
> tatweel, etc)
> >  ArabicStemFilter: performs arabic light stemming
> > Both filters operate directly on termbuffer for maximum performance.
> There is no object creation in this Analyzer.
> > There are no external dependencies. I've indexed about half a billion
> words of arabic text and tested against that.
> > If there are any issues with this implementation I am willing to fix
> them. I use lucene on a daily basis and would like to give something back.
> Thanks.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

[jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634838#action_12634838 ] 

Grant Ingersoll commented on LUCENE-1406:
-----------------------------------------

Very cool.  I've used a modified version of the light 8 before, and it is indeed pretty good.

Can you provide it as a patch?  Also, unit tests would be good.  See the How To Contribute section on the Lucene wiki: http://wiki.apache.org/lucene-java/HowToContribute



> new Arabic Analyzer (Apache license)
> ------------------------------------
>
>                 Key: LUCENE-1406
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1406
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: arabic.zip
>
>
> I've noticed there is no Arabic analyzer for Lucene, most likely because Tim Buckwalter's morphological dictionary is GPL.
> However, it is not necessary  to have full morphological analysis engine for a quality arabic search. 
> This implementation implements the light-8s algorithm present in the following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> As you can see from the paper, improvement via this method over searching surface forms (as lucene currently does) is significant, with almost 100% improvement in average precision.
> While I personally don't think all the choices were the best, and some easily improvements are still possible, the major motivation for implementing it exactly the way it is presented in the paper is that the algorithm is TREC-tested, so the precision/recall improvements to lucene are already documented.
> For a stopword list, I used a list present at http://members.unine.ch/jacques.savoy/clef/index.html simply because the creator of this list documents the data as BSD-licensed.
> This implementation (Analyzer) consists of above mentioned stopword list plus two filters:
>  ArabicNormalizationFilter: performs orthographic normalization (such as hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, etc)
>  ArabicStemFilter: performs arabic light stemming
> Both filters operate directly on termbuffer for maximum performance. There is no object creation in this Analyzer.
> There are no external dependencies. I've indexed about half a billion words of arabic text and tested against that.
> If there are any issues with this implementation I am willing to fix them. I use lucene on a daily basis and would like to give something back. Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Assigned: (LUCENE-1406) new Arabic Analyzer (Apache license)

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned LUCENE-1406:
---------------------------------------

    Assignee: Grant Ingersoll

> new Arabic Analyzer (Apache license)
> ------------------------------------
>
>                 Key: LUCENE-1406
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1406
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: arabic.zip
>
>
> I've noticed there is no Arabic analyzer for Lucene, most likely because Tim Buckwalter's morphological dictionary is GPL.
> However, it is not necessary  to have full morphological analysis engine for a quality arabic search. 
> This implementation implements the light-8s algorithm present in the following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> As you can see from the paper, improvement via this method over searching surface forms (as lucene currently does) is significant, with almost 100% improvement in average precision.
> While I personally don't think all the choices were the best, and some easily improvements are still possible, the major motivation for implementing it exactly the way it is presented in the paper is that the algorithm is TREC-tested, so the precision/recall improvements to lucene are already documented.
> For a stopword list, I used a list present at http://members.unine.ch/jacques.savoy/clef/index.html simply because the creator of this list documents the data as BSD-licensed.
> This implementation (Analyzer) consists of above mentioned stopword list plus two filters:
>  ArabicNormalizationFilter: performs orthographic normalization (such as hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, etc)
>  ArabicStemFilter: performs arabic light stemming
> Both filters operate directly on termbuffer for maximum performance. There is no object creation in this Analyzer.
> There are no external dependencies. I've indexed about half a billion words of arabic text and tested against that.
> If there are any issues with this implementation I am willing to fix them. I use lucene on a daily basis and would like to give something back. Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634944#action_12634944 ] 

Robert Muir commented on LUCENE-1406:
-------------------------------------

Thought I would add the following comments:

I tried to stick to basics to start. Some things that kept bugging me just for the record:

1) the rules for stemming only require stemmed token to have 2 characters in many places. This seems incorrect... triliteral root anyone? Seems to be too aggresive. Yet at the same time, many common "prefix"/suffix combinations are not stemmed by light8 algorithm...  But its trec tested... 

2) there is no decomposition of unicode presentation forms. These characters show up (typically when text is extracted out of PDF). The easiest way to deal with this is Unicode normalization, but that requires Java 6 or ICU.

3) there is no enhanced parsing. Typically academics index high quality news text but in other less perfect text often you see much text without spaces between words when the characters do not join (to the human reader there is a space!). to really solve this you need a lot of special stuff including morphological data, but you can partially solve some of the common cases by splitting words when you see 100% certain cases such as medial teh marbuta, medial alef maksura, double alef, ... I didnt do this because I wanted to keep it simple, but its important, see here: http://papers.ldc.upenn.edu/COLING2004/Buckwalter_Arabic-orthography-morphology.pdf
 
4) it is simply a stemmer, but I read in lucene docs where it is possible to inject synonym-like information (multiple tokens for one word) and boost the score for certain ones. Seems like this would be better than simply stemming, at least indexing and boosting the normalized surface form for better precision. I'd want to setup TREC tests to actually measure this though.


> new Arabic Analyzer (Apache license)
> ------------------------------------
>
>                 Key: LUCENE-1406
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1406
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: LUCENE-1406.patch
>
>
> I've noticed there is no Arabic analyzer for Lucene, most likely because Tim Buckwalter's morphological dictionary is GPL.
> However, it is not necessary  to have full morphological analysis engine for a quality arabic search. 
> This implementation implements the light-8s algorithm present in the following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> As you can see from the paper, improvement via this method over searching surface forms (as lucene currently does) is significant, with almost 100% improvement in average precision.
> While I personally don't think all the choices were the best, and some easily improvements are still possible, the major motivation for implementing it exactly the way it is presented in the paper is that the algorithm is TREC-tested, so the precision/recall improvements to lucene are already documented.
> For a stopword list, I used a list present at http://members.unine.ch/jacques.savoy/clef/index.html simply because the creator of this list documents the data as BSD-licensed.
> This implementation (Analyzer) consists of above mentioned stopword list plus two filters:
>  ArabicNormalizationFilter: performs orthographic normalization (such as hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, etc)
>  ArabicStemFilter: performs arabic light stemming
> Both filters operate directly on termbuffer for maximum performance. There is no object creation in this Analyzer.
> There are no external dependencies. I've indexed about half a billion words of arabic text and tested against that.
> If there are any issues with this implementation I am willing to fix them. I use lucene on a daily basis and would like to give something back. Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1406) new Arabic Analyzer (Apache license)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1406:
--------------------------------

    Attachment:     (was: arabic.zip)

> new Arabic Analyzer (Apache license)
> ------------------------------------
>
>                 Key: LUCENE-1406
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1406
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: LUCENE-1406.patch
>
>
> I've noticed there is no Arabic analyzer for Lucene, most likely because Tim Buckwalter's morphological dictionary is GPL.
> However, it is not necessary  to have full morphological analysis engine for a quality arabic search. 
> This implementation implements the light-8s algorithm present in the following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> As you can see from the paper, improvement via this method over searching surface forms (as lucene currently does) is significant, with almost 100% improvement in average precision.
> While I personally don't think all the choices were the best, and some easily improvements are still possible, the major motivation for implementing it exactly the way it is presented in the paper is that the algorithm is TREC-tested, so the precision/recall improvements to lucene are already documented.
> For a stopword list, I used a list present at http://members.unine.ch/jacques.savoy/clef/index.html simply because the creator of this list documents the data as BSD-licensed.
> This implementation (Analyzer) consists of above mentioned stopword list plus two filters:
>  ArabicNormalizationFilter: performs orthographic normalization (such as hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, etc)
>  ArabicStemFilter: performs arabic light stemming
> Both filters operate directly on termbuffer for maximum performance. There is no object creation in this Analyzer.
> There are no external dependencies. I've indexed about half a billion words of arabic text and tested against that.
> If there are any issues with this implementation I am willing to fix them. I use lucene on a daily basis and would like to give something back. Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-1406) new Arabic Analyzer (Apache license)

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved LUCENE-1406.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 2.9
    Lucene Fields: [Patch Available]  (was: [Patch Available, New])

> new Arabic Analyzer (Apache license)
> ------------------------------------
>
>                 Key: LUCENE-1406
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1406
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1406.patch
>
>
> I've noticed there is no Arabic analyzer for Lucene, most likely because Tim Buckwalter's morphological dictionary is GPL.
> However, it is not necessary  to have full morphological analysis engine for a quality arabic search. 
> This implementation implements the light-8s algorithm present in the following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> As you can see from the paper, improvement via this method over searching surface forms (as lucene currently does) is significant, with almost 100% improvement in average precision.
> While I personally don't think all the choices were the best, and some easily improvements are still possible, the major motivation for implementing it exactly the way it is presented in the paper is that the algorithm is TREC-tested, so the precision/recall improvements to lucene are already documented.
> For a stopword list, I used a list present at http://members.unine.ch/jacques.savoy/clef/index.html simply because the creator of this list documents the data as BSD-licensed.
> This implementation (Analyzer) consists of above mentioned stopword list plus two filters:
>  ArabicNormalizationFilter: performs orthographic normalization (such as hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, etc)
>  ArabicStemFilter: performs arabic light stemming
> Both filters operate directly on termbuffer for maximum performance. There is no object creation in this Analyzer.
> There are no external dependencies. I've indexed about half a billion words of arabic text and tested against that.
> If there are any issues with this implementation I am willing to fix them. I use lucene on a daily basis and would like to give something back. Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641075#action_12641075 ] 

Grant Ingersoll commented on LUCENE-1406:
-----------------------------------------

Committed revision 706342.

I made some small changes to reuse Tokens, also added in some comments into the stopwords list and added to WordListLoader to accommodate this

Thanks Robert!

> new Arabic Analyzer (Apache license)
> ------------------------------------
>
>                 Key: LUCENE-1406
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1406
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: LUCENE-1406.patch
>
>
> I've noticed there is no Arabic analyzer for Lucene, most likely because Tim Buckwalter's morphological dictionary is GPL.
> However, it is not necessary  to have full morphological analysis engine for a quality arabic search. 
> This implementation implements the light-8s algorithm present in the following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> As you can see from the paper, improvement via this method over searching surface forms (as lucene currently does) is significant, with almost 100% improvement in average precision.
> While I personally don't think all the choices were the best, and some easily improvements are still possible, the major motivation for implementing it exactly the way it is presented in the paper is that the algorithm is TREC-tested, so the precision/recall improvements to lucene are already documented.
> For a stopword list, I used a list present at http://members.unine.ch/jacques.savoy/clef/index.html simply because the creator of this list documents the data as BSD-licensed.
> This implementation (Analyzer) consists of above mentioned stopword list plus two filters:
>  ArabicNormalizationFilter: performs orthographic normalization (such as hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, etc)
>  ArabicStemFilter: performs arabic light stemming
> Both filters operate directly on termbuffer for maximum performance. There is no object creation in this Analyzer.
> There are no external dependencies. I've indexed about half a billion words of arabic text and tested against that.
> If there are any issues with this implementation I am willing to fix them. I use lucene on a daily basis and would like to give something back. Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723 ] 

Grant Ingersoll commented on LUCENE-1406:
-----------------------------------------

I'll commit once 2.4 is released.

> new Arabic Analyzer (Apache license)
> ------------------------------------
>
>                 Key: LUCENE-1406
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1406
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: LUCENE-1406.patch
>
>
> I've noticed there is no Arabic analyzer for Lucene, most likely because Tim Buckwalter's morphological dictionary is GPL.
> However, it is not necessary  to have full morphological analysis engine for a quality arabic search. 
> This implementation implements the light-8s algorithm present in the following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> As you can see from the paper, improvement via this method over searching surface forms (as lucene currently does) is significant, with almost 100% improvement in average precision.
> While I personally don't think all the choices were the best, and some easily improvements are still possible, the major motivation for implementing it exactly the way it is presented in the paper is that the algorithm is TREC-tested, so the precision/recall improvements to lucene are already documented.
> For a stopword list, I used a list present at http://members.unine.ch/jacques.savoy/clef/index.html simply because the creator of this list documents the data as BSD-licensed.
> This implementation (Analyzer) consists of above mentioned stopword list plus two filters:
>  ArabicNormalizationFilter: performs orthographic normalization (such as hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, etc)
>  ArabicStemFilter: performs arabic light stemming
> Both filters operate directly on termbuffer for maximum performance. There is no object creation in this Analyzer.
> There are no external dependencies. I've indexed about half a billion words of arabic text and tested against that.
> If there are any issues with this implementation I am willing to fix them. I use lucene on a daily basis and would like to give something back. Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1406) new Arabic Analyzer (Apache license)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1406:
--------------------------------

    Attachment: LUCENE-1406.patch

attached is patch

> new Arabic Analyzer (Apache license)
> ------------------------------------
>
>                 Key: LUCENE-1406
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1406
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: LUCENE-1406.patch
>
>
> I've noticed there is no Arabic analyzer for Lucene, most likely because Tim Buckwalter's morphological dictionary is GPL.
> However, it is not necessary  to have full morphological analysis engine for a quality arabic search. 
> This implementation implements the light-8s algorithm present in the following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> As you can see from the paper, improvement via this method over searching surface forms (as lucene currently does) is significant, with almost 100% improvement in average precision.
> While I personally don't think all the choices were the best, and some easily improvements are still possible, the major motivation for implementing it exactly the way it is presented in the paper is that the algorithm is TREC-tested, so the precision/recall improvements to lucene are already documented.
> For a stopword list, I used a list present at http://members.unine.ch/jacques.savoy/clef/index.html simply because the creator of this list documents the data as BSD-licensed.
> This implementation (Analyzer) consists of above mentioned stopword list plus two filters:
>  ArabicNormalizationFilter: performs orthographic normalization (such as hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, etc)
>  ArabicStemFilter: performs arabic light stemming
> Both filters operate directly on termbuffer for maximum performance. There is no object creation in this Analyzer.
> There are no external dependencies. I've indexed about half a billion words of arabic text and tested against that.
> If there are any issues with this implementation I am willing to fix them. I use lucene on a daily basis and would like to give something back. Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1406) new Arabic Analyzer (Apache license)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1406:
--------------------------------

    Attachment: arabic.zip

Attached is my implementation

> new Arabic Analyzer (Apache license)
> ------------------------------------
>
>                 Key: LUCENE-1406
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1406
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: arabic.zip
>
>
> I've noticed there is no Arabic analyzer for Lucene, most likely because Tim Buckwalter's morphological dictionary is GPL.
> However, it is not necessary  to have full morphological analysis engine for a quality arabic search. 
> This implementation implements the light-8s algorithm present in the following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> As you can see from the paper, improvement via this method over searching surface forms (as lucene currently does) is significant, with almost 100% improvement in average precision.
> While I personally don't think all the choices were the best, and some easily improvements are still possible, the major motivation for implementing it exactly the way it is presented in the paper is that the algorithm is TREC-tested, so the precision/recall improvements to lucene are already documented.
> For a stopword list, I used a list present at http://members.unine.ch/jacques.savoy/clef/index.html simply because the creator of this list documents the data as BSD-licensed.
> This implementation (Analyzer) consists of above mentioned stopword list plus two filters:
>  ArabicNormalizationFilter: performs orthographic normalization (such as hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, etc)
>  ArabicStemFilter: performs arabic light stemming
> Both filters operate directly on termbuffer for maximum performance. There is no object creation in this Analyzer.
> There are no external dependencies. I've indexed about half a billion words of arabic text and tested against that.
> If there are any issues with this implementation I am willing to fix them. I use lucene on a daily basis and would like to give something back. Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org