You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Nadav Har'El <ny...@math.technion.ac.il> on 2008/10/01 22:02:40 UTC

Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

On Tue, Sep 30, 2008, Robert Muir wrote about "Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)":
> Thanks for clarification. With this method arabic analyzer could lemmatize,
> not stem, using buckwalter dictionary, and things like broken plural will
> work correctly.
> 
> I'm not sure yet if hspell has this type of information, but it would at
> least be a better stem for hebrew as well.

Indeed Hspell also has this information. You can see for example
http://www.cs.technion.ac.il/~danken/cgi-bin/hspell.cgi?text=%E4%F8%EB%E1%FA&ling=on
(but you'll need to be able to read Hebrew to understand what this means).

But one thing to remember is that if you use Hspell, or basically any other
dictionary, you are committing yourself to a particular vocabulary and a
particular spelling of it. If your stemmer comes across a word outside your
vocabulary, or spelled a bit differently, it won't know what to do with it.

This problem is particularly visible in Hebrew, because its unvowelled
spelling standard (defined by the Academy of the Hebrew Language) is
not very well known - When I was in school, twenty years ago, it wasn't
even mentioned, let alone taught! As a result, some words have a few spelling
variants in the wild, with each dictionary typically considering one correct
and the others mispellings.

-- 
Nadav Har'El                        |    Wednesday, Oct  1 2008, 3 Tishri 5769
IBM Haifa Research Lab              |-----------------------------------------
                                    |The two most common elements in the
http://nadav.harel.org.il           |universe are hydrogen and stupidity.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

Posted by Grant Ingersoll <gs...@apache.org>.
Can we have the Hebrew discussion on another thread?  FWIW, I do agree  
it would be a good thing to add.

Thanks,
Grant

On Oct 1, 2008, at 4:02 PM, Nadav Har'El wrote:

> On Tue, Sep 30, 2008, Robert Muir wrote about "Re: [jira] Commented:  
> (LUCENE-1406) new Arabic Analyzer (Apache license)":
>> Thanks for clarification. With this method arabic analyzer could  
>> lemmatize,
>> not stem, using buckwalter dictionary, and things like broken  
>> plural will
>> work correctly.
>>
>> I'm not sure yet if hspell has this type of information, but it  
>> would at
>> least be a better stem for hebrew as well.
>
> Indeed Hspell also has this information. You can see for example
> http://www.cs.technion.ac.il/~danken/cgi-bin/hspell.cgi?text=%E4%F8%EB%E1%FA&ling=on
> (but you'll need to be able to read Hebrew to understand what this  
> means).
>
> But one thing to remember is that if you use Hspell, or basically  
> any other
> dictionary, you are committing yourself to a particular vocabulary  
> and a
> particular spelling of it. If your stemmer comes across a word  
> outside your
> vocabulary, or spelled a bit differently, it won't know what to do  
> with it.
>
> This problem is particularly visible in Hebrew, because its unvowelled
> spelling standard (defined by the Academy of the Hebrew Language) is
> not very well known - When I was in school, twenty years ago, it  
> wasn't
> even mentioned, let alone taught! As a result, some words have a few  
> spelling
> variants in the wild, with each dictionary typically considering one  
> correct
> and the others mispellings.
>
> -- 
> Nadav Har'El                        |    Wednesday, Oct  1 2008, 3  
> Tishri 5769
> IBM Haifa Research Lab               
> |-----------------------------------------
>                                    |The two most common elements in  
> the
> http://nadav.harel.org.il           |universe are hydrogen and  
> stupidity.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org