You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Mohamed Idrees <id...@planet-connections.net> on 2002/06/28 06:11:59 UTC

Internationalization - Arabic Language Support

Greetings to Everyone !

I understand from Lucene FAQ and the forum discussion that Lucene supports
internationalization.  I am now curious to know how does this being
implemented for different languages like, in my case, for Arabic.

Hope someone can clarify and probably provide me with a sample demo.

Thanking you in advance.

iDriZ

Re: Internationalization - Arabic Language Support

Posted by Juan Diego Hernández Fonseca <jh...@pereira.gov.co>.

When you have some new about internationalization, please let me konw it.
I'm looking for a way to use Lucen in spanish. Thanks.

Juan Diego Hernández
----- Original Message -----
From: "Mohamed Idrees" <id...@planet-connections.net>
To: <lu...@jakarta.apache.org>
Sent: Thursday, June 27, 2002 11:11 PM
Subject: Internationalization - Arabic Language Support


> Greetings to Everyone !
>
> I understand from Lucene FAQ and the forum discussion that Lucene supports
> internationalization.  I am now curious to know how does this being
> implemented for different languages like, in my case, for Arabic.
>
> Hope someone can clarify and probably provide me with a sample demo.
>
> Thanking you in advance.
>
> iDriZ
>
>


----------------------------------------------------------------------------
----


> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: Internationalization - Arabic Language Support

Posted by "Nader S. Henein" <ns...@bayt.net>.

I'm indexing arabic in my index and to make it searchable I had to switch
character sets
(not fun) the problem lies in the week standards surrounding Arabic
Character sets
between ISO 8895-6 , win-1256 and UTF-8 you can have three different
representations of the
same exact thing UTF-8 store arabic in numeric form ( the code that
represent each letter)
the lucene analyzer isn't to friendly with numbers and especially if you use
a stemmer.
When it comes to the other two encodings they are different but both come
back to the same results
lucene views them as if they were European character sets and tries to apply
the same rules to them
so take care when you're indexing arabic, I only figured it out when I
started experimenting with different
unix charset settings while encoding because I have an oracle DB that spits
out the XML files on a Solaris
os and then lucene picks them up for encoding and since my core application
isn't in java I have to contend
with two web servers Main application ( AOL server ) and then search
application (Lucene on Resin).

When trying to figure out encoding issues, you need to convert everything to
it's most simple form and
compare and contrast as it passes through your application.

Nader

-----Original Message-----
From: W. Eliot Kimber [mailto:eliot@isogen.com]
Sent: Friday, June 28, 2002 6:59 PM
To: Lucene Users List
Subject: Re: Internationalization - Arabic Language Support

Peter Carlson wrote:

> The biggest part that is usually changed per language is the analyzer.
This
> is the part of Lucene which transforms and breaks up a string into
distinct
> terms.

I have only the smallest understanding of Arabic as a language, but I
have done some work to implement back-of-the-book indexing of Arabic
(and other languages) for XSL/XSLT. Based on that experience, I think
that the main challenges in implementing an Arabic analyzer would be:

1. Understanding the stemming rules for Arabic. Our research into Arabic
collation revealed that the rules for how Arabic words are formed is not
nearly as simple as in English and other Western languages. At this
point we haven't stepped up to trying to implement (or find an
implementation for) Arabic stemming for collation (words are collated
first by their roots, which are not necessarily at the start of the
words, so simple lexical collation won't work for Arabic and I'm
assuming that full-text indexing by word roots would have the same
problem). So I don't know more than that the problem is hard, even for
native speakers of Arabic.

2. Handling different letter forms in queries--Semitic languages often
have different forms for the same abstract character for different
positions in a word: initial forms, final forms, and base forms. These
different forms have different Unicode code points (although initial and
final forms are identified as such in the Unicode database). Often a
word will be stored with the base forms but the presented word will be
transformed to use the appropriate initial or final form. This means,
for example, that cutting and pasting a word from, say, a PDF document
into a query might require rationalization of variant forms to base
forms before performing the search (assuming that the analyzer also
reduces all letters to their base forms for indexing).

3. Right-to-left entry of queries and presentation of results. Mixing
right-to-left data with left-to-right data can get pretty tricky at the
user interface level (it's not an issue at the data storate level, where
all characters are stored in order of occurrence regardless of
presentation direction). Good support for bidirectional input and
presentation is hit and miss at best. For example, we could not figure
out how to get Internet Explorer to correctly present mixed English and
Arabic where there were lots of special characters (as opposed to simple
flowed prose, which seems to work OK).  I would expect Arabic localized
Web browsers to handle input OK, but it might be hard to find GUI
toolkits that do it well.

IBMs ICU4J package, a collection of national language support utilities
and libraries, might offer some solutions to this problem but I have not
yet investigated its support for Arabic and similar languages (we used
it for its Thai word breaker, which would be needed to implement a Thai
analyzer for Lucene).

Cheers,

Eliot
--
W. Eliot Kimber, eliot@isogen.com
Consultant, ISOGEN International

1016 La Posada Dr., Suite 240
Austin, TX  78752 Phone: 512.656.4139

--
To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
For additional commands, e-mail:
<ma...@jakarta.apache.org>

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Internationalization - Arabic Language Support

Posted by "W. Eliot Kimber" <el...@isogen.com>.

Peter Carlson wrote:

> The biggest part that is usually changed per language is the analyzer. This
> is the part of Lucene which transforms and breaks up a string into distinct
> terms. 

I have only the smallest understanding of Arabic as a language, but I
have done some work to implement back-of-the-book indexing of Arabic
(and other languages) for XSL/XSLT. Based on that experience, I think
that the main challenges in implementing an Arabic analyzer would be:

1. Understanding the stemming rules for Arabic. Our research into Arabic
collation revealed that the rules for how Arabic words are formed is not
nearly as simple as in English and other Western languages. At this
point we haven't stepped up to trying to implement (or find an
implementation for) Arabic stemming for collation (words are collated
first by their roots, which are not necessarily at the start of the
words, so simple lexical collation won't work for Arabic and I'm
assuming that full-text indexing by word roots would have the same
problem). So I don't know more than that the problem is hard, even for
native speakers of Arabic.

2. Handling different letter forms in queries--Semitic languages often
have different forms for the same abstract character for different
positions in a word: initial forms, final forms, and base forms. These
different forms have different Unicode code points (although initial and
final forms are identified as such in the Unicode database). Often a
word will be stored with the base forms but the presented word will be
transformed to use the appropriate initial or final form. This means,
for example, that cutting and pasting a word from, say, a PDF document
into a query might require rationalization of variant forms to base
forms before performing the search (assuming that the analyzer also
reduces all letters to their base forms for indexing).

3. Right-to-left entry of queries and presentation of results. Mixing
right-to-left data with left-to-right data can get pretty tricky at the
user interface level (it's not an issue at the data storate level, where
all characters are stored in order of occurrence regardless of
presentation direction). Good support for bidirectional input and
presentation is hit and miss at best. For example, we could not figure
out how to get Internet Explorer to correctly present mixed English and
Arabic where there were lots of special characters (as opposed to simple
flowed prose, which seems to work OK).  I would expect Arabic localized
Web browsers to handle input OK, but it might be hard to find GUI
toolkits that do it well.

IBMs ICU4J package, a collection of national language support utilities
and libraries, might offer some solutions to this problem but I have not
yet investigated its support for Arabic and similar languages (we used
it for its Thai word breaker, which would be needed to implement a Thai
analyzer for Lucene).

Cheers,

Eliot
-- 
W. Eliot Kimber, eliot@isogen.com
Consultant, ISOGEN International

1016 La Posada Dr., Suite 240
Austin, TX  78752 Phone: 512.656.4139

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Internationalization - Arabic Language Support

Posted by Peter Carlson <ca...@bookandhammer.com>.

Lucene is an extensible API that allows for multiple language support.
However, some languages or style of languages have not yet been supported
with the current code.

The biggest part that is usually changed per language is the analyzer. This
is the part of Lucene which transforms and breaks up a string into distinct
terms. These terms are then added to the index (as well as which documents
they are in). Currently there are analyzers for English and German style
languages (see org.apache.lucene.analysis.standard and
org.apache.lucene.analysis.de) as part of the source.

Once you have created (or use a current) analyzer which meets your needs,
Lucene will index the content and provide the search results.

If you do create another analyzer, please contribute it back to the project
so others may take advantage of your work.

I hope this helps.

--Peter

On 6/27/02 9:11 PM, "Mohamed Idrees" <id...@planet-connections.net> wrote:

> Greetings to Everyone !
> 
> I understand from Lucene FAQ and the forum discussion that Lucene supports
> internationalization.  I am now curious to know how does this being
> implemented for different languages like, in my case, for Arabic.
> 
> Hope someone can clarify and probably provide me with a sample demo.
> 
> Thanking you in advance.
> 
> iDriZ
> 
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>