You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by John Haxby <jc...@scalix.com> on 2006/01/26 12:01:16 UTC

Re: encoding

arnaudbuffet wrote:

>For text files, data could be in different languages so different
>encoding. If data are in Turkish for exemple, all special characters and
>accents are not recognized in my lucene index. Is there a way to resolve
>problem? How do I work with the encoding ?
>  
>
I've been looking at a similar problem recently. There's 
org.apache.lucene.analysis.ISOLatin1AccentFilter on the svn trunk which 
may be quite close to what you want. I have a perl script here that I 
used to generate downgrading table for a C program. I can let you have 
the perl script as is, but if there's enough interest(*) I'll use it to 
generate, say, CompoundAsciiFilter since it converts compound characters 
like á, æ, ﬃ (ffi-ligature, in case it doesn't display) to a, ae and 
ffi. It's actually built from 
http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt so it winds up 
having nearly 1200 entries. An earlier version converted all compound 
characters to their constient parts, but this version just converts 
characters that are made up entirely of ASCII and modifiers.

jch

(*) Any interest, actually. Might be enough for me to be interested.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: encoding

Posted by petite_abeille <pe...@mac.com>.

Hello,

On Jan 27, 2006, at 11:44, John Haxby wrote:

> I've attached the perl script -- feed 
> http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt to it.

Thanks! Works great!

>   It's based on a slightly different principle to yours.   You seem to 
> look for things like "mumble mumble LETTER X mumble" and take "X" as 
> the base letter.

Yes, here is the mumbling algorithm in its full glory: aLetter = 
aLine:match( ".+%s(%u)%U.*" )

>   That means that, for example, ɖ (a "d" with a hook) gets converted 
> to "d".   My script, on the other hand, deals with things like "Ǣ" 
> (LATIN CAPITAL LETTER AE WITH MACRON) and converts it to AE.   There 
> are some differences of opinion though, you have ß mapped to "s" 
> whereas I have "ss" ("strße" to "strasse" instead of "strase" seems 
> right).  I think I'm also over-enthusiastic when it comes to mapping 
> characters to spaces: I know that there are some arabic characters 
> that get mapped to spaces.   For the purposes of converting to an 
> ASCII approximation, though, I suspect a combination of your approach 
> and mine would be best.   What do you think?

Overall, I much prefer your approach. Here is the updated Lua table 
derived from your handy perl script:

http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt

You also mentioned a full Unicode to ASCII 
transliteration/transcription module of some sort. Is it something you 
would like to share as well? :))

Cheers

--
PA, Onnay Equitursay
http://alt.textdrive.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: encoding

Posted by John Haxby <jc...@scalix.com>.

petite_abeille wrote:

> I would love to see this. I presently have a somewhat unwieldy 
> conversion table [1] that I would love to get ride of :))
> [snip]
> [1] http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt

I've attached the perl script -- feed 
http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt to it.   It's 
based on a slightly different principle to yours.   You seem to look for 
things like "mumble mumble LETTER X mumble" and take "X" as the base 
letter.   That means that, for example, ɖ (a "d" with a hook) gets 
converted to "d".   My script, on the other hand, deals with things like 
"Ǣ" (LATIN CAPITAL LETTER AE WITH MACRON) and converts it to AE.   There 
are some differences of opinion though, you have ß mapped to "s" whereas 
I have "ss" ("strße" to "strasse" instead of "strase" seems right).  I 
think I'm also over-enthusiastic when it comes to mapping characters to 
spaces: I know that there are some arabic characters that get mapped to 
spaces.   For the purposes of converting to an ASCII approximation, 
though, I suspect a combination of your approach and mine would be 
best.   What do you think?

Of course, it's still unweildy -- the code uses a huge great switch 
statement.   It would be more aesthetically pleasing to have a class 
representing UnicodeData.txt and work out the mapping on the fly.   IBM 
have some Unicode stuff that deals with decomposition and uses a similar 
algorithm (I think) to the one I use.   The standard java.lang.Character 
has everything but the decompositions to implement what I do in perl in 
Java: generating a map of decompositions isn't difficult though.   
However, I doubt whether the reduction in code size would make it run 
faster and certainly looking at the name of the letter to determine the 
ASCII nearest equivalent is going to be slow.

jch

Re: encoding

Posted by petite_abeille <pe...@mac.com>.

Hello,

On Jan 26, 2006, at 12:01, John Haxby wrote:

> I have a perl script here that I used to generate downgrading table 
> for a C program. I can let you have the perl script as is, but if 
> there's enough interest(*) I'll use it to generate, say, 
> CompoundAsciiFilter since it converts compound characters like á, æ, ﬃ 
> (ffi-ligature, in case it doesn't display) to a, ae and ffi. It's 
> actually built from 
> http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt so it winds up 
> having nearly 1200 entries. An earlier version converted all compound 
> characters to their constient parts, but this version just converts 
> characters that are made up entirely of ASCII and modifiers.

I would love to see this. I presently have a somewhat unwieldy 
conversion table [1] that I would love to get ride of :))

Cheers

--
PA, Onnay Equitursay
http://alt.textdrive.com/

[1] http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: encoding

Posted by John Haxby <jc...@scalix.com>.

arnaudbuffet wrote:

>if I try to index a text file encoded in Western 1252 for exemple with the Turkish text "düzenlediğimiz kampanyamıza" the lucene index will contain re encoded data with &#0;&#17;k&#0;&#0; ....
>  
>
ISOLatin1AccentFilter.removeAccents() converts that string to
"duzenlediğimiz kampanyamıza" The g-breve and the dotless-i are
untouched. My AsciiDecomposeFilter.decompose() converts the string to
"duzenledigimiz kampanyamiza".

However, since you're seeing those rather odd entities, it looks as
though you're not actually indexing what you think you're indexing. As
Erik says, you need to make sure that you're reading files with the
proper encoding and removing accent and adding dots won't help.

jch



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: RE : encoding

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jan 26, 2006, at 7:26 PM, arnaudbuffet wrote:
> I do not find the ISOLatin1AccentFilter class in my lucene jar, but  
> I find one on google attach to this mail, could you tell me if it  
> is the good one?

This used to be in contrib/analyzers but has been moved into the core  
(Subversion only for now):

	http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/ 
apache/lucene/analysis/

> I do not see anything in this class which can help me. This program  
> will replace some accent characters but my problem is:
>
> if I try to index a text file encoded in Western 1252 for exemple  
> with the Turkish text "düzenlediğimiz kampanyamıza" the lucene  
> index will contain re encoded data with &#0;&#17;k&#0;&#0; ....

Reading encoding files is your applications responsibility.  You need  
to be sure to read the files in using the proper encoding.  Once read  
properly into Java all will be well as far as Lucene indexing the  
characters.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE : encoding

Posted by arnaudbuffet <ar...@yahoo.fr>.

Hello and thanks for your answer.

I do not find the ISOLatin1AccentFilter class in my lucene jar, but I find one on google attach to this mail, could you tell me if it is the good one?

I do not see anything in this class which can help me. This program will replace some accent characters but my problem is:

if I try to index a text file encoded in Western 1252 for exemple with the Turkish text "düzenlediğimiz kampanyamıza" the lucene index will contain re encoded data with &#0;&#17;k&#0;&#0; ....

Thanks & regards

A.

-----Message d'origine-----
De : John Haxby [mailto:jch@scalix.com] 
Envoyé : jeudi 26 janvier 2006 03:01
À : java-user@lucene.apache.org
Objet : Re: encoding

arnaudbuffet wrote:

>For text files, data could be in different languages so different
>encoding. If data are in Turkish for exemple, all special characters and
>accents are not recognized in my lucene index. Is there a way to resolve
>problem? How do I work with the encoding ?
>  
>
I've been looking at a similar problem recently. There's 
org.apache.lucene.analysis.ISOLatin1AccentFilter on the svn trunk which 
may be quite close to what you want. I have a perl script here that I 
used to generate downgrading table for a C program. I can let you have 
the perl script as is, but if there's enough interest(*) I'll use it to 
generate, say, CompoundAsciiFilter since it converts compound characters 
like á, æ, ﬃ (ffi-ligature, in case it doesn't display) to a, ae and 
ffi. It's actually built from 
http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt so it winds up 
having nearly 1200 entries. An earlier version converted all compound 
characters to their constient parts, but this version just converts 
characters that are made up entirely of ASCII and modifiers.

jch

(*) Any interest, actually. Might be enough for me to be interested.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org