You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Peter Pimley <pp...@semantico.com> on 2005/02/01 11:25:28 UTC
Source code for an accent-removal filter
Hi.
In December I made some posts concerning a filter that could work by
getting the unicode name of a character and trying to figure out the
closest latin equivalent. For example, if it encountered character 00C1
LATIN CAPITAL LETTER A WITH ACUTE, it would be clever enough to replace
that with regular 'A'.
I got moved onto another project for a while so I've not looked at the
problem much since then. I'm back on it for a few days now though :)
The following perl program generates some Java source for a filter that
carries out the above task.
Get 'UnicodeData.txt' from www.unicode.org, and then do the following:
perl make_accent_filter.pl make.this.java.Class < UnicodeData.txt
to generate make/this/java/Class.java
This comes with no license and no warranty ;)
Do not think this is the full solution to your unicode-mangling
problems. I'm using it as a last resort catch-all after some other
filters that use the IBM ICU4J library to do all sorts of decomposition
and character-category magic. Once I get it all working I should be
able to post some pointers and code snippets up here.
Peter
---8<-------
# usage: perl make_accent_filter.pl my.full.ClassName < UnicodeData.txt
#
# creates my/full/ClassName.java
use strict;
use warnings;
use File::Path;
use File::Basename;
# decompose the classname that they gave us.
#
# TODO: this doesn't work if the classname has no dots (i.e. it's not in a
# package)
my $full_class = shift;
my @parts = $full_class =~ '^(.*)\.(.*)$';
my $package = shift @parts;
my $class = shift @parts;
# print to the correct place
my $path = $full_class;
$path =~ s/\./\//g;
$path = "$path.java";
mkpath dirname $path;
open STDOUT, "> $path" or die "Could not redirect stdout";
print <<END_JAVA;
// THIS FILE WAS AUTOGENERATED BY make_accent_filter.pl, DO NOT EDIT BY
HAND.
package $package;
import org.apache.lucene.analysis.*;
import java.io.*;
import java.util.*;
public class $class extends TokenFilter {
public $class (TokenStream input) {
super (input);
createHash();
}
// The replacement character, indexed by unicode value.
// (i.e Character objects indexed by Integer objects)
private static Hashtable values = null;
// Creates a HashTable from the array at the bottom of this file.
private void createHash () {
// only run this for the first object of this class
if (values != null) return;
values = new Hashtable ();
int i = 0;
while (true) {
if (array[i] == null) break; // 'array' is null terminated.
Object number = array[i++];
Object replacement = array[i++];
values.put (number, replacement);
}
// we're done with 'array', it can be garbage collected
array = null;
}
public Token next () throws IOException {
Token t = input.next ();
if (t == null) return null; // eof
String s = t.termText();
s = substituteAZString (s);
return new Token (s, t.startOffset(), t.endOffset());
}
private String substituteAZString (String s) {
char [] current = s.toCharArray ();
char [] AZ = new char [current.length];
int AZi = 0;
for (int i=0; i<current.length; i++) {
AZ[AZi++] = substituteAZChar (current[i]);
}
s = new String (AZ);
return s;
}
private char substituteAZChar (char c) {
Integer key = new Integer ((int) c);
if (values.containsKey(key)) {
c = ((Character)values.get(key)).charValue();
}
return c;
}
private static Object [] array = {
END_JAVA
# we only care about characters whose names are of the form:
my $latin_pattern = 'LATIN (.*) LETTER (.)( .*)$';
while (<STDIN>) {
my @parts = split ";";
my $num = shift @parts;
my $name = shift @parts;
my @matches;
if (@matches = ($name =~ $latin_pattern)) {
my $case = shift @matches;
my $convert_to_lc = $case eq "SMALL";
my $letter = shift @matches;
$letter = lc $letter if $convert_to_lc;
printf " new Integer (0x%s), new Character ('%s'), // %s\n",
$num, $letter, $name;
}
}
print <<END_JAVA;
null };
}
END_JAVA
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org