You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Peter Pimley <pp...@semantico.com> on 2005/02/01 11:25:28 UTC

Source code for an accent-removal filter

Hi.

In December I made some posts concerning a filter that could work by 
getting the unicode name of a character and trying to figure out the 
closest latin equivalent.  For example, if it encountered character 00C1 
LATIN CAPITAL LETTER A WITH ACUTE, it would be clever enough to replace 
that with regular 'A'.

I got moved onto another project for a while so I've not looked at the 
problem much since then.  I'm back on it for a few days now though :)

The following perl program generates some Java source for a filter that 
carries out the above task.

Get 'UnicodeData.txt' from www.unicode.org, and then do the following:
    perl make_accent_filter.pl make.this.java.Class < UnicodeData.txt
to generate make/this/java/Class.java

This comes with no license and no warranty  ;)

Do not think this is the full solution to your unicode-mangling 
problems.  I'm using it as a last resort catch-all after some other 
filters that use the IBM ICU4J library to do all sorts of decomposition 
and character-category magic.  Once I get it all working I should be 
able to post some pointers and code snippets up here.

Peter

---8<-------

# usage:  perl make_accent_filter.pl my.full.ClassName < UnicodeData.txt
#
# creates my/full/ClassName.java

use strict;
use warnings;

use File::Path;
use File::Basename;



# decompose the classname that they gave us.
#
# TODO: this doesn't work if the classname has no dots (i.e. it's not in a
# package)
my $full_class = shift;
my @parts = $full_class =~ '^(.*)\.(.*)$';
my $package = shift @parts;
my $class = shift @parts;


# print to the correct place
my $path = $full_class;
$path =~ s/\./\//g;
$path = "$path.java";
mkpath dirname $path;
open STDOUT, "> $path" or die "Could not redirect stdout";




print <<END_JAVA;


// THIS FILE WAS AUTOGENERATED BY make_accent_filter.pl, DO NOT EDIT BY 
HAND.


package $package;

import org.apache.lucene.analysis.*;
import java.io.*;
import java.util.*;


public class $class extends TokenFilter {


    public $class (TokenStream input) {
        super (input);
        createHash();
    }


    // The replacement character, indexed by unicode value.
    // (i.e Character objects indexed by Integer objects)
    private static Hashtable values = null;


    // Creates a HashTable from the array at the bottom of this file.
    private void createHash () {
        // only run this for the first object of this class
        if (values != null) return;
        values = new Hashtable ();

        int i = 0;
        while (true) {
            if (array[i] == null) break; // 'array' is null terminated.

            Object number = array[i++];
            Object replacement = array[i++];

            values.put (number, replacement);
        }

        // we're done with 'array', it can be garbage collected
        array = null;
    }


    public Token next () throws IOException {
        Token t = input.next ();
        if (t == null) return null; // eof

        String s = t.termText();
        s = substituteAZString (s);

        return new Token (s, t.startOffset(), t.endOffset());
    }


    private String substituteAZString (String s) {

        char [] current = s.toCharArray ();
        char [] AZ = new char [current.length];
        int AZi = 0;

        for (int i=0; i<current.length; i++) {
            AZ[AZi++] = substituteAZChar (current[i]);
        }

        s = new String (AZ);
        return s;
    }



    private char substituteAZChar (char c) {
        Integer key = new Integer ((int) c);
        if (values.containsKey(key)) {
            c = ((Character)values.get(key)).charValue();
        }
        return c;
    }


    private static Object [] array = {
END_JAVA




# we only care about characters whose names are of the form:
my $latin_pattern = 'LATIN (.*) LETTER (.)( .*)$';

while (<STDIN>) {
    my @parts = split ";";

    my $num  = shift @parts;
    my $name = shift @parts;

    my @matches;

    if (@matches = ($name =~ $latin_pattern)) {

        my $case = shift @matches;
        my $convert_to_lc = $case eq "SMALL";

        my $letter = shift @matches;
        $letter = lc $letter if $convert_to_lc;

        printf "    new Integer (0x%s), new Character ('%s'), // %s\n",
            $num, $letter, $name;
    }
}


print <<END_JAVA;
    null };
}
END_JAVA


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org