You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Fabien POULARD <fa...@fabienpoulard.info> on 2010/09/08 15:00:34 UTC

Word tokenizer for French

Hi all,

In case some of you are interested, I've implemented a UIMA component
to do word tokenization. This component handles tokenization of French
texts in a better way than what the WhitespaceTokenizer does.

The detail of the implementation is described on my blog [1] (in
French only, sorry), and I opened a github repository [2] for those
who would like to contribute or just use it.

[1] http://www.fabienpoulard.info/dotclear.php?post/2010/09/06/Un-rapide-tokeniseur-en-mots-pour-le-fran%C3%A7ais
[2] http://github.com/grdscarabe/uima-word-tokenizer

--
Fabien Poulard
LINA (UMR CNRS 6241) / Université de Nantes