You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Dawid Weiss <da...@cs.put.poznan.pl> on 2011/06/14 11:30:53 UTC
German compound decomposition (native speakers: help needed).

First of all I should probably congratulate my fellow Germans -- Dirk
Nowitzki's outstanding performance during this year's NBA finals will
become part of the history of basketball. As a Pole, I admit I'm
really freaking jealous.

Now... back to the subject.

A number of people have expressed an interest in a decompounding
engine for German recently (we talked about it during Berlin
Buzzwords, among other occasions). I did some research on the subject
(even though I don't know the language):

- a few commercial products exist (usually paired with morphological
analyzers); their quality seems to be very good, http://tagh.com among
others, for example;
- research papers on the subject also exist, including a project by
Torsten Marek that is readily available and uses FSTs to model the
probabilities of word links; unfortunately the evaluation data set
seems to be skewed and is not usable;
- Daniel Naber maintains jWordSplitter project on SourceForge; this is
a greedy heuristic backed by a morphological (static) dictionary; this
works surprisingly well in practice (we cannot measure the quality due
to lack of a proper evaluation data set -- see below).

In the past few days I've played with a number of resources of German
words and n-grams (Google, the dictionary in languagetool and
jWordSplitter, dewac corpus) and my gut feeling is that it is not
possible to provide a "perfect" solution, but something that will work
in a large majority of times is achievable through a heuristic much
like the one implemented in jWordSplitter. The advantage of this
approach is that we don't need a fully blown POS dictionary or deep
contextual disambiguation (and we can treat unknown words to some
degree). Disadvantage: there will be errors resulting from ambiguities
and improper assumptions.

As a start I have (re)implemented a naive heuristic that splits
compounds based on a dictionary of surface forms and a predefined set
of glue morphemes (the dictionary is under CC-SA:
http://creativecommons.org/licenses/by-sa/3.0/, which seems to be
accepted by Apache based on this post:
http://www.apache.org/legal/resolved.html#cc-sa). But in order to
develop it further and improve it, we REALLY need a "golden standard"
file; something that will include known compound splits and serve as
the benchmark we refer to when trying new algorithms or ideas. And
here comes your part: if you are a speaker of German and would like to
help, you are more than welcome to. The project is hosted at github at
the moment, here:

https://github.com/dweiss/compound-splitter

the 'test file' is in src/test/resources/test-compounds.utf8 and
README contains instructions on adding new test cases. You can either
fork the project on github or e-mail your compounds back to me,
whatever. I don't expect a full consensus among humans as to which
splits are legitimate and which are invalid, so you can also review/
comment on the existing test cases. If you're looking for inspiration
on where to get compounds to tag/split, Google n-grams is your friend.
I added a google-ngrams.bycount file that lists surface words with
aggregated counts between 1980 and 2008 or something. Pick a spot on
that list and decompound, decompound :)

If you wish to do something else, there is another file called:

morphy-google-intersect.20000

and this one contains an intersection of words not present in morphy
(the german dictionary we use for decompounding). Lots of these are
foreign words, but there is a fair share of German words (and, hint,
hint, compounds) that are simply newer or inflected in weird ways.

Let's see where we can take this.
Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org