You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Jörn Kottmann <ko...@gmail.com> on 2012/03/27 17:13:13 UTC

Coreference Training on MUC 6/7 data

Hi all,

I would like to figure out how the coref component can be trained
on MUC 6 and 7 data. Does anybody know how to do that?

After searching for information in the forum and doing quite some
reverse engineering I think the process is something like this:

1. Load data via MUC plugin into wordfreak.
     Getting wordfreak to work is a bit tricky, there seem to be a few 
jar files which
     are all have the 2.2 version in the name but are quite different. I 
now use  a self compiled head version.
2. Perform Named Entity Recognition via the opennlp plugin (I use it 
with opennlp 1.4.3)
3. Do Chunking or Parsing (parsing still causes a stack overflow in my 
setup, so I only did chunking)
4. Save the file to disk (Make sure it is named correctly, wordfreak 
attaches a .txt which must be removed)
5. Do training with the coref opennlp wordfreak plugin via its main method

But I still have a couple of issues.

Wordfreak saves the linked mentions as "mention" annotations which can 
then not
be retrieved by coref code (only looks for noun phrases, a mention is 
not a noun phrase).
Not sure how this is supposed to work, do I have to write some code to 
merge mentions
and the added noun phrases? Or is there some kind of trick I don't know yet?

Parsing in wordfreak does not work because of a stack overflow.

It looks like there is no util to do the actual coref resolution when 
only a shallow parse
was used to train it.

Any hints are very welcome.

Jörn