You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Andreas Niekler <an...@informatik.uni-leipzig.de> on 2014/03/24 13:02:56 UTC

Parser Training

Hello,

I recognized in some recent discussions that the parser training has to
be customized for different languages regarding head rules and
punctuation markers.

My question now before I open a jira issue. Does all this customisation
for a language make sense because the real differnces come from
different POS Models. As i understand it right in the code I have to
provide punctuation types. But those are total dependent on the POS
Model. I my case I use a STTS Tagger and punctuations are marked with
$., $( or $, . Furthermore the ( causes problems within the constituents
stack. I need to encode them. Now the question:

Would it be easier to just replace the punctuations as they are
hardcoded in the head rules class?

Would it be better to "refacture" the head rules class so that we can
use 2 external files (1 for the rules and one for the Tagset or the
punctuations within the tagset).

Thanks for any kind of advice

Andreas

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de