You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Gabriele Vaccari <ga...@dedalus.eu> on 2017/12/05 08:45:20 UTC

I: openNLP best practices - sentence detector

Hi all,
I sent this message to the users mailing list but got no response so far. Reposting to the dev mailing list.

Also: I'm trying to make some modifications to the code relating to issue 1163<https://issues.apache.org/jira/browse/OPENNLP-1163> mentioned below but I have troubles with the style checker. I keep getting a lot of NewlineAtEndOfFile errors, even though the files do have a new line at the end of file. I've also made sure to replacing \r\n's with \n's, to no avail. I'm using Maven 3.3.9 and Eclipse Neon.2

Thank you

Gabriele

Da: Gabriele Vaccari
Inviato: Friday, December 1, 2017 13:02
A: 'users@opennlp.apache.org' <us...@opennlp.apache.org>
Oggetto: openNLP best practices - sentence detector

Hi all,

I'm trying to use openNLP to train some models for Italian, basically to get some familiarity with the API. To provide some background, I'm familiar with machine learning concepts and understand what an NLP pipeline looks like, however this is the first time I actually have to go ahead and put together an application with all this.

So I started with the sentence detector. I was able to train an Italian SD with a corpus of sentences from http://www.corpusitaliano.it/en/. However the performance of the detector is somewhat below my expectations. It makes pretty obvious mistakes, like failing to recognize an end-of-sentence full stop (example below*), or failing to spot an abbreviation preceded by punctuation (I've posted the issue 1163 on Jira<https://issues.apache.org/jira/browse/OPENNLP-1163> for this).

Even though the documentation is very good, I feel it lacks some best practices and suggestions. For instance:

* Is my sentence detection training set supposed to have consistent documents or will a bunch of random sentences with a blank line every 20-30 work?
* Do my training examples in openNLP native format need to be formatted in a special way? Will the algo ignore stuff like extra white spaces or tabs between words? Do examples with a lot of punctuation like quotes or parenthesis somehow affect the outcome?
* How many training examples (or events) are recommended?
* Is it better to provide a case sensitive abbreviation dictionary vs case insensitive?
* Is the issue 1163 a known problem? I think other languages as French might have the same thing happening.
* Are there examples of complete production-grade data sets in Italian or other languages that have been successfully used to train openNLP tools?

I believe I could find most of these questions by just looking at the code, but someone who already went through it maybe could point me in the right direction.
Basically, I'm asking for best practices and pro tips.

Thank you

* failure to recognize EOS full stop:
SENT_1: Molteplici furono i passi che portarono alla nascita di questa disciplina.
SENT_2: Il primo, sia a livello di importanza che di ordine cronologico, è l'avvento dei calcolatori ed il continuo interesse rivolto ad essi. Già nel 1623, grazie a Willhelm Sickhart<https://it.wikipedia.org/w/index.php?title=Willhelm_Sickhart&action=edit&redlink=1>, si arrivò a creare macchine in grado di effettuare calcoli matematici con numeri fino a sei cifre, anche se non in maniera autonoma.

Gabriele Vaccari
Dedalus SpA