You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Frederic Baroz <ba...@gmail.com> on 2016/09/10 23:48:58 UTC

Training sentence detector with custom sample

Hello,

I ve been through most of the pages I found about opennlp+sentence detector and I still can t answer my question. I d like to construct a sentence detector model from data that I have. I can t just use the shipped in models (even the french ones) since I work with clinical narratives, which are very specific types of documents.

In those documents, there are very diverse type of texts: some (more or less) well formed paragraphs of text, but also lists of diagnosis, todo lists, lab results, etc. Moreover, extracting pdf files with some level of page formatting sometimes entangle text and introduces bits of text into sentences.

The end result is that text extracted from clinical narratives have a lot of « pseudo-sentences » which sometime dont end with a period (or other punctuations), do not start with a capital letters. Because of lists, a significative portion of sentences start with a « bullet » char or a hyphen (which are not technically part of the sentence). There are finally a lot of text representing lab results in the form a 2-dimensional table. This type of text ends up being just 1 line with a label (e.g.: pCO2) and its value (e.g.: 7.8 kPa).

Consequently I have trouble to figure out how exactly to transform my tika-extracted text into sentences example in order to train a sentence detector model. I have tried intuitively by inserting « new lines » whenever I would consider un chunk of text as a sentence even though sometimes it s actually far from the grammatical definition (keeping bullets in front of list element, keeping extra spaces before a sentence because it was there, not inserting a period because non was there).

I find that there is very little information about the format of training data. So the question is how should I edit the sentences within the train file, considering I m starting with a rather « dirty » extracted document which is actually not made of real sentences for its most parts?

Thank you in advance


FB