You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Juan Miguel Cejuela <ju...@jmcejuela.com> on 2012/05/02 18:23:11 UTC

Training data for English sentence segmentation

Hi,

in the models list page, it's written that the EN sentence detector uses
opennlp training data. Is it possible to access this training data? Besides
this, which other training corpora are for EN sentence segmentation?


Much appreciated

-- 
Juan Miguel Cejuela

Re: Training data for English sentence segmentation

Posted by Jason Baldridge <ja...@gmail.com>.

It comes from the Penn treebank, and is not accessible to those who don't
have data. We're making a push to switch to models trained on open data,
such as the Open American National Corpus. More to come on that in the
coming weeks.

On Wed, May 2, 2012 at 11:23 AM, Juan Miguel Cejuela
<ju...@jmcejuela.com>wrote:

> Hi,
>
> in the models list page, it's written that the EN sentence detector uses
> opennlp training data. Is it possible to access this training data? Besides
> this, which other training corpora are for EN sentence segmentation?
>
>
> Much appreciated
>
> --
> Juan Miguel Cejuela
>

-- 
Jason Baldridge
Associate Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge