You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Mariya Koleva <ko...@gmail.com> on 2012/06/19 16:03:38 UTC

Model for sentence detection not being created

Hi,
I apologise if the question is trivial but I'm not experienced with openNLP
(and not too confident in my Java skills either).

I'm trying to train a sentence detection model for Zulu. No matter whether
I'm using the command line interface or the API, it appears to be training
but a model file is not created. I'm getting the following exception [1]:
java.lang.IllegalArgumentException: The maxent model is not compatible with
the sentence detector!

The original data comes from the Ukwabelana corpus [2] in a text file
(US-ASCII), one sentence per line. It is completely stripped off of
capitalisation and any kind of punctuation. I automatically added a "." at
the end of every sentence, so that there is some EOS token for the program
to pick up.

I would appreciate any insight as to what is to be done!

Mariya

[1] The whole output is:

Indexing events using cutoff of 5

Computing event counts… done. 29424 events
Indexing… done.
Sorting and merging events… done. Reduced 29424 events to 7830.
Done indexing.
Incorporating indexed data for training…
done.

Number of Event Tokens: 7830
Number of Outcomes: 1
Number of Predicates: 1673

…done.

Computing model parameters …
Performing 100 iterations.
1: … loglikelihood=0.0 1.0
2: … loglikelihood=0.0 1.0

Exception in thread “main” java.lang.IllegalArgumentException: The
maxent model is not compatible with the sentence detector!

at
opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:275)
at opennlp.tools.sentdetect.SentenceModel.<init>(SentenceModel.java:64)
at
opennlp.tools.sentdetect.SentenceDetectorME.train(SentenceDetectorME.java:285)
at
opennlp.tools.sentdetect.SentenceDetectorME.train(SentenceDetectorME.java:296)
at
opennlp.tools.cmdline.sentdetect.SentenceDetectorTrainerTool.run(SentenceDetectorTrainerTool.java:111)
at opennlp.tools.cmdline.CLI.main(CLI.java:191)

[2]
http://www.cs.bris.ac.uk/Research/MachineLearning/Morphology/resources.jsp#corpus

Re: Model for sentence detection not being created

Posted by Jörn Kottmann <ko...@gmail.com>.

On 06/19/2012 04:47 PM, Mariya Koleva wrote:
> And yes, what I'm ultimately planning to do is to train POS models for Zulu
> and other related languages, and hopefully have them out for the community.

The POS data on this website is already in the OpenNLP format.
It should work if you follow the instructions here:
http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.postagger.training

You can use the command which is given there and just pass in the data 
from the website,
you might need to set a different encoding.

Jörn

Re: Model for sentence detection not being created

Posted by Mariya Koleva <ko...@gmail.com>.

Hello Jörn,

Thank you for the quick response!
The data from the corpus I'm using already came with the punctuation
removed. I'll see what I could do about it.

And yes, what I'm ultimately planning to do is to train POS models for Zulu
and other related languages, and hopefully have them out for the community.

Mariya

On 19 June 2012 16:20, Jörn Kottmann <ko...@gmail.com> wrote:

> BTW, the pos data can be easily used to train an OpenNLP POS model.
>
> Jörn
>
>
> On 06/19/2012 04:17 PM, Jörn Kottmann wrote:
>
>> Hello,
>>
>> the sentence detector does end-of-sentence character
>> disambiguation. In your case all end-of-sentence characters
>> are proper end of sentences.
>>
>> So it only sees one outcome in your entire corpus. To train
>> a sentence detector model you need both cases, so it can learn
>> which are valid sentence ends, and which are not.
>>
>> The training fails on some internal validation, that should be done
>> with a nicer error message.
>>
>> I suggest to not remove the punctuation from your training sentences,
>> then it should work.
>>
>> HTH,
>> Jörn
>>
>> On 06/19/2012 04:03 PM, Mariya Koleva wrote:
>>
>>> Hi,
>>> I apologise if the question is trivial but I'm not experienced with
>>> openNLP
>>> (and not too confident in my Java skills either).
>>>
>>> I'm trying to train a sentence detection model for Zulu. No matter
>>> whether
>>> I'm using the command line interface or the API, it appears to be
>>> training
>>> but a model file is not created. I'm getting the following exception [1]:
>>> java.lang.**IllegalArgumentException: The maxent model is not
>>> compatible with
>>> the sentence detector!
>>>
>>> The original data comes from the Ukwabelana corpus [2] in a text file
>>> (US-ASCII), one sentence per line. It is completely stripped off of
>>> capitalisation and any kind of punctuation. I automatically added a "."
>>> at
>>> the end of every sentence, so that there is some EOS token for the
>>> program
>>> to pick up.
>>>
>>> I would appreciate any insight as to what is to be done!
>>>
>>> Mariya
>>>
>>> [1] The whole output is:
>>>
>>> Indexing events using cutoff of 5
>>>
>>>     Computing event counts… done. 29424 events
>>>     Indexing… done.
>>>     Sorting and merging events… done. Reduced 29424 events to 7830.
>>>     Done indexing.
>>>     Incorporating indexed data for training…
>>>     done.
>>>
>>>     Number of Event Tokens: 7830
>>>     Number of Outcomes: 1
>>>     Number of Predicates: 1673
>>>
>>>     …done.
>>>
>>>     Computing model parameters …
>>>     Performing 100 iterations.
>>>     1: … loglikelihood=0.0 1.0
>>>     2: … loglikelihood=0.0 1.0
>>>
>>>     Exception in thread “main” java.lang.**IllegalArgumentException: The
>>> maxent model is not compatible with the sentence detector!
>>>
>>>     at
>>> opennlp.tools.util.model.**BaseModel.checkArtifactMap(**
>>> BaseModel.java:275)
>>>     at opennlp.tools.sentdetect.**SentenceModel.<init>(**
>>> SentenceModel.java:64)
>>>     at
>>> opennlp.tools.sentdetect.**SentenceDetectorME.train(**SentenceDetectorME.java:285)
>>>
>>>     at
>>> opennlp.tools.sentdetect.**SentenceDetectorME.train(**SentenceDetectorME.java:296)
>>>
>>>     at
>>> opennlp.tools.cmdline.**sentdetect.**SentenceDetectorTrainerTool.**run(*
>>> *SentenceDetectorTrainerTool.**java:111)
>>>     at opennlp.tools.cmdline.CLI.**main(CLI.java:191)
>>>
>>>
>>> [2]
>>> http://www.cs.bris.ac.uk/**Research/MachineLearning/**
>>> Morphology/resources.jsp#**corpus<http://www.cs.bris.ac.uk/Research/MachineLearning/Morphology/resources.jsp#corpus>
>>>
>>>
>>
>

Re: Model for sentence detection not being created

Posted by Jörn Kottmann <ko...@gmail.com>.

BTW, the pos data can be easily used to train an OpenNLP POS model.

Jörn

On 06/19/2012 04:17 PM, Jörn Kottmann wrote:
> Hello,
>
> the sentence detector does end-of-sentence character
> disambiguation. In your case all end-of-sentence characters
> are proper end of sentences.
>
> So it only sees one outcome in your entire corpus. To train
> a sentence detector model you need both cases, so it can learn
> which are valid sentence ends, and which are not.
>
> The training fails on some internal validation, that should be done
> with a nicer error message.
>
> I suggest to not remove the punctuation from your training sentences,
> then it should work.
>
> HTH,
> Jörn
>
> On 06/19/2012 04:03 PM, Mariya Koleva wrote:
>> Hi,
>> I apologise if the question is trivial but I'm not experienced with 
>> openNLP
>> (and not too confident in my Java skills either).
>>
>> I'm trying to train a sentence detection model for Zulu. No matter 
>> whether
>> I'm using the command line interface or the API, it appears to be 
>> training
>> but a model file is not created. I'm getting the following exception 
>> [1]:
>> java.lang.IllegalArgumentException: The maxent model is not 
>> compatible with
>> the sentence detector!
>>
>> The original data comes from the Ukwabelana corpus [2] in a text file
>> (US-ASCII), one sentence per line. It is completely stripped off of
>> capitalisation and any kind of punctuation. I automatically added a 
>> "." at
>> the end of every sentence, so that there is some EOS token for the 
>> program
>> to pick up.
>>
>> I would appreciate any insight as to what is to be done!
>>
>> Mariya
>>
>> [1] The whole output is:
>>
>> Indexing events using cutoff of 5
>>
>>      Computing event counts… done. 29424 events
>>      Indexing… done.
>>      Sorting and merging events… done. Reduced 29424 events to 7830.
>>      Done indexing.
>>      Incorporating indexed data for training…
>>      done.
>>
>>      Number of Event Tokens: 7830
>>      Number of Outcomes: 1
>>      Number of Predicates: 1673
>>
>>      …done.
>>
>>      Computing model parameters …
>>      Performing 100 iterations.
>>      1: … loglikelihood=0.0 1.0
>>      2: … loglikelihood=0.0 1.0
>>
>>      Exception in thread “main” java.lang.IllegalArgumentException: The
>> maxent model is not compatible with the sentence detector!
>>
>>      at
>> opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:275)
>>      at 
>> opennlp.tools.sentdetect.SentenceModel.<init>(SentenceModel.java:64)
>>      at
>> opennlp.tools.sentdetect.SentenceDetectorME.train(SentenceDetectorME.java:285) 
>>
>>      at
>> opennlp.tools.sentdetect.SentenceDetectorME.train(SentenceDetectorME.java:296) 
>>
>>      at
>> opennlp.tools.cmdline.sentdetect.SentenceDetectorTrainerTool.run(SentenceDetectorTrainerTool.java:111) 
>>
>>      at opennlp.tools.cmdline.CLI.main(CLI.java:191)
>>
>>
>> [2]
>> http://www.cs.bris.ac.uk/Research/MachineLearning/Morphology/resources.jsp#corpus 
>>
>>
>

Re: Model for sentence detection not being created

Posted by Jörn Kottmann <ko...@gmail.com>.

Hello,

the sentence detector does end-of-sentence character
disambiguation. In your case all end-of-sentence characters
are proper end of sentences.

So it only sees one outcome in your entire corpus. To train
a sentence detector model you need both cases, so it can learn
which are valid sentence ends, and which are not.

The training fails on some internal validation, that should be done
with a nicer error message.

I suggest to not remove the punctuation from your training sentences,
then it should work.

HTH,
Jörn

On 06/19/2012 04:03 PM, Mariya Koleva wrote:
> Hi,
> I apologise if the question is trivial but I'm not experienced with openNLP
> (and not too confident in my Java skills either).
>
> I'm trying to train a sentence detection model for Zulu. No matter whether
> I'm using the command line interface or the API, it appears to be training
> but a model file is not created. I'm getting the following exception [1]:
> java.lang.IllegalArgumentException: The maxent model is not compatible with
> the sentence detector!
>
> The original data comes from the Ukwabelana corpus [2] in a text file
> (US-ASCII), one sentence per line. It is completely stripped off of
> capitalisation and any kind of punctuation. I automatically added a "." at
> the end of every sentence, so that there is some EOS token for the program
> to pick up.
>
> I would appreciate any insight as to what is to be done!
>
> Mariya
>
> [1] The whole output is:
>
> Indexing events using cutoff of 5
>
>      Computing event counts… done. 29424 events
>      Indexing… done.
>      Sorting and merging events… done. Reduced 29424 events to 7830.
>      Done indexing.
>      Incorporating indexed data for training…
>      done.
>
>      Number of Event Tokens: 7830
>      Number of Outcomes: 1
>      Number of Predicates: 1673
>
>      …done.
>
>      Computing model parameters …
>      Performing 100 iterations.
>      1: … loglikelihood=0.0 1.0
>      2: … loglikelihood=0.0 1.0
>
>      Exception in thread “main” java.lang.IllegalArgumentException: The
> maxent model is not compatible with the sentence detector!
>
>      at
> opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:275)
>      at opennlp.tools.sentdetect.SentenceModel.<init>(SentenceModel.java:64)
>      at
> opennlp.tools.sentdetect.SentenceDetectorME.train(SentenceDetectorME.java:285)
>      at
> opennlp.tools.sentdetect.SentenceDetectorME.train(SentenceDetectorME.java:296)
>      at
> opennlp.tools.cmdline.sentdetect.SentenceDetectorTrainerTool.run(SentenceDetectorTrainerTool.java:111)
>      at opennlp.tools.cmdline.CLI.main(CLI.java:191)
>
>
> [2]
> http://www.cs.bris.ac.uk/Research/MachineLearning/Morphology/resources.jsp#corpus
>