You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Svetoslav Marinov <sv...@findwise.com> on 2012/01/27 16:04:01 UTC

How to train a new model

Hi all,

I have asked this question earlier in another thread but did not get answer.

I would like to train new models for Swedish for sentence detection, tokenization, POStagging, NER since the existing models seem to perform poorly on my data.

I read the documentation and I surely followed the steps for training a model from command line or the API, however, I am still not very happy with the results. The existing Swedish models are trained on a small Swedish corpus (Talbanken), while I have access to a much larger training set (SUC corpus) .

Here are the problems I face:

1. current Swedish model fails when sentences start with lower-case letters. Also when there is no space between the full-stop and the next sentence, the model splits the sentence but keeps the first word of the second sentence as part of the first sentence. My current model cannot split at all if there is no space after the full stop.

Questions: Can one influence the training set by creating examples where there is no space between two sentences? Can one add features to the model? If so how? What about if the training data contains sentences without end of sentence markers (fullstops, etc.)? Should such exist in the training set? Is there some example/documentation about it?
At least I cannot seem to find it – but correct me if I am wrong.

2. The NER task suffers from similar problems. My current model has a hard time recognizing single names but does OK is the name consists of several words. However, I would like to include POS tag information as part of the training features. Is this possible? If so how? Any examples/documentation about it?

Thanks in advance!

Best,
Svetoslav

Re: How to train a new model

Posted by Svetoslav Marinov <sv...@findwise.com>.

Thanks for the answer, William!

I was pretty busy with other stuff but I soon try to find time for the
suggestions you gave me. I hope these will improve the performance.

Best,
Svetoslav

On 2012-01-30 19:12, "william.colen@gmail.com" <wi...@gmail.com>
wrote:

>Hi, Svetoslav,
>
>
>On Fri, Jan 27, 2012 at 1:04 PM, Svetoslav Marinov <
>svetoslav.marinov@findwise.com> wrote:
>
>> Hi all,
>>
>> I have asked this question earlier in another thread but did not get
>> answer.
>>
>> I would like to train new models for Swedish for sentence detection,
>> tokenization, POStagging, NER since the existing models seem to perform
>> poorly on my data.
>>
>> I read the documentation and I surely followed the steps for training a
>> model from command line or the API, however,  I am still not very happy
>> with the results. The existing Swedish models are trained on a small
>> Swedish corpus (Talbanken), while I have access to a much larger
>>training
>> set (SUC corpus) .
>>
>> Here are the problems I face:
>>
>>  1.  current Swedish model fails when sentences start with lower-case
>> letters. Also when there is no space between the full-stop and the next
>> sentence, the model splits the sentence but keeps the first word of the
>> second sentence as part of the first sentence. My current model cannot
>> split at all if there is no space after the full stop.
>>
>
>Your production data should have the same characteristics of your train
>data. It will only handle cases like sentences that starts with lower-case
>letters and no space between the full-stop and the next sentence properly
>if these were covered by the training data. You can add sentences with
>these characteristics to your training data.
>
>
>>
>> Questions: Can one influence the training set by creating examples where
>> there is no space between two sentences? Can one add features to the
>>model?
>> If so how? What about if  the training data contains sentences without
>>end
>> of sentence markers (fullstops, etc.)? Should such exist in the training
>> set? Is there some example/documentation about it?
>> At least I cannot seem to find it  but correct me if I am wrong.
>>
>
>Yes, simply append it to your training data. I don't think the
>documentation covers how to create training corpus, it only explains the
>format, but any help to improve it is highly appreciated.
>
>
>> 2. The NER task suffers from similar problems. My current model has a
>>hard
>> time recognizing single names but does OK is the name consists of
>>several
>> words. However, I would like to include POS tag information as part of
>>the
>> training features. Is this possible? If so how? Any
>>examples/documentation
>> about it?
>>
>
>Does your corpus include examples of single names? Is it easy to
>distinguish it from other tokens? Maybe you should consider using a Custom
>Feature 
>Generation<http://incubator.apache.org/opennlp/documentation/1.5.2-incubat
>ing/manual/opennlp.html#tools.namefind.training.featuregen>
>to
>add a "dictionary" element. You can use the DictionaryBuilder tool
>"bin/opennlp DictionaryBuilder"
>
>I think you can add POS Tag information, but I don't know exactly how to
>do
>it. I would investigate if it is possible using the "custom" element
>of the Custom
>Feature 
>Generation<http://incubator.apache.org/opennlp/documentation/1.5.2-incubat
>ing/manual/opennlp.html#tools.namefind.training.featuregen>,
>were you can pass a class that implements "AdaptiveFeatureGenerator".
>
>Another possible way, maybe even simpler, is to use the additionalContext
>argument of the NameFinder to pass the POS Tag info while training and
>executing the name finder. Should work, but I never tried it.
>
>Regards,
>William

Re: How to train a new model

Posted by "william.colen@gmail.com" <wi...@gmail.com>.

Hi, Svetoslav,

On Fri, Jan 27, 2012 at 1:04 PM, Svetoslav Marinov <
svetoslav.marinov@findwise.com> wrote:

> Hi all,
>
> I have asked this question earlier in another thread but did not get
> answer.
>
> I would like to train new models for Swedish for sentence detection,
> tokenization, POStagging, NER since the existing models seem to perform
> poorly on my data.
>
> I read the documentation and I surely followed the steps for training a
> model from command line or the API, however,  I am still not very happy
> with the results. The existing Swedish models are trained on a small
> Swedish corpus (Talbanken), while I have access to a much larger training
> set (SUC corpus) .
>
> Here are the problems I face:
>
>  1.  current Swedish model fails when sentences start with lower-case
> letters. Also when there is no space between the full-stop and the next
> sentence, the model splits the sentence but keeps the first word of the
> second sentence as part of the first sentence. My current model cannot
> split at all if there is no space after the full stop.
>

Your production data should have the same characteristics of your train
data. It will only handle cases like sentences that starts with lower-case
letters and no space between the full-stop and the next sentence properly
if these were covered by the training data. You can add sentences with
these characteristics to your training data.

>
> Questions: Can one influence the training set by creating examples where
> there is no space between two sentences? Can one add features to the model?
> If so how? What about if  the training data contains sentences without end
> of sentence markers (fullstops, etc.)? Should such exist in the training
> set? Is there some example/documentation about it?
> At least I cannot seem to find it – but correct me if I am wrong.
>

Yes, simply append it to your training data. I don't think the
documentation covers how to create training corpus, it only explains the
format, but any help to improve it is highly appreciated.

> 2. The NER task suffers from similar problems. My current model has a hard
> time recognizing single names but does OK is the name consists of several
> words. However, I would like to include POS tag information as part of the
> training features. Is this possible? If so how? Any examples/documentation
> about it?
>

Does your corpus include examples of single names? Is it easy to
distinguish it from other tokens? Maybe you should consider using a Custom
Feature Generation<http://incubator.apache.org/opennlp/documentation/1.5.2-incubating/manual/opennlp.html#tools.namefind.training.featuregen>
to
add a "dictionary" element. You can use the DictionaryBuilder tool
"bin/opennlp DictionaryBuilder"

I think you can add POS Tag information, but I don't know exactly how to do
it. I would investigate if it is possible using the "custom" element
of the Custom
Feature Generation<http://incubator.apache.org/opennlp/documentation/1.5.2-incubating/manual/opennlp.html#tools.namefind.training.featuregen>,
were you can pass a class that implements "AdaptiveFeatureGenerator".

Another possible way, maybe even simpler, is to use the additionalContext
argument of the NameFinder to pass the POS Tag info while training and
executing the name finder. Should work, but I never tried it.

Regards,
William