You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Eugen Ignat <eu...@gmail.com> on 2011/08/22 11:38:45 UTC

NE

Hello,
I want to use the "Name Finder" from OpenNLP, but for Romanian.
I downloaded all the models for the Name Finder: date, location, money,
organization, percentage, person and time name for English.
I presume for location, organization and person, in the model there should
be some sort of list/lists.
And now to my problem: can i open the .model files in some way that i don't
contravene with the license (moral or written), and so that i can find these
lists. Of course, after i "make" the models for Romanian, i will send them
back to you if you wish them.

All the best,
Eugen Ignat.

Re: NE

Posted by Jörn Kottmann <ko...@gmail.com>.

On 8/22/11 11:41 PM, Jörn Kottmann wrote:
> No, these models are statistical. That means they can learn with training
> data what is an entity and what is not.
>

Ups, something is missing here.

During the training, each token is "transformed" into a set of features. 
This set of features is combined with an outcome which describes how a 
token should be labeled. These features are generated by all kinds of 
rules, e.g. the token, capitalization  of the token, the token before, 
the token after, etc. These features cannot be adjusted to work with 
Romanian by hand.

Jörn

Re: NE

Posted by Jörn Kottmann <ko...@gmail.com>.

On 8/22/11 11:38 AM, Eugen Ignat wrote:
> Hello,
> I want to use the "Name Finder" from OpenNLP, but for Romanian.
> I downloaded all the models for the Name Finder: date, location, money,
> organization, percentage, person and time name for English.
> I presume for location, organization and person, in the model there should
> be some sort of list/lists.

No, these models are statistical. That means they can learn with training
data what is an entity and what is not.

These features are generated by all kinds of rules, e.g. the token, 
capitalization
of the token. These features cannot be adjusted to work with Romanian by
hand.

Indeed you need to create new training data which contains Romanian texts,
you will get the best performance if you choose training data which is 
within
your domain, e.g. to process medical texts, you shouldn't use news wire
for training.

Have a look at our documentation:
http://incubator.apache.org/opennlp/documentation/manual/opennlp.html#tools.namefind.training

> And now to my problem: can i open the .model files in some way that i don't
> contravene with the license (moral or written), and so that i can find these
> lists. Of course, after i "make" the models for Romanian, i will send them
> back to you if you wish them.

Well, we have a model package, which is simply a zip, this you can unzip,
and then it contains a model file. The model file is the binary 
representation
of our statistical maxent (or perceptron) model.

There is no license issue, or other reason to keep you from looking at it.

At OpenNLP we currently simply lack the tools to inspect an existing model,
it would be interesting to see the features and their associated weights.

Jörn