You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Tomasz Sobczak <so...@gmail.com> on 2013/08/20 09:47:29 UTC

OpenNLP NER for Polish

Hello!

I play with OpenNLP NER a little bit but my results are not satisfying. I'm
trying to train Polish model for finding persons entities.

Apart of famous people names, it finds names of some organizations,
geographical names and others.

What have I done is:

- I crawled 10k articles texts from the most popular Polish daily

- prepared list of 18k famous people

- stemmed articles texts (Morfologik - Polish stemmer)

- tagged sentences containing famous people names by <START:person> … <END>

- put tagged sentences into a file (7k lines)

I used prepared corpora in OpenNLP training tool and produce model (*.bin).

Could you suggest me what have I missed or what can I do better in my input
text file to improve my entity recognition?

Thanks,
Tomek

Re: OpenNLP NER for Polish

Posted by Tomasz Sobczak <so...@gmail.com>.

Ok, I will pay attention on untagged persons in my corpus.

I handle different forms of first name by regular expressions i.e.
(Tomasz|Tomek)  - second one is name diminution. I've prepared this
expressions based on wikipedia list of Polish names.

I stem articles in corpus because of persons names/surnames inflections.
But I don't stem test data - thanks for remark.
I will try to apply your suggestion not to use stemmer, but problem with
inflection can be serious. I need to have automatic persons tagging that's
way I use stemmer and then regular expression to find entity.
In Polish names inflection is mostly realized by adding some suffix but not
always - and then problems arise.

How tokenization can help me with language inflection?

Last thing - what kind of valuable information I lose after stemming? Is
there any difference for NER tools when it has original word and its basic
form (stemmed) ?
If explanation is too complicated, could you recommend some materials to
read about it?

Thanks,
Tomek

2013/8/20 Svetoslav Marinov <sv...@findwise.com>

> As Jörn wrote you should tag ALL person names in your corpus, not just the
> famous ones.
>
> Then, Polish is a highly inflected language. How do you deal with all the
> case forms of a person name? Do you have them in the list? If you don't,
> that's one of the problems as well. Why do you need to stem the articles?
> Is it to account for the inflections? But then you should do exactly the
> same with your test data. However, I would strongly advise you not to use
> the stemmer. You lose a lot of valuable information which can help
> distinguish whether a word is a name or not. Just tag the texts as they are
> (maybe with some proper tokenization and sentence splitting) - this should
> improve the results.
>
> Svetoslav
> ________________________________________
> Från: Jörn Kottmann <ko...@gmail.com>
> Skickat: den 20 augusti 2013 09:56
> Till: users@opennlp.apache.org
> Ämne: Re: OpenNLP NER for Polish
>
> On 08/20/2013 09:47 AM, Tomasz Sobczak wrote:
> > Could you suggest me what have I missed or what can I do better in my
> input
> > text file to improve my entity recognition?
>
> Its hard to tell without seeing your training data, but I suspect your
> tagging is too inconsistent,
> e.g. many people names are not tagged.
>
> Try to use a linguistic annotation tool to annotate at least a few
> hundred articles with all mentioned
> person names.
>
> Jörn
>

SV: OpenNLP NER for Polish

Posted by Svetoslav Marinov <sv...@findwise.com>.

As Jörn wrote you should tag ALL person names in your corpus, not just the famous ones.

Then, Polish is a highly inflected language. How do you deal with all the case forms of a person name? Do you have them in the list? If you don't, that's one of the problems as well. Why do you need to stem the articles? Is it to account for the inflections? But then you should do exactly the same with your test data. However, I would strongly advise you not to use the stemmer. You lose a lot of valuable information which can help distinguish whether a word is a name or not. Just tag the texts as they are (maybe with some proper tokenization and sentence splitting) - this should improve the results.

Svetoslav
________________________________________
Från: Jörn Kottmann <ko...@gmail.com>
Skickat: den 20 augusti 2013 09:56
Till: users@opennlp.apache.org
Ämne: Re: OpenNLP NER for Polish

On 08/20/2013 09:47 AM, Tomasz Sobczak wrote:
> Could you suggest me what have I missed or what can I do better in my input
> text file to improve my entity recognition?

Its hard to tell without seeing your training data, but I suspect your
tagging is too inconsistent,
e.g. many people names are not tagged.

Try to use a linguistic annotation tool to annotate at least a few
hundred articles with all mentioned
person names.

Jörn

Re: OpenNLP NER for Polish

Posted by Jörn Kottmann <ko...@gmail.com>.

On 08/20/2013 08:57 PM, Tomasz Sobczak wrote:
> Could you suggest any exemplary linguistic annotation tool?

I worked a lot with the Apache UIMA Cas Editor, and now recently we switched
to Brat (http://brat.nlplab.org/).

The OpenNLP Tagging Server can be called by brat to pre-annotate a 
document with the OpenNLP Name Finder.

Jörn

Re: OpenNLP NER for Polish

Posted by Tomasz Sobczak <so...@gmail.com>.

Could you suggest any exemplary linguistic annotation tool?

I will check if there are some not tagged names in my training data and if
so - how many.

Training data should contain only tagged persons sentences, right? Are
sentences without tags useless?

Tomek

2013/8/20 Jörn Kottmann <ko...@gmail.com>

> On 08/20/2013 09:47 AM, Tomasz Sobczak wrote:
>
>> Could you suggest me what have I missed or what can I do better in my
>> input
>> text file to improve my entity recognition?
>>
>
> Its hard to tell without seeing your training data, but I suspect your
> tagging is too inconsistent,
> e.g. many people names are not tagged.
>
> Try to use a linguistic annotation tool to annotate at least a few hundred
> articles with all mentioned
> person names.
>
> Jörn
>

Re: OpenNLP NER for Polish

Posted by Jörn Kottmann <ko...@gmail.com>.

On 08/20/2013 09:47 AM, Tomasz Sobczak wrote:
> Could you suggest me what have I missed or what can I do better in my input
> text file to improve my entity recognition?

Its hard to tell without seeing your training data, but I suspect your 
tagging is too inconsistent,
e.g. many people names are not tagged.

Try to use a linguistic annotation tool to annotate at least a few 
hundred articles with all mentioned
person names.

Jörn