You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by mark meiklejohn <ma...@yahoo.co.uk> on 2012/01/18 20:35:26 UTC

Re: Case Insensitive Name Finder - any ideas? - sorry missed the update - another ? though

James,

I agree the correct way is to ensure upper-case. But when you have no 
control over input it makes things a little more difficult.

So, I may look at a training set. What is the recommended size of a 
training set?

Thanks

Mark

On 18/01/2012 19:29, mark meiklejohn wrote:
> On 16/01/2012 19:31, mark meiklejohn wrote:
>> Hi,
>>
>> I've been having a look through the API and I can't see it. However, I
>> wonder if there would be a case insensitive setting for the Name Finders?
>>
>> The reason being for some input which maybe in lower-case i.e. 'friday'
>> it does not detect it. However, if it is in upper-case, it is not an
>> issue.
>>
>> TIA
>>
>> Mark
>>
>>
>>
>
>
>
>

status of opennlp.tools.coref component

Posted by Boris Galitsky <bg...@hotmail.com>.

Hi guys

  As a next step for machine learning of parse trees I would need to include coreference information.
  Right now when I compute similarity between two paragraphs, I just do pair-wise generalization of each sentence from first paragraph against each sentence of sentence paragraph, asif they are independent.
  As the next step to develop the theory (and practice) of syntactic generalization is to include coreference info. Now instead of finding a set of maximum common sub-trees of a pair of parse trees for two sentences, I try to find a set of maximum common sub-graphs (sub-forest) for a forest of parse trees for a first paragraph and for a second paragraph.
  I suggest we call it 'coreference forest'.
  Any code samples on how can I use opennlp.tools.coref for that? There are no tests although code is decently commented.
RegardsBoris

Re: [Name Finder] Wikipedia training set and parameters tuning

Posted by Olivier Grisel <ol...@ensta.org>.

2012/1/27 Riccardo Tasso <ri...@gmail.com>:
>
> That's exactly what I mean. The fact is that in our interpretation of
> Wikipedia, not all the sentences are annotated. That is because not all the
> sentences containing an entity requires linking. So I'm thinking of using
> only a better subset of my sentences (since they are so much). From this the
> idea of sampling only featured pages: stubs or poor pages may have a greater
> probability of being poorly annotated.

I my case I only keep sentences that have at least one annotation
(e.g. a link that maps to one of the types I am interested in).

> The idea may also be extended with the other proposal, which I'll try to
> explain with an example. Imagine a page about a vegetable. If a city appears
> in a sentence inside this page, it could be possible that it will appear not
> linked (i.e. not annotated) since the topics of the article aren't as much
> related. Otherwise I suspect that in a page talking about Geography, places
> are tagged more frequently. This is obviously an hypothesis, which shoul be
> better verify.
>
> Another idea is to use only sentences containing links regarding the
> entities which may be interesting. For example:
> * "[[Milan|Milan]] is an industrial city" becomes: "<place>Milan</place> is
> an industrial city"
> * "[[Paris|Paris Hilton]] was drunk last Friday." becomes: "Paris was drunk
> last Friday" (this sentence is kept because the link text is in the list of
> candidates to be tagged as places, but in this case the anchor suggest us it
> isn't so, hence is a good negative example)
> "Paris is a very touristic city." is discarded because it doesn't contain
> any interesting link

I am not sure that "link richness" is related to the "entity
relatedness" of the topic of the article. That hypothesis would
require some data-driven validation.

Another fact to consider: on the page
http://en.wikipedia.org/wiki/Paris_Hilton , most occurrences of the
"Paris Hilton" as a name of a person are not linked (because that
would be confusing for user to link to the same page). So it would be
possible to pre-process the markup by adding those recursive link on
pages that refer to entities with interesting types.

Yet another bias: if you take a page like:
http://en.wikipedia.org/wiki/The_Simple_Life that mentions Paris
Hilton many times, only the first few occurrences are links. The
remaining occurrences of the firstname "Paris" are never linked: that
a huge false negative bias. Again a dedicate preprocessing heuristic
to propagate recurring name annotations inside a given page
automatically might help.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: [Name Finder] Wikipedia training set and parameters tuning

Posted by Riccardo Tasso <ri...@gmail.com>.

On 26/01/2012 20:39, Olivier Grisel wrote:
> You should use the DBpedia NTriples dumps instead of parsing the
> wikipedia template as done in https://github.com/ogrisel/pignlproc .
> The type information for person, places and organization is very good.
Ok, it will be my next step.

> I don't think it's a huge problem for training but it's indeed a
> problem the performance evaluation: if you use this some held out
> folds from this dataset for performance evaluation (precision, recall,
> f1-score of the trained NameFinder model) then the fact that dataset
> itself is missing annotation will artificially increase the false
> positive rate estimate which will have an potentially great impact on
> the evaluation of the precision. The actually precision should be
> higher that what's measured.
My sentiment is that if I train model with sentences missing annotations 
these will worse the performance of my model. Isn't it?

> I think the only way to fix this issue is to manually fix the
> annotations of a small portion of the automatically generated dataset
> to add the missing annotations. I think we probably need 1000
> sentences per type to get a non ridiculous validation set.
>
> Besides performance evaluation, the missing annotation issue will also
> bias the model towards negative response, hence increasing the false
> negatives rate and decreasing the true model recall.

That's exactly what I mean. The fact is that in our interpretation of 
Wikipedia, not all the sentences are annotated. That is because not all 
the sentences containing an entity requires linking. So I'm thinking of 
using only a better subset of my sentences (since they are so much). 
 From this the idea of sampling only featured pages: stubs or poor pages 
may have a greater probability of being poorly annotated.

The idea may also be extended with the other proposal, which I'll try to 
explain with an example. Imagine a page about a vegetable. If a city 
appears in a sentence inside this page, it could be possible that it 
will appear not linked (i.e. not annotated) since the topics of the 
article aren't as much related. Otherwise I suspect that in a page 
talking about Geography, places are tagged more frequently. This is 
obviously an hypothesis, which shoul be better verify.

Another idea is to use only sentences containing links regarding the 
entities which may be interesting. For example:
* "[[Milan|Milan]] is an industrial city" becomes: "<place>Milan</place> 
is an industrial city"
* "[[Paris|Paris Hilton]] was drunk last Friday." becomes: "Paris was 
drunk last Friday" (this sentence is kept because the link text is in 
the list of candidates to be tagged as places, but in this case the 
anchor suggest us it isn't so, hence is a good negative example)
"Paris is a very touristic city." is discarded because it doesn't 
contain any interesting link

> In my first experiment reported in [1] I had not taken the wikipedia
> redirect links into account which did probably aggravate this problem
> even further. The current version of the pig script has been fixed
> w.r.t redirect handling [2] but I have not found the time to rerun a
> complete performance evaluation. This will solve frequent
> classification errors such as "China" which is redirected to "People's
> Republic of China" in Wikipedia. So just handling the redirect my
> improve the quality of the data and hence the trained model by quite a
> bit.
>
> [1] http://dev.blogs.nuxeo.com/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html
> [2] https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/02_dbpedia_article_types.pig#L22
>
> Also note that the perceptron model was not available when I ran this
> experiment. It's probably more scalable, e.s.p. memory wise and would
> be very worth trying again.

I my case I can handle redirects too, and surely I'll try to test also 
the perceptron model.

> In my experience the DBpedia type links for Person, Place and 
> Organization are very good quality. No false positives, there might be 
> some missing links though. It might be interesting to do some manual 
> checking of the top 100 recurring false positive names after a first 
> round of DBpedia extraction => model training => model evaluation on 
> held out data. Then if a significant portion of those false positive 
> names are actually missing type info in DBpedia or in the redirect 
> links, add them manually and iterate. 

Ok, now I have a lot of ideas for customizing my experiments. I will 
publish of course my results as soon as I execute my tests. However I'd 
like to get more in depth also with the training parameters, so the 
discussion goes on :)

> Anyway if you are interested in reviving the annotation sub-project, 
> please feel free to do so: 
> https://cwiki.apache.org/OPENNLP/opennlp-annotations.html We need a 
> database of annotated open data text (wikipedia, wikinews, project 
> Gutemberg...) with human validation metadata and a nice Web UI to 
> maintain it. 

I think it would be a great think, and also a work which requires a good 
project phase (a mistake here could bring to a lot of problems in the 
future). I'll think about contributing to the project, but for sure it 
won't be immediate.

Thanks
     Riccardo

Re: [Name Finder] Wikipedia training set and parameters tuning

Posted by Olivier Grisel <ol...@ensta.org>.

2012/2/14 Riccardo Tasso <ri...@gmail.com>:
> On 26/01/2012 20:39, Olivier Grisel wrote:
>>
>> In my experience the DBpedia type links for Person, Place and
>> Organization are very good quality. No false positives, there might be
>> some missing links though.
>
>
> Probably you're rigth if you speak of english dataset. I'm using the italian
> one, found at:
> http://downloads.dbpedia.org/3.7-i18n/it/instance_types_it.nt.bz2
>
> It is good for Person and Place classes, but it doesn't contain any Company!
> For them I'll use my standard technique, just reading the wikipedia template
> dump.

Indeed even in http://downloads.dbpedia.org/3.7/it/instance_types_it.nt.bz2
there is no entity with type http://dbpedia.org/ontology/Organisation
while there are many both for English, German and French for instance.

This looks like a bug in the template mappings. This is probably
fixable here: http://mappings.dbpedia.org/index.php/Main_Page

However, for entities that have it <-> en linked wikipedia pages you
can combine:

http://downloads.dbpedia.org/3.7/it/wikipedia_links_it.nt.bz2

along with:

http://downloads.dbpedia.org/3.7/en/instance_types_en.nt.bz2

to find the type of italian wikilinks in the Italian wikipedia dumps.
This is how I do it with pignlproc to support the French language
corpus.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: [Name Finder] Wikipedia training set and parameters tuning

Posted by Riccardo Tasso <ri...@gmail.com>.

On 26/01/2012 20:39, Olivier Grisel wrote:
> In my experience the DBpedia type links for Person, Place and
> Organization are very good quality. No false positives, there might be
> some missing links though.

Probably you're rigth if you speak of english dataset. I'm using the 
italian one, found at:
http://downloads.dbpedia.org/3.7-i18n/it/instance_types_it.nt.bz2

It is good for Person and Place classes, but it doesn't contain any 
Company! For them I'll use my standard technique, just reading the 
wikipedia template dump.

Cheers,
     Riccardo

Re: [Name Finder] Wikipedia training set and parameters tuning

Posted by Olivier Grisel <ol...@ensta.org>.

2012/1/26 Riccardo Tasso <ri...@gmail.com>:
> Hi all,
>    I'm looking for using Wikipedia as a source to train my own NameFinder.
>
> The main idea is based on two assumptions:
> 1) Almost each Wikipedia Article has a template which make easy to classify
> it as a Person, Place or some kind of entity

You should use the DBpedia NTriples dumps instead of parsing the
wikipedia template as done in https://github.com/ogrisel/pignlproc .
The type information for person, places and organization is very good.

> 2) Each Wikipedia Article contains hyper text to other Wikipedia Articles
>
> Given that is possible to translate links to typed annotations to train the
> Name Finder.
>
> I know that Olivier has already tried this approach, but I wanted to work on
> my own implementation and I think this is the right place to discuss about
> it. There are some general questions and some more specific, regarding the
> Name Finder.
>
> The general question regards the fact that Wikipedia isn't the "perfect"
> training set, because not all the entities are linked / tagged. The good
> thing is that as dataset it is very large, which means a lot of tagged
> examples and a lot of untagged ones. Do you think this is a huge problem?

I don't think it's a huge problem for training but it's indeed a
problem the performance evaluation: if you use this some held out
folds from this dataset for performance evaluation (precision, recall,
f1-score of the trained NameFinder model) then the fact that dataset
itself is missing annotation will artificially increase the false
positive rate estimate which will have an potentially great impact on
the evaluation of the precision. The actually precision should be
higher that what's measured.

I think the only way to fix this issue is to manually fix the
annotations of a small portion of the automatically generated dataset
to add the missing annotations. I think we probably need 1000
sentences per type to get a non ridiculous validation set.

Besides performance evaluation, the missing annotation issue will also
bias the model towards negative response, hence increasing the false
negatives rate and decreasing the true model recall.

In my first experiment reported in [1] I had not taken the wikipedia
redirect links into account which did probably aggravate this problem
even further. The current version of the pig script has been fixed
w.r.t redirect handling [2] but I have not found the time to rerun a
complete performance evaluation. This will solve frequent
classification errors such as "China" which is redirected to "People's
Republic of China" in Wikipedia. So just handling the redirect my
improve the quality of the data and hence the trained model by quite a
bit.

[1] http://dev.blogs.nuxeo.com/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html
[2] https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/02_dbpedia_article_types.pig#L22

Also note that the perceptron model was not available when I ran this
experiment. It's probably more scalable, e.s.p. memory wise and would
be very worth trying again.

> What do you think about selecting as training set a subset of pages with
> high precision? I have some ideas about which strategy to implement:
> * select only featured pages (which somehow is a guarantee that linking is
> done properly)

In my experience the DBpedia type links for Person, Place and
Organization are very good quality. No false positives, there might be
some missing links though. It might be interesting to do some manual
checking of the top 100 recurring false positive names after a first
round of DBpedia extraction => model training => model evaluation on
held out data. Then if a significant portion of those false positive
names are actually missing type info in DBpedia or in the redirect
links, add them manually and iterate.

> * selecting only pages regarding the Name Finder entity I'm trying to train
> (e.g. only People pages for People Name Finder)

I am not sure that is such a good idea. I think having 50% positive
examples and 50% negative examples would be better. However because I
a have no clue on how bad the missing annotation issue is
quantitatively I would abstain to comment on this :)

Anyway if you are interested in reviving the annotation sub-project,
please feel free to do so:

  https://cwiki.apache.org/OPENNLP/opennlp-annotations.html

We need a database of annotated open data text (wikipedia, wikinews,
project Gutemberg...) with human validation metadata and a nice Web UI
to maintain it.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

[Name Finder] Wikipedia training set and parameters tuning

Posted by Riccardo Tasso <ri...@gmail.com>.

Hi all,
     I'm looking for using Wikipedia as a source to train my own NameFinder.

The main idea is based on two assumptions:
1) Almost each Wikipedia Article has a template which make easy to 
classify it as a Person, Place or some kind of entity
2) Each Wikipedia Article contains hyper text to other Wikipedia Articles

Given that is possible to translate links to typed annotations to train 
the Name Finder.

I know that Olivier has already tried this approach, but I wanted to 
work on my own implementation and I think this is the right place to 
discuss about it. There are some general questions and some more 
specific, regarding the Name Finder.

The general question regards the fact that Wikipedia isn't the "perfect" 
training set, because not all the entities are linked / tagged. The good 
thing is that as dataset it is very large, which means a lot of tagged 
examples and a lot of untagged ones. Do you think this is a huge problem?

What do you think about selecting as training set a subset of pages with 
high precision? I have some ideas about which strategy to implement:
* select only featured pages (which somehow is a guarantee that linking 
is done properly)
* selecting only pages regarding the Name Finder entity I'm trying to 
train (e.g. only People pages for People Name Finder)

The specific questions regard the right tuning of training parameters, 
which I think is a frequent question. I hope this discussion may bring 
to the creation of new material to improve documentation, I advice you I 
won't be brief. For this I'm starting from some hints given by Jörn:

On 19/01/2012 14:16, Jörn Kottmann wrote:
>
> When I am doing training I always take our defaults as a base line and 
> then modify the parameters
> to see how it changes the performance. When you are working with a 
> training set which grows over
> time I suggest to once in a while start again from the default and 
> verify if the modifications are still
> giving an improvement.
>
> A few hints:
> - Using more iterations on the maxent model helps especially when your 
> data set is small,
>    e.g. try 300 to 500 instead of 100.

My dataset is huge, but I will try to test also adding more iterations.

final Map<String, Object> resources

>
> - Depending on domain and language feature generation should be 
> adapted, try to use
>    our xml feature generation (for this use trunk version, there was a 
> severe bug in 1.5.2).
>

For feature generation, I admit I haven't the idea on how use it. I'm 
using the CachedFeatureGenerator exactly as instantiated in the 
documentation. Can you help me in explaining them?

new WindowFeatureGenerator(new TokenFeatureGenerator(), 2, 2):
this one means that the two previous and next tokens are used as 
features to train the model: the window size probably depends on the 
language and shouldn't be too big to avoid loosing generalization.

new WindowFeatureGenerator(new TokenClassFeatureGenerator(true), 2, 2):
this one is similar to the former, but it uses the class of token 
instead of the token itself. Let's say I can do POS-tagging on my 
sentences, which is a classification for them. I think this may be an 
interesting property to detect named entities (e.g. a Place is often 
introduced by a token whith the pos IN). How can I exploit this idea?

new OutcomePriorFeatureGenerator(),
new PreviousMapFeatureGenerator(),
new BigramNameFeatureGenerator():
these FeatureGenerator aren't very clear to me and I would like to get 
more in depth. I can only understand that they aren't used by default.

new SentenceFeatureGenerator(true, false):
used to keep or skip the first and the last word of a sentence as 
feature (depending on the boolean parameters given in input). Which is 
the rationale to keep the first word and skip the last word? How can I 
decide of this setting? Which are the possible customizations of this 
FeatureGenerator?

> - Try the perceptron, usually has a higher recall, train it with a 
> cutoff of 0.
>
> - Use our build-in evaluation to test how a model performs, it can 
> output performance numbers
>    and print out misclassified samples.
>
> - Look carefully at misclassified samples maybe there is are patterns 
> which do not really work
>    with your model.
>
> - Add training data which contains cases which should work but do not.
>
> Hope this helps,
> Jörn

Thank you for these hints, I will try each one carefully.

Regards,
     Riccardo