You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Olivier Grisel <ol...@ensta.org> on 2011/01/04 19:04:43 UTC

pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Hi all,

I have lately been working on a utility to automatically extract
annotated multilingual corpora for the Named Entity Recognition task
out of Wikipedia dumps.

The tool is named pignlproc, is licensed under ASL2, available at
https://github.com/ogrisel/pignlproc and uses Apache Hadoop, Apache
Pig and Apache Whirr to perform the processing on a cluster of tens of
virtual machines on the Amazon EC2 cloud infrastructure (you can also
run it locally on a single machine of course).

Here is a sample of the output on the French Wikipedia dump:

  http://pignlproc.s3.amazonaws.com/corpus/fr/opennlp_location/part-r-00000

You can replace "location" by "person" or "organization" in the
previous URL for more examples. You can also replace "part-r-00000" by
"part-r-000XX" to download larger chunks of the corpus.

And here are some trained models (50 iteration on the first 3 chunks
of each corpus, i.e. ~ 100k annotated sentences for each type):

http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-location.bin
http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-person.bin
http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-organization.bin

It is possible to retrain those models on a larger subset of chunks by
allocating more than 2GB of heapspace to the OpenNLP CLI tool (I used
version 1.5.0).

The corpus is quite noisy so the performance of the trained models is
not optimal (but better than nothing anyway). Here is the result of
evaluations on held out chunks of the corpus (+/- 0.02):

- Location:

Precision: 0.87
Recall: 0.74
F-Measure: 0.80

- Person:

Precision: 0.80
Recall: 0.68
F-Measure: 0.74

- Organization:

Precision: 0.80
Recall: 0.65
F-Measure: 0.72

If you would like to build new models for new entity types (based on
the DBpedia ontology) or other languages you can find some
documentation on how to fetch the data and setup an Hadoop / EC2
cluster here:

  https://github.com/ogrisel/pignlproc/blob/master/README.md
  https://github.com/ogrisel/pignlproc/wiki

The pig scripts to build these models are rather short and simple to understand:

  https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/01_extract_sentences_with_links.pig
  https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/02_dbpedia_article_types.pig
  https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/03_join_by_type_and_convert.pig

I plan to give more details in a blog post soon (tm).

As always any feedback warmly welcome. If you think those pig
utilities would blend into the OpenNLP project, both me and my
employer (Nuxeo) would be glad to contribute it to the project to the
ASF.

Cheers,

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Posted by Julien Nioche <li...@gmail.com>.
> > Interesting! I'll definitely have a closer look at this and see if / how
> > pignlproc could be a good match with Behemoth (
> > https://github.com/jnioche/behemoth). Speaking of which, I'll probably
> write
> > an openNLP wrapper for Behemoth at some point. Feel free to get in touch
> if
> > this is of interest.
>
> OpenNLP already features UIMA wrappers that could probably be used to
> run it on a Behemoth setup.


or the GATE plugin but it will not be up to date with the apache version of
OpenNLP


> However I really like the simplicity of
> pig w.r.t. a UIMA runtime that adds an intermediate java object (i.e.
> the CAS and its type system) that will further add pressure on the JVM
> garbage collector.


that's exactly the reason why I was considering embedding openNLP directly
into Behemoth and avoid the overhead of the GATE / UIMA objects

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Posted by Olivier Grisel <ol...@ensta.org>.
2011/1/4 Julien Nioche <li...@gmail.com>:
> Hi,
>
> Interesting! I'll definitely have a closer look at this and see if / how
> pignlproc could be a good match with Behemoth (
> https://github.com/jnioche/behemoth). Speaking of which, I'll probably write
> an openNLP wrapper for Behemoth at some point. Feel free to get in touch if
> this is of interest.

OpenNLP already features UIMA wrappers that could probably be used to
run it on a Behemoth setup. However I really like the simplicity of
pig w.r.t. a UIMA runtime that adds an intermediate java object (i.e.
the CAS and its type system) that will further add pressure on the JVM
garbage collector. Pig already has support for optional type
declarations to optimize the processing when needed, otherwise data is
just treated as byte[]: no wrapping overhead nor useless memory
allocation that can ruin the GC.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Posted by Julien Nioche <li...@gmail.com>.
Hi,

Interesting! I'll definitely have a closer look at this and see if / how
pignlproc could be a good match with Behemoth (
https://github.com/jnioche/behemoth). Speaking of which, I'll probably write
an openNLP wrapper for Behemoth at some point. Feel free to get in touch if
this is of interest.

Oh, and congrats on the Apache Incubation!

Julien

On 4 January 2011 18:04, Olivier Grisel <ol...@ensta.org> wrote:

> Hi all,
>
> I have lately been working on a utility to automatically extract
> annotated multilingual corpora for the Named Entity Recognition task
> out of Wikipedia dumps.
>
> The tool is named pignlproc, is licensed under ASL2, available at
> https://github.com/ogrisel/pignlproc and uses Apache Hadoop, Apache
> Pig and Apache Whirr to perform the processing on a cluster of tens of
> virtual machines on the Amazon EC2 cloud infrastructure (you can also
> run it locally on a single machine of course).
>
> Here is a sample of the output on the French Wikipedia dump:
>
>  http://pignlproc.s3.amazonaws.com/corpus/fr/opennlp_location/part-r-00000
>
> You can replace "location" by "person" or "organization" in the
> previous URL for more examples. You can also replace "part-r-00000" by
> "part-r-000XX" to download larger chunks of the corpus.
>
> And here are some trained models (50 iteration on the first 3 chunks
> of each corpus, i.e. ~ 100k annotated sentences for each type):
>
> http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-location.bin
> http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-person.bin
> http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-organization.bin
>
> It is possible to retrain those models on a larger subset of chunks by
> allocating more than 2GB of heapspace to the OpenNLP CLI tool (I used
> version 1.5.0).
>
> The corpus is quite noisy so the performance of the trained models is
> not optimal (but better than nothing anyway). Here is the result of
> evaluations on held out chunks of the corpus (+/- 0.02):
>
> - Location:
>
> Precision: 0.87
> Recall: 0.74
> F-Measure: 0.80
>
> - Person:
>
> Precision: 0.80
> Recall: 0.68
> F-Measure: 0.74
>
> - Organization:
>
> Precision: 0.80
> Recall: 0.65
> F-Measure: 0.72
>
> If you would like to build new models for new entity types (based on
> the DBpedia ontology) or other languages you can find some
> documentation on how to fetch the data and setup an Hadoop / EC2
> cluster here:
>
>  https://github.com/ogrisel/pignlproc/blob/master/README.md
>  https://github.com/ogrisel/pignlproc/wiki
>
> The pig scripts to build these models are rather short and simple to
> understand:
>
>
> https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/01_extract_sentences_with_links.pig
>
> https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/02_dbpedia_article_types.pig
>
> https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/03_join_by_type_and_convert.pig
>
> I plan to give more details in a blog post soon (tm).
>
> As always any feedback warmly welcome. If you think those pig
> utilities would blend into the OpenNLP project, both me and my
> employer (Nuxeo) would be glad to contribute it to the project to the
> ASF.
>
> Cheers,
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Posted by Grant Ingersoll <gs...@apache.org>.
On Jan 13, 2011, at 10:55 AM, Jörn Kottmann wrote:

> On 1/11/11 2:21 PM, Olivier Grisel wrote:
>> 2011/1/4 Olivier Grisel<ol...@ensta.org>:
>>> I plan to give more details in a blog post soon (tm).
>> Here it is:
>> 
>>   http://blogs.nuxeo.com/dev/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html
>> 
>> It gives a bit more context and some additional results and clues for
>> improvements and potential new usages.
>> 
> Now I read this post too, sounds very interesting.
> 
> What is the biggest training file for the name finder you can generate with this method?
> 
> I think we need MapReduce training support for OpenNLP. Actually that is already on my
> todo list, but currently I am still busy with the Apache migration and the next release.
> Anyway I hope we can get that done at least partially for the name finder this year.
> 

One of the things that I mentioned earlier is that it might make sense to just build on Mahout for this stuff.  We'd love to do MaxEnt, but we also have a lot of other classifiers (bayes, SGD, Random Forests).  To me, if OpenNLP was abstracted a little bit from the classification algorithm, that would make it easier for people to plug-in/try out their own, including the Pig stuff Olivier is suggesting.

-Grant


Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Posted by Jörn Kottmann <ko...@gmail.com>.
On 1/19/11 11:17 PM, Olivier Grisel wrote:
>> In that annotation project we could introduce the concept of
>> "atomic" annotations. That are annotations which are only considered as
>> correct in a part of the article. Some named entity annotations could maybe
>> directly
>> created from the wiki markup with an approach similar to the one
>> you used. And more could be produced by the community.
>> I guess it is possible to give these partial available named entities to our
>> name finder
>> to automatically label the rest of the article with a higher precision than
>> usual.
> It's worth a try by need careful manual validation and evaluation of
> the quality.
>
Having these atomic annotations I think is very important for a 
community labeling
project, because it allows people just to add that information to the 
article where
they are really sure about it is correct. Maybe there are a few cases 
where they are
unsure, with atomic annotations they are not forced to label the whole 
article.
We have to see how that exactly could be done, that also depends on the 
component.
For the name finder it would be easy to do it on a sentence level, or 
maybe even
a mixture is possible of document level, sentence level and individual 
annotations.

If the overall quality is good enough training on half-automatically label
articles could also be an option.

>> After we manually labeled a few hundred articles with entities we could even
>> go a step further and try to create new features for the name finder
>> which take the wiki markup into account (such a name finder could also help
>> your
>> project to process the whole wikipedia).
> Yes, it would be great to add new gazetteer features (names and
> alternative spelling for famous entities such as persons, places,
> organizations and so on) maybe in a compressed form using bloom
> filters:
>
>    http://en.wikipedia.org/wiki/Bloom_filter
Yes having something like that would be really nice. There are other
interesting applications of bloom filters in nlp. Jason once pointed me
to a paper where they used bloom filters for language models.

+1 to work on that
>> If we start something like that it might be only useful for the tokenizer,
>> sentence
>> detector and name finder in a short term. Maybe over time it is even
>> possible to
>> add annotations for all the components we have in OpenNLP into this corpus.
>>
>> What do others think ?
> +1 overall
>
> We also need user friendly tooling quickly review / validate / fix an
> annotated corpus and fix it (rather than using vim or emacs).

Yes this tooling should actually exceed the capabilities you could get with
a text editor in a way that the annotations in the text are updated as soon
as the user adds one. That way labeling will be speed up dramatically.
Often articles contain the same name over and over again, and it is really
boring labeling a name 5 - 6 times, because it feels like doing it once
should be enough to get the rest labeled automatically.

I am actually the author of the Cas Editor, maybe we could write a plugin
for that one or start some completely new web based tooling.

We also need annotations guide lines which explain what should be labeled
and what not.

Jörn



Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Posted by Olivier Grisel <ol...@ensta.org>.
2011/1/19 Jörn Kottmann <ko...@gmail.com>:
> A while back I started thinking about if wikinews could be
> used as a training source as part of a community annotation
> project over at OpenNLP. I guess your experience and your code
> would be really helpful to transform that data into a format
> we could use for such a project. Over time we would pull in the
> new articles to keep up with new topics.

+1

Using wikinews instead of wikimarkup should require very little (or
even no) change to the existing sample scripts..

> In that annotation project we could introduce the concept of
> "atomic" annotations. That are annotations which are only considered as
> correct in a part of the article. Some named entity annotations could maybe
> directly
> created from the wiki markup with an approach similar to the one
> you used. And more could be produced by the community.
> I guess it is possible to give these partial available named entities to our
> name finder
> to automatically label the rest of the article with a higher precision than
> usual.

It's worth a try by need careful manual validation and evaluation of
the quality.

> After we manually labeled a few hundred articles with entities we could even
> go a step further and try to create new features for the name finder
> which take the wiki markup into account (such a name finder could also help
> your
> project to process the whole wikipedia).

Yes, it would be great to add new gazetteer features (names and
alternative spelling for famous entities such as persons, places,
organizations and so on) maybe in a compressed form using bloom
filters:

  http://en.wikipedia.org/wiki/Bloom_filter

AFAIK there are already existing implementations of bloom filters in
lucene, hadoop and cassandra.

As for building NameFinders that take the wikimarkup into account I am
not sure how it could help. Better get rid of hit as soon as possible
IMHO :)

> If we start something like that it might be only useful for the tokenizer,
> sentence
> detector and name finder in a short term. Maybe over time it is even
> possible to
> add annotations for all the components we have in OpenNLP into this corpus.
>
> What do others think ?

+1 overall

We also need user friendly tooling quickly review / validate / fix an
annotated corpus and fix it (rather than using vim or emacs).

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Posted by Jörn Kottmann <ko...@gmail.com>.
A while back I started thinking about if wikinews could be
used as a training source as part of a community annotation
project over at OpenNLP. I guess your experience and your code
would be really helpful to transform that data into a format
we could use for such a project. Over time we would pull in the
new articles to keep up with new topics.

In that annotation project we could introduce the concept of
"atomic" annotations. That are annotations which are only considered as
correct in a part of the article. Some named entity annotations could 
maybe directly
created from the wiki markup with an approach similar to the one
you used. And more could be produced by the community.
I guess it is possible to give these partial available named entities to 
our name finder
to automatically label the rest of the article with a higher precision 
than usual.

After we manually labeled a few hundred articles with entities we could even
go a step further and try to create new features for the name finder
which take the wiki markup into account (such a name finder could also 
help your
project to process the whole wikipedia).

If we start something like that it might be only useful for the 
tokenizer, sentence
detector and name finder in a short term. Maybe over time it is even 
possible to
add annotations for all the components we have in OpenNLP into this corpus.

What do others think ?

Jörn

On 1/13/11 6:06 PM, Olivier Grisel wrote:
> 2011/1/13 Jörn Kottmann<ko...@gmail.com>:
>> On 1/11/11 2:21 PM, Olivier Grisel wrote:
>>> 2011/1/4 Olivier Grisel<ol...@ensta.org>:
>>>> I plan to give more details in a blog post soon (tm).
>>> Here it is:
>>>
>>> http://blogs.nuxeo.com/dev/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html
>>>
>>> It gives a bit more context and some additional results and clues for
>>> improvements and potential new usages.
>>>
>> Now I read this post too, sounds very interesting.
>>
>> What is the biggest training file for the name finder you can generate with
>> this method?
> It depends on the class of the entity you are interested in and the
> language of the dump. For instance for the pair (person / French) I
> have more than 600k sentences. For English it is gonna be much bigger.
> For entity class such as "Drug" or "Protein" this is much lower (I
> would say a couple of thousands of sentences).
>
> I trained my French models on my laptop with limited memory (2GB
> allocated to the heapspace) hence I stopped at ~100k sentences in the
> training file to avoid GC trashing. On Amazon EC2 instances with more
> 10GB RAM I guess you could train a model on 500k sentences and test it
> on the remaining 100k sentences for instance. For such scales average
> perceptron learners or SGD-based logistic regression model as
> implemented in Apache Mahout would probably be faster to train than
> the current MaxEnt impl.
>
>> I think we need MapReduce training support for OpenNLP. Actually that is
>> already on my todo list, but currently I am still busy with the Apache migration and the
>> next release.
> Alright no hurry. Please ping me as soon as you are ready to discuss this.
>
>> Anyway I hope we can get that done at least partially for the name finder
>> this year.
> Great :)
>


Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Posted by Olivier Grisel <ol...@ensta.org>.
2011/1/13 Jörn Kottmann <ko...@gmail.com>:
> On 1/11/11 2:21 PM, Olivier Grisel wrote:
>>
>> 2011/1/4 Olivier Grisel<ol...@ensta.org>:
>>>
>>> I plan to give more details in a blog post soon (tm).
>>
>> Here it is:
>>
>> http://blogs.nuxeo.com/dev/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html
>>
>> It gives a bit more context and some additional results and clues for
>> improvements and potential new usages.
>>
> Now I read this post too, sounds very interesting.
>
> What is the biggest training file for the name finder you can generate with
> this method?

It depends on the class of the entity you are interested in and the
language of the dump. For instance for the pair (person / French) I
have more than 600k sentences. For English it is gonna be much bigger.
For entity class such as "Drug" or "Protein" this is much lower (I
would say a couple of thousands of sentences).

I trained my French models on my laptop with limited memory (2GB
allocated to the heapspace) hence I stopped at ~100k sentences in the
training file to avoid GC trashing. On Amazon EC2 instances with more
10GB RAM I guess you could train a model on 500k sentences and test it
on the remaining 100k sentences for instance. For such scales average
perceptron learners or SGD-based logistic regression model as
implemented in Apache Mahout would probably be faster to train than
the current MaxEnt impl.

> I think we need MapReduce training support for OpenNLP. Actually that is
> already on my todo list, but currently I am still busy with the Apache migration and the
> next release.

Alright no hurry. Please ping me as soon as you are ready to discuss this.

> Anyway I hope we can get that done at least partially for the name finder
> this year.

Great :)

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Posted by Jörn Kottmann <ko...@gmail.com>.
On 1/11/11 2:21 PM, Olivier Grisel wrote:
> 2011/1/4 Olivier Grisel<ol...@ensta.org>:
>> I plan to give more details in a blog post soon (tm).
> Here it is:
>
>    http://blogs.nuxeo.com/dev/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html
>
> It gives a bit more context and some additional results and clues for
> improvements and potential new usages.
>
Now I read this post too, sounds very interesting.

What is the biggest training file for the name finder you can generate 
with this method?

I think we need MapReduce training support for OpenNLP. Actually that is 
already on my
todo list, but currently I am still busy with the Apache migration and 
the next release.
Anyway I hope we can get that done at least partially for the name 
finder this year.

Jörn

Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Posted by Olivier Grisel <ol...@ensta.org>.
2011/1/4 Olivier Grisel <ol...@ensta.org>:
>
> I plan to give more details in a blog post soon (tm).

Here it is:

  http://blogs.nuxeo.com/dev/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html

It gives a bit more context and some additional results and clues for
improvements and potential new usages.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Posted by Grant Ingersoll <gs...@apache.org>.
On Jan 5, 2011, at 12:22 PM, Jörn Kottmann wrote:

> On 1/5/11 4:44 PM, Olivier Grisel wrote:
>> 2011/1/5 Jason Baldridge<jb...@mail.utexas.edu>:
>>> This looks great, and it aligns with my own recent interest in large scale
>>> NLP with Hadoop, including working with Wikipedia. I'll look at it more
>>> closely later, but in principle I would be interested in having this brought
>>> into the OpenNLP project in some way!
>> Thanks for your interest. Don't hesitate to fork the repo on github to
>> experiment with your own design ideas. OpenNLP methods often handle
>> String[][] and Span[] data-structures where span start and end index
>> either refer to char positions or token indices. It might be
>> interesting make some generic wrappers for those data-structures from
>> / to pig tuples by taking care of not reallocating memory when not
>> necessary.
>> 
> 
> Making OpenNLP faster is always nice, I believe we should one day
> go away from String and use CharSequence instead, because that usually

In Lucene, we have recently converted to all byte based and it has been pretty significant in terms of speedup.

-Grant


Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Posted by Jörn Kottmann <ko...@gmail.com>.
On 1/5/11 4:44 PM, Olivier Grisel wrote:
> 2011/1/5 Jason Baldridge<jb...@mail.utexas.edu>:
>> This looks great, and it aligns with my own recent interest in large scale
>> NLP with Hadoop, including working with Wikipedia. I'll look at it more
>> closely later, but in principle I would be interested in having this brought
>> into the OpenNLP project in some way!
> Thanks for your interest. Don't hesitate to fork the repo on github to
> experiment with your own design ideas. OpenNLP methods often handle
> String[][] and Span[] data-structures where span start and end index
> either refer to char positions or token indices. It might be
> interesting make some generic wrappers for those data-structures from
> / to pig tuples by taking care of not reallocating memory when not
> necessary.
>

Making OpenNLP faster is always nice, I believe we should one day
go away from String and use CharSequence instead, because that usually
avoids a memory copy. And might be easy to integrate with pig (never used
pig myself).

Jörn

Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Posted by Olivier Grisel <ol...@ensta.org>.
2011/1/5 Jason Baldridge <jb...@mail.utexas.edu>:
> This looks great, and it aligns with my own recent interest in large scale
> NLP with Hadoop, including working with Wikipedia. I'll look at it more
> closely later, but in principle I would be interested in having this brought
> into the OpenNLP project in some way!

Thanks for your interest. Don't hesitate to fork the repo on github to
experiment with your own design ideas. OpenNLP methods often handle
String[][] and Span[] data-structures where span start and end index
either refer to char positions or token indices. It might be
interesting make some generic wrappers for those data-structures from
/ to pig tuples by taking care of not reallocating memory when not
necessary.

Mining a medium / large scale corpus in an almost interactive ways
with the pig shell (grunt) is a great way to quickly test ideas and
prototypes to tap into the unreasonable effectiveness of data.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Posted by Jason Baldridge <jb...@mail.utexas.edu>.
This looks great, and it aligns with my own recent interest in large scale
NLP with Hadoop, including working with Wikipedia. I'll look at it more
closely later, but in principle I would be interested in having this brought
into the OpenNLP project in some way!

On Tue, Jan 4, 2011 at 12:04 PM, Olivier Grisel <ol...@ensta.org>wrote:

> Hi all,
>
> I have lately been working on a utility to automatically extract
> annotated multilingual corpora for the Named Entity Recognition task
> out of Wikipedia dumps.
>
> The tool is named pignlproc, is licensed under ASL2, available at
> https://github.com/ogrisel/pignlproc and uses Apache Hadoop, Apache
> Pig and Apache Whirr to perform the processing on a cluster of tens of
> virtual machines on the Amazon EC2 cloud infrastructure (you can also
> run it locally on a single machine of course).
>
> Here is a sample of the output on the French Wikipedia dump:
>
>  http://pignlproc.s3.amazonaws.com/corpus/fr/opennlp_location/part-r-00000
>
> You can replace "location" by "person" or "organization" in the
> previous URL for more examples. You can also replace "part-r-00000" by
> "part-r-000XX" to download larger chunks of the corpus.
>
> And here are some trained models (50 iteration on the first 3 chunks
> of each corpus, i.e. ~ 100k annotated sentences for each type):
>
> http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-location.bin
> http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-person.bin
> http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-organization.bin
>
> It is possible to retrain those models on a larger subset of chunks by
> allocating more than 2GB of heapspace to the OpenNLP CLI tool (I used
> version 1.5.0).
>
> The corpus is quite noisy so the performance of the trained models is
> not optimal (but better than nothing anyway). Here is the result of
> evaluations on held out chunks of the corpus (+/- 0.02):
>
> - Location:
>
> Precision: 0.87
> Recall: 0.74
> F-Measure: 0.80
>
> - Person:
>
> Precision: 0.80
> Recall: 0.68
> F-Measure: 0.74
>
> - Organization:
>
> Precision: 0.80
> Recall: 0.65
> F-Measure: 0.72
>
> If you would like to build new models for new entity types (based on
> the DBpedia ontology) or other languages you can find some
> documentation on how to fetch the data and setup an Hadoop / EC2
> cluster here:
>
>  https://github.com/ogrisel/pignlproc/blob/master/README.md
>  https://github.com/ogrisel/pignlproc/wiki
>
> The pig scripts to build these models are rather short and simple to
> understand:
>
>
> https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/01_extract_sentences_with_links.pig
>
> https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/02_dbpedia_article_types.pig
>
> https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/03_join_by_type_and_convert.pig
>
> I plan to give more details in a blog post soon (tm).
>
> As always any feedback warmly welcome. If you think those pig
> utilities would blend into the OpenNLP project, both me and my
> employer (Nuxeo) would be glad to contribute it to the project to the
> ASF.
>
> Cheers,
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>



-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://comp.ling.utexas.edu/people/jason_baldridge