You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Lance Norskog <go...@gmail.com> on 2012/08/27 05:41:25 UTC

Name Entity resource, with duplication database

This crew has 1/2 million names from European media and Wikipedia.
They have a concordance of variant spellings for the same
person/place. 233k unique names in total, include 497 for the late
Moammar Ghaddafi and 324 for Mahmoud Achmedinajad. 40k organizations,
472k people, and 16 "other". (Apparently a dodgy political party is
its own kind of thing.)

The license is: no dispersion of the list or derivate works. Meaning,
OpenNLP could include software to download the list, tag and process a
corpus into an NER model. The software would be rather difficult: it
has to find the longest unique string sequences from a huge input
database of strings.

http://langtech.jrc.it/JRC-Names.html

They've got a bunch of cool NLP tools also.

-- 
Lance Norskog
goksron@gmail.com