You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Joern Kottmann (JIRA)" <ji...@apache.org> on 2018/12/10 14:34:00 UTC

[jira] [Commented] (OPENNLP-1223) Add NameFinder model based on Tiger

    [ https://issues.apache.org/jira/browse/OPENNLP-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16714792#comment-16714792 ] 

Joern Kottmann commented on OPENNLP-1223:
-----------------------------------------

The Apache License allows commercial use. 

And the license of the Tiger corpus says "Use of the corpus or use of data derived from the corpus for any commercial purposes requires explicit written agreement of Licenser.".

I think we would need such a written agreement.

> Add NameFinder model based on Tiger
> -----------------------------------
>
>                 Key: OPENNLP-1223
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1223
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: language model
>            Reporter: J. Fiala
>            Priority: Major
>         Attachments: tiger_2.2_namefinder.bin.7z, tiger_2.2_namefinder.testdata.txt, tiger_2.2_namefinder_all.bin_20181014.bin.7z, tiger_2.2_namefinder_eval.txt
>
>
> Add NameFinder model based on the Tiger treebank 2.2 (Universität Stuttgart - www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html)
>  
> 1.) add model based on tiger (/)
> >>> generated based on 6.271 sentences with tagged names (always given name + surname).
> 2.) add a few test sentences (/)
> 3.) add small evaluation file (/)
>  
> h3. Input data
>  * tigercorpus-2.2.conll09.tar.gz (Uni Stuttgart)
>  www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html
>  * yagoLabels.tsv.7z (Max Planck Institute)
>  [https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/]
> h3. Basic workflow
> 1.) Extract sentences in the tiger database with possible names (two words in sequence tagged as NE)
> 2.) Check if possible names include a given name based on the YAGO labels database (given name is assumed as first name)
> 3.) If given name is included in YAGO labels as givenName, then tag the person name
> 4.) Train with full data set (50.472 sentences - including non-names)
> 5.) Evaluate with person data set (6.271 sentences)
> >>> JF 14.10.: see updated model: tiger_2.2_namefinder_all.bin_20181014.bin.7z
> h3. Open questions
> I first extracted 6.271 sentences mentioning names and trained based on that (filtered) data. Or is it better to use the complete training data (including the sentences without names)? (/)
> >>> JF 14.10.: added steps 4 + 5
> h3. Results
> Results from step 5 above:
> Evaluated 6271 samples with 7659 entities; found: 7662 entities; correct: 7644.
>         TOTAL: precision:   99,77%;  recall:   99,80%; F1:   99,78%.
>        person: precision:   99,77%;  recall:   99,80%; F1:   99,78%. [target: 7659; tp: 7644; fp:  18]
>  
> h3. Further Improvements:
> 1.) There may be some names which are referring to locations which have to be refined (e.g. San Juan):
> Fünf bis sechs Stunden , damit sie zur Besinnung kommen , meint <START:person> Salvador Lopez <END>Gonzalez , das Oberhaupt von <START:person> San Juan <END> <START:person> Juan Chamula <END> , einem pittoresken Ort hoch in den Bergen von .).
> 2.) Add support for names with more than two words (e.g. Salvador Lopez Gonzalez above).
> 3.) Check for context-sensitive non-name matches (e.g. "General")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)