You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2012/11/03 09:26:12 UTC

[jira] [Updated] (STANBOL-792) Extend the NamedEntityExtraction engine to support custom NameFinder Models

     [ https://issues.apache.org/jira/browse/STANBOL-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rupert Westenthaler updated STANBOL-792:
----------------------------------------

    Description: 
This adds an new NER engine that allows to configure custom NER models - OpenNLP TokenNameFinderModel's.

The configuration uses two properties:

* __Name Finder Models__ _(stanbol.engines.opennlp-ner.nameFinderModels)_: The list if custom NameFinderModels used by this engine. The Engine supports Arrays, Vectors and comma separated string for. Values are the file names of the NameFinderModel files. Configured files are loaded by using the DataFileProvider service. That means that files copied into the 'datafile' folder (by default located at '{stanbol-working-dir}/stanbol/datafiles').
* __Named Entity to 'dc:type' Mappings__ _(stanbol.engines.opennlp-ner.typeMappings)_: This configuration uses the syntax {named-entity-type} > {uri}": {named-entity-type} matches to the string "name" used for the named entity type in the OpenNLP NameFinder model. {uri} MUST BE a valid URI and is used as dc:type value for fise:TextAnnotations created by the engine for extracted Named Entities. NOTE: that TextAnnotations for unmapped Named Entity Types will have no dc:type information.


Example:

The following configuration uses the '.config' format and needs to provided with a file name similar to  'org.apache.stanbol.enhancer.engines.opennlp.impl.CustomNERModelEnhancementEngine-{component-instance-name}.config' to the Sling FileInstaller (by default {stanbol-working-dir}/stanbol/fileinstall):

    stanbol.enhancer.engine.name="ehealth-ner"
    stanbol.engines.opennlp-ner.nameFinderModels=["bionlp2004-DNA-en.bin","bionlp2004-protein-en.bin","bionlp2004-cell_type-en.bin","bionlp2004-cell_line-en.bin","bionlp2004-RNA-en.bin"]
    stanbol.engines.opennlp-ner.typeMappings=["DNA\ >\ http://www.bootstrep.eu/ontology/GRO#DNA","RNA\ >\ http://www.bootstrep.eu/ontology/GRO#RNA","protein\ >\ http://www.bootstrep.eu/ontology/GRO#Protein","cell_type\ >\ http://purl.bioontology.org/ontology/CL","cell_line\ >\ http://purl.bioontology.org/ontology/MCCL"]

NOTE: that the '.config' format requires spaces to be escaped with '\'

Documentation of the Engine is available at http://stanbol.apache.org/docs/trunk/components/enhancer/engines/customnermodelengine.html

  was:
The NER engine needs the mappings for an {ner-model} to its {language}
and the extracted {entity-type}. Currently this works by a constant
defining the mappings for persons, organizations and places. NLP
models are loaded by using the OpenNLP service (defined by the
o.a.stanbol.commons.opennlp module).

To configure additional models and types I would suggest to add an
additional configuration property that uses the following syntax

    {model-file-name};lang={language};type={entity-type}

The OpenNLP TokenNameFinderModel would be loaded from the configured
"{model-file-name}" via the Stanbol DataFileProvider service.
practically this means that users would need to copy their custom
models to the "{stanbol.home}/datafiles" directory.

The language parameter "lang={language}" would specify the language
supported by this model. The "type={entity-type}" parameter would
specify the dc-type value set for fise:TextAnnotations created for
named entities extracted by the model.

This new feature should also allow to override the currently used defaults by specifying e.g.

    myCustomPersonModel_en.zip;lang=en;type=dbp-ont:Person

This would override the default configuration for loading the Person NameFinder model for English.


    
> Extend the NamedEntityExtraction engine to support custom NameFinder Models
> ---------------------------------------------------------------------------
>
>                 Key: STANBOL-792
>                 URL: https://issues.apache.org/jira/browse/STANBOL-792
>             Project: Stanbol
>          Issue Type: Sub-task
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> This adds an new NER engine that allows to configure custom NER models - OpenNLP TokenNameFinderModel's.
> The configuration uses two properties:
> * __Name Finder Models__ _(stanbol.engines.opennlp-ner.nameFinderModels)_: The list if custom NameFinderModels used by this engine. The Engine supports Arrays, Vectors and comma separated string for. Values are the file names of the NameFinderModel files. Configured files are loaded by using the DataFileProvider service. That means that files copied into the 'datafile' folder (by default located at '{stanbol-working-dir}/stanbol/datafiles').
> * __Named Entity to 'dc:type' Mappings__ _(stanbol.engines.opennlp-ner.typeMappings)_: This configuration uses the syntax {named-entity-type} > {uri}": {named-entity-type} matches to the string "name" used for the named entity type in the OpenNLP NameFinder model. {uri} MUST BE a valid URI and is used as dc:type value for fise:TextAnnotations created by the engine for extracted Named Entities. NOTE: that TextAnnotations for unmapped Named Entity Types will have no dc:type information.
> Example:
> The following configuration uses the '.config' format and needs to provided with a file name similar to  'org.apache.stanbol.enhancer.engines.opennlp.impl.CustomNERModelEnhancementEngine-{component-instance-name}.config' to the Sling FileInstaller (by default {stanbol-working-dir}/stanbol/fileinstall):
>     stanbol.enhancer.engine.name="ehealth-ner"
>     stanbol.engines.opennlp-ner.nameFinderModels=["bionlp2004-DNA-en.bin","bionlp2004-protein-en.bin","bionlp2004-cell_type-en.bin","bionlp2004-cell_line-en.bin","bionlp2004-RNA-en.bin"]
>     stanbol.engines.opennlp-ner.typeMappings=["DNA\ >\ http://www.bootstrep.eu/ontology/GRO#DNA","RNA\ >\ http://www.bootstrep.eu/ontology/GRO#RNA","protein\ >\ http://www.bootstrep.eu/ontology/GRO#Protein","cell_type\ >\ http://purl.bioontology.org/ontology/CL","cell_line\ >\ http://purl.bioontology.org/ontology/MCCL"]
> NOTE: that the '.config' format requires spaces to be escaped with '\'
> Documentation of the Engine is available at http://stanbol.apache.org/docs/trunk/components/enhancer/engines/customnermodelengine.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira