You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Rupert Westenthaler <ru...@gmail.com> on 2012/11/03 11:59:01 UTC

Re: Opennlp NER ...

Hi

The implementation of the CustomNERModelEnhancementEngine
(STANBOL-792) is now available. The documentation can be found at [1].

I also updated the eHealth demo ("{stanbol-trunk}/demo/ehealth") to
use the new Engine with 5 custom NER models for DNA, RNA, Proteins,
Cell Type and Cell Line based on the BioNLP2004 dataset [2]. When you
build (mvn clean install and install the health demo bundle
(org.apache.stanbol.demo.ehealth-0.10.1-SNAPSHOT.jar) to the Stanbol
Launcher (revision > 1405306) than you can test the engine with the
chain http://localhost:8080/enhancer/chain/ehealth-ner

@Andrea: I was not able to test the engine with NER models that
extract multiple entity types, as I was not able to find/build such a
model for testing. So if you find any issues regarding that please
report it.

I dont think I will have time to work on STANBOL-793 the coming days
as ApacheCon is around the corner

best
Rupert

[1] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/customnermodelengine.html
[2] http://www.nactem.ac.uk/tsujii/GENIA/ERtask/report.html

On Wed, Oct 31, 2012 at 5:22 PM, Rupert Westenthaler
<ru...@gmail.com> wrote:
> Hi
>
> just to lot you know that I can confirm that the type of the Named
> Entity is indeed provided by the Span#getType() method. So models for
> multiple Named Entity types are also supported by the Java API.
>
> best
> Rupert
>
> On Wed, Oct 31, 2012 at 3:45 PM, Rupert Westenthaler
> <ru...@gmail.com> wrote:
>> On Wed, Oct 31, 2012 at 3:31 PM, Andrea Taurchini <at...@gmail.com> wrote:
>>> Dear Rupert,
>>> thanks again.
>>> Uhmmm ... using tokennamefinder from command line of opennlp if you use a
>>> multitype trained model than you get a multitype tagged output ... as for
>>> api .find method I suppose is the way you told me (one type per model ??).
>>>
>>
>> Maybe the Span#getType() returns the type of the found entity. I will
>> try this out. If this really provides the different types, that the
>> configuration will be like
>>
>>     {model-file-name};language={language};{type}={type-uri};{type2}={type-uri2};...
>>
>> BTW I created already
>> https://issues.apache.org/jira/browse/STANBOL-792 for this feature.
>>
>>> Forgive me if I'm silly but I can't see how can I add configuration
>>> property under configuration tab of Felix WC.
>>>
>>
>> The form you see in the configuration in generated from a XML file in
>> the Bundle and this XML file is generated by the @Property annotations
>> in the implementation of the Engine. So as soon as this new
>> configuration options are implemented you will see the according
>> options in the form.
>>
>>
>>> Thanks and best regards,
>>> Andrea
>>>
>>>
>>>
>>>
>>>
>>> 2012/10/31 Rupert Westenthaler <ru...@gmail.com>
>>>
>>>> Hi
>>>>
>>>> On Wed, Oct 31, 2012 at 2:25 PM, Andrea Taurchini <at...@gmail.com>
>>>> wrote:
>>>> > Dear Rupert,
>>>> > as always thanks for your support.
>>>> > Is it possible to use a single model file to detect multiple dc-type ...
>>>> or
>>>> > should I add more than one configuration property each with the same
>>>> model
>>>> > file but different dc-type ... or else should I produce different model
>>>> > file.
>>>>
>>>> If this is possible with OpenNLP, than for sure, but AFAIK the
>>>> "opennlp.tools.namefind.NameFinderME#find(..)" method only provide the
>>>> token spans and probability. So it tells you only that you have found
>>>> an Named Entity from tokenA to tokenB and not the type of the Named
>>>> Entity.
>>>>
>>>> While I can imagine that one can train a model that detects different
>>>> types of entities, you will not know the specific type of an found
>>>> named entity. So found Entities may have any of the trained types.
>>>>
>>>> So if you want to distinguish between NamedEntities of the different
>>>> types you will need to train separate models.
>>>>
>>>> Please correct me if I am wrong.
>>>>
>>>> > However ... where do I have to set this configuration property (^_^) ?
>>>> > Throus OSGI admin ?
>>>>
>>>> Using the configuration tab of the Felix Web Console is only one
>>>> option. There are also other possibilities to provide configurations.
>>>> You can also provide configuration files to the Sling FileInstaller as
>>>> described at [1] and soon also under the new "Production" section on
>>>> the Stanbol webpage (currently only available on the staging server
>>>> [2])
>>>>
>>>>
>>>>
>>>> [1] http://markmail.org/message/jpxpl6x4nkmz6kda
>>>> [2] http://stanbol.staging.apache.org/production/partial-updates.html
>>>>
>>>> >
>>>> > Thanks a lot.
>>>> >
>>>> > Kindest regards,
>>>> > Andrea
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > 2012/10/31 Rupert Westenthaler <ru...@gmail.com>
>>>> >
>>>> >> Hi Andrea,
>>>> >>
>>>> >> On Tue, Oct 30, 2012 at 4:15 PM, Andrea Taurchini <ataurchini@gmail.com
>>>> >
>>>> >> wrote:
>>>> >> > Dear All,
>>>> >> > I developed my own models for NER based on OPENNLP.
>>>> >> > Within these models I have more entities than person, organization and
>>>> >> > places ... will stanbol enhance text using this added entities ?
>>>> >> >
>>>> >>
>>>> >> Currently both the OpenNLP NER engine as well as the
>>>> >> NamedEntityLinkingEngine can only handle Persons, Organizations and
>>>> >> Places. In its current form you will not be able to use them to link
>>>> >> other types.
>>>> >>
>>>> >> For both engines this is mainly because of the configuration. So
>>>> >> extending those engines to support other (or better arbitrary
>>>> >> configureable) types would require to extend the engines configuration
>>>> >> options. In the following I will try to describe the necessary
>>>> >> extensions.
>>>> >>
>>>> >> ## OpenNLP NER engine
>>>> >>
>>>> >> The NER engine needs the mappings for an {ner-model} to its {language}
>>>> >> and the extracted {entity-type}. Currently this works by a constant
>>>> >> defining the mappings for persons, organizations and places. NLP
>>>> >> models are loaded by using the OpenNLP service (defined by the
>>>> >> o.a.stanbol.commons.opennlp module).
>>>> >>
>>>> >> To configure additional models and types I would suggest to add an
>>>> >> additional configuration property that uses the following syntax
>>>> >>
>>>> >>     {model-file-name};lang={language};type={entity-type}
>>>> >>
>>>> >> The OpenNLP TokenNameFinderModel would be loaded from the configured
>>>> >> "{model-file-name}" via the Stanbol DataFileProvider service.
>>>> >> practically this means that users would need to copy their custom
>>>> >> models to the "{stanbol.home}/datafiles" directory.
>>>> >>
>>>> >> The language parameter "lang={language}" would specify the language
>>>> >> supported by this model. The "type={entity-type}" parameter would
>>>> >> specify the dc-type value set for fise:TextAnnotations created for
>>>> >> named entities extracted by the model.
>>>> >>
>>>> >>
>>>> >> ## NamedEntityLinkingEngine
>>>> >>
>>>> >> For this engine the main problem with the current implementation is
>>>> >> that the current way to configure mappings does not allow to configure
>>>> >> arbitrary mappings. Because of that one would need to implement a
>>>> >> different approach to configure the mappings for linked
>>>> >> fise:TextAnnotations dc:type values.
>>>> >>
>>>> >> I would suggest to use a configuration similar to the "type mapping"
>>>> >> [1] as already used by the KeywordLinkingEngine. The Syntax would be
>>>> >> like
>>>> >>
>>>> >>      {dc-type} > {vocabulary-type}; {vocabulary-type}; ...
>>>> >>      {dc-type} > *
>>>> >>      {dc-type}
>>>> >>
>>>> >> where the {dc-type} would be the value of the dc-type property of the
>>>> >> TextAnnotation and {vocabulary-type} is the rdf:type value required
>>>> >> for linked Entities in the vocabulary linked against. * represents the
>>>> >> wild-card (any type) and {dc-type} is a shorthand for {dc-type} >
>>>> >> {dc-type}
>>>> >>
>>>> >> The current default mappings would be represented in this syntax by
>>>> >>
>>>> >>     dbp-ont:Place
>>>> >>     dbp-ont:Person
>>>> >>     dbp-ont:Organisation
>>>> >>
>>>> >> I would suggest to keep support for the current properties for not
>>>> >> braking backward compatibility.
>>>> >>
>>>> >> If this extension is sufficient I suggest to create according JIRA
>>>> issues.
>>>> >>
>>>> >> best
>>>> >> Rupert
>>>> >>
>>>> >> [1]
>>>> >>
>>>> http://stanbol.apache.org/docs/trunk/components/enhancer/engines/keywordlinkingengine.html#type-mappings-syntax
>>>> >>
>>>> >> > Thanks and best regards,
>>>> >> > Andrea
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>> >> | Bodenlehenstraße 11                             ++43-699-11108907
>>>> >> | A-5500 Bischofshofen
>>>> >>
>>>>
>>>>
>>>>
>>>> --
>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>> | A-5500 Bischofshofen
>>>>
>>
>>
>>
>> --
>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Opennlp NER ...

Posted by Andrea Taurchini <at...@gmail.com>.
Dear Rupert,
thanks it works like a charm even with multiple entities tagged in a single
model file.

Thanks again.

Kindest regards,
Andrea




2012/11/3 Rupert Westenthaler <ru...@gmail.com>

> Hi
>
> The implementation of the CustomNERModelEnhancementEngine
> (STANBOL-792) is now available. The documentation can be found at [1].
>
> I also updated the eHealth demo ("{stanbol-trunk}/demo/ehealth") to
> use the new Engine with 5 custom NER models for DNA, RNA, Proteins,
> Cell Type and Cell Line based on the BioNLP2004 dataset [2]. When you
> build (mvn clean install and install the health demo bundle
> (org.apache.stanbol.demo.ehealth-0.10.1-SNAPSHOT.jar) to the Stanbol
> Launcher (revision > 1405306) than you can test the engine with the
> chain http://localhost:8080/enhancer/chain/ehealth-ner
>
> @Andrea: I was not able to test the engine with NER models that
> extract multiple entity types, as I was not able to find/build such a
> model for testing. So if you find any issues regarding that please
> report it.
>
> I dont think I will have time to work on STANBOL-793 the coming days
> as ApacheCon is around the corner
>
> best
> Rupert
>
> [1]
> http://stanbol.apache.org/docs/trunk/components/enhancer/engines/customnermodelengine.html
> [2] http://www.nactem.ac.uk/tsujii/GENIA/ERtask/report.html
>
> On Wed, Oct 31, 2012 at 5:22 PM, Rupert Westenthaler
> <ru...@gmail.com> wrote:
> > Hi
> >
> > just to lot you know that I can confirm that the type of the Named
> > Entity is indeed provided by the Span#getType() method. So models for
> > multiple Named Entity types are also supported by the Java API.
> >
> > best
> > Rupert
> >
> > On Wed, Oct 31, 2012 at 3:45 PM, Rupert Westenthaler
> > <ru...@gmail.com> wrote:
> >> On Wed, Oct 31, 2012 at 3:31 PM, Andrea Taurchini <at...@gmail.com>
> wrote:
> >>> Dear Rupert,
> >>> thanks again.
> >>> Uhmmm ... using tokennamefinder from command line of opennlp if you
> use a
> >>> multitype trained model than you get a multitype tagged output ... as
> for
> >>> api .find method I suppose is the way you told me (one type per model
> ??).
> >>>
> >>
> >> Maybe the Span#getType() returns the type of the found entity. I will
> >> try this out. If this really provides the different types, that the
> >> configuration will be like
> >>
> >>
> {model-file-name};language={language};{type}={type-uri};{type2}={type-uri2};...
> >>
> >> BTW I created already
> >> https://issues.apache.org/jira/browse/STANBOL-792 for this feature.
> >>
> >>> Forgive me if I'm silly but I can't see how can I add configuration
> >>> property under configuration tab of Felix WC.
> >>>
> >>
> >> The form you see in the configuration in generated from a XML file in
> >> the Bundle and this XML file is generated by the @Property annotations
> >> in the implementation of the Engine. So as soon as this new
> >> configuration options are implemented you will see the according
> >> options in the form.
> >>
> >>
> >>> Thanks and best regards,
> >>> Andrea
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> 2012/10/31 Rupert Westenthaler <ru...@gmail.com>
> >>>
> >>>> Hi
> >>>>
> >>>> On Wed, Oct 31, 2012 at 2:25 PM, Andrea Taurchini <
> ataurchini@gmail.com>
> >>>> wrote:
> >>>> > Dear Rupert,
> >>>> > as always thanks for your support.
> >>>> > Is it possible to use a single model file to detect multiple
> dc-type ...
> >>>> or
> >>>> > should I add more than one configuration property each with the same
> >>>> model
> >>>> > file but different dc-type ... or else should I produce different
> model
> >>>> > file.
> >>>>
> >>>> If this is possible with OpenNLP, than for sure, but AFAIK the
> >>>> "opennlp.tools.namefind.NameFinderME#find(..)" method only provide the
> >>>> token spans and probability. So it tells you only that you have found
> >>>> an Named Entity from tokenA to tokenB and not the type of the Named
> >>>> Entity.
> >>>>
> >>>> While I can imagine that one can train a model that detects different
> >>>> types of entities, you will not know the specific type of an found
> >>>> named entity. So found Entities may have any of the trained types.
> >>>>
> >>>> So if you want to distinguish between NamedEntities of the different
> >>>> types you will need to train separate models.
> >>>>
> >>>> Please correct me if I am wrong.
> >>>>
> >>>> > However ... where do I have to set this configuration property
> (^_^) ?
> >>>> > Throus OSGI admin ?
> >>>>
> >>>> Using the configuration tab of the Felix Web Console is only one
> >>>> option. There are also other possibilities to provide configurations.
> >>>> You can also provide configuration files to the Sling FileInstaller as
> >>>> described at [1] and soon also under the new "Production" section on
> >>>> the Stanbol webpage (currently only available on the staging server
> >>>> [2])
> >>>>
> >>>>
> >>>>
> >>>> [1] http://markmail.org/message/jpxpl6x4nkmz6kda
> >>>> [2] http://stanbol.staging.apache.org/production/partial-updates.html
> >>>>
> >>>> >
> >>>> > Thanks a lot.
> >>>> >
> >>>> > Kindest regards,
> >>>> > Andrea
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> > 2012/10/31 Rupert Westenthaler <ru...@gmail.com>
> >>>> >
> >>>> >> Hi Andrea,
> >>>> >>
> >>>> >> On Tue, Oct 30, 2012 at 4:15 PM, Andrea Taurchini <
> ataurchini@gmail.com
> >>>> >
> >>>> >> wrote:
> >>>> >> > Dear All,
> >>>> >> > I developed my own models for NER based on OPENNLP.
> >>>> >> > Within these models I have more entities than person,
> organization and
> >>>> >> > places ... will stanbol enhance text using this added entities ?
> >>>> >> >
> >>>> >>
> >>>> >> Currently both the OpenNLP NER engine as well as the
> >>>> >> NamedEntityLinkingEngine can only handle Persons, Organizations and
> >>>> >> Places. In its current form you will not be able to use them to
> link
> >>>> >> other types.
> >>>> >>
> >>>> >> For both engines this is mainly because of the configuration. So
> >>>> >> extending those engines to support other (or better arbitrary
> >>>> >> configureable) types would require to extend the engines
> configuration
> >>>> >> options. In the following I will try to describe the necessary
> >>>> >> extensions.
> >>>> >>
> >>>> >> ## OpenNLP NER engine
> >>>> >>
> >>>> >> The NER engine needs the mappings for an {ner-model} to its
> {language}
> >>>> >> and the extracted {entity-type}. Currently this works by a constant
> >>>> >> defining the mappings for persons, organizations and places. NLP
> >>>> >> models are loaded by using the OpenNLP service (defined by the
> >>>> >> o.a.stanbol.commons.opennlp module).
> >>>> >>
> >>>> >> To configure additional models and types I would suggest to add an
> >>>> >> additional configuration property that uses the following syntax
> >>>> >>
> >>>> >>     {model-file-name};lang={language};type={entity-type}
> >>>> >>
> >>>> >> The OpenNLP TokenNameFinderModel would be loaded from the
> configured
> >>>> >> "{model-file-name}" via the Stanbol DataFileProvider service.
> >>>> >> practically this means that users would need to copy their custom
> >>>> >> models to the "{stanbol.home}/datafiles" directory.
> >>>> >>
> >>>> >> The language parameter "lang={language}" would specify the language
> >>>> >> supported by this model. The "type={entity-type}" parameter would
> >>>> >> specify the dc-type value set for fise:TextAnnotations created for
> >>>> >> named entities extracted by the model.
> >>>> >>
> >>>> >>
> >>>> >> ## NamedEntityLinkingEngine
> >>>> >>
> >>>> >> For this engine the main problem with the current implementation is
> >>>> >> that the current way to configure mappings does not allow to
> configure
> >>>> >> arbitrary mappings. Because of that one would need to implement a
> >>>> >> different approach to configure the mappings for linked
> >>>> >> fise:TextAnnotations dc:type values.
> >>>> >>
> >>>> >> I would suggest to use a configuration similar to the "type
> mapping"
> >>>> >> [1] as already used by the KeywordLinkingEngine. The Syntax would
> be
> >>>> >> like
> >>>> >>
> >>>> >>      {dc-type} > {vocabulary-type}; {vocabulary-type}; ...
> >>>> >>      {dc-type} > *
> >>>> >>      {dc-type}
> >>>> >>
> >>>> >> where the {dc-type} would be the value of the dc-type property of
> the
> >>>> >> TextAnnotation and {vocabulary-type} is the rdf:type value required
> >>>> >> for linked Entities in the vocabulary linked against. * represents
> the
> >>>> >> wild-card (any type) and {dc-type} is a shorthand for {dc-type} >
> >>>> >> {dc-type}
> >>>> >>
> >>>> >> The current default mappings would be represented in this syntax by
> >>>> >>
> >>>> >>     dbp-ont:Place
> >>>> >>     dbp-ont:Person
> >>>> >>     dbp-ont:Organisation
> >>>> >>
> >>>> >> I would suggest to keep support for the current properties for not
> >>>> >> braking backward compatibility.
> >>>> >>
> >>>> >> If this extension is sufficient I suggest to create according JIRA
> >>>> issues.
> >>>> >>
> >>>> >> best
> >>>> >> Rupert
> >>>> >>
> >>>> >> [1]
> >>>> >>
> >>>>
> http://stanbol.apache.org/docs/trunk/components/enhancer/engines/keywordlinkingengine.html#type-mappings-syntax
> >>>> >>
> >>>> >> > Thanks and best regards,
> >>>> >> > Andrea
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >> --
> >>>> >> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> >>>> >> | Bodenlehenstraße 11
> ++43-699-11108907
> >>>> >> | A-5500 Bischofshofen
> >>>> >>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> >>>> | Bodenlehenstraße 11                             ++43-699-11108907
> >>>> | A-5500 Bischofshofen
> >>>>
> >>
> >>
> >>
> >> --
> >> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> >> | Bodenlehenstraße 11                             ++43-699-11108907
> >> | A-5500 Bischofshofen
> >
> >
> >
> > --
> > | Rupert Westenthaler             rupert.westenthaler@gmail.com
> > | Bodenlehenstraße 11                             ++43-699-11108907
> > | A-5500 Bischofshofen
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>