You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Stefan Bunk <st...@student.hpi.uni-potsdam.de> on 2014/05/14 16:08:19 UTC

[user-question] Building a custom NER-Model for geonames.org

Hi,

I have problems with using the Custom NER Model Extraction Engine [1].
Basically, no entities are not found, even though the underlying model is
correct.

Here's what I did:
1.  I build a custom NER model for places from geonames.org according to
the OpenNLP website [2]. I tested my model with the OpenNLP command line
tool, and it worked (i.e. I give my model a text and the entities are found
correctly).
2. I copied the model to both ./launchers/stanbol/datafiles/geonames.bin
and ./enhancement-engines/topic/engine/sling/datafiles/geonames.bin.
3. In the Apache Felix Web Console Configuration, I created a new "Custom
NER Model" with the following settings:
                - name: Geonames NER
                - Name Finder Model: geonames.bin
                - Type Mappings: place > http://dbpedia.org/ontology/Place
                - Ranking: -100
4. I build a new enhancement chain with: tika, langdetect,
opennlp-sentence, opennlp-token, opennlp-pos, opennlp-ner, geonames-ner,
geonames
5. Server restart
6. I send the exactly same string as in 1. when I tested the model, but no
entities are found.

Any hint would be useful!
How can I check, that Stanbol correctly finds my geonames.bin file? If I
intentionally add a file which does not exist, no error occurs.

Thanks in advance
Stefan




[1]
https://stanbol.apache.org/docs/trunk/components/enhancer/engines/opennlpcustomner
[2]
http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Name_Finder

Re: [user-question] Building a custom NER-Model for geonames.org

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Stefan,

STANBOL-660 [1] is now resolved (both 0.12.1 and 1.0.0) - so you can
now explicitly parse the language of the parsed content by using the
Content-Language header.

best
Rupert

[1] https://issues.apache.org/jira/browse/STANBOL-660

On Mon, May 19, 2014 at 8:27 AM, Rupert Westenthaler
<ru...@gmail.com> wrote:
> Hi Stefan
>
> On Sat, May 17, 2014 at 3:49 PM, Stefan Bunk
> <st...@student.hpi.uni-potsdam.de> wrote:
>> Problem is, that my texts send to the chain are quite short, only one
>> sentence usually and they often contain some obviously non-english name
>> like "Costa de Xurius". This confuses the language detection, which does
>> not output english anymore but rather spanish in this example. Afterwards,
>> the geonames-ner engine does not even bother to run because the text is not
>> in a language it was trained for.
>>
>> So, what's the right way to do it now? Can I somehow force the chain to
>> emit english as the language of the text? Removing the langdetect engine
>> does not work, as it is needed by the custom ner model engine.
>>
>
> This remembers me on STANBOL-660 that is about exactly this problem.
> Was not affected by it for some time so I totally forgot about it.
> I scheduled this issue to be fixed with 0.12.1 and 1.0.0. Will try to
> implement this later today.
>
> When this is implemented you can parse the language via the
> Content-Language header and remove the LanguageDetection engine from
> your chain.
>
>> ----
>> Furthermore, I am not satisfied with the geonames.org entity linking.
>> Even when the text is correctly classified as english and the location
>> entity is found, the geonames linking can't link many entities.
>> Example:
>> The text snippet is "University of Buenos Aires". This is the exact name of
>> the entity on geonames.org. Still, I had to lower the confidence score to
>> 20% to have the geonames engine find the link (confidence: 24%). Many
>> entities are not even found, even when I use the exact name as on
>> geonames.org and it is correctly identified as a location.
>>
>> Where can I look into to increase the linking performance?
>>
>
> I think STANBOL-1303 is the reason for the unexpected confidence values.
>
> You can try using the Entityhub Indexing Tool for Geonames
> (entityhub/indexing/geonames) to generate your own local index for
> Geonames. After installing this index to the Stanbol Entityhub you can
> used the Named Entity Linking Engine [1] for entity linking. This
> would also have the advantage that you do not depend on an external
> service for linking.
>
> You can use one of the genomes indexes available at [2] for testing.
> Those are based on a geonames.org dump that is about 1 year old.
>
> best
> Rupert
>
>
>
> [1] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/namedentitytaggingengine
> [2] http://dev.iks-project.eu/downloads/stanbol-indices/geonames/
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                              ++43-699-11108907
> | A-5500 Bischofshofen
> | REDLINK.CO ..........................................................................
> | http://redlink.co/



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO ..........................................................................
| http://redlink.co/

Re: [user-question] Building a custom NER-Model for geonames.org

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Stefan

On Sat, May 17, 2014 at 3:49 PM, Stefan Bunk
<st...@student.hpi.uni-potsdam.de> wrote:
> Problem is, that my texts send to the chain are quite short, only one
> sentence usually and they often contain some obviously non-english name
> like "Costa de Xurius". This confuses the language detection, which does
> not output english anymore but rather spanish in this example. Afterwards,
> the geonames-ner engine does not even bother to run because the text is not
> in a language it was trained for.
>
> So, what's the right way to do it now? Can I somehow force the chain to
> emit english as the language of the text? Removing the langdetect engine
> does not work, as it is needed by the custom ner model engine.
>

This remembers me on STANBOL-660 that is about exactly this problem.
Was not affected by it for some time so I totally forgot about it.
I scheduled this issue to be fixed with 0.12.1 and 1.0.0. Will try to
implement this later today.

When this is implemented you can parse the language via the
Content-Language header and remove the LanguageDetection engine from
your chain.

> ----
> Furthermore, I am not satisfied with the geonames.org entity linking.
> Even when the text is correctly classified as english and the location
> entity is found, the geonames linking can't link many entities.
> Example:
> The text snippet is "University of Buenos Aires". This is the exact name of
> the entity on geonames.org. Still, I had to lower the confidence score to
> 20% to have the geonames engine find the link (confidence: 24%). Many
> entities are not even found, even when I use the exact name as on
> geonames.org and it is correctly identified as a location.
>
> Where can I look into to increase the linking performance?
>

I think STANBOL-1303 is the reason for the unexpected confidence values.

You can try using the Entityhub Indexing Tool for Geonames
(entityhub/indexing/geonames) to generate your own local index for
Geonames. After installing this index to the Stanbol Entityhub you can
used the Named Entity Linking Engine [1] for entity linking. This
would also have the advantage that you do not depend on an external
service for linking.

You can use one of the genomes indexes available at [2] for testing.
Those are based on a geonames.org dump that is about 1 year old.

best
Rupert



[1] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/namedentitytaggingengine
[2] http://dev.iks-project.eu/downloads/stanbol-indices/geonames/

-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO ..........................................................................
| http://redlink.co/

Re: [user-question] Building a custom NER-Model for geonames.org

Posted by Stefan Bunk <st...@student.hpi.uni-potsdam.de>.
Hi Rupert, hi all,

thanks to your hints I was able to track down to the problem. First, I
checked the engine name and the file location and both were correct (yes, I
did not write the correct name I used in the original post, I am sorry for
that). The file was found correctly. Still, it wasn't working.

What got me on the right track was:


> 15.05.2014 10:38:28.739 *INFO* [DataFileTrackingDaemon]

org.apache.stanbol.enhancer.engines.opennlp.impl.CustomNERModelEnhancementEngine
> register custom NameFinderModel from resource: geonames-ner.bin for
> language: en to NamedModelFileListener (name:opennlp-ner)
>
> in the logs.
>
and the fact, that the geonames-ner always only ran only for 1ms (which is
really fast, given the 5 megabyte model it has to work through).

Problem is, that my texts send to the chain are quite short, only one
sentence usually and they often contain some obviously non-english name
like "Costa de Xurius". This confuses the language detection, which does
not output english anymore but rather spanish in this example. Afterwards,
the geonames-ner engine does not even bother to run because the text is not
in a language it was trained for.

So, what's the right way to do it now? Can I somehow force the chain to
emit english as the language of the text? Removing the langdetect engine
does not work, as it is needed by the custom ner model engine.

----
Furthermore, I am not satisfied with the geonames.org entity linking.
Even when the text is correctly classified as english and the location
entity is found, the geonames linking can't link many entities.
Example:
The text snippet is "University of Buenos Aires". This is the exact name of
the entity on geonames.org. Still, I had to lower the confidence score to
20% to have the geonames engine find the link (confidence: 24%). Many
entities are not even found, even when I use the exact name as on
geonames.org and it is correctly identified as a location.

Where can I look into to increase the linking performance?

Best,
Stefan

Re: [user-question] Building a custom NER-Model for geonames.org

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Stefan



On Wed, May 14, 2014 at 4:08 PM, Stefan Bunk
<st...@student.hpi.uni-potsdam.de> wrote:
> Hi,
>
> I have problems with using the Custom NER Model Extraction Engine [1].
> Basically, no entities are not found, even though the underlying model is
> correct.
>
> Here's what I did:
> 1.  I build a custom NER model for places from geonames.org according to
> the OpenNLP website [2]. I tested my model with the OpenNLP command line
> tool, and it worked (i.e. I give my model a text and the entities are found
> correctly).
> 2. I copied the model to both ./launchers/stanbol/datafiles/geonames.bin
> and ./enhancement-engines/topic/engine/sling/datafiles/geonames.bin.

You need to copy the model to the datafilee folder of your stanbol
instance. By default this is "./stanbol/datafiles". So if you run
stanbol in "/foo/bar" the model needs to be available under
"/foo/bar/stanbol/datafiles/geonames.bin".

> 3. In the Apache Felix Web Console Configuration, I created a new "Custom
> NER Model" with the following settings:
>                 - name: Geonames NER

This is the name of the engine. Typically lower case names with '-' as
word separator or CamelCase names are used as names. So I suggest to
use  "geonames-ner" as name for your engine

>                 - Name Finder Model: geonames.bin
>                 - Type Mappings: place > http://dbpedia.org/ontology/Place
>                 - Ranking: -100
> 4. I build a new enhancement chain with: tika, langdetect,
> opennlp-sentence, opennlp-token, opennlp-pos, opennlp-ner, geonames-ner,
> geonames

Based on the provided information you used "Geonames NER" as name of
your engine. This chain however refers "geonames-ner". I would expect
the chain to be unsatisfied as no "geonames-ner" engine is around.

> 5. Server restart

A server restart is not needed. If you update the model you might need
to start/stop the OpenNLP component as it keeps a SoftReference to the
loaded models.

> 6. I send the exactly same string as in 1. when I tested the model, but no
> entities are found.

I would expect an ChainException as your chain refers "geonames-ner"
and the name of the configured engine is "Geonames NER"

>
> Any hint would be useful!
> How can I check, that Stanbol correctly finds my geonames.bin file? If I
> intentionally add a file which does not exist, no error occurs.

The "Stanbol Data File Provider" Tab of the Felix Webconsole provides
information about requested data files. There is also INFO level
logging of the Custom NER Model Engine.


As I was not using the Custom NER engine since a long time I
successfully tested the engine with the 0.12.1-SNAPSHOT [4]

* by using [3] - the default english place model
* renaming it to genomes-ner.bin
* copying it to the ./stanbol/datafiles folder of my test instance
* configuring a Custom NER engine with

    stanbol.engines.opennlp-ner.typeMappings=["location\ >\
http://dbpedia.org/ontology/Place"]
    stanbol.enhancer.engine.name="geonames-ner"
    stanbol.engines.opennlp-ner.nameFinderModels=["geonames-ner.bin"]

* configuring a Weighted Chain with

    stanbol.enhancer.chain.weighted.chain=["langdetect","opennlp-sentence","opennlp-token","geonames-ner"]
    stanbol.enhancer.chain.name="geonames-ner"

This setting provided the expected results - meaning the exact same
list of locations as when using the "opennlp-ner" engine


As you do not get an ChainException the most likely reason four your
problem is that the "geonames.bin" model is no in the correct folder.
As soon as the model is available you should see a message like

15.05.2014 10:38:28.739 *INFO* [DataFileTrackingDaemon]
org.apache.stanbol.enhancer.engines.opennlp.impl.CustomNERModelEnhancementEngine
register custom NameFinderModel from resource: geonames-ner.bin for
language: en to NamedModelFileListener (name:opennlp-ner)

in the logs.

hope this helps
best
Rupert

>
> Thanks in advance
> Stefan
>
>
>
>
> [1]
> https://stanbol.apache.org/docs/trunk/components/enhancer/engines/opennlpcustomner
> [2]
> http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Name_Finder

[3] http://dev.iks-project.eu/downloads/opennlp/models-1.5/en-ner-location.bin
[4] http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO ..........................................................................
| http://redlink.co/