You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by Danny Ayers <da...@gmail.com> on 2014/05/07 08:54:16 UTC

normalization, STANBOL-1303

noticed this via Jenkins:
[[
The Geonames.org service changed the value range of provided scores from
[0..100] to [0..inv]. Because of that the engine does no longer report
fise:confidence values in the range of [0..1].
]]
https://issues.apache.org/jira/browse/STANBOL-1303

two possible normalization strategies are listed alongside the issue, I'd
like to suggest another - I used it a while back on some messy numerics, is
simple & robust:

https://en.wikipedia.org/wiki/Sigmoid_function

essentially

out = 1/(1+exp(-in))

for
inf. < in < inf.
gives
-1 < out < 1
as required.

Cheers,
Danny.

Re: normalization, STANBOL-1303

Posted by Rupert Westenthaler <ru...@gmail.com>.

Hi Danny,

In short: IMO using a sigmoid function is not a good way to get [0..1]
bounded scores for the Geonames LocationEnhancementEngine. Using 1st
levenshtein and 2nd an ordering based on the genomes service scores
results in confidence values consistent with those of other linking
engines.

See comments below for details:

On Wed, May 7, 2014 at 8:57 AM, Danny Ayers <da...@gmail.com> wrote:
> [oops, old list address]
>
> noticed this via Jenkins:
> [[
> The Geonames.org service changed the value range of provided scores from
> [0..100] to [0..inv]. Because of that the engine does no longer report
> fise:confidence values in the range of [0..1].
> ]]
> https://issues.apache.org/jira/browse/STANBOL-1303
>
> two possible normalization strategies are listed alongside the issue, I'd
> like to suggest another - I used it a while back on some messy numerics, is
> simple & robust:
>
> https://en.wikipedia.org/wiki/Sigmoid_function
>
> essentially
>
> out = 1/(1+exp(-in))
>
> for
> inf. < in < inf.
> gives
> -1 < out < 1
> as required.
>

As the scores returned by the geonased web service are in the range
[0..inv] the sigmoid function would provide scores in the range of
[0.5..1]. In addition most of the returned scores are big so a lot of
results for the sigmoid function would be rounded to 1.0.

The score of EntityLinking engines is expected to represent how well
the mention in the text does match a label of the suggested Entity.
The relevance (e.g. the popularity, page rank ...) of an Entity can be
used to adapt this score to provide more initiative sorting for
entities.

E.g. both the Entityhub Linking as well as the FST linking engine do
both calculate the confidence based on the similarity of the mention
with the best matching label of an Entity. In addition they do have an
option that allows to modify the score by max 0.1 based on the
relevance of the Entity.

In cases where users want to preserve the score as returned by the
Geonames WebService we could add those to fise:EntityAnnotations (by
using an engine specific property).

best
Rupert

-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO ..........................................................................
| http://redlink.co/

Fwd: normalization, STANBOL-1303

Posted by Danny Ayers <da...@gmail.com>.

[oops, old list address]

noticed this via Jenkins:
[[
The Geonames.org service changed the value range of provided scores from
[0..100] to [0..inv]. Because of that the engine does no longer report
fise:confidence values in the range of [0..1].
]]
https://issues.apache.org/jira/browse/STANBOL-1303

two possible normalization strategies are listed alongside the issue, I'd
like to suggest another - I used it a while back on some messy numerics, is
simple & robust:

https://en.wikipedia.org/wiki/Sigmoid_function

essentially

out = 1/(1+exp(-in))

for
inf. < in < inf.
gives
-1 < out < 1
as required.

Cheers,
Danny.








-- 
http://dannyayers.com

http://webbeep.it  - text to tones and back again