You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2014/01/21 10:24:19 UTC

[jira] [Created] (STANBOL-1262) Change/Improve processing of Chunks by EntityLinking

Rupert Westenthaler created STANBOL-1262:
--------------------------------------------

             Summary: Change/Improve processing of Chunks by EntityLinking 
                 Key: STANBOL-1262
                 URL: https://issues.apache.org/jira/browse/STANBOL-1262
             Project: Stanbol
          Issue Type: Improvement
    Affects Versions: 0.12.0
            Reporter: Rupert Westenthaler
            Assignee: Rupert Westenthaler


The first step of EntityLinking (applies to all EntityLinkingEngines incl. the Lucene FST Linking Engine) is that it classifies Tokens as "linkable", "matchable" and "others". In addition it determines "processible" chunks Tokens are contained in.

This issue is about changing the way how "processible" chunks are determined if the AnalyzedText contains multiple overlapping chunks.

A typical case where this can happen is if both a Noun Phrase Detection and a Named Entity Recognition is contained in the Chain. The chunks selected by Named Entities will typically be smaller as the corresponding Noun Phrase. There are even situations where the Named Entity does not even include all Nouns contained in a Noun Phrase.

Here an Example taken from [1]:

    After a disappointing start against an Everton side who led through Kevin Mirallas's first-half goal ...

While "Everton" is detected as Organization by NER, the Noun Phrase "an Everton side" also include 'side' as an 2nd noun. Therefore 'Everton' is not considered for linking as it only matches a 1/2 matchable tokens within a 'processible phrase'

This is because EntityLinking currently merges overlapping processible phrase together. A semantic that is - no longer - an optimal for EntityLinking.

To avoid recall problems like described the intersection instead of the union of multiple processible chunks need to be used.

For the given example this would result in

 - an [other]: an Everton side
 - Everton [linkable]: Everton
 - side [matchable]: an Everton side

So 'Everton' would get correctly linked to an Entity with the label Everton but 'side' would not get linked to an Entity with the label Side, as it is in a Phrase with an other linkable/matchable token.


[1] http://www.theguardian.com/football/2014/jan/20/west-bromwich-albion-everton-premier-league-match-report




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)