You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2013/06/10 15:05:19 UTC

[jira] [Created] (STANBOL-1102) EntityLinking MUST only accept single token matches for the currently active Token

Rupert Westenthaler created STANBOL-1102:
--------------------------------------------

             Summary: EntityLinking MUST only accept single token matches for the currently active Token
                 Key: STANBOL-1102
                 URL: https://issues.apache.org/jira/browse/STANBOL-1102
             Project: Stanbol
          Issue Type: Bug
            Reporter: Rupert Westenthaler
            Assignee: Rupert Westenthaler


With the "Max Search Tokens (enhancer.engines.linking.maxSearchTokens)" configuration the EntityLinking Engine does support OR queries for multiple linkable/matchable tokens to the controlled vocabulary (default=2). 

This feature ensures that Entities that do match longer section in the text are higher ranked. This is especially important for bigger vocabularies and/or common tokens within the vocabulary as the EntityLinking only considers the top 10 (or 3 * max suggestions) query results. 

However in cases where no Entities do match several tokens of the search this feature currently causes unwanted side effects that is may match single tokens that are not the currently active one. 

E.g. the text section "Bei einer gmeinsamen Pressekonferenz mit FPÖ-Bundesparteivorsitzenden Heinz-Christian Strache in Langenlois" generates the following queries

(1) process Token 5: FPÖ
  >> searchStrings [FPÖ, Bundesparteivorsitzenden]
  << 0: FPÖ[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for http://rdf.freebase.com/ns/m.013vy8

(2) process Token 5: Bundesparteivorsitzenden
  >> searchStrings [Bundesparteivorsitzenden, Heinz]
 << 0: Heinz[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for http://rdf.freebase.com/ns/m.0c5y96

(3) process Token 7: Christian
  >> searchStrings [Christian, Strache]
 << 0: Heinz-Christian Strache[m=FULL,s=2,c=2(1.0)/3] score=0.6666666666666666[l=0.6666666666666666,t=1.0] for http://rdf.freebase.com/ns/m.08lfdk

resulting in a situation where Heinz is linked to an other Entity while Heinz-Christian Strache - while completely matching the text - is only linked with "Christian Strache" AND a lower confidence!

The issue is that search (2) issued for the Token "Bundesparteivorsitzenden" MUST NOT suggest an Entity that does not match the currently active Token. Because this is the case in the given Example "Heinz" is already consumed and can not be linked with the expected Entity mention "Heinz-Christian Strache"

This issue will add a rule to EntityLinking that the currently active Token need to be included in suggestions. 


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira