You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Olivier Grisel (JIRA)" <ji...@apache.org> on 2011/09/02 15:08:10 UTC

[jira] [Created] (STANBOL-320) Named Entity detection engine should filter out some obviously wrong text annotations

Named Entity detection engine should filter out some obviously wrong text annotations
-------------------------------------------------------------------------------------

                 Key: STANBOL-320
                 URL: https://issues.apache.org/jira/browse/STANBOL-320
             Project: Stanbol
          Issue Type: Bug
            Reporter: Olivier Grisel
            Assignee: Olivier Grisel


OpenNLP tend to return really weird results from time to time. For instance:

"The researchers found the liver expresses higher levels of the gene encoding "selenoprotein P" (SEPP1) in people with type 2 diabetes - those with more insulin resistance." outputs a Person TextAnnotation for the mention 'P "' => note the double quote that is included as part the mention and the additional whitespace separator probably inserted by a confused detokenizer.

Here is another example:

"We are all very excited for Rahm as he takes on a new challenge for which he is extraordinarily well qualified," said the president. Obama appointed political consultant and senior advisor Pete Rouse as interim chief, calling Rouse "a skillful problem-solver" and a "wise, skillful and long-time counselor." => outputs 'Rouse "' as a Person annotation as well. This is again a confusion with a bad handling of quotation marks.

I would like to use this jira issue to collect most common annotation mistake that could be filtered using ad-hoc java code directly inside the enhancement engine.

For the too previous cases, removing the quotation marks and filtering single letter names should be enough. There might be other cases that don't match this simple pattern though. 


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (STANBOL-320) Named Entity detection engine should filter out some obviously wrong text annotations

Posted by "Olivier Grisel (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/STANBOL-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096102#comment-13096102 ] 

Olivier Grisel commented on STANBOL-320:
----------------------------------------

Another example, this time the title page of a scientific paper:

Semantic Relation Extraction With Kernels Over Typed
Dependency Trees
Frank Reichartz
Hannes Korte
Gerhard Paass
Fraunhofer IAIS
Schloss Birlinghoven
St. Augustin, Germany

=> OpenNLP outputs a single annotation of type person: "Frank Reichartz Hannes Korte Gerhard Paass Fraunhofer IAIS". In this case we could avoid such false positives with a single rule that discards person names with more than 4 or 5 words or more than 50 chars for instance.

> Named Entity detection engine should filter out some obviously wrong text annotations
> -------------------------------------------------------------------------------------
>
>                 Key: STANBOL-320
>                 URL: https://issues.apache.org/jira/browse/STANBOL-320
>             Project: Stanbol
>          Issue Type: Bug
>            Reporter: Olivier Grisel
>            Assignee: Olivier Grisel
>
> OpenNLP tend to return really weird results from time to time. For instance:
> "The researchers found the liver expresses higher levels of the gene encoding "selenoprotein P" (SEPP1) in people with type 2 diabetes - those with more insulin resistance." outputs a Person TextAnnotation for the mention 'P "' => note the double quote that is included as part the mention and the additional whitespace separator probably inserted by a confused detokenizer.
> Here is another example:
> "We are all very excited for Rahm as he takes on a new challenge for which he is extraordinarily well qualified," said the president. Obama appointed political consultant and senior advisor Pete Rouse as interim chief, calling Rouse "a skillful problem-solver" and a "wise, skillful and long-time counselor." => outputs 'Rouse "' as a Person annotation as well. This is again a confusion with a bad handling of quotation marks.
> I would like to use this jira issue to collect most common annotation mistake that could be filtered using ad-hoc java code directly inside the enhancement engine.
> For the too previous cases, removing the quotation marks and filtering single letter names should be enough. There might be other cases that don't match this simple pattern though. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (STANBOL-320) Named Entity detection engine should filter out some obviously wrong text annotations

Posted by "Fabian Christ (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/STANBOL-320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fabian Christ updated STANBOL-320:
----------------------------------

    Issue Type: Improvement  (was: Bug)
    
> Named Entity detection engine should filter out some obviously wrong text annotations
> -------------------------------------------------------------------------------------
>
>                 Key: STANBOL-320
>                 URL: https://issues.apache.org/jira/browse/STANBOL-320
>             Project: Stanbol
>          Issue Type: Improvement
>            Reporter: Olivier Grisel
>            Assignee: Olivier Grisel
>
> OpenNLP tend to return really weird results from time to time. For instance:
> "The researchers found the liver expresses higher levels of the gene encoding "selenoprotein P" (SEPP1) in people with type 2 diabetes - those with more insulin resistance." outputs a Person TextAnnotation for the mention 'P "' => note the double quote that is included as part the mention and the additional whitespace separator probably inserted by a confused detokenizer.
> Here is another example:
> "We are all very excited for Rahm as he takes on a new challenge for which he is extraordinarily well qualified," said the president. Obama appointed political consultant and senior advisor Pete Rouse as interim chief, calling Rouse "a skillful problem-solver" and a "wise, skillful and long-time counselor." => outputs 'Rouse "' as a Person annotation as well. This is again a confusion with a bad handling of quotation marks.
> I would like to use this jira issue to collect most common annotation mistake that could be filtered using ad-hoc java code directly inside the enhancement engine.
> For the too previous cases, removing the quotation marks and filtering single letter names should be enough. There might be other cases that don't match this simple pattern though. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (STANBOL-320) Named Entity detection engine should filter out some obviously wrong text annotations

Posted by "Olivier Grisel (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/STANBOL-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096071#comment-13096071 ] 

Olivier Grisel commented on STANBOL-320:
----------------------------------------

Here is another related example of badly annotated Text Annotation for a person:

This polarity splitting has always improved perfor-
mance in our experiments, and can be thought of as a more
flexible form of the absolute value rectification in (Jarrett
et al., 2009), or as non-negative sparse coding with the
dictionary [−D D].

"D D]" is considered a mention of a Person annotation (which is then resolved to "Dwight D. Eisenhower"...).

> Named Entity detection engine should filter out some obviously wrong text annotations
> -------------------------------------------------------------------------------------
>
>                 Key: STANBOL-320
>                 URL: https://issues.apache.org/jira/browse/STANBOL-320
>             Project: Stanbol
>          Issue Type: Bug
>            Reporter: Olivier Grisel
>            Assignee: Olivier Grisel
>
> OpenNLP tend to return really weird results from time to time. For instance:
> "The researchers found the liver expresses higher levels of the gene encoding "selenoprotein P" (SEPP1) in people with type 2 diabetes - those with more insulin resistance." outputs a Person TextAnnotation for the mention 'P "' => note the double quote that is included as part the mention and the additional whitespace separator probably inserted by a confused detokenizer.
> Here is another example:
> "We are all very excited for Rahm as he takes on a new challenge for which he is extraordinarily well qualified," said the president. Obama appointed political consultant and senior advisor Pete Rouse as interim chief, calling Rouse "a skillful problem-solver" and a "wise, skillful and long-time counselor." => outputs 'Rouse "' as a Person annotation as well. This is again a confusion with a bad handling of quotation marks.
> I would like to use this jira issue to collect most common annotation mistake that could be filtered using ad-hoc java code directly inside the enhancement engine.
> For the too previous cases, removing the quotation marks and filtering single letter names should be enough. There might be other cases that don't match this simple pattern though. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira