You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Nosiert Batiste (JIRA)" <ji...@apache.org> on 2012/05/15 18:08:44 UTC

[jira] [Created] (STANBOL-614) Enhancer returns inconsistent results

Nosiert Batiste created STANBOL-614:
---------------------------------------

             Summary: Enhancer returns inconsistent results
                 Key: STANBOL-614
                 URL: https://issues.apache.org/jira/browse/STANBOL-614
             Project: Stanbol
          Issue Type: Bug
          Components: Enhancer
    Affects Versions: 0.10.0-incubating
         Environment: Debian squeeze Linux 2.6.32-5-amd64 SMP x86_64
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
            Reporter: Nosiert Batiste


I'm trying to implement a tag suggestion feature in a document editing application. I'm using the stanbol enhancer to get EntityAnnotations for a piece of HTML.

This works great most of the time, but sometimes no results are returned. The difference between the text for which results are returned, and the text  for which no results are returned is sometimes only a single character.

I was able to reduce one case down to an additional &nbsp;.

With the following text, the enhancer returns an EntityAnnotation for Syria, but not for CNN:

&nbsp; &nbsp; So, where does the Syria conflict stand now? CNN 

With the following text, the enhancer returns EntityAnnotations for both Syria and CNN:

&nbsp; So, where does the Syria conflict stand now? CNN 

I post the text with the following command (where @test refers to the file that contains the text):

curl -v -X POST -H "Accept: application/json" -H "Content-Type: text/html;charset=utf-8" --data-binary @test "http://localhost:8086/enhancer"

I checked out stanbol from svn
$ svnversion .
1337074

and started it with the following command line
java -Xmx1g -jar launchers/full/target/org.apache.stanbol.launchers.full-0.10.0-incubating-SNAPSHOT.jar -p 8086

I will try to work around this problem by simply converting everything to plain text.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] [Created] (STANBOL-614) Enhancer returns inconsistent results

Posted by Fabian Christ <ch...@googlemail.com>.
Hi Nosiert,

2012/5/15 Nosiert Batiste (JIRA) <ji...@apache.org>:
> I will try to work around this problem by simply converting everything to plain text.

Yes that's the best way to solve this for the moment. Apache Stanbol
currently has no (good) support for annotating HTML sources. Maybe you
would like to implement an enhancement engine that converts your HTML
into plain text. This engine could run before the entity extraction
engines come into play.

Best,
 - Fabian

-- 
Fabian
http://twitter.com/fctwitt

[jira] [Resolved] (STANBOL-614) Enhancer returns inconsistent results

Posted by "Rupert Westenthaler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/STANBOL-614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rupert Westenthaler resolved STANBOL-614.
-----------------------------------------

    Resolution: Not A Problem
      Assignee: Rupert Westenthaler

Closing this. Hopefully my comment has provided some help in investigating similar issues
                
> Enhancer returns inconsistent results
> -------------------------------------
>
>                 Key: STANBOL-614
>                 URL: https://issues.apache.org/jira/browse/STANBOL-614
>             Project: Stanbol
>          Issue Type: Bug
>          Components: Enhancer
>    Affects Versions: 0.9.0-incubating
>         Environment: Debian squeeze Linux 2.6.32-5-amd64 SMP x86_64
> java version "1.6.0_26"
> Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
>            Reporter: Nosiert Batiste
>            Assignee: Rupert Westenthaler
>              Labels: newbie
>
> I'm trying to implement a tag suggestion feature in a document editing application. I'm using the stanbol enhancer to get EntityAnnotations for a piece of HTML.
> This works great most of the time, but sometimes no results are returned. The difference between the text for which results are returned, and the text  for which no results are returned is sometimes only a single character.
> I was able to reduce one case down to an additional &nbsp;.
> With the following text, the enhancer returns an EntityAnnotation for Syria, but not for CNN:
> &nbsp; &nbsp; So, where does the Syria conflict stand now? CNN 
> With the following text, the enhancer returns EntityAnnotations for both Syria and CNN:
> &nbsp; So, where does the Syria conflict stand now? CNN 
> I post the text with the following command (where @test refers to the file that contains the text):
> curl -v -X POST -H "Accept: application/json" -H "Content-Type: text/html;charset=utf-8" --data-binary @test "http://localhost:8086/enhancer"
> I checked out stanbol from svn
> $ svnversion .
> 1337074
> and started it with the following command line
> java -Xmx1g -jar launchers/full/target/org.apache.stanbol.launchers.full-0.10.0-incubating-SNAPSHOT.jar -p 8086
> I will try to work around this problem by simply converting everything to plain text.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (STANBOL-614) Enhancer returns inconsistent results

Posted by "Rupert Westenthaler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/STANBOL-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276729#comment-13276729 ] 

Rupert Westenthaler commented on STANBOL-614:
---------------------------------------------

There are several reasons why this could happen

1) Correct detection of the language: For short texts sometimes the correct language can not be detected. In those cases Enhancement Engines that depend on those information (e.g. openNLP-NER) will not work).
2) NER (Named Entity Recognition): Especially Entities that are mentioned in parts of a Text that are not full sentences do have a higher possibility to get overlooked.

If you send html text to Apache Stanbol it uses Apache Tika to convert the html to text. You can ask stanbol to return the converted texts e.g. by making a request such as

    curl -v -X POST -H "Accept: text/plain" -H "Content-Type: text/html;charset=utf-8" --data-binary @test "http://dev.iks-project.eu:8081/enhancer?omitMetadata=true"

You can also request the metadata and all content elements (parsed and converted) by a request like

    curl -v -X POST -H "Accept: multipart/from-data" -H "Content-Type: text/html;charset=utf-8" --data-binary @test "http://dev.iks-project.eu:8081/enhancer?outputContent=*/*&rdfFormat=application/json"

this can help a lot for debugging.

In general: If you sent very short documents to the Enhancer I would advice the use of the "KeywordLinkingEngine" instead of the combination "NamedEntityExtractionEnhancementEngine" and "NamedEntityTaggingEngine".

However if you use the "KeywordLinkingEngine" in combination with dbpedia as a Vocabulary you might need to filter results based on the "fise:entity-type" (e.g. ignoring all "fise:EntityAnnotations" that do not have an value for "fise:entity-type")

An example for an Enhancement Chain configured to use the KeywordLinkingEngine with dbpedia can be found at [1]

best
Rupert


[1] http://dev.iks-project.eu:8081/enhancer/chain/dbpedia-keyword

                
> Enhancer returns inconsistent results
> -------------------------------------
>
>                 Key: STANBOL-614
>                 URL: https://issues.apache.org/jira/browse/STANBOL-614
>             Project: Stanbol
>          Issue Type: Bug
>          Components: Enhancer
>    Affects Versions: 0.10.0-incubating
>         Environment: Debian squeeze Linux 2.6.32-5-amd64 SMP x86_64
> java version "1.6.0_26"
> Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
>            Reporter: Nosiert Batiste
>              Labels: newbie
>
> I'm trying to implement a tag suggestion feature in a document editing application. I'm using the stanbol enhancer to get EntityAnnotations for a piece of HTML.
> This works great most of the time, but sometimes no results are returned. The difference between the text for which results are returned, and the text  for which no results are returned is sometimes only a single character.
> I was able to reduce one case down to an additional &nbsp;.
> With the following text, the enhancer returns an EntityAnnotation for Syria, but not for CNN:
> &nbsp; &nbsp; So, where does the Syria conflict stand now? CNN 
> With the following text, the enhancer returns EntityAnnotations for both Syria and CNN:
> &nbsp; So, where does the Syria conflict stand now? CNN 
> I post the text with the following command (where @test refers to the file that contains the text):
> curl -v -X POST -H "Accept: application/json" -H "Content-Type: text/html;charset=utf-8" --data-binary @test "http://localhost:8086/enhancer"
> I checked out stanbol from svn
> $ svnversion .
> 1337074
> and started it with the following command line
> java -Xmx1g -jar launchers/full/target/org.apache.stanbol.launchers.full-0.10.0-incubating-SNAPSHOT.jar -p 8086
> I will try to work around this problem by simply converting everything to plain text.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira