You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2012/05/16 15:13:02 UTC

[jira] [Commented] (STANBOL-614) Enhancer returns inconsistent results

    [ https://issues.apache.org/jira/browse/STANBOL-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276729#comment-13276729 ] 

Rupert Westenthaler commented on STANBOL-614:
---------------------------------------------

There are several reasons why this could happen

1) Correct detection of the language: For short texts sometimes the correct language can not be detected. In those cases Enhancement Engines that depend on those information (e.g. openNLP-NER) will not work).
2) NER (Named Entity Recognition): Especially Entities that are mentioned in parts of a Text that are not full sentences do have a higher possibility to get overlooked.

If you send html text to Apache Stanbol it uses Apache Tika to convert the html to text. You can ask stanbol to return the converted texts e.g. by making a request such as

    curl -v -X POST -H "Accept: text/plain" -H "Content-Type: text/html;charset=utf-8" --data-binary @test "http://dev.iks-project.eu:8081/enhancer?omitMetadata=true"

You can also request the metadata and all content elements (parsed and converted) by a request like

    curl -v -X POST -H "Accept: multipart/from-data" -H "Content-Type: text/html;charset=utf-8" --data-binary @test "http://dev.iks-project.eu:8081/enhancer?outputContent=*/*&rdfFormat=application/json"

this can help a lot for debugging.

In general: If you sent very short documents to the Enhancer I would advice the use of the "KeywordLinkingEngine" instead of the combination "NamedEntityExtractionEnhancementEngine" and "NamedEntityTaggingEngine".

However if you use the "KeywordLinkingEngine" in combination with dbpedia as a Vocabulary you might need to filter results based on the "fise:entity-type" (e.g. ignoring all "fise:EntityAnnotations" that do not have an value for "fise:entity-type")

An example for an Enhancement Chain configured to use the KeywordLinkingEngine with dbpedia can be found at [1]

best
Rupert


[1] http://dev.iks-project.eu:8081/enhancer/chain/dbpedia-keyword

                
> Enhancer returns inconsistent results
> -------------------------------------
>
>                 Key: STANBOL-614
>                 URL: https://issues.apache.org/jira/browse/STANBOL-614
>             Project: Stanbol
>          Issue Type: Bug
>          Components: Enhancer
>    Affects Versions: 0.10.0-incubating
>         Environment: Debian squeeze Linux 2.6.32-5-amd64 SMP x86_64
> java version "1.6.0_26"
> Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
>            Reporter: Nosiert Batiste
>              Labels: newbie
>
> I'm trying to implement a tag suggestion feature in a document editing application. I'm using the stanbol enhancer to get EntityAnnotations for a piece of HTML.
> This works great most of the time, but sometimes no results are returned. The difference between the text for which results are returned, and the text  for which no results are returned is sometimes only a single character.
> I was able to reduce one case down to an additional &nbsp;.
> With the following text, the enhancer returns an EntityAnnotation for Syria, but not for CNN:
> &nbsp; &nbsp; So, where does the Syria conflict stand now? CNN 
> With the following text, the enhancer returns EntityAnnotations for both Syria and CNN:
> &nbsp; So, where does the Syria conflict stand now? CNN 
> I post the text with the following command (where @test refers to the file that contains the text):
> curl -v -X POST -H "Accept: application/json" -H "Content-Type: text/html;charset=utf-8" --data-binary @test "http://localhost:8086/enhancer"
> I checked out stanbol from svn
> $ svnversion .
> 1337074
> and started it with the following command line
> java -Xmx1g -jar launchers/full/target/org.apache.stanbol.launchers.full-0.10.0-incubating-SNAPSHOT.jar -p 8086
> I will try to work around this problem by simply converting everything to plain text.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira