You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2012/11/21 15:11:58 UTC
[jira] [Comment Edited] (STANBOL-796) OpenNLP Sentence Detection Engine

    [ https://issues.apache.org/jira/browse/STANBOL-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501968#comment-13501968 ] 

Rupert Westenthaler edited comment on STANBOL-796 at 11/21/12 2:10 PM:
-----------------------------------------------------------------------

Documentation for this Engine

OpenNLP Sentence Detection Engine
==============

The OpenNLP Sentence Detection Engine adds _Sentence_s to the _[AnalyzedText](../nlp/analyzedtext)_ content part. If the _AnalyzedText_ content part is not yet present it is created by this engine.

## Consumed information

* __Language__ (required): The language of the text needs to be available. It is read as specified by [STANBOL-613](https://issues.apache.org/jira/browse/STANBOL-613) from the metadata of the ContentItem. Effectively this means that any Stanbol Language Detection engine will need to be executed before the OpenNLP POS Tagging Engine.

## Configuration

The OpenNLP Sentence Detector Engine provides a default service instance (configuration policy is optional). This instance processes all languages and adds _Sentence_s for all languages where a OpenNLP sentence detection model is available. This Engine instance uses the name 'opennlp-sentence' and has a service ranking of '-100'.

This engine supports the default configuration for Enhancement Engines including the __name__ _(stanbol.enhancer.engine.name)_ and the __ranking__ _(service.ranking)_ In addition it is possible to configure the __processed languages__ _(org.apache.stanbol.enhancer.sentence.languages)_ and an parameter to specify the name of the sentence detection model used for a language.

__1. Processed Language Configuraiton:__

For the configuration of the processed languages the following syntax is used:

    de
    en
    
This would configure the Engine to only process German and English texts. It is also possible to explicitly exclude languages

    !fr
    !it
    *

This specifies that all Languages other than French and Italien are processed.

Values can be parsed as Array or Vector. This is done by using the ["elem1","elem2",...] syntax as defined by OSGI ".config" files. As fallback also ',' separated Strings are supported. 

The following example shows the two above examples combined to a single configuration.

    org.apache.stanbol.enhancer.sentence.languages=["!fr","!it","de","en","*"]

NOTE that the "processed language" configuration only specifies what languages are considered for processing. If "de" is enabled, but there is no sentence detection model available for that language, than German text will still not be processed. However if there is a POS model for "it" but the "processed language" configuration does not include Italian, than Italian text will NOT be processed. 

__2. Sentnece detection model parameter__

The OpenNLP Sentence Detection engine supports the 'model' parameter to explicitly parse the name of the sentence detection model used for an language. Models are loaded via the Stanbol DataFile provider infrastructure. That means that models can be loaded from the {stanbol-working-dir}/stanbol/datafiles folder.

The syntax for parameters is as follows

    {language};{param-name}={param-value}

So to use the "my-de-sentence-model.zip" for detecting sentences in German texts one can use a configuration like follows

    de;model=my-de-sentence-model.zip
    *

By default OpenNLP sentence detection models are loaded from '{lang}-sent.bin'. To use models with other names users need to use the 'model' parameter as described above.

                
      was (Author: rwesten):
    Documentation for this Engine

OpenNLP Sentence Detection Engine
==============

The OpenNLP Sentence Detection Engine adds _Sentence_s to the _[AnalyzedText](../nlp/analyzedtext)_ content part. If the _AnalyzedText_ content part is not yet present it is created by this engine.

## Consumed information

* __Language__ (required): The language of the text needs to be available. It is read as specified by [STANBOL-613](https://issues.apache.org/jira/browse/STANBOL-613) from the metadata of the ContentItem. Effectively this means that any Stanbol Language Detection engine will need to be executed before the OpenNLP POS Tagging Engine.

## Configuration

The OpenNLP Sentence Detector Engine provides a default service instance (configuration policy is optional). This instance processes all languages and adds _Sentence_s for all languages where a OpenNLP sentence detection model is available. This Engine instance uses the name 'opennlp-sentence' and has a service ranking of '-100'.

This engine supports the default configuration for Enhancement Engines including the __name__ _(stanbol.enhancer.engine.name)_ and the __ranking__ _(service.ranking)_ In addition it is possible to configure the __processed languages__ _(org.apache.stanbol.enhancer.sentence.languages)_ and an parameter to specify the name of the tokenizer model used for a language.

__1. Processed Language Configuraiton:__

For the configuration of the processed languages the following syntax is used:

    de
    en
    
This would configure the Engine to only process German and English texts. It is also possible to explicitly exclude languages

    !fr
    !it
    *

This specifies that all Languages other than French and Italien are processed.

Values can be parsed as Array or Vector. This is done by using the ["elem1","elem2",...] syntax as defined by OSGI ".config" files. As fallback also ',' separated Strings are supported. 

The following example shows the two above examples combined to a single configuration.

    org.apache.stanbol.enhancer.sentence.languages=["!fr","!it","de","en","*"]

NOTE that the "processed language" configuration only specifies what languages are considered for processing. If "de" is enabled, but there is no sentence detection model available for that language, than German text will still not be processed. However if there is a POS model for "it" but the "processed language" configuration does not include Italian, than Italian text will NOT be processed. 

__2. Sentnece detection model parameter__

The OpenNLP Sentence Detection engine supports the 'model' parameter to explicitly parse the name of the sentence detection model used for an language. Models are loaded via the Stanbol DataFile provider infrastructure. That means that models can be loaded from the {stanbol-working-dir}/stanbol/datafiles folder.

The syntax for parameters is as follows

    {language};{param-name}={param-value}

So to use the "my-de-sentence-model.zip" for detecting sentences in German texts one can use a configuration like follows

    de;model=my-de-sentence-model.zip
    *

By default OpenNLP sentence detection models are loaded from '{lang}-sent.bin'. To use models with other names users need to use the 'model' parameter as described above.

                  
> OpenNLP Sentence Detection Engine
> ---------------------------------
>
>                 Key: STANBOL-796
>                 URL: https://issues.apache.org/jira/browse/STANBOL-796
>             Project: Stanbol
>          Issue Type: Sub-task
>          Components: Enhancer
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Implement an OpenNLP based Sentence Detection Engine that supports language specific configurations.
> e.g. 
>     !zh
>     !jp
>     de;model=mySentDecModel.bin
>     *
> The 'model' parameter allows to configure the name of the language detection model used for a given language. If nothing is specified "{lang}-sent.bin" is assumed as default.
> Texts in languages where no model is present are not processed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira