You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2012/11/21 14:25:57 UTC

[jira] [Comment Edited] (STANBOL-795) OpenNLP Tokenizer Engine

    [ https://issues.apache.org/jira/browse/STANBOL-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501935#comment-13501935 ] 

Rupert Westenthaler edited comment on STANBOL-795 at 11/21/12 1:24 PM:
-----------------------------------------------------------------------

Documentation for this engine

OpenNLP Tokenizer Engine
===========

The OpenNLP Tokenizer Engine adds _Token_s to the _AnalyzedText_ content part. If this content part is not yet present it adds it to the ContentItem.

## Consumed information

* __Language__ (required): The language of the text needs to be available. It is read as specified by [STANBOL-613](https://issues.apache.org/jira/browse/STANBOL-613) from the metadata of the ContentItem. Effectively this means that any Stanbol Language Detection engine will need to be executed before the OpenNLP POS Tagging Engine.
* __Sentences__ (optional): In case _Sentence_s are available in the _AnalyzedText_ content part the tokenization of the text is done sentence by sentence. Otherwise the whole text is tokenized at once.

## Configuration

The OpenNLP Tokenizer engine provides a default service instance (configuration policy is optional). This instance processes all languages. Language specific tokenizer models are used if available. For other languages the OpenNLP SIMPLE_TOKENIZER is used. This Engine instance uses the name 'opennlp-token' and has a service ranking of '-100'.

While this engine supports the default configuration including the __name__ _(stanbol.enhancer.engine.name)_ and the __ranking__ _(service.ranking)_ the engine also allows to configure __processed languages__ _(org.apache.stanbol.enhancer.token.languages)_ and an parameter to specify the name of the tokenizer model used for a language.

__1. Processed Language Configuraiton:__

For the configuration of the processed languages the following syntax is used:

    de
    en
    
This would configure the Engine to only process German and English texts. It is also possible to explicitly exclude languages

    !fr
    !it
    *

This specifies that all Languages other than French and Italien are tokenized.

Values can be parsed as Array or Vector. This is done by using the ["elem1","elem2",...] syntax as defined by OSGI ".config" files. As fallback also ',' separated Strings are supported. 

The following example shows the two above examples combined to a single configuration.

    org.apache.stanbol.enhancer.token.languages=["!fr","!it","de","en","*"]

__2. Tokenizer model parameter__

The OpenNLP Tokenizer engine supports the 'model' parameter to explicitly parse the name of the Tokenizer model used for an language. Tokenizer models are loaded via the Stanbol DataFile provider infrastructure. That means that models can be loaded from the {stanbol-working-dir}/stanbol/datafiles folder.

The syntax for parameters is as follows

    {language};{param-name}={param-value}

So to use the "my-de-pos-model.zip" for POS tagging German texts one can use a configuration like follows

    de;model=my-de-pos-model.zip
    *

To configure that the SIMPLE_TOKENIZER should be used for a given language the 'model' parameter needs to be set to 'SIMPLE' as shown in the following example

    de;model=SIMPLE
    *

By default OpenNLP Tokenizer models are loaded for the names '{lang}-pos-perceptron.bin' and '{lang}-pos-maxent.bin' to use models with other names users need to use the 'model' parameter as described above.
                
      was (Author: rwesten):
    Documentation for this engine

OpenNLP Tokenizer Engine
===========

The OpenNLP Tokenizer Engine adds _Token_s to the _AnalyzedText_ content part. If this content part is not yet present it adds it to the ContentItem.

## Consumed information

* __Language__ (required): The language of the text needs to be available. It is read as specified by [STANBOL-613](https://issues.apache.org/jira/browse/STANBOL-613) from the metadata of the ContentItem. Effectively this means that any Stanbol Language Detection engine will need to be executed before the OpenNLP POS Tagging Engine.


## Configuration

The OpenNLP Tokenizer engine provides a default service instance (configuration policy is optional). This instance processes all languages. Language specific tokenizer models are used if available. For other languages the OpenNLP SIMPLE_TOKENIZER is used. This Engine instance uses the name 'opennlp-token' and has a service ranking of '-100'.

While this engine supports the default configuration including the __name__ _(stanbol.enhancer.engine.name)_ and the __ranking__ _(service.ranking)_ the engine also allows to configure __processed languages__ _(org.apache.stanbol.enhancer.token.languages)_ and an parameter to specify the name of the tokenizer model used for a language.

__1. Processed Language Configuraiton:__

For the configuration of the processed languages the following syntax is used:

    de
    en
    
This would configure the Engine to only process German and English texts. It is also possible to explicitly exclude languages

    !fr
    !it
    *

This specifies that all Languages other than French and Italien are tokenized.

Values can be parsed as Array or Vector. This is done by using the ["elem1","elem2",...] syntax as defined by OSGI ".config" files. As fallback also ',' separated Strings are supported. 

The following example shows the two above examples combined to a single configuration.

    org.apache.stanbol.enhancer.token.languages=["!fr","!it","de","en","*"]

__2. Tokenizer model parameter__

The OpenNLP Tokenizer engine supports the 'model' parameter to explicitly parse the name of the Tokenizer model used for an language. Tokenizer models are loaded via the Stanbol DataFile provider infrastructure. That means that models can be loaded from the {stanbol-working-dir}/stanbol/datafiles folder.

The syntax for parameters is as follows

    {language};{param-name}={param-value}

So to use the "my-de-pos-model.zip" for POS tagging German texts one can use a configuration like follows

    de;model=my-de-pos-model.zip
    *

To configure that the SIMPLE_TOKENIZER should be used for a given language the 'model' parameter needs to be set to 'SIMPLE' as shown in the following example

    de;model=SIMPLE
    *

By default OpenNLP Tokenizer models are loaded for the names '{lang}-pos-perceptron.bin' and '{lang}-pos-maxent.bin' to use models with other names users need to use the 'model' parameter as described above.
                  
> OpenNLP Tokenizer Engine
> ------------------------
>
>                 Key: STANBOL-795
>                 URL: https://issues.apache.org/jira/browse/STANBOL-795
>             Project: Stanbol
>          Issue Type: Sub-task
>          Components: Enhancer
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Implement an separate OpenNLP Tokenizer Engine.
> While some Engines like the OpenNLP POS or the CELI Lemmatizer engine do support tokenizing (if tokens do not already exist in the Analyzed Text) it is important to implement an engine explicitly for this task.
> This engine also supports the language configuration (see following example)
>     en;model=SIMPLE
>     de;model=mySpecificTokenizerModel_de.bin
>     !jp
>     !zh
>     *
> the 'model' parameter can be used to load specific tokenizer models. "SIMPLE" forces the use of the OpenNLP SimpleTokenizer. If no model configuration is present the default tokenizer for the language is loaded ("{lang}-token.bin" or the simple tokenizer if the language model is not present).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira