You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by rw...@apache.org on 2013/03/20 16:25:11 UTC

svn commit: r1458884 - in /stanbol/site/trunk/content/docs/trunk/components/enhancer/engines: kuromojinlp.mdtext textannotationnewmodel.mdtext

Author: rwesten
Date: Wed Mar 20 15:25:11 2013
New Revision: 1458884

URL: http://svn.apache.org/r1458884
Log:
Added documentation for the TextAnnotation new Model Enine (STANBOL-953) as well as the Kuromoji NLP engine for Japanese (STANBOL-980)

Added:
    stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/kuromojinlp.mdtext
    stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/textannotationnewmodel.mdtext

Added: stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/kuromojinlp.mdtext
URL: http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/kuromojinlp.mdtext?rev=1458884&view=auto
==============================================================================
--- stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/kuromojinlp.mdtext (added)
+++ stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/kuromojinlp.mdtext Wed Mar 20 15:25:11 2013
@@ -0,0 +1,25 @@
+title: Kuromoji NLP Engine for Japanese
+
+[Kuromoji](http://www.atilika.org/) is a NLP Framework contributed to [Apache Lucene](http://lucene.apache.org). It is available starting with version 3.6.2 and 4.1 of Solr/Lucene. In Stanbol it requires the use of a version newer than [revision 1458703](http://svn.apache.org/r1458703) as it only works for the stanbol.commons.solr modules compatible to Solr 4.1.
+
+
+## Consumed information
+
+* __Language__ (required): The language of the text needs to be available. It is read as specified by [STANBOL-613](https://issues.apache.org/jira/browse/STANBOL-613) from the metadata of the ContentItem. Effectively this means that any Stanbol Language Detection engine will need to be executed before the OpenNLP POS Tagging Engine.
+
+## Supported modules
+
+* __Sentences__ : Kuromoji itself does not provide sentence detection. Because of that the detection of sentences is done by using POS tagging results. The POS tag '記号-句点' is used for splitting Sentences. Further it is assumed that each Text starts and ends with a complete sentence.
+* __Tokens__: Kuromoji is configured to provide tokens for all words and punctuation. This is done by configuring an empty stop tag list as well as setting the 'discardPunctuation' property to <code>false</code>
+* __POS tagging__: The POS tag set used by Kuromoji was mapped to the LexicalCategories and POS types as defined by the Stanbol NLP processing module. For the String tags the Japanese name is used (e.g. '名詞-代名詞-縮約' := Pos.Pronoun,Pos.Participle, description: noun-pronoun-contraction: Spoken language contraction made by combining a pronoun and the particle 'wa'. e.g. ありゃ, こりゃ, こりゃあ, そりゃ, そりゃあ )
+    POS tags are represented by adding _NlpAnnotations#POS_ANNOTATION_'s to the _Tokens_ of the _AnalyzedText_ content part. Kuromoji provides only a single POS tag per Token.
+* __NER detection__; The POS tag set used by Kuromoji defines POS tags describing named entities. Those POS tags are than combined to chunks and interpreted as named entities (e.g. '名詞-固有名詞-人名-姓' noun-proper-person-surname; '名詞-固有名詞-人名-名' noun-proper-person-given_name)
+    Named Entities are represented by adding _NlpAnnotations#NER_ANNOTATION_'s to the _Tokens_ of the _AnalyzedText_ content part. In addition also 'fise:TextAnnotations' are added to the metadata of the ContentItem.
+
+### Confidence
+
+Kuromoji does not provide confidence values for results.
+
+## Configuration
+
+The engine does not provide any custom configuration. However it supports the configuration of the engine name.
\ No newline at end of file

Added: stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/textannotationnewmodel.mdtext
URL: http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/textannotationnewmodel.mdtext?rev=1458884&view=auto
==============================================================================
--- stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/textannotationnewmodel.mdtext (added)
+++ stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/textannotationnewmodel.mdtext Wed Mar 20 15:25:11 2013
@@ -0,0 +1,11 @@
+Title: TextAnnotation new Model Converter Engine
+
+This Engine converts '[fise:TextAnnotation](../enhancementstructure#fisetextannotation)' to include the 'fise:selection-prefix' and 'fise:selection-suffix' properties as introduced by [STANBOL-987](https://issues.apache.org/jira/browse/STANBOL-987).
+
+It processes all 'fise:TextAnnotation' that select a specific part of the text. Meaning that they define a 'fise:start' and 'fise:end' property. 'fise:TextAnnotations' that do already define 'fise:selection-prefix' or 'fise:selection-suffix' properties are skipped.
+
+## Configuration:
+
+Other than the configurations for the engines name and ranking this engine supports the following custom properties:
+
+* __Prefix Suffix Length__ _(enhancer.engines.textannotationnewmodel.prefixSuffixSize)_: Allows to change the char length of prefixes and suffixes. The default is <code>10</code>. The minimum allowed value is <code>3</code>
\ No newline at end of file