You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2012/11/21 15:43:58 UTC

[jira] [Resolved] (STANBOL-734) ContentPart for NLP data - AnalyzedText

     [ https://issues.apache.org/jira/browse/STANBOL-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rupert Westenthaler resolved STANBOL-734.
-----------------------------------------

    Resolution: Fixed

considered to be implemented with http://svn.apache.org/viewvc?rev=1412121&view=rev. Further changes adaptions should be implemented in own (more focused) issues
                
> ContentPart for NLP data - AnalyzedText
> ---------------------------------------
>
>                 Key: STANBOL-734
>                 URL: https://issues.apache.org/jira/browse/STANBOL-734
>             Project: Stanbol
>          Issue Type: Sub-task
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Because the management of NLP metadata - that is usually available on a word granularity - is not feasible using the RDF metadata this describes the addition of a special ContentPart Stanbol. This ContentPart will have the name AnalysedText.
> AnalysedText
> =====
> * It wraps the text/plain ContentPart of a ContentItem
> * It allows the definition of Spans (type, start, end, spanText). Type
> is an Enum: Text, TextSection, Sentence, Chunk, Span
> * Spans are sorted naturally by type, start and end. This allows to
> use a NavigateableSet (e.g. TreeSet) and the #subSet() functionality
> to work with contained Tokens. The #higher and #lower methods of
> NavigateableSet even allow to build Iterators that allow concurrent
> modifications (e.g adding Chunks while iterating over the Tokens of a
> Sentence).
> * One can attach Annotations to Spans. Basically a multi-valued Map
> with Object keys and Value<valueType> value(s) that support a type
> save view by using generically typed Annotation<key,valueType>
> * The Value<valueType> object natively supports confidence. This
> allows (e.g. for POS tags) to use the same instance ( e.g. of the POS
> tag for Noun) to be used for all noun annotations.
> * Note that the AnalysedText does NOT use RDF as representing those
> kind of data as RDF is not scaleable enough. This also means that the
> data of the AnalysedText are NOT available in the Enhancement Metadata
> of the ContentItem. However EnhancementEngines are free to write
> all/some results to the AnalysedText AND the RDF metadata of the
> ContentItem.
> Here is a sample code
>     AnalysedText at; //the contentPart
>     Iterator<Sentence> sentences = at.getSentences;
>     while(sentences.hasNext){
>         Sentence sentence = sentences.next();
>         String sentText = sentence.getSpan();
>         Iterator<SentenceToken> tokens = sentence.getTokens();
>         while(tokens.hasNext()){
>             Token token = tokens.next();
>             String tokenText = token.getSpan();
>             Value<PosTag> pos = token.getAnnotation(
>                 NlpAnnotations.posAnnotation);
>             String tag = pos.value().getTag();
>             double confidence = pos.probability();
>         }
>     }
> NLP annotations
> =====
> * TagSet and Tag<tagType>: A TagSet can be used for 1..n languages and
> contains Tags of a specific generic type. The Tag only defines a
> String "tag" property
> * Currently Tags for POS (PosTag) and Chunking (PhraseTag) are
> defined. Both define also an optional LexicalCategory. This is a enum
> with the 12 top level concepts defined by the
> [Olia](http://nlp2rdf.lod2.eu/olia/) ontology (e.g. Noun, Verb,
> Adjective, Adposition, Adverb ...)
> * TagSets (including mapped LexicalCategories) are defined for all
> languages where POS taggers are available for OpenNLP. This includes
> also the "penn.owl", "stts.owl" and "parole_es_cat.owl" provided by
> OLIA. The other TagSets used by OpenNLP are currently not available by
> Olia.
> * Note that the LexicalCategory can be used to process POS annotations
> of different languages
> TagSet:
> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/TagSet.java
> POS:
> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos
> A code sample:
>     TagSet<PosTag> tagSet; //the used TagSet
>     Map<String,PosTag> unknown; //missing tags in the TagSet
>     Token token; //the token
>     String tag; //the detected tag
>     double prob; //the probability
>     PosTag pos = tagset.getTag(tag);
>     if(pos == null){ //unkonw tag
>         pos = unknown.get(tag);
>     }
>     if(pos == null) {
>         pos = new PosTag(tag);
>         //this tag will not have a LexicalCategory
>         unknown.add(pos); //only one instance
>     }
>     token.addAnnotation(
>         NlpAnnotations.POSAnnotation,
>         new Value<PosTag>(pos, prob));

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira