You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2012/11/21 15:43:58 UTC
[jira] [Resolved] (STANBOL-734) ContentPart for NLP data -
AnalyzedText
[ https://issues.apache.org/jira/browse/STANBOL-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rupert Westenthaler resolved STANBOL-734.
-----------------------------------------
Resolution: Fixed
considered to be implemented with http://svn.apache.org/viewvc?rev=1412121&view=rev. Further changes adaptions should be implemented in own (more focused) issues
> ContentPart for NLP data - AnalyzedText
> ---------------------------------------
>
> Key: STANBOL-734
> URL: https://issues.apache.org/jira/browse/STANBOL-734
> Project: Stanbol
> Issue Type: Sub-task
> Reporter: Rupert Westenthaler
> Assignee: Rupert Westenthaler
>
> Because the management of NLP metadata - that is usually available on a word granularity - is not feasible using the RDF metadata this describes the addition of a special ContentPart Stanbol. This ContentPart will have the name AnalysedText.
> AnalysedText
> =====
> * It wraps the text/plain ContentPart of a ContentItem
> * It allows the definition of Spans (type, start, end, spanText). Type
> is an Enum: Text, TextSection, Sentence, Chunk, Span
> * Spans are sorted naturally by type, start and end. This allows to
> use a NavigateableSet (e.g. TreeSet) and the #subSet() functionality
> to work with contained Tokens. The #higher and #lower methods of
> NavigateableSet even allow to build Iterators that allow concurrent
> modifications (e.g adding Chunks while iterating over the Tokens of a
> Sentence).
> * One can attach Annotations to Spans. Basically a multi-valued Map
> with Object keys and Value<valueType> value(s) that support a type
> save view by using generically typed Annotation<key,valueType>
> * The Value<valueType> object natively supports confidence. This
> allows (e.g. for POS tags) to use the same instance ( e.g. of the POS
> tag for Noun) to be used for all noun annotations.
> * Note that the AnalysedText does NOT use RDF as representing those
> kind of data as RDF is not scaleable enough. This also means that the
> data of the AnalysedText are NOT available in the Enhancement Metadata
> of the ContentItem. However EnhancementEngines are free to write
> all/some results to the AnalysedText AND the RDF metadata of the
> ContentItem.
> Here is a sample code
> AnalysedText at; //the contentPart
> Iterator<Sentence> sentences = at.getSentences;
> while(sentences.hasNext){
> Sentence sentence = sentences.next();
> String sentText = sentence.getSpan();
> Iterator<SentenceToken> tokens = sentence.getTokens();
> while(tokens.hasNext()){
> Token token = tokens.next();
> String tokenText = token.getSpan();
> Value<PosTag> pos = token.getAnnotation(
> NlpAnnotations.posAnnotation);
> String tag = pos.value().getTag();
> double confidence = pos.probability();
> }
> }
> NLP annotations
> =====
> * TagSet and Tag<tagType>: A TagSet can be used for 1..n languages and
> contains Tags of a specific generic type. The Tag only defines a
> String "tag" property
> * Currently Tags for POS (PosTag) and Chunking (PhraseTag) are
> defined. Both define also an optional LexicalCategory. This is a enum
> with the 12 top level concepts defined by the
> [Olia](http://nlp2rdf.lod2.eu/olia/) ontology (e.g. Noun, Verb,
> Adjective, Adposition, Adverb ...)
> * TagSets (including mapped LexicalCategories) are defined for all
> languages where POS taggers are available for OpenNLP. This includes
> also the "penn.owl", "stts.owl" and "parole_es_cat.owl" provided by
> OLIA. The other TagSets used by OpenNLP are currently not available by
> Olia.
> * Note that the LexicalCategory can be used to process POS annotations
> of different languages
> TagSet:
> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/TagSet.java
> POS:
> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos
> A code sample:
> TagSet<PosTag> tagSet; //the used TagSet
> Map<String,PosTag> unknown; //missing tags in the TagSet
> Token token; //the token
> String tag; //the detected tag
> double prob; //the probability
> PosTag pos = tagset.getTag(tag);
> if(pos == null){ //unkonw tag
> pos = unknown.get(tag);
> }
> if(pos == null) {
> pos = new PosTag(tag);
> //this tag will not have a LexicalCategory
> unknown.add(pos); //only one instance
> }
> token.addAnnotation(
> NlpAnnotations.POSAnnotation,
> new Value<PosTag>(pos, prob));
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira