You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Rupert Westenthaler <ru...@gmail.com> on 2012/09/10 12:53:43 UTC
Re: Stanbol Enhancement Modules for NLP

Hi all,

In preparation of the upcoming code contribution I created a set of
Issues describing the features discussed in this thread.

   https://issues.apache.org/jira/browse/STANBOL-733

As soon as a patch with the current state of the development is
attached to the issue (hopefully in the coming days) I will create an
own branch for the further development.

best
Rupert

On Tue, Aug 7, 2012 at 1:27 PM, Fabian Christ
<ch...@googlemail.com> wrote:
> Hi,
>
> thanks for sharing these ideas. It totally fits into Stanbol as an
> important part of the content enhancement process. This would enable
> people to dig into the art of programming a high quality engine.
>
> I had never the time to play with UIMA but have you guys checked out
> how this task is performed there? Is there something good or bad we
> can learn from them?
>
> Best,
>  - Fabian
>
> 2012/8/6 harish suvarna <hs...@gmail.com>:
>> Really Really exciting to hear all this. Quality enhancement engines are
>> key to this great platform.
>> Count me in as developer, tester, reviewer and contributor or in any other
>> role.
>> More on this subject later this week when I have some free time.
>>
>> Thanks,
>> Harish
>>
>>
>> On Mon, Aug 6, 2012 at 4:52 AM, Rupert Westenthaler <
>> rupert.westenthaler@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> First thanks to Sebastian for writing this mail. I will try to add
>>> some additional Information to it
>>>
>>> First let me provide an Overview about the AnalysedText API
>>>
>>> AnalysedText ContentPart
>>> =====
>>>
>>> You can find the source discussed in this part at
>>>
>>>
>>> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/model
>>>
>>> * It wraps the text/plain ContentPart of a ContentItem
>>> * It allows the definition of Spans (type, start, end, spanText). Type
>>> is an Enum: Text, TextSection, Sentence, Chunk, Span
>>> * Spans are sorted naturally by type, start and end. This allows to
>>> use a NavigateableSet (e.g. TreeSet) and the #subSet() functionality
>>> to work with contained Tokens. The #higher and #lower methods of
>>> NavigateableSet even allow to build Iterators that allow concurrent
>>> modifications (e.g adding Chunks while iterating over the Tokens of a
>>> Sentence).
>>> * One can attach Annotations to Spans. Basically a multi-valued Map
>>> with Object keys and Value<valueType> value(s) that support a type
>>> save view by using generically typed Annotation<key,valueType>
>>> * The Value<valueType> object natively supports confidence. This
>>> allows (e.g. for POS tags) to use the same instance ( e.g. of the POS
>>> tag for Noun) to be used for all noun annotations.
>>>
>>> * Note that the AnalysedText does NOT use RDF as representing those
>>> kind of data as RDF is not scaleable enough. This also means that the
>>> data of the AnalysedText are NOT available in the Enhancement Metadata
>>> of the ContentItem. However EnhancementEngines are free to write
>>> all/some results to the AnalysedText AND the RDF metadata of the
>>> ContentItem.
>>>
>>> Here is a sample code
>>>
>>>     AnalysedText at; //the contentPart
>>>     Iterator<Sentence> sentences = at.getSentences;
>>>     while(sentences.hasNext){
>>>         Sentence sentence = sentences.next();
>>>         String sentText = sentence.getSpan();
>>>         Iterator<SentenceToken> tokens = sentence.getTokens();
>>>         while(tokens.hasNext()){
>>>             Token token = tokens.next();
>>>             String tokenText = token.getSpan();
>>>             Value<PosTag> pos = token.getAnnotation(
>>>                 NlpAnnotations.posAnnotation);
>>>             String tag = pos.value().getTag();
>>>             double confidence = pos.probability();
>>>         }
>>>     }
>>>
>>> NLP annotations
>>> =====
>>>
>>> * TagSet and Tag<tagType>: A TagSet can be used for 1..n languages and
>>> contains Tags of a specific generic type. The Tag only defines a
>>> String "tag" property
>>> * Currently Tags for POS (PosTag) and Chunking (PhraseTag) are
>>> defined. Both define also an optional LexicalCategory. This is a enum
>>> with the 12 top level concepts defined by the
>>> [Olia](http://nlp2rdf.lod2.eu/olia/) ontology (e.g. Noun, Verb,
>>> Adjective, Adposition, Adverb ...)
>>> * TagSets (including mapped LexicalCategories) are defined for all
>>> languages where POS taggers are available for OpenNLP. This includes
>>> also the "penn.owl", "stts.owl" and "parole_es_cat.owl" provided by
>>> OLIA. The other TagSets used by OpenNLP are currently not available by
>>> Olia.
>>> * Note that the LexicalCategory can be used to process POS annotations
>>> of different languages
>>>
>>> TagSet:
>>> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/TagSet.java
>>> POS:
>>> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos
>>>
>>>
>>> A code sample:
>>>
>>>     TagSet<PosTag> tagSet; //the used TagSet
>>>     Map<String,PosTag> unknown; //missing tags in the TagSet
>>>
>>>     Token token; //the token
>>>     String tag; //the detected tag
>>>     double prob; //the probability
>>>
>>>     PosTag pos = tagset.getTag(tag);
>>>     if(pos == null){ //unkonw tag
>>>         pos = unknown.get(tag);
>>>     }
>>>     if(pos == null) {
>>>         pos = new PosTag(tag);
>>>         //this tag will not have a LexicalCategory
>>>         unknown.add(pos); //only one instance
>>>     }
>>>     token.addAnnotation(
>>>         NlpAnnotations.POSAnnotation,
>>>         new Value<PosTag>(pos, prob));
>>>
>>>
>>> In the second part I will try to lay out future plans and TODOs
>>>
>>> 1. Next Steps:
>>>
>>>     * The most important thing was already started by this mail thread
>>> - to discuss this within the Stanbol Community. I am on vacation the
>>> next two weeks, but I will have time to participate on such a
>>> discussion.
>>>
>>>     * Migrate the sentiment engine to recent API changes to the
>>> AnalysedText ContentPart? Does anyone know an Sentiment Ontology?
>>>
>>>     * AnalyzedText and Annotations currently do not keep
>>> creator/contributor and creation/modification date information. Those
>>> might be needed to convert them to fise:Enhancements - any use cases
>>> why one would want to add those memory consuming information?
>>>
>>> 2. near term TODOs: thinks I would like to start in August
>>>
>>>     * contribute this work to Apache Stanbol: Based on the
>>> Feedback/Discussion we plan to do this as one of the first things
>>> after vacation. Having this feature within Stanbol is important as it
>>> has a lot of Opportunities for existing Components (see 3.)
>>>
>>>     * adapt the KeywordLinkingEngine to use the AnalyzedText: This
>>> would allow to use any NLP framework for preprocessing the Text before
>>> linking its Tokens with a vocabulary. It would also solve the issue
>>> that text needs to process n-times for n configured
>>> KeywordLinkingEngines. In addition this would also allow to use lemma
>>> information (if available) for linking.
>>>
>>> 3. mid-term improvements and opportunities:
>>>
>>>     * nlp2rdf (NIF): I am confident that one could implement an
>>> EnhancementEngine that converts the data of the AnalyzedText to RDF
>>> data compatible to NIF as suggested by Sebastian Hellmann here on the
>>> list (see [1]). While converting all NLP related information to RDF is
>>> not something one would like to do in a typical text enhancement chain
>>> this is an important feature for some use cases AND it might also help
>>> during development/configuration and debugging.
>>>
>>>     * CELI lemmatizer: Currently this Engine can provide POS tags and
>>> Lemmas as RDF in the metadata. Migrating this engine to the
>>> AnalyzedText would e.g. allow to use its results for the
>>> KeywordLinking Engine. In addition the AnalysedText ContentPart would
>>> also make it much simpler to add the discussed CELI sentiment engine
>>> [2].
>>>
>>>     * Additions of new kind of EnhancementEngines (as mentioned in the
>>> mail of Sebastian)
>>>
>>> best
>>> Rupert
>>>
>>> [1] http://markmail.org/message/oq3y4ae2rhtbmpri
>>> [2] http://markmail.org/message/m3m6vox46vewgomi
>>>
>>> On Mon, Aug 6, 2012 at 11:10 AM, Sebastian Schaffert
>>> <se...@salzburgresearch.at> wrote:
>>> > Dear all,
>>> >
>>> > Rupert and I have been working on porting some of our OpenNLP based
>>> natural language processing to Apache Stanbol. While not yet completely
>>> finished, we decided it might be worthwhile for you all to have a look on
>>> it and maybe even contribute. I will try to briefly summarise the goals and
>>> current state of implementation:
>>> >
>>> > Goals
>>> > =====
>>> >
>>> > 1. provide a modular infrastructure for NLP-related things
>>> >
>>> > Many tasks in NLP can be computationally intensive, and there is no "one
>>> fits all" NLP approach when analysing text. Therefore, we wanted to have a
>>> NLP infrastructure that can be configured and wired together as needed for
>>> the specific use case, with several specialised modules that can build upon
>>> each other but many of which are optional.
>>> >
>>> > 2. provide a unified data model for representing NLP text annotations
>>> >
>>> > In many szenarios, it will be necessary to implement custom engines
>>> building on the results of a previous "generic" analysis of the text (e.g.
>>> POS tagging and chunking). For example, in a project we are identifying
>>> so-called "noun phrases", use a lemmatizer to build the ground form, then
>>> convert this to singular nominative form to have a gramatically correct
>>> label to use in a tag cloud. Most of this builds on generic NLP
>>> functionality, but the last step is very specific to the use case.
>>> >
>>> > Therefore, we wanted also to implement a generic NLP data model that
>>> allows representing text annotations attached to individual words or also
>>> to spans of words.
>>> >
>>> >
>>> > Current State
>>> > =============
>>> >
>>> > Currently, the unified data model has been implemented by Rupert in a
>>> first version. He has tested it thoroughly and it is reliable and useful
>>> for the szenarios we had in mind. The current enhancement engines are using
>>> OpenNLP for analysis, but the model can in general be used by any NLP
>>> engine that associates tags with tokens or spans of tokens.
>>> >
>>> >  I have in the meantime concentrated on implementing modules for
>>> different NLP tasks. The following modules are already finished:
>>> >
>>> > - POS Tagger: takes text/plain from a content item and stores an
>>> AnalyzedText content part in the content item where each token is assigned
>>> its grammar POS tag
>>> > - Chunker (Noun Phrase Detector): takes a content item with AnalyzedText
>>> content part (from POS tagger) and applies noun phrase chunking on the
>>> token stream; results are annotated token spans that are stored in the
>>> AnalyzedText
>>> > - Sentiment Analyzer (English/German): takes a content item with
>>> AnalyzedText content part (from POS tagger) and assigns sentiment values to
>>> each token in the stream; results are annotated tokens that are stored in
>>> the AnalyzedText
>>> >
>>> > In progress:
>>> > - Lemmatizer (English/German): takes a token stream (POS tagged
>>> AnalyzedText) and adds the lemma for each token to the AnalyzedText content
>>> part
>>> >
>>> >
>>> > Future work
>>> > ===========
>>> >
>>> > Based on these generic modules, we intend to implement a number of "NLP
>>> result summarizers" that take the results in an AnalyzedText and perform
>>> some post processing on them, storing them as RDF in the metadata
>>> associated with the content item. Some ideas:
>>> > - Average Sentiment: compute the average sentiment value for the text by
>>> summing all sentiment values and dividing them by the number of annotated
>>> tokens
>>> > - Improved Sentiment: take into account negations in a sentence before a
>>> sentiment value and invert the values in this case; otherwise like average
>>> sentiment.
>>> > - Per-Noun Sentiment: associate sentiment values with each noun
>>> occurring in the text by taking into account the sentiment values of
>>> adjectives associated with the noun in a noun phrase and negations before
>>> them; result are text annotations where each noun is associated with a
>>> sentiment value, so you could say "Product XYZ is typically mentioned with
>>> an average sentiment of 0.N"
>>> > - Noun Adjectives: collect the adjectives that are commonly used in
>>> association with a noun by using the noun phrases and taking the adjectives
>>> > - Simple Tag Cloud: take nouns, build lemmatized form, generate a tag
>>> cloud in the metadata
>>> > - Noun Phrase Cloud: take noun phrases, build lemmatized form, build
>>> nominative singular form, generate tag cloud; this is useful when you want
>>> to provide more context for the tags, e.g. in facetted search ("red car",
>>> "blue car").
>>> >
>>> > The possibilities are literally endless… feel free to think about other
>>> options :)
>>> >
>>> >
>>> > Availability
>>> > ============
>>> >
>>> > Since this is still experimental code, we have for the time being set up
>>> a separate (public) repository:
>>> >
>>> > https://bitbucket.org/srfgkmt/stanbol-nlp
>>> >
>>> > When it is more-or-less finished, we would however like to include this
>>> into the main Stanbol code base so others can more easily benefit from it.
>>> Feel free to look at what we have implemented there!
>>> >
>>> > ;-)
>>> >
>>> > Sebastian
>>> > --
>>> > | Dr. Sebastian Schaffert
>>> sebastian.schaffert@salzburgresearch.at
>>> > | Salzburg Research Forschungsgesellschaft
>>> http://www.salzburgresearch.at
>>> > | Head of Knowledge and Media Technologies Group          +43 662 2288
>>> 423
>>> > | Jakob-Haringer Strasse 5/II
>>> > | A-5020 Salzburg
>>> >
>>>
>>>
>>>
>>> --
>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>> | A-5500 Bischofshofen
>>>
>
>
>
> --
> Fabian
> http://twitter.com/fctwitt



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen