You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Rupert Westenthaler <ru...@gmail.com> on 2012/09/21 09:16:45 UTC

Update CELI engines to use Stanbol NLP processing

Hi Alessio, all

I have started to work on the migration of the CELI lemmatizer Engine
to the new Stanbol NLP processing module (STANBOL-733, STANBOL-738).
Basically the Idea was to adapt the Lemmatizer Engine to use the
AnalysedText ContentPart (STANBOL-734) to store its result. The goal
of this work is being able to use word level NLP analyses result of
CELI in Apache Stanbol (e.g. CELI POS tags and lemma information for
looking up terms with the KeywordLinkingEngine). Achieving this would
open up a lot of additional possibilities for Stanbol Users that want
to use the CELI services.

While working on this I came across the following things:

(1) I recognized that the Lemmatizer Service does not provide
information for all Words (LexicalEntry). As an example in the
sentence

    Lo scandalo dei fondi pubblici sperperati in allegria dalla Regione
    Lazio ha dato i primi frutti: ieri il capogruppo Pdl Francesco Battistoni
    si è dimesso e la sede del Consiglio è stata invasa dalla Guardia
di Finanza.

the LexicalEntries for "Pdl Francesco Battistoni si" do not have any
metadata (no <Reading>). Do you know why this is the case? Is their a
possibility to obtain LexicalFeatures for all words?

(2) The Stanbol NLP processing module maps POS tag sets used by NLP
processing frameworks to Morphosyntactic Categories defined by the
OLIA ontology [1]. Uses Categories are defined by the LexicalCategory
enumeration [2]. Actual POS tags are represented by the PosTag class
[3] that provides (1) the tag as string and optionally (2) the
LexicalCategory. While LexicalCategories are optional they are
important as they allow other components to determine the type of a
word in an language independent way. Because of that it would be
important to map the POS tag sets used by CELI to the
LexicalCategories used by the Stanbol NLP processing module. Can you
point me to documentation of the POS tag sets used by CELI for the
different languages?

The following code snippet shows how such a mapping could look like for Italian:

    public static final TagSet<PosTag> ITALIEN = new
TagSet<PosTag>("CELI Italian","it");

    static {
        DEFAULT.addTag(new PosTag("ADJ",LexicalCategory.Adjective));
        DEFAULT.addTag(new PosTag("ADV",LexicalCategory.Adverb));
        DEFAULT.addTag(new PosTag("ART",LexicalCategory.PronounOrDeterminer));
        DEFAULT.addTag(new PosTag("CLI")); //mapping ??
        DEFAULT.addTag(new PosTag("CONJ",LexicalCategory.Conjuction));
        DEFAULT.addTag(new PosTag("PREP",LexicalCategory.Adposition));
        DEFAULT.addTag(new PosTag("NF",LexicalCategory.Noun));
        DEFAULT.addTag(new PosTag("NM",LexicalCategory.Noun));
        DEFAULT.addTag(new PosTag("V",LexicalCategory.Verb));
        getInstance().add(DEFAULT);
    }

BTW I would be also interested in mappings of the other
LexicalFeatures extracted by CELI to the OLIA ontology (e.g. GENDER ->
olia:GenderFeature, NUMBER -> olia:NumberFeature, VERB_TENSE ->
olia:TenseFeature, ...).

(3) The Lemmatizer Engine does not provide confidence (probabilities)
for the extracted Features. If those information are available it
would be great to have them available. Otherwise can I assume that the
things mentioned first in the XML file do have a higher probability as
additional options (e.g. <LexicalEntry> with multiple <Reading>)?

The code related to STANBOL-733 is developed in the
"stanbol-nlp-processing" branch

    svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/

best
Rupert Westenthaler



[1] http://purl.org/olia/olia.owl
[2] http://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/LexicalCategory.java
[3] http://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/PosTag.java
-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Update CELI engines to use Stanbol NLP processing

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi,

forgot to include the dev list in my last response to Alessio, hence the forward

On Fri, Sep 21, 2012 at 3:22 PM, Rupert Westenthaler
<ru...@gmail.com> wrote:
> Hi Alessio,
>
> On Fri, Sep 21, 2012 at 12:51 PM, Alessio Bosca <al...@celi.it> wrote:
>> We are surely willing to contribute to the development of the engines and I
>> will work on the requested modifications for supporting the  AnalyzedText
>> content part.
>
> Thats cool to hear. I already started some thinks. I will commit those
> later today so that you can continue from their.
>
>> We will also provide you a mapping for the POS tagset and the other lexical
>> features.
>
> If there is a documentation of the POS Tag Sets are available it would
> be cool if you could link those. When I commit my local changes there
> will be a "PosTagSetRegistry" in
> "org.apache.stanbol.enhancer.engines.celi" where you can add the
> mappings.
>
>>I will check with the team responsible for the morphological
>> analyzer about the confidence level or the ranking of multiple readings as
>> I'm not sure about that.
>>
>> Concerning the missing readings for some lexical entries it is because the
>> unrecognized term are not present in the lexicon of the morphological
>> analyzer; they are "unknown" words so to say.
>> It happens with mispelled words or unknown named entities. It is possible to
>> explicitly set a POS "Unknown" lexical feature for them, if you wish so, but
>> there are no lexical feature retrieved by the morphological analyzer itself.
>> Let me know if you want this update as well.
>> Calling the named entities engine for Italian may be an alternative way for
>> getting more info on that textual fragments.
>>
>
> OK that explains a lot. I had the impression that there is first a POS
> tagger and than a morphological analyzer uses those results to provide
> the lemmas and other information. If the morphological analyzer adds
> possible lemmas based on words I would expect that there are no
> results for some words and also that there are multiple readings for
> others.
>
> Does linguagrid also have a POS tagging service?
>
>> I will send you an update next week as soon as I finished to integrate the
>> updates
>>
>
> I am in Leibzig next week so I might be not as responsive as usually.
>
> best
> Rupert
>
>>
>> Bests
>>     Alessio
>>
>>
>> On 09/21/2012 09:16 AM, Rupert Westenthaler wrote:
>>>
>>> Hi Alessio, all
>>>
>>> I have started to work on the migration of the CELI lemmatizer Engine
>>> to the new Stanbol NLP processing module (STANBOL-733, STANBOL-738).
>>> Basically the Idea was to adapt the Lemmatizer Engine to use the
>>> AnalysedText ContentPart (STANBOL-734) to store its result. The goal
>>> of this work is being able to use word level NLP analyses result of
>>> CELI in Apache Stanbol (e.g. CELI POS tags and lemma information for
>>> looking up terms with the KeywordLinkingEngine). Achieving this would
>>> open up a lot of additional possibilities for Stanbol Users that want
>>> to use the CELI services.
>>>
>>> While working on this I came across the following things:
>>>
>>> (1) I recognized that the Lemmatizer Service does not provide
>>> information for all Words (LexicalEntry). As an example in the
>>> sentence
>>>
>>>      Lo scandalo dei fondi pubblici sperperati in allegria dalla Regione
>>>      Lazio ha dato i primi frutti: ieri il capogruppo Pdl Francesco
>>> Battistoni
>>>      si è dimesso e la sede del Consiglio è stata invasa dalla Guardia
>>> di Finanza.
>>>
>>> the LexicalEntries for "Pdl Francesco Battistoni si" do not have any
>>> metadata (no <Reading>). Do you know why this is the case? Is their a
>>> possibility to obtain LexicalFeatures for all words?
>>>
>>> (2) The Stanbol NLP processing module maps POS tag sets used by NLP
>>> processing frameworks to Morphosyntactic Categories defined by the
>>> OLIA ontology [1]. Uses Categories are defined by the LexicalCategory
>>> enumeration [2]. Actual POS tags are represented by the PosTag class
>>> [3] that provides (1) the tag as string and optionally (2) the
>>> LexicalCategory. While LexicalCategories are optional they are
>>> important as they allow other components to determine the type of a
>>> word in an language independent way. Because of that it would be
>>> important to map the POS tag sets used by CELI to the
>>> LexicalCategories used by the Stanbol NLP processing module. Can you
>>> point me to documentation of the POS tag sets used by CELI for the
>>> different languages?
>>>
>>> The following code snippet shows how such a mapping could look like for
>>> Italian:
>>>
>>>      public static final TagSet<PosTag> ITALIEN = new
>>> TagSet<PosTag>("CELI Italian","it");
>>>
>>>      static {
>>>          DEFAULT.addTag(new PosTag("ADJ",LexicalCategory.Adjective));
>>>          DEFAULT.addTag(new PosTag("ADV",LexicalCategory.Adverb));
>>>          DEFAULT.addTag(new
>>> PosTag("ART",LexicalCategory.PronounOrDeterminer));
>>>          DEFAULT.addTag(new PosTag("CLI")); //mapping ??
>>>          DEFAULT.addTag(new PosTag("CONJ",LexicalCategory.Conjuction));
>>>          DEFAULT.addTag(new PosTag("PREP",LexicalCategory.Adposition));
>>>          DEFAULT.addTag(new PosTag("NF",LexicalCategory.Noun));
>>>          DEFAULT.addTag(new PosTag("NM",LexicalCategory.Noun));
>>>          DEFAULT.addTag(new PosTag("V",LexicalCategory.Verb));
>>>          getInstance().add(DEFAULT);
>>>      }
>>>
>>> BTW I would be also interested in mappings of the other
>>> LexicalFeatures extracted by CELI to the OLIA ontology (e.g. GENDER ->
>>> olia:GenderFeature, NUMBER -> olia:NumberFeature, VERB_TENSE ->
>>> olia:TenseFeature, ...).
>>>
>>> (3) The Lemmatizer Engine does not provide confidence (probabilities)
>>> for the extracted Features. If those information are available it
>>> would be great to have them available. Otherwise can I assume that the
>>> things mentioned first in the XML file do have a higher probability as
>>> additional options (e.g. <LexicalEntry> with multiple <Reading>)?
>>>
>>> The code related to STANBOL-733 is developed in the
>>> "stanbol-nlp-processing" branch
>>>
>>>      svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/
>>>
>>> best
>>> Rupert Westenthaler
>>>
>>>
>>>
>>> [1] http://purl.org/olia/olia.owl
>>> [2]
>>> http://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/LexicalCategory.java
>>> [3]
>>> http://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/PosTag.java
>>
>>
>>
>> --
>> *************************************
>> Alessio Bosca, Ph.D.
>> CELI s.r.l.
>> Via San Quintino 31
>> 10121 Torino
>> Tel. +39 011.562.71.15
>> Fax +39 011.506.40.86
>> http://www.celi.it
>> *************************************
>>
>>
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Update CELI engines to use Stanbol NLP processing

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi,

Yesterday I committed some changes/additions to the
stanbol.enhancer.nlp module. Please make sure you are on the most
current version.

On Fri, Sep 28, 2012 at 10:44 AM, Alessio Bosca <al...@celi.it> wrote:
> Hi Rupert,
>
> I completed the POS mappings in the PosTagSetRegistry class and I'm starting
> to add mappings for other morphological features (like gender, number, case)
> using the same approach (i.e creating a GenderTagsetRegistry).
> I need to create a few classes for the mappings (GenderTag,
> GenderValuesEnum, etc) should I create them in celi engine project or should
> I create a proper subpackage (like morphology) on the same level as nlp.pos?
> Iìll send you a patch as soon as I finish

Regarding "morphology"

Please have a look at the o.a.s.enhancer.nlp.morpho package. I defined
yesterday Enumerations for Tenses and Cases (based on the Olia
Ontology). There is also a MorphoAnnotation class. If you need to
change/extend those feel free to do it. The current state is only a
first proposal (by myself) and clearly needs to be improved changed.

Regarding "GenderTagsetRegistry":

I would rather opt for a single CeliTagsetRegistry class that can be
used for everything (e.g. getPosTagSet(), getGenderTagSet(), ...) but
you can also create multiple specific registries if you like.

Regarding "GenderTag"

Would you like to introduce "{type}Tag" classes (similar as PosTag)
that hold a String Tag and a Category that is a member of the
according Enumeration.
Examples would be GenderTag, TenseTag, CaseTag ...

best
Rupert

>
> Bests
>     Alessio
>
>
> On 09/21/2012 03:22 PM, Rupert Westenthaler wrote:
>>
>> Hi Alessio,
>>
>> On Fri, Sep 21, 2012 at 12:51 PM, Alessio Bosca <al...@celi.it>
>> wrote:
>>>
>>> We are surely willing to contribute to the development of the engines and
>>> I
>>> will work on the requested modifications for supporting the  AnalyzedText
>>> content part.
>>
>> Thats cool to hear. I already started some thinks. I will commit those
>> later today so that you can continue from their.
>>
>>> We will also provide you a mapping for the POS tagset and the other
>>> lexical
>>> features.
>>
>> If there is a documentation of the POS Tag Sets are available it would
>> be cool if you could link those. When I commit my local changes there
>> will be a "PosTagSetRegistry" in
>> "org.apache.stanbol.enhancer.engines.celi" where you can add the
>> mappings.
>>
>>> I will check with the team responsible for the morphological
>>> analyzer about the confidence level or the ranking of multiple readings
>>> as
>>> I'm not sure about that.
>>>
>>> Concerning the missing readings for some lexical entries it is because
>>> the
>>> unrecognized term are not present in the lexicon of the morphological
>>> analyzer; they are "unknown" words so to say.
>>> It happens with mispelled words or unknown named entities. It is possible
>>> to
>>> explicitly set a POS "Unknown" lexical feature for them, if you wish so,
>>> but
>>> there are no lexical feature retrieved by the morphological analyzer
>>> itself.
>>> Let me know if you want this update as well.
>>> Calling the named entities engine for Italian may be an alternative way
>>> for
>>> getting more info on that textual fragments.
>>>
>> OK that explains a lot. I had the impression that there is first a POS
>> tagger and than a morphological analyzer uses those results to provide
>> the lemmas and other information. If the morphological analyzer adds
>> possible lemmas based on words I would expect that there are no
>> results for some words and also that there are multiple readings for
>> others.
>>
>> Does linguagrid also have a POS tagging service?
>>
>>> I will send you an update next week as soon as I finished to integrate
>>> the
>>> updates
>>>
>> I am in Leibzig next week so I might be not as responsive as usually.
>>
>> best
>> Rupert
>>
>>> Bests
>>>      Alessio
>>>
>>>
>>> On 09/21/2012 09:16 AM, Rupert Westenthaler wrote:
>>>>
>>>> Hi Alessio, all
>>>>
>>>> I have started to work on the migration of the CELI lemmatizer Engine
>>>> to the new Stanbol NLP processing module (STANBOL-733, STANBOL-738).
>>>> Basically the Idea was to adapt the Lemmatizer Engine to use the
>>>> AnalysedText ContentPart (STANBOL-734) to store its result. The goal
>>>> of this work is being able to use word level NLP analyses result of
>>>> CELI in Apache Stanbol (e.g. CELI POS tags and lemma information for
>>>> looking up terms with the KeywordLinkingEngine). Achieving this would
>>>> open up a lot of additional possibilities for Stanbol Users that want
>>>> to use the CELI services.
>>>>
>>>> While working on this I came across the following things:
>>>>
>>>> (1) I recognized that the Lemmatizer Service does not provide
>>>> information for all Words (LexicalEntry). As an example in the
>>>> sentence
>>>>
>>>>       Lo scandalo dei fondi pubblici sperperati in allegria dalla
>>>> Regione
>>>>       Lazio ha dato i primi frutti: ieri il capogruppo Pdl Francesco
>>>> Battistoni
>>>>       si è dimesso e la sede del Consiglio è stata invasa dalla Guardia
>>>> di Finanza.
>>>>
>>>> the LexicalEntries for "Pdl Francesco Battistoni si" do not have any
>>>> metadata (no <Reading>). Do you know why this is the case? Is their a
>>>> possibility to obtain LexicalFeatures for all words?
>>>>
>>>> (2) The Stanbol NLP processing module maps POS tag sets used by NLP
>>>> processing frameworks to Morphosyntactic Categories defined by the
>>>> OLIA ontology [1]. Uses Categories are defined by the LexicalCategory
>>>> enumeration [2]. Actual POS tags are represented by the PosTag class
>>>> [3] that provides (1) the tag as string and optionally (2) the
>>>> LexicalCategory. While LexicalCategories are optional they are
>>>> important as they allow other components to determine the type of a
>>>> word in an language independent way. Because of that it would be
>>>> important to map the POS tag sets used by CELI to the
>>>> LexicalCategories used by the Stanbol NLP processing module. Can you
>>>> point me to documentation of the POS tag sets used by CELI for the
>>>> different languages?
>>>>
>>>> The following code snippet shows how such a mapping could look like for
>>>> Italian:
>>>>
>>>>       public static final TagSet<PosTag> ITALIEN = new
>>>> TagSet<PosTag>("CELI Italian","it");
>>>>
>>>>       static {
>>>>           DEFAULT.addTag(new PosTag("ADJ",LexicalCategory.Adjective));
>>>>           DEFAULT.addTag(new PosTag("ADV",LexicalCategory.Adverb));
>>>>           DEFAULT.addTag(new
>>>> PosTag("ART",LexicalCategory.PronounOrDeterminer));
>>>>           DEFAULT.addTag(new PosTag("CLI")); //mapping ??
>>>>           DEFAULT.addTag(new PosTag("CONJ",LexicalCategory.Conjuction));
>>>>           DEFAULT.addTag(new PosTag("PREP",LexicalCategory.Adposition));
>>>>           DEFAULT.addTag(new PosTag("NF",LexicalCategory.Noun));
>>>>           DEFAULT.addTag(new PosTag("NM",LexicalCategory.Noun));
>>>>           DEFAULT.addTag(new PosTag("V",LexicalCategory.Verb));
>>>>           getInstance().add(DEFAULT);
>>>>       }
>>>>
>>>> BTW I would be also interested in mappings of the other
>>>> LexicalFeatures extracted by CELI to the OLIA ontology (e.g. GENDER ->
>>>> olia:GenderFeature, NUMBER -> olia:NumberFeature, VERB_TENSE ->
>>>> olia:TenseFeature, ...).
>>>>
>>>> (3) The Lemmatizer Engine does not provide confidence (probabilities)
>>>> for the extracted Features. If those information are available it
>>>> would be great to have them available. Otherwise can I assume that the
>>>> things mentioned first in the XML file do have a higher probability as
>>>> additional options (e.g. <LexicalEntry> with multiple <Reading>)?
>>>>
>>>> The code related to STANBOL-733 is developed in the
>>>> "stanbol-nlp-processing" branch
>>>>
>>>>       svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/
>>>>
>>>> best
>>>> Rupert Westenthaler
>>>>
>>>>
>>>>
>>>> [1] http://purl.org/olia/olia.owl
>>>> [2]
>>>>
>>>> http://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/LexicalCategory.java
>>>> [3]
>>>>
>>>> http://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/PosTag.java
>>>
>>>
>>>
>>> --
>>> *************************************
>>> Alessio Bosca, Ph.D.
>>> CELI s.r.l.
>>> Via San Quintino 31
>>> 10121 Torino
>>> Tel. +39 011.562.71.15
>>> Fax +39 011.506.40.86
>>> http://www.celi.it
>>> *************************************
>>>
>>>
>>
>>
>
>
> --
> *************************************
> Alessio Bosca, Ph.D.
> CELI s.r.l.
> Via San Quintino 31
> 10121 Torino
> Tel. +39 011.562.71.15
> Fax +39 011.506.40.86
> http://www.celi.it
> *************************************
>
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Update CELI engines to use Stanbol NLP processing

Posted by Alessio Bosca <al...@celi.it>.
Hi Rupert,

sorry for the late reply but in the previous days I was out of the 
office for a project meeting.
We are surely willing to contribute to the development of the engines 
and I will work on the requested modifications for supporting the  
AnalyzedText content part.
We will also provide you a mapping for the POS tagset and the other 
lexical features. I will check with the team responsible for the 
morphological analyzer about the confidence level or the ranking of 
multiple readings as I'm not sure about that.

Concerning the missing readings for some lexical entries it is because 
the unrecognized term are not present in the lexicon of the 
morphological analyzer; they are "unknown" words so to say.
It happens with mispelled words or unknown named entities. It is 
possible to explicitly set a POS "Unknown" lexical feature for them, if 
you wish so, but there are no lexical feature retrieved by the 
morphological analyzer itself.  Let me know if you want this update as well.
Calling the named entities engine for Italian may be an alternative way 
for getting more info on that textual fragments.

I will send you an update next week as soon as I finished to integrate 
the updates


Bests
     Alessio

On 09/21/2012 09:16 AM, Rupert Westenthaler wrote:
> Hi Alessio, all
>
> I have started to work on the migration of the CELI lemmatizer Engine
> to the new Stanbol NLP processing module (STANBOL-733, STANBOL-738).
> Basically the Idea was to adapt the Lemmatizer Engine to use the
> AnalysedText ContentPart (STANBOL-734) to store its result. The goal
> of this work is being able to use word level NLP analyses result of
> CELI in Apache Stanbol (e.g. CELI POS tags and lemma information for
> looking up terms with the KeywordLinkingEngine). Achieving this would
> open up a lot of additional possibilities for Stanbol Users that want
> to use the CELI services.
>
> While working on this I came across the following things:
>
> (1) I recognized that the Lemmatizer Service does not provide
> information for all Words (LexicalEntry). As an example in the
> sentence
>
>      Lo scandalo dei fondi pubblici sperperati in allegria dalla Regione
>      Lazio ha dato i primi frutti: ieri il capogruppo Pdl Francesco Battistoni
>      si è dimesso e la sede del Consiglio è stata invasa dalla Guardia
> di Finanza.
>
> the LexicalEntries for "Pdl Francesco Battistoni si" do not have any
> metadata (no <Reading>). Do you know why this is the case? Is their a
> possibility to obtain LexicalFeatures for all words?
>
> (2) The Stanbol NLP processing module maps POS tag sets used by NLP
> processing frameworks to Morphosyntactic Categories defined by the
> OLIA ontology [1]. Uses Categories are defined by the LexicalCategory
> enumeration [2]. Actual POS tags are represented by the PosTag class
> [3] that provides (1) the tag as string and optionally (2) the
> LexicalCategory. While LexicalCategories are optional they are
> important as they allow other components to determine the type of a
> word in an language independent way. Because of that it would be
> important to map the POS tag sets used by CELI to the
> LexicalCategories used by the Stanbol NLP processing module. Can you
> point me to documentation of the POS tag sets used by CELI for the
> different languages?
>
> The following code snippet shows how such a mapping could look like for Italian:
>
>      public static final TagSet<PosTag> ITALIEN = new
> TagSet<PosTag>("CELI Italian","it");
>
>      static {
>          DEFAULT.addTag(new PosTag("ADJ",LexicalCategory.Adjective));
>          DEFAULT.addTag(new PosTag("ADV",LexicalCategory.Adverb));
>          DEFAULT.addTag(new PosTag("ART",LexicalCategory.PronounOrDeterminer));
>          DEFAULT.addTag(new PosTag("CLI")); //mapping ??
>          DEFAULT.addTag(new PosTag("CONJ",LexicalCategory.Conjuction));
>          DEFAULT.addTag(new PosTag("PREP",LexicalCategory.Adposition));
>          DEFAULT.addTag(new PosTag("NF",LexicalCategory.Noun));
>          DEFAULT.addTag(new PosTag("NM",LexicalCategory.Noun));
>          DEFAULT.addTag(new PosTag("V",LexicalCategory.Verb));
>          getInstance().add(DEFAULT);
>      }
>
> BTW I would be also interested in mappings of the other
> LexicalFeatures extracted by CELI to the OLIA ontology (e.g. GENDER ->
> olia:GenderFeature, NUMBER -> olia:NumberFeature, VERB_TENSE ->
> olia:TenseFeature, ...).
>
> (3) The Lemmatizer Engine does not provide confidence (probabilities)
> for the extracted Features. If those information are available it
> would be great to have them available. Otherwise can I assume that the
> things mentioned first in the XML file do have a higher probability as
> additional options (e.g. <LexicalEntry> with multiple <Reading>)?
>
> The code related to STANBOL-733 is developed in the
> "stanbol-nlp-processing" branch
>
>      svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/
>
> best
> Rupert Westenthaler
>
>
>
> [1] http://purl.org/olia/olia.owl
> [2] http://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/LexicalCategory.java
> [3] http://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/PosTag.java


-- 
*************************************
Alessio Bosca, Ph.D.
CELI s.r.l.
Via San Quintino 31
10121 Torino
Tel. +39 011.562.71.15
Fax +39 011.506.40.86
http://www.celi.it
*************************************