You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by Grzegorz Trzeciak <gt...@gmail.com> on 2019/04/14 18:52:35 UTC

Steps required for adding support for another language

I need to provide a proof of concept for a customer using Stanbol enhancer
but the POC needs to be in Polish, only now I realised there is no support
for Polish in Stanbol (other than language recognition). At the moment
running the enhancer on a text only returns the recognized language, so my
question is twofold:

1. Is there a quick and dirty way of making Stanbol work with Polish
language (for POC only)
2. What are the steps necessary to implement the correct solution of
supporting another language

Thanks

Grzegorz Trzeciak

Re: Steps required for adding support for another language

Posted by Grzegorz Trzeciak <gt...@gmail.com>.

That would explain why  dbpedia-fst-linking worked, here is the list of
engines (opennlp-chunker instead of opennlp-ner in default)

   - *tika* ( optional , TikaEngine)
   - *langdetect* ( required , LanguageDetectionEnhancementEngine)
   - *opennlp-sentence* ( required , OpenNlpSentenceDetectionEngine)
   - *opennlp-token* ( required , OpenNlpTokenizerEngine)
   - *opennlp-pos* ( required , OpenNlpPosTaggingEngine)
   - *opennlp-chunker* ( required , OpenNlpChunkingEngine)
   - *dbpedia-fst* ( required , FstLinkingEngine)
   - *dbpedia-dereference* ( required , EntityDereferenceEngine)


niedz., 14 kwi 2019 o 22:30 Rafa Haro <rh...@apache.org> napisał(a):

> By the way, in your case, you shouldn't be using opennlp ner engine, you
> should be using directly opennlp chunking and EntityLinking engine (no
> named Entity Linking)
>
> El El dom, 14 abr 2019 a las 22:27, Rafa Haro <rh...@apache.org> escribió:
>
> > Yeah, ideally you will have to train open nlp models for Polish. But for
> > testing, you can force opennlp engines to use the models for a specific
> > language (English normally). I would swear you can do that directly in
> the
> > engines configuration through Felix console. The content will be
> processed
> > as English and open nlp will be doing its best, but for languages with a
> > similar sintaxis sometimes is enough for, at least, getting chunks with
> > candidate tokens.
> >
> > Hope that helps
> >
> > PD: just for curiosity, because I don't remember it right now and I won't
> > have a laptop by hand in some days....which are the engines involve in
> the
> > fst-linking chain?
> >
> > El El dom, 14 abr 2019 a las 21:54, Grzegorz Trzeciak <
> gtrzeciak@gmail.com>
> > escribió:
> >
> >> OK I've found the chain that at least captures some dbpedia entities:
> >> dbpedia-fst-linking
> >> I will be playing with varous engine combinations to see what can get me
> >> through the POC the best which leaves me with question about the more
> >> permanent solution.
> >>
> >> My understanding is that this would require building language model for
> >> opennlp, is it correct? Are there other requirements for adding language
> >> support? I am trying to estimate work effort required for such task so
> any
> >> advice will be helpful.
> >>
> >> Also if you are aware of any resources that could be helpful, that would
> >> be great.
> >>
> >> Thank you
> >>
> >> G.
> >>
> >> niedz., 14 kwi 2019 o 21:10 Grzegorz Trzeciak <gt...@gmail.com>
> >> napisał(a):
> >>
> >>> using default chain:
> >>>
> >>>    - *tika* ( optional , TikaEngine)
> >>>    - *langdetect* ( required , LanguageDetectionEnhancementEngine)
> >>>    - *opennlp-sentence* ( required , OpenNlpSentenceDetectionEngine)
> >>>    - *opennlp-token* ( required , OpenNlpTokenizerEngine)
> >>>    - *opennlp-pos* ( required , OpenNlpPosTaggingEngine)
> >>>    - *opennlp-ner* ( required , NamedEntityExtractionEnhancementEngine)
> >>>    - *dbpediaLinking* ( required , NamedEntityTaggingEngine)
> >>>    - *entityhubExtraction* ( required , EntityLinkingEngine)
> >>>    - *dbpedia-dereference* ( required , EntityDereferenceEngine)
> >>>
> >>>
> >>> I will try disabling langdetect then.
> >>>
> >>> niedz., 14 kwi 2019 o 21:08 Rafa Haro <rh...@apache.org> napisał(a):
> >>>
> >>>> Hi Grzergorz,
> >>>>
> >>>> Can you provide details about your enhancement chain?. Probably you
> can
> >>>> try
> >>>> by disabling language detection and forcing English as language for
> the
> >>>> whole chain
> >>>>
> >>>> El El dom, 14 abr 2019 a las 20:52, Grzegorz Trzeciak <
> >>>> gtrzeciak@gmail.com>
> >>>> escribió:
> >>>>
> >>>> > I need to provide a proof of concept for a customer using Stanbol
> >>>> enhancer
> >>>> > but the POC needs to be in Polish, only now I realised there is no
> >>>> support
> >>>> > for Polish in Stanbol (other than language recognition). At the
> moment
> >>>> > running the enhancer on a text only returns the recognized language,
> >>>> so my
> >>>> > question is twofold:
> >>>> >
> >>>> > 1. Is there a quick and dirty way of making Stanbol work with Polish
> >>>> > language (for POC only)
> >>>> > 2. What are the steps necessary to implement the correct solution of
> >>>> > supporting another language
> >>>> >
> >>>> > Thanks
> >>>> >
> >>>> > Grzegorz Trzeciak
> >>>> >
> >>>>
> >>>
>

Re: Steps required for adding support for another language

Posted by Rafa Haro <rh...@apache.org>.

By the way, in your case, you shouldn't be using opennlp ner engine, you
should be using directly opennlp chunking and EntityLinking engine (no
named Entity Linking)

El El dom, 14 abr 2019 a las 22:27, Rafa Haro <rh...@apache.org> escribió:

> Yeah, ideally you will have to train open nlp models for Polish. But for
> testing, you can force opennlp engines to use the models for a specific
> language (English normally). I would swear you can do that directly in the
> engines configuration through Felix console. The content will be processed
> as English and open nlp will be doing its best, but for languages with a
> similar sintaxis sometimes is enough for, at least, getting chunks with
> candidate tokens.
>
> Hope that helps
>
> PD: just for curiosity, because I don't remember it right now and I won't
> have a laptop by hand in some days....which are the engines involve in the
> fst-linking chain?
>
> El El dom, 14 abr 2019 a las 21:54, Grzegorz Trzeciak <gt...@gmail.com>
> escribió:
>
>> OK I've found the chain that at least captures some dbpedia entities:
>> dbpedia-fst-linking
>> I will be playing with varous engine combinations to see what can get me
>> through the POC the best which leaves me with question about the more
>> permanent solution.
>>
>> My understanding is that this would require building language model for
>> opennlp, is it correct? Are there other requirements for adding language
>> support? I am trying to estimate work effort required for such task so any
>> advice will be helpful.
>>
>> Also if you are aware of any resources that could be helpful, that would
>> be great.
>>
>> Thank you
>>
>> G.
>>
>> niedz., 14 kwi 2019 o 21:10 Grzegorz Trzeciak <gt...@gmail.com>
>> napisał(a):
>>
>>> using default chain:
>>>
>>>    - *tika* ( optional , TikaEngine)
>>>    - *langdetect* ( required , LanguageDetectionEnhancementEngine)
>>>    - *opennlp-sentence* ( required , OpenNlpSentenceDetectionEngine)
>>>    - *opennlp-token* ( required , OpenNlpTokenizerEngine)
>>>    - *opennlp-pos* ( required , OpenNlpPosTaggingEngine)
>>>    - *opennlp-ner* ( required , NamedEntityExtractionEnhancementEngine)
>>>    - *dbpediaLinking* ( required , NamedEntityTaggingEngine)
>>>    - *entityhubExtraction* ( required , EntityLinkingEngine)
>>>    - *dbpedia-dereference* ( required , EntityDereferenceEngine)
>>>
>>>
>>> I will try disabling langdetect then.
>>>
>>> niedz., 14 kwi 2019 o 21:08 Rafa Haro <rh...@apache.org> napisał(a):
>>>
>>>> Hi Grzergorz,
>>>>
>>>> Can you provide details about your enhancement chain?. Probably you can
>>>> try
>>>> by disabling language detection and forcing English as language for the
>>>> whole chain
>>>>
>>>> El El dom, 14 abr 2019 a las 20:52, Grzegorz Trzeciak <
>>>> gtrzeciak@gmail.com>
>>>> escribió:
>>>>
>>>> > I need to provide a proof of concept for a customer using Stanbol
>>>> enhancer
>>>> > but the POC needs to be in Polish, only now I realised there is no
>>>> support
>>>> > for Polish in Stanbol (other than language recognition). At the moment
>>>> > running the enhancer on a text only returns the recognized language,
>>>> so my
>>>> > question is twofold:
>>>> >
>>>> > 1. Is there a quick and dirty way of making Stanbol work with Polish
>>>> > language (for POC only)
>>>> > 2. What are the steps necessary to implement the correct solution of
>>>> > supporting another language
>>>> >
>>>> > Thanks
>>>> >
>>>> > Grzegorz Trzeciak
>>>> >
>>>>
>>>

Re: Steps required for adding support for another language

Posted by Rafa Haro <rh...@apache.org>.

Yeah, ideally you will have to train open nlp models for Polish. But for
testing, you can force opennlp engines to use the models for a specific
language (English normally). I would swear you can do that directly in the
engines configuration through Felix console. The content will be processed
as English and open nlp will be doing its best, but for languages with a
similar sintaxis sometimes is enough for, at least, getting chunks with
candidate tokens.

Hope that helps

PD: just for curiosity, because I don't remember it right now and I won't
have a laptop by hand in some days....which are the engines involve in the
fst-linking chain?

El El dom, 14 abr 2019 a las 21:54, Grzegorz Trzeciak <gt...@gmail.com>
escribió:

> OK I've found the chain that at least captures some dbpedia entities:
> dbpedia-fst-linking
> I will be playing with varous engine combinations to see what can get me
> through the POC the best which leaves me with question about the more
> permanent solution.
>
> My understanding is that this would require building language model for
> opennlp, is it correct? Are there other requirements for adding language
> support? I am trying to estimate work effort required for such task so any
> advice will be helpful.
>
> Also if you are aware of any resources that could be helpful, that would
> be great.
>
> Thank you
>
> G.
>
> niedz., 14 kwi 2019 o 21:10 Grzegorz Trzeciak <gt...@gmail.com>
> napisał(a):
>
>> using default chain:
>>
>>    - *tika* ( optional , TikaEngine)
>>    - *langdetect* ( required , LanguageDetectionEnhancementEngine)
>>    - *opennlp-sentence* ( required , OpenNlpSentenceDetectionEngine)
>>    - *opennlp-token* ( required , OpenNlpTokenizerEngine)
>>    - *opennlp-pos* ( required , OpenNlpPosTaggingEngine)
>>    - *opennlp-ner* ( required , NamedEntityExtractionEnhancementEngine)
>>    - *dbpediaLinking* ( required , NamedEntityTaggingEngine)
>>    - *entityhubExtraction* ( required , EntityLinkingEngine)
>>    - *dbpedia-dereference* ( required , EntityDereferenceEngine)
>>
>>
>> I will try disabling langdetect then.
>>
>> niedz., 14 kwi 2019 o 21:08 Rafa Haro <rh...@apache.org> napisał(a):
>>
>>> Hi Grzergorz,
>>>
>>> Can you provide details about your enhancement chain?. Probably you can
>>> try
>>> by disabling language detection and forcing English as language for the
>>> whole chain
>>>
>>> El El dom, 14 abr 2019 a las 20:52, Grzegorz Trzeciak <
>>> gtrzeciak@gmail.com>
>>> escribió:
>>>
>>> > I need to provide a proof of concept for a customer using Stanbol
>>> enhancer
>>> > but the POC needs to be in Polish, only now I realised there is no
>>> support
>>> > for Polish in Stanbol (other than language recognition). At the moment
>>> > running the enhancer on a text only returns the recognized language,
>>> so my
>>> > question is twofold:
>>> >
>>> > 1. Is there a quick and dirty way of making Stanbol work with Polish
>>> > language (for POC only)
>>> > 2. What are the steps necessary to implement the correct solution of
>>> > supporting another language
>>> >
>>> > Thanks
>>> >
>>> > Grzegorz Trzeciak
>>> >
>>>
>>

Re: Steps required for adding support for another language

Posted by Grzegorz Trzeciak <gt...@gmail.com>.

OK I've found the chain that at least captures some dbpedia entities:
dbpedia-fst-linking
I will be playing with varous engine combinations to see what can get me
through the POC the best which leaves me with question about the more
permanent solution.

My understanding is that this would require building language model for
opennlp, is it correct? Are there other requirements for adding language
support? I am trying to estimate work effort required for such task so any
advice will be helpful.

Also if you are aware of any resources that could be helpful, that would be
great.

Thank you

G.

niedz., 14 kwi 2019 o 21:10 Grzegorz Trzeciak <gt...@gmail.com>
napisał(a):

> using default chain:
>
>    - *tika* ( optional , TikaEngine)
>    - *langdetect* ( required , LanguageDetectionEnhancementEngine)
>    - *opennlp-sentence* ( required , OpenNlpSentenceDetectionEngine)
>    - *opennlp-token* ( required , OpenNlpTokenizerEngine)
>    - *opennlp-pos* ( required , OpenNlpPosTaggingEngine)
>    - *opennlp-ner* ( required , NamedEntityExtractionEnhancementEngine)
>    - *dbpediaLinking* ( required , NamedEntityTaggingEngine)
>    - *entityhubExtraction* ( required , EntityLinkingEngine)
>    - *dbpedia-dereference* ( required , EntityDereferenceEngine)
>
>
> I will try disabling langdetect then.
>
> niedz., 14 kwi 2019 o 21:08 Rafa Haro <rh...@apache.org> napisał(a):
>
>> Hi Grzergorz,
>>
>> Can you provide details about your enhancement chain?. Probably you can
>> try
>> by disabling language detection and forcing English as language for the
>> whole chain
>>
>> El El dom, 14 abr 2019 a las 20:52, Grzegorz Trzeciak <
>> gtrzeciak@gmail.com>
>> escribió:
>>
>> > I need to provide a proof of concept for a customer using Stanbol
>> enhancer
>> > but the POC needs to be in Polish, only now I realised there is no
>> support
>> > for Polish in Stanbol (other than language recognition). At the moment
>> > running the enhancer on a text only returns the recognized language, so
>> my
>> > question is twofold:
>> >
>> > 1. Is there a quick and dirty way of making Stanbol work with Polish
>> > language (for POC only)
>> > 2. What are the steps necessary to implement the correct solution of
>> > supporting another language
>> >
>> > Thanks
>> >
>> > Grzegorz Trzeciak
>> >
>>
>

Re: Steps required for adding support for another language

Posted by Grzegorz Trzeciak <gt...@gmail.com>.

using default chain:

   - *tika* ( optional , TikaEngine)
   - *langdetect* ( required , LanguageDetectionEnhancementEngine)
   - *opennlp-sentence* ( required , OpenNlpSentenceDetectionEngine)
   - *opennlp-token* ( required , OpenNlpTokenizerEngine)
   - *opennlp-pos* ( required , OpenNlpPosTaggingEngine)
   - *opennlp-ner* ( required , NamedEntityExtractionEnhancementEngine)
   - *dbpediaLinking* ( required , NamedEntityTaggingEngine)
   - *entityhubExtraction* ( required , EntityLinkingEngine)
   - *dbpedia-dereference* ( required , EntityDereferenceEngine)


I will try disabling langdetect then.

niedz., 14 kwi 2019 o 21:08 Rafa Haro <rh...@apache.org> napisał(a):

> Hi Grzergorz,
>
> Can you provide details about your enhancement chain?. Probably you can try
> by disabling language detection and forcing English as language for the
> whole chain
>
> El El dom, 14 abr 2019 a las 20:52, Grzegorz Trzeciak <gtrzeciak@gmail.com
> >
> escribió:
>
> > I need to provide a proof of concept for a customer using Stanbol
> enhancer
> > but the POC needs to be in Polish, only now I realised there is no
> support
> > for Polish in Stanbol (other than language recognition). At the moment
> > running the enhancer on a text only returns the recognized language, so
> my
> > question is twofold:
> >
> > 1. Is there a quick and dirty way of making Stanbol work with Polish
> > language (for POC only)
> > 2. What are the steps necessary to implement the correct solution of
> > supporting another language
> >
> > Thanks
> >
> > Grzegorz Trzeciak
> >
>

Re: Steps required for adding support for another language

Posted by Rafa Haro <rh...@apache.org>.

Hi Grzergorz,

Can you provide details about your enhancement chain?. Probably you can try
by disabling language detection and forcing English as language for the
whole chain

El El dom, 14 abr 2019 a las 20:52, Grzegorz Trzeciak <gt...@gmail.com>
escribió:

> I need to provide a proof of concept for a customer using Stanbol enhancer
> but the POC needs to be in Polish, only now I realised there is no support
> for Polish in Stanbol (other than language recognition). At the moment
> running the enhancer on a text only returns the recognized language, so my
> question is twofold:
>
> 1. Is there a quick and dirty way of making Stanbol work with Polish
> language (for POC only)
> 2. What are the steps necessary to implement the correct solution of
> supporting another language
>
> Thanks
>
> Grzegorz Trzeciak
>