You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by "Soubhik (সৌভিক)" <so...@gmail.com> on 2012/05/30 19:52:34 UTC

Unicode danda in sentence detector

Hi,

I'm trying to use OpenNLP to train a sentence detector for Bengali language
("bn"). I would like to add support for Unicode danda character in
opennlp.tools.sentdetect.lang.Factory
class. this character is a sentence break in Bengali, Hindi and several
other Indian languages. the code change should be small (< 10 lines).

Is it correct to think that a change of this size will not require a CLA?

Ref: en.wikipedia.org/wiki/*Danda*

Regards,
Soubhik.
--

Re: Unicode danda in sentence detector

Posted by "Soubhik (সৌভিক)" <so...@gmail.com>.

thanks!

I didn't see the support in 1.5.2-incubating. I'll build from trunk and try.

On Thu, May 31, 2012 at 7:05 AM, William Colen <wi...@gmail.com>wrote:

> As far as I know you don't need a CLA for a patch. Simply open a Jira and
> attach your patch to it.
>
> Besides what James pointed, you may also want change the EOS characters.
> There are two related new features that are already implemented in the
> trunk:
>
> https://issues.apache.org/jira/browse/OPENNLP-428
> This one added an optional command line argument where you set the
> end-of-sentence characters. This setting will be persisted to the model. If
> you are using the API you can create a SentenceDetectorFactory and use it
> to set the EOS chars.
>
> https://issues.apache.org/jira/browse/OPENNLP-434
> This is a new feature that allow customizing the SentenceDetector. You can
> extend the SentenceDetectorFactory and override methods as needed. You can
> pass in the customized factory using both the command line or the API.
>
>
> On Wed, May 30, 2012 at 7:19 PM, James Kosin <ja...@gmail.com>
> wrote:
>
> > Hi Soubhik,
> >
> > Should already be supported.
> > You have to pass the -encoding utf8 to the command line interface.
> >
> > James
> >
> > On 5/30/2012 1:52 PM, Soubhik (সৌভিক) wrote:
> > > Hi,
> > >
> > > I'm trying to use OpenNLP to train a sentence detector for Bengali
> > language
> > > ("bn"). I would like to add support for Unicode danda character in
> > > opennlp.tools.sentdetect.lang.Factory
> > > class. this character is a sentence break in Bengali, Hindi and several
> > > other Indian languages. the code change should be small (< 10 lines).
> > >
> > > Is it correct to think that a change of this size will not require a
> CLA?
> > >
> > > Ref: en.wikipedia.org/wiki/*Danda*
> > >
> > > Regards,
> > > Soubhik.
> > > --
> > >
> >
> >
>



-- 
Soubhik Bhattacharya

Re: Unicode danda in sentence detector

Posted by "Soubhik (সৌভিক)" <so...@gmail.com>.

On Thu, May 31, 2012 at 1:36 PM, Jörn Kottmann <ko...@gmail.com> wrote:
>
> The wikipedia reference says its commonly used for
> Indian languages, maybe we just should just include them,
> e.g. like we did for Portuguese.
>
> On the other side we might also need custom feature
> generation to get good results.
> How are words are delimited in Indian? With spaces?

words are delimited by spaces in bengali, hindi and most other Indian
languages.

>
> I suggest to first test with passing in the danda char,
> measure how it performs, and then decide if we might also
> need an adaption of the feature generation for Indian languages.

I started with a very small docset (about 1500 sentences from
news/blogs downloaded from the internet) and no abbreviations, no
custom features. I used the -eosChars '।?!' and got the following
result:

Precision: 0.8967468175388967
Recall: 0.8386243386243386
F-Measure: 0.8667122351332877

as you've mentioned, the danda is a sentence break in multiple Indian
languages. so does it make sense to add it in the Factory?

>
> Do you have training data you can train it on? If there is a publicly
> available data set me would appreciate having format support for it
> directly in OpenNLP.
>

I'll refine the model using a larger dataset and possibly, with an
abbreviations dictionary. I believe it should be possible to do it on
stuff openly available.

Cheers!
Soubhik.

> What do you think?
>
> Jörn
>
>
> On 05/31/2012 03:35 AM, William Colen wrote:
>>
>> As far as I know you don't need a CLA for a patch. Simply open a Jira and
>> attach your patch to it.
>>
>> Besides what James pointed, you may also want change the EOS characters.
>> There are two related new features that are already implemented in the
>> trunk:
>>
>> https://issues.apache.org/jira/browse/OPENNLP-428
>> This one added an optional command line argument where you set the
>> end-of-sentence characters. This setting will be persisted to the model.
>> If
>> you are using the API you can create a SentenceDetectorFactory and use it
>> to set the EOS chars.
>>
>> https://issues.apache.org/jira/browse/OPENNLP-434
>> This is a new feature that allow customizing the SentenceDetector. You
>> can
>> extend the SentenceDetectorFactory and override methods as needed. You
>> can
>> pass in the customized factory using both the command line or the API.
>>
>>
>> On Wed, May 30, 2012 at 7:19 PM, James Kosin<ja...@gmail.com>
>>  wrote:
>>
>>> Hi Soubhik,
>>>
>>> Should already be supported.
>>> You have to pass the -encoding utf8 to the command line interface.
>>>
>>> James
>>>
>>> On 5/30/2012 1:52 PM, Soubhik (সৌভিক) wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm trying to use OpenNLP to train a sentence detector for Bengali
>>>
>>> language
>>>>
>>>> ("bn"). I would like to add support for Unicode danda character in
>>>> opennlp.tools.sentdetect.lang.Factory
>>>> class. this character is a sentence break in Bengali, Hindi and several
>>>> other Indian languages. the code change should be small (<  10 lines).
>>>>
>>>> Is it correct to think that a change of this size will not require a
>>>> CLA?
>>>>
>>>> Ref: en.wikipedia.org/wiki/*Danda*
>>>>
>>>> Regards,
>>>> Soubhik.
>>>> --
>>>>
>>>
>



--
Soubhik Bhattacharya

Re: Unicode danda in sentence detector

Posted by Jörn Kottmann <ko...@gmail.com>.

The wikipedia reference says its commonly used for
Indian languages, maybe we just should just include them,
e.g. like we did for Portuguese.

On the other side we might also need custom feature
generation to get good results.
How are words are delimited in Indian? With spaces?

I suggest to first test with passing in the danda char,
measure how it performs, and then decide if we might also
need an adaption of the feature generation for Indian languages.

Do you have training data you can train it on? If there is a publicly
available data set me would appreciate having format support for it
directly in OpenNLP.

What do you think?

Jörn

On 05/31/2012 03:35 AM, William Colen wrote:
> As far as I know you don't need a CLA for a patch. Simply open a Jira and
> attach your patch to it.
>
> Besides what James pointed, you may also want change the EOS characters.
> There are two related new features that are already implemented in the
> trunk:
>
> https://issues.apache.org/jira/browse/OPENNLP-428
> This one added an optional command line argument where you set the
> end-of-sentence characters. This setting will be persisted to the model. If
> you are using the API you can create a SentenceDetectorFactory and use it
> to set the EOS chars.
>
> https://issues.apache.org/jira/browse/OPENNLP-434
> This is a new feature that allow customizing the SentenceDetector. You can
> extend the SentenceDetectorFactory and override methods as needed. You can
> pass in the customized factory using both the command line or the API.
>
>
> On Wed, May 30, 2012 at 7:19 PM, James Kosin<ja...@gmail.com>  wrote:
>
>> Hi Soubhik,
>>
>> Should already be supported.
>> You have to pass the -encoding utf8 to the command line interface.
>>
>> James
>>
>> On 5/30/2012 1:52 PM, Soubhik (সৌভিক) wrote:
>>> Hi,
>>>
>>> I'm trying to use OpenNLP to train a sentence detector for Bengali
>> language
>>> ("bn"). I would like to add support for Unicode danda character in
>>> opennlp.tools.sentdetect.lang.Factory
>>> class. this character is a sentence break in Bengali, Hindi and several
>>> other Indian languages. the code change should be small (<  10 lines).
>>>
>>> Is it correct to think that a change of this size will not require a CLA?
>>>
>>> Ref: en.wikipedia.org/wiki/*Danda*
>>>
>>> Regards,
>>> Soubhik.
>>> --
>>>
>>

Re: Unicode danda in sentence detector

Posted by William Colen <wi...@gmail.com>.

As far as I know you don't need a CLA for a patch. Simply open a Jira and
attach your patch to it.

Besides what James pointed, you may also want change the EOS characters.
There are two related new features that are already implemented in the
trunk:

https://issues.apache.org/jira/browse/OPENNLP-428
This one added an optional command line argument where you set the
end-of-sentence characters. This setting will be persisted to the model. If
you are using the API you can create a SentenceDetectorFactory and use it
to set the EOS chars.

https://issues.apache.org/jira/browse/OPENNLP-434
This is a new feature that allow customizing the SentenceDetector. You can
extend the SentenceDetectorFactory and override methods as needed. You can
pass in the customized factory using both the command line or the API.

On Wed, May 30, 2012 at 7:19 PM, James Kosin <ja...@gmail.com> wrote:

> Hi Soubhik,
>
> Should already be supported.
> You have to pass the -encoding utf8 to the command line interface.
>
> James
>
> On 5/30/2012 1:52 PM, Soubhik (সৌভিক) wrote:
> > Hi,
> >
> > I'm trying to use OpenNLP to train a sentence detector for Bengali
> language
> > ("bn"). I would like to add support for Unicode danda character in
> > opennlp.tools.sentdetect.lang.Factory
> > class. this character is a sentence break in Bengali, Hindi and several
> > other Indian languages. the code change should be small (< 10 lines).
> >
> > Is it correct to think that a change of this size will not require a CLA?
> >
> > Ref: en.wikipedia.org/wiki/*Danda*
> >
> > Regards,
> > Soubhik.
> > --
> >
>
>

Re: Unicode danda in sentence detector

Posted by James Kosin <ja...@gmail.com>.

Hi Soubhik,

Should already be supported.
You have to pass the -encoding utf8 to the command line interface.

James

On 5/30/2012 1:52 PM, Soubhik (সৌভিক) wrote:
> Hi,
>
> I'm trying to use OpenNLP to train a sentence detector for Bengali language
> ("bn"). I would like to add support for Unicode danda character in
> opennlp.tools.sentdetect.lang.Factory
> class. this character is a sentence break in Bengali, Hindi and several
> other Indian languages. the code change should be small (< 10 lines).
>
> Is it correct to think that a change of this size will not require a CLA?
>
> Ref: en.wikipedia.org/wiki/*Danda*
>
> Regards,
> Soubhik.
> --
>