You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Muhammad Dhito <mu...@gmail.com> on 2011/05/03 13:24:38 UTC

OpenNLP for Indonesian Language Processing

Hi,

I has been working on OpenNLP recently for my  final project. I'm
trying to adapt OpenNLP for Indonesian language processing. But, i'm
just adapting four components: sentence detector, tokenizer,
part-of-speech tagger, and chunker.

Is it enough if I'm just providing the Indonesian model so I could use
OpenNLP to process Indonesian text? Should I make some changes in
OpenNLP's source code according to Indonesian grammar by adding some
language-specific features?

Thanks for any help.

Sincerely,

M. Dhito
Informatics Engineering, Bandung Institute of Technology, Indonesia

Re: OpenNLP for Indonesian Language Processing

Posted by Muhammad Dhito <mu...@gmail.com>.
Okay. I get it. Thank you very much, Jörn.

On Tue, May 3, 2011 at 9:04 PM, Jörn Kottmann <ko...@gmail.com> wrote:
> On 5/3/11 3:51 PM, Muhammad Dhito wrote:
>>
>> Thanks for the answers.
>>
>> Unfortunately, today there is still no Indonesian corpus available
>> publicly. My lecturer and I have been trying to create our own
>> Indonesian corpus.
>>
>> About language specific features, where can I implement them in
>> OpenNLP? I mean, in which class exactly?
>>
> We do not really have language dependent features currently, so the
> best way to go would be that you open jiras per component with
> your proposed feature generators, we will then see how we could
> define language dependent default feature generation.
>
> In the meantime you can define your feature generators and
> pass them into the components, you need to pass them twice
> once during training and during model loading.
>
> Jörn
>

Re: OpenNLP for Indonesian Language Processing

Posted by Jörn Kottmann <ko...@gmail.com>.
On 5/3/11 3:51 PM, Muhammad Dhito wrote:
> Thanks for the answers.
>
> Unfortunately, today there is still no Indonesian corpus available
> publicly. My lecturer and I have been trying to create our own
> Indonesian corpus.
>
> About language specific features, where can I implement them in
> OpenNLP? I mean, in which class exactly?
>
We do not really have language dependent features currently, so the
best way to go would be that you open jiras per component with
your proposed feature generators, we will then see how we could
define language dependent default feature generation.

In the meantime you can define your feature generators and
pass them into the components, you need to pass them twice
once during training and during model loading.

Jörn

Re: OpenNLP for Indonesian Language Processing

Posted by Muhammad Dhito <mu...@gmail.com>.
Thanks for the answers.

Unfortunately, today there is still no Indonesian corpus available
publicly. My lecturer and I have been trying to create our own
Indonesian corpus.

About language specific features, where can I implement them in
OpenNLP? I mean, in which class exactly?

Thanks,
Dhito


On 5/3/11, Jörn Kottmann <ko...@gmail.com> wrote:
> On 5/3/11 1:24 PM, Muhammad Dhito wrote:
>> Hi,
>>
>> I has been working on OpenNLP recently for my  final project. I'm
>> trying to adapt OpenNLP for Indonesian language processing. But, i'm
>> just adapting four components: sentence detector, tokenizer,
>> part-of-speech tagger, and chunker.
>>
>> Is it enough if I'm just providing the Indonesian model so I could use
>> OpenNLP to process Indonesian text?
>
> It is of course nice if you provide the models to others, we might not
> be able
> to redistribute them here, but maybe you can just put them somewhere.
>
> On which corpus do you train? If they are publicly available it would be
> nice
> to add support to parse it directly to OpenNLP like we did with a couple
> of corpora already. Your contribution here would be very welcome.
>
>> Should I make some changes in
>> OpenNLP's source code according to Indonesian grammar by adding some
>> language-specific features?
>>
>
> Mabye you get better results with language specific features, we should
> support that and already did first steps to make that easier, e.g. the
> language
> is stored inside our models.
>
> Please feel free to propose new features which are specific for
> Indonesian, we
> will see how they could be integrated.
>
> Thanks,
> Jörn
>
>

Re: OpenNLP for Indonesian Language Processing

Posted by Jörn Kottmann <ko...@gmail.com>.
On 5/3/11 1:24 PM, Muhammad Dhito wrote:
> Hi,
>
> I has been working on OpenNLP recently for my  final project. I'm
> trying to adapt OpenNLP for Indonesian language processing. But, i'm
> just adapting four components: sentence detector, tokenizer,
> part-of-speech tagger, and chunker.
>
> Is it enough if I'm just providing the Indonesian model so I could use
> OpenNLP to process Indonesian text?

It is of course nice if you provide the models to others, we might not 
be able
to redistribute them here, but maybe you can just put them somewhere.

On which corpus do you train? If they are publicly available it would be 
nice
to add support to parse it directly to OpenNLP like we did with a couple
of corpora already. Your contribution here would be very welcome.

> Should I make some changes in
> OpenNLP's source code according to Indonesian grammar by adding some
> language-specific features?
>

Mabye you get better results with language specific features, we should
support that and already did first steps to make that easier, e.g. the 
language
is stored inside our models.

Please feel free to propose new features which are specific for 
Indonesian, we
will see how they could be integrated.

Thanks,
Jörn