You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Toshiya TSURU <tu...@gmail.com> on 2011/04/05 08:25:53 UTC

Japanese Tokenizer Model

Hi.

I'm a software developer in Tokyo,Japan.
I found that RapidMiner uses OpenNLP for its tokenization process.

But, the token given by RapidMiner is strange.
Because There is no Tokenizer model for Japanese.

Although I've checked the page below,
The models For Japanese is not found.
http://opennlp.sourceforge.net/models-1.5/

How can I get Japanese model?
Or Can I create one?

-- 
Toshiya TSURU <tu...@gmail.com>
http://twitter.com/turutosiya

Re: Japanese Tokenizer Model

Posted by Benson Margulies <bi...@gmail.com>.

Toshiya,

I can't answer your questions with certainty, but I'm very doubtful
that the parameters from the mecab training process (Sen is just as
java port of mecab) will be usable with the opennlp tokenizer. The
mecab training data might be your best bet if you don't want to write
code but rather just train a model.

--benson


On Tue, Apr 5, 2011 at 11:47 PM, Toshiya TSURU <tu...@gmail.com> wrote:
> Thanks Benson.
>
>> I am presuming that by 'tokenization' for Japanese you are talking
>> about segmentation into words. I appreciate that there has to be a
>> tokenizer in the pipeline somewhere
>
> Yes.
>
>> However, it seems to me that it
>> should be possible to write a bit of code and incorporate an existing
>> segmentation component as an alternative to training a model for the
>> opennlp tokenizer
>
> Yes.
> Because RapidMiner is written by Java, I've been looking for
> alternatives which is writted by Java.
> And I find it. "Sen" is the one.
>
> http://www.mlab.im.dendai.ac.jp/~yamada/ir/MorphologicalAnalyzer/Sen.html
>
> But, at this time, I would not like to write code ( It maybe causes
> unexpected bugs ), If there a way to do that.
> Then, I asked you Whether there is a model for Japanese language.
>
>
> Can mecab generate a model for OpenNLP?
>
> On Tue, Apr 5, 2011 at 8:43 PM, Benson Margulies <bi...@gmail.com> wrote:
>> Toshiya,
>>
>> While I'm a mentor of opennlp, I'm not that deep in the code. I'm
>> mostly here to help with process. However, I have done a good deal of
>> work on statistical segmentation of Japanese text.
>>
>> I am presuming that by 'tokenization' for Japanese you are talking
>> about segmentation into words. I appreciate that there has to be a
>> tokenizer in the pipeline somewhere. However, it seems to me that it
>> should be possible to write a bit of code and incorporate an existing
>> segmentation component as an alternative to training a model for the
>> opennlp tokenizer. I also have to wonder whether a component used for
>> languages with whitespace will do a very good job at tokenizing
>> Japanese or Chinese just by training a different model.  Perhaps Jorn
>> can shed some light on that; maybe others have used their own data to
>> experiment with that.
>>
>> --benson
>>
>>
>> On Tue, Apr 5, 2011 at 7:36 AM, Toshiya TSURU <tu...@gmail.com> wrote:
>>> Thanks Benson
>>>
>>> The reason why i'm looking for Japanese Model , is to implement
>>> practical tokenizer into RapidMiner.
>>>
>>> RapidMiner is a datamining software which includes OpenNLP within it.
>>>
>>> In RapidMiner, OpenNLP is used to for tokenizing document data. it
>>> works well for English contents, but for Japanese, not. Because The
>>> models which is bundled with RapidMiner are only English and German.
>>>
>>> Then, I'm lookong for the one for Japanese tokenization.
>>>
>>> On Tuesday, April 5, 2011, Benson Margulies <bi...@gmail.com> wrote:
>>>> First of all, do you really need to train your own tokenizer? You
>>>> could use http://www.chasen.org/~taku/software/TinySegmenter/, or
>>>> Chasen, or http://mecab.sourceforge.net/.
>>>>
>>>> I believe that there are corpora available that were used to train
>>>> mecab, but I'm rusty on the subject.
>>>>
>>>> The '1982' Mainichi might be available, but a model trained from it
>>>> will work well for newspapers and not well at all for hiragana-heavy
>>>> informal text.
>>>>
>>>> If you have a special reason to want to train a model, you can create
>>>> training data by using one of the tokenizers above. Of course, your
>>>> accuracy will be somewhat less than what you start with. In our
>>>> experience, however, not so much less.
>>>>
>>>>
>>>> On Tue, Apr 5, 2011 at 3:56 AM, Toshiya TSURU <tu...@gmail.com> wrote:
>>>>> Hi.
>>>>>
>>>>> I'm a newbie at Language Processing.
>>>>> Then, I'm wondering that what kind of data is suitable for training corpus.
>>>>>
>>>>> for english, what is the best for training corpus?
>>>>>
>>>>> On Tue, Apr 5, 2011 at 4:16 PM, Jörn Kottmann <ko...@gmail.com> wrote:
>>>>>> On 4/5/11 8:25 AM, Toshiya TSURU wrote:
>>>>>>>
>>>>>>> Hi.
>>>>>>>
>>>>>>> I'm a software developer in Tokyo,Japan.
>>>>>>> I found that RapidMiner uses OpenNLP for its tokenization process.
>>>>>>>
>>>>>>> But, the token given by RapidMiner is strange.
>>>>>>> Because There is no Tokenizer model for Japanese.
>>>>>>>
>>>>>>> Although I've checked the page below,
>>>>>>> The models For Japanese is not found.
>>>>>>> http://opennlp.sourceforge.net/models-1.5/
>>>>>>>
>>>>>>> How can I get Japanese model?
>>>>>>> Or Can I create one?
>>>>>>>
>>>>>> Currently we do not have support for Japanese, but
>>>>>> we would be happy to add it.
>>>>>>
>>>>>> Do you know a training corpus we could use?
>>>>>>
>>>>>> Jörn
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Toshiya TSURU <tu...@gmail.com>
>>>>> http://twitter.com/turutosiya
>>>>>
>>>>
>>>
>>>
>>> --
>>> Toshiya TSURU <tu...@gmail.com>
>>> http://twitter.com/turutosiya
>>>
>>
>
>
>
> --
> Toshiya TSURU <tu...@gmail.com>
> http://twitter.com/turutosiya
>

Re: Japanese Tokenizer Model

Posted by Toshiya TSURU <tu...@gmail.com>.

Thanks Benson.

> I am presuming that by 'tokenization' for Japanese you are talking
> about segmentation into words. I appreciate that there has to be a
> tokenizer in the pipeline somewhere

Yes.

> However, it seems to me that it
> should be possible to write a bit of code and incorporate an existing
> segmentation component as an alternative to training a model for the
> opennlp tokenizer

Yes.
Because RapidMiner is written by Java, I've been looking for
alternatives which is writted by Java.
And I find it. "Sen" is the one.

http://www.mlab.im.dendai.ac.jp/~yamada/ir/MorphologicalAnalyzer/Sen.html

But, at this time, I would not like to write code ( It maybe causes
unexpected bugs ), If there a way to do that.
Then, I asked you Whether there is a model for Japanese language.


Can mecab generate a model for OpenNLP?

On Tue, Apr 5, 2011 at 8:43 PM, Benson Margulies <bi...@gmail.com> wrote:
> Toshiya,
>
> While I'm a mentor of opennlp, I'm not that deep in the code. I'm
> mostly here to help with process. However, I have done a good deal of
> work on statistical segmentation of Japanese text.
>
> I am presuming that by 'tokenization' for Japanese you are talking
> about segmentation into words. I appreciate that there has to be a
> tokenizer in the pipeline somewhere. However, it seems to me that it
> should be possible to write a bit of code and incorporate an existing
> segmentation component as an alternative to training a model for the
> opennlp tokenizer. I also have to wonder whether a component used for
> languages with whitespace will do a very good job at tokenizing
> Japanese or Chinese just by training a different model.  Perhaps Jorn
> can shed some light on that; maybe others have used their own data to
> experiment with that.
>
> --benson
>
>
> On Tue, Apr 5, 2011 at 7:36 AM, Toshiya TSURU <tu...@gmail.com> wrote:
>> Thanks Benson
>>
>> The reason why i'm looking for Japanese Model , is to implement
>> practical tokenizer into RapidMiner.
>>
>> RapidMiner is a datamining software which includes OpenNLP within it.
>>
>> In RapidMiner, OpenNLP is used to for tokenizing document data. it
>> works well for English contents, but for Japanese, not. Because The
>> models which is bundled with RapidMiner are only English and German.
>>
>> Then, I'm lookong for the one for Japanese tokenization.
>>
>> On Tuesday, April 5, 2011, Benson Margulies <bi...@gmail.com> wrote:
>>> First of all, do you really need to train your own tokenizer? You
>>> could use http://www.chasen.org/~taku/software/TinySegmenter/, or
>>> Chasen, or http://mecab.sourceforge.net/.
>>>
>>> I believe that there are corpora available that were used to train
>>> mecab, but I'm rusty on the subject.
>>>
>>> The '1982' Mainichi might be available, but a model trained from it
>>> will work well for newspapers and not well at all for hiragana-heavy
>>> informal text.
>>>
>>> If you have a special reason to want to train a model, you can create
>>> training data by using one of the tokenizers above. Of course, your
>>> accuracy will be somewhat less than what you start with. In our
>>> experience, however, not so much less.
>>>
>>>
>>> On Tue, Apr 5, 2011 at 3:56 AM, Toshiya TSURU <tu...@gmail.com> wrote:
>>>> Hi.
>>>>
>>>> I'm a newbie at Language Processing.
>>>> Then, I'm wondering that what kind of data is suitable for training corpus.
>>>>
>>>> for english, what is the best for training corpus?
>>>>
>>>> On Tue, Apr 5, 2011 at 4:16 PM, Jörn Kottmann <ko...@gmail.com> wrote:
>>>>> On 4/5/11 8:25 AM, Toshiya TSURU wrote:
>>>>>>
>>>>>> Hi.
>>>>>>
>>>>>> I'm a software developer in Tokyo,Japan.
>>>>>> I found that RapidMiner uses OpenNLP for its tokenization process.
>>>>>>
>>>>>> But, the token given by RapidMiner is strange.
>>>>>> Because There is no Tokenizer model for Japanese.
>>>>>>
>>>>>> Although I've checked the page below,
>>>>>> The models For Japanese is not found.
>>>>>> http://opennlp.sourceforge.net/models-1.5/
>>>>>>
>>>>>> How can I get Japanese model?
>>>>>> Or Can I create one?
>>>>>>
>>>>> Currently we do not have support for Japanese, but
>>>>> we would be happy to add it.
>>>>>
>>>>> Do you know a training corpus we could use?
>>>>>
>>>>> Jörn
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Toshiya TSURU <tu...@gmail.com>
>>>> http://twitter.com/turutosiya
>>>>
>>>
>>
>>
>> --
>> Toshiya TSURU <tu...@gmail.com>
>> http://twitter.com/turutosiya
>>
>



-- 
Toshiya TSURU <tu...@gmail.com>
http://twitter.com/turutosiya

Re: Japanese Tokenizer Model

Posted by Benson Margulies <bi...@gmail.com>.

Toshiya,

While I'm a mentor of opennlp, I'm not that deep in the code. I'm
mostly here to help with process. However, I have done a good deal of
work on statistical segmentation of Japanese text.

I am presuming that by 'tokenization' for Japanese you are talking
about segmentation into words. I appreciate that there has to be a
tokenizer in the pipeline somewhere. However, it seems to me that it
should be possible to write a bit of code and incorporate an existing
segmentation component as an alternative to training a model for the
opennlp tokenizer. I also have to wonder whether a component used for
languages with whitespace will do a very good job at tokenizing
Japanese or Chinese just by training a different model.  Perhaps Jorn
can shed some light on that; maybe others have used their own data to
experiment with that.

--benson


On Tue, Apr 5, 2011 at 7:36 AM, Toshiya TSURU <tu...@gmail.com> wrote:
> Thanks Benson
>
> The reason why i'm looking for Japanese Model , is to implement
> practical tokenizer into RapidMiner.
>
> RapidMiner is a datamining software which includes OpenNLP within it.
>
> In RapidMiner, OpenNLP is used to for tokenizing document data. it
> works well for English contents, but for Japanese, not. Because The
> models which is bundled with RapidMiner are only English and German.
>
> Then, I'm lookong for the one for Japanese tokenization.
>
> On Tuesday, April 5, 2011, Benson Margulies <bi...@gmail.com> wrote:
>> First of all, do you really need to train your own tokenizer? You
>> could use http://www.chasen.org/~taku/software/TinySegmenter/, or
>> Chasen, or http://mecab.sourceforge.net/.
>>
>> I believe that there are corpora available that were used to train
>> mecab, but I'm rusty on the subject.
>>
>> The '1982' Mainichi might be available, but a model trained from it
>> will work well for newspapers and not well at all for hiragana-heavy
>> informal text.
>>
>> If you have a special reason to want to train a model, you can create
>> training data by using one of the tokenizers above. Of course, your
>> accuracy will be somewhat less than what you start with. In our
>> experience, however, not so much less.
>>
>>
>> On Tue, Apr 5, 2011 at 3:56 AM, Toshiya TSURU <tu...@gmail.com> wrote:
>>> Hi.
>>>
>>> I'm a newbie at Language Processing.
>>> Then, I'm wondering that what kind of data is suitable for training corpus.
>>>
>>> for english, what is the best for training corpus?
>>>
>>> On Tue, Apr 5, 2011 at 4:16 PM, Jörn Kottmann <ko...@gmail.com> wrote:
>>>> On 4/5/11 8:25 AM, Toshiya TSURU wrote:
>>>>>
>>>>> Hi.
>>>>>
>>>>> I'm a software developer in Tokyo,Japan.
>>>>> I found that RapidMiner uses OpenNLP for its tokenization process.
>>>>>
>>>>> But, the token given by RapidMiner is strange.
>>>>> Because There is no Tokenizer model for Japanese.
>>>>>
>>>>> Although I've checked the page below,
>>>>> The models For Japanese is not found.
>>>>> http://opennlp.sourceforge.net/models-1.5/
>>>>>
>>>>> How can I get Japanese model?
>>>>> Or Can I create one?
>>>>>
>>>> Currently we do not have support for Japanese, but
>>>> we would be happy to add it.
>>>>
>>>> Do you know a training corpus we could use?
>>>>
>>>> Jörn
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Toshiya TSURU <tu...@gmail.com>
>>> http://twitter.com/turutosiya
>>>
>>
>
>
> --
> Toshiya TSURU <tu...@gmail.com>
> http://twitter.com/turutosiya
>

Japanese Tokenizer Model

Posted by Toshiya TSURU <tu...@gmail.com>.

Thanks Benson

The reason why i'm looking for Japanese Model , is to implement
practical tokenizer into RapidMiner.

RapidMiner is a datamining software which includes OpenNLP within it.

In RapidMiner, OpenNLP is used to for tokenizing document data. it
works well for English contents, but for Japanese, not. Because The
models which is bundled with RapidMiner are only English and German.

Then, I'm lookong for the one for Japanese tokenization.

On Tuesday, April 5, 2011, Benson Margulies <bi...@gmail.com> wrote:
> First of all, do you really need to train your own tokenizer? You
> could use http://www.chasen.org/~taku/software/TinySegmenter/, or
> Chasen, or http://mecab.sourceforge.net/.
>
> I believe that there are corpora available that were used to train
> mecab, but I'm rusty on the subject.
>
> The '1982' Mainichi might be available, but a model trained from it
> will work well for newspapers and not well at all for hiragana-heavy
> informal text.
>
> If you have a special reason to want to train a model, you can create
> training data by using one of the tokenizers above. Of course, your
> accuracy will be somewhat less than what you start with. In our
> experience, however, not so much less.
>
>
> On Tue, Apr 5, 2011 at 3:56 AM, Toshiya TSURU <tu...@gmail.com> wrote:
>> Hi.
>>
>> I'm a newbie at Language Processing.
>> Then, I'm wondering that what kind of data is suitable for training corpus.
>>
>> for english, what is the best for training corpus?
>>
>> On Tue, Apr 5, 2011 at 4:16 PM, Jörn Kottmann <ko...@gmail.com> wrote:
>>> On 4/5/11 8:25 AM, Toshiya TSURU wrote:
>>>>
>>>> Hi.
>>>>
>>>> I'm a software developer in Tokyo,Japan.
>>>> I found that RapidMiner uses OpenNLP for its tokenization process.
>>>>
>>>> But, the token given by RapidMiner is strange.
>>>> Because There is no Tokenizer model for Japanese.
>>>>
>>>> Although I've checked the page below,
>>>> The models For Japanese is not found.
>>>> http://opennlp.sourceforge.net/models-1.5/
>>>>
>>>> How can I get Japanese model?
>>>> Or Can I create one?
>>>>
>>> Currently we do not have support for Japanese, but
>>> we would be happy to add it.
>>>
>>> Do you know a training corpus we could use?
>>>
>>> Jörn
>>>
>>>
>>>
>>
>>
>>
>> --
>> Toshiya TSURU <tu...@gmail.com>
>> http://twitter.com/turutosiya
>>
>


-- 
Toshiya TSURU <tu...@gmail.com>
http://twitter.com/turutosiya

Re: Japanese Tokenizer Model

Posted by Benson Margulies <bi...@gmail.com>.

First of all, do you really need to train your own tokenizer? You
could use http://www.chasen.org/~taku/software/TinySegmenter/, or
Chasen, or http://mecab.sourceforge.net/.

I believe that there are corpora available that were used to train
mecab, but I'm rusty on the subject.

The '1982' Mainichi might be available, but a model trained from it
will work well for newspapers and not well at all for hiragana-heavy
informal text.

If you have a special reason to want to train a model, you can create
training data by using one of the tokenizers above. Of course, your
accuracy will be somewhat less than what you start with. In our
experience, however, not so much less.

On Tue, Apr 5, 2011 at 3:56 AM, Toshiya TSURU <tu...@gmail.com> wrote:
> Hi.
>
> I'm a newbie at Language Processing.
> Then, I'm wondering that what kind of data is suitable for training corpus.
>
> for english, what is the best for training corpus?
>
> On Tue, Apr 5, 2011 at 4:16 PM, Jörn Kottmann <ko...@gmail.com> wrote:
>> On 4/5/11 8:25 AM, Toshiya TSURU wrote:
>>>
>>> Hi.
>>>
>>> I'm a software developer in Tokyo,Japan.
>>> I found that RapidMiner uses OpenNLP for its tokenization process.
>>>
>>> But, the token given by RapidMiner is strange.
>>> Because There is no Tokenizer model for Japanese.
>>>
>>> Although I've checked the page below,
>>> The models For Japanese is not found.
>>> http://opennlp.sourceforge.net/models-1.5/
>>>
>>> How can I get Japanese model?
>>> Or Can I create one?
>>>
>> Currently we do not have support for Japanese, but
>> we would be happy to add it.
>>
>> Do you know a training corpus we could use?
>>
>> Jörn
>>
>>
>>
>
>
>
> --
> Toshiya TSURU <tu...@gmail.com>
> http://twitter.com/turutosiya
>

Re: Japanese Tokenizer Model

Posted by Toshiya TSURU <tu...@gmail.com>.

Hi.

I'm a newbie at Language Processing.
Then, I'm wondering that what kind of data is suitable for training corpus.

for english, what is the best for training corpus?

On Tue, Apr 5, 2011 at 4:16 PM, Jörn Kottmann <ko...@gmail.com> wrote:
> On 4/5/11 8:25 AM, Toshiya TSURU wrote:
>>
>> Hi.
>>
>> I'm a software developer in Tokyo,Japan.
>> I found that RapidMiner uses OpenNLP for its tokenization process.
>>
>> But, the token given by RapidMiner is strange.
>> Because There is no Tokenizer model for Japanese.
>>
>> Although I've checked the page below,
>> The models For Japanese is not found.
>> http://opennlp.sourceforge.net/models-1.5/
>>
>> How can I get Japanese model?
>> Or Can I create one?
>>
> Currently we do not have support for Japanese, but
> we would be happy to add it.
>
> Do you know a training corpus we could use?
>
> Jörn
>
>
>



-- 
Toshiya TSURU <tu...@gmail.com>
http://twitter.com/turutosiya

Re: Japanese Tokenizer Model

Posted by Jörn Kottmann <ko...@gmail.com>.

On 4/5/11 8:25 AM, Toshiya TSURU wrote:
> Hi.
>
> I'm a software developer in Tokyo,Japan.
> I found that RapidMiner uses OpenNLP for its tokenization process.
>
> But, the token given by RapidMiner is strange.
> Because There is no Tokenizer model for Japanese.
>
> Although I've checked the page below,
> The models For Japanese is not found.
> http://opennlp.sourceforge.net/models-1.5/
>
> How can I get Japanese model?
> Or Can I create one?
>
Currently we do not have support for Japanese, but
we would be happy to add it.

Do you know a training corpus we could use?

Jörn