You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by 王春华 <ig...@icloud.com> on 2017/09/01 09:15:52 UTC
How to get a tokenize model for Chinese
Hello everyone,
I wonder if there is any tokenizing model for Chinese text, or where to get some guidelines of how to generate one by myself.
thanks!
Aaron
Re: How to get a tokenize model for Chinese
Posted by 王春华 <ig...@icloud.com>.
Hi Jörn,
I found it will work if replace the space with <SPLIT> within the corpus file.
Please ignore my last post.
Thanks!
> On 4 Sep 2017, at 7:52 AM, 王春华 <ig...@icloud.com> wrote:
>
> Hi Jörn,
>
> I am trying to train the tokenizer with some corpora in Chinese and got exception as below from the console:
>
>
> Indexing events with TwoPass using cutoff of 5
>
> Computing event counts... done. 4476143 events
> Indexing... done.
> Sorting and merging events... done. Reduced 4476143 events to 358244.
> Done indexing in 30.55 s.
> opennlp.tools.util.InsufficientTrainingDataException: Training data must contain more than one outcome
> at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:78)
> at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:93)
> at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:247)
> at com.mzdee.nlp.Tokenizer.main(Tokenizer.java:207)
>
> I am new to NPL and not quite understand what’s going on. and the code snip as below:
>
> InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(
> new File("/Users/aaron/resume-corpus/corpus_一_20140804162433.txt"));
> Charset charset = Charset.forName("UTF-8");
> ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, charset);
> ObjectStream<TokenSample> sampleStream = new TokenSampleStream(lineStream);
>
> TokenizerModel model;
>
> try {
> // model = TokenizerME.train("zh", sampleStream, true, TrainingParameters.defaultParams());
> TokenizerFactory tf = new TokenizerFactory();
>
> boolean useAlphaNumericOptimization=false;
> String languageCode="zh";
> model =TokenizerME.train(sampleStream, TokenizerFactory.create(null, languageCode, null, useAlphaNumericOptimization, null), TrainingParameters.defaultParams());
>
> } finally {
> sampleStream.close();
> }
>
> OutputStream modelOut = null;
> try {
> modelOut = new BufferedOutputStream(new FileOutputStream("/Users/aaron/resume-corpus/zh-token.bin"));
> model.serialize(modelOut);
> } finally {
> if (modelOut != null)
> modelOut.close();
> }
> The line I commented out above seem is not update with latest version.
>
> Help !!!
>
>
>> On 1 Sep 2017, at 7:18 PM, Joern Kottmann <kottmann@gmail.com <ma...@gmail.com>> wrote:
>>
>> Our current tokenizer can be trained to segment Chinese just by
>> following the user documentation,
>> but it might not work very well. We never tried this.
>>
>> Do you have a corpora you can train on?
>>
>> OntoNotes has some Chinese text and could probably be used.
>>
>> Jörn
>>
>> On Fri, Sep 1, 2017 at 11:15 AM, 王春华 <igor.wong@icloud.com <ma...@icloud.com>> wrote:
>>> Hello everyone,
>>>
>>> I wonder if there is any tokenizing model for Chinese text, or where to get some guidelines of how to generate one by myself.
>>>
>>> thanks!
>>> Aaron
>
Re: How to get a tokenize model for Chinese
Posted by 王春华 <ig...@icloud.com>.
Hi Jörn,
I am trying to train the tokenizer with some corpora in Chinese and got exception as below from the console:
Indexing events with TwoPass using cutoff of 5
Computing event counts... done. 4476143 events
Indexing... done.
Sorting and merging events... done. Reduced 4476143 events to 358244.
Done indexing in 30.55 s.
opennlp.tools.util.InsufficientTrainingDataException: Training data must contain more than one outcome
at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:78)
at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:93)
at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:247)
at com.mzdee.nlp.Tokenizer.main(Tokenizer.java:207)
I am new to NPL and not quite understand what’s going on. and the code snip as below:
InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(
new File("/Users/aaron/resume-corpus/corpus_一_20140804162433.txt"));
Charset charset = Charset.forName("UTF-8");
ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, charset);
ObjectStream<TokenSample> sampleStream = new TokenSampleStream(lineStream);
TokenizerModel model;
try {
// model = TokenizerME.train("zh", sampleStream, true, TrainingParameters.defaultParams());
TokenizerFactory tf = new TokenizerFactory();
boolean useAlphaNumericOptimization=false;
String languageCode="zh";
model =TokenizerME.train(sampleStream, TokenizerFactory.create(null, languageCode, null, useAlphaNumericOptimization, null), TrainingParameters.defaultParams());
} finally {
sampleStream.close();
}
OutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream("/Users/aaron/resume-corpus/zh-token.bin"));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}
The line I commented out above seem is not update with latest version.
Help !!!
> On 1 Sep 2017, at 7:18 PM, Joern Kottmann <ko...@gmail.com> wrote:
>
> Our current tokenizer can be trained to segment Chinese just by
> following the user documentation,
> but it might not work very well. We never tried this.
>
> Do you have a corpora you can train on?
>
> OntoNotes has some Chinese text and could probably be used.
>
> Jörn
>
> On Fri, Sep 1, 2017 at 11:15 AM, 王春华 <ig...@icloud.com> wrote:
>> Hello everyone,
>>
>> I wonder if there is any tokenizing model for Chinese text, or where to get some guidelines of how to generate one by myself.
>>
>> thanks!
>> Aaron
Re: How to get a tokenize model for Chinese
Posted by Joern Kottmann <ko...@gmail.com>.
Our current tokenizer can be trained to segment Chinese just by
following the user documentation,
but it might not work very well. We never tried this.
Do you have a corpora you can train on?
OntoNotes has some Chinese text and could probably be used.
Jörn
On Fri, Sep 1, 2017 at 11:15 AM, 王春华 <ig...@icloud.com> wrote:
> Hello everyone,
>
> I wonder if there is any tokenizing model for Chinese text, or where to get some guidelines of how to generate one by myself.
>
> thanks!
> Aaron