You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by 王春华 <ig...@icloud.com> on 2017/09/01 09:15:52 UTC

How to get a tokenize model for Chinese

Hello everyone,

I wonder if there is any tokenizing model for Chinese text, or where to get some guidelines of how to generate one by myself.

thanks!
Aaron

Re: How to get a tokenize model for Chinese

Posted by 王春华 <ig...@icloud.com>.
Hi Jörn,

I found it will work if replace the space with <SPLIT>  within the corpus file. 

Please ignore my last post.

Thanks!
> On 4 Sep 2017, at 7:52 AM, 王春华 <ig...@icloud.com> wrote:
> 
> Hi Jörn,
> 
> I am trying to train the tokenizer with some corpora in Chinese and got exception as below from the console:
> 
> 
> Indexing events with TwoPass using cutoff of 5
> 
> 	Computing event counts...  done. 4476143 events
> 	Indexing...  done.
> Sorting and merging events... done. Reduced 4476143 events to 358244.
> Done indexing in 30.55 s.
> opennlp.tools.util.InsufficientTrainingDataException: Training data must contain more than one outcome
> 	at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:78)
> 	at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:93)
> 	at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:247)
> 	at com.mzdee.nlp.Tokenizer.main(Tokenizer.java:207)
> 
> I am new to NPL and not quite understand what’s going on. and the code snip as below:
> 
> InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(
> 					new File("/Users/aaron/resume-corpus/corpus_一_20140804162433.txt"));
> 			Charset charset = Charset.forName("UTF-8");
> 			ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, charset);
> 			ObjectStream<TokenSample> sampleStream = new TokenSampleStream(lineStream);
> 
> 			TokenizerModel model;
> 
> 			try {
> //				model = TokenizerME.train("zh", sampleStream, true, TrainingParameters.defaultParams());
> 				TokenizerFactory tf = new TokenizerFactory();
> 				
> 				boolean useAlphaNumericOptimization=false;
> 				String languageCode="zh";
> 				model =TokenizerME.train(sampleStream, TokenizerFactory.create(null, languageCode, null, useAlphaNumericOptimization, null), TrainingParameters.defaultParams());
> 				
> 			} finally {
> 				sampleStream.close();
> 			}
> 
> 			OutputStream modelOut = null;
> 			try {
> 				modelOut = new BufferedOutputStream(new FileOutputStream("/Users/aaron/resume-corpus/zh-token.bin"));
> 				model.serialize(modelOut);
> 			} finally {
> 				if (modelOut != null)
> 					modelOut.close();
> 			}
> The line I commented out above seem is not update with latest version.
> 
> Help !!!
> 
> 
>> On 1 Sep 2017, at 7:18 PM, Joern Kottmann <kottmann@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Our current tokenizer can be trained to segment Chinese just by
>> following the user documentation,
>> but it might not work very well. We never tried this.
>> 
>> Do you have a corpora you can train on?
>> 
>> OntoNotes has some Chinese text and could probably be used.
>> 
>> Jörn
>> 
>> On Fri, Sep 1, 2017 at 11:15 AM, 王春华 <igor.wong@icloud.com <ma...@icloud.com>> wrote:
>>> Hello everyone,
>>> 
>>> I wonder if there is any tokenizing model for Chinese text, or where to get some guidelines of how to generate one by myself.
>>> 
>>> thanks!
>>> Aaron
> 


Re: How to get a tokenize model for Chinese

Posted by 王春华 <ig...@icloud.com>.
Hi Jörn,

I am trying to train the tokenizer with some corpora in Chinese and got exception as below from the console:


Indexing events with TwoPass using cutoff of 5

	Computing event counts...  done. 4476143 events
	Indexing...  done.
Sorting and merging events... done. Reduced 4476143 events to 358244.
Done indexing in 30.55 s.
opennlp.tools.util.InsufficientTrainingDataException: Training data must contain more than one outcome
	at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:78)
	at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:93)
	at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:247)
	at com.mzdee.nlp.Tokenizer.main(Tokenizer.java:207)

I am new to NPL and not quite understand what’s going on. and the code snip as below:

InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(
					new File("/Users/aaron/resume-corpus/corpus_一_20140804162433.txt"));
			Charset charset = Charset.forName("UTF-8");
			ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, charset);
			ObjectStream<TokenSample> sampleStream = new TokenSampleStream(lineStream);

			TokenizerModel model;

			try {
//				model = TokenizerME.train("zh", sampleStream, true, TrainingParameters.defaultParams());
				TokenizerFactory tf = new TokenizerFactory();
				
				boolean useAlphaNumericOptimization=false;
				String languageCode="zh";
				model =TokenizerME.train(sampleStream, TokenizerFactory.create(null, languageCode, null, useAlphaNumericOptimization, null), TrainingParameters.defaultParams());
				
			} finally {
				sampleStream.close();
			}

			OutputStream modelOut = null;
			try {
				modelOut = new BufferedOutputStream(new FileOutputStream("/Users/aaron/resume-corpus/zh-token.bin"));
				model.serialize(modelOut);
			} finally {
				if (modelOut != null)
					modelOut.close();
			}
The line I commented out above seem is not update with latest version.

Help !!!


> On 1 Sep 2017, at 7:18 PM, Joern Kottmann <ko...@gmail.com> wrote:
> 
> Our current tokenizer can be trained to segment Chinese just by
> following the user documentation,
> but it might not work very well. We never tried this.
> 
> Do you have a corpora you can train on?
> 
> OntoNotes has some Chinese text and could probably be used.
> 
> Jörn
> 
> On Fri, Sep 1, 2017 at 11:15 AM, 王春华 <ig...@icloud.com> wrote:
>> Hello everyone,
>> 
>> I wonder if there is any tokenizing model for Chinese text, or where to get some guidelines of how to generate one by myself.
>> 
>> thanks!
>> Aaron


Re: How to get a tokenize model for Chinese

Posted by Joern Kottmann <ko...@gmail.com>.
Our current tokenizer can be trained to segment Chinese just by
following the user documentation,
but it might not work very well. We never tried this.

Do you have a corpora you can train on?

OntoNotes has some Chinese text and could probably be used.

Jörn

On Fri, Sep 1, 2017 at 11:15 AM, 王春华 <ig...@icloud.com> wrote:
> Hello everyone,
>
> I wonder if there is any tokenizing model for Chinese text, or where to get some guidelines of how to generate one by myself.
>
> thanks!
> Aaron