You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Hayri Volkan Agun (JIRA)" <ji...@apache.org> on 2013/07/26 15:55:48 UTC
[jira] [Created] (OPENNLP-590) Tokenizer is not getting trained...
Hayri Volkan Agun created OPENNLP-590:
-----------------------------------------
Summary: Tokenizer is not getting trained...
Key: OPENNLP-590
URL: https://issues.apache.org/jira/browse/OPENNLP-590
Project: OpenNLP
Issue Type: Bug
Components: Tokenizer
Affects Versions: tools-1.5.3
Environment: Ubuntu 12.04 - JVM 1.7
Reporter: Hayri Volkan Agun
Priority: Minor
Trying to train a tokenizer for Turkish from API, which doesn't learn an obvious pattern. No abbreviation dictionary is used and is either necessary for learning. The sample stream is in UTF-8.
The code sample I used is below:
Charset charset = Charset.forName("UTF-8");
ObjectStream<String> lineStream = new PlainTextByLineStream(new FileInputStream(trainFilename),
charset);
ObjectStream<TokenSample> sampleStream = new TokenSampleStream(lineStream);
TokenizerModel model;
TokenizerFactory factory = new TokenizerFactory("tr",null,false, null);
String tr = factory.getLanguageCode();
model = TokenizerME.train(sampleStream, factory ,TrainingParameters.defaultParams());
try (OutputStream modelOut = new FileOutputStream(WordOptions.OPENNLPTOKENMODELFILENAME)) {
model.serialize(modelOut);
modelOut.close();
}
sampleStream.close();
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira