You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@opennlp.apache.org by "Hayri Volkan Agun (JIRA)" <ji...@apache.org> on 2013/07/26 15:55:48 UTC

[jira] [Created] (OPENNLP-590) Tokenizer is not getting trained...

Hayri Volkan Agun created OPENNLP-590:
-----------------------------------------

             Summary: Tokenizer is not getting trained...
                 Key: OPENNLP-590
                 URL: https://issues.apache.org/jira/browse/OPENNLP-590
             Project: OpenNLP
          Issue Type: Bug
          Components: Tokenizer
    Affects Versions: tools-1.5.3
         Environment: Ubuntu 12.04 - JVM 1.7
            Reporter: Hayri Volkan Agun
            Priority: Minor


Trying to train a tokenizer for Turkish from API, which doesn't learn an obvious pattern. No abbreviation dictionary is used and is either necessary for learning. The sample stream is in UTF-8. 

The code sample I used is below:

Charset charset = Charset.forName("UTF-8");
ObjectStream<String> lineStream = new PlainTextByLineStream(new FileInputStream(trainFilename),
                      charset);
ObjectStream<TokenSample> sampleStream = new TokenSampleStream(lineStream);
TokenizerModel model;
TokenizerFactory factory = new TokenizerFactory("tr",null,false, null);
String tr = factory.getLanguageCode();
model = TokenizerME.train(sampleStream, factory ,TrainingParameters.defaultParams());
try (OutputStream modelOut = new    FileOutputStream(WordOptions.OPENNLPTOKENMODELFILENAME)) {
   model.serialize(modelOut);
   modelOut.close();
}

sampleStream.close();

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira