You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Joern Kottmann (JIRA)" <ji...@apache.org> on 2014/01/06 17:51:53 UTC

[jira] [Commented] (OPENNLP-590) Tokenizer is not getting trained...

    [ https://issues.apache.org/jira/browse/OPENNLP-590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863102#comment-13863102 ] 

Joern Kottmann commented on OPENNLP-590:
----------------------------------------

Please ask questions on the user list.

> Tokenizer is not getting trained...
> -----------------------------------
>
>                 Key: OPENNLP-590
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-590
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Tokenizer
>    Affects Versions: tools-1.5.3
>         Environment: Ubuntu 12.04 - JVM 1.7
>            Reporter: Hayri Volkan Agun
>            Priority: Minor
>              Labels: tokenizer, training, turkish
>   Original Estimate: 2m
>  Remaining Estimate: 2m
>
> Trying to train a tokenizer for Turkish from API, which doesn't learn an obvious pattern. No abbreviation dictionary is used and is either necessary for learning. The sample stream is in UTF-8. 
> The code sample I used is below:
> Charset charset = Charset.forName("UTF-8");
> ObjectStream<String> lineStream = new PlainTextByLineStream(new FileInputStream(trainFilename),
>                       charset);
> ObjectStream<TokenSample> sampleStream = new TokenSampleStream(lineStream);
> TokenizerModel model;
> TokenizerFactory factory = new TokenizerFactory("tr",null,false, null);
> String tr = factory.getLanguageCode();
> model = TokenizerME.train(sampleStream, factory ,TrainingParameters.defaultParams());
> try (OutputStream modelOut = new    FileOutputStream(WordOptions.OPENNLPTOKENMODELFILENAME)) {
>    model.serialize(modelOut);
>    modelOut.close();
> }
> sampleStream.close();



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)