You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Joern Kottmann (JIRA)" <ji...@apache.org> on 2017/01/20 14:25:26 UTC

[jira] [Comment Edited] (OPENNLP-697) Tokenizer class is hardcoded in the DocumentSampleStream class.

    [ https://issues.apache.org/jira/browse/OPENNLP-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15831812#comment-15831812 ] 

Joern Kottmann edited comment on OPENNLP-697 at 1/20/17 2:25 PM:
-----------------------------------------------------------------

That is how it should be. The tokenizer should be removed from the factory, we will address this in OPENNLP-950.


was (Author: joern):
That is how it should be. The tokenizer should be removed from the factory, we will 

> Tokenizer class is hardcoded in the DocumentSampleStream class. 
> ----------------------------------------------------------------
>
>                 Key: OPENNLP-697
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-697
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Doccat, Tokenizer
>    Affects Versions: 1.6.0
>            Reporter: Praveena B
>             Fix For: 1.7.1
>
>
> While training the DocumentCategorizerME it is possible to set the type of Tokenizer that the categorizer should use.
> i,e doccatFactory.setTokenizer(SemicolonTokenizer.INSTANCE); 
> But the Tokenizer class is hardcoded to WhitespaceTokenizer in the DocumentSampleStream class. 
> So it is not possible to modify the default tokenizing behaviour even after setting it in the doccatFactory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)