You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Praveena B (JIRA)" <ji...@apache.org> on 2014/05/16 13:19:08 UTC

[jira] [Created] (OPENNLP-697) Tokenizer class is hardcoded in the DocumentSampleStream class.

Praveena B created OPENNLP-697:
----------------------------------

             Summary: Tokenizer class is hardcoded in the DocumentSampleStream class. 
                 Key: OPENNLP-697
                 URL: https://issues.apache.org/jira/browse/OPENNLP-697
             Project: OpenNLP
          Issue Type: Bug
          Components: Doccat, Tokenizer
    Affects Versions: 1.6.0
            Reporter: Praveena B


While training the DocumentCategorizerME it is possible to set the type of Tokenizer that the categorizer should use.
i,e doccatFactory.setTokenizer(SemicolonTokenizer.INSTANCE); 

But the Tokenizer class is hardcoded to WhitespaceTokenizer in the DocumentSampleStream class. 
So it is not possible to modify the default tokenizing behaviour even after setting it in the doccatFactory.




--
This message was sent by Atlassian JIRA
(v6.2#6252)