You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Praveena B (JIRA)" <ji...@apache.org> on 2014/05/16 13:19:08 UTC
[jira] [Created] (OPENNLP-697) Tokenizer class is hardcoded in the
DocumentSampleStream class.
Praveena B created OPENNLP-697:
----------------------------------
Summary: Tokenizer class is hardcoded in the DocumentSampleStream class.
Key: OPENNLP-697
URL: https://issues.apache.org/jira/browse/OPENNLP-697
Project: OpenNLP
Issue Type: Bug
Components: Doccat, Tokenizer
Affects Versions: 1.6.0
Reporter: Praveena B
While training the DocumentCategorizerME it is possible to set the type of Tokenizer that the categorizer should use.
i,e doccatFactory.setTokenizer(SemicolonTokenizer.INSTANCE);
But the Tokenizer class is hardcoded to WhitespaceTokenizer in the DocumentSampleStream class.
So it is not possible to modify the default tokenizing behaviour even after setting it in the doccatFactory.
--
This message was sent by Atlassian JIRA
(v6.2#6252)