You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Jörn Kottmann (JIRA)" <ji...@apache.org> on 2011/02/02 22:45:29 UTC

[jira] Commented: (OPENNLP-33) Write documentation for the document categorizer component

    [ https://issues.apache.org/jira/browse/OPENNLP-33?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989813#comment-12989813 ] 

Jörn Kottmann commented on OPENNLP-33:
--------------------------------------

There are a few questions inside the attached document.

1. The maxent jar is still necessary since it contains all the maxent classes which are mostly used by the DoccatModel for serializing the embeded maxent binary model and by DocumentCategorizerME to perform the training and categorization.

2. The training format is, one document per line, first token is the the category and all other whitespace separated tokens are document tokens. The DocumentSample constructor also expects whitespace tokenized input text.

3. The parsing code you describe is mostly already in DocumentSampleStream, that one can parse the above described format.

> Write documentation for the document categorizer component
> ----------------------------------------------------------
>
>                 Key: OPENNLP-33
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-33
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Documentation
>            Reporter: Jörn Kottmann
>         Attachments: doccat_documentation.rtf
>
>
> Write initial documentation for the document categorizer component.
> The issue is migrated from SourceForge:
> https://sourceforge.net/tracker/?func=detail&aid=3028436&group_id=3368&atid=103368

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira