You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@opennlp.apache.org by "Steve Rowe (JIRA)" <ji...@apache.org> on 2018/01/11 20:59:00 UTC

[jira] [Created] (OPENNLP-1182) LanguageDetectorConverterTool is a no-op, despite the docs saying otherwise

Steve Rowe created OPENNLP-1182:
-----------------------------------

             Summary: LanguageDetectorConverterTool is a no-op, despite the docs saying otherwise
                 Key: OPENNLP-1182
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1182
             Project: OpenNLP
          Issue Type: Bug
            Reporter: Steve Rowe


Contrary to the docs (see below), LanguageDetectorConverterTool doesn't actually do anything at all; the class is empty.

{quote}
The following sequence of commands shows how to convert the Leipzig Corpora collection at folder leipzig-train/ to the default Language Detector format, by creating groups of 5 sentences as documents and limiting to 10000 documents per language. Them, it shuffles the result and select the first 100000 lines as train corpus and the last 20000 as evaluation corpus:

{noformat}					
$ bin/opennlp LanguageDetectorConverter leipzig -sentencesDir leipzig-train/ -sentencesPerSample 5 -samplesPerLanguage 10000 > leipzig.txt
$ perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' < leipzig.txt > leipzig_shuf.txt
$ head -100000 < leipzig_shuf.txt > leipzig.train
$ tail -20000 < leipzig_shuf.txt > leipzig.eval
{noformat}
{quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)