You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Nicolas Hernandez <ni...@gmail.com> on 2013/03/28 15:10:44 UTC

Doccat : Different tokenizers for training and categorizing?

Dear all

I have not tracked yet the whole process but because some unexpected
doccat results I looked a little bit at the code.

Do you confirm that the DoccatTrainerTool whitespace tokenize (by
creating DocumentSample) while the DoccatTool "SimpleTokenize" ?

This should not be the case. Both should use the same tokenizer; in
particular : The whitespace tokenizer !

If not which one is used ?

Best regards

/Nicolas

Re: Doccat : Different tokenizers for training and categorizing?

Posted by Jörn Kottmann <ko...@gmail.com>.

Yes, thats a bug in the Doccat Tool, both tools should process the 
OpenNLP default format which
is a document per line and whitespace tokenized.

The trainer seems to work fine, and the DoccatTool needs to use the 
Whitespace tokenizer instead
of the Simple Tokenizer. Thanks for figuring that out!

Nicolas, do you mind to open a jira?

Jörn

On 03/29/2013 03:55 AM, William Colen wrote:
> In my opinion you are right. It would be safer to use whitespace tokenizer
> than SimpleTokenizer.
>
> But I could not check if DoccatTrainerTool is using whitespace tokenizer.
> Actually, the only DocumentSample provider we have today is the one that
> reads Leipzig corpus, and as far as I know it uses the SimpleTokenizer
> because the entries are not tokenized.
>
>
>
> On Thu, Mar 28, 2013 at 11:10 AM, Nicolas Hernandez <
> nicolas.hernandez@gmail.com> wrote:
>
>> Dear all
>>
>> I have not tracked yet the whole process but because some unexpected
>> doccat results I looked a little bit at the code.
>>
>> Do you confirm that the DoccatTrainerTool whitespace tokenize (by
>> creating DocumentSample) while the DoccatTool "SimpleTokenize" ?
>>
>> This should not be the case. Both should use the same tokenizer; in
>> particular : The whitespace tokenizer !
>>
>> If not which one is used ?
>>
>> Best regards
>>
>> /Nicolas
>>

Re: Doccat : Different tokenizers for training and categorizing?

Posted by William Colen <wi...@gmail.com>.

In my opinion you are right. It would be safer to use whitespace tokenizer
than SimpleTokenizer.

But I could not check if DoccatTrainerTool is using whitespace tokenizer.
Actually, the only DocumentSample provider we have today is the one that
reads Leipzig corpus, and as far as I know it uses the SimpleTokenizer
because the entries are not tokenized.

On Thu, Mar 28, 2013 at 11:10 AM, Nicolas Hernandez <
nicolas.hernandez@gmail.com> wrote:

> Dear all
>
> I have not tracked yet the whole process but because some unexpected
> doccat results I looked a little bit at the code.
>
> Do you confirm that the DoccatTrainerTool whitespace tokenize (by
> creating DocumentSample) while the DoccatTool "SimpleTokenize" ?
>
> This should not be the case. Both should use the same tokenizer; in
> particular : The whitespace tokenizer !
>
> If not which one is used ?
>
> Best regards
>
> /Nicolas
>

Re: Doccat : Different tokenizers for training and categorizing?

Posted by Nicolas Hernandez <ni...@gmail.com>.

Hi

I have tested using the binary of opennlp 1.5.3 RC 3. My results have
changed and seem to be more coherent with what I was expected.

Thanks

On Tue, Apr 2, 2013 at 10:47 PM, Jörn Kottmann <ko...@gmail.com> wrote:
> The DoccatTool now uses the WhitespaceTokenizer to tokenize
> the input text, see the issue here:
> https://issues.apache.org/jira/browse/OPENNLP-568
>
> Its fixed in trunk and will go into our next release candidate,
> please test if that fixes your issue.
>
> Jörn
>
>
> On 03/28/2013 03:10 PM, Nicolas Hernandez wrote:
>>
>> Dear all
>>
>> I have not tracked yet the whole process but because some unexpected
>> doccat results I looked a little bit at the code.
>>
>> Do you confirm that the DoccatTrainerTool whitespace tokenize (by
>> creating DocumentSample) while the DoccatTool "SimpleTokenize" ?
>>
>> This should not be the case. Both should use the same tokenizer; in
>> particular : The whitespace tokenizer !
>>
>> If not which one is used ?
>>
>> Best regards
>>
>> /Nicolas
>
>



-- 
Dr. Nicolas Hernandez
Associate Professor (Maître de Conférences)
Université de Nantes - LINA CNRS UMR 6241
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
+33 (0)2 51 12 53 94
+33 (0)2 40 30 60 67

Re: Doccat : Different tokenizers for training and categorizing?

Posted by Jörn Kottmann <ko...@gmail.com>.

The DoccatTool now uses the WhitespaceTokenizer to tokenize
the input text, see the issue here:
https://issues.apache.org/jira/browse/OPENNLP-568

Its fixed in trunk and will go into our next release candidate,
please test if that fixes your issue.

Jörn

On 03/28/2013 03:10 PM, Nicolas Hernandez wrote:
> Dear all
>
> I have not tracked yet the whole process but because some unexpected
> doccat results I looked a little bit at the code.
>
> Do you confirm that the DoccatTrainerTool whitespace tokenize (by
> creating DocumentSample) while the DoccatTool "SimpleTokenize" ?
>
> This should not be the case. Both should use the same tokenizer; in
> particular : The whitespace tokenizer !
>
> If not which one is used ?
>
> Best regards
>
> /Nicolas