You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Bhupesh Chawda (JIRA)" <ji...@apache.org> on 2014/11/17 13:15:33 UTC
[jira] [Comment Edited] (OPENNLP-702) DictionaryNameFinder Not
Finding Longest Match When Name Ends in a Number
[ https://issues.apache.org/jira/browse/OPENNLP-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214589#comment-14214589 ]
Bhupesh Chawda edited comment on OPENNLP-702 at 11/17/14 12:15 PM:
-------------------------------------------------------------------
The behavior here seems to be due to the tokenizer used (most probably, SimpleTokenizer). This creates a new token when the character type changes (alphabetic to numeric).
If an alphanumeric token is needed, some different tokenizer like TokenizerME may be used.
This does not seem to be a bug.
was (Author: bhupeshchawda):
The behavior here seems to be due to the tokenizer used (most probably, SimpleTokenizer). This created a new token when the character type changes (alphabetic to numeric).
If an alphanumeric token is needed, some different tokenizer like TokenizerME may be used.
This does not seem to be a bug.
> DictionaryNameFinder Not Finding Longest Match When Name Ends in a Number
> -------------------------------------------------------------------------
>
> Key: OPENNLP-702
> URL: https://issues.apache.org/jira/browse/OPENNLP-702
> Project: OpenNLP
> Issue Type: Bug
> Components: Name Finder, Tokenizer
> Environment: Darwin Kernel Version 12.5.0
> Reporter: rhead
>
> Here's my dictionary:
> {code:xml}
> <?xml version="1.0" encoding="UTF-8"?>
> <dictionary case_sensitive="false">
> <entry>
> <token>vitamin</token>
> <token>b12</token>
> </entry>
> <entry>
> <token>vitamin</token>
> <token>b</token>
> </entry>
> <entry>
> <token>john</token>
> <token>doe</token>
> </entry>
> <entry>
> <token>john</token>
> <token>d</token>
> </entry>
> </dictionary>
> {code}
> When ran on this sentence using a DictionaryNameFinder: {quote}My name is john doe, aka john d. I
> like vitamin b12.{quote}
> The following tokens are found: {quote}john doe, john d, vitamin b{quote}
> As you can see, when the 2nd token ends in a number, the longest match is discarded.
> (Originally from: http://mail-archives.apache.org/mod_mbox/opennlp-users/201406.mbox/%3C1402268906.31205.YahooMailNeo%40web121102.mail.ne1.yahoo.com%3E)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)