You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "James Kosin (Commented) (JIRA)" <ji...@apache.org> on 2012/03/15 04:57:49 UTC

[jira] [Commented] (OPENNLP-471) DictionaryNameFinder has HASHing issues

    [ https://issues.apache.org/jira/browse/OPENNLP-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229866#comment-13229866 ] 

James Kosin commented on OPENNLP-471:
-------------------------------------

Another possibility is to set an integer value that contains the longest size (in tokens) for the longest entry in the dictionary.

find() could use this to a) allow it to expand the search in the dictionary for more tokens and b) if only 1 token in all entries allows us to stop trying to find more tokens.

It may be the best compromize under the circumstances.
I'll let the group decide....

                
> DictionaryNameFinder has HASHing issues
> ---------------------------------------
>
>                 Key: OPENNLP-471
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-471
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Name Finder
>            Reporter: James Kosin
>              Labels: dictionary, namefinder
>
> The DictionaryNameFinder has issues finding multi-token names when the dictionary is searched a token at a time by the find() method.  If, the dictionary doesn't have a single (or shorter) token match available in the dictionary.
> Having a dictionary with {"folic", "acid"} without an entry for {"folic"} will cause the find() method to totally skip the fact there is a longer match possible.
> Thanks to Jim for pushing this and to my debugging skills to find.
> Two possiblilites come to mind:
> 1)  I don't really like, is we turn it into a larger problem by trying longer matches when shorter ones don't match.  Unfortunately, this turns quickly into a race to see who can wait longer.
> 2)  A way of returning a possible match that may need exploring, or a look-ahead type system to say we don't match "folic" but if you have "acid" after "folic" we have a match for that in the dictionary.
> 3)  Leave it as is and modify the dictionary to add shorter terms to the dictionary... maybe marking as not-a-valid entry so we can know we need a longer match.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira