You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Jörn Kottmann (JIRA)" <ji...@apache.org> on 2011/05/17 13:17:47 UTC
[jira] [Closed] (OPENNLP-172) Replace the regex token class feature
generation with the Character class/unicode based implementation
[ https://issues.apache.org/jira/browse/OPENNLP-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jörn Kottmann closed OPENNLP-172.
---------------------------------
Resolution: Fixed
> Replace the regex token class feature generation with the Character class/unicode based implementation
> ------------------------------------------------------------------------------------------------------
>
> Key: OPENNLP-172
> URL: https://issues.apache.org/jira/browse/OPENNLP-172
> Project: OpenNLP
> Issue Type: Improvement
> Components: Name Finder
> Affects Versions: tools-1.5.1-incubating
> Reporter: Jörn Kottmann
> Assignee: Jörn Kottmann
> Priority: Minor
> Fix For: tools-1.5.2-incubating
>
>
> The token class feature is computed with the help of regular expression, the regular expressions do not detect all-letter sequences correctly when they contain other letters than A to Z. The new token class feature method uses unicode to detect letters and that works better and is faster.
> The old regular expression based token class feature computation should be replaced with the new fast token class method.
> An evaluation on our spanish data showed that his change will reduce the recall of the spanish person model by 2% and precision is identical. But when the model is retrained with this fix applied the recall increases by 6%, and precision is still identical.
> Recall and Precision are identical on my test data for english, because it usually do not contain "special" characters.
> The speed up of the name finder will be roughly 10%.
> A measurement on the Leipzig corpus with 300K sentences increased the throughput from 556 sent/s to 618 sent/s.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira