You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Jörn Kottmann (JIRA)" <ji...@apache.org> on 2011/05/17 13:17:47 UTC

[jira] [Closed] (OPENNLP-172) Replace the regex token class feature generation with the Character class/unicode based implementation

     [ https://issues.apache.org/jira/browse/OPENNLP-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jörn Kottmann closed OPENNLP-172.
---------------------------------

    Resolution: Fixed

> Replace the regex token class feature generation with the Character class/unicode based implementation
> ------------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-172
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-172
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Name Finder
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: Jörn Kottmann
>            Assignee: Jörn Kottmann
>            Priority: Minor
>             Fix For: tools-1.5.2-incubating
>
>
> The token class feature is computed with the help of regular expression, the regular expressions do not detect all-letter sequences correctly when they contain other letters than A to Z. The new token class feature method uses unicode to detect letters and that works better and is faster.  
> The old regular expression based token class feature computation should be replaced with the new fast token class method.
> An evaluation on our spanish data showed that his change will reduce the recall of the spanish person model by 2% and precision is identical. But when the model is retrained with this fix applied the recall increases by 6%, and precision is still identical.
> Recall and Precision are identical on my test data for english, because it usually do not contain "special" characters.
> The speed up of the name finder will be roughly 10%.
> A measurement on the Leipzig corpus with 300K sentences increased the throughput from 556 sent/s to 618 sent/s.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira