You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@opennlp.apache.org by "Jörn Kottmann (JIRA)" <ji...@apache.org> on 2011/05/16 14:48:47 UTC

[jira] [Created] (OPENNLP-172) Replace the regex token class feature generation with the fast string pattern implementation

Replace the regex token class feature generation with the fast string pattern implementation
--------------------------------------------------------------------------------------------

                 Key: OPENNLP-172
                 URL: https://issues.apache.org/jira/browse/OPENNLP-172
             Project: OpenNLP
          Issue Type: Improvement
          Components: Name Finder
    Affects Versions: tools-1.5.1-incubating
            Reporter: Jörn Kottmann
            Assignee: Jörn Kottmann
            Priority: Minor
             Fix For: tools-1.5.2-incubating


The token class feature is computed with the help of regular expression, the regular expressions are slower than the new fast token class feature method which uses the Character class to compute the token class.

The old regular expression based token class feature computation should be replaced with the new fast token class method.
The output of both methods is identical, so changing this will not break backward compatibility, but increase the throughput of the name finder by roughly 10%.

A measurement on the Leipzig corpus with 300K sentences increased the throughput from 556 sent/s to 618 sent/s.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (OPENNLP-172) Replace the regex token class feature generation with the fast string pattern implementation

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/OPENNLP-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jörn Kottmann closed OPENNLP-172.
---------------------------------

    Resolution: Fixed

> Replace the regex token class feature generation with the fast string pattern implementation
> --------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-172
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-172
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Name Finder
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: Jörn Kottmann
>            Assignee: Jörn Kottmann
>            Priority: Minor
>             Fix For: tools-1.5.2-incubating
>
>
> The token class feature is computed with the help of regular expression, the regular expressions are slower than the new fast token class feature method which uses the Character class to compute the token class.
> The old regular expression based token class feature computation should be replaced with the new fast token class method.
> The output of both methods is identical, so changing this will not break backward compatibility, but increase the throughput of the name finder by roughly 10%.
> A measurement on the Leipzig corpus with 300K sentences increased the throughput from 556 sent/s to 618 sent/s.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (OPENNLP-172) Replace the regex token class feature generation with the Character class/unicode based implementation

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/OPENNLP-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jörn Kottmann closed OPENNLP-172.
---------------------------------

    Resolution: Fixed

> Replace the regex token class feature generation with the Character class/unicode based implementation
> ------------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-172
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-172
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Name Finder
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: Jörn Kottmann
>            Assignee: Jörn Kottmann
>            Priority: Minor
>             Fix For: tools-1.5.2-incubating
>
>
> The token class feature is computed with the help of regular expression, the regular expressions do not detect all-letter sequences correctly when they contain other letters than A to Z. The new token class feature method uses unicode to detect letters and that works better and is faster.  
> The old regular expression based token class feature computation should be replaced with the new fast token class method.
> An evaluation on our spanish data showed that his change will reduce the recall of the spanish person model by 2% and precision is identical. But when the model is retrained with this fix applied the recall increases by 6%, and precision is still identical.
> Recall and Precision are identical on my test data for english, because it usually do not contain "special" characters.
> The speed up of the name finder will be roughly 10%.
> A measurement on the Leipzig corpus with 300K sentences increased the throughput from 556 sent/s to 618 sent/s.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (OPENNLP-172) Replace the regex token class feature generation with the Character class/unicode based implementation

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/OPENNLP-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jörn Kottmann updated OPENNLP-172:
----------------------------------

    Description: 
The token class feature is computed with the help of regular expression, the regular expressions do not detect all-letter sequences correctly when they contain other letters than A to Z. The new token class feature method uses unicode to detect letters and that works better and is faster.  

The old regular expression based token class feature computation should be replaced with the new fast token class method.

An evaluation on our spanish data showed that his change will reduce the recall of the spanish person model by 2% and precision is identical. But when the model is retrained with this fix applied the recall increases by 6%, and precision is still identical.

Recall and Precision are identical on my test data for english, because it usually do not contain "special" characters.

The speed up of the name finder will be roughly 10%.
A measurement on the Leipzig corpus with 300K sentences increased the throughput from 556 sent/s to 618 sent/s.

  was:
The token class feature is computed with the help of regular expression, the regular expressions are slower than the new fast token class feature method which uses the Character class to compute the token class.

The old regular expression based token class feature computation should be replaced with the new fast token class method.
The output of both methods is identical, so changing this will not break backward compatibility, but increase the throughput of the name finder by roughly 10%.

A measurement on the Leipzig corpus with 300K sentences increased the throughput from 556 sent/s to 618 sent/s.

        Summary: Replace the regex token class feature generation with the Character class/unicode based implementation  (was: Replace the regex token class feature generation with the fast string pattern implementation)

Updated to describe changes in feature generation and the effect on the existing spanish ner model.

> Replace the regex token class feature generation with the Character class/unicode based implementation
> ------------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-172
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-172
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Name Finder
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: Jörn Kottmann
>            Assignee: Jörn Kottmann
>            Priority: Minor
>             Fix For: tools-1.5.2-incubating
>
>
> The token class feature is computed with the help of regular expression, the regular expressions do not detect all-letter sequences correctly when they contain other letters than A to Z. The new token class feature method uses unicode to detect letters and that works better and is faster.  
> The old regular expression based token class feature computation should be replaced with the new fast token class method.
> An evaluation on our spanish data showed that his change will reduce the recall of the spanish person model by 2% and precision is identical. But when the model is retrained with this fix applied the recall increases by 6%, and precision is still identical.
> Recall and Precision are identical on my test data for english, because it usually do not contain "special" characters.
> The speed up of the name finder will be roughly 10%.
> A measurement on the Leipzig corpus with 300K sentences increased the throughput from 556 sent/s to 618 sent/s.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (OPENNLP-172) Replace the regex token class feature generation with the fast string pattern implementation

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/OPENNLP-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jörn Kottmann reopened OPENNLP-172:
-----------------------------------


Looks like it handles non-english content better, but that might causes regressions, need more testing before the issue can be closed.

> Replace the regex token class feature generation with the fast string pattern implementation
> --------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-172
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-172
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Name Finder
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: Jörn Kottmann
>            Assignee: Jörn Kottmann
>            Priority: Minor
>             Fix For: tools-1.5.2-incubating
>
>
> The token class feature is computed with the help of regular expression, the regular expressions are slower than the new fast token class feature method which uses the Character class to compute the token class.
> The old regular expression based token class feature computation should be replaced with the new fast token class method.
> The output of both methods is identical, so changing this will not break backward compatibility, but increase the throughput of the name finder by roughly 10%.
> A measurement on the Leipzig corpus with 300K sentences increased the throughput from 556 sent/s to 618 sent/s.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira