You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/09/27 01:59:00 UTC
[jira] [Commented] (OPENNLP-1221) FeatureGeneratorUtil.tokenFeature() is too specific for some languages

    [ https://issues.apache.org/jira/browse/OPENNLP-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629651#comment-16629651 ] 

ASF GitHub Bot commented on OPENNLP-1221:
-----------------------------------------

GitHub user kojisekig opened a pull request:

    https://github.com/apache/opennlp-addons/pull/3

    OPENNLP-1221: FeatureGeneratorUtil.tokenFeature() is too specific for…

    … some languages

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kojisekig/opennlp-addons OPENNLP-1221

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/opennlp-addons/pull/3.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3
    
----
commit 066b5a1f1c2ba972c7fdd025cb4d4689a0e04e97
Author: koji <ko...@...>
Date:   2018-09-27T01:56:13Z

    OPENNLP-1221: FeatureGeneratorUtil.tokenFeature() is too specific for some languages

----


> FeatureGeneratorUtil.tokenFeature() is too specific for some languages
> ----------------------------------------------------------------------
>
>                 Key: OPENNLP-1221
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1221
>             Project: OpenNLP
>          Issue Type: Improvement
>    Affects Versions: 1.9.0
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>
> As I described in OPENNLP-1197, in Japanese NER problem, we usually use only DIGIT, HIRA (あ, い, う, え, お etc.), KATA (ア, イ, ウ, エ, オ etc.), ALPHA and OTHER for token classes. What FeatureGeneratorUtil.tokenFeature() provides at present are too specific. I don't need to distinguish among lc (lowercase alphabet), ac (all capital letters) and ic (initial capital letter), for example.
> By way of trial, if I applied the following patch in order to avoid "too specific token class generation":
> {code}
> diff --git a/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java b/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java
> index e6b8af95..405938d1 100644
> --- a/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java
> +++ b/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java
> @@ -29,6 +29,8 @@ public class FeatureGeneratorUtil {
>    private static final String TOKEN_AND_CLASS_PREFIX = "w&c";
>  
>    private static final Pattern capPeriod = Pattern.compile("^[A-Z]\\.$");
> +  private static final Pattern pDigit = Pattern.compile("^\\p{IsDigit}+$");
> +  private static final Pattern pAlpha = Pattern.compile("^\\p{IsAlphabetic}+$");
>  
>    /**
>     * Generates a class name for the specified token.
> @@ -64,48 +66,11 @@ public class FeatureGeneratorUtil {
>      else if (pattern.isAllKatakana()) {
>        feat = "jak";
>      }
> -    else if (pattern.isAllLowerCaseLetter()) {
> -      feat = "lc";
> +    else if (pDigit.matcher(token).find()) {
> +      feat = "digit";
>      }
> -    else if (pattern.digits() == 2) {
> -      feat = "2d";
> -    }
> -    else if (pattern.digits() == 4) {
> -      feat = "4d";
> -    }
> -    else if (pattern.containsDigit()) {
> -      if (pattern.containsLetters()) {
> -        feat = "an";
> -      }
> -      else if (pattern.containsHyphen()) {
> -        feat = "dd";
> -      }
> -      else if (pattern.containsSlash()) {
> -        feat = "ds";
> -      }
> -      else if (pattern.containsComma()) {
> -        feat = "dc";
> -      }
> -      else if (pattern.containsPeriod()) {
> -        feat = "dp";
> -      }
> -      else {
> -        feat = "num";
> -      }
> -    }
> -    else if (pattern.isAllCapitalLetter()) {
> -      if (token.length() == 1) {
> -        feat = "sc";
> -      }
> -      else {
> -        feat = "ac";
> -      }
> -    }
> -    else if (capPeriod.matcher(token).find()) {
> -      feat = "cp";
> -    }
> -    else if (pattern.isInitialCapitalLetter()) {
> -      feat = "ic";
> +    else if (pAlpha.matcher(token).find()) {
> +      feat = "alpha";
>      }
>      else {
>        feat = "other";
> {code}
> total F1 was increased from 82.00% to 82.13%. It may be trivial, but I think I have a lot of room yet to tune and increase the performance.
> Fortunately, I could add japanese-addon project to opennlp-addons in the previous ticket, I'd like to add some programs that generate simpler token classes in japanese-addon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)