You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Koji Sekiguchi (JIRA)" <ji...@apache.org> on 2018/09/27 02:00:00 UTC

[jira] [Resolved] (OPENNLP-1221) FeatureGeneratorUtil.tokenFeature() is too specific for some languages

     [ https://issues.apache.org/jira/browse/OPENNLP-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi resolved OPENNLP-1221.
-------------------------------------
    Resolution: Fixed
      Assignee: Koji Sekiguchi

add the fix to opennlp-addons/japanese-addon

> FeatureGeneratorUtil.tokenFeature() is too specific for some languages
> ----------------------------------------------------------------------
>
>                 Key: OPENNLP-1221
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1221
>             Project: OpenNLP
>          Issue Type: Improvement
>    Affects Versions: 1.9.0
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>
> As I described in OPENNLP-1197, in Japanese NER problem, we usually use only DIGIT, HIRA (あ, い, う, え, お etc.), KATA (ア, イ, ウ, エ, オ etc.), ALPHA and OTHER for token classes. What FeatureGeneratorUtil.tokenFeature() provides at present are too specific. I don't need to distinguish among lc (lowercase alphabet), ac (all capital letters) and ic (initial capital letter), for example.
> By way of trial, if I applied the following patch in order to avoid "too specific token class generation":
> {code}
> diff --git a/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java b/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java
> index e6b8af95..405938d1 100644
> --- a/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java
> +++ b/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java
> @@ -29,6 +29,8 @@ public class FeatureGeneratorUtil {
>    private static final String TOKEN_AND_CLASS_PREFIX = "w&c";
>  
>    private static final Pattern capPeriod = Pattern.compile("^[A-Z]\\.$");
> +  private static final Pattern pDigit = Pattern.compile("^\\p{IsDigit}+$");
> +  private static final Pattern pAlpha = Pattern.compile("^\\p{IsAlphabetic}+$");
>  
>    /**
>     * Generates a class name for the specified token.
> @@ -64,48 +66,11 @@ public class FeatureGeneratorUtil {
>      else if (pattern.isAllKatakana()) {
>        feat = "jak";
>      }
> -    else if (pattern.isAllLowerCaseLetter()) {
> -      feat = "lc";
> +    else if (pDigit.matcher(token).find()) {
> +      feat = "digit";
>      }
> -    else if (pattern.digits() == 2) {
> -      feat = "2d";
> -    }
> -    else if (pattern.digits() == 4) {
> -      feat = "4d";
> -    }
> -    else if (pattern.containsDigit()) {
> -      if (pattern.containsLetters()) {
> -        feat = "an";
> -      }
> -      else if (pattern.containsHyphen()) {
> -        feat = "dd";
> -      }
> -      else if (pattern.containsSlash()) {
> -        feat = "ds";
> -      }
> -      else if (pattern.containsComma()) {
> -        feat = "dc";
> -      }
> -      else if (pattern.containsPeriod()) {
> -        feat = "dp";
> -      }
> -      else {
> -        feat = "num";
> -      }
> -    }
> -    else if (pattern.isAllCapitalLetter()) {
> -      if (token.length() == 1) {
> -        feat = "sc";
> -      }
> -      else {
> -        feat = "ac";
> -      }
> -    }
> -    else if (capPeriod.matcher(token).find()) {
> -      feat = "cp";
> -    }
> -    else if (pattern.isInitialCapitalLetter()) {
> -      feat = "ic";
> +    else if (pAlpha.matcher(token).find()) {
> +      feat = "alpha";
>      }
>      else {
>        feat = "other";
> {code}
> total F1 was increased from 82.00% to 82.13%. It may be trivial, but I think I have a lot of room yet to tune and increase the performance.
> Fortunately, I could add japanese-addon project to opennlp-addons in the previous ticket, I'd like to add some programs that generate simpler token classes in japanese-addon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)