You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Koji Sekiguchi (JIRA)" <ji...@apache.org> on 2018/09/27 02:00:00 UTC
[jira] [Resolved] (OPENNLP-1221)
FeatureGeneratorUtil.tokenFeature() is too specific for some languages
[ https://issues.apache.org/jira/browse/OPENNLP-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Koji Sekiguchi resolved OPENNLP-1221.
-------------------------------------
Resolution: Fixed
Assignee: Koji Sekiguchi
add the fix to opennlp-addons/japanese-addon
> FeatureGeneratorUtil.tokenFeature() is too specific for some languages
> ----------------------------------------------------------------------
>
> Key: OPENNLP-1221
> URL: https://issues.apache.org/jira/browse/OPENNLP-1221
> Project: OpenNLP
> Issue Type: Improvement
> Affects Versions: 1.9.0
> Reporter: Koji Sekiguchi
> Assignee: Koji Sekiguchi
> Priority: Minor
>
> As I described in OPENNLP-1197, in Japanese NER problem, we usually use only DIGIT, HIRA (あ, い, う, え, お etc.), KATA (ア, イ, ウ, エ, オ etc.), ALPHA and OTHER for token classes. What FeatureGeneratorUtil.tokenFeature() provides at present are too specific. I don't need to distinguish among lc (lowercase alphabet), ac (all capital letters) and ic (initial capital letter), for example.
> By way of trial, if I applied the following patch in order to avoid "too specific token class generation":
> {code}
> diff --git a/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java b/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java
> index e6b8af95..405938d1 100644
> --- a/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java
> +++ b/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java
> @@ -29,6 +29,8 @@ public class FeatureGeneratorUtil {
> private static final String TOKEN_AND_CLASS_PREFIX = "w&c";
>
> private static final Pattern capPeriod = Pattern.compile("^[A-Z]\\.$");
> + private static final Pattern pDigit = Pattern.compile("^\\p{IsDigit}+$");
> + private static final Pattern pAlpha = Pattern.compile("^\\p{IsAlphabetic}+$");
>
> /**
> * Generates a class name for the specified token.
> @@ -64,48 +66,11 @@ public class FeatureGeneratorUtil {
> else if (pattern.isAllKatakana()) {
> feat = "jak";
> }
> - else if (pattern.isAllLowerCaseLetter()) {
> - feat = "lc";
> + else if (pDigit.matcher(token).find()) {
> + feat = "digit";
> }
> - else if (pattern.digits() == 2) {
> - feat = "2d";
> - }
> - else if (pattern.digits() == 4) {
> - feat = "4d";
> - }
> - else if (pattern.containsDigit()) {
> - if (pattern.containsLetters()) {
> - feat = "an";
> - }
> - else if (pattern.containsHyphen()) {
> - feat = "dd";
> - }
> - else if (pattern.containsSlash()) {
> - feat = "ds";
> - }
> - else if (pattern.containsComma()) {
> - feat = "dc";
> - }
> - else if (pattern.containsPeriod()) {
> - feat = "dp";
> - }
> - else {
> - feat = "num";
> - }
> - }
> - else if (pattern.isAllCapitalLetter()) {
> - if (token.length() == 1) {
> - feat = "sc";
> - }
> - else {
> - feat = "ac";
> - }
> - }
> - else if (capPeriod.matcher(token).find()) {
> - feat = "cp";
> - }
> - else if (pattern.isInitialCapitalLetter()) {
> - feat = "ic";
> + else if (pAlpha.matcher(token).find()) {
> + feat = "alpha";
> }
> else {
> feat = "other";
> {code}
> total F1 was increased from 82.00% to 82.13%. It may be trivial, but I think I have a lot of room yet to tune and increase the performance.
> Fortunately, I could add japanese-addon project to opennlp-addons in the previous ticket, I'd like to add some programs that generate simpler token classes in japanese-addon.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)