You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Joern Kottmann (JIRA)" <ji...@apache.org> on 2014/01/29 18:22:09 UTC

[jira] [Commented] (OPENNLP-643) Provide default rule based (regex) name finders (phone num, url, email, coords)

    [ https://issues.apache.org/jira/browse/OPENNLP-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13885550#comment-13885550 ] 

Joern Kottmann commented on OPENNLP-643:
----------------------------------------

I think it is a really good idea to offer default regexs for the mentioned types.

The existing RegexNameFinder could be extended to support multiple types. Or a user could use an ensemble of them to detect multiple types. In my opinion we should support both.

You are right, we should add some support to instantiate the RegexNameFinder from a some kind of files which contains the patterns, instead of forcing the user to do that (again, both should be supported).

To offer defaults we could create a factory which configures the desired RegexNameFinder.

For example:
RegexNameFinder.createDefaultNameFinder(DefaultPatterns.EMAIL, DefaultPatterns.URL, DefaultPatterns.PHONE)

What do you think about that?

> Provide default rule based (regex) name finders (phone num, url, email, coords)
> -------------------------------------------------------------------------------
>
>                 Key: OPENNLP-643
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-643
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Name Finder
>    Affects Versions: 1.6.0
>            Reporter: Mark Giaconia
>            Assignee: Mark Giaconia
>            Priority: Minor
>
> It would be nice if OpenNLP came with some basic rule based namefinders (RegexNameFinders) for basic types. Initially I would like to create an engine that runs phonenum,. email,  url, MGRS, and DD Lat Lon.
> Also, we need a framework for loading additional regexes other than the defaults mentioned above.
> Here is my initial thought... a class that has a set of default types and patterns in a map that runs the RegexNameFinder, with optional constructors to override the map, or read from a config file.
> Let me know what you think...
> /**
>  *
>  * Constructs a set of RegexNameFinders from configuration or from a provided Map
>  */
> public class RuleBasedEntityFinderEngine {
>   private static final String PHONE_REGEX = "";
>   private static final String EMAIL_REGEX = "";
>   private static final String URL_REGEX = "";
>   private static final String MGRS_REGEX = "";
>   private static final String DDLATLON_REGEX = "";
>   private static final String PHONE_REGEX_TYPE = "phone number";
>   private static final String EMAIL_REGEX_TYPE = "email";
>   private static final String URL_REGEX_TYPE = "url";
>   private static final String MGRS_REGEX_TYPE = "MGRS coord";
>   private static final String DDLATLON_REGEX_TYPE = "DD coord";
>   private Map<String, Pattern[]> typePatternMap = new HashMap<>();
>   Properties properties;
>   /**
>    * Loads a set of patterns via configuration. The file should have the entity
>    * type with no spaces, followed by the regex. For types that have multiple
>    * regexes, duplicate the type on each line. for example: phone_num <phonenum
>    * regex1>
>    * phone_num <phonenum regex2>
>    * email <regex1>
>    * Each entry will be loaded in order from top to bottom of file, so if order
>    * matters list regexes accordingly from top to bottom
>    *
>    * @param properties      the inputStream of props from which to load the
>    *                        regexes from
>    * @param includeDefaults when true, adds the defaults to the map. if there is
>    *                        key collision in the map, the default will override.
>    * @throws IOException
>    */
>   public RuleBasedEntityFinderEngine(InputStream properties, boolean includeDefaults) throws IOException {
>     this.properties = new Properties();
>     this.properties.load(properties);
>     init();
>   }
>   /**
>    *
>    * @param typePatternMap  a map of name types (i.e. phone number, email...) to
>    *                        an array of regex Patterns. This map is the basis
>    *                        for instantiating regexnamefinders
>    * @param includeDefaults when true, add the defaults to the map. if there is
>    *                        key collision in the map, the default will override.
>    */
>   public RuleBasedEntityFinderEngine(Map<String, Pattern[]> typePatternMap, boolean includeDefaults) {
>     this.typePatternMap = typePatternMap;
>     if (includeDefaults) {
>       init();
>     }
>   }
>   /**
>    * loads default regexs and types into the map
>    */
>   private void init() {
>     if (properties != null) {
>       //get the regexes from config somewhere
>       /**
>        *TODO
>        */
>     } else {
>       typePatternMap.put(PHONE_REGEX_TYPE, new Pattern[]{Pattern.compile(PHONE_REGEX)});
>       typePatternMap.put(EMAIL_REGEX_TYPE, new Pattern[]{Pattern.compile(EMAIL_REGEX)});
>       typePatternMap.put(URL_REGEX_TYPE, new Pattern[]{Pattern.compile(URL_REGEX)});
>       typePatternMap.put(MGRS_REGEX_TYPE, new Pattern[]{Pattern.compile(MGRS_REGEX)});
>       typePatternMap.put(DDLATLON_REGEX_TYPE, new Pattern[]{Pattern.compile(DDLATLON_REGEX)});
>       //load the default regexes
>     }
>   }
>   public Map<String, Span[]> find(String[] tokens) {
>     Map<String, Span[]> outSpans = new HashMap<>();
>     if (typePatternMap != null) {
>       for (Map.Entry<String, Pattern[]> finder : typePatternMap.entrySet()) {
>         RegexNameFinder nf = new RegexNameFinder(finder.getValue(), finder.getKey());
>         Span[] spans = nf.find(tokens);
>         outSpans.put(finder.getKey(), spans);
>       }
>     }
>     return outSpans;
>   }
>   public Map<String, Pattern[]> getTypePatternMap() {
>     init();
>     return typePatternMap;
>   }
> }



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)