You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Jake Dodd <ja...@ontopic.io> on 2014/11/19 23:08:35 UTC

RegexNameFinder when entity spans multiple tokens

Hi all,

I’m trying to implement a RegexNameFinder for money entities (to supplement results from the default OpenNLP statistical model).

The money entities will span multiple tokens (for example, “$120 billion” is tokenized as ‘$’, ‘120’, ‘billion’). I’ve verified that my regex pattern will match the phrase “$120 billion”, but when used as a pattern in RegexNameFinder, the name finder returns no results.

Do RegexNameFinders match named entities that span multiple tokens? Or are they designed to find single-token named entities?

Cheers

Jake

Re: RegexNameFinder when entity spans multiple tokens

Posted by Jake Dodd <ja...@ontopic.io>.
Found the solution to this—for future reference:

The RegexNameFinder class takes a String[] of tokens, and joins the array of tokens with a single space, reconstructing the sentence. It also maps the indices of the tokens to locations in the reconstructed sentence. Then, it matches the patterns against the reconstructed sentence. If matches are found, it uses the start and end locations (in the sentence) to pull the token indices from the map.

So, if the sentence “$120 billion” is tokenized as [“$”, “120”, “billion”], the reconstructed sentence will be “$ 120 billion.” The patterns in your RegexNameFinder need to account for this additional whitespace. YMMV.

Cheers

Jake

> On Nov 19, 2014, at 2:08 PM, Jake Dodd <ja...@ontopic.io> wrote:
> 
> Hi all,
> 
> I’m trying to implement a RegexNameFinder for money entities (to supplement results from the default OpenNLP statistical model).
> 
> The money entities will span multiple tokens (for example, “$120 billion” is tokenized as ‘$’, ‘120’, ‘billion’). I’ve verified that my regex pattern will match the phrase “$120 billion”, but when used as a pattern in RegexNameFinder, the name finder returns no results.
> 
> Do RegexNameFinders match named entities that span multiple tokens? Or are they designed to find single-token named entities?
> 
> Cheers
> 
> Jake