You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Tommaso Teofili <to...@gmail.com> on 2011/08/01 15:49:52 UTC

Re: New features for the Apache UIMA Regular Expression Annotator

Hello Nicolas,

2011/7/29 Nicolas Hernandez <ni...@gmail.com>

> Hi Everyone
>
> I tested the Apache UIMA Regular Expression Annotator to know its
> abilities to formulate recognizing rules. I tested it to recognize
> named entities.
> Being said it only works on text characters, I mainly encountered two
> limitations. I'd like to know what you think about, and if you think
> that future evolutions of the annotator could fix them.
>
> Roughly speaking, my problems started when I tried to handle several
> concepts and when my rules reach a high level complexity.
>
> First since the regex variables are also regex I used them as a
> dictionnary of elements for my rules (e.g.  <variable name="weekdays"
>   value="Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday"/>).
> The elements are also regex which has some advantages (e.g. <variable
> name="weekdays"
>
> value="[m|M]onday|[t|T]uesday|[w|W]ednesday|[t|T]hursday|[f|F]riday|[s|S]aturday|[s|S]unday"/>
> ) .
> The major drawback is when your dictionnary has several hundred or
> thousand of lexical entries. It it is tedious to keep the dictionnary
> up-to-date or even to handle and edit the file.
> It would be great if the variable values could also be defined in
> external files (one entry per line).
> This solution also allows to define once some variables and to use
> them as many times as you want in distinct rule files (which is also
> appealing to keep up to date the rules).
>

I think that would be nice to have.
It may be helpful to define them as external resources in the Annotator
descriptor so that reuse would be even much broader.


> Second, it is possible to set a priority order between rules of a same
> concept but not between concepts. In practice some distinct concepts
> may have similar rules (e.g. person entity and location entity) you
> may wish to set a priority between them to avoid some ambiguity to
> handle ouside of the annotator (currently to avoid this situation you
> have to define the recognizing rules of the person and the location
> entities in the same concept which is not conceptually acceptable).
> Offering a way to set priority between concepts will lead to the
> problem of how to do it when the concepts are defined in distinct
> files.
> I agree the ambiguity problem may be handled in further annotators.
>

Right, it seems to me that resolving such ambiguity in one only annotator
would be assigning to many responsibilities to a single annotator. In my
opinion defining a helpful ambiguity resolution policy definition language
would not be trivial and could end up to be not enough in certain situations
thus I'd say the best solution is naming similar concepts for a same entity
slightly different (i.e. PersonConceptApproach1/PersonConceptApproach2) and
crete the PersonConcept in a 'subsequent' annotator which 'evaluate' which
one is the best.
Just my 2 cents,
Tommaso


>
> Regards
>
> /Nicolas
>