You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by David Dearing <dd...@stottlerhenke.com> on 2009/08/18 21:15:30 UTC

How to tokenize during Annotator initialization?

Hi everyone,

I'm just getting started with UIMA and have poked through the docs and
the sandbox, but still have some questions on best/recommended practices.

A simple example of my question is with stop word processing of text.
Processing is broken up into Tokenizer -> Stemmer -> StopWordAnnotator.

The tokenizer and stemmer are straightforward.  We can create our own or
swap in modules such as the sandbox WhitespaceTokenizer or
SnowballAnnotator (stemming).

My concern is that during initialize(...) of the StopWordAnnotator I
load a resource file that contains the list of stop words.  These stop
words need to be tokenized and stemmed as well (probably in the same
manner as the previous steps, but perhaps configurable).

What is the best practice on doing this?  Specifying an aggregate
analysis engine that runs over the stop word list within the
initialize() method?  That seems a bit strange (and would maybe quite
complicated as later annotators have more complex processing), but I
haven't yet seen examples for this type of complex, resource-based
annotator.

Thanks for taking the time to read/help!
Dave

Re: How to tokenize during Annotator initialization?

Posted by Michael Tanenblatt <sl...@park-slope.net>.

You can look at the way ConceptMapper tokenizes it's dictionaries,  
which are external resources and are tokenized when they are loaded.  
The source is in the sandbox.

On Aug 18, 2009, at 3:15 PM, David Dearing  
<dd...@stottlerhenke.com> wrote:

> Hi everyone,
>
> I'm just getting started with UIMA and have poked through the docs and
> the sandbox, but still have some questions on best/recommended  
> practices.
>
> A simple example of my question is with stop word processing of text.
> Processing is broken up into Tokenizer -> Stemmer ->  
> StopWordAnnotator.
>
> The tokenizer and stemmer are straightforward.  We can create our  
> own or
> swap in modules such as the sandbox WhitespaceTokenizer or
> SnowballAnnotator (stemming).
>
> My concern is that during initialize(...) of the StopWordAnnotator I
> load a resource file that contains the list of stop words.  These stop
> words need to be tokenized and stemmed as well (probably in the same
> manner as the previous steps, but perhaps configurable).
>
> What is the best practice on doing this?  Specifying an aggregate
> analysis engine that runs over the stop word list within the
> initialize() method?  That seems a bit strange (and would maybe quite
> complicated as later annotators have more complex processing), but I
> haven't yet seen examples for this type of complex, resource-based
> annotator.
>
> Thanks for taking the time to read/help!
> Dave