You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Michael A Tanenblatt <mt...@us.ibm.com> on 2008/05/08 17:19:37 UTC

Any interest in this as an open source project?

My group would like to offer the following UIMA component, ConceptMapper,
as an open source offering into the UIMA sandbox, assuming there is
interest from the community:

ConceptMapper is a token-based dictionary lookup UIMA component. It was
designed specifically to allow any external tokenizer that is a UIMA
component to be used to tokenize its dictionary. Using the same tokenizer
on both the dictionary and for subsequent text processing prevents
situations where a particular dictionary entry is not found, though it
exists, because it was tokenized differently than the text being processed.

ConceptMapper is highly configurable, in terms of:
 * the way dictionary entries are mapped to resultant annotations
 * the way input documents are processed
 * the availability of multiple lookup strategies
 * its various output options.

Additionally, a set of post-processing filters are supplied, as well as an
interface to easily create new filters. This allows for overgenerating
results during the lookup phase, if so desired, then reducing the result
set according to particular rules.

More details:

The structure of the dictionary itself is quite flexible. Entries can have
any number of variants (synonyms), and arbitrary features can be associated
with dictionary entries. Individual variants inherit features from parent
token (i.e., the canonical from), but can override them or add additional
features. In the following sample dictionary entry, there are 5 variants of
the canonical form, and as described earlier, each inherits the SemClass
and POS attributes from the canonical form, with the exception of the
variant "mesenteric fibromatosis (c48.1)", which overrides the value of the
SemClass attribute (this is somewhat of a contrived example, just to make
that point):

<token canonical="abdominal fibromatosis" SemClass="Diagnosis" POS="NN">
   <variant base="abdominal fibromatosis" />
   <variant base="abdominal desmoid" />
   <variant base="mesenteric fibromatosis (c48.1)"
SemClass="Diagnosis-Site" />
   <variant base="mesenteric fibromatosis" />
   <variant base="retroperitoneal fibromatosis" />
</token>

Input tokens are processed one span at a time, where both the token and
span (usually a sentence) annotation type are configurable. Additionally,
the particular feature of the token annotation to use for lookups can be
specified, otherwise its covered text is used. Other input configuration
settings are whether to use case sensitive matching, an optional class name
of a stemmer to apply to the tokens, and a list of stop words to to ignore
during lookup. One additional input control mechanism is the ability to
skip tokens during lookups based on particular feature values. In this way,
it is easy to skip, for example, all tokens with particular part of speech
tags, or with some previously computed semantic class.

Output is in the form of new annotations, and the type of resulting
annotations can be specified in a descriptor file. The mapping from
dictionary entry attributes to the result annotation features can also be
specified. Additionally, a string containing the matched text, a list of
matched tokens, and the span enclosing the match can be specified to be set
in the result annotations. It is also possible to indicate dictionary
attributes to write back into each of the matched tokens.

Dictionary lookup is controlled by three parameters in the descriptor, one
of which allows for order-independent lookup (i.e., A B == B A), another
togles between finding only the longest match vs. finding all possible
matches. The final parameter specifies the search strategy, of which there
are three. The default search strategy only considers contiguous tokens
(not including tokens frm the stop word list or otherwise skipped tokens),
and then begins the subsequent search after the longest match. The second
strategy allows for ignoring non-matching tokens, allowing for disjoint
matches, so that a dictionary entry of

    A C

would match against the text

    A B C

As with the default search strategy, the subsequent search begins after the
longest match. The final search strategy is identical to the previous,
except that subsequent searches begin one token ahead, instead of after the
previous match. This enables overlapped matching.


--
Michael Tanenblatt
IBM T.J. Watson Research Center
19 Skyline Drive
P.O. Box 704
Hawthorne, NY 10532
USA
Tel: +1 (914) 784 7030 t/l 863 7030
Fax: +1 (914) 784 6054
mtan@us.ibm.com


Re: Any interest in this as an open source project?

Posted by David Buttler <bu...@llnl.gov>.
I wrote a tool similar to this, but with a bit less functionality, so I 
think this type of tool is very useful and I would be interested in 
contributing.  The key features that I would look for are:
1) it is fast
2) it can handle very large dictionaries without slowing down. For 
example, you might want to load UMLS into a dictionary (Assuming you had 
sufficient memory)
You mentioned that you support 10K entries -- is the runtime dependent 
on the number of entries in the dictionary or on the number of token 
matches?  Is the internal data structure some type of state machine?

It wasn't clear to me if you supported boolean operators but perhaps 
this is the type of functionality that you would put in a post filter?  
e.g. you match 'colon' and 'rectum' separately and only produce results 
when both matches are made, but not when 'colonoscopy' is present.

So, if you could skip tokens, would it be possible for an entire 
document to match assuming the dictionary contained 'A B' and the first 
token in the document is 'A' and the last token is 'B'?  Or do you limit 
the match to a window of some type?  If it is a window, is the window 
defined by the data (e.g. paragraph markers) or by the dictionary (e.g. 
N tokens?)

Another feature that seems useful is token-based regular expressions 
(e.g. matching 'run*' or '199?').  This feature really killed 
performance when I added it to my tool; perhaps you have a better way of 
approaching that requirement.

In any case, it seems very interesting.
Dave

Michael A Tanenblatt wrote:
> My group would like to offer the following UIMA component, ConceptMapper,
> as an open source offering into the UIMA sandbox, assuming there is
> interest from the community:
>
> ConceptMapper is a token-based dictionary lookup UIMA component. It was
> designed specifically to allow any external tokenizer that is a UIMA
> component to be used to tokenize its dictionary. Using the same tokenizer
> on both the dictionary and for subsequent text processing prevents
> situations where a particular dictionary entry is not found, though it
> exists, because it was tokenized differently than the text being processed.
>
> ConceptMapper is highly configurable, in terms of:
>  * the way dictionary entries are mapped to resultant annotations
>  * the way input documents are processed
>  * the availability of multiple lookup strategies
>  * its various output options.
>
> Additionally, a set of post-processing filters are supplied, as well as an
> interface to easily create new filters. This allows for overgenerating
> results during the lookup phase, if so desired, then reducing the result
> set according to particular rules.
>
> More details:
>
> The structure of the dictionary itself is quite flexible. Entries can have
> any number of variants (synonyms), and arbitrary features can be associated
> with dictionary entries. Individual variants inherit features from parent
> token (i.e., the canonical from), but can override them or add additional
> features. In the following sample dictionary entry, there are 5 variants of
> the canonical form, and as described earlier, each inherits the SemClass
> and POS attributes from the canonical form, with the exception of the
> variant "mesenteric fibromatosis (c48.1)", which overrides the value of the
> SemClass attribute (this is somewhat of a contrived example, just to make
> that point):
>
> <token canonical="abdominal fibromatosis" SemClass="Diagnosis" POS="NN">
>    <variant base="abdominal fibromatosis" />
>    <variant base="abdominal desmoid" />
>    <variant base="mesenteric fibromatosis (c48.1)"
> SemClass="Diagnosis-Site" />
>    <variant base="mesenteric fibromatosis" />
>    <variant base="retroperitoneal fibromatosis" />
> </token>
>
> Input tokens are processed one span at a time, where both the token and
> span (usually a sentence) annotation type are configurable. Additionally,
> the particular feature of the token annotation to use for lookups can be
> specified, otherwise its covered text is used. Other input configuration
> settings are whether to use case sensitive matching, an optional class name
> of a stemmer to apply to the tokens, and a list of stop words to to ignore
> during lookup. One additional input control mechanism is the ability to
> skip tokens during lookups based on particular feature values. In this way,
> it is easy to skip, for example, all tokens with particular part of speech
> tags, or with some previously computed semantic class.
>
> Output is in the form of new annotations, and the type of resulting
> annotations can be specified in a descriptor file. The mapping from
> dictionary entry attributes to the result annotation features can also be
> specified. Additionally, a string containing the matched text, a list of
> matched tokens, and the span enclosing the match can be specified to be set
> in the result annotations. It is also possible to indicate dictionary
> attributes to write back into each of the matched tokens.
>
> Dictionary lookup is controlled by three parameters in the descriptor, one
> of which allows for order-independent lookup (i.e., A B == B A), another
> togles between finding only the longest match vs. finding all possible
> matches. The final parameter specifies the search strategy, of which there
> are three. The default search strategy only considers contiguous tokens
> (not including tokens frm the stop word list or otherwise skipped tokens),
> and then begins the subsequent search after the longest match. The second
> strategy allows for ignoring non-matching tokens, allowing for disjoint
> matches, so that a dictionary entry of
>
>     A C
>
> would match against the text
>
>     A B C
>
> As with the default search strategy, the subsequent search begins after the
> longest match. The final search strategy is identical to the previous,
> except that subsequent searches begin one token ahead, instead of after the
> previous match. This enables overlapped matching.
>
>
> --
> Michael Tanenblatt
> IBM T.J. Watson Research Center
> 19 Skyline Drive
> P.O. Box 704
> Hawthorne, NY 10532
> USA
> Tel: +1 (914) 784 7030 t/l 863 7030
> Fax: +1 (914) 784 6054
> mtan@us.ibm.com
>
>
>   


Re: Any interest in this as an open source project?

Posted by Marshall Schor <ms...@schor.com>.
Looks right.  Minor comments below

-Marshall

Thilo Goetz wrote:
> Roberto Franchini wrote:
>> On Tue, May 13, 2008 at 3:55 PM, Michael Baessler 
>> <mb...@apache.org> wrote:
>>> Altogether sounds good to me, I'm interessted :-)
>>>
>>
>> I shown this thread to my linguistics (collegues) and we are 
>> interested too :)
>
> Ok, it seems there is general interest in accepting
> this code donation.  We should move this discussion
> to the developers' list.  Basically, here are the
> steps we need to follow:
>
> 1) Michael (Tanenblatt) creates a JIRA issue and attaches
> the source code (as a zip file). 
Also attach a checksum key for the submission.  See 
http://issues.apache.org/jira/browse/UIMA-689 for an example.
>
> 2) UIMA committers and other interested parties review
> the source code and VOTE to accept the donation.
>
> 3) Michael creates a software grant for ConceptMapper and
> sends it to the ASF (we'll help).  Of course you need
> your company's permission to do this (and for step 1).
>
> 4) We (UIMA PMC) do the 
first part of the
> IP clearance due diligence process.
An "officer of the foundation" does the final part.
>
> 5) We import the code into SVN and Apache development can
> start.
This step is dependent on the software grant being acknowledged as 
recorded by the ASF secretary or an appropriate surrogate.
>
> Steps 4 and 5 can proceed in parallel.
>
> It is not necessary to do any code clean-up before you
> submit the code, other than clean-up you may want to do
> from your side.  In particular, it is not necessary to
> rename packages and insert Apache license headers.  This
> can be done when the code is already in the Apache source
> repository.
>


Re: Any interest in this as an open source project?

Posted by Michael Baessler <mb...@apache.org>.
I spend some time to look at the ConceptMapper code.
I think some of the base concepts are similar to the Sandbox DictionaryAnnotator :-)

The current DictionaryAnnotator has defined interfaces to work with different kind of dictionaries,
maybe these can be used to support also the dictionaries provided by ConceptMapper. Even though the
matching concept of both
annotators are different. DictionaryAnnotator implements the matching behind the Dictionary
interface to support different dictionary types.
ConceptMapper does the matching in the annotator. But if we plan to create one dictionary annotator
component I think this can be cleaned up.

-- Michael

Thilo Goetz wrote:
> Roberto Franchini wrote:
>> On Tue, May 13, 2008 at 3:55 PM, Michael Baessler
>> <mb...@apache.org> wrote:
>>> Altogether sounds good to me, I'm interessted :-)
>>>
>>
>> I shown this thread to my linguistics (collegues) and we are
>> interested too :)
> 
> Ok, it seems there is general interest in accepting
> this code donation.  We should move this discussion
> to the developers' list.  Basically, here are the
> steps we need to follow:
> 
> 1) Michael (Tanenblatt) creates a JIRA issue and attaches
> the source code (as a zip file).
> 
> 2) UIMA committers and other interested parties review
> the source code and VOTE to accept the donation.
> 
> 3) Michael creates a software grant for ConceptMapper and
> sends it to the ASF (we'll help).  Of course you need
> your company's permission to do this (and for step 1).
> 
> 4) We (UIMA PMC) do the IP clearance due diligence process.
> 
> 5) We import the code into SVN and Apache development can
> start.
> 
> Steps 4 and 5 can proceed in parallel.
> 
> It is not necessary to do any code clean-up before you
> submit the code, other than clean-up you may want to do
> from your side.  In particular, it is not necessary to
> rename packages and insert Apache license headers.  This
> can be done when the code is already in the Apache source
> repository.
> 
> Marshall, you've recently done this for UIMA-AS.  Let me
> know if anything is missing/wrong.
> 
> Follow-ups to uima-dev, please.
> 
> --Thilo


Re: Any interest in this as an open source project?

Posted by Thilo Goetz <tw...@gmx.de>.
Roberto Franchini wrote:
> On Tue, May 13, 2008 at 3:55 PM, Michael Baessler <mb...@apache.org> wrote:
>> Altogether sounds good to me, I'm interessted :-)
>>
> 
> I shown this thread to my linguistics (collegues) and we are interested too :)

Ok, it seems there is general interest in accepting
this code donation.  We should move this discussion
to the developers' list.  Basically, here are the
steps we need to follow:

1) Michael (Tanenblatt) creates a JIRA issue and attaches
the source code (as a zip file).

2) UIMA committers and other interested parties review
the source code and VOTE to accept the donation.

3) Michael creates a software grant for ConceptMapper and
sends it to the ASF (we'll help).  Of course you need
your company's permission to do this (and for step 1).

4) We (UIMA PMC) do the IP clearance due diligence process.

5) We import the code into SVN and Apache development can
start.

Steps 4 and 5 can proceed in parallel.

It is not necessary to do any code clean-up before you
submit the code, other than clean-up you may want to do
from your side.  In particular, it is not necessary to
rename packages and insert Apache license headers.  This
can be done when the code is already in the Apache source
repository.

Marshall, you've recently done this for UIMA-AS.  Let me
know if anything is missing/wrong.

Follow-ups to uima-dev, please.

--Thilo

Re: Any interest in this as an open source project?

Posted by Thilo Goetz <tw...@gmx.de>.
Roberto Franchini wrote:
> On Tue, May 13, 2008 at 3:55 PM, Michael Baessler <mb...@apache.org> wrote:
>> Altogether sounds good to me, I'm interessted :-)
>>
> 
> I shown this thread to my linguistics (collegues) and we are interested too :)

Ok, it seems there is general interest in accepting
this code donation.  We should move this discussion
to the developers' list.  Basically, here are the
steps we need to follow:

1) Michael (Tanenblatt) creates a JIRA issue and attaches
the source code (as a zip file).

2) UIMA committers and other interested parties review
the source code and VOTE to accept the donation.

3) Michael creates a software grant for ConceptMapper and
sends it to the ASF (we'll help).  Of course you need
your company's permission to do this (and for step 1).

4) We (UIMA PMC) do the IP clearance due diligence process.

5) We import the code into SVN and Apache development can
start.

Steps 4 and 5 can proceed in parallel.

It is not necessary to do any code clean-up before you
submit the code, other than clean-up you may want to do
from your side.  In particular, it is not necessary to
rename packages and insert Apache license headers.  This
can be done when the code is already in the Apache source
repository.

Marshall, you've recently done this for UIMA-AS.  Let me
know if anything is missing/wrong.

Follow-ups to uima-dev, please.

--Thilo

Re: Any interest in this as an open source project?

Posted by Roberto Franchini <ro...@gmail.com>.
On Tue, May 13, 2008 at 3:55 PM, Michael Baessler <mb...@apache.org> wrote:
> Altogether sounds good to me, I'm interessted :-)
>

I shown this thread to my linguistics (collegues) and we are interested too :)
-- 
Roberto Franchini
CELI s.r.l. (http://www.celi.it) - C.so Moncalieri 21 - 10131 Torino - ITALY
Tel +39-011-6600814 - Fax +39-011-6600687
jabber:ro.franchini@gmail.com skype:ro.franchini

Re: Any interest in this as an open source project?

Posted by Michael Baessler <mb...@apache.org>.
Altogether sounds good to me, I'm interessted :-)

-- Michael

Michael Tanenblatt wrote:
> 
> On May 13, 2008, at 2:31 AM, Michael Baessler wrote:
> 
>> Michael Tanenblatt wrote:
>>> My comments inline, below:
>>>
>>> On May 10, 2008, at 2:56 AM, Michael Baessler wrote:
>>>
>>>> Hi Michael,
>>>>
>>>> thanks for the detailed comparison. ConceptMapper seems to be very
>>>> interesting
>>>> but I have some additional questions. Please see my comments below:
>>>>
>>>> Michael Tanenblatt wrote:
>>>>> OK, good question. I have never used the project that is in the
>>>>> sandbox
>>>>> as ConceptMapper has been in development and production for a long
>>>>> time,
>>>>> so my comparisons are based solely on what I gleaned from the
>>>>> documentation. From this cursory knowledge of the DictionaryAnnotator
>>>>> that is already in the sandbox, I think that ConceptMapper provides
>>>>> significantly more functionality and customizability, while seemingly
>>>>> providing all of the functionality of the current DictionaryAnnotator.
>>>>> Here is a comparison, with the caveats claimed earlier regarding my
>>>>> level of familiarity with the current DictionaryAnnotator:
>>>>>
>>>>> Both annotators allow for the use of the same tokenizer in dictionary
>>>>> tokenization as is used in the processing pipeline, though in slightly
>>>>> different ways (descriptor vs. Pear file). ConceptMapper has no
>>>>> default
>>>>> tokenizer, though there is a simple one included in the package.
>>>>
>>>> I think having a default tokenizer is important for the "ease of use"
>>>> of the
>>>> dictionary component. If users just want to use a simple list of
>>>> words(multi words) for processing,
>>>> they don't want to setup a separate tokenizer to create the
>>>> dictionary. Can explain
>>>> more detailed what a user have to do to tokenize the content.
>>>
>>> I am not sure I agree with you on this point. Since the default setup
>>> for ConceptMapper is to tokenize the dictionary at load time, which is
>>> when the processing pipeline is set up, and since there will need to be
>>> a tokenizer in the pipeline for processing the the input text, I think
>>> that it is perfectly reasonable to require the specification of a
>>> tokenizer in the ConceptMapper AE descriptor for use as the tokenizer
>>> for the dictionary. This enforces the point that the same tokenizer be
>>> used for tokenizing the dictionary as the input data, which is actually
>>> something that I believe to be more than reasonable. In fact, I think it
>>> *should* be a requirement, otherwise entries that are in the dictionary
>>> might not be found, due to a tokenization mismatch.
>>>
>>> To simplify setup for naïve users, as I said, there is a simple
>>> tokenizer annotator included in the ConceptMapper package, and that
>>> could be used for both the dictionary and text processing. It breaks on
>>> whitespace, plus any other character specified in a parameter.
>>
>> It was not my intention to say that I disagree with the way how
>> ConceptMapper
>> does the tokenization of the dictionary content. I was not aware that
>> you use one of the tokenizers
>> of the processing pipeline. May I missed that in one of your previous
>> mails.
>>
>> The way ConceptMapper does the tokenization is fine with me.
>> Are there any special requirements to the tokenizer (multi threading,
>> resources ...)? I think you
>> create your own instance of the tokenizer during initialization of
>> ConceptMapper.
> 
> The tokenizer is created using UIMAFramework.produceAnalysisEngine().
> The subsequent analysis engine is run against a dictionary entry's text 
> using its process() method and the tokens created in the CAS as a result
> of this process are then collected and then the CAS is reset for the
> next entry to be processed.
> 
>>
>>
>> Some tokenizers produce different results based on the document
>> language. Is there a setting
>> to specify the language that should be used to tokenize the dictionary
>> content?
> 
> There is a AE descriptor parameter that is passed to
> setDocumentLanguage() for the tokenizer processing of the dictionary
> entries.
> 
>>
>>>
>>>>
>>>>
>>>>>
>>>>> One clear difference is that there is no dictionary creator for
>>>>> ConceptMapper; instead, you must build the XML file by hand. This is
>>>>> due, in part, to the fact that dictionary entries can have arbitrary
>>>>> attributes associated with them. This leads to what I think is a
>>>>> serious
>>>>> advantage of ConceptMapper: these attributes associated with
>>>>> dictionary
>>>>> entries can be copied to the annotations that are created in
>>>>> response to
>>>>> a successful lookup. This is very useful for attaching a code from
>>>>> some
>>>>> coding scheme (e.g., from a medical lexicon or ontology) or a
>>>>> reference
>>>>> to a document in which the term was originally extracted, or any
>>>>> number
>>>>> of other features. There is no limit to the number of attributes
>>>>> attached to the dictionary entries, and the mapping from them to the
>>>>> resultant annotations is configurable in the AE descriptor.
>>>>
>>>> So if I understand you correct, the dictionary XML format is not
>>>> predefined. The XML tags
>>>> used to specify the dictionary content are related to the used UIMA
>>>> type system. How do you
>>>> check for errors in the dictionary definition?
>>>
>>> The predefined portion of the XML is:
>>>
>>> <token>
>>>    <variant base="text string" />
>>>    <variant base="text string2" />
>>>    ...
>>> </token>
>>>
>>> which defines an entry with two variants. It is any additional
>>> attributes that you might want to add (POS tag, code, etc.) that is not
>>> predefined, but also not required. The only error checking would be that
>>> the SAX parser would throw an exception if the above structure is not
>>> intact.
>>>
>>>>
>>>>
>>>> The resulting annotations are specified in the AE descriptor. So I
>>>> think you have a mapping from
>>>> dictionary XML elements/features to UIMA types/features? Is there a
>>>> default mapping?
>>>>
>>>
>>> There is no default mapping. Any identifying attributes that you might
>>> want transfered to the resultant annotations can be put into the token
>>> element or the individual variant elements, but as I said, that is
>>> optional.
>> OK so when adding a feature "email" to the token as shown in the
>> example below, there have to be an
>> "email" feature (String valued) in the type system for the created
>> result annotation.
>>
>> <token email="john.doe@sample-mail.com>
>>   <variant base="John Doe" />
>> </token>
> 
> Let's say you specify that the resultant annotations are going to be of
> the type "DictTerm", and each had a feature "EMailAddress". You could
> then specify the mapping from "email" to "EMailAddress" in the
> descriptor, and then when "John Doe" was found in the text, it would be
> annotated with a DictTerm with an "EMailAddress" of
> "john.doe@sample-mail.com".
> 
> 
> 
>>
>>>
>>>
>>>> Can the dictionaries also be language specific?
>>>
>>> Well, I am not sure what this means. If you mean: will ConceptMapper
>>> load different dictionaries depending on the language setting, then the
>>> answer is no. It currently allows only one dictionary resource to be
>>> specified, and it will be loaded if necessary. I would agree that this
>>> would be a nice feature to incorporate, though.
>>
>> I just want to know if the dictionary can have a language setting. In
>> some cases the dictionary
>> content is language specific and so I want to add a setting to the
>> dictionary that the content
>> should be used for English only.
>>
>> So your reply answers my question :-)
> 
> Right, whichever dictionary you set as the DictionaryFile resource is
> the one that is used.
> 
>>
>>>
>>>>
>>>>
>>>>>
>>>>> ConceptMapper only has provisions for using one dictionary per
>>>>> instance,
>>>>> though this is probably a relatively simple thing to augment.
>>>>>
>>>>> ConceptMapper dictionaries are implemented as shared resources. It is
>>>>> not clear if this is the case for the DictionaryAnnotator in the
>>>>> sandbox. One could also create a new implementation of the
>>>>> DictionaryResource interface. This was done in the case of the
>>>>> CompiledDictionaryResource_impl, which operates on a dictionary
>>>>> that has
>>>>> been parsed and then serialized, to allow for quick loading.
>>>>
>>>> The DictionaryAnnotator cannot share dictionaries, since the
>>>> dictionaries are compiled
>>>> to internal data structures during initialization of the annotator.
>>>
>>> The same is true of ConceptMapper, so I am not sure how useful it is
>>> that it is an UIMA resource. Nevertheless, it is one, and other
>>> instantiations of ConceptMapper could attach to that resource, if
>>> needed.
>>>
>>>>
>>>>
>>>>>
>>>>> In addition to the ability to do case-normalized matching, which both
>>>>> provide, ConceptMapper provides a mechanism to use a stemmer, which is
>>>>> applied to both the dictionary and the input documents.
>>>>
>>>> Is the stemmer provided with the ConceptMapper package?
>>>> If not, how is it integrated?
>>>
>>> None is provided. To adapt one for use, it needs to adhere to a simple
>>> interface:
>>>
>>> public interface Stemmer {
>>>    public String stem(String token);
>>>    public void initialize(String dictionary) throws
>>> FileNotFoundException, ParseException;
>>> }
>>>
>>> The only method that has to do anything is stem(), which takes a string
>>> in and returns a string. Using this, it was quite simple to integrate
>>> the open source Snowball stemmer.
>>>
>>>
>>>>
>>>>>
>>>>> Both systems provide the ability to specify the particular type of
>>>>> annotation to consider in lookups (e.g., uima.tt.TokenAnnotation), as
>>>>> well as an optional feature within that annotation, with both
>>>>> defaulting
>>>>> to the covered text. ConceptMapper also allows an annotation type
>>>>> to be
>>>>> used to bound lookups (e.g. a sentence at a time, or an NP at a time,
>>>>> etc.). Perhaps this was an oversight on my part, but I did not see
>>>>> this
>>>>> in the existing sandbox annotator.
>>>> Sorry, I don't understand what do you mean by "ConceptMapper also
>>>> allows an
>>>> annotation type to be used to bound lookups". Can you give an example?
>>>
>>> What I mean is that ConceptMapper works span by span, and that span is
>>> specified in the descriptor. Typically, that span is a sentence, but
>>> could be an NP or even the whole document. Dictionary lookups are
>>> limited to tokens that appear within a single span--no crossing of span
>>> boundaries are allowed. Does this make sense?
>>
>> Yes, thanks!
>>
>>>
>>>>
>>>>>
>>>>> Token skipping is an option in both systems, though it is implemented
>>>>> differently. ConceptMapper includes has two methods available: the
>>>>> ability to use a stop-word list to handle the simple case of omitting
>>>>> tokens based in lexical equality, and feature-based include/exclude
>>>>> lists. The latter is not as general as I'd like in its implementation.
>>>>> Perhaps the filter conditions of the current DictionaryAnnotator is
>>>>> better.
>>>>>
>>>>> Finally, and again this may be due an oversight on my part in reading
>>>>> the documentation, it is not clear what the search strategy is for the
>>>>> current DictionaryAnnotator, but I would assume it finds
>>>>> non-overlapping
>>>>> longest matches. While ConceptMapper supports this as a default, there
>>>>> are three parameters in the AE descriptor to control the way the
>>>>> search
>>>>> is done.
>>>>
>>>> Right, you cannot configure the matching strategy for the
>>>> DictionaryAnnotator.
>>>> Currently the matching strategy is "first longest match" and no
>>>> "overlapping"
>>>> annotations are created. So you are right non-overlapping longest
>>>> matches.
>>>>
>>>>
>>>> Altogether, I see advantages for both system. I'm not sure if there is
>>>> a way to
>>>> create one Dictionary component with the advantages of both since some
>>>> of the
>>>> base concepts are different e.g. dictionary content object. But maybe
>>>> we can try :-)
>>>>
>>>> -- Michael
>>>
>>
>> -- Michael
> 


Re: Any interest in this as an open source project?

Posted by Michael Tanenblatt <sl...@park-slope.net>.
On May 13, 2008, at 2:31 AM, Michael Baessler wrote:

> Michael Tanenblatt wrote:
>> My comments inline, below:
>>
>> On May 10, 2008, at 2:56 AM, Michael Baessler wrote:
>>
>>> Hi Michael,
>>>
>>> thanks for the detailed comparison. ConceptMapper seems to be very
>>> interesting
>>> but I have some additional questions. Please see my comments below:
>>>
>>> Michael Tanenblatt wrote:
>>>> OK, good question. I have never used the project that is in the  
>>>> sandbox
>>>> as ConceptMapper has been in development and production for a  
>>>> long time,
>>>> so my comparisons are based solely on what I gleaned from the
>>>> documentation. From this cursory knowledge of the  
>>>> DictionaryAnnotator
>>>> that is already in the sandbox, I think that ConceptMapper provides
>>>> significantly more functionality and customizability, while  
>>>> seemingly
>>>> providing all of the functionality of the current  
>>>> DictionaryAnnotator.
>>>> Here is a comparison, with the caveats claimed earlier regarding my
>>>> level of familiarity with the current DictionaryAnnotator:
>>>>
>>>> Both annotators allow for the use of the same tokenizer in  
>>>> dictionary
>>>> tokenization as is used in the processing pipeline, though in  
>>>> slightly
>>>> different ways (descriptor vs. Pear file). ConceptMapper has no  
>>>> default
>>>> tokenizer, though there is a simple one included in the package.
>>>
>>> I think having a default tokenizer is important for the "ease of  
>>> use"
>>> of the
>>> dictionary component. If users just want to use a simple list of
>>> words(multi words) for processing,
>>> they don't want to setup a separate tokenizer to create the
>>> dictionary. Can explain
>>> more detailed what a user have to do to tokenize the content.
>>
>> I am not sure I agree with you on this point. Since the default setup
>> for ConceptMapper is to tokenize the dictionary at load time, which  
>> is
>> when the processing pipeline is set up, and since there will need  
>> to be
>> a tokenizer in the pipeline for processing the the input text, I  
>> think
>> that it is perfectly reasonable to require the specification of a
>> tokenizer in the ConceptMapper AE descriptor for use as the tokenizer
>> for the dictionary. This enforces the point that the same tokenizer  
>> be
>> used for tokenizing the dictionary as the input data, which is  
>> actually
>> something that I believe to be more than reasonable. In fact, I  
>> think it
>> *should* be a requirement, otherwise entries that are in the  
>> dictionary
>> might not be found, due to a tokenization mismatch.
>>
>> To simplify setup for naïve users, as I said, there is a simple
>> tokenizer annotator included in the ConceptMapper package, and that
>> could be used for both the dictionary and text processing. It  
>> breaks on
>> whitespace, plus any other character specified in a parameter.
>
> It was not my intention to say that I disagree with the way how  
> ConceptMapper
> does the tokenization of the dictionary content. I was not aware  
> that you use one of the tokenizers
> of the processing pipeline. May I missed that in one of your  
> previous mails.
>
> The way ConceptMapper does the tokenization is fine with me.
> Are there any special requirements to the tokenizer (multi  
> threading, resources ...)? I think you
> create your own instance of the tokenizer during initialization of  
> ConceptMapper.

The tokenizer is created using UIMAFramework.produceAnalysisEngine().  
The subsequent analysis engine is run against a dictionary entry's  
text  using its process() method and the tokens created in the CAS as  
a result of this process are then collected and then the CAS is reset  
for the next entry to be processed.

>
>
> Some tokenizers produce different results based on the document  
> language. Is there a setting
> to specify the language that should be used to tokenize the  
> dictionary content?

There is a AE descriptor parameter that is passed to  
setDocumentLanguage() for the tokenizer processing of the dictionary  
entries.

>
>>
>>>
>>>
>>>>
>>>> One clear difference is that there is no dictionary creator for
>>>> ConceptMapper; instead, you must build the XML file by hand. This  
>>>> is
>>>> due, in part, to the fact that dictionary entries can have  
>>>> arbitrary
>>>> attributes associated with them. This leads to what I think is a  
>>>> serious
>>>> advantage of ConceptMapper: these attributes associated with  
>>>> dictionary
>>>> entries can be copied to the annotations that are created in  
>>>> response to
>>>> a successful lookup. This is very useful for attaching a code  
>>>> from some
>>>> coding scheme (e.g., from a medical lexicon or ontology) or a  
>>>> reference
>>>> to a document in which the term was originally extracted, or any  
>>>> number
>>>> of other features. There is no limit to the number of attributes
>>>> attached to the dictionary entries, and the mapping from them to  
>>>> the
>>>> resultant annotations is configurable in the AE descriptor.
>>>
>>> So if I understand you correct, the dictionary XML format is not
>>> predefined. The XML tags
>>> used to specify the dictionary content are related to the used UIMA
>>> type system. How do you
>>> check for errors in the dictionary definition?
>>
>> The predefined portion of the XML is:
>>
>> <token>
>>    <variant base="text string" />
>>    <variant base="text string2" />
>>    ...
>> </token>
>>
>> which defines an entry with two variants. It is any additional
>> attributes that you might want to add (POS tag, code, etc.) that is  
>> not
>> predefined, but also not required. The only error checking would be  
>> that
>> the SAX parser would throw an exception if the above structure is not
>> intact.
>>
>>>
>>>
>>> The resulting annotations are specified in the AE descriptor. So I
>>> think you have a mapping from
>>> dictionary XML elements/features to UIMA types/features? Is there a
>>> default mapping?
>>>
>>
>> There is no default mapping. Any identifying attributes that you  
>> might
>> want transfered to the resultant annotations can be put into the  
>> token
>> element or the individual variant elements, but as I said, that is
>> optional.
> OK so when adding a feature "email" to the token as shown in the  
> example below, there have to be an
> "email" feature (String valued) in the type system for the created  
> result annotation.
>
> <token email="john.doe@sample-mail.com>
>   <variant base="John Doe" />
> </token>

Let's say you specify that the resultant annotations are going to be  
of the type "DictTerm", and each had a feature "EMailAddress". You  
could then specify the mapping from "email" to "EMailAddress" in the  
descriptor, and then when "John Doe" was found in the text, it would  
be annotated with a DictTerm with an "EMailAddress" of "john.doe@sample-mail.com 
".



>
>>
>>
>>> Can the dictionaries also be language specific?
>>
>> Well, I am not sure what this means. If you mean: will ConceptMapper
>> load different dictionaries depending on the language setting, then  
>> the
>> answer is no. It currently allows only one dictionary resource to be
>> specified, and it will be loaded if necessary. I would agree that  
>> this
>> would be a nice feature to incorporate, though.
>
> I just want to know if the dictionary can have a language setting.  
> In some cases the dictionary
> content is language specific and so I want to add a setting to the  
> dictionary that the content
> should be used for English only.
>
> So your reply answers my question :-)

Right, whichever dictionary you set as the DictionaryFile resource is  
the one that is used.

>
>>
>>>
>>>
>>>>
>>>> ConceptMapper only has provisions for using one dictionary per  
>>>> instance,
>>>> though this is probably a relatively simple thing to augment.
>>>>
>>>> ConceptMapper dictionaries are implemented as shared resources.  
>>>> It is
>>>> not clear if this is the case for the DictionaryAnnotator in the
>>>> sandbox. One could also create a new implementation of the
>>>> DictionaryResource interface. This was done in the case of the
>>>> CompiledDictionaryResource_impl, which operates on a dictionary  
>>>> that has
>>>> been parsed and then serialized, to allow for quick loading.
>>>
>>> The DictionaryAnnotator cannot share dictionaries, since the
>>> dictionaries are compiled
>>> to internal data structures during initialization of the annotator.
>>
>> The same is true of ConceptMapper, so I am not sure how useful it is
>> that it is an UIMA resource. Nevertheless, it is one, and other
>> instantiations of ConceptMapper could attach to that resource, if  
>> needed.
>>
>>>
>>>
>>>>
>>>> In addition to the ability to do case-normalized matching, which  
>>>> both
>>>> provide, ConceptMapper provides a mechanism to use a stemmer,  
>>>> which is
>>>> applied to both the dictionary and the input documents.
>>>
>>> Is the stemmer provided with the ConceptMapper package?
>>> If not, how is it integrated?
>>
>> None is provided. To adapt one for use, it needs to adhere to a  
>> simple
>> interface:
>>
>> public interface Stemmer {
>>    public String stem(String token);
>>    public void initialize(String dictionary) throws
>> FileNotFoundException, ParseException;
>> }
>>
>> The only method that has to do anything is stem(), which takes a  
>> string
>> in and returns a string. Using this, it was quite simple to integrate
>> the open source Snowball stemmer.
>>
>>
>>>
>>>>
>>>> Both systems provide the ability to specify the particular type of
>>>> annotation to consider in lookups (e.g.,  
>>>> uima.tt.TokenAnnotation), as
>>>> well as an optional feature within that annotation, with both  
>>>> defaulting
>>>> to the covered text. ConceptMapper also allows an annotation type  
>>>> to be
>>>> used to bound lookups (e.g. a sentence at a time, or an NP at a  
>>>> time,
>>>> etc.). Perhaps this was an oversight on my part, but I did not  
>>>> see this
>>>> in the existing sandbox annotator.
>>> Sorry, I don't understand what do you mean by "ConceptMapper also
>>> allows an
>>> annotation type to be used to bound lookups". Can you give an  
>>> example?
>>
>> What I mean is that ConceptMapper works span by span, and that span  
>> is
>> specified in the descriptor. Typically, that span is a sentence, but
>> could be an NP or even the whole document. Dictionary lookups are
>> limited to tokens that appear within a single span--no crossing of  
>> span
>> boundaries are allowed. Does this make sense?
>
> Yes, thanks!
>
>>
>>>
>>>>
>>>> Token skipping is an option in both systems, though it is  
>>>> implemented
>>>> differently. ConceptMapper includes has two methods available: the
>>>> ability to use a stop-word list to handle the simple case of  
>>>> omitting
>>>> tokens based in lexical equality, and feature-based include/exclude
>>>> lists. The latter is not as general as I'd like in its  
>>>> implementation.
>>>> Perhaps the filter conditions of the current DictionaryAnnotator is
>>>> better.
>>>>
>>>> Finally, and again this may be due an oversight on my part in  
>>>> reading
>>>> the documentation, it is not clear what the search strategy is  
>>>> for the
>>>> current DictionaryAnnotator, but I would assume it finds non- 
>>>> overlapping
>>>> longest matches. While ConceptMapper supports this as a default,  
>>>> there
>>>> are three parameters in the AE descriptor to control the way the  
>>>> search
>>>> is done.
>>>
>>> Right, you cannot configure the matching strategy for the
>>> DictionaryAnnotator.
>>> Currently the matching strategy is "first longest match" and no
>>> "overlapping"
>>> annotations are created. So you are right non-overlapping longest
>>> matches.
>>>
>>>
>>> Altogether, I see advantages for both system. I'm not sure if  
>>> there is
>>> a way to
>>> create one Dictionary component with the advantages of both since  
>>> some
>>> of the
>>> base concepts are different e.g. dictionary content object. But  
>>> maybe
>>> we can try :-)
>>>
>>> -- Michael
>>
>
> -- Michael


Re: Any interest in this as an open source project?

Posted by Michael Baessler <mb...@michael-baessler.de>.
Michael Tanenblatt wrote:
> My comments inline, below:
> 
> On May 10, 2008, at 2:56 AM, Michael Baessler wrote:
> 
>> Hi Michael,
>>
>> thanks for the detailed comparison. ConceptMapper seems to be very
>> interesting
>> but I have some additional questions. Please see my comments below:
>>
>> Michael Tanenblatt wrote:
>>> OK, good question. I have never used the project that is in the sandbox
>>> as ConceptMapper has been in development and production for a long time,
>>> so my comparisons are based solely on what I gleaned from the
>>> documentation. From this cursory knowledge of the DictionaryAnnotator
>>> that is already in the sandbox, I think that ConceptMapper provides
>>> significantly more functionality and customizability, while seemingly
>>> providing all of the functionality of the current DictionaryAnnotator.
>>> Here is a comparison, with the caveats claimed earlier regarding my
>>> level of familiarity with the current DictionaryAnnotator:
>>>
>>> Both annotators allow for the use of the same tokenizer in dictionary
>>> tokenization as is used in the processing pipeline, though in slightly
>>> different ways (descriptor vs. Pear file). ConceptMapper has no default
>>> tokenizer, though there is a simple one included in the package.
>>
>> I think having a default tokenizer is important for the "ease of use"
>> of the
>> dictionary component. If users just want to use a simple list of
>> words(multi words) for processing,
>> they don't want to setup a separate tokenizer to create the
>> dictionary. Can explain
>> more detailed what a user have to do to tokenize the content.
> 
> I am not sure I agree with you on this point. Since the default setup
> for ConceptMapper is to tokenize the dictionary at load time, which is
> when the processing pipeline is set up, and since there will need to be
> a tokenizer in the pipeline for processing the the input text, I think
> that it is perfectly reasonable to require the specification of a
> tokenizer in the ConceptMapper AE descriptor for use as the tokenizer
> for the dictionary. This enforces the point that the same tokenizer be
> used for tokenizing the dictionary as the input data, which is actually
> something that I believe to be more than reasonable. In fact, I think it
> *should* be a requirement, otherwise entries that are in the dictionary
> might not be found, due to a tokenization mismatch.
> 
> To simplify setup for naïve users, as I said, there is a simple
> tokenizer annotator included in the ConceptMapper package, and that
> could be used for both the dictionary and text processing. It breaks on
> whitespace, plus any other character specified in a parameter.

It was not my intention to say that I disagree with the way how ConceptMapper
does the tokenization of the dictionary content. I was not aware that you use one of the tokenizers
of the processing pipeline. May I missed that in one of your previous mails.

The way ConceptMapper does the tokenization is fine with me.
Are there any special requirements to the tokenizer (multi threading, resources ...)? I think you
create your own instance of the tokenizer during initialization of ConceptMapper.

Some tokenizers produce different results based on the document language. Is there a setting
to specify the language that should be used to tokenize the dictionary content?
> 
>>
>>
>>>
>>> One clear difference is that there is no dictionary creator for
>>> ConceptMapper; instead, you must build the XML file by hand. This is
>>> due, in part, to the fact that dictionary entries can have arbitrary
>>> attributes associated with them. This leads to what I think is a serious
>>> advantage of ConceptMapper: these attributes associated with dictionary
>>> entries can be copied to the annotations that are created in response to
>>> a successful lookup. This is very useful for attaching a code from some
>>> coding scheme (e.g., from a medical lexicon or ontology) or a reference
>>> to a document in which the term was originally extracted, or any number
>>> of other features. There is no limit to the number of attributes
>>> attached to the dictionary entries, and the mapping from them to the
>>> resultant annotations is configurable in the AE descriptor.
>>
>> So if I understand you correct, the dictionary XML format is not
>> predefined. The XML tags
>> used to specify the dictionary content are related to the used UIMA
>> type system. How do you
>> check for errors in the dictionary definition?
> 
> The predefined portion of the XML is:
> 
> <token>
>     <variant base="text string" />
>     <variant base="text string2" />
>     ...
> </token>
> 
> which defines an entry with two variants. It is any additional
> attributes that you might want to add (POS tag, code, etc.) that is not
> predefined, but also not required. The only error checking would be that
> the SAX parser would throw an exception if the above structure is not
> intact.
> 
>>
>>
>> The resulting annotations are specified in the AE descriptor. So I
>> think you have a mapping from
>> dictionary XML elements/features to UIMA types/features? Is there a
>> default mapping?
>>
> 
> There is no default mapping. Any identifying attributes that you might
> want transfered to the resultant annotations can be put into the token
> element or the individual variant elements, but as I said, that is
> optional.
OK so when adding a feature "email" to the token as shown in the example below, there have to be an
"email" feature (String valued) in the type system for the created result annotation.

<token email="john.doe@sample-mail.com>
   <variant base="John Doe" />
</token>
> 
> 
>> Can the dictionaries also be language specific?
> 
> Well, I am not sure what this means. If you mean: will ConceptMapper
> load different dictionaries depending on the language setting, then the
> answer is no. It currently allows only one dictionary resource to be
> specified, and it will be loaded if necessary. I would agree that this
> would be a nice feature to incorporate, though.

I just want to know if the dictionary can have a language setting. In some cases the dictionary
content is language specific and so I want to add a setting to the dictionary that the content
should be used for English only.

So your reply answers my question :-)
> 
>>
>>
>>>
>>> ConceptMapper only has provisions for using one dictionary per instance,
>>> though this is probably a relatively simple thing to augment.
>>>
>>> ConceptMapper dictionaries are implemented as shared resources. It is
>>> not clear if this is the case for the DictionaryAnnotator in the
>>> sandbox. One could also create a new implementation of the
>>> DictionaryResource interface. This was done in the case of the
>>> CompiledDictionaryResource_impl, which operates on a dictionary that has
>>> been parsed and then serialized, to allow for quick loading.
>>
>> The DictionaryAnnotator cannot share dictionaries, since the
>> dictionaries are compiled
>> to internal data structures during initialization of the annotator.
> 
> The same is true of ConceptMapper, so I am not sure how useful it is
> that it is an UIMA resource. Nevertheless, it is one, and other
> instantiations of ConceptMapper could attach to that resource, if needed.
> 
>>
>>
>>>
>>> In addition to the ability to do case-normalized matching, which both
>>> provide, ConceptMapper provides a mechanism to use a stemmer, which is
>>> applied to both the dictionary and the input documents.
>>
>> Is the stemmer provided with the ConceptMapper package?
>> If not, how is it integrated?
> 
> None is provided. To adapt one for use, it needs to adhere to a simple
> interface:
> 
> public interface Stemmer {
>     public String stem(String token);
>     public void initialize(String dictionary) throws
> FileNotFoundException, ParseException;
> }
> 
> The only method that has to do anything is stem(), which takes a string
> in and returns a string. Using this, it was quite simple to integrate
> the open source Snowball stemmer.
> 
> 
>>
>>>
>>> Both systems provide the ability to specify the particular type of
>>> annotation to consider in lookups (e.g., uima.tt.TokenAnnotation), as
>>> well as an optional feature within that annotation, with both defaulting
>>> to the covered text. ConceptMapper also allows an annotation type to be
>>> used to bound lookups (e.g. a sentence at a time, or an NP at a time,
>>> etc.). Perhaps this was an oversight on my part, but I did not see this
>>> in the existing sandbox annotator.
>> Sorry, I don't understand what do you mean by "ConceptMapper also
>> allows an
>> annotation type to be used to bound lookups". Can you give an example?
> 
> What I mean is that ConceptMapper works span by span, and that span is
> specified in the descriptor. Typically, that span is a sentence, but
> could be an NP or even the whole document. Dictionary lookups are
> limited to tokens that appear within a single span--no crossing of span
> boundaries are allowed. Does this make sense?

Yes, thanks!

> 
>>
>>>
>>> Token skipping is an option in both systems, though it is implemented
>>> differently. ConceptMapper includes has two methods available: the
>>> ability to use a stop-word list to handle the simple case of omitting
>>> tokens based in lexical equality, and feature-based include/exclude
>>> lists. The latter is not as general as I'd like in its implementation.
>>> Perhaps the filter conditions of the current DictionaryAnnotator is
>>> better.
>>>
>>> Finally, and again this may be due an oversight on my part in reading
>>> the documentation, it is not clear what the search strategy is for the
>>> current DictionaryAnnotator, but I would assume it finds non-overlapping
>>> longest matches. While ConceptMapper supports this as a default, there
>>> are three parameters in the AE descriptor to control the way the search
>>> is done.
>>
>> Right, you cannot configure the matching strategy for the
>> DictionaryAnnotator.
>> Currently the matching strategy is "first longest match" and no
>> "overlapping"
>> annotations are created. So you are right non-overlapping longest
>> matches.
>>
>>
>> Altogether, I see advantages for both system. I'm not sure if there is
>> a way to
>> create one Dictionary component with the advantages of both since some
>> of the
>> base concepts are different e.g. dictionary content object. But maybe
>> we can try :-)
>>
>> -- Michael
> 

-- Michael

Re: Any interest in this as an open source project?

Posted by Michael Tanenblatt <sl...@park-slope.net>.
My comments inline, below:

On May 10, 2008, at 2:56 AM, Michael Baessler wrote:

> Hi Michael,
>
> thanks for the detailed comparison. ConceptMapper seems to be very  
> interesting
> but I have some additional questions. Please see my comments below:
>
> Michael Tanenblatt wrote:
>> OK, good question. I have never used the project that is in the  
>> sandbox
>> as ConceptMapper has been in development and production for a long  
>> time,
>> so my comparisons are based solely on what I gleaned from the
>> documentation. From this cursory knowledge of the DictionaryAnnotator
>> that is already in the sandbox, I think that ConceptMapper provides
>> significantly more functionality and customizability, while seemingly
>> providing all of the functionality of the current  
>> DictionaryAnnotator.
>> Here is a comparison, with the caveats claimed earlier regarding my
>> level of familiarity with the current DictionaryAnnotator:
>>
>> Both annotators allow for the use of the same tokenizer in dictionary
>> tokenization as is used in the processing pipeline, though in  
>> slightly
>> different ways (descriptor vs. Pear file). ConceptMapper has no  
>> default
>> tokenizer, though there is a simple one included in the package.
>
> I think having a default tokenizer is important for the "ease of  
> use" of the
> dictionary component. If users just want to use a simple list of  
> words(multi words) for processing,
> they don't want to setup a separate tokenizer to create the  
> dictionary. Can explain
> more detailed what a user have to do to tokenize the content.

I am not sure I agree with you on this point. Since the default setup  
for ConceptMapper is to tokenize the dictionary at load time, which is  
when the processing pipeline is set up, and since there will need to  
be a tokenizer in the pipeline for processing the the input text, I  
think that it is perfectly reasonable to require the specification of  
a tokenizer in the ConceptMapper AE descriptor for use as the  
tokenizer for the dictionary. This enforces the point that the same  
tokenizer be used for tokenizing the dictionary as the input data,  
which is actually something that I believe to be more than reasonable.  
In fact, I think it *should* be a requirement, otherwise entries that  
are in the dictionary might not be found, due to a tokenization  
mismatch.

To simplify setup for naïve users, as I said, there is a simple  
tokenizer annotator included in the ConceptMapper package, and that  
could be used for both the dictionary and text processing. It breaks  
on whitespace, plus any other character specified in a parameter.

>
>
>>
>> One clear difference is that there is no dictionary creator for
>> ConceptMapper; instead, you must build the XML file by hand. This is
>> due, in part, to the fact that dictionary entries can have arbitrary
>> attributes associated with them. This leads to what I think is a  
>> serious
>> advantage of ConceptMapper: these attributes associated with  
>> dictionary
>> entries can be copied to the annotations that are created in  
>> response to
>> a successful lookup. This is very useful for attaching a code from  
>> some
>> coding scheme (e.g., from a medical lexicon or ontology) or a  
>> reference
>> to a document in which the term was originally extracted, or any  
>> number
>> of other features. There is no limit to the number of attributes
>> attached to the dictionary entries, and the mapping from them to the
>> resultant annotations is configurable in the AE descriptor.
>
> So if I understand you correct, the dictionary XML format is not  
> predefined. The XML tags
> used to specify the dictionary content are related to the used UIMA  
> type system. How do you
> check for errors in the dictionary definition?

The predefined portion of the XML is:

<token>
	<variant base="text string" />
	<variant base="text string2" />
	...
</token>

which defines an entry with two variants. It is any additional  
attributes that you might want to add (POS tag, code, etc.) that is  
not predefined, but also not required. The only error checking would  
be that the SAX parser would throw an exception if the above structure  
is not intact.

>
>
> The resulting annotations are specified in the AE descriptor. So I  
> think you have a mapping from
> dictionary XML elements/features to UIMA types/features? Is there a  
> default mapping?
>

There is no default mapping. Any identifying attributes that you might  
want transfered to the resultant annotations can be put into the token  
element or the individual variant elements, but as I said, that is  
optional.


> Can the dictionaries also be language specific?

Well, I am not sure what this means. If you mean: will ConceptMapper  
load different dictionaries depending on the language setting, then  
the answer is no. It currently allows only one dictionary resource to  
be specified, and it will be loaded if necessary. I would agree that  
this would be a nice feature to incorporate, though.

>
>
>>
>> ConceptMapper only has provisions for using one dictionary per  
>> instance,
>> though this is probably a relatively simple thing to augment.
>>
>> ConceptMapper dictionaries are implemented as shared resources. It is
>> not clear if this is the case for the DictionaryAnnotator in the
>> sandbox. One could also create a new implementation of the
>> DictionaryResource interface. This was done in the case of the
>> CompiledDictionaryResource_impl, which operates on a dictionary  
>> that has
>> been parsed and then serialized, to allow for quick loading.
>
> The DictionaryAnnotator cannot share dictionaries, since the  
> dictionaries are compiled
> to internal data structures during initialization of the annotator.

The same is true of ConceptMapper, so I am not sure how useful it is  
that it is an UIMA resource. Nevertheless, it is one, and other  
instantiations of ConceptMapper could attach to that resource, if  
needed.

>
>
>>
>> In addition to the ability to do case-normalized matching, which both
>> provide, ConceptMapper provides a mechanism to use a stemmer, which  
>> is
>> applied to both the dictionary and the input documents.
>
> Is the stemmer provided with the ConceptMapper package?
> If not, how is it integrated?

None is provided. To adapt one for use, it needs to adhere to a simple  
interface:

public interface Stemmer {
	public String stem(String token);
	public void initialize(String dictionary) throws  
FileNotFoundException, ParseException;
}

The only method that has to do anything is stem(), which takes a  
string in and returns a string. Using this, it was quite simple to  
integrate the open source Snowball stemmer.


>
>>
>> Both systems provide the ability to specify the particular type of
>> annotation to consider in lookups (e.g., uima.tt.TokenAnnotation), as
>> well as an optional feature within that annotation, with both  
>> defaulting
>> to the covered text. ConceptMapper also allows an annotation type  
>> to be
>> used to bound lookups (e.g. a sentence at a time, or an NP at a time,
>> etc.). Perhaps this was an oversight on my part, but I did not see  
>> this
>> in the existing sandbox annotator.
> Sorry, I don't understand what do you mean by "ConceptMapper also  
> allows an
> annotation type to be used to bound lookups". Can you give an example?

What I mean is that ConceptMapper works span by span, and that span is  
specified in the descriptor. Typically, that span is a sentence, but  
could be an NP or even the whole document. Dictionary lookups are  
limited to tokens that appear within a single span--no crossing of  
span boundaries are allowed. Does this make sense?

>
>>
>> Token skipping is an option in both systems, though it is implemented
>> differently. ConceptMapper includes has two methods available: the
>> ability to use a stop-word list to handle the simple case of omitting
>> tokens based in lexical equality, and feature-based include/exclude
>> lists. The latter is not as general as I'd like in its  
>> implementation.
>> Perhaps the filter conditions of the current DictionaryAnnotator is  
>> better.
>>
>> Finally, and again this may be due an oversight on my part in reading
>> the documentation, it is not clear what the search strategy is for  
>> the
>> current DictionaryAnnotator, but I would assume it finds non- 
>> overlapping
>> longest matches. While ConceptMapper supports this as a default,  
>> there
>> are three parameters in the AE descriptor to control the way the  
>> search
>> is done.
>
> Right, you cannot configure the matching strategy for the  
> DictionaryAnnotator.
> Currently the matching strategy is "first longest match" and no  
> "overlapping"
> annotations are created. So you are right non-overlapping longest  
> matches.
>
>
> Altogether, I see advantages for both system. I'm not sure if there  
> is a way to
> create one Dictionary component with the advantages of both since  
> some of the
> base concepts are different e.g. dictionary content object. But  
> maybe we can try :-)
>
> -- Michael


Re: Any interest in this as an open source project?

Posted by Michael Baessler <mb...@michael-baessler.de>.
Hi Michael,

thanks for the detailed comparison. ConceptMapper seems to be very interesting
but I have some additional questions. Please see my comments below:

Michael Tanenblatt wrote:
> OK, good question. I have never used the project that is in the sandbox
> as ConceptMapper has been in development and production for a long time,
> so my comparisons are based solely on what I gleaned from the
> documentation. From this cursory knowledge of the DictionaryAnnotator
> that is already in the sandbox, I think that ConceptMapper provides
> significantly more functionality and customizability, while seemingly
> providing all of the functionality of the current DictionaryAnnotator.
> Here is a comparison, with the caveats claimed earlier regarding my
> level of familiarity with the current DictionaryAnnotator:
> 
> Both annotators allow for the use of the same tokenizer in dictionary
> tokenization as is used in the processing pipeline, though in slightly
> different ways (descriptor vs. Pear file). ConceptMapper has no default
> tokenizer, though there is a simple one included in the package.

I think having a default tokenizer is important for the "ease of use" of the
dictionary component. If users just want to use a simple list of words(multi words) for processing,
they don't want to setup a separate tokenizer to create the dictionary. Can explain
more detailed what a user have to do to tokenize the content.

> 
> One clear difference is that there is no dictionary creator for
> ConceptMapper; instead, you must build the XML file by hand. This is
> due, in part, to the fact that dictionary entries can have arbitrary
> attributes associated with them. This leads to what I think is a serious
> advantage of ConceptMapper: these attributes associated with dictionary
> entries can be copied to the annotations that are created in response to
> a successful lookup. This is very useful for attaching a code from some
> coding scheme (e.g., from a medical lexicon or ontology) or a reference
> to a document in which the term was originally extracted, or any number
> of other features. There is no limit to the number of attributes
> attached to the dictionary entries, and the mapping from them to the
> resultant annotations is configurable in the AE descriptor.

So if I understand you correct, the dictionary XML format is not predefined. The XML tags
used to specify the dictionary content are related to the used UIMA type system. How do you
check for errors in the dictionary definition?

The resulting annotations are specified in the AE descriptor. So I think you have a mapping from
dictionary XML elements/features to UIMA types/features? Is there a default mapping?

Can the dictionaries also be language specific?

> 
> ConceptMapper only has provisions for using one dictionary per instance,
> though this is probably a relatively simple thing to augment.
> 
> ConceptMapper dictionaries are implemented as shared resources. It is
> not clear if this is the case for the DictionaryAnnotator in the
> sandbox. One could also create a new implementation of the
> DictionaryResource interface. This was done in the case of the
> CompiledDictionaryResource_impl, which operates on a dictionary that has
> been parsed and then serialized, to allow for quick loading.

The DictionaryAnnotator cannot share dictionaries, since the dictionaries are compiled
to internal data structures during initialization of the annotator.

> 
> In addition to the ability to do case-normalized matching, which both
> provide, ConceptMapper provides a mechanism to use a stemmer, which is
> applied to both the dictionary and the input documents.

Is the stemmer provided with the ConceptMapper package?
If not, how is it integrated?
> 
> Both systems provide the ability to specify the particular type of
> annotation to consider in lookups (e.g., uima.tt.TokenAnnotation), as
> well as an optional feature within that annotation, with both defaulting
> to the covered text. ConceptMapper also allows an annotation type to be
> used to bound lookups (e.g. a sentence at a time, or an NP at a time,
> etc.). Perhaps this was an oversight on my part, but I did not see this
> in the existing sandbox annotator.
Sorry, I don't understand what do you mean by "ConceptMapper also allows an
annotation type to be used to bound lookups". Can you give an example?
> 
> Token skipping is an option in both systems, though it is implemented
> differently. ConceptMapper includes has two methods available: the
> ability to use a stop-word list to handle the simple case of omitting
> tokens based in lexical equality, and feature-based include/exclude
> lists. The latter is not as general as I'd like in its implementation.
> Perhaps the filter conditions of the current DictionaryAnnotator is better.
> 
> Finally, and again this may be due an oversight on my part in reading
> the documentation, it is not clear what the search strategy is for the
> current DictionaryAnnotator, but I would assume it finds non-overlapping
> longest matches. While ConceptMapper supports this as a default, there
> are three parameters in the AE descriptor to control the way the search
> is done. 

Right, you cannot configure the matching strategy for the DictionaryAnnotator.
Currently the matching strategy is "first longest match" and no "overlapping"
annotations are created. So you are right non-overlapping longest matches.


Altogether, I see advantages for both system. I'm not sure if there is a way to
create one Dictionary component with the advantages of both since some of the
base concepts are different e.g. dictionary content object. But maybe we can try :-)

-- Michael

Re: Any interest in this as an open source project?

Posted by Michael Tanenblatt <sl...@park-slope.net>.
OK, good question. I have never used the project that is in the  
sandbox as ConceptMapper has been in development and production for a  
long time, so my comparisons are based solely on what I gleaned from  
the documentation. From this cursory knowledge of the  
DictionaryAnnotator that is already in the sandbox, I think that  
ConceptMapper provides significantly more functionality and  
customizability, while seemingly providing all of the functionality of  
the current DictionaryAnnotator. Here is a comparison, with the  
caveats claimed earlier regarding my level of familiarity with the  
current DictionaryAnnotator:

Both annotators allow for the use of the same tokenizer in dictionary  
tokenization as is used in the processing pipeline, though in slightly  
different ways (descriptor vs. Pear file). ConceptMapper has no  
default tokenizer, though there is a simple one included in the package.

One clear difference is that there is no dictionary creator for  
ConceptMapper; instead, you must build the XML file by hand. This is  
due, in part, to the fact that dictionary entries can have arbitrary  
attributes associated with them. This leads to what I think is a  
serious advantage of ConceptMapper: these attributes associated with  
dictionary entries can be copied to the annotations that are created  
in response to a successful lookup. This is very useful for attaching  
a code from some coding scheme (e.g., from a medical lexicon or  
ontology) or a reference to a document in which the term was  
originally extracted, or any number of other features. There is no  
limit to the number of attributes attached to the dictionary entries,  
and the mapping from them to the resultant annotations is configurable  
in the AE descriptor.

ConceptMapper only has provisions for using one dictionary per  
instance, though this is probably a relatively simple thing to augment.

ConceptMapper dictionaries are implemented as shared resources. It is  
not clear if this is the case for the DictionaryAnnotator in the  
sandbox. One could also create a new implementation of the  
DictionaryResource interface. This was done in the case of the  
CompiledDictionaryResource_impl, which operates on a dictionary that  
has been parsed and then serialized, to allow for quick loading.

In addition to the ability to do case-normalized matching, which both  
provide, ConceptMapper provides a mechanism to use a stemmer, which is  
applied to both the dictionary and the input documents.

Both systems provide the ability to specify the particular type of  
annotation to consider in lookups (e.g., uima.tt.TokenAnnotation), as  
well as an optional feature within that annotation, with both  
defaulting to the covered text. ConceptMapper also allows an  
annotation type to be used to bound lookups (e.g. a sentence at a  
time, or an NP at a time, etc.). Perhaps this was an oversight on my  
part, but I did not see this in the existing sandbox annotator.

Token skipping is an option in both systems, though it is implemented  
differently. ConceptMapper includes has two methods available: the  
ability to use a stop-word list to handle the simple case of omitting  
tokens based in lexical equality, and feature-based include/exclude  
lists. The latter is not as general as I'd like in its implementation.  
Perhaps the filter conditions of the current DictionaryAnnotator is  
better.

Finally, and again this may be due an oversight on my part in reading  
the documentation, it is not clear what the search strategy is for the  
current DictionaryAnnotator, but I would assume it finds non- 
overlapping longest matches. While ConceptMapper supports this as a  
default, there are three parameters in the AE descriptor to control  
the way the search is done. From my original email:

> Dictionary lookup is controlled by three parameters in the  
> descriptor, one
> of which allows for order-independent lookup (i.e., A B == B A),  
> another
> togles between finding only the longest match vs. finding all possible
> matches. The final parameter specifies the search strategy, of which  
> there
> are three. The default search strategy only considers contiguous  
> tokens
> (not including tokens frm the stop word list or otherwise skipped  
> tokens),
> and then begins the subsequent search after the longest match. The  
> second
> strategy allows for ignoring non-matching tokens, allowing for  
> disjoint
> matches, so that a dictionary entry of
>
>     A C
>
> would match against the text
>
>     A B C
>
> As with the default search strategy, the subsequent search begins  
> after the
> longest match. The final search strategy is identical to the previous,
> except that subsequent searches begin one token ahead, instead of  
> after the
> previous match. This enables overlapped matching.
>


On May 9, 2008, at 5:13 AM, Thilo Goetz wrote:

> Michael A Tanenblatt wrote:
>> My group would like to offer the following UIMA component,  
>> ConceptMapper,
>> as an open source offering into the UIMA sandbox, assuming there is
>> interest from the community:
> ...
>
> Michael,
>
> we already have a dictionay project in the sandbox.  Can you
> comment on what the differences are, why you think we need
> another one?  Another option would be for you to help extending
> the existing dictionary implementation to satisfy your needs.
>
> --Thilo
>


Re: Any interest in this as an open source project?

Posted by Thilo Goetz <tw...@gmx.de>.
Michael A Tanenblatt wrote:
> My group would like to offer the following UIMA component, ConceptMapper,
> as an open source offering into the UIMA sandbox, assuming there is
> interest from the community:
...

Michael,

we already have a dictionay project in the sandbox.  Can you
comment on what the differences are, why you think we need
another one?  Another option would be for you to help extending
the existing dictionary implementation to satisfy your needs.

--Thilo


Re: Any interest in this as an open source project?

Posted by Michael A Tanenblatt <mt...@us.ibm.com>.
OK, I tried to answer your questions in line, below--If not, I am sure I
can try again:

Marshall Schor <ms...@schor.com> wrote on 05/08/2008 03:18:39 PM:

>
> Sounds interesting; see below for some questions:
>
> Michael A Tanenblatt wrote:
> > My group would like to offer the following UIMA component,
ConceptMapper,
> > as an open source offering into the UIMA sandbox, assuming there is
> > interest from the community:
> > ConceptMapper is a token-based dictionary lookup UIMA component. It was
> > designed specifically to allow any external tokenizer that is a UIMA
> > component to be used to tokenize its dictionary. Using the same
tokenizer
> > on both the dictionary and for subsequent text processing prevents
> > situations where a particular dictionary entry is not found, though it
> > exists, because it was tokenized differently than the text being
processed.
> >
> Is the idea that the tokenizer for the dictionary is run during some
> kind of "build" process which occurs, maybe once, before the "run"
process?

It depends on the size of the dictionary and how patient you are. The
dictionary is loaded as a UIMA resource, and the loading/tokenization can
be done at resource loading time, or it could be precompiled (Java object
serialization) and then loaded in that form. For dictionaries on the order
of 10K entries and running on a modern laptop, the loading doesn't take
more than a couple of seconds at most..


> > ConceptMapper is highly configurable, in terms of:
> >  * the way dictionary entries are mapped to resultant annotations
> >  * the way input documents are processed
> >  * the availability of multiple lookup strategies
> >  * its various output options.
> >
> > Additionally, a set of post-processing filters are supplied, as well as
an
> > interface to easily create new filters. This allows for overgenerating
> > results during the lookup phase, if so desired, then reducing the
result
> > set according to particular rules.
> >
> Can you give some examples of "overgenerating" and "filtering" in the
> context of looking things up?

Here is an example from the domain of colon pathology. Given the text:

      colon, rectum

and dictionary entries of

      colon
      rectum
      rectum colon

one could argue that one, two, or all three entries should be found. In
fact, finding all three was required for a recent project in which I was
involved. It is easy to configure ConceptMapper to find all three, but
perhaps one would want to then eliminate one of the results based on some
particular domain-specific rules. Another example from the same domain is
the text:

      carcinoma in adenomatous polyp

and the dictionary contains both:

      carcinoma
      carcinoma in adenomatous polyp

If you were to only looking for longest matches, the second item would be
found and you are done. But, if you allow for overlapping results, both
would be identified. Again, in the same recent project, this was a
requirement. The only exceptions were for "generic" terms like "carcinoma".
So we would generate both, then filter out the generic terms that are not
subsumed by a longer entries. But if they are not subsumed by a longer
entry, they would not be filtered out.

> > More details:
> >
> >
> Is the dictionary an external xml file?  Do you pre-process this into
> some run-time form, or load and tokenize the external dictionary every
> time this component is initialized?
>
> What does this component presume regarding memory footprint - does it
> work with large, external dictionaries without taking up very much
> "in-Ram" storage, or does it load the whole dictionary into memory in
> some internal format, for the duration of the run?

All in memory. Of course, this makes lookups pretty fast...



> > The structure of the dictionary itself is quite flexible. Entries can
have
> > any number of variants (synonyms), and arbitrary features can be
associated
> > with dictionary entries. Individual variants inherit features from
parent
> > token (i.e., the canonical from), but can override them or add
additional
> > features. In the following sample dictionary entry, there are 5
variants of
> > the canonical form, and as described earlier, each inherits the
SemClass
> > and POS attributes from the canonical form, with the exception of the
> > variant "mesenteric fibromatosis (c48.1)", which overrides the value of
the
> > SemClass attribute (this is somewhat of a contrived example, just to
make
> > that point):
> >
> >
> Is this the format of the external form of the dictionary?  Are the xml
> tags and attributes predefined, or is it up to the user to define them?

This is indeed the format of a single entry in the dictionary. The set of
attributes is not predefined, and the mapping from attributes to resultant
annotations is configurable in the AE descriptor.


> > <token canonical="abdominal fibromatosis" SemClass="Diagnosis"
POS="NN">
> >    <variant base="abdominal fibromatosis" />
> >    <variant base="abdominal desmoid" />
> >    <variant base="mesenteric fibromatosis (c48.1)"
> > SemClass="Diagnosis-Site" />
> >    <variant base="mesenteric fibromatosis" />
> >    <variant base="retroperitoneal fibromatosis" />
> > </token>
> >
> > Input tokens are processed one span at a time, where both the token and
> > span (usually a sentence) annotation type are configurable.
Additionally,
> > the particular feature of the token annotation to use for lookups can
be
> > specified, otherwise its covered text is used. Other input
configuration
> > settings are whether to use case sensitive matching, an optional class
name
> > of a stemmer to apply to the tokens, and a list of stop words to to
ignore
> > during lookup. One additional input control mechanism is the ability to
> > skip tokens during lookups based on particular feature values. In this
way,
> > it is easy to skip, for example, all tokens with particular part of
speech
> > tags, or with some previously computed semantic class.
> >
> > Output is in the form of new annotations, and the type of resulting
> > annotations can be specified in a descriptor file. The mapping from
> > dictionary entry attributes to the result annotation features can also
be
> > specified. Additionally, a string containing the matched text, a list
of
> > matched tokens, and the span enclosing the match can be specified to be
set
> > in the result annotations. It is also possible to indicate dictionary
> > attributes to write back into each of the matched tokens.
> >
> > Dictionary lookup is controlled by three parameters in the descriptor,
one
> > of which allows for order-independent lookup (i.e., A B == B A),
another
> > togles between finding only the longest match vs. finding all possible
> > matches.
> This seems to imply that the dictionary items can be multi-token things
> (as opposed to just single token lookups), with different kinds of
> matching of the input against these; is that right?


If I understand your question, the answer is yes. In the example dictionary
entry above, the "base" attribute of the variant elements contain the text
to match against, and they are all made up of multiple tokens (assuming the
tokenizer breaks on whitespace).>


> > The final parameter specifies the search strategy, of which there
> > are three. The default search strategy only considers contiguous tokens
> > (not including tokens frm the stop word list or otherwise skipped
tokens),
> > and then begins the subsequent search after the longest match. The
second
> > strategy allows for ignoring non-matching tokens, allowing for disjoint
> > matches, so that a dictionary entry of
> >
> >     A C
> >
> > would match against the text
> >
> >     A B C
> >
> > As with the default search strategy, the subsequent search begins after
the
> > longest match. The final search strategy is identical to the previous,
> > except that subsequent searches begin one token ahead, instead of after
the
> > previous match. This enables overlapped matching.
> >
> >
> > --
> > Michael Tanenblatt
> > IBM T.J. Watson Research Center
> > 19 Skyline Drive
> > P.O. Box 704
> > Hawthorne, NY 10532
> > USA
> > Tel: +1 (914) 784 7030 t/l 863 7030
> > Fax: +1 (914) 784 6054
> > mtan@us.ibm.com
> >
> Thanks.  -Marshall


Re: Any interest in this as an open source project?

Posted by Marshall Schor <ms...@schor.com>.
Sounds interesting; see below for some questions:

Michael A Tanenblatt wrote:
> My group would like to offer the following UIMA component, ConceptMapper,
> as an open source offering into the UIMA sandbox, assuming there is
> interest from the community:
> ConceptMapper is a token-based dictionary lookup UIMA component. It was
> designed specifically to allow any external tokenizer that is a UIMA
> component to be used to tokenize its dictionary. Using the same tokenizer
> on both the dictionary and for subsequent text processing prevents
> situations where a particular dictionary entry is not found, though it
> exists, because it was tokenized differently than the text being processed.
>   
Is the idea that the tokenizer for the dictionary is run during some 
kind of "build" process which occurs, maybe once, before the "run" process?
> ConceptMapper is highly configurable, in terms of:
>  * the way dictionary entries are mapped to resultant annotations
>  * the way input documents are processed
>  * the availability of multiple lookup strategies
>  * its various output options.
>
> Additionally, a set of post-processing filters are supplied, as well as an
> interface to easily create new filters. This allows for overgenerating
> results during the lookup phase, if so desired, then reducing the result
> set according to particular rules.
>   
Can you give some examples of "overgenerating" and "filtering" in the 
context of looking things up?
> More details:
>
>   
Is the dictionary an external xml file?  Do you pre-process this into 
some run-time form, or load and tokenize the external dictionary every 
time this component is initialized?

What does this component presume regarding memory footprint - does it 
work with large, external dictionaries without taking up very much 
"in-Ram" storage, or does it load the whole dictionary into memory in 
some internal format, for the duration of the run?
> The structure of the dictionary itself is quite flexible. Entries can have
> any number of variants (synonyms), and arbitrary features can be associated
> with dictionary entries. Individual variants inherit features from parent
> token (i.e., the canonical from), but can override them or add additional
> features. In the following sample dictionary entry, there are 5 variants of
> the canonical form, and as described earlier, each inherits the SemClass
> and POS attributes from the canonical form, with the exception of the
> variant "mesenteric fibromatosis (c48.1)", which overrides the value of the
> SemClass attribute (this is somewhat of a contrived example, just to make
> that point):
>
>   
Is this the format of the external form of the dictionary?  Are the xml 
tags and attributes predefined, or is it up to the user to define them?
> <token canonical="abdominal fibromatosis" SemClass="Diagnosis" POS="NN">
>    <variant base="abdominal fibromatosis" />
>    <variant base="abdominal desmoid" />
>    <variant base="mesenteric fibromatosis (c48.1)"
> SemClass="Diagnosis-Site" />
>    <variant base="mesenteric fibromatosis" />
>    <variant base="retroperitoneal fibromatosis" />
> </token>
>
> Input tokens are processed one span at a time, where both the token and
> span (usually a sentence) annotation type are configurable. Additionally,
> the particular feature of the token annotation to use for lookups can be
> specified, otherwise its covered text is used. Other input configuration
> settings are whether to use case sensitive matching, an optional class name
> of a stemmer to apply to the tokens, and a list of stop words to to ignore
> during lookup. One additional input control mechanism is the ability to
> skip tokens during lookups based on particular feature values. In this way,
> it is easy to skip, for example, all tokens with particular part of speech
> tags, or with some previously computed semantic class.
>
> Output is in the form of new annotations, and the type of resulting
> annotations can be specified in a descriptor file. The mapping from
> dictionary entry attributes to the result annotation features can also be
> specified. Additionally, a string containing the matched text, a list of
> matched tokens, and the span enclosing the match can be specified to be set
> in the result annotations. It is also possible to indicate dictionary
> attributes to write back into each of the matched tokens.
>
> Dictionary lookup is controlled by three parameters in the descriptor, one
> of which allows for order-independent lookup (i.e., A B == B A), another
> togles between finding only the longest match vs. finding all possible
> matches. 
This seems to imply that the dictionary items can be multi-token things 
(as opposed to just single token lookups), with different kinds of 
matching of the input against these; is that right?
> The final parameter specifies the search strategy, of which there
> are three. The default search strategy only considers contiguous tokens
> (not including tokens frm the stop word list or otherwise skipped tokens),
> and then begins the subsequent search after the longest match. The second
> strategy allows for ignoring non-matching tokens, allowing for disjoint
> matches, so that a dictionary entry of
>
>     A C
>
> would match against the text
>
>     A B C
>
> As with the default search strategy, the subsequent search begins after the
> longest match. The final search strategy is identical to the previous,
> except that subsequent searches begin one token ahead, instead of after the
> previous match. This enables overlapped matching.
>
>
> --
> Michael Tanenblatt
> IBM T.J. Watson Research Center
> 19 Skyline Drive
> P.O. Box 704
> Hawthorne, NY 10532
> USA
> Tel: +1 (914) 784 7030 t/l 863 7030
> Fax: +1 (914) 784 6054
> mtan@us.ibm.com
>   
Thanks.  -Marshall