You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ctakes.apache.org by "Finan, Sean" <Se...@childrens.harvard.edu> on 2020/08/02 13:25:04 UTC

Re: With custom dictionary - over-eager resolution of acronyms [EXTERNAL]

Hi Peter,

I would guess that you are seeing things like "SOFT" because you new dictionary has a vocabulary that was not included in sno_rx_16ab.
I don't remember if OMIM (which has the 'SOFT' synonym) was included in sno_rx_16ab.  Probably not, omim is a more -specialized- vocabulary for genetics.

The term is only in the omim (and mth) vocabularies in the 2016AB umls release. 
   https://uts.nlm.nih.gov/metathesaurus.html#C3542022;0;1;CUI;2016AB;WORD;CUI;*;  

The term is in snomed in umls 2020AA, but only with the expanded full-text synonym.  It still has the abbreviation from omim.  
 https://uts.nlm.nih.gov/metathesaurus.html#SHORT%20STATURE,%20ONYCHODYSPLASIA,%20FACIAL%20DYSMORPHISM,%20AND%20HYPOTRICHOSIS;0;1;TERM;2020AA;WORD;TERM;*;

As for finding terms in adjectives, the default parts of speech(pos) that are checked for terms are:
VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB

You can see what these are here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

You can override this list.  In your piper file, set the variable "exclusionTags"

// Default excluded parts of speech, plus various forms of adjective.
set exclusionTags="ADJ,JJ,JJR,JJS,VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB"

//  Annotate concepts based upon default algorithms.
add DefaultJCasTermAnnotator


You'll notice that I threw in 'ADJ' for good measure.  It should not break anything.  

I have modified this list many times for various projects.  In one I allow verbs for lookup.  For those notes the value of the true positives outweighed the increased false negatives.  In another I actually empty the entire list to allow everything (set exclusionTags="").  I did this because there is a lot of structured text in lists and tables, but the pos tagger is trying to resolve prose text.  The pos assigned on the structured text is all over the place, and terms are missed left and right.

So ... last but definitely not least, case-sensitivity.
I started working on this a while ago, but right now it sits unfinished.

There is an additional table in the dictionary database, in which all synonyms are all upper-case.
This second table is created with synonyms that exist in the umls as all upper-case.
The first  "classic" table is created using ONLY synonyms from the umls that are lower and/or mixed case. 

When the annotator engine iterates over the text, it checks one table (classic) or the other (caps) depending upon the case of the text in the note.

It sounds like minor work, but it requires a new engine, new dictionary, and new dictionary creator.  None of this is difficult, but it requires time.

Anyway, I hope that some of this helps.

Sean


________________________________________
From: Peter Abramowitsch <pa...@gmail.com>
Sent: Saturday, August 1, 2020 11:35 PM
To: dev@ctakes.apache.org
Subject: Re: With custom dictionary - over-eager resolution of acronyms [EXTERNAL]

* External Email - Caution *


Hi Jeff thanks for your suggestions,

I spent some time in the script file and sure enough,  my 2020 UMLS
extraction actually has these two entries:

INSERT INTO CUI_TERMS VALUES(3542022,0,1,'soft','soft')
INSERT INTO PREFTERM VALUES(3542022,'SHORT STATURE, ONYCHODYSPLASIA, FACIAL
DYSMORPHISM, AND HYPOTRICHOSIS SYNDROME')

It's unbelievable.  the UMLS entry has got to be wrong or I'm missing
something to say that it only applies (as an acronym) if it's capitalized

In sno_rx  there is neither a CUI 3542022 nor the definition of "soft" as a
solitary word, nor even a mention of ONYCHODYSPLASIA or HYPOTRICHOSIS

In any case, I would have thought that ctakes will only create an event
mention from a term tagged as NN or NP slot, not a ADJ as in "soft tissue"

Anyway  Thanks!  Now I will keep poking around.


Peter












On Sat, Aug 1, 2020 at 5:06 PM Jeffrey Miller <je...@gmail.com> wrote:

> Sorry, I meant suggest to search for 'soft' in the dictionary file not
> 'short'
>
> grep -i ,\'soft\', *.script
>
> On Sat, Aug 1, 2020 at 7:47 PM Jeffrey Miller <je...@gmail.com> wrote:
>
> > Hi Peter,
> >
> > To my knowledge, there isn't any drastic difference in the behavior of
> the
> > dictionary gui creator and the way the sno_rx dictionary was created. I
> > originally thought there was, but I realized the difference was that I
> had
> > not installed all of UMLS to my machine (just the vocabularies I was
> > interested in) and I was missing synonyms. The first thing I would check,
> > are you able to find a matching entry in the .script file for your ctakes
> > dictionary when you do this:
> >
> > grep -i ,\'short\', *.script
> >
> > That would confirm whether or not you have a term in your dictionary made
> > up only of 'short' and whether it mapped to the CUI equal to "SHORT
> > STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND HYPOTRICHOSIS
> SYNDROME".
> > If it's not in there, something else is going on. You could do the same
> for
> > 'bed'.
> >
> > If not, another thing I might check is that I noticed you are using
> > the OverlapJCasTermAnnotator in your prior e-mail. I don't have much
> > experience with it, and I don't think it should cause this behavior, but
> I
> > wonder if that could be making the difference (as compared
> > to DefaultJCasTermAnnotator).
> >
> > Jeff
> >
> > On Sat, Aug 1, 2020 at 5:27 PM Peter Abramowitsch <
> pabramowitsch@gmail.com>
> > wrote:
> >
> >>
> >> Hi All
> >>
> >> Having created a new dictionary from the 2020AA UMLS and added Genes and
> >> Receptors to the dictionary-creator's default selections, I have a
> curious
> >> problem where cTakes now assigns the most bizarre acronyms to ordinary
> >> words used in POS contexts where it shouldn't  find <XXX>Mentions.
> >>
> >> Here are two examples:
> >>
> >> 1.   soft (in "soft tissue...")
> >> becomes   "SHORT STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND
> >> HYPOTRICHOSIS SYNDROME",
> >>
> >> 2.   bed in ("The wound bed was...")
> >> becomes  "BORNHOLM EYE DISEASE"
> >>
> >> I have not changed the TermConsumer type in the descriptor XML.
> >>
> >> Are the DictionaryCreator's defaults, the equivalent to the default
> >> sno_rx that's delivered with the app?
> >>
> >> Attached is the vocab subsets list I used
> >>
> >>
> >> Peter
> >>
> >>
> >>
>

Re: With custom dictionary - over-eager resolution of acronyms [EXTERNAL] [EXTERNAL]

Posted by Peter Abramowitsch <pa...@gmail.com>.
>It would have two completely different applications:  a superior way of
finding the values of findings and a way of validating/pruning the polarity
status of concepts that are in an semi-grammatical or improperly punctuated
sentence
-- Cool.  I expect to see it by end of business tomorrow.

... if only.

Peter

On Sun, Aug 2, 2020 at 10:46 AM Finan, Sean <
Sean.Finan@childrens.harvard.edu> wrote:

> For Peter and Jeff:
>
> > are the vocabulary & tui selections that one finds as defaults in the
> dictionary creator something set by the creator as a ctakes optimization
> -- Good question, and the answer is "no."  Those vocabularies and semantic
> types were chosen simply because they contain clinical terms of interest to
> previously done national studies.  The other semantic types and vocabulary
> terms, while present in notes, are often not of interest to "standard"
> clinical studies.  Adding more terms from other vocabularies and semantic
> types should not slow down processing to any noticeable degree.
> > are the defaults governed by information the creator reads from the UMLS
> release
> -- As far as I know there are no recommendations of this sort made by the
> NLM.
>
> >It would have two completely different applications:  a superior way of
> finding the values of findings and a way of validating/pruning the polarity
> status of concepts that are in an semi-grammatical or improperly punctuated
> sentence
> -- Cool.  I expect to see it by end of business tomorrow.
>
> >I recently created a dictionary based off of UMLS 2020AA and did not see
> 'bed' or 'soft' mapped as synonyms to those terms in my .script file. They
> are there, but mapped to other cuis (for example, the cui for an actual bed
> from SNOMED). I think the difference is that I select all of the available
> TUIs on the right and when I do that 'bed' and 'soft' get assigned to a
> different CUIs (with TUIs of "manufactured object" and "quantitative
> concept" respectively) and the CUI synonyms for the more clinical TUIs are
> skipped. I selected all the TUIs because the defaults seemed to be missing
> some things people might be interested in, but I did not expect the
> behavior where it would change how identical terms from other TUIs get
> included (maybe this is some kind of WSD?)
> -- Yes, there is some horribly simple "WSD" being done before the
> dictionary is written.
> What you are seeing is that SOFT only exists as two synonym entries under
> "Short Stature ...", while it exists as 2++ synonym entries for "bed"
> and/or it is the preferred text for "bed" (probably not), or something like
> that.
>
> >but I imagine it could cause other misses.
> -- True.  It is really difficult to make the perfect dictionary for any
> purpose.  So, we just go for the best coverage and fewest extraneous
> entries - or fewest frequently discovered extraneous entries.  "Bed" may
> not be a problem for notes on outpatient visits.  For inpatient notes it
> would be a different story.
>
> And of course, once you get a great set of terms, you get to play with the
> valid parts of speech.  You decide on grabbing every term or only the
> longest overlapping terms.  Allow discontinuous spans or require continuous
> spans.
>
> Fun.
>
>
>
> ________________________________________
> From: Peter Abramowitsch <pa...@gmail.com>
> Sent: Sunday, August 2, 2020 12:14 PM
> To: dev@ctakes.apache.org
> Subject: Re: With custom dictionary - over-eager resolution of acronyms
> [EXTERNAL] [EXTERNAL]
>
> * External Email - Caution *
>
>
> Many thanks Sean and Jeff.  You guys must be both on the East Coast,
> because my coffee has only just kicked in enough to digest your lucid
> replies.   Super helpful information.  It sounds like the quick and dirty
> solution is to rebuild the dictionary without the OMIM and MTH
> vocabularies.  So it’s not a case of a CUI being remapped - but that it’s
> being layered onto by a particular vocabulary adding a synonym (which in
> this case is probably very rarely used)
>
> One question related to that - are the vocabulary & tui selections that
> one finds as defaults in the dictionary creator something set by the
> creator as a ctakes optimization, or are the defaults governed by
> information the creator reads from the UMLS release?
>
> And thanks for mentioning the capitalization project.  I had been looking
> in vain for that functionality which I had assumed was already there.  You
> can tell that these are still my first experiences with dictionary building.
>
> I appreciate how difficult it is to find the time to build enhancements to
> the product when one is so busy just using it.   There’s an enhancement
> I’ve been prototyping for months which brings in some functionality from
> the Stanford NLP project.  But just don’t have time or energy to productize
> it.   It would have two completely different applications:  a superior way
> of finding the values of findings and a way of validating/pruning the
> polarity status of concepts that are in an semi-grammatical or improperly
> punctuated sentence - such as “Denies headache, abdominal pain, temperature
> normal”
>
> Maybe one day....
>
> Thanks again
> Peter
>
> Sent from my iPad
>
> > On Aug 2, 2020, at 06:25, Finan, Sean <Se...@childrens.harvard.edu>
> wrote:
> >
> > Hi Peter,
> >
> > I would guess that you are seeing things like "SOFT" because you new
> dictionary has a vocabulary that was not included in sno_rx_16ab.
> > I don't remember if OMIM (which has the 'SOFT' synonym) was included in
> sno_rx_16ab.  Probably not, omim is a more -specialized- vocabulary for
> genetics.
> >
> > The term is only in the omim (and mth) vocabularies in the 2016AB umls
> release.
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__uts.nlm.nih.gov_metathesaurus.html-23C3542022-3B0-3B1-3BCUI-3B2016AB-3BWORD-3BCUI-3B-2A&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=K3tMfx4RCQyZOhhxqqooR8jr4rMTlawX51Pz_bjcInc&s=7znT93CZlVXo4x9Era3J3Lfx6KbtaPfylNmjOkGhs9E&e=
> ;
> >
> > The term is in snomed in umls 2020AA, but only with the expanded
> full-text synonym.  It still has the abbreviation from omim.
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__uts.nlm.nih.gov_metathesaurus.html-23SHORT-2520STATURE-2C-2520ONYCHODYSPLASIA-2C-2520FACIAL-2520DYSMORPHISM-2C-2520AND-2520HYPOTRICHOSIS-3B0-3B1-3BTERM-3B2020AA-3BWORD-3BTERM-3B-2A&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=K3tMfx4RCQyZOhhxqqooR8jr4rMTlawX51Pz_bjcInc&s=Lg3VS4Doc0_jhCg-v-gZRwB87fZ76a4o7nr89b7EKN0&e=
> ;
> >
> > As for finding terms in adjectives, the default parts of speech(pos)
> that are checked for terms are:
> >
> VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB
> >
> > You can see what these are here:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ling.upenn.edu_courses_Fall-5F2003_ling001_penn-5Ftreebank-5Fpos.html&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=K3tMfx4RCQyZOhhxqqooR8jr4rMTlawX51Pz_bjcInc&s=p6rKDBv8CR-mooZAqh3B-bR3foZ2DKgGy_LQHQwTNX8&e=
> >
> > You can override this list.  In your piper file, set the variable
> "exclusionTags"
> >
> > // Default excluded parts of speech, plus various forms of adjective.
> > set
> exclusionTags="ADJ,JJ,JJR,JJS,VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB"
> >
> > //  Annotate concepts based upon default algorithms.
> > add DefaultJCasTermAnnotator
> >
> >
> > You'll notice that I threw in 'ADJ' for good measure.  It should not
> break anything.
> >
> > I have modified this list many times for various projects.  In one I
> allow verbs for lookup.  For those notes the value of the true positives
> outweighed the increased false negatives.  In another I actually empty the
> entire list to allow everything (set exclusionTags="").  I did this because
> there is a lot of structured text in lists and tables, but the pos tagger
> is trying to resolve prose text.  The pos assigned on the structured text
> is all over the place, and terms are missed left and right.
> >
> > So ... last but definitely not least, case-sensitivity.
> > I started working on this a while ago, but right now it sits unfinished.
> >
> > There is an additional table in the dictionary database, in which all
> synonyms are all upper-case.
> > This second table is created with synonyms that exist in the umls as all
> upper-case.
> > The first  "classic" table is created using ONLY synonyms from the umls
> that are lower and/or mixed case.
> >
> > When the annotator engine iterates over the text, it checks one table
> (classic) or the other (caps) depending upon the case of the text in the
> note.
> >
> > It sounds like minor work, but it requires a new engine, new dictionary,
> and new dictionary creator.  None of this is difficult, but it requires
> time.
> >
> > Anyway, I hope that some of this helps.
> >
> > Sean
> >
> >
> > ________________________________________
> > From: Peter Abramowitsch <pa...@gmail.com>
> > Sent: Saturday, August 1, 2020 11:35 PM
> > To: dev@ctakes.apache.org
> > Subject: Re: With custom dictionary - over-eager resolution of acronyms
> [EXTERNAL]
> >
> > * External Email - Caution *
> >
> >
> > Hi Jeff thanks for your suggestions,
> >
> > I spent some time in the script file and sure enough,  my 2020 UMLS
> > extraction actually has these two entries:
> >
> > INSERT INTO CUI_TERMS VALUES(3542022,0,1,'soft','soft')
> > INSERT INTO PREFTERM VALUES(3542022,'SHORT STATURE, ONYCHODYSPLASIA,
> FACIAL
> > DYSMORPHISM, AND HYPOTRICHOSIS SYNDROME')
> >
> > It's unbelievable.  the UMLS entry has got to be wrong or I'm missing
> > something to say that it only applies (as an acronym) if it's capitalized
> >
> > In sno_rx  there is neither a CUI 3542022 nor the definition of "soft"
> as a
> > solitary word, nor even a mention of ONYCHODYSPLASIA or HYPOTRICHOSIS
> >
> > In any case, I would have thought that ctakes will only create an event
> > mention from a term tagged as NN or NP slot, not a ADJ as in "soft
> tissue"
> >
> > Anyway  Thanks!  Now I will keep poking around.
> >
> >
> > Peter
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >> On Sat, Aug 1, 2020 at 5:06 PM Jeffrey Miller <je...@gmail.com>
> wrote:
> >>
> >> Sorry, I meant suggest to search for 'soft' in the dictionary file not
> >> 'short'
> >>
> >> grep -i ,\'soft\', *.script
> >>
> >>> On Sat, Aug 1, 2020 at 7:47 PM Jeffrey Miller <je...@gmail.com>
> wrote:
> >>>
> >>> Hi Peter,
> >>>
> >>> To my knowledge, there isn't any drastic difference in the behavior of
> >> the
> >>> dictionary gui creator and the way the sno_rx dictionary was created. I
> >>> originally thought there was, but I realized the difference was that I
> >> had
> >>> not installed all of UMLS to my machine (just the vocabularies I was
> >>> interested in) and I was missing synonyms. The first thing I would
> check,
> >>> are you able to find a matching entry in the .script file for your
> ctakes
> >>> dictionary when you do this:
> >>>
> >>> grep -i ,\'short\', *.script
> >>>
> >>> That would confirm whether or not you have a term in your dictionary
> made
> >>> up only of 'short' and whether it mapped to the CUI equal to "SHORT
> >>> STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND HYPOTRICHOSIS
> >> SYNDROME".
> >>> If it's not in there, something else is going on. You could do the same
> >> for
> >>> 'bed'.
> >>>
> >>> If not, another thing I might check is that I noticed you are using
> >>> the OverlapJCasTermAnnotator in your prior e-mail. I don't have much
> >>> experience with it, and I don't think it should cause this behavior,
> but
> >> I
> >>> wonder if that could be making the difference (as compared
> >>> to DefaultJCasTermAnnotator).
> >>>
> >>> Jeff
> >>>
> >>> On Sat, Aug 1, 2020 at 5:27 PM Peter Abramowitsch <
> >> pabramowitsch@gmail.com>
> >>> wrote:
> >>>
> >>>>
> >>>> Hi All
> >>>>
> >>>> Having created a new dictionary from the 2020AA UMLS and added Genes
> and
> >>>> Receptors to the dictionary-creator's default selections, I have a
> >> curious
> >>>> problem where cTakes now assigns the most bizarre acronyms to ordinary
> >>>> words used in POS contexts where it shouldn't  find <XXX>Mentions.
> >>>>
> >>>> Here are two examples:
> >>>>
> >>>> 1.   soft (in "soft tissue...")
> >>>> becomes   "SHORT STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND
> >>>> HYPOTRICHOSIS SYNDROME",
> >>>>
> >>>> 2.   bed in ("The wound bed was...")
> >>>> becomes  "BORNHOLM EYE DISEASE"
> >>>>
> >>>> I have not changed the TermConsumer type in the descriptor XML.
> >>>>
> >>>> Are the DictionaryCreator's defaults, the equivalent to the default
> >>>> sno_rx that's delivered with the app?
> >>>>
> >>>> Attached is the vocab subsets list I used
> >>>>
> >>>>
> >>>> Peter
> >>>>
> >>>>
> >>>>
> >>
>

Re: With custom dictionary - over-eager resolution of acronyms [EXTERNAL] [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
For Peter and Jeff:

> are the vocabulary & tui selections that one finds as defaults in the dictionary creator something set by the creator as a ctakes optimization
-- Good question, and the answer is "no."  Those vocabularies and semantic types were chosen simply because they contain clinical terms of interest to previously done national studies.  The other semantic types and vocabulary terms, while present in notes, are often not of interest to "standard" clinical studies.  Adding more terms from other vocabularies and semantic types should not slow down processing to any noticeable degree.
> are the defaults governed by information the creator reads from the UMLS release
-- As far as I know there are no recommendations of this sort made by the NLM.

>It would have two completely different applications:  a superior way of finding the values of findings and a way of validating/pruning the polarity status of concepts that are in an semi-grammatical or improperly punctuated sentence
-- Cool.  I expect to see it by end of business tomorrow.

>I recently created a dictionary based off of UMLS 2020AA and did not see
'bed' or 'soft' mapped as synonyms to those terms in my .script file. They
are there, but mapped to other cuis (for example, the cui for an actual bed
from SNOMED). I think the difference is that I select all of the available
TUIs on the right and when I do that 'bed' and 'soft' get assigned to a
different CUIs (with TUIs of "manufactured object" and "quantitative
concept" respectively) and the CUI synonyms for the more clinical TUIs are
skipped. I selected all the TUIs because the defaults seemed to be missing
some things people might be interested in, but I did not expect the
behavior where it would change how identical terms from other TUIs get
included (maybe this is some kind of WSD?)
-- Yes, there is some horribly simple "WSD" being done before the dictionary is written.
What you are seeing is that SOFT only exists as two synonym entries under "Short Stature ...", while it exists as 2++ synonym entries for "bed" and/or it is the preferred text for "bed" (probably not), or something like that.

>but I imagine it could cause other misses.
-- True.  It is really difficult to make the perfect dictionary for any purpose.  So, we just go for the best coverage and fewest extraneous entries - or fewest frequently discovered extraneous entries.  "Bed" may not be a problem for notes on outpatient visits.  For inpatient notes it would be a different story.

And of course, once you get a great set of terms, you get to play with the valid parts of speech.  You decide on grabbing every term or only the longest overlapping terms.  Allow discontinuous spans or require continuous spans.

Fun.



________________________________________
From: Peter Abramowitsch <pa...@gmail.com>
Sent: Sunday, August 2, 2020 12:14 PM
To: dev@ctakes.apache.org
Subject: Re: With custom dictionary - over-eager resolution of acronyms [EXTERNAL] [EXTERNAL]

* External Email - Caution *


Many thanks Sean and Jeff.  You guys must be both on the East Coast, because my coffee has only just kicked in enough to digest your lucid replies.   Super helpful information.  It sounds like the quick and dirty solution is to rebuild the dictionary without the OMIM and MTH vocabularies.  So it’s not a case of a CUI being remapped - but that it’s being layered onto by a particular vocabulary adding a synonym (which in this case is probably very rarely used)

One question related to that - are the vocabulary & tui selections that one finds as defaults in the dictionary creator something set by the creator as a ctakes optimization, or are the defaults governed by information the creator reads from the UMLS release?

And thanks for mentioning the capitalization project.  I had been looking in vain for that functionality which I had assumed was already there.  You can tell that these are still my first experiences with dictionary building.

I appreciate how difficult it is to find the time to build enhancements to the product when one is so busy just using it.   There’s an enhancement I’ve been prototyping for months which brings in some functionality from the Stanford NLP project.  But just don’t have time or energy to productize it.   It would have two completely different applications:  a superior way of finding the values of findings and a way of validating/pruning the polarity status of concepts that are in an semi-grammatical or improperly punctuated sentence - such as “Denies headache, abdominal pain, temperature normal”

Maybe one day....

Thanks again
Peter

Sent from my iPad

> On Aug 2, 2020, at 06:25, Finan, Sean <Se...@childrens.harvard.edu> wrote:
>
> Hi Peter,
>
> I would guess that you are seeing things like "SOFT" because you new dictionary has a vocabulary that was not included in sno_rx_16ab.
> I don't remember if OMIM (which has the 'SOFT' synonym) was included in sno_rx_16ab.  Probably not, omim is a more -specialized- vocabulary for genetics.
>
> The term is only in the omim (and mth) vocabularies in the 2016AB umls release.
>   https://urldefense.proofpoint.com/v2/url?u=https-3A__uts.nlm.nih.gov_metathesaurus.html-23C3542022-3B0-3B1-3BCUI-3B2016AB-3BWORD-3BCUI-3B-2A&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=K3tMfx4RCQyZOhhxqqooR8jr4rMTlawX51Pz_bjcInc&s=7znT93CZlVXo4x9Era3J3Lfx6KbtaPfylNmjOkGhs9E&e= ;
>
> The term is in snomed in umls 2020AA, but only with the expanded full-text synonym.  It still has the abbreviation from omim.
> https://urldefense.proofpoint.com/v2/url?u=https-3A__uts.nlm.nih.gov_metathesaurus.html-23SHORT-2520STATURE-2C-2520ONYCHODYSPLASIA-2C-2520FACIAL-2520DYSMORPHISM-2C-2520AND-2520HYPOTRICHOSIS-3B0-3B1-3BTERM-3B2020AA-3BWORD-3BTERM-3B-2A&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=K3tMfx4RCQyZOhhxqqooR8jr4rMTlawX51Pz_bjcInc&s=Lg3VS4Doc0_jhCg-v-gZRwB87fZ76a4o7nr89b7EKN0&e= ;
>
> As for finding terms in adjectives, the default parts of speech(pos) that are checked for terms are:
> VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB
>
> You can see what these are here: https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ling.upenn.edu_courses_Fall-5F2003_ling001_penn-5Ftreebank-5Fpos.html&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=K3tMfx4RCQyZOhhxqqooR8jr4rMTlawX51Pz_bjcInc&s=p6rKDBv8CR-mooZAqh3B-bR3foZ2DKgGy_LQHQwTNX8&e=
>
> You can override this list.  In your piper file, set the variable "exclusionTags"
>
> // Default excluded parts of speech, plus various forms of adjective.
> set exclusionTags="ADJ,JJ,JJR,JJS,VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB"
>
> //  Annotate concepts based upon default algorithms.
> add DefaultJCasTermAnnotator
>
>
> You'll notice that I threw in 'ADJ' for good measure.  It should not break anything.
>
> I have modified this list many times for various projects.  In one I allow verbs for lookup.  For those notes the value of the true positives outweighed the increased false negatives.  In another I actually empty the entire list to allow everything (set exclusionTags="").  I did this because there is a lot of structured text in lists and tables, but the pos tagger is trying to resolve prose text.  The pos assigned on the structured text is all over the place, and terms are missed left and right.
>
> So ... last but definitely not least, case-sensitivity.
> I started working on this a while ago, but right now it sits unfinished.
>
> There is an additional table in the dictionary database, in which all synonyms are all upper-case.
> This second table is created with synonyms that exist in the umls as all upper-case.
> The first  "classic" table is created using ONLY synonyms from the umls that are lower and/or mixed case.
>
> When the annotator engine iterates over the text, it checks one table (classic) or the other (caps) depending upon the case of the text in the note.
>
> It sounds like minor work, but it requires a new engine, new dictionary, and new dictionary creator.  None of this is difficult, but it requires time.
>
> Anyway, I hope that some of this helps.
>
> Sean
>
>
> ________________________________________
> From: Peter Abramowitsch <pa...@gmail.com>
> Sent: Saturday, August 1, 2020 11:35 PM
> To: dev@ctakes.apache.org
> Subject: Re: With custom dictionary - over-eager resolution of acronyms [EXTERNAL]
>
> * External Email - Caution *
>
>
> Hi Jeff thanks for your suggestions,
>
> I spent some time in the script file and sure enough,  my 2020 UMLS
> extraction actually has these two entries:
>
> INSERT INTO CUI_TERMS VALUES(3542022,0,1,'soft','soft')
> INSERT INTO PREFTERM VALUES(3542022,'SHORT STATURE, ONYCHODYSPLASIA, FACIAL
> DYSMORPHISM, AND HYPOTRICHOSIS SYNDROME')
>
> It's unbelievable.  the UMLS entry has got to be wrong or I'm missing
> something to say that it only applies (as an acronym) if it's capitalized
>
> In sno_rx  there is neither a CUI 3542022 nor the definition of "soft" as a
> solitary word, nor even a mention of ONYCHODYSPLASIA or HYPOTRICHOSIS
>
> In any case, I would have thought that ctakes will only create an event
> mention from a term tagged as NN or NP slot, not a ADJ as in "soft tissue"
>
> Anyway  Thanks!  Now I will keep poking around.
>
>
> Peter
>
>
>
>
>
>
>
>
>
>
>
>
>> On Sat, Aug 1, 2020 at 5:06 PM Jeffrey Miller <je...@gmail.com> wrote:
>>
>> Sorry, I meant suggest to search for 'soft' in the dictionary file not
>> 'short'
>>
>> grep -i ,\'soft\', *.script
>>
>>> On Sat, Aug 1, 2020 at 7:47 PM Jeffrey Miller <je...@gmail.com> wrote:
>>>
>>> Hi Peter,
>>>
>>> To my knowledge, there isn't any drastic difference in the behavior of
>> the
>>> dictionary gui creator and the way the sno_rx dictionary was created. I
>>> originally thought there was, but I realized the difference was that I
>> had
>>> not installed all of UMLS to my machine (just the vocabularies I was
>>> interested in) and I was missing synonyms. The first thing I would check,
>>> are you able to find a matching entry in the .script file for your ctakes
>>> dictionary when you do this:
>>>
>>> grep -i ,\'short\', *.script
>>>
>>> That would confirm whether or not you have a term in your dictionary made
>>> up only of 'short' and whether it mapped to the CUI equal to "SHORT
>>> STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND HYPOTRICHOSIS
>> SYNDROME".
>>> If it's not in there, something else is going on. You could do the same
>> for
>>> 'bed'.
>>>
>>> If not, another thing I might check is that I noticed you are using
>>> the OverlapJCasTermAnnotator in your prior e-mail. I don't have much
>>> experience with it, and I don't think it should cause this behavior, but
>> I
>>> wonder if that could be making the difference (as compared
>>> to DefaultJCasTermAnnotator).
>>>
>>> Jeff
>>>
>>> On Sat, Aug 1, 2020 at 5:27 PM Peter Abramowitsch <
>> pabramowitsch@gmail.com>
>>> wrote:
>>>
>>>>
>>>> Hi All
>>>>
>>>> Having created a new dictionary from the 2020AA UMLS and added Genes and
>>>> Receptors to the dictionary-creator's default selections, I have a
>> curious
>>>> problem where cTakes now assigns the most bizarre acronyms to ordinary
>>>> words used in POS contexts where it shouldn't  find <XXX>Mentions.
>>>>
>>>> Here are two examples:
>>>>
>>>> 1.   soft (in "soft tissue...")
>>>> becomes   "SHORT STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND
>>>> HYPOTRICHOSIS SYNDROME",
>>>>
>>>> 2.   bed in ("The wound bed was...")
>>>> becomes  "BORNHOLM EYE DISEASE"
>>>>
>>>> I have not changed the TermConsumer type in the descriptor XML.
>>>>
>>>> Are the DictionaryCreator's defaults, the equivalent to the default
>>>> sno_rx that's delivered with the app?
>>>>
>>>> Attached is the vocab subsets list I used
>>>>
>>>>
>>>> Peter
>>>>
>>>>
>>>>
>>

Re: With custom dictionary - over-eager resolution of acronyms [EXTERNAL]

Posted by Peter Abramowitsch <pa...@gmail.com>.
Many thanks Sean and Jeff.  You guys must be both on the East Coast, because my coffee has only just kicked in enough to digest your lucid replies.   Super helpful information.  It sounds like the quick and dirty solution is to rebuild the dictionary without the OMIM and MTH vocabularies.  So it’s not a case of a CUI being remapped - but that it’s being layered onto by a particular vocabulary adding a synonym (which in this case is probably very rarely used) 

One question related to that - are the vocabulary & tui selections that one finds as defaults in the dictionary creator something set by the creator as a ctakes optimization, or are the defaults governed by information the creator reads from the UMLS release? 

And thanks for mentioning the capitalization project.  I had been looking in vain for that functionality which I had assumed was already there.  You can tell that these are still my first experiences with dictionary building.

I appreciate how difficult it is to find the time to build enhancements to the product when one is so busy just using it.   There’s an enhancement I’ve been prototyping for months which brings in some functionality from the Stanford NLP project.  But just don’t have time or energy to productize it.   It would have two completely different applications:  a superior way of finding the values of findings and a way of validating/pruning the polarity status of concepts that are in an semi-grammatical or improperly punctuated sentence - such as “Denies headache, abdominal pain, temperature normal”    

Maybe one day....

Thanks again
Peter

Sent from my iPad

> On Aug 2, 2020, at 06:25, Finan, Sean <Se...@childrens.harvard.edu> wrote:
> 
> Hi Peter,
> 
> I would guess that you are seeing things like "SOFT" because you new dictionary has a vocabulary that was not included in sno_rx_16ab.
> I don't remember if OMIM (which has the 'SOFT' synonym) was included in sno_rx_16ab.  Probably not, omim is a more -specialized- vocabulary for genetics.
> 
> The term is only in the omim (and mth) vocabularies in the 2016AB umls release. 
>   https://uts.nlm.nih.gov/metathesaurus.html#C3542022;0;1;CUI;2016AB;WORD;CUI;*;  
> 
> The term is in snomed in umls 2020AA, but only with the expanded full-text synonym.  It still has the abbreviation from omim.  
> https://uts.nlm.nih.gov/metathesaurus.html#SHORT%20STATURE,%20ONYCHODYSPLASIA,%20FACIAL%20DYSMORPHISM,%20AND%20HYPOTRICHOSIS;0;1;TERM;2020AA;WORD;TERM;*;
> 
> As for finding terms in adjectives, the default parts of speech(pos) that are checked for terms are:
> VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB
> 
> You can see what these are here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
> 
> You can override this list.  In your piper file, set the variable "exclusionTags"
> 
> // Default excluded parts of speech, plus various forms of adjective.
> set exclusionTags="ADJ,JJ,JJR,JJS,VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB"
> 
> //  Annotate concepts based upon default algorithms.
> add DefaultJCasTermAnnotator
> 
> 
> You'll notice that I threw in 'ADJ' for good measure.  It should not break anything.  
> 
> I have modified this list many times for various projects.  In one I allow verbs for lookup.  For those notes the value of the true positives outweighed the increased false negatives.  In another I actually empty the entire list to allow everything (set exclusionTags="").  I did this because there is a lot of structured text in lists and tables, but the pos tagger is trying to resolve prose text.  The pos assigned on the structured text is all over the place, and terms are missed left and right.
> 
> So ... last but definitely not least, case-sensitivity.
> I started working on this a while ago, but right now it sits unfinished.
> 
> There is an additional table in the dictionary database, in which all synonyms are all upper-case.
> This second table is created with synonyms that exist in the umls as all upper-case.
> The first  "classic" table is created using ONLY synonyms from the umls that are lower and/or mixed case. 
> 
> When the annotator engine iterates over the text, it checks one table (classic) or the other (caps) depending upon the case of the text in the note.
> 
> It sounds like minor work, but it requires a new engine, new dictionary, and new dictionary creator.  None of this is difficult, but it requires time.
> 
> Anyway, I hope that some of this helps.
> 
> Sean
> 
> 
> ________________________________________
> From: Peter Abramowitsch <pa...@gmail.com>
> Sent: Saturday, August 1, 2020 11:35 PM
> To: dev@ctakes.apache.org
> Subject: Re: With custom dictionary - over-eager resolution of acronyms [EXTERNAL]
> 
> * External Email - Caution *
> 
> 
> Hi Jeff thanks for your suggestions,
> 
> I spent some time in the script file and sure enough,  my 2020 UMLS
> extraction actually has these two entries:
> 
> INSERT INTO CUI_TERMS VALUES(3542022,0,1,'soft','soft')
> INSERT INTO PREFTERM VALUES(3542022,'SHORT STATURE, ONYCHODYSPLASIA, FACIAL
> DYSMORPHISM, AND HYPOTRICHOSIS SYNDROME')
> 
> It's unbelievable.  the UMLS entry has got to be wrong or I'm missing
> something to say that it only applies (as an acronym) if it's capitalized
> 
> In sno_rx  there is neither a CUI 3542022 nor the definition of "soft" as a
> solitary word, nor even a mention of ONYCHODYSPLASIA or HYPOTRICHOSIS
> 
> In any case, I would have thought that ctakes will only create an event
> mention from a term tagged as NN or NP slot, not a ADJ as in "soft tissue"
> 
> Anyway  Thanks!  Now I will keep poking around.
> 
> 
> Peter
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> On Sat, Aug 1, 2020 at 5:06 PM Jeffrey Miller <je...@gmail.com> wrote:
>> 
>> Sorry, I meant suggest to search for 'soft' in the dictionary file not
>> 'short'
>> 
>> grep -i ,\'soft\', *.script
>> 
>>> On Sat, Aug 1, 2020 at 7:47 PM Jeffrey Miller <je...@gmail.com> wrote:
>>> 
>>> Hi Peter,
>>> 
>>> To my knowledge, there isn't any drastic difference in the behavior of
>> the
>>> dictionary gui creator and the way the sno_rx dictionary was created. I
>>> originally thought there was, but I realized the difference was that I
>> had
>>> not installed all of UMLS to my machine (just the vocabularies I was
>>> interested in) and I was missing synonyms. The first thing I would check,
>>> are you able to find a matching entry in the .script file for your ctakes
>>> dictionary when you do this:
>>> 
>>> grep -i ,\'short\', *.script
>>> 
>>> That would confirm whether or not you have a term in your dictionary made
>>> up only of 'short' and whether it mapped to the CUI equal to "SHORT
>>> STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND HYPOTRICHOSIS
>> SYNDROME".
>>> If it's not in there, something else is going on. You could do the same
>> for
>>> 'bed'.
>>> 
>>> If not, another thing I might check is that I noticed you are using
>>> the OverlapJCasTermAnnotator in your prior e-mail. I don't have much
>>> experience with it, and I don't think it should cause this behavior, but
>> I
>>> wonder if that could be making the difference (as compared
>>> to DefaultJCasTermAnnotator).
>>> 
>>> Jeff
>>> 
>>> On Sat, Aug 1, 2020 at 5:27 PM Peter Abramowitsch <
>> pabramowitsch@gmail.com>
>>> wrote:
>>> 
>>>> 
>>>> Hi All
>>>> 
>>>> Having created a new dictionary from the 2020AA UMLS and added Genes and
>>>> Receptors to the dictionary-creator's default selections, I have a
>> curious
>>>> problem where cTakes now assigns the most bizarre acronyms to ordinary
>>>> words used in POS contexts where it shouldn't  find <XXX>Mentions.
>>>> 
>>>> Here are two examples:
>>>> 
>>>> 1.   soft (in "soft tissue...")
>>>> becomes   "SHORT STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND
>>>> HYPOTRICHOSIS SYNDROME",
>>>> 
>>>> 2.   bed in ("The wound bed was...")
>>>> becomes  "BORNHOLM EYE DISEASE"
>>>> 
>>>> I have not changed the TermConsumer type in the descriptor XML.
>>>> 
>>>> Are the DictionaryCreator's defaults, the equivalent to the default
>>>> sno_rx that's delivered with the app?
>>>> 
>>>> Attached is the vocab subsets list I used
>>>> 
>>>> 
>>>> Peter
>>>> 
>>>> 
>>>> 
>> 

Re: With custom dictionary - over-eager resolution of acronyms [EXTERNAL]

Posted by Jeffrey Miller <je...@gmail.com>.
Hi Peter and Sean,

I recently created a dictionary based off of UMLS 2020AA and did not see
'bed' or 'soft' mapped as synonyms to those terms in my .script file. They
are there, but mapped to other cuis (for example, the cui for an actual bed
from SNOMED). I think the difference is that I select all of the available
TUIs on the right and when I do that 'bed' and 'soft' get assigned to a
different CUIs (with TUIs of "manufactured object" and "quantitative
concept" respectively) and the CUI synonyms for the more clinical TUIs are
skipped. I selected all the TUIs because the defaults seemed to be missing
some things people might be interested in, but I did not expect the
behavior where it would change how identical terms from other TUIs get
included (maybe this is some kind of WSD?). In this case it seems like the
behavior is preferable since it prevents inclusion of unlikely synonyms
(and mentions of "bed" probably get mapped to the fallback of
IdentityMention by cTAKES), but I imagine it could cause other misses.

Jeff

On Sun, Aug 2, 2020 at 9:25 AM Finan, Sean <Se...@childrens.harvard.edu>
wrote:

> Hi Peter,
>
> I would guess that you are seeing things like "SOFT" because you new
> dictionary has a vocabulary that was not included in sno_rx_16ab.
> I don't remember if OMIM (which has the 'SOFT' synonym) was included in
> sno_rx_16ab.  Probably not, omim is a more -specialized- vocabulary for
> genetics.
>
> The term is only in the omim (and mth) vocabularies in the 2016AB umls
> release.
>
> https://uts.nlm.nih.gov/metathesaurus.html#C3542022;0;1;CUI;2016AB;WORD;CUI;*;
>
>
> The term is in snomed in umls 2020AA, but only with the expanded full-text
> synonym.  It still has the abbreviation from omim.
>
> https://uts.nlm.nih.gov/metathesaurus.html#SHORT%20STATURE,%20ONYCHODYSPLASIA,%20FACIAL%20DYSMORPHISM,%20AND%20HYPOTRICHOSIS;0;1;TERM;2020AA;WORD;TERM;*
> ;
>
> As for finding terms in adjectives, the default parts of speech(pos) that
> are checked for terms are:
>
> VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB
>
> You can see what these are here:
> https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
>
> You can override this list.  In your piper file, set the variable
> "exclusionTags"
>
> // Default excluded parts of speech, plus various forms of adjective.
> set
> exclusionTags="ADJ,JJ,JJR,JJS,VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB"
>
> //  Annotate concepts based upon default algorithms.
> add DefaultJCasTermAnnotator
>
>
> You'll notice that I threw in 'ADJ' for good measure.  It should not break
> anything.
>
> I have modified this list many times for various projects.  In one I allow
> verbs for lookup.  For those notes the value of the true positives
> outweighed the increased false negatives.  In another I actually empty the
> entire list to allow everything (set exclusionTags="").  I did this because
> there is a lot of structured text in lists and tables, but the pos tagger
> is trying to resolve prose text.  The pos assigned on the structured text
> is all over the place, and terms are missed left and right.
>
> So ... last but definitely not least, case-sensitivity.
> I started working on this a while ago, but right now it sits unfinished.
>
> There is an additional table in the dictionary database, in which all
> synonyms are all upper-case.
> This second table is created with synonyms that exist in the umls as all
> upper-case.
> The first  "classic" table is created using ONLY synonyms from the umls
> that are lower and/or mixed case.
>
> When the annotator engine iterates over the text, it checks one table
> (classic) or the other (caps) depending upon the case of the text in the
> note.
>
> It sounds like minor work, but it requires a new engine, new dictionary,
> and new dictionary creator.  None of this is difficult, but it requires
> time.
>
> Anyway, I hope that some of this helps.
>
> Sean
>
>
> ________________________________________
> From: Peter Abramowitsch <pa...@gmail.com>
> Sent: Saturday, August 1, 2020 11:35 PM
> To: dev@ctakes.apache.org
> Subject: Re: With custom dictionary - over-eager resolution of acronyms
> [EXTERNAL]
>
> * External Email - Caution *
>
>
> Hi Jeff thanks for your suggestions,
>
> I spent some time in the script file and sure enough,  my 2020 UMLS
> extraction actually has these two entries:
>
> INSERT INTO CUI_TERMS VALUES(3542022,0,1,'soft','soft')
> INSERT INTO PREFTERM VALUES(3542022,'SHORT STATURE, ONYCHODYSPLASIA, FACIAL
> DYSMORPHISM, AND HYPOTRICHOSIS SYNDROME')
>
> It's unbelievable.  the UMLS entry has got to be wrong or I'm missing
> something to say that it only applies (as an acronym) if it's capitalized
>
> In sno_rx  there is neither a CUI 3542022 nor the definition of "soft" as a
> solitary word, nor even a mention of ONYCHODYSPLASIA or HYPOTRICHOSIS
>
> In any case, I would have thought that ctakes will only create an event
> mention from a term tagged as NN or NP slot, not a ADJ as in "soft tissue"
>
> Anyway  Thanks!  Now I will keep poking around.
>
>
> Peter
>
>
>
>
>
>
>
>
>
>
>
>
> On Sat, Aug 1, 2020 at 5:06 PM Jeffrey Miller <je...@gmail.com> wrote:
>
> > Sorry, I meant suggest to search for 'soft' in the dictionary file not
> > 'short'
> >
> > grep -i ,\'soft\', *.script
> >
> > On Sat, Aug 1, 2020 at 7:47 PM Jeffrey Miller <je...@gmail.com> wrote:
> >
> > > Hi Peter,
> > >
> > > To my knowledge, there isn't any drastic difference in the behavior of
> > the
> > > dictionary gui creator and the way the sno_rx dictionary was created. I
> > > originally thought there was, but I realized the difference was that I
> > had
> > > not installed all of UMLS to my machine (just the vocabularies I was
> > > interested in) and I was missing synonyms. The first thing I would
> check,
> > > are you able to find a matching entry in the .script file for your
> ctakes
> > > dictionary when you do this:
> > >
> > > grep -i ,\'short\', *.script
> > >
> > > That would confirm whether or not you have a term in your dictionary
> made
> > > up only of 'short' and whether it mapped to the CUI equal to "SHORT
> > > STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND HYPOTRICHOSIS
> > SYNDROME".
> > > If it's not in there, something else is going on. You could do the same
> > for
> > > 'bed'.
> > >
> > > If not, another thing I might check is that I noticed you are using
> > > the OverlapJCasTermAnnotator in your prior e-mail. I don't have much
> > > experience with it, and I don't think it should cause this behavior,
> but
> > I
> > > wonder if that could be making the difference (as compared
> > > to DefaultJCasTermAnnotator).
> > >
> > > Jeff
> > >
> > > On Sat, Aug 1, 2020 at 5:27 PM Peter Abramowitsch <
> > pabramowitsch@gmail.com>
> > > wrote:
> > >
> > >>
> > >> Hi All
> > >>
> > >> Having created a new dictionary from the 2020AA UMLS and added Genes
> and
> > >> Receptors to the dictionary-creator's default selections, I have a
> > curious
> > >> problem where cTakes now assigns the most bizarre acronyms to ordinary
> > >> words used in POS contexts where it shouldn't  find <XXX>Mentions.
> > >>
> > >> Here are two examples:
> > >>
> > >> 1.   soft (in "soft tissue...")
> > >> becomes   "SHORT STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND
> > >> HYPOTRICHOSIS SYNDROME",
> > >>
> > >> 2.   bed in ("The wound bed was...")
> > >> becomes  "BORNHOLM EYE DISEASE"
> > >>
> > >> I have not changed the TermConsumer type in the descriptor XML.
> > >>
> > >> Are the DictionaryCreator's defaults, the equivalent to the default
> > >> sno_rx that's delivered with the app?
> > >>
> > >> Attached is the vocab subsets list I used
> > >>
> > >>
> > >> Peter
> > >>
> > >>
> > >>
> >
>