You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ctakes.apache.org by "Lee, Richard A. [USA]" <le...@bah.com> on 2014/01/03 00:00:30 UTC

How to augment/modify UMLS resources?

Howdy, all. I’ve got a lot of experience with various commercial extraction tools, but I’m new to cTAKES and UIMA, so please bear with me.

I am able to use my UMLS credentials to process documents, and the results are good. But there are a few things I wish to change in the medfacts.types.Concept and AnatomicalSiteMention areas, for starters. For example, while it annotates “orbicularis oculi” as a concept, it does not annotate “musculus orbicularis oculi”, “septum orbital”, or “oculi medialis”. It annotates “ulceration”, “perforation”, and “corneal perforation” but not “corneal ulceration”. It annotates “men” (as in “Chinese men”) as a “problem”. It annotates “ER” (ie Emergency Room) as an AnatomicalSiteReference.

So, the question becomes, how do I address these? Do I need to somehow re-generate (with changes) the UMLS data files, probably using Luke or some such? That seems a bit crude. Is there a clean way to supplement those data files instead to achieve the desired changes?

Thanks in advance.

------------------------------------------------------------------------------------------------------------
Richard A Lee || Lead Associate / Senior Ontologist || lee_richard@bah.com<ma...@bah.com> || 571-482-7809

Re: How to augment/modify UMLS resources?

Posted by vijay garla <vn...@gmail.com>.

With YTEX (which is close to ready to being merged into trunk) we have a
utility that simplifies creating custom database dictionaries from UMLS.

What we do is:
* we create a table that has the AUI, tokenized string, and first word of
every term from MRCONSO
* we then join that table with MRCONSO and other tables to filter by SAB
(source vocabulary), TUI (semantic type via join with MRSTY), and any other
UMLS attribute you like
This approach obviates the need for writing code for every subset of the
UMLS you want to create a dictionary for (you just have to write a select
into sql statement)

-vj



On Fri, Jan 3, 2014 at 4:45 PM, Masanz, James J. <Ma...@mayo.edu>wrote:

> The separately downloadable UMLS dictionary formatted for cTAKES [1], not
> counting medication names (RxNorm), is in a database [2]. So you could add
> to that database whatever terms you want.
>
> The RxNorm dictionary is in a Lucene index (though there is a related jira
> ticket open so that maybe it will end up in the same database) so to add to
> the currently used medications list, would probably best be done
> programmatically using the Lucene API (someone with more Lucene end-user
> experience, please chime in)
>
> cTAKES provides a way to look up terms in a flatfile dictionary that you
> would provide. See the files that end with .csv within
>
> ctakes-dictionary-lookup-res\src\main\resources\org\apache\ctakes\dictionary\lookup
>
> The flatfile is not used directly in conjunction with the database file of
> terms from UMLS – to use the two together, you would have one annotator
> configured to use that flatfile for the dictionary, and have a second
> annotator configured to use the database file.
>
> Some things to be aware of if you went that route
>  - each note would be processed by both, and if you had terms in your
> flatfile that duplicated what was in the database, you would end up with
> double annotations
>  - each note would be processed in effect twice (not by the entire
> pipeline thankfully) so it would be a slower than just using one.
>
> As far as something being annotated that you don't want annotated, within
> the LookupDesc*xml file being used, there can be an excludeList to have
> "men" no longer annotated.  See LookupDesc_DrugNER.xml for an example of
> using excludeList.
>
> Any improvements or even written steps on any of the above would be a
> great contribution.
>
> -- James
>
> [1] http://sourceforge.net/projects/ctakesresources/files/
> [2] the relative path to the hsql db is
> resources\org\apache\ctakes\dictionary\lookup\umls2011ab
>
> From: user-return-451-Masanz.James=mayo.edu@ctakes.apache.org [mailto:
> user-return-451-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of
> Lee, Richard A. [USA]
> Sent: Thursday, January 02, 2014 5:01 PM
> To: user@ctakes.apache.org
> Subject: How to augment/modify UMLS resources?
>
> Howdy, all. I’ve got a lot of experience with various commercial
> extraction tools, but I’m new to cTAKES and UIMA, so please bear with me.
>
> I am able to use my UMLS credentials to process documents, and the results
> are good. But there are a few things I wish to change in the
> medfacts.types.Concept and AnatomicalSiteMention areas, for starters. For
> example, while it annotates “orbicularis oculi” as a concept, it does not
> annotate “musculus orbicularis oculi”, “septum orbital”, or “oculi
> medialis”. It annotates “ulceration”, “perforation”, and “corneal
> perforation” but not “corneal ulceration”. It annotates “men” (as in
> “Chinese men”) as a “problem”. It annotates “ER” (ie Emergency Room) as an
> AnatomicalSiteReference.
>
> So, the question becomes, how do I address these? Do I need to somehow
> re-generate (with changes) the UMLS data files, probably using Luke or some
> such? That seems a bit crude. Is there a clean way to supplement those data
> files instead to achieve the desired changes?
>
> Thanks in advance.
>
>
> ------------------------------------------------------------------------------------------------------------
> Richard A Lee || Lead Associate / Senior Ontologist || lee_richard@bah.com||
> 571-482-7809
>
>

RE: How to augment/modify UMLS resources?

Posted by "Masanz, James J." <Ma...@mayo.edu>.

The separately downloadable UMLS dictionary formatted for cTAKES [1], not counting medication names (RxNorm), is in a database [2]. So you could add to that database whatever terms you want.

The RxNorm dictionary is in a Lucene index (though there is a related jira ticket open so that maybe it will end up in the same database) so to add to the currently used medications list, would probably best be done programmatically using the Lucene API (someone with more Lucene end-user experience, please chime in)

cTAKES provides a way to look up terms in a flatfile dictionary that you would provide. See the files that end with .csv within
ctakes-dictionary-lookup-res\src\main\resources\org\apache\ctakes\dictionary\lookup

The flatfile is not used directly in conjunction with the database file of terms from UMLS – to use the two together, you would have one annotator configured to use that flatfile for the dictionary, and have a second annotator configured to use the database file.
 
Some things to be aware of if you went that route
 - each note would be processed by both, and if you had terms in your flatfile that duplicated what was in the database, you would end up with double annotations
 - each note would be processed in effect twice (not by the entire pipeline thankfully) so it would be a slower than just using one.

As far as something being annotated that you don't want annotated, within the LookupDesc*xml file being used, there can be an excludeList to have "men" no longer annotated.  See LookupDesc_DrugNER.xml for an example of using excludeList.

Any improvements or even written steps on any of the above would be a great contribution.

-- James

[1] http://sourceforge.net/projects/ctakesresources/files/
[2] the relative path to the hsql db is 
resources\org\apache\ctakes\dictionary\lookup\umls2011ab

From: user-return-451-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-451-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Lee, Richard A. [USA]
Sent: Thursday, January 02, 2014 5:01 PM
To: user@ctakes.apache.org
Subject: How to augment/modify UMLS resources?

Howdy, all. I’ve got a lot of experience with various commercial extraction tools, but I’m new to cTAKES and UIMA, so please bear with me.

I am able to use my UMLS credentials to process documents, and the results are good. But there are a few things I wish to change in the medfacts.types.Concept and AnatomicalSiteMention areas, for starters. For example, while it annotates “orbicularis oculi” as a concept, it does not annotate “musculus orbicularis oculi”, “septum orbital”, or “oculi medialis”. It annotates “ulceration”, “perforation”, and “corneal perforation” but not “corneal ulceration”. It annotates “men” (as in “Chinese men”) as a “problem”. It annotates “ER” (ie Emergency Room) as an AnatomicalSiteReference.

So, the question becomes, how do I address these? Do I need to somehow re-generate (with changes) the UMLS data files, probably using Luke or some such? That seems a bit crude. Is there a clean way to supplement those data files instead to achieve the desired changes?

Thanks in advance.

------------------------------------------------------------------------------------------------------------
Richard A Lee || Lead Associate / Senior Ontologist || lee_richard@bah.com || 571-482-7809