You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ctakes.apache.org by "Geise, Brandon D." <bd...@geisinger.edu> on 2015/10/19 14:50:22 UTC

RE: Fast Dictionary Update

Hi Sean,

I finally had a chance to look at the SNOMEDCT issue further regarding the codingScheme populating using the default value.  What I found was in the dictionary tool when running the CodeMapCreator, when the CuiCodesDbWriter is called, the collection uses the name passed into the method, which is SNOMEDCT.  However, if you are using SNOMEDCT_US the collection name is SNOMEDCT_US instead of SNOMEDCT, so it never populates the hsqldb.  Obviously an easy change to make, but thought it might be helpful feedback.

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu] 
Sent: Monday, September 21, 2015 10:39 AM
To: dev@ctakes.apache.org
Subject: RE: Fast Dictionary Update

Hi Brandon,

Sorry for the late reply - I've been out for an extended weekend.

The coding scheme change is fairly simply explained (imo).  The plain old CUI is not a snomed code.  If the snomed codes are reported by ctakes (uncomment the snomed line in ctakesHsql.xml ) then their UmlsConcept entries in the ontology array have the coding scheme name "SNOMEDCT".
            <!-- Optional tables for optional term info.
            Uncommenting these lines alone may not persist term information;
            persistence depends upon the TermConsumer.  -->
            <property key="snomedTable" value="snomedct"/>

Basically, the "CTAKES" name indicates that the scheme only contains Umls Cuis that have TUIs of the default ctakes configuration.  ctakes does not use all umls tuis, therefore I did not name the scheme "UMLS".  If you make a custom scheme (etc.) you can change the name in cTakesHsql.xml or in a custom .xml
          <!-- Depending upon the consumer, the value of codingScheme may or may not be used.  With the packaged consumers,
          codingScheme is a default value used only for cuis that do not have secondary codes (snomed, rxnorm, etc.)  -->
         <property key="codingScheme" value="CTAKES"/>


The " RelationsExtractor" in the dictionary creator tool is completely experimental and unfinished - but perhaps some day it will throw umls relations into a format that ctakes can directly use.  For the time being it should be avoided.

Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu] 
Sent: Thursday, September 17, 2015 10:23 PM
To: dev@ctakes.apache.org
Subject: RE: Fast Dictionary Update

You can disregard my question about the relation extraction as I fixed this by building the new dictionary with the default data files in the dictionarytool.  I am curious about the SNOMED change still though.

Thanks,
Brandon

-----Original Message-----
From: Geise, Brandon D. 
Sent: Thursday, September 17, 2015 9:40 PM
To: cTAKES Developer list <de...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Thanks Dmitriy.  I was referring to the RelationsExtractor class found in the dictionarytool.  On a similar note, the coding scheme for all SNOMEDCT codes for the new dictionary is CTAKES compared to SNOMED with the UMLS version packaged with cTakes.  Is there something else I need to run for the dictionary creation that I'm missing?

Thanks,
Brandon

-----Original Message-----
From: Dligach, Dmitriy [mailto:Dmitriy.Dligach@childrens.harvard.edu] 
Sent: Thursday, September 17, 2015 8:42 PM
To: cTAKES Developer list <de...@ctakes.apache.org>
Subject: Re: Fast Dictionary Update

Hi Brandon,

Relation extraction at the moment only handles two specific relation types: LocationOf and DegreeOf. You are welcome to run it if you need these specific relations.


Dima

--
Dmitriy (Dima) Dligach, Ph.D.
Boston Children's Hospital and Harvard Medical School
(617) 651-0397



On Sep 17, 2015, at 17:08, Geise, Brandon D. <bd...@geisinger.edu>> wrote:

Does the RelationsExtractor need to be run in order to generate information on relationships from cTakes?  When running with 2011 UMLS dictionary I'm able to get relationships for BodyLocationMentions but with the dictionary I created I am not getting this information.  Any advice?

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Thursday, September 17, 2015 1:18 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

It claims that the database is connected and the preceding line of are spat out during loading, which took ~3-4 seconds (so something was there):
............
17 Sep 2015 12:58:58  INFO JdbcConnectionFactory -  Database connected

Strange.  I don't really know what to tell you right now.  Perhaps something will click with me later ...


Did you also run org.apache.ctakes.dictionarytool.CodeMapCreator ?  It isn't strictly necessary but it stores the tuis in the database so that cTakes can identify the semantic group of a mention.




-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Thursday, September 17, 2015 1:02 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Not specifically loaded.  Here's what I see when loading the pipeline:

17 Sep 2015 12:58:54  INFO JdbcConnectionFactory - Connecting to jdbc:hsqldb:file:path/to/ctakes/ctakes-dictionary-lookup-fast-res/src/main/resources/org/apache/ctakes/dictionary/lookup/fast/UMLS2015/snorx2015:
............
17 Sep 2015 12:58:58  INFO JdbcConnectionFactory -  Database connected

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Thursday, September 17, 2015 12:57 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Making an alternate copy of cTakesHsql.xml and pointing to the new dictionary is all that is necessary.  Do you see a message in the initialization output indicating that the dictionary db has been loaded?

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Thursday, September 17, 2015 12:54 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Great, thanks both seemed to work for populating the script table.

Besides the path to the new dictionary needing to be changed in cTakesHsql.xml, does anything else need to be modified to use the new dictionary?  My pipeline runs however there aren't any annotations related to the UMLS concepts.  The only annotations I'm seeing are date, roman numeral, or modifier related. (My pipeline if UMLSFastProcessor with additions for modifiers and templatefiller).  Any suggestions would be appreciated.

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Thursday, September 17, 2015 10:40 AM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Correct, Hsql should automatically read the .log file upon first use, and then perform the inserts into the .script file.

In case you want to play it safe, check the README in the resource/ directory (where you got the hsqldb template).  The last paragraph indicates how you can launch a simple sql tool to play with the db.  You will need to change the name of the db accordingly.  Upon first launch of the sql tool everything should be moved from the .log to the .script file.   It is a strange setup/workflow, but it seems to work.

Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Thursday, September 17, 2015 10:31 AM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

When I run the tool it outputs a file with a .log extension that has all the insert statements.  Do I copy this to the .script template from memcachedb in the dictionarytool project or should the inserts be put into the .script file by default on the program execution?

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 9:59 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Excellent!

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 9:55 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

No, I had changed it on the Tiny source file.  I just changed the default file and it looks to be running as expected now.

Thank you for all your help and patience, Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 9:35 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Did you add it to data/default/ CtakesSources.txt ?

If not then you need to specify -src ./data/tiny/CtakesSources.txt

Sorry for any confusion.

As soon as my inet isn't overloaded I'll download 2015AA and see if I can build a dictionary.

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 8:14 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>; dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Sean,

I added that and still had the same issue.

Thanks,
Brandon
_____________________________
From: Finan, Sean <se...@childrens.harvard.edu>>
Sent: Wednesday, September 16, 2015 7:56 PM
Subject: RE: Fast Dictionary Update
To: <de...@ctakes.apache.org>>


And you added "SNOMEDCT_US" to data/tiny/CtakesSources.txt ?

-----Original Message-----
From: Tomasz Oliwa [mailto:oliwa@uchicago.edu]
Sent: Wednesday, September 16, 2015 7:13 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

I have exactly the same problem with the tool.

A grep on MRCONSO.RRF for "SNOMEDCT" or for "SNOMEDCT_US" shows many lines.

________________________________________
From: Geise, Brandon D. [bdgeise@geisinger.edu<ma...@geisinger.edu>]
Sent: Wednesday, September 16, 2015 5:05 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Yes, it finds "SNOMEDCT_US".

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 5:17 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Ah, now I see what you mean. Can you do a grep on your MRCONSO.RRF for "SNOMEDCT" ?

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 4:04 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

I tried changing as suggested.

Below is what I see for the snomed piece, but for RXNorm it writes terms at the end.

Reading list of Source Types from ./data/default/CtakesSources.txt File Lines 1 list of Source Types 1 Reading list of Tuis from ./data/tiny/CtakesSnomedTuis.txt File Lines 24 list of Tuis 24 Compiling list of Cuis with wanted Tuis using /patto/UMLS_Current_Version/META/MRSTY.RRF
File Line 200000 Cuis 60895
File Line 300000 Cuis 85750
File Line 400000 Cuis 135098
File Line 600000 Cuis 183925
File Line 1700000<tel:1700000> Cuis 376338 File Line 1800000<tel:1800000> Cuis 471009 File Line 1900000<tel:1900000> Cuis 568375 File Line 2100000<tel:2100000> Cuis 674715 File Line 2800000<tel:2800000> Cuis 903583 File Line 3300000<tel:3300000> Cuis 973791 File Lines 3370173<tel:3370173> Cuis 999451 ..................................................File Line 100000 Valid Cuis 0 ..................................................File Line 200000 Valid Cuis 0 ..................................................File Line 300000 Valid Cuis 0 ..................................................File Line 400000 Valid Cuis 0 ..................................................File Line 500000 Valid Cuis 0 ..................................................File Line 600000 Valid Cuis 0 ..................................................File Line 700000 Valid Cuis 0 ..................................................File Line 800000 Valid Cuis 0 ..................................................File Line 900000 Valid Cuis 0 ..................................................File Line 1000000<tel:1000000> Valid Cuis 0 ..................................................File Line 1100000<tel:1100000> Valid Cuis 0 ..................................................File Line 1200000<tel:1200000> Valid Cuis 0 ..................................................File Line 1300000<tel:1300000> Valid Cuis 0 ..................................................File Line 1400000<tel:1400000> Valid Cuis 0 ..................................................File Line 1500000<tel:1500000> Valid Cuis 0 ..................................................File Line 1600000<tel:1600000> Valid Cuis 0 ..................................................File Line 1700000<tel:1700000> Valid Cuis 0 ..................................................File Line 1800000<tel:1800000> Valid Cuis 0 ..................................................File Line 1900000<tel:1900000> Valid Cuis 0 ..................................................File Line 2000000<tel:2000000> Valid Cuis 0 ..................................................File Line 2100000<tel:2100000> Valid Cuis 0 ..................................................File Line 2200000<tel:2200000> Valid Cuis 0 ..................................................File Line 2300000<tel:2300000> Valid Cuis 0 ..................................................File Line 2400000<tel:2400000> Valid Cuis 0 ..................................................File Line 2500000<tel:2500000> Valid Cuis 0 ..................................................File Line 2600000<tel:2600000> Valid Cuis 0 ..................................................File Line 2700000<tel:2700000> Valid Cuis 0 ..................................................File Line 2800000<tel:2800000> Valid Cuis 0 ..................................................File Line 2900000<tel:2900000> Valid Cuis 0 ..................................................File Line 3000000<tel:3000000> Valid Cuis 0 ..................................................File Line 3100000<tel:3100000> Valid Cuis 0 ..................................................File Line 3200000<tel:3200000> Valid Cuis 0 ..................................................File Line 3300000<tel:3300000> Valid Cuis 0 ..................................................File Line 3400000<tel:3400000> Valid Cuis 0 ..................................................File Line 3500000<tel:3500000> Valid Cuis 0 ..................................................File Line 3600000<tel:3600000> Valid Cuis 0 ..................................................File Line 3700000<tel:3700000> Valid Cuis 0 ..................................................File Line 3800000<tel:3800000> Valid Cuis 0 ..................................................File Line 3900000<tel:3900000> Valid Cuis 0 ..................................................File Line 4000000<tel:4000000> Valid Cuis 0 ..................................................File Line 4100000<tel:4100000> Valid Cuis 0 ..................................................File Line 4200000<tel:4200000> Valid Cuis 0 ..................................................File Line 4300000<tel:4300000> Valid Cuis 0 ..................................................File Line 4400000<tel:4400000> Valid Cuis 0 ..................................................File Line 4500000<tel:4500000> Valid Cuis 0 ..................................................File Line 4600000<tel:4600000> Valid Cuis 0 ..................................................File Line 4700000<tel:4700000> Valid Cuis 0 ..................................................File Line 4800000<tel:4800000> Valid Cuis 0 ..................................................File Line 4900000<tel:4900000> Valid Cuis 0 ..................................................File Line 5000000<tel:5000000> Valid Cuis 0 ..................................................File Line 5100000<tel:5100000> Valid Cuis 0 ..................................................File Line 5200000<tel:5200000> Valid Cuis 0 ..................................................File Line 5300000<tel:5300000> Valid Cuis 0 ..................................................File Line 5400000<tel:5400000> Valid Cuis 0 ..................................................File Line 5500000<tel:5500000> Valid Cuis 0 ..................................................File Line 5600000<tel:5600000> Valid Cuis 0 ..................................................File Line 5700000<tel:5700000> Valid Cuis 0 ..................................................File Line 5800000<tel:5800000> Valid Cuis 0 ..................................................File Line 5900000<tel:5900000> Valid Cuis 0 ..................................................File Line 6000000<tel:6000000> Valid Cuis 0 ..................................................File Line 6100000<tel:6100000> Valid Cuis 0 ..................................................File Line 6200000<tel:6200000> Valid Cuis 0 ..................................................File Line 6300000<tel:6300000> Valid Cuis 0 ..................................................File Line 6400000<tel:6400000> Valid Cuis 0 ..................................................File Line 6500000<tel:6500000> Valid Cuis 0 ..................................................File Line 6600000<tel:6600000> Valid Cuis 0 ..................................................File Line 6700000<tel:6700000> Valid Cuis 0 ..................................................File Line 6800000<tel:6800000> Valid Cuis 0 ..................................................File Line 6900000<tel:6900000> Valid Cuis 0 ..................................................File Line 7000000<tel:7000000> Valid Cuis 0 ..................................................File Line 7100000<tel:7100000> Valid Cuis 0 ..................................................File Line 7200000<tel:7200000> Valid Cuis 0 ..................................................File Line 7300000<tel:7300000> Valid Cuis 0 ..................................................File Line 7400000<tel:7400000> Valid Cuis 0 ..................................................File Line 7500000<tel:7500000> Valid Cuis 0 ..................................................File Line 7600000<tel:7600000> Valid Cuis 0 ..................................................File Line 7700000<tel:7700000> Valid Cuis 0 ..................................................File Line 7800000<tel:7800000> Valid Cuis 0 ..................................................File Line 7900000<tel:7900000> Valid Cuis 0 ..................................................File Line 8000000<tel:8000000> Valid Cuis 0 ..................................................File Line 8100000<tel:8100000> Valid Cuis 0 ..................................................File Line 8200000<tel:8200000> Valid Cuis 0 ..................................................File Line 8300000<tel:8300000> Valid Cuis 0 ..................................................File Line 8400000<tel:8400000> Valid Cuis 0 ..................................................File Line 8500000<tel:8500000> Valid Cuis 0 ..................................................File Line 8600000<tel:8600000> Valid Cuis 0 ..................................................File Line 8700000<tel:8700000> Valid Cuis 0 ..................................................File Line 8800000<tel:8800000> Valid Cuis 0 .............File Lines 8827152<tel:8827152> Valid Cuis 0 Compiling map of Umls Cuis and Texts ..................................................File Line 100000 Terms 0 ..................................................File Line 200000 Terms 0 ..................................................File Line 300000 Terms 0 ..................................................File Line 400000 Terms 0 ..................................................File Line 500000 Terms 0 ..................................................File Line 600000 Terms 0 ..................................................File Line 700000 Terms 0 ..................................................File Line 800000 Terms 0 ..................................................File Line 900000 Terms 0 ..................................................File Line 1000000<tel:1000000> Terms 0 ..................................................File Line 1100000<tel:1100000> Terms 0 ..................................................File Line 1200000<tel:1200000> Terms 0 ..................................................File Line 1300000<tel:1300000> Terms 0 ..................................................File Line 1400000<tel:1400000> Terms 0 ..................................................File Line 1500000<tel:1500000> Terms 0 ..................................................File Line 1600000<tel:1600000> Terms 0 ..................................................File Line 1700000<tel:1700000> Terms 0 ..................................................File Line 1800000<tel:1800000> Terms 0 ..................................................File Line 1900000<tel:1900000> Terms 0 ..................................................File Line 2000000<tel:2000000> Terms 0 ..................................................File Line 2100000<tel:2100000> Terms 0 ..................................................File Line 2200000<tel:2200000> Terms 0 ..................................................File Line 2300000<tel:2300000> Terms 0 ..................................................File Line 2400000<tel:2400000> Terms 0 ..................................................File Line 2500000<tel:2500000> Terms 0 ..................................................File Line 2600000<tel:2600000> Terms 0 ..................................................File Line 2700000<tel:2700000> Terms 0 ..................................................File Line 2800000<tel:2800000> Terms 0 ..................................................File Line 2900000<tel:2900000> Terms 0 ..................................................File Line 3000000<tel:3000000> Terms 0 ..................................................File Line 3100000<tel:3100000> Terms 0 ..................................................File Line 3200000<tel:3200000> Terms 0 ..................................................File Line 3300000<tel:3300000> Terms 0 ..................................................File Line 3400000<tel:3400000> Terms 0 ..................................................File Line 3500000<tel:3500000> Terms 0 ..................................................File Line 3600000<tel:3600000> Terms 0 ..................................................File Line 3700000<tel:3700000> Terms 0 ..................................................File Line 3800000<tel:3800000> Terms 0 ..................................................File Line 3900000<tel:3900000> Terms 0 ..................................................File Line 4000000<tel:4000000> Terms 0 ..................................................File Line 4100000<tel:4100000> Terms 0 ..................................................File Line 4200000<tel:4200000> Terms 0 ..................................................File Line 4300000<tel:4300000> Terms 0 ..................................................File Line 4400000<tel:4400000> Terms 0 ..................................................File Line 4500000<tel:4500000> Terms 0 ..................................................File Line 4600000<tel:4600000> Terms 0 ..................................................File Line 4700000<tel:4700000> Terms 0 ..................................................File Line 4800000<tel:4800000> Terms 0 ..................................................File Line 4900000<tel:4900000> Terms 0 ..................................................File Line 5000000<tel:5000000> Terms 0 ..................................................File Line 5100000<tel:5100000> Terms 0 ..................................................File Line 5200000<tel:5200000> Terms 0 ..................................................File Line 5300000<tel:5300000> Terms 0 ..................................................File Line 5400000<tel:5400000> Terms 0 ..................................................File Line 5500000<tel:5500000> Terms 0 ..................................................File Line 5600000<tel:5600000> Terms 0 ..................................................File Line 5700000<tel:5700000> Terms 0 ..................................................File Line 5800000<tel:5800000> Terms 0 ..................................................File Line 5900000<tel:5900000> Terms 0 ..................................................File Line 6000000<tel:6000000> Terms 0 ..................................................File Line 6100000<tel:6100000> Terms 0 ..................................................File Line 6200000<tel:6200000> Terms 0 ..................................................File Line 6300000<tel:6300000> Terms 0 ..................................................File Line 6400000<tel:6400000> Terms 0 ..................................................File Line 6500000<tel:6500000> Terms 0 ..................................................File Line 6600000<tel:6600000> Terms 0 ..................................................File Line 6700000<tel:6700000> Terms 0 ..................................................File Line 6800000<tel:6800000> Terms 0 ..................................................File Line 6900000<tel:6900000> Terms 0 ..................................................File Line 7000000<tel:7000000> Terms 0 ..................................................File Line 7100000<tel:7100000> Terms 0 ..................................................File Line 7200000<tel:7200000> Terms 0 ..................................................File Line 7300000<tel:7300000> Terms 0 ..................................................File Line 7400000<tel:7400000> Terms 0 ..................................................File Line 7500000<tel:7500000> Terms 0 ..................................................File Line 7600000<tel:7600000> Terms 0 ..................................................File Line 7700000<tel:7700000> Terms 0 ..................................................File Line 7800000<tel:7800000> Terms 0 ..................................................File Line 7900000<tel:7900000> Terms 0 ..................................................File Line 8000000<tel:8000000> Terms 0 ..................................................File Line 8100000<tel:8100000> Terms 0 ..................................................File Line 8200000<tel:8200000> Terms 0 ..................................................File Line 8300000<tel:8300000> Terms 0 ..................................................File Line 8400000<tel:8400000> Terms 0 ..................................................File Line 8500000<tel:8500000> Terms 0 ..................................................File Line 8600000<tel:8600000> Terms 0 ..................................................File Line 8700000<tel:8700000> Terms 0 ..................................................File Line 8800000<tel:8800000> Terms 0 .............File Line 8827152<tel:8827152> Terms 0 Writing map of Cuis and Texts to pathtoUmls2015.bsv

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 4:00 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Thank you! I believe that was a change post 2011! You should actually be ok with both SNOMEDCT and SNOMEDCT_US in CtakesSources.txt

Cheers,
Sean

-----Original Message-----
From: Maite Meseure Hugues [mailto:meseure.maite@gmail.com]
Sent: Wednesday, September 16, 2015 3:43 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Re: Fast Dictionary Update

If this can helps, I had to replace 'SNOMEDCT' with 'SNOMEDCT_US' in CtakesSources.txt.

On Wed, Sep 16, 2015 at 2:33 PM, Finan, Sean < Sean.Finan@childrens.harvard.edu<ma...@childrens.harvard.edu>> wrote:

I'm not sure that I understand your question. As I sent it, the anat, snomed and rxnorm are not separate runs. The args line I sent earlier is for a single run that will create a dictionary with snomed and rxnorm terms. The anatomy tui list has a special use in correctly processing snomed codes.

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 3:27 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Ok, hopefully one last question.

Based on your example everything runs, however the Anat and Snomed runs don't produce any valid CUIs but RXNorm does. I'm not sure if this has anything to do with it but every UMLS source read is against MRSTY.

Here's my command

java -cp dictionarytool.jar;lib/*
org.apache.ctakes.dictionarytool.DictionaryCreator2 -umls /path/to/UMLS/META -fd ./data/tiny -atui ./data/tiny/CtakesAnatTuis.txt -tui ./data/tiny/CtakesSnomedTuis.txt -ol path o ileUmls2015.bsv

Any suggestions?

Thanks again,
Brandon


-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 3:05 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Yes, that will make the rare word dictionary in a memory-based hsql database - the same as the default for the dictionary-lookup-fast module.

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 2:42 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Thanks Sean, much appreciated. To clarify the example below would create the dictionary for use for the rare word approach?

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 2:16 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Hi Brandon,

I just checked in a bin/dictionarytool.zip It should have everything that you need (.jar, lib/, data/).
java -cp dictionarytool.jar;lib/*
org.apache.ctakes.dictionarytool.DictionaryCreator2 [args] Should do the trick.

To recreate a 2015 version of the current ctakes dictionary, the arguments
are:
-umls my/path/to/2015AA/META -fd ./data/tiny -atui ./data/tiny/CtakesAnatTuis.txt -tui ./data/tiny/CtakesSnomedTuis.txt -db
jdbc:hsqldb:file:my/path/to/snorx2015 -tbl CUI_TERMS

Create my/path/to/snorx2015 by copying
resources/memdbtemplate/ctakesumls.properties to my/path/to/snorx2015.properties - there is a resources/README about this.

Before populating a DB, I usually do a trial run first, writing to a flat file. Replace "-db ... -tbl ..." with "-ol my/path/to/testout.bsv"


Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 1:49 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Hi Sean,

That'd be great.

I think I'm building it incorrectly because after I build the jar and try to run specifying DictionaryCreator2 as the main class it says it can't find it. I'm not too familiar with Java and building projects/jars so it could be my ignorance causing the problem.

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 1:45 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Hi Brandon,

I can send you a jar or commit one pre-built. What goes wrong when you try to build the tool?

Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 1:23 PM
To: 'dev@ctakes.apache.org<ma...@ctakes.apache.org>'
Subject: Fast Dictionary Update

Does someone have the DictionaryTool jar available? I'm having trouble creating the jar file from the project and would like to be able to create an updated UMLS fast dictionary for 2015.

Thanks,
Brandon


IMPORTANT WARNING: The information in this message (and the documents attached to it, if any) is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this message by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken, or omitted to be taken, in reliance on it is prohibited and may be unlawful. If you have received this message in error, please delete all electronic copies of this message (and the documents attached to it, if any), destroy any hard copies you may have created and notify me immediately by replying to this email. Thank you.

Geisinger Health System utilizes an encryption process to safeguard Protected Health Information and other confidential data contained in external e-mail messages. If email is encrypted, the recipient will receive an e-mail instructing them to sign on to the Geisinger Health System Secure E-mail Message Center to retrieve the encrypted e-mail.


















RE: Fast Dictionary Update

Posted by Tomasz Oliwa <ol...@uchicago.edu>.
Sean,

Thanks for the detailed answer, I appreciate it, I'm looking into it.

Regards,
Tomasz

________________________________________
From: Finan, Sean [Sean.Finan@childrens.harvard.edu]
Sent: Tuesday, November 10, 2015 11:55 AM
To: dev@ctakes.apache.org
Subject: RE: Fast Dictionary Update

Hi Tomasz,

I am not at all surprised that many cuis have more synonyms in the 2015 release ... progress!  I apologize, I don't have much time right now to spend on this.  In a nutshell:

1) inclusion/exclusion triggers
Within the data/ directory there are several subdirectories containing .txt lists.  In the lists that start with "Removal" there are triggers that cause the dictionary tool to completely disregard a synonym.  For instance, in RemovalColonTriggers.txt we have ": 24 hours :".  Any time a synonym contains the substring ": 24 hours :" the entire synonym will be disregarded.  In RemovalPrefixTriggers.txt we have "deprecated".  If a synonym starts with "deprecated" then it will be disregarded.  The triggers in these lists were created manually.  Most of the time there is an equivalent synonym that does not contain the trigger.  In the different subdirectories (default/ optional/ small/ tiny/) you will find different versions of these lists containing more or fewer triggers.  The directories are named (generally) after the size of the created dictionary.  For instance, 1016 tiny/RemovalSuffixTriggers, 239 small/RemovalSuffixTriggers, 33 default/RemovalSuffixTriggers.  More triggers = fewer included synonyms.  A list directory is pointed to with the "-fd" (format data directory) parameter.  Run DictionaryCreator2 without parameters (or with "-?" or "-h") to list the available parameters.
2) inclusion tuis
Also within the data/ subdirectories are lists of tuis that should be included in the dictionary.  By default only a subset of specific ctakes  tuis are used, but check the optional/ subdirectory for lists of options.  You can specify these with "-tui" (complete list), "-atui" (anatomy tuis) and  "-mtui" (medication tuis).
3) inclusion source types
By default the tool will include any synonym that exists for a term in snomedct or rxnorm.  That is to say that if cui 1 exists in snomedct then all synonyms from all sources will be added to the dictionary.  If cui 2 only exists in mesh, then that cui and its synonyms will not exist in the dictionary.  You can add whatever sources you like to include their cuis in the dictionary.  "-src" lets you point to a custom source list.  The data/optional/UmlsAllSources.txt lists the sources from 2011ab, but you should find more in the 2015 release.
4) unwanted substrings
These are basically substrings that are removed from a synonym - but the synonym itself is not disregarded.  For instance, default/UnwantedPrefixes.txt contains "activities involving".  This will be removed from a synonym but the remainder will not be disregarded.
5) abbreviation extraction
In default/RightAbbreviations.txt there is a list of suffix substrings that trigger a "tripling" of synonyms.  For instance, "( eds )" is in the list, which will cause an umls entry "ehlers - danlos ( eds )" to be put into the dictionary as "ehlers - danlos ( eds )" AND "ehlers - danlos" AND "eds".  The list small/RightAbbreviations.txt is empty to keep the dictionary a little more compact.

I think that is about it for all things easy wrt sizing the dictionary.  Outside of these items you would need to do something custom within the tool code.  org.apache.ctakes.dictionarytool.util class UmlsTermUtil contains the code currently used for trimming, etc.  UmlsTermUtil and RxNormTermUtil and DoseUtil and DeliveryUtil have some experimental code that is not used by default, but one could always play around.

I know that -official- documentation outside of doc/howto.txt would be great.  However I would rather spend free time throwing a gui around the tool to make things more intuitive.

Sean

-----Original Message-----
From: Tomasz Oliwa [mailto:oliwa@uchicago.edu]
Sent: Tuesday, November 10, 2015 11:53 AM
To: dev@ctakes.apache.org
Subject: RE: Fast Dictionary Update

Hi,

The build-in cTAKES Fast Dictionary (UMLS 2011) contains about ~490.000 rows (each synonym of the same CUI counted as a row), while the 2015 UMLS Fast Dictionary created via the dictionarytool results in about ~660.000 rows.

I noticed that CUIs tend to be expressed with more synonyms in the 2015 UMLS Fast Dictionary, this is what I suppose leads to the increase of rows. For instance, the CUI C0231749 "knee pain" has 15 rows in the default cTAKES UMLS, while 24 rows in the 2015 one.

How can I control which subset of synonyms is taken by the dictionarytool per CUI when the Fast Dictionary is created?

The UMLS metathesaurus itself has (much) more synonyms than 24 for C0231749, so I image somewhere in the dictionarytool the subset can be setup?

Thanks,
Tomasz


________________________________________
From: Finan, Sean [Sean.Finan@childrens.harvard.edu]
Sent: Monday, October 19, 2015 9:02 AM
To: dev@ctakes.apache.org
Subject: RE: Fast Dictionary Update

Hi Brandon,

Good catch, and thanks for letting me know.  Feel free to check in a fix, otherwise it will probably be a while before I get to it.

Thanks,
Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Monday, October 19, 2015 8:50 AM
To: dev@ctakes.apache.org
Subject: RE: Fast Dictionary Update

Hi Sean,

I finally had a chance to look at the SNOMEDCT issue further regarding the codingScheme populating using the default value.  What I found was in the dictionary tool when running the CodeMapCreator, when the CuiCodesDbWriter is called, the collection uses the name passed into the method, which is SNOMEDCT.  However, if you are using SNOMEDCT_US the collection name is SNOMEDCT_US instead of SNOMEDCT, so it never populates the hsqldb.  Obviously an easy change to make, but thought it might be helpful feedback.

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Monday, September 21, 2015 10:39 AM
To: dev@ctakes.apache.org
Subject: RE: Fast Dictionary Update

Hi Brandon,

Sorry for the late reply - I've been out for an extended weekend.

The coding scheme change is fairly simply explained (imo).  The plain old CUI is not a snomed code.  If the snomed codes are reported by ctakes (uncomment the snomed line in ctakesHsql.xml ) then their UmlsConcept entries in the ontology array have the coding scheme name "SNOMEDCT".
            <!-- Optional tables for optional term info.
            Uncommenting these lines alone may not persist term information;
            persistence depends upon the TermConsumer.  -->
            <property key="snomedTable" value="snomedct"/>

Basically, the "CTAKES" name indicates that the scheme only contains Umls Cuis that have TUIs of the default ctakes configuration.  ctakes does not use all umls tuis, therefore I did not name the scheme "UMLS".  If you make a custom scheme (etc.) you can change the name in cTakesHsql.xml or in a custom .xml
          <!-- Depending upon the consumer, the value of codingScheme may or may not be used.  With the packaged consumers,
          codingScheme is a default value used only for cuis that do not have secondary codes (snomed, rxnorm, etc.)  -->
         <property key="codingScheme" value="CTAKES"/>


The " RelationsExtractor" in the dictionary creator tool is completely experimental and unfinished - but perhaps some day it will throw umls relations into a format that ctakes can directly use.  For the time being it should be avoided.

Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Thursday, September 17, 2015 10:23 PM
To: dev@ctakes.apache.org
Subject: RE: Fast Dictionary Update

You can disregard my question about the relation extraction as I fixed this by building the new dictionary with the default data files in the dictionarytool.  I am curious about the SNOMED change still though.

Thanks,
Brandon

-----Original Message-----
From: Geise, Brandon D.
Sent: Thursday, September 17, 2015 9:40 PM
To: cTAKES Developer list <de...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Thanks Dmitriy.  I was referring to the RelationsExtractor class found in the dictionarytool.  On a similar note, the coding scheme for all SNOMEDCT codes for the new dictionary is CTAKES compared to SNOMED with the UMLS version packaged with cTakes.  Is there something else I need to run for the dictionary creation that I'm missing?

Thanks,
Brandon

-----Original Message-----
From: Dligach, Dmitriy [mailto:Dmitriy.Dligach@childrens.harvard.edu]
Sent: Thursday, September 17, 2015 8:42 PM
To: cTAKES Developer list <de...@ctakes.apache.org>
Subject: Re: Fast Dictionary Update

Hi Brandon,

Relation extraction at the moment only handles two specific relation types: LocationOf and DegreeOf. You are welcome to run it if you need these specific relations.


Dima

--
Dmitriy (Dima) Dligach, Ph.D.
Boston Children's Hospital and Harvard Medical School
(617) 651-0397



On Sep 17, 2015, at 17:08, Geise, Brandon D. <bd...@geisinger.edu>> wrote:

Does the RelationsExtractor need to be run in order to generate information on relationships from cTakes?  When running with 2011 UMLS dictionary I'm able to get relationships for BodyLocationMentions but with the dictionary I created I am not getting this information.  Any advice?

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Thursday, September 17, 2015 1:18 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

It claims that the database is connected and the preceding line of are spat out during loading, which took ~3-4 seconds (so something was there):
............
17 Sep 2015 12:58:58  INFO JdbcConnectionFactory -  Database connected

Strange.  I don't really know what to tell you right now.  Perhaps something will click with me later ...


Did you also run org.apache.ctakes.dictionarytool.CodeMapCreator ?  It isn't strictly necessary but it stores the tuis in the database so that cTakes can identify the semantic group of a mention.




-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Thursday, September 17, 2015 1:02 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Not specifically loaded.  Here's what I see when loading the pipeline:

17 Sep 2015 12:58:54  INFO JdbcConnectionFactory - Connecting to jdbc:hsqldb:file:path/to/ctakes/ctakes-dictionary-lookup-fast-res/src/main/resources/org/apache/ctakes/dictionary/lookup/fast/UMLS2015/snorx2015:
............
17 Sep 2015 12:58:58  INFO JdbcConnectionFactory -  Database connected

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Thursday, September 17, 2015 12:57 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Making an alternate copy of cTakesHsql.xml and pointing to the new dictionary is all that is necessary.  Do you see a message in the initialization output indicating that the dictionary db has been loaded?

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Thursday, September 17, 2015 12:54 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Great, thanks both seemed to work for populating the script table.

Besides the path to the new dictionary needing to be changed in cTakesHsql.xml, does anything else need to be modified to use the new dictionary?  My pipeline runs however there aren't any annotations related to the UMLS concepts.  The only annotations I'm seeing are date, roman numeral, or modifier related. (My pipeline if UMLSFastProcessor with additions for modifiers and templatefiller).  Any suggestions would be appreciated.

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Thursday, September 17, 2015 10:40 AM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Correct, Hsql should automatically read the .log file upon first use, and then perform the inserts into the .script file.

In case you want to play it safe, check the README in the resource/ directory (where you got the hsqldb template).  The last paragraph indicates how you can launch a simple sql tool to play with the db.  You will need to change the name of the db accordingly.  Upon first launch of the sql tool everything should be moved from the .log to the .script file.   It is a strange setup/workflow, but it seems to work.

Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Thursday, September 17, 2015 10:31 AM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

When I run the tool it outputs a file with a .log extension that has all the insert statements.  Do I copy this to the .script template from memcachedb in the dictionarytool project or should the inserts be put into the .script file by default on the program execution?

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 9:59 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Excellent!

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 9:55 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

No, I had changed it on the Tiny source file.  I just changed the default file and it looks to be running as expected now.

Thank you for all your help and patience, Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 9:35 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Did you add it to data/default/ CtakesSources.txt ?

If not then you need to specify -src ./data/tiny/CtakesSources.txt

Sorry for any confusion.

As soon as my inet isn't overloaded I'll download 2015AA and see if I can build a dictionary.

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 8:14 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>; dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Sean,

I added that and still had the same issue.

Thanks,
Brandon
_____________________________
From: Finan, Sean <se...@childrens.harvard.edu>>
Sent: Wednesday, September 16, 2015 7:56 PM
Subject: RE: Fast Dictionary Update
To: <de...@ctakes.apache.org>>


And you added "SNOMEDCT_US" to data/tiny/CtakesSources.txt ?

-----Original Message-----
From: Tomasz Oliwa [mailto:oliwa@uchicago.edu]
Sent: Wednesday, September 16, 2015 7:13 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

I have exactly the same problem with the tool.

A grep on MRCONSO.RRF for "SNOMEDCT" or for "SNOMEDCT_US" shows many lines.

________________________________________
From: Geise, Brandon D. [bdgeise@geisinger.edu<ma...@geisinger.edu>]
Sent: Wednesday, September 16, 2015 5:05 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Yes, it finds "SNOMEDCT_US".

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 5:17 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Ah, now I see what you mean. Can you do a grep on your MRCONSO.RRF for "SNOMEDCT" ?

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 4:04 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

I tried changing as suggested.

Below is what I see for the snomed piece, but for RXNorm it writes terms at the end.

Reading list of Source Types from ./data/default/CtakesSources.txt File Lines 1 list of Source Types 1 Reading list of Tuis from ./data/tiny/CtakesSnomedTuis.txt File Lines 24 list of Tuis 24 Compiling list of Cuis with wanted Tuis using /patto/UMLS_Current_Version/META/MRSTY.RRF
File Line 200000 Cuis 60895
File Line 300000 Cuis 85750
File Line 400000 Cuis 135098
File Line 600000 Cuis 183925
File Line 1700000<tel:1700000> Cuis 376338 File Line 1800000<tel:1800000> Cuis 471009 File Line 1900000<tel:1900000> Cuis 568375 File Line 2100000<tel:2100000> Cuis 674715 File Line 2800000<tel:2800000> Cuis 903583 File Line 3300000<tel:3300000> Cuis 973791 File Lines 3370173<tel:3370173> Cuis 999451 ..................................................File Line 100000 Valid Cuis 0 ..................................................File Line 200000 Valid Cuis 0 ..................................................File Line 300000 Valid Cuis 0 ..................................................File Line 400000 Valid Cuis 0 ..................................................File Line 500000 Valid Cuis 0 ..................................................File Line 600000 Valid Cuis 0 ..................................................File Line 700000 Valid Cuis 0 ..................................................File Line 800000 Valid Cuis 0 ..................................................File Line 900000 Valid Cuis 0 ..................................................File Line 1000000<tel:1000000> Valid Cuis 0 ..................................................File Line 1100000<tel:1100000> Valid Cuis 0 ..................................................File Line 1200000<tel:1200000> Valid Cuis 0 ..................................................File Line 1300000<tel:1300000> Valid Cuis 0 ..................................................File Line 1400000<tel:1400000> Valid Cuis 0 ..................................................File Line 1500000<tel:1500000> Valid Cuis 0 ..................................................File Line 1600000<tel:1600000> Valid Cuis 0 ..................................................File Line 1700000<tel:1700000> Valid Cuis 0 ..................................................File Line 1800000<tel:1800000> Valid Cuis 0 ..................................................File Line 1900000<tel:1900000> Valid Cuis 0 ..................................................File Line 2000000<tel:2000000> Valid Cuis 0 ..................................................File Line 2100000<tel:2100000> Valid Cuis 0 ..................................................File Line 2200000<tel:2200000> Valid Cuis 0 ..................................................File Line 2300000<tel:2300000> Valid Cuis 0 ..................................................File Line 2400000<tel:2400000> Valid Cuis 0 ..................................................File Line 2500000<tel:2500000> Valid Cuis 0 ..................................................File Line 2600000<tel:2600000> Valid Cuis 0 ..................................................File Line 2700000<tel:2700000> Valid Cuis 0 ..................................................File Line 2800000<tel:2800000> Valid Cuis 0 ..................................................File Line 2900000<tel:2900000> Valid Cuis 0 ..................................................File Line 3000000<tel:3000000> Valid Cuis 0 ..................................................File Line 3100000<tel:3100000> Valid Cuis 0 ..................................................File Line 3200000<tel:3200000> Valid Cuis 0 ..................................................File Line 3300000<tel:3300000> Valid Cuis 0 ..................................................File Line 3400000<tel:3400000> Valid Cuis 0 ..................................................File Line 3500000<tel:3500000> Valid Cuis 0 ..................................................File Line 3600000<tel:3600000> Valid Cuis 0 ..................................................File Line 3700000<tel:3700000> Valid Cuis 0 ..................................................File Line 3800000<tel:3800000> Valid Cuis 0 ..................................................File Line 3900000<tel:3900000> Valid Cuis 0 ..................................................File Line 4000000<tel:4000000> Valid Cuis 0 ..................................................File Line 4100000<tel:4100000> Valid Cuis 0 ..................................................File Line 4200000<tel:4200000> Valid Cuis 0 ..................................................File Line 4300000<tel:4300000> Valid Cuis 0 ..................................................File Line 4400000<tel:4400000> Valid Cuis 0 ..................................................File Line 4500000<tel:4500000> Valid Cuis 0 ..................................................File Line 4600000<tel:4600000> Valid Cuis 0 ..................................................File Line 4700000<tel:4700000> Valid Cuis 0 ..................................................File Line 4800000<tel:4800000> Valid Cuis 0 ..................................................File Line 4900000<tel:4900000> Valid Cuis 0 ..................................................File Line 5000000<tel:5000000> Valid Cuis 0 ..................................................File Line 5100000<tel:5100000> Valid Cuis 0 ..................................................File Line 5200000<tel:5200000> Valid Cuis 0 ..................................................File Line 5300000<tel:5300000> Valid Cuis 0 ..................................................File Line 5400000<tel:5400000> Valid Cuis 0 ..................................................File Line 5500000<tel:5500000> Valid Cuis 0 ..................................................File Line 5600000<tel:5600000> Valid Cuis 0 ..................................................File Line 5700000<tel:5700000> Valid Cuis 0 ..................................................File Line 5800000<tel:5800000> Valid Cuis 0 ..................................................File Line 5900000<tel:5900000> Valid Cuis 0 ..................................................File Line 6000000<tel:6000000> Valid Cuis 0 ..................................................File Line 6100000<tel:6100000> Valid Cuis 0 ..................................................File Line 6200000<tel:6200000> Valid Cuis 0 ..................................................File Line 6300000<tel:6300000> Valid Cuis 0 ..................................................File Line 6400000<tel:6400000> Valid Cuis 0 ..................................................File Line 6500000<tel:6500000> Valid Cuis 0 ..................................................File Line 6600000<tel:6600000> Valid Cuis 0 ..................................................File Line 6700000<tel:6700000> Valid Cuis 0 ..................................................File Line 6800000<tel:6800000> Valid Cuis 0 ..................................................File Line 6900000<tel:6900000> Valid Cuis 0 ..................................................File Line 7000000<tel:7000000> Valid Cuis 0 ..................................................File Line 7100000<tel:7100000> Valid Cuis 0 ..................................................File Line 7200000<tel:7200000> Valid Cuis 0 ..................................................File Line 7300000<tel:7300000> Valid Cuis 0 ..................................................File Line 7400000<tel:7400000> Valid Cuis 0 ..................................................File Line 7500000<tel:7500000> Valid Cuis 0 ..................................................File Line 7600000<tel:7600000> Valid Cuis 0 ..................................................File Line 7700000<tel:7700000> Valid Cuis 0 ..................................................File Line 7800000<tel:7800000> Valid Cuis 0 ..................................................File Line 7900000<tel:7900000> Valid Cuis 0 ..................................................File Line 8000000<tel:8000000> Valid Cuis 0 ..................................................File Line 8100000<tel:8100000> Valid Cuis 0 ..................................................File Line 8200000<tel:8200000> Valid Cuis 0 ..................................................File Line 8300000<tel:8300000> Valid Cuis 0 ..................................................File Line 8400000<tel:8400000> Valid Cuis 0 ..................................................File Line 8500000<tel:8500000> Valid Cuis 0 ..................................................File Line 8600000<tel:8600000> Valid Cuis 0 ..................................................File Line 8700000<tel:8700000> Valid Cuis 0 ..................................................File Line 8800000<tel:8800000> Valid Cuis 0 .............File Lines 8827152<tel:8827152> Valid Cuis 0 Compiling map of Umls Cuis and Texts ..................................................File Line 100000 Terms 0 ..................................................File Line 200000 Terms 0 ..................................................File Line 300000 Terms 0 ..................................................File Line 400000 Terms 0 ..................................................File Line 500000 Terms 0 ..................................................File Line 600000 Terms 0 ..................................................File Line 700000 Terms 0 ..................................................File Line 800000 Terms 0 ..................................................File Line 900000 Terms 0 ..................................................File Line 1000000<tel:1000000> Terms 0 ..................................................File Line 1100000<tel:1100000> Terms 0 ..................................................File Line 1200000<tel:1200000> Terms 0 ..................................................File Line 1300000<tel:1300000> Terms 0 ..................................................File Line 1400000<tel:1400000> Terms 0 ..................................................File Line 1500000<tel:1500000> Terms 0 ..................................................File Line 1600000<tel:1600000> Terms 0 ..................................................File Line 1700000<tel:1700000> Terms 0 ..................................................File Line 1800000<tel:1800000> Terms 0 ..................................................File Line 1900000<tel:1900000> Terms 0 ..................................................File Line 2000000<tel:2000000> Terms 0 ..................................................File Line 2100000<tel:2100000> Terms 0 ..................................................File Line 2200000<tel:2200000> Terms 0 ..................................................File Line 2300000<tel:2300000> Terms 0 ..................................................File Line 2400000<tel:2400000> Terms 0 ..................................................File Line 2500000<tel:2500000> Terms 0 ..................................................File Line 2600000<tel:2600000> Terms 0 ..................................................File Line 2700000<tel:2700000> Terms 0 ..................................................File Line 2800000<tel:2800000> Terms 0 ..................................................File Line 2900000<tel:2900000> Terms 0 ..................................................File Line 3000000<tel:3000000> Terms 0 ..................................................File Line 3100000<tel:3100000> Terms 0 ..................................................File Line 3200000<tel:3200000> Terms 0 ..................................................File Line 3300000<tel:3300000> Terms 0 ..................................................File Line 3400000<tel:3400000> Terms 0 ..................................................File Line 3500000<tel:3500000> Terms 0 ..................................................File Line 3600000<tel:3600000> Terms 0 ..................................................File Line 3700000<tel:3700000> Terms 0 ..................................................File Line 3800000<tel:3800000> Terms 0 ..................................................File Line 3900000<tel:3900000> Terms 0 ..................................................File Line 4000000<tel:4000000> Terms 0 ..................................................File Line 4100000<tel:4100000> Terms 0 ..................................................File Line 4200000<tel:4200000> Terms 0 ..................................................File Line 4300000<tel:4300000> Terms 0 ..................................................File Line 4400000<tel:4400000> Terms 0 ..................................................File Line 4500000<tel:4500000> Terms 0 ..................................................File Line 4600000<tel:4600000> Terms 0 ..................................................File Line 4700000<tel:4700000> Terms 0 ..................................................File Line 4800000<tel:4800000> Terms 0 ..................................................File Line 4900000<tel:4900000> Terms 0 ..................................................File Line 5000000<tel:5000000> Terms 0 ..................................................File Line 5100000<tel:5100000> Terms 0 ..................................................File Line 5200000<tel:5200000> Terms 0 ..................................................File Line 5300000<tel:5300000> Terms 0 ..................................................File Line 5400000<tel:5400000> Terms 0 ..................................................File Line 5500000<tel:5500000> Terms 0 ..................................................File Line 5600000<tel:5600000> Terms 0 ..................................................File Line 5700000<tel:5700000> Terms 0 ..................................................File Line 5800000<tel:5800000> Terms 0 ..................................................File Line 5900000<tel:5900000> Terms 0 ..................................................File Line 6000000<tel:6000000> Terms 0 ..................................................File Line 6100000<tel:6100000> Terms 0 ..................................................File Line 6200000<tel:6200000> Terms 0 ..................................................File Line 6300000<tel:6300000> Terms 0 ..................................................File Line 6400000<tel:6400000> Terms 0 ..................................................File Line 6500000<tel:6500000> Terms 0 ..................................................File Line 6600000<tel:6600000> Terms 0 ..................................................File Line 6700000<tel:6700000> Terms 0 ..................................................File Line 6800000<tel:6800000> Terms 0 ..................................................File Line 6900000<tel:6900000> Terms 0 ..................................................File Line 7000000<tel:7000000> Terms 0 ..................................................File Line 7100000<tel:7100000> Terms 0 ..................................................File Line 7200000<tel:7200000> Terms 0 ..................................................File Line 7300000<tel:7300000> Terms 0 ..................................................File Line 7400000<tel:7400000> Terms 0 ..................................................File Line 7500000<tel:7500000> Terms 0 ..................................................File Line 7600000<tel:7600000> Terms 0 ..................................................File Line 7700000<tel:7700000> Terms 0 ..................................................File Line 7800000<tel:7800000> Terms 0 ..................................................File Line 7900000<tel:7900000> Terms 0 ..................................................File Line 8000000<tel:8000000> Terms 0 ..................................................File Line 8100000<tel:8100000> Terms 0 ..................................................File Line 8200000<tel:8200000> Terms 0 ..................................................File Line 8300000<tel:8300000> Terms 0 ..................................................File Line 8400000<tel:8400000> Terms 0 ..................................................File Line 8500000<tel:8500000> Terms 0 ..................................................File Line 8600000<tel:8600000> Terms 0 ..................................................File Line 8700000<tel:8700000> Terms 0 ..................................................File Line 8800000<tel:8800000> Terms 0 .............File Line 8827152<tel:8827152> Terms 0 Writing map of Cuis and Texts to pathtoUmls2015.bsv

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 4:00 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Thank you! I believe that was a change post 2011! You should actually be ok with both SNOMEDCT and SNOMEDCT_US in CtakesSources.txt

Cheers,
Sean

-----Original Message-----
From: Maite Meseure Hugues [mailto:meseure.maite@gmail.com]
Sent: Wednesday, September 16, 2015 3:43 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Re: Fast Dictionary Update

If this can helps, I had to replace 'SNOMEDCT' with 'SNOMEDCT_US' in CtakesSources.txt.

On Wed, Sep 16, 2015 at 2:33 PM, Finan, Sean < Sean.Finan@childrens.harvard.edu<ma...@childrens.harvard.edu>> wrote:

I'm not sure that I understand your question. As I sent it, the anat, snomed and rxnorm are not separate runs. The args line I sent earlier is for a single run that will create a dictionary with snomed and rxnorm terms. The anatomy tui list has a special use in correctly processing snomed codes.

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 3:27 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Ok, hopefully one last question.

Based on your example everything runs, however the Anat and Snomed runs don't produce any valid CUIs but RXNorm does. I'm not sure if this has anything to do with it but every UMLS source read is against MRSTY.

Here's my command

java -cp dictionarytool.jar;lib/*
org.apache.ctakes.dictionarytool.DictionaryCreator2 -umls /path/to/UMLS/META -fd ./data/tiny -atui ./data/tiny/CtakesAnatTuis.txt -tui ./data/tiny/CtakesSnomedTuis.txt -ol path o ileUmls2015.bsv

Any suggestions?

Thanks again,
Brandon


-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 3:05 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Yes, that will make the rare word dictionary in a memory-based hsql database - the same as the default for the dictionary-lookup-fast module.

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 2:42 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Thanks Sean, much appreciated. To clarify the example below would create the dictionary for use for the rare word approach?

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 2:16 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Hi Brandon,

I just checked in a bin/dictionarytool.zip It should have everything that you need (.jar, lib/, data/).
java -cp dictionarytool.jar;lib/*
org.apache.ctakes.dictionarytool.DictionaryCreator2 [args] Should do the trick.

To recreate a 2015 version of the current ctakes dictionary, the arguments
are:
-umls my/path/to/2015AA/META -fd ./data/tiny -atui ./data/tiny/CtakesAnatTuis.txt -tui ./data/tiny/CtakesSnomedTuis.txt -db
jdbc:hsqldb:file:my/path/to/snorx2015 -tbl CUI_TERMS

Create my/path/to/snorx2015 by copying
resources/memdbtemplate/ctakesumls.properties to my/path/to/snorx2015.properties - there is a resources/README about this.

Before populating a DB, I usually do a trial run first, writing to a flat file. Replace "-db ... -tbl ..." with "-ol my/path/to/testout.bsv"


Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 1:49 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Hi Sean,

That'd be great.

I think I'm building it incorrectly because after I build the jar and try to run specifying DictionaryCreator2 as the main class it says it can't find it. I'm not too familiar with Java and building projects/jars so it could be my ignorance causing the problem.

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 1:45 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Hi Brandon,

I can send you a jar or commit one pre-built. What goes wrong when you try to build the tool?

Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 1:23 PM
To: 'dev@ctakes.apache.org<ma...@ctakes.apache.org>'
Subject: Fast Dictionary Update

Does someone have the DictionaryTool jar available? I'm having trouble creating the jar file from the project and would like to be able to create an updated UMLS fast dictionary for 2015.

Thanks,
Brandon


IMPORTANT WARNING: The information in this message (and the documents attached to it, if any) is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this message by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken, or omitted to be taken, in reliance on it is prohibited and may be unlawful. If you have received this message in error, please delete all electronic copies of this message (and the documents attached to it, if any), destroy any hard copies you may have created and notify me immediately by replying to this email. Thank you.

Geisinger Health System utilizes an encryption process to safeguard Protected Health Information and other confidential data contained in external e-mail messages. If email is encrypted, the recipient will receive an e-mail instructing them to sign on to the Geisinger Health System Secure E-mail Message Center to retrieve the encrypted e-mail.






















RE: Fast Dictionary Update

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Hi Tomasz, 

I am not at all surprised that many cuis have more synonyms in the 2015 release ... progress!  I apologize, I don't have much time right now to spend on this.  In a nutshell:

1) inclusion/exclusion triggers
Within the data/ directory there are several subdirectories containing .txt lists.  In the lists that start with "Removal" there are triggers that cause the dictionary tool to completely disregard a synonym.  For instance, in RemovalColonTriggers.txt we have ": 24 hours :".  Any time a synonym contains the substring ": 24 hours :" the entire synonym will be disregarded.  In RemovalPrefixTriggers.txt we have "deprecated".  If a synonym starts with "deprecated" then it will be disregarded.  The triggers in these lists were created manually.  Most of the time there is an equivalent synonym that does not contain the trigger.  In the different subdirectories (default/ optional/ small/ tiny/) you will find different versions of these lists containing more or fewer triggers.  The directories are named (generally) after the size of the created dictionary.  For instance, 1016 tiny/RemovalSuffixTriggers, 239 small/RemovalSuffixTriggers, 33 default/RemovalSuffixTriggers.  More triggers = fewer included synonyms.  A list directory is pointed to with the "-fd" (format data directory) parameter.  Run DictionaryCreator2 without parameters (or with "-?" or "-h") to list the available parameters.
2) inclusion tuis
Also within the data/ subdirectories are lists of tuis that should be included in the dictionary.  By default only a subset of specific ctakes  tuis are used, but check the optional/ subdirectory for lists of options.  You can specify these with "-tui" (complete list), "-atui" (anatomy tuis) and  "-mtui" (medication tuis).
3) inclusion source types
By default the tool will include any synonym that exists for a term in snomedct or rxnorm.  That is to say that if cui 1 exists in snomedct then all synonyms from all sources will be added to the dictionary.  If cui 2 only exists in mesh, then that cui and its synonyms will not exist in the dictionary.  You can add whatever sources you like to include their cuis in the dictionary.  "-src" lets you point to a custom source list.  The data/optional/UmlsAllSources.txt lists the sources from 2011ab, but you should find more in the 2015 release.
4) unwanted substrings
These are basically substrings that are removed from a synonym - but the synonym itself is not disregarded.  For instance, default/UnwantedPrefixes.txt contains "activities involving".  This will be removed from a synonym but the remainder will not be disregarded.
5) abbreviation extraction
In default/RightAbbreviations.txt there is a list of suffix substrings that trigger a "tripling" of synonyms.  For instance, "( eds )" is in the list, which will cause an umls entry "ehlers - danlos ( eds )" to be put into the dictionary as "ehlers - danlos ( eds )" AND "ehlers - danlos" AND "eds".  The list small/RightAbbreviations.txt is empty to keep the dictionary a little more compact.

I think that is about it for all things easy wrt sizing the dictionary.  Outside of these items you would need to do something custom within the tool code.  org.apache.ctakes.dictionarytool.util class UmlsTermUtil contains the code currently used for trimming, etc.  UmlsTermUtil and RxNormTermUtil and DoseUtil and DeliveryUtil have some experimental code that is not used by default, but one could always play around.

I know that -official- documentation outside of doc/howto.txt would be great.  However I would rather spend free time throwing a gui around the tool to make things more intuitive.

Sean

-----Original Message-----
From: Tomasz Oliwa [mailto:oliwa@uchicago.edu] 
Sent: Tuesday, November 10, 2015 11:53 AM
To: dev@ctakes.apache.org
Subject: RE: Fast Dictionary Update

Hi,

The build-in cTAKES Fast Dictionary (UMLS 2011) contains about ~490.000 rows (each synonym of the same CUI counted as a row), while the 2015 UMLS Fast Dictionary created via the dictionarytool results in about ~660.000 rows.

I noticed that CUIs tend to be expressed with more synonyms in the 2015 UMLS Fast Dictionary, this is what I suppose leads to the increase of rows. For instance, the CUI C0231749 "knee pain" has 15 rows in the default cTAKES UMLS, while 24 rows in the 2015 one. 

How can I control which subset of synonyms is taken by the dictionarytool per CUI when the Fast Dictionary is created?

The UMLS metathesaurus itself has (much) more synonyms than 24 for C0231749, so I image somewhere in the dictionarytool the subset can be setup?

Thanks,
Tomasz 


________________________________________
From: Finan, Sean [Sean.Finan@childrens.harvard.edu]
Sent: Monday, October 19, 2015 9:02 AM
To: dev@ctakes.apache.org
Subject: RE: Fast Dictionary Update

Hi Brandon,

Good catch, and thanks for letting me know.  Feel free to check in a fix, otherwise it will probably be a while before I get to it.

Thanks,
Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Monday, October 19, 2015 8:50 AM
To: dev@ctakes.apache.org
Subject: RE: Fast Dictionary Update

Hi Sean,

I finally had a chance to look at the SNOMEDCT issue further regarding the codingScheme populating using the default value.  What I found was in the dictionary tool when running the CodeMapCreator, when the CuiCodesDbWriter is called, the collection uses the name passed into the method, which is SNOMEDCT.  However, if you are using SNOMEDCT_US the collection name is SNOMEDCT_US instead of SNOMEDCT, so it never populates the hsqldb.  Obviously an easy change to make, but thought it might be helpful feedback.

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Monday, September 21, 2015 10:39 AM
To: dev@ctakes.apache.org
Subject: RE: Fast Dictionary Update

Hi Brandon,

Sorry for the late reply - I've been out for an extended weekend.

The coding scheme change is fairly simply explained (imo).  The plain old CUI is not a snomed code.  If the snomed codes are reported by ctakes (uncomment the snomed line in ctakesHsql.xml ) then their UmlsConcept entries in the ontology array have the coding scheme name "SNOMEDCT".
            <!-- Optional tables for optional term info.
            Uncommenting these lines alone may not persist term information;
            persistence depends upon the TermConsumer.  -->
            <property key="snomedTable" value="snomedct"/>

Basically, the "CTAKES" name indicates that the scheme only contains Umls Cuis that have TUIs of the default ctakes configuration.  ctakes does not use all umls tuis, therefore I did not name the scheme "UMLS".  If you make a custom scheme (etc.) you can change the name in cTakesHsql.xml or in a custom .xml
          <!-- Depending upon the consumer, the value of codingScheme may or may not be used.  With the packaged consumers,
          codingScheme is a default value used only for cuis that do not have secondary codes (snomed, rxnorm, etc.)  -->
         <property key="codingScheme" value="CTAKES"/>


The " RelationsExtractor" in the dictionary creator tool is completely experimental and unfinished - but perhaps some day it will throw umls relations into a format that ctakes can directly use.  For the time being it should be avoided.

Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Thursday, September 17, 2015 10:23 PM
To: dev@ctakes.apache.org
Subject: RE: Fast Dictionary Update

You can disregard my question about the relation extraction as I fixed this by building the new dictionary with the default data files in the dictionarytool.  I am curious about the SNOMED change still though.

Thanks,
Brandon

-----Original Message-----
From: Geise, Brandon D.
Sent: Thursday, September 17, 2015 9:40 PM
To: cTAKES Developer list <de...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Thanks Dmitriy.  I was referring to the RelationsExtractor class found in the dictionarytool.  On a similar note, the coding scheme for all SNOMEDCT codes for the new dictionary is CTAKES compared to SNOMED with the UMLS version packaged with cTakes.  Is there something else I need to run for the dictionary creation that I'm missing?

Thanks,
Brandon

-----Original Message-----
From: Dligach, Dmitriy [mailto:Dmitriy.Dligach@childrens.harvard.edu]
Sent: Thursday, September 17, 2015 8:42 PM
To: cTAKES Developer list <de...@ctakes.apache.org>
Subject: Re: Fast Dictionary Update

Hi Brandon,

Relation extraction at the moment only handles two specific relation types: LocationOf and DegreeOf. You are welcome to run it if you need these specific relations.


Dima

--
Dmitriy (Dima) Dligach, Ph.D.
Boston Children's Hospital and Harvard Medical School
(617) 651-0397



On Sep 17, 2015, at 17:08, Geise, Brandon D. <bd...@geisinger.edu>> wrote:

Does the RelationsExtractor need to be run in order to generate information on relationships from cTakes?  When running with 2011 UMLS dictionary I'm able to get relationships for BodyLocationMentions but with the dictionary I created I am not getting this information.  Any advice?

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Thursday, September 17, 2015 1:18 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

It claims that the database is connected and the preceding line of are spat out during loading, which took ~3-4 seconds (so something was there):
............
17 Sep 2015 12:58:58  INFO JdbcConnectionFactory -  Database connected

Strange.  I don't really know what to tell you right now.  Perhaps something will click with me later ...


Did you also run org.apache.ctakes.dictionarytool.CodeMapCreator ?  It isn't strictly necessary but it stores the tuis in the database so that cTakes can identify the semantic group of a mention.




-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Thursday, September 17, 2015 1:02 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Not specifically loaded.  Here's what I see when loading the pipeline:

17 Sep 2015 12:58:54  INFO JdbcConnectionFactory - Connecting to jdbc:hsqldb:file:path/to/ctakes/ctakes-dictionary-lookup-fast-res/src/main/resources/org/apache/ctakes/dictionary/lookup/fast/UMLS2015/snorx2015:
............
17 Sep 2015 12:58:58  INFO JdbcConnectionFactory -  Database connected

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Thursday, September 17, 2015 12:57 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Making an alternate copy of cTakesHsql.xml and pointing to the new dictionary is all that is necessary.  Do you see a message in the initialization output indicating that the dictionary db has been loaded?

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Thursday, September 17, 2015 12:54 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Great, thanks both seemed to work for populating the script table.

Besides the path to the new dictionary needing to be changed in cTakesHsql.xml, does anything else need to be modified to use the new dictionary?  My pipeline runs however there aren't any annotations related to the UMLS concepts.  The only annotations I'm seeing are date, roman numeral, or modifier related. (My pipeline if UMLSFastProcessor with additions for modifiers and templatefiller).  Any suggestions would be appreciated.

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Thursday, September 17, 2015 10:40 AM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Correct, Hsql should automatically read the .log file upon first use, and then perform the inserts into the .script file.

In case you want to play it safe, check the README in the resource/ directory (where you got the hsqldb template).  The last paragraph indicates how you can launch a simple sql tool to play with the db.  You will need to change the name of the db accordingly.  Upon first launch of the sql tool everything should be moved from the .log to the .script file.   It is a strange setup/workflow, but it seems to work.

Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Thursday, September 17, 2015 10:31 AM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

When I run the tool it outputs a file with a .log extension that has all the insert statements.  Do I copy this to the .script template from memcachedb in the dictionarytool project or should the inserts be put into the .script file by default on the program execution?

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 9:59 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Excellent!

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 9:55 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

No, I had changed it on the Tiny source file.  I just changed the default file and it looks to be running as expected now.

Thank you for all your help and patience, Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 9:35 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Did you add it to data/default/ CtakesSources.txt ?

If not then you need to specify -src ./data/tiny/CtakesSources.txt

Sorry for any confusion.

As soon as my inet isn't overloaded I'll download 2015AA and see if I can build a dictionary.

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 8:14 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>; dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Sean,

I added that and still had the same issue.

Thanks,
Brandon
_____________________________
From: Finan, Sean <se...@childrens.harvard.edu>>
Sent: Wednesday, September 16, 2015 7:56 PM
Subject: RE: Fast Dictionary Update
To: <de...@ctakes.apache.org>>


And you added "SNOMEDCT_US" to data/tiny/CtakesSources.txt ?

-----Original Message-----
From: Tomasz Oliwa [mailto:oliwa@uchicago.edu]
Sent: Wednesday, September 16, 2015 7:13 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

I have exactly the same problem with the tool.

A grep on MRCONSO.RRF for "SNOMEDCT" or for "SNOMEDCT_US" shows many lines.

________________________________________
From: Geise, Brandon D. [bdgeise@geisinger.edu<ma...@geisinger.edu>]
Sent: Wednesday, September 16, 2015 5:05 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Yes, it finds "SNOMEDCT_US".

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 5:17 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Ah, now I see what you mean. Can you do a grep on your MRCONSO.RRF for "SNOMEDCT" ?

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 4:04 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

I tried changing as suggested.

Below is what I see for the snomed piece, but for RXNorm it writes terms at the end.

Reading list of Source Types from ./data/default/CtakesSources.txt File Lines 1 list of Source Types 1 Reading list of Tuis from ./data/tiny/CtakesSnomedTuis.txt File Lines 24 list of Tuis 24 Compiling list of Cuis with wanted Tuis using /patto/UMLS_Current_Version/META/MRSTY.RRF
File Line 200000 Cuis 60895
File Line 300000 Cuis 85750
File Line 400000 Cuis 135098
File Line 600000 Cuis 183925
File Line 1700000<tel:1700000> Cuis 376338 File Line 1800000<tel:1800000> Cuis 471009 File Line 1900000<tel:1900000> Cuis 568375 File Line 2100000<tel:2100000> Cuis 674715 File Line 2800000<tel:2800000> Cuis 903583 File Line 3300000<tel:3300000> Cuis 973791 File Lines 3370173<tel:3370173> Cuis 999451 ..................................................File Line 100000 Valid Cuis 0 ..................................................File Line 200000 Valid Cuis 0 ..................................................File Line 300000 Valid Cuis 0 ..................................................File Line 400000 Valid Cuis 0 ..................................................File Line 500000 Valid Cuis 0 ..................................................File Line 600000 Valid Cuis 0 ..................................................File Line 700000 Valid Cuis 0 ..................................................File Line 800000 Valid Cuis 0 ..................................................File Line 900000 Valid Cuis 0 ..................................................File Line 1000000<tel:1000000> Valid Cuis 0 ..................................................File Line 1100000<tel:1100000> Valid Cuis 0 ..................................................File Line 1200000<tel:1200000> Valid Cuis 0 ..................................................File Line 1300000<tel:1300000> Valid Cuis 0 ..................................................File Line 1400000<tel:1400000> Valid Cuis 0 ..................................................File Line 1500000<tel:1500000> Valid Cuis 0 ..................................................File Line 1600000<tel:1600000> Valid Cuis 0 ..................................................File Line 1700000<tel:1700000> Valid Cuis 0 ..................................................File Line 1800000<tel:1800000> Valid Cuis 0 ..................................................File Line 1900000<tel:1900000> Valid Cuis 0 ..................................................File Line 2000000<tel:2000000> Valid Cuis 0 ..................................................File Line 2100000<tel:2100000> Valid Cuis 0 ..................................................File Line 2200000<tel:2200000> Valid Cuis 0 ..................................................File Line 2300000<tel:2300000> Valid Cuis 0 ..................................................File Line 2400000<tel:2400000> Valid Cuis 0 ..................................................File Line 2500000<tel:2500000> Valid Cuis 0 ..................................................File Line 2600000<tel:2600000> Valid Cuis 0 ..................................................File Line 2700000<tel:2700000> Valid Cuis 0 ..................................................File Line 2800000<tel:2800000> Valid Cuis 0 ..................................................File Line 2900000<tel:2900000> Valid Cuis 0 ..................................................File Line 3000000<tel:3000000> Valid Cuis 0 ..................................................File Line 3100000<tel:3100000> Valid Cuis 0 ..................................................File Line 3200000<tel:3200000> Valid Cuis 0 ..................................................File Line 3300000<tel:3300000> Valid Cuis 0 ..................................................File Line 3400000<tel:3400000> Valid Cuis 0 ..................................................File Line 3500000<tel:3500000> Valid Cuis 0 ..................................................File Line 3600000<tel:3600000> Valid Cuis 0 ..................................................File Line 3700000<tel:3700000> Valid Cuis 0 ..................................................File Line 3800000<tel:3800000> Valid Cuis 0 ..................................................File Line 3900000<tel:3900000> Valid Cuis 0 ..................................................File Line 4000000<tel:4000000> Valid Cuis 0 ..................................................File Line 4100000<tel:4100000> Valid Cuis 0 ..................................................File Line 4200000<tel:4200000> Valid Cuis 0 ..................................................File Line 4300000<tel:4300000> Valid Cuis 0 ..................................................File Line 4400000<tel:4400000> Valid Cuis 0 ..................................................File Line 4500000<tel:4500000> Valid Cuis 0 ..................................................File Line 4600000<tel:4600000> Valid Cuis 0 ..................................................File Line 4700000<tel:4700000> Valid Cuis 0 ..................................................File Line 4800000<tel:4800000> Valid Cuis 0 ..................................................File Line 4900000<tel:4900000> Valid Cuis 0 ..................................................File Line 5000000<tel:5000000> Valid Cuis 0 ..................................................File Line 5100000<tel:5100000> Valid Cuis 0 ..................................................File Line 5200000<tel:5200000> Valid Cuis 0 ..................................................File Line 5300000<tel:5300000> Valid Cuis 0 ..................................................File Line 5400000<tel:5400000> Valid Cuis 0 ..................................................File Line 5500000<tel:5500000> Valid Cuis 0 ..................................................File Line 5600000<tel:5600000> Valid Cuis 0 ..................................................File Line 5700000<tel:5700000> Valid Cuis 0 ..................................................File Line 5800000<tel:5800000> Valid Cuis 0 ..................................................File Line 5900000<tel:5900000> Valid Cuis 0 ..................................................File Line 6000000<tel:6000000> Valid Cuis 0 ..................................................File Line 6100000<tel:6100000> Valid Cuis 0 ..................................................File Line 6200000<tel:6200000> Valid Cuis 0 ..................................................File Line 6300000<tel:6300000> Valid Cuis 0 ..................................................File Line 6400000<tel:6400000> Valid Cuis 0 ..................................................File Line 6500000<tel:6500000> Valid Cuis 0 ..................................................File Line 6600000<tel:6600000> Valid Cuis 0 ..................................................File Line 6700000<tel:6700000> Valid Cuis 0 ..................................................File Line 6800000<tel:6800000> Valid Cuis 0 ..................................................File Line 6900000<tel:6900000> Valid Cuis 0 ..................................................File Line 7000000<tel:7000000> Valid Cuis 0 ..................................................File Line 7100000<tel:7100000> Valid Cuis 0 ..................................................File Line 7200000<tel:7200000> Valid Cuis 0 ..................................................File Line 7300000<tel:7300000> Valid Cuis 0 ..................................................File Line 7400000<tel:7400000> Valid Cuis 0 ..................................................File Line 7500000<tel:7500000> Valid Cuis 0 ..................................................File Line 7600000<tel:7600000> Valid Cuis 0 ..................................................File Line 7700000<tel:7700000> Valid Cuis 0 ..................................................File Line 7800000<tel:7800000> Valid Cuis 0 ..................................................File Line 7900000<tel:7900000> Valid Cuis 0 ..................................................File Line 8000000<tel:8000000> Valid Cuis 0 ..................................................File Line 8100000<tel:8100000> Valid Cuis 0 ..................................................File Line 8200000<tel:8200000> Valid Cuis 0 ..................................................File Line 8300000<tel:8300000> Valid Cuis 0 ..................................................File Line 8400000<tel:8400000> Valid Cuis 0 ..................................................File Line 8500000<tel:8500000> Valid Cuis 0 ..................................................File Line 8600000<tel:8600000> Valid Cuis 0 ..................................................File Line 8700000<tel:8700000> Valid Cuis 0 ..................................................File Line 8800000<tel:8800000> Valid Cuis 0 .............File Lines 8827152<tel:8827152> Valid Cuis 0 Compiling map of Umls Cuis and Texts ..................................................File Line 100000 Terms 0 ..................................................File Line 200000 Terms 0 ..................................................File Line 300000 Terms 0 ..................................................File Line 400000 Terms 0 ..................................................File Line 500000 Terms 0 ..................................................File Line 600000 Terms 0 ..................................................File Line 700000 Terms 0 ..................................................File Line 800000 Terms 0 ..................................................File Line 900000 Terms 0 ..................................................File Line 1000000<tel:1000000> Terms 0 ..................................................File Line 1100000<tel:1100000> Terms 0 ..................................................File Line 1200000<tel:1200000> Terms 0 ..................................................File Line 1300000<tel:1300000> Terms 0 ..................................................File Line 1400000<tel:1400000> Terms 0 ..................................................File Line 1500000<tel:1500000> Terms 0 ..................................................File Line 1600000<tel:1600000> Terms 0 ..................................................File Line 1700000<tel:1700000> Terms 0 ..................................................File Line 1800000<tel:1800000> Terms 0 ..................................................File Line 1900000<tel:1900000> Terms 0 ..................................................File Line 2000000<tel:2000000> Terms 0 ..................................................File Line 2100000<tel:2100000> Terms 0 ..................................................File Line 2200000<tel:2200000> Terms 0 ..................................................File Line 2300000<tel:2300000> Terms 0 ..................................................File Line 2400000<tel:2400000> Terms 0 ..................................................File Line 2500000<tel:2500000> Terms 0 ..................................................File Line 2600000<tel:2600000> Terms 0 ..................................................File Line 2700000<tel:2700000> Terms 0 ..................................................File Line 2800000<tel:2800000> Terms 0 ..................................................File Line 2900000<tel:2900000> Terms 0 ..................................................File Line 3000000<tel:3000000> Terms 0 ..................................................File Line 3100000<tel:3100000> Terms 0 ..................................................File Line 3200000<tel:3200000> Terms 0 ..................................................File Line 3300000<tel:3300000> Terms 0 ..................................................File Line 3400000<tel:3400000> Terms 0 ..................................................File Line 3500000<tel:3500000> Terms 0 ..................................................File Line 3600000<tel:3600000> Terms 0 ..................................................File Line 3700000<tel:3700000> Terms 0 ..................................................File Line 3800000<tel:3800000> Terms 0 ..................................................File Line 3900000<tel:3900000> Terms 0 ..................................................File Line 4000000<tel:4000000> Terms 0 ..................................................File Line 4100000<tel:4100000> Terms 0 ..................................................File Line 4200000<tel:4200000> Terms 0 ..................................................File Line 4300000<tel:4300000> Terms 0 ..................................................File Line 4400000<tel:4400000> Terms 0 ..................................................File Line 4500000<tel:4500000> Terms 0 ..................................................File Line 4600000<tel:4600000> Terms 0 ..................................................File Line 4700000<tel:4700000> Terms 0 ..................................................File Line 4800000<tel:4800000> Terms 0 ..................................................File Line 4900000<tel:4900000> Terms 0 ..................................................File Line 5000000<tel:5000000> Terms 0 ..................................................File Line 5100000<tel:5100000> Terms 0 ..................................................File Line 5200000<tel:5200000> Terms 0 ..................................................File Line 5300000<tel:5300000> Terms 0 ..................................................File Line 5400000<tel:5400000> Terms 0 ..................................................File Line 5500000<tel:5500000> Terms 0 ..................................................File Line 5600000<tel:5600000> Terms 0 ..................................................File Line 5700000<tel:5700000> Terms 0 ..................................................File Line 5800000<tel:5800000> Terms 0 ..................................................File Line 5900000<tel:5900000> Terms 0 ..................................................File Line 6000000<tel:6000000> Terms 0 ..................................................File Line 6100000<tel:6100000> Terms 0 ..................................................File Line 6200000<tel:6200000> Terms 0 ..................................................File Line 6300000<tel:6300000> Terms 0 ..................................................File Line 6400000<tel:6400000> Terms 0 ..................................................File Line 6500000<tel:6500000> Terms 0 ..................................................File Line 6600000<tel:6600000> Terms 0 ..................................................File Line 6700000<tel:6700000> Terms 0 ..................................................File Line 6800000<tel:6800000> Terms 0 ..................................................File Line 6900000<tel:6900000> Terms 0 ..................................................File Line 7000000<tel:7000000> Terms 0 ..................................................File Line 7100000<tel:7100000> Terms 0 ..................................................File Line 7200000<tel:7200000> Terms 0 ..................................................File Line 7300000<tel:7300000> Terms 0 ..................................................File Line 7400000<tel:7400000> Terms 0 ..................................................File Line 7500000<tel:7500000> Terms 0 ..................................................File Line 7600000<tel:7600000> Terms 0 ..................................................File Line 7700000<tel:7700000> Terms 0 ..................................................File Line 7800000<tel:7800000> Terms 0 ..................................................File Line 7900000<tel:7900000> Terms 0 ..................................................File Line 8000000<tel:8000000> Terms 0 ..................................................File Line 8100000<tel:8100000> Terms 0 ..................................................File Line 8200000<tel:8200000> Terms 0 ..................................................File Line 8300000<tel:8300000> Terms 0 ..................................................File Line 8400000<tel:8400000> Terms 0 ..................................................File Line 8500000<tel:8500000> Terms 0 ..................................................File Line 8600000<tel:8600000> Terms 0 ..................................................File Line 8700000<tel:8700000> Terms 0 ..................................................File Line 8800000<tel:8800000> Terms 0 .............File Line 8827152<tel:8827152> Terms 0 Writing map of Cuis and Texts to pathtoUmls2015.bsv

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 4:00 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Thank you! I believe that was a change post 2011! You should actually be ok with both SNOMEDCT and SNOMEDCT_US in CtakesSources.txt

Cheers,
Sean

-----Original Message-----
From: Maite Meseure Hugues [mailto:meseure.maite@gmail.com]
Sent: Wednesday, September 16, 2015 3:43 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Re: Fast Dictionary Update

If this can helps, I had to replace 'SNOMEDCT' with 'SNOMEDCT_US' in CtakesSources.txt.

On Wed, Sep 16, 2015 at 2:33 PM, Finan, Sean < Sean.Finan@childrens.harvard.edu<ma...@childrens.harvard.edu>> wrote:

I'm not sure that I understand your question. As I sent it, the anat, snomed and rxnorm are not separate runs. The args line I sent earlier is for a single run that will create a dictionary with snomed and rxnorm terms. The anatomy tui list has a special use in correctly processing snomed codes.

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 3:27 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Ok, hopefully one last question.

Based on your example everything runs, however the Anat and Snomed runs don't produce any valid CUIs but RXNorm does. I'm not sure if this has anything to do with it but every UMLS source read is against MRSTY.

Here's my command

java -cp dictionarytool.jar;lib/*
org.apache.ctakes.dictionarytool.DictionaryCreator2 -umls /path/to/UMLS/META -fd ./data/tiny -atui ./data/tiny/CtakesAnatTuis.txt -tui ./data/tiny/CtakesSnomedTuis.txt -ol path o ileUmls2015.bsv

Any suggestions?

Thanks again,
Brandon


-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 3:05 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Yes, that will make the rare word dictionary in a memory-based hsql database - the same as the default for the dictionary-lookup-fast module.

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 2:42 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Thanks Sean, much appreciated. To clarify the example below would create the dictionary for use for the rare word approach?

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 2:16 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Hi Brandon,

I just checked in a bin/dictionarytool.zip It should have everything that you need (.jar, lib/, data/).
java -cp dictionarytool.jar;lib/*
org.apache.ctakes.dictionarytool.DictionaryCreator2 [args] Should do the trick.

To recreate a 2015 version of the current ctakes dictionary, the arguments
are:
-umls my/path/to/2015AA/META -fd ./data/tiny -atui ./data/tiny/CtakesAnatTuis.txt -tui ./data/tiny/CtakesSnomedTuis.txt -db
jdbc:hsqldb:file:my/path/to/snorx2015 -tbl CUI_TERMS

Create my/path/to/snorx2015 by copying
resources/memdbtemplate/ctakesumls.properties to my/path/to/snorx2015.properties - there is a resources/README about this.

Before populating a DB, I usually do a trial run first, writing to a flat file. Replace "-db ... -tbl ..." with "-ol my/path/to/testout.bsv"


Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 1:49 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Hi Sean,

That'd be great.

I think I'm building it incorrectly because after I build the jar and try to run specifying DictionaryCreator2 as the main class it says it can't find it. I'm not too familiar with Java and building projects/jars so it could be my ignorance causing the problem.

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 1:45 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Hi Brandon,

I can send you a jar or commit one pre-built. What goes wrong when you try to build the tool?

Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 1:23 PM
To: 'dev@ctakes.apache.org<ma...@ctakes.apache.org>'
Subject: Fast Dictionary Update

Does someone have the DictionaryTool jar available? I'm having trouble creating the jar file from the project and would like to be able to create an updated UMLS fast dictionary for 2015.

Thanks,
Brandon


IMPORTANT WARNING: The information in this message (and the documents attached to it, if any) is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this message by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken, or omitted to be taken, in reliance on it is prohibited and may be unlawful. If you have received this message in error, please delete all electronic copies of this message (and the documents attached to it, if any), destroy any hard copies you may have created and notify me immediately by replying to this email. Thank you.

Geisinger Health System utilizes an encryption process to safeguard Protected Health Information and other confidential data contained in external e-mail messages. If email is encrypted, the recipient will receive an e-mail instructing them to sign on to the Geisinger Health System Secure E-mail Message Center to retrieve the encrypted e-mail.





















RE: Fast Dictionary Update

Posted by Tomasz Oliwa <ol...@uchicago.edu>.
Hi,

The build-in cTAKES Fast Dictionary (UMLS 2011) contains about ~490.000 rows (each synonym of the same CUI counted as a row), while the 2015 UMLS Fast Dictionary created via the dictionarytool results in about ~660.000 rows.

I noticed that CUIs tend to be expressed with more synonyms in the 2015 UMLS Fast Dictionary, this is what I suppose leads to the increase of rows. For instance, the CUI C0231749 "knee pain" has 15 rows in the default cTAKES UMLS, while 24 rows in the 2015 one. 

How can I control which subset of synonyms is taken by the dictionarytool per CUI when the Fast Dictionary is created?

The UMLS metathesaurus itself has (much) more synonyms than 24 for C0231749, so I image somewhere in the dictionarytool the subset can be setup?

Thanks,
Tomasz 


________________________________________
From: Finan, Sean [Sean.Finan@childrens.harvard.edu]
Sent: Monday, October 19, 2015 9:02 AM
To: dev@ctakes.apache.org
Subject: RE: Fast Dictionary Update

Hi Brandon,

Good catch, and thanks for letting me know.  Feel free to check in a fix, otherwise it will probably be a while before I get to it.

Thanks,
Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Monday, October 19, 2015 8:50 AM
To: dev@ctakes.apache.org
Subject: RE: Fast Dictionary Update

Hi Sean,

I finally had a chance to look at the SNOMEDCT issue further regarding the codingScheme populating using the default value.  What I found was in the dictionary tool when running the CodeMapCreator, when the CuiCodesDbWriter is called, the collection uses the name passed into the method, which is SNOMEDCT.  However, if you are using SNOMEDCT_US the collection name is SNOMEDCT_US instead of SNOMEDCT, so it never populates the hsqldb.  Obviously an easy change to make, but thought it might be helpful feedback.

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Monday, September 21, 2015 10:39 AM
To: dev@ctakes.apache.org
Subject: RE: Fast Dictionary Update

Hi Brandon,

Sorry for the late reply - I've been out for an extended weekend.

The coding scheme change is fairly simply explained (imo).  The plain old CUI is not a snomed code.  If the snomed codes are reported by ctakes (uncomment the snomed line in ctakesHsql.xml ) then their UmlsConcept entries in the ontology array have the coding scheme name "SNOMEDCT".
            <!-- Optional tables for optional term info.
            Uncommenting these lines alone may not persist term information;
            persistence depends upon the TermConsumer.  -->
            <property key="snomedTable" value="snomedct"/>

Basically, the "CTAKES" name indicates that the scheme only contains Umls Cuis that have TUIs of the default ctakes configuration.  ctakes does not use all umls tuis, therefore I did not name the scheme "UMLS".  If you make a custom scheme (etc.) you can change the name in cTakesHsql.xml or in a custom .xml
          <!-- Depending upon the consumer, the value of codingScheme may or may not be used.  With the packaged consumers,
          codingScheme is a default value used only for cuis that do not have secondary codes (snomed, rxnorm, etc.)  -->
         <property key="codingScheme" value="CTAKES"/>


The " RelationsExtractor" in the dictionary creator tool is completely experimental and unfinished - but perhaps some day it will throw umls relations into a format that ctakes can directly use.  For the time being it should be avoided.

Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Thursday, September 17, 2015 10:23 PM
To: dev@ctakes.apache.org
Subject: RE: Fast Dictionary Update

You can disregard my question about the relation extraction as I fixed this by building the new dictionary with the default data files in the dictionarytool.  I am curious about the SNOMED change still though.

Thanks,
Brandon

-----Original Message-----
From: Geise, Brandon D.
Sent: Thursday, September 17, 2015 9:40 PM
To: cTAKES Developer list <de...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Thanks Dmitriy.  I was referring to the RelationsExtractor class found in the dictionarytool.  On a similar note, the coding scheme for all SNOMEDCT codes for the new dictionary is CTAKES compared to SNOMED with the UMLS version packaged with cTakes.  Is there something else I need to run for the dictionary creation that I'm missing?

Thanks,
Brandon

-----Original Message-----
From: Dligach, Dmitriy [mailto:Dmitriy.Dligach@childrens.harvard.edu]
Sent: Thursday, September 17, 2015 8:42 PM
To: cTAKES Developer list <de...@ctakes.apache.org>
Subject: Re: Fast Dictionary Update

Hi Brandon,

Relation extraction at the moment only handles two specific relation types: LocationOf and DegreeOf. You are welcome to run it if you need these specific relations.


Dima

--
Dmitriy (Dima) Dligach, Ph.D.
Boston Children's Hospital and Harvard Medical School
(617) 651-0397



On Sep 17, 2015, at 17:08, Geise, Brandon D. <bd...@geisinger.edu>> wrote:

Does the RelationsExtractor need to be run in order to generate information on relationships from cTakes?  When running with 2011 UMLS dictionary I'm able to get relationships for BodyLocationMentions but with the dictionary I created I am not getting this information.  Any advice?

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Thursday, September 17, 2015 1:18 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

It claims that the database is connected and the preceding line of are spat out during loading, which took ~3-4 seconds (so something was there):
............
17 Sep 2015 12:58:58  INFO JdbcConnectionFactory -  Database connected

Strange.  I don't really know what to tell you right now.  Perhaps something will click with me later ...


Did you also run org.apache.ctakes.dictionarytool.CodeMapCreator ?  It isn't strictly necessary but it stores the tuis in the database so that cTakes can identify the semantic group of a mention.




-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Thursday, September 17, 2015 1:02 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Not specifically loaded.  Here's what I see when loading the pipeline:

17 Sep 2015 12:58:54  INFO JdbcConnectionFactory - Connecting to jdbc:hsqldb:file:path/to/ctakes/ctakes-dictionary-lookup-fast-res/src/main/resources/org/apache/ctakes/dictionary/lookup/fast/UMLS2015/snorx2015:
............
17 Sep 2015 12:58:58  INFO JdbcConnectionFactory -  Database connected

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Thursday, September 17, 2015 12:57 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Making an alternate copy of cTakesHsql.xml and pointing to the new dictionary is all that is necessary.  Do you see a message in the initialization output indicating that the dictionary db has been loaded?

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Thursday, September 17, 2015 12:54 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Great, thanks both seemed to work for populating the script table.

Besides the path to the new dictionary needing to be changed in cTakesHsql.xml, does anything else need to be modified to use the new dictionary?  My pipeline runs however there aren't any annotations related to the UMLS concepts.  The only annotations I'm seeing are date, roman numeral, or modifier related. (My pipeline if UMLSFastProcessor with additions for modifiers and templatefiller).  Any suggestions would be appreciated.

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Thursday, September 17, 2015 10:40 AM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Correct, Hsql should automatically read the .log file upon first use, and then perform the inserts into the .script file.

In case you want to play it safe, check the README in the resource/ directory (where you got the hsqldb template).  The last paragraph indicates how you can launch a simple sql tool to play with the db.  You will need to change the name of the db accordingly.  Upon first launch of the sql tool everything should be moved from the .log to the .script file.   It is a strange setup/workflow, but it seems to work.

Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Thursday, September 17, 2015 10:31 AM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

When I run the tool it outputs a file with a .log extension that has all the insert statements.  Do I copy this to the .script template from memcachedb in the dictionarytool project or should the inserts be put into the .script file by default on the program execution?

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 9:59 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Excellent!

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 9:55 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

No, I had changed it on the Tiny source file.  I just changed the default file and it looks to be running as expected now.

Thank you for all your help and patience, Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 9:35 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Did you add it to data/default/ CtakesSources.txt ?

If not then you need to specify -src ./data/tiny/CtakesSources.txt

Sorry for any confusion.

As soon as my inet isn't overloaded I'll download 2015AA and see if I can build a dictionary.

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 8:14 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>; dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Sean,

I added that and still had the same issue.

Thanks,
Brandon
_____________________________
From: Finan, Sean <se...@childrens.harvard.edu>>
Sent: Wednesday, September 16, 2015 7:56 PM
Subject: RE: Fast Dictionary Update
To: <de...@ctakes.apache.org>>


And you added "SNOMEDCT_US" to data/tiny/CtakesSources.txt ?

-----Original Message-----
From: Tomasz Oliwa [mailto:oliwa@uchicago.edu]
Sent: Wednesday, September 16, 2015 7:13 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

I have exactly the same problem with the tool.

A grep on MRCONSO.RRF for "SNOMEDCT" or for "SNOMEDCT_US" shows many lines.

________________________________________
From: Geise, Brandon D. [bdgeise@geisinger.edu<ma...@geisinger.edu>]
Sent: Wednesday, September 16, 2015 5:05 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Yes, it finds "SNOMEDCT_US".

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 5:17 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Ah, now I see what you mean. Can you do a grep on your MRCONSO.RRF for "SNOMEDCT" ?

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 4:04 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

I tried changing as suggested.

Below is what I see for the snomed piece, but for RXNorm it writes terms at the end.

Reading list of Source Types from ./data/default/CtakesSources.txt File Lines 1 list of Source Types 1 Reading list of Tuis from ./data/tiny/CtakesSnomedTuis.txt File Lines 24 list of Tuis 24 Compiling list of Cuis with wanted Tuis using /patto/UMLS_Current_Version/META/MRSTY.RRF
File Line 200000 Cuis 60895
File Line 300000 Cuis 85750
File Line 400000 Cuis 135098
File Line 600000 Cuis 183925
File Line 1700000<tel:1700000> Cuis 376338 File Line 1800000<tel:1800000> Cuis 471009 File Line 1900000<tel:1900000> Cuis 568375 File Line 2100000<tel:2100000> Cuis 674715 File Line 2800000<tel:2800000> Cuis 903583 File Line 3300000<tel:3300000> Cuis 973791 File Lines 3370173<tel:3370173> Cuis 999451 ..................................................File Line 100000 Valid Cuis 0 ..................................................File Line 200000 Valid Cuis 0 ..................................................File Line 300000 Valid Cuis 0 ..................................................File Line 400000 Valid Cuis 0 ..................................................File Line 500000 Valid Cuis 0 ..................................................File Line 600000 Valid Cuis 0 ..................................................File Line 700000 Valid Cuis 0 ..................................................File Line 800000 Valid Cuis 0 ..................................................File Line 900000 Valid Cuis 0 ..................................................File Line 1000000<tel:1000000> Valid Cuis 0 ..................................................File Line 1100000<tel:1100000> Valid Cuis 0 ..................................................File Line 1200000<tel:1200000> Valid Cuis 0 ..................................................File Line 1300000<tel:1300000> Valid Cuis 0 ..................................................File Line 1400000<tel:1400000> Valid Cuis 0 ..................................................File Line 1500000<tel:1500000> Valid Cuis 0 ..................................................File Line 1600000<tel:1600000> Valid Cuis 0 ..................................................File Line 1700000<tel:1700000> Valid Cuis 0 ..................................................File Line 1800000<tel:1800000> Valid Cuis 0 ..................................................File Line 1900000<tel:1900000> Valid Cuis 0 ..................................................File Line 2000000<tel:2000000> Valid Cuis 0 ..................................................File Line 2100000<tel:2100000> Valid Cuis 0 ..................................................File Line 2200000<tel:2200000> Valid Cuis 0 ..................................................File Line 2300000<tel:2300000> Valid Cuis 0 ..................................................File Line 2400000<tel:2400000> Valid Cuis 0 ..................................................File Line 2500000<tel:2500000> Valid Cuis 0 ..................................................File Line 2600000<tel:2600000> Valid Cuis 0 ..................................................File Line 2700000<tel:2700000> Valid Cuis 0 ..................................................File Line 2800000<tel:2800000> Valid Cuis 0 ..................................................File Line 2900000<tel:2900000> Valid Cuis 0 ..................................................File Line 3000000<tel:3000000> Valid Cuis 0 ..................................................File Line 3100000<tel:3100000> Valid Cuis 0 ..................................................File Line 3200000<tel:3200000> Valid Cuis 0 ..................................................File Line 3300000<tel:3300000> Valid Cuis 0 ..................................................File Line 3400000<tel:3400000> Valid Cuis 0 ..................................................File Line 3500000<tel:3500000> Valid Cuis 0 ..................................................File Line 3600000<tel:3600000> Valid Cuis 0 ..................................................File Line 3700000<tel:3700000> Valid Cuis 0 ..................................................File Line 3800000<tel:3800000> Valid Cuis 0 ..................................................File Line 3900000<tel:3900000> Valid Cuis 0 ..................................................File Line 4000000<tel:4000000> Valid Cuis 0 ..................................................File Line 4100000<tel:4100000> Valid Cuis 0 ..................................................File Line 4200000<tel:4200000> Valid Cuis 0 ..................................................File Line 4300000<tel:4300000> Valid Cuis 0 ..................................................File Line 4400000<tel:4400000> Valid Cuis 0 ..................................................File Line 4500000<tel:4500000> Valid Cuis 0 ..................................................File Line 4600000<tel:4600000> Valid Cuis 0 ..................................................File Line 4700000<tel:4700000> Valid Cuis 0 ..................................................File Line 4800000<tel:4800000> Valid Cuis 0 ..................................................File Line 4900000<tel:4900000> Valid Cuis 0 ..................................................File Line 5000000<tel:5000000> Valid Cuis 0 ..................................................File Line 5100000<tel:5100000> Valid Cuis 0 ..................................................File Line 5200000<tel:5200000> Valid Cuis 0 ..................................................File Line 5300000<tel:5300000> Valid Cuis 0 ..................................................File Line 5400000<tel:5400000> Valid Cuis 0 ..................................................File Line 5500000<tel:5500000> Valid Cuis 0 ..................................................File Line 5600000<tel:5600000> Valid Cuis 0 ..................................................File Line 5700000<tel:5700000> Valid Cuis 0 ..................................................File Line 5800000<tel:5800000> Valid Cuis 0 ..................................................File Line 5900000<tel:5900000> Valid Cuis 0 ..................................................File Line 6000000<tel:6000000> Valid Cuis 0 ..................................................File Line 6100000<tel:6100000> Valid Cuis 0 ..................................................File Line 6200000<tel:6200000> Valid Cuis 0 ..................................................File Line 6300000<tel:6300000> Valid Cuis 0 ..................................................File Line 6400000<tel:6400000> Valid Cuis 0 ..................................................File Line 6500000<tel:6500000> Valid Cuis 0 ..................................................File Line 6600000<tel:6600000> Valid Cuis 0 ..................................................File Line 6700000<tel:6700000> Valid Cuis 0 ..................................................File Line 6800000<tel:6800000> Valid Cuis 0 ..................................................File Line 6900000<tel:6900000> Valid Cuis 0 ..................................................File Line 7000000<tel:7000000> Valid Cuis 0 ..................................................File Line 7100000<tel:7100000> Valid Cuis 0 ..................................................File Line 7200000<tel:7200000> Valid Cuis 0 ..................................................File Line 7300000<tel:7300000> Valid Cuis 0 ..................................................File Line 7400000<tel:7400000> Valid Cuis 0 ..................................................File Line 7500000<tel:7500000> Valid Cuis 0 ..................................................File Line 7600000<tel:7600000> Valid Cuis 0 ..................................................File Line 7700000<tel:7700000> Valid Cuis 0 ..................................................File Line 7800000<tel:7800000> Valid Cuis 0 ..................................................File Line 7900000<tel:7900000> Valid Cuis 0 ..................................................File Line 8000000<tel:8000000> Valid Cuis 0 ..................................................File Line 8100000<tel:8100000> Valid Cuis 0 ..................................................File Line 8200000<tel:8200000> Valid Cuis 0 ..................................................File Line 8300000<tel:8300000> Valid Cuis 0 ..................................................File Line 8400000<tel:8400000> Valid Cuis 0 ..................................................File Line 8500000<tel:8500000> Valid Cuis 0 ..................................................File Line 8600000<tel:8600000> Valid Cuis 0 ..................................................File Line 8700000<tel:8700000> Valid Cuis 0 ..................................................File Line 8800000<tel:8800000> Valid Cuis 0 .............File Lines 8827152<tel:8827152> Valid Cuis 0 Compiling map of Umls Cuis and Texts ..................................................File Line 100000 Terms 0 ..................................................File Line 200000 Terms 0 ..................................................File Line 300000 Terms 0 ..................................................File Line 400000 Terms 0 ..................................................File Line 500000 Terms 0 ..................................................File Line 600000 Terms 0 ..................................................File Line 700000 Terms 0 ..................................................File Line 800000 Terms 0 ..................................................File Line 900000 Terms 0 ..................................................File Line 1000000<tel:1000000> Terms 0 ..................................................File Line 1100000<tel:1100000> Terms 0 ..................................................File Line 1200000<tel:1200000> Terms 0 ..................................................File Line 1300000<tel:1300000> Terms 0 ..................................................File Line 1400000<tel:1400000> Terms 0 ..................................................File Line 1500000<tel:1500000> Terms 0 ..................................................File Line 1600000<tel:1600000> Terms 0 ..................................................File Line 1700000<tel:1700000> Terms 0 ..................................................File Line 1800000<tel:1800000> Terms 0 ..................................................File Line 1900000<tel:1900000> Terms 0 ..................................................File Line 2000000<tel:2000000> Terms 0 ..................................................File Line 2100000<tel:2100000> Terms 0 ..................................................File Line 2200000<tel:2200000> Terms 0 ..................................................File Line 2300000<tel:2300000> Terms 0 ..................................................File Line 2400000<tel:2400000> Terms 0 ..................................................File Line 2500000<tel:2500000> Terms 0 ..................................................File Line 2600000<tel:2600000> Terms 0 ..................................................File Line 2700000<tel:2700000> Terms 0 ..................................................File Line 2800000<tel:2800000> Terms 0 ..................................................File Line 2900000<tel:2900000> Terms 0 ..................................................File Line 3000000<tel:3000000> Terms 0 ..................................................File Line 3100000<tel:3100000> Terms 0 ..................................................File Line 3200000<tel:3200000> Terms 0 ..................................................File Line 3300000<tel:3300000> Terms 0 ..................................................File Line 3400000<tel:3400000> Terms 0 ..................................................File Line 3500000<tel:3500000> Terms 0 ..................................................File Line 3600000<tel:3600000> Terms 0 ..................................................File Line 3700000<tel:3700000> Terms 0 ..................................................File Line 3800000<tel:3800000> Terms 0 ..................................................File Line 3900000<tel:3900000> Terms 0 ..................................................File Line 4000000<tel:4000000> Terms 0 ..................................................File Line 4100000<tel:4100000> Terms 0 ..................................................File Line 4200000<tel:4200000> Terms 0 ..................................................File Line 4300000<tel:4300000> Terms 0 ..................................................File Line 4400000<tel:4400000> Terms 0 ..................................................File Line 4500000<tel:4500000> Terms 0 ..................................................File Line 4600000<tel:4600000> Terms 0 ..................................................File Line 4700000<tel:4700000> Terms 0 ..................................................File Line 4800000<tel:4800000> Terms 0 ..................................................File Line 4900000<tel:4900000> Terms 0 ..................................................File Line 5000000<tel:5000000> Terms 0 ..................................................File Line 5100000<tel:5100000> Terms 0 ..................................................File Line 5200000<tel:5200000> Terms 0 ..................................................File Line 5300000<tel:5300000> Terms 0 ..................................................File Line 5400000<tel:5400000> Terms 0 ..................................................File Line 5500000<tel:5500000> Terms 0 ..................................................File Line 5600000<tel:5600000> Terms 0 ..................................................File Line 5700000<tel:5700000> Terms 0 ..................................................File Line 5800000<tel:5800000> Terms 0 ..................................................File Line 5900000<tel:5900000> Terms 0 ..................................................File Line 6000000<tel:6000000> Terms 0 ..................................................File Line 6100000<tel:6100000> Terms 0 ..................................................File Line 6200000<tel:6200000> Terms 0 ..................................................File Line 6300000<tel:6300000> Terms 0 ..................................................File Line 6400000<tel:6400000> Terms 0 ..................................................File Line 6500000<tel:6500000> Terms 0 ..................................................File Line 6600000<tel:6600000> Terms 0 ..................................................File Line 6700000<tel:6700000> Terms 0 ..................................................File Line 6800000<tel:6800000> Terms 0 ..................................................File Line 6900000<tel:6900000> Terms 0 ..................................................File Line 7000000<tel:7000000> Terms 0 ..................................................File Line 7100000<tel:7100000> Terms 0 ..................................................File Line 7200000<tel:7200000> Terms 0 ..................................................File Line 7300000<tel:7300000> Terms 0 ..................................................File Line 7400000<tel:7400000> Terms 0 ..................................................File Line 7500000<tel:7500000> Terms 0 ..................................................File Line 7600000<tel:7600000> Terms 0 ..................................................File Line 7700000<tel:7700000> Terms 0 ..................................................File Line 7800000<tel:7800000> Terms 0 ..................................................File Line 7900000<tel:7900000> Terms 0 ..................................................File Line 8000000<tel:8000000> Terms 0 ..................................................File Line 8100000<tel:8100000> Terms 0 ..................................................File Line 8200000<tel:8200000> Terms 0 ..................................................File Line 8300000<tel:8300000> Terms 0 ..................................................File Line 8400000<tel:8400000> Terms 0 ..................................................File Line 8500000<tel:8500000> Terms 0 ..................................................File Line 8600000<tel:8600000> Terms 0 ..................................................File Line 8700000<tel:8700000> Terms 0 ..................................................File Line 8800000<tel:8800000> Terms 0 .............File Line 8827152<tel:8827152> Terms 0 Writing map of Cuis and Texts to pathtoUmls2015.bsv

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 4:00 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Thank you! I believe that was a change post 2011! You should actually be ok with both SNOMEDCT and SNOMEDCT_US in CtakesSources.txt

Cheers,
Sean

-----Original Message-----
From: Maite Meseure Hugues [mailto:meseure.maite@gmail.com]
Sent: Wednesday, September 16, 2015 3:43 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Re: Fast Dictionary Update

If this can helps, I had to replace 'SNOMEDCT' with 'SNOMEDCT_US' in CtakesSources.txt.

On Wed, Sep 16, 2015 at 2:33 PM, Finan, Sean < Sean.Finan@childrens.harvard.edu<ma...@childrens.harvard.edu>> wrote:

I'm not sure that I understand your question. As I sent it, the anat, snomed and rxnorm are not separate runs. The args line I sent earlier is for a single run that will create a dictionary with snomed and rxnorm terms. The anatomy tui list has a special use in correctly processing snomed codes.

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 3:27 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Ok, hopefully one last question.

Based on your example everything runs, however the Anat and Snomed runs don't produce any valid CUIs but RXNorm does. I'm not sure if this has anything to do with it but every UMLS source read is against MRSTY.

Here's my command

java -cp dictionarytool.jar;lib/*
org.apache.ctakes.dictionarytool.DictionaryCreator2 -umls /path/to/UMLS/META -fd ./data/tiny -atui ./data/tiny/CtakesAnatTuis.txt -tui ./data/tiny/CtakesSnomedTuis.txt -ol path o ileUmls2015.bsv

Any suggestions?

Thanks again,
Brandon


-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 3:05 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Yes, that will make the rare word dictionary in a memory-based hsql database - the same as the default for the dictionary-lookup-fast module.

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 2:42 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Thanks Sean, much appreciated. To clarify the example below would create the dictionary for use for the rare word approach?

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 2:16 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Hi Brandon,

I just checked in a bin/dictionarytool.zip It should have everything that you need (.jar, lib/, data/).
java -cp dictionarytool.jar;lib/*
org.apache.ctakes.dictionarytool.DictionaryCreator2 [args] Should do the trick.

To recreate a 2015 version of the current ctakes dictionary, the arguments
are:
-umls my/path/to/2015AA/META -fd ./data/tiny -atui ./data/tiny/CtakesAnatTuis.txt -tui ./data/tiny/CtakesSnomedTuis.txt -db
jdbc:hsqldb:file:my/path/to/snorx2015 -tbl CUI_TERMS

Create my/path/to/snorx2015 by copying
resources/memdbtemplate/ctakesumls.properties to my/path/to/snorx2015.properties - there is a resources/README about this.

Before populating a DB, I usually do a trial run first, writing to a flat file. Replace "-db ... -tbl ..." with "-ol my/path/to/testout.bsv"


Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 1:49 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Hi Sean,

That'd be great.

I think I'm building it incorrectly because after I build the jar and try to run specifying DictionaryCreator2 as the main class it says it can't find it. I'm not too familiar with Java and building projects/jars so it could be my ignorance causing the problem.

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 1:45 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Hi Brandon,

I can send you a jar or commit one pre-built. What goes wrong when you try to build the tool?

Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 1:23 PM
To: 'dev@ctakes.apache.org<ma...@ctakes.apache.org>'
Subject: Fast Dictionary Update

Does someone have the DictionaryTool jar available? I'm having trouble creating the jar file from the project and would like to be able to create an updated UMLS fast dictionary for 2015.

Thanks,
Brandon


IMPORTANT WARNING: The information in this message (and the documents attached to it, if any) is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this message by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken, or omitted to be taken, in reliance on it is prohibited and may be unlawful. If you have received this message in error, please delete all electronic copies of this message (and the documents attached to it, if any), destroy any hard copies you may have created and notify me immediately by replying to this email. Thank you.

Geisinger Health System utilizes an encryption process to safeguard Protected Health Information and other confidential data contained in external e-mail messages. If email is encrypted, the recipient will receive an e-mail instructing them to sign on to the Geisinger Health System Secure E-mail Message Center to retrieve the encrypted e-mail.




















RE: Fast Dictionary Update

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Hi Brandon,

Good catch, and thanks for letting me know.  Feel free to check in a fix, otherwise it will probably be a while before I get to it.

Thanks,
Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu] 
Sent: Monday, October 19, 2015 8:50 AM
To: dev@ctakes.apache.org
Subject: RE: Fast Dictionary Update

Hi Sean,

I finally had a chance to look at the SNOMEDCT issue further regarding the codingScheme populating using the default value.  What I found was in the dictionary tool when running the CodeMapCreator, when the CuiCodesDbWriter is called, the collection uses the name passed into the method, which is SNOMEDCT.  However, if you are using SNOMEDCT_US the collection name is SNOMEDCT_US instead of SNOMEDCT, so it never populates the hsqldb.  Obviously an easy change to make, but thought it might be helpful feedback.

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu] 
Sent: Monday, September 21, 2015 10:39 AM
To: dev@ctakes.apache.org
Subject: RE: Fast Dictionary Update

Hi Brandon,

Sorry for the late reply - I've been out for an extended weekend.

The coding scheme change is fairly simply explained (imo).  The plain old CUI is not a snomed code.  If the snomed codes are reported by ctakes (uncomment the snomed line in ctakesHsql.xml ) then their UmlsConcept entries in the ontology array have the coding scheme name "SNOMEDCT".
            <!-- Optional tables for optional term info.
            Uncommenting these lines alone may not persist term information;
            persistence depends upon the TermConsumer.  -->
            <property key="snomedTable" value="snomedct"/>

Basically, the "CTAKES" name indicates that the scheme only contains Umls Cuis that have TUIs of the default ctakes configuration.  ctakes does not use all umls tuis, therefore I did not name the scheme "UMLS".  If you make a custom scheme (etc.) you can change the name in cTakesHsql.xml or in a custom .xml
          <!-- Depending upon the consumer, the value of codingScheme may or may not be used.  With the packaged consumers,
          codingScheme is a default value used only for cuis that do not have secondary codes (snomed, rxnorm, etc.)  -->
         <property key="codingScheme" value="CTAKES"/>


The " RelationsExtractor" in the dictionary creator tool is completely experimental and unfinished - but perhaps some day it will throw umls relations into a format that ctakes can directly use.  For the time being it should be avoided.

Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu] 
Sent: Thursday, September 17, 2015 10:23 PM
To: dev@ctakes.apache.org
Subject: RE: Fast Dictionary Update

You can disregard my question about the relation extraction as I fixed this by building the new dictionary with the default data files in the dictionarytool.  I am curious about the SNOMED change still though.

Thanks,
Brandon

-----Original Message-----
From: Geise, Brandon D. 
Sent: Thursday, September 17, 2015 9:40 PM
To: cTAKES Developer list <de...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Thanks Dmitriy.  I was referring to the RelationsExtractor class found in the dictionarytool.  On a similar note, the coding scheme for all SNOMEDCT codes for the new dictionary is CTAKES compared to SNOMED with the UMLS version packaged with cTakes.  Is there something else I need to run for the dictionary creation that I'm missing?

Thanks,
Brandon

-----Original Message-----
From: Dligach, Dmitriy [mailto:Dmitriy.Dligach@childrens.harvard.edu] 
Sent: Thursday, September 17, 2015 8:42 PM
To: cTAKES Developer list <de...@ctakes.apache.org>
Subject: Re: Fast Dictionary Update

Hi Brandon,

Relation extraction at the moment only handles two specific relation types: LocationOf and DegreeOf. You are welcome to run it if you need these specific relations.


Dima

--
Dmitriy (Dima) Dligach, Ph.D.
Boston Children's Hospital and Harvard Medical School
(617) 651-0397



On Sep 17, 2015, at 17:08, Geise, Brandon D. <bd...@geisinger.edu>> wrote:

Does the RelationsExtractor need to be run in order to generate information on relationships from cTakes?  When running with 2011 UMLS dictionary I'm able to get relationships for BodyLocationMentions but with the dictionary I created I am not getting this information.  Any advice?

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Thursday, September 17, 2015 1:18 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

It claims that the database is connected and the preceding line of are spat out during loading, which took ~3-4 seconds (so something was there):
............
17 Sep 2015 12:58:58  INFO JdbcConnectionFactory -  Database connected

Strange.  I don't really know what to tell you right now.  Perhaps something will click with me later ...


Did you also run org.apache.ctakes.dictionarytool.CodeMapCreator ?  It isn't strictly necessary but it stores the tuis in the database so that cTakes can identify the semantic group of a mention.




-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Thursday, September 17, 2015 1:02 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Not specifically loaded.  Here's what I see when loading the pipeline:

17 Sep 2015 12:58:54  INFO JdbcConnectionFactory - Connecting to jdbc:hsqldb:file:path/to/ctakes/ctakes-dictionary-lookup-fast-res/src/main/resources/org/apache/ctakes/dictionary/lookup/fast/UMLS2015/snorx2015:
............
17 Sep 2015 12:58:58  INFO JdbcConnectionFactory -  Database connected

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Thursday, September 17, 2015 12:57 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Making an alternate copy of cTakesHsql.xml and pointing to the new dictionary is all that is necessary.  Do you see a message in the initialization output indicating that the dictionary db has been loaded?

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Thursday, September 17, 2015 12:54 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Great, thanks both seemed to work for populating the script table.

Besides the path to the new dictionary needing to be changed in cTakesHsql.xml, does anything else need to be modified to use the new dictionary?  My pipeline runs however there aren't any annotations related to the UMLS concepts.  The only annotations I'm seeing are date, roman numeral, or modifier related. (My pipeline if UMLSFastProcessor with additions for modifiers and templatefiller).  Any suggestions would be appreciated.

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Thursday, September 17, 2015 10:40 AM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Correct, Hsql should automatically read the .log file upon first use, and then perform the inserts into the .script file.

In case you want to play it safe, check the README in the resource/ directory (where you got the hsqldb template).  The last paragraph indicates how you can launch a simple sql tool to play with the db.  You will need to change the name of the db accordingly.  Upon first launch of the sql tool everything should be moved from the .log to the .script file.   It is a strange setup/workflow, but it seems to work.

Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Thursday, September 17, 2015 10:31 AM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

When I run the tool it outputs a file with a .log extension that has all the insert statements.  Do I copy this to the .script template from memcachedb in the dictionarytool project or should the inserts be put into the .script file by default on the program execution?

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 9:59 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Excellent!

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 9:55 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

No, I had changed it on the Tiny source file.  I just changed the default file and it looks to be running as expected now.

Thank you for all your help and patience, Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 9:35 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Did you add it to data/default/ CtakesSources.txt ?

If not then you need to specify -src ./data/tiny/CtakesSources.txt

Sorry for any confusion.

As soon as my inet isn't overloaded I'll download 2015AA and see if I can build a dictionary.

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 8:14 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>; dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Sean,

I added that and still had the same issue.

Thanks,
Brandon
_____________________________
From: Finan, Sean <se...@childrens.harvard.edu>>
Sent: Wednesday, September 16, 2015 7:56 PM
Subject: RE: Fast Dictionary Update
To: <de...@ctakes.apache.org>>


And you added "SNOMEDCT_US" to data/tiny/CtakesSources.txt ?

-----Original Message-----
From: Tomasz Oliwa [mailto:oliwa@uchicago.edu]
Sent: Wednesday, September 16, 2015 7:13 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

I have exactly the same problem with the tool.

A grep on MRCONSO.RRF for "SNOMEDCT" or for "SNOMEDCT_US" shows many lines.

________________________________________
From: Geise, Brandon D. [bdgeise@geisinger.edu<ma...@geisinger.edu>]
Sent: Wednesday, September 16, 2015 5:05 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Yes, it finds "SNOMEDCT_US".

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 5:17 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Ah, now I see what you mean. Can you do a grep on your MRCONSO.RRF for "SNOMEDCT" ?

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 4:04 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

I tried changing as suggested.

Below is what I see for the snomed piece, but for RXNorm it writes terms at the end.

Reading list of Source Types from ./data/default/CtakesSources.txt File Lines 1 list of Source Types 1 Reading list of Tuis from ./data/tiny/CtakesSnomedTuis.txt File Lines 24 list of Tuis 24 Compiling list of Cuis with wanted Tuis using /patto/UMLS_Current_Version/META/MRSTY.RRF
File Line 200000 Cuis 60895
File Line 300000 Cuis 85750
File Line 400000 Cuis 135098
File Line 600000 Cuis 183925
File Line 1700000<tel:1700000> Cuis 376338 File Line 1800000<tel:1800000> Cuis 471009 File Line 1900000<tel:1900000> Cuis 568375 File Line 2100000<tel:2100000> Cuis 674715 File Line 2800000<tel:2800000> Cuis 903583 File Line 3300000<tel:3300000> Cuis 973791 File Lines 3370173<tel:3370173> Cuis 999451 ..................................................File Line 100000 Valid Cuis 0 ..................................................File Line 200000 Valid Cuis 0 ..................................................File Line 300000 Valid Cuis 0 ..................................................File Line 400000 Valid Cuis 0 ..................................................File Line 500000 Valid Cuis 0 ..................................................File Line 600000 Valid Cuis 0 ..................................................File Line 700000 Valid Cuis 0 ..................................................File Line 800000 Valid Cuis 0 ..................................................File Line 900000 Valid Cuis 0 ..................................................File Line 1000000<tel:1000000> Valid Cuis 0 ..................................................File Line 1100000<tel:1100000> Valid Cuis 0 ..................................................File Line 1200000<tel:1200000> Valid Cuis 0 ..................................................File Line 1300000<tel:1300000> Valid Cuis 0 ..................................................File Line 1400000<tel:1400000> Valid Cuis 0 ..................................................File Line 1500000<tel:1500000> Valid Cuis 0 ..................................................File Line 1600000<tel:1600000> Valid Cuis 0 ..................................................File Line 1700000<tel:1700000> Valid Cuis 0 ..................................................File Line 1800000<tel:1800000> Valid Cuis 0 ..................................................File Line 1900000<tel:1900000> Valid Cuis 0 ..................................................File Line 2000000<tel:2000000> Valid Cuis 0 ..................................................File Line 2100000<tel:2100000> Valid Cuis 0 ..................................................File Line 2200000<tel:2200000> Valid Cuis 0 ..................................................File Line 2300000<tel:2300000> Valid Cuis 0 ..................................................File Line 2400000<tel:2400000> Valid Cuis 0 ..................................................File Line 2500000<tel:2500000> Valid Cuis 0 ..................................................File Line 2600000<tel:2600000> Valid Cuis 0 ..................................................File Line 2700000<tel:2700000> Valid Cuis 0 ..................................................File Line 2800000<tel:2800000> Valid Cuis 0 ..................................................File Line 2900000<tel:2900000> Valid Cuis 0 ..................................................File Line 3000000<tel:3000000> Valid Cuis 0 ..................................................File Line 3100000<tel:3100000> Valid Cuis 0 ..................................................File Line 3200000<tel:3200000> Valid Cuis 0 ..................................................File Line 3300000<tel:3300000> Valid Cuis 0 ..................................................File Line 3400000<tel:3400000> Valid Cuis 0 ..................................................File Line 3500000<tel:3500000> Valid Cuis 0 ..................................................File Line 3600000<tel:3600000> Valid Cuis 0 ..................................................File Line 3700000<tel:3700000> Valid Cuis 0 ..................................................File Line 3800000<tel:3800000> Valid Cuis 0 ..................................................File Line 3900000<tel:3900000> Valid Cuis 0 ..................................................File Line 4000000<tel:4000000> Valid Cuis 0 ..................................................File Line 4100000<tel:4100000> Valid Cuis 0 ..................................................File Line 4200000<tel:4200000> Valid Cuis 0 ..................................................File Line 4300000<tel:4300000> Valid Cuis 0 ..................................................File Line 4400000<tel:4400000> Valid Cuis 0 ..................................................File Line 4500000<tel:4500000> Valid Cuis 0 ..................................................File Line 4600000<tel:4600000> Valid Cuis 0 ..................................................File Line 4700000<tel:4700000> Valid Cuis 0 ..................................................File Line 4800000<tel:4800000> Valid Cuis 0 ..................................................File Line 4900000<tel:4900000> Valid Cuis 0 ..................................................File Line 5000000<tel:5000000> Valid Cuis 0 ..................................................File Line 5100000<tel:5100000> Valid Cuis 0 ..................................................File Line 5200000<tel:5200000> Valid Cuis 0 ..................................................File Line 5300000<tel:5300000> Valid Cuis 0 ..................................................File Line 5400000<tel:5400000> Valid Cuis 0 ..................................................File Line 5500000<tel:5500000> Valid Cuis 0 ..................................................File Line 5600000<tel:5600000> Valid Cuis 0 ..................................................File Line 5700000<tel:5700000> Valid Cuis 0 ..................................................File Line 5800000<tel:5800000> Valid Cuis 0 ..................................................File Line 5900000<tel:5900000> Valid Cuis 0 ..................................................File Line 6000000<tel:6000000> Valid Cuis 0 ..................................................File Line 6100000<tel:6100000> Valid Cuis 0 ..................................................File Line 6200000<tel:6200000> Valid Cuis 0 ..................................................File Line 6300000<tel:6300000> Valid Cuis 0 ..................................................File Line 6400000<tel:6400000> Valid Cuis 0 ..................................................File Line 6500000<tel:6500000> Valid Cuis 0 ..................................................File Line 6600000<tel:6600000> Valid Cuis 0 ..................................................File Line 6700000<tel:6700000> Valid Cuis 0 ..................................................File Line 6800000<tel:6800000> Valid Cuis 0 ..................................................File Line 6900000<tel:6900000> Valid Cuis 0 ..................................................File Line 7000000<tel:7000000> Valid Cuis 0 ..................................................File Line 7100000<tel:7100000> Valid Cuis 0 ..................................................File Line 7200000<tel:7200000> Valid Cuis 0 ..................................................File Line 7300000<tel:7300000> Valid Cuis 0 ..................................................File Line 7400000<tel:7400000> Valid Cuis 0 ..................................................File Line 7500000<tel:7500000> Valid Cuis 0 ..................................................File Line 7600000<tel:7600000> Valid Cuis 0 ..................................................File Line 7700000<tel:7700000> Valid Cuis 0 ..................................................File Line 7800000<tel:7800000> Valid Cuis 0 ..................................................File Line 7900000<tel:7900000> Valid Cuis 0 ..................................................File Line 8000000<tel:8000000> Valid Cuis 0 ..................................................File Line 8100000<tel:8100000> Valid Cuis 0 ..................................................File Line 8200000<tel:8200000> Valid Cuis 0 ..................................................File Line 8300000<tel:8300000> Valid Cuis 0 ..................................................File Line 8400000<tel:8400000> Valid Cuis 0 ..................................................File Line 8500000<tel:8500000> Valid Cuis 0 ..................................................File Line 8600000<tel:8600000> Valid Cuis 0 ..................................................File Line 8700000<tel:8700000> Valid Cuis 0 ..................................................File Line 8800000<tel:8800000> Valid Cuis 0 .............File Lines 8827152<tel:8827152> Valid Cuis 0 Compiling map of Umls Cuis and Texts ..................................................File Line 100000 Terms 0 ..................................................File Line 200000 Terms 0 ..................................................File Line 300000 Terms 0 ..................................................File Line 400000 Terms 0 ..................................................File Line 500000 Terms 0 ..................................................File Line 600000 Terms 0 ..................................................File Line 700000 Terms 0 ..................................................File Line 800000 Terms 0 ..................................................File Line 900000 Terms 0 ..................................................File Line 1000000<tel:1000000> Terms 0 ..................................................File Line 1100000<tel:1100000> Terms 0 ..................................................File Line 1200000<tel:1200000> Terms 0 ..................................................File Line 1300000<tel:1300000> Terms 0 ..................................................File Line 1400000<tel:1400000> Terms 0 ..................................................File Line 1500000<tel:1500000> Terms 0 ..................................................File Line 1600000<tel:1600000> Terms 0 ..................................................File Line 1700000<tel:1700000> Terms 0 ..................................................File Line 1800000<tel:1800000> Terms 0 ..................................................File Line 1900000<tel:1900000> Terms 0 ..................................................File Line 2000000<tel:2000000> Terms 0 ..................................................File Line 2100000<tel:2100000> Terms 0 ..................................................File Line 2200000<tel:2200000> Terms 0 ..................................................File Line 2300000<tel:2300000> Terms 0 ..................................................File Line 2400000<tel:2400000> Terms 0 ..................................................File Line 2500000<tel:2500000> Terms 0 ..................................................File Line 2600000<tel:2600000> Terms 0 ..................................................File Line 2700000<tel:2700000> Terms 0 ..................................................File Line 2800000<tel:2800000> Terms 0 ..................................................File Line 2900000<tel:2900000> Terms 0 ..................................................File Line 3000000<tel:3000000> Terms 0 ..................................................File Line 3100000<tel:3100000> Terms 0 ..................................................File Line 3200000<tel:3200000> Terms 0 ..................................................File Line 3300000<tel:3300000> Terms 0 ..................................................File Line 3400000<tel:3400000> Terms 0 ..................................................File Line 3500000<tel:3500000> Terms 0 ..................................................File Line 3600000<tel:3600000> Terms 0 ..................................................File Line 3700000<tel:3700000> Terms 0 ..................................................File Line 3800000<tel:3800000> Terms 0 ..................................................File Line 3900000<tel:3900000> Terms 0 ..................................................File Line 4000000<tel:4000000> Terms 0 ..................................................File Line 4100000<tel:4100000> Terms 0 ..................................................File Line 4200000<tel:4200000> Terms 0 ..................................................File Line 4300000<tel:4300000> Terms 0 ..................................................File Line 4400000<tel:4400000> Terms 0 ..................................................File Line 4500000<tel:4500000> Terms 0 ..................................................File Line 4600000<tel:4600000> Terms 0 ..................................................File Line 4700000<tel:4700000> Terms 0 ..................................................File Line 4800000<tel:4800000> Terms 0 ..................................................File Line 4900000<tel:4900000> Terms 0 ..................................................File Line 5000000<tel:5000000> Terms 0 ..................................................File Line 5100000<tel:5100000> Terms 0 ..................................................File Line 5200000<tel:5200000> Terms 0 ..................................................File Line 5300000<tel:5300000> Terms 0 ..................................................File Line 5400000<tel:5400000> Terms 0 ..................................................File Line 5500000<tel:5500000> Terms 0 ..................................................File Line 5600000<tel:5600000> Terms 0 ..................................................File Line 5700000<tel:5700000> Terms 0 ..................................................File Line 5800000<tel:5800000> Terms 0 ..................................................File Line 5900000<tel:5900000> Terms 0 ..................................................File Line 6000000<tel:6000000> Terms 0 ..................................................File Line 6100000<tel:6100000> Terms 0 ..................................................File Line 6200000<tel:6200000> Terms 0 ..................................................File Line 6300000<tel:6300000> Terms 0 ..................................................File Line 6400000<tel:6400000> Terms 0 ..................................................File Line 6500000<tel:6500000> Terms 0 ..................................................File Line 6600000<tel:6600000> Terms 0 ..................................................File Line 6700000<tel:6700000> Terms 0 ..................................................File Line 6800000<tel:6800000> Terms 0 ..................................................File Line 6900000<tel:6900000> Terms 0 ..................................................File Line 7000000<tel:7000000> Terms 0 ..................................................File Line 7100000<tel:7100000> Terms 0 ..................................................File Line 7200000<tel:7200000> Terms 0 ..................................................File Line 7300000<tel:7300000> Terms 0 ..................................................File Line 7400000<tel:7400000> Terms 0 ..................................................File Line 7500000<tel:7500000> Terms 0 ..................................................File Line 7600000<tel:7600000> Terms 0 ..................................................File Line 7700000<tel:7700000> Terms 0 ..................................................File Line 7800000<tel:7800000> Terms 0 ..................................................File Line 7900000<tel:7900000> Terms 0 ..................................................File Line 8000000<tel:8000000> Terms 0 ..................................................File Line 8100000<tel:8100000> Terms 0 ..................................................File Line 8200000<tel:8200000> Terms 0 ..................................................File Line 8300000<tel:8300000> Terms 0 ..................................................File Line 8400000<tel:8400000> Terms 0 ..................................................File Line 8500000<tel:8500000> Terms 0 ..................................................File Line 8600000<tel:8600000> Terms 0 ..................................................File Line 8700000<tel:8700000> Terms 0 ..................................................File Line 8800000<tel:8800000> Terms 0 .............File Line 8827152<tel:8827152> Terms 0 Writing map of Cuis and Texts to pathtoUmls2015.bsv

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 4:00 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Thank you! I believe that was a change post 2011! You should actually be ok with both SNOMEDCT and SNOMEDCT_US in CtakesSources.txt

Cheers,
Sean

-----Original Message-----
From: Maite Meseure Hugues [mailto:meseure.maite@gmail.com]
Sent: Wednesday, September 16, 2015 3:43 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Re: Fast Dictionary Update

If this can helps, I had to replace 'SNOMEDCT' with 'SNOMEDCT_US' in CtakesSources.txt.

On Wed, Sep 16, 2015 at 2:33 PM, Finan, Sean < Sean.Finan@childrens.harvard.edu<ma...@childrens.harvard.edu>> wrote:

I'm not sure that I understand your question. As I sent it, the anat, snomed and rxnorm are not separate runs. The args line I sent earlier is for a single run that will create a dictionary with snomed and rxnorm terms. The anatomy tui list has a special use in correctly processing snomed codes.

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 3:27 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Ok, hopefully one last question.

Based on your example everything runs, however the Anat and Snomed runs don't produce any valid CUIs but RXNorm does. I'm not sure if this has anything to do with it but every UMLS source read is against MRSTY.

Here's my command

java -cp dictionarytool.jar;lib/*
org.apache.ctakes.dictionarytool.DictionaryCreator2 -umls /path/to/UMLS/META -fd ./data/tiny -atui ./data/tiny/CtakesAnatTuis.txt -tui ./data/tiny/CtakesSnomedTuis.txt -ol path o ileUmls2015.bsv

Any suggestions?

Thanks again,
Brandon


-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 3:05 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Yes, that will make the rare word dictionary in a memory-based hsql database - the same as the default for the dictionary-lookup-fast module.

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 2:42 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Thanks Sean, much appreciated. To clarify the example below would create the dictionary for use for the rare word approach?

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 2:16 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Hi Brandon,

I just checked in a bin/dictionarytool.zip It should have everything that you need (.jar, lib/, data/).
java -cp dictionarytool.jar;lib/*
org.apache.ctakes.dictionarytool.DictionaryCreator2 [args] Should do the trick.

To recreate a 2015 version of the current ctakes dictionary, the arguments
are:
-umls my/path/to/2015AA/META -fd ./data/tiny -atui ./data/tiny/CtakesAnatTuis.txt -tui ./data/tiny/CtakesSnomedTuis.txt -db
jdbc:hsqldb:file:my/path/to/snorx2015 -tbl CUI_TERMS

Create my/path/to/snorx2015 by copying
resources/memdbtemplate/ctakesumls.properties to my/path/to/snorx2015.properties - there is a resources/README about this.

Before populating a DB, I usually do a trial run first, writing to a flat file. Replace "-db ... -tbl ..." with "-ol my/path/to/testout.bsv"


Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 1:49 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Hi Sean,

That'd be great.

I think I'm building it incorrectly because after I build the jar and try to run specifying DictionaryCreator2 as the main class it says it can't find it. I'm not too familiar with Java and building projects/jars so it could be my ignorance causing the problem.

Thanks,
Brandon

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 16, 2015 1:45 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Fast Dictionary Update

Hi Brandon,

I can send you a jar or commit one pre-built. What goes wrong when you try to build the tool?

Sean

-----Original Message-----
From: Geise, Brandon D. [mailto:bdgeise@geisinger.edu]
Sent: Wednesday, September 16, 2015 1:23 PM
To: 'dev@ctakes.apache.org<ma...@ctakes.apache.org>'
Subject: Fast Dictionary Update

Does someone have the DictionaryTool jar available? I'm having trouble creating the jar file from the project and would like to be able to create an updated UMLS fast dictionary for 2015.

Thanks,
Brandon


IMPORTANT WARNING: The information in this message (and the documents attached to it, if any) is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this message by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken, or omitted to be taken, in reliance on it is prohibited and may be unlawful. If you have received this message in error, please delete all electronic copies of this message (and the documents attached to it, if any), destroy any hard copies you may have created and notify me immediately by replying to this email. Thank you.

Geisinger Health System utilizes an encryption process to safeguard Protected Health Information and other confidential data contained in external e-mail messages. If email is encrypted, the recipient will receive an e-mail instructing them to sign on to the Geisinger Health System Secure E-mail Message Center to retrieve the encrypted e-mail.