You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ctakes.apache.org by Oranit Dror <or...@algotec.co.il> on 2015/06/21 10:37:02 UTC

The fast dictionary pipeline vs. the regular one

Hello,

I am using ctakes 3.2.2 with the regular pipeline. Recently, I have tested the fast dictionary pipeline and indeed it is much faster.
However, I have encountered with several quality differences in the returned annotations. For example:


1.       With the fast pipeline, the term "GBM" is annotated as "glioblastoma multiforme", while in the regular pipeline it is annotated as "glioblastoma".
Note that according to the UMLS DB, the concept of "GBM" is "glioblastoma" and "glioblastoma multiforme" is mapped to a narrower concept.


2.       The word "cm" in a phrase like "5.5 cm X 2.6 cm" is annotated by the regular pipeline as "Cutaneous Mastocytosis", while in the fast pipeline it is  not annotated as a medical term (as expected and as in UMLS).


Any explanation for the differences?

Thank you,
Oranit.




RE: The fast dictionary pipeline vs. the regular one

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Hi Oranit,

>" Each is the Preferred Term in at least one of the >150 sources in the Metathesaurus. Neither is from a WHO vocabulary source. The terms are related in that Glioblastoma is the Broader term (RB) of the 2 and Glioblastoma Multiforme is the Narrower term (RN)."

Hmmm, I'm not sure why they assigned narrower and broader ... The two are from different source dictionaries and not related in such a manner.  Again, the WHO term is from the Mesh and NCI sources, while the full GBM spell-out is from CSP.  None are from the source named WHO (for adverse drugs).  See http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/source_vocabularies.html

The WHO classification scheme does not have gioblastoma multiforme at all, just gioblastoma.  Hence there cannot be a hierarchical relationship in that ontology.  Check the paper on the latest WHO classification of brain tumours: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1929165/ 
Or check the definition from the National Brain Tumor Society's  Tumor Types page: http://www.abta.org/brain-tumor-information/types-of-tumors/
" Astrocytoma Grade IV (also called Glioblastoma, previously named “Glioblastoma Multiforme,” “Grade IV Glioblastoma,” and “GBM”)— There are two types of astrocytoma grade IV—primary, or de novo, and secondary. Primary tumors are very aggressive and the most common form of astrocytoma grade IV. The secondary tumors are those which originate as a lower-grade tumor and evolve into a grade IV tumor."

Keep in mind that the umls is a living document and corrections are made all the time - it is not flawless and this might be a case that should be reported.


> In the regular pipeline, the  concept array of "gbm" contains the CUI of "Glioblastoma" only, while in the fast pipeline, the concept array of "GBM" contains the CUIs of both "Glioblastoma" and "glioblastoma Multiforme".

Another thing to keep in mind is that the regular pipeline does not always provide the best discoveries.  In this case, if it is not giving you gioblastoma multiforme for GBM then it is providing incomplete information - as gioblastoma multiforme is exactly what GBM stands for and that cui should be provided when gbm is discovered.  Otherwise, if a researcher (possibly more inclined to use ...multiforme than a clinician) is searching for the ...multiforme cui then they will not find what they are looking for and may think that a gbm does not exist.


I hope that this clears the air,
Sean


-----Original Message-----
From: Oranit Dror [mailto:oranit@algotec.co.il] 
Sent: Monday, June 29, 2015 4:44 AM
To: dev@ctakes.apache.org
Subject: RE: The fast dictionary pipeline vs. the regular one

Hi,



Thank you all for the detailed replies.



Per the "Glioblastoma" and " Glioblastoma Multiforme" terms, I have contacted NLM with my question and their answer was as follows:

" Each is the Preferred Term in at least one of the >150 sources in the Metathesaurus. Neither is from a WHO vocabulary source. The terms are related in that Glioblastoma is the Broader term (RB) of the 2 and Glioblastoma Multiforme is the Narrower term (RN)."



In the regular pipeline, the  concept array of "gbm" contains the CUI of "Glioblastoma" only, while in the fast pipeline, the concept array of "GBM" contains the CUIs of both "Glioblastoma" and "glioblastoma Multiforme".



Best,

Oranit.













-----Original Message-----

From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu] 

Sent: Monday, June 22, 2015 5:13 PM

To: dev@ctakes.apache.org

Subject: RE: The fast dictionary pipeline vs. the regular one



Hi all,



I’m glad that there continues to be interest in the fast alternative to the dictionary lookup and I welcome all testing.



GBM actually is Glioblastoma Multiforme – hence the “M”.   The WHO name is the abbreviated “Glioblastoma”, but they are actually not (as far as I can discern) different things.  If you check the metathesaurus 2011ab, GBM brings up both Glioblastoma C0017636 and Glioblastoma Multiforme C1621958.  The first comes from Mesh and NCI, the second from CSP.  If you look at the definitions they are synonymous: “malignant form of astrocytoma histologically characterized by pleomorphism of cells, nuclear atypia, microhemorrhage and necrosis; may arise in any region of the central nervous system, with a predilection for the cerebral hemispheres, basal ganglia, and commissural pathways.”  Mapping to a different CUI in the UMLS does not always mean that they are truly different concepts.  It often means that they came from 2 different source dictionaries (such as in this case).  Also check https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Glioblastoma-5Fmultiforme&d=BQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=nW5NpS7rJf0J_U27HFbGMu27dHHLm6fhDKfHs1q2VAQ&s=iEMBwhyzVtmLoWuNrEm-yfm0odtihzXzUyrfBq53B9Q&e=   But I am a little confused: are you saying that you got only Glioblastoma Multiforme C1621958 and not Glioblastoma C0017636 ?  When I run it I get both returns …



Britt is correct (thank you) in that if you change the default minimum span from 3 to 2 you will get Cutaneous Mastocytosis C1136033 within “5.5 cm”.  The minimum span is 3 (not 2) to prevent things like the obviously garbage return of Cutaneous Mastocytosis for every “cm”.  However, feel free to change it to fit your purposes.  2 characters is the minimum – you cannot lookup 1 character terms with the default dictionary.  You can do so with a custom dictionary if you like – which might be useful if you just have 1 or 2 single-character terms.



Sean



From: britt fitch [mailto:britt.fitch@wiredinformatics.com]

Sent: Monday, June 22, 2015 9:24 AM

To: dev@ctakes.apache.org

Subject: Re: The fast dictionary pipeline vs. the regular one



Regarding the miss on “cm” in #2, you might want to check out the dictionary xml descriptor or uimafit wiring, depending on which you are using, for the parameter “minimumSpan”. If I recall correctly the default minimum span is 3 characters, however you can reduce it to 2 if desired.



Cheers,



Britt



















Britt Fitch

Wired Informatics

265 Franklin St Ste 1702

Boston, MA 02110

https://urldefense.proofpoint.com/v2/url?u=http-3A__wiredinformatics.com&d=BQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=nW5NpS7rJf0J_U27HFbGMu27dHHLm6fhDKfHs1q2VAQ&s=4t655eG7_5nXvbQxeaguLyVA2aLjq7QnQtrboAPH-Pw&e= 

Britt.Fitch@wiredinformatics.com<ma...@wiredinformatics.com>



On Jun 21, 2015, at 2:45 PM, Miller, Timothy <Ti...@childrens.harvard.edu>> wrote:



Sean wrote the fast version and may be able to answer your specific questions. But in general, the fast dictionary does not match performance exactly -- it is not implementing an equivalent search and it has different indexing methods. We are happy to receive reports of what seem like bugs, though, any new software is likely to have some. What I will say is that I know Sean has run some (as yet unpublished) experiments and we believe that in the aggregate the new system output is at least as high quality as the older one.

Tim





________________________________________

From: Oranit Dror [oranit@algotec.co.il<ma...@algotec.co.il>]

Sent: Sunday, June 21, 2015 4:37 AM

To: dev@ctakes.apache.org<ma...@ctakes.apache.org>

Subject: The fast dictionary pipeline vs. the regular one



Hello,



I am using ctakes 3.2.2 with the regular pipeline. Recently, I have tested the fast dictionary pipeline and indeed it is much faster.

However, I have encountered with several quality differences in the returned annotations. For example:





1.       With the fast pipeline, the term "GBM" is annotated as "glioblastoma multiforme", while in the regular pipeline it is annotated as "glioblastoma".

Note that according to the UMLS DB, the concept of "GBM" is "glioblastoma" and "glioblastoma multiforme" is mapped to a narrower concept.





2.       The word "cm" in a phrase like "5.5 cm X 2.6 cm" is annotated by the regular pipeline as "Cutaneous Mastocytosis", while in the fast pipeline it is  not annotated as a medical term (as expected and as in UMLS).





Any explanation for the differences?



Thank you,

Oranit.








RE: The fast dictionary pipeline vs. the regular one

Posted by Oranit Dror <or...@algotec.co.il>.
Hi,

Thank you all for the detailed replies.

Per the "Glioblastoma" and " Glioblastoma Multiforme" terms, I have contacted NLM with my question and their answer was as follows:
" Each is the Preferred Term in at least one of the >150 sources in the Metathesaurus. Neither is from a WHO vocabulary source. The terms are related in that Glioblastoma is the Broader term (RB) of the 2 and Glioblastoma Multiforme is the Narrower term (RN)."

In the regular pipeline, the  concept array of "gbm" contains the CUI of "Glioblastoma" only, while in the fast pipeline, the concept array of "GBM" contains the CUIs of both "Glioblastoma" and "glioblastoma Multiforme".

Best,
Oranit.






-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu] 
Sent: Monday, June 22, 2015 5:13 PM
To: dev@ctakes.apache.org
Subject: RE: The fast dictionary pipeline vs. the regular one

Hi all,

I’m glad that there continues to be interest in the fast alternative to the dictionary lookup and I welcome all testing.

GBM actually is Glioblastoma Multiforme – hence the “M”.   The WHO name is the abbreviated “Glioblastoma”, but they are actually not (as far as I can discern) different things.  If you check the metathesaurus 2011ab, GBM brings up both Glioblastoma C0017636 and Glioblastoma Multiforme C1621958.  The first comes from Mesh and NCI, the second from CSP.  If you look at the definitions they are synonymous: “malignant form of astrocytoma histologically characterized by pleomorphism of cells, nuclear atypia, microhemorrhage and necrosis; may arise in any region of the central nervous system, with a predilection for the cerebral hemispheres, basal ganglia, and commissural pathways.”  Mapping to a different CUI in the UMLS does not always mean that they are truly different concepts.  It often means that they came from 2 different source dictionaries (such as in this case).  Also check https://en.wikipedia.org/wiki/Glioblastoma_multiforme  But I am a little confused: are you saying that you got only Glioblastoma Multiforme C1621958 and not Glioblastoma C0017636 ?  When I run it I get both returns …

Britt is correct (thank you) in that if you change the default minimum span from 3 to 2 you will get Cutaneous Mastocytosis C1136033 within “5.5 cm”.  The minimum span is 3 (not 2) to prevent things like the obviously garbage return of Cutaneous Mastocytosis for every “cm”.  However, feel free to change it to fit your purposes.  2 characters is the minimum – you cannot lookup 1 character terms with the default dictionary.  You can do so with a custom dictionary if you like – which might be useful if you just have 1 or 2 single-character terms.

Sean

From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
Sent: Monday, June 22, 2015 9:24 AM
To: dev@ctakes.apache.org
Subject: Re: The fast dictionary pipeline vs. the regular one

Regarding the miss on “cm” in #2, you might want to check out the dictionary xml descriptor or uimafit wiring, depending on which you are using, for the parameter “minimumSpan”. If I recall correctly the default minimum span is 3 characters, however you can reduce it to 2 if desired.

Cheers,

Britt









Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com<ma...@wiredinformatics.com>

On Jun 21, 2015, at 2:45 PM, Miller, Timothy <Ti...@childrens.harvard.edu>> wrote:

Sean wrote the fast version and may be able to answer your specific questions. But in general, the fast dictionary does not match performance exactly -- it is not implementing an equivalent search and it has different indexing methods. We are happy to receive reports of what seem like bugs, though, any new software is likely to have some. What I will say is that I know Sean has run some (as yet unpublished) experiments and we believe that in the aggregate the new system output is at least as high quality as the older one.
Tim


________________________________________
From: Oranit Dror [oranit@algotec.co.il<ma...@algotec.co.il>]
Sent: Sunday, June 21, 2015 4:37 AM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: The fast dictionary pipeline vs. the regular one

Hello,

I am using ctakes 3.2.2 with the regular pipeline. Recently, I have tested the fast dictionary pipeline and indeed it is much faster.
However, I have encountered with several quality differences in the returned annotations. For example:


1.       With the fast pipeline, the term "GBM" is annotated as "glioblastoma multiforme", while in the regular pipeline it is annotated as "glioblastoma".
Note that according to the UMLS DB, the concept of "GBM" is "glioblastoma" and "glioblastoma multiforme" is mapped to a narrower concept.


2.       The word "cm" in a phrase like "5.5 cm X 2.6 cm" is annotated by the regular pipeline as "Cutaneous Mastocytosis", while in the fast pipeline it is  not annotated as a medical term (as expected and as in UMLS).


Any explanation for the differences?

Thank you,
Oranit.




RE: The fast dictionary pipeline vs. the regular one

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Hi all,

I’m glad that there continues to be interest in the fast alternative to the dictionary lookup and I welcome all testing.

GBM actually is Glioblastoma Multiforme – hence the “M”.   The WHO name is the abbreviated “Glioblastoma”, but they are actually not (as far as I can discern) different things.  If you check the metathesaurus 2011ab, GBM brings up both Glioblastoma C0017636 and Glioblastoma Multiforme C1621958.  The first comes from Mesh and NCI, the second from CSP.  If you look at the definitions they are synonymous: “malignant form of astrocytoma histologically characterized by pleomorphism of cells, nuclear atypia, microhemorrhage and necrosis; may arise in any region of the central nervous system, with a predilection for the cerebral hemispheres, basal ganglia, and commissural pathways.”  Mapping to a different CUI in the UMLS does not always mean that they are truly different concepts.  It often means that they came from 2 different source dictionaries (such as in this case).  Also check https://en.wikipedia.org/wiki/Glioblastoma_multiforme  But I am a little confused: are you saying that you got only Glioblastoma Multiforme C1621958 and not Glioblastoma C0017636 ?  When I run it I get both returns …

Britt is correct (thank you) in that if you change the default minimum span from 3 to 2 you will get Cutaneous Mastocytosis C1136033 within “5.5 cm”.  The minimum span is 3 (not 2) to prevent things like the obviously garbage return of Cutaneous Mastocytosis for every “cm”.  However, feel free to change it to fit your purposes.  2 characters is the minimum – you cannot lookup 1 character terms with the default dictionary.  You can do so with a custom dictionary if you like – which might be useful if you just have 1 or 2 single-character terms.

Sean

From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
Sent: Monday, June 22, 2015 9:24 AM
To: dev@ctakes.apache.org
Subject: Re: The fast dictionary pipeline vs. the regular one

Regarding the miss on “cm” in #2, you might want to check out the dictionary xml descriptor or uimafit wiring, depending on which you are using, for the parameter “minimumSpan”. If I recall correctly the default minimum span is 3 characters, however you can reduce it to 2 if desired.

Cheers,

Britt









Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com<ma...@wiredinformatics.com>

On Jun 21, 2015, at 2:45 PM, Miller, Timothy <Ti...@childrens.harvard.edu>> wrote:

Sean wrote the fast version and may be able to answer your specific questions. But in general, the fast dictionary does not match performance exactly -- it is not implementing an equivalent search and it has different indexing methods. We are happy to receive reports of what seem like bugs, though, any new software is likely to have some. What I will say is that I know Sean has run some (as yet unpublished) experiments and we believe that in the aggregate the new system output is at least as high quality as the older one.
Tim


________________________________________
From: Oranit Dror [oranit@algotec.co.il<ma...@algotec.co.il>]
Sent: Sunday, June 21, 2015 4:37 AM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: The fast dictionary pipeline vs. the regular one

Hello,

I am using ctakes 3.2.2 with the regular pipeline. Recently, I have tested the fast dictionary pipeline and indeed it is much faster.
However, I have encountered with several quality differences in the returned annotations. For example:


1.       With the fast pipeline, the term "GBM" is annotated as "glioblastoma multiforme", while in the regular pipeline it is annotated as "glioblastoma".
Note that according to the UMLS DB, the concept of "GBM" is "glioblastoma" and "glioblastoma multiforme" is mapped to a narrower concept.


2.       The word "cm" in a phrase like "5.5 cm X 2.6 cm" is annotated by the regular pipeline as "Cutaneous Mastocytosis", while in the fast pipeline it is  not annotated as a medical term (as expected and as in UMLS).


Any explanation for the differences?

Thank you,
Oranit.




Re: The fast dictionary pipeline vs. the regular one

Posted by britt fitch <br...@wiredinformatics.com>.
Regarding the miss on “cm” in #2, you might want to check out the dictionary xml descriptor or uimafit wiring, depending on which you are using, for the parameter “minimumSpan”. If I recall correctly the default minimum span is 3 characters, however you can reduce it to 2 if desired.

Cheers,

Britt


Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com

> On Jun 21, 2015, at 2:45 PM, Miller, Timothy <Ti...@childrens.harvard.edu> wrote:
> 
> Sean wrote the fast version and may be able to answer your specific questions. But in general, the fast dictionary does not match performance exactly -- it is not implementing an equivalent search and it has different indexing methods. We are happy to receive reports of what seem like bugs, though, any new software is likely to have some. What I will say is that I know Sean has run some (as yet unpublished) experiments and we believe that in the aggregate the new system output is at least as high quality as the older one.
> Tim
> 
> 
> ________________________________________
> From: Oranit Dror [oranit@algotec.co.il]
> Sent: Sunday, June 21, 2015 4:37 AM
> To: dev@ctakes.apache.org
> Subject: The fast dictionary pipeline vs. the regular one
> 
> Hello,
> 
> I am using ctakes 3.2.2 with the regular pipeline. Recently, I have tested the fast dictionary pipeline and indeed it is much faster.
> However, I have encountered with several quality differences in the returned annotations. For example:
> 
> 
> 1.       With the fast pipeline, the term "GBM" is annotated as "glioblastoma multiforme", while in the regular pipeline it is annotated as "glioblastoma".
> Note that according to the UMLS DB, the concept of "GBM" is "glioblastoma" and "glioblastoma multiforme" is mapped to a narrower concept.
> 
> 
> 2.       The word "cm" in a phrase like "5.5 cm X 2.6 cm" is annotated by the regular pipeline as "Cutaneous Mastocytosis", while in the fast pipeline it is  not annotated as a medical term (as expected and as in UMLS).
> 
> 
> Any explanation for the differences?
> 
> Thank you,
> Oranit.
> 
> 
> 


RE: The fast dictionary pipeline vs. the regular one

Posted by "Miller, Timothy" <Ti...@childrens.harvard.edu>.
Sean wrote the fast version and may be able to answer your specific questions. But in general, the fast dictionary does not match performance exactly -- it is not implementing an equivalent search and it has different indexing methods. We are happy to receive reports of what seem like bugs, though, any new software is likely to have some. What I will say is that I know Sean has run some (as yet unpublished) experiments and we believe that in the aggregate the new system output is at least as high quality as the older one.
Tim


________________________________________
From: Oranit Dror [oranit@algotec.co.il]
Sent: Sunday, June 21, 2015 4:37 AM
To: dev@ctakes.apache.org
Subject: The fast dictionary pipeline vs. the regular one

Hello,

I am using ctakes 3.2.2 with the regular pipeline. Recently, I have tested the fast dictionary pipeline and indeed it is much faster.
However, I have encountered with several quality differences in the returned annotations. For example:


1.       With the fast pipeline, the term "GBM" is annotated as "glioblastoma multiforme", while in the regular pipeline it is annotated as "glioblastoma".
Note that according to the UMLS DB, the concept of "GBM" is "glioblastoma" and "glioblastoma multiforme" is mapped to a narrower concept.


2.       The word "cm" in a phrase like "5.5 cm X 2.6 cm" is annotated by the regular pipeline as "Cutaneous Mastocytosis", while in the fast pipeline it is  not annotated as a medical term (as expected and as in UMLS).


Any explanation for the differences?

Thank you,
Oranit.