You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ctakes.apache.org by Dennis Lee Hon Kit <dl...@uvic.ca> on 2013/08/21 20:09:31 UTC
Concept annotation questions
Hi Everyone,
We are new to cTakes so please bear with our questions. We are using cTakes to annotate things like encounter diagnoses and referral notes and are especially interested with the SNOMED CT encodings. But we are not sure how to make sense of all the outputs.
Example #1
In the example below, “cancer of colon, lung and liver” has been encoded with SNOMED CT and additional concepts that do not apply have been removed (e.g., general “cancer” concept, lung, colon and liver structures, etc). They have been plotted out by the begin/end positions. If the terms to do not align, its probably because the email only accepts plain text and a mono-spaced font is not the default.
cancer of colon, lung and liver
cancer of colon, lung and liver 93870000|Malignant neoplasm of liver (disorder)|
cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
cancer of colon 363406005|Malignant tumor of colon (disorder)|
Question (1) – We had to do quite a bit of post-processing to remove inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
Question (2) – How can we determine that “93870000|Malignant neoplasm of liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string, which points to “cancer of colon, lung and liver”? Certainly we can try to do additional parsing but there are a lot of different scenarios to take into account.
Question (3) – This relates to question 2, are we able to identify the original terms that were used for the concept matching or the exact description that was returned in the UMLS? While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
--------------------------------------------------------------------------------
Example #2
Switching the position of colon, lung and liver can result in different encodings. Once again, after removing additional concepts not needed (i.e., “cancer” and “colon structure”), we get the following. What happened to liver and lung cancer?
cancer of colon, liver and lung
cancer of colon 363406005|Malignant tumor of colon (disorder)|
lung 39607008|Lung structure (body structure)|
We have more questions but will start with these. Thank you in advance.
Regards,
Dennis
Re: Concept annotation questions
Posted by Dennis Lee Hon Kit <dl...@uvic.ca>.
Hi James,
Thank you very much. I will try it out this weekend and let you know how it goes. Thank you again for putting together the patch.
Regards,
Dennis
-----Original Message-----
From: Masanz, James J.
Sent: Friday, October 04, 2013 2:45 AM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Hi Dennis,
I put a patch at
http://people.apache.org/~james-masanz/patches/ctakes-3.1.0-patches/
Note this patch is not an official release of the ASF or the Apache cTAKES project.
This patch includes a new feature called originalTextAsDelimitedString which contains the original text, delimited by a vertical bar. This is likely not how this function will be implemented in a future release but gives you something to start with. (For discussion of future implementation, refer to [1])
For text like this, with a typo of "colons" instead of just "colon"
Cancer of lower left colons
the DiseaseDisorderMention that covers that entire string will have
originalTextAsDelimitedString ="Cancer|of|colons"
To use the patch, download it, verify the signature, and extract the contents to the same directory where you have Apache cTAKES installed so that you have a new subdirectory called
ctakes-3.1.0.patch.01
After extracting the contents, your CTAKES_HOME should contain the following subdirectories:
bin
config
ctakes-3.1.0.patch.01
desc
lib
resources
Within ctakes-3.1.0.patch.01 are two .bat files showing how the classpath can be set to pick up the patch. If you use one of those .bat file, please edit the file and replace YourUmlsUserIdHere and YourUmlsPasswordHere with your UMLS user ID and password
If you post any questions to the email list, please make sure to indicate that you are using a modified version of Apache cTAKES.
-- James
[1] http://markmail.org/message/tvwbkfxuamiiwi7s
From: user-return-302-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-302-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Monday, September 23, 2013 11:13 PM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James,
We use the binary release. Thank you for your help, we look forward to the update.
Regards,
Dennis
From: Masanz, James J.
Sent: Friday, September 20, 2013 12:27 PM
To: mailto:user@ctakes.apache.org
Subject: RE: Concept annotation questions
Dennis,
Now that I have the code written, one more question. Are you using the binary, or do you download source and compile yourself?
-- James
From: user-return-286-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-286-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Tuesday, September 10, 2013 12:19 PM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James,
Thank you for your email. We are currently using cTakes 3.0 but will upgrade to which ever version you issue the patch for. Thank you for taking the time out of your busy schedule to work on the patch.
Regards,
Dennis
From: Masanz, James J.
Sent: Monday, September 09, 2013 7:44 AM
To: mailto:user@ctakes.apache.org
Subject: RE: Concept annotation questions
Which version of cTAKES are you using or planning to use.
cTAKES 3.1 has been approved and once the apache.org infrastructure team does some administrative-like tasks the process of having the apache mirrors updated with 3.1 should start.
I want to target the release that will be most useful for you for this patch first.
From: user-return-267-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-267-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Friday, August 30, 2013 1:11 AM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James,
Thank you for your reply.
If you could create the patch for identifying the words used in the matching that would be great. We understand you have other priorities and will wait until you have time to do it.
Thank you for logging the issue with the incorrect chunking as well.
Regards,
Dennis
-----Original Message-----
From: Masanz, James J.
Sent: Thursday, August 29, 2013 8:38 AM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
I created JIRA issue CTAKES-231 for this as the code in trunk and in the cTAKES 3.1 branch also get the chunking wrong.
https://issues.apache.org/jira/browse/CTAKES-231
Thanks,
-- James
From: user-return-258-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-258-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Masanz, James J.
Sent: Thursday, August 29, 2013 9:19 AM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Hi Dennis,
Thanks for explaining why you are interested in finding out which words in the original text cause a particular concept to be annotated. We are currently working on getting Apache cTAKES 3.1 out. Depending on your timeline, after that is done, perhaps I could create a patch for you that would help with determining which words from the text matched a dictionary entry, rather than just the begin offset of the first word and the end offset of the last word.
As far as the chunking, the fact “liver” and “and” are being tagged as O-chunks explains why the dictionary lookup component is not finding liver cancer or lung cancer in “cancer of colon, liver and lung”
I’ll try that sentence with the latest chunker model (which will be in cTAKES 3.1) and see if it assigns correct chunk tags for that sentence.
-- James
From: user-return-257-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 28, 2013 2:33 PM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James & Pei,
Thank you for your replies and sorry for my late reply as I have been away.
Q1 – The longest span could work and is one of the options we are looking at but when there are overlaps it can get complicated. In the following example, the longest would work. We can take start with 01, and ignore 02 and 03 because their start positions overlap the end position of 01, and then continue with 04. But I don’t think it will always be this straight forward as the being/end string positions may not always be a good indicator of what exactly in the original text was coded.
00 Invasive ductal carcinoma of the left breast with bone metastases.
01 Invasive ductal carcinoma of the left breast 408643008|Infiltrating duct carcinoma of breast (disorder)|
02 breast with bone 56873002|Bone structure of sternum (body structure)|
03 breast with bone metastases 94297009|Secondary malignant neoplasm of female breast (disorder)|
04 bone metastases 94222008|Secondary malignant neoplasm of bone (disorder)|
Q2 – As we are beginners, we are not at the level where we are comfortable with modifying cTakes or even know where to begin modifying cTakes but that would be an option in the future. Going back to the example of “cancer of liver” and using the begin/end position of the string that was used to identify the concept, the original string would be “cancer of colon, lung and liver.” The CUI that was identified was C0345904, which has 209 (137 unique) descriptions for all languages. Examples of English terms include:
• CA - Liver cancer
• Cancer of Liver
• cancer of the liver
• Cancer, Hepatic
• CANCER, HEPATOCELLULAR
• Malignant hepatic neoplasm
• Malignant liver tumor
• Malignant liver tumour
• Malignant neoplasm of liver
• malignant neoplasm of liver (diagnosis)
• Malignant neoplasm of liver unspecified
• Malignant neoplasm of liver unspecified (disorder)
• Malignant neoplasm of liver, not specified as primary or secondary
• Malignant neoplasm of liver, NOS
• Malignant neoplasm of liver, unspecified
• malignant neosplasm of the liver
• Malignant tumor of liver
• Malignant tumor of liver (disorder)
• Malignant tumour of liver
It would seem suboptimal to go through each of the descriptions to try and determine which was the UMLS term that was used in the coding. It is important for us to know which part of the string is matched because something like “Invasive ductal carcinoma of the left breast” will be matched to the SNOMED CT concept “408643008|Infiltrating duct carcinoma of breast (disorder)|”, but we would like to know that “left” was not matched and would like to post-coordinate the expression to indicate the left breast, i.e.: 408643008|Infiltrating duct carcinoma of breast (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast structure (body structure)|. When there are other qualifiers like severity, chronicity and episodicity that may be ignored when matching, we would like to capture it at the level of granularity specified in the original text.
In terms of the chunking, here is what I see for “cancer of colon, lung and liver”:
• NP: cancer of colon, lung and liver
• PP: of
• NP: colon, lung and liver
For “cancer of colon, liver and lung” here is what I see:
• NP: cancer of colon,
• PP: of
• NP: colon
• O: liver
• O: and
• NP: lung
Q3 – To answer Pei’s question, we are not looking at the preferred name from the UMLS, just which term was used.
Regards,
Dennis
From: Chen, Pei
Sent: Thursday, August 22, 2013 12:27 PM
To: user@ctakes.apache.org
Subject: RE: Concept annotation questions
Also,
> 3)… or the exact description that was returned in the UMLS?
I presume you mean to save the preferred name from UMLS? If so, this seems to be a common request- see: https://issues.apache.org/jira/browse/CTAKES-224
--Pei
From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
Sent: Thursday, August 22, 2013 3:24 PM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Welcome to the cTAKES community.
Q1 – some people use the longest span.
Q2 &Q3 – can you just use the text from the dictionary “Malignant neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to save the text of the words that it matches when it is performing dictionary lookup. I would guess there is a term in the UMLS dictionary with the same code as Malignant neoplasm of liver (disorder) that just has the words “cancer of liver”, but there isn’t anything in cTAKES to give that to you just through a configuration change.
For “cancer of colon, liver and lung“, can you look at the chunk tag for liver. If it’s in a separate noun phrase (NP) from “cancer of colon” that would account for why cancer is not getting tied to liver in that case (but wouldn’t account for why the chunker is creating as a separate noun phrase)
-- James
From: user-return-248-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 21, 2013 1:10 PM
To: user@ctakes.apache.org
Subject: Concept annotation questions
Hi Everyone,
We are new to cTakes so please bear with our questions. We are using cTakes to annotate things like encounter diagnoses and referral notes and are especially interested with the SNOMED CT encodings. But we are not sure how to make sense of all the outputs.
Example #1
In the example below, “cancer of colon, lung and liver” has been encoded with SNOMED CT and additional concepts that do not apply have been removed (e.g., general “cancer” concept, lung, colon and liver structures, etc). They have been plotted out by the begin/end positions. If the terms to do not align, its probably because the email only accepts plain text and a mono-spaced font is not the default.
cancer of colon, lung and liver
cancer of colon, lung and liver 93870000|Malignant neoplasm of liver (disorder)|
cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
cancer of colon 363406005|Malignant tumor of colon (disorder)|
Question (1) – We had to do quite a bit of post-processing to remove inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
Question (2) – How can we determine that “93870000|Malignant neoplasm of liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string, which points to “cancer of colon, lung and liver”? Certainly we can try to do additional parsing but there are a lot of different scenarios to take into account.
Question (3) – This relates to question 2, are we able to identify the original terms that were used for the concept matching or the exact description that was returned in the UMLS? While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
________________________________________
Example #2
Switching the position of colon, lung and liver can result in different encodings. Once again, after removing additional concepts not needed (i.e., “cancer” and “colon structure”), we get the following. What happened to liver and lung cancer?
cancer of colon, liver and lung
cancer of colon 363406005|Malignant tumor of colon (disorder)|
lung 39607008|Lung structure (body structure)|
We have more questions but will start with these. Thank you in advance.
Regards,
Dennis
RE: Concept annotation questions
Posted by "Masanz, James J." <Ma...@mayo.edu>.
Hi Dennis,
I put a patch at
http://people.apache.org/~james-masanz/patches/ctakes-3.1.0-patches/
Note this patch is not an official release of the ASF or the Apache cTAKES project.
This patch includes a new feature called originalTextAsDelimitedString which contains the original text, delimited by a vertical bar. This is likely not how this function will be implemented in a future release but gives you something to start with. (For discussion of future implementation, refer to [1])
For text like this, with a typo of "colons" instead of just "colon"
Cancer of lower left colons
the DiseaseDisorderMention that covers that entire string will have
originalTextAsDelimitedString ="Cancer|of|colons"
To use the patch, download it, verify the signature, and extract the contents to the same directory where you have Apache cTAKES installed so that you have a new subdirectory called
ctakes-3.1.0.patch.01
After extracting the contents, your CTAKES_HOME should contain the following subdirectories:
bin
config
ctakes-3.1.0.patch.01
desc
lib
resources
Within ctakes-3.1.0.patch.01 are two .bat files showing how the classpath can be set to pick up the patch. If you use one of those .bat file, please edit the file and replace YourUmlsUserIdHere and YourUmlsPasswordHere with your UMLS user ID and password
If you post any questions to the email list, please make sure to indicate that you are using a modified version of Apache cTAKES.
-- James
[1] http://markmail.org/message/tvwbkfxuamiiwi7s
From: user-return-302-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-302-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Monday, September 23, 2013 11:13 PM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James,
We use the binary release. Thank you for your help, we look forward to the update.
Regards,
Dennis
From: Masanz, James J.
Sent: Friday, September 20, 2013 12:27 PM
To: mailto:user@ctakes.apache.org
Subject: RE: Concept annotation questions
Dennis,
Now that I have the code written, one more question. Are you using the binary, or do you download source and compile yourself?
-- James
From: user-return-286-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-286-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Tuesday, September 10, 2013 12:19 PM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James,
Thank you for your email. We are currently using cTakes 3.0 but will upgrade to which ever version you issue the patch for. Thank you for taking the time out of your busy schedule to work on the patch.
Regards,
Dennis
From: Masanz, James J.
Sent: Monday, September 09, 2013 7:44 AM
To: mailto:user@ctakes.apache.org
Subject: RE: Concept annotation questions
Which version of cTAKES are you using or planning to use.
cTAKES 3.1 has been approved and once the apache.org infrastructure team does some administrative-like tasks the process of having the apache mirrors updated with 3.1 should start.
I want to target the release that will be most useful for you for this patch first.
From: user-return-267-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-267-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Friday, August 30, 2013 1:11 AM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James,
Thank you for your reply.
If you could create the patch for identifying the words used in the matching that would be great. We understand you have other priorities and will wait until you have time to do it.
Thank you for logging the issue with the incorrect chunking as well.
Regards,
Dennis
-----Original Message-----
From: Masanz, James J.
Sent: Thursday, August 29, 2013 8:38 AM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
I created JIRA issue CTAKES-231 for this as the code in trunk and in the cTAKES 3.1 branch also get the chunking wrong.
https://issues.apache.org/jira/browse/CTAKES-231
Thanks,
-- James
From: user-return-258-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-258-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Masanz, James J.
Sent: Thursday, August 29, 2013 9:19 AM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Hi Dennis,
Thanks for explaining why you are interested in finding out which words in the original text cause a particular concept to be annotated. We are currently working on getting Apache cTAKES 3.1 out. Depending on your timeline, after that is done, perhaps I could create a patch for you that would help with determining which words from the text matched a dictionary entry, rather than just the begin offset of the first word and the end offset of the last word.
As far as the chunking, the fact “liver” and “and” are being tagged as O-chunks explains why the dictionary lookup component is not finding liver cancer or lung cancer in “cancer of colon, liver and lung”
I’ll try that sentence with the latest chunker model (which will be in cTAKES 3.1) and see if it assigns correct chunk tags for that sentence.
-- James
From: user-return-257-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 28, 2013 2:33 PM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James & Pei,
Thank you for your replies and sorry for my late reply as I have been away.
Q1 – The longest span could work and is one of the options we are looking at but when there are overlaps it can get complicated. In the following example, the longest would work. We can take start with 01, and ignore 02 and 03 because their start positions overlap the end position of 01, and then continue with 04. But I don’t think it will always be this straight forward as the being/end string positions may not always be a good indicator of what exactly in the original text was coded.
00 Invasive ductal carcinoma of the left breast with bone metastases.
01 Invasive ductal carcinoma of the left breast 408643008|Infiltrating duct carcinoma of breast (disorder)|
02 breast with bone 56873002|Bone structure of sternum (body structure)|
03 breast with bone metastases 94297009|Secondary malignant neoplasm of female breast (disorder)|
04 bone metastases 94222008|Secondary malignant neoplasm of bone (disorder)|
Q2 – As we are beginners, we are not at the level where we are comfortable with modifying cTakes or even know where to begin modifying cTakes but that would be an option in the future. Going back to the example of “cancer of liver” and using the begin/end position of the string that was used to identify the concept, the original string would be “cancer of colon, lung and liver.” The CUI that was identified was C0345904, which has 209 (137 unique) descriptions for all languages. Examples of English terms include:
• CA - Liver cancer
• Cancer of Liver
• cancer of the liver
• Cancer, Hepatic
• CANCER, HEPATOCELLULAR
• Malignant hepatic neoplasm
• Malignant liver tumor
• Malignant liver tumour
• Malignant neoplasm of liver
• malignant neoplasm of liver (diagnosis)
• Malignant neoplasm of liver unspecified
• Malignant neoplasm of liver unspecified (disorder)
• Malignant neoplasm of liver, not specified as primary or secondary
• Malignant neoplasm of liver, NOS
• Malignant neoplasm of liver, unspecified
• malignant neosplasm of the liver
• Malignant tumor of liver
• Malignant tumor of liver (disorder)
• Malignant tumour of liver
It would seem suboptimal to go through each of the descriptions to try and determine which was the UMLS term that was used in the coding. It is important for us to know which part of the string is matched because something like “Invasive ductal carcinoma of the left breast” will be matched to the SNOMED CT concept “408643008|Infiltrating duct carcinoma of breast (disorder)|”, but we would like to know that “left” was not matched and would like to post-coordinate the expression to indicate the left breast, i.e.: 408643008|Infiltrating duct carcinoma of breast (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast structure (body structure)|. When there are other qualifiers like severity, chronicity and episodicity that may be ignored when matching, we would like to capture it at the level of granularity specified in the original text.
In terms of the chunking, here is what I see for “cancer of colon, lung and liver”:
• NP: cancer of colon, lung and liver
• PP: of
• NP: colon, lung and liver
For “cancer of colon, liver and lung” here is what I see:
• NP: cancer of colon,
• PP: of
• NP: colon
• O: liver
• O: and
• NP: lung
Q3 – To answer Pei’s question, we are not looking at the preferred name from the UMLS, just which term was used.
Regards,
Dennis
From: Chen, Pei
Sent: Thursday, August 22, 2013 12:27 PM
To: user@ctakes.apache.org
Subject: RE: Concept annotation questions
Also,
> 3)… or the exact description that was returned in the UMLS?
I presume you mean to save the preferred name from UMLS? If so, this seems to be a common request- see: https://issues.apache.org/jira/browse/CTAKES-224
--Pei
From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
Sent: Thursday, August 22, 2013 3:24 PM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Welcome to the cTAKES community.
Q1 – some people use the longest span.
Q2 &Q3 – can you just use the text from the dictionary “Malignant neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to save the text of the words that it matches when it is performing dictionary lookup. I would guess there is a term in the UMLS dictionary with the same code as Malignant neoplasm of liver (disorder) that just has the words “cancer of liver”, but there isn’t anything in cTAKES to give that to you just through a configuration change.
For “cancer of colon, liver and lung“, can you look at the chunk tag for liver. If it’s in a separate noun phrase (NP) from “cancer of colon” that would account for why cancer is not getting tied to liver in that case (but wouldn’t account for why the chunker is creating as a separate noun phrase)
-- James
From: user-return-248-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 21, 2013 1:10 PM
To: user@ctakes.apache.org
Subject: Concept annotation questions
Hi Everyone,
We are new to cTakes so please bear with our questions. We are using cTakes to annotate things like encounter diagnoses and referral notes and are especially interested with the SNOMED CT encodings. But we are not sure how to make sense of all the outputs.
Example #1
In the example below, “cancer of colon, lung and liver” has been encoded with SNOMED CT and additional concepts that do not apply have been removed (e.g., general “cancer” concept, lung, colon and liver structures, etc). They have been plotted out by the begin/end positions. If the terms to do not align, its probably because the email only accepts plain text and a mono-spaced font is not the default.
cancer of colon, lung and liver
cancer of colon, lung and liver 93870000|Malignant neoplasm of liver (disorder)|
cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
cancer of colon 363406005|Malignant tumor of colon (disorder)|
Question (1) – We had to do quite a bit of post-processing to remove inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
Question (2) – How can we determine that “93870000|Malignant neoplasm of liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string, which points to “cancer of colon, lung and liver”? Certainly we can try to do additional parsing but there are a lot of different scenarios to take into account.
Question (3) – This relates to question 2, are we able to identify the original terms that were used for the concept matching or the exact description that was returned in the UMLS? While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
________________________________________
Example #2
Switching the position of colon, lung and liver can result in different encodings. Once again, after removing additional concepts not needed (i.e., “cancer” and “colon structure”), we get the following. What happened to liver and lung cancer?
cancer of colon, liver and lung
cancer of colon 363406005|Malignant tumor of colon (disorder)|
lung 39607008|Lung structure (body structure)|
We have more questions but will start with these. Thank you in advance.
Regards,
Dennis
Re: Concept annotation questions
Posted by Dennis Lee Hon Kit <dl...@uvic.ca>.
Hi James,
We use the binary release. Thank you for your help, we look forward to the update.
Regards,
Dennis
From: Masanz, James J.
Sent: Friday, September 20, 2013 12:27 PM
To: mailto:user@ctakes.apache.org
Subject: RE: Concept annotation questions
Dennis,
Now that I have the code written, one more question. Are you using the binary, or do you download source and compile yourself?
-- James
From: user-return-286-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-286-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Tuesday, September 10, 2013 12:19 PM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James,
Thank you for your email. We are currently using cTakes 3.0 but will upgrade to which ever version you issue the patch for. Thank you for taking the time out of your busy schedule to work on the patch.
Regards,
Dennis
From: Masanz, James J.
Sent: Monday, September 09, 2013 7:44 AM
To: mailto:user@ctakes.apache.org
Subject: RE: Concept annotation questions
Which version of cTAKES are you using or planning to use.
cTAKES 3.1 has been approved and once the apache.org infrastructure team does some administrative-like tasks the process of having the apache mirrors updated with 3.1 should start.
I want to target the release that will be most useful for you for this patch first.
From: user-return-267-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-267-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Friday, August 30, 2013 1:11 AM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James,
Thank you for your reply.
If you could create the patch for identifying the words used in the matching that would be great. We understand you have other priorities and will wait until you have time to do it.
Thank you for logging the issue with the incorrect chunking as well.
Regards,
Dennis
-----Original Message-----
From: Masanz, James J.
Sent: Thursday, August 29, 2013 8:38 AM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
I created JIRA issue CTAKES-231 for this as the code in trunk and in the cTAKES 3.1 branch also get the chunking wrong.
https://issues.apache.org/jira/browse/CTAKES-231
Thanks,
-- James
From: user-return-258-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-258-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Masanz, James J.
Sent: Thursday, August 29, 2013 9:19 AM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Hi Dennis,
Thanks for explaining why you are interested in finding out which words in the original text cause a particular concept to be annotated. We are currently working on getting Apache cTAKES 3.1 out. Depending on your timeline, after that is done, perhaps I could create a patch for you that would help with determining which words from the text matched a dictionary entry, rather than just the begin offset of the first word and the end offset of the last word.
As far as the chunking, the fact “liver” and “and” are being tagged as O-chunks explains why the dictionary lookup component is not finding liver cancer or lung cancer in “cancer of colon, liver and lung”
I’ll try that sentence with the latest chunker model (which will be in cTAKES 3.1) and see if it assigns correct chunk tags for that sentence.
-- James
From: user-return-257-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 28, 2013 2:33 PM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James & Pei,
Thank you for your replies and sorry for my late reply as I have been away.
Q1 – The longest span could work and is one of the options we are looking at but when there are overlaps it can get complicated. In the following example, the longest would work. We can take start with 01, and ignore 02 and 03 because their start positions overlap the end position of 01, and then continue with 04. But I don’t think it will always be this straight forward as the being/end string positions may not always be a good indicator of what exactly in the original text was coded.
00 Invasive ductal carcinoma of the left breast with bone metastases.
01 Invasive ductal carcinoma of the left breast 408643008|Infiltrating duct carcinoma of breast (disorder)|
02 breast with bone 56873002|Bone structure of sternum (body structure)|
03 breast with bone metastases 94297009|Secondary malignant neoplasm of female breast (disorder)|
04 bone metastases 94222008|Secondary malignant neoplasm of bone (disorder)|
Q2 – As we are beginners, we are not at the level where we are comfortable with modifying cTakes or even know where to begin modifying cTakes but that would be an option in the future. Going back to the example of “cancer of liver” and using the begin/end position of the string that was used to identify the concept, the original string would be “cancer of colon, lung and liver.” The CUI that was identified was C0345904, which has 209 (137 unique) descriptions for all languages. Examples of English terms include:
• CA - Liver cancer
• Cancer of Liver
• cancer of the liver
• Cancer, Hepatic
• CANCER, HEPATOCELLULAR
• Malignant hepatic neoplasm
• Malignant liver tumor
• Malignant liver tumour
• Malignant neoplasm of liver
• malignant neoplasm of liver (diagnosis)
• Malignant neoplasm of liver unspecified
• Malignant neoplasm of liver unspecified (disorder)
• Malignant neoplasm of liver, not specified as primary or secondary
• Malignant neoplasm of liver, NOS
• Malignant neoplasm of liver, unspecified
• malignant neosplasm of the liver
• Malignant tumor of liver
• Malignant tumor of liver (disorder)
• Malignant tumour of liver
It would seem suboptimal to go through each of the descriptions to try and determine which was the UMLS term that was used in the coding. It is important for us to know which part of the string is matched because something like “Invasive ductal carcinoma of the left breast” will be matched to the SNOMED CT concept “408643008|Infiltrating duct carcinoma of breast (disorder)|”, but we would like to know that “left” was not matched and would like to post-coordinate the expression to indicate the left breast, i.e.: 408643008|Infiltrating duct carcinoma of breast (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast structure (body structure)|. When there are other qualifiers like severity, chronicity and episodicity that may be ignored when matching, we would like to capture it at the level of granularity specified in the original text.
In terms of the chunking, here is what I see for “cancer of colon, lung and liver”:
• NP: cancer of colon, lung and liver
• PP: of
• NP: colon, lung and liver
For “cancer of colon, liver and lung” here is what I see:
• NP: cancer of colon,
• PP: of
• NP: colon
• O: liver
• O: and
• NP: lung
Q3 – To answer Pei’s question, we are not looking at the preferred name from the UMLS, just which term was used.
Regards,
Dennis
From: Chen, Pei
Sent: Thursday, August 22, 2013 12:27 PM
To: user@ctakes.apache.org
Subject: RE: Concept annotation questions
Also,
> 3)… or the exact description that was returned in the UMLS?
I presume you mean to save the preferred name from UMLS? If so, this seems to be a common request- see: https://issues.apache.org/jira/browse/CTAKES-224
--Pei
From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
Sent: Thursday, August 22, 2013 3:24 PM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Welcome to the cTAKES community.
Q1 – some people use the longest span.
Q2 &Q3 – can you just use the text from the dictionary “Malignant neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to save the text of the words that it matches when it is performing dictionary lookup. I would guess there is a term in the UMLS dictionary with the same code as Malignant neoplasm of liver (disorder) that just has the words “cancer of liver”, but there isn’t anything in cTAKES to give that to you just through a configuration change.
For “cancer of colon, liver and lung“, can you look at the chunk tag for liver. If it’s in a separate noun phrase (NP) from “cancer of colon” that would account for why cancer is not getting tied to liver in that case (but wouldn’t account for why the chunker is creating as a separate noun phrase)
-- James
From: user-return-248-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 21, 2013 1:10 PM
To: user@ctakes.apache.org
Subject: Concept annotation questions
Hi Everyone,
We are new to cTakes so please bear with our questions. We are using cTakes to annotate things like encounter diagnoses and referral notes and are especially interested with the SNOMED CT encodings. But we are not sure how to make sense of all the outputs.
Example #1
In the example below, “cancer of colon, lung and liver” has been encoded with SNOMED CT and additional concepts that do not apply have been removed (e.g., general “cancer” concept, lung, colon and liver structures, etc). They have been plotted out by the begin/end positions. If the terms to do not align, its probably because the email only accepts plain text and a mono-spaced font is not the default.
cancer of colon, lung and liver
cancer of colon, lung and liver 93870000|Malignant neoplasm of liver (disorder)|
cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
cancer of colon 363406005|Malignant tumor of colon (disorder)|
Question (1) – We had to do quite a bit of post-processing to remove inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
Question (2) – How can we determine that “93870000|Malignant neoplasm of liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string, which points to “cancer of colon, lung and liver”? Certainly we can try to do additional parsing but there are a lot of different scenarios to take into account.
Question (3) – This relates to question 2, are we able to identify the original terms that were used for the concept matching or the exact description that was returned in the UMLS? While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
________________________________________
Example #2
Switching the position of colon, lung and liver can result in different encodings. Once again, after removing additional concepts not needed (i.e., “cancer” and “colon structure”), we get the following. What happened to liver and lung cancer?
cancer of colon, liver and lung
cancer of colon 363406005|Malignant tumor of colon (disorder)|
lung 39607008|Lung structure (body structure)|
We have more questions but will start with these. Thank you in advance.
Regards,
Dennis
RE: Concept annotation questions
Posted by "Masanz, James J." <Ma...@mayo.edu>.
Dennis,
Now that I have the code written, one more question. Are you using the binary, or do you download source and compile yourself?
-- James
From: user-return-286-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-286-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Tuesday, September 10, 2013 12:19 PM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James,
Thank you for your email. We are currently using cTakes 3.0 but will upgrade to which ever version you issue the patch for. Thank you for taking the time out of your busy schedule to work on the patch.
Regards,
Dennis
From: Masanz, James J.<ma...@mayo.edu>
Sent: Monday, September 09, 2013 7:44 AM
To: mailto:user@ctakes.apache.org
Subject: RE: Concept annotation questions
Which version of cTAKES are you using or planning to use.
cTAKES 3.1 has been approved and once the apache.org infrastructure team does some administrative-like tasks the process of having the apache mirrors updated with 3.1 should start.
I want to target the release that will be most useful for you for this patch first.
From: user-return-267-Masanz.James=mayo.edu@ctakes.apache.org<ma...@ctakes.apache.org> [mailto:user-return-267-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Friday, August 30, 2013 1:11 AM
To: user@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Re: Concept annotation questions
Hi James,
Thank you for your reply.
If you could create the patch for identifying the words used in the matching that would be great. We understand you have other priorities and will wait until you have time to do it.
Thank you for logging the issue with the incorrect chunking as well.
Regards,
Dennis
-----Original Message-----
From: Masanz, James J.
Sent: Thursday, August 29, 2013 8:38 AM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
I created JIRA issue CTAKES-231 for this as the code in trunk and in the cTAKES 3.1 branch also get the chunking wrong.
https://issues.apache.org/jira/browse/CTAKES-231
Thanks,
-- James
From: user-return-258-Masanz.James=mayo.edu@ctakes.apache.org<ma...@ctakes.apache.org> [mailto:user-return-258-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Masanz, James J.
Sent: Thursday, August 29, 2013 9:19 AM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Hi Dennis,
Thanks for explaining why you are interested in finding out which words in the original text cause a particular concept to be annotated. We are currently working on getting Apache cTAKES 3.1 out. Depending on your timeline, after that is done, perhaps I could create a patch for you that would help with determining which words from the text matched a dictionary entry, rather than just the begin offset of the first word and the end offset of the last word.
As far as the chunking, the fact “liver” and “and” are being tagged as O-chunks explains why the dictionary lookup component is not finding liver cancer or lung cancer in “cancer of colon, liver and lung”
I’ll try that sentence with the latest chunker model (which will be in cTAKES 3.1) and see if it assigns correct chunk tags for that sentence.
-- James
From: user-return-257-Masanz.James=mayo.edu@ctakes.apache.org<ma...@ctakes.apache.org> [mailto:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 28, 2013 2:33 PM
To: user@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Re: Concept annotation questions
Hi James & Pei,
Thank you for your replies and sorry for my late reply as I have been away.
Q1 – The longest span could work and is one of the options we are looking at but when there are overlaps it can get complicated. In the following example, the longest would work. We can take start with 01, and ignore 02 and 03 because their start positions overlap the end position of 01, and then continue with 04. But I don’t think it will always be this straight forward as the being/end string positions may not always be a good indicator of what exactly in the original text was coded.
00 Invasive ductal carcinoma of the left breast with bone metastases.
01 Invasive ductal carcinoma of the left breast 408643008|Infiltrating duct carcinoma of breast (disorder)|
02 breast with bone 56873002|Bone structure of sternum (body structure)|
03 breast with bone metastases 94297009|Secondary malignant neoplasm of female breast (disorder)|
04 bone metastases 94222008|Secondary malignant neoplasm of bone (disorder)|
Q2 – As we are beginners, we are not at the level where we are comfortable with modifying cTakes or even know where to begin modifying cTakes but that would be an option in the future. Going back to the example of “cancer of liver” and using the begin/end position of the string that was used to identify the concept, the original string would be “cancer of colon, lung and liver.” The CUI that was identified was C0345904, which has 209 (137 unique) descriptions for all languages. Examples of English terms include:
• CA - Liver cancer
• Cancer of Liver
• cancer of the liver
• Cancer, Hepatic
• CANCER, HEPATOCELLULAR
• Malignant hepatic neoplasm
• Malignant liver tumor
• Malignant liver tumour
• Malignant neoplasm of liver
• malignant neoplasm of liver (diagnosis)
• Malignant neoplasm of liver unspecified
• Malignant neoplasm of liver unspecified (disorder)
• Malignant neoplasm of liver, not specified as primary or secondary
• Malignant neoplasm of liver, NOS
• Malignant neoplasm of liver, unspecified
• malignant neosplasm of the liver
• Malignant tumor of liver
• Malignant tumor of liver (disorder)
• Malignant tumour of liver
It would seem suboptimal to go through each of the descriptions to try and determine which was the UMLS term that was used in the coding. It is important for us to know which part of the string is matched because something like “Invasive ductal carcinoma of the left breast” will be matched to the SNOMED CT concept “408643008|Infiltrating duct carcinoma of breast (disorder)|”, but we would like to know that “left” was not matched and would like to post-coordinate the expression to indicate the left breast, i.e.: 408643008|Infiltrating duct carcinoma of breast (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast structure (body structure)|. When there are other qualifiers like severity, chronicity and episodicity that may be ignored when matching, we would like to capture it at the level of granularity specified in the original text.
In terms of the chunking, here is what I see for “cancer of colon, lung and liver”:
• NP: cancer of colon, lung and liver
• PP: of
• NP: colon, lung and liver
For “cancer of colon, liver and lung” here is what I see:
• NP: cancer of colon,
• PP: of
• NP: colon
• O: liver
• O: and
• NP: lung
Q3 – To answer Pei’s question, we are not looking at the preferred name from the UMLS, just which term was used.
Regards,
Dennis
From: Chen, Pei
Sent: Thursday, August 22, 2013 12:27 PM
To: user@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Concept annotation questions
Also,
> 3)… or the exact description that was returned in the UMLS?
I presume you mean to save the preferred name from UMLS? If so, this seems to be a common request- see: https://issues.apache.org/jira/browse/CTAKES-224
--Pei
From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
Sent: Thursday, August 22, 2013 3:24 PM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Welcome to the cTAKES community.
Q1 – some people use the longest span.
Q2 &Q3 – can you just use the text from the dictionary “Malignant neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to save the text of the words that it matches when it is performing dictionary lookup. I would guess there is a term in the UMLS dictionary with the same code as Malignant neoplasm of liver (disorder) that just has the words “cancer of liver”, but there isn’t anything in cTAKES to give that to you just through a configuration change.
For “cancer of colon, liver and lung“, can you look at the chunk tag for liver. If it’s in a separate noun phrase (NP) from “cancer of colon” that would account for why cancer is not getting tied to liver in that case (but wouldn’t account for why the chunker is creating as a separate noun phrase)
-- James
From: user-return-248-Masanz.James=mayo.edu@ctakes.apache.org<ma...@ctakes.apache.org> [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 21, 2013 1:10 PM
To: user@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Concept annotation questions
Hi Everyone,
We are new to cTakes so please bear with our questions. We are using cTakes to annotate things like encounter diagnoses and referral notes and are especially interested with the SNOMED CT encodings. But we are not sure how to make sense of all the outputs.
Example #1
In the example below, “cancer of colon, lung and liver” has been encoded with SNOMED CT and additional concepts that do not apply have been removed (e.g., general “cancer” concept, lung, colon and liver structures, etc). They have been plotted out by the begin/end positions. If the terms to do not align, its probably because the email only accepts plain text and a mono-spaced font is not the default.
cancer of colon, lung and liver
cancer of colon, lung and liver 93870000|Malignant neoplasm of liver (disorder)|
cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
cancer of colon 363406005|Malignant tumor of colon (disorder)|
Question (1) – We had to do quite a bit of post-processing to remove inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
Question (2) – How can we determine that “93870000|Malignant neoplasm of liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string, which points to “cancer of colon, lung and liver”? Certainly we can try to do additional parsing but there are a lot of different scenarios to take into account.
Question (3) – This relates to question 2, are we able to identify the original terms that were used for the concept matching or the exact description that was returned in the UMLS? While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
________________________________________
Example #2
Switching the position of colon, lung and liver can result in different encodings. Once again, after removing additional concepts not needed (i.e., “cancer” and “colon structure”), we get the following. What happened to liver and lung cancer?
cancer of colon, liver and lung
cancer of colon 363406005|Malignant tumor of colon (disorder)|
lung 39607008|Lung structure (body structure)|
We have more questions but will start with these. Thank you in advance.
Regards,
Dennis
RE: Concept annotation questions
Posted by "Masanz, James J." <Ma...@mayo.edu>.
Hi Dennis,
I will create it for 3.1 then.
3.1 has been approved and will be released soon - just waiting on some administrative/infrastructure work
-- James
________________________________
From: user-return-286-Masanz.James=mayo.edu@ctakes.apache.org [user-return-286-Masanz.James=mayo.edu@ctakes.apache.org] on behalf of Dennis Lee Hon Kit [dlhk@uvic.ca]
Sent: Tuesday, September 10, 2013 12:19 PM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James,
Thank you for your email. We are currently using cTakes 3.0 but will upgrade to which ever version you issue the patch for. Thank you for taking the time out of your busy schedule to work on the patch.
Regards,
Dennis
From: Masanz, James J.<ma...@mayo.edu>
Sent: Monday, September 09, 2013 7:44 AM
To: mailto:user@ctakes.apache.org
Subject: RE: Concept annotation questions
Which version of cTAKES are you using or planning to use.
cTAKES 3.1 has been approved and once the apache.org infrastructure team does some administrative-like tasks the process of having the apache mirrors updated with 3.1 should start.
I want to target the release that will be most useful for you for this patch first.
From: user-return-267-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-267-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Friday, August 30, 2013 1:11 AM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James,
Thank you for your reply.
If you could create the patch for identifying the words used in the matching that would be great. We understand you have other priorities and will wait until you have time to do it.
Thank you for logging the issue with the incorrect chunking as well.
Regards,
Dennis
-----Original Message-----
From: Masanz, James J.
Sent: Thursday, August 29, 2013 8:38 AM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
I created JIRA issue CTAKES-231 for this as the code in trunk and in the cTAKES 3.1 branch also get the chunking wrong.
https://issues.apache.org/jira/browse/CTAKES-231
Thanks,
-- James
From: user-return-258-Masanz.James=mayo.edu@ctakes.apache.org<ma...@ctakes.apache.org> [mailto:user-return-258-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Masanz, James J.
Sent: Thursday, August 29, 2013 9:19 AM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Hi Dennis,
Thanks for explaining why you are interested in finding out which words in the original text cause a particular concept to be annotated. We are currently working on getting Apache cTAKES 3.1 out. Depending on your timeline, after that is done, perhaps I could create a patch for you that would help with determining which words from the text matched a dictionary entry, rather than just the begin offset of the first word and the end offset of the last word.
As far as the chunking, the fact “liver” and “and” are being tagged as O-chunks explains why the dictionary lookup component is not finding liver cancer or lung cancer in “cancer of colon, liver and lung”
I’ll try that sentence with the latest chunker model (which will be in cTAKES 3.1) and see if it assigns correct chunk tags for that sentence.
-- James
From: user-return-257-Masanz.James=mayo.edu@ctakes.apache.org<ma...@ctakes.apache.org> [mailto:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 28, 2013 2:33 PM
To: user@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Re: Concept annotation questions
Hi James & Pei,
Thank you for your replies and sorry for my late reply as I have been away.
Q1 – The longest span could work and is one of the options we are looking at but when there are overlaps it can get complicated. In the following example, the longest would work. We can take start with 01, and ignore 02 and 03 because their start positions overlap the end position of 01, and then continue with 04. But I don’t think it will always be this straight forward as the being/end string positions may not always be a good indicator of what exactly in the original text was coded.
00 Invasive ductal carcinoma of the left breast with bone metastases.
01 Invasive ductal carcinoma of the left breast 408643008|Infiltrating duct carcinoma of breast (disorder)|
02 breast with bone 56873002|Bone structure of sternum (body structure)|
03 breast with bone metastases 94297009|Secondary malignant neoplasm of female breast (disorder)|
04 bone metastases 94222008|Secondary malignant neoplasm of bone (disorder)|
Q2 – As we are beginners, we are not at the level where we are comfortable with modifying cTakes or even know where to begin modifying cTakes but that would be an option in the future. Going back to the example of “cancer of liver” and using the begin/end position of the string that was used to identify the concept, the original string would be “cancer of colon, lung and liver.” The CUI that was identified was C0345904, which has 209 (137 unique) descriptions for all languages. Examples of English terms include:
• CA - Liver cancer
• Cancer of Liver
• cancer of the liver
• Cancer, Hepatic
• CANCER, HEPATOCELLULAR
• Malignant hepatic neoplasm
• Malignant liver tumor
• Malignant liver tumour
• Malignant neoplasm of liver
• malignant neoplasm of liver (diagnosis)
• Malignant neoplasm of liver unspecified
• Malignant neoplasm of liver unspecified (disorder)
• Malignant neoplasm of liver, not specified as primary or secondary
• Malignant neoplasm of liver, NOS
• Malignant neoplasm of liver, unspecified
• malignant neosplasm of the liver
• Malignant tumor of liver
• Malignant tumor of liver (disorder)
• Malignant tumour of liver
It would seem suboptimal to go through each of the descriptions to try and determine which was the UMLS term that was used in the coding. It is important for us to know which part of the string is matched because something like “Invasive ductal carcinoma of the left breast” will be matched to the SNOMED CT concept “408643008|Infiltrating duct carcinoma of breast (disorder)|”, but we would like to know that “left” was not matched and would like to post-coordinate the expression to indicate the left breast, i.e.: 408643008|Infiltrating duct carcinoma of breast (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast structure (body structure)|. When there are other qualifiers like severity, chronicity and episodicity that may be ignored when matching, we would like to capture it at the level of granularity specified in the original text.
In terms of the chunking, here is what I see for “cancer of colon, lung and liver”:
• NP: cancer of colon, lung and liver
• PP: of
• NP: colon, lung and liver
For “cancer of colon, liver and lung” here is what I see:
• NP: cancer of colon,
• PP: of
• NP: colon
• O: liver
• O: and
• NP: lung
Q3 – To answer Pei’s question, we are not looking at the preferred name from the UMLS, just which term was used.
Regards,
Dennis
From: Chen, Pei
Sent: Thursday, August 22, 2013 12:27 PM
To: user@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Concept annotation questions
Also,
> 3)… or the exact description that was returned in the UMLS?
I presume you mean to save the preferred name from UMLS? If so, this seems to be a common request- see: https://issues.apache.org/jira/browse/CTAKES-224
--Pei
From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
Sent: Thursday, August 22, 2013 3:24 PM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Welcome to the cTAKES community.
Q1 – some people use the longest span.
Q2 &Q3 – can you just use the text from the dictionary “Malignant neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to save the text of the words that it matches when it is performing dictionary lookup. I would guess there is a term in the UMLS dictionary with the same code as Malignant neoplasm of liver (disorder) that just has the words “cancer of liver”, but there isn’t anything in cTAKES to give that to you just through a configuration change.
For “cancer of colon, liver and lung“, can you look at the chunk tag for liver. If it’s in a separate noun phrase (NP) from “cancer of colon” that would account for why cancer is not getting tied to liver in that case (but wouldn’t account for why the chunker is creating as a separate noun phrase)
-- James
From: user-return-248-Masanz.James=mayo.edu@ctakes.apache.org<ma...@ctakes.apache.org> [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 21, 2013 1:10 PM
To: user@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Concept annotation questions
Hi Everyone,
We are new to cTakes so please bear with our questions. We are using cTakes to annotate things like encounter diagnoses and referral notes and are especially interested with the SNOMED CT encodings. But we are not sure how to make sense of all the outputs.
Example #1
In the example below, “cancer of colon, lung and liver” has been encoded with SNOMED CT and additional concepts that do not apply have been removed (e.g., general “cancer” concept, lung, colon and liver structures, etc). They have been plotted out by the begin/end positions. If the terms to do not align, its probably because the email only accepts plain text and a mono-spaced font is not the default.
cancer of colon, lung and liver
cancer of colon, lung and liver 93870000|Malignant neoplasm of liver (disorder)|
cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
cancer of colon 363406005|Malignant tumor of colon (disorder)|
Question (1) – We had to do quite a bit of post-processing to remove inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
Question (2) – How can we determine that “93870000|Malignant neoplasm of liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string, which points to “cancer of colon, lung and liver”? Certainly we can try to do additional parsing but there are a lot of different scenarios to take into account.
Question (3) – This relates to question 2, are we able to identify the original terms that were used for the concept matching or the exact description that was returned in the UMLS? While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
________________________________________
Example #2
Switching the position of colon, lung and liver can result in different encodings. Once again, after removing additional concepts not needed (i.e., “cancer” and “colon structure”), we get the following. What happened to liver and lung cancer?
cancer of colon, liver and lung
cancer of colon 363406005|Malignant tumor of colon (disorder)|
lung 39607008|Lung structure (body structure)|
We have more questions but will start with these. Thank you in advance.
Regards,
Dennis
Re: Concept annotation questions
Posted by Dennis Lee Hon Kit <dl...@uvic.ca>.
Hi James,
Thank you for your email. We are currently using cTakes 3.0 but will upgrade to which ever version you issue the patch for. Thank you for taking the time out of your busy schedule to work on the patch.
Regards,
Dennis
From: Masanz, James J.
Sent: Monday, September 09, 2013 7:44 AM
To: mailto:user@ctakes.apache.org
Subject: RE: Concept annotation questions
Which version of cTAKES are you using or planning to use.
cTAKES 3.1 has been approved and once the apache.org infrastructure team does some administrative-like tasks the process of having the apache mirrors updated with 3.1 should start.
I want to target the release that will be most useful for you for this patch first.
From: user-return-267-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-267-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Friday, August 30, 2013 1:11 AM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James,
Thank you for your reply.
If you could create the patch for identifying the words used in the matching that would be great. We understand you have other priorities and will wait until you have time to do it.
Thank you for logging the issue with the incorrect chunking as well.
Regards,
Dennis
-----Original Message-----
From: Masanz, James J.
Sent: Thursday, August 29, 2013 8:38 AM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
I created JIRA issue CTAKES-231 for this as the code in trunk and in the cTAKES 3.1 branch also get the chunking wrong.
https://issues.apache.org/jira/browse/CTAKES-231
Thanks,
-- James
From: user-return-258-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-258-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Masanz, James J.
Sent: Thursday, August 29, 2013 9:19 AM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Hi Dennis,
Thanks for explaining why you are interested in finding out which words in the original text cause a particular concept to be annotated. We are currently working on getting Apache cTAKES 3.1 out. Depending on your timeline, after that is done, perhaps I could create a patch for you that would help with determining which words from the text matched a dictionary entry, rather than just the begin offset of the first word and the end offset of the last word.
As far as the chunking, the fact “liver” and “and” are being tagged as O-chunks explains why the dictionary lookup component is not finding liver cancer or lung cancer in “cancer of colon, liver and lung”
I’ll try that sentence with the latest chunker model (which will be in cTAKES 3.1) and see if it assigns correct chunk tags for that sentence.
-- James
From: user-return-257-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 28, 2013 2:33 PM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James & Pei,
Thank you for your replies and sorry for my late reply as I have been away.
Q1 – The longest span could work and is one of the options we are looking at but when there are overlaps it can get complicated. In the following example, the longest would work. We can take start with 01, and ignore 02 and 03 because their start positions overlap the end position of 01, and then continue with 04. But I don’t think it will always be this straight forward as the being/end string positions may not always be a good indicator of what exactly in the original text was coded.
00 Invasive ductal carcinoma of the left breast with bone metastases.
01 Invasive ductal carcinoma of the left breast 408643008|Infiltrating duct carcinoma of breast (disorder)|
02 breast with bone 56873002|Bone structure of sternum (body structure)|
03 breast with bone metastases 94297009|Secondary malignant neoplasm of female breast (disorder)|
04 bone metastases 94222008|Secondary malignant neoplasm of bone (disorder)|
Q2 – As we are beginners, we are not at the level where we are comfortable with modifying cTakes or even know where to begin modifying cTakes but that would be an option in the future. Going back to the example of “cancer of liver” and using the begin/end position of the string that was used to identify the concept, the original string would be “cancer of colon, lung and liver.” The CUI that was identified was C0345904, which has 209 (137 unique) descriptions for all languages. Examples of English terms include:
• CA - Liver cancer
• Cancer of Liver
• cancer of the liver
• Cancer, Hepatic
• CANCER, HEPATOCELLULAR
• Malignant hepatic neoplasm
• Malignant liver tumor
• Malignant liver tumour
• Malignant neoplasm of liver
• malignant neoplasm of liver (diagnosis)
• Malignant neoplasm of liver unspecified
• Malignant neoplasm of liver unspecified (disorder)
• Malignant neoplasm of liver, not specified as primary or secondary
• Malignant neoplasm of liver, NOS
• Malignant neoplasm of liver, unspecified
• malignant neosplasm of the liver
• Malignant tumor of liver
• Malignant tumor of liver (disorder)
• Malignant tumour of liver
It would seem suboptimal to go through each of the descriptions to try and determine which was the UMLS term that was used in the coding. It is important for us to know which part of the string is matched because something like “Invasive ductal carcinoma of the left breast” will be matched to the SNOMED CT concept “408643008|Infiltrating duct carcinoma of breast (disorder)|”, but we would like to know that “left” was not matched and would like to post-coordinate the expression to indicate the left breast, i.e.: 408643008|Infiltrating duct carcinoma of breast (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast structure (body structure)|. When there are other qualifiers like severity, chronicity and episodicity that may be ignored when matching, we would like to capture it at the level of granularity specified in the original text.
In terms of the chunking, here is what I see for “cancer of colon, lung and liver”:
• NP: cancer of colon, lung and liver
• PP: of
• NP: colon, lung and liver
For “cancer of colon, liver and lung” here is what I see:
• NP: cancer of colon,
• PP: of
• NP: colon
• O: liver
• O: and
• NP: lung
Q3 – To answer Pei’s question, we are not looking at the preferred name from the UMLS, just which term was used.
Regards,
Dennis
From: Chen, Pei
Sent: Thursday, August 22, 2013 12:27 PM
To: user@ctakes.apache.org
Subject: RE: Concept annotation questions
Also,
> 3)… or the exact description that was returned in the UMLS?
I presume you mean to save the preferred name from UMLS? If so, this seems to be a common request- see: https://issues.apache.org/jira/browse/CTAKES-224
--Pei
From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
Sent: Thursday, August 22, 2013 3:24 PM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Welcome to the cTAKES community.
Q1 – some people use the longest span.
Q2 &Q3 – can you just use the text from the dictionary “Malignant neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to save the text of the words that it matches when it is performing dictionary lookup. I would guess there is a term in the UMLS dictionary with the same code as Malignant neoplasm of liver (disorder) that just has the words “cancer of liver”, but there isn’t anything in cTAKES to give that to you just through a configuration change.
For “cancer of colon, liver and lung“, can you look at the chunk tag for liver. If it’s in a separate noun phrase (NP) from “cancer of colon” that would account for why cancer is not getting tied to liver in that case (but wouldn’t account for why the chunker is creating as a separate noun phrase)
-- James
From: user-return-248-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 21, 2013 1:10 PM
To: user@ctakes.apache.org
Subject: Concept annotation questions
Hi Everyone,
We are new to cTakes so please bear with our questions. We are using cTakes to annotate things like encounter diagnoses and referral notes and are especially interested with the SNOMED CT encodings. But we are not sure how to make sense of all the outputs.
Example #1
In the example below, “cancer of colon, lung and liver” has been encoded with SNOMED CT and additional concepts that do not apply have been removed (e.g., general “cancer” concept, lung, colon and liver structures, etc). They have been plotted out by the begin/end positions. If the terms to do not align, its probably because the email only accepts plain text and a mono-spaced font is not the default.
cancer of colon, lung and liver
cancer of colon, lung and liver 93870000|Malignant neoplasm of liver (disorder)|
cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
cancer of colon 363406005|Malignant tumor of colon (disorder)|
Question (1) – We had to do quite a bit of post-processing to remove inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
Question (2) – How can we determine that “93870000|Malignant neoplasm of liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string, which points to “cancer of colon, lung and liver”? Certainly we can try to do additional parsing but there are a lot of different scenarios to take into account.
Question (3) – This relates to question 2, are we able to identify the original terms that were used for the concept matching or the exact description that was returned in the UMLS? While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
________________________________________
Example #2
Switching the position of colon, lung and liver can result in different encodings. Once again, after removing additional concepts not needed (i.e., “cancer” and “colon structure”), we get the following. What happened to liver and lung cancer?
cancer of colon, liver and lung
cancer of colon 363406005|Malignant tumor of colon (disorder)|
lung 39607008|Lung structure (body structure)|
We have more questions but will start with these. Thank you in advance.
Regards,
Dennis
RE: Concept annotation questions
Posted by "Masanz, James J." <Ma...@mayo.edu>.
Which version of cTAKES are you using or planning to use.
cTAKES 3.1 has been approved and once the apache.org infrastructure team does some administrative-like tasks the process of having the apache mirrors updated with 3.1 should start.
I want to target the release that will be most useful for you for this patch first.
From: user-return-267-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-267-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Friday, August 30, 2013 1:11 AM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James,
Thank you for your reply.
If you could create the patch for identifying the words used in the matching that would be great. We understand you have other priorities and will wait until you have time to do it.
Thank you for logging the issue with the incorrect chunking as well.
Regards,
Dennis
-----Original Message-----
From: Masanz, James J.
Sent: Thursday, August 29, 2013 8:38 AM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
I created JIRA issue CTAKES-231 for this as the code in trunk and in the cTAKES 3.1 branch also get the chunking wrong.
https://issues.apache.org/jira/browse/CTAKES-231
Thanks,
-- James
From: user-return-258-Masanz.James=mayo.edu@ctakes.apache.org<ma...@ctakes.apache.org> [mailto:user-return-258-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Masanz, James J.
Sent: Thursday, August 29, 2013 9:19 AM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Hi Dennis,
Thanks for explaining why you are interested in finding out which words in the original text cause a particular concept to be annotated. We are currently working on getting Apache cTAKES 3.1 out. Depending on your timeline, after that is done, perhaps I could create a patch for you that would help with determining which words from the text matched a dictionary entry, rather than just the begin offset of the first word and the end offset of the last word.
As far as the chunking, the fact “liver” and “and” are being tagged as O-chunks explains why the dictionary lookup component is not finding liver cancer or lung cancer in “cancer of colon, liver and lung”
I’ll try that sentence with the latest chunker model (which will be in cTAKES 3.1) and see if it assigns correct chunk tags for that sentence.
-- James
From: user-return-257-Masanz.James=mayo.edu@ctakes.apache.org<ma...@ctakes.apache.org> [mailto:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 28, 2013 2:33 PM
To: user@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Re: Concept annotation questions
Hi James & Pei,
Thank you for your replies and sorry for my late reply as I have been away.
Q1 – The longest span could work and is one of the options we are looking at but when there are overlaps it can get complicated. In the following example, the longest would work. We can take start with 01, and ignore 02 and 03 because their start positions overlap the end position of 01, and then continue with 04. But I don’t think it will always be this straight forward as the being/end string positions may not always be a good indicator of what exactly in the original text was coded.
00 Invasive ductal carcinoma of the left breast with bone metastases.
01 Invasive ductal carcinoma of the left breast 408643008|Infiltrating duct carcinoma of breast (disorder)|
02 breast with bone 56873002|Bone structure of sternum (body structure)|
03 breast with bone metastases 94297009|Secondary malignant neoplasm of female breast (disorder)|
04 bone metastases 94222008|Secondary malignant neoplasm of bone (disorder)|
Q2 – As we are beginners, we are not at the level where we are comfortable with modifying cTakes or even know where to begin modifying cTakes but that would be an option in the future. Going back to the example of “cancer of liver” and using the begin/end position of the string that was used to identify the concept, the original string would be “cancer of colon, lung and liver.” The CUI that was identified was C0345904, which has 209 (137 unique) descriptions for all languages. Examples of English terms include:
• CA - Liver cancer
• Cancer of Liver
• cancer of the liver
• Cancer, Hepatic
• CANCER, HEPATOCELLULAR
• Malignant hepatic neoplasm
• Malignant liver tumor
• Malignant liver tumour
• Malignant neoplasm of liver
• malignant neoplasm of liver (diagnosis)
• Malignant neoplasm of liver unspecified
• Malignant neoplasm of liver unspecified (disorder)
• Malignant neoplasm of liver, not specified as primary or secondary
• Malignant neoplasm of liver, NOS
• Malignant neoplasm of liver, unspecified
• malignant neosplasm of the liver
• Malignant tumor of liver
• Malignant tumor of liver (disorder)
• Malignant tumour of liver
It would seem suboptimal to go through each of the descriptions to try and determine which was the UMLS term that was used in the coding. It is important for us to know which part of the string is matched because something like “Invasive ductal carcinoma of the left breast” will be matched to the SNOMED CT concept “408643008|Infiltrating duct carcinoma of breast (disorder)|”, but we would like to know that “left” was not matched and would like to post-coordinate the expression to indicate the left breast, i.e.: 408643008|Infiltrating duct carcinoma of breast (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast structure (body structure)|. When there are other qualifiers like severity, chronicity and episodicity that may be ignored when matching, we would like to capture it at the level of granularity specified in the original text.
In terms of the chunking, here is what I see for “cancer of colon, lung and liver”:
• NP: cancer of colon, lung and liver
• PP: of
• NP: colon, lung and liver
For “cancer of colon, liver and lung” here is what I see:
• NP: cancer of colon,
• PP: of
• NP: colon
• O: liver
• O: and
• NP: lung
Q3 – To answer Pei’s question, we are not looking at the preferred name from the UMLS, just which term was used.
Regards,
Dennis
From: Chen, Pei
Sent: Thursday, August 22, 2013 12:27 PM
To: user@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Concept annotation questions
Also,
> 3)… or the exact description that was returned in the UMLS?
I presume you mean to save the preferred name from UMLS? If so, this seems to be a common request- see: https://issues.apache.org/jira/browse/CTAKES-224
--Pei
From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
Sent: Thursday, August 22, 2013 3:24 PM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Welcome to the cTAKES community.
Q1 – some people use the longest span.
Q2 &Q3 – can you just use the text from the dictionary “Malignant neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to save the text of the words that it matches when it is performing dictionary lookup. I would guess there is a term in the UMLS dictionary with the same code as Malignant neoplasm of liver (disorder) that just has the words “cancer of liver”, but there isn’t anything in cTAKES to give that to you just through a configuration change.
For “cancer of colon, liver and lung“, can you look at the chunk tag for liver. If it’s in a separate noun phrase (NP) from “cancer of colon” that would account for why cancer is not getting tied to liver in that case (but wouldn’t account for why the chunker is creating as a separate noun phrase)
-- James
From: user-return-248-Masanz.James=mayo.edu@ctakes.apache.org<ma...@ctakes.apache.org> [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 21, 2013 1:10 PM
To: user@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Concept annotation questions
Hi Everyone,
We are new to cTakes so please bear with our questions. We are using cTakes to annotate things like encounter diagnoses and referral notes and are especially interested with the SNOMED CT encodings. But we are not sure how to make sense of all the outputs.
Example #1
In the example below, “cancer of colon, lung and liver” has been encoded with SNOMED CT and additional concepts that do not apply have been removed (e.g., general “cancer” concept, lung, colon and liver structures, etc). They have been plotted out by the begin/end positions. If the terms to do not align, its probably because the email only accepts plain text and a mono-spaced font is not the default.
cancer of colon, lung and liver
cancer of colon, lung and liver 93870000|Malignant neoplasm of liver (disorder)|
cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
cancer of colon 363406005|Malignant tumor of colon (disorder)|
Question (1) – We had to do quite a bit of post-processing to remove inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
Question (2) – How can we determine that “93870000|Malignant neoplasm of liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string, which points to “cancer of colon, lung and liver”? Certainly we can try to do additional parsing but there are a lot of different scenarios to take into account.
Question (3) – This relates to question 2, are we able to identify the original terms that were used for the concept matching or the exact description that was returned in the UMLS? While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
________________________________________
Example #2
Switching the position of colon, lung and liver can result in different encodings. Once again, after removing additional concepts not needed (i.e., “cancer” and “colon structure”), we get the following. What happened to liver and lung cancer?
cancer of colon, liver and lung
cancer of colon 363406005|Malignant tumor of colon (disorder)|
lung 39607008|Lung structure (body structure)|
We have more questions but will start with these. Thank you in advance.
Regards,
Dennis
Re: Concept annotation questions
Posted by Dennis Lee Hon Kit <dl...@uvic.ca>.
Hi James,
Thank you for your reply.
If you could create the patch for identifying the words used in the matching that would be great. We understand you have other priorities and will wait until you have time to do it.
Thank you for logging the issue with the incorrect chunking as well.
Regards,
Dennis
-----Original Message-----
From: Masanz, James J.
Sent: Thursday, August 29, 2013 8:38 AM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
I created JIRA issue CTAKES-231 for this as the code in trunk and in the cTAKES 3.1 branch also get the chunking wrong.
https://issues.apache.org/jira/browse/CTAKES-231
Thanks,
-- James
From: user-return-258-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-258-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Masanz, James J.
Sent: Thursday, August 29, 2013 9:19 AM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Hi Dennis,
Thanks for explaining why you are interested in finding out which words in the original text cause a particular concept to be annotated. We are currently working on getting Apache cTAKES 3.1 out. Depending on your timeline, after that is done, perhaps I could create a patch for you that would help with determining which words from the text matched a dictionary entry, rather than just the begin offset of the first word and the end offset of the last word.
As far as the chunking, the fact “liver” and “and” are being tagged as O-chunks explains why the dictionary lookup component is not finding liver cancer or lung cancer in “cancer of colon, liver and lung”
I’ll try that sentence with the latest chunker model (which will be in cTAKES 3.1) and see if it assigns correct chunk tags for that sentence.
-- James
From: user-return-257-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 28, 2013 2:33 PM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James & Pei,
Thank you for your replies and sorry for my late reply as I have been away.
Q1 – The longest span could work and is one of the options we are looking at but when there are overlaps it can get complicated. In the following example, the longest would work. We can take start with 01, and ignore 02 and 03 because their start positions overlap the end position of 01, and then continue with 04. But I don’t think it will always be this straight forward as the being/end string positions may not always be a good indicator of what exactly in the original text was coded.
00 Invasive ductal carcinoma of the left breast with bone metastases.
01 Invasive ductal carcinoma of the left breast 408643008|Infiltrating duct carcinoma of breast (disorder)|
02 breast with bone 56873002|Bone structure of sternum (body structure)|
03 breast with bone metastases 94297009|Secondary malignant neoplasm of female breast (disorder)|
04 bone metastases 94222008|Secondary malignant neoplasm of bone (disorder)|
Q2 – As we are beginners, we are not at the level where we are comfortable with modifying cTakes or even know where to begin modifying cTakes but that would be an option in the future. Going back to the example of “cancer of liver” and using the begin/end position of the string that was used to identify the concept, the original string would be “cancer of colon, lung and liver.” The CUI that was identified was C0345904, which has 209 (137 unique) descriptions for all languages. Examples of English terms include:
• CA - Liver cancer
• Cancer of Liver
• cancer of the liver
• Cancer, Hepatic
• CANCER, HEPATOCELLULAR
• Malignant hepatic neoplasm
• Malignant liver tumor
• Malignant liver tumour
• Malignant neoplasm of liver
• malignant neoplasm of liver (diagnosis)
• Malignant neoplasm of liver unspecified
• Malignant neoplasm of liver unspecified (disorder)
• Malignant neoplasm of liver, not specified as primary or secondary
• Malignant neoplasm of liver, NOS
• Malignant neoplasm of liver, unspecified
• malignant neosplasm of the liver
• Malignant tumor of liver
• Malignant tumor of liver (disorder)
• Malignant tumour of liver
It would seem suboptimal to go through each of the descriptions to try and determine which was the UMLS term that was used in the coding. It is important for us to know which part of the string is matched because something like “Invasive ductal carcinoma of the left breast” will be matched to the SNOMED CT concept “408643008|Infiltrating duct carcinoma of breast (disorder)|”, but we would like to know that “left” was not matched and would like to post-coordinate the expression to indicate the left breast, i.e.: 408643008|Infiltrating duct carcinoma of breast (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast structure (body structure)|. When there are other qualifiers like severity, chronicity and episodicity that may be ignored when matching, we would like to capture it at the level of granularity specified in the original text.
In terms of the chunking, here is what I see for “cancer of colon, lung and liver”:
• NP: cancer of colon, lung and liver
• PP: of
• NP: colon, lung and liver
For “cancer of colon, liver and lung” here is what I see:
• NP: cancer of colon,
• PP: of
• NP: colon
• O: liver
• O: and
• NP: lung
Q3 – To answer Pei’s question, we are not looking at the preferred name from the UMLS, just which term was used.
Regards,
Dennis
From: Chen, Pei
Sent: Thursday, August 22, 2013 12:27 PM
To: user@ctakes.apache.org
Subject: RE: Concept annotation questions
Also,
> 3)… or the exact description that was returned in the UMLS?
I presume you mean to save the preferred name from UMLS? If so, this seems to be a common request- see: https://issues.apache.org/jira/browse/CTAKES-224
--Pei
From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
Sent: Thursday, August 22, 2013 3:24 PM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Welcome to the cTAKES community.
Q1 – some people use the longest span.
Q2 &Q3 – can you just use the text from the dictionary “Malignant neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to save the text of the words that it matches when it is performing dictionary lookup. I would guess there is a term in the UMLS dictionary with the same code as Malignant neoplasm of liver (disorder) that just has the words “cancer of liver”, but there isn’t anything in cTAKES to give that to you just through a configuration change.
For “cancer of colon, liver and lung“, can you look at the chunk tag for liver. If it’s in a separate noun phrase (NP) from “cancer of colon” that would account for why cancer is not getting tied to liver in that case (but wouldn’t account for why the chunker is creating as a separate noun phrase)
-- James
From: user-return-248-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 21, 2013 1:10 PM
To: user@ctakes.apache.org
Subject: Concept annotation questions
Hi Everyone,
We are new to cTakes so please bear with our questions. We are using cTakes to annotate things like encounter diagnoses and referral notes and are especially interested with the SNOMED CT encodings. But we are not sure how to make sense of all the outputs.
Example #1
In the example below, “cancer of colon, lung and liver” has been encoded with SNOMED CT and additional concepts that do not apply have been removed (e.g., general “cancer” concept, lung, colon and liver structures, etc). They have been plotted out by the begin/end positions. If the terms to do not align, its probably because the email only accepts plain text and a mono-spaced font is not the default.
cancer of colon, lung and liver
cancer of colon, lung and liver 93870000|Malignant neoplasm of liver (disorder)|
cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
cancer of colon 363406005|Malignant tumor of colon (disorder)|
Question (1) – We had to do quite a bit of post-processing to remove inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
Question (2) – How can we determine that “93870000|Malignant neoplasm of liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string, which points to “cancer of colon, lung and liver”? Certainly we can try to do additional parsing but there are a lot of different scenarios to take into account.
Question (3) – This relates to question 2, are we able to identify the original terms that were used for the concept matching or the exact description that was returned in the UMLS? While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
________________________________________
Example #2
Switching the position of colon, lung and liver can result in different encodings. Once again, after removing additional concepts not needed (i.e., “cancer” and “colon structure”), we get the following. What happened to liver and lung cancer?
cancer of colon, liver and lung
cancer of colon 363406005|Malignant tumor of colon (disorder)|
lung 39607008|Lung structure (body structure)|
We have more questions but will start with these. Thank you in advance.
Regards,
Dennis
RE: Concept annotation questions
Posted by "Masanz, James J." <Ma...@mayo.edu>.
I created JIRA issue CTAKES-231 for this as the code in trunk and in the cTAKES 3.1 branch also get the chunking wrong.
https://issues.apache.org/jira/browse/CTAKES-231
Thanks,
-- James
From: user-return-258-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-258-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Masanz, James J.
Sent: Thursday, August 29, 2013 9:19 AM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Hi Dennis,
Thanks for explaining why you are interested in finding out which words in the original text cause a particular concept to be annotated. We are currently working on getting Apache cTAKES 3.1 out. Depending on your timeline, after that is done, perhaps I could create a patch for you that would help with determining which words from the text matched a dictionary entry, rather than just the begin offset of the first word and the end offset of the last word.
As far as the chunking, the fact “liver” and “and” are being tagged as O-chunks explains why the dictionary lookup component is not finding liver cancer or lung cancer in “cancer of colon, liver and lung”
I’ll try that sentence with the latest chunker model (which will be in cTAKES 3.1) and see if it assigns correct chunk tags for that sentence.
-- James
From: user-return-257-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 28, 2013 2:33 PM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James & Pei,
Thank you for your replies and sorry for my late reply as I have been away.
Q1 – The longest span could work and is one of the options we are looking at but when there are overlaps it can get complicated. In the following example, the longest would work. We can take start with 01, and ignore 02 and 03 because their start positions overlap the end position of 01, and then continue with 04. But I don’t think it will always be this straight forward as the being/end string positions may not always be a good indicator of what exactly in the original text was coded.
00 Invasive ductal carcinoma of the left breast with bone metastases.
01 Invasive ductal carcinoma of the left breast 408643008|Infiltrating duct carcinoma of breast (disorder)|
02 breast with bone 56873002|Bone structure of sternum (body structure)|
03 breast with bone metastases 94297009|Secondary malignant neoplasm of female breast (disorder)|
04 bone metastases 94222008|Secondary malignant neoplasm of bone (disorder)|
Q2 – As we are beginners, we are not at the level where we are comfortable with modifying cTakes or even know where to begin modifying cTakes but that would be an option in the future. Going back to the example of “cancer of liver” and using the begin/end position of the string that was used to identify the concept, the original string would be “cancer of colon, lung and liver.” The CUI that was identified was C0345904, which has 209 (137 unique) descriptions for all languages. Examples of English terms include:
• CA - Liver cancer
• Cancer of Liver
• cancer of the liver
• Cancer, Hepatic
• CANCER, HEPATOCELLULAR
• Malignant hepatic neoplasm
• Malignant liver tumor
• Malignant liver tumour
• Malignant neoplasm of liver
• malignant neoplasm of liver (diagnosis)
• Malignant neoplasm of liver unspecified
• Malignant neoplasm of liver unspecified (disorder)
• Malignant neoplasm of liver, not specified as primary or secondary
• Malignant neoplasm of liver, NOS
• Malignant neoplasm of liver, unspecified
• malignant neosplasm of the liver
• Malignant tumor of liver
• Malignant tumor of liver (disorder)
• Malignant tumour of liver
It would seem suboptimal to go through each of the descriptions to try and determine which was the UMLS term that was used in the coding. It is important for us to know which part of the string is matched because something like “Invasive ductal carcinoma of the left breast” will be matched to the SNOMED CT concept “408643008|Infiltrating duct carcinoma of breast (disorder)|”, but we would like to know that “left” was not matched and would like to post-coordinate the expression to indicate the left breast, i.e.: 408643008|Infiltrating duct carcinoma of breast (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast structure (body structure)|. When there are other qualifiers like severity, chronicity and episodicity that may be ignored when matching, we would like to capture it at the level of granularity specified in the original text.
In terms of the chunking, here is what I see for “cancer of colon, lung and liver”:
• NP: cancer of colon, lung and liver
• PP: of
• NP: colon, lung and liver
For “cancer of colon, liver and lung” here is what I see:
• NP: cancer of colon,
• PP: of
• NP: colon
• O: liver
• O: and
• NP: lung
Q3 – To answer Pei’s question, we are not looking at the preferred name from the UMLS, just which term was used.
Regards,
Dennis
From: Chen, Pei
Sent: Thursday, August 22, 2013 12:27 PM
To: user@ctakes.apache.org
Subject: RE: Concept annotation questions
Also,
> 3)… or the exact description that was returned in the UMLS?
I presume you mean to save the preferred name from UMLS? If so, this seems to be a common request- see: https://issues.apache.org/jira/browse/CTAKES-224
--Pei
From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
Sent: Thursday, August 22, 2013 3:24 PM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Welcome to the cTAKES community.
Q1 – some people use the longest span.
Q2 &Q3 – can you just use the text from the dictionary “Malignant neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to save the text of the words that it matches when it is performing dictionary lookup. I would guess there is a term in the UMLS dictionary with the same code as Malignant neoplasm of liver (disorder) that just has the words “cancer of liver”, but there isn’t anything in cTAKES to give that to you just through a configuration change.
For “cancer of colon, liver and lung“, can you look at the chunk tag for liver. If it’s in a separate noun phrase (NP) from “cancer of colon” that would account for why cancer is not getting tied to liver in that case (but wouldn’t account for why the chunker is creating as a separate noun phrase)
-- James
From: user-return-248-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 21, 2013 1:10 PM
To: user@ctakes.apache.org
Subject: Concept annotation questions
Hi Everyone,
We are new to cTakes so please bear with our questions. We are using cTakes to annotate things like encounter diagnoses and referral notes and are especially interested with the SNOMED CT encodings. But we are not sure how to make sense of all the outputs.
Example #1
In the example below, “cancer of colon, lung and liver” has been encoded with SNOMED CT and additional concepts that do not apply have been removed (e.g., general “cancer” concept, lung, colon and liver structures, etc). They have been plotted out by the begin/end positions. If the terms to do not align, its probably because the email only accepts plain text and a mono-spaced font is not the default.
cancer of colon, lung and liver
cancer of colon, lung and liver 93870000|Malignant neoplasm of liver (disorder)|
cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
cancer of colon 363406005|Malignant tumor of colon (disorder)|
Question (1) – We had to do quite a bit of post-processing to remove inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
Question (2) – How can we determine that “93870000|Malignant neoplasm of liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string, which points to “cancer of colon, lung and liver”? Certainly we can try to do additional parsing but there are a lot of different scenarios to take into account.
Question (3) – This relates to question 2, are we able to identify the original terms that were used for the concept matching or the exact description that was returned in the UMLS? While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
________________________________________
Example #2
Switching the position of colon, lung and liver can result in different encodings. Once again, after removing additional concepts not needed (i.e., “cancer” and “colon structure”), we get the following. What happened to liver and lung cancer?
cancer of colon, liver and lung
cancer of colon 363406005|Malignant tumor of colon (disorder)|
lung 39607008|Lung structure (body structure)|
We have more questions but will start with these. Thank you in advance.
Regards,
Dennis
Re: Concept annotation questions and keep JCas results in a file
Posted by samir chabou <sa...@yahoo.com>.
mucha gracias Pei, that helps to know.
Samir
________________________________
From: Pei Chen <ch...@apache.org>
To: user@ctakes.apache.org; samir chabou <sa...@yahoo.com>
Sent: Saturday, September 7, 2013 11:38:11 AM
Subject: Re: Concept annotation questions and keep JCas results in a file
Samir,
xcas will eventually be deprecated/replaced with the preferred/more compact xmi format--
/*
*******************************************************************************************
* N O T E : The XML format (XCAS) that this Cas Consumer outputs,
is eventually
* being superceeded by the more standardized and compact
XMI format. However
* it is used currently as the expected form for remote
services, and there is
* existing tooling for doing stand-alone component
development and debugging
* that uses this format to populate an initial CAS. So
it is not
* deprecated yet; it is also being kept for
compatibility with older versions.
*
* New code should consider using the XmiWriterCasConsumer
where possible,
* which uses the current XMI format for XML
externalizations of the CAS
*******************************************************************************************
*/
On Fri, Sep 6, 2013 at 11:34 PM, samir chabou <sa...@yahoo.com> wrote:
Hi Richard,
>I had a look to these methods they can allow me to implement my requirement. Do you have an idea if there is a preferrence of using readXCas/writeXCas rather than readXmi/writeXmi or it is just a matter of having different possibilities of read/write from/to different file format.
>Thanks
>Samir
>
>
>
>
>
>
>________________________________
> From: Richard Eckart de Castilho <re...@apache.org>
>To: user@ctakes.apache.org; samir chabou <sa...@yahoo.com>
>Sent: Friday, September 6, 2013 3:29:19 AM
>Subject: Re: Concept annotation questions and keep JCas results in a file
>
>
>Hi,
>
>you might want to take a look at convenience methods in the recently
>released Apache uimaFIT 2.0.0:
>
>CasIOUtil
> readXCas(JCas, File)
> readXmi(JCas, File)
> writeXCas(JCas, File)
> writeXmi(JCas, File)
>
>Cheers,
>
>-- Richard
>
>On 06.09.2013, at 06:28, samir chabou <sa...@yahoo.com> wrote:
>
>> Hi Tim, Pei and James
>> 1) I tryied List l = JCasUtil.selectCovered(jcas, BaseToken.class, i) it answer perfectly my requirement, thanks Tim.
>> 2) Now; I need to NLP a medical question using the clinical pipeline and I need to keep the
JCas result in a file or any persistent way because i need to use it later in my processing. Is it possible to do this and is it possible to recall this JCas later in my processing ?
>>
>> Thanks
>> Samir
>> From: samir chabou <sa...@yahoo.com>
>> To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
>> Sent: Thursday, August 29, 2013 2:51:12 PM
>> Subject: Re: Concept annotation questions
>>
>> Thanks Tim,
>> it looks a better and cleaner way. It means the List l = JCasUtil.selectCovered(jcas, BaseToken.class, i) will give me the intersection between the BaseTokens and IdentifiedAnnotations. If my base token is in the list so
the base token is also an IdentifiedAnnotation. I'll give it a try some time next week and let you know.
>> Thanks
>> Samir
>>
>>
>> From: Tim Miller <ti...@childrens.harvard.edu>
>> To: user@ctakes.apache.org
>> Sent: Thursday, August 29, 2013 1:07:58 PM
>> Subject: Re: Concept annotation questions
>>
>> Samir,
>> You may be able to use the JCasUtil class from Uimafit to do something like the following:
>>
>> for each IdentifiedAnnotation i:
>> List l = JCasUtil.selectCovered(jcas, BaseToken.class, i)
>>
>>
>> (this is java-ish pseudocode obviously). Then the list you get of tokens will all have the same type as the IdentifiedAnnotation i.
Would that solve your problem?
>> Tim
>>
>> On 08/29/2013 12:29 PM, samir chabou wrote:
>>> Hi James and Pei,
>>> I also need to know what is the medical type (Sympto, Drug , procedure, relation) of a given word token. Since in the typeystem hierarchy wordtoken is not under the same inheritance tree than identifiedAnnotation . I’m currently iterating on all wordTokens and compare each wordToken.CoveredText to the annotations.CovredText in the identifiedAnnotation. I found this a long process. James, do you think the patch <<I could create a patch for you that would help with determining which words from the text matched a dictionary entry >> that you are planning to create will permit also this requirement ? or can you suggest me some thing better than I’m currently doing.
>>>
>>> Thanks
>>> Samir
>>>
>>> From: "Masanz, James J." <Ma...@mayo.edu>
>>> To: "'user@ctakes.apache.org'" <us...@ctakes.apache.org>
>>> Sent: Thursday, August 29, 2013 10:18:40 AM
>>> Subject: RE: Concept annotation questions
>>>
>>> Hi Dennis,
>>>
>>> Thanks for explaining why you are interested in finding out which words in the original text cause a particular concept to be annotated. We are currently working on getting Apache cTAKES 3.1 out. Depending on your timeline, after that is done, perhaps I could create a patch for you that would help with determining which words from the text matched a dictionary entry, rather than just the begin offset of the first word and the
end offset of the last word.
>>>
>>> As far as the chunking, the fact “liver” and “and” are being tagged as O-chunks explains why the dictionary lookup component is not finding liver cancer or lung cancer in “cancer of colon, liver and lung”
>>>
>>> I’ll try that sentence with the latest chunker model (which will be in cTAKES 3.1) and see if it assigns correct chunk tags for that sentence.
>>>
>>> -- James
>>>
>>> From: user-return-257-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
>>> Sent: Wednesday, August 28, 2013 2:33 PM
>>> To: user@ctakes.apache.org
>>> Subject: Re: Concept annotation questions
>>>
>>> Hi James & Pei,
>>>
>>> Thank you for your replies and sorry for my late reply as I have been away.
>>>
>>> Q1 – The longest span could work and is one of the options we are looking at but when there are overlaps it can get complicated. In the following example, the longest would work. We can take start with 01, and ignore 02 and 03 because their start positions overlap the end position of 01, and then continue with 04. But I don’t think it will always be this straight forward as the being/end string positions may not always be a good indicator of what exactly in the original text was coded.
>>>
>>> 00 Invasive ductal carcinoma of the left breast with bone
metastases.
>>> 01 Invasive ductal carcinoma of the left breast 408643008|Infiltrating duct carcinoma of breast (disorder)|
>>> 02 breast with bone 56873002|Bone structure of sternum (body structure)|
>>> 03 breast with bone metastases 94297009|Secondary malignant neoplasm of female breast (disorder)|
>>> 04 bone metastases 94222008|Secondary malignant neoplasm
of bone (disorder)|
>>>
>>> Q2 – As we are beginners, we are not at the level where we are comfortable with modifying cTakes or even know where to begin modifying cTakes but that would be an option in the future. Going back to the example of “cancer of liver” and using the begin/end position of the string that was used to identify the concept, the original string would be “cancer of colon, lung and liver.” The CUI that was identified was C0345904, which has 209 (137 unique) descriptions for all languages. Examples of English terms include:
>>> • CA - Liver cancer
>>> • Cancer of Liver
>>> • cancer of the liver
>>> • Cancer, Hepatic
>>> • CANCER, HEPATOCELLULAR
>>> • Malignant hepatic neoplasm
>>> •
Malignant liver tumor
>>> • Malignant liver tumour
>>> • Malignant neoplasm of liver
>>> • malignant neoplasm of liver (diagnosis)
>>> • Malignant neoplasm of liver unspecified
>>> • Malignant neoplasm of liver unspecified (disorder)
>>> • Malignant neoplasm of liver, not specified as primary or secondary
>>> • Malignant neoplasm of liver, NOS
>>> • Malignant neoplasm of liver, unspecified
>>> • malignant neosplasm of the liver
>>> • Malignant tumor of liver
>>> • Malignant tumor of liver (disorder)
>>> • Malignant tumour of liver
>>> It would seem suboptimal to go through each of the descriptions to try
and determine which was the UMLS term that was used in the coding. It is important for us to know which part of the string is matched because something like “Invasive ductal carcinoma of the left breast” will be matched to the SNOMED CT concept “408643008|Infiltrating duct carcinoma of breast (disorder)|”, but we would like to know that “left” was not matched and would like to post-coordinate the expression to indicate the left breast, i.e.: 408643008|Infiltrating duct carcinoma of breast (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast structure (body structure)|. When there are other qualifiers like severity, chronicity and episodicity that may be ignored when matching, we would like to capture it at the level of granularity specified in the original text.
>>>
>>> In terms of the chunking, here is what I see for “cancer of colon, lung and liver”:
>>> •
NP: cancer of colon, lung and liver
>>> • PP: of
>>> • NP: colon, lung and liver
>>> For “cancer of colon, liver and lung” here is what I see:
>>> • NP: cancer of colon,
>>> • PP: of
>>> • NP: colon
>>> • O: liver
>>> • O: and
>>> • NP: lung
>>> Q3 – To answer Pei’s question, we are not looking at the preferred name from the UMLS, just which term was used.
>>>
>>> Regards,
>>> Dennis
>>>
>>> From: Chen, Pei
>>> Sent: Thursday, August 22, 2013 12:27 PM
>>> To: user@ctakes.apache.org
>>> Subject: RE: Concept annotation
questions
>>>
>>> Also,
>>> > 3)… or the exact description that was returned in the UMLS?
>>> I presume you mean to save the preferred name from UMLS? If so, this seems to be a common request- see: https://issues.apache.org/jira/browse/CTAKES-224
>>>
>>> --Pei
>>>
>>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
>>> Sent: Thursday, August 22, 2013 3:24 PM
>>> To: 'user@ctakes.apache.org'
>>> Subject: RE: Concept annotation questions
>>>
>>>
>>> Welcome to the cTAKES community.
>>>
>>> Q1 – some people use the longest
span.
>>> Q2 &Q3 – can you just use the text from the dictionary “Malignant neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to save the text of the words that it matches when it is performing dictionary lookup. I would guess there is a term in the UMLS dictionary with the same code as Malignant neoplasm of liver (disorder) that just has the words “cancer of liver”, but there isn’t anything in cTAKES to give that to you just through a configuration change.
>>>
>>> For “cancer of colon, liver and lung“, can you look at the chunk tag for liver. If it’s in a separate noun phrase (NP) from “cancer of colon” that would account for why cancer is not getting tied to liver in that case (but wouldn’t account for why the chunker is creating as a separate noun phrase)
>>>
>>> -- James
>>>
>>> From:
user-return-248-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
>>> Sent: Wednesday, August 21, 2013 1:10 PM
>>> To: user@ctakes.apache.org
>>> Subject: Concept annotation questions
>>>
>>> Hi Everyone,
>>>
>>> We are new to cTakes so please bear with our questions. We are using cTakes to annotate things like encounter diagnoses and referral notes and are especially interested with the SNOMED CT encodings. But we are not sure how to make sense of all the outputs.
>>>
>>> Example #1
>>>
>>> In the example below, “cancer of colon, lung and liver” has been encoded with SNOMED CT and additional concepts that do not apply have been removed (e.g., general “cancer” concept, lung, colon and liver structures, etc). They have been plotted out by the begin/end positions. If the terms to do not align, its probably because the email only accepts plain text and a mono-spaced font is not the default.
>>>
>>> cancer of colon, lung and liver
>>> cancer of colon, lung and liver 93870000|Malignant neoplasm of liver (disorder)|
>>> cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
>>> cancer of colon 363406005|Malignant tumor of colon (disorder)|
>>>
>>> Question (1) – We had to do quite a bit of post-processing to remove
inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
>>> Question (2) – How can we determine that “93870000|Malignant neoplasm of liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string, which points to “cancer of colon, lung and liver”? Certainly we can try to do additional parsing but there are a lot of different scenarios to take into account.
>>> Question (3) – This relates to question 2, are we able to identify the original terms that were used for the concept matching or the exact description that was returned in the UMLS? While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
>>>
>>> Example #2
>>>
>>> Switching the position of colon, lung and liver
can result in different encodings. Once again, after removing additional concepts not needed (i.e., “cancer” and “colon structure”), we get the following. What happened to liver and lung cancer?
>>>
>>> cancer of colon, liver and lung
>>> cancer of colon 363406005|Malignant tumor of colon (disorder)|
>>> lung 39607008|Lung structure (body structure)|
>>>
>>> We have more questions but will start with these. Thank you in advance.
>>>
>>> Regards,
>>> Dennis
>>>
>>>
>>
>>
>>
>>
>>
>
>
>
Re: Concept annotation questions and keep JCas results in a file
Posted by Pei Chen <ch...@apache.org>.
Samir,
xcas will eventually be deprecated/replaced with the preferred/more compact
xmi format--
/*
*******************************************************************************************
* N O T E : The XML format (XCAS) that this Cas Consumer outputs,
is eventually
* being superceeded by the more standardized and compact
XMI format. However
* it is used currently as the expected form for remote
services, and there is
* existing tooling for doing stand-alone component
development and debugging
* that uses this format to populate an initial CAS. So
it is not
* deprecated yet; it is also being kept for
compatibility with older versions.
*
* New code should consider using the XmiWriterCasConsumer
where possible,
* which uses the current XMI format for XML
externalizations of the CAS
*******************************************************************************************
*/
On Fri, Sep 6, 2013 at 11:34 PM, samir chabou <sa...@yahoo.com> wrote:
> Hi Richard,
> I had a look to these methods they can allow me to implement my
> requirement. Do you have an idea if there is a preferrence of using
> readXCas/writeXCas rather than readXmi/writeXmi or it is just a matter of
> having different possibilities of read/write from/to different file format.
> Thanks
> Samir
>
>
> ------------------------------
> *From:* Richard Eckart de Castilho <re...@apache.org>
> *To:* user@ctakes.apache.org; samir chabou <sa...@yahoo.com>
> *Sent:* Friday, September 6, 2013 3:29:19 AM
> *Subject:* Re: Concept annotation questions and keep JCas results in a
> file
>
> Hi,
>
> you might want to take a look at convenience methods in the recently
> released Apache uimaFIT 2.0.0:
>
> CasIOUtil
> readXCas(JCas, File)
> readXmi(JCas, File)
> writeXCas(JCas, File)
> writeXmi(JCas, File)
>
> Cheers,
>
> -- Richard
>
> On 06.09.2013, at 06:28, samir chabou <sa...@yahoo.com> wrote:
>
> > Hi Tim, Pei and James
> > 1) I tryied List l = JCasUtil.selectCovered(jcas, BaseToken.class, i) it
> answer perfectly my requirement, thanks Tim.
> > 2) Now; I need to NLP a medical question using the clinical pipeline
> and I need to keep the JCas result in a file or any persistent way because
> i need to use it later in my processing. Is it possible to do this and is
> it possible to recall this JCas later in my processing ?
> >
> > Thanks
> > Samir
> > From: samir chabou <sa...@yahoo.com>
> > To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
> > Sent: Thursday, August 29, 2013 2:51:12 PM
> > Subject: Re: Concept annotation questions
> >
> > Thanks Tim,
> > it looks a better and cleaner way. It means the List l =
> JCasUtil.selectCovered(jcas, BaseToken.class, i) will give me the
> intersection between the BaseTokens and IdentifiedAnnotations. If my base
> token is in the list so the base token is also an IdentifiedAnnotation.
> I'll give it a try some time next week and let you know.
> > Thanks
> > Samir
> >
> >
> > From: Tim Miller <ti...@childrens.harvard.edu>
> > To: user@ctakes.apache.org
> > Sent: Thursday, August 29, 2013 1:07:58 PM
> > Subject: Re: Concept annotation questions
> >
> > Samir,
> > You may be able to use the JCasUtil class from Uimafit to do something
> like the following:
> >
> > for each IdentifiedAnnotation i:
> > List l = JCasUtil.selectCovered(jcas, BaseToken.class, i)
> >
> >
> > (this is java-ish pseudocode obviously). Then the list you get of tokens
> will all have the same type as the IdentifiedAnnotation i. Would that solve
> your problem?
> > Tim
> >
> > On 08/29/2013 12:29 PM, samir chabou wrote:
> >> Hi James and Pei,
> >> I also need to know what is the medical type (Sympto, Drug , procedure,
> relation) of a given word token. Since in the typeystem hierarchy wordtoken
> is not under the same inheritance tree than identifiedAnnotation . I’m
> currently iterating on all wordTokens and compare each
> wordToken.CoveredText to the annotations.CovredText in the
> identifiedAnnotation. I found this a long process. James, do you think the
> patch <<I could create a patch for you that would help with determining
> which words from the text matched a dictionary entry >> that you are
> planning to create will permit also this requirement ? or can you suggest
> me some thing better than I’m currently doing.
> >>
> >> Thanks
> >> Samir
> >>
> >> From: "Masanz, James J." <Ma...@mayo.edu>
> >> To: "'user@ctakes.apache.org'" <us...@ctakes.apache.org>
> >> Sent: Thursday, August 29, 2013 10:18:40 AM
> >> Subject: RE: Concept annotation questions
> >>
> >> Hi Dennis,
> >>
> >> Thanks for explaining why you are interested in finding out which words
> in the original text cause a particular concept to be annotated. We are
> currently working on getting Apache cTAKES 3.1 out. Depending on your
> timeline, after that is done, perhaps I could create a patch for you that
> would help with determining which words from the text matched a dictionary
> entry, rather than just the begin offset of the first word and the end
> offset of the last word.
> >>
> >> As far as the chunking, the fact “liver” and “and” are being tagged as
> O-chunks explains why the dictionary lookup component is not finding liver
> cancer or lung cancer in “cancer of colon, liver and lung”
> >>
> >> I’ll try that sentence with the latest chunker model (which will be in
> cTAKES 3.1) and see if it assigns correct chunk tags for that sentence.
> >>
> >> -- James
> >>
> >> From: user-return-257-Masanz.James=mayo.edu@ctakes.apache.org [mailto:
> user-return-257-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of
> Dennis Lee Hon Kit
> >> Sent: Wednesday, August 28, 2013 2:33 PM
> >> To: user@ctakes.apache.org
> >> Subject: Re: Concept annotation questions
> >>
> >> Hi James & Pei,
> >>
> >> Thank you for your replies and sorry for my late reply as I have been
> away.
> >>
> >> Q1 – The longest span could work and is one of the options we are
> looking at but when there are overlaps it can get complicated. In the
> following example, the longest would work. We can take start with 01, and
> ignore 02 and 03 because their start positions overlap the end position of
> 01, and then continue with 04. But I don’t think it will always be this
> straight forward as the being/end string positions may not always be a good
> indicator of what exactly in the original text was coded.
> >>
> >> 00 Invasive ductal carcinoma of the left breast with bone metastases.
> >> 01 Invasive ductal carcinoma of the left breast
> 408643008|Infiltrating duct carcinoma of breast (disorder)|
> >> 02 breast with bone
> 56873002|Bone structure of sternum (body structure)|
> >> 03 breast with bone metastases
> 94297009|Secondary malignant neoplasm of female breast (disorder)|
> >> 04 bone metastases
> 94222008|Secondary malignant neoplasm of bone (disorder)|
> >>
> >> Q2 – As we are beginners, we are not at the level where we are
> comfortable with modifying cTakes or even know where to begin modifying
> cTakes but that would be an option in the future. Going back to the
> example of “cancer of liver” and using the begin/end position of the string
> that was used to identify the concept, the original string would be “cancer
> of colon, lung and liver.” The CUI that was identified was C0345904, which
> has 209 (137 unique) descriptions for all languages. Examples of English
> terms include:
> >> • CA - Liver cancer
> >> • Cancer of Liver
> >> • cancer of the liver
> >> • Cancer, Hepatic
> >> • CANCER, HEPATOCELLULAR
> >> • Malignant hepatic neoplasm
> >> • Malignant liver tumor
> >> • Malignant liver tumour
> >> • Malignant neoplasm of liver
> >> • malignant neoplasm of liver (diagnosis)
> >> • Malignant neoplasm of liver unspecified
> >> • Malignant neoplasm of liver unspecified (disorder)
> >> • Malignant neoplasm of liver, not specified as primary or secondary
> >> • Malignant neoplasm of liver, NOS
> >> • Malignant neoplasm of liver, unspecified
> >> • malignant neosplasm of the liver
> >> • Malignant tumor of liver
> >> • Malignant tumor of liver (disorder)
> >> • Malignant tumour of liver
> >> It would seem suboptimal to go through each of the descriptions to try
> and determine which was the UMLS term that was used in the coding. It is
> important for us to know which part of the string is matched because
> something like “Invasive ductal carcinoma of the left breast” will be
> matched to the SNOMED CT concept “408643008|Infiltrating duct carcinoma of
> breast (disorder)|”, but we would like to know that “left” was not matched
> and would like to post-coordinate the expression to indicate the left
> breast, i.e.: 408643008|Infiltrating duct carcinoma of breast
> (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast
> structure (body structure)|. When there are other qualifiers like
> severity, chronicity and episodicity that may be ignored when matching, we
> would like to capture it at the level of granularity specified in the
> original text.
> >>
> >> In terms of the chunking, here is what I see for “cancer of colon, lung
> and liver”:
> >> • NP: cancer of colon, lung and liver
> >> • PP: of
> >> • NP: colon, lung and liver
> >> For “cancer of colon, liver and lung” here is what I see:
> >> • NP: cancer of colon,
> >> • PP: of
> >> • NP: colon
> >> • O: liver
> >> • O: and
> >> • NP: lung
> >> Q3 – To answer Pei’s question, we are not looking at the preferred name
> from the UMLS, just which term was used.
> >>
> >> Regards,
> >> Dennis
> >>
> >> From: Chen, Pei
> >> Sent: Thursday, August 22, 2013 12:27 PM
> >> To: user@ctakes.apache.org
> >> Subject: RE: Concept annotation questions
> >>
> >> Also,
> >> > 3)… or the exact description that was returned in the UMLS?
> >> I presume you mean to save the preferred name from UMLS? If so, this
> seems to be a common request- see:
> https://issues.apache.org/jira/browse/CTAKES-224
> >>
> >> --Pei
> >>
> >> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
> >> Sent: Thursday, August 22, 2013 3:24 PM
> >> To: 'user@ctakes.apache.org'
> >> Subject: RE: Concept annotation questions
> >>
> >>
> >> Welcome to the cTAKES community.
> >>
> >> Q1 – some people use the longest span.
> >> Q2 &Q3 – can you just use the text from the dictionary “Malignant
> neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to
> save the text of the words that it matches when it is performing dictionary
> lookup. I would guess there is a term in the UMLS dictionary with the same
> code as Malignant neoplasm of liver (disorder) that just has the words
> “cancer of liver”, but there isn’t anything in cTAKES to give that to you
> just through a configuration change.
> >>
> >> For “cancer of colon, liver and lung“, can you look at the chunk tag
> for liver. If it’s in a separate noun phrase (NP) from “cancer of colon”
> that would account for why cancer is not getting tied to liver in that case
> (but wouldn’t account for why the chunker is creating as a separate noun
> phrase)
> >>
> >> -- James
> >>
> >> From: user-return-248-Masanz.James=mayo.edu@ctakes.apache.org [mailto:
> user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of
> Dennis Lee Hon Kit
> >> Sent: Wednesday, August 21, 2013 1:10 PM
> >> To: user@ctakes.apache.org
> >> Subject: Concept annotation questions
> >>
> >> Hi Everyone,
> >>
> >> We are new to cTakes so please bear with our questions. We are using
> cTakes to annotate things like encounter diagnoses and referral notes and
> are especially interested with the SNOMED CT encodings. But we are not
> sure how to make sense of all the outputs.
> >>
> >> Example #1
> >>
> >> In the example below, “cancer of colon, lung and liver” has been
> encoded with SNOMED CT and additional concepts that do not apply have been
> removed (e.g., general “cancer” concept, lung, colon and liver structures,
> etc). They have been plotted out by the begin/end positions. If the terms
> to do not align, its probably because the email only accepts plain text and
> a mono-spaced font is not the default.
> >>
> >> cancer of colon, lung and liver
> >> cancer of colon, lung and liver 93870000|Malignant neoplasm of liver
> (disorder)|
> >> cancer of colon, lung 363358000|Malignant tumor of lung
> (disorder)|
> >> cancer of colon 363406005|Malignant tumor of colon
> (disorder)|
> >>
> >> Question (1) – We had to do quite a bit of post-processing to remove
> inactive concepts, subtype concepts, concepts that are part of the defining
> attributes, etc. Are there a set of guidelines to help sort out the CUI or
> SNOMED CT codes that have been identified?
> >> Question (2) – How can we determine that “93870000|Malignant neoplasm
> of liver (disorder)|” refers to “cancer of liver” as opposed to using the
> begin/end string, which points to “cancer of colon, lung and liver”?
> Certainly we can try to do additional parsing but there are a lot of
> different scenarios to take into account.
> >> Question (3) – This relates to question 2, are we able to identify the
> original terms that were used for the concept matching or the exact
> description that was returned in the UMLS? While the CUI is helpful, the
> CUI can refer to tens or even hundreds of descriptions.
> >>
> >> Example #2
> >>
> >> Switching the position of colon, lung and liver can result in different
> encodings. Once again, after removing additional concepts not needed
> (i.e., “cancer” and “colon structure”), we get the following. What
> happened to liver and lung cancer?
> >>
> >> cancer of colon, liver and lung
> >> cancer of colon 363406005|Malignant tumor of colon
> (disorder)|
> >> lung 39607008|Lung structure (body
> structure)|
> >>
> >> We have more questions but will start with these. Thank you in advance.
> >>
> >> Regards,
> >> Dennis
> >>
> >>
> >
> >
> >
> >
> >
>
>
>
Re: Concept annotation questions and keep JCas results in a file
Posted by samir chabou <sa...@yahoo.com>.
Hi Richard,
I had a look to these methods they can allow me to implement my requirement. Do you have an idea if there is a preferrence of using readXCas/writeXCas rather than readXmi/writeXmi or it is just a matter of having different possibilities of read/write from/to different file format.
Thanks
Samir
________________________________
From: Richard Eckart de Castilho <re...@apache.org>
To: user@ctakes.apache.org; samir chabou <sa...@yahoo.com>
Sent: Friday, September 6, 2013 3:29:19 AM
Subject: Re: Concept annotation questions and keep JCas results in a file
Hi,
you might want to take a look at convenience methods in the recently
released Apache uimaFIT 2.0.0:
CasIOUtil
readXCas(JCas, File)
readXmi(JCas, File)
writeXCas(JCas, File)
writeXmi(JCas, File)
Cheers,
-- Richard
On 06.09.2013, at 06:28, samir chabou <sa...@yahoo.com> wrote:
> Hi Tim, Pei and James
> 1) I tryied List l = JCasUtil.selectCovered(jcas, BaseToken.class, i) it answer perfectly my requirement, thanks Tim.
> 2) Now; I need to NLP a medical question using the clinical pipeline and I need to keep the JCas result in a file or any persistent way because i need to use it later in my processing. Is it possible to do this and is it possible to recall this JCas later in my processing ?
>
> Thanks
> Samir
> From: samir chabou <sa...@yahoo.com>
> To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
> Sent: Thursday, August 29, 2013 2:51:12 PM
> Subject: Re: Concept annotation questions
>
> Thanks Tim,
> it looks a better and cleaner way. It means the List l = JCasUtil.selectCovered(jcas, BaseToken.class, i) will give me the intersection between the BaseTokens and IdentifiedAnnotations. If my base token is in the list so the base token is also an IdentifiedAnnotation. I'll give it a try some time next week and let you know.
> Thanks
> Samir
>
>
> From: Tim Miller <ti...@childrens.harvard.edu>
> To: user@ctakes.apache.org
> Sent: Thursday, August 29, 2013 1:07:58 PM
> Subject: Re: Concept annotation questions
>
> Samir,
> You may be able to use the JCasUtil class from Uimafit to do something like the following:
>
> for each IdentifiedAnnotation i:
> List l = JCasUtil.selectCovered(jcas, BaseToken.class, i)
>
>
> (this is java-ish pseudocode obviously). Then the list you get of tokens will all have the same type as the IdentifiedAnnotation i. Would that solve your problem?
> Tim
>
> On 08/29/2013 12:29 PM, samir chabou wrote:
>> Hi James and Pei,
>> I also need to know what is the medical type (Sympto, Drug , procedure, relation) of a given word token. Since in the typeystem hierarchy wordtoken is not under the same inheritance tree than identifiedAnnotation . I’m currently iterating on all wordTokens and compare each wordToken.CoveredText to the annotations.CovredText in the identifiedAnnotation. I found this a long process. James, do you think the patch <<I could create a patch for you that would help with determining which words from the text matched a dictionary entry >> that you are planning to create will permit also this requirement ? or can you suggest me some thing better than I’m currently doing.
>>
>> Thanks
>> Samir
>>
>> From: "Masanz, James J." <Ma...@mayo.edu>
>> To: "'user@ctakes.apache.org'" <us...@ctakes.apache.org>
>> Sent: Thursday, August 29, 2013 10:18:40 AM
>> Subject: RE: Concept annotation questions
>>
>> Hi Dennis,
>>
>> Thanks for explaining why you are interested in finding out which words in the original text cause a particular concept to be annotated. We are currently working on getting Apache cTAKES 3.1 out. Depending on your timeline, after that is done, perhaps I could create a patch for you that would help with determining which words from the text matched a dictionary entry, rather than just the begin offset of the first word and the end offset of the last word.
>>
>> As far as the chunking, the fact “liver” and “and” are being tagged as O-chunks explains why the dictionary lookup component is not finding liver cancer or lung cancer in “cancer of colon, liver and lung”
>>
>> I’ll try that sentence with the latest chunker model (which will be in cTAKES 3.1) and see if it assigns correct chunk tags for that sentence.
>>
>> -- James
>>
>> From: user-return-257-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
>> Sent: Wednesday, August 28, 2013 2:33 PM
>> To: user@ctakes.apache.org
>> Subject: Re: Concept annotation questions
>>
>> Hi James & Pei,
>>
>> Thank you for your replies and sorry for my late reply as I have been away.
>>
>> Q1 – The longest span could work and is one of the options we are looking at but when there are overlaps it can get complicated. In the following example, the longest would work. We can take start with 01, and ignore 02 and 03 because their start positions overlap the end position of 01, and then continue with 04. But I don’t think it will always be this straight forward as the being/end string positions may not always be a good indicator of what exactly in the original text was coded.
>>
>> 00 Invasive ductal carcinoma of the left breast with bone metastases.
>> 01 Invasive ductal carcinoma of the left breast 408643008|Infiltrating duct carcinoma of breast (disorder)|
>> 02 breast with bone 56873002|Bone structure of sternum (body structure)|
>> 03 breast with bone metastases 94297009|Secondary malignant neoplasm of female breast (disorder)|
>> 04 bone metastases 94222008|Secondary malignant neoplasm of bone (disorder)|
>>
>> Q2 – As we are beginners, we are not at the level where we are comfortable with modifying cTakes or even know where to begin modifying cTakes but that would be an option in the future. Going back to the example of “cancer of liver” and using the begin/end position of the string that was used to identify the concept, the original string would be “cancer of colon, lung and liver.” The CUI that was identified was C0345904, which has 209 (137 unique) descriptions for all languages. Examples of English terms include:
>> • CA - Liver cancer
>> • Cancer of Liver
>> • cancer of the liver
>> • Cancer, Hepatic
>> • CANCER, HEPATOCELLULAR
>> • Malignant hepatic neoplasm
>> • Malignant liver tumor
>> • Malignant liver tumour
>> • Malignant neoplasm of liver
>> • malignant neoplasm of liver (diagnosis)
>> • Malignant neoplasm of liver unspecified
>> • Malignant neoplasm of liver unspecified (disorder)
>> • Malignant neoplasm of liver, not specified as primary or secondary
>> • Malignant neoplasm of liver, NOS
>> • Malignant neoplasm of liver, unspecified
>> • malignant neosplasm of the liver
>> • Malignant tumor of liver
>> • Malignant tumor of liver (disorder)
>> • Malignant tumour of liver
>> It would seem suboptimal to go through each of the descriptions to try and determine which was the UMLS term that was used in the coding. It is important for us to know which part of the string is matched because something like “Invasive ductal carcinoma of the left breast” will be matched to the SNOMED CT concept “408643008|Infiltrating duct carcinoma of breast (disorder)|”, but we would like to know that “left” was not matched and would like to post-coordinate the expression to indicate the left breast, i.e.: 408643008|Infiltrating duct carcinoma of breast (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast structure (body structure)|. When there are other qualifiers like severity, chronicity and episodicity that may be ignored when matching, we would like to capture it at the level of granularity specified in the original text.
>>
>> In terms of the chunking, here is what I see for “cancer of colon, lung and liver”:
>> • NP: cancer of colon, lung and liver
>> • PP: of
>> • NP: colon, lung and liver
>> For “cancer of colon, liver and lung” here is what I see:
>> • NP: cancer of colon,
>> • PP: of
>> • NP: colon
>> • O: liver
>> • O: and
>> • NP: lung
>> Q3 – To answer Pei’s question, we are not looking at the preferred name from the UMLS, just which term was used.
>>
>> Regards,
>> Dennis
>>
>> From: Chen, Pei
>> Sent: Thursday, August 22, 2013 12:27 PM
>> To: user@ctakes.apache.org
>> Subject: RE: Concept annotation questions
>>
>> Also,
>> > 3)… or the exact description that was returned in the UMLS?
>> I presume you mean to save the preferred name from UMLS? If so, this seems to be a common request- see: https://issues.apache.org/jira/browse/CTAKES-224
>>
>> --Pei
>>
>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
>> Sent: Thursday, August 22, 2013 3:24 PM
>> To: 'user@ctakes.apache.org'
>> Subject: RE: Concept annotation questions
>>
>>
>> Welcome to the cTAKES community.
>>
>> Q1 – some people use the longest span.
>> Q2 &Q3 – can you just use the text from the dictionary “Malignant neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to save the text of the words that it matches when it is performing dictionary lookup. I would guess there is a term in the UMLS dictionary with the same code as Malignant neoplasm of liver (disorder) that just has the words “cancer of liver”, but there isn’t anything in cTAKES to give that to you just through a configuration change.
>>
>> For “cancer of colon, liver and lung“, can you look at the chunk tag for liver. If it’s in a separate noun phrase (NP) from “cancer of colon” that would account for why cancer is not getting tied to liver in that case (but wouldn’t account for why the chunker is creating as a separate noun phrase)
>>
>> -- James
>>
>> From: user-return-248-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
>> Sent: Wednesday, August 21, 2013 1:10 PM
>> To: user@ctakes.apache.org
>> Subject: Concept annotation questions
>>
>> Hi Everyone,
>>
>> We are new to cTakes so please bear with our questions. We are using cTakes to annotate things like encounter diagnoses and referral notes and are especially interested with the SNOMED CT encodings. But we are not sure how to make sense of all the outputs.
>>
>> Example #1
>>
>> In the example below, “cancer of colon, lung and liver” has been encoded with SNOMED CT and additional concepts that do not apply have been removed (e.g., general “cancer” concept, lung, colon and liver structures, etc). They have been plotted out by the begin/end positions. If the terms to do not align, its probably because the email only accepts plain text and a mono-spaced font is not the default.
>>
>> cancer of colon, lung and liver
>> cancer of colon, lung and liver 93870000|Malignant neoplasm of liver (disorder)|
>> cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
>> cancer of colon 363406005|Malignant tumor of colon (disorder)|
>>
>> Question (1) – We had to do quite a bit of post-processing to remove inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
>> Question (2) – How can we determine that “93870000|Malignant neoplasm of liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string, which points to “cancer of colon, lung and liver”? Certainly we can try to do additional parsing but there are a lot of different scenarios to take into account.
>> Question (3) – This relates to question 2, are we able to identify the original terms that were used for the concept matching or the exact description that was returned in the UMLS? While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
>>
>> Example #2
>>
>> Switching the position of colon, lung and liver can result in different encodings. Once again, after removing additional concepts not needed (i.e., “cancer” and “colon structure”), we get the following. What happened to liver and lung cancer?
>>
>> cancer of colon, liver and lung
>> cancer of colon 363406005|Malignant tumor of colon (disorder)|
>> lung 39607008|Lung structure (body structure)|
>>
>> We have more questions but will start with these. Thank you in advance.
>>
>> Regards,
>> Dennis
>>
>>
>
>
>
>
>
Re: Concept annotation questions and keep JCas results in a file
Posted by Richard Eckart de Castilho <re...@apache.org>.
Hi,
you might want to take a look at convenience methods in the recently
released Apache uimaFIT 2.0.0:
CasIOUtil
readXCas(JCas, File)
readXmi(JCas, File)
writeXCas(JCas, File)
writeXmi(JCas, File)
Cheers,
-- Richard
On 06.09.2013, at 06:28, samir chabou <sa...@yahoo.com> wrote:
> Hi Tim, Pei and James
> 1) I tryied List l = JCasUtil.selectCovered(jcas, BaseToken.class, i) it answer perfectly my requirement, thanks Tim.
> 2) Now; I need to NLP a medical question using the clinical pipeline and I need to keep the JCas result in a file or any persistent way because i need to use it later in my processing. Is it possible to do this and is it possible to recall this JCas later in my processing ?
>
> Thanks
> Samir
> From: samir chabou <sa...@yahoo.com>
> To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
> Sent: Thursday, August 29, 2013 2:51:12 PM
> Subject: Re: Concept annotation questions
>
> Thanks Tim,
> it looks a better and cleaner way. It means the List l = JCasUtil.selectCovered(jcas, BaseToken.class, i) will give me the intersection between the BaseTokens and IdentifiedAnnotations. If my base token is in the list so the base token is also an IdentifiedAnnotation. I'll give it a try some time next week and let you know.
> Thanks
> Samir
>
>
> From: Tim Miller <ti...@childrens.harvard.edu>
> To: user@ctakes.apache.org
> Sent: Thursday, August 29, 2013 1:07:58 PM
> Subject: Re: Concept annotation questions
>
> Samir,
> You may be able to use the JCasUtil class from Uimafit to do something like the following:
>
> for each IdentifiedAnnotation i:
> List l = JCasUtil.selectCovered(jcas, BaseToken.class, i)
>
>
> (this is java-ish pseudocode obviously). Then the list you get of tokens will all have the same type as the IdentifiedAnnotation i. Would that solve your problem?
> Tim
>
> On 08/29/2013 12:29 PM, samir chabou wrote:
>> Hi James and Pei,
>> I also need to know what is the medical type (Sympto, Drug , procedure, relation) of a given word token. Since in the typeystem hierarchy wordtoken is not under the same inheritance tree than identifiedAnnotation . I’m currently iterating on all wordTokens and compare each wordToken.CoveredText to the annotations.CovredText in the identifiedAnnotation. I found this a long process. James, do you think the patch <<I could create a patch for you that would help with determining which words from the text matched a dictionary entry >> that you are planning to create will permit also this requirement ? or can you suggest me some thing better than I’m currently doing.
>>
>> Thanks
>> Samir
>>
>> From: "Masanz, James J." <Ma...@mayo.edu>
>> To: "'user@ctakes.apache.org'" <us...@ctakes.apache.org>
>> Sent: Thursday, August 29, 2013 10:18:40 AM
>> Subject: RE: Concept annotation questions
>>
>> Hi Dennis,
>>
>> Thanks for explaining why you are interested in finding out which words in the original text cause a particular concept to be annotated. We are currently working on getting Apache cTAKES 3.1 out. Depending on your timeline, after that is done, perhaps I could create a patch for you that would help with determining which words from the text matched a dictionary entry, rather than just the begin offset of the first word and the end offset of the last word.
>>
>> As far as the chunking, the fact “liver” and “and” are being tagged as O-chunks explains why the dictionary lookup component is not finding liver cancer or lung cancer in “cancer of colon, liver and lung”
>>
>> I’ll try that sentence with the latest chunker model (which will be in cTAKES 3.1) and see if it assigns correct chunk tags for that sentence.
>>
>> -- James
>>
>> From: user-return-257-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
>> Sent: Wednesday, August 28, 2013 2:33 PM
>> To: user@ctakes.apache.org
>> Subject: Re: Concept annotation questions
>>
>> Hi James & Pei,
>>
>> Thank you for your replies and sorry for my late reply as I have been away.
>>
>> Q1 – The longest span could work and is one of the options we are looking at but when there are overlaps it can get complicated. In the following example, the longest would work. We can take start with 01, and ignore 02 and 03 because their start positions overlap the end position of 01, and then continue with 04. But I don’t think it will always be this straight forward as the being/end string positions may not always be a good indicator of what exactly in the original text was coded.
>>
>> 00 Invasive ductal carcinoma of the left breast with bone metastases.
>> 01 Invasive ductal carcinoma of the left breast 408643008|Infiltrating duct carcinoma of breast (disorder)|
>> 02 breast with bone 56873002|Bone structure of sternum (body structure)|
>> 03 breast with bone metastases 94297009|Secondary malignant neoplasm of female breast (disorder)|
>> 04 bone metastases 94222008|Secondary malignant neoplasm of bone (disorder)|
>>
>> Q2 – As we are beginners, we are not at the level where we are comfortable with modifying cTakes or even know where to begin modifying cTakes but that would be an option in the future. Going back to the example of “cancer of liver” and using the begin/end position of the string that was used to identify the concept, the original string would be “cancer of colon, lung and liver.” The CUI that was identified was C0345904, which has 209 (137 unique) descriptions for all languages. Examples of English terms include:
>> • CA - Liver cancer
>> • Cancer of Liver
>> • cancer of the liver
>> • Cancer, Hepatic
>> • CANCER, HEPATOCELLULAR
>> • Malignant hepatic neoplasm
>> • Malignant liver tumor
>> • Malignant liver tumour
>> • Malignant neoplasm of liver
>> • malignant neoplasm of liver (diagnosis)
>> • Malignant neoplasm of liver unspecified
>> • Malignant neoplasm of liver unspecified (disorder)
>> • Malignant neoplasm of liver, not specified as primary or secondary
>> • Malignant neoplasm of liver, NOS
>> • Malignant neoplasm of liver, unspecified
>> • malignant neosplasm of the liver
>> • Malignant tumor of liver
>> • Malignant tumor of liver (disorder)
>> • Malignant tumour of liver
>> It would seem suboptimal to go through each of the descriptions to try and determine which was the UMLS term that was used in the coding. It is important for us to know which part of the string is matched because something like “Invasive ductal carcinoma of the left breast” will be matched to the SNOMED CT concept “408643008|Infiltrating duct carcinoma of breast (disorder)|”, but we would like to know that “left” was not matched and would like to post-coordinate the expression to indicate the left breast, i.e.: 408643008|Infiltrating duct carcinoma of breast (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast structure (body structure)|. When there are other qualifiers like severity, chronicity and episodicity that may be ignored when matching, we would like to capture it at the level of granularity specified in the original text.
>>
>> In terms of the chunking, here is what I see for “cancer of colon, lung and liver”:
>> • NP: cancer of colon, lung and liver
>> • PP: of
>> • NP: colon, lung and liver
>> For “cancer of colon, liver and lung” here is what I see:
>> • NP: cancer of colon,
>> • PP: of
>> • NP: colon
>> • O: liver
>> • O: and
>> • NP: lung
>> Q3 – To answer Pei’s question, we are not looking at the preferred name from the UMLS, just which term was used.
>>
>> Regards,
>> Dennis
>>
>> From: Chen, Pei
>> Sent: Thursday, August 22, 2013 12:27 PM
>> To: user@ctakes.apache.org
>> Subject: RE: Concept annotation questions
>>
>> Also,
>> > 3)… or the exact description that was returned in the UMLS?
>> I presume you mean to save the preferred name from UMLS? If so, this seems to be a common request- see: https://issues.apache.org/jira/browse/CTAKES-224
>>
>> --Pei
>>
>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
>> Sent: Thursday, August 22, 2013 3:24 PM
>> To: 'user@ctakes.apache.org'
>> Subject: RE: Concept annotation questions
>>
>>
>> Welcome to the cTAKES community.
>>
>> Q1 – some people use the longest span.
>> Q2 &Q3 – can you just use the text from the dictionary “Malignant neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to save the text of the words that it matches when it is performing dictionary lookup. I would guess there is a term in the UMLS dictionary with the same code as Malignant neoplasm of liver (disorder) that just has the words “cancer of liver”, but there isn’t anything in cTAKES to give that to you just through a configuration change.
>>
>> For “cancer of colon, liver and lung“, can you look at the chunk tag for liver. If it’s in a separate noun phrase (NP) from “cancer of colon” that would account for why cancer is not getting tied to liver in that case (but wouldn’t account for why the chunker is creating as a separate noun phrase)
>>
>> -- James
>>
>> From: user-return-248-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
>> Sent: Wednesday, August 21, 2013 1:10 PM
>> To: user@ctakes.apache.org
>> Subject: Concept annotation questions
>>
>> Hi Everyone,
>>
>> We are new to cTakes so please bear with our questions. We are using cTakes to annotate things like encounter diagnoses and referral notes and are especially interested with the SNOMED CT encodings. But we are not sure how to make sense of all the outputs.
>>
>> Example #1
>>
>> In the example below, “cancer of colon, lung and liver” has been encoded with SNOMED CT and additional concepts that do not apply have been removed (e.g., general “cancer” concept, lung, colon and liver structures, etc). They have been plotted out by the begin/end positions. If the terms to do not align, its probably because the email only accepts plain text and a mono-spaced font is not the default.
>>
>> cancer of colon, lung and liver
>> cancer of colon, lung and liver 93870000|Malignant neoplasm of liver (disorder)|
>> cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
>> cancer of colon 363406005|Malignant tumor of colon (disorder)|
>>
>> Question (1) – We had to do quite a bit of post-processing to remove inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
>> Question (2) – How can we determine that “93870000|Malignant neoplasm of liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string, which points to “cancer of colon, lung and liver”? Certainly we can try to do additional parsing but there are a lot of different scenarios to take into account.
>> Question (3) – This relates to question 2, are we able to identify the original terms that were used for the concept matching or the exact description that was returned in the UMLS? While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
>>
>> Example #2
>>
>> Switching the position of colon, lung and liver can result in different encodings. Once again, after removing additional concepts not needed (i.e., “cancer” and “colon structure”), we get the following. What happened to liver and lung cancer?
>>
>> cancer of colon, liver and lung
>> cancer of colon 363406005|Malignant tumor of colon (disorder)|
>> lung 39607008|Lung structure (body structure)|
>>
>> We have more questions but will start with these. Thank you in advance.
>>
>> Regards,
>> Dennis
>>
>>
>
>
>
>
>
Concept annotation questions and keep JCas results in a file
Posted by samir chabou <sa...@yahoo.com>.
Hi Tim, Pei and James
1) I tryied List l = JCasUtil.selectCovered(jcas, BaseToken.class, i) it answer perfectly my requirement, thanks Tim.
2) Now; I need to NLP a medical question using the clinical pipeline and I need to keep the JCas result in a file or any persistent way because i need to use it later in my processing. Is it possible to do this and is it possible to recall this JCas later in my processing ?
Thanks
Samir
________________________________
From: samir chabou <sa...@yahoo.com>
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Sent: Thursday, August 29, 2013 2:51:12 PM
Subject: Re: Concept annotation questions
Thanks Tim,
it looks a better and cleaner way. It means the List l = JCasUtil.selectCovered(jcas, BaseToken.class, i) will give me the intersection between the BaseTokens and IdentifiedAnnotations. If my base token is in the list so the base token is also an IdentifiedAnnotation. I'll give it a try some time next week and let you know.
Thanks
Samir
________________________________
From: Tim Miller <ti...@childrens.harvard.edu>
To: user@ctakes.apache.org
Sent: Thursday, August 29, 2013 1:07:58 PM
Subject: Re: Concept annotation questions
Samir,
You may be able to use the JCasUtil class from Uimafit to do
something like the following:
for each IdentifiedAnnotation i:
List l = JCasUtil.selectCovered(jcas, BaseToken.class, i)
(this is java-ish pseudocode obviously). Then the list you get of
tokens will all have the same type as the IdentifiedAnnotation i.
Would that solve your problem?
Tim
On 08/29/2013 12:29 PM, samir chabou wrote:
Hi James and Pei,
>I also need to know what is the medical type (Sympto, Drug , procedure, relation) of a given word token. Since in the typeystem hierarchy wordtoken is not under the same inheritance tree than identifiedAnnotation . I’m currently iterating on all wordTokens and compare each wordToken.CoveredText to the annotations.CovredText in the identifiedAnnotation. I found this a long process. James, do you think the patch <<I could create a patch for you that would help with determining which words from the text matched a dictionary entry >> that you are planning to create will permit also this requirement ? or can you suggest me some thing better than I’m currently doing.
>
>Thanks
>Samir
>
>
>
>________________________________
> From: "Masanz, James J." <Ma...@mayo.edu>
>To: "'user@ctakes.apache.org'" <us...@ctakes.apache.org>
>Sent: Thursday, August 29, 2013 10:18:40 AM
>Subject: RE: Concept annotation questions
>
>
>
>
>Hi Dennis,
>
>Thanks for explaining why you are interested in finding out which words in the original text cause a particular concept to be annotated. We are currently working on getting Apache cTAKES 3.1 out. Depending on your timeline, after that is done, perhaps I could create a patch for you that would help with determining which words from the text matched a dictionary entry, rather than just the begin offset of the first word and the end offset of the last word.
>
>As far as the chunking, the fact “liver” and “and” are being tagged as O-chunks explains why the dictionary lookup component is not finding liver cancer or lung cancer in “cancer of colon, liver and lung”
>
>I’ll try that sentence with the latest chunker model (which will be in cTAKES 3.1) and see if it assigns correct chunk tags for that sentence.
>
>-- James
>
>From:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
>Sent: Wednesday, August 28, 2013 2:33 PM
>To: user@ctakes.apache.org
>Subject: Re: Concept annotation questions
>
>Hi James & Pei,
>
>Thank you for your replies and sorry for my late reply as I have been away.
>
>Q1 – The longest span could work and is one of the options we are looking at but when there are overlaps it can get complicated. In the following example, the longest would work. We can take start with 01, and ignore 02 and 03 because their start positions overlap the end position of 01, and then continue with 04. But I don’t think it will always be this straight forward as the being/end string positions may not always be a good indicator of what exactly in the original text was coded.
>
>00 Invasive ductal carcinoma of the left breast with bone metastases.
>01 Invasive ductal carcinoma of the left breast 408643008|Infiltrating duct carcinoma of breast (disorder)|
>02 breast with bone 56873002|Bone structure of sternum (body structure)|
>03 breast with bone metastases 94297009|Secondary malignant neoplasm of female breast (disorder)|
>04 bone metastases 94222008|Secondary malignant neoplasm of bone (disorder)|
>
>Q2 – As we are beginners, we are not at the level where we are comfortable with modifying cTakes or even know where to begin modifying cTakes but that would be an option in the future. Going back to the example of “cancer of liver” and using the begin/end position of the string that was used to identify the concept, the original string would be “cancer of colon, lung and liver.” The CUI that was identified was C0345904, which has 209 (137 unique) descriptions for all languages. Examples of English terms include:
> * CA - Liver cancer
> * Cancer of Liver
> * cancer of the liver
> * Cancer, Hepatic
> * CANCER, HEPATOCELLULAR
> * Malignant hepatic neoplasm
> * Malignant liver tumor
> * Malignant liver tumour
> * Malignant neoplasm of liver
> * malignant neoplasm of liver (diagnosis)
> * Malignant neoplasm of liver unspecified
> * Malignant neoplasm of liver unspecified (disorder)
> * Malignant neoplasm of liver, not specified as primary or secondary
> * Malignant neoplasm of liver, NOS
> * Malignant neoplasm of liver, unspecified
> * malignant neosplasm of the liver
> * Malignant tumor of liver
> * Malignant tumor of liver (disorder)
> * Malignant tumour of liver
>It would seem suboptimal to go through each of the descriptions to try and determine which was the UMLS term that was used in the coding. It is important for us to know which part of the string is matched because something like “Invasive ductal carcinoma of the left breast” will be matched to the SNOMED CT concept “408643008|Infiltrating duct carcinoma of breast (disorder)|”, but we would like to know that “left” was not matched and would like to post-coordinate the expression to indicate the left breast, i.e.: 408643008|Infiltrating duct carcinoma of breast (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast structure (body structure)|. When there are other qualifiers like severity, chronicity and episodicity that may be ignored when matching, we would like to capture it at the level of granularity specified in the original text.
>
>In terms of the chunking, here is what I see for “cancer of colon, lung and liver”:
> * NP: cancer of colon, lung and liver
> * PP: of
> * NP: colon, lung and liver
>For “cancer of colon, liver and lung” here is what I see:
> * NP: cancer of colon,
> * PP: of
> * NP: colon
> * O: liver
> * O: and
> * NP: lung
>Q3 – To answer Pei’s question, we are not looking at the preferred name from the UMLS, just which term was used.
>
>Regards,
>Dennis
>
>From:Chen, Pei
>Sent:Thursday, August 22, 2013 12:27 PM
>To:user@ctakes.apache.org
>Subject:RE: Concept annotation questions
>
>Also,
>>3)… or the exact description that was returned in the UMLS?
>I presume you mean to save the preferred name from UMLS? If so, this seems to be a common request- see:https://issues.apache.org/jira/browse/CTAKES-224
>
>--Pei
>
>From:Masanz, James J. [mailto:Masanz.James@mayo.edu]
>Sent: Thursday, August 22, 2013 3:24 PM
>To: 'user@ctakes.apache.org'
>Subject: RE: Concept annotation questions
>
>
>Welcome to the cTAKES community.
>
>Q1 – some people use the longest span.
>Q2 &Q3 – can you just use the text from the dictionary “Malignant neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to save the text of the words that it matches when it is performing dictionary lookup. I would guess there is a term in the UMLS dictionary with the same code as Malignant neoplasm of liver (disorder) that just has the words “cancer of liver”, but there isn’t anything in cTAKES to give that to you just through a configuration change.
>
>For “cancer of colon, liver and lung“, can you look at the chunk tag for liver. If it’s in a separate noun phrase (NP) from “cancer of colon” that would account for why cancer is not getting tied to liver in that case (but wouldn’t account for why the chunker is creating as a separate noun phrase)
>
>-- James
>
>From:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
>Sent: Wednesday, August 21, 2013 1:10 PM
>To: user@ctakes.apache.org
>Subject: Concept annotation questions
>
>Hi Everyone,
>
>We are new to cTakes so please bear with our questions. We are using cTakes to annotate things like encounter diagnoses and referral notes and are especially interested with the SNOMED CT encodings. But we are not sure how to make sense of all the outputs.
>
>Example #1
>
>In the example below, “cancer of colon, lung and liver” has been encoded with SNOMED CT and additional concepts that do not apply have been removed (e.g., general “cancer” concept, lung, colon and liver structures, etc). They have been plotted out by the begin/end positions. If the terms to do not align, its probably because the email only accepts plain text and a mono-spaced font is not the default.
>
>cancer of colon, lung and liver
>cancer of colon, lung and liver 93870000|Malignant neoplasm of liver (disorder)|
>cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
>cancer of colon 363406005|Malignant tumor of colon (disorder)|
>
>Question (1) – We had to do quite a bit of post-processing to remove inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
>Question (2) – How can we determine that “93870000|Malignant neoplasm of liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string, which points to “cancer of colon, lung and liver”? Certainly we can try to do additional parsing but there are a lot of different scenarios to take into account.
>Question (3) – This relates to question 2, are we able to identify the original terms that were used for the concept matching or the exact description that was returned in the UMLS? While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
>
>
>________________________________
>
>Example #2
>
>Switching the position of colon, lung and liver can result in different encodings. Once again, after removing additional concepts not needed (i.e., “cancer” and “colon structure”), we get the following. What happened to liver and lung cancer?
>
>cancer of colon, liver and lung
>cancer of colon 363406005|Malignant tumor of colon (disorder)|
> lung 39607008|Lung structure (body structure)|
>
>We have more questions but will start with these. Thank you in advance.
>
>Regards,
>Dennis
>
>
Re: Concept annotation questions
Posted by samir chabou <sa...@yahoo.com>.
Thanks Tim,
it looks a better and cleaner way. It means the List l = JCasUtil.selectCovered(jcas, BaseToken.class, i) will give me the intersection between the BaseTokens and IdentifiedAnnotations. If my base token is in the list so the base token is also an IdentifiedAnnotation. I'll give it a try some time next week and let you know.
Thanks
Samir
________________________________
From: Tim Miller <ti...@childrens.harvard.edu>
To: user@ctakes.apache.org
Sent: Thursday, August 29, 2013 1:07:58 PM
Subject: Re: Concept annotation questions
Samir,
You may be able to use the JCasUtil class from Uimafit to do
something like the following:
for each IdentifiedAnnotation i:
List l = JCasUtil.selectCovered(jcas, BaseToken.class, i)
(this is java-ish pseudocode obviously). Then the list you get of
tokens will all have the same type as the IdentifiedAnnotation i.
Would that solve your problem?
Tim
On 08/29/2013 12:29 PM, samir chabou wrote:
Hi James and Pei,
>I also need to know what is the medical type (Sympto, Drug , procedure, relation) of a given word token. Since in the typeystem hierarchy wordtoken is not under the same inheritance tree than identifiedAnnotation . I’m currently iterating on all wordTokens and compare each wordToken.CoveredText to the annotations.CovredText in the identifiedAnnotation. I found this a long process. James, do you think the patch <<I could create a patch for you that would help with determining which words from the text matched a dictionary entry >> that you are planning to create will permit also this requirement ? or can you suggest me some thing better than I’m currently doing.
>
>Thanks
>Samir
>
>
>
>________________________________
> From: "Masanz, James J." <Ma...@mayo.edu>
>To: "'user@ctakes.apache.org'" <us...@ctakes.apache.org>
>Sent: Thursday, August 29, 2013 10:18:40 AM
>Subject: RE: Concept annotation questions
>
>
>
>
>Hi Dennis,
>
>Thanks for explaining why you are interested in finding out which words in the original text cause a particular concept to be annotated. We are currently working on getting Apache cTAKES 3.1 out. Depending on your timeline, after that is done, perhaps I could create a patch for you that would help with determining which words from the text matched a dictionary entry, rather than just the begin offset of the first word and the end offset of the last word.
>
>As far as the chunking, the fact “liver” and “and” are being tagged as O-chunks explains why the dictionary lookup component is not finding liver cancer or lung cancer in “cancer of colon, liver and lung”
>
>I’ll try that sentence with the latest chunker model (which will be in cTAKES 3.1) and see if it assigns correct chunk tags for that sentence.
>
>-- James
>
>From:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
>Sent: Wednesday, August 28, 2013 2:33 PM
>To: user@ctakes.apache.org
>Subject: Re: Concept annotation questions
>
>Hi James & Pei,
>
>Thank you for your replies and sorry for my late reply as I have been away.
>
>Q1 – The longest span could work and is one of the options we are looking at but when there are overlaps it can get complicated. In the following example, the longest would work. We can take start with 01, and ignore 02 and 03 because their start positions overlap the end position of 01, and then continue with 04. But I don’t think it will always be this straight forward as the being/end string positions may not always be a good indicator of what exactly in the original text was coded.
>
>00 Invasive ductal carcinoma of the left breast with bone metastases.
>01 Invasive ductal carcinoma of the left breast 408643008|Infiltrating duct carcinoma of breast (disorder)|
>02 breast with bone 56873002|Bone structure of sternum (body structure)|
>03 breast with bone metastases 94297009|Secondary malignant neoplasm of female breast (disorder)|
>04 bone metastases 94222008|Secondary malignant neoplasm of bone (disorder)|
>
>Q2 – As we are beginners, we are not at the level where we are comfortable with modifying cTakes or even know where to begin modifying cTakes but that would be an option in the future. Going back to the example of “cancer of liver” and using the begin/end position of the string that was used to identify the concept, the original string would be “cancer of colon, lung and liver.” The CUI that was identified was C0345904, which has 209 (137 unique) descriptions for all languages. Examples of English terms include:
> * CA - Liver cancer
> * Cancer of Liver
> * cancer of the liver
> * Cancer, Hepatic
> * CANCER, HEPATOCELLULAR
> * Malignant hepatic neoplasm
> * Malignant liver tumor
> * Malignant liver tumour
> * Malignant neoplasm of liver
> * malignant neoplasm of liver (diagnosis)
> * Malignant neoplasm of liver unspecified
> * Malignant neoplasm of liver unspecified (disorder)
> * Malignant neoplasm of liver, not specified as primary or secondary
> * Malignant neoplasm of liver, NOS
> * Malignant neoplasm of liver, unspecified
> * malignant neosplasm of the liver
> * Malignant tumor of liver
> * Malignant tumor of liver (disorder)
> * Malignant tumour of liver
>It would seem suboptimal to go through each of the descriptions to try and determine which was the UMLS term that was used in the coding. It is important for us to know which part of the string is matched because something like “Invasive ductal carcinoma of the left breast” will be matched to the SNOMED CT concept “408643008|Infiltrating duct carcinoma of breast (disorder)|”, but we would like to know that “left” was not matched and would like to post-coordinate the expression to indicate the left breast, i.e.: 408643008|Infiltrating duct carcinoma of breast (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast structure (body structure)|. When there are other qualifiers like severity, chronicity and episodicity that may be ignored when matching, we would like to capture it at the level of granularity specified in the original text.
>
>In terms of the chunking, here is what I see for “cancer of colon, lung and liver”:
> * NP: cancer of colon, lung and liver
> * PP: of
> * NP: colon, lung and liver
>For “cancer of colon, liver and lung” here is what I see:
> * NP: cancer of colon,
> * PP: of
> * NP: colon
> * O: liver
> * O: and
> * NP: lung
>Q3 – To answer Pei’s question, we are not looking at the preferred name from the UMLS, just which term was used.
>
>Regards,
>Dennis
>
>From:Chen, Pei
>Sent:Thursday, August 22, 2013 12:27 PM
>To:user@ctakes.apache.org
>Subject:RE: Concept annotation questions
>
>Also,
>>3)… or the exact description that was returned in the UMLS?
>I presume you mean to save the preferred name from UMLS? If so, this seems to be a common request- see:https://issues.apache.org/jira/browse/CTAKES-224
>
>--Pei
>
>From:Masanz, James J. [mailto:Masanz.James@mayo.edu]
>Sent: Thursday, August 22, 2013 3:24 PM
>To: 'user@ctakes.apache.org'
>Subject: RE: Concept annotation questions
>
>
>Welcome to the cTAKES community.
>
>Q1 – some people use the longest span.
>Q2 &Q3 – can you just use the text from the dictionary “Malignant neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to save the text of the words that it matches when it is performing dictionary lookup. I would guess there is a term in the UMLS dictionary with the same code as Malignant neoplasm of liver (disorder) that just has the words “cancer of liver”, but there isn’t anything in cTAKES to give that to you just through a configuration change.
>
>For “cancer of colon, liver and lung“, can you look at the chunk tag for liver. If it’s in a separate noun phrase (NP) from “cancer of colon” that would account for why cancer is not getting tied to liver in that case (but wouldn’t account for why the chunker is creating as a separate noun phrase)
>
>-- James
>
>From:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
>Sent: Wednesday, August 21, 2013 1:10 PM
>To: user@ctakes.apache.org
>Subject: Concept annotation questions
>
>Hi Everyone,
>
>We are new to cTakes so please bear with our questions. We are using cTakes to annotate things like encounter diagnoses and referral notes and are especially interested with the SNOMED CT encodings. But we are not sure how to make sense of all the outputs.
>
>Example #1
>
>In the example below, “cancer of colon, lung and liver” has been encoded with SNOMED CT and additional concepts that do not apply have been removed (e.g., general “cancer” concept, lung, colon and liver structures, etc). They have been plotted out by the begin/end positions. If the terms to do not align, its probably because the email only accepts plain text and a mono-spaced font is not the default.
>
>cancer of colon, lung and liver
>cancer of colon, lung and liver 93870000|Malignant neoplasm of liver (disorder)|
>cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
>cancer of colon 363406005|Malignant tumor of colon (disorder)|
>
>Question (1) – We had to do quite a bit of post-processing to remove inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
>Question (2) – How can we determine that “93870000|Malignant neoplasm of liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string, which points to “cancer of colon, lung and liver”? Certainly we can try to do additional parsing but there are a lot of different scenarios to take into account.
>Question (3) – This relates to question 2, are we able to identify the original terms that were used for the concept matching or the exact description that was returned in the UMLS? While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
>
>
>________________________________
>
>Example #2
>
>Switching the position of colon, lung and liver can result in different encodings. Once again, after removing additional concepts not needed (i.e., “cancer” and “colon structure”), we get the following. What happened to liver and lung cancer?
>
>cancer of colon, liver and lung
>cancer of colon 363406005|Malignant tumor of colon (disorder)|
> lung 39607008|Lung structure (body structure)|
>
>We have more questions but will start with these. Thank you in advance.
>
>Regards,
>Dennis
>
>
Re: Concept annotation questions
Posted by Tim Miller <ti...@childrens.harvard.edu>.
Samir,
You may be able to use the JCasUtil class from Uimafit to do something
like the following:
for each IdentifiedAnnotation i:
List l = JCasUtil.selectCovered(jcas, BaseToken.class, i)
(this is java-ish pseudocode obviously). Then the list you get of tokens
will all have the same type as the IdentifiedAnnotation i. Would that
solve your problem?
Tim
On 08/29/2013 12:29 PM, samir chabou wrote:
> Hi James and Pei,
> I also need to know what is the medical type (Sympto, Drug ,
> procedure, relation) of a given word token. Since in the typeystem
> hierarchy wordtoken is not under the same inheritance tree than
> identifiedAnnotation . I’m currently iterating on all wordTokens and
> compare each wordToken.CoveredText to the annotations.CovredText in
> the identifiedAnnotation. I found this a long process. James, do you
> think the patch <<I could create a patch for you that would help with
> determining which words from the text matched a dictionary entry >>
> that you are planning to create will permit also this requirement ? or
> can you suggest me some thing better than I’m currently doing.
> Thanks
> Samir
>
> ------------------------------------------------------------------------
> *From:* "Masanz, James J." <Ma...@mayo.edu>
> *To:* "'user@ctakes.apache.org'" <us...@ctakes.apache.org>
> *Sent:* Thursday, August 29, 2013 10:18:40 AM
> *Subject:* RE: Concept annotation questions
>
> Hi Dennis,
> Thanks for explaining why you are interested in finding out which
> words in the original text cause a particular concept to be
> annotated. We are currently working on getting Apache cTAKES 3.1
> out. Depending on your timeline, after that is done, perhaps I could
> create a patch for you that would help with determining which words
> from the text matched a dictionary entry, rather than just the begin
> offset of the first word and the end offset of the last word.
> As far as the chunking, the fact “liver” and “and” are being tagged as
> O-chunks explains why the dictionary lookup component is not finding
> liver cancer or lung cancer in “cancer of colon, liver and lung”
> I’ll try that sentence with the latest chunker model (which will be in
> cTAKES 3.1) and see if it assigns correct chunk tags for that sentence.
> -- James
> *From:*user-return-257-Masanz.James=mayo.edu@ctakes.apache.org
> [mailto:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org] *On
> Behalf Of *Dennis Lee Hon Kit
> *Sent:* Wednesday, August 28, 2013 2:33 PM
> *To:* user@ctakes.apache.org
> *Subject:* Re: Concept annotation questions
> Hi James & Pei,
> Thank you for your replies and sorry for my late reply as I have been
> away.
> Q1 – The longest span could work and is one of the options we are
> looking at but when there are overlaps it can get complicated. In the
> following example, the longest would work. We can take start with 01,
> and ignore 02 and 03 because their start positions overlap the end
> position of 01, and then continue with 04. But I don’t think it will
> always be this straight forward as the being/end string positions may
> not always be a good indicator of what exactly in the original text
> was coded.
> *00 Invasive ductal carcinoma of the left breast with bone metastases.*
> 01 Invasive ductal carcinoma of the left breast 408643008|Infiltrating
> duct carcinoma of breast (disorder)|
> 02 breast with bone 56873002|Bone structure of sternum
> (body structure)|
> 03 breast with bone metastases 94297009|Secondary malignant neoplasm
> of female breast (disorder)|
> 04 bone metastases 94222008|Secondary malignant neoplasm of bone
> (disorder)|
> Q2 – As we are beginners, we are not at the level where we are
> comfortable with modifying cTakes or even know where to begin
> modifying cTakes but that would be an option in the future. Going
> back to the example of “cancer of liver” and using the begin/end
> position of the string that was used to identify the concept, the
> original string would be “cancer of colon, lung and liver.” The CUI
> that was identified was C0345904, which has 209 (137 unique)
> descriptions for all languages. Examples of English terms include:
>
> * CA - Liver cancer
> * Cancer of Liver
> * cancer of the liver
> * Cancer, Hepatic
> * CANCER, HEPATOCELLULAR
> * Malignant hepatic neoplasm
> * Malignant liver tumor
> * Malignant liver tumour
> * Malignant neoplasm of liver
> * malignant neoplasm of liver (diagnosis)
> * Malignant neoplasm of liver unspecified
> * Malignant neoplasm of liver unspecified (disorder)
> * Malignant neoplasm of liver, not specified as primary or secondary
> * Malignant neoplasm of liver, NOS
> * Malignant neoplasm of liver, unspecified
> * malignant neosplasm of the liver
> * Malignant tumor of liver
> * Malignant tumor of liver (disorder)
> * Malignant tumour of liver
>
> It would seem suboptimal to go through each of the descriptions to try
> and determine which was the UMLS term that was used in the coding. It
> is important for us to know which part of the string is matched
> because something like “Invasive ductal carcinoma of the left breast”
> will be matched to the SNOMED CT concept “408643008|Infiltrating duct
> carcinoma of breast (disorder)|”, but we would like to know that
> “left” was not matched and would like to post-coordinate the
> expression to indicate the left breast, i.e.: 408643008|Infiltrating
> duct carcinoma of breast (disorder)|:363698007|Finding site
> (attribute)|=80248007|Left breast structure (body structure)|. When
> there are other qualifiers like severity, chronicity and episodicity
> that may be ignored when matching, we would like to capture it at the
> level of granularity specified in the original text.
> In terms of the chunking, here is what I see for “cancer of colon,
> lung and liver”:
>
> * NP: cancer of colon, lung and liver
> * PP: of
> * NP: colon, lung and liver
>
> For “cancer of colon, liver and lung” here is what I see:
>
> * NP: cancer of colon,
> * PP: of
> * NP: colon
> * O: liver
> * O: and
> * NP: lung
>
> Q3 – To answer Pei’s question, we are not looking at the preferred
> name from the UMLS, just which term was used.
> Regards,
> Dennis
> *From:*Chen, Pei <ma...@childrens.harvard.edu>
> *Sent:*Thursday, August 22, 2013 12:27 PM
> *To:*user@ctakes.apache.org <ma...@ctakes.apache.org>
> *Subject:*RE: Concept annotation questions
> Also,
> >3)… or the exact description that was returned in the UMLS?
> I presume you mean to save the preferred name from UMLS? If so, this
> seems to be a common request-
> see:https://issues.apache.org/jira/browse/CTAKES-224
> --Pei
> *From:*Masanz, James J. [mailto:Masanz.James@mayo.edu]
> *Sent:* Thursday, August 22, 2013 3:24 PM
> *To:* 'user@ctakes.apache.org'
> *Subject:* RE: Concept annotation questions
> Welcome to the cTAKES community.
> Q1 – some people use the longest span.
> Q2 &Q3 – can you just use the text from the dictionary “Malignant
> neoplasm of liver (disorder)“. Alternatively you could modify cTAKES
> to save the text of the words that it matches when it is performing
> dictionary lookup. I would guess there is a term in the UMLS
> dictionary with the same code as Malignant neoplasm of liver
> (disorder) that just has the words “cancer of liver”, but there isn’t
> anything in cTAKES to give that to you just through a configuration
> change.
> For “*cancer of colon, liver and lung*“, can you look at the chunk
> tag for liver. If it’s in a separate noun phrase (NP) from “cancer of
> colon” that would account for why cancer is not getting tied to liver
> in that case (but wouldn’t account for why the chunker is creating as
> a separate noun phrase)
> -- James
> *From:*user-return-248-Masanz.James=mayo.edu@ctakes.apache.org
> <ma...@ctakes.apache.org>
> [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] *On
> Behalf Of *Dennis Lee Hon Kit
> *Sent:* Wednesday, August 21, 2013 1:10 PM
> *To:* user@ctakes.apache.org <ma...@ctakes.apache.org>
> *Subject:* Concept annotation questions
> Hi Everyone,
> We are new to cTakes so please bear with our questions. We are using
> cTakes to annotate things like encounter diagnoses and referral notes
> and are especially interested with the SNOMED CT encodings. But we
> are not sure how to make sense of all the outputs.
> *Example #1*
> In the example below, “cancer of colon, lung and liver” has been
> encoded with SNOMED CT and additional concepts that do not apply have
> been removed (e.g., general “cancer” concept, lung, colon and liver
> structures, etc). They have been plotted out by the begin/end
> positions. If the terms to do not align, its probably because the
> email only accepts plain text and a mono-spaced font is not the default.
> *cancer of colon, lung and liver*
> cancer of colon, lung and liver 93870000|Malignant neoplasm of liver
> (disorder)|
> cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
> cancer of colon 363406005|Malignant tumor of colon (disorder)|
> Question (1) – We had to do quite a bit of post-processing to remove
> inactive concepts, subtype concepts, concepts that are part of the
> defining attributes, etc. Are there a set of guidelines to help sort
> out the CUI or SNOMED CT codes that have been identified?
> Question (2) – How can we determine that “93870000|Malignant neoplasm
> of liver (disorder)|” refers to “cancer of liver” as opposed to using
> the begin/end string, which points to “cancer of colon, lung and
> liver”? Certainly we can try to do additional parsing but there are a
> lot of different scenarios to take into account.
> Question (3) – This relates to question 2, are we able to identify the
> original terms that were used for the concept matching or the exact
> description that was returned in the UMLS? While the CUI is helpful,
> the CUI can refer to tens or even hundreds of descriptions.
> ------------------------------------------------------------------------
> *Example #2*
> Switching the position of colon, lung and liver can result in
> different encodings. Once again, after removing additional concepts
> not needed (i.e., “cancer” and “colon structure”), we get the
> following. What happened to liver and lung cancer?
> *cancer of colon, liver and lung*
> cancer of colon 363406005|Malignant tumor of colon (disorder)|
> lung 39607008|Lung structure (body structure)|
> We have more questions but will start with these. Thank you in advance.
> Regards,
> Dennis
>
>
Re: Concept annotation questions
Posted by samir chabou <sa...@yahoo.com>.
Hi James and Pei,
I also need to know what is the medical type
(Sympto, Drug , procedure, relation) of a given word token. Since in the
typeystem hierarchy wordtoken is not under the same inheritance tree than
identifiedAnnotation . I’m currently iterating on all wordTokens and compare each
wordToken.CoveredText to the annotations.CovredText in the identifiedAnnotation.
I found this a long process. James, do you think the patch <<I could
create a patch for you that would help with determining which words from the
text matched a dictionary entry >> that you are planning to
create will permit also this requirement ? or can you suggest me some thing
better than I’m currently doing.
Thanks
Samir
________________________________
From: "Masanz, James J." <Ma...@mayo.edu>
To: "'user@ctakes.apache.org'" <us...@ctakes.apache.org>
Sent: Thursday, August 29, 2013 10:18:40 AM
Subject: RE: Concept annotation questions
Hi Dennis,
Thanks for explaining why you are interested in finding out which words in the original text cause a particular concept to be annotated. We are currently working on getting Apache cTAKES 3.1 out. Depending on your timeline, after that is done, perhaps I could create a patch for you that would help with determining which words from the text matched a dictionary entry, rather than just the begin offset of the first word and the end offset of the last word.
As far as the chunking, the fact “liver” and “and” are being tagged as O-chunks explains why the dictionary lookup component is not finding liver cancer or lung cancer in “cancer of colon, liver and lung”
I’ll try that sentence with the latest chunker model (which will be in cTAKES 3.1) and see if it assigns correct chunk tags for that sentence.
-- James
From:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 28, 2013 2:33 PM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James & Pei,
Thank you for your replies and sorry for my late reply as I have been away.
Q1 – The longest span could work and is one of the options we are looking at but when there are overlaps it can get complicated. In the following example, the longest would work. We can take start with 01, and ignore 02 and 03 because their start positions overlap the end position of 01, and then continue with 04. But I don’t think it will always be this straight forward as the being/end string positions may not always be a good indicator of what exactly in the original text was coded.
00 Invasive ductal carcinoma of the left breast with bone metastases.
01 Invasive ductal carcinoma of the left breast 408643008|Infiltrating duct carcinoma of breast (disorder)|
02 breast with bone 56873002|Bone structure of sternum (body structure)|
03 breast with bone metastases 94297009|Secondary malignant neoplasm of female breast (disorder)|
04 bone metastases 94222008|Secondary malignant neoplasm of bone (disorder)|
Q2 – As we are beginners, we are not at the level where we are comfortable with modifying cTakes or even know where to begin modifying cTakes but that would be an option in the future. Going back to the example of “cancer of liver” and using the begin/end position of the string that was used to identify the concept, the original string would be “cancer of colon, lung and liver.” The CUI that was identified was C0345904, which has 209 (137 unique) descriptions for all languages. Examples of English terms include:
* CA - Liver cancer
* Cancer of Liver
* cancer of the liver
* Cancer, Hepatic
* CANCER, HEPATOCELLULAR
* Malignant hepatic neoplasm
* Malignant liver tumor
* Malignant liver tumour
* Malignant neoplasm of liver
* malignant neoplasm of liver (diagnosis)
* Malignant neoplasm of liver unspecified
* Malignant neoplasm of liver unspecified (disorder)
* Malignant neoplasm of liver, not specified as primary or secondary
* Malignant neoplasm of liver, NOS
* Malignant neoplasm of liver, unspecified
* malignant neosplasm of the liver
* Malignant tumor of liver
* Malignant tumor of liver (disorder)
* Malignant tumour of liver
It would seem suboptimal to go through each of the descriptions to try and determine which was the UMLS term that was used in the coding. It is important for us to know which part of the string is matched because something like “Invasive ductal carcinoma of the left breast” will be matched to the SNOMED CT concept “408643008|Infiltrating duct carcinoma of breast (disorder)|”, but we would like to know that “left” was not matched and would like to post-coordinate the expression to indicate the left breast, i.e.: 408643008|Infiltrating duct carcinoma of breast (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast structure (body structure)|. When there are other qualifiers like severity, chronicity and episodicity that may be ignored when matching, we would like to capture it at the level of granularity specified in the original text.
In terms of the chunking, here is what I see for “cancer of colon, lung and liver”:
* NP: cancer of colon, lung and liver
* PP: of
* NP: colon, lung and liver
For “cancer of colon, liver and lung” here is what I see:
* NP: cancer of colon,
* PP: of
* NP: colon
* O: liver
* O: and
* NP: lung
Q3 – To answer Pei’s question, we are not looking at the preferred name from the UMLS, just which term was used.
Regards,
Dennis
From:Chen, Pei
Sent:Thursday, August 22, 2013 12:27 PM
To:user@ctakes.apache.org
Subject:RE: Concept annotation questions
Also,
>3)… or the exact description that was returned in the UMLS?
I presume you mean to save the preferred name from UMLS? If so, this seems to be a common request- see:https://issues.apache.org/jira/browse/CTAKES-224
--Pei
From:Masanz, James J. [mailto:Masanz.James@mayo.edu]
Sent: Thursday, August 22, 2013 3:24 PM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Welcome to the cTAKES community.
Q1 – some people use the longest span.
Q2 &Q3 – can you just use the text from the dictionary “Malignant neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to save the text of the words that it matches when it is performing dictionary lookup. I would guess there is a term in the UMLS dictionary with the same code as Malignant neoplasm of liver (disorder) that just has the words “cancer of liver”, but there isn’t anything in cTAKES to give that to you just through a configuration change.
For “cancer of colon, liver and lung“, can you look at the chunk tag for liver. If it’s in a separate noun phrase (NP) from “cancer of colon” that would account for why cancer is not getting tied to liver in that case (but wouldn’t account for why the chunker is creating as a separate noun phrase)
-- James
From:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 21, 2013 1:10 PM
To: user@ctakes.apache.org
Subject: Concept annotation questions
Hi Everyone,
We are new to cTakes so please bear with our questions. We are using cTakes to annotate things like encounter diagnoses and referral notes and are especially interested with the SNOMED CT encodings. But we are not sure how to make sense of all the outputs.
Example #1
In the example below, “cancer of colon, lung and liver” has been encoded with SNOMED CT and additional concepts that do not apply have been removed (e.g., general “cancer” concept, lung, colon and liver structures, etc). They have been plotted out by the begin/end positions. If the terms to do not align, its probably because the email only accepts plain text and a mono-spaced font is not the default.
cancer of colon, lung and liver
cancer of colon, lung and liver 93870000|Malignant neoplasm of liver (disorder)|
cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
cancer of colon 363406005|Malignant tumor of colon (disorder)|
Question (1) – We had to do quite a bit of post-processing to remove inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
Question (2) – How can we determine that “93870000|Malignant neoplasm of liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string, which points to “cancer of colon, lung and liver”? Certainly we can try to do additional parsing but there are a lot of different scenarios to take into account.
Question (3) – This relates to question 2, are we able to identify the original terms that were used for the concept matching or the exact description that was returned in the UMLS? While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
________________________________
Example #2
Switching the position of colon, lung and liver can result in different encodings. Once again, after removing additional concepts not needed (i.e., “cancer” and “colon structure”), we get the following. What happened to liver and lung cancer?
cancer of colon, liver and lung
cancer of colon 363406005|Malignant tumor of colon (disorder)|
lung 39607008|Lung structure (body structure)|
We have more questions but will start with these. Thank you in advance.
Regards,
Dennis
RE: Concept annotation questions
Posted by "Masanz, James J." <Ma...@mayo.edu>.
Hi Dennis,
Thanks for explaining why you are interested in finding out which words in the original text cause a particular concept to be annotated. We are currently working on getting Apache cTAKES 3.1 out. Depending on your timeline, after that is done, perhaps I could create a patch for you that would help with determining which words from the text matched a dictionary entry, rather than just the begin offset of the first word and the end offset of the last word.
As far as the chunking, the fact “liver” and “and” are being tagged as O-chunks explains why the dictionary lookup component is not finding liver cancer or lung cancer in “cancer of colon, liver and lung”
I’ll try that sentence with the latest chunker model (which will be in cTAKES 3.1) and see if it assigns correct chunk tags for that sentence.
-- James
From: user-return-257-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-257-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 28, 2013 2:33 PM
To: user@ctakes.apache.org
Subject: Re: Concept annotation questions
Hi James & Pei,
Thank you for your replies and sorry for my late reply as I have been away.
Q1 – The longest span could work and is one of the options we are looking at but when there are overlaps it can get complicated. In the following example, the longest would work. We can take start with 01, and ignore 02 and 03 because their start positions overlap the end position of 01, and then continue with 04. But I don’t think it will always be this straight forward as the being/end string positions may not always be a good indicator of what exactly in the original text was coded.
00 Invasive ductal carcinoma of the left breast with bone metastases.
01 Invasive ductal carcinoma of the left breast 408643008|Infiltrating duct carcinoma of breast (disorder)|
02 breast with bone 56873002|Bone structure of sternum (body structure)|
03 breast with bone metastases 94297009|Secondary malignant neoplasm of female breast (disorder)|
04 bone metastases 94222008|Secondary malignant neoplasm of bone (disorder)|
Q2 – As we are beginners, we are not at the level where we are comfortable with modifying cTakes or even know where to begin modifying cTakes but that would be an option in the future. Going back to the example of “cancer of liver” and using the begin/end position of the string that was used to identify the concept, the original string would be “cancer of colon, lung and liver.” The CUI that was identified was C0345904, which has 209 (137 unique) descriptions for all languages. Examples of English terms include:
* CA - Liver cancer
* Cancer of Liver
* cancer of the liver
* Cancer, Hepatic
* CANCER, HEPATOCELLULAR
* Malignant hepatic neoplasm
* Malignant liver tumor
* Malignant liver tumour
* Malignant neoplasm of liver
* malignant neoplasm of liver (diagnosis)
* Malignant neoplasm of liver unspecified
* Malignant neoplasm of liver unspecified (disorder)
* Malignant neoplasm of liver, not specified as primary or secondary
* Malignant neoplasm of liver, NOS
* Malignant neoplasm of liver, unspecified
* malignant neosplasm of the liver
* Malignant tumor of liver
* Malignant tumor of liver (disorder)
* Malignant tumour of liver
It would seem suboptimal to go through each of the descriptions to try and determine which was the UMLS term that was used in the coding. It is important for us to know which part of the string is matched because something like “Invasive ductal carcinoma of the left breast” will be matched to the SNOMED CT concept “408643008|Infiltrating duct carcinoma of breast (disorder)|”, but we would like to know that “left” was not matched and would like to post-coordinate the expression to indicate the left breast, i.e.: 408643008|Infiltrating duct carcinoma of breast (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast structure (body structure)|. When there are other qualifiers like severity, chronicity and episodicity that may be ignored when matching, we would like to capture it at the level of granularity specified in the original text.
In terms of the chunking, here is what I see for “cancer of colon, lung and liver”:
* NP: cancer of colon, lung and liver
* PP: of
* NP: colon, lung and liver
For “cancer of colon, liver and lung” here is what I see:
* NP: cancer of colon,
* PP: of
* NP: colon
* O: liver
* O: and
* NP: lung
Q3 – To answer Pei’s question, we are not looking at the preferred name from the UMLS, just which term was used.
Regards,
Dennis
From: Chen, Pei<ma...@childrens.harvard.edu>
Sent: Thursday, August 22, 2013 12:27 PM
To: user@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Concept annotation questions
Also,
> 3)… or the exact description that was returned in the UMLS?
I presume you mean to save the preferred name from UMLS? If so, this seems to be a common request- see: https://issues.apache.org/jira/browse/CTAKES-224
--Pei
From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
Sent: Thursday, August 22, 2013 3:24 PM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Welcome to the cTAKES community.
Q1 – some people use the longest span.
Q2 &Q3 – can you just use the text from the dictionary “Malignant neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to save the text of the words that it matches when it is performing dictionary lookup. I would guess there is a term in the UMLS dictionary with the same code as Malignant neoplasm of liver (disorder) that just has the words “cancer of liver”, but there isn’t anything in cTAKES to give that to you just through a configuration change.
For “cancer of colon, liver and lung“, can you look at the chunk tag for liver. If it’s in a separate noun phrase (NP) from “cancer of colon” that would account for why cancer is not getting tied to liver in that case (but wouldn’t account for why the chunker is creating as a separate noun phrase)
-- James
From: user-return-248-Masanz.James=mayo.edu@ctakes.apache.org<ma...@ctakes.apache.org> [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 21, 2013 1:10 PM
To: user@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Concept annotation questions
Hi Everyone,
We are new to cTakes so please bear with our questions. We are using cTakes to annotate things like encounter diagnoses and referral notes and are especially interested with the SNOMED CT encodings. But we are not sure how to make sense of all the outputs.
Example #1
In the example below, “cancer of colon, lung and liver” has been encoded with SNOMED CT and additional concepts that do not apply have been removed (e.g., general “cancer” concept, lung, colon and liver structures, etc). They have been plotted out by the begin/end positions. If the terms to do not align, its probably because the email only accepts plain text and a mono-spaced font is not the default.
cancer of colon, lung and liver
cancer of colon, lung and liver 93870000|Malignant neoplasm of liver (disorder)|
cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
cancer of colon 363406005|Malignant tumor of colon (disorder)|
Question (1) – We had to do quite a bit of post-processing to remove inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
Question (2) – How can we determine that “93870000|Malignant neoplasm of liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string, which points to “cancer of colon, lung and liver”? Certainly we can try to do additional parsing but there are a lot of different scenarios to take into account.
Question (3) – This relates to question 2, are we able to identify the original terms that were used for the concept matching or the exact description that was returned in the UMLS? While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
________________________________
Example #2
Switching the position of colon, lung and liver can result in different encodings. Once again, after removing additional concepts not needed (i.e., “cancer” and “colon structure”), we get the following. What happened to liver and lung cancer?
cancer of colon, liver and lung
cancer of colon 363406005|Malignant tumor of colon (disorder)|
lung 39607008|Lung structure (body structure)|
We have more questions but will start with these. Thank you in advance.
Regards,
Dennis
Re: Concept annotation questions
Posted by Dennis Lee Hon Kit <dl...@uvic.ca>.
Hi James & Pei,
Thank you for your replies and sorry for my late reply as I have been away.
Q1 – The longest span could work and is one of the options we are looking at but when there are overlaps it can get complicated. In the following example, the longest would work. We can take start with 01, and ignore 02 and 03 because their start positions overlap the end position of 01, and then continue with 04. But I don’t think it will always be this straight forward as the being/end string positions may not always be a good indicator of what exactly in the original text was coded.
00 Invasive ductal carcinoma of the left breast with bone metastases.
01 Invasive ductal carcinoma of the left breast 408643008|Infiltrating duct carcinoma of breast (disorder)|
02 breast with bone 56873002|Bone structure of sternum (body structure)|
03 breast with bone metastases 94297009|Secondary malignant neoplasm of female breast (disorder)|
04 bone metastases 94222008|Secondary malignant neoplasm of bone (disorder)|
Q2 – As we are beginners, we are not at the level where we are comfortable with modifying cTakes or even know where to begin modifying cTakes but that would be an option in the future. Going back to the example of “cancer of liver” and using the begin/end position of the string that was used to identify the concept, the original string would be “cancer of colon, lung and liver.” The CUI that was identified was C0345904, which has 209 (137 unique) descriptions for all languages. Examples of English terms include:
a.. CA - Liver cancer
b.. Cancer of Liver
c.. cancer of the liver
d.. Cancer, Hepatic
e.. CANCER, HEPATOCELLULAR
f.. Malignant hepatic neoplasm
g.. Malignant liver tumor
h.. Malignant liver tumour
i.. Malignant neoplasm of liver
j.. malignant neoplasm of liver (diagnosis)
k.. Malignant neoplasm of liver unspecified
l.. Malignant neoplasm of liver unspecified (disorder)
m.. Malignant neoplasm of liver, not specified as primary or secondary
n.. Malignant neoplasm of liver, NOS
o.. Malignant neoplasm of liver, unspecified
p.. malignant neosplasm of the liver
q.. Malignant tumor of liver
r.. Malignant tumor of liver (disorder)
s.. Malignant tumour of liver
It would seem suboptimal to go through each of the descriptions to try and determine which was the UMLS term that was used in the coding. It is important for us to know which part of the string is matched because something like “Invasive ductal carcinoma of the left breast” will be matched to the SNOMED CT concept “408643008|Infiltrating duct carcinoma of breast (disorder)|”, but we would like to know that “left” was not matched and would like to post-coordinate the expression to indicate the left breast, i.e.: 408643008|Infiltrating duct carcinoma of breast (disorder)|:363698007|Finding site (attribute)|=80248007|Left breast structure (body structure)|. When there are other qualifiers like severity, chronicity and episodicity that may be ignored when matching, we would like to capture it at the level of granularity specified in the original text.
In terms of the chunking, here is what I see for “cancer of colon, lung and liver”:
a.. NP: cancer of colon, lung and liver
b.. PP: of
c.. NP: colon, lung and liver
For “cancer of colon, liver and lung” here is what I see:
a.. NP: cancer of colon,
b.. PP: of
c.. NP: colon
d.. O: liver
e.. O: and
f.. NP: lung
Q3 – To answer Pei’s question, we are not looking at the preferred name from the UMLS, just which term was used.
Regards,
Dennis
From: Chen, Pei
Sent: Thursday, August 22, 2013 12:27 PM
To: user@ctakes.apache.org
Subject: RE: Concept annotation questions
Also,
> 3)… or the exact description that was returned in the UMLS?
I presume you mean to save the preferred name from UMLS? If so, this seems to be a common request- see: https://issues.apache.org/jira/browse/CTAKES-224
--Pei
From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
Sent: Thursday, August 22, 2013 3:24 PM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Welcome to the cTAKES community.
Q1 – some people use the longest span.
Q2 &Q3 – can you just use the text from the dictionary “Malignant neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to save the text of the words that it matches when it is performing dictionary lookup. I would guess there is a term in the UMLS dictionary with the same code as Malignant neoplasm of liver (disorder) that just has the words “cancer of liver”, but there isn’t anything in cTAKES to give that to you just through a configuration change.
For “cancer of colon, liver and lung“, can you look at the chunk tag for liver. If it’s in a separate noun phrase (NP) from “cancer of colon” that would account for why cancer is not getting tied to liver in that case (but wouldn’t account for why the chunker is creating as a separate noun phrase)
-- James
From: user-return-248-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 21, 2013 1:10 PM
To: user@ctakes.apache.org
Subject: Concept annotation questions
Hi Everyone,
We are new to cTakes so please bear with our questions. We are using cTakes to annotate things like encounter diagnoses and referral notes and are especially interested with the SNOMED CT encodings. But we are not sure how to make sense of all the outputs.
Example #1
In the example below, “cancer of colon, lung and liver” has been encoded with SNOMED CT and additional concepts that do not apply have been removed (e.g., general “cancer” concept, lung, colon and liver structures, etc). They have been plotted out by the begin/end positions. If the terms to do not align, its probably because the email only accepts plain text and a mono-spaced font is not the default.
cancer of colon, lung and liver
cancer of colon, lung and liver 93870000|Malignant neoplasm of liver (disorder)|
cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
cancer of colon 363406005|Malignant tumor of colon (disorder)|
Question (1) – We had to do quite a bit of post-processing to remove inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
Question (2) – How can we determine that “93870000|Malignant neoplasm of liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string, which points to “cancer of colon, lung and liver”? Certainly we can try to do additional parsing but there are a lot of different scenarios to take into account.
Question (3) – This relates to question 2, are we able to identify the original terms that were used for the concept matching or the exact description that was returned in the UMLS? While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
--------------------------------------------------------------------------------
Example #2
Switching the position of colon, lung and liver can result in different encodings. Once again, after removing additional concepts not needed (i.e., “cancer” and “colon structure”), we get the following. What happened to liver and lung cancer?
cancer of colon, liver and lung
cancer of colon 363406005|Malignant tumor of colon (disorder)|
lung 39607008|Lung structure (body structure)|
We have more questions but will start with these. Thank you in advance.
Regards,
Dennis
RE: Concept annotation questions
Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.
Also,
> 3)… or the exact description that was returned in the UMLS?
I presume you mean to save the preferred name from UMLS? If so, this seems to be a common request- see: https://issues.apache.org/jira/browse/CTAKES-224
--Pei
From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
Sent: Thursday, August 22, 2013 3:24 PM
To: 'user@ctakes.apache.org'
Subject: RE: Concept annotation questions
Welcome to the cTAKES community.
Q1 – some people use the longest span.
Q2 &Q3 – can you just use the text from the dictionary “Malignant neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to save the text of the words that it matches when it is performing dictionary lookup. I would guess there is a term in the UMLS dictionary with the same code as Malignant neoplasm of liver (disorder) that just has the words “cancer of liver”, but there isn’t anything in cTAKES to give that to you just through a configuration change.
For “cancer of colon, liver and lung“, can you look at the chunk tag for liver. If it’s in a separate noun phrase (NP) from “cancer of colon” that would account for why cancer is not getting tied to liver in that case (but wouldn’t account for why the chunker is creating as a separate noun phrase)
-- James
From: user-return-248-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 21, 2013 1:10 PM
To: user@ctakes.apache.org
Subject: Concept annotation questions
Hi Everyone,
We are new to cTakes so please bear with our questions. We are using cTakes to annotate things like encounter diagnoses and referral notes and are especially interested with the SNOMED CT encodings. But we are not sure how to make sense of all the outputs.
Example #1
In the example below, “cancer of colon, lung and liver” has been encoded with SNOMED CT and additional concepts that do not apply have been removed (e.g., general “cancer” concept, lung, colon and liver structures, etc). They have been plotted out by the begin/end positions. If the terms to do not align, its probably because the email only accepts plain text and a mono-spaced font is not the default.
cancer of colon, lung and liver
cancer of colon, lung and liver 93870000|Malignant neoplasm of liver (disorder)|
cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
cancer of colon 363406005|Malignant tumor of colon (disorder)|
Question (1) – We had to do quite a bit of post-processing to remove inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
Question (2) – How can we determine that “93870000|Malignant neoplasm of liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string, which points to “cancer of colon, lung and liver”? Certainly we can try to do additional parsing but there are a lot of different scenarios to take into account.
Question (3) – This relates to question 2, are we able to identify the original terms that were used for the concept matching or the exact description that was returned in the UMLS? While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
________________________________
Example #2
Switching the position of colon, lung and liver can result in different encodings. Once again, after removing additional concepts not needed (i.e., “cancer” and “colon structure”), we get the following. What happened to liver and lung cancer?
cancer of colon, liver and lung
cancer of colon 363406005|Malignant tumor of colon (disorder)|
lung 39607008|Lung structure (body structure)|
We have more questions but will start with these. Thank you in advance.
Regards,
Dennis
RE: Concept annotation questions
Posted by "Masanz, James J." <Ma...@mayo.edu>.
Welcome to the cTAKES community.
Q1 – some people use the longest span.
Q2 &Q3 – can you just use the text from the dictionary “Malignant neoplasm of liver (disorder)“. Alternatively you could modify cTAKES to save the text of the words that it matches when it is performing dictionary lookup. I would guess there is a term in the UMLS dictionary with the same code as Malignant neoplasm of liver (disorder) that just has the words “cancer of liver”, but there isn’t anything in cTAKES to give that to you just through a configuration change.
For “cancer of colon, liver and lung“, can you look at the chunk tag for liver. If it’s in a separate noun phrase (NP) from “cancer of colon” that would account for why cancer is not getting tied to liver in that case (but wouldn’t account for why the chunker is creating as a separate noun phrase)
-- James
From: user-return-248-Masanz.James=mayo.edu@ctakes.apache.org [mailto:user-return-248-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Dennis Lee Hon Kit
Sent: Wednesday, August 21, 2013 1:10 PM
To: user@ctakes.apache.org
Subject: Concept annotation questions
Hi Everyone,
We are new to cTakes so please bear with our questions. We are using cTakes to annotate things like encounter diagnoses and referral notes and are especially interested with the SNOMED CT encodings. But we are not sure how to make sense of all the outputs.
Example #1
In the example below, “cancer of colon, lung and liver” has been encoded with SNOMED CT and additional concepts that do not apply have been removed (e.g., general “cancer” concept, lung, colon and liver structures, etc). They have been plotted out by the begin/end positions. If the terms to do not align, its probably because the email only accepts plain text and a mono-spaced font is not the default.
cancer of colon, lung and liver
cancer of colon, lung and liver 93870000|Malignant neoplasm of liver (disorder)|
cancer of colon, lung 363358000|Malignant tumor of lung (disorder)|
cancer of colon 363406005|Malignant tumor of colon (disorder)|
Question (1) – We had to do quite a bit of post-processing to remove inactive concepts, subtype concepts, concepts that are part of the defining attributes, etc. Are there a set of guidelines to help sort out the CUI or SNOMED CT codes that have been identified?
Question (2) – How can we determine that “93870000|Malignant neoplasm of liver (disorder)|” refers to “cancer of liver” as opposed to using the begin/end string, which points to “cancer of colon, lung and liver”? Certainly we can try to do additional parsing but there are a lot of different scenarios to take into account.
Question (3) – This relates to question 2, are we able to identify the original terms that were used for the concept matching or the exact description that was returned in the UMLS? While the CUI is helpful, the CUI can refer to tens or even hundreds of descriptions.
________________________________
Example #2
Switching the position of colon, lung and liver can result in different encodings. Once again, after removing additional concepts not needed (i.e., “cancer” and “colon structure”), we get the following. What happened to liver and lung cancer?
cancer of colon, liver and lung
cancer of colon 363406005|Malignant tumor of colon (disorder)|
lung 39607008|Lung structure (body structure)|
We have more questions but will start with these. Thank you in advance.
Regards,
Dennis