You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by Tomasz Oliwa <ol...@uchicago.edu> on 2015/07/15 20:49:37 UTC

UmlsConcept subject

Hi,

I think there is a regression in the way cTAKES discovers the subject status ("patient", "familiy_member", etc.) of an UmlsConcept. Using cTAKES 3.2.2 and the AggregatePlaintextFastUMLSProcessor in the CVD:

1. "Patient's brother has a myocardial infarction." 
"myocardial infarction" and "infarction" have subject = "patient"

2. "Father had a myocardial infarction."
"myocardial infarction" and "infarction" have subject = "patient"

3. "Sister was diagnosed with a myocardial infarction."
"myocardial infarction" and "infarction" have subject = "patient"

4. "Family member had a myocardial infarction."
"myocardial infarction" and "infarction" have subject = "family_member" (this is correct)

I am looking at the code of the SubjectCleartkAnalysisEngine. Is this the class responsible for inferring the subject?
How can this be fixed? Should I open a JIRA ticket?

Thanks,
Tomasz

RE: UmlsConcept subject

Posted by Tomasz Oliwa <ol...@uchicago.edu>.

Pei,

I added some example sentences to the JIRA ticket regarding the subject status.

It would be great if they could be used together with the existing sentences in a training set.

Regards,
Tomasz

________________________________________
From: Pei Chen [chenpei@apache.org]
Sent: Thursday, July 30, 2015 4:07 PM
To: dev@ctakes.apache.org
Subject: Re: UmlsConcept subject

Tomasz,
IIRC, the code in SubjectCleartkAnalysisEngine.java should have the
feature extractors used- I believe there is an ENUM of a preset of
features, but do not recall exactly which one was the best performing
for test set- probably best to check the source code.

I think adding the plain sentences examples in Jira would be a great
help since we can use that for unit testing at a minimum.
Currently, there is no real easy way to 'Append' training data, so one
has create the new set with examples in it.  The code used for
training is also in the project- it should be in the **/eval/* name
spaces.  I believe the gold standard was created in xml (either
knowtator or anafora).

Hope that helps.
--Pei

On Thu, Jul 23, 2015 at 10:33 AM, Tomasz Oliwa <ol...@uchicago.edu> wrote:
> What format (features, labels) is best suitable for some more training examples?
>
> The SubjectCleartkAnalysisEngine class loads a /org/apache/ctakes/assertion/models/subject/model.jar, which contains a liblinear cleartk model.
>
> The model has 3 features, label 12 3.
>
> But what are the features exactly are how are they derived?
>
> How does the target class look like, is is really differentiating between "patient", "brother", "sister" etc. or is it a binary decision model between "patient" and "family_history" (the latter is what is looks to me) ?
>
> This is not documented.
>
> Tomasz

Re: UmlsConcept subject

Posted by Pei Chen <ch...@apache.org>.

Tomasz,
IIRC, the code in SubjectCleartkAnalysisEngine.java should have the
feature extractors used- I believe there is an ENUM of a preset of
features, but do not recall exactly which one was the best performing
for test set- probably best to check the source code.

I think adding the plain sentences examples in Jira would be a great
help since we can use that for unit testing at a minimum.
Currently, there is no real easy way to 'Append' training data, so one
has create the new set with examples in it.  The code used for
training is also in the project- it should be in the **/eval/* name
spaces.  I believe the gold standard was created in xml (either
knowtator or anafora).

Hope that helps.
--Pei

On Thu, Jul 23, 2015 at 10:33 AM, Tomasz Oliwa <ol...@uchicago.edu> wrote:
> What format (features, labels) is best suitable for some more training examples?
>
> The SubjectCleartkAnalysisEngine class loads a /org/apache/ctakes/assertion/models/subject/model.jar, which contains a liblinear cleartk model.
>
> The model has 3 features, label 12 3.
>
> But what are the features exactly are how are they derived?
>
> How does the target class look like, is is really differentiating between "patient", "brother", "sister" etc. or is it a binary decision model between "patient" and "family_history" (the latter is what is looks to me) ?
>
> This is not documented.
>
> Tomasz

RE: UmlsConcept subject

Posted by Tomasz Oliwa <ol...@uchicago.edu>.

What format (features, labels) is best suitable for some more training examples?

The SubjectCleartkAnalysisEngine class loads a /org/apache/ctakes/assertion/models/subject/model.jar, which contains a liblinear cleartk model. 

The model has 3 features, label 12 3. 

But what are the features exactly are how are they derived? 

How does the target class look like, is is really differentiating between "patient", "brother", "sister" etc. or is it a binary decision model between "patient" and "family_history" (the latter is what is looks to me) ? 

This is not documented.

Tomasz

Re: UmlsConcept subject

Posted by jay vyas <ja...@gmail.com>.

Yup, JIRA would be great, thanks tomasz.... maybe even a unit test to
reproduce this?

On Wed, Jul 22, 2015 at 12:16 PM, Chen, Pei <Pe...@childrens.harvard.edu>
wrote:

> Tomasz,
> Thanks for bringing those up.  It would be great if you can log the real
> examples into the Jira ticket and it can be incorporated into test cases
> going forward (it may most likely need more training examples).
> Also, FYI- If I recall correctly, there was nothing previously in cTAKES
> that explicitly populated the subject attribute.  The closest remotely was
> the regex that was lumped together with history.
>
> I hope that helps...
> --Pei
> -----Original Message-----
> From: Tomasz Oliwa [mailto:oliwa@uchicago.edu]
> Sent: Wednesday, July 22, 2015 11:35 AM
> To: dev@ctakes.apache.org
> Subject: RE: UmlsConcept subject
>
> Pei,
>
> The SubjectCleartkAnalysisEngine is currently broken in cTAKES, I tried it
> with more examples, it just returns "patient" as subject.
>
> You mentioned that this is the new Subject Classifier.
>
> 1. What was the old module that was capturing the subject of a UmlsConcept?
>
> 2. How can this old module be enabled in the clinical pipeline until this
> new Subject Classifier is fixed?
>
> Thanks,
> Tomasz
>



-- 
jay vyas

RE: UmlsConcept subject

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.

Tomasz,
Thanks for bringing those up.  It would be great if you can log the real examples into the Jira ticket and it can be incorporated into test cases going forward (it may most likely need more training examples).
Also, FYI- If I recall correctly, there was nothing previously in cTAKES that explicitly populated the subject attribute.  The closest remotely was the regex that was lumped together with history.

I hope that helps...
--Pei
-----Original Message-----
From: Tomasz Oliwa [mailto:oliwa@uchicago.edu] 
Sent: Wednesday, July 22, 2015 11:35 AM
To: dev@ctakes.apache.org
Subject: RE: UmlsConcept subject

Pei,

The SubjectCleartkAnalysisEngine is currently broken in cTAKES, I tried it with more examples, it just returns "patient" as subject.

You mentioned that this is the new Subject Classifier. 

1. What was the old module that was capturing the subject of a UmlsConcept? 

2. How can this old module be enabled in the clinical pipeline until this new Subject Classifier is fixed?

Thanks,
Tomasz

RE: UmlsConcept subject

Posted by Tomasz Oliwa <ol...@uchicago.edu>.

Pei,

The SubjectCleartkAnalysisEngine is currently broken in cTAKES, I tried it with more examples, it just returns "patient" as subject.

You mentioned that this is the new Subject Classifier. 

1. What was the old module that was capturing the subject of a UmlsConcept? 

2. How can this old module be enabled in the clinical pipeline until this new Subject Classifier is fixed?

Thanks,
Tomasz

RE: UmlsConcept subject

Posted by Tomasz Oliwa <ol...@uchicago.edu>.

https://issues.apache.org/jira/browse/CTAKES-369 is open now.

Thanks for looking into this. If there is something I could additionally test, let me know.

RE: UmlsConcept subject

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.

Tomasz,
Yes, please please feel free to open a Jira ticket for this. Also, Be sure to include the version of the cTAKES and pipeline you're using.
It is possible that the new Subject Classifier isn't classifying this...

-----Original Message-----
From: Tomasz Oliwa [mailto:oliwa@uchicago.edu] 
Sent: Wednesday, July 15, 2015 2:50 PM
To: dev@ctakes.apache.org
Subject: UmlsConcept subject

Hi,

I think there is a regression in the way cTAKES discovers the subject status ("patient", "familiy_member", etc.) of an UmlsConcept. Using cTAKES 3.2.2 and the AggregatePlaintextFastUMLSProcessor in the CVD:

1. "Patient's brother has a myocardial infarction." 
"myocardial infarction" and "infarction" have subject = "patient"

2. "Father had a myocardial infarction."
"myocardial infarction" and "infarction" have subject = "patient"

3. "Sister was diagnosed with a myocardial infarction."
"myocardial infarction" and "infarction" have subject = "patient"

4. "Family member had a myocardial infarction."
"myocardial infarction" and "infarction" have subject = "family_member" (this is correct)

I am looking at the code of the SubjectCleartkAnalysisEngine. Is this the class responsible for inferring the subject?
How can this be fixed? Should I open a JIRA ticket?

Thanks,
Tomasz