You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by Steven Bethard <st...@Colorado.EDU> on 2012/11/18 11:14:10 UTC

type system changes needed to read SHARP data

I've finished another pass through the reader that takes the SHARP Knowtator data and reads it into the cTAKES UIMA type system. The class is:

org.apache.ctakes.core.ae.SHARPKnowtatorXMLReader

If you take a look at that, you'll see a ton of TODO notes and warnings, where I couldn't figure out how to map the Knowtator annotations to the cTAKES UIMA annotations. Here's a list of issues:

* I couldn't find an entity type for "Clinical_attribute", "Devices", "Lab", "Phenomena"

* I couldn't find a modifier type (or alternatively, an Annotation subclass) for the Knowtator annotations "generic_class", "conditional_class", "uncertainty_indicator_class", "distal_or_proximal", "Person", "negation_indicator_class", "historyOf_indicator_class", "superior_or_inferior", "medial_or_lateral", "dorsal_or_ventral", "method_class", "device_class", "allergy_indicator_class", "Route", "Form", "Strength", "Strength number", "Strength unit", "Frequency", "Frequency number", "Frequency unit", "Value", "Value number", "Value unit", "estimated_flag_indicator", "reference_range", "Date", "Status change", "Duration", "Dosage".

* I couldn't find a place for the normalized value of "generic_class", "conditional_class", "uncertainty_indicator_class", "distal_or_proximal", "Person", "negation_indicator_class", "superior_or_inferior", "medial_or_lateral", "dorsal_or_ventral", "device_class", "allergy_indicator_class", "lab_interpretation_indicator", "estimated_flag_indicator"

* I couldn't find a place for the "associatedCode" of a "Person" or "historyOf_indicator_class"

* There were several things in the Knowtator annotations that I couldn't even guess what they meant: "Attributes_lab", "Temporal", ":THING", "Entities".

After working with this data I think we should consider having separate UIMA Annotation sub-types for each of the things that are Modifiers now. For example, if we have a real Severity Annotation for textual mentions of severity, then the CAS makes it easy to select these. We have exactly this use case in relation extractor - we need just the Severity modifiers, excluding all the other modifiers. Basically, I think the principle we should follow in UIMA is:

"If you could imagine searching the CAS for something, then that something should have it's own Annotation sub-type."

So, I think we need Annotation sub-types (not TOP sub-types) for:

// linguistic phenonmena
Generic
Conditional
Negation
Uncertainty
Estimated
HistoryOf
Person

// for disease/disorder/sign/symptom
Course
BodyLaterality (covering distal_or_proximal, superior_or_inferior, etc.)
BodySide

// for procedure
ProcedureMethod
ProcedureDevice

// for medication
MedicationAllergyIndicator
MedicationDosage
MedicationDuration
MedicationForm
MedicationFrequency
MedicationRoute
MedicationStartDate (maybe?)
MedicationStatusChange
MedicationStrength

// for lab
LabValue
LabInterpretation
LabReferenceRange

Steve

P.S. SHARPKnowtatorXMLReader can parse all the UMLS_CEM data that's on the cloud right now. So once all these type system issues get sorted out, it should be pretty much ready to go.

RE: SHARPKnowtatorXMLReader

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.

> Who has the authority to add these, as appears to have been agreed upon
> by everyone?  I am not clear on the process for managing the type system
> these days...
I would suggest the "Apache way" here? : Open a Jira, Discuss on ctakes-dev as appropriate (call a Vote if there are contentions), Commit the code.
Any committers should have access to make the changes to the code.

--Pei


> -----Original Message-----
> From: Wu, Stephen T., Ph.D. [mailto:Wu.Stephen@mayo.edu]
> Sent: Wednesday, December 12, 2012 1:45 PM
> To: ctakes-dev@incubator.apache.org
> Subject: SHARPKnowtatorXMLReader
> 
> So back to the issue of the reading in Knowtator XML data...
> 
> I've been looking at your (Steve B's) code, and it seems like it's been written
> so that everything hinges on Annotations subtypes being created for
> everything.  Are these type system barriers the main problems with
> connecting relations and attributes to their corresponding NEs?  (There is a
> layer that does not seem to be happening -- namely, that the values of NE
> attributes/relations don't get populated.  I don't fully understand the
> DelayedFeatures but assume that they would work if the types were set?)
> 
>   1. We haven't created the LabMention, ProcedureMention,
> AnatomicalSiteMention, DiseaseDisorderMention, and
> SignSymptomMention types that we had planned to, yet!  Or at least, they're
> not checked in.
> 
>   2. We also haven't created comprehensive Annotations for modifiers.
> 
> Who has the authority to add these, as appears to have been agreed upon
> by everyone?  I am not clear on the process for managing the type system
> these days...
>  - SHARP SDG (in original SHARP plans -- though I'm not a member)?
>  - Apache ctakes-dev (as most other stuff these days)?
>  - Me and James (what it's mostly been for the type system thus far)?
> 
> E.g., could Steve B. or anyone else on this list add them, esp. since they've
> been discussed?  I could do it too, but it seems like relation/attribute
> functionality in SHARPKnowtatorXMLReader has been held up a long time
> because we're confused about who can change the type system.
> 
> stephen

Re: type system changes needed to read SHARP data

Posted by "Wu, Stephen T., Ph.D." <Wu...@mayo.edu>.

This was our hesitation.  We didn't believe that the annotation schema
should unequivocally set the schema of a "common" type system.

Unfortunately, there are multiple people trying to define the semantics of
what gets stored.  The annotations people went ahead and defined an
annotation schema very early on because it was necessary in order to get
actual annotations out.  But I think other people (i.e., the SHARP CEM and
data norm people) have continued tinkering under the hood.  So we could end
up with more types that were not in the original Knowtator annotations.

Currently, all of our stuff (CEM->type system, annotation schema->type
system, task needs->type system, type system->documentation) requires manual
work because there are so many sources, and all conflicts would have to be
mediated anyways.  That's frustrating, and I don't know how to change it.
Suggestions are welcome.

stephen

On 12/6/12 1:20 PM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:

> Hi Steven,
> +1 it seems reasonable.
> 
> Just taking a step back,  should there always be a 1-1 mapping between human
> annotated data (Knowtator schema) and the System annotated data (cTAKES type
> system)?
> If this is true, then should they really share the schema then?  i.e. Can the
> annotation tool(s) be auto generated/based off the type system schema or vice
> versa then?  Just thinking of ways we may save time with mappings...
> 
> --Pei
> 
>> -----Original Message-----
>> From: Wu, Stephen T., Ph.D. [mailto:Wu.Stephen@mayo.edu]
>> Sent: Wednesday, December 05, 2012 3:37 PM
>> To: ctakes-dev@incubator.apache.org
>> Subject: Re: type system changes needed to read SHARP data
>> 
>> Sorry for the delayed response, Steve.  The type system was not designed to
>> house the annotations, but rather the later results of processing.  It makes
>> sense to do both.
>> 
>> Takeaways, first, then point-by-point response.
>> For 3.1.0 the type system should include more than just "LabMention,
>> ProcedureMention, SignSymptomMention, DiseaseDisorderMention,
>> AnatomicalSiteMention."  It should also include the exhaustive list of
>> attributes, which would come as subtypes of Modifier.
>> 
>> 
>> Let me hear some +1s and we'll make it happen...
>> 
>> stephen
>> 
>> 
>>>> "Clinical_attribute" -- is this what you're looking for:
>>>> org.apache.ctakes.typesystem.type.refsem.Attribute
>>>> It inherits from Element.
>>> But Attribute is a TOP and we need an Annotation here. (An added
>>> concern is, does it really make sense to have a raw Attribute, and not
>>> some specific sub-type like BodyLaterality or BodySide?)
>> To capture the Knowtator annotations, yes, we do need an Annotation --
>> namely Modifier subtypes, as you've suggested.
>> Attribute is not really meant to be instantiated, it is just meant to be a
>> super-
>> type that could feasibly provide easier indexing.
>> 
>>>> Lab should be at org.apache.ctakes.typesystem.type.refsem.Lab
>>> But Lab is a TOP, and we need an Annotation here.
>> Again, for the case of reading in Knowtator, yes.  I think the addition of
>> LabMention, etc, were slated for 3.1.0, right james?
>> 
>>>> Use the type org.apache.ctakes.typesystem.type.textsem.Modifier with
>>>> the "category" feature.
>>> Should there be constants for each of these categories?
>> There are constants in
>> /ctakes-type-
>> system/src/main/java/org/apache/ctakes/typesystem/type/constant
>> s/CONST.java
>> 
>>>> "Person", --> Entity
>>> But Entity is a TOP, not an Annotation.
>> This is an interesting question.  Person was not previously included in a
>> CEM,
>> so it doesn't have a semantic TOP subtype.  Therefore, it also doesn't have a
>> Annotation subtype.  For now we'll just leave it be.
>> 
>>>>> After working with this data I think we should consider having
>>>>> separate UIMA Annotation sub-types for each of the things that are
>>>>> Modifiers now. For example, if we have a real Severity Annotation
>>>>> for textual mentions of severity, then the CAS makes it easy to select
>> these.
>> I think we're lining up with you on this now.
>> 
>>> The types we're talking about are not
>>> used locally within a single AnalysisEngine. They're read in from the
>>> SHARPKnowtatorXMLReader AnalysisEngine, and used separately...
>>> So they can't be local to a
>>> single AnalysisEngine, and they must be in the CAS.
>> Agreed, because of the gold standard representation issue.
>> 
>>> That's exactly what I'm talking about with the severity modifiers. We
>>> have a severity modifier extraction annotator, and we *do* need to
>>> evaluate its performance by comparing the severity modifiers it
>>> extracts to those in the annotated data... So we really do want
>>> everything that's in the Knowtator XML annotations to be loaded and
>> accessible to all our UIMA AnalysisEngines.
>> Ok.  There is a slight difference in finding modifiers because, for the most
>> part annotators wouldn't mark e.g., a negation term that didn't modify
>> anything clinically interesting.  But there are enough cases where an
>> attribute
>> should be searched for and evaluated on its own that I suppose it's worth it
>> to add all these Modifier subtypes.
>> 
>>>> 2) Will these modifiers be reusable downstream?
>>> I'm not sure what you mean here. Are you suggesting that the type
>>> system should only have types for things that external users of cTAKES
>>> might need, and that we shouldn't have types for things that must be
>>> passed between different cTAKES AnalysisEngines?
>> Sorry for being unclear: "downstream" in this context meant "to other UIMA
>> components in the NLP pipeline."
>

RE: type system changes needed to read SHARP data

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.

Hi Steven,
+1 it seems reasonable.

Just taking a step back,  should there always be a 1-1 mapping between human annotated data (Knowtator schema) and the System annotated data (cTAKES type system)?
If this is true, then should they really share the schema then?  i.e. Can the annotation tool(s) be auto generated/based off the type system schema or vice versa then?  Just thinking of ways we may save time with mappings...

--Pei

> -----Original Message-----
> From: Wu, Stephen T., Ph.D. [mailto:Wu.Stephen@mayo.edu]
> Sent: Wednesday, December 05, 2012 3:37 PM
> To: ctakes-dev@incubator.apache.org
> Subject: Re: type system changes needed to read SHARP data
> 
> Sorry for the delayed response, Steve.  The type system was not designed to
> house the annotations, but rather the later results of processing.  It makes
> sense to do both.
> 
> Takeaways, first, then point-by-point response.
> For 3.1.0 the type system should include more than just "LabMention,
> ProcedureMention, SignSymptomMention, DiseaseDisorderMention,
> AnatomicalSiteMention."  It should also include the exhaustive list of
> attributes, which would come as subtypes of Modifier.
> 
> 
> Let me hear some +1s and we'll make it happen...
> 
> stephen
> 
> 
> >> "Clinical_attribute" -- is this what you're looking for:
> >> org.apache.ctakes.typesystem.type.refsem.Attribute
> >> It inherits from Element.
> > But Attribute is a TOP and we need an Annotation here. (An added
> > concern is, does it really make sense to have a raw Attribute, and not
> > some specific sub-type like BodyLaterality or BodySide?)
> To capture the Knowtator annotations, yes, we do need an Annotation --
> namely Modifier subtypes, as you've suggested.
> Attribute is not really meant to be instantiated, it is just meant to be a super-
> type that could feasibly provide easier indexing.
> 
> >> Lab should be at org.apache.ctakes.typesystem.type.refsem.Lab
> > But Lab is a TOP, and we need an Annotation here.
> Again, for the case of reading in Knowtator, yes.  I think the addition of
> LabMention, etc, were slated for 3.1.0, right james?
> 
> >> Use the type org.apache.ctakes.typesystem.type.textsem.Modifier with
> >> the "category" feature.
> > Should there be constants for each of these categories?
> There are constants in
> /ctakes-type-
> system/src/main/java/org/apache/ctakes/typesystem/type/constant
> s/CONST.java
> 
> >> "Person", --> Entity
> > But Entity is a TOP, not an Annotation.
> This is an interesting question.  Person was not previously included in a CEM,
> so it doesn't have a semantic TOP subtype.  Therefore, it also doesn't have a
> Annotation subtype.  For now we'll just leave it be.
> 
> >>> After working with this data I think we should consider having
> >>> separate UIMA Annotation sub-types for each of the things that are
> >>> Modifiers now. For example, if we have a real Severity Annotation
> >>> for textual mentions of severity, then the CAS makes it easy to select
> these.
> I think we're lining up with you on this now.
> 
> > The types we're talking about are not
> > used locally within a single AnalysisEngine. They're read in from the
> > SHARPKnowtatorXMLReader AnalysisEngine, and used separately...
> > So they can't be local to a
> > single AnalysisEngine, and they must be in the CAS.
> Agreed, because of the gold standard representation issue.
> 
> > That's exactly what I'm talking about with the severity modifiers. We
> > have a severity modifier extraction annotator, and we *do* need to
> > evaluate its performance by comparing the severity modifiers it
> > extracts to those in the annotated data... So we really do want
> > everything that's in the Knowtator XML annotations to be loaded and
> accessible to all our UIMA AnalysisEngines.
> Ok.  There is a slight difference in finding modifiers because, for the most
> part annotators wouldn't mark e.g., a negation term that didn't modify
> anything clinically interesting.  But there are enough cases where an attribute
> should be searched for and evaluated on its own that I suppose it's worth it
> to add all these Modifier subtypes.
> 
> >> 2) Will these modifiers be reusable downstream?
> > I'm not sure what you mean here. Are you suggesting that the type
> > system should only have types for things that external users of cTAKES
> > might need, and that we shouldn't have types for things that must be
> > passed between different cTAKES AnalysisEngines?
> Sorry for being unclear: "downstream" in this context meant "to other UIMA
> components in the NLP pipeline."

RE: type system changes needed to read SHARP data

Posted by "Masanz, James J." <Ma...@mayo.edu>.

CU Boulder (Martha Palmer et al) is working on an annotation tool. I forwarded a link to our thread to them for comments.

-- James


> -----Original Message-----
> From: ctakes-dev-return-958-Masanz.James=mayo.edu@incubator.apache.org
> [mailto:ctakes-dev-return-958-
> Masanz.James=mayo.edu@incubator.apache.org] On Behalf Of Chen, Pei
> Sent: Wednesday, December 12, 2012 10:24 AM
> To: ctakes-dev@incubator.apache.org
> Subject: RE: type system changes needed to read SHARP data
> 
> > Yeah, agreed that web based annotation tools are the way to go. I
> > would love to see a BRAT-like tool that could work directly from a
> > UIMA type system schema. But I'm not going to hold my breath. ;-)
> 
> It sounds like a comprehensive annotation tool (BRAT on top of UIMA)
> that works directly from a UIMA type system schema would be a common
> tool that would benefit the entire UIMA community; not just OpenNLP or
> cTAKES.  Perhaps we can combine our efforts.
> 
> --Pei
> 
> > -----Original Message-----
> > From: Steven Bethard [mailto:steven.bethard@Colorado.EDU]
> > Sent: Saturday, December 08, 2012 10:08 AM
> > To: ctakes-dev@incubator.apache.org
> > Subject: Re: type system changes needed to read SHARP data
> >
> > On Dec 7, 2012, at 6:14 PM, Jörn Kottmann <ko...@gmail.com> wrote:
> > > Anyway to do an annotation project efficiently a web based tool like
> > > brat is better than the Cas Editor, but brat is not easy to
> > > integrate with UIMA currently. For now we are doing it half-half,
> > > the first annotation work is done with the Cas Editor (layout,
> > > sentences, tokens, named entities), and the more advanced tasks are
> > > done with brat (e.g. relations, coref, disambiguation).
> >
> > Yeah, agreed that web based annotation tools are the way to go. I
> > would love to see a BRAT-like tool that could work directly from a
> > UIMA type system schema. But I'm not going to hold my breath. ;-)
> >
> > Steve

Re: type system changes needed to read SHARP data

Posted by Jörn Kottmann <ko...@gmail.com>.

On 12/12/2012 05:24 PM, Chen, Pei wrote:
>> Yeah, agreed that web based annotation tools are the way to go. I would
>> love to see a BRAT-like tool that could work directly from a UIMA type
>> system schema. But I'm not going to hold my breath. ;-)
> It sounds like a comprehensive annotation tool (BRAT on top of UIMA) that works directly from a UIMA type system schema would be a common tool that would benefit the entire UIMA community; not just OpenNLP or cTAKES.  Perhaps we can combine our efforts.
>

Over at OpenNLP we are working on the Corpus Server which can be used to 
host a set of XMI files and share them between a group of annotators,
it would be really nice if we could find a way to attach BRAT to this 
server. UIMA based trainers can be connected to the Corpus Server via
a Collection Reader which fetches the training material from it.

Jörn

RE: type system changes needed to read SHARP data

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.

> Yeah, agreed that web based annotation tools are the way to go. I would
> love to see a BRAT-like tool that could work directly from a UIMA type
> system schema. But I'm not going to hold my breath. ;-)

It sounds like a comprehensive annotation tool (BRAT on top of UIMA) that works directly from a UIMA type system schema would be a common tool that would benefit the entire UIMA community; not just OpenNLP or cTAKES.  Perhaps we can combine our efforts.

--Pei

> -----Original Message-----
> From: Steven Bethard [mailto:steven.bethard@Colorado.EDU]
> Sent: Saturday, December 08, 2012 10:08 AM
> To: ctakes-dev@incubator.apache.org
> Subject: Re: type system changes needed to read SHARP data
> 
> On Dec 7, 2012, at 6:14 PM, Jörn Kottmann <ko...@gmail.com> wrote:
> > Anyway to do an annotation project efficiently a web based tool like
> > brat is better than the Cas Editor, but brat is not easy to integrate
> > with UIMA currently. For now we are doing it half-half, the first
> > annotation work is done with the Cas Editor (layout, sentences,
> > tokens, named entities), and the more advanced tasks are done with
> > brat (e.g. relations, coref, disambiguation).
> 
> Yeah, agreed that web based annotation tools are the way to go. I would
> love to see a BRAT-like tool that could work directly from a UIMA type
> system schema. But I'm not going to hold my breath. ;-)
> 
> Steve

Re: type system changes needed to read SHARP data

Posted by Steven Bethard <st...@Colorado.EDU>.

On Dec 7, 2012, at 6:14 PM, Jörn Kottmann <ko...@gmail.com> wrote:
> Anyway to do an annotation project efficiently a web based tool like 
> brat is better than the Cas Editor, but brat is not
> easy to integrate with UIMA currently. For now we are doing it 
> half-half, the first annotation work is done with the Cas Editor 
> (layout, sentences,
> tokens, named entities), and the more advanced tasks are done with brat 
> (e.g. relations, coref, disambiguation).

Yeah, agreed that web based annotation tools are the way to go. I would love to see a BRAT-like tool that could work directly from a UIMA type system schema. But I'm not going to hold my breath… ;-)

Steve

Re: type system changes needed to read SHARP data

Posted by Jörn Kottmann <ko...@gmail.com>.

On 12/07/2012 06:02 PM, Chen, Pei wrote:
>> CAS Editor doesn't count - that's not usable for any real large-scale complex annotation
> Can we extend the existing UIMA one CAS Editor? Update Knowtator to work directly off the UIMA type system file (objects) Or even, something like BRAT (http://brat.nlplab.org/) that could works directly off a UIMA types system file/objects

I extended the Cas Editor with some plugins to suit my annotation needs. 
Some of them are Open Source, like the OpenNLP integration or the 
connector to the Corpus Server.
Anyway to do an annotation project efficiently a web based tool like 
brat is better than the Cas Editor, but brat is not
easy to integrate with UIMA currently. For now we are doing it 
half-half, the first annotation work is done with the Cas Editor 
(layout, sentences,
tokens, named entities), and the more advanced tasks are done with brat 
(e.g. relations, coref, disambiguation).

For my next annotation project I will probably try to do the named 
entities also with brat, but currently its too slow compared tot he Cas 
Editor with
the OpenNLP support.

Jörn

RE: type system changes needed to read SHARP data

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.

> What we'd really want is an annotation tool that works directly off of a UIMA type system file
I was thinking of same... I think it's fine to have additional types which are not used so then, they can be easily subset'd.  For example, CTS could be a subset of this larger schema.  But I think the key would be sharing the same underlying common types/schema/objects.  So any new types created in the annotation tool/schema could just be automatically reused by the system (or subset'd which is an easier problem to solve than mapping).

> CAS Editor doesn't count - that's not usable for any real large-scale complex annotation
Can we extend the existing UIMA one CAS Editor? Update Knowtator to work directly off the UIMA type system file (objects) Or even, something like BRAT (http://brat.nlplab.org/) that could works directly off a UIMA types system file/objects?



> -----Original Message-----
> From: Steven Bethard [mailto:steven.bethard@Colorado.EDU]
> Sent: Friday, December 07, 2012 5:33 AM
> To: ctakes-dev@incubator.apache.org
> Subject: Re: type system changes needed to read SHARP data
> 
> On Dec 5, 2012, at 9:36 PM, "Wu, Stephen T., Ph.D."
> <Wu...@mayo.edu> wrote:
> > For 3.1.0 the type system should include more than just "LabMention,
> > ProcedureMention, SignSymptomMention, DiseaseDisorderMention,
> > AnatomicalSiteMention."  It should also include the exhaustive list of
> > attributes, which would come as subtypes of Modifier.
> 
> +1
> 
> On Dec 6, 2012, at 8:20 PM, "Chen, Pei" <Pe...@childrens.harvard.edu>
> wrote:
> > Just taking a step back,  should there always be a 1-1 mapping between
> human annotated data (Knowtator schema) and the System annotated data
> (cTAKES type system)?
> > If this is true, then should they really share the schema then?  i.e.
> > Can the annotation tool(s) be auto generated/based off the type system
> > schema or vice versa then?  Just thinking of ways we may save time
> > with mappings...
> 
> Ideally yes, they should share exactly the same schema. The main problem
> here is annotation tools. What we'd really want is an annotation tool that
> works directly off of a UIMA type system file. But I don't know of any such
> tool. (And no, the CAS Editor doesn't count - that's not usable for any real
> large-scale complex annotation.)
> 
> Steve

Re: type system changes needed to read SHARP data

Posted by Steven Bethard <st...@Colorado.EDU>.

On Dec 5, 2012, at 9:36 PM, "Wu, Stephen T., Ph.D." <Wu...@mayo.edu> wrote:
> For 3.1.0 the type system should include more than just "LabMention,
> ProcedureMention, SignSymptomMention, DiseaseDisorderMention,
> AnatomicalSiteMention."  It should also include the exhaustive list of
> attributes, which would come as subtypes of Modifier.

+1

On Dec 6, 2012, at 8:20 PM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
> Just taking a step back,  should there always be a 1-1 mapping between human annotated data (Knowtator schema) and the System annotated data (cTAKES type system)?
> If this is true, then should they really share the schema then?  i.e. Can the annotation tool(s) be auto generated/based off the type system schema or vice versa then?  Just thinking of ways we may save time with mappings…

Ideally yes, they should share exactly the same schema. The main problem here is annotation tools. What we'd really want is an annotation tool that works directly off of a UIMA type system file. But I don't know of any such tool. (And no, the CAS Editor doesn't count - that's not usable for any real large-scale complex annotation.)

Steve

Re: type system changes needed to read SHARP data

Posted by "Wu, Stephen T., Ph.D." <Wu...@mayo.edu>.

Maybe I should've added some additional considerations around
SHARPKnowtatorXMLReader to the discussion...

The previous way to handle all these modifiers was to directly map them to
the named entities that they're associated with.  Again taking negation as
an example, we hadn't been creating a Modifier subtype for polarity, but
just set the value of a named entity as negated.

Storing all of these attributes as Modifier subtypes in
SHARPKnowtatorXMLReader does not eliminate the need to map these subtypes to
NEs.  The Knowtator data includes both the spans of modifiers AND the
assignment of values to the NEs.

So there are some modifiers that you'd never be interested in evaluating on
their own apart from the NEs.  However, I'm agreeing with the previous
proposition because there are other modifiers that are interesting to
evaluate apart from NEs, and we should just keep things consistent.

stephen


On 12/5/12 2:36 PM, "Stephen Wu" <wu...@mayo.edu> wrote:

> Sorry for the delayed response, Steve.  The type system was not designed to
> house the annotations, but rather the later results of processing.  It makes
> sense to do both.
> 
> Takeaways, first, then point-by-point response.
> For 3.1.0 the type system should include more than just "LabMention,
> ProcedureMention, SignSymptomMention, DiseaseDisorderMention,
> AnatomicalSiteMention."  It should also include the exhaustive list of
> attributes, which would come as subtypes of Modifier.
> 
> Let me hear some +1s and we'll make it happen...
> 
> stephen
> 
> 
>>> "Clinical_attribute" -- is this what you're looking for:
>>> org.apache.ctakes.typesystem.type.refsem.Attribute
>>> It inherits from Element.
>> But Attribute is a TOP and we need an Annotation here. (An added concern is,
>> does it really make sense to have a raw Attribute, and not some specific
>> sub-type like BodyLaterality or BodySide?)
> To capture the Knowtator annotations, yes, we do need an Annotation --
> namely Modifier subtypes, as you've suggested.
> Attribute is not really meant to be instantiated, it is just meant to be a
> super-type that could feasibly provide easier indexing.
> 
>>> Lab should be at org.apache.ctakes.typesystem.type.refsem.Lab
>> But Lab is a TOP, and we need an Annotation here.
> Again, for the case of reading in Knowtator, yes.  I think the addition of
> LabMention, etc, were slated for 3.1.0, right james?
> 
>>> Use the type org.apache.ctakes.typesystem.type.textsem.Modifier with the
>>> "category" feature.
>> Should there be constants for each of these categories?
> There are constants in
> /ctakes-type-system/src/main/java/org/apache/ctakes/typesystem/type/constant
> s/CONST.java
> 
>>> "Person", --> Entity
>> But Entity is a TOP, not an Annotation.
> This is an interesting question.  Person was not previously included in a
> CEM, so it doesn't have a semantic TOP subtype.  Therefore, it also doesn't
> have a Annotation subtype.  For now we'll just leave it be.
> 
>>>> After working with this data I think we should consider having separate
>>>> UIMA
>>>> Annotation sub-types for each of the things that are Modifiers now. For
>>>> example, if we have a real Severity Annotation for textual mentions of
>>>> severity, then the CAS makes it easy to select these.
> I think we're lining up with you on this now.
> 
>> The types we're talking about are not
>> used locally within a single AnalysisEngine. They're read in from the
>> SHARPKnowtatorXMLReader AnalysisEngine, and used separately...
>> So they can't be local to a
>> single AnalysisEngine, and they must be in the CAS.
> Agreed, because of the gold standard representation issue.
> 
>> That's exactly what I'm talking about with the severity modifiers. We have a
>> severity modifier extraction annotator, and we *do* need to evaluate its
>> performance by comparing the severity modifiers it extracts to those in the
>> annotated data... So we really do want everything that's in the Knowtator XML
>> annotations to be loaded and accessible to all our UIMA AnalysisEngines.
> Ok.  There is a slight difference in finding modifiers because, for the most
> part annotators wouldn't mark e.g., a negation term that didn't modify
> anything clinically interesting.  But there are enough cases where an
> attribute should be searched for and evaluated on its own that I suppose
> it's worth it to add all these Modifier subtypes.
>  
>>> 2) Will these modifiers be reusable downstream?
>> I'm not sure what you mean here. Are you suggesting that the type system
>> should only have types for things that external users of cTAKES might need,
>> and that we shouldn't have types for things that must be passed between
>> different cTAKES AnalysisEngines?
> Sorry for being unclear: "downstream" in this context meant "to other UIMA
> components in the NLP pipeline."
>

Re: type system changes needed to read SHARP data

Posted by "Wu, Stephen T., Ph.D." <Wu...@mayo.edu>.

Sorry for the delayed response, Steve.  The type system was not designed to
house the annotations, but rather the later results of processing.  It makes
sense to do both.  

Takeaways, first, then point-by-point response.
For 3.1.0 the type system should include more than just "LabMention,
ProcedureMention, SignSymptomMention, DiseaseDisorderMention,
AnatomicalSiteMention."  It should also include the exhaustive list of
attributes, which would come as subtypes of Modifier.

Let me hear some +1s and we'll make it happen...

stephen


>> "Clinical_attribute" -- is this what you're looking for:
>> org.apache.ctakes.typesystem.type.refsem.Attribute
>> It inherits from Element.
> But Attribute is a TOP and we need an Annotation here. (An added concern is,
> does it really make sense to have a raw Attribute, and not some specific
> sub-type like BodyLaterality or BodySide?)
To capture the Knowtator annotations, yes, we do need an Annotation --
namely Modifier subtypes, as you've suggested.
Attribute is not really meant to be instantiated, it is just meant to be a
super-type that could feasibly provide easier indexing.

>> Lab should be at org.apache.ctakes.typesystem.type.refsem.Lab
> But Lab is a TOP, and we need an Annotation here.
Again, for the case of reading in Knowtator, yes.  I think the addition of
LabMention, etc, were slated for 3.1.0, right james?

>> Use the type org.apache.ctakes.typesystem.type.textsem.Modifier with the
>> "category" feature.
> Should there be constants for each of these categories?
There are constants in
/ctakes-type-system/src/main/java/org/apache/ctakes/typesystem/type/constant
s/CONST.java

>> "Person", --> Entity
> But Entity is a TOP, not an Annotation.
This is an interesting question.  Person was not previously included in a
CEM, so it doesn't have a semantic TOP subtype.  Therefore, it also doesn't
have a Annotation subtype.  For now we'll just leave it be.

>>> After working with this data I think we should consider having separate UIMA
>>> Annotation sub-types for each of the things that are Modifiers now. For
>>> example, if we have a real Severity Annotation for textual mentions of
>>> severity, then the CAS makes it easy to select these.
I think we're lining up with you on this now.

> The types we're talking about are not
> used locally within a single AnalysisEngine. They're read in from the
> SHARPKnowtatorXMLReader AnalysisEngine, and used separately...
> So they can't be local to a
> single AnalysisEngine, and they must be in the CAS.
Agreed, because of the gold standard representation issue.

> That's exactly what I'm talking about with the severity modifiers. We have a
> severity modifier extraction annotator, and we *do* need to evaluate its
> performance by comparing the severity modifiers it extracts to those in the
> annotated data... So we really do want everything that's in the Knowtator XML
> annotations to be loaded and accessible to all our UIMA AnalysisEngines.
Ok.  There is a slight difference in finding modifiers because, for the most
part annotators wouldn't mark e.g., a negation term that didn't modify
anything clinically interesting.  But there are enough cases where an
attribute should be searched for and evaluated on its own that I suppose
it's worth it to add all these Modifier subtypes.
 
>> 2) Will these modifiers be reusable downstream?
> I'm not sure what you mean here. Are you suggesting that the type system
> should only have types for things that external users of cTAKES might need,
> and that we shouldn't have types for things that must be passed between
> different cTAKES AnalysisEngines?
Sorry for being unclear: "downstream" in this context meant "to other UIMA
components in the NLP pipeline."

Re: type system changes needed to read SHARP data

Posted by Steven Bethard <st...@Colorado.EDU>.

A point of clarification: Almost everything we get from the SHARP human annotations is associated with a span of text by the annotators. And we need to recover those spans of text with our machine learning models. So in most cases, we need subtypes of Annotation, not subtypes of TOP. This is perhaps the biggest issue with the current type system: the TOP subtypes contain most of what we need, but the Annotation subtypes are often too impoverished to capture the SHARP annotations.

On Nov 26, 2012, at 9:28 PM, "Wu, Stephen T., Ph.D." <Wu...@mayo.edu> wrote:
>> * I couldn't find an entity type for "Clinical_attribute", "Devices", "Lab",
>> "Phenomena"
> "Devices" and "Phenomena" don't exist yet because they were not part of the
> CEM models.  I need input from someone on CEMs if we're to add these.
> 
> "Clinical_attribute" -- is this what you're looking for:
> org.apache.ctakes.typesystem.type.refsem.Attribute
> It inherits from Element.

But Attribute is a TOP and we need an Annotation here. (An added concern is, does it really make sense to have a raw Attribute, and not some specific sub-type like BodyLaterality or BodySide?)

> Lab should be at org.apache.ctakes.typesystem.type.refsem.Lab

But Lab is a TOP, and we need an Annotation here.

>> * I couldn't find a modifier type (or alternatively, an Annotation subclass)
>> for the Knowtator annotations "generic_class", "conditional_class",
>> "uncertainty_indicator_class", "distal_or_proximal", "Person",
>> "negation_indicator_class", "historyOf_indicator_class",
>> "superior_or_inferior", "medial_or_lateral", "dorsal_or_ventral",
>> "method_class", "device_class", "allergy_indicator_class", "Route", "Form",
>> "Strength", "Strength number", "Strength unit", "Frequency", "Frequency
>> number", "Frequency unit", "Value", "Value number", "Value unit",
>> "estimated_flag_indicator", "reference_range", "Date", "Status change",
>> "Duration", "Dosage".
> Use the type org.apache.ctakes.typesystem.type.textsem.Modifier with the
> "category" feature.

Should there be constants for each of these categories?

>> * I couldn't find a place for the normalized value of
> "generic_class", --> IdentifiedAnnotation:generic
> "conditional_class",  --> IdentifiedAnnotation:conditionl
> "uncertainty_indicator_class", --> IdentifiedAnnotation:uncertainty
> "negation_indicator_class",  --> IdentifiedAnnotation:polarity

Ok.

> "distal_or_proximal", --> BodyLaterality:value
> "superior_or_inferior", --> BodyLaterality:value
> "dorsal_or_ventral", --> BodyLaterality:value
> "medial_or_lateral", --> BodyLaterality:value
> "device_class", --> ProcedureDevice:value

And then set the Modifier.normalizedForm to BodyLaterality or ProcedureDevice? Ok.

> "Person", --> Entity

But Entity is a TOP, not an Annotation.

>> After working with this data I think we should consider having separate UIMA
>> Annotation sub-types for each of the things that are Modifiers now. For
>> example, if we have a real Severity Annotation for textual mentions of
>> severity, then the CAS makes it easy to select these. We have exactly this use
>> case in relation extractor - we need just the Severity modifiers, excluding
>> all the other modifiers. Basically, I think the principle we should follow in
>> UIMA is:
>> 
>> "If you could imagine searching the CAS for something, then that something
>> should have it's own Annotation sub-type."
>> 
> It's a good point, and a relatively good principle, but we have decided
> against it in the past.  The reason is a countering principle:
> 
> "Do not put locally used (component-specific) types in the CAS."

This principle is not relevant here. The types we're talking about are not used locally within a single AnalysisEngine. They're read in from the SHARPKnowtatorXMLReader AnalysisEngine, and used separately in the ModifierExtractorAnnotator AnalysisEngine, the DegreeOfRelationExtractorAnnotator AnalysisEngine, EventAnnotator AnalysisEngine, TimeAnnotator AnalysisEngine, etc. So they can't be local to a single AnalysisEngine, and they must be in the CAS.

> There is no garbage collection in UIMA (despite things being deleted from
> the index) and extra types will bloat the CAS system, though admittedly is
> not too terrible a bloating.

I don't see how garbage collection is relevant here. We're going to create exactly the same number of Modifiers. It's just whether we create them as raw Modifiers or Modifier sub types. Are you saying there's some significant extra cost to having extra types, even when the total number of instances across all types is constant?

> Two doubts that could change my mind:
> 1) Do we envision evaluation of the Modifiers/attributes -- apart from the
> Named Entities they're attached to?  If so, we need to preserve this
> information right at the beginning.

That's exactly what I'm talking about with the severity modifiers. We have a severity modifier extraction annotator, and we *do* need to evaluate its performance by comparing the severity modifiers it extracts to those in the annotated data. (We need this annotator, just like we need the UMLS entity annotator, so that our relation extraction annotator can find relations between severities and UMLS entities.)

The same is essentially true for everything annotated in SHARP. It's all annotated with the intention of training machine learning models to reproduce those annotations. So we really do want everything that's in the Knowtator XML annotations to be loaded and accessible to all our UIMA AnalysisEngines.

> 2) Will these modifiers be reusable downstream?

I'm not sure what you mean here. Are you suggesting that the type system should only have types for things that external users of cTAKES might need, and that we shouldn't have types for things that must be passed between different cTAKES AnalysisEngines?

If that's the case, I think this would be a step in a very wrong direction. In UIMA, anything that has to be passed between AnalysisEngines should be declared in the type system. And the whole point of having a type system is to ease the passing of this information. So hobbling the types that we pass between cTAKES annotators just to reduce the size of the type system for external users just doesn't make sense.

Steve

Re: type system changes needed to read SHARP data

Posted by "Wu, Stephen T., Ph.D." <Wu...@mayo.edu>.

Thanks for all your work, Steve.

> * I couldn't find an entity type for "Clinical_attribute", "Devices", "Lab",
> "Phenomena"
"Devices" and "Phenomena" don't exist yet because they were not part of the
CEM models.  I need input from someone on CEMs if we're to add these.

"Clinical_attribute" -- is this what you're looking for:
 org.apache.ctakes.typesystem.type.refsem.Attribute
It inherits from Element.  If you see something other than known subtypes
(e.g., BodyLoaterality, BodySide, Course, Severity, Procedure*, Lab*,
Medication*), we should probably extend.

Lab should be at org.apache.ctakes.typesystem.type.refsem.Lab


> * I couldn't find a modifier type (or alternatively, an Annotation subclass)
> for the Knowtator annotations "generic_class", "conditional_class",
> "uncertainty_indicator_class", "distal_or_proximal", "Person",
> "negation_indicator_class", "historyOf_indicator_class",
> "superior_or_inferior", "medial_or_lateral", "dorsal_or_ventral",
> "method_class", "device_class", "allergy_indicator_class", "Route", "Form",
> "Strength", "Strength number", "Strength unit", "Frequency", "Frequency
> number", "Frequency unit", "Value", "Value number", "Value unit",
> "estimated_flag_indicator", "reference_range", "Date", "Status change",
> "Duration", "Dosage".
Use the type org.apache.ctakes.typesystem.type.textsem.Modifier with the
"category" feature.

> * I couldn't find a place for the normalized value of
"generic_class", --> IdentifiedAnnotation:generic
"conditional_class",  --> IdentifiedAnnotation:conditionl
"uncertainty_indicator_class", --> IdentifiedAnnotation:uncertainty
"negation_indicator_class",  --> IdentifiedAnnotation:polarity
"distal_or_proximal", --> BodyLaterality:value
"superior_or_inferior", --> BodyLaterality:value
"dorsal_or_ventral", --> BodyLaterality:value
"medial_or_lateral", --> BodyLaterality:value
"device_class", --> ProcedureDevice:value
"Person", --> Entity
"allergy_indicator_class", --> ?
"lab_interpretation_indicator", --> ?
"estimated_flag_indicator"--> ?

Value should be set according the constants in
src/main/java/org/apache/ctakes/typesystem/type/constants/CONST.java

These are to my best estimation.  We may need to add the three question
marks, plus things like "Device"... But let's hold off on that for now.

> * I couldn't find a place for the "associatedCode" of a "Person" or
> "historyOf_indicator_class"
> * There were several things in the Knowtator annotations that I couldn't even
> guess what they meant: "Attributes_lab", "Temporal", ":THING", "Entities".
Attributes_lab should probably be housed in include Attributes value number,
reference range, delta flag, ordinal.

Someone else who knows the annotation schema (e.g., Guergana) needs to weigh
in on this.  I'm not sure what most of the rest of these are intended to be.

> After working with this data I think we should consider having separate UIMA
> Annotation sub-types for each of the things that are Modifiers now. For
> example, if we have a real Severity Annotation for textual mentions of
> severity, then the CAS makes it easy to select these. We have exactly this use
> case in relation extractor - we need just the Severity modifiers, excluding
> all the other modifiers. Basically, I think the principle we should follow in
> UIMA is:
> 
> "If you could imagine searching the CAS for something, then that something
> should have it's own Annotation sub-type."
> 
It's a good point, and a relatively good principle, but we have decided
against it in the past.  The reason is a countering principle:

 "Do not put locally used (component-specific) types in the CAS."
There is no garbage collection in UIMA (despite things being deleted from
the index) and extra types will bloat the CAS system, though admittedly is
not too terrible a bloating.

Currently, the idea would be to create local objects or types.

Another reason for this is that we didn't want to be making changes to the
type system quite so frequently, and anybody can look for something locally
that nobody else cares about -- we shouldn't make full type system changes
for those.

Two doubts that could change my mind:
 1) Do we envision evaluation of the Modifiers/attributes -- apart from the
Named Entities they're attached to?  If so, we need to preserve this
information right at the beginning.
 2) Will these modifiers be reusable downstream?

stephen

Re: SHARPKnowtatorXMLReader

Posted by Steven Bethard <st...@Colorado.EDU>.

On Dec 12, 2012, at 7:43 PM, "Wu, Stephen T., Ph.D." <Wu...@mayo.edu> wrote:
> I've been looking at your (Steve B's) code, and it seems like it's been
> written so that everything hinges on Annotations subtypes being created for
> everything.

Yep, that's right. To train a severity modifier classifier, for example, we need some Annotation sub-type that marks the spans of severity modifiers.

> Are these type system barriers the main problems with
> connecting relations and attributes to their corresponding NEs?

That's one of the big problems, yes.

>  (There is a
> layer that does not seem to be happening -- namely, that the values of NE
> attributes/relations don't get populated.  I don't fully understand the
> DelayedFeatures but assume that they would work if the types were set?)

This is the other big problem. Basically, we need to be able to say something like:

diseaseDisorderMention.setBodySide(bodySideMention)

And so on, for all the mention types. Once such methods exist, I can update the DelayedFeature implementations to actually set those features.

Ideally, we'd set the inheritance hierarchy up correctly so that all Mentions that have a bodySide feature (DiseaseDisorder, Procedure, SignSymptom) have some common superclass, all Mentions that have a course (DiseaseDisorder, SignSymptom) have some common superclass, etc. If this is not possible, we can get around it with instanceof and casting, but if it is possible, it would be great.

On Dec 12, 2012, at 7:57 PM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
> I would suggest the "Apache way" here? : Open a Jira, Discuss on ctakes-dev as appropriate (call a Vote if there are contentions), Commit the code.
> Any committers should have access to make the changes to the code.

I've added some comments to the issue (https://issues.apache.org/jira/browse/CTAKES-57) as well. Stephen, let me know if you plan to make these changes or if you'd like me to.

Thanks,

Steve

SHARPKnowtatorXMLReader

Posted by "Wu, Stephen T., Ph.D." <Wu...@mayo.edu>.

So back to the issue of the reading in Knowtator XML data...

I've been looking at your (Steve B's) code, and it seems like it's been
written so that everything hinges on Annotations subtypes being created for
everything.  Are these type system barriers the main problems with
connecting relations and attributes to their corresponding NEs?  (There is a
layer that does not seem to be happening -- namely, that the values of NE
attributes/relations don't get populated.  I don't fully understand the
DelayedFeatures but assume that they would work if the types were set?)

  1. We haven't created the LabMention, ProcedureMention,
AnatomicalSiteMention, DiseaseDisorderMention, and SignSymptomMention types
that we had planned to, yet!  Or at least, they're not checked in.

  2. We also haven't created comprehensive Annotations for modifiers.

Who has the authority to add these, as appears to have been agreed upon by
everyone?  I am not clear on the process for managing the type system these
days...  
 - SHARP SDG (in original SHARP plans -- though I'm not a member)?
 - Apache ctakes-dev (as most other stuff these days)?
 - Me and James (what it's mostly been for the type system thus far)?

E.g., could Steve B. or anyone else on this list add them, esp. since
they've been discussed?  I could do it too, but it seems like
relation/attribute functionality in SHARPKnowtatorXMLReader has been held up
a long time because we're confused about who can change the type system.

stephen