You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ctakes.apache.org by Zuo Yiming <yi...@gmail.com> on 2016/10/19 14:03:50 UTC

Best combination of analysis engines to consider negation, family history, uncertainty, etc.

Hi everyone,

I've spent the last a few months working on a clinical NLP project using
cTAKES. It's a very complex system to me and every time I dig into it some
new discoveries will come out. Since last week, I tried to figure out which
analysis engine can help to do a good job to consider cases like negation,
family history, uncertainty, etc. By now, I had some experience and would
like to share with the community.

The best combination for me is to use assertionMiniPipelineAnalysisEngine
for negation, uncertainty, generic and subject detection, and
HistoryCleartkAnalysisEngine for history detection. Both engines are in
desc/ctakes-assertion folder. The assertionMiniPipelineAnalysisEngine also
claims to be useful for conditional detection, which I haven't verified
using my test files yet.

I'm using the AggregatePlaintextFastUMLSProcessor on the higher level. The
default analysis engines in AggregatePlaintextFastUMLSProcessor for
negation, uncertainty, generic, etc. are StatusAnnotator +
NegationAnnotator + PolarityCleartkAnalysisEngine +
SubjectCleartkAnalysisEngine + UncertaintyCleartkAnalysisEngine +
GenericCleartkAnalysisEngine + HistoryCleartkAnalysisEngine. It looks like
in the node part, StatusAnnotator and NegationAnnotator are commented out,
so only the remaining five analysis engines are actually used and all of
them are in the same desc/ctakes-assertion folder. These five analysis
engines were not effective in my test files and I'm still confused by their
relationship to the assertionaAnalysisEngine,
conceptConverterAnalysisEngine, GenericAttributeAnalysisEngine and
SubjectAttributeAnalysisEngine used in assertionMiniPipelineAnalysisEngine.
It looks to me the Clear in their names indicate something but I couldn't
figure it out without going through the java code, which I intend not to do
at this level.

That's pretty much all of it for now. Anyone familiar with this topic are
welcome to jump in to provide my insights or correction. Hopefully, we can
have a nice discussion that can be useful to other users and developers.

ps. The reason for using AggregatePlaintextFastUMLSProcessor rather than
AggregatePlaintextProcessor is that I find the preferred words property in
the former very useful while it can't be detected using the latter.

Best,
Yiming
-- 
Yiming Zuo <https://sites.google.com/site/yimingzuo/>
Georgetown U. Medical Center:
Dr. Ressom's Omics Lab <http://omics.georgetown.edu/>
ECE Department of Virginia Tech:
Computational Bioinformatics & Bio-imaging Laboratory
<http://www.cbil.ece.vt.edu/>

RE: Best combination of analysis engines to consider negation, family history, uncertainty, etc.

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Hi Yiming,

I don't know if anybody will add functionality to the older (non-fast) dictionary lookup.  It is practically deprecated.

The -fast annotator has several options.  One is the "style" of lookup.  Two are provided in ctakes - one called "Default" and the other "Overlap".  Overlap has a looser matching algorithm compared to Default, which requires a strict match of text in note to text in dictionary.  Because of the looser language in clinical notes, Overlap might work better for you.  Default is better for Literature.  I should probably change the names to "Exact" and "Approximate" ...  You can find a little information at the bottom of the page here: https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+-+Fast+Dictionary+Lookup

We could also make the Overlap the default.  The non-fast module only uses Overlap.

Anyway, you can try editing the resources/.../fast/cTakesHsql..xml file and enabling the Overlap over Default to see how it changes your results.

Sean

-----Original Message-----
From: Zuo Yiming [mailto:yimingzuo@gmail.com] 
Sent: Wednesday, October 19, 2016 12:22 PM
To: dev@ctakes.apache.org
Subject: Re: Best combination of analysis engines to consider negation, family history, uncertainty, etc.

Hi Sean and Timothy,

Thanks for your clarification about ClearTK tools. I'm amazed by the power of cTAKES and the resource and community you guys take efforts to built. I will certainly be happy to provide more feedback as my project moves on.

For Timothy,

By rule-based system, do you refer to the assertion annotator? How about the old negation annotator and the status annotator, are they also ruled-based system? I got a feeling that assertion annotator and ClearTK system are more favored than negation annotator and the status annotator for some reason in cTAKES right now.

Regarding ClearTK system on my test files, the negation, history, uncertainty modules work just fine as the assertion annotator. My test files are only a few, so it's really hard to tell which one is better. The main difference comes when detecting subject and generic property. On my limited test files, ClearTK system doesn't work at all. It will assign patient as the subject for all detected phrases when it's the patient's family member who have diabetes. The same problem goes to the generic property, ClearTK system assigns false as the generic property for all detected phrases. The paper mentioned by you and Sean seems interesting, I will take a look later.

As for further questions, can you guys give me some suggestions where to find public golden standard datasets so I can actually conduct some independent evaluation of cTAKES by metrics like precision/recall and F1 score?

At last, a minor suggestion from the user perspective will be to add the preferred words property to the AggregatePlaintextUMLSProcessor. Like I pointed out briefly in my first email, using AggregatePlaintextFastUMLSProcessor we can get the preferred words for detected phrases but not AggregatePlaintextUMLSProcessor. This is very helpful when the detected phrases are acronyms such as pt for patient. From my experience, AggregatePlaintextUMLSProcessor tend to detect more clinical relevant phrases compared with AggregatePlaintextFastUMLSProcessor. It will be really nice if we can have the same preferred words property in AggregatePlaintextUMLSProcessor in future cTAKES release.

Best,
Yiming

On Wed, Oct 19, 2016 at 11:11 AM, Miller, Timothy < Timothy.Miller@childrens.harvard.edu> wrote:

> I can second Sean's thank you, it is good to have this feedback. The 
> ClearTK machine learning models were made the default after we ran 
> some experiments that found it performed better across a range of 
> standard datasets than rule-based algorithms or the existing cTAKES 
> module ( https://urldefense.proofpoint.com/v2/url?u=http-3A__journals.plos.org_plosone_article-3Fid-3D10.1371_journal.pone.0112774&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=ftWgRI4rSEndNvCnrI1Bgqy3c-pZwMEYJVrmuYygEYA&s=_t_t1pRW-bV3g2gjNRdnpEia6b9kzdMhQKlvSPpbF6I&e= ).
> Since making them the default, though, we have heard from people and 
> had our own experience conflict with those experiments. And certainly 
> the errors in the rule-based system are easier to understand.
>
> Just curious, are you able to characterize the errors you see from the 
> ClearTK system? I did some experiments recently on a new dataset 
> comparing negex with the cleartk negation module and found that there 
> was a precision/recall tradeoff but almost identical F1 scores. But 
> for that dataset the tradeoff negex provided was preferred by our 
> collaborators. (I think negex had better recall of negated terms but worse precision).
>
> Tim
>
>
>
> ________________________________________
> From: Finan, Sean <Se...@childrens.harvard.edu>
> Sent: Wednesday, October 19, 2016 10:53 AM
> To: dev@ctakes.apache.org
> Subject: RE: Best combination of analysis engines to consider 
> negation, family history, uncertainty, etc.
>
> Hi Yiming,
>
>
>
> Thank you very much for letting the community know what has and has 
> not worked for you.  I have also had better results with the Assertion 
> annotators than the ClearTk alternatives, but that could be because of 
> the note types/formats that I am using.
>
>
>
> Regarding the "Clear" in names, it is because ClearTk (Clear ToolKit) 
> is used to train machine learning models for detection of the 
> indicated property.  You can find information on ClearTk starting here:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__clear.
> colorado.edu_compsem_&d=DQIGaQ&c=qS4goWBT7poplM69zy_
> 3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-
> OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=aRk0CH-2UrNpH0F4PgdnzixY-
> xVsh8OYTCP8mhe27Gw&s=0mEmiKK5adFN2YCkYyNCNM3Cv4FNWlMbN8XU6GtcQP4&e=
>
>
>
> If you prefer to read a paper, you can check out 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
> lrec-2Dconf.org_proceedings_lrec2014_pdf_218-5FPaper.pdf&
> d=DQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=
> Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=aRk
> 0CH- 
> 2UrNpH0F4PgdnzixY-xVsh8OYTCP8mhe27Gw&s=T-pZCKB6BckhHzvYc9gyutCmKQlhitd
> O
> _-i4e387tjM&e=
>
>
>
> Others no the devlist can provide much more information than can I, so 
> you could post a question if you like.
>
>
>
> Cheers,
>
> Sean
>
>
>
> -----Original Message-----
>
> From: Zuo Yiming [mailto:yimingzuo@gmail.com]
>
> Sent: Wednesday, October 19, 2016 10:04 AM
>
> To: user@ctakes.apache.org; dev@ctakes.apache.org
>
> Subject: Best combination of analysis engines to consider negation, 
> family history, uncertainty, etc.
>
>
>
> Hi everyone,
>
>
>
> I've spent the last a few months working on a clinical NLP project 
> using cTAKES. It's a very complex system to me and every time I dig 
> into it some new discoveries will come out. Since last week, I tried 
> to figure out which analysis engine can help to do a good job to 
> consider cases like negation, family history, uncertainty, etc. By 
> now, I had some experience and would like to share with the community.
>
>
>
> The best combination for me is to use 
> assertionMiniPipelineAnalysisEngine
>
> for negation, uncertainty, generic and subject detection, and 
> HistoryCleartkAnalysisEngine for history detection. Both engines are 
> in desc/ctakes-assertion folder. The 
> assertionMiniPipelineAnalysisEngine
> also claims to be useful for conditional detection, which I haven't 
> verified using my test files yet.
>
>
>
> I'm using the AggregatePlaintextFastUMLSProcessor on the higher level.
> The default analysis engines in AggregatePlaintextFastUMLSProcessor 
> for negation, uncertainty, generic, etc. are StatusAnnotator + 
> NegationAnnotator + PolarityCleartkAnalysisEngine + 
> SubjectCleartkAnalysisEngine + UncertaintyCleartkAnalysisEngine + 
> GenericCleartkAnalysisEngine + HistoryCleartkAnalysisEngine. It looks 
> like in the node part, StatusAnnotator and NegationAnnotator are 
> commented out, so only the remaining five analysis engines are 
> actually used and all of them are in the same desc/ctakes-assertion 
> folder. These five analysis engines were not effective in my test 
> files and I'm still confused by their relationship to the 
> assertionaAnalysisEngine, conceptConverterAnalysisEngine, 
> GenericAttributeAnalysisEngine and SubjectAttributeAnalysisEngine used in assertionMiniPipelineAnalysisEngine.
>
> It looks to me the Clear in their names indicate something but I 
> couldn't figure it out without going through the java code, which I 
> intend not to do at this level.
>
>
>
> That's pretty much all of it for now. Anyone familiar with this topic 
> are welcome to jump in to provide my insights or correction. 
> Hopefully, we can have a nice discussion that can be useful to other users and developers.
>
>
>
> ps. The reason for using AggregatePlaintextFastUMLSProcessor rather 
> than AggregatePlaintextProcessor is that I find the preferred words 
> property in the former very useful while it can't be detected using the latter.
>
>
>
> Best,
>
> Yiming
>
> --
>
> Yiming Zuo <https://urldefense.proofpoint.com/v2/url?u=https-> 
> 3A__sites.google.com_site_yimingzuo_&d=DQIBaQ&c=qS4goWBT7poplM69zy_
> 3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao
> &m=
> 4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=vSmSOvLXuCa-
> Pwp8qu05VTzZgGA0P3Y2CL8q3JBhppQ&e=> Georgetown U. Medical Center:
>
> Dr. Ressom's Omics Lab 
> <https://urldefense.proofpoint.com/v2/url?u=http-> 
> 3A__omics.georgetown.edu_&d=DQIBaQ&c=qS4goWBT7poplM69zy_
> 3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao
> &m=
> 4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=yNsVaS7s20e-
> 125SmdmQqKHvQ0lAQ7si98GefPRDxT0&e=> ECE Department of Virginia Tech:
>
> Computational Bioinformatics & Bio-imaging Laboratory 
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__urldefense&d=DQI
> BaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisC
> YNYmQCP6r0bcpKGd4f7d4gTao&m=ftWgRI4rSEndNvCnrI1Bgqy3c-pZwMEYJVrmuYygEY
> A&s=vZNRaZ4ohJfaykbOtld7CEZMzWT94Zwn-cF95f98l-Y&e= .> 
> proofpoint.com/v2/url?u=http-3A__www.cbil.ece.vt.edu_&d=
> DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=
> fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=
> 4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=DpORI1TH9yITkdlRX_
> RLjxejH2jMJUq8yFaTPjWAar4&e=>
>
>


--
Yiming Zuo <https://urldefense.proofpoint.com/v2/url?u=https-3A__sites.google.com_site_yimingzuo_&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=ftWgRI4rSEndNvCnrI1Bgqy3c-pZwMEYJVrmuYygEYA&s=SUnYlQNd1OFbU9sJs9x5vwV_4kH_6Uid2q752NROwaU&e= > Georgetown U. Medical Center:
Dr. Ressom's Omics Lab <https://urldefense.proofpoint.com/v2/url?u=http-3A__omics.georgetown.edu_&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=ftWgRI4rSEndNvCnrI1Bgqy3c-pZwMEYJVrmuYygEYA&s=Rom6rvi5-orjzvNtDPoVkErh-4ciiGxFBiMsVw796wo&e= > ECE Department of Virginia Tech:
Computational Bioinformatics & Bio-imaging Laboratory <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cbil.ece.vt.edu_&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=ftWgRI4rSEndNvCnrI1Bgqy3c-pZwMEYJVrmuYygEYA&s=YUSfihtDYSCdYPRK-KAriHZD6jt47tZFnRiwdUOx0GQ&e= >

Re: Best combination of analysis engines to consider negation, family history, uncertainty, etc.

Posted by "Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>.
Very exciting to hear about this!

If you need any members of an Advisory Board or anything I would be happy to help.

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-502
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 

On 10/20/16, 7:27 AM, "Savova, Guergana" <Gu...@childrens.harvard.edu> wrote:

    I guess I cannot send attachments to the apache list. Pasted the text below:
    
    ****************************************************************
    Executive Summary: Health Natural Language Processing Center (hNLP Center)
    
    Language in its digital form is the most ubiquitous human product nowadays. The amounts of health-related text such as the clinical narrative from the electronic medical records, the text from online health communities and media as well as the biomedical scholarly literature have been growing exponentially. Coupled with the amazing advances in computational methods and hardware, this firehose stream presents the tech community the unique opportunity to be a major player in biomedical discoveries and healthcare personalization by syphoning the unwieldy into informational nuggets.
    
    The Health Natural Language Processing (hNLP) Center targets a key challenge to current hNLP research and health-related human language technology development: the lack of health-related language data. Without shared data, the research community cannot build on each other’s scientific progress as they do in other disciplines where massive amounts of data are available. Even worse, the stakeholders and consumers of hNLP technology in health discovery and care have little access to robust hNLP technology, and are left with needing to implement all methods from scratch on their own data.
    
    The Center builds on the rich experience of its founders – Prof. Guergana Savova (Harvard), Prof. Martha Palmer (University of Colorado) and Prof. Noemie Elhadad (Columbia University) – in the area of natural language processing. The Center follows the tradition of other successful data dissemination centers such as the Linguistic Data Consortium (LDC) (www.ldc.upenn.edu) and the European Language Resources Association (ELRA) (http://www.elra.info/en/) for general language resources. The hNLP Center goes beyond these initiatives in addressing the very critical need of clinical text availability to advance health IT in general. The Center’s Mission is to support health language-related education, research and technology development by creating and sharing curated linguistic textual resources based on the principle that broad access to data drives innovation. The Center’s organizational structure is a not-for-profit consortium of members. Its Advisory Board includes representatives from government, industry, academia and the stakeholders to strategically guide its trajectory. Its fee-based membership is similar to LDC’s and will ensure its sustainability. The highly sensitive clinical narrative is distributed through a meticulously thought-out process. Industry members can obtain a commercial license to allow the embedding of models built from the Center’s data into products. The Center’s primary activities are to (1) provide a repository and data curation, distribution and management point for health-related language resources, (2) support sponsored research programs and health-related language-based technology evaluations, (3) engage in collaborations with US and foreign researchers, institutions and data centers, (4) host and participate in various workshops.
    
    The Center’s roadmap for the near future defines the distribution of about 2M words of clinical text with layers of linguistic gold annotation (constituency trees, dependency trees, coreference, temporal relations, events) and domain gold annotations (clinical entities with mappings to ontologies, clinical data elements). These datasets have already been created with funding from the National Institutes of Health. Portions of the datasets have been used in shared tasks such as CLEF/ShARe (http://clefehealth2014.dcu.ie/), SemEval Analysis of Clinical Text (http://alt.qcri.org/semeval2014/task7/ ; http://alt.qcri.org/semeval2015/task14/) , SemEval Clinical TempEval (http://alt.qcri.org/semeval2015/task6/; http://alt.qcri.org/semeval2016/task12/ ). This 2M-word dataset is the largest gold annotated clinical narrative dataset in the world so far.
    
    The Center addresses the emphasis on methodological robustness and reproducibility which are now required by the National Institutes of Health and the National Science Foundation. It also aligns with bold and ambitious national initiatives such as the cancer moonshot and the call for predictive models for personalized and precision medicine.
    ***************************************************************************************
    
    --Guergana
    
    
    Guergana Savova, PhD, FACMI
    Associate Professor
    PI Natural Language Processing Lab
    Boston Children's Hospital and Harvard Medical School
    300 Longwood Avenue
    Mailstop: BCH3092
    Enders 144.1
    Boston, MA 02115
    Tel: (617) 919-2972
    Fax: (617) 730-0817
    Harvard Scholar: http://scholar.harvard.edu/guergana_k_savova/biocv
    
    
    -----Original Message-----
    From: Zuo Yiming [mailto:yimingzuo@gmail.com] 
    Sent: Thursday, October 20, 2016 9:56 AM
    To: dev@ctakes.apache.org
    Subject: Re: Best combination of analysis engines to consider negation, family history, uncertainty, etc.
    
    Hi Sean and Guergana,
    
    Thanks for your reply about the fast and non-fast dictionary look-up, and the testing dataset. Originally, I thought the fast annotator is fast because it only takes a portion of the whole dictionary. Now I realize the fast annotator is the more powerful one. That's very helpful.
    
    For Guergana,
    
    Were you also trying to attach the exec summary? I couldn't see it from the email.
    
    Best,
    Yiming
    
    On Wed, Oct 19, 2016 at 1:03 PM, Savova, Guergana < Guergana.Savova@childrens.harvard.edu> wrote:
    
    > Hi Yiming,
    > Re your question about gold standard datasets. In parallel with 
    > releasing best performing methods in cTAKES, we have generated several 
    > gold standard datesets. Our plan is to start distributing them through 
    > a unified effort
    > -- a health NLP Center. See attached exec summary. We hope to have the 
    > Center running in the very near future.
    >
    > Cheers,
    > --Guergana
    >
    > -----Original Message-----
    > From: Zuo Yiming [mailto:yimingzuo@gmail.com]
    > Sent: Wednesday, October 19, 2016 12:22 PM
    > To: dev@ctakes.apache.org
    > Subject: Re: Best combination of analysis engines to consider 
    > negation, family history, uncertainty, etc.
    >
    > Hi Sean and Timothy,
    >
    > Thanks for your clarification about ClearTK tools. I'm amazed by the 
    > power of cTAKES and the resource and community you guys take efforts 
    > to built. I will certainly be happy to provide more feedback as my project moves on.
    >
    > For Timothy,
    >
    > By rule-based system, do you refer to the assertion annotator? How 
    > about the old negation annotator and the status annotator, are they 
    > also ruled-based system? I got a feeling that assertion annotator and 
    > ClearTK system are more favored than negation annotator and the status 
    > annotator for some reason in cTAKES right now.
    >
    > Regarding ClearTK system on my test files, the negation, history, 
    > uncertainty modules work just fine as the assertion annotator. My test 
    > files are only a few, so it's really hard to tell which one is better. 
    > The main difference comes when detecting subject and generic property. 
    > On my limited test files, ClearTK system doesn't work at all. It will 
    > assign patient as the subject for all detected phrases when it's the 
    > patient's family member who have diabetes. The same problem goes to 
    > the generic property, ClearTK system assigns false as the generic 
    > property for all detected phrases. The paper mentioned by you and Sean 
    > seems interesting, I will take a look later.
    >
    > As for further questions, can you guys give me some suggestions where 
    > to find public golden standard datasets so I can actually conduct some 
    > independent evaluation of cTAKES by metrics like precision/recall and 
    > F1 score?
    >
    > At last, a minor suggestion from the user perspective will be to add 
    > the preferred words property to the AggregatePlaintextUMLSProcessor. 
    > Like I pointed out briefly in my first email, using 
    > AggregatePlaintextFastUMLSProcessor
    > we can get the preferred words for detected phrases but not 
    > AggregatePlaintextUMLSProcessor. This is very helpful when the 
    > detected phrases are acronyms such as pt for patient. From my 
    > experience, AggregatePlaintextUMLSProcessor tend to detect more 
    > clinical relevant phrases compared with 
    > AggregatePlaintextFastUMLSProcessor. It will be really nice if we can 
    > have the same preferred words property in AggregatePlaintextUMLSProcessor in future cTAKES release.
    >
    > Best,
    > Yiming
    >
    > On Wed, Oct 19, 2016 at 11:11 AM, Miller, Timothy < 
    > Timothy.Miller@childrens.harvard.edu> wrote:
    >
    > > I can second Sean's thank you, it is good to have this feedback. The 
    > > ClearTK machine learning models were made the default after we ran 
    > > some experiments that found it performed better across a range of 
    > > standard datasets than rule-based algorithms or the existing cTAKES 
    > > module ( https://urldefense.proofpoint.com/v2/url?u=http-3A__
    > journals.plos.org_plosone_article-3Fid-3D10.1371_
    > journal.pone.0112774&d=DQIBaQ&c=qS4goWBT7poplM69zy_
    > 3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-
    > j0kfEcrO4yRGmRCJNAr-rCmP&m=h2xGj7JrNP5pTj6fU4IE9EdNfbJZ0F
    > kOk3swxGR91E4&s=9b891QWT_DEckn4f25-xn3W32qkz8UoOw61qKAOqpK0&e= ).
    > > Since making them the default, though, we have heard from people and 
    > > had our own experience conflict with those experiments. And 
    > > certainly the errors in the rule-based system are easier to understand.
    > >
    > > Just curious, are you able to characterize the errors you see from 
    > > the ClearTK system? I did some experiments recently on a new dataset 
    > > comparing negex with the cleartk negation module and found that 
    > > there was a precision/recall tradeoff but almost identical F1 
    > > scores. But for that dataset the tradeoff negex provided was 
    > > preferred by our collaborators. (I think negex had better recall of 
    > > negated terms but
    > worse precision).
    > >
    > > Tim
    > >
    > >
    > >
    > > ________________________________________
    > > From: Finan, Sean <Se...@childrens.harvard.edu>
    > > Sent: Wednesday, October 19, 2016 10:53 AM
    > > To: dev@ctakes.apache.org
    > > Subject: RE: Best combination of analysis engines to consider 
    > > negation, family history, uncertainty, etc.
    > >
    > > Hi Yiming,
    > >
    > >
    > >
    > > Thank you very much for letting the community know what has and has 
    > > not worked for you.  I have also had better results with the 
    > > Assertion annotators than the ClearTk alternatives, but that could 
    > > be because of the note types/formats that I am using.
    > >
    > >
    > >
    > > Regarding the "Clear" in names, it is because ClearTk (Clear 
    > > ToolKit) is used to train machine learning models for detection of 
    > > the indicated property.  You can find information on ClearTk starting here:
    > > https://urldefense.proofpoint.com/v2/url?u=http-3A__clear.
    > > colorado.edu_compsem_&d=DQIGaQ&c=qS4goWBT7poplM69zy_
    > > 3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-
    > > OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=aRk0CH-2UrNpH0F4PgdnzixY-
    > > xVsh8OYTCP8mhe27Gw&s=0mEmiKK5adFN2YCkYyNCNM3Cv4FNWlMbN8XU6GtcQP4&e=
    > >
    > >
    > >
    > > If you prefer to read a paper, you can check out 
    > > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
    > > lrec-2Dconf.org_proceedings_lrec2014_pdf_218-5FPaper.pdf&
    > > d=DQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=
    > > Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=a
    > > Rk
    > > 0CH-
    > > 2UrNpH0F4PgdnzixY-xVsh8OYTCP8mhe27Gw&s=T-pZCKB6BckhHzvYc9gyutCmKQlhi
    > > td
    > > O
    > > _-i4e387tjM&e=
    > >
    > >
    > >
    > > Others no the devlist can provide much more information than can I, 
    > > so you could post a question if you like.
    > >
    > >
    > >
    > > Cheers,
    > >
    > > Sean
    > >
    > >
    > >
    > > -----Original Message-----
    > >
    > > From: Zuo Yiming [mailto:yimingzuo@gmail.com]
    > >
    > > Sent: Wednesday, October 19, 2016 10:04 AM
    > >
    > > To: user@ctakes.apache.org; dev@ctakes.apache.org
    > >
    > > Subject: Best combination of analysis engines to consider negation, 
    > > family history, uncertainty, etc.
    > >
    > >
    > >
    > > Hi everyone,
    > >
    > >
    > >
    > > I've spent the last a few months working on a clinical NLP project 
    > > using cTAKES. It's a very complex system to me and every time I dig 
    > > into it some new discoveries will come out. Since last week, I tried 
    > > to figure out which analysis engine can help to do a good job to 
    > > consider cases like negation, family history, uncertainty, etc. By 
    > > now, I had some experience and would like to share with the community.
    > >
    > >
    > >
    > > The best combination for me is to use 
    > > assertionMiniPipelineAnalysisEngine
    > >
    > > for negation, uncertainty, generic and subject detection, and 
    > > HistoryCleartkAnalysisEngine for history detection. Both engines are 
    > > in desc/ctakes-assertion folder. The 
    > > assertionMiniPipelineAnalysisEngine
    > > also claims to be useful for conditional detection, which I haven't 
    > > verified using my test files yet.
    > >
    > >
    > >
    > > I'm using the AggregatePlaintextFastUMLSProcessor on the higher level.
    > > The default analysis engines in AggregatePlaintextFastUMLSProcessor
    > > for negation, uncertainty, generic, etc. are StatusAnnotator + 
    > > NegationAnnotator + PolarityCleartkAnalysisEngine + 
    > > SubjectCleartkAnalysisEngine + UncertaintyCleartkAnalysisEngine + 
    > > GenericCleartkAnalysisEngine + HistoryCleartkAnalysisEngine. It 
    > > looks like in the node part, StatusAnnotator and NegationAnnotator 
    > > are commented out, so only the remaining five analysis engines are 
    > > actually used and all of them are in the same desc/ctakes-assertion 
    > > folder. These five analysis engines were not effective in my test 
    > > files and I'm still confused by their relationship to the 
    > > assertionaAnalysisEngine, conceptConverterAnalysisEngine, 
    > > GenericAttributeAnalysisEngine and SubjectAttributeAnalysisEngine 
    > > used
    > in assertionMiniPipelineAnalysisEngine.
    > >
    > > It looks to me the Clear in their names indicate something but I 
    > > couldn't figure it out without going through the java code, which I 
    > > intend not to do at this level.
    > >
    > >
    > >
    > > That's pretty much all of it for now. Anyone familiar with this 
    > > topic are welcome to jump in to provide my insights or correction.
    > > Hopefully, we can have a nice discussion that can be useful to other
    > users and developers.
    > >
    > >
    > >
    > > ps. The reason for using AggregatePlaintextFastUMLSProcessor rather 
    > > than AggregatePlaintextProcessor is that I find the preferred words 
    > > property in the former very useful while it can't be detected using 
    > > the
    > latter.
    > >
    > >
    > >
    > > Best,
    > >
    > > Yiming
    >
    >
    


RE: Best combination of analysis engines to consider negation, family history, uncertainty, etc.

Posted by "Savova, Guergana" <Gu...@childrens.harvard.edu>.
I guess I cannot send attachments to the apache list. Pasted the text below:

****************************************************************
Executive Summary: Health Natural Language Processing Center (hNLP Center)

Language in its digital form is the most ubiquitous human product nowadays. The amounts of health-related text such as the clinical narrative from the electronic medical records, the text from online health communities and media as well as the biomedical scholarly literature have been growing exponentially. Coupled with the amazing advances in computational methods and hardware, this firehose stream presents the tech community the unique opportunity to be a major player in biomedical discoveries and healthcare personalization by syphoning the unwieldy into informational nuggets.

The Health Natural Language Processing (hNLP) Center targets a key challenge to current hNLP research and health-related human language technology development: the lack of health-related language data. Without shared data, the research community cannot build on each other’s scientific progress as they do in other disciplines where massive amounts of data are available. Even worse, the stakeholders and consumers of hNLP technology in health discovery and care have little access to robust hNLP technology, and are left with needing to implement all methods from scratch on their own data.

The Center builds on the rich experience of its founders – Prof. Guergana Savova (Harvard), Prof. Martha Palmer (University of Colorado) and Prof. Noemie Elhadad (Columbia University) – in the area of natural language processing. The Center follows the tradition of other successful data dissemination centers such as the Linguistic Data Consortium (LDC) (www.ldc.upenn.edu) and the European Language Resources Association (ELRA) (http://www.elra.info/en/) for general language resources. The hNLP Center goes beyond these initiatives in addressing the very critical need of clinical text availability to advance health IT in general. The Center’s Mission is to support health language-related education, research and technology development by creating and sharing curated linguistic textual resources based on the principle that broad access to data drives innovation. The Center’s organizational structure is a not-for-profit consortium of members. Its Advisory Board includes representatives from government, industry, academia and the stakeholders to strategically guide its trajectory. Its fee-based membership is similar to LDC’s and will ensure its sustainability. The highly sensitive clinical narrative is distributed through a meticulously thought-out process. Industry members can obtain a commercial license to allow the embedding of models built from the Center’s data into products. The Center’s primary activities are to (1) provide a repository and data curation, distribution and management point for health-related language resources, (2) support sponsored research programs and health-related language-based technology evaluations, (3) engage in collaborations with US and foreign researchers, institutions and data centers, (4) host and participate in various workshops.

The Center’s roadmap for the near future defines the distribution of about 2M words of clinical text with layers of linguistic gold annotation (constituency trees, dependency trees, coreference, temporal relations, events) and domain gold annotations (clinical entities with mappings to ontologies, clinical data elements). These datasets have already been created with funding from the National Institutes of Health. Portions of the datasets have been used in shared tasks such as CLEF/ShARe (http://clefehealth2014.dcu.ie/), SemEval Analysis of Clinical Text (http://alt.qcri.org/semeval2014/task7/ ; http://alt.qcri.org/semeval2015/task14/) , SemEval Clinical TempEval (http://alt.qcri.org/semeval2015/task6/; http://alt.qcri.org/semeval2016/task12/ ). This 2M-word dataset is the largest gold annotated clinical narrative dataset in the world so far.

The Center addresses the emphasis on methodological robustness and reproducibility which are now required by the National Institutes of Health and the National Science Foundation. It also aligns with bold and ambitious national initiatives such as the cancer moonshot and the call for predictive models for personalized and precision medicine.
***************************************************************************************

--Guergana


Guergana Savova, PhD, FACMI
Associate Professor
PI Natural Language Processing Lab
Boston Children's Hospital and Harvard Medical School
300 Longwood Avenue
Mailstop: BCH3092
Enders 144.1
Boston, MA 02115
Tel: (617) 919-2972
Fax: (617) 730-0817
Harvard Scholar: http://scholar.harvard.edu/guergana_k_savova/biocv


-----Original Message-----
From: Zuo Yiming [mailto:yimingzuo@gmail.com] 
Sent: Thursday, October 20, 2016 9:56 AM
To: dev@ctakes.apache.org
Subject: Re: Best combination of analysis engines to consider negation, family history, uncertainty, etc.

Hi Sean and Guergana,

Thanks for your reply about the fast and non-fast dictionary look-up, and the testing dataset. Originally, I thought the fast annotator is fast because it only takes a portion of the whole dictionary. Now I realize the fast annotator is the more powerful one. That's very helpful.

For Guergana,

Were you also trying to attach the exec summary? I couldn't see it from the email.

Best,
Yiming

On Wed, Oct 19, 2016 at 1:03 PM, Savova, Guergana < Guergana.Savova@childrens.harvard.edu> wrote:

> Hi Yiming,
> Re your question about gold standard datasets. In parallel with 
> releasing best performing methods in cTAKES, we have generated several 
> gold standard datesets. Our plan is to start distributing them through 
> a unified effort
> -- a health NLP Center. See attached exec summary. We hope to have the 
> Center running in the very near future.
>
> Cheers,
> --Guergana
>
> -----Original Message-----
> From: Zuo Yiming [mailto:yimingzuo@gmail.com]
> Sent: Wednesday, October 19, 2016 12:22 PM
> To: dev@ctakes.apache.org
> Subject: Re: Best combination of analysis engines to consider 
> negation, family history, uncertainty, etc.
>
> Hi Sean and Timothy,
>
> Thanks for your clarification about ClearTK tools. I'm amazed by the 
> power of cTAKES and the resource and community you guys take efforts 
> to built. I will certainly be happy to provide more feedback as my project moves on.
>
> For Timothy,
>
> By rule-based system, do you refer to the assertion annotator? How 
> about the old negation annotator and the status annotator, are they 
> also ruled-based system? I got a feeling that assertion annotator and 
> ClearTK system are more favored than negation annotator and the status 
> annotator for some reason in cTAKES right now.
>
> Regarding ClearTK system on my test files, the negation, history, 
> uncertainty modules work just fine as the assertion annotator. My test 
> files are only a few, so it's really hard to tell which one is better. 
> The main difference comes when detecting subject and generic property. 
> On my limited test files, ClearTK system doesn't work at all. It will 
> assign patient as the subject for all detected phrases when it's the 
> patient's family member who have diabetes. The same problem goes to 
> the generic property, ClearTK system assigns false as the generic 
> property for all detected phrases. The paper mentioned by you and Sean 
> seems interesting, I will take a look later.
>
> As for further questions, can you guys give me some suggestions where 
> to find public golden standard datasets so I can actually conduct some 
> independent evaluation of cTAKES by metrics like precision/recall and 
> F1 score?
>
> At last, a minor suggestion from the user perspective will be to add 
> the preferred words property to the AggregatePlaintextUMLSProcessor. 
> Like I pointed out briefly in my first email, using 
> AggregatePlaintextFastUMLSProcessor
> we can get the preferred words for detected phrases but not 
> AggregatePlaintextUMLSProcessor. This is very helpful when the 
> detected phrases are acronyms such as pt for patient. From my 
> experience, AggregatePlaintextUMLSProcessor tend to detect more 
> clinical relevant phrases compared with 
> AggregatePlaintextFastUMLSProcessor. It will be really nice if we can 
> have the same preferred words property in AggregatePlaintextUMLSProcessor in future cTAKES release.
>
> Best,
> Yiming
>
> On Wed, Oct 19, 2016 at 11:11 AM, Miller, Timothy < 
> Timothy.Miller@childrens.harvard.edu> wrote:
>
> > I can second Sean's thank you, it is good to have this feedback. The 
> > ClearTK machine learning models were made the default after we ran 
> > some experiments that found it performed better across a range of 
> > standard datasets than rule-based algorithms or the existing cTAKES 
> > module ( https://urldefense.proofpoint.com/v2/url?u=http-3A__
> journals.plos.org_plosone_article-3Fid-3D10.1371_
> journal.pone.0112774&d=DQIBaQ&c=qS4goWBT7poplM69zy_
> 3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-
> j0kfEcrO4yRGmRCJNAr-rCmP&m=h2xGj7JrNP5pTj6fU4IE9EdNfbJZ0F
> kOk3swxGR91E4&s=9b891QWT_DEckn4f25-xn3W32qkz8UoOw61qKAOqpK0&e= ).
> > Since making them the default, though, we have heard from people and 
> > had our own experience conflict with those experiments. And 
> > certainly the errors in the rule-based system are easier to understand.
> >
> > Just curious, are you able to characterize the errors you see from 
> > the ClearTK system? I did some experiments recently on a new dataset 
> > comparing negex with the cleartk negation module and found that 
> > there was a precision/recall tradeoff but almost identical F1 
> > scores. But for that dataset the tradeoff negex provided was 
> > preferred by our collaborators. (I think negex had better recall of 
> > negated terms but
> worse precision).
> >
> > Tim
> >
> >
> >
> > ________________________________________
> > From: Finan, Sean <Se...@childrens.harvard.edu>
> > Sent: Wednesday, October 19, 2016 10:53 AM
> > To: dev@ctakes.apache.org
> > Subject: RE: Best combination of analysis engines to consider 
> > negation, family history, uncertainty, etc.
> >
> > Hi Yiming,
> >
> >
> >
> > Thank you very much for letting the community know what has and has 
> > not worked for you.  I have also had better results with the 
> > Assertion annotators than the ClearTk alternatives, but that could 
> > be because of the note types/formats that I am using.
> >
> >
> >
> > Regarding the "Clear" in names, it is because ClearTk (Clear 
> > ToolKit) is used to train machine learning models for detection of 
> > the indicated property.  You can find information on ClearTk starting here:
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__clear.
> > colorado.edu_compsem_&d=DQIGaQ&c=qS4goWBT7poplM69zy_
> > 3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-
> > OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=aRk0CH-2UrNpH0F4PgdnzixY-
> > xVsh8OYTCP8mhe27Gw&s=0mEmiKK5adFN2YCkYyNCNM3Cv4FNWlMbN8XU6GtcQP4&e=
> >
> >
> >
> > If you prefer to read a paper, you can check out 
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
> > lrec-2Dconf.org_proceedings_lrec2014_pdf_218-5FPaper.pdf&
> > d=DQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=
> > Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=a
> > Rk
> > 0CH-
> > 2UrNpH0F4PgdnzixY-xVsh8OYTCP8mhe27Gw&s=T-pZCKB6BckhHzvYc9gyutCmKQlhi
> > td
> > O
> > _-i4e387tjM&e=
> >
> >
> >
> > Others no the devlist can provide much more information than can I, 
> > so you could post a question if you like.
> >
> >
> >
> > Cheers,
> >
> > Sean
> >
> >
> >
> > -----Original Message-----
> >
> > From: Zuo Yiming [mailto:yimingzuo@gmail.com]
> >
> > Sent: Wednesday, October 19, 2016 10:04 AM
> >
> > To: user@ctakes.apache.org; dev@ctakes.apache.org
> >
> > Subject: Best combination of analysis engines to consider negation, 
> > family history, uncertainty, etc.
> >
> >
> >
> > Hi everyone,
> >
> >
> >
> > I've spent the last a few months working on a clinical NLP project 
> > using cTAKES. It's a very complex system to me and every time I dig 
> > into it some new discoveries will come out. Since last week, I tried 
> > to figure out which analysis engine can help to do a good job to 
> > consider cases like negation, family history, uncertainty, etc. By 
> > now, I had some experience and would like to share with the community.
> >
> >
> >
> > The best combination for me is to use 
> > assertionMiniPipelineAnalysisEngine
> >
> > for negation, uncertainty, generic and subject detection, and 
> > HistoryCleartkAnalysisEngine for history detection. Both engines are 
> > in desc/ctakes-assertion folder. The 
> > assertionMiniPipelineAnalysisEngine
> > also claims to be useful for conditional detection, which I haven't 
> > verified using my test files yet.
> >
> >
> >
> > I'm using the AggregatePlaintextFastUMLSProcessor on the higher level.
> > The default analysis engines in AggregatePlaintextFastUMLSProcessor
> > for negation, uncertainty, generic, etc. are StatusAnnotator + 
> > NegationAnnotator + PolarityCleartkAnalysisEngine + 
> > SubjectCleartkAnalysisEngine + UncertaintyCleartkAnalysisEngine + 
> > GenericCleartkAnalysisEngine + HistoryCleartkAnalysisEngine. It 
> > looks like in the node part, StatusAnnotator and NegationAnnotator 
> > are commented out, so only the remaining five analysis engines are 
> > actually used and all of them are in the same desc/ctakes-assertion 
> > folder. These five analysis engines were not effective in my test 
> > files and I'm still confused by their relationship to the 
> > assertionaAnalysisEngine, conceptConverterAnalysisEngine, 
> > GenericAttributeAnalysisEngine and SubjectAttributeAnalysisEngine 
> > used
> in assertionMiniPipelineAnalysisEngine.
> >
> > It looks to me the Clear in their names indicate something but I 
> > couldn't figure it out without going through the java code, which I 
> > intend not to do at this level.
> >
> >
> >
> > That's pretty much all of it for now. Anyone familiar with this 
> > topic are welcome to jump in to provide my insights or correction.
> > Hopefully, we can have a nice discussion that can be useful to other
> users and developers.
> >
> >
> >
> > ps. The reason for using AggregatePlaintextFastUMLSProcessor rather 
> > than AggregatePlaintextProcessor is that I find the preferred words 
> > property in the former very useful while it can't be detected using 
> > the
> latter.
> >
> >
> >
> > Best,
> >
> > Yiming
>
>

Re: Best combination of analysis engines to consider negation, family history, uncertainty, etc.

Posted by Zuo Yiming <yi...@gmail.com>.
Hi Sean and Guergana,

Thanks for your reply about the fast and non-fast dictionary look-up, and
the testing dataset. Originally, I thought the fast annotator is fast
because it only takes a portion of the whole dictionary. Now I realize the
fast annotator is the more powerful one. That's very helpful.

For Guergana,

Were you also trying to attach the exec summary? I couldn't see it from the
email.

Best,
Yiming

On Wed, Oct 19, 2016 at 1:03 PM, Savova, Guergana <
Guergana.Savova@childrens.harvard.edu> wrote:

> Hi Yiming,
> Re your question about gold standard datasets. In parallel with releasing
> best performing methods in cTAKES, we have generated several gold standard
> datesets. Our plan is to start distributing them through a unified effort
> -- a health NLP Center. See attached exec summary. We hope to have the
> Center running in the very near future.
>
> Cheers,
> --Guergana
>
> -----Original Message-----
> From: Zuo Yiming [mailto:yimingzuo@gmail.com]
> Sent: Wednesday, October 19, 2016 12:22 PM
> To: dev@ctakes.apache.org
> Subject: Re: Best combination of analysis engines to consider negation,
> family history, uncertainty, etc.
>
> Hi Sean and Timothy,
>
> Thanks for your clarification about ClearTK tools. I'm amazed by the power
> of cTAKES and the resource and community you guys take efforts to built. I
> will certainly be happy to provide more feedback as my project moves on.
>
> For Timothy,
>
> By rule-based system, do you refer to the assertion annotator? How about
> the old negation annotator and the status annotator, are they also
> ruled-based system? I got a feeling that assertion annotator and ClearTK
> system are more favored than negation annotator and the status annotator
> for some reason in cTAKES right now.
>
> Regarding ClearTK system on my test files, the negation, history,
> uncertainty modules work just fine as the assertion annotator. My test
> files are only a few, so it's really hard to tell which one is better. The
> main difference comes when detecting subject and generic property. On my
> limited test files, ClearTK system doesn't work at all. It will assign
> patient as the subject for all detected phrases when it's the patient's
> family member who have diabetes. The same problem goes to the generic
> property, ClearTK system assigns false as the generic property for all
> detected phrases. The paper mentioned by you and Sean seems interesting, I
> will take a look later.
>
> As for further questions, can you guys give me some suggestions where to
> find public golden standard datasets so I can actually conduct some
> independent evaluation of cTAKES by metrics like precision/recall and F1
> score?
>
> At last, a minor suggestion from the user perspective will be to add the
> preferred words property to the AggregatePlaintextUMLSProcessor. Like I
> pointed out briefly in my first email, using AggregatePlaintextFastUMLSProcessor
> we can get the preferred words for detected phrases but not
> AggregatePlaintextUMLSProcessor. This is very helpful when the detected
> phrases are acronyms such as pt for patient. From my experience,
> AggregatePlaintextUMLSProcessor tend to detect more clinical relevant
> phrases compared with AggregatePlaintextFastUMLSProcessor. It will be
> really nice if we can have the same preferred words property in
> AggregatePlaintextUMLSProcessor in future cTAKES release.
>
> Best,
> Yiming
>
> On Wed, Oct 19, 2016 at 11:11 AM, Miller, Timothy <
> Timothy.Miller@childrens.harvard.edu> wrote:
>
> > I can second Sean's thank you, it is good to have this feedback. The
> > ClearTK machine learning models were made the default after we ran
> > some experiments that found it performed better across a range of
> > standard datasets than rule-based algorithms or the existing cTAKES
> > module ( https://urldefense.proofpoint.com/v2/url?u=http-3A__
> journals.plos.org_plosone_article-3Fid-3D10.1371_
> journal.pone.0112774&d=DQIBaQ&c=qS4goWBT7poplM69zy_
> 3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-
> j0kfEcrO4yRGmRCJNAr-rCmP&m=h2xGj7JrNP5pTj6fU4IE9EdNfbJZ0F
> kOk3swxGR91E4&s=9b891QWT_DEckn4f25-xn3W32qkz8UoOw61qKAOqpK0&e= ).
> > Since making them the default, though, we have heard from people and
> > had our own experience conflict with those experiments. And certainly
> > the errors in the rule-based system are easier to understand.
> >
> > Just curious, are you able to characterize the errors you see from the
> > ClearTK system? I did some experiments recently on a new dataset
> > comparing negex with the cleartk negation module and found that there
> > was a precision/recall tradeoff but almost identical F1 scores. But
> > for that dataset the tradeoff negex provided was preferred by our
> > collaborators. (I think negex had better recall of negated terms but
> worse precision).
> >
> > Tim
> >
> >
> >
> > ________________________________________
> > From: Finan, Sean <Se...@childrens.harvard.edu>
> > Sent: Wednesday, October 19, 2016 10:53 AM
> > To: dev@ctakes.apache.org
> > Subject: RE: Best combination of analysis engines to consider
> > negation, family history, uncertainty, etc.
> >
> > Hi Yiming,
> >
> >
> >
> > Thank you very much for letting the community know what has and has
> > not worked for you.  I have also had better results with the Assertion
> > annotators than the ClearTk alternatives, but that could be because of
> > the note types/formats that I am using.
> >
> >
> >
> > Regarding the "Clear" in names, it is because ClearTk (Clear ToolKit)
> > is used to train machine learning models for detection of the
> > indicated property.  You can find information on ClearTk starting here:
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__clear.
> > colorado.edu_compsem_&d=DQIGaQ&c=qS4goWBT7poplM69zy_
> > 3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-
> > OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=aRk0CH-2UrNpH0F4PgdnzixY-
> > xVsh8OYTCP8mhe27Gw&s=0mEmiKK5adFN2YCkYyNCNM3Cv4FNWlMbN8XU6GtcQP4&e=
> >
> >
> >
> > If you prefer to read a paper, you can check out
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
> > lrec-2Dconf.org_proceedings_lrec2014_pdf_218-5FPaper.pdf&
> > d=DQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=
> > Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=aRk
> > 0CH-
> > 2UrNpH0F4PgdnzixY-xVsh8OYTCP8mhe27Gw&s=T-pZCKB6BckhHzvYc9gyutCmKQlhitd
> > O
> > _-i4e387tjM&e=
> >
> >
> >
> > Others no the devlist can provide much more information than can I, so
> > you could post a question if you like.
> >
> >
> >
> > Cheers,
> >
> > Sean
> >
> >
> >
> > -----Original Message-----
> >
> > From: Zuo Yiming [mailto:yimingzuo@gmail.com]
> >
> > Sent: Wednesday, October 19, 2016 10:04 AM
> >
> > To: user@ctakes.apache.org; dev@ctakes.apache.org
> >
> > Subject: Best combination of analysis engines to consider negation,
> > family history, uncertainty, etc.
> >
> >
> >
> > Hi everyone,
> >
> >
> >
> > I've spent the last a few months working on a clinical NLP project
> > using cTAKES. It's a very complex system to me and every time I dig
> > into it some new discoveries will come out. Since last week, I tried
> > to figure out which analysis engine can help to do a good job to
> > consider cases like negation, family history, uncertainty, etc. By
> > now, I had some experience and would like to share with the community.
> >
> >
> >
> > The best combination for me is to use
> > assertionMiniPipelineAnalysisEngine
> >
> > for negation, uncertainty, generic and subject detection, and
> > HistoryCleartkAnalysisEngine for history detection. Both engines are
> > in desc/ctakes-assertion folder. The
> > assertionMiniPipelineAnalysisEngine
> > also claims to be useful for conditional detection, which I haven't
> > verified using my test files yet.
> >
> >
> >
> > I'm using the AggregatePlaintextFastUMLSProcessor on the higher level.
> > The default analysis engines in AggregatePlaintextFastUMLSProcessor
> > for negation, uncertainty, generic, etc. are StatusAnnotator +
> > NegationAnnotator + PolarityCleartkAnalysisEngine +
> > SubjectCleartkAnalysisEngine + UncertaintyCleartkAnalysisEngine +
> > GenericCleartkAnalysisEngine + HistoryCleartkAnalysisEngine. It looks
> > like in the node part, StatusAnnotator and NegationAnnotator are
> > commented out, so only the remaining five analysis engines are
> > actually used and all of them are in the same desc/ctakes-assertion
> > folder. These five analysis engines were not effective in my test
> > files and I'm still confused by their relationship to the
> > assertionaAnalysisEngine, conceptConverterAnalysisEngine,
> > GenericAttributeAnalysisEngine and SubjectAttributeAnalysisEngine used
> in assertionMiniPipelineAnalysisEngine.
> >
> > It looks to me the Clear in their names indicate something but I
> > couldn't figure it out without going through the java code, which I
> > intend not to do at this level.
> >
> >
> >
> > That's pretty much all of it for now. Anyone familiar with this topic
> > are welcome to jump in to provide my insights or correction.
> > Hopefully, we can have a nice discussion that can be useful to other
> users and developers.
> >
> >
> >
> > ps. The reason for using AggregatePlaintextFastUMLSProcessor rather
> > than AggregatePlaintextProcessor is that I find the preferred words
> > property in the former very useful while it can't be detected using the
> latter.
> >
> >
> >
> > Best,
> >
> > Yiming
>
>

RE: Best combination of analysis engines to consider negation, family history, uncertainty, etc.

Posted by "Savova, Guergana" <Gu...@childrens.harvard.edu>.
Hi Yiming,
Re your question about gold standard datasets. In parallel with releasing best performing methods in cTAKES, we have generated several gold standard datesets. Our plan is to start distributing them through a unified effort -- a health NLP Center. See attached exec summary. We hope to have the Center running in the very near future.

Cheers,
--Guergana

-----Original Message-----
From: Zuo Yiming [mailto:yimingzuo@gmail.com] 
Sent: Wednesday, October 19, 2016 12:22 PM
To: dev@ctakes.apache.org
Subject: Re: Best combination of analysis engines to consider negation, family history, uncertainty, etc.

Hi Sean and Timothy,

Thanks for your clarification about ClearTK tools. I'm amazed by the power of cTAKES and the resource and community you guys take efforts to built. I will certainly be happy to provide more feedback as my project moves on.

For Timothy,

By rule-based system, do you refer to the assertion annotator? How about the old negation annotator and the status annotator, are they also ruled-based system? I got a feeling that assertion annotator and ClearTK system are more favored than negation annotator and the status annotator for some reason in cTAKES right now.

Regarding ClearTK system on my test files, the negation, history, uncertainty modules work just fine as the assertion annotator. My test files are only a few, so it's really hard to tell which one is better. The main difference comes when detecting subject and generic property. On my limited test files, ClearTK system doesn't work at all. It will assign patient as the subject for all detected phrases when it's the patient's family member who have diabetes. The same problem goes to the generic property, ClearTK system assigns false as the generic property for all detected phrases. The paper mentioned by you and Sean seems interesting, I will take a look later.

As for further questions, can you guys give me some suggestions where to find public golden standard datasets so I can actually conduct some independent evaluation of cTAKES by metrics like precision/recall and F1 score?

At last, a minor suggestion from the user perspective will be to add the preferred words property to the AggregatePlaintextUMLSProcessor. Like I pointed out briefly in my first email, using AggregatePlaintextFastUMLSProcessor we can get the preferred words for detected phrases but not AggregatePlaintextUMLSProcessor. This is very helpful when the detected phrases are acronyms such as pt for patient. From my experience, AggregatePlaintextUMLSProcessor tend to detect more clinical relevant phrases compared with AggregatePlaintextFastUMLSProcessor. It will be really nice if we can have the same preferred words property in AggregatePlaintextUMLSProcessor in future cTAKES release.

Best,
Yiming

On Wed, Oct 19, 2016 at 11:11 AM, Miller, Timothy < Timothy.Miller@childrens.harvard.edu> wrote:

> I can second Sean's thank you, it is good to have this feedback. The 
> ClearTK machine learning models were made the default after we ran 
> some experiments that found it performed better across a range of 
> standard datasets than rule-based algorithms or the existing cTAKES 
> module ( https://urldefense.proofpoint.com/v2/url?u=http-3A__journals.plos.org_plosone_article-3Fid-3D10.1371_journal.pone.0112774&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m=h2xGj7JrNP5pTj6fU4IE9EdNfbJZ0FkOk3swxGR91E4&s=9b891QWT_DEckn4f25-xn3W32qkz8UoOw61qKAOqpK0&e= ).
> Since making them the default, though, we have heard from people and 
> had our own experience conflict with those experiments. And certainly 
> the errors in the rule-based system are easier to understand.
>
> Just curious, are you able to characterize the errors you see from the 
> ClearTK system? I did some experiments recently on a new dataset 
> comparing negex with the cleartk negation module and found that there 
> was a precision/recall tradeoff but almost identical F1 scores. But 
> for that dataset the tradeoff negex provided was preferred by our 
> collaborators. (I think negex had better recall of negated terms but worse precision).
>
> Tim
>
>
>
> ________________________________________
> From: Finan, Sean <Se...@childrens.harvard.edu>
> Sent: Wednesday, October 19, 2016 10:53 AM
> To: dev@ctakes.apache.org
> Subject: RE: Best combination of analysis engines to consider 
> negation, family history, uncertainty, etc.
>
> Hi Yiming,
>
>
>
> Thank you very much for letting the community know what has and has 
> not worked for you.  I have also had better results with the Assertion 
> annotators than the ClearTk alternatives, but that could be because of 
> the note types/formats that I am using.
>
>
>
> Regarding the "Clear" in names, it is because ClearTk (Clear ToolKit) 
> is used to train machine learning models for detection of the 
> indicated property.  You can find information on ClearTk starting here:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__clear.
> colorado.edu_compsem_&d=DQIGaQ&c=qS4goWBT7poplM69zy_
> 3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-
> OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=aRk0CH-2UrNpH0F4PgdnzixY-
> xVsh8OYTCP8mhe27Gw&s=0mEmiKK5adFN2YCkYyNCNM3Cv4FNWlMbN8XU6GtcQP4&e=
>
>
>
> If you prefer to read a paper, you can check out 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
> lrec-2Dconf.org_proceedings_lrec2014_pdf_218-5FPaper.pdf&
> d=DQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=
> Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=aRk
> 0CH- 
> 2UrNpH0F4PgdnzixY-xVsh8OYTCP8mhe27Gw&s=T-pZCKB6BckhHzvYc9gyutCmKQlhitd
> O
> _-i4e387tjM&e=
>
>
>
> Others no the devlist can provide much more information than can I, so 
> you could post a question if you like.
>
>
>
> Cheers,
>
> Sean
>
>
>
> -----Original Message-----
>
> From: Zuo Yiming [mailto:yimingzuo@gmail.com]
>
> Sent: Wednesday, October 19, 2016 10:04 AM
>
> To: user@ctakes.apache.org; dev@ctakes.apache.org
>
> Subject: Best combination of analysis engines to consider negation, 
> family history, uncertainty, etc.
>
>
>
> Hi everyone,
>
>
>
> I've spent the last a few months working on a clinical NLP project 
> using cTAKES. It's a very complex system to me and every time I dig 
> into it some new discoveries will come out. Since last week, I tried 
> to figure out which analysis engine can help to do a good job to 
> consider cases like negation, family history, uncertainty, etc. By 
> now, I had some experience and would like to share with the community.
>
>
>
> The best combination for me is to use 
> assertionMiniPipelineAnalysisEngine
>
> for negation, uncertainty, generic and subject detection, and 
> HistoryCleartkAnalysisEngine for history detection. Both engines are 
> in desc/ctakes-assertion folder. The 
> assertionMiniPipelineAnalysisEngine
> also claims to be useful for conditional detection, which I haven't 
> verified using my test files yet.
>
>
>
> I'm using the AggregatePlaintextFastUMLSProcessor on the higher level.
> The default analysis engines in AggregatePlaintextFastUMLSProcessor 
> for negation, uncertainty, generic, etc. are StatusAnnotator + 
> NegationAnnotator + PolarityCleartkAnalysisEngine + 
> SubjectCleartkAnalysisEngine + UncertaintyCleartkAnalysisEngine + 
> GenericCleartkAnalysisEngine + HistoryCleartkAnalysisEngine. It looks 
> like in the node part, StatusAnnotator and NegationAnnotator are 
> commented out, so only the remaining five analysis engines are 
> actually used and all of them are in the same desc/ctakes-assertion 
> folder. These five analysis engines were not effective in my test 
> files and I'm still confused by their relationship to the 
> assertionaAnalysisEngine, conceptConverterAnalysisEngine, 
> GenericAttributeAnalysisEngine and SubjectAttributeAnalysisEngine used in assertionMiniPipelineAnalysisEngine.
>
> It looks to me the Clear in their names indicate something but I 
> couldn't figure it out without going through the java code, which I 
> intend not to do at this level.
>
>
>
> That's pretty much all of it for now. Anyone familiar with this topic 
> are welcome to jump in to provide my insights or correction. 
> Hopefully, we can have a nice discussion that can be useful to other users and developers.
>
>
>
> ps. The reason for using AggregatePlaintextFastUMLSProcessor rather 
> than AggregatePlaintextProcessor is that I find the preferred words 
> property in the former very useful while it can't be detected using the latter.
>
>
>
> Best,
>
> Yiming
>
> --
>
> Yiming Zuo <https://urldefense.proofpoint.com/v2/url?u=https-> 
> 3A__sites.google.com_site_yimingzuo_&d=DQIBaQ&c=qS4goWBT7poplM69zy_
> 3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao
> &m=
> 4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=vSmSOvLXuCa-
> Pwp8qu05VTzZgGA0P3Y2CL8q3JBhppQ&e=> Georgetown U. Medical Center:
>
> Dr. Ressom's Omics Lab 
> <https://urldefense.proofpoint.com/v2/url?u=http-> 
> 3A__omics.georgetown.edu_&d=DQIBaQ&c=qS4goWBT7poplM69zy_
> 3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao
> &m=
> 4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=yNsVaS7s20e-
> 125SmdmQqKHvQ0lAQ7si98GefPRDxT0&e=> ECE Department of Virginia Tech:
>
> Computational Bioinformatics & Bio-imaging Laboratory 
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__urldefense&d=DQI
> BaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WC
> gf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m=h2xGj7JrNP5pTj6fU4IE9
> EdNfbJZ0FkOk3swxGR91E4&s=UwqUSJ1x3i9O3xH_RPn5yrKe-Q589wKhd0zowUZ18Ik&e
> = .> proofpoint.com/v2/url?u=http-3A__www.cbil.ece.vt.edu_&d=
> DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=
> fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=
> 4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=DpORI1TH9yITkdlRX_
> RLjxejH2jMJUq8yFaTPjWAar4&e=>
>
>


--
Yiming Zuo <https://urldefense.proofpoint.com/v2/url?u=https-3A__sites.google.com_site_yimingzuo_&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m=h2xGj7JrNP5pTj6fU4IE9EdNfbJZ0FkOk3swxGR91E4&s=zDQKdGR1qvXq0eeMIGpXofXm-JpOb8J7iC6XIlqEjfA&e= > Georgetown U. Medical Center:
Dr. Ressom's Omics Lab <https://urldefense.proofpoint.com/v2/url?u=http-3A__omics.georgetown.edu_&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m=h2xGj7JrNP5pTj6fU4IE9EdNfbJZ0FkOk3swxGR91E4&s=8Rio1GmvriiEeWqhgJ9kyY6ykiwgKdKKR4XWFWFfEGU&e= > ECE Department of Virginia Tech:
Computational Bioinformatics & Bio-imaging Laboratory <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cbil.ece.vt.edu_&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m=h2xGj7JrNP5pTj6fU4IE9EdNfbJZ0FkOk3swxGR91E4&s=KLQqKplLX_oCGE9TY63PGAw_mjyg26FSV_SSQckScaQ&e= >

Re: Best combination of analysis engines to consider negation, family history, uncertainty, etc.

Posted by Zuo Yiming <yi...@gmail.com>.
Hi Sean and Timothy,

Thanks for your clarification about ClearTK tools. I'm amazed by the power
of cTAKES and the resource and community you guys take efforts to built. I
will certainly be happy to provide more feedback as my project moves on.

For Timothy,

By rule-based system, do you refer to the assertion annotator? How about
the old negation annotator and the status annotator, are they also
ruled-based system? I got a feeling that assertion annotator and ClearTK
system are more favored than negation annotator and the status
annotator for some reason in cTAKES right now.

Regarding ClearTK system on my test files, the negation, history,
uncertainty modules work just fine as the assertion annotator. My test
files are only a few, so it's really hard to tell which one is better. The
main difference comes when detecting subject and generic property. On my
limited test files, ClearTK system doesn't work at all. It will assign
patient as the subject for all detected phrases when it's the patient's
family member who have diabetes. The same problem goes to the generic
property, ClearTK system assigns false as the generic property for all
detected phrases. The paper mentioned by you and Sean seems interesting, I
will take a look later.

As for further questions, can you guys give me some suggestions where to
find public golden standard datasets so I can actually conduct some
independent evaluation of cTAKES by metrics like precision/recall and F1
score?

At last, a minor suggestion from the user perspective will be to add the
preferred words property to the AggregatePlaintextUMLSProcessor. Like I
pointed out briefly in my first email,
using AggregatePlaintextFastUMLSProcessor we can get the preferred words
for detected phrases but not AggregatePlaintextUMLSProcessor. This is very
helpful when the detected phrases are acronyms such as pt for patient. From
my experience, AggregatePlaintextUMLSProcessor tend to detect more clinical
relevant phrases compared with AggregatePlaintextFastUMLSProcessor. It will
be really nice if we can have the same preferred words property in
AggregatePlaintextUMLSProcessor in future cTAKES release.

Best,
Yiming

On Wed, Oct 19, 2016 at 11:11 AM, Miller, Timothy <
Timothy.Miller@childrens.harvard.edu> wrote:

> I can second Sean's thank you, it is good to have this feedback. The
> ClearTK machine learning models were made the default after we ran some
> experiments that found it performed better across a range of standard
> datasets than rule-based algorithms or the existing cTAKES module (
> http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0112774).
> Since making them the default, though, we have heard from people and had
> our own experience conflict with those experiments. And certainly the
> errors in the rule-based system are easier to understand.
>
> Just curious, are you able to characterize the errors you see from the
> ClearTK system? I did some experiments recently on a new dataset comparing
> negex with the cleartk negation module and found that there was a
> precision/recall tradeoff but almost identical F1 scores. But for that
> dataset the tradeoff negex provided was preferred by our collaborators. (I
> think negex had better recall of negated terms but worse precision).
>
> Tim
>
>
>
> ________________________________________
> From: Finan, Sean <Se...@childrens.harvard.edu>
> Sent: Wednesday, October 19, 2016 10:53 AM
> To: dev@ctakes.apache.org
> Subject: RE: Best combination of analysis engines to consider negation,
> family history, uncertainty, etc.
>
> Hi Yiming,
>
>
>
> Thank you very much for letting the community know what has and has not
> worked for you.  I have also had better results with the Assertion
> annotators than the ClearTk alternatives, but that could be because of the
> note types/formats that I am using.
>
>
>
> Regarding the "Clear" in names, it is because ClearTk (Clear ToolKit) is
> used to train machine learning models for detection of the indicated
> property.  You can find information on ClearTk starting here:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__clear.
> colorado.edu_compsem_&d=DQIGaQ&c=qS4goWBT7poplM69zy_
> 3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-
> OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=aRk0CH-2UrNpH0F4PgdnzixY-
> xVsh8OYTCP8mhe27Gw&s=0mEmiKK5adFN2YCkYyNCNM3Cv4FNWlMbN8XU6GtcQP4&e=
>
>
>
> If you prefer to read a paper, you can check out
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
> lrec-2Dconf.org_proceedings_lrec2014_pdf_218-5FPaper.pdf&
> d=DQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=
> Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=aRk0CH-
> 2UrNpH0F4PgdnzixY-xVsh8OYTCP8mhe27Gw&s=T-pZCKB6BckhHzvYc9gyutCmKQlhitdO
> _-i4e387tjM&e=
>
>
>
> Others no the devlist can provide much more information than can I, so you
> could post a question if you like.
>
>
>
> Cheers,
>
> Sean
>
>
>
> -----Original Message-----
>
> From: Zuo Yiming [mailto:yimingzuo@gmail.com]
>
> Sent: Wednesday, October 19, 2016 10:04 AM
>
> To: user@ctakes.apache.org; dev@ctakes.apache.org
>
> Subject: Best combination of analysis engines to consider negation, family
> history, uncertainty, etc.
>
>
>
> Hi everyone,
>
>
>
> I've spent the last a few months working on a clinical NLP project using
> cTAKES. It's a very complex system to me and every time I dig into it some
> new discoveries will come out. Since last week, I tried to figure out which
> analysis engine can help to do a good job to consider cases like negation,
> family history, uncertainty, etc. By now, I had some experience and would
> like to share with the community.
>
>
>
> The best combination for me is to use assertionMiniPipelineAnalysisEngine
>
> for negation, uncertainty, generic and subject detection, and
> HistoryCleartkAnalysisEngine for history detection. Both engines are in
> desc/ctakes-assertion folder. The assertionMiniPipelineAnalysisEngine
> also claims to be useful for conditional detection, which I haven't
> verified using my test files yet.
>
>
>
> I'm using the AggregatePlaintextFastUMLSProcessor on the higher level.
> The default analysis engines in AggregatePlaintextFastUMLSProcessor for
> negation, uncertainty, generic, etc. are StatusAnnotator +
> NegationAnnotator + PolarityCleartkAnalysisEngine +
> SubjectCleartkAnalysisEngine + UncertaintyCleartkAnalysisEngine +
> GenericCleartkAnalysisEngine + HistoryCleartkAnalysisEngine. It looks like
> in the node part, StatusAnnotator and NegationAnnotator are commented out,
> so only the remaining five analysis engines are actually used and all of
> them are in the same desc/ctakes-assertion folder. These five analysis
> engines were not effective in my test files and I'm still confused by their
> relationship to the assertionaAnalysisEngine, conceptConverterAnalysisEngine,
> GenericAttributeAnalysisEngine and SubjectAttributeAnalysisEngine used in
> assertionMiniPipelineAnalysisEngine.
>
> It looks to me the Clear in their names indicate something but I couldn't
> figure it out without going through the java code, which I intend not to do
> at this level.
>
>
>
> That's pretty much all of it for now. Anyone familiar with this topic are
> welcome to jump in to provide my insights or correction. Hopefully, we can
> have a nice discussion that can be useful to other users and developers.
>
>
>
> ps. The reason for using AggregatePlaintextFastUMLSProcessor rather than
> AggregatePlaintextProcessor is that I find the preferred words property in
> the former very useful while it can't be detected using the latter.
>
>
>
> Best,
>
> Yiming
>
> --
>
> Yiming Zuo <https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__sites.google.com_site_yimingzuo_&d=DQIBaQ&c=qS4goWBT7poplM69zy_
> 3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=
> 4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=vSmSOvLXuCa-
> Pwp8qu05VTzZgGA0P3Y2CL8q3JBhppQ&e=> Georgetown U. Medical Center:
>
> Dr. Ressom's Omics Lab <https://urldefense.proofpoint.com/v2/url?u=http-
> 3A__omics.georgetown.edu_&d=DQIBaQ&c=qS4goWBT7poplM69zy_
> 3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=
> 4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=yNsVaS7s20e-
> 125SmdmQqKHvQ0lAQ7si98GefPRDxT0&e=> ECE Department of Virginia Tech:
>
> Computational Bioinformatics & Bio-imaging Laboratory <https://urldefense.
> proofpoint.com/v2/url?u=http-3A__www.cbil.ece.vt.edu_&d=
> DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=
> fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=
> 4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=DpORI1TH9yITkdlRX_
> RLjxejH2jMJUq8yFaTPjWAar4&e=>
>
>


-- 
Yiming Zuo <https://sites.google.com/site/yimingzuo/>
Georgetown U. Medical Center:
Dr. Ressom's Omics Lab <http://omics.georgetown.edu/>
ECE Department of Virginia Tech:
Computational Bioinformatics & Bio-imaging Laboratory
<http://www.cbil.ece.vt.edu/>

Re: Best combination of analysis engines to consider negation, family history, uncertainty, etc.

Posted by "Miller, Timothy" <Ti...@childrens.harvard.edu>.
I can second Sean's thank you, it is good to have this feedback. The ClearTK machine learning models were made the default after we ran some experiments that found it performed better across a range of standard datasets than rule-based algorithms or the existing cTAKES module (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0112774). Since making them the default, though, we have heard from people and had our own experience conflict with those experiments. And certainly the errors in the rule-based system are easier to understand.

Just curious, are you able to characterize the errors you see from the ClearTK system? I did some experiments recently on a new dataset comparing negex with the cleartk negation module and found that there was a precision/recall tradeoff but almost identical F1 scores. But for that dataset the tradeoff negex provided was preferred by our collaborators. (I think negex had better recall of negated terms but worse precision).

Tim



________________________________________
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Wednesday, October 19, 2016 10:53 AM
To: dev@ctakes.apache.org
Subject: RE: Best combination of analysis engines to consider negation, family history, uncertainty, etc.

Hi Yiming,



Thank you very much for letting the community know what has and has not worked for you.  I have also had better results with the Assertion annotators than the ClearTk alternatives, but that could be because of the note types/formats that I am using.



Regarding the "Clear" in names, it is because ClearTk (Clear ToolKit) is used to train machine learning models for detection of the indicated property.  You can find information on ClearTk starting here:  https://urldefense.proofpoint.com/v2/url?u=http-3A__clear.colorado.edu_compsem_&d=DQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=aRk0CH-2UrNpH0F4PgdnzixY-xVsh8OYTCP8mhe27Gw&s=0mEmiKK5adFN2YCkYyNCNM3Cv4FNWlMbN8XU6GtcQP4&e=



If you prefer to read a paper, you can check out https://urldefense.proofpoint.com/v2/url?u=http-3A__www.lrec-2Dconf.org_proceedings_lrec2014_pdf_218-5FPaper.pdf&d=DQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=aRk0CH-2UrNpH0F4PgdnzixY-xVsh8OYTCP8mhe27Gw&s=T-pZCKB6BckhHzvYc9gyutCmKQlhitdO_-i4e387tjM&e=



Others no the devlist can provide much more information than can I, so you could post a question if you like.



Cheers,

Sean



-----Original Message-----

From: Zuo Yiming [mailto:yimingzuo@gmail.com]

Sent: Wednesday, October 19, 2016 10:04 AM

To: user@ctakes.apache.org; dev@ctakes.apache.org

Subject: Best combination of analysis engines to consider negation, family history, uncertainty, etc.



Hi everyone,



I've spent the last a few months working on a clinical NLP project using cTAKES. It's a very complex system to me and every time I dig into it some new discoveries will come out. Since last week, I tried to figure out which analysis engine can help to do a good job to consider cases like negation, family history, uncertainty, etc. By now, I had some experience and would like to share with the community.



The best combination for me is to use assertionMiniPipelineAnalysisEngine

for negation, uncertainty, generic and subject detection, and HistoryCleartkAnalysisEngine for history detection. Both engines are in desc/ctakes-assertion folder. The assertionMiniPipelineAnalysisEngine also claims to be useful for conditional detection, which I haven't verified using my test files yet.



I'm using the AggregatePlaintextFastUMLSProcessor on the higher level. The default analysis engines in AggregatePlaintextFastUMLSProcessor for negation, uncertainty, generic, etc. are StatusAnnotator + NegationAnnotator + PolarityCleartkAnalysisEngine + SubjectCleartkAnalysisEngine + UncertaintyCleartkAnalysisEngine + GenericCleartkAnalysisEngine + HistoryCleartkAnalysisEngine. It looks like in the node part, StatusAnnotator and NegationAnnotator are commented out, so only the remaining five analysis engines are actually used and all of them are in the same desc/ctakes-assertion folder. These five analysis engines were not effective in my test files and I'm still confused by their relationship to the assertionaAnalysisEngine, conceptConverterAnalysisEngine, GenericAttributeAnalysisEngine and SubjectAttributeAnalysisEngine used in assertionMiniPipelineAnalysisEngine.

It looks to me the Clear in their names indicate something but I couldn't figure it out without going through the java code, which I intend not to do at this level.



That's pretty much all of it for now. Anyone familiar with this topic are welcome to jump in to provide my insights or correction. Hopefully, we can have a nice discussion that can be useful to other users and developers.



ps. The reason for using AggregatePlaintextFastUMLSProcessor rather than AggregatePlaintextProcessor is that I find the preferred words property in the former very useful while it can't be detected using the latter.



Best,

Yiming

--

Yiming Zuo <https://urldefense.proofpoint.com/v2/url?u=https-3A__sites.google.com_site_yimingzuo_&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=vSmSOvLXuCa-Pwp8qu05VTzZgGA0P3Y2CL8q3JBhppQ&e=> Georgetown U. Medical Center:

Dr. Ressom's Omics Lab <https://urldefense.proofpoint.com/v2/url?u=http-3A__omics.georgetown.edu_&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=yNsVaS7s20e-125SmdmQqKHvQ0lAQ7si98GefPRDxT0&e=> ECE Department of Virginia Tech:

Computational Bioinformatics & Bio-imaging Laboratory <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cbil.ece.vt.edu_&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=DpORI1TH9yITkdlRX_RLjxejH2jMJUq8yFaTPjWAar4&e=>


RE: Best combination of analysis engines to consider negation, family history, uncertainty, etc.

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Hi Yiming,

Thank you very much for letting the community know what has and has not worked for you.  I have also had better results with the Assertion annotators than the ClearTk alternatives, but that could be because of the note types/formats that I am using.

Regarding the "Clear" in names, it is because ClearTk (Clear ToolKit) is used to train machine learning models for detection of the indicated property.  You can find information on ClearTk starting here:  http://clear.colorado.edu/compsem/ 

If you prefer to read a paper, you can check out http://www.lrec-conf.org/proceedings/lrec2014/pdf/218_Paper.pdf

Others no the devlist can provide much more information than can I, so you could post a question if you like.

Cheers,
Sean

-----Original Message-----
From: Zuo Yiming [mailto:yimingzuo@gmail.com] 
Sent: Wednesday, October 19, 2016 10:04 AM
To: user@ctakes.apache.org; dev@ctakes.apache.org
Subject: Best combination of analysis engines to consider negation, family history, uncertainty, etc.

Hi everyone,

I've spent the last a few months working on a clinical NLP project using cTAKES. It's a very complex system to me and every time I dig into it some new discoveries will come out. Since last week, I tried to figure out which analysis engine can help to do a good job to consider cases like negation, family history, uncertainty, etc. By now, I had some experience and would like to share with the community.

The best combination for me is to use assertionMiniPipelineAnalysisEngine
for negation, uncertainty, generic and subject detection, and HistoryCleartkAnalysisEngine for history detection. Both engines are in desc/ctakes-assertion folder. The assertionMiniPipelineAnalysisEngine also claims to be useful for conditional detection, which I haven't verified using my test files yet.

I'm using the AggregatePlaintextFastUMLSProcessor on the higher level. The default analysis engines in AggregatePlaintextFastUMLSProcessor for negation, uncertainty, generic, etc. are StatusAnnotator + NegationAnnotator + PolarityCleartkAnalysisEngine + SubjectCleartkAnalysisEngine + UncertaintyCleartkAnalysisEngine + GenericCleartkAnalysisEngine + HistoryCleartkAnalysisEngine. It looks like in the node part, StatusAnnotator and NegationAnnotator are commented out, so only the remaining five analysis engines are actually used and all of them are in the same desc/ctakes-assertion folder. These five analysis engines were not effective in my test files and I'm still confused by their relationship to the assertionaAnalysisEngine, conceptConverterAnalysisEngine, GenericAttributeAnalysisEngine and SubjectAttributeAnalysisEngine used in assertionMiniPipelineAnalysisEngine.
It looks to me the Clear in their names indicate something but I couldn't figure it out without going through the java code, which I intend not to do at this level.

That's pretty much all of it for now. Anyone familiar with this topic are welcome to jump in to provide my insights or correction. Hopefully, we can have a nice discussion that can be useful to other users and developers.

ps. The reason for using AggregatePlaintextFastUMLSProcessor rather than AggregatePlaintextProcessor is that I find the preferred words property in the former very useful while it can't be detected using the latter.

Best,
Yiming
--
Yiming Zuo <https://urldefense.proofpoint.com/v2/url?u=https-3A__sites.google.com_site_yimingzuo_&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=vSmSOvLXuCa-Pwp8qu05VTzZgGA0P3Y2CL8q3JBhppQ&e= > Georgetown U. Medical Center:
Dr. Ressom's Omics Lab <https://urldefense.proofpoint.com/v2/url?u=http-3A__omics.georgetown.edu_&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=yNsVaS7s20e-125SmdmQqKHvQ0lAQ7si98GefPRDxT0&e= > ECE Department of Virginia Tech:
Computational Bioinformatics & Bio-imaging Laboratory <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cbil.ece.vt.edu_&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=DpORI1TH9yITkdlRX_RLjxejH2jMJUq8yFaTPjWAar4&e= >