You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ctakes.apache.org by Peter Abramowitsch <pa...@gmail.com> on 2022/01/03 22:03:55 UTC

Performance of the cleartk history module

Hi All

I've noticed that the HistoryCleartkAnalysisEngine misses many common forms
of subject history including the obvious "h/o" prefix.    Looking into the
distribution, there's a model.jar and what  appears to be a weights file
containing trigger words:
resources/org/apache/ctakes/assertion/models/history.txt   where h, o, /
are all given their own weights.   But I'm not sure that they're actually
used in this way:  see below.   However, there's also a tiny file:
/org/apache/ctakes/assertion/semantic_classes/history.txt
which does contain a few entries including "h/o" which I assume is used for
training but is never referred to anywhere.

Here's the behavior I'm seeing:
example input condition term found history feature marked range text
history of pregnancies "history of" included in the cu_term and prefterm yes
  no history of pregnancies
history of adenopathy "history of" not included in the cu_term or prefterm
yes yes adenopathy
H/O postpartum psychosis "h/o" not included in the prefterm or cu_term yes
yes postpartum psychosis
H/O: postpartum psychosis "h/o" not included in the prefterm or cu_term yes
no postpartum psychosis
H/O pregnancies "h/o"  included in the  cu_term yes no h/o pregnancies

You can see that it is quite perverse -  there is a pattern suggesting that
if the concept definition occupies the history words, then they cannot be
seen by the history annotation engine.

Has anyone else noticed this - and have they done anything about it?

Peter

Re: Performance of the cleartk history module [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Hi Peter,

Your indexing solution sounds pretty slick.  I am practically salivating at the prospect of testing and using another negation engine!

Tim,

I think that you have done negation testing before (Plos one, '14?).  Do you still have a test configuration (i2b2, sharpn, mipacq) lurking somewhere?

Sean
________________________________________
From: Peter Abramowitsch <pa...@gmail.com>
Sent: Wednesday, January 5, 2022 12:51 AM
To: dev@ctakes.apache.org
Subject: Re: Performance of the cleartk history module [EXTERNAL]

* External Email - Caution *


Hi Tim,
The performance boost was the frosting on the cake:  I had to make changes
(at least for our team) because Negex was not working correctly in
sentences with multiple identified annotations only some of which were
meant to be negated.  Negex became over-eager - applying negation when it
shouldn't have.  But even in its original version it was much more
effective than the cleartk polarity module.     Shifting from Polarity to
the original Negex was decidedly slower - you could feel it.

However, you're right it would be good to benchmark it and get some real
numbers.  But as I say, it was the need to fix some of its problems that
was the primary issue.  I suspect that the regex cpu loading wasn't a big
issue in the early days of Negex when testing on grammatical biomedical
text and there were only a few negex trigger patterns.   But with 310
potential patterns and extremely dense notes it can make a real
difference.   The compiled regex from each pattern is fairly complex as
well.

I don't like code that does unnecessary work (literally billions of times
in my case)  - and in a large suite like cTakes all the little coding
shortcuts that waste CPU do add up.

I'll do a test and publish the results when I check in the code.

On Tue, Jan 4, 2022 at 8:54 PM Miller, Timothy <
Timothy.Miller@childrens.harvard.edu> wrote:

> Peter,
> That sounds really useful! Were you able to benchmark it for runtime on a
> reasonably sized sample of your notes? Just curious because I wouldn't have
> expected regex to be that much of a bottleneck.
> Tim
>
>
> On Tue, 2022-01-04 at 17:36 -0800, Peter Abramowitsch wrote:
>
> * External Email - Caution *
>
>
>
> Thank you for the fulsome and humorous response.  Yes, I understand
>
> perfectly.  We definitely think along the same lines.  One of the drawbacks
>
> of static and simple to understand utility functions like JCasUtil's  is
>
> that one can just slap things together without getting to grips with the
>
> wastage of resources that sometimes occur.
>
>
> This brings me to the topic of Negex.  I've done a lot of improvements to
>
> it, also after I sent you that version last year.  It has been well tested
>
> in over 100 million notes so i think I can check it in.  But back to
>
> performance - it used to execute 200+ regular expressions multiple times on
>
> every sentence covering an identified annotation regardless of whether
>
> there was any hope of any of them matching.   My solution was to build an
>
> inverted index of the compiled expressions keyed on unique words found in
>
> the expressions, so based on the sentence,  I could look up and execute
>
> only the expressions that might match.  This might cut the number of regex
>
> operations down to 5 or 10 and sometimes none at all.    There were many
>
> other changes that related to negation detection, of course.  For instance
>
> - handling sentences that switch between negating and non negating phrases
>
> within the same sentence.
>
>
> Peter
>
>
> On Tue, Jan 4, 2022 at 10:47 AM Finan, Sean <
>
> <ma...@childrens.harvard.edu>
>
> Sean.Finan@childrens.harvard.edu
>
> > wrote:
>
>
> Great question.
>
>
> The package name "windowed" isn't helpfully self-descriptive.  It contains
>
> yet another bit of code that I wrote as quickly as possible to help
>
> somebody in real-time with a problem.
>
> * There is only a 'procedural' difference between the two.  The models and
>
> methods are the same.
>
>
> The assertion engine has a bunch of objects delegating to objects
>
> delegating to more objects.  Each object calls one or more
>
> JCasUtil.select() frequently for the same types.  They also redundantly
>
> call JCasUtil.selectCovered() and selectCovering() for the same types.
>
>
> process( jcas ) {
>
>   Collection<..> sentences = ...select(..);
>
>   delegateA.do( sentences );
>
> }
>
> class DelegateA {
>
>   void do( Collection<..> sentences ) {
>
>    for ( Sentence sentence : sentences ) {
>
>       Collection<Token> tokens = JCasUtil.selectCovered( jcas,
>
> Token.class, sentence );
>
>       delegateB.use( tokens );
>
>  }
>
> }
>
> class DelegateB {
>
>   void use( Collection<..> tokens ) {
>
>      Collection<Sentence> sentence = JCasUtil.selectCovering( jcas,
>
> Sentece.class, tokens );
>
>     ...
>
>   }
>
> }
>
>
> The above isn't an exact representation, but you get the point.
>
> The problem with code like this is repeated traversal of the (object)
>
> array in the cas.  Every JCasUtil.select* pours through the whole thing.
>
> For a small document with a small cas (or early in a pipeline), that array
>
> may be small and the traversal fast.  However, when people are
>
> (unadvisably) processing a single document that sizes in the gigabyte
>
> range, repeatedly going through the cas takes a long time.
>
>
> So, what I did was create a single container object that holds Collections
>
> of the types of interest and their covering relationships, populate all
>
> that stuff once per process( jcas ) and pass that container through to each
>
> delegate object.  Basically, a jcas lite.  The biggest culprit in the
>
> assertion engines was repeatedly iterating over the array for covered and
>
> covering windows, hence the subpackage name "windowed".
>
>
> Is it faster for smaller docs?  Not so much.  Does it instantaneously
>
> process the Encyclopedia Brittanica as one text?  Of course not.  Is it
>
> orders of magnitudes faster on such onerous docs?  In my tests, yes.
>
>
> Going through my delegating example above, the end delegate is the same.
>
> Hence the processing is the same and repeatable.  In my tests on both small
>
> and gargantuan documents the windowed version and the original version
>
> produced the same output.
>
>
> Sean
>
>
>
>
>
>
> ________________________________________
>
> From: Peter Abramowitsch <
>
> <ma...@gmail.com>
>
> pabramowitsch@gmail.com
>
> >
>
> Sent: Tuesday, January 4, 2022 11:39 AM
>
> To:
>
> <ma...@ctakes.apache.org>
>
> dev@ctakes.apache.org
>
>
> Subject: Re: Performance of the cleartk history module [EXTERNAL]
>
>
> * External Email - Caution *
>
>
>
> Hi Sean
>
> Ok..  I was confused whether I was meant to find it in the sources.
>
> But while you're reading this, is there a brief way to describe the
>
> difference between the older:package
>
>
> org.apache.ctakes.assertion.medfacts.cleartk;
>
> and
>
> org.apache.ctakes.assertion.medfacts.cleartk.windowed
>
>
> Peter
>
>
>
>
>
>
> On Tue, Jan 4, 2022 at 7:47 AM Finan, Sean <
>
> <ma...@childrens.harvard.edu>
>
> Sean.Finan@childrens.harvard.edu
>
> >
>
> wrote:
>
>
> Hi Peter,
>
>
> I created a second engine that just used text matching or regular
>
> expressions given the discovered events.  It also uses covering section
>
> types, formatted text and other things, but the text match might be the
>
> most impactful item.
>
>
> You are an accomplished developer so the email scratch below is for the
>
> benefit of others who search archives.
>
>
> class LazyHistoryFinder extends JCasAnnotator_ImplBase {
>
>   String[] HISTORY = { "history of", "h/o", "h / o" };
>
>
>   boolean isHistory( EventMention event ) {
>
>        text = e.getCoveredText().toLowerCase();
>
>       return Arrays.stream( HISTORY ).anyMatch( text::startsWith );
>
>   }
>
>
>   void process( JCas jcas ) throws Analysis*Ex {
>
>     JCasUtil.select( jcas, EventMention.class )
>
>                  .stream()
>
>                  .filter( this::isHistory )
>
>                  .foreach( e -> e.setHistoryOf(
>
> CONST.NE_HISTORY_OF_PRESENT ) );
>
>   }
>
> }
>
>
> It requires a stroll through the monstrous cas array and it certainly
>
> isn't sexy, but it gets the job done.
>
>
> Sean
>
>
>
> ________________________________________
>
> From: Peter Abramowitsch <
>
> <ma...@gmail.com>
>
> pabramowitsch@gmail.com
>
> >
>
> Sent: Monday, January 3, 2022 10:23 PM
>
> To:
>
> <ma...@ctakes.apache.org>
>
> dev@ctakes.apache.org
>
>
> Subject: Re: Performance of the cleartk history module [EXTERNAL]
>
>
> * External Email - Caution *
>
>
>
> Thanks Sean
>
>
> By "following engine", you mean a second instance of the history engine
>
> that uses only the event spans, or you modified the current one to
>
> traverse
>
> the event-span within the context window?    I see you made some source
>
> changes in that area and will check tomorrow.
>
>
> Peter
>
>
> On Mon, Jan 3, 2022 at 2:26 PM Finan, Sean <
>
> <ma...@childrens.harvard.edu>
>
> Sean.Finan@childrens.harvard.edu
>
> >
>
> wrote:
>
>
> Hi Peter,
>
>
> I have noticed this and just added a following engine that recognized
>
> text
>
> within event spans.  It is a lazy solution, but it fit my needs and
>
> available time.
>
>
> Sean
>
> ________________________________________
>
> From: Peter Abramowitsch <
>
> <ma...@gmail.com>
>
> pabramowitsch@gmail.com
>
> >
>
> Sent: Monday, January 3, 2022 5:03 PM
>
> To:
>
> <ma...@ctakes.apache.org>
>
> dev@ctakes.apache.org
>
>
> Subject: Performance of the cleartk history module [EXTERNAL]
>
>
> * External Email - Caution *
>
>
>
> Hi All
>
>
> I've noticed that the HistoryCleartkAnalysisEngine misses many common
>
> forms
>
> of subject history including the obvious "h/o" prefix.    Looking into
>
> the
>
> distribution, there's a model.jar and what  appears to be a weights
>
> file
>
> containing trigger words:
>
> resources/org/apache/ctakes/assertion/models/history.txt   where h, o,
>
> /
>
> are all given their own weights.   But I'm not sure that they're
>
> actually
>
> used in this way:  see below.   However, there's also a tiny file:
>
> /org/apache/ctakes/assertion/semantic_classes/history.txt
>
> which does contain a few entries including "h/o" which I assume is used
>
> for
>
> training but is never referred to anywhere.
>
>
> Here's the behavior I'm seeing:
>
> example input condition term found history feature marked range text
>
> history of pregnancies "history of" included in the cu_term and
>
> prefterm
>
> yes
>
>   no history of pregnancies
>
> history of adenopathy "history of" not included in the cu_term or
>
> prefterm
>
> yes yes adenopathy
>
> H/O postpartum psychosis "h/o" not included in the prefterm or cu_term
>
> yes
>
> yes postpartum psychosis
>
> H/O: postpartum psychosis "h/o" not included in the prefterm or cu_term
>
> yes
>
> no postpartum psychosis
>
> H/O pregnancies "h/o"  included in the  cu_term yes no h/o pregnancies
>
>
> You can see that it is quite perverse -  there is a pattern suggesting
>
> that
>
> if the concept definition occupies the history words, then they cannot
>
> be
>
> seen by the history annotation engine.
>
>
> Has anyone else noticed this - and have they done anything about it?
>
>
> Peter
>
>
>
>

Re: Performance of the cleartk history module [EXTERNAL]

Posted by Peter Abramowitsch <pa...@gmail.com>.
Hi Tim,
The performance boost was the frosting on the cake:  I had to make changes
(at least for our team) because Negex was not working correctly in
sentences with multiple identified annotations only some of which were
meant to be negated.  Negex became over-eager - applying negation when it
shouldn't have.  But even in its original version it was much more
effective than the cleartk polarity module.     Shifting from Polarity to
the original Negex was decidedly slower - you could feel it.

However, you're right it would be good to benchmark it and get some real
numbers.  But as I say, it was the need to fix some of its problems that
was the primary issue.  I suspect that the regex cpu loading wasn't a big
issue in the early days of Negex when testing on grammatical biomedical
text and there were only a few negex trigger patterns.   But with 310
potential patterns and extremely dense notes it can make a real
difference.   The compiled regex from each pattern is fairly complex as
well.

I don't like code that does unnecessary work (literally billions of times
in my case)  - and in a large suite like cTakes all the little coding
shortcuts that waste CPU do add up.

I'll do a test and publish the results when I check in the code.

On Tue, Jan 4, 2022 at 8:54 PM Miller, Timothy <
Timothy.Miller@childrens.harvard.edu> wrote:

> Peter,
> That sounds really useful! Were you able to benchmark it for runtime on a
> reasonably sized sample of your notes? Just curious because I wouldn't have
> expected regex to be that much of a bottleneck.
> Tim
>
>
> On Tue, 2022-01-04 at 17:36 -0800, Peter Abramowitsch wrote:
>
> * External Email - Caution *
>
>
>
> Thank you for the fulsome and humorous response.  Yes, I understand
>
> perfectly.  We definitely think along the same lines.  One of the drawbacks
>
> of static and simple to understand utility functions like JCasUtil's  is
>
> that one can just slap things together without getting to grips with the
>
> wastage of resources that sometimes occur.
>
>
> This brings me to the topic of Negex.  I've done a lot of improvements to
>
> it, also after I sent you that version last year.  It has been well tested
>
> in over 100 million notes so i think I can check it in.  But back to
>
> performance - it used to execute 200+ regular expressions multiple times on
>
> every sentence covering an identified annotation regardless of whether
>
> there was any hope of any of them matching.   My solution was to build an
>
> inverted index of the compiled expressions keyed on unique words found in
>
> the expressions, so based on the sentence,  I could look up and execute
>
> only the expressions that might match.  This might cut the number of regex
>
> operations down to 5 or 10 and sometimes none at all.    There were many
>
> other changes that related to negation detection, of course.  For instance
>
> - handling sentences that switch between negating and non negating phrases
>
> within the same sentence.
>
>
> Peter
>
>
> On Tue, Jan 4, 2022 at 10:47 AM Finan, Sean <
>
> <ma...@childrens.harvard.edu>
>
> Sean.Finan@childrens.harvard.edu
>
> > wrote:
>
>
> Great question.
>
>
> The package name "windowed" isn't helpfully self-descriptive.  It contains
>
> yet another bit of code that I wrote as quickly as possible to help
>
> somebody in real-time with a problem.
>
> * There is only a 'procedural' difference between the two.  The models and
>
> methods are the same.
>
>
> The assertion engine has a bunch of objects delegating to objects
>
> delegating to more objects.  Each object calls one or more
>
> JCasUtil.select() frequently for the same types.  They also redundantly
>
> call JCasUtil.selectCovered() and selectCovering() for the same types.
>
>
> process( jcas ) {
>
>   Collection<..> sentences = ...select(..);
>
>   delegateA.do( sentences );
>
> }
>
> class DelegateA {
>
>   void do( Collection<..> sentences ) {
>
>    for ( Sentence sentence : sentences ) {
>
>       Collection<Token> tokens = JCasUtil.selectCovered( jcas,
>
> Token.class, sentence );
>
>       delegateB.use( tokens );
>
>  }
>
> }
>
> class DelegateB {
>
>   void use( Collection<..> tokens ) {
>
>      Collection<Sentence> sentence = JCasUtil.selectCovering( jcas,
>
> Sentece.class, tokens );
>
>     ...
>
>   }
>
> }
>
>
> The above isn't an exact representation, but you get the point.
>
> The problem with code like this is repeated traversal of the (object)
>
> array in the cas.  Every JCasUtil.select* pours through the whole thing.
>
> For a small document with a small cas (or early in a pipeline), that array
>
> may be small and the traversal fast.  However, when people are
>
> (unadvisably) processing a single document that sizes in the gigabyte
>
> range, repeatedly going through the cas takes a long time.
>
>
> So, what I did was create a single container object that holds Collections
>
> of the types of interest and their covering relationships, populate all
>
> that stuff once per process( jcas ) and pass that container through to each
>
> delegate object.  Basically, a jcas lite.  The biggest culprit in the
>
> assertion engines was repeatedly iterating over the array for covered and
>
> covering windows, hence the subpackage name "windowed".
>
>
> Is it faster for smaller docs?  Not so much.  Does it instantaneously
>
> process the Encyclopedia Brittanica as one text?  Of course not.  Is it
>
> orders of magnitudes faster on such onerous docs?  In my tests, yes.
>
>
> Going through my delegating example above, the end delegate is the same.
>
> Hence the processing is the same and repeatable.  In my tests on both small
>
> and gargantuan documents the windowed version and the original version
>
> produced the same output.
>
>
> Sean
>
>
>
>
>
>
> ________________________________________
>
> From: Peter Abramowitsch <
>
> <ma...@gmail.com>
>
> pabramowitsch@gmail.com
>
> >
>
> Sent: Tuesday, January 4, 2022 11:39 AM
>
> To:
>
> <ma...@ctakes.apache.org>
>
> dev@ctakes.apache.org
>
>
> Subject: Re: Performance of the cleartk history module [EXTERNAL]
>
>
> * External Email - Caution *
>
>
>
> Hi Sean
>
> Ok..  I was confused whether I was meant to find it in the sources.
>
> But while you're reading this, is there a brief way to describe the
>
> difference between the older:package
>
>
> org.apache.ctakes.assertion.medfacts.cleartk;
>
> and
>
> org.apache.ctakes.assertion.medfacts.cleartk.windowed
>
>
> Peter
>
>
>
>
>
>
> On Tue, Jan 4, 2022 at 7:47 AM Finan, Sean <
>
> <ma...@childrens.harvard.edu>
>
> Sean.Finan@childrens.harvard.edu
>
> >
>
> wrote:
>
>
> Hi Peter,
>
>
> I created a second engine that just used text matching or regular
>
> expressions given the discovered events.  It also uses covering section
>
> types, formatted text and other things, but the text match might be the
>
> most impactful item.
>
>
> You are an accomplished developer so the email scratch below is for the
>
> benefit of others who search archives.
>
>
> class LazyHistoryFinder extends JCasAnnotator_ImplBase {
>
>   String[] HISTORY = { "history of", "h/o", "h / o" };
>
>
>   boolean isHistory( EventMention event ) {
>
>        text = e.getCoveredText().toLowerCase();
>
>       return Arrays.stream( HISTORY ).anyMatch( text::startsWith );
>
>   }
>
>
>   void process( JCas jcas ) throws Analysis*Ex {
>
>     JCasUtil.select( jcas, EventMention.class )
>
>                  .stream()
>
>                  .filter( this::isHistory )
>
>                  .foreach( e -> e.setHistoryOf(
>
> CONST.NE_HISTORY_OF_PRESENT ) );
>
>   }
>
> }
>
>
> It requires a stroll through the monstrous cas array and it certainly
>
> isn't sexy, but it gets the job done.
>
>
> Sean
>
>
>
> ________________________________________
>
> From: Peter Abramowitsch <
>
> <ma...@gmail.com>
>
> pabramowitsch@gmail.com
>
> >
>
> Sent: Monday, January 3, 2022 10:23 PM
>
> To:
>
> <ma...@ctakes.apache.org>
>
> dev@ctakes.apache.org
>
>
> Subject: Re: Performance of the cleartk history module [EXTERNAL]
>
>
> * External Email - Caution *
>
>
>
> Thanks Sean
>
>
> By "following engine", you mean a second instance of the history engine
>
> that uses only the event spans, or you modified the current one to
>
> traverse
>
> the event-span within the context window?    I see you made some source
>
> changes in that area and will check tomorrow.
>
>
> Peter
>
>
> On Mon, Jan 3, 2022 at 2:26 PM Finan, Sean <
>
> <ma...@childrens.harvard.edu>
>
> Sean.Finan@childrens.harvard.edu
>
> >
>
> wrote:
>
>
> Hi Peter,
>
>
> I have noticed this and just added a following engine that recognized
>
> text
>
> within event spans.  It is a lazy solution, but it fit my needs and
>
> available time.
>
>
> Sean
>
> ________________________________________
>
> From: Peter Abramowitsch <
>
> <ma...@gmail.com>
>
> pabramowitsch@gmail.com
>
> >
>
> Sent: Monday, January 3, 2022 5:03 PM
>
> To:
>
> <ma...@ctakes.apache.org>
>
> dev@ctakes.apache.org
>
>
> Subject: Performance of the cleartk history module [EXTERNAL]
>
>
> * External Email - Caution *
>
>
>
> Hi All
>
>
> I've noticed that the HistoryCleartkAnalysisEngine misses many common
>
> forms
>
> of subject history including the obvious "h/o" prefix.    Looking into
>
> the
>
> distribution, there's a model.jar and what  appears to be a weights
>
> file
>
> containing trigger words:
>
> resources/org/apache/ctakes/assertion/models/history.txt   where h, o,
>
> /
>
> are all given their own weights.   But I'm not sure that they're
>
> actually
>
> used in this way:  see below.   However, there's also a tiny file:
>
> /org/apache/ctakes/assertion/semantic_classes/history.txt
>
> which does contain a few entries including "h/o" which I assume is used
>
> for
>
> training but is never referred to anywhere.
>
>
> Here's the behavior I'm seeing:
>
> example input condition term found history feature marked range text
>
> history of pregnancies "history of" included in the cu_term and
>
> prefterm
>
> yes
>
>   no history of pregnancies
>
> history of adenopathy "history of" not included in the cu_term or
>
> prefterm
>
> yes yes adenopathy
>
> H/O postpartum psychosis "h/o" not included in the prefterm or cu_term
>
> yes
>
> yes postpartum psychosis
>
> H/O: postpartum psychosis "h/o" not included in the prefterm or cu_term
>
> yes
>
> no postpartum psychosis
>
> H/O pregnancies "h/o"  included in the  cu_term yes no h/o pregnancies
>
>
> You can see that it is quite perverse -  there is a pattern suggesting
>
> that
>
> if the concept definition occupies the history words, then they cannot
>
> be
>
> seen by the history annotation engine.
>
>
> Has anyone else noticed this - and have they done anything about it?
>
>
> Peter
>
>
>
>

Re: Performance of the cleartk history module [EXTERNAL]

Posted by "Miller, Timothy" <Ti...@childrens.harvard.edu>.
Peter,
That sounds really useful! Were you able to benchmark it for runtime on a reasonably sized sample of your notes? Just curious because I wouldn't have expected regex to be that much of a bottleneck.
Tim


On Tue, 2022-01-04 at 17:36 -0800, Peter Abramowitsch wrote:

* External Email - Caution *



Thank you for the fulsome and humorous response.  Yes, I understand

perfectly.  We definitely think along the same lines.  One of the drawbacks

of static and simple to understand utility functions like JCasUtil's  is

that one can just slap things together without getting to grips with the

wastage of resources that sometimes occur.


This brings me to the topic of Negex.  I've done a lot of improvements to

it, also after I sent you that version last year.  It has been well tested

in over 100 million notes so i think I can check it in.  But back to

performance - it used to execute 200+ regular expressions multiple times on

every sentence covering an identified annotation regardless of whether

there was any hope of any of them matching.   My solution was to build an

inverted index of the compiled expressions keyed on unique words found in

the expressions, so based on the sentence,  I could look up and execute

only the expressions that might match.  This might cut the number of regex

operations down to 5 or 10 and sometimes none at all.    There were many

other changes that related to negation detection, of course.  For instance

- handling sentences that switch between negating and non negating phrases

within the same sentence.


Peter


On Tue, Jan 4, 2022 at 10:47 AM Finan, Sean <

<ma...@childrens.harvard.edu>

Sean.Finan@childrens.harvard.edu

> wrote:


Great question.


The package name "windowed" isn't helpfully self-descriptive.  It contains

yet another bit of code that I wrote as quickly as possible to help

somebody in real-time with a problem.

* There is only a 'procedural' difference between the two.  The models and

methods are the same.


The assertion engine has a bunch of objects delegating to objects

delegating to more objects.  Each object calls one or more

JCasUtil.select() frequently for the same types.  They also redundantly

call JCasUtil.selectCovered() and selectCovering() for the same types.


process( jcas ) {

  Collection<..> sentences = ...select(..);

  delegateA.do( sentences );

}

class DelegateA {

  void do( Collection<..> sentences ) {

   for ( Sentence sentence : sentences ) {

      Collection<Token> tokens = JCasUtil.selectCovered( jcas,

Token.class, sentence );

      delegateB.use( tokens );

 }

}

class DelegateB {

  void use( Collection<..> tokens ) {

     Collection<Sentence> sentence = JCasUtil.selectCovering( jcas,

Sentece.class, tokens );

    ...

  }

}


The above isn't an exact representation, but you get the point.

The problem with code like this is repeated traversal of the (object)

array in the cas.  Every JCasUtil.select* pours through the whole thing.

For a small document with a small cas (or early in a pipeline), that array

may be small and the traversal fast.  However, when people are

(unadvisably) processing a single document that sizes in the gigabyte

range, repeatedly going through the cas takes a long time.


So, what I did was create a single container object that holds Collections

of the types of interest and their covering relationships, populate all

that stuff once per process( jcas ) and pass that container through to each

delegate object.  Basically, a jcas lite.  The biggest culprit in the

assertion engines was repeatedly iterating over the array for covered and

covering windows, hence the subpackage name "windowed".


Is it faster for smaller docs?  Not so much.  Does it instantaneously

process the Encyclopedia Brittanica as one text?  Of course not.  Is it

orders of magnitudes faster on such onerous docs?  In my tests, yes.


Going through my delegating example above, the end delegate is the same.

Hence the processing is the same and repeatable.  In my tests on both small

and gargantuan documents the windowed version and the original version

produced the same output.


Sean






________________________________________

From: Peter Abramowitsch <

<ma...@gmail.com>

pabramowitsch@gmail.com

>

Sent: Tuesday, January 4, 2022 11:39 AM

To:

<ma...@ctakes.apache.org>

dev@ctakes.apache.org


Subject: Re: Performance of the cleartk history module [EXTERNAL]


* External Email - Caution *



Hi Sean

Ok..  I was confused whether I was meant to find it in the sources.

But while you're reading this, is there a brief way to describe the

difference between the older:package


org.apache.ctakes.assertion.medfacts.cleartk;

and

org.apache.ctakes.assertion.medfacts.cleartk.windowed


Peter






On Tue, Jan 4, 2022 at 7:47 AM Finan, Sean <

<ma...@childrens.harvard.edu>

Sean.Finan@childrens.harvard.edu

>

wrote:


Hi Peter,


I created a second engine that just used text matching or regular

expressions given the discovered events.  It also uses covering section

types, formatted text and other things, but the text match might be the

most impactful item.


You are an accomplished developer so the email scratch below is for the

benefit of others who search archives.


class LazyHistoryFinder extends JCasAnnotator_ImplBase {

  String[] HISTORY = { "history of", "h/o", "h / o" };


  boolean isHistory( EventMention event ) {

       text = e.getCoveredText().toLowerCase();

      return Arrays.stream( HISTORY ).anyMatch( text::startsWith );

  }


  void process( JCas jcas ) throws Analysis*Ex {

    JCasUtil.select( jcas, EventMention.class )

                 .stream()

                 .filter( this::isHistory )

                 .foreach( e -> e.setHistoryOf(

CONST.NE_HISTORY_OF_PRESENT ) );

  }

}


It requires a stroll through the monstrous cas array and it certainly

isn't sexy, but it gets the job done.


Sean



________________________________________

From: Peter Abramowitsch <

<ma...@gmail.com>

pabramowitsch@gmail.com

>

Sent: Monday, January 3, 2022 10:23 PM

To:

<ma...@ctakes.apache.org>

dev@ctakes.apache.org


Subject: Re: Performance of the cleartk history module [EXTERNAL]


* External Email - Caution *



Thanks Sean


By "following engine", you mean a second instance of the history engine

that uses only the event spans, or you modified the current one to

traverse

the event-span within the context window?    I see you made some source

changes in that area and will check tomorrow.


Peter


On Mon, Jan 3, 2022 at 2:26 PM Finan, Sean <

<ma...@childrens.harvard.edu>

Sean.Finan@childrens.harvard.edu

>

wrote:


Hi Peter,


I have noticed this and just added a following engine that recognized

text

within event spans.  It is a lazy solution, but it fit my needs and

available time.


Sean

________________________________________

From: Peter Abramowitsch <

<ma...@gmail.com>

pabramowitsch@gmail.com

>

Sent: Monday, January 3, 2022 5:03 PM

To:

<ma...@ctakes.apache.org>

dev@ctakes.apache.org


Subject: Performance of the cleartk history module [EXTERNAL]


* External Email - Caution *



Hi All


I've noticed that the HistoryCleartkAnalysisEngine misses many common

forms

of subject history including the obvious "h/o" prefix.    Looking into

the

distribution, there's a model.jar and what  appears to be a weights

file

containing trigger words:

resources/org/apache/ctakes/assertion/models/history.txt   where h, o,

/

are all given their own weights.   But I'm not sure that they're

actually

used in this way:  see below.   However, there's also a tiny file:

/org/apache/ctakes/assertion/semantic_classes/history.txt

which does contain a few entries including "h/o" which I assume is used

for

training but is never referred to anywhere.


Here's the behavior I'm seeing:

example input condition term found history feature marked range text

history of pregnancies "history of" included in the cu_term and

prefterm

yes

  no history of pregnancies

history of adenopathy "history of" not included in the cu_term or

prefterm

yes yes adenopathy

H/O postpartum psychosis "h/o" not included in the prefterm or cu_term

yes

yes postpartum psychosis

H/O: postpartum psychosis "h/o" not included in the prefterm or cu_term

yes

no postpartum psychosis

H/O pregnancies "h/o"  included in the  cu_term yes no h/o pregnancies


You can see that it is quite perverse -  there is a pattern suggesting

that

if the concept definition occupies the history words, then they cannot

be

seen by the history annotation engine.


Has anyone else noticed this - and have they done anything about it?


Peter




Re: Performance of the cleartk history module [EXTERNAL]

Posted by Peter Abramowitsch <pa...@gmail.com>.
Thank you for the fulsome and humorous response.  Yes, I understand
perfectly.  We definitely think along the same lines.  One of the drawbacks
of static and simple to understand utility functions like JCasUtil's  is
that one can just slap things together without getting to grips with the
wastage of resources that sometimes occur.

This brings me to the topic of Negex.  I've done a lot of improvements to
it, also after I sent you that version last year.  It has been well tested
in over 100 million notes so i think I can check it in.  But back to
performance - it used to execute 200+ regular expressions multiple times on
every sentence covering an identified annotation regardless of whether
there was any hope of any of them matching.   My solution was to build an
inverted index of the compiled expressions keyed on unique words found in
the expressions, so based on the sentence,  I could look up and execute
only the expressions that might match.  This might cut the number of regex
operations down to 5 or 10 and sometimes none at all.    There were many
other changes that related to negation detection, of course.  For instance
- handling sentences that switch between negating and non negating phrases
within the same sentence.

Peter

On Tue, Jan 4, 2022 at 10:47 AM Finan, Sean <
Sean.Finan@childrens.harvard.edu> wrote:

> Great question.
>
> The package name "windowed" isn't helpfully self-descriptive.  It contains
> yet another bit of code that I wrote as quickly as possible to help
> somebody in real-time with a problem.
> * There is only a 'procedural' difference between the two.  The models and
> methods are the same.
>
> The assertion engine has a bunch of objects delegating to objects
> delegating to more objects.  Each object calls one or more
> JCasUtil.select() frequently for the same types.  They also redundantly
> call JCasUtil.selectCovered() and selectCovering() for the same types.
>
> process( jcas ) {
>   Collection<..> sentences = ...select(..);
>   delegateA.do( sentences );
> }
> class DelegateA {
>   void do( Collection<..> sentences ) {
>    for ( Sentence sentence : sentences ) {
>       Collection<Token> tokens = JCasUtil.selectCovered( jcas,
> Token.class, sentence );
>       delegateB.use( tokens );
>  }
> }
> class DelegateB {
>   void use( Collection<..> tokens ) {
>      Collection<Sentence> sentence = JCasUtil.selectCovering( jcas,
> Sentece.class, tokens );
>     ...
>   }
> }
>
> The above isn't an exact representation, but you get the point.
> The problem with code like this is repeated traversal of the (object)
> array in the cas.  Every JCasUtil.select* pours through the whole thing.
> For a small document with a small cas (or early in a pipeline), that array
> may be small and the traversal fast.  However, when people are
> (unadvisably) processing a single document that sizes in the gigabyte
> range, repeatedly going through the cas takes a long time.
>
> So, what I did was create a single container object that holds Collections
> of the types of interest and their covering relationships, populate all
> that stuff once per process( jcas ) and pass that container through to each
> delegate object.  Basically, a jcas lite.  The biggest culprit in the
> assertion engines was repeatedly iterating over the array for covered and
> covering windows, hence the subpackage name "windowed".
>
> Is it faster for smaller docs?  Not so much.  Does it instantaneously
> process the Encyclopedia Brittanica as one text?  Of course not.  Is it
> orders of magnitudes faster on such onerous docs?  In my tests, yes.
>
> Going through my delegating example above, the end delegate is the same.
> Hence the processing is the same and repeatable.  In my tests on both small
> and gargantuan documents the windowed version and the original version
> produced the same output.
>
> Sean
>
>
>
>
>
> ________________________________________
> From: Peter Abramowitsch <pa...@gmail.com>
> Sent: Tuesday, January 4, 2022 11:39 AM
> To: dev@ctakes.apache.org
> Subject: Re: Performance of the cleartk history module [EXTERNAL]
>
> * External Email - Caution *
>
>
> Hi Sean
> Ok..  I was confused whether I was meant to find it in the sources.
> But while you're reading this, is there a brief way to describe the
> difference between the older:package
>
> org.apache.ctakes.assertion.medfacts.cleartk;
> and
> org.apache.ctakes.assertion.medfacts.cleartk.windowed
>
> Peter
>
>
>
>
>
> On Tue, Jan 4, 2022 at 7:47 AM Finan, Sean <
> Sean.Finan@childrens.harvard.edu>
> wrote:
>
> > Hi Peter,
> >
> > I created a second engine that just used text matching or regular
> > expressions given the discovered events.  It also uses covering section
> > types, formatted text and other things, but the text match might be the
> > most impactful item.
> >
> > You are an accomplished developer so the email scratch below is for the
> > benefit of others who search archives.
> >
> > class LazyHistoryFinder extends JCasAnnotator_ImplBase {
> >   String[] HISTORY = { "history of", "h/o", "h / o" };
> >
> >   boolean isHistory( EventMention event ) {
> >        text = e.getCoveredText().toLowerCase();
> >       return Arrays.stream( HISTORY ).anyMatch( text::startsWith );
> >   }
> >
> >   void process( JCas jcas ) throws Analysis*Ex {
> >     JCasUtil.select( jcas, EventMention.class )
> >                  .stream()
> >                  .filter( this::isHistory )
> >                  .foreach( e -> e.setHistoryOf(
> > CONST.NE_HISTORY_OF_PRESENT ) );
> >   }
> > }
> >
> > It requires a stroll through the monstrous cas array and it certainly
> > isn't sexy, but it gets the job done.
> >
> > Sean
> >
> >
> > ________________________________________
> > From: Peter Abramowitsch <pa...@gmail.com>
> > Sent: Monday, January 3, 2022 10:23 PM
> > To: dev@ctakes.apache.org
> > Subject: Re: Performance of the cleartk history module [EXTERNAL]
> >
> > * External Email - Caution *
> >
> >
> > Thanks Sean
> >
> > By "following engine", you mean a second instance of the history engine
> > that uses only the event spans, or you modified the current one to
> traverse
> > the event-span within the context window?    I see you made some source
> > changes in that area and will check tomorrow.
> >
> > Peter
> >
> > On Mon, Jan 3, 2022 at 2:26 PM Finan, Sean <
> > Sean.Finan@childrens.harvard.edu>
> > wrote:
> >
> > > Hi Peter,
> > >
> > > I have noticed this and just added a following engine that recognized
> > text
> > > within event spans.  It is a lazy solution, but it fit my needs and
> > > available time.
> > >
> > > Sean
> > > ________________________________________
> > > From: Peter Abramowitsch <pa...@gmail.com>
> > > Sent: Monday, January 3, 2022 5:03 PM
> > > To: dev@ctakes.apache.org
> > > Subject: Performance of the cleartk history module [EXTERNAL]
> > >
> > > * External Email - Caution *
> > >
> > >
> > > Hi All
> > >
> > > I've noticed that the HistoryCleartkAnalysisEngine misses many common
> > forms
> > > of subject history including the obvious "h/o" prefix.    Looking into
> > the
> > > distribution, there's a model.jar and what  appears to be a weights
> file
> > > containing trigger words:
> > > resources/org/apache/ctakes/assertion/models/history.txt   where h, o,
> /
> > > are all given their own weights.   But I'm not sure that they're
> actually
> > > used in this way:  see below.   However, there's also a tiny file:
> > > /org/apache/ctakes/assertion/semantic_classes/history.txt
> > > which does contain a few entries including "h/o" which I assume is used
> > for
> > > training but is never referred to anywhere.
> > >
> > > Here's the behavior I'm seeing:
> > > example input condition term found history feature marked range text
> > > history of pregnancies "history of" included in the cu_term and
> prefterm
> > > yes
> > >   no history of pregnancies
> > > history of adenopathy "history of" not included in the cu_term or
> > prefterm
> > > yes yes adenopathy
> > > H/O postpartum psychosis "h/o" not included in the prefterm or cu_term
> > yes
> > > yes postpartum psychosis
> > > H/O: postpartum psychosis "h/o" not included in the prefterm or cu_term
> > yes
> > > no postpartum psychosis
> > > H/O pregnancies "h/o"  included in the  cu_term yes no h/o pregnancies
> > >
> > > You can see that it is quite perverse -  there is a pattern suggesting
> > that
> > > if the concept definition occupies the history words, then they cannot
> be
> > > seen by the history annotation engine.
> > >
> > > Has anyone else noticed this - and have they done anything about it?
> > >
> > > Peter
> > >
> >
>

Re: Performance of the cleartk history module [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Great question.

The package name "windowed" isn't helpfully self-descriptive.  It contains yet another bit of code that I wrote as quickly as possible to help somebody in real-time with a problem.
* There is only a 'procedural' difference between the two.  The models and methods are the same.

The assertion engine has a bunch of objects delegating to objects delegating to more objects.  Each object calls one or more JCasUtil.select() frequently for the same types.  They also redundantly call JCasUtil.selectCovered() and selectCovering() for the same types.

process( jcas ) {
  Collection<..> sentences = ...select(..);
  delegateA.do( sentences );
}
class DelegateA {
  void do( Collection<..> sentences ) {
   for ( Sentence sentence : sentences ) {
      Collection<Token> tokens = JCasUtil.selectCovered( jcas, Token.class, sentence );
      delegateB.use( tokens );
 }
}
class DelegateB {
  void use( Collection<..> tokens ) {
     Collection<Sentence> sentence = JCasUtil.selectCovering( jcas, Sentece.class, tokens );
    ...
  }
}

The above isn't an exact representation, but you get the point.
The problem with code like this is repeated traversal of the (object) array in the cas.  Every JCasUtil.select* pours through the whole thing.  For a small document with a small cas (or early in a pipeline), that array may be small and the traversal fast.  However, when people are (unadvisably) processing a single document that sizes in the gigabyte range, repeatedly going through the cas takes a long time.

So, what I did was create a single container object that holds Collections of the types of interest and their covering relationships, populate all that stuff once per process( jcas ) and pass that container through to each delegate object.  Basically, a jcas lite.  The biggest culprit in the assertion engines was repeatedly iterating over the array for covered and covering windows, hence the subpackage name "windowed".

Is it faster for smaller docs?  Not so much.  Does it instantaneously process the Encyclopedia Brittanica as one text?  Of course not.  Is it orders of magnitudes faster on such onerous docs?  In my tests, yes.

Going through my delegating example above, the end delegate is the same.  Hence the processing is the same and repeatable.  In my tests on both small and gargantuan documents the windowed version and the original version produced the same output.

Sean


   


________________________________________
From: Peter Abramowitsch <pa...@gmail.com>
Sent: Tuesday, January 4, 2022 11:39 AM
To: dev@ctakes.apache.org
Subject: Re: Performance of the cleartk history module [EXTERNAL]

* External Email - Caution *


Hi Sean
Ok..  I was confused whether I was meant to find it in the sources.
But while you're reading this, is there a brief way to describe the
difference between the older:package

org.apache.ctakes.assertion.medfacts.cleartk;
and
org.apache.ctakes.assertion.medfacts.cleartk.windowed

Peter





On Tue, Jan 4, 2022 at 7:47 AM Finan, Sean <Se...@childrens.harvard.edu>
wrote:

> Hi Peter,
>
> I created a second engine that just used text matching or regular
> expressions given the discovered events.  It also uses covering section
> types, formatted text and other things, but the text match might be the
> most impactful item.
>
> You are an accomplished developer so the email scratch below is for the
> benefit of others who search archives.
>
> class LazyHistoryFinder extends JCasAnnotator_ImplBase {
>   String[] HISTORY = { "history of", "h/o", "h / o" };
>
>   boolean isHistory( EventMention event ) {
>        text = e.getCoveredText().toLowerCase();
>       return Arrays.stream( HISTORY ).anyMatch( text::startsWith );
>   }
>
>   void process( JCas jcas ) throws Analysis*Ex {
>     JCasUtil.select( jcas, EventMention.class )
>                  .stream()
>                  .filter( this::isHistory )
>                  .foreach( e -> e.setHistoryOf(
> CONST.NE_HISTORY_OF_PRESENT ) );
>   }
> }
>
> It requires a stroll through the monstrous cas array and it certainly
> isn't sexy, but it gets the job done.
>
> Sean
>
>
> ________________________________________
> From: Peter Abramowitsch <pa...@gmail.com>
> Sent: Monday, January 3, 2022 10:23 PM
> To: dev@ctakes.apache.org
> Subject: Re: Performance of the cleartk history module [EXTERNAL]
>
> * External Email - Caution *
>
>
> Thanks Sean
>
> By "following engine", you mean a second instance of the history engine
> that uses only the event spans, or you modified the current one to traverse
> the event-span within the context window?    I see you made some source
> changes in that area and will check tomorrow.
>
> Peter
>
> On Mon, Jan 3, 2022 at 2:26 PM Finan, Sean <
> Sean.Finan@childrens.harvard.edu>
> wrote:
>
> > Hi Peter,
> >
> > I have noticed this and just added a following engine that recognized
> text
> > within event spans.  It is a lazy solution, but it fit my needs and
> > available time.
> >
> > Sean
> > ________________________________________
> > From: Peter Abramowitsch <pa...@gmail.com>
> > Sent: Monday, January 3, 2022 5:03 PM
> > To: dev@ctakes.apache.org
> > Subject: Performance of the cleartk history module [EXTERNAL]
> >
> > * External Email - Caution *
> >
> >
> > Hi All
> >
> > I've noticed that the HistoryCleartkAnalysisEngine misses many common
> forms
> > of subject history including the obvious "h/o" prefix.    Looking into
> the
> > distribution, there's a model.jar and what  appears to be a weights file
> > containing trigger words:
> > resources/org/apache/ctakes/assertion/models/history.txt   where h, o, /
> > are all given their own weights.   But I'm not sure that they're actually
> > used in this way:  see below.   However, there's also a tiny file:
> > /org/apache/ctakes/assertion/semantic_classes/history.txt
> > which does contain a few entries including "h/o" which I assume is used
> for
> > training but is never referred to anywhere.
> >
> > Here's the behavior I'm seeing:
> > example input condition term found history feature marked range text
> > history of pregnancies "history of" included in the cu_term and prefterm
> > yes
> >   no history of pregnancies
> > history of adenopathy "history of" not included in the cu_term or
> prefterm
> > yes yes adenopathy
> > H/O postpartum psychosis "h/o" not included in the prefterm or cu_term
> yes
> > yes postpartum psychosis
> > H/O: postpartum psychosis "h/o" not included in the prefterm or cu_term
> yes
> > no postpartum psychosis
> > H/O pregnancies "h/o"  included in the  cu_term yes no h/o pregnancies
> >
> > You can see that it is quite perverse -  there is a pattern suggesting
> that
> > if the concept definition occupies the history words, then they cannot be
> > seen by the history annotation engine.
> >
> > Has anyone else noticed this - and have they done anything about it?
> >
> > Peter
> >
>

Re: Performance of the cleartk history module [EXTERNAL]

Posted by Peter Abramowitsch <pa...@gmail.com>.
Hi Sean
Ok..  I was confused whether I was meant to find it in the sources.
But while you're reading this, is there a brief way to describe the
difference between the older:package

org.apache.ctakes.assertion.medfacts.cleartk;
and
org.apache.ctakes.assertion.medfacts.cleartk.windowed

Peter





On Tue, Jan 4, 2022 at 7:47 AM Finan, Sean <Se...@childrens.harvard.edu>
wrote:

> Hi Peter,
>
> I created a second engine that just used text matching or regular
> expressions given the discovered events.  It also uses covering section
> types, formatted text and other things, but the text match might be the
> most impactful item.
>
> You are an accomplished developer so the email scratch below is for the
> benefit of others who search archives.
>
> class LazyHistoryFinder extends JCasAnnotator_ImplBase {
>   String[] HISTORY = { "history of", "h/o", "h / o" };
>
>   boolean isHistory( EventMention event ) {
>        text = e.getCoveredText().toLowerCase();
>       return Arrays.stream( HISTORY ).anyMatch( text::startsWith );
>   }
>
>   void process( JCas jcas ) throws Analysis*Ex {
>     JCasUtil.select( jcas, EventMention.class )
>                  .stream()
>                  .filter( this::isHistory )
>                  .foreach( e -> e.setHistoryOf(
> CONST.NE_HISTORY_OF_PRESENT ) );
>   }
> }
>
> It requires a stroll through the monstrous cas array and it certainly
> isn't sexy, but it gets the job done.
>
> Sean
>
>
> ________________________________________
> From: Peter Abramowitsch <pa...@gmail.com>
> Sent: Monday, January 3, 2022 10:23 PM
> To: dev@ctakes.apache.org
> Subject: Re: Performance of the cleartk history module [EXTERNAL]
>
> * External Email - Caution *
>
>
> Thanks Sean
>
> By "following engine", you mean a second instance of the history engine
> that uses only the event spans, or you modified the current one to traverse
> the event-span within the context window?    I see you made some source
> changes in that area and will check tomorrow.
>
> Peter
>
> On Mon, Jan 3, 2022 at 2:26 PM Finan, Sean <
> Sean.Finan@childrens.harvard.edu>
> wrote:
>
> > Hi Peter,
> >
> > I have noticed this and just added a following engine that recognized
> text
> > within event spans.  It is a lazy solution, but it fit my needs and
> > available time.
> >
> > Sean
> > ________________________________________
> > From: Peter Abramowitsch <pa...@gmail.com>
> > Sent: Monday, January 3, 2022 5:03 PM
> > To: dev@ctakes.apache.org
> > Subject: Performance of the cleartk history module [EXTERNAL]
> >
> > * External Email - Caution *
> >
> >
> > Hi All
> >
> > I've noticed that the HistoryCleartkAnalysisEngine misses many common
> forms
> > of subject history including the obvious "h/o" prefix.    Looking into
> the
> > distribution, there's a model.jar and what  appears to be a weights file
> > containing trigger words:
> > resources/org/apache/ctakes/assertion/models/history.txt   where h, o, /
> > are all given their own weights.   But I'm not sure that they're actually
> > used in this way:  see below.   However, there's also a tiny file:
> > /org/apache/ctakes/assertion/semantic_classes/history.txt
> > which does contain a few entries including "h/o" which I assume is used
> for
> > training but is never referred to anywhere.
> >
> > Here's the behavior I'm seeing:
> > example input condition term found history feature marked range text
> > history of pregnancies "history of" included in the cu_term and prefterm
> > yes
> >   no history of pregnancies
> > history of adenopathy "history of" not included in the cu_term or
> prefterm
> > yes yes adenopathy
> > H/O postpartum psychosis "h/o" not included in the prefterm or cu_term
> yes
> > yes postpartum psychosis
> > H/O: postpartum psychosis "h/o" not included in the prefterm or cu_term
> yes
> > no postpartum psychosis
> > H/O pregnancies "h/o"  included in the  cu_term yes no h/o pregnancies
> >
> > You can see that it is quite perverse -  there is a pattern suggesting
> that
> > if the concept definition occupies the history words, then they cannot be
> > seen by the history annotation engine.
> >
> > Has anyone else noticed this - and have they done anything about it?
> >
> > Peter
> >
>

Re: Performance of the cleartk history module [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Hi Peter,

I created a second engine that just used text matching or regular expressions given the discovered events.  It also uses covering section types, formatted text and other things, but the text match might be the most impactful item.

You are an accomplished developer so the email scratch below is for the benefit of others who search archives. 

class LazyHistoryFinder extends JCasAnnotator_ImplBase {
  String[] HISTORY = { "history of", "h/o", "h / o" };

  boolean isHistory( EventMention event ) {
       text = e.getCoveredText().toLowerCase();
      return Arrays.stream( HISTORY ).anyMatch( text::startsWith );
  }

  void process( JCas jcas ) throws Analysis*Ex {
    JCasUtil.select( jcas, EventMention.class )
                 .stream()
                 .filter( this::isHistory )
                 .foreach( e -> e.setHistoryOf( CONST.NE_HISTORY_OF_PRESENT ) );
  }
}

It requires a stroll through the monstrous cas array and it certainly isn't sexy, but it gets the job done.  

Sean


________________________________________
From: Peter Abramowitsch <pa...@gmail.com>
Sent: Monday, January 3, 2022 10:23 PM
To: dev@ctakes.apache.org
Subject: Re: Performance of the cleartk history module [EXTERNAL]

* External Email - Caution *


Thanks Sean

By "following engine", you mean a second instance of the history engine
that uses only the event spans, or you modified the current one to traverse
the event-span within the context window?    I see you made some source
changes in that area and will check tomorrow.

Peter

On Mon, Jan 3, 2022 at 2:26 PM Finan, Sean <Se...@childrens.harvard.edu>
wrote:

> Hi Peter,
>
> I have noticed this and just added a following engine that recognized text
> within event spans.  It is a lazy solution, but it fit my needs and
> available time.
>
> Sean
> ________________________________________
> From: Peter Abramowitsch <pa...@gmail.com>
> Sent: Monday, January 3, 2022 5:03 PM
> To: dev@ctakes.apache.org
> Subject: Performance of the cleartk history module [EXTERNAL]
>
> * External Email - Caution *
>
>
> Hi All
>
> I've noticed that the HistoryCleartkAnalysisEngine misses many common forms
> of subject history including the obvious "h/o" prefix.    Looking into the
> distribution, there's a model.jar and what  appears to be a weights file
> containing trigger words:
> resources/org/apache/ctakes/assertion/models/history.txt   where h, o, /
> are all given their own weights.   But I'm not sure that they're actually
> used in this way:  see below.   However, there's also a tiny file:
> /org/apache/ctakes/assertion/semantic_classes/history.txt
> which does contain a few entries including "h/o" which I assume is used for
> training but is never referred to anywhere.
>
> Here's the behavior I'm seeing:
> example input condition term found history feature marked range text
> history of pregnancies "history of" included in the cu_term and prefterm
> yes
>   no history of pregnancies
> history of adenopathy "history of" not included in the cu_term or prefterm
> yes yes adenopathy
> H/O postpartum psychosis "h/o" not included in the prefterm or cu_term yes
> yes postpartum psychosis
> H/O: postpartum psychosis "h/o" not included in the prefterm or cu_term yes
> no postpartum psychosis
> H/O pregnancies "h/o"  included in the  cu_term yes no h/o pregnancies
>
> You can see that it is quite perverse -  there is a pattern suggesting that
> if the concept definition occupies the history words, then they cannot be
> seen by the history annotation engine.
>
> Has anyone else noticed this - and have they done anything about it?
>
> Peter
>

Re: Performance of the cleartk history module [EXTERNAL]

Posted by Peter Abramowitsch <pa...@gmail.com>.
Thanks Sean

By "following engine", you mean a second instance of the history engine
that uses only the event spans, or you modified the current one to traverse
the event-span within the context window?    I see you made some source
changes in that area and will check tomorrow.

Peter

On Mon, Jan 3, 2022 at 2:26 PM Finan, Sean <Se...@childrens.harvard.edu>
wrote:

> Hi Peter,
>
> I have noticed this and just added a following engine that recognized text
> within event spans.  It is a lazy solution, but it fit my needs and
> available time.
>
> Sean
> ________________________________________
> From: Peter Abramowitsch <pa...@gmail.com>
> Sent: Monday, January 3, 2022 5:03 PM
> To: dev@ctakes.apache.org
> Subject: Performance of the cleartk history module [EXTERNAL]
>
> * External Email - Caution *
>
>
> Hi All
>
> I've noticed that the HistoryCleartkAnalysisEngine misses many common forms
> of subject history including the obvious "h/o" prefix.    Looking into the
> distribution, there's a model.jar and what  appears to be a weights file
> containing trigger words:
> resources/org/apache/ctakes/assertion/models/history.txt   where h, o, /
> are all given their own weights.   But I'm not sure that they're actually
> used in this way:  see below.   However, there's also a tiny file:
> /org/apache/ctakes/assertion/semantic_classes/history.txt
> which does contain a few entries including "h/o" which I assume is used for
> training but is never referred to anywhere.
>
> Here's the behavior I'm seeing:
> example input condition term found history feature marked range text
> history of pregnancies "history of" included in the cu_term and prefterm
> yes
>   no history of pregnancies
> history of adenopathy "history of" not included in the cu_term or prefterm
> yes yes adenopathy
> H/O postpartum psychosis "h/o" not included in the prefterm or cu_term yes
> yes postpartum psychosis
> H/O: postpartum psychosis "h/o" not included in the prefterm or cu_term yes
> no postpartum psychosis
> H/O pregnancies "h/o"  included in the  cu_term yes no h/o pregnancies
>
> You can see that it is quite perverse -  there is a pattern suggesting that
> if the concept definition occupies the history words, then they cannot be
> seen by the history annotation engine.
>
> Has anyone else noticed this - and have they done anything about it?
>
> Peter
>

Re: Performance of the cleartk history module [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Hi Peter,

I have noticed this and just added a following engine that recognized text within event spans.  It is a lazy solution, but it fit my needs and available time.

Sean
________________________________________
From: Peter Abramowitsch <pa...@gmail.com>
Sent: Monday, January 3, 2022 5:03 PM
To: dev@ctakes.apache.org
Subject: Performance of the cleartk history module [EXTERNAL]

* External Email - Caution *


Hi All

I've noticed that the HistoryCleartkAnalysisEngine misses many common forms
of subject history including the obvious "h/o" prefix.    Looking into the
distribution, there's a model.jar and what  appears to be a weights file
containing trigger words:
resources/org/apache/ctakes/assertion/models/history.txt   where h, o, /
are all given their own weights.   But I'm not sure that they're actually
used in this way:  see below.   However, there's also a tiny file:
/org/apache/ctakes/assertion/semantic_classes/history.txt
which does contain a few entries including "h/o" which I assume is used for
training but is never referred to anywhere.

Here's the behavior I'm seeing:
example input condition term found history feature marked range text
history of pregnancies "history of" included in the cu_term and prefterm yes
  no history of pregnancies
history of adenopathy "history of" not included in the cu_term or prefterm
yes yes adenopathy
H/O postpartum psychosis "h/o" not included in the prefterm or cu_term yes
yes postpartum psychosis
H/O: postpartum psychosis "h/o" not included in the prefterm or cu_term yes
no postpartum psychosis
H/O pregnancies "h/o"  included in the  cu_term yes no h/o pregnancies

You can see that it is quite perverse -  there is a pattern suggesting that
if the concept definition occupies the history words, then they cannot be
seen by the history annotation engine.

Has anyone else noticed this - and have they done anything about it?

Peter