You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by "Miller, Timothy" <Ti...@childrens.harvard.edu.INVALID> on 2022/10/21 18:02:49 UTC

Best practices for documenting NLP versions

We’ve recently been using cTAKES for some internal projects where we make modifications, often using the REST server, combined with an open-source python client that makes the output of the REST server easy to post-process:
https://github.com/Machine-Learning-for-Medical-Language/ctakes-client-py
written by my colleagues Andy McMurry and Mike Terry, and pip installable. The output is then either converted to FHIR or written to whatever convenient format we need.

But it’s useful to know for a given run on a given project, what was the NLP configuration that produced this output? Obviously, there are things like version numbers, but since cTAKES is highly configurable, and our post-processing libraries have versions, and we may use trunk or a previous commit instead of releases, things get complicated quickly. Does anyone have an existing solution they are willing to share? Or does anyone have any thoughts on this topic? This question goes slightly beyond cTAKES, but cTAKES is responsible for a lot of the complexity in figuring this out since it’s the most configurable component.

Thanks
Tim

Re: Best practices for documenting NLP versions [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu.INVALID>.

Hi all,

This versioning topic had come up at least once before, so I thought that I'd give it a shot before it fell off my radar.

I checked in some new stuff and rebuilt the snapshot, so you should be able to use one solution as of now.

I couldn't find a great solution for compile-time alone versioning.  Class file dates aren't great and I tried four different supposed maven solutions to write properties files but none of them worked.

What did work was placing build information in the manifest files in ctakes jar files during the package phase.  What this means is that if you are running in an IDE or just from compiled classes you will not know the version.  If you run from a jar file (built via maven) then you will have the following in each ctakes jar file:

Manifest-Version: 1.0
Implementation-Title: Apache cTAKES core
Implementation-Version: 4.0.1-SNAPSHOT
Specification-Vendor: The Apache Software Foundation
Specification-Title: Apache cTAKES core
Build-Jdk-Spec: 1.8
Created-By: Maven JAR Plugin 3.3.0
Specification-Version: 4.0
Implementation-Vendor: The Apache Software Foundation
Implementation-Build-Date: 2022-10-26 17:02

Above is the content of ctakes-core-4.0.1-20221026.172844-167.jar - the current snapshot build at https://repository.apache.org/content/repositories/snapshots/org/apache/ctakes/ctakes-core/4.0.1-SNAPSHOT/
which is what you get if you use ctakes as a dependency in your own project.

If you maven package locally then everything will be the same except for the Implementation-Build-Date at the bottom of the list.

You can get to this information manually or programmatically.  I added a static public method named getBuildInfo() to the FinishedLogger (ctakes-core util.log.FinishedLogger.java) that returns a jar build version and build date.  If you are running outside of a jar then it returns an empty string.

In case you have no idea what FinishedLogger is, it prints stats on time (start, end, init, process, per note) and now it prints build information.  piper: "add util.log.FinishedLogger"

I also added the build information to the ctakes banner at the "welcome to ctakes" step.  In case you have no idea what the banners are about then add "set WriteBanner=yes" to any piper that uses a collection reader that extends the AbstractFileTreeReader.  If you don't know what collection reader you are using then you are probably using FileTreeReader - which is an extension.

I hope that this is useful to somebody.

Sean


________________________________
From: Greg Silverman <gm...@umn.edu.INVALID>
Sent: Friday, October 21, 2022 6:23 PM
To: dev@ctakes.apache.org <de...@ctakes.apache.org>
Subject: Re: Best practices for documenting NLP versions [EXTERNAL]

* External Email - Caution *


It was an off-the-cuff suggestion. Devil is obviously in the details.

On Fri, Oct 21, 2022 at 3:33 PM Peter Abramowitsch <pa...@gmail.com>
wrote:

> Interesting, but it would depend on how the docker is set up.  Our docker
> for instance, encapsulates all the code and imported jars, as you imply,
> but the piper and other runtime configuration such as section regex, negex,
> bsvs, etc are imported on a mounted FS during the container's runtime.
> Having them frozen into the docker instances would proliferate vast numbers
> of docker image-tars with 99% redundant data.  Or do you have a cleverer
> solution?
>
> Peter
>
> On Fri, Oct 21, 2022 at 10:18 PM Greg Silverman <gm...@umn.edu.invalid>
> wrote:
>
> > Why not use Docker and versioning by tags? See "C. Boettiger, An
> > introduction to Docker for reproducible research, SIGOPS Oper. Syst. Rev.
> > 49
> > (2015) 71–79. doi:10.1145/2723872.2723882.
> > <https://urldefense.com/v3/__https://www.zotero.org/google-docs/?Xd3H9e__;!!NZvER7FxgEiBAiR_!qLGDWloRxycOJpfz0ymlW4ueQUAFQbEkEW3Uv0M_BL_5QEQ74O14rNJ-jSFHtVeMPFg6VIPS_n3fdgUb7WP4w16TESCU$  >"
> >
> >
> >
> > On Fri, Oct 21, 2022 at 3:15 PM Peter Abramowitsch <
> > pabramowitsch@gmail.com>
> > wrote:
> >
> > > Well, obviously, the full range of permutations of all source files and
> > all
> > > annotators and pre and post ctakes code would require a huge amount of
> > > commit information on thousands of files... and not only ctakes
> > > files...recently I made some pretty significant changes to the
> ZonerCli
> > > library which is only a dependency of the ctakes distribution. How
> would
> > > all the commit info be used to tag the end results.  I think the answer
> > is
> > > that it's simply not feasible or useful.     So we haven't gone to
> those
> > > lengths.  As far as we go at the UCs  is to version the piper file and
> > then
> > > write the versioned_name of the piper back into the json object
> returned
> > > for each note... We have our own rest service and our own Java and
> Python
> > > clients, but they don't touch the internals of the message in a way
> that
> > > interferes with the clinical informatics.  The note concept collection
> > > object with its piper version is then persisted in our data store.
>  The
> > > server jar also has a version which writes into a log and is updated
> > > whenever any significant framework changes are implemented.   But the
> > > server version is not written into the data-store.
> > >
> > > Not sure if any of this was helpful
> > >
> > > On Fri, Oct 21, 2022 at 8:03 PM Miller, Timothy
> > > <Ti...@childrens.harvard.edu.invalid> wrote:
> > >
> > > > We’ve recently been using cTAKES for some internal projects where we
> > make
> > > > modifications, often using the REST server, combined with an
> > open-source
> > > > python client that makes the output of the REST server easy to
> > > post-process:
> > > >
> > >
> >
> https://urldefense.com/v3/__https://github.com/Machine-Learning-for-Medical-Language/ctakes-client-py__;!!NZvER7FxgEiBAiR_!qLGDWloRxycOJpfz0ymlW4ueQUAFQbEkEW3Uv0M_BL_5QEQ74O14rNJ-jSFHtVeMPFg6VIPS_n3fdgUb7WP4w2vbsZKE$
> > > > written by my colleagues Andy McMurry and Mike Terry, and pip
> > > installable.
> > > > The output is then either converted to FHIR or written to whatever
> > > > convenient format we need.
> > > >
> > > > But it’s useful to know for a given run on a given project, what was
> > the
> > > > NLP configuration that produced this output? Obviously, there are
> > things
> > > > like version numbers, but since cTAKES is highly configurable, and
> our
> > > > post-processing libraries have versions, and we may use trunk or a
> > > previous
> > > > commit instead of releases, things get complicated quickly. Does
> anyone
> > > > have an existing solution they are willing to share? Or does anyone
> > have
> > > > any thoughts on this topic? This question goes slightly beyond
> cTAKES,
> > > but
> > > > cTAKES is responsible for a lot of the complexity in figuring this
> out
> > > > since it’s the most configurable component.
> > > >
> > > > Thanks
> > > > Tim
> > > >
> > > >
> > >
> >
> >
> > --
> > Greg M. Silverman
> > Senior Systems Developer
> > NLP/IE <https://urldefense.com/v3/__https://healthinformatics.umn.edu/research/nlpie-group__;!!NZvER7FxgEiBAiR_!qLGDWloRxycOJpfz0ymlW4ueQUAFQbEkEW3Uv0M_BL_5QEQ74O14rNJ-jSFHtVeMPFg6VIPS_n3fdgUb7WP4wwTSuUPE$  >
> > Department of Surgery
> > University of Minnesota
> > gms@umn.edu
> >
>


--
Greg M. Silverman
Senior Systems Developer
NLP/IE <https://urldefense.com/v3/__https://healthinformatics.umn.edu/research/nlpie-group__;!!NZvER7FxgEiBAiR_!qLGDWloRxycOJpfz0ymlW4ueQUAFQbEkEW3Uv0M_BL_5QEQ74O14rNJ-jSFHtVeMPFg6VIPS_n3fdgUb7WP4wwTSuUPE$  >
Department of Surgery
University of Minnesota
gms@umn.edu

Re: Best practices for documenting NLP versions

Posted by Greg Silverman <gm...@umn.edu.INVALID>.

It was an off-the-cuff suggestion. Devil is obviously in the details.

On Fri, Oct 21, 2022 at 3:33 PM Peter Abramowitsch <pa...@gmail.com>
wrote:

> Interesting, but it would depend on how the docker is set up.  Our docker
> for instance, encapsulates all the code and imported jars, as you imply,
> but the piper and other runtime configuration such as section regex, negex,
> bsvs, etc are imported on a mounted FS during the container's runtime.
> Having them frozen into the docker instances would proliferate vast numbers
> of docker image-tars with 99% redundant data.  Or do you have a cleverer
> solution?
>
> Peter
>
> On Fri, Oct 21, 2022 at 10:18 PM Greg Silverman <gm...@umn.edu.invalid>
> wrote:
>
> > Why not use Docker and versioning by tags? See "C. Boettiger, An
> > introduction to Docker for reproducible research, SIGOPS Oper. Syst. Rev.
> > 49
> > (2015) 71–79. doi:10.1145/2723872.2723882.
> > <https://www.zotero.org/google-docs/?Xd3H9e>"
> >
> >
> >
> > On Fri, Oct 21, 2022 at 3:15 PM Peter Abramowitsch <
> > pabramowitsch@gmail.com>
> > wrote:
> >
> > > Well, obviously, the full range of permutations of all source files and
> > all
> > > annotators and pre and post ctakes code would require a huge amount of
> > > commit information on thousands of files... and not only ctakes
> > > files...recently I made some pretty significant changes to the
> ZonerCli
> > > library which is only a dependency of the ctakes distribution. How
> would
> > > all the commit info be used to tag the end results.  I think the answer
> > is
> > > that it's simply not feasible or useful.     So we haven't gone to
> those
> > > lengths.  As far as we go at the UCs  is to version the piper file and
> > then
> > > write the versioned_name of the piper back into the json object
> returned
> > > for each note... We have our own rest service and our own Java and
> Python
> > > clients, but they don't touch the internals of the message in a way
> that
> > > interferes with the clinical informatics.  The note concept collection
> > > object with its piper version is then persisted in our data store.
>  The
> > > server jar also has a version which writes into a log and is updated
> > > whenever any significant framework changes are implemented.   But the
> > > server version is not written into the data-store.
> > >
> > > Not sure if any of this was helpful
> > >
> > > On Fri, Oct 21, 2022 at 8:03 PM Miller, Timothy
> > > <Ti...@childrens.harvard.edu.invalid> wrote:
> > >
> > > > We’ve recently been using cTAKES for some internal projects where we
> > make
> > > > modifications, often using the REST server, combined with an
> > open-source
> > > > python client that makes the output of the REST server easy to
> > > post-process:
> > > >
> > >
> >
> https://github.com/Machine-Learning-for-Medical-Language/ctakes-client-py
> > > > written by my colleagues Andy McMurry and Mike Terry, and pip
> > > installable.
> > > > The output is then either converted to FHIR or written to whatever
> > > > convenient format we need.
> > > >
> > > > But it’s useful to know for a given run on a given project, what was
> > the
> > > > NLP configuration that produced this output? Obviously, there are
> > things
> > > > like version numbers, but since cTAKES is highly configurable, and
> our
> > > > post-processing libraries have versions, and we may use trunk or a
> > > previous
> > > > commit instead of releases, things get complicated quickly. Does
> anyone
> > > > have an existing solution they are willing to share? Or does anyone
> > have
> > > > any thoughts on this topic? This question goes slightly beyond
> cTAKES,
> > > but
> > > > cTAKES is responsible for a lot of the complexity in figuring this
> out
> > > > since it’s the most configurable component.
> > > >
> > > > Thanks
> > > > Tim
> > > >
> > > >
> > >
> >
> >
> > --
> > Greg M. Silverman
> > Senior Systems Developer
> > NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group>
> > Department of Surgery
> > University of Minnesota
> > gms@umn.edu
> >
>


-- 
Greg M. Silverman
Senior Systems Developer
NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group>
Department of Surgery
University of Minnesota
gms@umn.edu

Re: Best practices for documenting NLP versions

Posted by Peter Abramowitsch <pa...@gmail.com>.

Interesting, but it would depend on how the docker is set up.  Our docker
for instance, encapsulates all the code and imported jars, as you imply,
but the piper and other runtime configuration such as section regex, negex,
bsvs, etc are imported on a mounted FS during the container's runtime.
Having them frozen into the docker instances would proliferate vast numbers
of docker image-tars with 99% redundant data.  Or do you have a cleverer
solution?

Peter

On Fri, Oct 21, 2022 at 10:18 PM Greg Silverman <gm...@umn.edu.invalid> wrote:

> Why not use Docker and versioning by tags? See "C. Boettiger, An
> introduction to Docker for reproducible research, SIGOPS Oper. Syst. Rev.
> 49
> (2015) 71–79. doi:10.1145/2723872.2723882.
> <https://www.zotero.org/google-docs/?Xd3H9e>"
>
>
>
> On Fri, Oct 21, 2022 at 3:15 PM Peter Abramowitsch <
> pabramowitsch@gmail.com>
> wrote:
>
> > Well, obviously, the full range of permutations of all source files and
> all
> > annotators and pre and post ctakes code would require a huge amount of
> > commit information on thousands of files... and not only ctakes
> > files...recently I made some pretty significant changes to the  ZonerCli
> > library which is only a dependency of the ctakes distribution. How would
> > all the commit info be used to tag the end results.  I think the answer
> is
> > that it's simply not feasible or useful.     So we haven't gone to those
> > lengths.  As far as we go at the UCs  is to version the piper file and
> then
> > write the versioned_name of the piper back into the json object returned
> > for each note... We have our own rest service and our own Java and Python
> > clients, but they don't touch the internals of the message in a way that
> > interferes with the clinical informatics.  The note concept collection
> > object with its piper version is then persisted in our data store.   The
> > server jar also has a version which writes into a log and is updated
> > whenever any significant framework changes are implemented.   But the
> > server version is not written into the data-store.
> >
> > Not sure if any of this was helpful
> >
> > On Fri, Oct 21, 2022 at 8:03 PM Miller, Timothy
> > <Ti...@childrens.harvard.edu.invalid> wrote:
> >
> > > We’ve recently been using cTAKES for some internal projects where we
> make
> > > modifications, often using the REST server, combined with an
> open-source
> > > python client that makes the output of the REST server easy to
> > post-process:
> > >
> >
> https://github.com/Machine-Learning-for-Medical-Language/ctakes-client-py
> > > written by my colleagues Andy McMurry and Mike Terry, and pip
> > installable.
> > > The output is then either converted to FHIR or written to whatever
> > > convenient format we need.
> > >
> > > But it’s useful to know for a given run on a given project, what was
> the
> > > NLP configuration that produced this output? Obviously, there are
> things
> > > like version numbers, but since cTAKES is highly configurable, and our
> > > post-processing libraries have versions, and we may use trunk or a
> > previous
> > > commit instead of releases, things get complicated quickly. Does anyone
> > > have an existing solution they are willing to share? Or does anyone
> have
> > > any thoughts on this topic? This question goes slightly beyond cTAKES,
> > but
> > > cTAKES is responsible for a lot of the complexity in figuring this out
> > > since it’s the most configurable component.
> > >
> > > Thanks
> > > Tim
> > >
> > >
> >
>
>
> --
> Greg M. Silverman
> Senior Systems Developer
> NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group>
> Department of Surgery
> University of Minnesota
> gms@umn.edu
>

Re: Best practices for documenting NLP versions

Posted by Greg Silverman <gm...@umn.edu.INVALID>.

Why not use Docker and versioning by tags? See "C. Boettiger, An
introduction to Docker for reproducible research, SIGOPS Oper. Syst. Rev. 49
(2015) 71–79. doi:10.1145/2723872.2723882.
<https://www.zotero.org/google-docs/?Xd3H9e>"



On Fri, Oct 21, 2022 at 3:15 PM Peter Abramowitsch <pa...@gmail.com>
wrote:

> Well, obviously, the full range of permutations of all source files and all
> annotators and pre and post ctakes code would require a huge amount of
> commit information on thousands of files... and not only ctakes
> files...recently I made some pretty significant changes to the  ZonerCli
> library which is only a dependency of the ctakes distribution. How would
> all the commit info be used to tag the end results.  I think the answer is
> that it's simply not feasible or useful.     So we haven't gone to those
> lengths.  As far as we go at the UCs  is to version the piper file and then
> write the versioned_name of the piper back into the json object returned
> for each note... We have our own rest service and our own Java and Python
> clients, but they don't touch the internals of the message in a way that
> interferes with the clinical informatics.  The note concept collection
> object with its piper version is then persisted in our data store.   The
> server jar also has a version which writes into a log and is updated
> whenever any significant framework changes are implemented.   But the
> server version is not written into the data-store.
>
> Not sure if any of this was helpful
>
> On Fri, Oct 21, 2022 at 8:03 PM Miller, Timothy
> <Ti...@childrens.harvard.edu.invalid> wrote:
>
> > We’ve recently been using cTAKES for some internal projects where we make
> > modifications, often using the REST server, combined with an open-source
> > python client that makes the output of the REST server easy to
> post-process:
> >
> https://github.com/Machine-Learning-for-Medical-Language/ctakes-client-py
> > written by my colleagues Andy McMurry and Mike Terry, and pip
> installable.
> > The output is then either converted to FHIR or written to whatever
> > convenient format we need.
> >
> > But it’s useful to know for a given run on a given project, what was the
> > NLP configuration that produced this output? Obviously, there are things
> > like version numbers, but since cTAKES is highly configurable, and our
> > post-processing libraries have versions, and we may use trunk or a
> previous
> > commit instead of releases, things get complicated quickly. Does anyone
> > have an existing solution they are willing to share? Or does anyone have
> > any thoughts on this topic? This question goes slightly beyond cTAKES,
> but
> > cTAKES is responsible for a lot of the complexity in figuring this out
> > since it’s the most configurable component.
> >
> > Thanks
> > Tim
> >
> >
>


-- 
Greg M. Silverman
Senior Systems Developer
NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group>
Department of Surgery
University of Minnesota
gms@umn.edu

Re: Best practices for documenting NLP versions

Posted by Peter Abramowitsch <pa...@gmail.com>.

Well, obviously, the full range of permutations of all source files and all
annotators and pre and post ctakes code would require a huge amount of
commit information on thousands of files... and not only ctakes
files...recently I made some pretty significant changes to the  ZonerCli
library which is only a dependency of the ctakes distribution. How would
all the commit info be used to tag the end results.  I think the answer is
that it's simply not feasible or useful.     So we haven't gone to those
lengths.  As far as we go at the UCs  is to version the piper file and then
write the versioned_name of the piper back into the json object returned
for each note... We have our own rest service and our own Java and Python
clients, but they don't touch the internals of the message in a way that
interferes with the clinical informatics.  The note concept collection
object with its piper version is then persisted in our data store.   The
server jar also has a version which writes into a log and is updated
whenever any significant framework changes are implemented.   But the
server version is not written into the data-store.

Not sure if any of this was helpful

On Fri, Oct 21, 2022 at 8:03 PM Miller, Timothy
<Ti...@childrens.harvard.edu.invalid> wrote:

> We’ve recently been using cTAKES for some internal projects where we make
> modifications, often using the REST server, combined with an open-source
> python client that makes the output of the REST server easy to post-process:
> https://github.com/Machine-Learning-for-Medical-Language/ctakes-client-py
> written by my colleagues Andy McMurry and Mike Terry, and pip installable.
> The output is then either converted to FHIR or written to whatever
> convenient format we need.
>
> But it’s useful to know for a given run on a given project, what was the
> NLP configuration that produced this output? Obviously, there are things
> like version numbers, but since cTAKES is highly configurable, and our
> post-processing libraries have versions, and we may use trunk or a previous
> commit instead of releases, things get complicated quickly. Does anyone
> have an existing solution they are willing to share? Or does anyone have
> any thoughts on this topic? This question goes slightly beyond cTAKES, but
> cTAKES is responsible for a lot of the complexity in figuring this out
> since it’s the most configurable component.
>
> Thanks
> Tim
>
>