You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ctakes.apache.org by "Baas,Leah" <Le...@SanfordHealth.org> on 2019/01/29 17:58:48 UTC

Processing large batches of files in cTAKES

Hi all,

I would like to process a batch of 13,414 files (avg file size 6.2 KB) using the default clinical pipeline. I am new to cTAKES and computer programming, and I’m looking for guidance on how to process these files with maximum time/CPU efficiency. I am currently running my program on an Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one 6.0 KB file. I’m reading up on parallel processing strategies, but would be grateful for any suggestions, tips, etc. that you might have!

Thanks,

Leah


-----------------------------------------------------------------------
Confidentiality Notice: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
privileged and confidential information.  Any unauthorized review, use,
disclosure or distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply e-mail and destroy
all copies of the original message.

Re: Processing large batches of files in cTAKES [EXTERNAL]

Posted by "Baas,Leah" <Le...@SanfordHealth.org>.

Fantastic. Thank you all for your help!

Leah

From: Greg Silverman <gm...@umn.edu>
Reply-To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Date: Tuesday, January 29, 2019 at 3:37 PM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

Yes, one batch! We had no problems with 10 K plain text files.

On Tue, Jan 29, 2019 at 3:33 PM Baas,Leah <Le...@sanfordhealth.org>> wrote:
Ah, I see. Yes—I will change the pre-processing step to write plaintext instead of xml files. Thank you so much for the tip!

Once I’ve fixed the pre-processing code, do you anticipate that I should be able to process all of the input files in one batch?

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>>
Reply-To: "user@ctakes.apache.org<ma...@ctakes.apache.org>" <us...@ctakes.apache.org>>
Date: Tuesday, January 29, 2019 at 3:23 PM
To: "Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org<ma...@ctakes.apache.org>" <us...@ctakes.apache.org>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

OK, if you can see xml tags in the right pane, that means that ctakes is trying to process the xml markup as well as the text. Can you change your python pre-process to just write plaintext files with only the text from the note, and not xml? And then process that? I think there are probably cases where having xml in the text would confuse some of the  modules and cause them to run slowly. You also will get weird outputs, I've seen "<span>" get annotated as a "body measurement finding" when we accidentally processed some html once.
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: "Miller, Timothy" <Timothy.Miller@childrens.harvard.edu<mailto:%22Miller,%20Timothy%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>, user@ctakes.apache.org<ma...@ctakes.apache.org> <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 21:15:54 +0000

Yes, I’ve been following those instructions to view the .xmi files in the CVD.  The right pane shows the text of the XML file.

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>>
Date: Tuesday, January 29, 2019 at 3:00 PM
To: "Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org<ma...@ctakes.apache.org>" <us...@ctakes.apache.org>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

So after you process all the notes do you follow the instructions on the wiki page that say:
You can view information in the XMI files using the UIMA Cas Visual Debugger (CVD).

Execute bin/runctakesCVD
Select File > Read Type System File
Select TypeSystem.xml in resources/org/apache/ctakes/typesystem/types/
Select File > Read XMI CAS File
Select any .xmi file in your outputDirectory

and look at that .xmi file? If so, what do you see in the right pane? The text of the note or the text of an xml file?
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: "Miller, Timothy" <Timothy.Miller@childrens.harvard.edu<mailto:%22Miller,%20Timothy%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>, user@ctakes.apache.org<ma...@ctakes.apache.org> <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 20:45:58 +0000

It is not CDA format. I used Python’s ElementTree module to generate XML files containing the clinical notes for each subject in my dataset. When I run the Default Clinical Pipeline, I can successfully generate XMI output files for each XML file in my input directory. The following WARNING message appears multiple times over the course of the processing (not sure if this is at all related to the issue at hand):

Jan 29, 2019 2:02:56 PM org.apache.uima.util.MessageReport decreasingWithTrace(51)
WARNING: Message count: 1; Feature org.apache.ctakes.typesystem.type.textsem.Predicate:relations is marked multipleReferencesAllowed=false, but it has multiple references.  These will be serialized in duplicate. Message count indicates messages skipped to avoid potential flooding. Set FINE logging level for stacktrace.

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>>
Date: Tuesday, January 29, 2019 at 2:28 PM
To: "Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org<ma...@ctakes.apache.org>" <us...@ctakes.apache.org>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

Well if you're processing XML files that will likely cause a problem with this script, it's expecting plain text files in a directory. Maybe Sean can chime in on whether it's possible to use an XML collection reader with the runClinicalPipeline.sh script? Is it CDA format?
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: "Miller, Timothy" <Timothy.Miller@childrens.harvard.edu<mailto:%22Miller,%20Timothy%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>, user@ctakes.apache.org<ma...@ctakes.apache.org> <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 20:21:17 +0000

Hi Tim,

Thanks again for working through this with me. I hadn’t read through the time stamps carefully enough to notice the one-time cost of startup.

I did replicate your setup by copying/pasting 7 of my XML input files into an empty directory. Here’s what I saw:

  1.  For the startup-- 20 seconds between the first time-stamped log message:

29 Jan 2019 14:02:35  INFO SentenceDetector - Sentence detector model file: org/apache/ctakes/core/sentdetect/sd-med-model.zip

                and the first log message doing processing:
29 Jan 2019 14:02:55  INFO SentenceDetector - Starting processing.

  1.  Once started up, 12 seconds to process the notes.

29 Jan 2019 14:03:07  INFO ClearNLPSemanticRoleLabelerAE - Finished processing

Does this help narrow things down?

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>>
Date: Tuesday, January 29, 2019 at 1:58 PM
To: "Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org<ma...@ctakes.apache.org>" <us...@ctakes.apache.org>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I haven't used that script myself, but I just tried it now on some notes from mtsamples. Maybe you can try to replicate that setup? I just copy/pasted the 7 allergy/immunology notes [1] into 7 text files in an empty directory. Here's what I see:

1) It is pretty slow to start up -- but this is a one time cost (~50 seconds). I'm looking at the time between the very first time-stamped log message:
29 Jan 2019 14:51:51  INFO SentenceDetector - Sentence detector model file: org/apache/ctakes/core/sentdetect/sd-med-model.zip

and the first log message doing processing:

29 Jan 2019 14:52:40  INFO SentenceDetector - Starting processing

2) Once started up, it processes the notes in about 14s. This is actually slower than expected but this is a lot faster than you were seeing. I"m looking at the time between the start of processing just above and the last log message before it quits:

29 Jan 2019 14:52:54  INFO ClearNLPSemanticRoleLabelerAE - Finished processing

If you can replicate this input/output setup and approximate timing in your VM first, then we can see whether it's a function of your notes or your setup.

Tim

[1] https://www.mtsamples.com/site/pages/browse.asp?type=3-Allergy%20/%20Immunology<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mtsamples.com_site_pages_browse.asp-3Ftype-3D3-2DAllergy-2520_-2520Immunology&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=mrw9Hkq5tgV2AJpZMfTcbtAXSa2A59SwIOtsBR73mFs&s=dzNYtO-sdz1-shXn2KbCVDJQbxNh-i5mMutk0H-8ifc&e=>

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: user@ctakes.apache.org<ma...@ctakes.apache.org> <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>, Timothy.Miller@childrens.harvard.edu<ma...@childrens.harvard.edu> <Timothy.Miller@childrens.harvard.edu<mailto:%22Timothy.Miller@childrens.harvard.edu%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 19:33:34 +0000

Hi again Tim,

I am trying to check which version of the dictionary I am using when running the Default Clinical Pipeline. I have been running the pipeline according to the instructions detailed here<https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_CTAKES_Default-2BClinical-2BPipeline&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=jgvtkadUTVhxxDm24op4l0wy5Gr3jtNrWgRsUw93nKs&s=-iPRvjXA71f66iWz53vhCbU6a1JqiEwWZ03YmfUPf-Y&e=>. However, I haven’t been able to find documentation specifying which dictionary version is built into this pipeline. There must be a simple way to check—I am just ignorant. Could you enlighten me?

Thanks,

Leah

From: "Baas,Leah" <Le...@SanfordHealth.org>
Date: Tuesday, January 29, 2019 at 12:23 PM
To: "user@ctakes.apache.org<ma...@ctakes.apache.org>" <us...@ctakes.apache.org>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

Tim,

Thanks for your quick response! Probably unsurprisingly, I’ll have to do some googling to learn how to check those things. If you could point me in the right direction, that’d be great!

Thanks again,

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>>
Reply-To: "user@ctakes.apache.org<ma...@ctakes.apache.org>" <us...@ctakes.apache.org>>
Date: Tuesday, January 29, 2019 at 12:14 PM
To: "user@ctakes.apache.org<ma...@ctakes.apache.org>" <us...@ctakes.apache.org>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I am able to process that number of files in a reasonable amount of time (maybe an hour) on an average desktop. Luckily, debugging your setup should be much easier than doing a scaleout. A few possibilities:

* You are running the old (slow) dictionary instead of the new fast one
* Your document has extremely long sentences
* Your VM is _extremely_ resource constrained and is thrashing constantly

Do you know how to check these things?
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
Reply-to: <us...@ctakes.apache.org>>
To: user@ctakes.apache.org<ma...@ctakes.apache.org> <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 17:58:48 +0000

Hi all,

I would like to process a batch of 13,414 files (avg file size 6.2 KB) using the default clinical pipeline. I am new to cTAKES and computer programming, and I’m looking for guidance on how to process these files with maximum time/CPU efficiency. I am currently running my program on an Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one 6.0 KB file. I’m reading up on parallel processing strategies, but would be grateful for any suggestions, tips, etc. that you might have!

Thanks,

Leah

-----------------------------------------------------------------------
Confidentiality Notice: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
privileged and confidential information.  Any unauthorized review, use,
disclosure or distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply e-mail and destroy
all copies of the original message.

--
Greg M. Silverman
Senior Systems Developer
NLP/IE<https://healthinformatics.umn.edu/research/nlpie-group>
Cardiovascular Informatics<http://www.med.umn.edu/cardiology/>
University of Minnesota
gms@umn.edu<ma...@umn.edu>

 ›  evaluate-it.org<http://evaluate-it.org>  ‹

Re: Processing large batches of files in cTAKES [EXTERNAL]

Posted by Greg Silverman <gm...@umn.edu>.

Yes, one batch! We had no problems with 10 K plain text files.

On Tue, Jan 29, 2019 at 3:33 PM Baas,Leah <Le...@sanfordhealth.org>
wrote:

> Ah, I see. Yes—I will change the pre-processing step to write plaintext
> instead of xml files. Thank you so much for the tip!
>
>
>
> Once I’ve fixed the pre-processing code, do you anticipate that I should
> be able to process all of the input files in one batch?
>
>
>
> Leah
>
>
>
> *From: *"Miller, Timothy" <Ti...@childrens.harvard.edu>
> *Reply-To: *"user@ctakes.apache.org" <us...@ctakes.apache.org>
> *Date: *Tuesday, January 29, 2019 at 3:23 PM
> *To: *"Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org"
> <us...@ctakes.apache.org>
> *Subject: *Re: Processing large batches of files in cTAKES [EXTERNAL]
>
>
>
> OK, if you can see xml tags in the right pane, that means that ctakes is
> trying to process the xml markup as well as the text. Can you change your
> python pre-process to just write plaintext files with only the text from
> the note, and not xml? And then process that? I think there are probably
> cases where having xml in the text would confuse some of the  modules and
> cause them to run slowly. You also will get weird outputs, I've seen
> "<span>" get annotated as a "body measurement finding" when we accidentally
> processed some html once.
>
> Tim
>
>
>
>
>
> -----Original Message-----
>
> *From*: "Baas,Leah" <Leah.Baas@SanfordHealth.org
> <%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
>
> *To*: "Miller, Timothy" <Timothy.Miller@childrens.harvard.edu
> <%22Miller,%20Timothy%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>,
> user@ctakes.apache.org <user@ctakes.apache.org
> <%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
>
> *Subject*: Re: Processing large batches of files in cTAKES [EXTERNAL]
>
> *Date*: Tue, 29 Jan 2019 21:15:54 +0000
>
>
>
> Yes, I’ve been following those instructions to view the .xmi files in the
> CVD.  The right pane shows the text of the XML file.
>
>
>
> Leah
>
>
>
> *From: *"Miller, Timothy" <Ti...@childrens.harvard.edu>
> *Date: *Tuesday, January 29, 2019 at 3:00 PM
> *To: *"Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org"
> <us...@ctakes.apache.org>
> *Subject: *Re: Processing large batches of files in cTAKES [EXTERNAL]
>
>
>
> So after you process all the notes do you follow the instructions on the
> wiki page that say:
>
> You can view information in the XMI files using the UIMA Cas Visual
> Debugger (CVD).
>
>
>
> Execute bin/runctakesCVD
>
> Select File > Read Type System File
>
> Select TypeSystem.xml in resources/org/apache/ctakes/typesystem/types/
>
> Select File > Read XMI CAS File
>
> Select any .xmi file in your outputDirectory
>
>
>
> and look at that .xmi file? If so, what do you see in the right pane? The
> text of the note or the text of an xml file?
>
> Tim
>
>
>
>
>
> -----Original Message-----
>
> *From*: "Baas,Leah" <Leah.Baas@SanfordHealth.org
> <%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
>
> *To*: "Miller, Timothy" <Timothy.Miller@childrens.harvard.edu
> <%22Miller,%20Timothy%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>,
> user@ctakes.apache.org <user@ctakes.apache.org
> <%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
>
> *Subject*: Re: Processing large batches of files in cTAKES [EXTERNAL]
>
> *Date*: Tue, 29 Jan 2019 20:45:58 +0000
>
>
>
> It is not CDA format. I used Python’s ElementTree module to generate XML
> files containing the clinical notes for each subject in my dataset. When I
> run the Default Clinical Pipeline, I can successfully generate XMI output
> files for each XML file in my input directory. The following WARNING
> message appears multiple times over the course of the processing (not sure
> if this is at all related to the issue at hand):
>
>
>
> Jan 29, 2019 2:02:56 PM org.apache.uima.util.MessageReport
> decreasingWithTrace(51)
>
> WARNING: Message count: 1; Feature
> org.apache.ctakes.typesystem.type.textsem.Predicate:relations is marked
> multipleReferencesAllowed=false, but it has multiple references.  These
> will be serialized in duplicate. Message count indicates messages skipped
> to avoid potential flooding. Set FINE logging level for stacktrace.
>
>
>
> Leah
>
>
>
> *From: *"Miller, Timothy" <Ti...@childrens.harvard.edu>
> *Date: *Tuesday, January 29, 2019 at 2:28 PM
> *To: *"Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org"
> <us...@ctakes.apache.org>
> *Subject: *Re: Processing large batches of files in cTAKES [EXTERNAL]
>
>
>
> Well if you're processing XML files that will likely cause a problem with
> this script, it's expecting plain text files in a directory. Maybe Sean can
> chime in on whether it's possible to use an XML collection reader with the
> runClinicalPipeline.sh script? Is it CDA format?
>
> Tim
>
>
>
> -----Original Message-----
>
> *From*: "Baas,Leah" <Leah.Baas@SanfordHealth.org
> <%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
>
> *To*: "Miller, Timothy" <Timothy.Miller@childrens.harvard.edu
> <%22Miller,%20Timothy%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>,
> user@ctakes.apache.org <user@ctakes.apache.org
> <%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
>
> *Subject*: Re: Processing large batches of files in cTAKES [EXTERNAL]
>
> *Date*: Tue, 29 Jan 2019 20:21:17 +0000
>
>
>
> Hi Tim,
>
>
>
> Thanks again for working through this with me. I hadn’t read through the
> time stamps carefully enough to notice the one-time cost of startup.
>
>
>
> I did replicate your setup by copying/pasting 7 of my XML input files into
> an empty directory. Here’s what I saw:
>
>
>
>    1. For the startup-- 20 seconds between the first time-stamped log
>    message:
>
> *29 Jan 2019 14:02:35  INFO SentenceDetector - Sentence detector model
> file: org/apache/ctakes/core/sentdetect/sd-med-model.zip*
>
>
>
>                 and the first log message doing processing:
>
> *29 Jan 2019 14:02:55  INFO SentenceDetector - Starting processing.*
>
>
>
>    1. Once started up, 12 seconds to process the notes.
>
> *29 Jan 2019 14:03:07  INFO ClearNLPSemanticRoleLabelerAE - Finished
> processing*
>
>
>
> Does this help narrow things down?
>
>
>
> Leah
>
>
>
> *From: *"Miller, Timothy" <Ti...@childrens.harvard.edu>
> *Date: *Tuesday, January 29, 2019 at 1:58 PM
> *To: *"Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org"
> <us...@ctakes.apache.org>
> *Subject: *Re: Processing large batches of files in cTAKES [EXTERNAL]
>
>
>
> I haven't used that script myself, but I just tried it now on some notes
> from mtsamples. Maybe you can try to replicate that setup? I just
> copy/pasted the 7 allergy/immunology notes [1] into 7 text files in an
> empty directory. Here's what I see:
>
>
>
> 1) It is pretty slow to start up -- but this is a one time cost (~50
> seconds). I'm looking at the time between the very first time-stamped log
> message:
>
> *29 Jan 2019 14:51:51  INFO SentenceDetector - Sentence detector model
> file: org/apache/ctakes/core/sentdetect/sd-med-model.zip*
>
>
>
> and the first log message doing processing:
>
>
>
> *29 Jan 2019 14:52:40  INFO SentenceDetector - Starting processing*
>
>
>
> 2) Once started up, it processes the notes in about 14s. This is actually
> slower than expected but this is a lot faster than you were seeing. I"m
> looking at the time between the start of processing just above and the last
> log message before it quits:
>
>
>
> *29 Jan 2019 14:52:54  INFO ClearNLPSemanticRoleLabelerAE - Finished
> processing*
>
>
>
> If you can replicate this input/output setup and approximate timing in
> your VM first, then we can see whether it's a function of your notes or
> your setup.
>
>
>
> Tim
>
>
>
>
>
> [1]
> https://www.mtsamples.com/site/pages/browse.asp?type=3-Allergy%20/%20Immunology
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mtsamples.com_site_pages_browse.asp-3Ftype-3D3-2DAllergy-2520_-2520Immunology&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=mrw9Hkq5tgV2AJpZMfTcbtAXSa2A59SwIOtsBR73mFs&s=dzNYtO-sdz1-shXn2KbCVDJQbxNh-i5mMutk0H-8ifc&e=>
>
>
>
> -----Original Message-----
>
> *From*: "Baas,Leah" <Leah.Baas@SanfordHealth.org
> <%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
>
> *To*: user@ctakes.apache.org <user@ctakes.apache.org
> <%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>,
> Timothy.Miller@childrens.harvard.edu <Timothy.Miller@childrens.harvard.edu
> <%22Timothy.Miller@childrens.harvard.edu%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>
> >
>
> *Subject*: Re: Processing large batches of files in cTAKES [EXTERNAL]
>
> *Date*: Tue, 29 Jan 2019 19:33:34 +0000
>
>
>
> Hi again Tim,
>
>
>
> I am trying to check which version of the dictionary I am using when
> running the Default Clinical Pipeline. I have been running the pipeline
> according to the instructions detailed here
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_CTAKES_Default-2BClinical-2BPipeline&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=jgvtkadUTVhxxDm24op4l0wy5Gr3jtNrWgRsUw93nKs&s=-iPRvjXA71f66iWz53vhCbU6a1JqiEwWZ03YmfUPf-Y&e=>.
> However, I haven’t been able to find documentation specifying which
> dictionary version is built into this pipeline. There must be a simple way
> to check—I am just ignorant. Could you enlighten me?
>
>
>
> Thanks,
>
>
>
> Leah
>
>
>
> *From: *"Baas,Leah" <Le...@SanfordHealth.org>
> *Date: *Tuesday, January 29, 2019 at 12:23 PM
> *To: *"user@ctakes.apache.org" <us...@ctakes.apache.org>
> *Subject: *Re: Processing large batches of files in cTAKES [EXTERNAL]
>
>
>
> Tim,
>
>
>
> Thanks for your quick response! Probably unsurprisingly, I’ll have to do
> some googling to learn how to check those things. If you could point me in
> the right direction, that’d be great!
>
>
>
> Thanks again,
>
>
>
> Leah
>
>
>
> *From: *"Miller, Timothy" <Ti...@childrens.harvard.edu>
> *Reply-To: *"user@ctakes.apache.org" <us...@ctakes.apache.org>
> *Date: *Tuesday, January 29, 2019 at 12:14 PM
> *To: *"user@ctakes.apache.org" <us...@ctakes.apache.org>
> *Subject: *Re: Processing large batches of files in cTAKES [EXTERNAL]
>
>
>
> I am able to process that number of files in a reasonable amount of time
> (maybe an hour) on an average desktop. Luckily, debugging your setup should
> be much easier than doing a scaleout. A few possibilities:
>
>
>
> * You are running the old (slow) dictionary instead of the new fast one
>
> * Your document has extremely long sentences
>
> * Your VM is _extremely_ resource constrained and is thrashing constantly
>
>
>
> Do you know how to check these things?
>
> Tim
>
>
>
>
>
>
>
> -----Original Message-----
>
> *From*: "Baas,Leah" <Leah.Baas@SanfordHealth.org
> <%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
>
> Reply-to: <us...@ctakes.apache.org>
>
> *To*: user@ctakes.apache.org <user@ctakes.apache.org
> <%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
>
> *Subject*: Processing large batches of files in cTAKES [EXTERNAL]
>
> *Date*: Tue, 29 Jan 2019 17:58:48 +0000
>
>
>
> Hi all,
>
>
>
> I would like to process a batch of 13,414 files (avg file size 6.2 KB)
> using the default clinical pipeline. I am new to cTAKES and computer
> programming, and I’m looking for guidance on how to process these files
> with maximum time/CPU efficiency. I am currently running my program on an
> Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one
> 6.0 KB file. I’m reading up on parallel processing strategies, but would be
> grateful for any suggestions, tips, etc. that you might have!
>
>
>
> Thanks,
>
>
>
> Leah
>
>
>
>
>
> -----------------------------------------------------------------------
> Confidentiality Notice: This e-mail message, including any attachments,
> is for the sole use of the intended recipient(s) and may contain
> privileged and confidential information.  Any unauthorized review, use,
> disclosure or distribution is prohibited.  If you are not the intended
> recipient, please contact the sender by reply e-mail and destroy
> all copies of the original message.
>


-- 
Greg M. Silverman
Senior Systems Developer
NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group>
Cardiovascular Informatics <http://www.med.umn.edu/cardiology/>
University of Minnesota
gms@umn.edu

 ›  evaluate-it.org  ‹

Re: Processing large batches of files in cTAKES [EXTERNAL]

Posted by "Baas,Leah" <Le...@SanfordHealth.org>.

Ah, I see. Yes—I will change the pre-processing step to write plaintext instead of xml files. Thank you so much for the tip!

Once I’ve fixed the pre-processing code, do you anticipate that I should be able to process all of the input files in one batch?

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Reply-To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Date: Tuesday, January 29, 2019 at 3:23 PM
To: "Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

OK, if you can see xml tags in the right pane, that means that ctakes is trying to process the xml markup as well as the text. Can you change your python pre-process to just write plaintext files with only the text from the note, and not xml? And then process that? I think there are probably cases where having xml in the text would confuse some of the  modules and cause them to run slowly. You also will get weird outputs, I've seen "<span>" get annotated as a "body measurement finding" when we accidentally processed some html once.
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: "Miller, Timothy" <Timothy.Miller@childrens.harvard.edu<mailto:%22Miller,%20Timothy%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>, user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 21:15:54 +0000

Yes, I’ve been following those instructions to view the .xmi files in the CVD.  The right pane shows the text of the XML file.

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Date: Tuesday, January 29, 2019 at 3:00 PM
To: "Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

So after you process all the notes do you follow the instructions on the wiki page that say:
You can view information in the XMI files using the UIMA Cas Visual Debugger (CVD).

Execute bin/runctakesCVD
Select File > Read Type System File
Select TypeSystem.xml in resources/org/apache/ctakes/typesystem/types/
Select File > Read XMI CAS File
Select any .xmi file in your outputDirectory

and look at that .xmi file? If so, what do you see in the right pane? The text of the note or the text of an xml file?
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: "Miller, Timothy" <Timothy.Miller@childrens.harvard.edu<mailto:%22Miller,%20Timothy%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>, user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 20:45:58 +0000

It is not CDA format. I used Python’s ElementTree module to generate XML files containing the clinical notes for each subject in my dataset. When I run the Default Clinical Pipeline, I can successfully generate XMI output files for each XML file in my input directory. The following WARNING message appears multiple times over the course of the processing (not sure if this is at all related to the issue at hand):

Jan 29, 2019 2:02:56 PM org.apache.uima.util.MessageReport decreasingWithTrace(51)
WARNING: Message count: 1; Feature org.apache.ctakes.typesystem.type.textsem.Predicate:relations is marked multipleReferencesAllowed=false, but it has multiple references.  These will be serialized in duplicate. Message count indicates messages skipped to avoid potential flooding. Set FINE logging level for stacktrace.

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Date: Tuesday, January 29, 2019 at 2:28 PM
To: "Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

Well if you're processing XML files that will likely cause a problem with this script, it's expecting plain text files in a directory. Maybe Sean can chime in on whether it's possible to use an XML collection reader with the runClinicalPipeline.sh script? Is it CDA format?
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: "Miller, Timothy" <Timothy.Miller@childrens.harvard.edu<mailto:%22Miller,%20Timothy%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>, user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 20:21:17 +0000

Hi Tim,

Thanks again for working through this with me. I hadn’t read through the time stamps carefully enough to notice the one-time cost of startup.

I did replicate your setup by copying/pasting 7 of my XML input files into an empty directory. Here’s what I saw:

  1.  For the startup-- 20 seconds between the first time-stamped log message:

29 Jan 2019 14:02:35  INFO SentenceDetector - Sentence detector model file: org/apache/ctakes/core/sentdetect/sd-med-model.zip

                and the first log message doing processing:
29 Jan 2019 14:02:55  INFO SentenceDetector - Starting processing.

  1.  Once started up, 12 seconds to process the notes.

29 Jan 2019 14:03:07  INFO ClearNLPSemanticRoleLabelerAE - Finished processing

Does this help narrow things down?

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Date: Tuesday, January 29, 2019 at 1:58 PM
To: "Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I haven't used that script myself, but I just tried it now on some notes from mtsamples. Maybe you can try to replicate that setup? I just copy/pasted the 7 allergy/immunology notes [1] into 7 text files in an empty directory. Here's what I see:

1) It is pretty slow to start up -- but this is a one time cost (~50 seconds). I'm looking at the time between the very first time-stamped log message:
29 Jan 2019 14:51:51  INFO SentenceDetector - Sentence detector model file: org/apache/ctakes/core/sentdetect/sd-med-model.zip

and the first log message doing processing:

29 Jan 2019 14:52:40  INFO SentenceDetector - Starting processing

2) Once started up, it processes the notes in about 14s. This is actually slower than expected but this is a lot faster than you were seeing. I"m looking at the time between the start of processing just above and the last log message before it quits:

29 Jan 2019 14:52:54  INFO ClearNLPSemanticRoleLabelerAE - Finished processing

If you can replicate this input/output setup and approximate timing in your VM first, then we can see whether it's a function of your notes or your setup.

Tim

[1] https://www.mtsamples.com/site/pages/browse.asp?type=3-Allergy%20/%20Immunology<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mtsamples.com_site_pages_browse.asp-3Ftype-3D3-2DAllergy-2520_-2520Immunology&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=mrw9Hkq5tgV2AJpZMfTcbtAXSa2A59SwIOtsBR73mFs&s=dzNYtO-sdz1-shXn2KbCVDJQbxNh-i5mMutk0H-8ifc&e=>

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>, Timothy.Miller@childrens.harvard.edu <Timothy.Miller@childrens.harvard.edu<mailto:%22Timothy.Miller@childrens.harvard.edu%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 19:33:34 +0000

Hi again Tim,

I am trying to check which version of the dictionary I am using when running the Default Clinical Pipeline. I have been running the pipeline according to the instructions detailed here<https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_CTAKES_Default-2BClinical-2BPipeline&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=jgvtkadUTVhxxDm24op4l0wy5Gr3jtNrWgRsUw93nKs&s=-iPRvjXA71f66iWz53vhCbU6a1JqiEwWZ03YmfUPf-Y&e=>. However, I haven’t been able to find documentation specifying which dictionary version is built into this pipeline. There must be a simple way to check—I am just ignorant. Could you enlighten me?

Thanks,

Leah

From: "Baas,Leah" <Le...@SanfordHealth.org>
Date: Tuesday, January 29, 2019 at 12:23 PM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

Tim,

Thanks for your quick response! Probably unsurprisingly, I’ll have to do some googling to learn how to check those things. If you could point me in the right direction, that’d be great!

Thanks again,

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Reply-To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Date: Tuesday, January 29, 2019 at 12:14 PM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I am able to process that number of files in a reasonable amount of time (maybe an hour) on an average desktop. Luckily, debugging your setup should be much easier than doing a scaleout. A few possibilities:

* You are running the old (slow) dictionary instead of the new fast one
* Your document has extremely long sentences
* Your VM is _extremely_ resource constrained and is thrashing constantly

Do you know how to check these things?
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
Reply-to: <us...@ctakes.apache.org>
To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 17:58:48 +0000

Hi all,

I would like to process a batch of 13,414 files (avg file size 6.2 KB) using the default clinical pipeline. I am new to cTAKES and computer programming, and I’m looking for guidance on how to process these files with maximum time/CPU efficiency. I am currently running my program on an Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one 6.0 KB file. I’m reading up on parallel processing strategies, but would be grateful for any suggestions, tips, etc. that you might have!

Thanks,

Leah

-----------------------------------------------------------------------
Confidentiality Notice: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
privileged and confidential information.  Any unauthorized review, use,
disclosure or distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply e-mail and destroy
all copies of the original message.

Re: Processing large batches of files in cTAKES [EXTERNAL]

Posted by "Miller, Timothy" <Ti...@childrens.harvard.edu>.

OK, if you can see xml tags in the right pane, that means that ctakes is trying to process the xml markup as well as the text. Can you change your python pre-process to just write plaintext files with only the text from the note, and not xml? And then process that? I think there are probably cases where having xml in the text would confuse some of the  modules and cause them to run slowly. You also will get weird outputs, I've seen "<span>" get annotated as a "body measurement finding" when we accidentally processed some html once.
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: "Miller, Timothy" <Timothy.Miller@childrens.harvard.edu<mailto:%22Miller,%20Timothy%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>, user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 21:15:54 +0000

Yes, I’ve been following those instructions to view the .xmi files in the CVD.  The right pane shows the text of the XML file.

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Date: Tuesday, January 29, 2019 at 3:00 PM
To: "Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

So after you process all the notes do you follow the instructions on the wiki page that say:
You can view information in the XMI files using the UIMA Cas Visual Debugger (CVD).

Execute bin/runctakesCVD
Select File > Read Type System File
Select TypeSystem.xml in resources/org/apache/ctakes/typesystem/types/
Select File > Read XMI CAS File
Select any .xmi file in your outputDirectory

and look at that .xmi file? If so, what do you see in the right pane? The text of the note or the text of an xml file?
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: "Miller, Timothy" <Timothy.Miller@childrens.harvard.edu<mailto:%22Miller,%20Timothy%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>, user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 20:45:58 +0000

It is not CDA format. I used Python’s ElementTree module to generate XML files containing the clinical notes for each subject in my dataset. When I run the Default Clinical Pipeline, I can successfully generate XMI output files for each XML file in my input directory. The following WARNING message appears multiple times over the course of the processing (not sure if this is at all related to the issue at hand):

Jan 29, 2019 2:02:56 PM org.apache.uima.util.MessageReport decreasingWithTrace(51)
WARNING: Message count: 1; Feature org.apache.ctakes.typesystem.type.textsem.Predicate:relations is marked multipleReferencesAllowed=false, but it has multiple references.  These will be serialized in duplicate. Message count indicates messages skipped to avoid potential flooding. Set FINE logging level for stacktrace.

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Date: Tuesday, January 29, 2019 at 2:28 PM
To: "Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

Well if you're processing XML files that will likely cause a problem with this script, it's expecting plain text files in a directory. Maybe Sean can chime in on whether it's possible to use an XML collection reader with the runClinicalPipeline.sh script? Is it CDA format?
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: "Miller, Timothy" <Timothy.Miller@childrens.harvard.edu<mailto:%22Miller,%20Timothy%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>, user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 20:21:17 +0000

Hi Tim,

Thanks again for working through this with me. I hadn’t read through the time stamps carefully enough to notice the one-time cost of startup.

I did replicate your setup by copying/pasting 7 of my XML input files into an empty directory. Here’s what I saw:

  1.  For the startup-- 20 seconds between the first time-stamped log message:

29 Jan 2019 14:02:35  INFO SentenceDetector - Sentence detector model file: org/apache/ctakes/core/sentdetect/sd-med-model.zip

                and the first log message doing processing:
29 Jan 2019 14:02:55  INFO SentenceDetector - Starting processing.

  1.  Once started up, 12 seconds to process the notes.

29 Jan 2019 14:03:07  INFO ClearNLPSemanticRoleLabelerAE - Finished processing

Does this help narrow things down?

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Date: Tuesday, January 29, 2019 at 1:58 PM
To: "Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I haven't used that script myself, but I just tried it now on some notes from mtsamples. Maybe you can try to replicate that setup? I just copy/pasted the 7 allergy/immunology notes [1] into 7 text files in an empty directory. Here's what I see:

1) It is pretty slow to start up -- but this is a one time cost (~50 seconds). I'm looking at the time between the very first time-stamped log message:
29 Jan 2019 14:51:51  INFO SentenceDetector - Sentence detector model file: org/apache/ctakes/core/sentdetect/sd-med-model.zip

and the first log message doing processing:

29 Jan 2019 14:52:40  INFO SentenceDetector - Starting processing

2) Once started up, it processes the notes in about 14s. This is actually slower than expected but this is a lot faster than you were seeing. I"m looking at the time between the start of processing just above and the last log message before it quits:

29 Jan 2019 14:52:54  INFO ClearNLPSemanticRoleLabelerAE - Finished processing

If you can replicate this input/output setup and approximate timing in your VM first, then we can see whether it's a function of your notes or your setup.

Tim

[1] https://www.mtsamples.com/site/pages/browse.asp?type=3-Allergy%20/%20Immunology<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mtsamples.com_site_pages_browse.asp-3Ftype-3D3-2DAllergy-2520_-2520Immunology&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=mrw9Hkq5tgV2AJpZMfTcbtAXSa2A59SwIOtsBR73mFs&s=dzNYtO-sdz1-shXn2KbCVDJQbxNh-i5mMutk0H-8ifc&e=>

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>, Timothy.Miller@childrens.harvard.edu <Timothy.Miller@childrens.harvard.edu<mailto:%22Timothy.Miller@childrens.harvard.edu%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 19:33:34 +0000

Hi again Tim,

I am trying to check which version of the dictionary I am using when running the Default Clinical Pipeline. I have been running the pipeline according to the instructions detailed here<https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_CTAKES_Default-2BClinical-2BPipeline&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=jgvtkadUTVhxxDm24op4l0wy5Gr3jtNrWgRsUw93nKs&s=-iPRvjXA71f66iWz53vhCbU6a1JqiEwWZ03YmfUPf-Y&e=>. However, I haven’t been able to find documentation specifying which dictionary version is built into this pipeline. There must be a simple way to check—I am just ignorant. Could you enlighten me?

Thanks,

Leah

From: "Baas,Leah" <Le...@SanfordHealth.org>
Date: Tuesday, January 29, 2019 at 12:23 PM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

Tim,

Thanks for your quick response! Probably unsurprisingly, I’ll have to do some googling to learn how to check those things. If you could point me in the right direction, that’d be great!

Thanks again,

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Reply-To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Date: Tuesday, January 29, 2019 at 12:14 PM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I am able to process that number of files in a reasonable amount of time (maybe an hour) on an average desktop. Luckily, debugging your setup should be much easier than doing a scaleout. A few possibilities:

* You are running the old (slow) dictionary instead of the new fast one
* Your document has extremely long sentences
* Your VM is _extremely_ resource constrained and is thrashing constantly

Do you know how to check these things?
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
Reply-to: <us...@ctakes.apache.org>
To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 17:58:48 +0000

Hi all,

I would like to process a batch of 13,414 files (avg file size 6.2 KB) using the default clinical pipeline. I am new to cTAKES and computer programming, and I’m looking for guidance on how to process these files with maximum time/CPU efficiency. I am currently running my program on an Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one 6.0 KB file. I’m reading up on parallel processing strategies, but would be grateful for any suggestions, tips, etc. that you might have!

Thanks,

Leah

-----------------------------------------------------------------------
Confidentiality Notice: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
privileged and confidential information.  Any unauthorized review, use,
disclosure or distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply e-mail and destroy
all copies of the original message.

Re: Processing large batches of files in cTAKES [EXTERNAL]

Posted by "Baas,Leah" <Le...@SanfordHealth.org>.

Yes, I’ve been following those instructions to view the .xmi files in the CVD.  The right pane shows the text of the XML file.

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Date: Tuesday, January 29, 2019 at 3:00 PM
To: "Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

So after you process all the notes do you follow the instructions on the wiki page that say:
You can view information in the XMI files using the UIMA Cas Visual Debugger (CVD).

Execute bin/runctakesCVD
Select File > Read Type System File
Select TypeSystem.xml in resources/org/apache/ctakes/typesystem/types/
Select File > Read XMI CAS File
Select any .xmi file in your outputDirectory

and look at that .xmi file? If so, what do you see in the right pane? The text of the note or the text of an xml file?
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: "Miller, Timothy" <Timothy.Miller@childrens.harvard.edu<mailto:%22Miller,%20Timothy%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>, user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 20:45:58 +0000

It is not CDA format. I used Python’s ElementTree module to generate XML files containing the clinical notes for each subject in my dataset. When I run the Default Clinical Pipeline, I can successfully generate XMI output files for each XML file in my input directory. The following WARNING message appears multiple times over the course of the processing (not sure if this is at all related to the issue at hand):

Jan 29, 2019 2:02:56 PM org.apache.uima.util.MessageReport decreasingWithTrace(51)
WARNING: Message count: 1; Feature org.apache.ctakes.typesystem.type.textsem.Predicate:relations is marked multipleReferencesAllowed=false, but it has multiple references.  These will be serialized in duplicate. Message count indicates messages skipped to avoid potential flooding. Set FINE logging level for stacktrace.

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Date: Tuesday, January 29, 2019 at 2:28 PM
To: "Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

Well if you're processing XML files that will likely cause a problem with this script, it's expecting plain text files in a directory. Maybe Sean can chime in on whether it's possible to use an XML collection reader with the runClinicalPipeline.sh script? Is it CDA format?
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: "Miller, Timothy" <Timothy.Miller@childrens.harvard.edu<mailto:%22Miller,%20Timothy%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>, user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 20:21:17 +0000

Hi Tim,

Thanks again for working through this with me. I hadn’t read through the time stamps carefully enough to notice the one-time cost of startup.

I did replicate your setup by copying/pasting 7 of my XML input files into an empty directory. Here’s what I saw:

  1.  For the startup-- 20 seconds between the first time-stamped log message:

29 Jan 2019 14:02:35  INFO SentenceDetector - Sentence detector model file: org/apache/ctakes/core/sentdetect/sd-med-model.zip

                and the first log message doing processing:
29 Jan 2019 14:02:55  INFO SentenceDetector - Starting processing.

  1.  Once started up, 12 seconds to process the notes.

29 Jan 2019 14:03:07  INFO ClearNLPSemanticRoleLabelerAE - Finished processing

Does this help narrow things down?

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Date: Tuesday, January 29, 2019 at 1:58 PM
To: "Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I haven't used that script myself, but I just tried it now on some notes from mtsamples. Maybe you can try to replicate that setup? I just copy/pasted the 7 allergy/immunology notes [1] into 7 text files in an empty directory. Here's what I see:

1) It is pretty slow to start up -- but this is a one time cost (~50 seconds). I'm looking at the time between the very first time-stamped log message:
29 Jan 2019 14:51:51  INFO SentenceDetector - Sentence detector model file: org/apache/ctakes/core/sentdetect/sd-med-model.zip

and the first log message doing processing:

29 Jan 2019 14:52:40  INFO SentenceDetector - Starting processing

2) Once started up, it processes the notes in about 14s. This is actually slower than expected but this is a lot faster than you were seeing. I"m looking at the time between the start of processing just above and the last log message before it quits:

29 Jan 2019 14:52:54  INFO ClearNLPSemanticRoleLabelerAE - Finished processing

If you can replicate this input/output setup and approximate timing in your VM first, then we can see whether it's a function of your notes or your setup.

Tim

[1] https://www.mtsamples.com/site/pages/browse.asp?type=3-Allergy%20/%20Immunology<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mtsamples.com_site_pages_browse.asp-3Ftype-3D3-2DAllergy-2520_-2520Immunology&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=mrw9Hkq5tgV2AJpZMfTcbtAXSa2A59SwIOtsBR73mFs&s=dzNYtO-sdz1-shXn2KbCVDJQbxNh-i5mMutk0H-8ifc&e=>

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>, Timothy.Miller@childrens.harvard.edu <Timothy.Miller@childrens.harvard.edu<mailto:%22Timothy.Miller@childrens.harvard.edu%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 19:33:34 +0000

Hi again Tim,

I am trying to check which version of the dictionary I am using when running the Default Clinical Pipeline. I have been running the pipeline according to the instructions detailed here<https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_CTAKES_Default-2BClinical-2BPipeline&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=jgvtkadUTVhxxDm24op4l0wy5Gr3jtNrWgRsUw93nKs&s=-iPRvjXA71f66iWz53vhCbU6a1JqiEwWZ03YmfUPf-Y&e=>. However, I haven’t been able to find documentation specifying which dictionary version is built into this pipeline. There must be a simple way to check—I am just ignorant. Could you enlighten me?

Thanks,

Leah

From: "Baas,Leah" <Le...@SanfordHealth.org>
Date: Tuesday, January 29, 2019 at 12:23 PM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

Tim,

Thanks for your quick response! Probably unsurprisingly, I’ll have to do some googling to learn how to check those things. If you could point me in the right direction, that’d be great!

Thanks again,

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Reply-To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Date: Tuesday, January 29, 2019 at 12:14 PM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I am able to process that number of files in a reasonable amount of time (maybe an hour) on an average desktop. Luckily, debugging your setup should be much easier than doing a scaleout. A few possibilities:

* You are running the old (slow) dictionary instead of the new fast one
* Your document has extremely long sentences
* Your VM is _extremely_ resource constrained and is thrashing constantly

Do you know how to check these things?
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
Reply-to: <us...@ctakes.apache.org>
To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 17:58:48 +0000

Hi all,

I would like to process a batch of 13,414 files (avg file size 6.2 KB) using the default clinical pipeline. I am new to cTAKES and computer programming, and I’m looking for guidance on how to process these files with maximum time/CPU efficiency. I am currently running my program on an Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one 6.0 KB file. I’m reading up on parallel processing strategies, but would be grateful for any suggestions, tips, etc. that you might have!

Thanks,

Leah

-----------------------------------------------------------------------
Confidentiality Notice: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
privileged and confidential information.  Any unauthorized review, use,
disclosure or distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply e-mail and destroy
all copies of the original message.

Re: Processing large batches of files in cTAKES [EXTERNAL]

Posted by "Miller, Timothy" <Ti...@childrens.harvard.edu>.

So after you process all the notes do you follow the instructions on the wiki page that say:
You can view information in the XMI files using the UIMA Cas Visual Debugger (CVD).

Execute bin/runctakesCVD
Select File > Read Type System File
Select TypeSystem.xml in resources/org/apache/ctakes/typesystem/types/
Select File > Read XMI CAS File
Select any .xmi file in your outputDirectory

and look at that .xmi file? If so, what do you see in the right pane? The text of the note or the text of an xml file?
Tim


-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: "Miller, Timothy" <Timothy.Miller@childrens.harvard.edu<mailto:%22Miller,%20Timothy%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>, user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 20:45:58 +0000

It is not CDA format. I used Python’s ElementTree module to generate XML files containing the clinical notes for each subject in my dataset. When I run the Default Clinical Pipeline, I can successfully generate XMI output files for each XML file in my input directory. The following WARNING message appears multiple times over the course of the processing (not sure if this is at all related to the issue at hand):

Jan 29, 2019 2:02:56 PM org.apache.uima.util.MessageReport decreasingWithTrace(51)
WARNING: Message count: 1; Feature org.apache.ctakes.typesystem.type.textsem.Predicate:relations is marked multipleReferencesAllowed=false, but it has multiple references.  These will be serialized in duplicate. Message count indicates messages skipped to avoid potential flooding. Set FINE logging level for stacktrace.

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Date: Tuesday, January 29, 2019 at 2:28 PM
To: "Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

Well if you're processing XML files that will likely cause a problem with this script, it's expecting plain text files in a directory. Maybe Sean can chime in on whether it's possible to use an XML collection reader with the runClinicalPipeline.sh script? Is it CDA format?
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: "Miller, Timothy" <Timothy.Miller@childrens.harvard.edu<mailto:%22Miller,%20Timothy%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>, user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 20:21:17 +0000

Hi Tim,

Thanks again for working through this with me. I hadn’t read through the time stamps carefully enough to notice the one-time cost of startup.

I did replicate your setup by copying/pasting 7 of my XML input files into an empty directory. Here’s what I saw:


  1.  For the startup-- 20 seconds between the first time-stamped log message:

29 Jan 2019 14:02:35  INFO SentenceDetector - Sentence detector model file: org/apache/ctakes/core/sentdetect/sd-med-model.zip

                and the first log message doing processing:
29 Jan 2019 14:02:55  INFO SentenceDetector - Starting processing.


  1.  Once started up, 12 seconds to process the notes.

29 Jan 2019 14:03:07  INFO ClearNLPSemanticRoleLabelerAE - Finished processing

Does this help narrow things down?

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Date: Tuesday, January 29, 2019 at 1:58 PM
To: "Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I haven't used that script myself, but I just tried it now on some notes from mtsamples. Maybe you can try to replicate that setup? I just copy/pasted the 7 allergy/immunology notes [1] into 7 text files in an empty directory. Here's what I see:

1) It is pretty slow to start up -- but this is a one time cost (~50 seconds). I'm looking at the time between the very first time-stamped log message:
29 Jan 2019 14:51:51  INFO SentenceDetector - Sentence detector model file: org/apache/ctakes/core/sentdetect/sd-med-model.zip

and the first log message doing processing:

29 Jan 2019 14:52:40  INFO SentenceDetector - Starting processing

2) Once started up, it processes the notes in about 14s. This is actually slower than expected but this is a lot faster than you were seeing. I"m looking at the time between the start of processing just above and the last log message before it quits:

29 Jan 2019 14:52:54  INFO ClearNLPSemanticRoleLabelerAE - Finished processing

If you can replicate this input/output setup and approximate timing in your VM first, then we can see whether it's a function of your notes or your setup.

Tim


[1] https://www.mtsamples.com/site/pages/browse.asp?type=3-Allergy%20/%20Immunology<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mtsamples.com_site_pages_browse.asp-3Ftype-3D3-2DAllergy-2520_-2520Immunology&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=mrw9Hkq5tgV2AJpZMfTcbtAXSa2A59SwIOtsBR73mFs&s=dzNYtO-sdz1-shXn2KbCVDJQbxNh-i5mMutk0H-8ifc&e=>

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>, Timothy.Miller@childrens.harvard.edu <Timothy.Miller@childrens.harvard.edu<mailto:%22Timothy.Miller@childrens.harvard.edu%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 19:33:34 +0000

Hi again Tim,

I am trying to check which version of the dictionary I am using when running the Default Clinical Pipeline. I have been running the pipeline according to the instructions detailed here<https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_CTAKES_Default-2BClinical-2BPipeline&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=jgvtkadUTVhxxDm24op4l0wy5Gr3jtNrWgRsUw93nKs&s=-iPRvjXA71f66iWz53vhCbU6a1JqiEwWZ03YmfUPf-Y&e=>. However, I haven’t been able to find documentation specifying which dictionary version is built into this pipeline. There must be a simple way to check—I am just ignorant. Could you enlighten me?

Thanks,

Leah

From: "Baas,Leah" <Le...@SanfordHealth.org>
Date: Tuesday, January 29, 2019 at 12:23 PM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

Tim,

Thanks for your quick response! Probably unsurprisingly, I’ll have to do some googling to learn how to check those things. If you could point me in the right direction, that’d be great!

Thanks again,

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Reply-To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Date: Tuesday, January 29, 2019 at 12:14 PM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I am able to process that number of files in a reasonable amount of time (maybe an hour) on an average desktop. Luckily, debugging your setup should be much easier than doing a scaleout. A few possibilities:

* You are running the old (slow) dictionary instead of the new fast one
* Your document has extremely long sentences
* Your VM is _extremely_ resource constrained and is thrashing constantly

Do you know how to check these things?
Tim



-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
Reply-to: <us...@ctakes.apache.org>
To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 17:58:48 +0000

Hi all,

I would like to process a batch of 13,414 files (avg file size 6.2 KB) using the default clinical pipeline. I am new to cTAKES and computer programming, and I’m looking for guidance on how to process these files with maximum time/CPU efficiency. I am currently running my program on an Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one 6.0 KB file. I’m reading up on parallel processing strategies, but would be grateful for any suggestions, tips, etc. that you might have!

Thanks,

Leah



-----------------------------------------------------------------------
Confidentiality Notice: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
privileged and confidential information.  Any unauthorized review, use,
disclosure or distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply e-mail and destroy
all copies of the original message.

Re: Processing large batches of files in cTAKES [EXTERNAL]

Posted by "Baas,Leah" <Le...@SanfordHealth.org>.

It is not CDA format. I used Python’s ElementTree module to generate XML files containing the clinical notes for each subject in my dataset. When I run the Default Clinical Pipeline, I can successfully generate XMI output files for each XML file in my input directory. The following WARNING message appears multiple times over the course of the processing (not sure if this is at all related to the issue at hand):

Jan 29, 2019 2:02:56 PM org.apache.uima.util.MessageReport decreasingWithTrace(51)
WARNING: Message count: 1; Feature org.apache.ctakes.typesystem.type.textsem.Predicate:relations is marked multipleReferencesAllowed=false, but it has multiple references.  These will be serialized in duplicate. Message count indicates messages skipped to avoid potential flooding. Set FINE logging level for stacktrace.

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Date: Tuesday, January 29, 2019 at 2:28 PM
To: "Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

Well if you're processing XML files that will likely cause a problem with this script, it's expecting plain text files in a directory. Maybe Sean can chime in on whether it's possible to use an XML collection reader with the runClinicalPipeline.sh script? Is it CDA format?
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: "Miller, Timothy" <Timothy.Miller@childrens.harvard.edu<mailto:%22Miller,%20Timothy%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>, user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 20:21:17 +0000

Hi Tim,

Thanks again for working through this with me. I hadn’t read through the time stamps carefully enough to notice the one-time cost of startup.

I did replicate your setup by copying/pasting 7 of my XML input files into an empty directory. Here’s what I saw:

  1.  For the startup-- 20 seconds between the first time-stamped log message:

29 Jan 2019 14:02:35  INFO SentenceDetector - Sentence detector model file: org/apache/ctakes/core/sentdetect/sd-med-model.zip

                and the first log message doing processing:
29 Jan 2019 14:02:55  INFO SentenceDetector - Starting processing.

  1.  Once started up, 12 seconds to process the notes.

29 Jan 2019 14:03:07  INFO ClearNLPSemanticRoleLabelerAE - Finished processing

Does this help narrow things down?

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Date: Tuesday, January 29, 2019 at 1:58 PM
To: "Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I haven't used that script myself, but I just tried it now on some notes from mtsamples. Maybe you can try to replicate that setup? I just copy/pasted the 7 allergy/immunology notes [1] into 7 text files in an empty directory. Here's what I see:

1) It is pretty slow to start up -- but this is a one time cost (~50 seconds). I'm looking at the time between the very first time-stamped log message:
29 Jan 2019 14:51:51  INFO SentenceDetector - Sentence detector model file: org/apache/ctakes/core/sentdetect/sd-med-model.zip

and the first log message doing processing:

29 Jan 2019 14:52:40  INFO SentenceDetector - Starting processing

2) Once started up, it processes the notes in about 14s. This is actually slower than expected but this is a lot faster than you were seeing. I"m looking at the time between the start of processing just above and the last log message before it quits:

29 Jan 2019 14:52:54  INFO ClearNLPSemanticRoleLabelerAE - Finished processing

If you can replicate this input/output setup and approximate timing in your VM first, then we can see whether it's a function of your notes or your setup.

Tim

[1] https://www.mtsamples.com/site/pages/browse.asp?type=3-Allergy%20/%20Immunology<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mtsamples.com_site_pages_browse.asp-3Ftype-3D3-2DAllergy-2520_-2520Immunology&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=mrw9Hkq5tgV2AJpZMfTcbtAXSa2A59SwIOtsBR73mFs&s=dzNYtO-sdz1-shXn2KbCVDJQbxNh-i5mMutk0H-8ifc&e=>

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>, Timothy.Miller@childrens.harvard.edu <Timothy.Miller@childrens.harvard.edu<mailto:%22Timothy.Miller@childrens.harvard.edu%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 19:33:34 +0000

Hi again Tim,

I am trying to check which version of the dictionary I am using when running the Default Clinical Pipeline. I have been running the pipeline according to the instructions detailed here<https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_CTAKES_Default-2BClinical-2BPipeline&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=jgvtkadUTVhxxDm24op4l0wy5Gr3jtNrWgRsUw93nKs&s=-iPRvjXA71f66iWz53vhCbU6a1JqiEwWZ03YmfUPf-Y&e=>. However, I haven’t been able to find documentation specifying which dictionary version is built into this pipeline. There must be a simple way to check—I am just ignorant. Could you enlighten me?

Thanks,

Leah

From: "Baas,Leah" <Le...@SanfordHealth.org>
Date: Tuesday, January 29, 2019 at 12:23 PM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

Tim,

Thanks for your quick response! Probably unsurprisingly, I’ll have to do some googling to learn how to check those things. If you could point me in the right direction, that’d be great!

Thanks again,

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Reply-To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Date: Tuesday, January 29, 2019 at 12:14 PM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I am able to process that number of files in a reasonable amount of time (maybe an hour) on an average desktop. Luckily, debugging your setup should be much easier than doing a scaleout. A few possibilities:

* You are running the old (slow) dictionary instead of the new fast one
* Your document has extremely long sentences
* Your VM is _extremely_ resource constrained and is thrashing constantly

Do you know how to check these things?
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
Reply-to: <us...@ctakes.apache.org>
To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 17:58:48 +0000

Hi all,

I would like to process a batch of 13,414 files (avg file size 6.2 KB) using the default clinical pipeline. I am new to cTAKES and computer programming, and I’m looking for guidance on how to process these files with maximum time/CPU efficiency. I am currently running my program on an Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one 6.0 KB file. I’m reading up on parallel processing strategies, but would be grateful for any suggestions, tips, etc. that you might have!

Thanks,

Leah

-----------------------------------------------------------------------
Confidentiality Notice: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
privileged and confidential information.  Any unauthorized review, use,
disclosure or distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply e-mail and destroy
all copies of the original message.

Re: Processing large batches of files in cTAKES [EXTERNAL]

Posted by "Miller, Timothy" <Ti...@childrens.harvard.edu>.

Well if you're processing XML files that will likely cause a problem with this script, it's expecting plain text files in a directory. Maybe Sean can chime in on whether it's possible to use an XML collection reader with the runClinicalPipeline.sh script? Is it CDA format?
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: "Miller, Timothy" <Timothy.Miller@childrens.harvard.edu<mailto:%22Miller,%20Timothy%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>, user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 20:21:17 +0000

Hi Tim,

Thanks again for working through this with me. I hadn’t read through the time stamps carefully enough to notice the one-time cost of startup.

I did replicate your setup by copying/pasting 7 of my XML input files into an empty directory. Here’s what I saw:

  1.  For the startup-- 20 seconds between the first time-stamped log message:

29 Jan 2019 14:02:35  INFO SentenceDetector - Sentence detector model file: org/apache/ctakes/core/sentdetect/sd-med-model.zip

                and the first log message doing processing:
29 Jan 2019 14:02:55  INFO SentenceDetector - Starting processing.

  1.  Once started up, 12 seconds to process the notes.

29 Jan 2019 14:03:07  INFO ClearNLPSemanticRoleLabelerAE - Finished processing

Does this help narrow things down?

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Date: Tuesday, January 29, 2019 at 1:58 PM
To: "Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I haven't used that script myself, but I just tried it now on some notes from mtsamples. Maybe you can try to replicate that setup? I just copy/pasted the 7 allergy/immunology notes [1] into 7 text files in an empty directory. Here's what I see:

1) It is pretty slow to start up -- but this is a one time cost (~50 seconds). I'm looking at the time between the very first time-stamped log message:
29 Jan 2019 14:51:51  INFO SentenceDetector - Sentence detector model file: org/apache/ctakes/core/sentdetect/sd-med-model.zip

and the first log message doing processing:

29 Jan 2019 14:52:40  INFO SentenceDetector - Starting processing

2) Once started up, it processes the notes in about 14s. This is actually slower than expected but this is a lot faster than you were seeing. I"m looking at the time between the start of processing just above and the last log message before it quits:

29 Jan 2019 14:52:54  INFO ClearNLPSemanticRoleLabelerAE - Finished processing

If you can replicate this input/output setup and approximate timing in your VM first, then we can see whether it's a function of your notes or your setup.

Tim

[1] https://www.mtsamples.com/site/pages/browse.asp?type=3-Allergy%20/%20Immunology<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mtsamples.com_site_pages_browse.asp-3Ftype-3D3-2DAllergy-2520_-2520Immunology&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=mrw9Hkq5tgV2AJpZMfTcbtAXSa2A59SwIOtsBR73mFs&s=dzNYtO-sdz1-shXn2KbCVDJQbxNh-i5mMutk0H-8ifc&e=>

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>, Timothy.Miller@childrens.harvard.edu <Timothy.Miller@childrens.harvard.edu<mailto:%22Timothy.Miller@childrens.harvard.edu%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 19:33:34 +0000

Hi again Tim,

I am trying to check which version of the dictionary I am using when running the Default Clinical Pipeline. I have been running the pipeline according to the instructions detailed here<https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_CTAKES_Default-2BClinical-2BPipeline&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=jgvtkadUTVhxxDm24op4l0wy5Gr3jtNrWgRsUw93nKs&s=-iPRvjXA71f66iWz53vhCbU6a1JqiEwWZ03YmfUPf-Y&e=>. However, I haven’t been able to find documentation specifying which dictionary version is built into this pipeline. There must be a simple way to check—I am just ignorant. Could you enlighten me?

Thanks,

Leah

From: "Baas,Leah" <Le...@SanfordHealth.org>
Date: Tuesday, January 29, 2019 at 12:23 PM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

Tim,

Thanks for your quick response! Probably unsurprisingly, I’ll have to do some googling to learn how to check those things. If you could point me in the right direction, that’d be great!

Thanks again,

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Reply-To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Date: Tuesday, January 29, 2019 at 12:14 PM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I am able to process that number of files in a reasonable amount of time (maybe an hour) on an average desktop. Luckily, debugging your setup should be much easier than doing a scaleout. A few possibilities:

* You are running the old (slow) dictionary instead of the new fast one
* Your document has extremely long sentences
* Your VM is _extremely_ resource constrained and is thrashing constantly

Do you know how to check these things?
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
Reply-to: <us...@ctakes.apache.org>
To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 17:58:48 +0000

Hi all,

I would like to process a batch of 13,414 files (avg file size 6.2 KB) using the default clinical pipeline. I am new to cTAKES and computer programming, and I’m looking for guidance on how to process these files with maximum time/CPU efficiency. I am currently running my program on an Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one 6.0 KB file. I’m reading up on parallel processing strategies, but would be grateful for any suggestions, tips, etc. that you might have!

Thanks,

Leah

-----------------------------------------------------------------------
Confidentiality Notice: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
privileged and confidential information.  Any unauthorized review, use,
disclosure or distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply e-mail and destroy
all copies of the original message.

Re: Processing large batches of files in cTAKES [EXTERNAL]

Posted by "Baas,Leah" <Le...@SanfordHealth.org>.

Hi Tim,

Thanks again for working through this with me. I hadn’t read through the time stamps carefully enough to notice the one-time cost of startup.

I did replicate your setup by copying/pasting 7 of my XML input files into an empty directory. Here’s what I saw:


  1.  For the startup-- 20 seconds between the first time-stamped log message:

29 Jan 2019 14:02:35  INFO SentenceDetector - Sentence detector model file: org/apache/ctakes/core/sentdetect/sd-med-model.zip

                and the first log message doing processing:
29 Jan 2019 14:02:55  INFO SentenceDetector - Starting processing.


  1.  Once started up, 12 seconds to process the notes.

29 Jan 2019 14:03:07  INFO ClearNLPSemanticRoleLabelerAE - Finished processing

Does this help narrow things down?

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Date: Tuesday, January 29, 2019 at 1:58 PM
To: "Baas,Leah" <Le...@SanfordHealth.org>, "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I haven't used that script myself, but I just tried it now on some notes from mtsamples. Maybe you can try to replicate that setup? I just copy/pasted the 7 allergy/immunology notes [1] into 7 text files in an empty directory. Here's what I see:

1) It is pretty slow to start up -- but this is a one time cost (~50 seconds). I'm looking at the time between the very first time-stamped log message:
29 Jan 2019 14:51:51  INFO SentenceDetector - Sentence detector model file: org/apache/ctakes/core/sentdetect/sd-med-model.zip

and the first log message doing processing:

29 Jan 2019 14:52:40  INFO SentenceDetector - Starting processing

2) Once started up, it processes the notes in about 14s. This is actually slower than expected but this is a lot faster than you were seeing. I"m looking at the time between the start of processing just above and the last log message before it quits:

29 Jan 2019 14:52:54  INFO ClearNLPSemanticRoleLabelerAE - Finished processing

If you can replicate this input/output setup and approximate timing in your VM first, then we can see whether it's a function of your notes or your setup.

Tim


[1] https://www.mtsamples.com/site/pages/browse.asp?type=3-Allergy%20/%20Immunology

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>, Timothy.Miller@childrens.harvard.edu <Timothy.Miller@childrens.harvard.edu<mailto:%22Timothy.Miller@childrens.harvard.edu%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 19:33:34 +0000

Hi again Tim,

I am trying to check which version of the dictionary I am using when running the Default Clinical Pipeline. I have been running the pipeline according to the instructions detailed here<https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_CTAKES_Default-2BClinical-2BPipeline&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=jgvtkadUTVhxxDm24op4l0wy5Gr3jtNrWgRsUw93nKs&s=-iPRvjXA71f66iWz53vhCbU6a1JqiEwWZ03YmfUPf-Y&e=>. However, I haven’t been able to find documentation specifying which dictionary version is built into this pipeline. There must be a simple way to check—I am just ignorant. Could you enlighten me?

Thanks,

Leah

From: "Baas,Leah" <Le...@SanfordHealth.org>
Date: Tuesday, January 29, 2019 at 12:23 PM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

Tim,

Thanks for your quick response! Probably unsurprisingly, I’ll have to do some googling to learn how to check those things. If you could point me in the right direction, that’d be great!

Thanks again,

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Reply-To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Date: Tuesday, January 29, 2019 at 12:14 PM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I am able to process that number of files in a reasonable amount of time (maybe an hour) on an average desktop. Luckily, debugging your setup should be much easier than doing a scaleout. A few possibilities:

* You are running the old (slow) dictionary instead of the new fast one
* Your document has extremely long sentences
* Your VM is _extremely_ resource constrained and is thrashing constantly

Do you know how to check these things?
Tim



-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
Reply-to: <us...@ctakes.apache.org>
To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 17:58:48 +0000

Hi all,

I would like to process a batch of 13,414 files (avg file size 6.2 KB) using the default clinical pipeline. I am new to cTAKES and computer programming, and I’m looking for guidance on how to process these files with maximum time/CPU efficiency. I am currently running my program on an Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one 6.0 KB file. I’m reading up on parallel processing strategies, but would be grateful for any suggestions, tips, etc. that you might have!

Thanks,

Leah



-----------------------------------------------------------------------
Confidentiality Notice: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
privileged and confidential information.  Any unauthorized review, use,
disclosure or distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply e-mail and destroy
all copies of the original message.

Re: Processing large batches of files in cTAKES [EXTERNAL]

Posted by "Miller, Timothy" <Ti...@childrens.harvard.edu>.

I haven't used that script myself, but I just tried it now on some notes from mtsamples. Maybe you can try to replicate that setup? I just copy/pasted the 7 allergy/immunology notes [1] into 7 text files in an empty directory. Here's what I see:

1) It is pretty slow to start up -- but this is a one time cost (~50 seconds). I'm looking at the time between the very first time-stamped log message:
29 Jan 2019 14:51:51  INFO SentenceDetector - Sentence detector model file: org/apache/ctakes/core/sentdetect/sd-med-model.zip

and the first log message doing processing:

29 Jan 2019 14:52:40  INFO SentenceDetector - Starting processing

2) Once started up, it processes the notes in about 14s. This is actually slower than expected but this is a lot faster than you were seeing. I"m looking at the time between the start of processing just above and the last log message before it quits:

29 Jan 2019 14:52:54  INFO ClearNLPSemanticRoleLabelerAE - Finished processing

If you can replicate this input/output setup and approximate timing in your VM first, then we can see whether it's a function of your notes or your setup.

Tim


[1] https://www.mtsamples.com/site/pages/browse.asp?type=3-Allergy%20/%20Immunology

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>, Timothy.Miller@childrens.harvard.edu <Timothy.Miller@childrens.harvard.edu<mailto:%22Timothy.Miller@childrens.harvard.edu%22%20%3cTimothy.Miller@childrens.harvard.edu%3e>>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 19:33:34 +0000

Hi again Tim,

I am trying to check which version of the dictionary I am using when running the Default Clinical Pipeline. I have been running the pipeline according to the instructions detailed here<https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_CTAKES_Default-2BClinical-2BPipeline&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=jgvtkadUTVhxxDm24op4l0wy5Gr3jtNrWgRsUw93nKs&s=-iPRvjXA71f66iWz53vhCbU6a1JqiEwWZ03YmfUPf-Y&e=>. However, I haven’t been able to find documentation specifying which dictionary version is built into this pipeline. There must be a simple way to check—I am just ignorant. Could you enlighten me?

Thanks,

Leah

From: "Baas,Leah" <Le...@SanfordHealth.org>
Date: Tuesday, January 29, 2019 at 12:23 PM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

Tim,

Thanks for your quick response! Probably unsurprisingly, I’ll have to do some googling to learn how to check those things. If you could point me in the right direction, that’d be great!

Thanks again,

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Reply-To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Date: Tuesday, January 29, 2019 at 12:14 PM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I am able to process that number of files in a reasonable amount of time (maybe an hour) on an average desktop. Luckily, debugging your setup should be much easier than doing a scaleout. A few possibilities:

* You are running the old (slow) dictionary instead of the new fast one
* Your document has extremely long sentences
* Your VM is _extremely_ resource constrained and is thrashing constantly

Do you know how to check these things?
Tim



-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
Reply-to: <us...@ctakes.apache.org>
To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 17:58:48 +0000

Hi all,

I would like to process a batch of 13,414 files (avg file size 6.2 KB) using the default clinical pipeline. I am new to cTAKES and computer programming, and I’m looking for guidance on how to process these files with maximum time/CPU efficiency. I am currently running my program on an Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one 6.0 KB file. I’m reading up on parallel processing strategies, but would be grateful for any suggestions, tips, etc. that you might have!

Thanks,

Leah



-----------------------------------------------------------------------
Confidentiality Notice: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
privileged and confidential information.  Any unauthorized review, use,
disclosure or distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply e-mail and destroy
all copies of the original message.

Re: Processing large batches of files in cTAKES [EXTERNAL]

Posted by "Baas,Leah" <Le...@SanfordHealth.org>.

Hi again Tim,

I am trying to check which version of the dictionary I am using when running the Default Clinical Pipeline. I have been running the pipeline according to the instructions detailed here<https://cwiki.apache.org/confluence/display/CTAKES/Default+Clinical+Pipeline>. However, I haven’t been able to find documentation specifying which dictionary version is built into this pipeline. There must be a simple way to check—I am just ignorant. Could you enlighten me?

Thanks,

Leah

From: "Baas,Leah" <Le...@SanfordHealth.org>
Date: Tuesday, January 29, 2019 at 12:23 PM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

Tim,

Thanks for your quick response! Probably unsurprisingly, I’ll have to do some googling to learn how to check those things. If you could point me in the right direction, that’d be great!

Thanks again,

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Reply-To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Date: Tuesday, January 29, 2019 at 12:14 PM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I am able to process that number of files in a reasonable amount of time (maybe an hour) on an average desktop. Luckily, debugging your setup should be much easier than doing a scaleout. A few possibilities:

* You are running the old (slow) dictionary instead of the new fast one
* Your document has extremely long sentences
* Your VM is _extremely_ resource constrained and is thrashing constantly

Do you know how to check these things?
Tim

-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
Reply-to: <us...@ctakes.apache.org>
To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 17:58:48 +0000

Hi all,

I would like to process a batch of 13,414 files (avg file size 6.2 KB) using the default clinical pipeline. I am new to cTAKES and computer programming, and I’m looking for guidance on how to process these files with maximum time/CPU efficiency. I am currently running my program on an Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one 6.0 KB file. I’m reading up on parallel processing strategies, but would be grateful for any suggestions, tips, etc. that you might have!

Thanks,

Leah

-----------------------------------------------------------------------
Confidentiality Notice: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
privileged and confidential information.  Any unauthorized review, use,
disclosure or distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply e-mail and destroy
all copies of the original message.

Re: Processing large batches of files in cTAKES [EXTERNAL]

Posted by "Baas,Leah" <Le...@SanfordHealth.org>.

Tim,

Thanks for your quick response! Probably unsurprisingly, I’ll have to do some googling to learn how to check those things. If you could point me in the right direction, that’d be great!

Thanks again,

Leah

From: "Miller, Timothy" <Ti...@childrens.harvard.edu>
Reply-To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Date: Tuesday, January 29, 2019 at 12:14 PM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Processing large batches of files in cTAKES [EXTERNAL]

I am able to process that number of files in a reasonable amount of time (maybe an hour) on an average desktop. Luckily, debugging your setup should be much easier than doing a scaleout. A few possibilities:

* You are running the old (slow) dictionary instead of the new fast one
* Your document has extremely long sentences
* Your VM is _extremely_ resource constrained and is thrashing constantly

Do you know how to check these things?
Tim



-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
Reply-to: <us...@ctakes.apache.org>
To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 17:58:48 +0000

Hi all,

I would like to process a batch of 13,414 files (avg file size 6.2 KB) using the default clinical pipeline. I am new to cTAKES and computer programming, and I’m looking for guidance on how to process these files with maximum time/CPU efficiency. I am currently running my program on an Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one 6.0 KB file. I’m reading up on parallel processing strategies, but would be grateful for any suggestions, tips, etc. that you might have!

Thanks,

Leah



-----------------------------------------------------------------------
Confidentiality Notice: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
privileged and confidential information.  Any unauthorized review, use,
disclosure or distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply e-mail and destroy
all copies of the original message.

Re: Processing large batches of files in cTAKES [EXTERNAL]

Posted by "Miller, Timothy" <Ti...@childrens.harvard.edu>.

I am able to process that number of files in a reasonable amount of time (maybe an hour) on an average desktop. Luckily, debugging your setup should be much easier than doing a scaleout. A few possibilities:

* You are running the old (slow) dictionary instead of the new fast one
* Your document has extremely long sentences
* Your VM is _extremely_ resource constrained and is thrashing constantly

Do you know how to check these things?
Tim



-----Original Message-----
From: "Baas,Leah" <Leah.Baas@SanfordHealth.org<mailto:%22Baas,Leah%22%20%3cLeah.Baas@SanfordHealth.org%3e>>
Reply-to: <us...@ctakes.apache.org>
To: user@ctakes.apache.org <user@ctakes.apache.org<mailto:%22user@ctakes.apache.org%22%20%3cuser@ctakes.apache.org%3e>>
Subject: Processing large batches of files in cTAKES [EXTERNAL]
Date: Tue, 29 Jan 2019 17:58:48 +0000

Hi all,

I would like to process a batch of 13,414 files (avg file size 6.2 KB) using the default clinical pipeline. I am new to cTAKES and computer programming, and I’m looking for guidance on how to process these files with maximum time/CPU efficiency. I am currently running my program on an Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one 6.0 KB file. I’m reading up on parallel processing strategies, but would be grateful for any suggestions, tips, etc. that you might have!

Thanks,

Leah



-----------------------------------------------------------------------
Confidentiality Notice: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
privileged and confidential information.  Any unauthorized review, use,
disclosure or distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply e-mail and destroy
all copies of the original message.

Re: Processing large batches of files in cTAKES

Posted by Greg Silverman <gm...@umn.edu>.

That seems like a long time. We managed to push through all 10000 MIMIC
files in cTAKES in a little over 2.5 hours (and this was on a VM) using the
default settings.

Greg--

On Tue, Jan 29, 2019 at 11:59 AM Baas,Leah <Le...@sanfordhealth.org>
wrote:

> Hi all,
>
>
>
> I would like to process a batch of 13,414 files (avg file size 6.2 KB)
> using the default clinical pipeline. I am new to cTAKES and computer
> programming, and I’m looking for guidance on how to process these files
> with maximum time/CPU efficiency. I am currently running my program on an
> Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one
> 6.0 KB file. I’m reading up on parallel processing strategies, but would be
> grateful for any suggestions, tips, etc. that you might have!
>
>
>
> Thanks,
>
>
>
> Leah
>
>
>
>
>
> -----------------------------------------------------------------------
> Confidentiality Notice: This e-mail message, including any attachments,
> is for the sole use of the intended recipient(s) and may contain
> privileged and confidential information.  Any unauthorized review, use,
> disclosure or distribution is prohibited.  If you are not the intended
> recipient, please contact the sender by reply e-mail and destroy
> all copies of the original message.
>


-- 
Greg M. Silverman
Senior Systems Developer
NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group>
Cardiovascular Informatics <http://www.med.umn.edu/cardiology/>
University of Minnesota
gms@umn.edu

 ›  evaluate-it.org  ‹

Re: Processing large batches of files in cTAKES

Posted by gandhi rajan <ga...@gmail.com>.

Ctakes web rest module (rest service) is now available as part of ctakes
svn codebase. All you gotta do is to use dictionary generator gui to add
dictionary of your choice, get the corresponding scripts and load the DB.
Then use the custom dictionary xml file in ctakes web rest module to point
to the DB.

On Wednesday, January 30, 2019, <jb...@yahoo.com> wrote:

> Steve, are you using the ctakes-rest-service https://github.com/
> GoTeamEpsilon/ctakes-rest-service?
>
>
>
> If so, do you have any pointers as to how to configure a custom dictionary
> (such as for ICD10) after installing the ctakes-rest-service. Because the
> ctakes-rest-service installation procedure involves building ctakes from
> source, the runCustomDictionary tool does not work…I have logged an issue
> here: https://github.com/GoTeamEpsilon/ctakes-rest-service/issues/56
>
>
>
> I was wondering if you had any pointers in this regard, or know anything
> about how to implement their suggested solution of creating custom tables
> etc.
>
>
>
> *From:* Steve Evans <st...@duke.edu>
> *Sent:* Tuesday, January 29, 2019 10:23 AM
> *To:* user@ctakes.apache.org
> *Subject:* RE: Processing large batches of files in cTAKES
>
>
>
> Leah,
>
>
>
> I run my ctakes work load using docker containers.
>
>
>
> I have built a container that serves ctakes requests via tomcat
> webservices. That’s not for the feint of heart and not for non-programmer
> types. But you might be able to install the ctakes software in a container
> with the input/output directories on the host and then run in parallel
> using file input/output.
>
>
>
> I run 10 containers to get the thru put we need (5/second). This is on a
> 16 cpu 64GB host (each container consumes about 2GB of ram)
>
>
>
> Not a slam dunk type answer but I thought it might help gen ideas
>
>
>
> Steve
>
>
>
>
>
> *From:* Baas,Leah <Le...@SanfordHealth.org>
> *Sent:* Tuesday, January 29, 2019 12:59 PM
> *To:* user@ctakes.apache.org
> *Subject:* Processing large batches of files in cTAKES
>
>
>
> Hi all,
>
>
>
> I would like to process a batch of 13,414 files (avg file size 6.2 KB)
> using the default clinical pipeline. I am new to cTAKES and computer
> programming, and I’m looking for guidance on how to process these files
> with maximum time/CPU efficiency. I am currently running my program on an
> Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one
> 6.0 KB file. I’m reading up on parallel processing strategies, but would be
> grateful for any suggestions, tips, etc. that you might have!
>
>
>
> Thanks,
>
>
>
> Leah
>
>
>
>
>
> -----------------------------------------------------------------------
> Confidentiality Notice: This e-mail message, including any attachments,
> is for the sole use of the intended recipient(s) and may contain
> privileged and confidential information.  Any unauthorized review, use,
> disclosure or distribution is prohibited.  If you are not the intended
> recipient, please contact the sender by reply e-mail and destroy
> all copies of the original message.
>


-- 
Regards,
Gandhi

"The best way to find urself is to lose urself in the service of others !!!"

RE: Processing large batches of files in cTAKES

Posted by Steve Evans <st...@duke.edu>.

No, not using the ctakes rest service. I ended up rolling our own so I could better tailor the web service.

I also built a generic Java API wrapping ctakes which delivers content in a more easily consumable structure to be used by researchers here at Duke. That java API is also used internally by the Tomcat web app which delivers the same structure via web service.

I use my own json based api so that we can have a standard response to nlp requests regardless of internal implementation (meaning if we swapped out ctakes for some other nlp engine, the json could remain the same).

I am (was) new to docker but found it really good tool for hiding implementation complexity and easy deployment. Plus you can spin up as many containers your hardware can bear for parallel processing.

I don’t want to understate the time it took to put all this together. It did take a lot of effort, but I am happy with the result.

Happy to share thoughts/code with anyone – beware: NO DOC!

Steve


From: jbliss1234@yahoo.com <jb...@yahoo.com>
Sent: Tuesday, January 29, 2019 1:35 PM
To: user@ctakes.apache.org
Subject: RE: Processing large batches of files in cTAKES

Steve, are you using the ctakes-rest-service https://github.com/GoTeamEpsilon/ctakes-rest-service<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GoTeamEpsilon_ctakes-2Drest-2Dservice&d=DwMFaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=ecHOdnpj-qRAlXItlO7cR0A5bhd088MpYZuJiNrYwbM&m=v3JOLWPfa48Tk0nGy5frtDt5qP0TPT-LS2ggKcnEEaA&s=nYjh0fzik0j33KN8alcP8C8aXw7KAbjsTZHLW6uLyoA&e=>?

If so, do you have any pointers as to how to configure a custom dictionary (such as for ICD10) after installing the ctakes-rest-service. Because the ctakes-rest-service installation procedure involves building ctakes from source, the runCustomDictionary tool does not work…I have logged an issue here: https://github.com/GoTeamEpsilon/ctakes-rest-service/issues/56<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GoTeamEpsilon_ctakes-2Drest-2Dservice_issues_56&d=DwMFaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=ecHOdnpj-qRAlXItlO7cR0A5bhd088MpYZuJiNrYwbM&m=v3JOLWPfa48Tk0nGy5frtDt5qP0TPT-LS2ggKcnEEaA&s=33F2NRAs7pFwO6YXQBvOwvDZ_qLT59MbXiqIcmTLoNE&e=>

I was wondering if you had any pointers in this regard, or know anything about how to implement their suggested solution of creating custom tables etc.

From: Steve Evans <st...@duke.edu>>
Sent: Tuesday, January 29, 2019 10:23 AM
To: user@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: RE: Processing large batches of files in cTAKES

Leah,

I run my ctakes work load using docker containers.

I have built a container that serves ctakes requests via tomcat webservices. That’s not for the feint of heart and not for non-programmer types. But you might be able to install the ctakes software in a container with the input/output directories on the host and then run in parallel using file input/output.

I run 10 containers to get the thru put we need (5/second). This is on a 16 cpu 64GB host (each container consumes about 2GB of ram)

Not a slam dunk type answer but I thought it might help gen ideas

Steve


From: Baas,Leah <Le...@SanfordHealth.org>>
Sent: Tuesday, January 29, 2019 12:59 PM
To: user@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Processing large batches of files in cTAKES

Hi all,

I would like to process a batch of 13,414 files (avg file size 6.2 KB) using the default clinical pipeline. I am new to cTAKES and computer programming, and I’m looking for guidance on how to process these files with maximum time/CPU efficiency. I am currently running my program on an Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one 6.0 KB file. I’m reading up on parallel processing strategies, but would be grateful for any suggestions, tips, etc. that you might have!

Thanks,

Leah



-----------------------------------------------------------------------
Confidentiality Notice: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
privileged and confidential information.  Any unauthorized review, use,
disclosure or distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply e-mail and destroy
all copies of the original message.

RE: Processing large batches of files in cTAKES

Posted by jb...@yahoo.com.

Steve, are you using the ctakes-rest-service https://github.com/GoTeamEpsilon/ctakes-rest-service?

 

If so, do you have any pointers as to how to configure a custom dictionary (such as for ICD10) after installing the ctakes-rest-service. Because the ctakes-rest-service installation procedure involves building ctakes from source, the runCustomDictionary tool does not work…I have logged an issue here: https://github.com/GoTeamEpsilon/ctakes-rest-service/issues/56

 

I was wondering if you had any pointers in this regard, or know anything about how to implement their suggested solution of creating custom tables etc.

 

From: Steve Evans <st...@duke.edu> 
Sent: Tuesday, January 29, 2019 10:23 AM
To: user@ctakes.apache.org
Subject: RE: Processing large batches of files in cTAKES

 

Leah,

 

I run my ctakes work load using docker containers.

 

I have built a container that serves ctakes requests via tomcat webservices. That’s not for the feint of heart and not for non-programmer types. But you might be able to install the ctakes software in a container with the input/output directories on the host and then run in parallel using file input/output.

 

I run 10 containers to get the thru put we need (5/second). This is on a 16 cpu 64GB host (each container consumes about 2GB of ram)

 

Not a slam dunk type answer but I thought it might help gen ideas

 

Steve

 

 

From: Baas,Leah <Leah.Baas@SanfordHealth.org <ma...@SanfordHealth.org> > 
Sent: Tuesday, January 29, 2019 12:59 PM
To: user@ctakes.apache.org <ma...@ctakes.apache.org> 
Subject: Processing large batches of files in cTAKES

 

Hi all,

 

I would like to process a batch of 13,414 files (avg file size 6.2 KB) using the default clinical pipeline. I am new to cTAKES and computer programming, and I’m looking for guidance on how to process these files with maximum time/CPU efficiency. I am currently running my program on an Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one 6.0 KB file. I’m reading up on parallel processing strategies, but would be grateful for any suggestions, tips, etc. that you might have!

 

Thanks,

 

Leah

 

 

-----------------------------------------------------------------------
Confidentiality Notice: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
privileged and confidential information.  Any unauthorized review, use,
disclosure or distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply e-mail and destroy
all copies of the original message.

RE: Processing large batches of files in cTAKES

Posted by Steve Evans <st...@duke.edu>.

Leah,

I run my ctakes work load using docker containers.

I have built a container that serves ctakes requests via tomcat webservices. That’s not for the feint of heart and not for non-programmer types. But you might be able to install the ctakes software in a container with the input/output directories on the host and then run in parallel using file input/output.

I run 10 containers to get the thru put we need (5/second). This is on a 16 cpu 64GB host (each container consumes about 2GB of ram)

Not a slam dunk type answer but I thought it might help gen ideas

Steve


From: Baas,Leah <Le...@SanfordHealth.org>
Sent: Tuesday, January 29, 2019 12:59 PM
To: user@ctakes.apache.org
Subject: Processing large batches of files in cTAKES

Hi all,

I would like to process a batch of 13,414 files (avg file size 6.2 KB) using the default clinical pipeline. I am new to cTAKES and computer programming, and I’m looking for guidance on how to process these files with maximum time/CPU efficiency. I am currently running my program on an Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one 6.0 KB file. I’m reading up on parallel processing strategies, but would be grateful for any suggestions, tips, etc. that you might have!

Thanks,

Leah



-----------------------------------------------------------------------
Confidentiality Notice: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
privileged and confidential information.  Any unauthorized review, use,
disclosure or distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply e-mail and destroy
all copies of the original message.