You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ctakes.apache.org by John Doe <lu...@gmail.com> on 2020/11/17 16:47:32 UTC

Scaling out cTAKES

Hello,

I'm new to cTAKES and was wondering what the options are for scaling out
the default clinical pipeline. I'm running it on a large number of clinical
notes using runClinicalPipeline.bat and specifying the input directory with
the notes. What are the best options for doing this in a more scalable way?
For example, can I parallelize it with UIMA-AS? Or should I manually use
multiple command prompts to run the clinical pipeline on a different set of
clinical notes in parallel? I'm not sure if there is any build-in solution
or community resource which uses EMR/Spark or some other method to achieve
this.

Thank you for your help.

Re: Scaling out cTAKES

Posted by "Schenk, Gundolf" <Gu...@ucsf.edu>.

Hi John,

There have been a couple of presentations at the recent ApacheCon:
https://www.apachecon.com/acah2020/tracks/ctakes.html

Cheers,
Gundolf.

From: John Doe <lu...@gmail.com>
Reply-To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Date: Tuesday, November 17, 2020 at 08:48
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Scaling out cTAKES

Hello,

I'm new to cTAKES and was wondering what the options are for scaling out the default clinical pipeline. I'm running it on a large number of clinical notes using runClinicalPipeline.bat and specifying the input directory with the notes. What are the best options for doing this in a more scalable way? For example, can I parallelize it with UIMA-AS? Or should I manually use multiple command prompts to run the clinical pipeline on a different set of clinical notes in parallel? I'm not sure if there is any build-in solution or community resource which uses EMR/Spark or some other method to achieve this.

Thank you for your help.

Re: Scaling out cTAKES

Posted by Olga Patterson <ol...@utah.edu>.

At the VA, we use cTAKES with UIMA AS.
Here is a very simple example how it can be implemented
http://decipher.chpc.utah.edu/gitblit/summary/?r=examples/ctakes-test.git


--
Olga


From: John Doe <lu...@gmail.com>
Reply-To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Date: Wednesday, November 18, 2020 at 7:23 AM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Cc: Serguei Pakhomov <pa...@umn.edu>, Raymond Finzel <fi...@umn.edu>
Subject: Re: Scaling out cTAKES

Thank you all for the responses. For now, I am going to learn more how UIMA-AS works to determine if this will work for my use case. If not, I will check out your other suggestions.

On Tue, Nov 17, 2020 at 5:22 PM Greg Silverman <gm...@umn.edu>> wrote:
FYI, I just doubled the number of backends and clients and increased the throughput to ~1000 docs/second. Server utilization is only minimal now.

I should note, that unlike on a Spark cluster, this is running on 2-old servers and a VM. The nice thing about Kubernetes is that you can easily scale up or down the number of instances using horizontal pod autoscaling. Plus, it's a lot easier to manage than a Spark cluster.

We just started running the cTAKES pipeline on this, so it's an experiment in process.

So far, the results are very decent. I'll scale it up even more in a day or so.

Greg--



On Tue, Nov 17, 2020 at 11:10 AM Greg Silverman <gm...@umn.edu>> wrote:
We at the UMN NLP/IE Lab have developed NLP-ADAPT-kube to scale out 4-UIMA NLP annotators using Kubernetes/UIMA-AS, including cTAKES, CLAMP, MetaMap (using the UIMA wrapper), and our own homegrown BioMedICUS. Our project is here: https://github.com/nlpie/nlp-adapt-kube

There are 2-versions: One for CPM, which includes QuickUMLS; and the other for UIMA-AS. The AS versions are under the docker folder and the argo-k8s folder, and use the 4-engines mentioned above. There is a project Wiki (but it is slightly out-of-date). We are in the process of working non-UIMA engines (like QuickUMLS and our new version of BioMedICUS) into the AS workflow (we're using AMQ for message queuing).

We're currently running cTAKES using Kubernetes hpa with 6-backends and 2-clients across 3-compute nodes getting very decent throughput (~150 docs/second). We could definitely scale it up even further.

For comparison how well this scales, we were running 64-MetaMap backends with 16-clients and getting  ~40 docs/second for very large clinical documents (which for MetaMap is very decent). This was across 5-compute nodes.

If you're interested, we can assist in implementation. The client does require some customizations based on the backend database you're using: https://github.com/nlpie/nlp-adapt-kube/tree/master/docker/as/client, but that is pretty straightforward.

Best!

Greg--






On Tue, Nov 17, 2020 at 10:47 AM John Doe <lu...@gmail.com>> wrote:
Hello,

I'm new to cTAKES and was wondering what the options are for scaling out the default clinical pipeline. I'm running it on a large number of clinical notes using runClinicalPipeline.bat and specifying the input directory with the notes. What are the best options for doing this in a more scalable way? For example, can I parallelize it with UIMA-AS? Or should I manually use multiple command prompts to run the clinical pipeline on a different set of clinical notes in parallel? I'm not sure if there is any build-in solution or community resource which uses EMR/Spark or some other method to achieve this.

Thank you for your help.


--
Greg M. Silverman
Senior Systems Developer
NLP/IE<https://healthinformatics.umn.edu/research/nlpie-group>
Department of Surgery
University of Minnesota
gms@umn.edu<ma...@umn.edu>



--
Greg M. Silverman
Senior Systems Developer
NLP/IE<https://healthinformatics.umn.edu/research/nlpie-group>
Department of Surgery
University of Minnesota
gms@umn.edu<ma...@umn.edu>

Re: Scaling out cTAKES

Posted by John Doe <lu...@gmail.com>.

Thank you all for the responses. For now, I am going to learn more how
UIMA-AS works to determine if this will work for my use case. If not, I
will check out your other suggestions.

On Tue, Nov 17, 2020 at 5:22 PM Greg Silverman <gm...@umn.edu> wrote:

> FYI, I just doubled the number of backends and clients and increased the
> throughput to ~1000 docs/second. Server utilization is only minimal now.
>
> I should note, that unlike on a Spark cluster, this is running on 2-old
> servers and a VM. The nice thing about Kubernetes is that you can easily
> scale up or down the number of instances using horizontal pod autoscaling.
> Plus, it's a lot easier to manage than a Spark cluster.
>
> We just started running the cTAKES pipeline on this, so it's an experiment
> in process.
>
> So far, the results are very decent. I'll scale it up even more in a day
> or so.
>
> Greg--
>
>
>
> On Tue, Nov 17, 2020 at 11:10 AM Greg Silverman <gm...@umn.edu> wrote:
>
>> We at the UMN NLP/IE Lab have developed NLP-ADAPT-kube to scale out
>> 4-UIMA NLP annotators using Kubernetes/UIMA-AS, including cTAKES, CLAMP,
>> MetaMap (using the UIMA wrapper), and our own homegrown BioMedICUS. Our
>> project is here: https://github.com/nlpie/nlp-adapt-kube
>>
>> There are 2-versions: One for CPM, which includes QuickUMLS; and the
>> other for UIMA-AS. The AS versions are under the docker folder and the
>> argo-k8s folder, and use the 4-engines mentioned above. There is a project
>> Wiki (but it is slightly out-of-date). We are in the process of working
>> non-UIMA engines (like QuickUMLS and our new version of BioMedICUS) into
>> the AS workflow (we're using AMQ for message queuing).
>>
>> We're currently running cTAKES using Kubernetes hpa with 6-backends and
>> 2-clients across 3-compute nodes getting very decent throughput (~150
>> docs/second). We could definitely scale it up even further.
>>
>> For comparison how well this scales, we were running 64-MetaMap backends
>> with 16-clients and getting  ~40 docs/second for very large clinical
>> documents (which for MetaMap is very decent). This was across 5-compute
>> nodes.
>>
>> If you're interested, we can assist in implementation. The client does
>> require some customizations based on the backend database you're using:
>> https://github.com/nlpie/nlp-adapt-kube/tree/master/docker/as/client,
>> but that is pretty straightforward.
>>
>> Best!
>>
>> Greg--
>>
>>
>>
>>
>>
>>
>> On Tue, Nov 17, 2020 at 10:47 AM John Doe <lu...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I'm new to cTAKES and was wondering what the options are for scaling out
>>> the default clinical pipeline. I'm running it on a large number of clinical
>>> notes using runClinicalPipeline.bat and specifying the input directory with
>>> the notes. What are the best options for doing this in a more scalable way?
>>> For example, can I parallelize it with UIMA-AS? Or should I manually use
>>> multiple command prompts to run the clinical pipeline on a different set of
>>> clinical notes in parallel? I'm not sure if there is any build-in solution
>>> or community resource which uses EMR/Spark or some other method to achieve
>>> this.
>>>
>>> Thank you for your help.
>>>
>>
>>
>> --
>> Greg M. Silverman
>> Senior Systems Developer
>> NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group>
>> Department of Surgery
>> University of Minnesota
>> gms@umn.edu
>>
>>
>
> --
> Greg M. Silverman
> Senior Systems Developer
> NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group>
> Department of Surgery
> University of Minnesota
> gms@umn.edu
>
>

Re: Scaling out cTAKES

Posted by Greg Silverman <gm...@umn.edu>.

FYI, I just doubled the number of backends and clients and increased the
throughput to ~1000 docs/second. Server utilization is only minimal now.

I should note, that unlike on a Spark cluster, this is running on 2-old
servers and a VM. The nice thing about Kubernetes is that you can easily
scale up or down the number of instances using horizontal pod autoscaling.
Plus, it's a lot easier to manage than a Spark cluster.

We just started running the cTAKES pipeline on this, so it's an experiment
in process.

So far, the results are very decent. I'll scale it up even more in a day or
so.

Greg--



On Tue, Nov 17, 2020 at 11:10 AM Greg Silverman <gm...@umn.edu> wrote:

> We at the UMN NLP/IE Lab have developed NLP-ADAPT-kube to scale out 4-UIMA
> NLP annotators using Kubernetes/UIMA-AS, including cTAKES, CLAMP, MetaMap
> (using the UIMA wrapper), and our own homegrown BioMedICUS. Our project is
> here: https://github.com/nlpie/nlp-adapt-kube
>
> There are 2-versions: One for CPM, which includes QuickUMLS; and the other
> for UIMA-AS. The AS versions are under the docker folder and the argo-k8s
> folder, and use the 4-engines mentioned above. There is a project Wiki (but
> it is slightly out-of-date). We are in the process of working non-UIMA
> engines (like QuickUMLS and our new version of BioMedICUS) into the AS
> workflow (we're using AMQ for message queuing).
>
> We're currently running cTAKES using Kubernetes hpa with 6-backends and
> 2-clients across 3-compute nodes getting very decent throughput (~150
> docs/second). We could definitely scale it up even further.
>
> For comparison how well this scales, we were running 64-MetaMap backends
> with 16-clients and getting  ~40 docs/second for very large clinical
> documents (which for MetaMap is very decent). This was across 5-compute
> nodes.
>
> If you're interested, we can assist in implementation. The client does
> require some customizations based on the backend database you're using:
> https://github.com/nlpie/nlp-adapt-kube/tree/master/docker/as/client, but
> that is pretty straightforward.
>
> Best!
>
> Greg--
>
>
>
>
>
>
> On Tue, Nov 17, 2020 at 10:47 AM John Doe <lu...@gmail.com> wrote:
>
>> Hello,
>>
>> I'm new to cTAKES and was wondering what the options are for scaling out
>> the default clinical pipeline. I'm running it on a large number of clinical
>> notes using runClinicalPipeline.bat and specifying the input directory with
>> the notes. What are the best options for doing this in a more scalable way?
>> For example, can I parallelize it with UIMA-AS? Or should I manually use
>> multiple command prompts to run the clinical pipeline on a different set of
>> clinical notes in parallel? I'm not sure if there is any build-in solution
>> or community resource which uses EMR/Spark or some other method to achieve
>> this.
>>
>> Thank you for your help.
>>
>
>
> --
> Greg M. Silverman
> Senior Systems Developer
> NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group>
> Department of Surgery
> University of Minnesota
> gms@umn.edu
>
>

-- 
Greg M. Silverman
Senior Systems Developer
NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group>
Department of Surgery
University of Minnesota
gms@umn.edu

Re: Scaling out cTAKES

Posted by Greg Silverman <gm...@umn.edu>.

We at the UMN NLP/IE Lab have developed NLP-ADAPT-kube to scale out 4-UIMA
NLP annotators using Kubernetes/UIMA-AS, including cTAKES, CLAMP, MetaMap
(using the UIMA wrapper), and our own homegrown BioMedICUS. Our project is
here: https://github.com/nlpie/nlp-adapt-kube

There are 2-versions: One for CPM, which includes QuickUMLS; and the other
for UIMA-AS. The AS versions are under the docker folder and the argo-k8s
folder, and use the 4-engines mentioned above. There is a project Wiki (but
it is slightly out-of-date). We are in the process of working non-UIMA
engines (like QuickUMLS and our new version of BioMedICUS) into the AS
workflow (we're using AMQ for message queuing).

We're currently running cTAKES using Kubernetes hpa with 6-backends and
2-clients across 3-compute nodes getting very decent throughput (~150
docs/second). We could definitely scale it up even further.

For comparison how well this scales, we were running 64-MetaMap backends
with 16-clients and getting  ~40 docs/second for very large clinical
documents (which for MetaMap is very decent). This was across 5-compute
nodes.

If you're interested, we can assist in implementation. The client does
require some customizations based on the backend database you're using:
https://github.com/nlpie/nlp-adapt-kube/tree/master/docker/as/client, but
that is pretty straightforward.

Best!

Greg--

On Tue, Nov 17, 2020 at 10:47 AM John Doe <lu...@gmail.com> wrote:

> Hello,
>
> I'm new to cTAKES and was wondering what the options are for scaling out
> the default clinical pipeline. I'm running it on a large number of clinical
> notes using runClinicalPipeline.bat and specifying the input directory with
> the notes. What are the best options for doing this in a more scalable way?
> For example, can I parallelize it with UIMA-AS? Or should I manually use
> multiple command prompts to run the clinical pipeline on a different set of
> clinical notes in parallel? I'm not sure if there is any build-in solution
> or community resource which uses EMR/Spark or some other method to achieve
> this.
>
> Thank you for your help.
>

-- 
Greg M. Silverman
Senior Systems Developer
NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group>
Department of Surgery
University of Minnesota
gms@umn.edu