You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by "Finan, Sean" <Se...@childrens.harvard.edu> on 2021/08/02 15:41:27 UTC

Re: Can you store cTAKES in an S3 bucket so you can use it with EMR for parallel processing? [EXTERNAL]

Hi John,

I am not completely sure that I understand what you are asking, and I think that this is more of an s3 question than a ctakes question, but here are a couple of comments:

> the cTAKES part of it relies on CTAKES_HOME being set
- Is this requirement on your side?   I never bother to set CTAKES_HOME.

> So I need to store cTAKES in a shared location
- I am not sure why you need to do this when it is possible to spin up multiple machines, each with its own ctakes "installation."

> Usually, in EMR, you would use S3 for this 
- This seems to be quite a blanket statement

> cTAKES relies on a hierarchical file structure
- ok ...

> such as storing cTAKES on S3 instead
- I have [essentially] done this.  If I remember correctly I didn't need to venture too far outside my comfort zone.

> altering cTAKES to work with a flat file structure using the S3
- I haven't touched it for many years, but the flat file structure was essentially internal to s3 and files can still be referenced via a complete "hierarchical path" - it is just that the filename is "bob/likes/ice.cream"

Again, I haven't needed to work with this for about 5 years, so what I did might be completely irrelevant.  I would hope that implementation is now simpler, examples more prevalent and documentation better than back in the day.

Sean

________________________________________
From: John Doe <lu...@gmail.com>
Sent: Sunday, July 25, 2021 3:28 PM
To: dev@ctakes.apache.org
Subject: Can you store cTAKES in an S3 bucket so you can use it with EMR for parallel processing? [EXTERNAL]

* External Email - Caution *


I'm working on a solution for running cTAKES in an Amazon EMR environment
with Apache Spark so I can run multiple instances of cTAKES in parallel for
processing a bunch of notes. However, the cTAKES part of it relies on
CTAKES_HOME being set on every machine for locating model files and such.
So I need to store cTAKES in a shared location so every node can set
CTAKES_HOME to that location. Usually, in EMR, you would use S3 for this
but it seems that cTAKES relies on a hierarchical file structure for
loading in files (model files, dictionary files, etc.). My current solution
uses EFS as an alternative. Is there a better alternative to this approach
to getting cTAKES integrated with EMR? I know there are alternative non-EMR
approaches to parallelizing cTAKES, but I may not have those technologies
available. I'm wondering if there is a good way around using EFS such as
storing cTAKES on S3 instead, but it seems like altering cTAKES to work
with a flat file structure using the S3 API may be a pretty big task.

Re: Can you store cTAKES in an S3 bucket so you can use it with EMR for parallel processing? [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

Hi Tom,

That is great news!  I am always happy to hear that people are using ctakes and have gotten it to run in novel ways!

- and I am happy that ctakes-dockhand is useful in your solution!

Cheers,
Sean

________________________________________
From: Thomas W Loehfelm <tw...@ucdavis.edu.INVALID>
Sent: Thursday, August 12, 2021 1:58 PM
To: dev@ctakes.apache.org
Subject: Re: Can you store cTAKES in an S3 bucket so you can use it with EMR for parallel processing? [EXTERNAL]

* External Email - Caution *

For parallel processing, consider using the ctakes-dockhand component. You can run ctakes as a docker service and then scale it using docker swarm to replicate that service across many nodes to expand processing capacity.

I can am happy to share my experience doing it and the code I use, although I am not a java expert and so can’t necessarily explain WHY my stuff is the way it is, just that it gets the job done.

The problem I wanted to address was:

  1.  Using a custom dictionary…
  2.  Defining a custom processing pipeline (piper file)…
  3.  …set up an API endpoint to receive the text of a report
  4.  …and return a custom subset of ctakes output in json format
  5.  …achieve desired processing scale/throughput via docker

Thanks to Sean and many others on this forum we are up and running. If that general workflow is what you are looking for I am happy to help in any way I can.

Tom

From: John Doe <lu...@gmail.com>
Date: Tuesday, August 3, 2021 at 1:24 PM
To: dev@ctakes.apache.org <de...@ctakes.apache.org>
Subject: Re: Can you store cTAKES in an S3 bucket so you can use it with EMR for parallel processing? [EXTERNAL]
Hello,

Thanks for the response. The reason we are using a shared location for
ctakes is so that we have everything in one place. If we need to add our
own components, dictionaries, etc., we can do it all in one spot. It also
saves us from having to download ctakes on every machine every time we
start up a cluster. I didn't know the regular java file API would still
work with S3 but will have to give that a try. I am relying on CTAKES_HOME
being set since ctakes is stored on EFS so the node wouldn't be able to
find it on its own local file system. I'm basically mounting the EFS folder
holding ctakes onto each node and setting CTAKES_HOME to that so it can
find all the files it needs to. For us anyway, S3 has come up as the
primary means of storage for EMR and I'm not sure if EFS will be available,
which is why I'm trying to see if I can do it on S3.

On Mon, Aug 2, 2021 at 11:41 AM Finan, Sean <
Sean.Finan@childrens.harvard.edu> wrote:

> Hi John,
>
> I am not completely sure that I understand what you are asking, and I
> think that this is more of an s3 question than a ctakes question, but here
> are a couple of comments:
>
> > the cTAKES part of it relies on CTAKES_HOME being set
> - Is this requirement on your side?   I never bother to set CTAKES_HOME.
>
> > So I need to store cTAKES in a shared location
> - I am not sure why you need to do this when it is possible to spin up
> multiple machines, each with its own ctakes "installation."
>
> > Usually, in EMR, you would use S3 for this
> - This seems to be quite a blanket statement
>
> > cTAKES relies on a hierarchical file structure
> - ok ...
>
> > such as storing cTAKES on S3 instead
> - I have [essentially] done this.  If I remember correctly I didn't need
> to venture too far outside my comfort zone.
>
> > altering cTAKES to work with a flat file structure using the S3
> - I haven't touched it for many years, but the flat file structure was
> essentially internal to s3 and files can still be referenced via a complete
> "hierarchical path" - it is just that the filename is "bob/likes/ice.cream"
>
> Again, I haven't needed to work with this for about 5 years, so what I did
> might be completely irrelevant.  I would hope that implementation is now
> simpler, examples more prevalent and documentation better than back in the
> day.
>
> Sean
>
> ________________________________________
> From: John Doe <lu...@gmail.com>
> Sent: Sunday, July 25, 2021 3:28 PM
> To: dev@ctakes.apache.org
> Subject: Can you store cTAKES in an S3 bucket so you can use it with EMR
> for parallel processing? [EXTERNAL]
>
> * External Email - Caution *
>
>
> I'm working on a solution for running cTAKES in an Amazon EMR environment
> with Apache Spark so I can run multiple instances of cTAKES in parallel for
> processing a bunch of notes. However, the cTAKES part of it relies on
> CTAKES_HOME being set on every machine for locating model files and such.
> So I need to store cTAKES in a shared location so every node can set
> CTAKES_HOME to that location. Usually, in EMR, you would use S3 for this
> but it seems that cTAKES relies on a hierarchical file structure for
> loading in files (model files, dictionary files, etc.). My current solution
> uses EFS as an alternative. Is there a better alternative to this approach
> to getting cTAKES integrated with EMR? I know there are alternative non-EMR
> approaches to parallelizing cTAKES, but I may not have those technologies
> available. I'm wondering if there is a good way around using EFS such as
> storing cTAKES on S3 instead, but it seems like altering cTAKES to work
> with a flat file structure using the S3 API may be a pretty big task.
>
**CONFIDENTIALITY NOTICE** This e-mail communication and any attachments are for the sole use of the intended recipient and may contain information that is confidential and privileged under state and federal privacy laws. If you received this e-mail in error, be aware that any unauthorized use, disclosure, copying, or distribution is strictly prohibited. If you received this e-mail in error, please contact the sender immediately and destroy/delete all copies of this message.

Re: Can you store cTAKES in an S3 bucket so you can use it with EMR for parallel processing? [EXTERNAL]

Posted by Thomas W Loehfelm <tw...@ucdavis.edu.INVALID>.

For parallel processing, consider using the ctakes-dockhand component. You can run ctakes as a docker service and then scale it using docker swarm to replicate that service across many nodes to expand processing capacity.

I can am happy to share my experience doing it and the code I use, although I am not a java expert and so can’t necessarily explain WHY my stuff is the way it is, just that it gets the job done.

The problem I wanted to address was:

  1.  Using a custom dictionary…
  2.  Defining a custom processing pipeline (piper file)…
  3.  …set up an API endpoint to receive the text of a report
  4.  …and return a custom subset of ctakes output in json format
  5.  …achieve desired processing scale/throughput via docker

Thanks to Sean and many others on this forum we are up and running. If that general workflow is what you are looking for I am happy to help in any way I can.

Tom

From: John Doe <lu...@gmail.com>
Date: Tuesday, August 3, 2021 at 1:24 PM
To: dev@ctakes.apache.org <de...@ctakes.apache.org>
Subject: Re: Can you store cTAKES in an S3 bucket so you can use it with EMR for parallel processing? [EXTERNAL]
Hello,

Thanks for the response. The reason we are using a shared location for
ctakes is so that we have everything in one place. If we need to add our
own components, dictionaries, etc., we can do it all in one spot. It also
saves us from having to download ctakes on every machine every time we
start up a cluster. I didn't know the regular java file API would still
work with S3 but will have to give that a try. I am relying on CTAKES_HOME
being set since ctakes is stored on EFS so the node wouldn't be able to
find it on its own local file system. I'm basically mounting the EFS folder
holding ctakes onto each node and setting CTAKES_HOME to that so it can
find all the files it needs to. For us anyway, S3 has come up as the
primary means of storage for EMR and I'm not sure if EFS will be available,
which is why I'm trying to see if I can do it on S3.

On Mon, Aug 2, 2021 at 11:41 AM Finan, Sean <
Sean.Finan@childrens.harvard.edu> wrote:

> Hi John,
>
> I am not completely sure that I understand what you are asking, and I
> think that this is more of an s3 question than a ctakes question, but here
> are a couple of comments:
>
> > the cTAKES part of it relies on CTAKES_HOME being set
> - Is this requirement on your side?   I never bother to set CTAKES_HOME.
>
> > So I need to store cTAKES in a shared location
> - I am not sure why you need to do this when it is possible to spin up
> multiple machines, each with its own ctakes "installation."
>
> > Usually, in EMR, you would use S3 for this
> - This seems to be quite a blanket statement
>
> > cTAKES relies on a hierarchical file structure
> - ok ...
>
> > such as storing cTAKES on S3 instead
> - I have [essentially] done this.  If I remember correctly I didn't need
> to venture too far outside my comfort zone.
>
> > altering cTAKES to work with a flat file structure using the S3
> - I haven't touched it for many years, but the flat file structure was
> essentially internal to s3 and files can still be referenced via a complete
> "hierarchical path" - it is just that the filename is "bob/likes/ice.cream"
>
> Again, I haven't needed to work with this for about 5 years, so what I did
> might be completely irrelevant.  I would hope that implementation is now
> simpler, examples more prevalent and documentation better than back in the
> day.
>
> Sean
>
> ________________________________________
> From: John Doe <lu...@gmail.com>
> Sent: Sunday, July 25, 2021 3:28 PM
> To: dev@ctakes.apache.org
> Subject: Can you store cTAKES in an S3 bucket so you can use it with EMR
> for parallel processing? [EXTERNAL]
>
> * External Email - Caution *
>
>
> I'm working on a solution for running cTAKES in an Amazon EMR environment
> with Apache Spark so I can run multiple instances of cTAKES in parallel for
> processing a bunch of notes. However, the cTAKES part of it relies on
> CTAKES_HOME being set on every machine for locating model files and such.
> So I need to store cTAKES in a shared location so every node can set
> CTAKES_HOME to that location. Usually, in EMR, you would use S3 for this
> but it seems that cTAKES relies on a hierarchical file structure for
> loading in files (model files, dictionary files, etc.). My current solution
> uses EFS as an alternative. Is there a better alternative to this approach
> to getting cTAKES integrated with EMR? I know there are alternative non-EMR
> approaches to parallelizing cTAKES, but I may not have those technologies
> available. I'm wondering if there is a good way around using EFS such as
> storing cTAKES on S3 instead, but it seems like altering cTAKES to work
> with a flat file structure using the S3 API may be a pretty big task.
>
**CONFIDENTIALITY NOTICE** This e-mail communication and any attachments are for the sole use of the intended recipient and may contain information that is confidential and privileged under state and federal privacy laws. If you received this e-mail in error, be aware that any unauthorized use, disclosure, copying, or distribution is strictly prohibited. If you received this e-mail in error, please contact the sender immediately and destroy/delete all copies of this message.

Re: Can you store cTAKES in an S3 bucket so you can use it with EMR for parallel processing? [EXTERNAL]

Posted by John Doe <lu...@gmail.com>.

Hello,

Thanks for the response. The reason we are using a shared location for
ctakes is so that we have everything in one place. If we need to add our
own components, dictionaries, etc., we can do it all in one spot. It also
saves us from having to download ctakes on every machine every time we
start up a cluster. I didn't know the regular java file API would still
work with S3 but will have to give that a try. I am relying on CTAKES_HOME
being set since ctakes is stored on EFS so the node wouldn't be able to
find it on its own local file system. I'm basically mounting the EFS folder
holding ctakes onto each node and setting CTAKES_HOME to that so it can
find all the files it needs to. For us anyway, S3 has come up as the
primary means of storage for EMR and I'm not sure if EFS will be available,
which is why I'm trying to see if I can do it on S3.

On Mon, Aug 2, 2021 at 11:41 AM Finan, Sean <
Sean.Finan@childrens.harvard.edu> wrote:

> Hi John,
>
> I am not completely sure that I understand what you are asking, and I
> think that this is more of an s3 question than a ctakes question, but here
> are a couple of comments:
>
> > the cTAKES part of it relies on CTAKES_HOME being set
> - Is this requirement on your side?   I never bother to set CTAKES_HOME.
>
> > So I need to store cTAKES in a shared location
> - I am not sure why you need to do this when it is possible to spin up
> multiple machines, each with its own ctakes "installation."
>
> > Usually, in EMR, you would use S3 for this
> - This seems to be quite a blanket statement
>
> > cTAKES relies on a hierarchical file structure
> - ok ...
>
> > such as storing cTAKES on S3 instead
> - I have [essentially] done this.  If I remember correctly I didn't need
> to venture too far outside my comfort zone.
>
> > altering cTAKES to work with a flat file structure using the S3
> - I haven't touched it for many years, but the flat file structure was
> essentially internal to s3 and files can still be referenced via a complete
> "hierarchical path" - it is just that the filename is "bob/likes/ice.cream"
>
> Again, I haven't needed to work with this for about 5 years, so what I did
> might be completely irrelevant.  I would hope that implementation is now
> simpler, examples more prevalent and documentation better than back in the
> day.
>
> Sean
>
> ________________________________________
> From: John Doe <lu...@gmail.com>
> Sent: Sunday, July 25, 2021 3:28 PM
> To: dev@ctakes.apache.org
> Subject: Can you store cTAKES in an S3 bucket so you can use it with EMR
> for parallel processing? [EXTERNAL]
>
> * External Email - Caution *
>
>
> I'm working on a solution for running cTAKES in an Amazon EMR environment
> with Apache Spark so I can run multiple instances of cTAKES in parallel for
> processing a bunch of notes. However, the cTAKES part of it relies on
> CTAKES_HOME being set on every machine for locating model files and such.
> So I need to store cTAKES in a shared location so every node can set
> CTAKES_HOME to that location. Usually, in EMR, you would use S3 for this
> but it seems that cTAKES relies on a hierarchical file structure for
> loading in files (model files, dictionary files, etc.). My current solution
> uses EFS as an alternative. Is there a better alternative to this approach
> to getting cTAKES integrated with EMR? I know there are alternative non-EMR
> approaches to parallelizing cTAKES, but I may not have those technologies
> available. I'm wondering if there is a good way around using EFS such as
> storing cTAKES on S3 instead, but it seems like altering cTAKES to work
> with a flat file structure using the S3 API may be a pretty big task.
>