You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ctakes.apache.org by Michael Trepanier <mi...@metistream.com> on 2017/07/25 23:53:35 UTC

Implementation Improvements for cTAKES on top of Spark

Hi,

I am currently leveraging cTAKES inside of Apache Spark and have
written a function that takes in a single clinical note as as string
and does the following:

1) Sets the UMLS system properties.
2) Instantiates JCAS object.
3) Runs the default pipeline
4) (Not shown below) Grabs the annotations and places them in a JSON
object for each note.

  def generateAnnotations(paragraph:String): String = {
    System.setProperty("ctakes.umlsuser", "MY_UMLS_USERNAME")
    System.setProperty("ctakes.umlspw", "MY_UMLS_PASSWORD")

    var jcas = JCasFactory.createJCas("org.apache.ctakes.typesystem.types.TypeSystem")
    var aed = ClinicalPipelineFactory.getDefaultPipeline()
    jcas.setDocumentText(paragraph)
    SimplePipeline.runPipeline(jcas, aed)
    ...

This function is being implemented as a UDF which is applied to a
Spark Dataframe with clinical notes in each row. I have two
implementation questions that follow:

1) When cTAKES is being applied iteratively to clinical notes, is it
necessary to instantiate a new JCAS object for every annotation? Or
can the same JCAS object be utilized over and over with the document
text being changed?
2) For each application of this function, the
UmlsDictionaryLookupAnnotator has to connect to UMLS using the
provided UMLS information. This Is there any way to instead perform
this step locally? Ie. ingest UMLS and place it in either HDFS or just
mount it somewhere on each node? I'm worried about spamming the UMLS
server in this step, and about how long this seems to take.

Thanks,

Mike


-- 

Mike Trepanier| Big Data Engineer | MetiStream, Inc. |
mike@metistream.com | 845 - 270 - 3129 (m) | www.metistream.com

Re: Implementation Improvements for cTAKES on top of Spark

Posted by "Abramowitsch, Peter" <pa...@hearst.com>.

I've done the same thing, and there is a reset method in the CAS which allows one to repopulate it with new patient data each time.
My server side function looks like this


private void runPipeline(spark.Request req, spark.Response res)

throws AnalysisEngineProcessException,

ResourceInitializationException, SAXException, IOException {

_jcas.setDocumentText(req.body());

_xxx.process(_jcas);

_yyy.process(_jcas);

res.header("Content-Type", "application/json");

JsonCasSerializer.jsonSerialize(_jcas.getCas(), res.raw()

.getOutputStream());

_jcas.reset();

}


- Peter
From: Harsh Mishra <mn...@gmail.com>>
Reply-To: "user@ctakes.apache.org<ma...@ctakes.apache.org>" <us...@ctakes.apache.org>>
Date: Wednesday, July 26, 2017 at 11:29 AM
To: "user@ctakes.apache.org<ma...@ctakes.apache.org>" <us...@ctakes.apache.org>>
Subject: Re: Implementation Improvements for cTAKES on top of Spark

Hi Mike,

Thanks for adding this here. I just want to know If you see the answer to these questions.
I am facing the same issue and want to address the same questions.

If Yes, please share with me.

Thanks,

Harsha



On 26 July 2017 at 05:23, Michael Trepanier <mi...@metistream.com>> wrote:
Hi,

I am currently leveraging cTAKES inside of Apache Spark and have
written a function that takes in a single clinical note as as string
and does the following:

1) Sets the UMLS system properties.
2) Instantiates JCAS object.
3) Runs the default pipeline
4) (Not shown below) Grabs the annotations and places them in a JSON
object for each note.

  def generateAnnotations(paragraph:String): String = {
    System.setProperty("ctakes.umlsuser", "MY_UMLS_USERNAME")
    System.setProperty("ctakes.umlspw", "MY_UMLS_PASSWORD")

    var jcas = JCasFactory.createJCas("org.apache.ctakes.typesystem.types.TypeSystem")
    var aed = ClinicalPipelineFactory.getDefaultPipeline()
    jcas.setDocumentText(paragraph)
    SimplePipeline.runPipeline(jcas, aed)
    ...

This function is being implemented as a UDF which is applied to a
Spark Dataframe with clinical notes in each row. I have two
implementation questions that follow:

1) When cTAKES is being applied iteratively to clinical notes, is it
necessary to instantiate a new JCAS object for every annotation? Or
can the same JCAS object be utilized over and over with the document
text being changed?
2) For each application of this function, the
UmlsDictionaryLookupAnnotator has to connect to UMLS using the
provided UMLS information. This Is there any way to instead perform
this step locally? Ie. ingest UMLS and place it in either HDFS or just
mount it somewhere on each node? I'm worried about spamming the UMLS
server in this step, and about how long this seems to take.

Thanks,

Mike


--

Mike Trepanier| Big Data Engineer | MetiStream, Inc. |
mike@metistream.com<ma...@metistream.com> | 845 - 270 - 3129 (m) | www.metistream.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.metistream.com&d=DwMFaQ&c=B73tqXN8Ec0ocRmZHMCntw&r=5LM1YwNyMUq7CWiSepCCsjTjwuVF4uswNF8BK5Orm10&m=7dtqwYkQGEmsR3jZXDfs2ewubNXvT2XaY4he3ko4mLg&s=K_1QhVRXYRhvR-yP3MHjPkebo43Nku-i8DMt5rhvNlg&e=>

Re: Implementation Improvements for cTAKES on top of Spark

Posted by Harsh Mishra <mn...@gmail.com>.

Hi Mike,

Thanks for adding this here. I just want to know If you see the answer to
these questions.
I am facing the same issue and want to address the same questions.

If Yes, please share with me.

Thanks,

Harsha



On 26 July 2017 at 05:23, Michael Trepanier <mi...@metistream.com> wrote:

> Hi,
>
> I am currently leveraging cTAKES inside of Apache Spark and have
> written a function that takes in a single clinical note as as string
> and does the following:
>
> 1) Sets the UMLS system properties.
> 2) Instantiates JCAS object.
> 3) Runs the default pipeline
> 4) (Not shown below) Grabs the annotations and places them in a JSON
> object for each note.
>
>   def generateAnnotations(paragraph:String): String = {
>     System.setProperty("ctakes.umlsuser", "MY_UMLS_USERNAME")
>     System.setProperty("ctakes.umlspw", "MY_UMLS_PASSWORD")
>
>     var jcas = JCasFactory.createJCas("org.apache.ctakes.typesystem.
> types.TypeSystem")
>     var aed = ClinicalPipelineFactory.getDefaultPipeline()
>     jcas.setDocumentText(paragraph)
>     SimplePipeline.runPipeline(jcas, aed)
>     ...
>
> This function is being implemented as a UDF which is applied to a
> Spark Dataframe with clinical notes in each row. I have two
> implementation questions that follow:
>
> 1) When cTAKES is being applied iteratively to clinical notes, is it
> necessary to instantiate a new JCAS object for every annotation? Or
> can the same JCAS object be utilized over and over with the document
> text being changed?
> 2) For each application of this function, the
> UmlsDictionaryLookupAnnotator has to connect to UMLS using the
> provided UMLS information. This Is there any way to instead perform
> this step locally? Ie. ingest UMLS and place it in either HDFS or just
> mount it somewhere on each node? I'm worried about spamming the UMLS
> server in this step, and about how long this seems to take.
>
> Thanks,
>
> Mike
>
>
> --
>
> Mike Trepanier| Big Data Engineer | MetiStream, Inc. |
> mike@metistream.com | 845 - 270 - 3129 (m) | www.metistream.com
>

Re: Implementation Improvements for cTAKES on top of Spark

Posted by "Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>.

Awesome Mike that is so cool to hear. Thank you for building it off our software system. Really awesome.


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, NSF & Open Source Projects Formulation and Development Offices (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


From: Michael Trepanier <mi...@metistream.com>
Reply-To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Date: Friday, August 11, 2017 at 11:08 AM
To: "user@ctakes.apache.org" <us...@ctakes.apache.org>
Subject: Re: Implementation Improvements for cTAKES on top of Spark

Chris,

We actually were basing our implementation off of https://github.com/selinachu/SparkStreamingCTK which I believe came from your team, but updated it for cTAKES 4.0. For those trying to do this, you'll likely run into issues tied to the lvg annotator outlined here: https://issues.apache.org/jira/browse/CTAKES-445

The comments provide a solution (essentially, ensure the cTAKES resources zip is on your classpath). In a cluster environment, this means putting them on every node at that particular classpath location. Fingers crossed that in some future implementations cTAKES can just be zipped up in a fat-jar with no issues.

Mike

On Fri, Jul 28, 2017 at 1:17 PM, Mattmann, Chris A (3010) <ch...@jpl.nasa.gov>> wrote:
FYI for interest, my JPL team implemented a prototype of this in 2015:

https://www.mail-archive.com/user@ctakes.apache.org/msg01082.html



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, NSF & Open Source Projects Formulation and Development Offices (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


On 7/28/17, 11:19 AM, "Michael Trepanier" <mi...@metistream.com>> wrote:

    That's an excellent suggestion! In a Spark implementation, build the
    pipeline outside of the map function and pass the aed in as an input.
    Then just ensure the jcas object persists between each mapping
    iteration and leverage the reset method.

    On Fri, Jul 28, 2017 at 11:11 AM, Abramowitsch, Peter
    <pa...@hearst.com>> wrote:
    > About your second question with UMLS,  You can build the pipeline
    > initially and it will verify the license info, then just reuse the
    > pipeline on each call.
    >
    >
    >
    > On 7/25/17, 4:53 PM, "Michael Trepanier" <mi...@metistream.com>> wrote:
    >
    >>Hi,
    >>
    >>I am currently leveraging cTAKES inside of Apache Spark and have
    >>written a function that takes in a single clinical note as as string
    >>and does the following:
    >>
    >>1) Sets the UMLS system properties.
    >>2) Instantiates JCAS object.
    >>3) Runs the default pipeline
    >>4) (Not shown below) Grabs the annotations and places them in a JSON
    >>object for each note.
    >>
    >>  def generateAnnotations(paragraph:String): String = {
    >>    System.setProperty("ctakes.umlsuser", "MY_UMLS_USERNAME")
    >>    System.setProperty("ctakes.umlspw", "MY_UMLS_PASSWORD")
    >>
    >>    var jcas =
    >>JCasFactory.createJCas("org.apache.ctakes.typesystem.types.TypeSystem")
    >>    var aed = ClinicalPipelineFactory.getDefaultPipeline()
    >>    jcas.setDocumentText(paragraph)
    >>    SimplePipeline.runPipeline(jcas, aed)
    >>    ...
    >>
    >>This function is being implemented as a UDF which is applied to a
    >>Spark Dataframe with clinical notes in each row. I have two
    >>implementation questions that follow:
    >>
    >>1) When cTAKES is being applied iteratively to clinical notes, is it
    >>necessary to instantiate a new JCAS object for every annotation? Or
    >>can the same JCAS object be utilized over and over with the document
    >>text being changed?
    >>2) For each application of this function, the
    >>UmlsDictionaryLookupAnnotator has to connect to UMLS using the
    >>provided UMLS information. This Is there any way to instead perform
    >>this step locally? Ie. ingest UMLS and place it in either HDFS or just
    >>mount it somewhere on each node? I'm worried about spamming the UMLS
    >>server in this step, and about how long this seems to take.
    >>
    >>Thanks,
    >>
    >>Mike
    >>
    >>
    >>--
    >>
    >>Mike Trepanier| Big Data Engineer | MetiStream, Inc. |
    >>mike@metistream.com<ma...@metistream.com> | 845 - 270 - 3129<tel:845%20-%20270%20-%203129> (m) | www.metistream.com<http://www.metistream.com>
    >



    --

    Mike Trepanier| Big Data Engineer | MetiStream, Inc. |
    mike@metistream.com<ma...@metistream.com> | 845 - 270 - 3129<tel:845%20-%20270%20-%203129> (m) | www.metistream.com<http://www.metistream.com>




--
[etiStream Logo - 500]
Mike Trepanier| Big Data Engineer | MetiStream, Inc. |  mike@metistream.com<ma...@metistream.com> | 845 - 270 - 3129 (m) | www.metistream.com<http://www.metistream.com/>

Re: Implementation Improvements for cTAKES on top of Spark

Posted by Michael Trepanier <mi...@metistream.com>.

Chris,

We actually were basing our implementation off of
https://github.com/selinachu/SparkStreamingCTK which I believe came from
your team, but updated it for cTAKES 4.0. For those trying to do this,
you'll likely run into issues tied to the lvg annotator outlined here:
https://issues.apache.org/jira/browse/CTAKES-445

The comments provide a solution (essentially, ensure the cTAKES resources
zip is on your classpath). In a cluster environment, this means putting
them on every node at that particular classpath location. Fingers crossed
that in some future implementations cTAKES can just be zipped up in a
fat-jar with no issues.

Mike

On Fri, Jul 28, 2017 at 1:17 PM, Mattmann, Chris A (3010) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> FYI for interest, my JPL team implemented a prototype of this in 2015:
>
> https://www.mail-archive.com/user@ctakes.apache.org/msg01082.html
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Principal Data Scientist, Engineering Administrative Office (3010)
> Manager, NSF & Open Source Projects Formulation and Development Offices
> (8212)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 180-503E, Mailstop: 180-503
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
> On 7/28/17, 11:19 AM, "Michael Trepanier" <mi...@metistream.com> wrote:
>
>     That's an excellent suggestion! In a Spark implementation, build the
>     pipeline outside of the map function and pass the aed in as an input.
>     Then just ensure the jcas object persists between each mapping
>     iteration and leverage the reset method.
>
>     On Fri, Jul 28, 2017 at 11:11 AM, Abramowitsch, Peter
>     <pa...@hearst.com> wrote:
>     > About your second question with UMLS,  You can build the pipeline
>     > initially and it will verify the license info, then just reuse the
>     > pipeline on each call.
>     >
>     >
>     >
>     > On 7/25/17, 4:53 PM, "Michael Trepanier" <mi...@metistream.com>
> wrote:
>     >
>     >>Hi,
>     >>
>     >>I am currently leveraging cTAKES inside of Apache Spark and have
>     >>written a function that takes in a single clinical note as as string
>     >>and does the following:
>     >>
>     >>1) Sets the UMLS system properties.
>     >>2) Instantiates JCAS object.
>     >>3) Runs the default pipeline
>     >>4) (Not shown below) Grabs the annotations and places them in a JSON
>     >>object for each note.
>     >>
>     >>  def generateAnnotations(paragraph:String): String = {
>     >>    System.setProperty("ctakes.umlsuser", "MY_UMLS_USERNAME")
>     >>    System.setProperty("ctakes.umlspw", "MY_UMLS_PASSWORD")
>     >>
>     >>    var jcas =
>     >>JCasFactory.createJCas("org.apache.ctakes.typesystem.
> types.TypeSystem")
>     >>    var aed = ClinicalPipelineFactory.getDefaultPipeline()
>     >>    jcas.setDocumentText(paragraph)
>     >>    SimplePipeline.runPipeline(jcas, aed)
>     >>    ...
>     >>
>     >>This function is being implemented as a UDF which is applied to a
>     >>Spark Dataframe with clinical notes in each row. I have two
>     >>implementation questions that follow:
>     >>
>     >>1) When cTAKES is being applied iteratively to clinical notes, is it
>     >>necessary to instantiate a new JCAS object for every annotation? Or
>     >>can the same JCAS object be utilized over and over with the document
>     >>text being changed?
>     >>2) For each application of this function, the
>     >>UmlsDictionaryLookupAnnotator has to connect to UMLS using the
>     >>provided UMLS information. This Is there any way to instead perform
>     >>this step locally? Ie. ingest UMLS and place it in either HDFS or
> just
>     >>mount it somewhere on each node? I'm worried about spamming the UMLS
>     >>server in this step, and about how long this seems to take.
>     >>
>     >>Thanks,
>     >>
>     >>Mike
>     >>
>     >>
>     >>--
>     >>
>     >>Mike Trepanier| Big Data Engineer | MetiStream, Inc. |
>     >>mike@metistream.com | 845 - 270 - 3129 (m) | www.metistream.com
>     >
>
>
>
>     --
>
>     Mike Trepanier| Big Data Engineer | MetiStream, Inc. |
>     mike@metistream.com | 845 - 270 - 3129 (m) | www.metistream.com
>
>
>


-- 
[image: MetiStream Logo - 500]
Mike Trepanier| Big Data Engineer | MetiStream, Inc. |  mike@metistream.com |
845 - 270 - 3129 (m) | www.metistream.com

Re: Implementation Improvements for cTAKES on top of Spark

Posted by "Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>.

FYI for interest, my JPL team implemented a prototype of this in 2015:

https://www.mail-archive.com/user@ctakes.apache.org/msg01082.html



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, NSF & Open Source Projects Formulation and Development Offices (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 

On 7/28/17, 11:19 AM, "Michael Trepanier" <mi...@metistream.com> wrote:

    That's an excellent suggestion! In a Spark implementation, build the
    pipeline outside of the map function and pass the aed in as an input.
    Then just ensure the jcas object persists between each mapping
    iteration and leverage the reset method.
    
    On Fri, Jul 28, 2017 at 11:11 AM, Abramowitsch, Peter
    <pa...@hearst.com> wrote:
    > About your second question with UMLS,  You can build the pipeline
    > initially and it will verify the license info, then just reuse the
    > pipeline on each call.
    >
    >
    >
    > On 7/25/17, 4:53 PM, "Michael Trepanier" <mi...@metistream.com> wrote:
    >
    >>Hi,
    >>
    >>I am currently leveraging cTAKES inside of Apache Spark and have
    >>written a function that takes in a single clinical note as as string
    >>and does the following:
    >>
    >>1) Sets the UMLS system properties.
    >>2) Instantiates JCAS object.
    >>3) Runs the default pipeline
    >>4) (Not shown below) Grabs the annotations and places them in a JSON
    >>object for each note.
    >>
    >>  def generateAnnotations(paragraph:String): String = {
    >>    System.setProperty("ctakes.umlsuser", "MY_UMLS_USERNAME")
    >>    System.setProperty("ctakes.umlspw", "MY_UMLS_PASSWORD")
    >>
    >>    var jcas =
    >>JCasFactory.createJCas("org.apache.ctakes.typesystem.types.TypeSystem")
    >>    var aed = ClinicalPipelineFactory.getDefaultPipeline()
    >>    jcas.setDocumentText(paragraph)
    >>    SimplePipeline.runPipeline(jcas, aed)
    >>    ...
    >>
    >>This function is being implemented as a UDF which is applied to a
    >>Spark Dataframe with clinical notes in each row. I have two
    >>implementation questions that follow:
    >>
    >>1) When cTAKES is being applied iteratively to clinical notes, is it
    >>necessary to instantiate a new JCAS object for every annotation? Or
    >>can the same JCAS object be utilized over and over with the document
    >>text being changed?
    >>2) For each application of this function, the
    >>UmlsDictionaryLookupAnnotator has to connect to UMLS using the
    >>provided UMLS information. This Is there any way to instead perform
    >>this step locally? Ie. ingest UMLS and place it in either HDFS or just
    >>mount it somewhere on each node? I'm worried about spamming the UMLS
    >>server in this step, and about how long this seems to take.
    >>
    >>Thanks,
    >>
    >>Mike
    >>
    >>
    >>--
    >>
    >>Mike Trepanier| Big Data Engineer | MetiStream, Inc. |
    >>mike@metistream.com | 845 - 270 - 3129 (m) | www.metistream.com
    >
    
    
    
    -- 
    
    Mike Trepanier| Big Data Engineer | MetiStream, Inc. |
    mike@metistream.com | 845 - 270 - 3129 (m) | www.metistream.com

Re: Implementation Improvements for cTAKES on top of Spark

Posted by Michael Trepanier <mi...@metistream.com>.

That's an excellent suggestion! In a Spark implementation, build the
pipeline outside of the map function and pass the aed in as an input.
Then just ensure the jcas object persists between each mapping
iteration and leverage the reset method.

On Fri, Jul 28, 2017 at 11:11 AM, Abramowitsch, Peter
<pa...@hearst.com> wrote:
> About your second question with UMLS,  You can build the pipeline
> initially and it will verify the license info, then just reuse the
> pipeline on each call.
>
>
>
> On 7/25/17, 4:53 PM, "Michael Trepanier" <mi...@metistream.com> wrote:
>
>>Hi,
>>
>>I am currently leveraging cTAKES inside of Apache Spark and have
>>written a function that takes in a single clinical note as as string
>>and does the following:
>>
>>1) Sets the UMLS system properties.
>>2) Instantiates JCAS object.
>>3) Runs the default pipeline
>>4) (Not shown below) Grabs the annotations and places them in a JSON
>>object for each note.
>>
>>  def generateAnnotations(paragraph:String): String = {
>>    System.setProperty("ctakes.umlsuser", "MY_UMLS_USERNAME")
>>    System.setProperty("ctakes.umlspw", "MY_UMLS_PASSWORD")
>>
>>    var jcas =
>>JCasFactory.createJCas("org.apache.ctakes.typesystem.types.TypeSystem")
>>    var aed = ClinicalPipelineFactory.getDefaultPipeline()
>>    jcas.setDocumentText(paragraph)
>>    SimplePipeline.runPipeline(jcas, aed)
>>    ...
>>
>>This function is being implemented as a UDF which is applied to a
>>Spark Dataframe with clinical notes in each row. I have two
>>implementation questions that follow:
>>
>>1) When cTAKES is being applied iteratively to clinical notes, is it
>>necessary to instantiate a new JCAS object for every annotation? Or
>>can the same JCAS object be utilized over and over with the document
>>text being changed?
>>2) For each application of this function, the
>>UmlsDictionaryLookupAnnotator has to connect to UMLS using the
>>provided UMLS information. This Is there any way to instead perform
>>this step locally? Ie. ingest UMLS and place it in either HDFS or just
>>mount it somewhere on each node? I'm worried about spamming the UMLS
>>server in this step, and about how long this seems to take.
>>
>>Thanks,
>>
>>Mike
>>
>>
>>--
>>
>>Mike Trepanier| Big Data Engineer | MetiStream, Inc. |
>>mike@metistream.com | 845 - 270 - 3129 (m) | www.metistream.com
>



-- 

Mike Trepanier| Big Data Engineer | MetiStream, Inc. |
mike@metistream.com | 845 - 270 - 3129 (m) | www.metistream.com

Re: Implementation Improvements for cTAKES on top of Spark

Posted by "Abramowitsch, Peter" <pa...@hearst.com>.

About your second question with UMLS,  You can build the pipeline
initially and it will verify the license info, then just reuse the
pipeline on each call.



On 7/25/17, 4:53 PM, "Michael Trepanier" <mi...@metistream.com> wrote:

>Hi,
>
>I am currently leveraging cTAKES inside of Apache Spark and have
>written a function that takes in a single clinical note as as string
>and does the following:
>
>1) Sets the UMLS system properties.
>2) Instantiates JCAS object.
>3) Runs the default pipeline
>4) (Not shown below) Grabs the annotations and places them in a JSON
>object for each note.
>
>  def generateAnnotations(paragraph:String): String = {
>    System.setProperty("ctakes.umlsuser", "MY_UMLS_USERNAME")
>    System.setProperty("ctakes.umlspw", "MY_UMLS_PASSWORD")
>
>    var jcas = 
>JCasFactory.createJCas("org.apache.ctakes.typesystem.types.TypeSystem")
>    var aed = ClinicalPipelineFactory.getDefaultPipeline()
>    jcas.setDocumentText(paragraph)
>    SimplePipeline.runPipeline(jcas, aed)
>    ...
>
>This function is being implemented as a UDF which is applied to a
>Spark Dataframe with clinical notes in each row. I have two
>implementation questions that follow:
>
>1) When cTAKES is being applied iteratively to clinical notes, is it
>necessary to instantiate a new JCAS object for every annotation? Or
>can the same JCAS object be utilized over and over with the document
>text being changed?
>2) For each application of this function, the
>UmlsDictionaryLookupAnnotator has to connect to UMLS using the
>provided UMLS information. This Is there any way to instead perform
>this step locally? Ie. ingest UMLS and place it in either HDFS or just
>mount it somewhere on each node? I'm worried about spamming the UMLS
>server in this step, and about how long this seems to take.
>
>Thanks,
>
>Mike
>
>
>-- 
>
>Mike Trepanier| Big Data Engineer | MetiStream, Inc. |
>mike@metistream.com | 845 - 270 - 3129 (m) | www.metistream.com