You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ctakes.apache.org by Robert Spurrier <ro...@explorys.com> on 2013/09/09 15:25:35 UTC

Creating Runnable .JARs From A Subset of cTAKES Maven Modules

Good Morning!

I am trying to use cTAKES tools on a distributed computing platform. I would rather not ship the entire compiled cTAKES package (~1.5 Gb) out to the shared cache when I only need a few annotators and their resources at a time.

I should first mention that I am not very familiar with Maven. I recently upgraded cTAKES from v 2.5.0, where I was configuring smaller pipelines using ant build files. This process was cumbersome however, and I can appreciate the new modular Maven project layout.  I just do not know how to effectively utilize it in a way that is flexible.

Does anyone have any advice on how I can package subsets of cTAKES annotator modules and their dependencies/resources, so  I can create 'thinner' custom pipelines that are geared towards specific tasks?

For example, I might ultimately want a pipeline .JAR that contains the tools to RegEx Left Ventricular Ejection Fraction measurements from free text. In such a .JAR I would not need any of the dictionary resources or negation annotators, so they could be excluded.

It looks like I could create Maven assembly plugin descriptors to generate these custom .JARs, but I would like to see if anyone here has any advice/caveats before I pursue this route.


Thanks,
Robert Spurrier

Re: Creating Runnable .JARs From A Subset of cTAKES Maven Modules

Posted by Robert Spurrier <ro...@explorys.com>.
Hello Pei,

Since my usage is purely local for now, I created an external cTAKES copy
from SVN. Any customizations/changes I package and deploy are to my local
maven repo. I then use these local/custom dependencies in my 'Pipelines'
project.

Thanks,
Rob

On 10/1/13 2:48 PM, "Pei Chen" <ch...@apache.org> wrote:

>Rob,
>Are you pulling the existing ctakes dependencies from maven central.  Or
>did you have recreate ctakes modules in a local repo of some sort?
>It would be good to make ctakes flexible enough to do what you described
>(hence seperating out modules and resources into it's own modules).
>--Pei
>
>
>On Tue, Oct 1, 2013 at 2:06 PM, Robert Spurrier <
>robert.spurrier@explorys.com> wrote:
>
>> It's been a while, but just to update in case anyone is watching this:
>>
>> My goal was to create a project full of annotators (both cTAKES and
>> home-grown), and "cherry-pick" from them at will to create smaller
>> pipelines that could be launched on a hadoop grid via MapReduce.
>>
>> My final setup consisted of two Maven aggregator projects, Annotators
>>and
>> Pipelines.
>>
>> Annotators is an aggregator project containing all of the annotators and
>> their resources.  I am essentially following the cTAKES layout for this
>> one. One annotator, one module.
>> E.g.:
>> Annotators
>>         -ctakes-core-annotator
>>                 Pom.xml
>>         -ctakes-pos-tagger-annotator
>>                 Pom.xml
>>         -custom-annotator-one
>>                 Pom.xml
>> ParentPom.xml
>>
>>
>> Pipelines is another aggregator project containing the source code to
>> generate the pipelines, and the job files that utilize the pipelines on
>> the hadoop grid (effectively serving as the input reader & CAS
>>consumer).
>> Each pipeline is its own Maven module, and spits outs a .jar that
>>contains
>> all of the classes I need to run a UIMA-MapReduce job for that specific
>> pipeline. It also creates a resource archive (model files, etc) that I
>> ship off to the Hadoop DistributedCache.
>> E.g.:
>> Pipelines
>>         -custom-base-pipeline
>>                 Pom.xml
>>         -observation-pipeline
>>                 Pom.xml
>> ParentPom.xml
>>
>>
>>
>> Notes:
>> -I modified the cTAKES pom to put all of the descriptors into each
>> individual annotator jar as well as the classes, just so they can
>> conveniently be called by name.The "heavier" resources are put on the
>> DistributedCache.
>>
>> -I create individual pipeline distributions in the Pipelines project by
>> using Maven Reactor Plugin at the parent project level. E.g. "maven
>> package -pl custom-base-pipeline  -am" . This builds
>>custom-base-pipeline
>> with all of its dependencies, and all of the necessary resource
>>
>> -Each pipeline has it's own Maven assembly to specify what should be
>> included with that pipeline's distribution and resources
>>
>>
>> The point of this was to maximize modularity, pipeline flexibility,
>> runtime speed, and to keep my pipeline jars as lightweight as possible.
>> Though it has many awesome features, I did not want to run every part of
>> cTAKES every time.
>>
>>
>> Cheers,
>> Rob
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 9/9/13 11:23 AM, "Robert Spurrier" <ro...@explorys.com>
>> wrote:
>>
>> >Actually after poking around in Maven documentation I think I have just
>> >figured out an approach I like.
>> >
>> >For each pipeline I wish to create, I will generate a Maven assembly
>> >descriptor. I will put each assembly file in the cTAKES root pom.xml.
>> >Hopefully this will create each pipeline for me when I run 'package'.
>>This
>> >approach will still tie in nicely with the project object
>>model/lifecycle
>> >of cTAKES, and generate all my custom jars as well.
>> >
>> >I will try it out and update this thread with the results
>> >
>> >Thanks,
>> >Rob
>> >
>> >
>> >On 9/9/13 10:38 AM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
>> >
>> >>Hi Robert,
>> >>
>> >>Are you planning to a process to build everything from source?
>> >>Or were you planning to have a build process that combines the
>>ctakes-***
>> >>jars with your custom application jars?
>> >>
>> >>--Pei
>> >>
>> >>> -----Original Message-----
>> >>> From: Robert Spurrier [mailto:robert.spurrier@explorys.com]
>> >>> Sent: Monday, September 09, 2013 9:27 AM
>> >>> To: dev@ctakes.apache.org
>> >>> Subject: Creating Runnable .JARs From A Subset of cTAKES Maven
>>Modules
>> >>>
>> >>> Good Morning!
>> >>>
>> >>> I am trying to use cTAKES tools on a distributed computing
>>platform. I
>> >>>would
>> >>> rather not ship the entire compiled cTAKES package (~1.5 Gb) out to
>>the
>> >>> shared cache when I only need a few annotators and their resources
>>at a
>> >>> time.
>> >>>
>> >>> I should first mention that I am not very familiar with Maven. I
>> >>>recently
>> >>> upgraded cTAKES from v 2.5.0, where I was configuring smaller
>>pipelines
>> >>> using ant build files. This process was cumbersome however, and I
>>can
>> >>> appreciate the new modular Maven project layout.  I just do not know
>> >>>how
>> >>> to effectively utilize it in a way that is flexible.
>> >>>
>> >>> Does anyone have any advice on how I can package subsets of cTAKES
>> >>> annotator modules and their dependencies/resources, so  I can create
>> >>> 'thinner' custom pipelines that are geared towards specific tasks?
>> >>>
>> >>> For example, I might ultimately want a pipeline .JAR that contains
>>the
>> >>>tools to
>> >>> RegEx Left Ventricular Ejection Fraction measurements from free
>>text.
>> >>>In
>> >>> such a .JAR I would not need any of the dictionary resources or
>> >>>negation
>> >>> annotators, so they could be excluded.
>> >>>
>> >>> It looks like I could create Maven assembly plugin descriptors to
>> >>>generate
>> >>> these custom .JARs, but I would like to see if anyone here has any
>> >>> advice/caveats before I pursue this route.
>> >>>
>> >>>
>> >>> Thanks,
>> >>> Robert Spurrier
>> >
>> >
>>
>>
>>



Re: Creating Runnable .JARs From A Subset of cTAKES Maven Modules

Posted by Pei Chen <ch...@apache.org>.
Rob,
Are you pulling the existing ctakes dependencies from maven central.  Or
did you have recreate ctakes modules in a local repo of some sort?
It would be good to make ctakes flexible enough to do what you described
(hence seperating out modules and resources into it's own modules).
--Pei


On Tue, Oct 1, 2013 at 2:06 PM, Robert Spurrier <
robert.spurrier@explorys.com> wrote:

> It's been a while, but just to update in case anyone is watching this:
>
> My goal was to create a project full of annotators (both cTAKES and
> home-grown), and "cherry-pick" from them at will to create smaller
> pipelines that could be launched on a hadoop grid via MapReduce.
>
> My final setup consisted of two Maven aggregator projects, Annotators and
> Pipelines.
>
> Annotators is an aggregator project containing all of the annotators and
> their resources.  I am essentially following the cTAKES layout for this
> one. One annotator, one module.
> E.g.:
> Annotators
>         -ctakes-core-annotator
>                 Pom.xml
>         -ctakes-pos-tagger-annotator
>                 Pom.xml
>         -custom-annotator-one
>                 Pom.xml
> ParentPom.xml
>
>
> Pipelines is another aggregator project containing the source code to
> generate the pipelines, and the job files that utilize the pipelines on
> the hadoop grid (effectively serving as the input reader & CAS consumer).
> Each pipeline is its own Maven module, and spits outs a .jar that contains
> all of the classes I need to run a UIMA-MapReduce job for that specific
> pipeline. It also creates a resource archive (model files, etc) that I
> ship off to the Hadoop DistributedCache.
> E.g.:
> Pipelines
>         -custom-base-pipeline
>                 Pom.xml
>         -observation-pipeline
>                 Pom.xml
> ParentPom.xml
>
>
>
> Notes:
> -I modified the cTAKES pom to put all of the descriptors into each
> individual annotator jar as well as the classes, just so they can
> conveniently be called by name.The "heavier" resources are put on the
> DistributedCache.
>
> -I create individual pipeline distributions in the Pipelines project by
> using Maven Reactor Plugin at the parent project level. E.g. "maven
> package -pl custom-base-pipeline  -am" . This builds custom-base-pipeline
> with all of its dependencies, and all of the necessary resource
>
> -Each pipeline has it's own Maven assembly to specify what should be
> included with that pipeline's distribution and resources
>
>
> The point of this was to maximize modularity, pipeline flexibility,
> runtime speed, and to keep my pipeline jars as lightweight as possible.
> Though it has many awesome features, I did not want to run every part of
> cTAKES every time.
>
>
> Cheers,
> Rob
>
>
>
>
>
>
>
>
>
>
> On 9/9/13 11:23 AM, "Robert Spurrier" <ro...@explorys.com>
> wrote:
>
> >Actually after poking around in Maven documentation I think I have just
> >figured out an approach I like.
> >
> >For each pipeline I wish to create, I will generate a Maven assembly
> >descriptor. I will put each assembly file in the cTAKES root pom.xml.
> >Hopefully this will create each pipeline for me when I run 'package'. This
> >approach will still tie in nicely with the project object model/lifecycle
> >of cTAKES, and generate all my custom jars as well.
> >
> >I will try it out and update this thread with the results
> >
> >Thanks,
> >Rob
> >
> >
> >On 9/9/13 10:38 AM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
> >
> >>Hi Robert,
> >>
> >>Are you planning to a process to build everything from source?
> >>Or were you planning to have a build process that combines the ctakes-***
> >>jars with your custom application jars?
> >>
> >>--Pei
> >>
> >>> -----Original Message-----
> >>> From: Robert Spurrier [mailto:robert.spurrier@explorys.com]
> >>> Sent: Monday, September 09, 2013 9:27 AM
> >>> To: dev@ctakes.apache.org
> >>> Subject: Creating Runnable .JARs From A Subset of cTAKES Maven Modules
> >>>
> >>> Good Morning!
> >>>
> >>> I am trying to use cTAKES tools on a distributed computing platform. I
> >>>would
> >>> rather not ship the entire compiled cTAKES package (~1.5 Gb) out to the
> >>> shared cache when I only need a few annotators and their resources at a
> >>> time.
> >>>
> >>> I should first mention that I am not very familiar with Maven. I
> >>>recently
> >>> upgraded cTAKES from v 2.5.0, where I was configuring smaller pipelines
> >>> using ant build files. This process was cumbersome however, and I can
> >>> appreciate the new modular Maven project layout.  I just do not know
> >>>how
> >>> to effectively utilize it in a way that is flexible.
> >>>
> >>> Does anyone have any advice on how I can package subsets of cTAKES
> >>> annotator modules and their dependencies/resources, so  I can create
> >>> 'thinner' custom pipelines that are geared towards specific tasks?
> >>>
> >>> For example, I might ultimately want a pipeline .JAR that contains the
> >>>tools to
> >>> RegEx Left Ventricular Ejection Fraction measurements from free text.
> >>>In
> >>> such a .JAR I would not need any of the dictionary resources or
> >>>negation
> >>> annotators, so they could be excluded.
> >>>
> >>> It looks like I could create Maven assembly plugin descriptors to
> >>>generate
> >>> these custom .JARs, but I would like to see if anyone here has any
> >>> advice/caveats before I pursue this route.
> >>>
> >>>
> >>> Thanks,
> >>> Robert Spurrier
> >
> >
>
>
>

Re: Creating Runnable .JARs From A Subset of cTAKES Maven Modules

Posted by Robert Spurrier <ro...@explorys.com>.
It's been a while, but just to update in case anyone is watching this:

My goal was to create a project full of annotators (both cTAKES and
home-grown), and "cherry-pick" from them at will to create smaller
pipelines that could be launched on a hadoop grid via MapReduce.

My final setup consisted of two Maven aggregator projects, Annotators and
Pipelines.

Annotators is an aggregator project containing all of the annotators and
their resources.  I am essentially following the cTAKES layout for this
one. One annotator, one module.
E.g.:
Annotators
        -ctakes-core-annotator
                Pom.xml
        -ctakes-pos-tagger-annotator
                Pom.xml
        -custom-annotator-one
                Pom.xml
ParentPom.xml


Pipelines is another aggregator project containing the source code to
generate the pipelines, and the job files that utilize the pipelines on
the hadoop grid (effectively serving as the input reader & CAS consumer).
Each pipeline is its own Maven module, and spits outs a .jar that contains
all of the classes I need to run a UIMA-MapReduce job for that specific
pipeline. It also creates a resource archive (model files, etc) that I
ship off to the Hadoop DistributedCache.
E.g.:
Pipelines
        -custom-base-pipeline
                Pom.xml
        -observation-pipeline
                Pom.xml
ParentPom.xml



Notes:
-I modified the cTAKES pom to put all of the descriptors into each
individual annotator jar as well as the classes, just so they can
conveniently be called by name.The "heavier" resources are put on the
DistributedCache.

-I create individual pipeline distributions in the Pipelines project by
using Maven Reactor Plugin at the parent project level. E.g. "maven
package -pl custom-base-pipeline  -am" . This builds custom-base-pipeline
with all of its dependencies, and all of the necessary resource

-Each pipeline has it's own Maven assembly to specify what should be
included with that pipeline's distribution and resources


The point of this was to maximize modularity, pipeline flexibility,
runtime speed, and to keep my pipeline jars as lightweight as possible.
Though it has many awesome features, I did not want to run every part of
cTAKES every time.


Cheers,
Rob










On 9/9/13 11:23 AM, "Robert Spurrier" <ro...@explorys.com> wrote:

>Actually after poking around in Maven documentation I think I have just
>figured out an approach I like.
>
>For each pipeline I wish to create, I will generate a Maven assembly
>descriptor. I will put each assembly file in the cTAKES root pom.xml.
>Hopefully this will create each pipeline for me when I run 'package'. This
>approach will still tie in nicely with the project object model/lifecycle
>of cTAKES, and generate all my custom jars as well.
>
>I will try it out and update this thread with the results
>
>Thanks,
>Rob
>
>
>On 9/9/13 10:38 AM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
>
>>Hi Robert,
>>
>>Are you planning to a process to build everything from source?
>>Or were you planning to have a build process that combines the ctakes-***
>>jars with your custom application jars?
>>
>>--Pei
>>
>>> -----Original Message-----
>>> From: Robert Spurrier [mailto:robert.spurrier@explorys.com]
>>> Sent: Monday, September 09, 2013 9:27 AM
>>> To: dev@ctakes.apache.org
>>> Subject: Creating Runnable .JARs From A Subset of cTAKES Maven Modules
>>>
>>> Good Morning!
>>>
>>> I am trying to use cTAKES tools on a distributed computing platform. I
>>>would
>>> rather not ship the entire compiled cTAKES package (~1.5 Gb) out to the
>>> shared cache when I only need a few annotators and their resources at a
>>> time.
>>>
>>> I should first mention that I am not very familiar with Maven. I
>>>recently
>>> upgraded cTAKES from v 2.5.0, where I was configuring smaller pipelines
>>> using ant build files. This process was cumbersome however, and I can
>>> appreciate the new modular Maven project layout.  I just do not know
>>>how
>>> to effectively utilize it in a way that is flexible.
>>>
>>> Does anyone have any advice on how I can package subsets of cTAKES
>>> annotator modules and their dependencies/resources, so  I can create
>>> 'thinner' custom pipelines that are geared towards specific tasks?
>>>
>>> For example, I might ultimately want a pipeline .JAR that contains the
>>>tools to
>>> RegEx Left Ventricular Ejection Fraction measurements from free text.
>>>In
>>> such a .JAR I would not need any of the dictionary resources or
>>>negation
>>> annotators, so they could be excluded.
>>>
>>> It looks like I could create Maven assembly plugin descriptors to
>>>generate
>>> these custom .JARs, but I would like to see if anyone here has any
>>> advice/caveats before I pursue this route.
>>>
>>>
>>> Thanks,
>>> Robert Spurrier
>
>



Re: Creating Runnable .JARs From A Subset of cTAKES Maven Modules

Posted by Robert Spurrier <ro...@explorys.com>.
Actually after poking around in Maven documentation I think I have just
figured out an approach I like.

For each pipeline I wish to create, I will generate a Maven assembly
descriptor. I will put each assembly file in the cTAKES root pom.xml.
Hopefully this will create each pipeline for me when I run 'package'. This
approach will still tie in nicely with the project object model/lifecycle
of cTAKES, and generate all my custom jars as well.

I will try it out and update this thread with the results

Thanks,
Rob


On 9/9/13 10:38 AM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:

>Hi Robert,
>
>Are you planning to a process to build everything from source?
>Or were you planning to have a build process that combines the ctakes-***
>jars with your custom application jars?
>
>--Pei
>
>> -----Original Message-----
>> From: Robert Spurrier [mailto:robert.spurrier@explorys.com]
>> Sent: Monday, September 09, 2013 9:27 AM
>> To: dev@ctakes.apache.org
>> Subject: Creating Runnable .JARs From A Subset of cTAKES Maven Modules
>>
>> Good Morning!
>>
>> I am trying to use cTAKES tools on a distributed computing platform. I
>>would
>> rather not ship the entire compiled cTAKES package (~1.5 Gb) out to the
>> shared cache when I only need a few annotators and their resources at a
>> time.
>>
>> I should first mention that I am not very familiar with Maven. I
>>recently
>> upgraded cTAKES from v 2.5.0, where I was configuring smaller pipelines
>> using ant build files. This process was cumbersome however, and I can
>> appreciate the new modular Maven project layout.  I just do not know how
>> to effectively utilize it in a way that is flexible.
>>
>> Does anyone have any advice on how I can package subsets of cTAKES
>> annotator modules and their dependencies/resources, so  I can create
>> 'thinner' custom pipelines that are geared towards specific tasks?
>>
>> For example, I might ultimately want a pipeline .JAR that contains the
>>tools to
>> RegEx Left Ventricular Ejection Fraction measurements from free text. In
>> such a .JAR I would not need any of the dictionary resources or negation
>> annotators, so they could be excluded.
>>
>> It looks like I could create Maven assembly plugin descriptors to
>>generate
>> these custom .JARs, but I would like to see if anyone here has any
>> advice/caveats before I pursue this route.
>>
>>
>> Thanks,
>> Robert Spurrier



Re: Creating Runnable .JARs From A Subset of cTAKES Maven Modules

Posted by Robert Spurrier <ro...@explorys.com>.
Hello Pei,

My plan is to use cTAKES source and expand upon it with additional custom
annotator modules.

So I would like a build process where I can selectively define what parts
I want to use, and then compile a jar from source with just those items
(and all their dependencies of course). In the program that runs the
pipelines, I am thinking I will use uimaFit to instantiate the objects I
need which are located in the jar, and then pass text into those objects
for processing.

Essentially I would like to have Maven build files for each of my custom
pipelines. Here's an example. My end goal is to be able to compile my LVEF
pipeline, my Discharge Summary pipeline, and my Lab Results pipeline, all
from the same set of source modules, but generate 3 different jars that
contain only the resources I need for each respective pipeline.

It seems that the general cTAKES object model is completely based around
creating the 'ctakes-clinical-pipeline'. So maybe it doesn't make sense
for me to try to shimmy my custom build files in with the cTAKES project.
What do you think?


Thanks,
Rob




On 9/9/13 10:38 AM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:

>Hi Robert,
>
>Are you planning to a process to build everything from source?
>Or were you planning to have a build process that combines the ctakes-***
>jars with your custom application jars?
>
>--Pei
>
>> -----Original Message-----
>> From: Robert Spurrier [mailto:robert.spurrier@explorys.com]
>> Sent: Monday, September 09, 2013 9:27 AM
>> To: dev@ctakes.apache.org
>> Subject: Creating Runnable .JARs From A Subset of cTAKES Maven Modules
>>
>> Good Morning!
>>
>> I am trying to use cTAKES tools on a distributed computing platform. I
>>would
>> rather not ship the entire compiled cTAKES package (~1.5 Gb) out to the
>> shared cache when I only need a few annotators and their resources at a
>> time.
>>
>> I should first mention that I am not very familiar with Maven. I
>>recently
>> upgraded cTAKES from v 2.5.0, where I was configuring smaller pipelines
>> using ant build files. This process was cumbersome however, and I can
>> appreciate the new modular Maven project layout.  I just do not know how
>> to effectively utilize it in a way that is flexible.
>>
>> Does anyone have any advice on how I can package subsets of cTAKES
>> annotator modules and their dependencies/resources, so  I can create
>> 'thinner' custom pipelines that are geared towards specific tasks?
>>
>> For example, I might ultimately want a pipeline .JAR that contains the
>>tools to
>> RegEx Left Ventricular Ejection Fraction measurements from free text. In
>> such a .JAR I would not need any of the dictionary resources or negation
>> annotators, so they could be excluded.
>>
>> It looks like I could create Maven assembly plugin descriptors to
>>generate
>> these custom .JARs, but I would like to see if anyone here has any
>> advice/caveats before I pursue this route.
>>
>>
>> Thanks,
>> Robert Spurrier



RE: Creating Runnable .JARs From A Subset of cTAKES Maven Modules

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.
Hi Robert,

Are you planning to a process to build everything from source?
Or were you planning to have a build process that combines the ctakes-*** jars with your custom application jars?

--Pei

> -----Original Message-----
> From: Robert Spurrier [mailto:robert.spurrier@explorys.com]
> Sent: Monday, September 09, 2013 9:27 AM
> To: dev@ctakes.apache.org
> Subject: Creating Runnable .JARs From A Subset of cTAKES Maven Modules
> 
> Good Morning!
> 
> I am trying to use cTAKES tools on a distributed computing platform. I would
> rather not ship the entire compiled cTAKES package (~1.5 Gb) out to the
> shared cache when I only need a few annotators and their resources at a
> time.
> 
> I should first mention that I am not very familiar with Maven. I recently
> upgraded cTAKES from v 2.5.0, where I was configuring smaller pipelines
> using ant build files. This process was cumbersome however, and I can
> appreciate the new modular Maven project layout.  I just do not know how
> to effectively utilize it in a way that is flexible.
> 
> Does anyone have any advice on how I can package subsets of cTAKES
> annotator modules and their dependencies/resources, so  I can create
> 'thinner' custom pipelines that are geared towards specific tasks?
> 
> For example, I might ultimately want a pipeline .JAR that contains the tools to
> RegEx Left Ventricular Ejection Fraction measurements from free text. In
> such a .JAR I would not need any of the dictionary resources or negation
> annotators, so they could be excluded.
> 
> It looks like I could create Maven assembly plugin descriptors to generate
> these custom .JARs, but I would like to see if anyone here has any
> advice/caveats before I pursue this route.
> 
> 
> Thanks,
> Robert Spurrier