You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ctakes.apache.org by "Chen, Pei" <Pe...@childrens.harvard.edu> on 2012/09/25 00:10:18 UTC

ctakes-resources

During the updating of the project names, I left the resources folders intact.

What do folks think about the resources?

1)      Should we leave the resources as is (within each project)? Or

2)      Create a ctakes-resources module that contains all of the resource files?

For example:
ctakes-resources/
                ctakes-chunker/desc
ctakes-chunker/models
ctakes-dictionary-lookup/somelookupbinariesfolder


Either way, I think any startup script should be able to easily reference the resources and should be completely transparent to end-users.

--Pei

Re: ctakes-resources

Posted by Steven Bethard <st...@Colorado.EDU>.
On Sep 25, 2012, at 7:52 AM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
> What about those larger resources such as LVG or Dictionary Lookup?  In my opinion, they probably shouldn't even be versioned in the same spot as the java srcs... Perhaps an /incr directory outside of trunk?

My suggestion (2) already addressed this. Create a separate dictionary-lookup-resources (or whatever you want to name it) module alongside dictionary-lookup.

An added benefit of making the resources modules real Maven modules is that you can then write tests that test your code using the models, etc. from the resources modules. If you have them somewhere else, like "/incr", then it means your tests won't run once they're packaged in a jar file.

On Sep 25, 2012, at 8:50 AM, "Bleeker, Troy C." <Bl...@mayo.edu> wrote:
> For the best first experience, is it best to include an all-in-one package, including the LVG and dictionary lookup, even though it will take longer to download? If so, shouldn't they be co-located with the other resources?

Well, as long as we make the resource packages real Maven modules (as I suggest above), then it's nearly trivial to write another Maven module that just aggregates the other dependencies. So no, all your resources don't need to be in the same place if the only reason you're doing that is to make an easy single-file download.

On Sep 25, 2012, at 9:15 AM, "Coarr, Matt" <mc...@mitre.org> wrote:
> Could we have two top level directories in svn -- main-modules and
> resource-modules?

I wouldn't recommend this approach. Having both code modules and resource modules at the same level makes it easier to aggregate everything into a single-file download (which I gather we'll still want at some point).

On Sep 25, 2012, at 3:22 PM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
> I think Brandyn and others had a reasonable suggestion to use the name instead of location paths in those pesky descriptor xml files- Assuming they're in the classpath (placed in src/main/resources).

Awesome. Thanks!

Steve

RE: ctakes-resources

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.
> 1. If referencing UIMA descriptors via name, make sure the descriptor name
> attribute and the filename are the same (not including the .xml).
> 2.  In DictionaryLookupAnnotatorUMLS.xml there is a reference to the UMLS
> hsqldb via file path.  To change it to class path, see the changes below

Thanks Brandyn and Co. for that excellent suggestion! It worked like a charm...
Since the resources and type systems have been automagically added to the classpath/jars by maven, we can reference them by name easily now without having to worry about those relative ../../.. locations.  We'll see if we could do the same for the AnalysisEngines in the future as well.

        <import name="org.apache.ctakes.assertion.types.TypeSystem"/>
        <import name="org.apache.ctakes.typesystem.types.TypeSystem"/>


> -----Original Message-----
> From: Kusenda, Brandyn J [mailto:brandyn-kusenda@uiowa.edu]
> Sent: Wednesday, September 26, 2012 11:05 AM
> To: ctakes-dev@incubator.apache.org
> Subject: RE: ctakes-resources
> 
> Hello,
> 
> I've have some of the classpath related changes working on my local copy of
> cTAKES.  Here are a couple of things I discovered, that might save you some
> time.
> 
> 1. If referencing UIMA descriptors via name, make sure the descriptor name
> attribute and the filename are the same (not including the .xml).
> 
> 2.  In DictionaryLookupAnnotatorUMLS.xml there is a reference to the UMLS
> hsqldb via file path.  To change it to class path, see the changes below
> 
> Original using file path:
> --SNIP--
> <nameValuePair>
>        <name>URL</name>
>        <value>
>               <string>jdbc:hsqldb:file:../dictionary
> lookup/resources/lookup/umls2011ab/umls</string>
>        </value>
> </nameValuePair>
> --SNIP--
> 
> Updated to use class path if the lookup directory is in the
> /src/main/resources/  (notice the use of 'res' as oppose to 'file' in the
> connection string)
> -SNIP-
>  <nameValuePair>
>         <name>URL</name>
>         <value>
>               <string>jdbc:hsqldb:res:lookup/umls2011ab/umls</string>
>        </value>
> </nameValuePair>
> --SNIP--
> 
> Thanks,
> Brandyn
> 
> ________________________________________
> From: Chen, Pei [Pei.Chen@childrens.harvard.edu]
> Sent: Tuesday, September 25, 2012 4:22 PM
> To: ctakes-dev@incubator.apache.org
> Subject: RE: ctakes-resources
> 
> I think Brandyn and others had a reasonable suggestion to use the name
> instead of location paths in those pesky descriptor xml files- Assuming
> they're in the classpath (placed in src/main/resources).
> I'll take a stab at this later this week if I get a chance.  But trunk *should*
> compile now in case other want to start working on it.  Apologies in advanced
> if I broke someone's build at this point- it was a pretty large change...
> 
> Troy: If we do this correctly, this should be transparent (if not easier) to end-
> users.
> 
> > -----Original Message-----
> > From: Coarr, Matt [mailto:mcoarr@mitre.org]
> > Sent: Tuesday, September 25, 2012 11:16 AM
> > To: ctakes-dev@incubator.apache.org
> > Subject: Re: ctakes-resources
> >
> > Trying to distill some of the suggestions...
> >
> > Could we have two top level directories in svn -- main-modules and
> > resource-modules?  And then below resource-modules we could have
> > dictionary-lookup-resources, lvg-resources, etc.
> >
> > * (ctakes svn trunk)
> >   * main-modules
> >     * core
> >     * clinical-documents-pipeline
> >     * dictionary-lookup
> >     * lvg
> >   * resource-modules
> >     * dictionary-lookup-resources
> >     * lvg-resources
> >
> > This means you could just checkout main-modules and keep it pretty
> > small (mostly just code and descriptors).  Then you can checkout "main-
> modules"
> > a a few times to have different working directories without taking up
> > too much space (perhaps a working copy for trunk development, another
> > for clean copy of latest release, a third for an experimental branch,
> > and perhaps a fourth for a branch to patch the last release).  But
> > they could all use the same resources.
> >
> > And ideally, most users and developers could pull those resources from
> > the maven repository as an artifact, unless they were working on
> > packaging up a new version of those resources.
> >
> > Matt


RE: ctakes-resources

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.
Hi Brandyn,
This is a useful good tip.  Thanks for that!
Just out of curiosity, how is cTAKES currently being used at uiowa now?

--Pei

> -----Original Message-----
> From: Kusenda, Brandyn J [mailto:brandyn-kusenda@uiowa.edu]
> Sent: Wednesday, September 26, 2012 11:05 AM
> To: ctakes-dev@incubator.apache.org
> Subject: RE: ctakes-resources
> 
> Hello,
> 
> I've have some of the classpath related changes working on my local copy of
> cTAKES.  Here are a couple of things I discovered, that might save you some
> time.
> 
> 1. If referencing UIMA descriptors via name, make sure the descriptor name
> attribute and the filename are the same (not including the .xml).
> 
> 2.  In DictionaryLookupAnnotatorUMLS.xml there is a reference to the UMLS
> hsqldb via file path.  To change it to class path, see the changes below
> 
> Original using file path:
> --SNIP--
> <nameValuePair>
>        <name>URL</name>
>        <value>
>               <string>jdbc:hsqldb:file:../dictionary
> lookup/resources/lookup/umls2011ab/umls</string>
>        </value>
> </nameValuePair>
> --SNIP--
> 
> Updated to use class path if the lookup directory is in the
> /src/main/resources/  (notice the use of 'res' as oppose to 'file' in the
> connection string)
> -SNIP-
>  <nameValuePair>
>         <name>URL</name>
>         <value>
>               <string>jdbc:hsqldb:res:lookup/umls2011ab/umls</string>
>        </value>
> </nameValuePair>
> --SNIP--
> 
> Thanks,
> Brandyn
> 
> ________________________________________
> From: Chen, Pei [Pei.Chen@childrens.harvard.edu]
> Sent: Tuesday, September 25, 2012 4:22 PM
> To: ctakes-dev@incubator.apache.org
> Subject: RE: ctakes-resources
> 
> I think Brandyn and others had a reasonable suggestion to use the name
> instead of location paths in those pesky descriptor xml files- Assuming
> they're in the classpath (placed in src/main/resources).
> I'll take a stab at this later this week if I get a chance.  But trunk *should*
> compile now in case other want to start working on it.  Apologies in advanced
> if I broke someone's build at this point- it was a pretty large change...
> 
> Troy: If we do this correctly, this should be transparent (if not easier) to end-
> users.
> 
> > -----Original Message-----
> > From: Coarr, Matt [mailto:mcoarr@mitre.org]
> > Sent: Tuesday, September 25, 2012 11:16 AM
> > To: ctakes-dev@incubator.apache.org
> > Subject: Re: ctakes-resources
> >
> > Trying to distill some of the suggestions...
> >
> > Could we have two top level directories in svn -- main-modules and
> > resource-modules?  And then below resource-modules we could have
> > dictionary-lookup-resources, lvg-resources, etc.
> >
> > * (ctakes svn trunk)
> >   * main-modules
> >     * core
> >     * clinical-documents-pipeline
> >     * dictionary-lookup
> >     * lvg
> >   * resource-modules
> >     * dictionary-lookup-resources
> >     * lvg-resources
> >
> > This means you could just checkout main-modules and keep it pretty
> > small (mostly just code and descriptors).  Then you can checkout "main-
> modules"
> > a a few times to have different working directories without taking up
> > too much space (perhaps a working copy for trunk development, another
> > for clean copy of latest release, a third for an experimental branch,
> > and perhaps a fourth for a branch to patch the last release).  But
> > they could all use the same resources.
> >
> > And ideally, most users and developers could pull those resources from
> > the maven repository as an artifact, unless they were working on
> > packaging up a new version of those resources.
> >
> > Matt


RE: ctakes-resources

Posted by "Kusenda, Brandyn J" <br...@uiowa.edu>.
Hello,

I've have some of the classpath related changes working on my local copy of cTAKES.  Here are a couple of things I discovered, that might save you some time.

1. If referencing UIMA descriptors via name, make sure the descriptor name attribute and the filename are the same (not including the .xml).

2.  In DictionaryLookupAnnotatorUMLS.xml there is a reference to the UMLS hsqldb via file path.  To change it to class path, see the changes below

Original using file path:
--SNIP--
<nameValuePair>
       <name>URL</name>
       <value>
              <string>jdbc:hsqldb:file:../dictionary lookup/resources/lookup/umls2011ab/umls</string>
       </value>
</nameValuePair>
--SNIP--

Updated to use class path if the lookup directory is in the /src/main/resources/  (notice the use of 'res' as oppose to 'file' in the connection string)
-SNIP-
 <nameValuePair>
        <name>URL</name>
        <value>
              <string>jdbc:hsqldb:res:lookup/umls2011ab/umls</string>
       </value>
</nameValuePair>
--SNIP--

Thanks,
Brandyn

________________________________________
From: Chen, Pei [Pei.Chen@childrens.harvard.edu]
Sent: Tuesday, September 25, 2012 4:22 PM
To: ctakes-dev@incubator.apache.org
Subject: RE: ctakes-resources

I think Brandyn and others had a reasonable suggestion to use the name instead of location paths in those pesky descriptor xml files- Assuming they're in the classpath (placed in src/main/resources).
I'll take a stab at this later this week if I get a chance.  But trunk *should* compile now in case other want to start working on it.  Apologies in advanced if I broke someone's build at this point- it was a pretty large change...

Troy: If we do this correctly, this should be transparent (if not easier) to end-users.

> -----Original Message-----
> From: Coarr, Matt [mailto:mcoarr@mitre.org]
> Sent: Tuesday, September 25, 2012 11:16 AM
> To: ctakes-dev@incubator.apache.org
> Subject: Re: ctakes-resources
>
> Trying to distill some of the suggestions...
>
> Could we have two top level directories in svn -- main-modules and
> resource-modules?  And then below resource-modules we could have
> dictionary-lookup-resources, lvg-resources, etc.
>
> * (ctakes svn trunk)
>   * main-modules
>     * core
>     * clinical-documents-pipeline
>     * dictionary-lookup
>     * lvg
>   * resource-modules
>     * dictionary-lookup-resources
>     * lvg-resources
>
> This means you could just checkout main-modules and keep it pretty small
> (mostly just code and descriptors).  Then you can checkout "main-modules"
> a a few times to have different working directories without taking up too
> much space (perhaps a working copy for trunk development, another for
> clean copy of latest release, a third for an experimental branch, and perhaps
> a fourth for a branch to patch the last release).  But they could all use the
> same resources.
>
> And ideally, most users and developers could pull those resources from the
> maven repository as an artifact, unless they were working on packaging up a
> new version of those resources.
>
> Matt


RE: ctakes-resources

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.
I think Brandyn and others had a reasonable suggestion to use the name instead of location paths in those pesky descriptor xml files- Assuming they're in the classpath (placed in src/main/resources).
I'll take a stab at this later this week if I get a chance.  But trunk *should* compile now in case other want to start working on it.  Apologies in advanced if I broke someone's build at this point- it was a pretty large change...

Troy: If we do this correctly, this should be transparent (if not easier) to end-users.

> -----Original Message-----
> From: Coarr, Matt [mailto:mcoarr@mitre.org]
> Sent: Tuesday, September 25, 2012 11:16 AM
> To: ctakes-dev@incubator.apache.org
> Subject: Re: ctakes-resources
> 
> Trying to distill some of the suggestions...
> 
> Could we have two top level directories in svn -- main-modules and
> resource-modules?  And then below resource-modules we could have
> dictionary-lookup-resources, lvg-resources, etc.
> 
> * (ctakes svn trunk)
>   * main-modules
>     * core
>     * clinical-documents-pipeline
>     * dictionary-lookup
>     * lvg
>   * resource-modules
>     * dictionary-lookup-resources
>     * lvg-resources
> 
> This means you could just checkout main-modules and keep it pretty small
> (mostly just code and descriptors).  Then you can checkout "main-modules"
> a a few times to have different working directories without taking up too
> much space (perhaps a working copy for trunk development, another for
> clean copy of latest release, a third for an experimental branch, and perhaps
> a fourth for a branch to patch the last release).  But they could all use the
> same resources.
> 
> And ideally, most users and developers could pull those resources from the
> maven repository as an artifact, unless they were working on packaging up a
> new version of those resources.
> 
> Matt


Re: ctakes-resources

Posted by "Coarr, Matt" <mc...@mitre.org>.
Trying to distill some of the suggestions...

Could we have two top level directories in svn -- main-modules and
resource-modules?  And then below resource-modules we could have
dictionary-lookup-resources, lvg-resources, etc.

* (ctakes svn trunk)
  * main-modules
    * core
    * clinical-documents-pipeline
    * dictionary-lookup
    * lvg
  * resource-modules
    * dictionary-lookup-resources
    * lvg-resources

This means you could just checkout main-modules and keep it pretty small
(mostly just code and descriptors).  Then you can checkout "main-modules"
a a few times to have different working directories without taking up too
much space (perhaps a working copy for trunk development, another for
clean copy of latest release, a third for an experimental branch, and
perhaps a fourth for a branch to patch the last release).  But they could
all use the same resources.

And ideally, most users and developers could pull those resources from the
maven repository as an artifact, unless they were working on packaging up
a new version of those resources.

Matt


RE: ctakes-resources

Posted by "Bleeker, Troy C." <Bl...@mayo.edu>.
This discussion speaks to the install cases and the vote we took in the SDG meetings. I agree with putting the resources in one place. There is still a use case for the easiest adoption. That is, probably a first time user that wants to kick the tires. For the best first experience, is it best to include an all-in-one package, including the LVG and dictionary lookup, even though it will take longer to download? If so, shouldn't they be co-located with the other resources?

Thanks
Troy
-----Original Message-----
From: ctakes-dev-return-426-Bleeker.Troy=mayo.edu@incubator.apache.org [mailto:ctakes-dev-return-426-Bleeker.Troy=mayo.edu@incubator.apache.org] On Behalf Of Chen, Pei
Sent: Tuesday, September 25, 2012 8:52 AM
To: ctakes-dev@incubator.apache.org
Subject: RE: ctakes-resources

The downside of putting descriptors inside a jar/war is that you would need to recompile/re-jar for simple xml changes.  I think for some, they might consider xml a configuration change and not necessary a code change.  I wonder if that was the motivation of why if you create a new UIMA nature, it puts it in a desc directory (let's dig into it a little bit, but probably could go either way on that).

What about those larger resources such as LVG or Dictionary Lookup?  In my opinion, they probably shouldn't even be versioned in the same spot as the java srcs... Perhaps an /incr directory outside of trunk?  I would image something like /dictionarylook2011ab, /dictionaryloookup2012aa doesn't need to be inside a particular release.

> -----Original Message-----
> From: Steven Bethard [mailto:steven.bethard@Colorado.EDU]
> Sent: Tuesday, September 25, 2012 3:20 AM
> To: ctakes-dev@incubator.apache.org
> Subject: Re: ctakes-resources
> 
> On Sep 25, 2012, at 12:10 AM, "Chen, Pei"
> <Pe...@childrens.harvard.edu> wrote:
> > During the updating of the project names, I left the resources 
> > folders
> intact.
> >
> > What do folks think about the resources?
> >
> > 1)      Should we leave the resources as is (within each project)? Or
> >
> > 2)      Create a ctakes-resources module that contains all of the resource
> files?
> >
> > For example:
> > ctakes-resources/
> >                ctakes-chunker/desc
> > ctakes-chunker/models
> > ctakes-dictionary-lookup/somelookupbinariesfolder
> 
> In an ideal world, all resources (descriptors, models, etc.) would be 
> distributed in the jar file and referenced via Java's 
> Class.getResource mechanism. In Maven, this means putting them into src/main/resources.
> 
> So my vote would be one of:
> 
> (1) In the src/main/resources directory of the ctakes-XXX project.
> 
> (2) Descriptors that don't refer to specific models would go into 
> src/main/resources of ctakes-XXX, but the models and the descriptors 
> that refer to them would go into src/main/resources of a separate 
> ctakes-XXX- models project. This would allow people to exclude the 
> models if they don't need them, which can make a lot of sense for the big models.
> 
> Steve

RE: ctakes-resources

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.
The downside of putting descriptors inside a jar/war is that you would need to recompile/re-jar for simple xml changes.  I think for some, they might consider xml a configuration change and not necessary a code change.  I wonder if that was the motivation of why if you create a new UIMA nature, it puts it in a desc directory (let's dig into it a little bit, but probably could go either way on that).

What about those larger resources such as LVG or Dictionary Lookup?  In my opinion, they probably shouldn't even be versioned in the same spot as the java srcs... Perhaps an /incr directory outside of trunk?  I would image something like /dictionarylook2011ab, /dictionaryloookup2012aa doesn't need to be inside a particular release.

> -----Original Message-----
> From: Steven Bethard [mailto:steven.bethard@Colorado.EDU]
> Sent: Tuesday, September 25, 2012 3:20 AM
> To: ctakes-dev@incubator.apache.org
> Subject: Re: ctakes-resources
> 
> On Sep 25, 2012, at 12:10 AM, "Chen, Pei"
> <Pe...@childrens.harvard.edu> wrote:
> > During the updating of the project names, I left the resources folders
> intact.
> >
> > What do folks think about the resources?
> >
> > 1)      Should we leave the resources as is (within each project)? Or
> >
> > 2)      Create a ctakes-resources module that contains all of the resource
> files?
> >
> > For example:
> > ctakes-resources/
> >                ctakes-chunker/desc
> > ctakes-chunker/models
> > ctakes-dictionary-lookup/somelookupbinariesfolder
> 
> In an ideal world, all resources (descriptors, models, etc.) would be
> distributed in the jar file and referenced via Java's Class.getResource
> mechanism. In Maven, this means putting them into src/main/resources.
> 
> So my vote would be one of:
> 
> (1) In the src/main/resources directory of the ctakes-XXX project.
> 
> (2) Descriptors that don't refer to specific models would go into
> src/main/resources of ctakes-XXX, but the models and the descriptors that
> refer to them would go into src/main/resources of a separate ctakes-XXX-
> models project. This would allow people to exclude the models if they don't
> need them, which can make a lot of sense for the big models.
> 
> Steve

RE: ctakes-resources

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
On September 25, 2012 3:20 AM, Steve Bethard wrote:

>In an ideal world, all resources (descriptors, models, etc.) would be distributed in the jar file and referenced via Java's Class.getResource mechanism. 

I agree with Steve: the Class.getResource method is a very useful tool.  Just as a point of clarity, it only requires that the resource be in the classpath, and not in the same jar.  If you implement a custom classloader then you have to be careful with this ... 
In the installation of cTakes that I have, the jar is roughly 15 MB, while the resource directory is roughly 1.4 GB.  This could be a faulty installation - I don't know.
In my experience the code and resources should not be in a single jar, especially if not all resources are necessary for all "modules" of an app.  However, I come from a different background where this kind of thing mattered a lot more than it does for a (smaller) app like cTakes, where both code and resources were separated into jars, directories and files in such a manner that they were only installed and available as needed.

Unfortunately I don't have enough knowledge of cTakes to post an educated vote on this matter, I just wanted to put a little information in the thread in case it is useful to others.

Sean

Re: ctakes-resources

Posted by Steven Bethard <st...@Colorado.EDU>.
On Sep 25, 2012, at 12:10 AM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
> During the updating of the project names, I left the resources folders intact.
> 
> What do folks think about the resources?
> 
> 1)      Should we leave the resources as is (within each project)? Or
> 
> 2)      Create a ctakes-resources module that contains all of the resource files?
> 
> For example:
> ctakes-resources/
>                ctakes-chunker/desc
> ctakes-chunker/models
> ctakes-dictionary-lookup/somelookupbinariesfolder

In an ideal world, all resources (descriptors, models, etc.) would be distributed in the jar file and referenced via Java's Class.getResource mechanism. In Maven, this means putting them into src/main/resources.

So my vote would be one of:

(1) In the src/main/resources directory of the ctakes-XXX project.

(2) Descriptors that don't refer to specific models would go into src/main/resources of ctakes-XXX, but the models and the descriptors that refer to them would go into src/main/resources of a separate ctakes-XXX-models project. This would allow people to exclude the models if they don't need them, which can make a lot of sense for the big models.

Steve