You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Jens Grivolla <j+...@grivolla.net> on 2013/05/22 18:31:45 UTC

managing resources for UIMA?

Hi, while not strictly a UIMA issue, we have a problem that seems very 
relevant in the context of UIMA analysis engines: how to manage large 
binary resources such as trained models used by an AE, etc.

So far, we have managed to achieve a good separation between code 
development and the actual AEs, using Maven (and git for version 
control). An AE thus consists only of a POM referencing the code, the AE 
descriptor, and the resources used for the AE. The AE poms are 
configured to generate PEAR archives that include all dependencies and 
resources.

At this point we have the code in git, and the AEs' pom and descriptor 
also, while we manually copy the resources to the directory before 
running `mvn package` (and exclude those resources from git). We're 
missing a way to manage those resources, including versioning etc.

I'm guessing that this is a rather typical problem, so what solutions do 
you use? We're thinking of having all resources also in Maven (e.g. 
Artifactory) so we can reference them with a unique identifier and 
version. This would also help us when moving to more complex pipeline 
assemblies using uimafit instead of generating individual PEARS for each 
component in order to create complete packages.

Btw, we are just very few core developers, with most of the team made up 
of linguists, so we want to make it easy for them to save versions of 
resources they create and assemble AEs by just referencing the algorithm 
and resource (e.g. "create a new OpenNLP POStagger using 
spanish-pos-model.bin, version 1.2.3").

Thanks for sharing your experiences with this...

Jens

Re: managing resources for UIMA?

Posted by Richard Eckart de Castilho <ri...@gmail.com>.

On 28.05.2013, at 18:26, Jens Grivolla <j+...@grivolla.net> wrote:

> Thanks for your pointers, I think this will be very helpful.
> 
> However, we use various components and wrappers from outside sources (such as OpenNLP) and don't always control how resources are loaded by the AE. Sometimes it might be sufficient to have the resource on the classpath (in a JAR referenced by Maven), and I think it is acceptable for us to hard-code the resource reference in the AE descriptor, e.g. Having a fully automatic resource resolution as in DKPro would be nice but is not an immediate necessity.

As mentioned in [1], we write all wrappers ourselves.

> We face a more complicated situation with some components that do not resolve resources from the classpath, e.g. C++ and Python components that need to reference the actual resource files. For those situations we would need to unpack the resource when building the AE and bundle it so it can be accessed with a file path. We currently manually copy the resource file into a /resources folder that gets included in the PEAR packages we generate (and thus unpacked when installing the PEAR).

As mentioned in [1], we implemented some helpers that extract resources to the file system when required. These helpers try to be smart, so that extraction is done only once per JVM, not over and over again per pipeline execution or per CAS.

> This probably leads to a different problem, which is doing more advanced custom tasks with Maven, such as unpacking archives, moving files around when packaging, etc.  I see that you actually use Ant to package your resource Maven artifacts, probably due to the difficulty of doing such task directly in Maven.

I did this in Ant because I wanted an easy way for users on any platform to be able to run these scripts themselves, without having to worry about Maven, Java or whatnot. Ant is platform-independent and it is more accepted amongst non-expert-programmer users than any other solution that came to my mind.

I didn't do it in Maven, not even call the Ant scripts from Maven, because they are not part of the build. Resource JARs are built independently in irregular intervals and deployed to our Maven repository. We update these scripts when new models become available or when we upgrade to newer versions of wrapped tools.

> Would Gradle be a better option to have the dependency management from Maven while being able to more easily define custom manipulations of resources to help with packaging? Is it possible to generate PEAR packages from Gradle? There are afaik plugins for Maven and Ant, so would we then reference an Ant task from Gradle? (I'll split this part off as a more general thread about Gradle, I think.)

I don't know Gradle. It appears to be a declarative DSL for builds with the option of adding imperative sections and with the aim of having a more concise syntax than Maven. In principle, it doesn't seem to make things possible that wouldn't be possible with Maven or Maven + ant-run plugin or Maven + groovy plugin. 

In the environment where I am working, it was already hard enough to establish Maven as a standard, even though it comes with quite a good Eclipse POM editor and artifact search capabilities. Since Gradle appears much more flexible, I suspect that similarly convenient GUI editors for Gradle builds are not available. 

I'd instead try to push as much as possible of the difficult Maven configuration up into a parent pom and handle it with Maven profiles (this is done in UIMA right now). That way, the module POMs can remain quite concise. Where necessary, I'd write a Maven plugin.

Cheers,

-- Richard

[1] http://markmail.org/thread/quufbip5cz5jrgb6

Re: managing resources for UIMA?

Posted by Jens Grivolla <j+...@grivolla.net>.

Thanks for your pointers, I think this will be very helpful.

However, we use various components and wrappers from outside sources 
(such as OpenNLP) and don't always control how resources are loaded by 
the AE. Sometimes it might be sufficient to have the resource on the 
classpath (in a JAR referenced by Maven), and I think it is acceptable 
for us to hard-code the resource reference in the AE descriptor, e.g. 
Having a fully automatic resource resolution as in DKPro would be nice 
but is not an immediate necessity.

We face a more complicated situation with some components that do not 
resolve resources from the classpath, e.g. C++ and Python components 
that need to reference the actual resource files. For those situations 
we would need to unpack the resource when building the AE and bundle it 
so it can be accessed with a file path. We currently manually copy the 
resource file into a /resources folder that gets included in the PEAR 
packages we generate (and thus unpacked when installing the PEAR).

This probably leads to a different problem, which is doing more advanced 
custom tasks with Maven, such as unpacking archives, moving files around 
when packaging, etc.  I see that you actually use Ant to package your 
resource Maven artifacts, probably due to the difficulty of doing such 
task directly in Maven.

Would Gradle be a better option to have the dependency management from 
Maven while being able to more easily define custom manipulations of 
resources to help with packaging? Is it possible to generate PEAR 
packages from Gradle? There are afaik plugins for Maven and Ant, so 
would we then reference an Ant task from Gradle? (I'll split this part 
off as a more general thread about Gradle, I think.)

Thanks,
Jens

On 05/22/2013 06:45 PM, Richard Eckart de Castilho wrote:
> For some additional half-decent documentation on how and why we do the model packaging as we do it, see:
>
> http://code.google.com/p/dkpro-core-asl/wiki/ResourceProviderAPI
> http://code.google.com/p/dkpro-core-asl/wiki/PackagingResources
>
> -- Richard
>
> Am 22.05.2013 um 18:43 schrieb Richard Eckart de Castilho <ri...@gmail.com>:
>
>> Hi Jens,
>>
>> for DKPro Core [1], we have packaged a large number of models as Maven artifacts and host them in our public Maven repository [1]. We have made good experiences with this approach. Please do feel free to make use of these packages.
>>
>> To package models, we use a set of ant-macros [2] which we use in different Ant scripts that download the original models from their original sites and wrap them up in a standard layout and naming scheme, for example [4].
>>
>> Cheers,
>>
>> -- Richard
>>
>> [1] http://code.google.com/p/dkpro-core-asl
>> [2] http://zoidberg.ukp.informatik.tu-darmstadt.de/artifactory/public-model-releases-local
>> [3] http://code.google.com/p/dkpro-core-asl/source/browse/built-ant-macros/trunk/ant-macros.xml
>> [4] https://dkpro-core-gpl.googlecode.com/svn/de.tudarmstadt.ukp.dkpro.core-gpl/trunk/de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl/src/scripts/build.xml
>>
>> Am 22.05.2013 um 18:31 schrieb Jens Grivolla <j+...@grivolla.net>:
>>
>>> Hi, while not strictly a UIMA issue, we have a problem that seems very relevant in the context of UIMA analysis engines: how to manage large binary resources such as trained models used by an AE, etc.
>>>
>>> So far, we have managed to achieve a good separation between code development and the actual AEs, using Maven (and git for version control). An AE thus consists only of a POM referencing the code, the AE descriptor, and the resources used for the AE. The AE poms are configured to generate PEAR archives that include all dependencies and resources.
>>>
>>> At this point we have the code in git, and the AEs' pom and descriptor also, while we manually copy the resources to the directory before running `mvn package` (and exclude those resources from git). We're missing a way to manage those resources, including versioning etc.
>>>
>>> I'm guessing that this is a rather typical problem, so what solutions do you use? We're thinking of having all resources also in Maven (e.g. Artifactory) so we can reference them with a unique identifier and version. This would also help us when moving to more complex pipeline assemblies using uimafit instead of generating individual PEARS for each component in order to create complete packages.
>>>
>>> Btw, we are just very few core developers, with most of the team made up of linguists, so we want to make it easy for them to save versions of resources they create and assemble AEs by just referencing the algorithm and resource (e.g. "create a new OpenNLP POStagger using spanish-pos-model.bin, version 1.2.3").
>>>
>>> Thanks for sharing your experiences with this...
>>>
>>> Jens
>
>

Re: managing resources for UIMA?

Posted by Richard Eckart de Castilho <ri...@gmail.com>.

For some additional half-decent documentation on how and why we do the model packaging as we do it, see:

http://code.google.com/p/dkpro-core-asl/wiki/ResourceProviderAPI
http://code.google.com/p/dkpro-core-asl/wiki/PackagingResources

-- Richard

Am 22.05.2013 um 18:43 schrieb Richard Eckart de Castilho <ri...@gmail.com>:

> Hi Jens,
> 
> for DKPro Core [1], we have packaged a large number of models as Maven artifacts and host them in our public Maven repository [1]. We have made good experiences with this approach. Please do feel free to make use of these packages.
> 
> To package models, we use a set of ant-macros [2] which we use in different Ant scripts that download the original models from their original sites and wrap them up in a standard layout and naming scheme, for example [4].
> 
> Cheers,
> 
> -- Richard
> 
> [1] http://code.google.com/p/dkpro-core-asl
> [2] http://zoidberg.ukp.informatik.tu-darmstadt.de/artifactory/public-model-releases-local
> [3] http://code.google.com/p/dkpro-core-asl/source/browse/built-ant-macros/trunk/ant-macros.xml
> [4] https://dkpro-core-gpl.googlecode.com/svn/de.tudarmstadt.ukp.dkpro.core-gpl/trunk/de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl/src/scripts/build.xml
> 
> Am 22.05.2013 um 18:31 schrieb Jens Grivolla <j+...@grivolla.net>:
> 
>> Hi, while not strictly a UIMA issue, we have a problem that seems very relevant in the context of UIMA analysis engines: how to manage large binary resources such as trained models used by an AE, etc.
>> 
>> So far, we have managed to achieve a good separation between code development and the actual AEs, using Maven (and git for version control). An AE thus consists only of a POM referencing the code, the AE descriptor, and the resources used for the AE. The AE poms are configured to generate PEAR archives that include all dependencies and resources.
>> 
>> At this point we have the code in git, and the AEs' pom and descriptor also, while we manually copy the resources to the directory before running `mvn package` (and exclude those resources from git). We're missing a way to manage those resources, including versioning etc.
>> 
>> I'm guessing that this is a rather typical problem, so what solutions do you use? We're thinking of having all resources also in Maven (e.g. Artifactory) so we can reference them with a unique identifier and version. This would also help us when moving to more complex pipeline assemblies using uimafit instead of generating individual PEARS for each component in order to create complete packages.
>> 
>> Btw, we are just very few core developers, with most of the team made up of linguists, so we want to make it easy for them to save versions of resources they create and assemble AEs by just referencing the algorithm and resource (e.g. "create a new OpenNLP POStagger using spanish-pos-model.bin, version 1.2.3").
>> 
>> Thanks for sharing your experiences with this...
>> 
>> Jens

Re: managing resources for UIMA?

Posted by Richard Eckart de Castilho <ri...@gmail.com>.

Hi Jens,

for DKPro Core [1], we have packaged a large number of models as Maven artifacts and host them in our public Maven repository [1]. We have made good experiences with this approach. Please do feel free to make use of these packages.

To package models, we use a set of ant-macros [2] which we use in different Ant scripts that download the original models from their original sites and wrap them up in a standard layout and naming scheme, for example [4].

Cheers,

-- Richard

[1] http://code.google.com/p/dkpro-core-asl
[2] http://zoidberg.ukp.informatik.tu-darmstadt.de/artifactory/public-model-releases-local
[3] http://code.google.com/p/dkpro-core-asl/source/browse/built-ant-macros/trunk/ant-macros.xml
[4] https://dkpro-core-gpl.googlecode.com/svn/de.tudarmstadt.ukp.dkpro.core-gpl/trunk/de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl/src/scripts/build.xml

Am 22.05.2013 um 18:31 schrieb Jens Grivolla <j+...@grivolla.net>:

> Hi, while not strictly a UIMA issue, we have a problem that seems very relevant in the context of UIMA analysis engines: how to manage large binary resources such as trained models used by an AE, etc.
> 
> So far, we have managed to achieve a good separation between code development and the actual AEs, using Maven (and git for version control). An AE thus consists only of a POM referencing the code, the AE descriptor, and the resources used for the AE. The AE poms are configured to generate PEAR archives that include all dependencies and resources.
> 
> At this point we have the code in git, and the AEs' pom and descriptor also, while we manually copy the resources to the directory before running `mvn package` (and exclude those resources from git). We're missing a way to manage those resources, including versioning etc.
> 
> I'm guessing that this is a rather typical problem, so what solutions do you use? We're thinking of having all resources also in Maven (e.g. Artifactory) so we can reference them with a unique identifier and version. This would also help us when moving to more complex pipeline assemblies using uimafit instead of generating individual PEARS for each component in order to create complete packages.
> 
> Btw, we are just very few core developers, with most of the team made up of linguists, so we want to make it easy for them to save versions of resources they create and assemble AEs by just referencing the algorithm and resource (e.g. "create a new OpenNLP POStagger using spanish-pos-model.bin, version 1.2.3").
> 
> Thanks for sharing your experiences with this...
> 
> Jens