You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Jens Grivolla <j+...@grivolla.net> on 2013/05/22 18:31:45 UTC
managing resources for UIMA?
Hi, while not strictly a UIMA issue, we have a problem that seems very
relevant in the context of UIMA analysis engines: how to manage large
binary resources such as trained models used by an AE, etc.
So far, we have managed to achieve a good separation between code
development and the actual AEs, using Maven (and git for version
control). An AE thus consists only of a POM referencing the code, the AE
descriptor, and the resources used for the AE. The AE poms are
configured to generate PEAR archives that include all dependencies and
resources.
At this point we have the code in git, and the AEs' pom and descriptor
also, while we manually copy the resources to the directory before
running `mvn package` (and exclude those resources from git). We're
missing a way to manage those resources, including versioning etc.
I'm guessing that this is a rather typical problem, so what solutions do
you use? We're thinking of having all resources also in Maven (e.g.
Artifactory) so we can reference them with a unique identifier and
version. This would also help us when moving to more complex pipeline
assemblies using uimafit instead of generating individual PEARS for each
component in order to create complete packages.
Btw, we are just very few core developers, with most of the team made up
of linguists, so we want to make it easy for them to save versions of
resources they create and assemble AEs by just referencing the algorithm
and resource (e.g. "create a new OpenNLP POStagger using
spanish-pos-model.bin, version 1.2.3").
Thanks for sharing your experiences with this...
Jens
Re: managing resources for UIMA?
Posted by Richard Eckart de Castilho <ri...@gmail.com>.
On 28.05.2013, at 18:26, Jens Grivolla <j+...@grivolla.net> wrote:
> Thanks for your pointers, I think this will be very helpful.
>
> However, we use various components and wrappers from outside sources (such as OpenNLP) and don't always control how resources are loaded by the AE. Sometimes it might be sufficient to have the resource on the classpath (in a JAR referenced by Maven), and I think it is acceptable for us to hard-code the resource reference in the AE descriptor, e.g. Having a fully automatic resource resolution as in DKPro would be nice but is not an immediate necessity.
As mentioned in [1], we write all wrappers ourselves.
> We face a more complicated situation with some components that do not resolve resources from the classpath, e.g. C++ and Python components that need to reference the actual resource files. For those situations we would need to unpack the resource when building the AE and bundle it so it can be accessed with a file path. We currently manually copy the resource file into a /resources folder that gets included in the PEAR packages we generate (and thus unpacked when installing the PEAR).
As mentioned in [1], we implemented some helpers that extract resources to the file system when required. These helpers try to be smart, so that extraction is done only once per JVM, not over and over again per pipeline execution or per CAS.
> This probably leads to a different problem, which is doing more advanced custom tasks with Maven, such as unpacking archives, moving files around when packaging, etc. I see that you actually use Ant to package your resource Maven artifacts, probably due to the difficulty of doing such task directly in Maven.
I did this in Ant because I wanted an easy way for users on any platform to be able to run these scripts themselves, without having to worry about Maven, Java or whatnot. Ant is platform-independent and it is more accepted amongst non-expert-programmer users than any other solution that came to my mind.
I didn't do it in Maven, not even call the Ant scripts from Maven, because they are not part of the build. Resource JARs are built independently in irregular intervals and deployed to our Maven repository. We update these scripts when new models become available or when we upgrade to newer versions of wrapped tools.
> Would Gradle be a better option to have the dependency management from Maven while being able to more easily define custom manipulations of resources to help with packaging? Is it possible to generate PEAR packages from Gradle? There are afaik plugins for Maven and Ant, so would we then reference an Ant task from Gradle? (I'll split this part off as a more general thread about Gradle, I think.)
I don't know Gradle. It appears to be a declarative DSL for builds with the option of adding imperative sections and with the aim of having a more concise syntax than Maven. In principle, it doesn't seem to make things possible that wouldn't be possible with Maven or Maven + ant-run plugin or Maven + groovy plugin.
In the environment where I am working, it was already hard enough to establish Maven as a standard, even though it comes with quite a good Eclipse POM editor and artifact search capabilities. Since Gradle appears much more flexible, I suspect that similarly convenient GUI editors for Gradle builds are not available.
I'd instead try to push as much as possible of the difficult Maven configuration up into a parent pom and handle it with Maven profiles (this is done in UIMA right now). That way, the module POMs can remain quite concise. Where necessary, I'd write a Maven plugin.
Cheers,
-- Richard
[1] http://markmail.org/thread/quufbip5cz5jrgb6
Re: managing resources for UIMA?
Posted by Jens Grivolla <j+...@grivolla.net>.
Thanks for your pointers, I think this will be very helpful.
However, we use various components and wrappers from outside sources
(such as OpenNLP) and don't always control how resources are loaded by
the AE. Sometimes it might be sufficient to have the resource on the
classpath (in a JAR referenced by Maven), and I think it is acceptable
for us to hard-code the resource reference in the AE descriptor, e.g.
Having a fully automatic resource resolution as in DKPro would be nice
but is not an immediate necessity.
We face a more complicated situation with some components that do not
resolve resources from the classpath, e.g. C++ and Python components
that need to reference the actual resource files. For those situations
we would need to unpack the resource when building the AE and bundle it
so it can be accessed with a file path. We currently manually copy the
resource file into a /resources folder that gets included in the PEAR
packages we generate (and thus unpacked when installing the PEAR).
This probably leads to a different problem, which is doing more advanced
custom tasks with Maven, such as unpacking archives, moving files around
when packaging, etc. I see that you actually use Ant to package your
resource Maven artifacts, probably due to the difficulty of doing such
task directly in Maven.
Would Gradle be a better option to have the dependency management from
Maven while being able to more easily define custom manipulations of
resources to help with packaging? Is it possible to generate PEAR
packages from Gradle? There are afaik plugins for Maven and Ant, so
would we then reference an Ant task from Gradle? (I'll split this part
off as a more general thread about Gradle, I think.)
Thanks,
Jens
On 05/22/2013 06:45 PM, Richard Eckart de Castilho wrote:
> For some additional half-decent documentation on how and why we do the model packaging as we do it, see:
>
> http://code.google.com/p/dkpro-core-asl/wiki/ResourceProviderAPI
> http://code.google.com/p/dkpro-core-asl/wiki/PackagingResources
>
> -- Richard
>
> Am 22.05.2013 um 18:43 schrieb Richard Eckart de Castilho <ri...@gmail.com>:
>
>> Hi Jens,
>>
>> for DKPro Core [1], we have packaged a large number of models as Maven artifacts and host them in our public Maven repository [1]. We have made good experiences with this approach. Please do feel free to make use of these packages.
>>
>> To package models, we use a set of ant-macros [2] which we use in different Ant scripts that download the original models from their original sites and wrap them up in a standard layout and naming scheme, for example [4].
>>
>> Cheers,
>>
>> -- Richard
>>
>> [1] http://code.google.com/p/dkpro-core-asl
>> [2] http://zoidberg.ukp.informatik.tu-darmstadt.de/artifactory/public-model-releases-local
>> [3] http://code.google.com/p/dkpro-core-asl/source/browse/built-ant-macros/trunk/ant-macros.xml
>> [4] https://dkpro-core-gpl.googlecode.com/svn/de.tudarmstadt.ukp.dkpro.core-gpl/trunk/de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl/src/scripts/build.xml
>>
>> Am 22.05.2013 um 18:31 schrieb Jens Grivolla <j+...@grivolla.net>:
>>
>>> Hi, while not strictly a UIMA issue, we have a problem that seems very relevant in the context of UIMA analysis engines: how to manage large binary resources such as trained models used by an AE, etc.
>>>
>>> So far, we have managed to achieve a good separation between code development and the actual AEs, using Maven (and git for version control). An AE thus consists only of a POM referencing the code, the AE descriptor, and the resources used for the AE. The AE poms are configured to generate PEAR archives that include all dependencies and resources.
>>>
>>> At this point we have the code in git, and the AEs' pom and descriptor also, while we manually copy the resources to the directory before running `mvn package` (and exclude those resources from git). We're missing a way to manage those resources, including versioning etc.
>>>
>>> I'm guessing that this is a rather typical problem, so what solutions do you use? We're thinking of having all resources also in Maven (e.g. Artifactory) so we can reference them with a unique identifier and version. This would also help us when moving to more complex pipeline assemblies using uimafit instead of generating individual PEARS for each component in order to create complete packages.
>>>
>>> Btw, we are just very few core developers, with most of the team made up of linguists, so we want to make it easy for them to save versions of resources they create and assemble AEs by just referencing the algorithm and resource (e.g. "create a new OpenNLP POStagger using spanish-pos-model.bin, version 1.2.3").
>>>
>>> Thanks for sharing your experiences with this...
>>>
>>> Jens
>
>
Re: managing resources for UIMA?
Posted by Richard Eckart de Castilho <ri...@gmail.com>.
For some additional half-decent documentation on how and why we do the model packaging as we do it, see:
http://code.google.com/p/dkpro-core-asl/wiki/ResourceProviderAPI
http://code.google.com/p/dkpro-core-asl/wiki/PackagingResources
-- Richard
Am 22.05.2013 um 18:43 schrieb Richard Eckart de Castilho <ri...@gmail.com>:
> Hi Jens,
>
> for DKPro Core [1], we have packaged a large number of models as Maven artifacts and host them in our public Maven repository [1]. We have made good experiences with this approach. Please do feel free to make use of these packages.
>
> To package models, we use a set of ant-macros [2] which we use in different Ant scripts that download the original models from their original sites and wrap them up in a standard layout and naming scheme, for example [4].
>
> Cheers,
>
> -- Richard
>
> [1] http://code.google.com/p/dkpro-core-asl
> [2] http://zoidberg.ukp.informatik.tu-darmstadt.de/artifactory/public-model-releases-local
> [3] http://code.google.com/p/dkpro-core-asl/source/browse/built-ant-macros/trunk/ant-macros.xml
> [4] https://dkpro-core-gpl.googlecode.com/svn/de.tudarmstadt.ukp.dkpro.core-gpl/trunk/de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl/src/scripts/build.xml
>
> Am 22.05.2013 um 18:31 schrieb Jens Grivolla <j+...@grivolla.net>:
>
>> Hi, while not strictly a UIMA issue, we have a problem that seems very relevant in the context of UIMA analysis engines: how to manage large binary resources such as trained models used by an AE, etc.
>>
>> So far, we have managed to achieve a good separation between code development and the actual AEs, using Maven (and git for version control). An AE thus consists only of a POM referencing the code, the AE descriptor, and the resources used for the AE. The AE poms are configured to generate PEAR archives that include all dependencies and resources.
>>
>> At this point we have the code in git, and the AEs' pom and descriptor also, while we manually copy the resources to the directory before running `mvn package` (and exclude those resources from git). We're missing a way to manage those resources, including versioning etc.
>>
>> I'm guessing that this is a rather typical problem, so what solutions do you use? We're thinking of having all resources also in Maven (e.g. Artifactory) so we can reference them with a unique identifier and version. This would also help us when moving to more complex pipeline assemblies using uimafit instead of generating individual PEARS for each component in order to create complete packages.
>>
>> Btw, we are just very few core developers, with most of the team made up of linguists, so we want to make it easy for them to save versions of resources they create and assemble AEs by just referencing the algorithm and resource (e.g. "create a new OpenNLP POStagger using spanish-pos-model.bin, version 1.2.3").
>>
>> Thanks for sharing your experiences with this...
>>
>> Jens
Re: managing resources for UIMA?
Posted by Richard Eckart de Castilho <ri...@gmail.com>.
Hi Jens,
for DKPro Core [1], we have packaged a large number of models as Maven artifacts and host them in our public Maven repository [1]. We have made good experiences with this approach. Please do feel free to make use of these packages.
To package models, we use a set of ant-macros [2] which we use in different Ant scripts that download the original models from their original sites and wrap them up in a standard layout and naming scheme, for example [4].
Cheers,
-- Richard
[1] http://code.google.com/p/dkpro-core-asl
[2] http://zoidberg.ukp.informatik.tu-darmstadt.de/artifactory/public-model-releases-local
[3] http://code.google.com/p/dkpro-core-asl/source/browse/built-ant-macros/trunk/ant-macros.xml
[4] https://dkpro-core-gpl.googlecode.com/svn/de.tudarmstadt.ukp.dkpro.core-gpl/trunk/de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl/src/scripts/build.xml
Am 22.05.2013 um 18:31 schrieb Jens Grivolla <j+...@grivolla.net>:
> Hi, while not strictly a UIMA issue, we have a problem that seems very relevant in the context of UIMA analysis engines: how to manage large binary resources such as trained models used by an AE, etc.
>
> So far, we have managed to achieve a good separation between code development and the actual AEs, using Maven (and git for version control). An AE thus consists only of a POM referencing the code, the AE descriptor, and the resources used for the AE. The AE poms are configured to generate PEAR archives that include all dependencies and resources.
>
> At this point we have the code in git, and the AEs' pom and descriptor also, while we manually copy the resources to the directory before running `mvn package` (and exclude those resources from git). We're missing a way to manage those resources, including versioning etc.
>
> I'm guessing that this is a rather typical problem, so what solutions do you use? We're thinking of having all resources also in Maven (e.g. Artifactory) so we can reference them with a unique identifier and version. This would also help us when moving to more complex pipeline assemblies using uimafit instead of generating individual PEARS for each component in order to create complete packages.
>
> Btw, we are just very few core developers, with most of the team made up of linguists, so we want to make it easy for them to save versions of resources they create and assemble AEs by just referencing the algorithm and resource (e.g. "create a new OpenNLP POStagger using spanish-pos-model.bin, version 1.2.3").
>
> Thanks for sharing your experiences with this...
>
> Jens