You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by John David Osborne <oz...@uab.edu> on 2012/04/27 21:35:35 UTC

Running UIMA on a cluster

Hello,

Is there any best practice documentation out there for running
UIMA/UIMA-AS on a cluster? I have only run single machine instances of
UIMA (mostly through Eclipse) and have not investigated the ability to
perform multiple simultaneous analyses in order to process large document
collections.

It's not clear to me how UIMA would operate in a cluster environment, do
people really do message passing using JMI? I'm guessing this is the case
as I seeing references to MPICH, SGE or other things I am more used to.
I've looked through some of the documentation (including all the Overview
& SDK setup) but am not finding anything helpful. I've also tried googling
but I am not getting much except this:
http://comments.gmane.org/gmane.comp.apache.uima.general/2131 which makes
me think it is possible.

Currently with my level of confusion I think it may be best to have
multiple instances of UIMA on a cluster and just submit jobs processing
discrete document sets to our SGE cluster and ignore whatever scaling
features are actually present in UIMA since the document processing I plan
to do is data parallel.

-John

Re: Running UIMA on a cluster

Posted by Eric Riebling <er...@cs.cmu.edu>.

I'd like to point out also that the best UIMA-AS documentation is actually
not where one might first go looking (in docs, html, or pdf files) but rather
the README file at the top level of the UIMA-AS distribution.  That's where
to find the good stuff. :)

On 4/27/2012 3:47 PM, Thomas Ginter wrote:
> UIMA-AS was created to handle the message passing, job distribution, etc.  Try going through the UIMA-AS documentation first.  We have had pretty good success using it here.
>
> Thanks,
>
> Thomas Ginter
> 801-448-7676
> thomas.ginter@utah.edu
>
>
>
>
> On Apr 27, 2012, at 1:35 PM, John David Osborne wrote:
>
>> Hello,
>>
>> Is there any best practice documentation out there for running
>> UIMA/UIMA-AS on a cluster? I have only run single machine instances of
>> UIMA (mostly through Eclipse) and have not investigated the ability to
>> perform multiple simultaneous analyses in order to process large document
>> collections.
>>
>> It's not clear to me how UIMA would operate in a cluster environment, do
>> people really do message passing using JMI? I'm guessing this is the case
>> as I seeing references to MPICH, SGE or other things I am more used to.
>> I've looked through some of the documentation (including all the Overview
>> &  SDK setup) but am not finding anything helpful. I've also tried googling
>> but I am not getting much except this:
>> http://comments.gmane.org/gmane.comp.apache.uima.general/2131 which makes
>> me think it is possible.
>>
>> Currently with my level of confusion I think it may be best to have
>> multiple instances of UIMA on a cluster and just submit jobs processing
>> discrete document sets to our SGE cluster and ignore whatever scaling
>> features are actually present in UIMA since the document processing I plan
>> to do is data parallel.
>>
>> -John
>>
>
>

-- 
Eric Riebling                 Senior Systems Programmer
http://ericriebling.com       CMU Language Technologies Institute

Re: Running UIMA on a cluster

Posted by Thomas Ginter <th...@utah.edu>.

UIMA-AS was created to handle the message passing, job distribution, etc.  Try going through the UIMA-AS documentation first.  We have had pretty good success using it here.

Thanks,

Thomas Ginter
801-448-7676
thomas.ginter@utah.edu




On Apr 27, 2012, at 1:35 PM, John David Osborne wrote:

> Hello,
> 
> Is there any best practice documentation out there for running
> UIMA/UIMA-AS on a cluster? I have only run single machine instances of
> UIMA (mostly through Eclipse) and have not investigated the ability to
> perform multiple simultaneous analyses in order to process large document
> collections.
> 
> It's not clear to me how UIMA would operate in a cluster environment, do
> people really do message passing using JMI? I'm guessing this is the case
> as I seeing references to MPICH, SGE or other things I am more used to.
> I've looked through some of the documentation (including all the Overview
> & SDK setup) but am not finding anything helpful. I've also tried googling
> but I am not getting much except this:
> http://comments.gmane.org/gmane.comp.apache.uima.general/2131 which makes
> me think it is possible.
> 
> Currently with my level of confusion I think it may be best to have
> multiple instances of UIMA on a cluster and just submit jobs processing
> discrete document sets to our SGE cluster and ignore whatever scaling
> features are actually present in UIMA since the document processing I plan
> to do is data parallel.
> 
> -John
>

Re: Running UIMA on a cluster

Posted by Eddie Epstein <ea...@gmail.com>.

The UIMA-AS framework doesn't have any support for deploying processes
across a cluster. SGE could be used to play that role.

Because UIMA-AS services register with a JMS broker, and the UIMA-AS client
communicates with these services via the broker, it doesn't matter
where they run.

Eddie

On Fri, Apr 27, 2012 at 5:48 PM, John David Osborne <oz...@uab.edu> wrote:
> Very helpful responses from you and Thomas, thanks guys!  The README in
> the 2.3.1 documentation is very useful.
>
> I'm still confused about one thing, and I am dreading the answer. How does
> UIMA-AS play with pre-existing tools like SGE? I'm under the impression
> that it is basically going to ignore SGE and try to start jobs on the
> compute nodes by itself. Is everybody running UIMA on  dedicated clusters
> more or less?
>
> I'm in a situation where I'm looking to run on a cluster shared pretty
> much University wide for which SGE is the main (probably only) job
> submission method.
>
>  -John
>
>
>
> On 4/27/12 2:59 PM, "Eric Riebling" <er...@cs.cmu.edu> wrote:
>
>>We've had success deploying annotators on cluster nodes (using UIMA-AS
>>deployment descriptors) registered to a UIMA-AS broker running on the
>>head node.  If the cluster use shared data folders, you only need to
>>put the code in one place for it to 'appear' on all nodes.
>>
>>Then we run a collection reader and CAS consumer on the head node,
>>with the amount of scale-out specified on the command line of
>>runRemoteAsyncAE.sh, something like this:
>>
>>   $UIMA_HOME/bin/runRemoteAsyncAE.sh -c (path.to)XmiCollectionReader.xml
>>tcp://localhost:6
>>1616 (name of deployed service) -p (number of nodes) -o output_foldername
>>
>>With enough scale-out, the limiting factor becomes the speed of the CR
>>and CC on the head node.  This is the briefest explanation I can give,
>>not sure it's a 'best practice' but it works. :)
>>
>>On 4/27/2012 3:35 PM, John David Osborne wrote:
>>> Hello,
>>>
>>> Is there any best practice documentation out there for running
>>> UIMA/UIMA-AS on a cluster? I have only run single machine instances of
>>> UIMA (mostly through Eclipse) and have not investigated the ability to
>>> perform multiple simultaneous analyses in order to process large
>>>document
>>> collections.
>>>
>>> It's not clear to me how UIMA would operate in a cluster environment, do
>>> people really do message passing using JMI? I'm guessing this is the
>>>case
>>> as I seeing references to MPICH, SGE or other things I am more used to.
>>> I've looked through some of the documentation (including all the
>>>Overview
>>> &  SDK setup) but am not finding anything helpful. I've also tried
>>>googling
>>> but I am not getting much except this:
>>> http://comments.gmane.org/gmane.comp.apache.uima.general/2131 which
>>>makes
>>> me think it is possible.
>>>
>>> Currently with my level of confusion I think it may be best to have
>>> multiple instances of UIMA on a cluster and just submit jobs processing
>>> discrete document sets to our SGE cluster and ignore whatever scaling
>>> features are actually present in UIMA since the document processing I
>>>plan
>>> to do is data parallel.
>>>
>>> -John
>>>
>>>
>>
>>--
>>Eric Riebling                 Senior Systems Programmer
>>http://ericriebling.com       CMU Language Technologies Institute
>>
>

Re: Running UIMA on a cluster

Posted by Thomas Ginter <th...@utah.edu>.

UIMA-AS is still at 2.3.1. For that reason we have not upgraded our core to 2.4.0 yet.

You are so right about the README though.

Thanks,

Tom

Sent from my iPhone

On Apr 27, 2012, at 2:21 PM, "Eric Riebling" <er...@cs.cmu.edu> wrote:

> Oops, sorry, spoke too soon.  It's not in README any more, as of 2.4.0.  D'oh!

Re: Running UIMA on a cluster

Posted by Eric Riebling <er...@cs.cmu.edu>.

Oops, sorry, spoke too soon.  It's not in README any more, as of 2.4.0.  D'oh!

Re: Running UIMA on a cluster

Posted by John David Osborne <oz...@uab.edu>.

Very helpful responses from you and Thomas, thanks guys!  The README in
the 2.3.1 documentation is very useful.

I'm still confused about one thing, and I am dreading the answer. How does
UIMA-AS play with pre-existing tools like SGE? I'm under the impression
that it is basically going to ignore SGE and try to start jobs on the
compute nodes by itself. Is everybody running UIMA on  dedicated clusters
more or less?

I'm in a situation where I'm looking to run on a cluster shared pretty
much University wide for which SGE is the main (probably only) job
submission method.

 -John



On 4/27/12 2:59 PM, "Eric Riebling" <er...@cs.cmu.edu> wrote:

>We've had success deploying annotators on cluster nodes (using UIMA-AS
>deployment descriptors) registered to a UIMA-AS broker running on the
>head node.  If the cluster use shared data folders, you only need to
>put the code in one place for it to 'appear' on all nodes.
>
>Then we run a collection reader and CAS consumer on the head node,
>with the amount of scale-out specified on the command line of
>runRemoteAsyncAE.sh, something like this:
>
>   $UIMA_HOME/bin/runRemoteAsyncAE.sh -c (path.to)XmiCollectionReader.xml
>tcp://localhost:6
>1616 (name of deployed service) -p (number of nodes) -o output_foldername
>
>With enough scale-out, the limiting factor becomes the speed of the CR
>and CC on the head node.  This is the briefest explanation I can give,
>not sure it's a 'best practice' but it works. :)
>
>On 4/27/2012 3:35 PM, John David Osborne wrote:
>> Hello,
>>
>> Is there any best practice documentation out there for running
>> UIMA/UIMA-AS on a cluster? I have only run single machine instances of
>> UIMA (mostly through Eclipse) and have not investigated the ability to
>> perform multiple simultaneous analyses in order to process large
>>document
>> collections.
>>
>> It's not clear to me how UIMA would operate in a cluster environment, do
>> people really do message passing using JMI? I'm guessing this is the
>>case
>> as I seeing references to MPICH, SGE or other things I am more used to.
>> I've looked through some of the documentation (including all the
>>Overview
>> &  SDK setup) but am not finding anything helpful. I've also tried
>>googling
>> but I am not getting much except this:
>> http://comments.gmane.org/gmane.comp.apache.uima.general/2131 which
>>makes
>> me think it is possible.
>>
>> Currently with my level of confusion I think it may be best to have
>> multiple instances of UIMA on a cluster and just submit jobs processing
>> discrete document sets to our SGE cluster and ignore whatever scaling
>> features are actually present in UIMA since the document processing I
>>plan
>> to do is data parallel.
>>
>> -John
>>
>>
>
>-- 
>Eric Riebling                 Senior Systems Programmer
>http://ericriebling.com       CMU Language Technologies Institute
>

Re: Running UIMA on a cluster

Posted by Eric Riebling <er...@cs.cmu.edu>.

We've had success deploying annotators on cluster nodes (using UIMA-AS
deployment descriptors) registered to a UIMA-AS broker running on the
head node.  If the cluster use shared data folders, you only need to
put the code in one place for it to 'appear' on all nodes.

Then we run a collection reader and CAS consumer on the head node,
with the amount of scale-out specified on the command line of
runRemoteAsyncAE.sh, something like this:

   $UIMA_HOME/bin/runRemoteAsyncAE.sh -c (path.to)XmiCollectionReader.xml tcp://localhost:6
1616 (name of deployed service) -p (number of nodes) -o output_foldername

With enough scale-out, the limiting factor becomes the speed of the CR
and CC on the head node.  This is the briefest explanation I can give,
not sure it's a 'best practice' but it works. :)

On 4/27/2012 3:35 PM, John David Osborne wrote:
> Hello,
>
> Is there any best practice documentation out there for running
> UIMA/UIMA-AS on a cluster? I have only run single machine instances of
> UIMA (mostly through Eclipse) and have not investigated the ability to
> perform multiple simultaneous analyses in order to process large document
> collections.
>
> It's not clear to me how UIMA would operate in a cluster environment, do
> people really do message passing using JMI? I'm guessing this is the case
> as I seeing references to MPICH, SGE or other things I am more used to.
> I've looked through some of the documentation (including all the Overview
> &  SDK setup) but am not finding anything helpful. I've also tried googling
> but I am not getting much except this:
> http://comments.gmane.org/gmane.comp.apache.uima.general/2131 which makes
> me think it is possible.
>
> Currently with my level of confusion I think it may be best to have
> multiple instances of UIMA on a cluster and just submit jobs processing
> discrete document sets to our SGE cluster and ignore whatever scaling
> features are actually present in UIMA since the document processing I plan
> to do is data parallel.
>
> -John
>
>

-- 
Eric Riebling                 Senior Systems Programmer
http://ericriebling.com       CMU Language Technologies Institute