You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Marshall Schor <ms...@schor.com> on 2013/07/18 16:49:10 UTC

Re: Using uima pipeline as an API

On 7/18/2013 3:22 AM, swirl wrote:
> Hi,
>
> I have this particular requirement for a API that we wrap over a Uima 
> pipeline.
>
>  public List<String> analyse(String inputFolderPath, String modelName);
>
> This method is supposed to accept a collection of files (residing in the 
> inputFolderPath), run the files (as CAS) through a pipeline of UIMA AEs, and 
> return the results (one String per CAS).
>
> To return the strings, I will need to somehow access the CAS after the AEs 
> have finished their job and transform/extract whatever inside the CAS into 
> the string that I will return to the caller of this method.
>
> But if I run the AEs using a SimplePipeline.runPipeline()
> How I can get hold of the CAS that are coming out of the AEs?
> Do I attach a CAS Consumer at the tail of the pipeline and read the CAS 
> contents at that point? Then I append each result to the List<String> that I 
> constructed at the begining.

Please look at the code in the uimaj-examples project, in the class
org.apache.uima.examples.ExampleApplication.java.  This does exactly what you
describe, and you can see how it does it.  For more details, you can read the
chapter in our documentation about using UIMA via APIs, here:
http://uima.apache.org/d/uimaj-2.4.0/tutorials_and_users_guides.html#ugr.tug.application

>
> If so, is this scalable? 
> If I have thousands of files in the inputFolderPath, and if the strings are 
> very large, would I run out of memory soon?
> Is there a more scalable way to do this?

UIMA has multiple ways of scaling.  For processing thousands of things,
depending on the amount of processing you need to do per thing, you may want to
scale this out over 100's of machines, running in a cluster.

UIMA-AS extends UIMA to support this kind of deployment; see
http://uima.apache.org/doc-uimaas-what.html. 

A new component (not yet released, but soon (!)) called DUCC (Distributed UIMA
Cluster Controller) supports managing clusters of machines for UIMA "jobs",
scheduling these with a notion of "fairness" and paying attention to memory
requirements (so that it only schedules simultaneously running jobs on a single
machine up to the limit of that machine's RAM, to avoid paging/swapping);  it
comes with a web command/control interface giving various overviews to the
cluster and the jobs running on it.

-Marshall
>
>