You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by swirl <sw...@yahoo.com> on 2013/07/18 09:22:29 UTC

Using uima pipeline as an API

Hi,

I have this particular requirement for a API that we wrap over a Uima 
pipeline.

 public List<String> analyse(String inputFolderPath, String modelName);

This method is supposed to accept a collection of files (residing in the 
inputFolderPath), run the files (as CAS) through a pipeline of UIMA AEs, and 
return the results (one String per CAS).

To return the strings, I will need to somehow access the CAS after the AEs 
have finished their job and transform/extract whatever inside the CAS into 
the string that I will return to the caller of this method.

But if I run the AEs using a SimplePipeline.runPipeline()
How I can get hold of the CAS that are coming out of the AEs?
Do I attach a CAS Consumer at the tail of the pipeline and read the CAS 
contents at that point? Then I append each result to the List<String> that I 
constructed at the begining.

If so, is this scalable? 
If I have thousands of files in the inputFolderPath, and if the strings are 
very large, would I run out of memory soon?
Is there a more scalable way to do this?

Re: Using uima pipeline as an API

Posted by Marshall Schor <ms...@schor.com>.

On 7/18/2013 3:22 AM, swirl wrote:
> Hi,
>
> I have this particular requirement for a API that we wrap over a Uima 
> pipeline.
>
>  public List<String> analyse(String inputFolderPath, String modelName);
>
> This method is supposed to accept a collection of files (residing in the 
> inputFolderPath), run the files (as CAS) through a pipeline of UIMA AEs, and 
> return the results (one String per CAS).
>
> To return the strings, I will need to somehow access the CAS after the AEs 
> have finished their job and transform/extract whatever inside the CAS into 
> the string that I will return to the caller of this method.
>
> But if I run the AEs using a SimplePipeline.runPipeline()
> How I can get hold of the CAS that are coming out of the AEs?
> Do I attach a CAS Consumer at the tail of the pipeline and read the CAS 
> contents at that point? Then I append each result to the List<String> that I 
> constructed at the begining.

Please look at the code in the uimaj-examples project, in the class
org.apache.uima.examples.ExampleApplication.java.  This does exactly what you
describe, and you can see how it does it.  For more details, you can read the
chapter in our documentation about using UIMA via APIs, here:
http://uima.apache.org/d/uimaj-2.4.0/tutorials_and_users_guides.html#ugr.tug.application

>
> If so, is this scalable? 
> If I have thousands of files in the inputFolderPath, and if the strings are 
> very large, would I run out of memory soon?
> Is there a more scalable way to do this?

UIMA has multiple ways of scaling.  For processing thousands of things,
depending on the amount of processing you need to do per thing, you may want to
scale this out over 100's of machines, running in a cluster.

UIMA-AS extends UIMA to support this kind of deployment; see
http://uima.apache.org/doc-uimaas-what.html. 

A new component (not yet released, but soon (!)) called DUCC (Distributed UIMA
Cluster Controller) supports managing clusters of machines for UIMA "jobs",
scheduling these with a notion of "fairness" and paying attention to memory
requirements (so that it only schedules simultaneously running jobs on a single
machine up to the limit of that machine's RAM, to avoid paging/swapping);  it
comes with a web command/control interface giving various overviews to the
cluster and the jobs running on it.

-Marshall
>
>

Re: Using uima pipeline as an API

Posted by Richard Eckart de Castilho <ri...@gmail.com>.

> Am 25.07.2013 um 04:15 schrieb swirl <sw...@yahoo.com>:
>> Hi Richard,
>> I was reading your reference for using JCasIterable 
>> (https://code.google.com/p/dkpro-core-asl/wiki/GroovyRecipies#OpenNLP_Part-
>> of-speech_tagging_pipeline_using_JCasIterable_and_c), but i have some 
>> questions.
>> 
>> I assume that createReaderDescription(), createEngineDescription() are 
>> return CollectionReaderDescription and AnalysisEngineDescription 
>> respectively. But when I looked at the constructor for JCasIterable, it only 
>> accepts CollectionReader and AnalysisEngine array:
>> JCasIterable(final CollectionReader aReader, final AnalysisEngine... 
>> aEngines)
>> 
>> Why is this so?

There is another answer as well: this example is not an original uimaFIT
example but an example of using uimaFIT to build a pipeline using DKPro
Core UIMA components. These examples are currently bleeding edge using
the unreleased DKPro Core 1.5.0-SNAPSHOT, the unreleased 
uimaFIT 2.0.0-SNAPSHOT and even the unreleased UIMA 2.4.1-SNAPSHOT.

In fact, some issues (like the JCasIterable not supporting descriptions)
cropped up while writing these examples.For this reason, the example(s)
have changed quite a bit during the last weeks. 

Releases for all of these frameworks are upcoming. Once they are out,
the examples will become stable. Independently of that, the uimaFIT
docs need to be improved with examples of its own.

Cheers,

-- Richard

Re: Using uima pipeline as an API

Posted by Richard Eckart de Castilho <ri...@gmail.com>.

Am 25.07.2013 um 04:15 schrieb swirl <sw...@yahoo.com>:
> Hi Richard,
> I was reading your reference for using JCasIterable 
> (https://code.google.com/p/dkpro-core-asl/wiki/GroovyRecipies#OpenNLP_Part-
> of-speech_tagging_pipeline_using_JCasIterable_and_c), but i have some 
> questions.
> 
> Your example creates a JCasIterable using the following codes:
> 
> def pipeline = new JCasIterable(
>  createReaderDescription(TextReader,
>    TextReader.PARAM_PATH, args[0],
>    TextReader.PARAM_LANGUAGE, args[1],
>    TextReader.PARAM_PATTERNS, ["[+]*.txt"]),
>  createEngineDescription(OpenNlpSegmenter),
>  createEngineDescription(OpenNlpPosTagger));
> 
> I assume that createReaderDescription(), createEngineDescription() are 
> return CollectionReaderDescription and AnalysisEngineDescription 
> respectively. But when I looked at the constructor for JCasIterable, it only 
> accepts CollectionReader and AnalysisEngine array:
> JCasIterable(final CollectionReader aReader, final AnalysisEngine... 
> aEngines)
> 
> Why is this so?


In the days of yore, uutuc/uimaFIT devs/users used instances (CollectionReader, AnalysisEngine)
more often. Later, we figured out that in those cases we had to take care of sending all the
life-cycle events (collectionProcessComplete, destroy) ourselves. It also had other potentially
unexpected effects, such as that a CollectionReader could not be re-used in several pipelines
because after the first pipeline was through, it would be "empty" (hasNext() = false).

Today, it is considered a best practice to stick to descriptors as long as possible and
instantiate only when necessary. If possible, leave instantiation to a runtime engine like
SimplePipeline or CPE.

In uimaFIT 1.4.0 the JCasIterable only accepts CollectionReader and AnalysisEngine….

In uimaFIT 2.0.0, this changes to CollectionReaderDescription and AnalaysisEngineDescription….

See also:

- UIMA-3041 [1] - JCasIterable should have signature accepting descriptors

- UIMA-3097 [2] - Split JCasIterable into iterable and iterator parts

Cheers,

-- Richard

[1] https://issues.apache.org/jira/browse/UIMA-3041

[2] https://issues.apache.org/jira/browse/UIMA-3097

Re: Using uima pipeline as an API

Posted by swirl <sw...@yahoo.com>.

Richard Eckart de Castilho <ri...@...> writes:

> 
> You should take a look at the JCasIterable (cf. [1] - Example in Groovy, 
but
> JCasIterable is a Java class and works nicely in Java too, just I have no 
> example in Java).
> 
> JCasIterable basically allows you to iterate over the CASes produced by 
your
> pipeline. In such a look, you can extract and collect the data you need 
from
> the CASes, e.g. putting into a List<String> and returning it. Mind that 
you
> should *not* try to keep hold of full CASes, FeatureStructure (including
> Annotations and stuff). You need to copy the data from the CAS, otherwise
> it will be corrupted.

Hi Richard,
I was reading your reference for using JCasIterable 
(https://code.google.com/p/dkpro-core-asl/wiki/GroovyRecipies#OpenNLP_Part-
of-speech_tagging_pipeline_using_JCasIterable_and_c), but i have some 
questions.

Your example creates a JCasIterable using the following codes:

def pipeline = new JCasIterable(
  createReaderDescription(TextReader,
    TextReader.PARAM_PATH, args[0],
    TextReader.PARAM_LANGUAGE, args[1],
    TextReader.PARAM_PATTERNS, ["[+]*.txt"]),
  createEngineDescription(OpenNlpSegmenter),
  createEngineDescription(OpenNlpPosTagger));

I assume that createReaderDescription(), createEngineDescription() are 
return CollectionReaderDescription and AnalysisEngineDescription 
respectively. But when I looked at the constructor for JCasIterable, it only 
accepts CollectionReader and AnalysisEngine array:
 JCasIterable(final CollectionReader aReader, final AnalysisEngine... 
aEngines)

Why is this so?

Re: Using uima pipeline as an API

Posted by Richard Eckart de Castilho <ri...@gmail.com>.

> I have this particular requirement for a API that we wrap over a Uima 
> pipeline.
> 
> public List<String> analyse(String inputFolderPath, String modelName);
> 
> This method is supposed to accept a collection of files (residing in the 
> inputFolderPath), run the files (as CAS) through a pipeline of UIMA AEs, and 
> return the results (one String per CAS).
> 
> To return the strings, I will need to somehow access the CAS after the AEs 
> have finished their job and transform/extract whatever inside the CAS into 
> the string that I will return to the caller of this method.
> 
> But if I run the AEs using a SimplePipeline.runPipeline()
> How I can get hold of the CAS that are coming out of the AEs?
> Do I attach a CAS Consumer at the tail of the pipeline and read the CAS 
> contents at that point? Then I append each result to the List<String> that I 
> constructed at the begining.

You should take a look at the JCasIterable (cf. [1] - Example in Groovy, but
JCasIterable is a Java class and works nicely in Java too, just I have no 
example in Java).

JCasIterable basically allows you to iterate over the CASes produced by your
pipeline. In such a look, you can extract and collect the data you need from
the CASes, e.g. putting into a List<String> and returning it. Mind that you
should *not* try to keep hold of full CASes, FeatureStructure (including
Annotations and stuff). You need to copy the data from the CAS, otherwise
it will be corrupted.

> If so, is this scalable? 

Well… up to a point, but not in general.

> If I have thousands of files in the inputFolderPath, and if the strings are 
> very large, would I run out of memory soon?
> Is there a more scalable way to do this?

You could write your strings to a file and then return an implementation of 
List<String> which directly accesses the file. Depending on how much you want
to scale, you'll have to look into different solutions. The easiest would be
to buy more memory, the most complex would probably be porting your stuff to
some kind of cluster. The latter will most likely require a change of API,
possibly even of the whole processing paradigm. List<String> most probably
won't do then ;)

Cheers,

-- Richard


[1] http://code.google.com/p/dkpro-core-asl/wiki/GroovyRecipies#OpenNLP_Part-of-speech_tagging_pipeline_using_JCasIterable_and_c