You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by ThanhDK <th...@gmail.com> on 2013/10/03 07:14:27 UTC

Best approach for analyzing a set of documents

Hi all,

I am new to UIMA and from what I see, the concept of AE is very
single-document centric. My question is, from UIMA point of view, what is
the standard way to write a analysis component of which input is a set of
documents? For instance, a clustering engine that clusters similar documents
to the same basket, or an trending topic detector that detect new topics
from a set of documents.

I had a look at the CPE  before but it looks to me like just a iterator that
collect documents one by one, send it through the AEs and collects the output.

Regards

Re: Best approach for analyzing a set of documents

Posted by Richard Eckart de Castilho <re...@apache.org>.

Hi,

an AE/CasConsumer can have a state and use that aggregate information over all the CASes it sees. When the last document in the set produced by the reader is reached, the event collectionProcessComplete() is triggered on the AE. This is the point where further evaluation on the aggregated information can happen or where the results can be persisted somewhere.

Mind that AEs can by default be deployed multiple times, meaning that each only sees a part of the data, while CCs per default cannot be deployed multiple times, meaning they see each CAS.

-- Richard

On 03.10.2013, at 07:14, ThanhDK <th...@gmail.com> wrote:

> Hi all,
> 
> I am new to UIMA and from what I see, the concept of AE is very
> single-document centric. My question is, from UIMA point of view, what is
> the standard way to write a analysis component of which input is a set of
> documents? For instance, a clustering engine that clusters similar documents
> to the same basket, or an trending topic detector that detect new topics
> from a set of documents.
> 
> I had a look at the CPE  before but it looks to me like just a iterator that
> collect documents one by one, send it through the AEs and collects the output.
> 
> Regards

Re: Best approach for analyzing a set of documents

Posted by Marshall Schor <ms...@schor.com>.

On 10/3/2013 11:54 AM, Richard Eckart de Castilho wrote:
> Even though the CasMultiplier ist the ultimate component, for learning UIMA, I believe that the distinction into readers, analysis engines, and consumers is quite instructive.
Right.  I think this is why they are not deprecated. 

-Marshall
>
> -- Richard
>
> On 03.10.2013, at 16:09, Jörn Kottmann <ko...@gmail.com> wrote:
>
>> On 10/03/2013 03:57 PM, Marshall Schor wrote:
>>> Later, it became clear that the Collection Reader and Cas Consumer were just
>>> parameterizations of normal Analysis Engines, so they were replaced by those.
>>> The older classes still work, though.
>> We should deprecate them and communicate this better to our users.
>>
>> Jörn
>

Re: Best approach for analyzing a set of documents

Posted by Richard Eckart de Castilho <re...@apache.org>.

Even though the CasMultiplier ist the ultimate component, for learning UIMA, I believe that the distinction into readers, analysis engines, and consumers is quite instructive.

-- Richard

On 03.10.2013, at 16:09, Jörn Kottmann <ko...@gmail.com> wrote:

> On 10/03/2013 03:57 PM, Marshall Schor wrote:
>> Later, it became clear that the Collection Reader and Cas Consumer were just
>> parameterizations of normal Analysis Engines, so they were replaced by those.
>> The older classes still work, though.
> 
> We should deprecate them and communicate this better to our users.
> 
> Jörn

Re: Best approach for analyzing a set of documents

Posted by Steven Bethard <st...@gmail.com>.

On Thu, Oct 3, 2013 at 9:09 AM, Jörn Kottmann <ko...@gmail.com> wrote:
> On 10/03/2013 03:57 PM, Marshall Schor wrote:
>> Later, it became clear that the Collection Reader and Cas Consumer were
>> just
>> parameterizations of normal Analysis Engines, so they were replaced by
>> those.
>> The older classes still work, though.
>
> We should deprecate them and communicate this better to our users.

Yes, please. I'm still unclear on how to translate a CollectionReader
into an AnalysisEngine. The documentation should really hide
CollectionReaders and CasConsumers entirely (e.g. in a "deprecated"
section of the docs) and only show how to do the currently recommended
approach.

Stewve

Re: Best approach for analyzing a set of documents

Posted by Jörn Kottmann <ko...@gmail.com>.

On 10/03/2013 03:57 PM, Marshall Schor wrote:
> Later, it became clear that the Collection Reader and Cas Consumer were just
> parameterizations of normal Analysis Engines, so they were replaced by those.
> The older classes still work, though.

We should deprecate them and communicate this better to our users.

Jörn

Re: Best approach for analyzing a set of documents

Posted by Marshall Schor <ms...@schor.com>.

On 10/4/2013 2:09 AM, ThanhDK wrote:
> Thanks Marshall for your detailed response. Really appreciate it.
>
> I have a few more inquiries:
>
>> Later, UIMA introduced the concept of a CAS Multiplier.  This generalized the
>> Collection Reader a bit, allowing it to be anywhere in a pipeline, not just at
>> the beginning.
> Thanks for the info. I had a look at the CAS Multiplier and saw that it
> implements the interface AnalysisComponent
> http://uima.apache.org/d/uimaj-2.4.2/apidocs/org/apache/uima/analysis_component/AnalysisComponent.html
>
> So my question is what is the relationship between this interface and the
> AnalysisEngine interface
> http://uima.apache.org/d/uimaj-2.4.2/apidocs/org/apache/uima/analysis_engine/AnalysisEngine.html
>
> Conceptually speaking, AE should be subclass of AC but this doesn't seem to
> be the case?
UIMA, as a framework, has 2 "sides".  On one side, it supports "components". 
These components have to be structured to follow the component interfaces.

On the other side, there's the caller of the UIMA Framework.  That's, for
instance, a Java "Main" class, or a Servlet, or ...

>From that side, we have other interfaces, and the AnalysisEngine is one of them.

If you read the docs, this chapter,
http://uima.apache.org/d/uimaj-2.4.2/tutorials_and_users_guides.html#ugr.tug.aae, is
about the first side - writing the components.
This chapter,
http://uima.apache.org/d/uimaj-2.4.2/tutorials_and_users_guides.html#ugr.tug.application,
is about the other side - the Application side, which calls the UIMA framework.

HTH. -Marshall
>
>> Later, it became clear that the Collection Reader and Cas Consumer were just
>> parameterizations of normal Analysis Engines, so they were replaced by those. 
>> The older classes still work, though.
> Do you mind elaborating on the "parameterizations" part?
>
>> So the current way to do what your asking is to use an Analysis Engine
> specified
>> as a Cas Multiplier to generate the CASes flowing in the pipeline, and to
> use an
>> Analysis Engine set up like a Cas Consumer (for instance, specify the
> properties
>> in the <operationalProperties> element to indicate that
>> multipleDeploymentAllowed is false (to cause all the CASes to flow into
> this one
>> instance, if that's what's needed).
> Again, when you say AE specified as a CAS Multiplier, how does the
> inheritance relationship work?
>
> Thanks again for your help.
>
> Regards
>
>
>
>
>
>

Re: Best approach for analyzing a set of documents

Posted by ThanhDK <th...@gmail.com>.

Thanks Marshall for your detailed response. Really appreciate it.

I have a few more inquiries:

> Later, UIMA introduced the concept of a CAS Multiplier.  This generalized the
> Collection Reader a bit, allowing it to be anywhere in a pipeline, not just at
> the beginning.

Thanks for the info. I had a look at the CAS Multiplier and saw that it
implements the interface AnalysisComponent
http://uima.apache.org/d/uimaj-2.4.2/apidocs/org/apache/uima/analysis_component/AnalysisComponent.html

So my question is what is the relationship between this interface and the
AnalysisEngine interface
http://uima.apache.org/d/uimaj-2.4.2/apidocs/org/apache/uima/analysis_engine/AnalysisEngine.html

Conceptually speaking, AE should be subclass of AC but this doesn't seem to
be the case?

> 
> Later, it became clear that the Collection Reader and Cas Consumer were just
> parameterizations of normal Analysis Engines, so they were replaced by those. 
> The older classes still work, though.

Do you mind elaborating on the "parameterizations" part?

> So the current way to do what your asking is to use an Analysis Engine
specified
> as a Cas Multiplier to generate the CASes flowing in the pipeline, and to
use an
> Analysis Engine set up like a Cas Consumer (for instance, specify the
properties
> in the <operationalProperties> element to indicate that
> multipleDeploymentAllowed is false (to cause all the CASes to flow into
this one
> instance, if that's what's needed).

Again, when you say AE specified as a CAS Multiplier, how does the
inheritance relationship work?

Thanks again for your help.

Regards

Re: Best approach for analyzing a set of documents

Posted by Marshall Schor <ms...@schor.com>.

On 10/3/2013 1:14 AM, ThanhDK wrote:
> Hi all,
>
> I am new to UIMA and from what I see, the concept of AE is very
> single-document centric. My question is, from UIMA point of view, what is
> the standard way to write a analysis component of which input is a set of
> documents? For instance, a clustering engine that clusters similar documents
> to the same basket, or an trending topic detector that detect new topics
> from a set of documents.
>
> I had a look at the CPE  before but it looks to me like just a iterator that
> collect documents one by one, send it through the AEs and collects the output.
Hi,

A bit of history may be helpful.

In the beginning, UIMA had Collection Readers and Cas Consumers.  These were
conceptually intended to go at the beginning and end of pipelines.  The
Collection Readers would read "work-items" (e.g., documents - but UIMA can
process things other than documents, for instance, video clips, etc.) and push
those through the pipeline.  And Cas Consumers would do something with the
results of the analysis (e.g., write them to a file, a database, etc.).

Later, UIMA introduced the concept of a CAS Multiplier.  This generalized the
Collection Reader a bit, allowing it to be anywhere in a pipeline, not just at
the beginning.

Later, it became clear that the Collection Reader and Cas Consumer were just
parameterizations of normal Analysis Engines, so they were replaced by those. 
The older classes still work, though.

So the current way to do what your asking is to use an Analysis Engine specified
as a Cas Multiplier to generate the CASes flowing in the pipeline, and to use an
Analysis Engine set up like a Cas Consumer (for instance, specify the properties
in the <operationalProperties> element to indicate that
multipleDeploymentAllowed is false (to cause all the CASes to flow into this one
instance, if that's what's needed).

This approach enables the same pipeline to be run on a laptop for testeing, and
then scaled up (e.g. using UIMA-AS) to a big cluster of machines (for processing
very large document collections).  The CPE was a first implementation of
scaleout; the current, more flexible and powerful version is UIMA-AS.

-Marshall
>
> Regards
>
>
>
>
>