You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by "Sergeant, Alan" <al...@sap.com> on 2010/11/30 15:14:30 UTC

Scale out using multiple Collection Readers and Cas Consumers

Hi,
I have been looking through the UIMA AS Getting started guide(http://uima.apache.org/doc-uimaas-what.html), and am particularly interested in the architecture discussed in Figure 5 (Scale out using multiple Collection Readers and Cas Consumers). I am wondering is there any example code like RunRemoteAsyncAE that is available, or is this a suggested architecture that has not been implemented yet?

Regards,
Alan Sergeant

Re: Scale out using multiple Collection Readers and Cas Consumers

Posted by Eddie Epstein <ea...@gmail.com>.

Wow, that's a lot of questions ... and here we go ...

On Thu, Dec 2, 2010 at 1:09 AM, Greg Holmberg <ho...@comcast.net> wrote:
> Hi Eddie--
>
>
>
>> My experiences with UIMA AS are mostly with applications deployed
>> on a single cluster of multi-core machines interconnected with a high
>> performance network.
>
> By "high performance" you mean something more than gigabit ethernet, like
> Infiniband or 10 GB optical fiber?

For us just 1gigE and 10gigE so far.

>
>> The largest cluster we have worked with is several
>> hundred nodes. We see hundreds of MB/sec of data flowing between
>> clients and services thru a single broker. The load is evenly distributed
>> among all instances of a service type. Client requests are processed
>> in the order they are queued.
>
> I'm having trouble picturing this system landscape--could you describe how
> the various pieces of data (content, control messages, status messages,
> etc.) move through the system, from document source (or app) to result
> database ?  I'd like to see where the network I/O is and where the disk I/O
> is, and what data formats are used.

Several different systems, each different. In one case, a multimodal
speech-to-speech system, the SofaURI was used to flow audio
data via a separate audio media controller. CAS flow contained control
and result information. In other systems CASes contain pointers to data
on NFS. Others just put all the data in the CAS.

>
> By "broker" do you mean Active MQ?

Yes.

>
> How do clients submit requests to the cluster?  Do you support non-Java
> clients?  What does a request contain?  Can the client monitor the progress
> of a request?

The UIMA AS client API is only in Java as of now. It is possible to create
clients in other programming languages because AMQ client code exists
for several others. We use the AMQ-C++ client code for UIMA C++ services,
but as yet have had no need for C++ clients. Interactive applications typically
use http servlets as clients to backend UIMA AS services.

>
> Is the broker a bottle-neck?  Does all content pass through it?  How many
> times does each document (in one form or another) pass through the broker?

Although conceptually the broker is a bottleneck, we have not seen it.
There are workarounds, for example, it is easy to deploy different services
using different brokers.

>
> How does a web crawler fit into the system?
>
> Does one request have to completely finish before another can start?  Are
> there priorities?  What about requests from interactive application, where
> the user is waiting?

The UIMA AS client API has both synchronous and async interfaces to
process(CAS). As with any interactive application, the number of services
must support latency requirements. Client side process timeouts are available.

>
> Given that document processing time varies significantly, and different
> requests may use different aggregate engines, how do you manage to keep all
> the CPUs equally (and hopefully fully) busy?

Ideally this is simple: 1. configure each server node to maximize CPU
utilization
if all service instances are busy; 2. make sure the CAS pools in the clients are
sufficient to keep all service instances busy.

>
> How does a client get the annotators that it needs deployed into the
> cluster?

See other threads on service life cycle management. Short answer, this
is currently outside UIMA AS code.

>
> Is every machine performing the same function, or do they specialize in a
> particular annotator?  That is, is an aggregate engine self-contained in a
> single JVM, or is it split over multiple machines?

All are true. Depends on the analytics.

>
> If a machine crashes, can there be data loss?  How do you recover?
>
> Can you increase or decrease the capacity of the system without disrupting
> service?

Interactive systems are designed with redundancy with no single point
of failure.
If a UIMA AS request times out it should be resubmitted. UIMA AS
service instances can be added or removed dynamically at runtime.

>
> So many questions, I know.  But I think these are legitimate issues when
> building a system, and I don't see how AS handles them.  Someone really
> needs to write a paper...

There is a paper on the use of UIMA AS for GALE. I'm sure more will come.

>
>> The strength of UIMA AS is to easily scale out pipelines that
>> exceed the processing resources of individual nodes with no changes to
>> annotator and flow controller code or descriptors. Achieving high
>> CPU utilization may require a bit of sophistication, as always, but
>> UIMA AS includes the tools to facilitate that process.
>
> Really? To me AS seems more like a box of Legos and a picture (but no
> instructions) of a really cool airplane you can build if you've got the time
> and expertise.

Definitely some truth to this.

>
> Sorry about that.  I'm just having a really hard time seeing how to build a
> reliable, scalable, efficient document processing service on AS.  It's seems
> more theoretical than practical.
>
>
> Greg Holmberg
>

It would be nice to have a turnkey application for UIMA AS. So far we have been
focusing on getting full coverage for all UIMA functionality as well
as maximizing
performance at all levels of the runtime system.

Eddie

Re: Scale out using multiple Collection Readers and Cas Consumers

Posted by Greg Holmberg <ho...@comcast.net>.

Hi Eddie--



> My experiences with UIMA AS are mostly with applications deployed
> on a single cluster of multi-core machines interconnected with a high
> performance network.

By "high performance" you mean something more than gigabit ethernet, like  
Infiniband or 10 GB optical fiber?

> The largest cluster we have worked with is several
> hundred nodes. We see hundreds of MB/sec of data flowing between
> clients and services thru a single broker. The load is evenly distributed
> among all instances of a service type. Client requests are processed
> in the order they are queued.

I'm having trouble picturing this system landscape--could you describe how  
the various pieces of data (content, control messages, status messages,  
etc.) move through the system, from document source (or app) to result  
database ?  I'd like to see where the network I/O is and where the disk  
I/O is, and what data formats are used.

By "broker" do you mean Active MQ?

How do clients submit requests to the cluster?  Do you support non-Java  
clients?  What does a request contain?  Can the client monitor the  
progress of a request?

Is the broker a bottle-neck?  Does all content pass through it?  How many  
times does each document (in one form or another) pass through the broker?

How does a web crawler fit into the system?

Does one request have to completely finish before another can start?  Are  
there priorities?  What about requests from interactive application, where  
the user is waiting?

Given that document processing time varies significantly, and different  
requests may use different aggregate engines, how do you manage to keep  
all the CPUs equally (and hopefully fully) busy?

How does a client get the annotators that it needs deployed into the  
cluster?

Is every machine performing the same function, or do they specialize in a  
particular annotator?  That is, is an aggregate engine self-contained in a  
single JVM, or is it split over multiple machines?

If a machine crashes, can there be data loss?  How do you recover?

Can you increase or decrease the capacity of the system without disrupting  
service?

So many questions, I know.  But I think these are legitimate issues when  
building a system, and I don't see how AS handles them.  Someone really  
needs to write a paper...

> The strength of UIMA AS is to easily scale out pipelines that
> exceed the processing resources of individual nodes with no changes to
> annotator and flow controller code or descriptors. Achieving high
> CPU utilization may require a bit of sophistication, as always, but
> UIMA AS includes the tools to facilitate that process.

Really? To me AS seems more like a box of Legos and a picture (but no  
instructions) of a really cool airplane you can build if you've got the  
time and expertise.

Sorry about that.  I'm just having a really hard time seeing how to build  
a reliable, scalable, efficient document processing service on AS.  It's  
seems more theoretical than practical.


Greg Holmberg

Re: Scale out using multiple Collection Readers and Cas Consumers

Posted by Burn Lewis <bu...@gmail.com>.

The choice between single or multiple collection readers depends a lot on
the application.  If populating the initial input CASes is not expensive it
could be implemented as a UIMA-AS client similar to runRemoteAsyncAE in
figure 3, with the load balancing provided by the multiple service instances
consuming CASes from the input queue.  If the application has multiple
services each handling a stream of job requests then each could be a UIMA-AS
client and send CASes to the same input queue.

Note that CasMultipliers are a flexible replacement for Collection Readers
since the collection definition can be provided dynamically in the input
CAS, rather than in a configuration file or via some side channel.  So
similar to figure 5 an application could scale out multiple aggregates on a
cluster of machines, each aggregate starting with a CasMultiplier that gets
its collection definition (a directory or list of documents) from a CAS
placed on the shared input queue, and creates the document CASes to be
processed by the AEs in the rest of the aggregate.  Some of these AEs could
be scaled locally, or could be remote AS services which could be shared by
all of the scaled out aggregates.

In practice it may be sufficient to scale just the delegates inside an AS
aggregate, deploying multiple instances of any slow components and providing
a CAS pool large enough to keep all of the local and remote delegate
instances busy.

One advantage of designing an application as a deployed UIMA-AS aggregate
with some of its AEs deployed as remote services is that it is relatively
easy to start with a simple synchronous single-threaded UIMA aggregate and
later add the UIMA-AS deployments and scaleout.

~Burn

Fwd: Scale out using multiple Collection Readers and Cas Consumers

Posted by Eddie Epstein <ea...@gmail.com>.

---------- Forwarded message ----------
From: Eddie Epstein <ea...@gmail.com>
Date: Wed, Dec 1, 2010 at 9:00 AM
Subject: Re: Scale out using multiple Collection Readers and Cas Consumers
To: Greg Holmberg <ho...@comcast.net>


Ooops :)

> We see hundreds of GB/sec of data flowing between
> clients and services thru a single broker.

That would be hundreds of MB/sec

Eddie

Re: Scale out using multiple Collection Readers and Cas Consumers

Posted by Eddie Epstein <ea...@gmail.com>.

Hi Greg,

> OK, so, bottom line, if we want to process terabytes of text contained in
> millions of files, and we want to do it in a cluster of hundreds of
> machines, and we want that cluster to scale linearly and infinitely without
> bottle-necks, and we want to use UIMA-AS to do it, then we've got a lot of
> work ahead us?  There's no existing example configurations or code that
> shows how to do this?
>
> If we did do that work, are you confident that AS doesn't have any inherent
> bottle-necks that would prevent scaling to that level?  Was it designed to
> do that kind of thing?  The multiple Collection Reader idea wouldn't really
> be able to do that, would it?

There are no claims that UIMA AS will scale indefinitely. The design in
figure 5 simply eliminates the bottleneck of a single collection reader.
As yet there is no code for distributed readers offered.

>
> What if there's no obvious way to partition the file set?  Say, for example,
> we're crawling a web site, like amazon.com?
>
> What if the file set is not known (and so can't be partitioned), such as if
> we have an on-demand service that is receiving a steady series of random job
> submissions from different clients, each wanting to process different doc
> sets from different repositories?  How could AS be configured to ensure
> efficient use of the hardware (load balanced, all CPU cores at 100%)?  And
> fairness to the competing clients?
>
> The AS architecture has always been a bit fuzzy to me.  Any insights on how
> to achieve extreme scalability with AS would be appreciated.

My experiences with UIMA AS are mostly with applications deployed
on a single cluster of multi-core machines interconnected with a high
performance network. The largest cluster we have worked with is several
hundred nodes. We see hundreds of GB/sec of data flowing between
clients and services thru a single broker. The load is evenly distributed
among all instances of a service type. Client requests are processed
in the order they are queued.

The strength of UIMA AS is to easily scale out pipelines that
exceed the processing resources of individual nodes with no changes to
annotator and flow controller code or descriptors. Achieving high
CPU utilization may require a bit of sophistication, as always, but
UIMA AS includes the tools to facilitate that process.

Eddie

Re: Scale out using multiple Collection Readers and Cas Consumers

Posted by Greg Holmberg <ho...@comcast.net>.

On Tue, 30 Nov 2010 13:23:27 -0800, Eddie Epstein <ea...@gmail.com>  
wrote:

> I agree with Jerry that there is no code in UIMA packages explicitly
> for this. I'd suggest looking at
> examples/src/org/apache/uima/examples/casMultiplier/SimpleTextSegmenter.java
> for an example CasMultiplier that can easily be adapted. Another
> suggestion is to assemble and test the aggregate before deploying it
> as a service. Much easier to debug.
>

OK, so, bottom line, if we want to process terabytes of text contained in  
millions of files, and we want to do it in a cluster of hundreds of  
machines, and we want that cluster to scale linearly and infinitely  
without bottle-necks, and we want to use UIMA-AS to do it, then we've got  
a lot of work ahead us?  There's no existing example configurations or  
code that shows how to do this?

If we did do that work, are you confident that AS doesn't have any  
inherent bottle-necks that would prevent scaling to that level?  Was it  
designed to do that kind of thing?  The multiple Collection Reader idea  
wouldn't really be able to do that, would it?

What if there's no obvious way to partition the file set?  Say, for  
example, we're crawling a web site, like amazon.com?

What if the file set is not known (and so can't be partitioned), such as  
if we have an on-demand service that is receiving a steady series of  
random job submissions from different clients, each wanting to process  
different doc sets from different repositories?  How could AS be  
configured to ensure efficient use of the hardware (load balanced, all CPU  
cores at 100%)?  And fairness to the competing clients?

The AS architecture has always been a bit fuzzy to me.  Any insights on  
how to achieve extreme scalability with AS would be appreciated.

Greg Holmberg

Re: Scale out using multiple Collection Readers and Cas Consumers

Posted by Eddie Epstein <ea...@gmail.com>.

I agree with Jerry that there is no code in UIMA packages explicitly
for this. I'd suggest looking at
examples/src/org/apache/uima/examples/casMultiplier/SimpleTextSegmenter.java
for an example CasMultiplier that can easily be adapted. Another
suggestion is to assemble and test the aggregate before deploying it
as a service. Much easier to debug.

Eddie

On Tue, Nov 30, 2010 at 9:14 AM, Sergeant, Alan <al...@sap.com> wrote:
> Hi,
> I have been looking through the UIMA AS Getting started guide(http://uima.apache.org/doc-uimaas-what.html), and am particularly interested in the architecture discussed in Figure 5 (Scale out using multiple Collection Readers and Cas Consumers). I am wondering is there any example code like RunRemoteAsyncAE that is available, or is this a suggested architecture that has not been implemented yet?
>
> Regards,
> Alan Sergeant
>
>

Re: Scale out using multiple Collection Readers and Cas Consumers

Posted by Jaroslaw Cwiklik <ui...@gmail.com>.

Hi, I dont think there is an explicit example that demonstrate UIMA AS setup
as in Figure 5. However, with some effort you can create it yourself.
UIMA ships with a bunch of example components that include
FileSystemCollectionReader and XmiWriterCasConsumer among others. Check uima
distribution under $UIMA_HOME/examples/descriptors. The above
CollectionReader reads specified collection of documents from a file system.
Supported documents are text and xml. If your collection is not segmented
perhaps the easiest way to get started is to use RunRemoteAsyncAE (client)
and run it with a CollectionReader that points to a collection of text/xml
files. You can find a small collection of documents under
$UIMA_HOME/examples/data.

As for the service, you can modify Deploy_MeetingDetectorTAE.xml example.
Just add to the AE descriptor a CasConsumer that writes results of analysis
to the File System in XMI format. As mentioned there is an example CC for
that: XmiWriterCasConsumer.

If your collection is already segmented you can add a CasMultiplier
(CollectionReader) to the MeetingDetectorTAE descriptor and configure it to
point to the collection chunk you want to process.

Regards, Jerry

On Tue, Nov 30, 2010 at 9:14 AM, Sergeant, Alan <al...@sap.com>wrote:

> Hi,
> I have been looking through the UIMA AS Getting started guide(
> http://uima.apache.org/doc-uimaas-what.html), and am particularly
> interested in the architecture discussed in Figure 5 (Scale out using
> multiple Collection Readers and Cas Consumers). I am wondering is there any
> example code like RunRemoteAsyncAE that is available, or is this a suggested
> architecture that has not been implemented yet?
>
> Regards,
> Alan Sergeant
>
>