You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by Balkrishnan V <ba...@gmail.com> on 2009/02/18 06:12:47 UTC

Channel Usability Analytics using UIMA : Request for info on the Log-File Size..

Hi,

I am working on a UIMA solution for web-server log-analysis, to identify the
user-patterns.

I am new to this UIMA framework. So, can you please give me some pointers as to
at what point of my statistics-run would I encounter Operating-System
issues/constraints ?

For example, is there any limit to the size of the log-files (.txt) that I can
feed to the UIMA ? If so, can you please give me some details on the same ?

TIA.

Regards,
Balkrishnan.V


Re: Channel Usability Analytics using UIMA : Request for info on the Log-File Size..

Posted by Marshall Schor <ms...@schor.com>.
UIMA uses its CAS to pass information from one annotator to the next. 
If the annotators are co-located, the CAS you can think of as a set of
memory-resident data structures, passed by reference. If annotators are
on different IP addresses in a network, CASes are
serialized/deserialized and sent over various network transports.

The CAS can also be thought of as the "unit of work" for UIMA.  A very
large (or, indeed "infinite" - such as a real-time continuous feed) can
be broken up into units of work by a CAS Multiplier (or a collection
reader) component.  This is typically done for systems doing things like
real-time speech or video analysis.  This component typically does the
initial basic analysis to decide on where logical units of work occur -
for instance, in the case of audio, it might break things on "silence"
boundaries.

The CAS also contains the "subject of analysis".  Often, this is a
string of characters, representing a document to be analyzed, or a
string of bytes representing audio, etc.  A CAS can, however, contain
instead of the literal subject of analysis, a reference to an external
source for this. 

The bottom line answer to your question:  The CAS is a unit of work; it
is kept in memory, so this can be a limit.  Users of UIMA often break
very large pieces of work into multiple CASes to control this; the
actual subject of analysis can be literally in the CAS or just a reference.

We have seen applications running on 64-bit versions of operating
systems with 64-bit Java JVMs that routinely support
multi-gigabyte-sized CASes, so, CASes can be quite large, with today's
hardware.

HTH.  -Marshall

Balkrishnan V wrote:
> Hi,
>
> I am working on a UIMA solution for web-server log-analysis, to identify the
> user-patterns.
>
> I am new to this UIMA framework. So, can you please give me some pointers as to
> at what point of my statistics-run would I encounter Operating-System
> issues/constraints ?
>
> For example, is there any limit to the size of the log-files (.txt) that I can
> feed to the UIMA ? If so, can you please give me some details on the same ?
>
> TIA.
>
> Regards,
> Balkrishnan.V
>
>
>
>