You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Arthit Suriyawongkul <ar...@gmail.com> on 2007/06/25 12:48:39 UTC

UIMA document loading strategy

Hi,

How UIMA load document to memory ?
Does it load the whole document at once, or it only read document
partially (sometime stream-like).

Now I'm using GATE and sometimes got a problem if my document is very large,
as GATE trying to load the whole document into the memory first and
convert it to
its own representation.
My application doesn't need a knowledge of the whole document (like DOM),
but only takes data from a small-size window (e.g. less than 100
characters) at a time.

cheers,
Art

-- 
:: Freedom Against Censorship Thailand
:: http://facthai.wordpress.com/sign/

Re: UIMA document loading strategy

Posted by Adam Lally <al...@alum.rpi.edu>.

On 6/25/07, Arthit Suriyawongkul <ar...@gmail.com> wrote:
> Hi,
>
> How UIMA load document to memory ?
> Does it load the whole document at once, or it only read document
> partially (sometime stream-like).
>
> Now I'm using GATE and sometimes got a problem if my document is very large,
> as GATE trying to load the whole document into the memory first and
> convert it to
> its own representation.
> My application doesn't need a knowledge of the whole document (like DOM),
> but only takes data from a small-size window (e.g. less than 100
> characters) at a time.
>
> cheers,
> Art
>

Hi Art,

UIMA is flexible with respect to this.  You can provide a
CollectionReader that populates a CAS with however much text is
appropriate for your application.  So a single document could be split
across many CASes in order to decrease the overall memory
requirements.

It's also possible to split a CAS into smaller CASes, do annotation on
each, and then merge the results.  The kind of component that does the
split and merge is called a "CAS Multiplier".  There's an example of
this in the uimaj-examples project that comes with the download - see
descriptors/cas_multiplier/Segment_Annotate_Merge_AE.  This is
described in the "CAS Multiplier Developer's Guide" section of the
documentation.

Another option is to consider using a "remote Sofa" (Sofa = subject of
analysis).  In this case the CAS just contains a URL to where the
actual document lives, not the document text itself.  See the
"Annotations, Artifacts, and Sofas" section of the documentaiton.

-Adam