You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Arthit Suriyawongkul <ar...@gmail.com> on 2007/06/25 12:48:39 UTC
UIMA document loading strategy
Hi,
How UIMA load document to memory ?
Does it load the whole document at once, or it only read document
partially (sometime stream-like).
Now I'm using GATE and sometimes got a problem if my document is very large,
as GATE trying to load the whole document into the memory first and
convert it to
its own representation.
My application doesn't need a knowledge of the whole document (like DOM),
but only takes data from a small-size window (e.g. less than 100
characters) at a time.
cheers,
Art
--
:: Freedom Against Censorship Thailand
:: http://facthai.wordpress.com/sign/
Re: UIMA document loading strategy
Posted by Adam Lally <al...@alum.rpi.edu>.
On 6/25/07, Arthit Suriyawongkul <ar...@gmail.com> wrote:
> Hi,
>
> How UIMA load document to memory ?
> Does it load the whole document at once, or it only read document
> partially (sometime stream-like).
>
> Now I'm using GATE and sometimes got a problem if my document is very large,
> as GATE trying to load the whole document into the memory first and
> convert it to
> its own representation.
> My application doesn't need a knowledge of the whole document (like DOM),
> but only takes data from a small-size window (e.g. less than 100
> characters) at a time.
>
> cheers,
> Art
>
Hi Art,
UIMA is flexible with respect to this. You can provide a
CollectionReader that populates a CAS with however much text is
appropriate for your application. So a single document could be split
across many CASes in order to decrease the overall memory
requirements.
It's also possible to split a CAS into smaller CASes, do annotation on
each, and then merge the results. The kind of component that does the
split and merge is called a "CAS Multiplier". There's an example of
this in the uimaj-examples project that comes with the download - see
descriptors/cas_multiplier/Segment_Annotate_Merge_AE. This is
described in the "CAS Multiplier Developer's Guide" section of the
documentation.
Another option is to consider using a "remote Sofa" (Sofa = subject of
analysis). In this case the CAS just contains a URL to where the
actual document lives, not the document text itself. See the
"Annotations, Artifacts, and Sofas" section of the documentaiton.
-Adam