You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Olivier Terrier <ol...@temis.com> on 2008/07/21 16:34:21 UTC

UIMA chunking

Hi all,
Sometimes we are facing the problem of processing collection of "big" documents.
This may leads to an instability of the processing chain: out-of-memory errors, timeouts etc...
Moreover this it not very efficient in terms of load balancing (we use CPEs with analysis engines deployed as Vinci remote services on several machines).
We would like to solve this problem implementing a kind of UIMA document chunking where
big documents would be splitted into reasonable chunks (according to a given block size for example) at the beginning of the processing chain and merged back into one CAS at the end.
According to us, the splitting phase is quite straightforward : a CAS multiplier
splits the input document into N text blocks and produce N CASes.
Chunking informations like:
- document identifier
- current part number
- total part number
- text offset
Are stored in the CAS.
The merging phase is much more complicated : a CAS consumer is responsible for intercepting each "part" and store it somewhere (in memory or serialized on the filesystem), when the last part of the document comes in, all the annotation of the CAS parts are merged back taking into account the offset.
As we use a CPE, the merger CAS consumer can't "produce" a new CAS. What we have in mind is to create a new Sofa "fullDocumentView" in the last CAS "part" to store the text of the full document along with its associated annotations.
Another idea is to use sofa mappings to leave unchanged our existing CAS consumers (that are sofa-unaware) that come after the merger in the CPE flow.
      CPE flow:
      
    CAS SPLITTER
_InitialView: text part_i
fullDocumentView: empty
          |
         AE1  
_InitialView: text part_i + annotations AE1
fullDocumentView: empty
          |
        ...
          |
         AEn
_InitialView: text partN + annotations AE1+...+AEn
fullDocumentView: empty
          |
     CAS MERGER
_InitialView: text part_i + annotations AE1+...+AEn
fullDocumentView: if not last part = empty
                  if last part = text + annotations merged part1+...+partN
          |
      CONSUMER (sofa-unaware)
MAPPING cpe sofa : fullDocumentView => component sofa : _InitialView
_InitialView: text + annotations merged part1+...+partN

The tricky operations are:
- caching/storing the CAS 'parts' in the merger: how (XCAS, XMI, etc..) ? where (memory, disk, ...)?
- the merging of CAS 'parts' annotation into the full document CAS.
- error management: what append in case of errors on some parts?
We would like to share the thoughts/opinions of the UIMA community regarding this problem and the possible solutions.
Do you think our approach is the good one?
Does anybody has already faced a similar problem?
As far as possible we don't want to reinvent the wheele and give priority to a generic and ideally a UIMA-builtin implementation. We are of course ready to contribute to this development if the community find a generic solution.
Regards
Olivier Terrier - TEMIS 

Re: UIMA chunking

Posted by Burn Lewis <bu...@gmail.com>.
One approach is to use another CAS Multiplier for the merge ... it would
take in the N chunks and produce an output CAS only when the N-th has been
processed.  Any later processing would be independent of the chunking that
preceded it.  This merging CM could also handle any out-of-order segments
that can occur if you scale out your annotators.  The CasCopier class makes
it relatively easy to copy all FeatureStructures and update their offsets as
necessary.

Burn.