You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Ingo Thon <is...@gmx.de> on 2013/07/19 11:03:17 UTC

custom FSRepository(?)

Hi List-Members,

I'm using UIMA in a very large project. For two reasons I would like to store annotations /partly and the SofaText/SofaStream:
1.) The workflow of our application is roughly as follows:
First, UIMA AE is used to add Meta Data to the artifacts. Second, the artifacts are stored together with the Meta Data in a
DB and accessed by several other (non-uima) components. From time to time a second UIMA AE is used to produce
additional metadata or  a representation for the User. This means the CAS has to be serialized and deseralized quite often.

2.) The number of annotations is for some documents to large to fit into memory. Currently this means UIMA crashes my application from time to time
with OutOfMemoryExceptions. Even though I was able to harden our application I can not apply UIMA to all documents.
 As the feature structures are accessed in my annotators in a sequential manner (using FSIterators)
streaming of annotations from HD/external DB would easily solve the problem.


I digged a little bit into the UIMA code and to me it currently looks like the best option would be to build a
custom CAS/JCas where I replace FSRepository for certain types. and also the Method getSofaStream getSofaText().

However, I was wondering wether this is really the best option for my problem.


best regards,
Ingo Thon

Re: custom FSRepository(?)

Posted by Thilo Goetz <tw...@gmx.de>.
On 07/19/2013 11:03 AM, Ingo Thon wrote:
> Hi List-Members,
>
> I'm using UIMA in a very large project. For two reasons I would like to store annotations /partly and the SofaText/SofaStream:
> 1.) The workflow of our application is roughly as follows:
> First, UIMA AE is used to add Meta Data to the artifacts. Second, the artifacts are stored together with the Meta Data in a
> DB and accessed by several other (non-uima) components. From time to time a second UIMA AE is used to produce
> additional metadata or  a representation for the User. This means the CAS has to be serialized and deseralized quite often.
>
> 2.) The number of annotations is for some documents to large to fit into memory. Currently this means UIMA crashes my application from time to time
> with OutOfMemoryExceptions. Even though I was able to harden our application I can not apply UIMA to all documents.
>   As the feature structures are accessed in my annotators in a sequential manner (using FSIterators)
> streaming of annotations from HD/external DB would easily solve the problem.
>
>
> I digged a little bit into the UIMA code and to me it currently looks like the best option would be to build a
> custom CAS/JCas where I replace FSRepository for certain types. and also the Method getSofaStream getSofaText().
>
> However, I was wondering wether this is really the best option for my problem.
>
>
> best regards,
> Ingo Thon
>

A solution I have employed in the past is to split very long documents
into pieces before UIMA processing, and put them back together
afterwards.  Of course it depends on your
particular application if this is an option (i.e., are you doing mostly
local, sentence based processing, or do you need the entire document
for annotation).

--Thilo