You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Kirk True <ki...@mustardgrain.com> on 2007/05/17 19:52:16 UTC

UIMA internals memory footprint

Hi all,

I have begun getting seeing heavy memory use when processing largish
documents through a UIMA pipeline. I wanted to make sure what I'm
seeing with regard to UIMA's internal memory use is on par with
expectations.

It looks like either for a 1,500,000 byte or a 15,000,000 byte document
with the same annotations (100,000 10-character annotations), we incur
a ~13 MB "overhead" for internal UIMA data structures. Is this in line
with expectations?

Details:

In the interest of narrowing down the issue, I made a very simple test
annotator which mimics what my annotators do. The annotator creates a
document of N bytes which is set in a view in the CAS, then it
transforms the bytes to an HTML string that is then set in a view in
the CAS. Next, for each view, the annotator creates 50,000 annotations.
Each annotation has two 5-character attributes. I profiled my
application using two profilers (JProbe and YourKit) and took heap
snapshots before and after processing was performed and saw similar
results.

I know there's a lot going on under the hood, so I'm trying to get an
idea of what kind of size factor I can expect for a given document
size. Right now, according to my calculations and verified by the
profiler, the expected memory usage for just my data (i.e. the two
views of the document and the strings making up the annotations) is:

For a 1,500,000 byte document:

    Original document         1,500,000
    HTML document             2,800,000
    TestCaseAnnotation        1,600,000
    Annotation strings        4,800,000 
    Annotation char[]s        2,400,000
    Integer                   1,600,000 (UIMA internal (Annotation))
    int[]                     9,300,000 (UIMA internal)
    java.util.HashMap$Entry   2,400,000 (UIMA internal)
    -----------------------------------
                             26,400,000

For a 15,000,000 byte document:

    Original document        15,000,000
    HTML document            28,000,000
    TestCaseAnnotation        1,600,000
    Annotation strings        4,800,000 
    Annotation char[]s        2,400,000
    Integer                   1,600,000 (UIMA internal (Annotation))
    int[]                     9,300,000 (UIMA internal)
    java.util.HashMap$Entry   2,400,000 (UIMA internal)
    -----------------------------------
                             65,100,000

I can post the code for the test cases if it helps.

Thanks,
Kirk

Re: UIMA internals memory footprint

Posted by Marshall Schor <ms...@schor.com>.
Kirk True wrote:
> Hi Marshall,
>
>   
>> This reduces 4.6 MB down to 1 MB overhead for 100K annotations.
>>     
>
> That's awesome - thanks so much for looking into this! 
>
> Just to double-check - will this make it into the 2.2 release?
>
> Thanks again,
> Kirk 
>
>
>   
Yes, it should be there.

-Marshall

Re: UIMA internals memory footprint

Posted by Kirk True <ki...@mustardgrain.com>.
Hi Marshall,

> This reduces 4.6 MB down to 1 MB overhead for 100K annotations.

That's awesome - thanks so much for looking into this! 

Just to double-check - will this make it into the 2.2 release?

Thanks again,
Kirk 

Re: UIMA internals memory footprint

Posted by Marshall Schor <ms...@schor.com>.
Hi Kirk -

Thanks for posting your test case.  It did point up an inefficiency in 
how one of the hash maps was being used - this is now being improved. 

Your basic question concerned expectations for storing things in the 
CAS.  Here's the basic model.

The CAS stores most things in int[] arrays.

Feature Structures take up a number of entries in the int[]:  one plus 
the number of features.  So an annotation takes 1 + 3 = 4 words.  (1 
word = 4 bytes). Your annotation type added 2 more features, so these 
take 6 words in the int[].

Feature Structure features which are Strings take another 4 slots in the 
int[] arrays (16 bytes), per string being referenced, in addition to 
whatever storage Java uses for strings.  Sun Java 6_01 appears to use 32 
+ 2 * number_of_characters to store a string.  In your test case, each 
String in Java took 42 bytes. 

Indexes take one word per annotation indexed, typically.

All of the int[] objects grow as needed, by quantum jumps, so at any 
particular time, the number of words allocated is often larger than the 
number used. 

To reference CAS objects from a Java program, one of 2 interfaces is 
used: the "JCas" interface or the plain "CAS" Java interface.  Both of 
these create a 2 field Java object for each referenced CAS object.  In 
Sun's Java 6_01, there is an always-present overhead of 8 bytes per Java 
object, so these Java objects take 16 bytes.

In addition, the JCas implementation keeps a hash map where the keys are 
the CAS object reference (an int), and the values are the corresponding 
JCas object, once it is created.  This hash map takes additional space: 
in the runs you did, this took about 46 bytes per entry.  We've done 
some redesign and reduced this to about 10 bytes additional, per 
annotation.

So your numbers are basically correct, except we've now changed the 
entries due to the JCas hash map overhead from:

    Integer                   1,600,000 (UIMA internal (Annotation))
    java.util.HashMap$Entry   2,400,000 (UIMA internal)
    Object[]                    600,000 (a guess- for the table part of the hash table
                                         100K entries (1 per annotation) * 4 bytes plus
                                         some for the 75% load factor of this table +
                                         extra due to the table expanding by a factor of 2
                                         when growing


The new implementation makes these numbers look more like this:

    Integer                   0         (UIMA internal (Annotation)) - no longer needed
    java.util.HashMap$Entry   0         (UIMA internal) - no longer used
    Object[]                  1,000,000 (an approximation-for the table part of the hash table
                                         100K entries (1 per annotation) * 4 bytes plus
                                         some for the 50% load factor of this table

This reduces 4.6 MB down to 1 MB overhead for 100K annotations.

When I measured the int[] use for Java 6_01 from Sun, using Sun's 
heapDump and "jhat" tool, I found the int[] size to be
8.1 MB (versus your measurement of 9.3 MB).  I'm not sure where the 
difference comes from. 

If I add all these up, I get a "UIMA overhead" of 8.1 MB + 1MB = 
9.1MB.   Before the "fix" for the JCas instance hashmap,
this would be closer to your reported overhead:  8.1 MB + 4.6 MB = 12.7 
MB.  So your reporting triggered a significant
improvement in the implementation (will be in the next release) - thank you!

I hope this helps in having a better model of what to expect in terms of 
space utilization.

-Marshall

Kirk True wrote:
> Hi all,
>
> I have begun getting seeing heavy memory use when processing largish
> documents through a UIMA pipeline. I wanted to make sure what I'm
> seeing with regard to UIMA's internal memory use is on par with
> expectations.
>
> It looks like either for a 1,500,000 byte or a 15,000,000 byte document
> with the same annotations (100,000 10-character annotations), we incur
> a ~13 MB "overhead" for internal UIMA data structures. Is this in line
> with expectations?
>
> Details:
>
> In the interest of narrowing down the issue, I made a very simple test
> annotator which mimics what my annotators do. The annotator creates a
> document of N bytes which is set in a view in the CAS, then it
> transforms the bytes to an HTML string that is then set in a view in
> the CAS. Next, for each view, the annotator creates 50,000 annotations.
> Each annotation has two 5-character attributes. I profiled my
> application using two profilers (JProbe and YourKit) and took heap
> snapshots before and after processing was performed and saw similar
> results.
>
> I know there's a lot going on under the hood, so I'm trying to get an
> idea of what kind of size factor I can expect for a given document
> size. Right now, according to my calculations and verified by the
> profiler, the expected memory usage for just my data (i.e. the two
> views of the document and the strings making up the annotations) is:
>
> For a 1,500,000 byte document:
>
>     Original document         1,500,000
>     HTML document             2,800,000
>     TestCaseAnnotation        1,600,000
>     Annotation strings        4,800,000 
>     Annotation char[]s        2,400,000
>     Integer                   1,600,000 (UIMA internal (Annotation))
>     int[]                     9,300,000 (UIMA internal)
>     java.util.HashMap$Entry   2,400,000 (UIMA internal)
>     -----------------------------------
>                              26,400,000
>
> For a 15,000,000 byte document:
>
>     Original document        15,000,000
>     HTML document            28,000,000
>     TestCaseAnnotation        1,600,000
>     Annotation strings        4,800,000 
>     Annotation char[]s        2,400,000
>     Integer                   1,600,000 (UIMA internal (Annotation))
>     int[]                     9,300,000 (UIMA internal)
>     java.util.HashMap$Entry   2,400,000 (UIMA internal)
>     -----------------------------------
>                              65,100,000
>
> I can post the code for the test cases if it helps.
>
> Thanks,
> Kirk
>
>
>   


Re: UIMA internals memory footprint

Posted by Adam Lally <al...@alum.rpi.edu>.
On 5/22/07, Kirk True <ki...@mustardgrain.com> wrote:
> If it helps, here's the source code for the annotator and the shim
> application from which it is run:
>
>     http://www.mustardgrain.com/files/testcaseannotator.zip
>

Kirk,

Just so we have all the legal bases covered, could you attach this to
the JIRA issue UIMA-412
(https://issues.apache.org/jira/browse/UIMA-412) and check the box
that says you grant license to the ASF?

Thanks,
  -Adam

Re: UIMA internals memory footprint

Posted by Kirk True <ki...@mustardgrain.com>.
Hi Marshall,

> The indexes use int[] arrays. 
> 
> Kirk - what indexes do you have defined (if any)?  Do you 
> "addToIndexes..." any of
> the annotations you create?

Yes - I'm adding all annotations to the indexes.

If it helps, here's the source code for the annotator and the shim
application from which it is run:

    http://www.mustardgrain.com/files/testcaseannotator.zip

Thanks for all the feedback!

Kirk

> -Marshall
> 
> Adam Lally wrote:
> > On 5/18/07, Thilo Goetz <tw...@gmx.de> wrote:
> >> You can estimate data use on the heap as follows.  Each FS uses at
> 
> >> least one
> >> int for the type information, plus whatever features it has.  So a
> 
> >> vanilla
> >> annotation is 3 ints, one for the type, and one for the start and
> end 
> >> features,
> >> respectively.  If you have two additional features, that's 5 ints,
> so 
> >> 20 bytes.
> >> If you use the JCas, you incur an additional overhead of a Java 
> >> object for
> >> each annotation.  It's small, but I can't say off the top of my
> head 
> >> how small
> >> exactly.  Plus, the JCas objects are held in a HashMap (or some
> such, 
> >> Marshall
> >> correct me if I'm wrong), which incurs additional memory overhead.
> >>
> >> In my experience, the CAS can easily reach 10 to 20 times the size
> of 
> >> the input
> >> document.  If you have information reach token annotations, that's
> 
> >> not really
> >> surprising.  (And this is without using JCas).  Imagine you were
> to 
> >> manually
> >> create Java objects that carry the same information, you would see
> 
> >> roughly
> >> the same kind of overhead.
> >>
> >
> > Using these numbers can we account for the 9,300,000 bytes of
> integer 
> > arrays?
> >
> > 100,000 annotations of size 5 cells = 500,000 ints, which is
> exactly
> > the default heap size.  But with the Sofa FS this will exceed the
> > default heap size.  It will grow by another 500,000 (I think).
> >
> > So that accounts for 1,000,000 ints = 4,000,000 bytes.
> >
> > Where are the other 5,300,000?
> >
> >
> >
> > Likewise, what about the 1,600,000 bytes of Integers.  The JCAS
> hash
> > map only accounts for one per annotation, which in this case should
> > only be 400,000 bytes.
> >
> > Maybe it would be useful to get Kirk's test case so we can take a
> look
> > at where exactly the memory is being used.  I think it would need
> to
> > be attached to a JIRA issue with the grant license to Apache box
> > checked?
> >
> > -Adam
> >
> >
> 
> 


Re: UIMA internals memory footprint

Posted by Marshall Schor <ms...@schor.com>.
The indexes use int[] arrays. 

Kirk - what indexes do you have defined (if any)?  Do you 
"addToIndexes..." any of
the annotations you create?

-Marshall

Adam Lally wrote:
> On 5/18/07, Thilo Goetz <tw...@gmx.de> wrote:
>> You can estimate data use on the heap as follows.  Each FS uses at 
>> least one
>> int for the type information, plus whatever features it has.  So a 
>> vanilla
>> annotation is 3 ints, one for the type, and one for the start and end 
>> features,
>> respectively.  If you have two additional features, that's 5 ints, so 
>> 20 bytes.
>> If you use the JCas, you incur an additional overhead of a Java 
>> object for
>> each annotation.  It's small, but I can't say off the top of my head 
>> how small
>> exactly.  Plus, the JCas objects are held in a HashMap (or some such, 
>> Marshall
>> correct me if I'm wrong), which incurs additional memory overhead.
>>
>> In my experience, the CAS can easily reach 10 to 20 times the size of 
>> the input
>> document.  If you have information reach token annotations, that's 
>> not really
>> surprising.  (And this is without using JCas).  Imagine you were to 
>> manually
>> create Java objects that carry the same information, you would see 
>> roughly
>> the same kind of overhead.
>>
>
> Using these numbers can we account for the 9,300,000 bytes of integer 
> arrays?
>
> 100,000 annotations of size 5 cells = 500,000 ints, which is exactly
> the default heap size.  But with the Sofa FS this will exceed the
> default heap size.  It will grow by another 500,000 (I think).
>
> So that accounts for 1,000,000 ints = 4,000,000 bytes.
>
> Where are the other 5,300,000?
>
>
>
> Likewise, what about the 1,600,000 bytes of Integers.  The JCAS hash
> map only accounts for one per annotation, which in this case should
> only be 400,000 bytes.
>
> Maybe it would be useful to get Kirk's test case so we can take a look
> at where exactly the memory is being used.  I think it would need to
> be attached to a JIRA issue with the grant license to Apache box
> checked?
>
> -Adam
>
>


Re: UIMA internals memory footprint

Posted by Adam Lally <al...@alum.rpi.edu>.
On 5/18/07, Thilo Goetz <tw...@gmx.de> wrote:
> You can estimate data use on the heap as follows.  Each FS uses at least one
> int for the type information, plus whatever features it has.  So a vanilla
> annotation is 3 ints, one for the type, and one for the start and end features,
> respectively.  If you have two additional features, that's 5 ints, so 20 bytes.
> If you use the JCas, you incur an additional overhead of a Java object for
> each annotation.  It's small, but I can't say off the top of my head how small
> exactly.  Plus, the JCas objects are held in a HashMap (or some such, Marshall
> correct me if I'm wrong), which incurs additional memory overhead.
>
> In my experience, the CAS can easily reach 10 to 20 times the size of the input
> document.  If you have information reach token annotations, that's not really
> surprising.  (And this is without using JCas).  Imagine you were to manually
> create Java objects that carry the same information, you would see roughly
> the same kind of overhead.
>

Using these numbers can we account for the 9,300,000 bytes of integer arrays?

100,000 annotations of size 5 cells = 500,000 ints, which is exactly
the default heap size.  But with the Sofa FS this will exceed the
default heap size.  It will grow by another 500,000 (I think).

So that accounts for 1,000,000 ints = 4,000,000 bytes.

Where are the other 5,300,000?



Likewise, what about the 1,600,000 bytes of Integers.  The JCAS hash
map only accounts for one per annotation, which in this case should
only be 400,000 bytes.

Maybe it would be useful to get Kirk's test case so we can take a look
at where exactly the memory is being used.  I think it would need to
be attached to a JIRA issue with the grant license to Apache box
checked?

-Adam

Re: UIMA internals memory footprint

Posted by Thilo Goetz <tw...@gmx.de>.
Marshall Schor wrote:
> Thilo Goetz wrote:
[...]
>> If you use the JCas, 
> or you create FeatureStructure Java objects (which are Java Objects),

True.  My point was (which maybe I should have mentioned ;-) that JCas
objects stick around, while plain old FeatureStructures get can get
garbage collected after each annotator has run.  So JCas objects behave
like the rest of the CAS in that respect, and unlike FeatureStructure
objects.  Not beating on the JCas, just trying to explain sources of
memory consumption in the final analysis, after processing, so to speak.

--Thilo


Re: UIMA internals memory footprint

Posted by Marshall Schor <ms...@schor.com>.
Thilo Goetz wrote:
> Kirk True wrote:
>> Hi Adam,
>>
>>> Kirk,
>>>
>>> In this test are you running a CPE or just an AnalysisEngine?  If it
>>> is a CPE do you know what your CAS Pool size is?
>>
>> It's an AnalysisEngine.
>>
>>> When a CAS is created it does allocate a large heap which is then
>>> filled as you create annotations.  By default I believe this is
>>> 500,000 cells (2MB) per CAS, but this can be overridden (see
>>> UIMAFramework.getDefaultPerformanceTuningPropeties()).  So this can
>>> defintely be one source of memory overhead.  As you saw it does not
>>> grow with larger documents, it will only grow if you create enough
>>> annotations to fill up the allocated space.
>>
>> I noticed that this is tweak-able and set it to something insanely
>> small (like 100). But, as you said, it grows as the number of
>> annotations grow. Since the parameter is under the umbrella of
>> performance, I'd assume that it would actually be better to
>> pre-allocate close to what we're going to use.
> [...]
>
> Yes.
>
> You can estimate data use on the heap as follows.  Each FS uses at 
> least one
> int for the type information, plus whatever features it has.  So a 
> vanilla
> annotation is 3 ints, one for the type, and one for the start and end 
> features,
> respectively.  If you have two additional features, that's 5 ints, so 
> 20 bytes.
> If you use the JCas, 
or you create FeatureStructure Java objects (which are Java Objects),
> you incur an additional overhead of a Java object for
> each annotation.  It's small, but I can't say off the top of my head 
> how small
> exactly.  
Both the Feature Structure Java object and the JCas Java Object have 2 
fields:
a Java "int" (4 bytes) and a Java reference (4 bytes, unless it's a 64 
bit Java, I think).
Plus you have to add the Java overhead for an object, which might be 8 
bytes, but I'm
not sure.
> Plus, the JCas objects are held in a HashMap (or some such, Marshall
> correct me if I'm wrong), which incurs additional memory overhead.
True.  The key is a wrapped "int", the value is a Java "ref", and then 
you have the
hash table overhead. 
>
> In my experience, the CAS can easily reach 10 to 20 times the size of 
> the input
> document.  If you have information-rich token annotations, that's not 
> really surprising.  (And this is without using JCas).  Imagine you 
> were to manually
> create Java objects that carry the same information, you would see 
> roughly
> the same kind of overhead.
Two more points:

If you have variable sized documents, you might want to consider 
"chunking" - that is, breaking
very large documents up into multiple CASes.   A CAS Consumer can 
collect the chunks at the
end of the processing pipeline and re-assemble things.

Finally, when you "reset" a CAS, if it had expanded itself due to an 
unusually large number of feature structures, it
will gradually shrink back down to a more nominal size.  There is code 
in the reset that does this adjustment.

-Marshall


Re: UIMA internals memory footprint

Posted by Thilo Goetz <tw...@gmx.de>.
Kirk True wrote:
> Hi Adam,
> 
>> Kirk,
>>
>> In this test are you running a CPE or just an AnalysisEngine?  If it
>> is a CPE do you know what your CAS Pool size is?
> 
> It's an AnalysisEngine.
> 
>> When a CAS is created it does allocate a large heap which is then
>> filled as you create annotations.  By default I believe this is
>> 500,000 cells (2MB) per CAS, but this can be overridden (see
>> UIMAFramework.getDefaultPerformanceTuningPropeties()).  So this can
>> defintely be one source of memory overhead.  As you saw it does not
>> grow with larger documents, it will only grow if you create enough
>> annotations to fill up the allocated space.
> 
> I noticed that this is tweak-able and set it to something insanely
> small (like 100). But, as you said, it grows as the number of
> annotations grow. Since the parameter is under the umbrella of
> performance, I'd assume that it would actually be better to
> pre-allocate close to what we're going to use.
[...]

Yes.

You can estimate data use on the heap as follows.  Each FS uses at least one
int for the type information, plus whatever features it has.  So a vanilla
annotation is 3 ints, one for the type, and one for the start and end features,
respectively.  If you have two additional features, that's 5 ints, so 20 bytes.
If you use the JCas, you incur an additional overhead of a Java object for
each annotation.  It's small, but I can't say off the top of my head how small
exactly.  Plus, the JCas objects are held in a HashMap (or some such, Marshall
correct me if I'm wrong), which incurs additional memory overhead.

In my experience, the CAS can easily reach 10 to 20 times the size of the input
document.  If you have information reach token annotations, that's not really 
surprising.  (And this is without using JCas).  Imagine you were to manually
create Java objects that carry the same information, you would see roughly
the same kind of overhead.

--Thilo

Re: UIMA internals memory footprint

Posted by Kirk True <ki...@mustardgrain.com>.
Hi Adam,

> Kirk,
> 
> In this test are you running a CPE or just an AnalysisEngine?  If it
> is a CPE do you know what your CAS Pool size is?

It's an AnalysisEngine.

> When a CAS is created it does allocate a large heap which is then
> filled as you create annotations.  By default I believe this is
> 500,000 cells (2MB) per CAS, but this can be overridden (see
> UIMAFramework.getDefaultPerformanceTuningPropeties()).  So this can
> defintely be one source of memory overhead.  As you saw it does not
> grow with larger documents, it will only grow if you create enough
> annotations to fill up the allocated space.

I noticed that this is tweak-able and set it to something insanely
small (like 100). But, as you said, it grows as the number of
annotations grow. Since the parameter is under the umbrella of
performance, I'd assume that it would actually be better to
pre-allocate close to what we're going to use.

Thanks!
Kirk

> On 5/17/07, Kirk True <ki...@mustardgrain.com> wrote:
> > Hi all,
> >
> > I have begun getting seeing heavy memory use when processing
> largish
> > documents through a UIMA pipeline. I wanted to make sure what I'm
> > seeing with regard to UIMA's internal memory use is on par with
> > expectations.
> >
> > It looks like either for a 1,500,000 byte or a 15,000,000 byte
> document
> > with the same annotations (100,000 10-character annotations), we
> incur
> > a ~13 MB "overhead" for internal UIMA data structures. Is this in
> line
> > with expectations?
> >
> > Details:
> >
> > In the interest of narrowing down the issue, I made a very simple
> test
> > annotator which mimics what my annotators do. The annotator creates
> a
> > document of N bytes which is set in a view in the CAS, then it
> > transforms the bytes to an HTML string that is then set in a view
> in
> > the CAS. Next, for each view, the annotator creates 50,000
> annotations.
> > Each annotation has two 5-character attributes. I profiled my
> > application using two profilers (JProbe and YourKit) and took heap
> > snapshots before and after processing was performed and saw similar
> > results.
> >
> > I know there's a lot going on under the hood, so I'm trying to get
> an
> > idea of what kind of size factor I can expect for a given document
> > size. Right now, according to my calculations and verified by the
> > profiler, the expected memory usage for just my data (i.e. the two
> > views of the document and the strings making up the annotations)
> is:
> >
> > For a 1,500,000 byte document:
> >
> >     Original document         1,500,000
> >     HTML document             2,800,000
> >     TestCaseAnnotation        1,600,000
> >     Annotation strings        4,800,000
> >     Annotation char[]s        2,400,000
> >     Integer                   1,600,000 (UIMA internal
> (Annotation))
> >     int[]                     9,300,000 (UIMA internal)
> >     java.util.HashMap$Entry   2,400,000 (UIMA internal)
> >     -----------------------------------
> >                              26,400,000
> >
> > For a 15,000,000 byte document:
> >
> >     Original document        15,000,000
> >     HTML document            28,000,000
> >     TestCaseAnnotation        1,600,000
> >     Annotation strings        4,800,000
> >     Annotation char[]s        2,400,000
> >     Integer                   1,600,000 (UIMA internal
> (Annotation))
> >     int[]                     9,300,000 (UIMA internal)
> >     java.util.HashMap$Entry   2,400,000 (UIMA internal)
> >     -----------------------------------
> >                              65,100,000
> >
> > I can post the code for the test cases if it helps.
> >
> > Thanks,
> > Kirk
> >
> 


Re: UIMA internals memory footprint

Posted by Adam Lally <al...@alum.rpi.edu>.
Kirk,

In this test are you running a CPE or just an AnalysisEngine?  If it
is a CPE do you know what your CAS Pool size is?

When a CAS is created it does allocate a large heap which is then
filled as you create annotations.  By default I believe this is
500,000 cells (2MB) per CAS, but this can be overridden (see
UIMAFramework.getDefaultPerformanceTuningPropeties()).  So this can
defintely be one source of memory overhead.  As you saw it does not
grow with larger documents, it will only grow if you create enough
annotations to fill up the allocated space.

-Adam


On 5/17/07, Kirk True <ki...@mustardgrain.com> wrote:
> Hi all,
>
> I have begun getting seeing heavy memory use when processing largish
> documents through a UIMA pipeline. I wanted to make sure what I'm
> seeing with regard to UIMA's internal memory use is on par with
> expectations.
>
> It looks like either for a 1,500,000 byte or a 15,000,000 byte document
> with the same annotations (100,000 10-character annotations), we incur
> a ~13 MB "overhead" for internal UIMA data structures. Is this in line
> with expectations?
>
> Details:
>
> In the interest of narrowing down the issue, I made a very simple test
> annotator which mimics what my annotators do. The annotator creates a
> document of N bytes which is set in a view in the CAS, then it
> transforms the bytes to an HTML string that is then set in a view in
> the CAS. Next, for each view, the annotator creates 50,000 annotations.
> Each annotation has two 5-character attributes. I profiled my
> application using two profilers (JProbe and YourKit) and took heap
> snapshots before and after processing was performed and saw similar
> results.
>
> I know there's a lot going on under the hood, so I'm trying to get an
> idea of what kind of size factor I can expect for a given document
> size. Right now, according to my calculations and verified by the
> profiler, the expected memory usage for just my data (i.e. the two
> views of the document and the strings making up the annotations) is:
>
> For a 1,500,000 byte document:
>
>     Original document         1,500,000
>     HTML document             2,800,000
>     TestCaseAnnotation        1,600,000
>     Annotation strings        4,800,000
>     Annotation char[]s        2,400,000
>     Integer                   1,600,000 (UIMA internal (Annotation))
>     int[]                     9,300,000 (UIMA internal)
>     java.util.HashMap$Entry   2,400,000 (UIMA internal)
>     -----------------------------------
>                              26,400,000
>
> For a 15,000,000 byte document:
>
>     Original document        15,000,000
>     HTML document            28,000,000
>     TestCaseAnnotation        1,600,000
>     Annotation strings        4,800,000
>     Annotation char[]s        2,400,000
>     Integer                   1,600,000 (UIMA internal (Annotation))
>     int[]                     9,300,000 (UIMA internal)
>     java.util.HashMap$Entry   2,400,000 (UIMA internal)
>     -----------------------------------
>                              65,100,000
>
> I can post the code for the test cases if it helps.
>
> Thanks,
> Kirk
>