You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by Marshall Schor <ms...@schor.com> on 2012/09/11 06:06:49 UTC

Re: [jira] [Created] (UIMA-2460) Binary deserialization inefficient

On 8/27/2012 1:59 PM, Thilo Goetz wrote:
> Once you're done, I'd be interested to know if this has any measurable effect
> beyond the noise level.

Here's my back of the envelope calculations on this:

Each string object has an ref to char Array (4-8 bytes)  + an offset (4) and
length (4).
Each char array object has an length (4) + the space for the chars.

In addition, there's "object" overhead of 8 - 16 bytes per object (depends on 32
or 64 bit Java, etc.).

In the case of a string using an individual char array, the space is approx
(using 8 byte (64 bit) address refs):

2 obj overheads + string obj + char array obj + char length * 2 =
32 + 16 + 4 + char length * 2 = 52 + char length * 2.

With this patch, it becomes
16 + 16 + char length * 2 = 32 + char length * 2.

So, the savings depends on the "average" size of the character strings, but
might amount to 20 bytes / string.

It's somewhat hard to say what typical CASes have as strings versus other space,
and what the average string length might be.
For one set I looked at, strings made up about 1/2 the space, but the average
string length was about 50 chars.

With this, the total string space might have started out as n * (52 + 50*2) = n
* 152.
After this patch, the space would be n * (32 + 50 *2) = n * 132. 

If the string space accounted for 50% of the CAS, the savings in CAS space would
be 20/152 divided by 2 or about 6.5 %.

If the strings accounted for more than 50 % of the CAS space, or the average
string length was less than 100, the % of savings
(in CAS size) would be larger.  Of course, there are other things that take up
space in a UIMA application, besides the CAS;
counting all of that will reduce the overall % effect when measured as a percent
of the total space used by the entire application.

So, I would tend to agree that the savings is probably not all that large, in
most "typical" cases.

-Marshall

>
> On 27.08.2012 16:29, Marshall Schor (JIRA) wrote:
>> Marshall Schor created UIMA-2460:
>> ------------------------------------
>>
>>               Summary: Binary deserialization inefficient
>>                   Key: UIMA-2460
>>                   URL: https://issues.apache.org/jira/browse/UIMA-2460
>>               Project: UIMA
>>            Issue Type: Improvement
>>            Components: Core Java Framework
>>              Reporter: Marshall Schor
>>              Assignee: Marshall Schor
>>              Priority: Minor
>>               Fix For: 2.4.1SDK
>>
>>
>> The CAS binary deserialization code can be made (much) more space efficient. 
>> Currently, the char data that is used in the strings is read into a char
>> array; each string is represented as an offset into this char array + a
>> length; and new Java strings are created using new String(chararray, offset,
>> length).  This works, but it allocates a new char array for each string being
>> created, and copies from the original char array.  This results in new char
>> array objects for each string object.
>>
>> The alternative is to reuse the original char array object, and not allocate
>> any other char array objects.  This can be done by:
>> * making a temporary string from the entire char array object, and then
>> * making the new strings using tempString.substring(offset, offset + length)
>>
>> For 1000 strings, this will save 999 char array object overheads (probably
>> about 16 bytes per).
>>
>> An additional space savings is possible by reusing the same string object for
>> equal strings.
>>
>> -- 
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA administrators
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>
>