You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by "Marshall Schor (JIRA)" <de...@uima.apache.org> on 2012/08/27 16:29:07 UTC

[jira] [Created] (UIMA-2460) Binary deserialization inefficient

Marshall Schor created UIMA-2460:
------------------------------------

             Summary: Binary deserialization inefficient
                 Key: UIMA-2460
                 URL: https://issues.apache.org/jira/browse/UIMA-2460
             Project: UIMA
          Issue Type: Improvement
          Components: Core Java Framework
            Reporter: Marshall Schor
            Assignee: Marshall Schor
            Priority: Minor
             Fix For: 2.4.1SDK


The CAS binary deserialization code can be made (much) more space efficient.  Currently, the char data that is used in the strings is read into a char array; each string is represented as an offset into this char array + a length; and new Java strings are created using new String(chararray, offset, length).  This works, but it allocates a new char array for each string being created, and copies from the original char array.  This results in new char array objects for each string object.

The alternative is to reuse the original char array object, and not allocate any other char array objects.  This can be done by:
* making a temporary string from the entire char array object, and then
* making the new strings using tempString.substring(offset, offset + length)

For 1000 strings, this will save 999 char array object overheads (probably about 16 bytes per).

An additional space savings is possible by reusing the same string object for equal strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Created] (UIMA-2460) Binary deserialization inefficient

Posted by Marshall Schor <ms...@schor.com>.
On 8/27/2012 1:59 PM, Thilo Goetz wrote:
> Once you're done, I'd be interested to know if this has any measurable effect
> beyond the noise level.

Here's my back of the envelope calculations on this:

Each string object has an ref to char Array (4-8 bytes)  + an offset (4) and
length (4).
Each char array object has an length (4) + the space for the chars.

In addition, there's "object" overhead of 8 - 16 bytes per object (depends on 32
or 64 bit Java, etc.).

In the case of a string using an individual char array, the space is approx
(using 8 byte (64 bit) address refs):

2 obj overheads + string obj + char array obj + char length * 2 =
32 + 16 + 4 + char length * 2 = 52 + char length * 2.

With this patch, it becomes
16 + 16 + char length * 2 = 32 + char length * 2.

So, the savings depends on the "average" size of the character strings, but
might amount to 20 bytes / string.

It's somewhat hard to say what typical CASes have as strings versus other space,
and what the average string length might be.
For one set I looked at, strings made up about 1/2 the space, but the average
string length was about 50 chars.

With this, the total string space might have started out as n * (52 + 50*2) = n
* 152.
After this patch, the space would be n * (32 + 50 *2) = n * 132. 

If the string space accounted for 50% of the CAS, the savings in CAS space would
be 20/152 divided by 2 or about 6.5 %.

If the strings accounted for more than 50 % of the CAS space, or the average
string length was less than 100, the % of savings
(in CAS size) would be larger.  Of course, there are other things that take up
space in a UIMA application, besides the CAS;
counting all of that will reduce the overall % effect when measured as a percent
of the total space used by the entire application.

So, I would tend to agree that the savings is probably not all that large, in
most "typical" cases.

-Marshall

>
> On 27.08.2012 16:29, Marshall Schor (JIRA) wrote:
>> Marshall Schor created UIMA-2460:
>> ------------------------------------
>>
>>               Summary: Binary deserialization inefficient
>>                   Key: UIMA-2460
>>                   URL: https://issues.apache.org/jira/browse/UIMA-2460
>>               Project: UIMA
>>            Issue Type: Improvement
>>            Components: Core Java Framework
>>              Reporter: Marshall Schor
>>              Assignee: Marshall Schor
>>              Priority: Minor
>>               Fix For: 2.4.1SDK
>>
>>
>> The CAS binary deserialization code can be made (much) more space efficient. 
>> Currently, the char data that is used in the strings is read into a char
>> array; each string is represented as an offset into this char array + a
>> length; and new Java strings are created using new String(chararray, offset,
>> length).  This works, but it allocates a new char array for each string being
>> created, and copies from the original char array.  This results in new char
>> array objects for each string object.
>>
>> The alternative is to reuse the original char array object, and not allocate
>> any other char array objects.  This can be done by:
>> * making a temporary string from the entire char array object, and then
>> * making the new strings using tempString.substring(offset, offset + length)
>>
>> For 1000 strings, this will save 999 char array object overheads (probably
>> about 16 bytes per).
>>
>> An additional space savings is possible by reusing the same string object for
>> equal strings.
>>
>> -- 
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA administrators
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>
>


Re: [jira] [Created] (UIMA-2460) Binary deserialization inefficient

Posted by Thilo Goetz <tw...@gmx.de>.
Once you're done, I'd be interested to know if this has any measurable 
effect beyond the noise level.

On 27.08.2012 16:29, Marshall Schor (JIRA) wrote:
> Marshall Schor created UIMA-2460:
> ------------------------------------
>
>               Summary: Binary deserialization inefficient
>                   Key: UIMA-2460
>                   URL: https://issues.apache.org/jira/browse/UIMA-2460
>               Project: UIMA
>            Issue Type: Improvement
>            Components: Core Java Framework
>              Reporter: Marshall Schor
>              Assignee: Marshall Schor
>              Priority: Minor
>               Fix For: 2.4.1SDK
>
>
> The CAS binary deserialization code can be made (much) more space efficient.  Currently, the char data that is used in the strings is read into a char array; each string is represented as an offset into this char array + a length; and new Java strings are created using new String(chararray, offset, length).  This works, but it allocates a new char array for each string being created, and copies from the original char array.  This results in new char array objects for each string object.
>
> The alternative is to reuse the original char array object, and not allocate any other char array objects.  This can be done by:
> * making a temporary string from the entire char array object, and then
> * making the new strings using tempString.substring(offset, offset + length)
>
> For 1000 strings, this will save 999 char array object overheads (probably about 16 bytes per).
>
> An additional space savings is possible by reusing the same string object for equal strings.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>


[jira] [Updated] (UIMA-2460) Binary deserialization inefficient

Posted by "Marshall Schor (JIRA)" <de...@uima.apache.org>.
     [ https://issues.apache.org/jira/browse/UIMA-2460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marshall Schor updated UIMA-2460:
---------------------------------

    Affects Version/s: 2.4.0SDK
    
> Binary deserialization inefficient
> ----------------------------------
>
>                 Key: UIMA-2460
>                 URL: https://issues.apache.org/jira/browse/UIMA-2460
>             Project: UIMA
>          Issue Type: Improvement
>          Components: Core Java Framework
>    Affects Versions: 2.4.0SDK
>            Reporter: Marshall Schor
>            Assignee: Marshall Schor
>            Priority: Minor
>             Fix For: 2.4.1SDK
>
>
> The CAS binary deserialization code can be made (much) more space efficient.  Currently, the char data that is used in the strings is read into a char array; each string is represented as an offset into this char array + a length; and new Java strings are created using new String(chararray, offset, length).  This works, but it allocates a new char array for each string being created, and copies from the original char array.  This results in new char array objects for each string object.
> The alternative is to reuse the original char array object, and not allocate any other char array objects.  This can be done by:
> * making a temporary string from the entire char array object, and then
> * making the new strings using tempString.substring(offset, offset + length)
> For 1000 strings, this will save 999 char array object overheads (probably about 16 bytes per).
> An additional space savings is possible by reusing the same string object for equal strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira