You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by "Irving, Dave" <da...@baml.com> on 2012/03/01 14:48:44 UTC

Utf8 byte[] reuse

Hi,

Im using a BinaryDecoder to read some Utf8s - and I'm reusing the same Utf8 instance.
I found there was a huge amount of allocation going on - and tracked it down to Utf8#setByteLength:

...
  public Utf8 setByteLength(int newLength) {
    if (this.length < newLength) {
      byte[] newBytes = new byte[newLength];
     System.arraycopy(bytes, 0, newBytes, 0, this.length);
      this.bytes = newBytes;
    }
    this.length = newLength;
    this.string = null;
    return this;
  }
...

So, say I've got 4 Utf8s A,B,C and D lined up, with byte lengths A=1, B=10, C=5 and D=6 respectively, and do a read reusing the same Utf8 instance each time.
Read A: Causes allocation from empty buffer to buffer size 1 (ok)
Read B: Causes allocation from buffer size 1 to 10 (ok)
Read C: Reuses the buffer (ok)
Read D: Reallocates a buffer again, even though we've already got a 10 byte buffer (???)

A simple 'fix' would be to compare the byte[] length rather than this.length before doing a reallocation.
The only issue I can see with this though is that you cause a byte[] of the largest utf you've read with that instance to stay in memory. If thats a concern though, you could always provide a 'limit' on construction of the Utf8 (if the allocated byte[] goes greater than this, drop it and reallocate on the next resize < limit).

If this something that would be considered for changing if I submit a patch / jira?

Many thanks in advance,

Dave


----------------------------------------------------------------------
This message w/attachments (message) is intended solely for the use of the intended recipient(s) and may contain information that is privileged, confidential or proprietary. If you are not an intended recipient, please notify the sender, and then please delete and destroy all copies and attachments, and be advised that any review or dissemination of, or the taking of any action in reliance on, the information contained in or attached to this message is prohibited. 
Unless specifically indicated, this message is not an offer to sell or a solicitation of any investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Sender. Subject to applicable law, Sender may intercept, monitor, review and retain e-communications (EC) traveling through its networks/systems and may produce any such EC to regulators, law enforcement, in litigation and as required by law. 
The laws of the country of each sender/recipient may impact the handling of EC, and EC may be archived, supervised and produced in countries other than the country in which you are located. This message cannot be guaranteed to be secure or free of errors or viruses. 

References to "Sender" are references to any subsidiary of Bank of America Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a Condition to Any Banking Service or Activity * Are Not Insured by Any Federal Government Agency. Attachments that are part of this EC may have additional important disclosures and disclaimers, which you should read. This message is subject to terms available at the following link: 
http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you consent to the foregoing.

RE: Utf8 byte[] reuse

Posted by "Irving, Dave" <da...@baml.com>.
...

>> If this something that would be considered for changing if I submit a 
>> patch / jira?

> Yes, please do.

I've raised https://issues.apache.org/jira/browse/AVRO-1041 and attached a patch with the bug fix.
Haven't done anything about the limit as yet - perhaps this could be addressed as a separate feature if desired?

> Thanks!

> Doug

Cheers,

Dave

----------------------------------------------------------------------
This message w/attachments (message) is intended solely for the use of the intended recipient(s) and may contain information that is privileged, confidential or proprietary. If you are not an intended recipient, please notify the sender, and then please delete and destroy all copies and attachments, and be advised that any review or dissemination of, or the taking of any action in reliance on, the information contained in or attached to this message is prohibited. 
Unless specifically indicated, this message is not an offer to sell or a solicitation of any investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Sender. Subject to applicable law, Sender may intercept, monitor, review and retain e-communications (EC) traveling through its networks/systems and may produce any such EC to regulators, law enforcement, in litigation and as required by law. 
The laws of the country of each sender/recipient may impact the handling of EC, and EC may be archived, supervised and produced in countries other than the country in which you are located. This message cannot be guaranteed to be secure or free of errors or viruses. 

References to "Sender" are references to any subsidiary of Bank of America Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a Condition to Any Banking Service or Activity * Are Not Insured by Any Federal Government Agency. Attachments that are part of this EC may have additional important disclosures and disclaimers, which you should read. This message is subject to terms available at the following link: 
http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you consent to the foregoing.

Re: Utf8 byte[] reuse

Posted by Doug Cutting <cu...@apache.org>.
On 03/01/2012 05:48 AM, Irving, Dave wrote:
> Read D: Reallocates a buffer again, even though we’ve already got a 10
> byte buffer (???)

This is a bug.

> A simple ‘fix’ would be to compare the byte[] length rather than
> this.length before doing a reallocation.

That was the intent.

> The only issue I can see with this though is that you cause a byte[] of
> the largest utf you’ve read with that instance to stay in memory. If
> thats a concern though, you could always provide a ‘limit’ on
> construction of the Utf8 (if the allocated byte[] goes greater than
> this, drop it and reallocate on the next resize < limit).

That may be a useful feature to add.

> If this something that would be considered for changing if I submit a
> patch / jira?

Yes, please do.

Thanks!

Doug