You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by "Irving, Dave" <da...@baml.com> on 2012/04/19 11:09:58 UTC

Specific/GenericDatumReader performance and resolving decoders

Hi,

Recently I've been looking at the performance of avros SpecificDatumReaders/Writers. In our use cases, when deserializing, we find it quite usual for reader / writer schemas to be identical. Interestingly, GenericDatumReader bakes in the use of ResolvingDecoders right in to its core. So even if constructed with a single (reader/writer) schema, a ResolvingDecoder is still used.
I experimented a little, and wrote a SpecificDatumReader which instead of being hard wired with a ResolvingDecoder, uses a DecodeStrategy - leaving the reader only dealing with Decoders directly.
Details follow - but for 'same schema' decodes - the performance difference is impressive. For the types of records I deal with, a decode with reader schema == writer schema using this approach is about 1.6x faster than a standard SpecificDatumReader decode.


interface DecodeStrategy
{
  Decoder configureForRead(Decoder in) throws IOException;

  void readComplete() throws IOException;

  void decodeRecordFields(Object old, SpecificRecord record, Schema expected, Decoder in, SpecificDatumReader2 reader) throws IOException;
}

The idea is that when we hit a record, instead of getting field order from a ResolvingDecoder directly, we just let the decode strategy do it for us (calling back for each field to the reader - allowing recursion).
For e.g. when we know reader / writer schemas are identical, and we don't want validation - an IdentitySchemaDecodeStrategy#decodeRecordFields can just pull the fields direct from the provided record schema (calling back on the reader for each one):

...

void decodeRecordFields(......)
{
  List<Field> fields = expected.getFields();
  For (int i=0, len = fields.size(); i<len; ++i)
  {
    reader.readField(old, in, field, record);
  }
}

...

The resolving decoder impl of this strategy just does a 'readFieldOrder' like GenericDatumReader does today.

For each read (given a Decoder), the datum reader lets the decode strategy return back the actual decoder to be used (via #configureForRead). This means that a resolving implementation can use this hook to configure the ResolvingDecoder and return this.
The result is that the datum reader can work with same schema / validated schema / resolved schemas seamlessly without caring about the difference.

I thought I'd share the approach before working on a full patch: Is this an approach you'd be interested in taking back to core avro? Or is it a little niche? :)

Cheers,

Dave

----------------------------------------------------------------------
This message w/attachments (message) is intended solely for the use of the intended recipient(s) and may contain information that is privileged, confidential or proprietary. If you are not an intended recipient, please notify the sender, and then please delete and destroy all copies and attachments, and be advised that any review or dissemination of, or the taking of any action in reliance on, the information contained in or attached to this message is prohibited. 
Unless specifically indicated, this message is not an offer to sell or a solicitation of any investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Sender. Subject to applicable law, Sender may intercept, monitor, review and retain e-communications (EC) traveling through its networks/systems and may produce any such EC to regulators, law enforcement, in litigation and as required by law. 
The laws of the country of each sender/recipient may impact the handling of EC, and EC may be archived, supervised and produced in countries other than the country in which you are located. This message cannot be guaranteed to be secure or free of errors or viruses. 

References to "Sender" are references to any subsidiary of Bank of America Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a Condition to Any Banking Service or Activity * Are Not Insured by Any Federal Government Agency. Attachments that are part of this EC may have additional important disclosures and disclaimers, which you should read. This message is subject to terms available at the following link: 
http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you consent to the foregoing.

Re: Specific/GenericDatumReader performance and resolving decoders

Posted by Scott Carey <sc...@apache.org>.

I think this approach makes sense, reader=writer is common.  In addition to
record fields, unions are affected.

I have been thinking about the issue that resolving records is slower than
not for a while.  In theory, it could be just as fast because you can
pre-compute the steps needed and bake that into the reading logic.  This
seems like a reasonable way to avoid the cost for the case where schemas
equal.

Please open a JIRA ticket and put your preliminary thoughts there.  It is a
good place to discuss the technical bits of the issue even before you have a
patch.

On 4/19/12 2:09 AM, "Irving, Dave" <da...@baml.com> wrote:

> Hi,
>  
> Recently I¹ve been looking at the performance of avros
> SpecificDatumReaders/Writers. In our use cases, when deserializing, we find it
> quite usual for reader / writer schemas to be identical. Interestingly,
> GenericDatumReader bakes in the use of ResolvingDecoders right in to its core.
> So even if constructed with a single (reader/writer) schema, a
> ResolvingDecoder is still used.
> I experimented a little, and wrote a SpecificDatumReader which instead of
> being hard wired with a ResolvingDecoder, uses a DecodeStrategy  leaving the
> reader only dealing with Decoders directly.
> Details follow  but for same schema¹ decodes  the performance difference is
> impressive. For the types of records I deal with, a decode with reader schema
> == writer schema using this approach is about 1.6x faster than a standard
> SpecificDatumReader decode.
>  
>  
> interface DecodeStrategy
> {
>   Decoder configureForRead(Decoder in) throws IOException;
>  
>   void readComplete() throws IOException;
>  
>   void decodeRecordFields(Object old, SpecificRecord record, Schema expected,
> Decoder in, SpecificDatumReader2 reader) throws IOException;
> }
>  
> The idea is that when we hit a record, instead of getting field order from a
> ResolvingDecoder directly, we just let the decode strategy do it for us
> (calling back for each field to the reader  allowing recursion).
> For e.g. when we know reader / writer schemas are identical, and we don¹t want
> validation  an IdentitySchemaDecodeStrategy#decodeRecordFields can just pull
> the fields direct from the provided record schema (calling back on the reader
> for each one):
>  
> ...
>  
> void decodeRecordFields(......)
> {
>   List<Field> fields = expected.getFields();
>   For (int i=0, len = fields.size(); i<len; ++i)
>   {
>     reader.readField(old, in, field, record);
>   }
> }
>  
> ...
>  
> The resolving decoder impl of this strategy just does a readFieldOrder¹ like
> GenericDatumReader does today.
>  
> For each read (given a Decoder), the datum reader lets the decode strategy
> return back the actual decoder to be used (via #configureForRead). This means
> that a resolving implementation can use this hook to configure the
> ResolvingDecoder and return this.
> The result is that the datum reader can work with same schema / validated
> schema / resolved schemas seamlessly without caring about the difference.
>  
> I thought I¹d share the approach before working on a full patch: Is this an
> approach you¹d be interested in taking back to core avro? Or is it a little
> niche? J
>  
> Cheers,
>  
> Dave
>  
> 
> This message w/attachments (message) is intended solely for the use of the
> intended recipient(s) and may contain information that is privileged,
> confidential or proprietary. If you are not an intended recipient, please
> notify the sender, and then please delete and destroy all copies and
> attachments, and be advised that any review or dissemination of, or the taking
> of any action in reliance on, the information contained in or attached to this
> message is prohibited.
> Unless specifically indicated, this message is not an offer to sell or a
> solicitation of any investment products or other financial product or service,
> an official confirmation of any transaction, or an official statement of
> Sender. Subject to applicable law, Sender may intercept, monitor, review and
> retain e-communications (EC) traveling through its networks/systems and may
> produce any such EC to regulators, law enforcement, in litigation and as
> required by law. 
> The laws of the country of each sender/recipient may impact the handling of
> EC, and EC may be archived, supervised and produced in countries other than
> the country in which you are located. This message cannot be guaranteed to be
> secure or free of errors or viruses.
> 
> References to "Sender" are references to any subsidiary of Bank of America
> Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are
> Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a
> Condition to Any Banking Service or Activity * Are Not Insured by Any Federal
> Government Agency. Attachments that are part of this EC may have additional
> important disclosures and disclaimers, which you should read. This message is
> subject to terms available at the following link:
> http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you
> consent to the foregoing.