You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2010/04/22 00:42:50 UTC
[jira] Commented: (AVRO-519) Efficient sparse optional fields support

    [ https://issues.apache.org/jira/browse/AVRO-519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859566#action_12859566 ] 

Doug Cutting commented on AVRO-519:
-----------------------------------

In general, I am hesitant to add new fundamental types, as they break bi-directional compatibility.

> Avro does have support for arrays and maps however both of these require homogeneous types.

The type of an array or map can be a union.

A map would meet your needs, although it would be a bit larger, since field names would be repeated.

An array of a union could be used, with a record in the union for each field.  Then the overhead per item present would be the union's integer dispatch tag, i.e., one byte for up to 255 fields.  This has similar overhead to the approach you propose when random subsets of fields are used.  One could annotate the schema of such arrays with a metadata property that suggests they be represented in memory as a sparse record, but they'd still be compatible with implementations that know nothing of sparse records.

> Efficient sparse optional fields support
> ----------------------------------------
>
>                 Key: AVRO-519
>                 URL: https://issues.apache.org/jira/browse/AVRO-519
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: John Plevyak
>
> One of the nice features of protobuf is efficient support for very sparse optional fields,
> for example large number of tags potentially associated with a document the vast
> majority of which are empty.
> Avro does support optional fields as part of differing specifications, but not on a per-record
> level after a protocol has been agreed upon.  Avro does have support for arrays and maps
> however both of these require homogeneous types.
> I would suggest adding an additional field attribute:
>    * "optional" - with values "true"/"false" (where "false" is assumed)
> For the encoding I would suggest that that any record which includes optional fields
> would be prefixed by an presence map which would be a sequence of int8 x* where:
>   x > 0 : the lower 7 bits are presence bits for the next 7 optional fields (low bit first)
>   -128 < x < 0 : the next present field is position x + 135 (as x runs from 0 to -127 and the first 7
>               must be empty otherwise we would use the x > 0 encoding) 
>   x == -128: no optional fields present in the next 134 optional fields
>   x = 0 : end of sequence
>   further, if the map has covered all the options, the end-of-sequence marker can be
>   elided.  For example, a type with 3 optional fields would require only a single byte. 
> This will permit encoding at 8/7 of a bit per present entry (worst case) and at a cost of
> 8/134 (0.06) bits/entry per all but last not-present (7.5 bytes / 1000 optional fields).
> This encoding is backward compatible as well as schema's which do not contain optional
> elements do not have the presence map and the encoding is therefore identical.  Backward
> compatibility can be maintained by simply using the default value for not-present fields.
> Language APIs:
> Efficient support could include either an explicit presence test or a function which returns the value
> or default value (if the field is not present).
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.