You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Sohail Khan <ks...@gmail.com> on 2020/06/25 23:14:19 UTC

AVRO Best Practices for Sparse object storage

Hello Team,

I am trying to serialize data in AVRO format and store it in Database. This
would bring down the disk requirement of the table. Currently we are
storing it in JSON format.
I Have a very huge POJO with string type fields (Approximately 100), but
for a given POJO hardly 20 or 30 have values, rest are null. I call it a
sparse object. I am currently achieving approximately 20 percent
improvement. Any suggestions, How to take it further, what are the best
practices w.r.t to handling null values

Thanks and Regards
Sohail Khan

Re: AVRO Best Practices for Sparse object storage

Posted by roger peppe <ro...@gmail.com>.
Assuming each field is represented as a union {null, string}, 70 null
fields would take about 70 bytes (one byte for the discriminator for each
union). One way to reduce that overhead might be to put a bunch of the
fields that are very commonly null into a possibly-null sub-record. That
way you'd need to store just one byte if all its fields are null (although
it would use an extra byte if any of the fields inside it are present).
Another way to save some space would be to avoid using a {null, string}
union where an empty string is sufficient to tell that the field isn't
present. That will save you one byte per non-null field because a string is
prefixed by its length, so could potentially save you 20 or 30 bytes.

  cheers,
    rog.

On Fri, 26 Jun 2020 at 00:14, Sohail Khan <ks...@gmail.com> wrote:

> Hello Team,
>
> I am trying to serialize data in AVRO format and store it in Database.
> This would bring down the disk requirement of the table. Currently we are
> storing it in JSON format.
> I Have a very huge POJO with string type fields (Approximately 100), but
> for a given POJO hardly 20 or 30 have values, rest are null. I call it a
> sparse object. I am currently achieving approximately 20 percent
> improvement. Any suggestions, How to take it further, what are the best
> practices w.r.t to handling null values
>
> Thanks and Regards
> Sohail Khan
>

Re: AVRO Best Practices for Sparse object storage

Posted by Doug Cutting <cu...@gmail.com>.
A map schema might be appropriate.  Another idea might be to define a
record for every field, then use an array whose values are a union of all
these records.  This is a bit more complicated but would probably use the
least space.

Doug

On Thu, Jun 25, 2020 at 4:14 PM Sohail Khan <ks...@gmail.com> wrote:

> Hello Team,
>
> I am trying to serialize data in AVRO format and store it in Database.
> This would bring down the disk requirement of the table. Currently we are
> storing it in JSON format.
> I Have a very huge POJO with string type fields (Approximately 100), but
> for a given POJO hardly 20 or 30 have values, rest are null. I call it a
> sparse object. I am currently achieving approximately 20 percent
> improvement. Any suggestions, How to take it further, what are the best
> practices w.r.t to handling null values
>
> Thanks and Regards
> Sohail Khan
>