You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Da...@parc.com on 2013/08/16 20:22:23 UTC

Using AVRO C with a large schema

We have a C program that prepares many GB of data for analysis at a later time.  We'd like to serialize this data using AVRO C.  Here are some statements that I hope are wrong.



1. There's a 1:1 relationship between schema and file.  You can't mix different schemas in the same file.

2. Each value written to a file represents the file's full schema.  You can't write pieces of a schema.

3. AVRO C cannot write values that are bigger than the file writer's specified block_size.  I don't think there's enough memory to hold both the original structures and a gigantic block_size.



What's my best course of action?  Split the structures and arrays into as multiple files?



Thanks,



Dan



Re: Using AVRO C with a large schema

Posted by Douglas Creager <do...@creagertino.net>.
>> 3. AVRO C cannot write values that are bigger than the file writer's
>> specified block_size.  I don't think there's enough memory to hold both the
>> original structures and a gigantic block_size.
> 
> I don't know enough about the C implementation to verify this one and
> will leave it to others.

#3 is also true.  AVRO-724 [1] is the relevant issue.  Matt Massie's
comment on that issue discusses a couple of proposed solutions.

If your data structures are large arrays, then one option is to have a
separate file for each array, and have the array elements correspond to
records in the file.  Then it's only each individual array element that
needs to fit in the file's block_size.


Re: Using AVRO C with a large schema

Posted by Doug Cutting <cu...@apache.org>.
On Fri, Aug 16, 2013 at 11:22 AM,  <Da...@parc.com> wrote:
> 1. There's a 1:1 relationship between schema and file.  You can't mix
> different schemas in the same file.
>
> 2. Each value written to a file represents the file's full schema.  You
> can't write pieces of a schema.

These are both correct.  If you want to intermix, use a union as the
file's schema.

> 3. AVRO C cannot write values that are bigger than the file writer's
> specified block_size.  I don't think there's enough memory to hold both the
> original structures and a gigantic block_size.

I don't know enough about the C implementation to verify this one and
will leave it to others.

Doug