You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Joe Crobak <jo...@gmail.com> on 2011/05/05 19:29:57 UTC

resolving schemas in multiple avro data files

We've recently come across a situation where we have two data files with
different schemas that we'd like to process together using
GenericDatumReader.  One schema is promotable to the other, but not vice
versa.  We'd like to programmatically determine which of the schemas to use.
 I did a brief look through javadoc and tests, and I couldn't find any
examples of checking if one schema is promotable to the other.  Has anyone
else come across this?


For some context, we're considering patching AvroStorage [1] to remove the
assumption that all files have the same schema.  In our case, our schema has
evolved in that a field that was an int was promoted to a long.


Thanks,

Joe



[1] https://issues.apache.org/jira/browse/PIG-1748

Re: resolving schemas in multiple avro data files

Posted by Joe Crobak <jo...@gmail.com>.
> Such a method does not yet exist in Avro, but should not be difficult to
> add.  Please file an issue in Jira if this sounds of interest.
>
>
Thanks for the response -- I suspect you're right about the schema-superset
method. I've filed https://issues.apache.org/jira/browse/AVRO-816

Thanks,
Joe

Re: resolving schemas in multiple avro data files

Posted by Doug Cutting <cu...@apache.org>.
On 05/05/2011 10:29 AM, Joe Crobak wrote:
> We've recently come across a situation where we have two data files with
> different schemas that we'd like to process together using
> GenericDatumReader.  One schema is promotable to the other, but not vice
> versa.  We'd like to programmatically determine which of the schemas to
> use.  I did a brief look through javadoc and tests, and I couldn't find
> any examples of checking if one schema is promotable to the other.  Has
> anyone else come across this?
> 
> For some context, we're considering patching AvroStorage [1] to remove
> the assumption that all files have the same schema.  In our case, our
> schema has evolved in that a field that was an int was promoted to a long.

A boolean method that tells you if one schema is promotable to another
would work in this case, but would not help in cases where, e.g.,
different fields had changed in different versions.  For example, in
branched development, two branches might each add a distinct symbol to
an enum.  So I think you might be better off with a method that, given
two schemas, returns their superset, a schema that can read data written
by either.

Such a method does not yet exist in Avro, but should not be difficult to
add.  Please file an issue in Jira if this sounds of interest.

Doug