You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2015/10/07 21:49:26 UTC

[jira] [Commented] (AVRO-1699) AutoMap field values between Avro objects with different schemas

    [ https://issues.apache.org/jira/browse/AVRO-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14947466#comment-14947466 ] 

Ryan Blue commented on AVRO-1699:
---------------------------------

[~pmazak], thanks for taking the time to post this feature. It's an interesting take on the problem.

Avro already has support for some of what AutoMapper is doing by using read schemas. When you read encoded data, you always need the schema that it was written with to decode the fields (which is why schemas are embedded in file headers). Readers also let you to request a schema (the "read schema") that is the schema that the objects handed to you will have. That schema needs to follow the [well-defined rules|https://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution] in order to be valid, but you can add columns, remove columns, resolve column aliases, and widen types (int to long but not float to int). There are also validations that verify a schema can read data written with another schema.

There are two limitations to the current approach:
1. Resolution or translation is done when reading from encoded data. That works great if you control the reader and can set the read schema, but it gets a little annoying when you forget to pass a writer's schema all the way back to the reader... which is actually a problem I ran into recently.
2. Schema resolution on read gives you a view of the original data and isn't intended to help with data transformations that you might need in business logic, like parsing numbers from strings, updating structures, or converting units.

Right now, I think the AutoMapper in this patch is mixing those two use cases. I'd really like to see the ability to identify when an object passed to a writer has a different schema and translate. I think it is also interesting to discuss a library for generic transformations. But I don't think those two should be done in one class because too much is going on without the caller's knowledge, especially when schema resolution is well-defined in the Avro spec.

What do you think about working on those separately? I'm betting that the in-memory resolution would be a really useful feature for you that would handle most of your use cases.

> AutoMap field values between Avro objects with different schemas
> ----------------------------------------------------------------
>
>                 Key: AVRO-1699
>                 URL: https://issues.apache.org/jira/browse/AVRO-1699
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>    Affects Versions: 1.7.6
>            Reporter: Paul Mazak
>         Attachments: AVRO-1699.patch
>
>
> There are a few use cases for this:
> *Various Avro input data to one common output*
> You want to pickup Avro files in different schemas and normalize into one. You might wish to transform to the superset of the input schemas.
> *Aggregating Raw Data*
> You want to rewrite data grouped by some fields and aggregated.  The output Avro in this case would be a subset of the input Avro, where at least the group by fields are in both input and output schemas.
> *Alternate Views*
> You have Avro data that you want to trim different ways to create subsets that would be useful for views in Hive or exports for SQL tables.
> *Schema Migration*
> You've added fields to a schema and you are storing data in both the old and new schemas.  You have Avro in an old schema and you can't process it with Avro in the new schema (using pig or java map-reduce).  AutoMapping would up-convert your old data by setting null for the new fields added, and all data are in the new schema.  This was [asked|http://stackoverflow.com/questions/27131942/is-it-possible-to-retrieve-schema-from-avro-data-and-use-them-in-mapreduce] about on StackOverflow.
> _Considerations:_
>  * Loop over the source schema fields available to automap over and return any that were unable to be mapped.
>  * Allow mappings between compatible types. For example going from integers to longs, floats to strings, etc.
>  * Field names match case-sensitive.
>  * Make use of aliases in the schema when considering fields to automap.
>  * Deep copy nested structures like arrays and maps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)