You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Mason <ma...@verbasoftware.com> on 2013/05/22 19:34:25 UTC

ETL in face of column renames

dear list,

I have what I imagine is a standard setup: a web application generates 
data in MySQL, which I want to analyze in Hadoop; I run a nightly 
process to extract tables of interest, Avroize, and dump into HDFS.

This has worked great so far because the tools I'm using make it easy to 
load a directory tree of Avros with the same schema.

The issue is what to do when schema changes occur in the SQL database. I 
believe column additions and deletions are handled automatically by the 
Avro loaders I'm using, but I need to deal with a column rename.

My thinking is: I could bake the table schemas at time of ETL into the 
Avros, for historical record, but then manually copy that schema out as 
a "master" schema and apply it to all Avros for which it's appropriate; 
then when a column rename occurs, go back and edit the master schema.

I've never used an external schema before, so please correct if I 
misunderstand how they work.

Anyone have wisdom to share on this topic? I'd love to hear from anyone 
who has done this, or has a better solution.

-Mason

Re: ETL in face of column renames

Posted by Scott Carey <sc...@apache.org>.

On 5/22/13 10:34 AM, "Mason" <ma...@verbasoftware.com> wrote:

>dear list,
>
>I have what I imagine is a standard setup: a web application generates
>data in MySQL, which I want to analyze in Hadoop; I run a nightly
>process to extract tables of interest, Avroize, and dump into HDFS.
>
>This has worked great so far because the tools I'm using make it easy to
>load a directory tree of Avros with the same schema.
>
>The issue is what to do when schema changes occur in the SQL database. I
>believe column additions and deletions are handled automatically by the
>Avro loaders I'm using, but I need to deal with a column rename.
>
>My thinking is: I could bake the table schemas at time of ETL into the
>Avros, for historical record, but then manually copy that schema out as
>a "master" schema and apply it to all Avros for which it's appropriate;
>then when a column rename occurs, go back and edit the master schema.
>
>I've never used an external schema before, so please correct if I
>misunderstand how they work.
>
>Anyone have wisdom to share on this topic? I'd love to hear from anyone
>who has done this, or has a better solution.

The first thing that comes to mind is the alias feature for field names:
http://avro.apache.org/docs/current/spec.html#Aliases

If you bare using Avro data files, these contain the schemas at the time
of writing for "historical record".

The trick is being able to distinguish between someone who renamed a
column from "foo" to "fubar" and a case where "foo" was removed and
"foobar" added.  To do this, one has to have knowledge from the SQL
database DDL changes.

Once you have this, you can choose your reader schema appropriately --
likely by using the 'latest' schema decorated with field aliases where
appropriate, but there are other options.


>
>-Mason