You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Markus Resch <ma...@adtech.de> on 2012/03/20 10:27:37 UTC

Globbing several AVRO files with different (extended) schemes

Hi guys,

Thanks again for your awesome hint about sqoop. 

I have another question: The Data I'm working with is stored as AVRO
Files in the Hadoop. When I try to glob them everything works just
perfectly. But. When I add the schema of a single data file while the
others remain everything gets wrecked:

"currently we assume all avro files under the same "location" 
     * share the same schema and will throw exception if not." 

(e.g. I add a new data field) Expected behavior for me would be: If I'm
globbing several files with slightly different schema the result of the
LOAD would be either return an intersection of all valid fields that are
common to both schemes or the atoms of the missing fields are nulled. 

How could I handle this properly?

Thanks 

Markus





Fwd: Globbing several AVRO files with different (extended) schemes

Posted by Russell Jurney <ru...@gmail.com>.
Anyone interested in doing this?

---------- Forwarded message ----------
From: Scott Carey <sc...@apache.org>
Date: Tue, Mar 20, 2012 at 2:08 PM
Subject: Re: Globbing several AVRO files with different (extended) schemes
To: user@avro.apache.org


I'm assuming you are using Pig's AvroStorage function. It appears that it
does not support schema migration, but it certainly could do so.  A
collection of avro files can be 'viewed' as if they all are of one schema
provided they can all resolve to it.  I have several tools that do this
successfully with MapReduce/Pig/Hive.

The Pig AvroStorage tool is maintained by the Apache Pig project, you will
need to inquire there in order to get more details.

-Scott



On 3/20/12 2:27 AM, "Markus Resch" <ma...@adtech.de> wrote:

>Hi guys,
>
>Thanks again for your awesome hint about sqoop.
>
>I have another question: The Data I'm working with is stored as AVRO
>Files in the Hadoop. When I try to glob them everything works just
>perfectly. But. When I add the schema of a single data file while the
>others remain everything gets wrecked:
>
>"currently we assume all avro files under the same "location"
>     * share the same schema and will throw exception if not."
>
>(e.g. I add a new data field) Expected behavior for me would be: If I'm
>globbing several files with slightly different schema the result of the
>LOAD would be either return an intersection of all valid fields that are
>common to both schemes or the atoms of the missing fields are nulled.
>
>How could I handle this properly?
>
>Thanks
>
>Markus
>
>
>
>





-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Globbing several AVRO files with different (extended) schemes

Posted by Russell Jurney <ru...@gmail.com>.
Supporting schema migration is a badly needed feature in AvroStorage.  I'm
not able to add it in the near future.  Anyone else interested?

On Tue, Mar 20, 2012 at 2:08 PM, Scott Carey <sc...@apache.org> wrote:

> I'm assuming you are using Pig's AvroStorage function. It appears that it
> does not support schema migration, but it certainly could do so.  A
> collection of avro files can be 'viewed' as if they all are of one schema
> provided they can all resolve to it.  I have several tools that do this
> successfully with MapReduce/Pig/Hive.
>
> The Pig AvroStorage tool is maintained by the Apache Pig project, you will
> need to inquire there in order to get more details.
>
> -Scott
>
>
>
> On 3/20/12 2:27 AM, "Markus Resch" <ma...@adtech.de> wrote:
>
> >Hi guys,
> >
> >Thanks again for your awesome hint about sqoop.
> >
> >I have another question: The Data I'm working with is stored as AVRO
> >Files in the Hadoop. When I try to glob them everything works just
> >perfectly. But. When I add the schema of a single data file while the
> >others remain everything gets wrecked:
> >
> >"currently we assume all avro files under the same "location"
> >     * share the same schema and will throw exception if not."
> >
> >(e.g. I add a new data field) Expected behavior for me would be: If I'm
> >globbing several files with slightly different schema the result of the
> >LOAD would be either return an intersection of all valid fields that are
> >common to both schemes or the atoms of the missing fields are nulled.
> >
> >How could I handle this properly?
> >
> >Thanks
> >
> >Markus
> >
> >
> >
> >
>
>
>


-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Globbing several AVRO files with different (extended) schemes

Posted by Scott Carey <sc...@apache.org>.
I'm assuming you are using Pig's AvroStorage function. It appears that it
does not support schema migration, but it certainly could do so.  A
collection of avro files can be 'viewed' as if they all are of one schema
provided they can all resolve to it.  I have several tools that do this
successfully with MapReduce/Pig/Hive.

The Pig AvroStorage tool is maintained by the Apache Pig project, you will
need to inquire there in order to get more details.

-Scott



On 3/20/12 2:27 AM, "Markus Resch" <ma...@adtech.de> wrote:

>Hi guys,
>
>Thanks again for your awesome hint about sqoop.
>
>I have another question: The Data I'm working with is stored as AVRO
>Files in the Hadoop. When I try to glob them everything works just
>perfectly. But. When I add the schema of a single data file while the
>others remain everything gets wrecked:
>
>"currently we assume all avro files under the same "location"
>     * share the same schema and will throw exception if not."
>
>(e.g. I add a new data field) Expected behavior for me would be: If I'm
>globbing several files with slightly different schema the result of the
>LOAD would be either return an intersection of all valid fields that are
>common to both schemes or the atoms of the missing fields are nulled.
>
>How could I handle this properly?
>
>Thanks 
>
>Markus
>
>
>
>