You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by IGZ Nick <ig...@gmail.com> on 2012/03/28 22:22:07 UTC

Working with changing schemas (avro) in Pig

Hi guys,

I use Pig to process some clickstream data. I need to track a new field, so
I added a new field to my avro schema, and changed my Pig script
accordingly. It works fine with the new files (which have that new column)
but it breaks when I run it on my old files which do not have that column
in the schema (since avro stores schema in the data files itself). I was
expecting that Pig will assume the field to be null if that particular
field does not exist. But now I am having to maintain separate scripts to
process the old and new files. Is there any workaround this? Because I
figure I'll have to add new column frequently and I don't want to maintain
a separate script for each window where the schema is constant.

Thanks,

Re: Working with changing schemas (avro) in Pig

Posted by Bill Graham <bi...@gmail.com>.

Yes and Avro has similar functionality to return a default value when the
schema evolves.

Re your comment:

Seems like it can handle these cases with ease?


I thought the "these cases" you were referring to was the issue referencing
a schema definition file at runtime, which is AvroStorage does. Thrift and
Protobufs via EB don't have this issue, since they deal with it at compile
time with codegen.



On Mon, Apr 2, 2012 at 10:18 AM, Alex Rovner <al...@gmail.com> wrote:

> Bill,
>
> It would seem that you will hit the same issues though. Image you are
> processing log files from an application. As your schema changes, you
> certainly do not want to reprocess all the historic logs. I believe
> Protobufs and Thrift handle these cases gracefully by inserting nulls into
> expected columns that are not found?
>
> Alex
>
>
> On Mon, Apr 2, 2012 at 1:08 AM, Bill Graham <bi...@gmail.com> wrote:
>
>> Elephantbird has functionality to integrate with Protobufs and Thrift but
>> not Avro. When reading and writing messages of either type, EB expects
>> classes to be generated via schema definitions at build-time. It doesn't
>> read schemas defs at run-time to dynamically generate messages like one
>> would do with Avro. Hence EB takes a different approach and doesn't have
>> to
>> deal with the evolving schema file in the same way as AvroStorage does.
>>
>>
>> On Sun, Apr 1, 2012 at 9:32 AM, Alex Rovner <al...@gmail.com> wrote:
>>
>> > Anyone have any experience with elephantbird? Seems like it can handle
>> > these cases with ease?
>> >
>> > Sent from my iPhone
>> >
>> > On Mar 30, 2012, at 12:59 AM, Bill Graham <bi...@gmail.com> wrote:
>> >
>> > > In the TestAvroStorage.testRecordWithFieldSchemaFromTextWithSchemaFile
>> > > there's an example:
>> > >
>> > > STORE avro2 INTO 'output_dir'
>> > > USING org.apache.pig.piggybank.storage.avro.AvroStorage (
>> > > '{"schema_file": "/path/to/schema/file" ,
>> > > "field0": "def:member_id",
>> > > "field1": "def:browser_id",
>> > > "field3": "def:act_content" }'
>> > > );
>> > >
>> > > You specify the file that contains the schema, then you have to map
>> the
>> > > tuple fields to the name of the field in the avro schema. This mapping
>> > is a
>> > > drag, but it's currently required.
>> > >
>> > > Note that only the json-style constructor (as opposed to the string
>> array
>> > > appoach) supports schema_file without this uncommitted patch:
>> > > https://issues.apache.org/jira/browse/PIG-2257
>> > >
>> > >
>> > > thanks,
>> > > Bill
>> > >
>> > > On Thu, Mar 29, 2012 at 1:05 PM, IGZ Nick <ig...@gmail.com>
>> wrote:
>> > >
>> > >> That's nice! Can you give me an example of how to use it? I am not
>> able
>> > to
>> > >> figure it out from the code. The schemaManager is only used at one
>> place
>> > >> after that, and that is when the params contains a "field<number>"
>> key.
>> > I
>> > >> don't understand that part. Is there a way I can call it simply like
>> > STORE
>> > >> xyz INTO 'abc' USING AvroStorage('schema_file=/path/to/schema/file')?
>> > >>
>> > >>
>> > >>
>> > >> On Wed, Mar 28, 2012 at 5:41 PM, Bill Graham <bi...@gmail.com>
>> > wrote:
>> > >>
>> > >>> Yes, the schema can be in HDFS but the documentation for this is
>> > lacking.
>> > >>> Search for 'schema_file' here:
>> > >>>
>> > >>>
>> > >>>
>> >
>> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java
>> > >>>
>> > >>> and here:
>> > >>>
>> > >>>
>> > >>>
>> >
>> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
>> > >>>
>> > >>> And be aware of this open JIRA:
>> > >>> https://issues.apache.org/jira/browse/PIG-2257
>> > >>>
>> > >>> And this closed one:
>> > >>> https://issues.apache.org/jira/browse/PIG-2195
>> > >>>
>> > >>> :)
>> > >>>
>> > >>> thanks,
>> > >>> Bill
>> > >>>
>> > >>>
>> > >>> On Wed, Mar 28, 2012 at 5:26 PM, IGZ Nick <ig...@gmail.com>
>> wrote:
>> > >>>
>> > >>>> The schema has to be written in the script right? I don't think
>> there
>> > is
>> > >>>> any way the schema can be in a file outside the script. That was
>> the
>> > >>>> messyness I was talking about. Or is there a way I can write the
>> > schema in
>> > >>>> a separate file? One way I see is to create and store a dummy file
>> > with the
>> > >>>> schema
>> > >>>>
>> > >>>>
>> > >>>> Wed, Mar 28, 2012 at 4:19 PM, Bill Graham <bi...@gmail.com>
>> > wrote:
>> > >>>>
>> > >>>>> The default value will be part of the new Avro schema definition
>> and
>> > >>>>> Avro should return it to you, so there shouldn't be any code
>> > messyness with
>> > >>>>> that approach.
>> > >>>>>
>> > >>>>>
>> > >>>>> On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <ig...@gmail.com>
>> > wrote:
>> > >>>>>
>> > >>>>>> Ok.. you mean I can just use the newer schema to read the old
>> schema
>> > >>>>>> as well, by populating some default value for the missing field.
>> I
>> > think
>> > >>>>>> that should work, messy code though!
>> > >>>>>>
>> > >>>>>> Thanks!
>> > >>>>>>
>> > >>>>>> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <
>> billgraham@gmail.com
>> > >wrote:
>> > >>>>>>
>> > >>>>>>> If you evolved your schema to just add fields, then you should
>> be
>> > >>>>>>> able to
>> > >>>>>>> use a single schema descriptor file to read both pre- and
>> > >>>>>>> post-evolved data
>> > >>>>>>> objects. This is because one of the rules of new fields in Avro
>> is
>> > >>>>>>> that
>> > >>>>>>> they have to have a default value and be non-null. AvroStorage
>> > should
>> > >>>>>>> pick
>> > >>>>>>> that default field up for the old objects. If it doesn't, then
>> > that's
>> > >>>>>>> a bug.
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <ig...@gmail.com>
>> > >>>>>>> wrote:
>> > >>>>>>>
>> > >>>>>>>> @Bill,
>> > >>>>>>>> I did look at the option of providing input as a parameter
>> while
>> > >>>>>>>> initializing AvroStorage(). But even then, I'll still need to
>> > >>>>>>> change my
>> > >>>>>>>> script to handle the two files because I'll still need to have
>> > >>>>>>> separate
>> > >>>>>>>> schemas right?
>> > >>>>>>>>
>> > >>>>>>>> @Stan,
>> > >>>>>>>> Thanks for pointing me to it, it is a useful feature. But in my
>> > >>>>>>> case, I
>> > >>>>>>>> would never have two input files with different schemas. The
>> input
>> > >>>>>>> will
>> > >>>>>>>> always have only one of the schemas, but I want my new script
>> > (with
>> > >>>>>>> the
>> > >>>>>>>> additional column) to be able to process the old data as well,
>> > even
>> > >>>>>>> if the
>> > >>>>>>>> input only contains data with the older schema.
>> > >>>>>>>>
>> > >>>>>>>> On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg <
>> > >>>>>>> stan.rosenberg@gmail.com
>> > >>>>>>>>> wrote:
>> > >>>>>>>>
>> > >>>>>>>>> There is a patch for Avro to deal with this use case:
>> > >>>>>>>>> https://issues.apache.org/jira/browse/PIG-2579
>> > >>>>>>>>> (See the attached pig example which loads two avro input files
>> > >>>>>>> with
>> > >>>>>>>>> different schemas.)
>> > >>>>>>>>>
>> > >>>>>>>>> Best,
>> > >>>>>>>>>
>> > >>>>>>>>> stan
>> > >>>>>>>>>
>> > >>>>>>>>> On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <
>> igznick01@gmail.com>
>> > >>>>>>> wrote:
>> > >>>>>>>>>> Hi guys,
>> > >>>>>>>>>>
>> > >>>>>>>>>> I use Pig to process some clickstream data. I need to track a
>> > >>>>>>> new
>> > >>>>>>>> field,
>> > >>>>>>>>> so
>> > >>>>>>>>>> I added a new field to my avro schema, and changed my Pig
>> script
>> > >>>>>>>>>> accordingly. It works fine with the new files (which have
>> that
>> > >>>>>>> new
>> > >>>>>>>>> column)
>> > >>>>>>>>>> but it breaks when I run it on my old files which do not have
>> > >>>>>>> that
>> > >>>>>>>> column
>> > >>>>>>>>>> in the schema (since avro stores schema in the data files
>> > >>>>>>> itself). I
>> > >>>>>>>> was
>> > >>>>>>>>>> expecting that Pig will assume the field to be null if that
>> > >>>>>>> particular
>> > >>>>>>>>>> field does not exist. But now I am having to maintain
>> separate
>> > >>>>>>> scripts
>> > >>>>>>>> to
>> > >>>>>>>>>> process the old and new files. Is there any workaround this?
>> > >>>>>>> Because I
>> > >>>>>>>>>> figure I'll have to add new column frequently and I don't
>> want
>> > >>>>>>> to
>> > >>>>>>>>> maintain
>> > >>>>>>>>>> a separate script for each window where the schema is
>> constant.
>> > >>>>>>>>>>
>> > >>>>>>>>>> Thanks,
>> > >>>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> --
>> > >>>>>>> *Note that I'm no longer using my Yahoo! email address. Please
>> > email
>> > >>>>>>> me at
>> > >>>>>>> billgraham@gmail.com going forward.*
>> > >>>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>
>> > >>>>>
>> > >>>>> --
>> > >>>>> *Note that I'm no longer using my Yahoo! email address. Please
>> email
>> > >>>>> me at billgraham@gmail.com going forward.*
>> > >>>>>
>> > >>>>
>> > >>>>
>> > >>>
>> > >>>
>> > >>> --
>> > >>> *Note that I'm no longer using my Yahoo! email address. Please
>> email me
>> > >>> at billgraham@gmail.com going forward.*
>> > >>>
>> > >>
>> > >>
>> > >
>> > >
>> > > --
>> > > *Note that I'm no longer using my Yahoo! email address. Please email
>> me
>> > at
>> > > billgraham@gmail.com going forward.*
>> >
>>
>>
>>
>> --
>> *Note that I'm no longer using my Yahoo! email address. Please email me at
>> billgraham@gmail.com going forward.*
>>
>
>


-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*

Re: Working with changing schemas (avro) in Pig

Posted by Alex Rovner <al...@gmail.com>.

Bill,

It would seem that you will hit the same issues though. Image you are
processing log files from an application. As your schema changes, you
certainly do not want to reprocess all the historic logs. I believe
Protobufs and Thrift handle these cases gracefully by inserting nulls into
expected columns that are not found?

Alex

On Mon, Apr 2, 2012 at 1:08 AM, Bill Graham <bi...@gmail.com> wrote:

> Elephantbird has functionality to integrate with Protobufs and Thrift but
> not Avro. When reading and writing messages of either type, EB expects
> classes to be generated via schema definitions at build-time. It doesn't
> read schemas defs at run-time to dynamically generate messages like one
> would do with Avro. Hence EB takes a different approach and doesn't have to
> deal with the evolving schema file in the same way as AvroStorage does.
>
>
> On Sun, Apr 1, 2012 at 9:32 AM, Alex Rovner <al...@gmail.com> wrote:
>
> > Anyone have any experience with elephantbird? Seems like it can handle
> > these cases with ease?
> >
> > Sent from my iPhone
> >
> > On Mar 30, 2012, at 12:59 AM, Bill Graham <bi...@gmail.com> wrote:
> >
> > > In the TestAvroStorage.testRecordWithFieldSchemaFromTextWithSchemaFile
> > > there's an example:
> > >
> > > STORE avro2 INTO 'output_dir'
> > > USING org.apache.pig.piggybank.storage.avro.AvroStorage (
> > > '{"schema_file": "/path/to/schema/file" ,
> > > "field0": "def:member_id",
> > > "field1": "def:browser_id",
> > > "field3": "def:act_content" }'
> > > );
> > >
> > > You specify the file that contains the schema, then you have to map the
> > > tuple fields to the name of the field in the avro schema. This mapping
> > is a
> > > drag, but it's currently required.
> > >
> > > Note that only the json-style constructor (as opposed to the string
> array
> > > appoach) supports schema_file without this uncommitted patch:
> > > https://issues.apache.org/jira/browse/PIG-2257
> > >
> > >
> > > thanks,
> > > Bill
> > >
> > > On Thu, Mar 29, 2012 at 1:05 PM, IGZ Nick <ig...@gmail.com> wrote:
> > >
> > >> That's nice! Can you give me an example of how to use it? I am not
> able
> > to
> > >> figure it out from the code. The schemaManager is only used at one
> place
> > >> after that, and that is when the params contains a "field<number>"
> key.
> > I
> > >> don't understand that part. Is there a way I can call it simply like
> > STORE
> > >> xyz INTO 'abc' USING AvroStorage('schema_file=/path/to/schema/file')?
> > >>
> > >>
> > >>
> > >> On Wed, Mar 28, 2012 at 5:41 PM, Bill Graham <bi...@gmail.com>
> > wrote:
> > >>
> > >>> Yes, the schema can be in HDFS but the documentation for this is
> > lacking.
> > >>> Search for 'schema_file' here:
> > >>>
> > >>>
> > >>>
> >
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java
> > >>>
> > >>> and here:
> > >>>
> > >>>
> > >>>
> >
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
> > >>>
> > >>> And be aware of this open JIRA:
> > >>> https://issues.apache.org/jira/browse/PIG-2257
> > >>>
> > >>> And this closed one:
> > >>> https://issues.apache.org/jira/browse/PIG-2195
> > >>>
> > >>> :)
> > >>>
> > >>> thanks,
> > >>> Bill
> > >>>
> > >>>
> > >>> On Wed, Mar 28, 2012 at 5:26 PM, IGZ Nick <ig...@gmail.com>
> wrote:
> > >>>
> > >>>> The schema has to be written in the script right? I don't think
> there
> > is
> > >>>> any way the schema can be in a file outside the script. That was the
> > >>>> messyness I was talking about. Or is there a way I can write the
> > schema in
> > >>>> a separate file? One way I see is to create and store a dummy file
> > with the
> > >>>> schema
> > >>>>
> > >>>>
> > >>>> Wed, Mar 28, 2012 at 4:19 PM, Bill Graham <bi...@gmail.com>
> > wrote:
> > >>>>
> > >>>>> The default value will be part of the new Avro schema definition
> and
> > >>>>> Avro should return it to you, so there shouldn't be any code
> > messyness with
> > >>>>> that approach.
> > >>>>>
> > >>>>>
> > >>>>> On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <ig...@gmail.com>
> > wrote:
> > >>>>>
> > >>>>>> Ok.. you mean I can just use the newer schema to read the old
> schema
> > >>>>>> as well, by populating some default value for the missing field. I
> > think
> > >>>>>> that should work, messy code though!
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>>
> > >>>>>> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <
> billgraham@gmail.com
> > >wrote:
> > >>>>>>
> > >>>>>>> If you evolved your schema to just add fields, then you should be
> > >>>>>>> able to
> > >>>>>>> use a single schema descriptor file to read both pre- and
> > >>>>>>> post-evolved data
> > >>>>>>> objects. This is because one of the rules of new fields in Avro
> is
> > >>>>>>> that
> > >>>>>>> they have to have a default value and be non-null. AvroStorage
> > should
> > >>>>>>> pick
> > >>>>>>> that default field up for the old objects. If it doesn't, then
> > that's
> > >>>>>>> a bug.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <ig...@gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> @Bill,
> > >>>>>>>> I did look at the option of providing input as a parameter while
> > >>>>>>>> initializing AvroStorage(). But even then, I'll still need to
> > >>>>>>> change my
> > >>>>>>>> script to handle the two files because I'll still need to have
> > >>>>>>> separate
> > >>>>>>>> schemas right?
> > >>>>>>>>
> > >>>>>>>> @Stan,
> > >>>>>>>> Thanks for pointing me to it, it is a useful feature. But in my
> > >>>>>>> case, I
> > >>>>>>>> would never have two input files with different schemas. The
> input
> > >>>>>>> will
> > >>>>>>>> always have only one of the schemas, but I want my new script
> > (with
> > >>>>>>> the
> > >>>>>>>> additional column) to be able to process the old data as well,
> > even
> > >>>>>>> if the
> > >>>>>>>> input only contains data with the older schema.
> > >>>>>>>>
> > >>>>>>>> On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg <
> > >>>>>>> stan.rosenberg@gmail.com
> > >>>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> There is a patch for Avro to deal with this use case:
> > >>>>>>>>> https://issues.apache.org/jira/browse/PIG-2579
> > >>>>>>>>> (See the attached pig example which loads two avro input files
> > >>>>>>> with
> > >>>>>>>>> different schemas.)
> > >>>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>>
> > >>>>>>>>> stan
> > >>>>>>>>>
> > >>>>>>>>> On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <igznick01@gmail.com
> >
> > >>>>>>> wrote:
> > >>>>>>>>>> Hi guys,
> > >>>>>>>>>>
> > >>>>>>>>>> I use Pig to process some clickstream data. I need to track a
> > >>>>>>> new
> > >>>>>>>> field,
> > >>>>>>>>> so
> > >>>>>>>>>> I added a new field to my avro schema, and changed my Pig
> script
> > >>>>>>>>>> accordingly. It works fine with the new files (which have that
> > >>>>>>> new
> > >>>>>>>>> column)
> > >>>>>>>>>> but it breaks when I run it on my old files which do not have
> > >>>>>>> that
> > >>>>>>>> column
> > >>>>>>>>>> in the schema (since avro stores schema in the data files
> > >>>>>>> itself). I
> > >>>>>>>> was
> > >>>>>>>>>> expecting that Pig will assume the field to be null if that
> > >>>>>>> particular
> > >>>>>>>>>> field does not exist. But now I am having to maintain separate
> > >>>>>>> scripts
> > >>>>>>>> to
> > >>>>>>>>>> process the old and new files. Is there any workaround this?
> > >>>>>>> Because I
> > >>>>>>>>>> figure I'll have to add new column frequently and I don't want
> > >>>>>>> to
> > >>>>>>>>> maintain
> > >>>>>>>>>> a separate script for each window where the schema is
> constant.
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks,
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> *Note that I'm no longer using my Yahoo! email address. Please
> > email
> > >>>>>>> me at
> > >>>>>>> billgraham@gmail.com going forward.*
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> *Note that I'm no longer using my Yahoo! email address. Please
> email
> > >>>>> me at billgraham@gmail.com going forward.*
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> *Note that I'm no longer using my Yahoo! email address. Please email
> me
> > >>> at billgraham@gmail.com going forward.*
> > >>>
> > >>
> > >>
> > >
> > >
> > > --
> > > *Note that I'm no longer using my Yahoo! email address. Please email me
> > at
> > > billgraham@gmail.com going forward.*
> >
>
>
>
> --
> *Note that I'm no longer using my Yahoo! email address. Please email me at
> billgraham@gmail.com going forward.*
>

Re: Working with changing schemas (avro) in Pig

Posted by Bill Graham <bi...@gmail.com>.

Elephantbird has functionality to integrate with Protobufs and Thrift but
not Avro. When reading and writing messages of either type, EB expects
classes to be generated via schema definitions at build-time. It doesn't
read schemas defs at run-time to dynamically generate messages like one
would do with Avro. Hence EB takes a different approach and doesn't have to
deal with the evolving schema file in the same way as AvroStorage does.


On Sun, Apr 1, 2012 at 9:32 AM, Alex Rovner <al...@gmail.com> wrote:

> Anyone have any experience with elephantbird? Seems like it can handle
> these cases with ease?
>
> Sent from my iPhone
>
> On Mar 30, 2012, at 12:59 AM, Bill Graham <bi...@gmail.com> wrote:
>
> > In the TestAvroStorage.testRecordWithFieldSchemaFromTextWithSchemaFile
> > there's an example:
> >
> > STORE avro2 INTO 'output_dir'
> > USING org.apache.pig.piggybank.storage.avro.AvroStorage (
> > '{"schema_file": "/path/to/schema/file" ,
> > "field0": "def:member_id",
> > "field1": "def:browser_id",
> > "field3": "def:act_content" }'
> > );
> >
> > You specify the file that contains the schema, then you have to map the
> > tuple fields to the name of the field in the avro schema. This mapping
> is a
> > drag, but it's currently required.
> >
> > Note that only the json-style constructor (as opposed to the string array
> > appoach) supports schema_file without this uncommitted patch:
> > https://issues.apache.org/jira/browse/PIG-2257
> >
> >
> > thanks,
> > Bill
> >
> > On Thu, Mar 29, 2012 at 1:05 PM, IGZ Nick <ig...@gmail.com> wrote:
> >
> >> That's nice! Can you give me an example of how to use it? I am not able
> to
> >> figure it out from the code. The schemaManager is only used at one place
> >> after that, and that is when the params contains a "field<number>" key.
> I
> >> don't understand that part. Is there a way I can call it simply like
> STORE
> >> xyz INTO 'abc' USING AvroStorage('schema_file=/path/to/schema/file')?
> >>
> >>
> >>
> >> On Wed, Mar 28, 2012 at 5:41 PM, Bill Graham <bi...@gmail.com>
> wrote:
> >>
> >>> Yes, the schema can be in HDFS but the documentation for this is
> lacking.
> >>> Search for 'schema_file' here:
> >>>
> >>>
> >>>
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java
> >>>
> >>> and here:
> >>>
> >>>
> >>>
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
> >>>
> >>> And be aware of this open JIRA:
> >>> https://issues.apache.org/jira/browse/PIG-2257
> >>>
> >>> And this closed one:
> >>> https://issues.apache.org/jira/browse/PIG-2195
> >>>
> >>> :)
> >>>
> >>> thanks,
> >>> Bill
> >>>
> >>>
> >>> On Wed, Mar 28, 2012 at 5:26 PM, IGZ Nick <ig...@gmail.com> wrote:
> >>>
> >>>> The schema has to be written in the script right? I don't think there
> is
> >>>> any way the schema can be in a file outside the script. That was the
> >>>> messyness I was talking about. Or is there a way I can write the
> schema in
> >>>> a separate file? One way I see is to create and store a dummy file
> with the
> >>>> schema
> >>>>
> >>>>
> >>>> Wed, Mar 28, 2012 at 4:19 PM, Bill Graham <bi...@gmail.com>
> wrote:
> >>>>
> >>>>> The default value will be part of the new Avro schema definition and
> >>>>> Avro should return it to you, so there shouldn't be any code
> messyness with
> >>>>> that approach.
> >>>>>
> >>>>>
> >>>>> On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <ig...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Ok.. you mean I can just use the newer schema to read the old schema
> >>>>>> as well, by populating some default value for the missing field. I
> think
> >>>>>> that should work, messy code though!
> >>>>>>
> >>>>>> Thanks!
> >>>>>>
> >>>>>> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <billgraham@gmail.com
> >wrote:
> >>>>>>
> >>>>>>> If you evolved your schema to just add fields, then you should be
> >>>>>>> able to
> >>>>>>> use a single schema descriptor file to read both pre- and
> >>>>>>> post-evolved data
> >>>>>>> objects. This is because one of the rules of new fields in Avro is
> >>>>>>> that
> >>>>>>> they have to have a default value and be non-null. AvroStorage
> should
> >>>>>>> pick
> >>>>>>> that default field up for the old objects. If it doesn't, then
> that's
> >>>>>>> a bug.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <ig...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> @Bill,
> >>>>>>>> I did look at the option of providing input as a parameter while
> >>>>>>>> initializing AvroStorage(). But even then, I'll still need to
> >>>>>>> change my
> >>>>>>>> script to handle the two files because I'll still need to have
> >>>>>>> separate
> >>>>>>>> schemas right?
> >>>>>>>>
> >>>>>>>> @Stan,
> >>>>>>>> Thanks for pointing me to it, it is a useful feature. But in my
> >>>>>>> case, I
> >>>>>>>> would never have two input files with different schemas. The input
> >>>>>>> will
> >>>>>>>> always have only one of the schemas, but I want my new script
> (with
> >>>>>>> the
> >>>>>>>> additional column) to be able to process the old data as well,
> even
> >>>>>>> if the
> >>>>>>>> input only contains data with the older schema.
> >>>>>>>>
> >>>>>>>> On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg <
> >>>>>>> stan.rosenberg@gmail.com
> >>>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> There is a patch for Avro to deal with this use case:
> >>>>>>>>> https://issues.apache.org/jira/browse/PIG-2579
> >>>>>>>>> (See the attached pig example which loads two avro input files
> >>>>>>> with
> >>>>>>>>> different schemas.)
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>>
> >>>>>>>>> stan
> >>>>>>>>>
> >>>>>>>>> On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <ig...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>> Hi guys,
> >>>>>>>>>>
> >>>>>>>>>> I use Pig to process some clickstream data. I need to track a
> >>>>>>> new
> >>>>>>>> field,
> >>>>>>>>> so
> >>>>>>>>>> I added a new field to my avro schema, and changed my Pig script
> >>>>>>>>>> accordingly. It works fine with the new files (which have that
> >>>>>>> new
> >>>>>>>>> column)
> >>>>>>>>>> but it breaks when I run it on my old files which do not have
> >>>>>>> that
> >>>>>>>> column
> >>>>>>>>>> in the schema (since avro stores schema in the data files
> >>>>>>> itself). I
> >>>>>>>> was
> >>>>>>>>>> expecting that Pig will assume the field to be null if that
> >>>>>>> particular
> >>>>>>>>>> field does not exist. But now I am having to maintain separate
> >>>>>>> scripts
> >>>>>>>> to
> >>>>>>>>>> process the old and new files. Is there any workaround this?
> >>>>>>> Because I
> >>>>>>>>>> figure I'll have to add new column frequently and I don't want
> >>>>>>> to
> >>>>>>>>> maintain
> >>>>>>>>>> a separate script for each window where the schema is constant.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> *Note that I'm no longer using my Yahoo! email address. Please
> email
> >>>>>>> me at
> >>>>>>> billgraham@gmail.com going forward.*
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> *Note that I'm no longer using my Yahoo! email address. Please email
> >>>>> me at billgraham@gmail.com going forward.*
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> *Note that I'm no longer using my Yahoo! email address. Please email me
> >>> at billgraham@gmail.com going forward.*
> >>>
> >>
> >>
> >
> >
> > --
> > *Note that I'm no longer using my Yahoo! email address. Please email me
> at
> > billgraham@gmail.com going forward.*
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*

Re: Working with changing schemas (avro) in Pig

Posted by Alex Rovner <al...@gmail.com>.

Anyone have any experience with elephantbird? Seems like it can handle these cases with ease?

Sent from my iPhone

On Mar 30, 2012, at 12:59 AM, Bill Graham <bi...@gmail.com> wrote:

> In the TestAvroStorage.testRecordWithFieldSchemaFromTextWithSchemaFile
> there's an example:
> 
> STORE avro2 INTO 'output_dir'
> USING org.apache.pig.piggybank.storage.avro.AvroStorage (
> '{"schema_file": "/path/to/schema/file" ,
> "field0": "def:member_id",
> "field1": "def:browser_id",
> "field3": "def:act_content" }'
> );
> 
> You specify the file that contains the schema, then you have to map the
> tuple fields to the name of the field in the avro schema. This mapping is a
> drag, but it's currently required.
> 
> Note that only the json-style constructor (as opposed to the string array
> appoach) supports schema_file without this uncommitted patch:
> https://issues.apache.org/jira/browse/PIG-2257
> 
> 
> thanks,
> Bill
> 
> On Thu, Mar 29, 2012 at 1:05 PM, IGZ Nick <ig...@gmail.com> wrote:
> 
>> That's nice! Can you give me an example of how to use it? I am not able to
>> figure it out from the code. The schemaManager is only used at one place
>> after that, and that is when the params contains a "field<number>" key. I
>> don't understand that part. Is there a way I can call it simply like STORE
>> xyz INTO 'abc' USING AvroStorage('schema_file=/path/to/schema/file')?
>> 
>> 
>> 
>> On Wed, Mar 28, 2012 at 5:41 PM, Bill Graham <bi...@gmail.com> wrote:
>> 
>>> Yes, the schema can be in HDFS but the documentation for this is lacking.
>>> Search for 'schema_file' here:
>>> 
>>> 
>>> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java
>>> 
>>> and here:
>>> 
>>> 
>>> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
>>> 
>>> And be aware of this open JIRA:
>>> https://issues.apache.org/jira/browse/PIG-2257
>>> 
>>> And this closed one:
>>> https://issues.apache.org/jira/browse/PIG-2195
>>> 
>>> :)
>>> 
>>> thanks,
>>> Bill
>>> 
>>> 
>>> On Wed, Mar 28, 2012 at 5:26 PM, IGZ Nick <ig...@gmail.com> wrote:
>>> 
>>>> The schema has to be written in the script right? I don't think there is
>>>> any way the schema can be in a file outside the script. That was the
>>>> messyness I was talking about. Or is there a way I can write the schema in
>>>> a separate file? One way I see is to create and store a dummy file with the
>>>> schema
>>>> 
>>>> 
>>>> Wed, Mar 28, 2012 at 4:19 PM, Bill Graham <bi...@gmail.com> wrote:
>>>> 
>>>>> The default value will be part of the new Avro schema definition and
>>>>> Avro should return it to you, so there shouldn't be any code messyness with
>>>>> that approach.
>>>>> 
>>>>> 
>>>>> On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <ig...@gmail.com> wrote:
>>>>> 
>>>>>> Ok.. you mean I can just use the newer schema to read the old schema
>>>>>> as well, by populating some default value for the missing field. I think
>>>>>> that should work, messy code though!
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <bi...@gmail.com>wrote:
>>>>>> 
>>>>>>> If you evolved your schema to just add fields, then you should be
>>>>>>> able to
>>>>>>> use a single schema descriptor file to read both pre- and
>>>>>>> post-evolved data
>>>>>>> objects. This is because one of the rules of new fields in Avro is
>>>>>>> that
>>>>>>> they have to have a default value and be non-null. AvroStorage should
>>>>>>> pick
>>>>>>> that default field up for the old objects. If it doesn't, then that's
>>>>>>> a bug.
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <ig...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> @Bill,
>>>>>>>> I did look at the option of providing input as a parameter while
>>>>>>>> initializing AvroStorage(). But even then, I'll still need to
>>>>>>> change my
>>>>>>>> script to handle the two files because I'll still need to have
>>>>>>> separate
>>>>>>>> schemas right?
>>>>>>>> 
>>>>>>>> @Stan,
>>>>>>>> Thanks for pointing me to it, it is a useful feature. But in my
>>>>>>> case, I
>>>>>>>> would never have two input files with different schemas. The input
>>>>>>> will
>>>>>>>> always have only one of the schemas, but I want my new script (with
>>>>>>> the
>>>>>>>> additional column) to be able to process the old data as well, even
>>>>>>> if the
>>>>>>>> input only contains data with the older schema.
>>>>>>>> 
>>>>>>>> On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg <
>>>>>>> stan.rosenberg@gmail.com
>>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> There is a patch for Avro to deal with this use case:
>>>>>>>>> https://issues.apache.org/jira/browse/PIG-2579
>>>>>>>>> (See the attached pig example which loads two avro input files
>>>>>>> with
>>>>>>>>> different schemas.)
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> 
>>>>>>>>> stan
>>>>>>>>> 
>>>>>>>>> On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <ig...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>> Hi guys,
>>>>>>>>>> 
>>>>>>>>>> I use Pig to process some clickstream data. I need to track a
>>>>>>> new
>>>>>>>> field,
>>>>>>>>> so
>>>>>>>>>> I added a new field to my avro schema, and changed my Pig script
>>>>>>>>>> accordingly. It works fine with the new files (which have that
>>>>>>> new
>>>>>>>>> column)
>>>>>>>>>> but it breaks when I run it on my old files which do not have
>>>>>>> that
>>>>>>>> column
>>>>>>>>>> in the schema (since avro stores schema in the data files
>>>>>>> itself). I
>>>>>>>> was
>>>>>>>>>> expecting that Pig will assume the field to be null if that
>>>>>>> particular
>>>>>>>>>> field does not exist. But now I am having to maintain separate
>>>>>>> scripts
>>>>>>>> to
>>>>>>>>>> process the old and new files. Is there any workaround this?
>>>>>>> Because I
>>>>>>>>>> figure I'll have to add new column frequently and I don't want
>>>>>>> to
>>>>>>>>> maintain
>>>>>>>>>> a separate script for each window where the schema is constant.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> *Note that I'm no longer using my Yahoo! email address. Please email
>>>>>>> me at
>>>>>>> billgraham@gmail.com going forward.*
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> *Note that I'm no longer using my Yahoo! email address. Please email
>>>>> me at billgraham@gmail.com going forward.*
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> *Note that I'm no longer using my Yahoo! email address. Please email me
>>> at billgraham@gmail.com going forward.*
>>> 
>> 
>> 
> 
> 
> -- 
> *Note that I'm no longer using my Yahoo! email address. Please email me at
> billgraham@gmail.com going forward.*

Re: Working with changing schemas (avro) in Pig

Posted by Bill Graham <bi...@gmail.com>.

In the TestAvroStorage.testRecordWithFieldSchemaFromTextWithSchemaFile
there's an example:

STORE avro2 INTO 'output_dir'
USING org.apache.pig.piggybank.storage.avro.AvroStorage (
'{"schema_file": "/path/to/schema/file" ,
 "field0": "def:member_id",
"field1": "def:browser_id",
"field3": "def:act_content" }'
);

You specify the file that contains the schema, then you have to map the
tuple fields to the name of the field in the avro schema. This mapping is a
drag, but it's currently required.

Note that only the json-style constructor (as opposed to the string array
appoach) supports schema_file without this uncommitted patch:
https://issues.apache.org/jira/browse/PIG-2257


thanks,
Bill

On Thu, Mar 29, 2012 at 1:05 PM, IGZ Nick <ig...@gmail.com> wrote:

> That's nice! Can you give me an example of how to use it? I am not able to
> figure it out from the code. The schemaManager is only used at one place
> after that, and that is when the params contains a "field<number>" key. I
> don't understand that part. Is there a way I can call it simply like STORE
> xyz INTO 'abc' USING AvroStorage('schema_file=/path/to/schema/file')?
>
>
>
> On Wed, Mar 28, 2012 at 5:41 PM, Bill Graham <bi...@gmail.com> wrote:
>
>> Yes, the schema can be in HDFS but the documentation for this is lacking.
>> Search for 'schema_file' here:
>>
>>
>> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java
>>
>> and here:
>>
>>
>> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
>>
>> And be aware of this open JIRA:
>> https://issues.apache.org/jira/browse/PIG-2257
>>
>> And this closed one:
>> https://issues.apache.org/jira/browse/PIG-2195
>>
>> :)
>>
>> thanks,
>> Bill
>>
>>
>> On Wed, Mar 28, 2012 at 5:26 PM, IGZ Nick <ig...@gmail.com> wrote:
>>
>>> The schema has to be written in the script right? I don't think there is
>>> any way the schema can be in a file outside the script. That was the
>>> messyness I was talking about. Or is there a way I can write the schema in
>>> a separate file? One way I see is to create and store a dummy file with the
>>> schema
>>>
>>>
>>> Wed, Mar 28, 2012 at 4:19 PM, Bill Graham <bi...@gmail.com> wrote:
>>>
>>>> The default value will be part of the new Avro schema definition and
>>>> Avro should return it to you, so there shouldn't be any code messyness with
>>>> that approach.
>>>>
>>>>
>>>> On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <ig...@gmail.com> wrote:
>>>>
>>>>> Ok.. you mean I can just use the newer schema to read the old schema
>>>>> as well, by populating some default value for the missing field. I think
>>>>> that should work, messy code though!
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <bi...@gmail.com>wrote:
>>>>>
>>>>>>  If you evolved your schema to just add fields, then you should be
>>>>>> able to
>>>>>> use a single schema descriptor file to read both pre- and
>>>>>> post-evolved data
>>>>>> objects. This is because one of the rules of new fields in Avro is
>>>>>> that
>>>>>> they have to have a default value and be non-null. AvroStorage should
>>>>>> pick
>>>>>> that default field up for the old objects. If it doesn't, then that's
>>>>>> a bug.
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <ig...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> > @Bill,
>>>>>> > I did look at the option of providing input as a parameter while
>>>>>> > initializing AvroStorage(). But even then, I'll still need to
>>>>>> change my
>>>>>> > script to handle the two files because I'll still need to have
>>>>>> separate
>>>>>> > schemas right?
>>>>>> >
>>>>>> > @Stan,
>>>>>> > Thanks for pointing me to it, it is a useful feature. But in my
>>>>>> case, I
>>>>>> > would never have two input files with different schemas. The input
>>>>>> will
>>>>>> > always have only one of the schemas, but I want my new script (with
>>>>>> the
>>>>>> > additional column) to be able to process the old data as well, even
>>>>>> if the
>>>>>> > input only contains data with the older schema.
>>>>>> >
>>>>>> > On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg <
>>>>>> stan.rosenberg@gmail.com
>>>>>> > >wrote:
>>>>>> >
>>>>>> > > There is a patch for Avro to deal with this use case:
>>>>>> > > https://issues.apache.org/jira/browse/PIG-2579
>>>>>> > > (See the attached pig example which loads two avro input files
>>>>>> with
>>>>>> > > different schemas.)
>>>>>> > >
>>>>>> > > Best,
>>>>>> > >
>>>>>> > > stan
>>>>>> > >
>>>>>> > > On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <ig...@gmail.com>
>>>>>> wrote:
>>>>>> > > > Hi guys,
>>>>>> > > >
>>>>>> > > > I use Pig to process some clickstream data. I need to track a
>>>>>> new
>>>>>> > field,
>>>>>> > > so
>>>>>> > > > I added a new field to my avro schema, and changed my Pig script
>>>>>> > > > accordingly. It works fine with the new files (which have that
>>>>>> new
>>>>>> > > column)
>>>>>> > > > but it breaks when I run it on my old files which do not have
>>>>>> that
>>>>>> > column
>>>>>> > > > in the schema (since avro stores schema in the data files
>>>>>> itself). I
>>>>>> > was
>>>>>> > > > expecting that Pig will assume the field to be null if that
>>>>>> particular
>>>>>> > > > field does not exist. But now I am having to maintain separate
>>>>>> scripts
>>>>>> > to
>>>>>> > > > process the old and new files. Is there any workaround this?
>>>>>> Because I
>>>>>> > > > figure I'll have to add new column frequently and I don't want
>>>>>> to
>>>>>> > > maintain
>>>>>> > > > a separate script for each window where the schema is constant.
>>>>>> > > >
>>>>>> > > > Thanks,
>>>>>> > >
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Note that I'm no longer using my Yahoo! email address. Please email
>>>>>> me at
>>>>>> billgraham@gmail.com going forward.*
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Note that I'm no longer using my Yahoo! email address. Please email
>>>> me at billgraham@gmail.com going forward.*
>>>>
>>>
>>>
>>
>>
>> --
>> *Note that I'm no longer using my Yahoo! email address. Please email me
>> at billgraham@gmail.com going forward.*
>>
>
>


-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*

Re: Working with changing schemas (avro) in Pig

Posted by IGZ Nick <ig...@gmail.com>.

That's nice! Can you give me an example of how to use it? I am not able to
figure it out from the code. The schemaManager is only used at one place
after that, and that is when the params contains a "field<number>" key. I
don't understand that part. Is there a way I can call it simply like STORE
xyz INTO 'abc' USING AvroStorage('schema_file=/path/to/schema/file')?


On Wed, Mar 28, 2012 at 5:41 PM, Bill Graham <bi...@gmail.com> wrote:

> Yes, the schema can be in HDFS but the documentation for this is lacking.
> Search for 'schema_file' here:
>
>
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java
>
> and here:
>
>
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
>
> And be aware of this open JIRA:
> https://issues.apache.org/jira/browse/PIG-2257
>
> And this closed one:
> https://issues.apache.org/jira/browse/PIG-2195
>
> :)
>
> thanks,
> Bill
>
>
> On Wed, Mar 28, 2012 at 5:26 PM, IGZ Nick <ig...@gmail.com> wrote:
>
>> The schema has to be written in the script right? I don't think there is
>> any way the schema can be in a file outside the script. That was the
>> messyness I was talking about. Or is there a way I can write the schema in
>> a separate file? One way I see is to create and store a dummy file with the
>> schema
>>
>>
>> Wed, Mar 28, 2012 at 4:19 PM, Bill Graham <bi...@gmail.com> wrote:
>>
>>> The default value will be part of the new Avro schema definition and
>>> Avro should return it to you, so there shouldn't be any code messyness with
>>> that approach.
>>>
>>>
>>> On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <ig...@gmail.com> wrote:
>>>
>>>> Ok.. you mean I can just use the newer schema to read the old schema as
>>>> well, by populating some default value for the missing field. I think that
>>>> should work, messy code though!
>>>>
>>>> Thanks!
>>>>
>>>> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <bi...@gmail.com>wrote:
>>>>
>>>>>  If you evolved your schema to just add fields, then you should be
>>>>> able to
>>>>> use a single schema descriptor file to read both pre- and post-evolved
>>>>> data
>>>>> objects. This is because one of the rules of new fields in Avro is that
>>>>> they have to have a default value and be non-null. AvroStorage should
>>>>> pick
>>>>> that default field up for the old objects. If it doesn't, then that's
>>>>> a bug.
>>>>>
>>>>>
>>>>> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <ig...@gmail.com> wrote:
>>>>>
>>>>> > @Bill,
>>>>> > I did look at the option of providing input as a parameter while
>>>>> > initializing AvroStorage(). But even then, I'll still need to change
>>>>> my
>>>>> > script to handle the two files because I'll still need to have
>>>>> separate
>>>>> > schemas right?
>>>>> >
>>>>> > @Stan,
>>>>> > Thanks for pointing me to it, it is a useful feature. But in my
>>>>> case, I
>>>>> > would never have two input files with different schemas. The input
>>>>> will
>>>>> > always have only one of the schemas, but I want my new script (with
>>>>> the
>>>>> > additional column) to be able to process the old data as well, even
>>>>> if the
>>>>> > input only contains data with the older schema.
>>>>> >
>>>>> > On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg <
>>>>> stan.rosenberg@gmail.com
>>>>> > >wrote:
>>>>> >
>>>>> > > There is a patch for Avro to deal with this use case:
>>>>> > > https://issues.apache.org/jira/browse/PIG-2579
>>>>> > > (See the attached pig example which loads two avro input files with
>>>>> > > different schemas.)
>>>>> > >
>>>>> > > Best,
>>>>> > >
>>>>> > > stan
>>>>> > >
>>>>> > > On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <ig...@gmail.com>
>>>>> wrote:
>>>>> > > > Hi guys,
>>>>> > > >
>>>>> > > > I use Pig to process some clickstream data. I need to track a new
>>>>> > field,
>>>>> > > so
>>>>> > > > I added a new field to my avro schema, and changed my Pig script
>>>>> > > > accordingly. It works fine with the new files (which have that
>>>>> new
>>>>> > > column)
>>>>> > > > but it breaks when I run it on my old files which do not have
>>>>> that
>>>>> > column
>>>>> > > > in the schema (since avro stores schema in the data files
>>>>> itself). I
>>>>> > was
>>>>> > > > expecting that Pig will assume the field to be null if that
>>>>> particular
>>>>> > > > field does not exist. But now I am having to maintain separate
>>>>> scripts
>>>>> > to
>>>>> > > > process the old and new files. Is there any workaround this?
>>>>> Because I
>>>>> > > > figure I'll have to add new column frequently and I don't want to
>>>>> > > maintain
>>>>> > > > a separate script for each window where the schema is constant.
>>>>> > > >
>>>>> > > > Thanks,
>>>>> > >
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Note that I'm no longer using my Yahoo! email address. Please email
>>>>> me at
>>>>> billgraham@gmail.com going forward.*
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Note that I'm no longer using my Yahoo! email address. Please email me
>>> at billgraham@gmail.com going forward.*
>>>
>>
>>
>
>
> --
> *Note that I'm no longer using my Yahoo! email address. Please email me
> at billgraham@gmail.com going forward.*
>

Re: Working with changing schemas (avro) in Pig

Posted by Bill Graham <bi...@gmail.com>.

Yes, the schema can be in HDFS but the documentation for this is lacking.
Search for 'schema_file' here:

http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java

and here:

http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java

And be aware of this open JIRA:
https://issues.apache.org/jira/browse/PIG-2257

And this closed one:
https://issues.apache.org/jira/browse/PIG-2195

:)

thanks,
Bill

On Wed, Mar 28, 2012 at 5:26 PM, IGZ Nick <ig...@gmail.com> wrote:

> The schema has to be written in the script right? I don't think there is
> any way the schema can be in a file outside the script. That was the
> messyness I was talking about. Or is there a way I can write the schema in
> a separate file? One way I see is to create and store a dummy file with the
> schema
>
>
> Wed, Mar 28, 2012 at 4:19 PM, Bill Graham <bi...@gmail.com> wrote:
>
>> The default value will be part of the new Avro schema definition and Avro
>> should return it to you, so there shouldn't be any code messyness with that
>> approach.
>>
>>
>> On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <ig...@gmail.com> wrote:
>>
>>> Ok.. you mean I can just use the newer schema to read the old schema as
>>> well, by populating some default value for the missing field. I think that
>>> should work, messy code though!
>>>
>>> Thanks!
>>>
>>> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <bi...@gmail.com>wrote:
>>>
>>>>  If you evolved your schema to just add fields, then you should be able
>>>> to
>>>> use a single schema descriptor file to read both pre- and post-evolved
>>>> data
>>>> objects. This is because one of the rules of new fields in Avro is that
>>>> they have to have a default value and be non-null. AvroStorage should
>>>> pick
>>>> that default field up for the old objects. If it doesn't, then that's a
>>>> bug.
>>>>
>>>>
>>>> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <ig...@gmail.com> wrote:
>>>>
>>>> > @Bill,
>>>> > I did look at the option of providing input as a parameter while
>>>> > initializing AvroStorage(). But even then, I'll still need to change
>>>> my
>>>> > script to handle the two files because I'll still need to have
>>>> separate
>>>> > schemas right?
>>>> >
>>>> > @Stan,
>>>> > Thanks for pointing me to it, it is a useful feature. But in my case,
>>>> I
>>>> > would never have two input files with different schemas. The input
>>>> will
>>>> > always have only one of the schemas, but I want my new script (with
>>>> the
>>>> > additional column) to be able to process the old data as well, even
>>>> if the
>>>> > input only contains data with the older schema.
>>>> >
>>>> > On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg <
>>>> stan.rosenberg@gmail.com
>>>> > >wrote:
>>>> >
>>>> > > There is a patch for Avro to deal with this use case:
>>>> > > https://issues.apache.org/jira/browse/PIG-2579
>>>> > > (See the attached pig example which loads two avro input files with
>>>> > > different schemas.)
>>>> > >
>>>> > > Best,
>>>> > >
>>>> > > stan
>>>> > >
>>>> > > On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <ig...@gmail.com>
>>>> wrote:
>>>> > > > Hi guys,
>>>> > > >
>>>> > > > I use Pig to process some clickstream data. I need to track a new
>>>> > field,
>>>> > > so
>>>> > > > I added a new field to my avro schema, and changed my Pig script
>>>> > > > accordingly. It works fine with the new files (which have that new
>>>> > > column)
>>>> > > > but it breaks when I run it on my old files which do not have that
>>>> > column
>>>> > > > in the schema (since avro stores schema in the data files
>>>> itself). I
>>>> > was
>>>> > > > expecting that Pig will assume the field to be null if that
>>>> particular
>>>> > > > field does not exist. But now I am having to maintain separate
>>>> scripts
>>>> > to
>>>> > > > process the old and new files. Is there any workaround this?
>>>> Because I
>>>> > > > figure I'll have to add new column frequently and I don't want to
>>>> > > maintain
>>>> > > > a separate script for each window where the schema is constant.
>>>> > > >
>>>> > > > Thanks,
>>>> > >
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> *Note that I'm no longer using my Yahoo! email address. Please email me
>>>> at
>>>> billgraham@gmail.com going forward.*
>>>>
>>>
>>>
>>
>>
>> --
>> *Note that I'm no longer using my Yahoo! email address. Please email me
>> at billgraham@gmail.com going forward.*
>>
>
>


-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*

Re: Working with changing schemas (avro) in Pig

Posted by IGZ Nick <ig...@gmail.com>.

The schema has to be written in the script right? I don't think there is
any way the schema can be in a file outside the script. That was the
messyness I was talking about. Or is there a way I can write the schema in
a separate file? One way I see is to create and store a dummy file with the
schema

Wed, Mar 28, 2012 at 4:19 PM, Bill Graham <bi...@gmail.com> wrote:

> The default value will be part of the new Avro schema definition and Avro
> should return it to you, so there shouldn't be any code messyness with that
> approach.
>
>
> On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <ig...@gmail.com> wrote:
>
>> Ok.. you mean I can just use the newer schema to read the old schema as
>> well, by populating some default value for the missing field. I think that
>> should work, messy code though!
>>
>> Thanks!
>>
>> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <bi...@gmail.com>wrote:
>>
>>>  If you evolved your schema to just add fields, then you should be able
>>> to
>>> use a single schema descriptor file to read both pre- and post-evolved
>>> data
>>> objects. This is because one of the rules of new fields in Avro is that
>>> they have to have a default value and be non-null. AvroStorage should
>>> pick
>>> that default field up for the old objects. If it doesn't, then that's a
>>> bug.
>>>
>>>
>>> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <ig...@gmail.com> wrote:
>>>
>>> > @Bill,
>>> > I did look at the option of providing input as a parameter while
>>> > initializing AvroStorage(). But even then, I'll still need to change my
>>> > script to handle the two files because I'll still need to have separate
>>> > schemas right?
>>> >
>>> > @Stan,
>>> > Thanks for pointing me to it, it is a useful feature. But in my case, I
>>> > would never have two input files with different schemas. The input will
>>> > always have only one of the schemas, but I want my new script (with the
>>> > additional column) to be able to process the old data as well, even if
>>> the
>>> > input only contains data with the older schema.
>>> >
>>> > On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg <
>>> stan.rosenberg@gmail.com
>>> > >wrote:
>>> >
>>> > > There is a patch for Avro to deal with this use case:
>>> > > https://issues.apache.org/jira/browse/PIG-2579
>>> > > (See the attached pig example which loads two avro input files with
>>> > > different schemas.)
>>> > >
>>> > > Best,
>>> > >
>>> > > stan
>>> > >
>>> > > On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <ig...@gmail.com>
>>> wrote:
>>> > > > Hi guys,
>>> > > >
>>> > > > I use Pig to process some clickstream data. I need to track a new
>>> > field,
>>> > > so
>>> > > > I added a new field to my avro schema, and changed my Pig script
>>> > > > accordingly. It works fine with the new files (which have that new
>>> > > column)
>>> > > > but it breaks when I run it on my old files which do not have that
>>> > column
>>> > > > in the schema (since avro stores schema in the data files itself).
>>> I
>>> > was
>>> > > > expecting that Pig will assume the field to be null if that
>>> particular
>>> > > > field does not exist. But now I am having to maintain separate
>>> scripts
>>> > to
>>> > > > process the old and new files. Is there any workaround this?
>>> Because I
>>> > > > figure I'll have to add new column frequently and I don't want to
>>> > > maintain
>>> > > > a separate script for each window where the schema is constant.
>>> > > >
>>> > > > Thanks,
>>> > >
>>> >
>>>
>>>
>>>
>>> --
>>> *Note that I'm no longer using my Yahoo! email address. Please email me
>>> at
>>> billgraham@gmail.com going forward.*
>>>
>>
>>
>
>
> --
> *Note that I'm no longer using my Yahoo! email address. Please email me
> at billgraham@gmail.com going forward.*
>

Re: Working with changing schemas (avro) in Pig

Posted by Bill Graham <bi...@gmail.com>.

The default value will be part of the new Avro schema definition and Avro
should return it to you, so there shouldn't be any code messyness with that
approach.


On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <ig...@gmail.com> wrote:

> Ok.. you mean I can just use the newer schema to read the old schema as
> well, by populating some default value for the missing field. I think that
> should work, messy code though!
>
> Thanks!
>
> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <bi...@gmail.com> wrote:
>
>>  If you evolved your schema to just add fields, then you should be able to
>> use a single schema descriptor file to read both pre- and post-evolved
>> data
>> objects. This is because one of the rules of new fields in Avro is that
>> they have to have a default value and be non-null. AvroStorage should pick
>> that default field up for the old objects. If it doesn't, then that's a
>> bug.
>>
>>
>> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <ig...@gmail.com> wrote:
>>
>> > @Bill,
>> > I did look at the option of providing input as a parameter while
>> > initializing AvroStorage(). But even then, I'll still need to change my
>> > script to handle the two files because I'll still need to have separate
>> > schemas right?
>> >
>> > @Stan,
>> > Thanks for pointing me to it, it is a useful feature. But in my case, I
>> > would never have two input files with different schemas. The input will
>> > always have only one of the schemas, but I want my new script (with the
>> > additional column) to be able to process the old data as well, even if
>> the
>> > input only contains data with the older schema.
>> >
>> > On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg <
>> stan.rosenberg@gmail.com
>> > >wrote:
>> >
>> > > There is a patch for Avro to deal with this use case:
>> > > https://issues.apache.org/jira/browse/PIG-2579
>> > > (See the attached pig example which loads two avro input files with
>> > > different schemas.)
>> > >
>> > > Best,
>> > >
>> > > stan
>> > >
>> > > On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <ig...@gmail.com>
>> wrote:
>> > > > Hi guys,
>> > > >
>> > > > I use Pig to process some clickstream data. I need to track a new
>> > field,
>> > > so
>> > > > I added a new field to my avro schema, and changed my Pig script
>> > > > accordingly. It works fine with the new files (which have that new
>> > > column)
>> > > > but it breaks when I run it on my old files which do not have that
>> > column
>> > > > in the schema (since avro stores schema in the data files itself). I
>> > was
>> > > > expecting that Pig will assume the field to be null if that
>> particular
>> > > > field does not exist. But now I am having to maintain separate
>> scripts
>> > to
>> > > > process the old and new files. Is there any workaround this?
>> Because I
>> > > > figure I'll have to add new column frequently and I don't want to
>> > > maintain
>> > > > a separate script for each window where the schema is constant.
>> > > >
>> > > > Thanks,
>> > >
>> >
>>
>>
>>
>> --
>> *Note that I'm no longer using my Yahoo! email address. Please email me at
>> billgraham@gmail.com going forward.*
>>
>
>


-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*

Re: Working with changing schemas (avro) in Pig

Posted by IGZ Nick <ig...@gmail.com>.

Ok.. you mean I can just use the newer schema to read the old schema as
well, by populating some default value for the missing field. I think that
should work, messy code though!

Thanks!

On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <bi...@gmail.com> wrote:

> If you evolved your schema to just add fields, then you should be able to
> use a single schema descriptor file to read both pre- and post-evolved data
> objects. This is because one of the rules of new fields in Avro is that
> they have to have a default value and be non-null. AvroStorage should pick
> that default field up for the old objects. If it doesn't, then that's a
> bug.
>
>
> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <ig...@gmail.com> wrote:
>
> > @Bill,
> > I did look at the option of providing input as a parameter while
> > initializing AvroStorage(). But even then, I'll still need to change my
> > script to handle the two files because I'll still need to have separate
> > schemas right?
> >
> > @Stan,
> > Thanks for pointing me to it, it is a useful feature. But in my case, I
> > would never have two input files with different schemas. The input will
> > always have only one of the schemas, but I want my new script (with the
> > additional column) to be able to process the old data as well, even if
> the
> > input only contains data with the older schema.
> >
> > On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg <
> stan.rosenberg@gmail.com
> > >wrote:
> >
> > > There is a patch for Avro to deal with this use case:
> > > https://issues.apache.org/jira/browse/PIG-2579
> > > (See the attached pig example which loads two avro input files with
> > > different schemas.)
> > >
> > > Best,
> > >
> > > stan
> > >
> > > On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <ig...@gmail.com> wrote:
> > > > Hi guys,
> > > >
> > > > I use Pig to process some clickstream data. I need to track a new
> > field,
> > > so
> > > > I added a new field to my avro schema, and changed my Pig script
> > > > accordingly. It works fine with the new files (which have that new
> > > column)
> > > > but it breaks when I run it on my old files which do not have that
> > column
> > > > in the schema (since avro stores schema in the data files itself). I
> > was
> > > > expecting that Pig will assume the field to be null if that
> particular
> > > > field does not exist. But now I am having to maintain separate
> scripts
> > to
> > > > process the old and new files. Is there any workaround this? Because
> I
> > > > figure I'll have to add new column frequently and I don't want to
> > > maintain
> > > > a separate script for each window where the schema is constant.
> > > >
> > > > Thanks,
> > >
> >
>
>
>
> --
> *Note that I'm no longer using my Yahoo! email address. Please email me at
> billgraham@gmail.com going forward.*
>

Re: Working with changing schemas (avro) in Pig

Posted by Bill Graham <bi...@gmail.com>.

If you evolved your schema to just add fields, then you should be able to
use a single schema descriptor file to read both pre- and post-evolved data
objects. This is because one of the rules of new fields in Avro is that
they have to have a default value and be non-null. AvroStorage should pick
that default field up for the old objects. If it doesn't, then that's a bug.


On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <ig...@gmail.com> wrote:

> @Bill,
> I did look at the option of providing input as a parameter while
> initializing AvroStorage(). But even then, I'll still need to change my
> script to handle the two files because I'll still need to have separate
> schemas right?
>
> @Stan,
> Thanks for pointing me to it, it is a useful feature. But in my case, I
> would never have two input files with different schemas. The input will
> always have only one of the schemas, but I want my new script (with the
> additional column) to be able to process the old data as well, even if the
> input only contains data with the older schema.
>
> On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg <stan.rosenberg@gmail.com
> >wrote:
>
> > There is a patch for Avro to deal with this use case:
> > https://issues.apache.org/jira/browse/PIG-2579
> > (See the attached pig example which loads two avro input files with
> > different schemas.)
> >
> > Best,
> >
> > stan
> >
> > On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <ig...@gmail.com> wrote:
> > > Hi guys,
> > >
> > > I use Pig to process some clickstream data. I need to track a new
> field,
> > so
> > > I added a new field to my avro schema, and changed my Pig script
> > > accordingly. It works fine with the new files (which have that new
> > column)
> > > but it breaks when I run it on my old files which do not have that
> column
> > > in the schema (since avro stores schema in the data files itself). I
> was
> > > expecting that Pig will assume the field to be null if that particular
> > > field does not exist. But now I am having to maintain separate scripts
> to
> > > process the old and new files. Is there any workaround this? Because I
> > > figure I'll have to add new column frequently and I don't want to
> > maintain
> > > a separate script for each window where the schema is constant.
> > >
> > > Thanks,
> >
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*

Re: Working with changing schemas (avro) in Pig

Posted by IGZ Nick <ig...@gmail.com>.

@Bill,
I did look at the option of providing input as a parameter while
initializing AvroStorage(). But even then, I'll still need to change my
script to handle the two files because I'll still need to have separate
schemas right?

@Stan,
Thanks for pointing me to it, it is a useful feature. But in my case, I
would never have two input files with different schemas. The input will
always have only one of the schemas, but I want my new script (with the
additional column) to be able to process the old data as well, even if the
input only contains data with the older schema.

On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg <st...@gmail.com>wrote:

> There is a patch for Avro to deal with this use case:
> https://issues.apache.org/jira/browse/PIG-2579
> (See the attached pig example which loads two avro input files with
> different schemas.)
>
> Best,
>
> stan
>
> On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <ig...@gmail.com> wrote:
> > Hi guys,
> >
> > I use Pig to process some clickstream data. I need to track a new field,
> so
> > I added a new field to my avro schema, and changed my Pig script
> > accordingly. It works fine with the new files (which have that new
> column)
> > but it breaks when I run it on my old files which do not have that column
> > in the schema (since avro stores schema in the data files itself). I was
> > expecting that Pig will assume the field to be null if that particular
> > field does not exist. But now I am having to maintain separate scripts to
> > process the old and new files. Is there any workaround this? Because I
> > figure I'll have to add new column frequently and I don't want to
> maintain
> > a separate script for each window where the schema is constant.
> >
> > Thanks,
>

Re: Working with changing schemas (avro) in Pig

Posted by Stan Rosenberg <st...@gmail.com>.

There is a patch for Avro to deal with this use case:
https://issues.apache.org/jira/browse/PIG-2579
(See the attached pig example which loads two avro input files with
different schemas.)

Best,

stan

On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <ig...@gmail.com> wrote:
> Hi guys,
>
> I use Pig to process some clickstream data. I need to track a new field, so
> I added a new field to my avro schema, and changed my Pig script
> accordingly. It works fine with the new files (which have that new column)
> but it breaks when I run it on my old files which do not have that column
> in the schema (since avro stores schema in the data files itself). I was
> expecting that Pig will assume the field to be null if that particular
> field does not exist. But now I am having to maintain separate scripts to
> process the old and new files. Is there any workaround this? Because I
> figure I'll have to add new column frequently and I don't want to maintain
> a separate script for each window where the schema is constant.
>
> Thanks,

Re: Working with changing schemas (avro) in Pig

Posted by Bill Graham <bi...@gmail.com>.

AvroStorage supports different modes to load the schema definition. One is
to get it from the Avro record, which would cause problems with evolution,
but you can also specific a schema file. Which are you using? Can you
attach the snippet of your script that initializes AvroStorage?



On Wed, Mar 28, 2012 at 1:22 PM, IGZ Nick <ig...@gmail.com> wrote:

> Hi guys,
>
> I use Pig to process some clickstream data. I need to track a new field, so
> I added a new field to my avro schema, and changed my Pig script
> accordingly. It works fine with the new files (which have that new column)
> but it breaks when I run it on my old files which do not have that column
> in the schema (since avro stores schema in the data files itself). I was
> expecting that Pig will assume the field to be null if that particular
> field does not exist. But now I am having to maintain separate scripts to
> process the old and new files. Is there any workaround this? Because I
> figure I'll have to add new column frequently and I don't want to maintain
> a separate script for each window where the schema is constant.
>
> Thanks,
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*