You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Enns, Steven" <sa...@a9.com> on 2013/04/26 01:01:26 UTC

Override input schema in AvroStorage

Hi everyone,

I would like to override the input schema in AvroStorage to make a pig
script robust to schema evolution.  For example, suppose a new field is
added to an avro schema with a default value of null.  If the input to a
pig script using this field includes both old and new data, AvroStorage
will merge the input schemas from the old and new data.  However, if the
input includes only old data, the new schema will not be available to
AvroStorage and pig will fail to interpret the script with an error such
as "projected field [newField] does not exist in schema".  If AvroStorage
accepted an input schema, the script would be valid for both the new and
old data.  Is there any plan to implement this?

Thanks,
Steve


Re: AvroStorage Default values are set to null even if they are specified

Posted by Cheolsoo Park <pi...@gmail.com>.
Hi Viray,

Yes, that's a known bug. Here is what happens:

1) Let's say there are two schema X and Y.
2) AvroStorage creates a tuple whose size == max( sizeOf(X), sizeOf(Y) ).
3) Fields are filled in as values are read. But if no values are found,
those fields are left as null.

If you'd like to fix it, please take a look at PigAvroRecordReader.java:
http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigAvroRecordReader.java

In particular, see how mProtoTuple is initialized and updated.

Thanks,
Cheolsoo





On Thu, May 2, 2013 at 8:34 PM, Viraj Bhat <vi...@yahoo-inc.com> wrote:

> Hi Cheolsoo/Pig User Group,
>   I am using the Pig 0.11 piggybank - AvroStorage. When merging multiple
> schemas where default values have been specified in the avro schema; The
> AvroStorage puts nulls in the merged data set.
> Is this a known bug in the current implementation of the AvroStorage.
> Using an example provided by one of my colleagues. The final dataset should
> contain "NU", 0, "OU" for all values where the columns do not exist.
> ==> Employee3.avro <==
> {
> "type" : "record",
> "name" : "employee",
> "fields":[
>         {"name" : "name", "type" : "string", "default" : "NU"},
>         {"name" : "age", "type" : "int", "default" : 0 },
>         {"name" : "dept", "type": "string", "default" : "DU"}
> ]
> }
>
> ==> Employee4.avro <==
> {
> "type" : "record",
> "name" : "employee",
> "fields":[
>         {"name" : "name", "type" : "string", "default" : "NU"},
>         {"name" : "age", "type" : "int", "default" : 0},
>         {"name" : "dept", "type": "string", "default" : "DU"},
>         {"name" : "office", "type": "string", "default" : "OU"}
> ]
> }
>
> ==> Employee6.avro <==
> {
> "type" : "record",
> "name" : "employee",
> "fields":[
>         {"name" : "name", "type" : "string", "default" : "NU"},
>         {"name" : "lastname", "type": "string", "default" : "LNU"},
>         {"name" : "age", "type" : "int","default" : 0},
>         {"name" : "salary", "type": "int", "default" : 0},
>         {"name" : "dept", "type": "string","default" : "DU"},
>         {"name" : "office", "type": "string","default" : "OU"}
> ]
> }
>
> The pig script:
> employee = load '$input' using
> org.apache.pig.piggybank.storage.avro.AvroStorage('multiple_schemas');
> describe employee;
> dump employee;
>
> The call:
> dump_employees.pig employee{3,4,6}.ser
>
> The output:
> employee: {name: chararray,age: int,dept: chararray,lastname:
> chararray,salary: int,office: chararray}
>
> (Milo,30,DH,,,)
> (Asmya,34,PQ,,,)
> (Baljit,23,RS,,,)
> (Pune,60,Astrophysics,Warriors,5466,UTA)
> (Rajsathan,20,Biochemistry,Royals,1378,Stanford)
> (Chennai,50,Microbiology,Superkings,7338,Hopkins)
> (Mumbai,20,Applied Math,Indians,4468,UAH)
> (Praj,54,RMX,,,Champaign)
> (Buba,767,HD,,,Sunnyvale)
> (Manku,375,MS,,,New York)
> Regards
> Viraj
>
> -----Original Message-----
> From: Cheolsoo Park [mailto:piaozhexiu@gmail.com]
> Sent: Tuesday, April 30, 2013 9:10 PM
> To: user@pig.apache.org
> Cc: Qi, Runping
> Subject: Re: Override input schema in AvroStorage
>
> Hi Steven,
>
> The new AvroStorage will let you specify the input schema:
> https://issues.apache.org/jira/browse/PIG-3015
>
> In fact, somebody made the same request in a comment of the jira that I am
> copying and pasting below:
>
> Furthermore, we occasionally have issues with pig jobs picking the old
> > schema when we have a schema update. Manually specifying the schema
> > would fix this and give us more flexibility in defining the data we
> > want pig to pull from a file.
>
>
> This jira is work in progress, but hopefully it will be in next major
> released.
>
> Thanks,
> Cheolsoo
>
>
>
> On Sat, Apr 27, 2013 at 3:24 PM, Enns, Steven <sa...@a9.com> wrote:
>
> > Resending now that I am subscribed :)
> >
> > On 4/25/13 4:01 PM, "Enns, Steven" <sa...@a9.com> wrote:
> >
> > >Hi everyone,
> > >
> > >I would like to override the input schema in AvroStorage to make a
> > >pig script robust to schema evolution.  For example, suppose a new
> > >field is added to an avro schema with a default value of null.  If
> > >the input to a pig script using this field includes both old and new
> > >data, AvroStorage will merge the input schemas from the old and new
> > >data.  However, if the input includes only old data, the new schema
> > >will not be available to AvroStorage and pig will fail to interpret
> > >the script with an error such as "projected field [newField] does not
> > >exist in schema".  If AvroStorage accepted an input schema, the
> > >script would be valid for both the new and old data.  Is there any plan
> to implement this?
> > >
> > >Thanks,
> > >Steve
> > >
> >
> >
>

AvroStorage Default values are set to null even if they are specified

Posted by Viraj Bhat <vi...@yahoo-inc.com>.
Hi Cheolsoo/Pig User Group,
  I am using the Pig 0.11 piggybank - AvroStorage. When merging multiple schemas where default values have been specified in the avro schema; The AvroStorage puts nulls in the merged data set. 
Is this a known bug in the current implementation of the AvroStorage. Using an example provided by one of my colleagues. The final dataset should contain "NU", 0, "OU" for all values where the columns do not exist. 
==> Employee3.avro <==
{
"type" : "record",
"name" : "employee",
"fields":[
        {"name" : "name", "type" : "string", "default" : "NU"},
        {"name" : "age", "type" : "int", "default" : 0 },
        {"name" : "dept", "type": "string", "default" : "DU"}
]
}

==> Employee4.avro <==
{
"type" : "record",
"name" : "employee",
"fields":[
        {"name" : "name", "type" : "string", "default" : "NU"},
        {"name" : "age", "type" : "int", "default" : 0},
        {"name" : "dept", "type": "string", "default" : "DU"},
        {"name" : "office", "type": "string", "default" : "OU"}
]
}

==> Employee6.avro <==
{
"type" : "record",
"name" : "employee",
"fields":[
        {"name" : "name", "type" : "string", "default" : "NU"},
        {"name" : "lastname", "type": "string", "default" : "LNU"},
        {"name" : "age", "type" : "int","default" : 0},
        {"name" : "salary", "type": "int", "default" : 0},
        {"name" : "dept", "type": "string","default" : "DU"},
        {"name" : "office", "type": "string","default" : "OU"}
]
}

The pig script:
employee = load '$input' using org.apache.pig.piggybank.storage.avro.AvroStorage('multiple_schemas');
describe employee;
dump employee;

The call:
dump_employees.pig employee{3,4,6}.ser 

The output:
employee: {name: chararray,age: int,dept: chararray,lastname: chararray,salary: int,office: chararray}

(Milo,30,DH,,,)
(Asmya,34,PQ,,,)
(Baljit,23,RS,,,)
(Pune,60,Astrophysics,Warriors,5466,UTA)
(Rajsathan,20,Biochemistry,Royals,1378,Stanford)
(Chennai,50,Microbiology,Superkings,7338,Hopkins)
(Mumbai,20,Applied Math,Indians,4468,UAH)
(Praj,54,RMX,,,Champaign)
(Buba,767,HD,,,Sunnyvale)
(Manku,375,MS,,,New York)
Regards
Viraj

-----Original Message-----
From: Cheolsoo Park [mailto:piaozhexiu@gmail.com] 
Sent: Tuesday, April 30, 2013 9:10 PM
To: user@pig.apache.org
Cc: Qi, Runping
Subject: Re: Override input schema in AvroStorage

Hi Steven,

The new AvroStorage will let you specify the input schema:
https://issues.apache.org/jira/browse/PIG-3015

In fact, somebody made the same request in a comment of the jira that I am copying and pasting below:

Furthermore, we occasionally have issues with pig jobs picking the old
> schema when we have a schema update. Manually specifying the schema 
> would fix this and give us more flexibility in defining the data we 
> want pig to pull from a file.


This jira is work in progress, but hopefully it will be in next major released.

Thanks,
Cheolsoo



On Sat, Apr 27, 2013 at 3:24 PM, Enns, Steven <sa...@a9.com> wrote:

> Resending now that I am subscribed :)
>
> On 4/25/13 4:01 PM, "Enns, Steven" <sa...@a9.com> wrote:
>
> >Hi everyone,
> >
> >I would like to override the input schema in AvroStorage to make a 
> >pig script robust to schema evolution.  For example, suppose a new 
> >field is added to an avro schema with a default value of null.  If 
> >the input to a pig script using this field includes both old and new 
> >data, AvroStorage will merge the input schemas from the old and new 
> >data.  However, if the input includes only old data, the new schema 
> >will not be available to AvroStorage and pig will fail to interpret 
> >the script with an error such as "projected field [newField] does not 
> >exist in schema".  If AvroStorage accepted an input schema, the 
> >script would be valid for both the new and old data.  Is there any plan to implement this?
> >
> >Thanks,
> >Steve
> >
>
>

Re: Override input schema in AvroStorage

Posted by "Enns, Steven" <sa...@a9.com>.
Cool thanks!

On 4/30/13 9:10 PM, "Cheolsoo Park" <pi...@gmail.com> wrote:

>Hi Steven,
>
>The new AvroStorage will let you specify the input schema:
>https://issues.apache.org/jira/browse/PIG-3015
>
>In fact, somebody made the same request in a comment of the jira that I am
>copying and pasting below:
>
>Furthermore, we occasionally have issues with pig jobs picking the old
>> schema when we have a schema update. Manually specifying the schema
>>would
>> fix this and give us more flexibility in defining the data we want pig
>>to
>> pull from a file.
>
>
>This jira is work in progress, but hopefully it will be in next major
>released.
>
>Thanks,
>Cheolsoo
>
>
>
>On Sat, Apr 27, 2013 at 3:24 PM, Enns, Steven <sa...@a9.com> wrote:
>
>> Resending now that I am subscribed :)
>>
>> On 4/25/13 4:01 PM, "Enns, Steven" <sa...@a9.com> wrote:
>>
>> >Hi everyone,
>> >
>> >I would like to override the input schema in AvroStorage to make a pig
>> >script robust to schema evolution.  For example, suppose a new field is
>> >added to an avro schema with a default value of null.  If the input to
>>a
>> >pig script using this field includes both old and new data, AvroStorage
>> >will merge the input schemas from the old and new data.  However, if
>>the
>> >input includes only old data, the new schema will not be available to
>> >AvroStorage and pig will fail to interpret the script with an error
>>such
>> >as "projected field [newField] does not exist in schema".  If
>>AvroStorage
>> >accepted an input schema, the script would be valid for both the new
>>and
>> >old data.  Is there any plan to implement this?
>> >
>> >Thanks,
>> >Steve
>> >
>>
>>


Re: Override input schema in AvroStorage

Posted by Cheolsoo Park <pi...@gmail.com>.
Hi Steven,

The new AvroStorage will let you specify the input schema:
https://issues.apache.org/jira/browse/PIG-3015

In fact, somebody made the same request in a comment of the jira that I am
copying and pasting below:

Furthermore, we occasionally have issues with pig jobs picking the old
> schema when we have a schema update. Manually specifying the schema would
> fix this and give us more flexibility in defining the data we want pig to
> pull from a file.


This jira is work in progress, but hopefully it will be in next major
released.

Thanks,
Cheolsoo



On Sat, Apr 27, 2013 at 3:24 PM, Enns, Steven <sa...@a9.com> wrote:

> Resending now that I am subscribed :)
>
> On 4/25/13 4:01 PM, "Enns, Steven" <sa...@a9.com> wrote:
>
> >Hi everyone,
> >
> >I would like to override the input schema in AvroStorage to make a pig
> >script robust to schema evolution.  For example, suppose a new field is
> >added to an avro schema with a default value of null.  If the input to a
> >pig script using this field includes both old and new data, AvroStorage
> >will merge the input schemas from the old and new data.  However, if the
> >input includes only old data, the new schema will not be available to
> >AvroStorage and pig will fail to interpret the script with an error such
> >as "projected field [newField] does not exist in schema".  If AvroStorage
> >accepted an input schema, the script would be valid for both the new and
> >old data.  Is there any plan to implement this?
> >
> >Thanks,
> >Steve
> >
>
>

Re: Override input schema in AvroStorage

Posted by "Enns, Steven" <sa...@a9.com>.
Resending now that I am subscribed :)

On 4/25/13 4:01 PM, "Enns, Steven" <sa...@a9.com> wrote:

>Hi everyone,
>
>I would like to override the input schema in AvroStorage to make a pig
>script robust to schema evolution.  For example, suppose a new field is
>added to an avro schema with a default value of null.  If the input to a
>pig script using this field includes both old and new data, AvroStorage
>will merge the input schemas from the old and new data.  However, if the
>input includes only old data, the new schema will not be available to
>AvroStorage and pig will fail to interpret the script with an error such
>as "projected field [newField] does not exist in schema".  If AvroStorage
>accepted an input schema, the script would be valid for both the new and
>old data.  Is there any plan to implement this?
>
>Thanks,
>Steve
>


Re: Override input schema in AvroStorage

Posted by "Enns, Steven" <sa...@a9.com>.
Resending now that I am subscribed :)

On 4/25/13 4:01 PM, "Enns, Steven" <sa...@a9.com> wrote:

>Hi everyone,
>
>I would like to override the input schema in AvroStorage to make a pig
>script robust to schema evolution.  For example, suppose a new field is
>added to an avro schema with a default value of null.  If the input to a
>pig script using this field includes both old and new data, AvroStorage
>will merge the input schemas from the old and new data.  However, if the
>input includes only old data, the new schema will not be available to
>AvroStorage and pig will fail to interpret the script with an error such
>as "projected field [newField] does not exist in schema".  If AvroStorage
>accepted an input schema, the script would be valid for both the new and
>old data.  Is there any plan to implement this?
>
>Thanks,
>Steve
>