You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Milind Vaidya <ka...@gmail.com> on 2013/01/09 15:49:01 UTC

Pig Avrostorage Issue regarding Schema evaluation

Environment:

Pig version: 0.11
Hadoop 0.23.6.0.1301071353


Script:


REGISTER /homes/immilind/HadoopLocal/Jars/avro-1.7.1.jar
REGISTER /homes/immilind/HadoopLocal/Jars/jackson-all-1.8.10.jar
REGISTER /homes/immilind/HadoopLocal/Jars/jackson-core-asl-1.8.10.jar
REGISTER /homes/immilind/HadoopLocal/Jars/jackson-jaxrs-1.8.10.jar
REGISTER /homes/immilind/HadoopLocal/Jars/jackson-mapper-asl-1.8.10.jar
REGISTER /homes/immilind/HadoopLocal/Jars/jackson-xc-1.8.10.jar
REGISTER /home/gs/pig/current/lib-hadoop23/piggybank.jar

employee= load '/user/immilind/AvroData' using
org.apache.pig.piggybank.storage.avro.AvroStorage( );
dump employee;


Schemas :

{
"type" : "record",
"name" : "employee",
"fields":[
    {"name" : "name", "type" : "string", "default" : "NU"},
    {"name" : "age", "type" : "int","default" : 0},
    {"name" : "dept", "type": "string","default" : "DU"},
    {"name" : "office", "type": "string","default" : "OU"},
    {"name" : "salary", "type": "float","default" : 0.0}
]
}

{
"type" : "record",
"name" : "employee",
"fields":[
    {"name" : "name", "type" : "string", "default" : "NU"},
    {"name" : "age", "type" : "int","default" : 0},
    {"name" : "dept", "type": "string","default" : "DU"},
    {"name" : "office", "type": "string","default" : "OU"},
    {"name" : "salary", "type": "int", "default" : 0}
]
}


Both the schemas differ only in one field. As per the schema evolution/
merging rules, I am expecting to see "int" fields loaded as "float". But
instead, the job fails due to field mismatch.

I am referring to :

Similar thread named "Working with changing schemas (avro) in Pig"
https://mail-archives.apache.org/mod_mbox/pig-user/201204.mbox/%3CCAB-acjM6B39OMtWYpYijBBoJxL8MuyRjDzrxmXfBjmtxXcMsOQ@mail.gmail.com%3E

JIRA:
https://issues.apache.org/jira/browse/PIG-2579
How to use "multiple_schema' option with "AvroStorage"  as suggested by
this JIRA ?

Function mergeType indicating rules for primitive types
https://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorageUtils.java



Can anybody suggest what is going wrong ?

Re: Pig Avrostorage Issue regarding Schema evaluation

Posted by Russell Jurney <ru...@gmail.com>.
Np, I included it myself forever before someone pointed that out :)

Russell Jurney http://datasyndrome.com

On Jan 9, 2013, at 2:15 PM, Cheolsoo Park <ch...@cloudera.com> wrote:

> Hi Russell,
>
> You're absolute right. Jackson is not needed. Thanks for point that out!
>
> Cheolsoo
>
>
> On Wed, Jan 9, 2013 at 11:25 AM, Russell Jurney <ru...@gmail.com>wrote:
>
>> Jackson is no longer needed, right? Or is it coming back in 0.11?
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Jan 9, 2013, at 10:26 AM, Cheolsoo Park <ch...@cloudera.com> wrote:
>>
>>> Hi Milind,
>>>
>>> Please try this:
>>>
>>> REGISTER build/ivy/lib/Pig/avro-1.7.1.jar
>>> REGISTER build/ivy/lib/Pig/json-simple-1.1.jar
>>> REGISTER build/ivy/lib/Pig/jackson-mapper-asl-1.8.8.jar
>>> REGISTER build/ivy/lib/Pig/jackson-core-asl-1.8.8.jar
>>> REGISTER contrib/piggybank/java/piggybank.jar
>>>
>>> employee = LOAD '/home/cheolsoo/workspace/avro/emplyees' USING
>>> org.apache.pig.piggybank.storage.avro.AvroStorage('multiple_schemas');
>>> DESCRIBE employee;
>>> DUMP employee;
>>>
>>> I have two Avro files in my input directory:
>>>
>>> $java -jar /home/cheolsoo/workspace/avro/avro-tools-1.7.1.jar tojson
>>> record_employee.avro
>>> {"name":"a","age":0,"dept":"b","office":"c","salary":0.0}
>>>
>>> $java -jar /home/cheolsoo/workspace/avro/avro-tools-1.7.1.jar tojson
>>> record_employee2.avro
>>> {"name":"a","age":0,"dept":"b","office":"c","salary":0}
>>>
>>> record_employee.avro contains a float, and record_employee2.avro contains
>>> an int.
>>>
>>> The output looks as follows:
>>>
>>> ...
>>> employee: {name: chararray,age: int,dept: chararray,office:
>>> chararray,salary: float}
>>> ...
>>> (a,0,b,c,0.0)
>>> (a,0,b,c,0)
>>>
>>> Thanks,
>>> Cheolsoo
>>>
>>>
>>>
>>>
>>> On Wed, Jan 9, 2013 at 6:49 AM, Milind Vaidya <ka...@gmail.com> wrote:
>>>
>>>> Environment:
>>>>
>>>> Pig version: 0.11
>>>> Hadoop 0.23.6.0.1301071353
>>>>
>>>>
>>>> Script:
>>>>
>>>>
>>>> REGISTER /homes/immilind/HadoopLocal/Jars/avro-1.7.1.jar
>>>> REGISTER /homes/immilind/HadoopLocal/Jars/jackson-all-1.8.10.jar
>>>> REGISTER /homes/immilind/HadoopLocal/Jars/jackson-core-asl-1.8.10.jar
>>>> REGISTER /homes/immilind/HadoopLocal/Jars/jackson-jaxrs-1.8.10.jar
>>>> REGISTER /homes/immilind/HadoopLocal/Jars/jackson-mapper-asl-1.8.10.jar
>>>> REGISTER /homes/immilind/HadoopLocal/Jars/jackson-xc-1.8.10.jar
>>>> REGISTER /home/gs/pig/current/lib-hadoop23/piggybank.jar
>>>>
>>>> employee= load '/user/immilind/AvroData' using
>>>> org.apache.pig.piggybank.storage.avro.AvroStorage( );
>>>> dump employee;
>>>>
>>>>
>>>> Schemas :
>>>>
>>>> {
>>>> "type" : "record",
>>>> "name" : "employee",
>>>> "fields":[
>>>>   {"name" : "name", "type" : "string", "default" : "NU"},
>>>>   {"name" : "age", "type" : "int","default" : 0},
>>>>   {"name" : "dept", "type": "string","default" : "DU"},
>>>>   {"name" : "office", "type": "string","default" : "OU"},
>>>>   {"name" : "salary", "type": "float","default" : 0.0}
>>>> ]
>>>> }
>>>>
>>>> {
>>>> "type" : "record",
>>>> "name" : "employee",
>>>> "fields":[
>>>>   {"name" : "name", "type" : "string", "default" : "NU"},
>>>>   {"name" : "age", "type" : "int","default" : 0},
>>>>   {"name" : "dept", "type": "string","default" : "DU"},
>>>>   {"name" : "office", "type": "string","default" : "OU"},
>>>>   {"name" : "salary", "type": "int", "default" : 0}
>>>> ]
>>>> }
>>>>
>>>>
>>>> Both the schemas differ only in one field. As per the schema evolution/
>>>> merging rules, I am expecting to see "int" fields loaded as "float". But
>>>> instead, the job fails due to field mismatch.
>>>>
>>>> I am referring to :
>>>>
>>>> Similar thread named "Working with changing schemas (avro) in Pig"
>>>>
>>>>
>> https://mail-archives.apache.org/mod_mbox/pig-user/201204.mbox/%3CCAB-acjM6B39OMtWYpYijBBoJxL8MuyRjDzrxmXfBjmtxXcMsOQ@mail.gmail.com%3E
>>>>
>>>> JIRA:
>>>> https://issues.apache.org/jira/browse/PIG-2579
>>>> How to use "multiple_schema' option with "AvroStorage"  as suggested by
>>>> this JIRA ?
>>>>
>>>> Function mergeType indicating rules for primitive types
>>>>
>>>>
>> https://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorageUtils.java
>>>>
>>>>
>>>>
>>>> Can anybody suggest what is going wrong ?
>>>>
>>

Re: Pig Avrostorage Issue regarding Schema evaluation

Posted by Cheolsoo Park <ch...@cloudera.com>.
Hi Russell,

You're absolute right. Jackson is not needed. Thanks for point that out!

Cheolsoo


On Wed, Jan 9, 2013 at 11:25 AM, Russell Jurney <ru...@gmail.com>wrote:

> Jackson is no longer needed, right? Or is it coming back in 0.11?
>
> Russell Jurney http://datasyndrome.com
>
> On Jan 9, 2013, at 10:26 AM, Cheolsoo Park <ch...@cloudera.com> wrote:
>
> > Hi Milind,
> >
> > Please try this:
> >
> > REGISTER build/ivy/lib/Pig/avro-1.7.1.jar
> > REGISTER build/ivy/lib/Pig/json-simple-1.1.jar
> > REGISTER build/ivy/lib/Pig/jackson-mapper-asl-1.8.8.jar
> > REGISTER build/ivy/lib/Pig/jackson-core-asl-1.8.8.jar
> > REGISTER contrib/piggybank/java/piggybank.jar
> >
> > employee = LOAD '/home/cheolsoo/workspace/avro/emplyees' USING
> > org.apache.pig.piggybank.storage.avro.AvroStorage('multiple_schemas');
> > DESCRIBE employee;
> > DUMP employee;
> >
> > I have two Avro files in my input directory:
> >
> > $java -jar /home/cheolsoo/workspace/avro/avro-tools-1.7.1.jar tojson
> > record_employee.avro
> > {"name":"a","age":0,"dept":"b","office":"c","salary":0.0}
> >
> > $java -jar /home/cheolsoo/workspace/avro/avro-tools-1.7.1.jar tojson
> > record_employee2.avro
> > {"name":"a","age":0,"dept":"b","office":"c","salary":0}
> >
> > record_employee.avro contains a float, and record_employee2.avro contains
> > an int.
> >
> > The output looks as follows:
> >
> > ...
> > employee: {name: chararray,age: int,dept: chararray,office:
> > chararray,salary: float}
> > ...
> > (a,0,b,c,0.0)
> > (a,0,b,c,0)
> >
> > Thanks,
> > Cheolsoo
> >
> >
> >
> >
> > On Wed, Jan 9, 2013 at 6:49 AM, Milind Vaidya <ka...@gmail.com> wrote:
> >
> >> Environment:
> >>
> >> Pig version: 0.11
> >> Hadoop 0.23.6.0.1301071353
> >>
> >>
> >> Script:
> >>
> >>
> >> REGISTER /homes/immilind/HadoopLocal/Jars/avro-1.7.1.jar
> >> REGISTER /homes/immilind/HadoopLocal/Jars/jackson-all-1.8.10.jar
> >> REGISTER /homes/immilind/HadoopLocal/Jars/jackson-core-asl-1.8.10.jar
> >> REGISTER /homes/immilind/HadoopLocal/Jars/jackson-jaxrs-1.8.10.jar
> >> REGISTER /homes/immilind/HadoopLocal/Jars/jackson-mapper-asl-1.8.10.jar
> >> REGISTER /homes/immilind/HadoopLocal/Jars/jackson-xc-1.8.10.jar
> >> REGISTER /home/gs/pig/current/lib-hadoop23/piggybank.jar
> >>
> >> employee= load '/user/immilind/AvroData' using
> >> org.apache.pig.piggybank.storage.avro.AvroStorage( );
> >> dump employee;
> >>
> >>
> >> Schemas :
> >>
> >> {
> >> "type" : "record",
> >> "name" : "employee",
> >> "fields":[
> >>    {"name" : "name", "type" : "string", "default" : "NU"},
> >>    {"name" : "age", "type" : "int","default" : 0},
> >>    {"name" : "dept", "type": "string","default" : "DU"},
> >>    {"name" : "office", "type": "string","default" : "OU"},
> >>    {"name" : "salary", "type": "float","default" : 0.0}
> >> ]
> >> }
> >>
> >> {
> >> "type" : "record",
> >> "name" : "employee",
> >> "fields":[
> >>    {"name" : "name", "type" : "string", "default" : "NU"},
> >>    {"name" : "age", "type" : "int","default" : 0},
> >>    {"name" : "dept", "type": "string","default" : "DU"},
> >>    {"name" : "office", "type": "string","default" : "OU"},
> >>    {"name" : "salary", "type": "int", "default" : 0}
> >> ]
> >> }
> >>
> >>
> >> Both the schemas differ only in one field. As per the schema evolution/
> >> merging rules, I am expecting to see "int" fields loaded as "float". But
> >> instead, the job fails due to field mismatch.
> >>
> >> I am referring to :
> >>
> >> Similar thread named "Working with changing schemas (avro) in Pig"
> >>
> >>
> https://mail-archives.apache.org/mod_mbox/pig-user/201204.mbox/%3CCAB-acjM6B39OMtWYpYijBBoJxL8MuyRjDzrxmXfBjmtxXcMsOQ@mail.gmail.com%3E
> >>
> >> JIRA:
> >> https://issues.apache.org/jira/browse/PIG-2579
> >> How to use "multiple_schema' option with "AvroStorage"  as suggested by
> >> this JIRA ?
> >>
> >> Function mergeType indicating rules for primitive types
> >>
> >>
> https://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorageUtils.java
> >>
> >>
> >>
> >> Can anybody suggest what is going wrong ?
> >>
>

Re: Pig Avrostorage Issue regarding Schema evaluation

Posted by Russell Jurney <ru...@gmail.com>.
Jackson is no longer needed, right? Or is it coming back in 0.11?

Russell Jurney http://datasyndrome.com

On Jan 9, 2013, at 10:26 AM, Cheolsoo Park <ch...@cloudera.com> wrote:

> Hi Milind,
>
> Please try this:
>
> REGISTER build/ivy/lib/Pig/avro-1.7.1.jar
> REGISTER build/ivy/lib/Pig/json-simple-1.1.jar
> REGISTER build/ivy/lib/Pig/jackson-mapper-asl-1.8.8.jar
> REGISTER build/ivy/lib/Pig/jackson-core-asl-1.8.8.jar
> REGISTER contrib/piggybank/java/piggybank.jar
>
> employee = LOAD '/home/cheolsoo/workspace/avro/emplyees' USING
> org.apache.pig.piggybank.storage.avro.AvroStorage('multiple_schemas');
> DESCRIBE employee;
> DUMP employee;
>
> I have two Avro files in my input directory:
>
> $java -jar /home/cheolsoo/workspace/avro/avro-tools-1.7.1.jar tojson
> record_employee.avro
> {"name":"a","age":0,"dept":"b","office":"c","salary":0.0}
>
> $java -jar /home/cheolsoo/workspace/avro/avro-tools-1.7.1.jar tojson
> record_employee2.avro
> {"name":"a","age":0,"dept":"b","office":"c","salary":0}
>
> record_employee.avro contains a float, and record_employee2.avro contains
> an int.
>
> The output looks as follows:
>
> ...
> employee: {name: chararray,age: int,dept: chararray,office:
> chararray,salary: float}
> ...
> (a,0,b,c,0.0)
> (a,0,b,c,0)
>
> Thanks,
> Cheolsoo
>
>
>
>
> On Wed, Jan 9, 2013 at 6:49 AM, Milind Vaidya <ka...@gmail.com> wrote:
>
>> Environment:
>>
>> Pig version: 0.11
>> Hadoop 0.23.6.0.1301071353
>>
>>
>> Script:
>>
>>
>> REGISTER /homes/immilind/HadoopLocal/Jars/avro-1.7.1.jar
>> REGISTER /homes/immilind/HadoopLocal/Jars/jackson-all-1.8.10.jar
>> REGISTER /homes/immilind/HadoopLocal/Jars/jackson-core-asl-1.8.10.jar
>> REGISTER /homes/immilind/HadoopLocal/Jars/jackson-jaxrs-1.8.10.jar
>> REGISTER /homes/immilind/HadoopLocal/Jars/jackson-mapper-asl-1.8.10.jar
>> REGISTER /homes/immilind/HadoopLocal/Jars/jackson-xc-1.8.10.jar
>> REGISTER /home/gs/pig/current/lib-hadoop23/piggybank.jar
>>
>> employee= load '/user/immilind/AvroData' using
>> org.apache.pig.piggybank.storage.avro.AvroStorage( );
>> dump employee;
>>
>>
>> Schemas :
>>
>> {
>> "type" : "record",
>> "name" : "employee",
>> "fields":[
>>    {"name" : "name", "type" : "string", "default" : "NU"},
>>    {"name" : "age", "type" : "int","default" : 0},
>>    {"name" : "dept", "type": "string","default" : "DU"},
>>    {"name" : "office", "type": "string","default" : "OU"},
>>    {"name" : "salary", "type": "float","default" : 0.0}
>> ]
>> }
>>
>> {
>> "type" : "record",
>> "name" : "employee",
>> "fields":[
>>    {"name" : "name", "type" : "string", "default" : "NU"},
>>    {"name" : "age", "type" : "int","default" : 0},
>>    {"name" : "dept", "type": "string","default" : "DU"},
>>    {"name" : "office", "type": "string","default" : "OU"},
>>    {"name" : "salary", "type": "int", "default" : 0}
>> ]
>> }
>>
>>
>> Both the schemas differ only in one field. As per the schema evolution/
>> merging rules, I am expecting to see "int" fields loaded as "float". But
>> instead, the job fails due to field mismatch.
>>
>> I am referring to :
>>
>> Similar thread named "Working with changing schemas (avro) in Pig"
>>
>> https://mail-archives.apache.org/mod_mbox/pig-user/201204.mbox/%3CCAB-acjM6B39OMtWYpYijBBoJxL8MuyRjDzrxmXfBjmtxXcMsOQ@mail.gmail.com%3E
>>
>> JIRA:
>> https://issues.apache.org/jira/browse/PIG-2579
>> How to use "multiple_schema' option with "AvroStorage"  as suggested by
>> this JIRA ?
>>
>> Function mergeType indicating rules for primitive types
>>
>> https://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorageUtils.java
>>
>>
>>
>> Can anybody suggest what is going wrong ?
>>

Re: Pig Avrostorage Issue regarding Schema evaluation

Posted by Cheolsoo Park <ch...@cloudera.com>.
Hi Milind,

Please try this:

REGISTER build/ivy/lib/Pig/avro-1.7.1.jar
REGISTER build/ivy/lib/Pig/json-simple-1.1.jar
REGISTER build/ivy/lib/Pig/jackson-mapper-asl-1.8.8.jar
REGISTER build/ivy/lib/Pig/jackson-core-asl-1.8.8.jar
REGISTER contrib/piggybank/java/piggybank.jar

employee = LOAD '/home/cheolsoo/workspace/avro/emplyees' USING
org.apache.pig.piggybank.storage.avro.AvroStorage('multiple_schemas');
DESCRIBE employee;
DUMP employee;

I have two Avro files in my input directory:

$java -jar /home/cheolsoo/workspace/avro/avro-tools-1.7.1.jar tojson
record_employee.avro
{"name":"a","age":0,"dept":"b","office":"c","salary":0.0}

$java -jar /home/cheolsoo/workspace/avro/avro-tools-1.7.1.jar tojson
record_employee2.avro
{"name":"a","age":0,"dept":"b","office":"c","salary":0}

record_employee.avro contains a float, and record_employee2.avro contains
an int.

The output looks as follows:

...
employee: {name: chararray,age: int,dept: chararray,office:
chararray,salary: float}
...
(a,0,b,c,0.0)
(a,0,b,c,0)

Thanks,
Cheolsoo




On Wed, Jan 9, 2013 at 6:49 AM, Milind Vaidya <ka...@gmail.com> wrote:

> Environment:
>
> Pig version: 0.11
> Hadoop 0.23.6.0.1301071353
>
>
> Script:
>
>
> REGISTER /homes/immilind/HadoopLocal/Jars/avro-1.7.1.jar
> REGISTER /homes/immilind/HadoopLocal/Jars/jackson-all-1.8.10.jar
> REGISTER /homes/immilind/HadoopLocal/Jars/jackson-core-asl-1.8.10.jar
> REGISTER /homes/immilind/HadoopLocal/Jars/jackson-jaxrs-1.8.10.jar
> REGISTER /homes/immilind/HadoopLocal/Jars/jackson-mapper-asl-1.8.10.jar
> REGISTER /homes/immilind/HadoopLocal/Jars/jackson-xc-1.8.10.jar
> REGISTER /home/gs/pig/current/lib-hadoop23/piggybank.jar
>
> employee= load '/user/immilind/AvroData' using
> org.apache.pig.piggybank.storage.avro.AvroStorage( );
> dump employee;
>
>
> Schemas :
>
> {
> "type" : "record",
> "name" : "employee",
> "fields":[
>     {"name" : "name", "type" : "string", "default" : "NU"},
>     {"name" : "age", "type" : "int","default" : 0},
>     {"name" : "dept", "type": "string","default" : "DU"},
>     {"name" : "office", "type": "string","default" : "OU"},
>     {"name" : "salary", "type": "float","default" : 0.0}
> ]
> }
>
> {
> "type" : "record",
> "name" : "employee",
> "fields":[
>     {"name" : "name", "type" : "string", "default" : "NU"},
>     {"name" : "age", "type" : "int","default" : 0},
>     {"name" : "dept", "type": "string","default" : "DU"},
>     {"name" : "office", "type": "string","default" : "OU"},
>     {"name" : "salary", "type": "int", "default" : 0}
> ]
> }
>
>
> Both the schemas differ only in one field. As per the schema evolution/
> merging rules, I am expecting to see "int" fields loaded as "float". But
> instead, the job fails due to field mismatch.
>
> I am referring to :
>
> Similar thread named "Working with changing schemas (avro) in Pig"
>
> https://mail-archives.apache.org/mod_mbox/pig-user/201204.mbox/%3CCAB-acjM6B39OMtWYpYijBBoJxL8MuyRjDzrxmXfBjmtxXcMsOQ@mail.gmail.com%3E
>
> JIRA:
> https://issues.apache.org/jira/browse/PIG-2579
> How to use "multiple_schema' option with "AvroStorage"  as suggested by
> this JIRA ?
>
> Function mergeType indicating rules for primitive types
>
> https://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorageUtils.java
>
>
>
> Can anybody suggest what is going wrong ?
>