You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Andrew Kenworthy <ad...@yahoo.com> on 2011/11/16 15:25:31 UTC

Difficulties loading avro-generated data with AvroStorage

Hallo,

I'm a little confused as to how to load avro data into pig using AvroStorage. I have a map-reduce job that writes an AvroKey<Long>/AvroValue<GenericRecord> K/V pair, producing a schema that looks like this:

{ "fields" : [ { "doc" : "",
        "name" : "key",
        "type" : "long"
      },
      { "doc" : "",
        "name" : "value",
        "order" : "ignore",
        "type" : { "fields" : [ { "name" : "logid",
                  "type" : "long"
                }
{ "name" : "my_data",
                 
"type" : [ "null",
                     
{ "avro.java.string" : "String",
                       
"type" : "map",
                       
"values" : [ "null",
                           
{ "avro.java.string" : "String",
                             
"type" : "string"
                           
}
                         
]
                     
}
                   
]
               
},, {...} etc.etc. ],
            "name" : "my_log",
            "namespace" : "x.y.z.log.avro",
            "type" : "record"
          }
      }
    ],
  "name" : "Pair",
  "namespace" : "org.apache.avro.mapred",
  "type" : "record"
}

i.e. a Pair schema including an avro.java.string field. When I load a datafile with this schema using AvroStorage, I get the following exception:

java.io.IOException:
java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.avro.util.Utf8
    at org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:251)
    at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187)
    at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:455)
    at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
    at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:646)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.avro.util.Utf8
    at org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readString(PigAvroDatumReader.java:154)
    at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:150)
    at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
    at org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvroDatumReader.java:67)
    at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
    at
org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvroDatumReader.java:67)
    at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
    at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:135)
    at
org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
    at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
    at
org.apache.pig.piggybank.storage.avro.PigAvroRecordReader.getCurrentValue(PigAvroRecordReader.java:80)
    at
org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:249)
which seems to be because my avro string value cannot be cast to an avro utf8 object.
My questions are:
1. is it legitimate to load a Pair schema, or I should I be loading a schema that just consists of my generic record?
2. why is Avro unable to load data it produced itself, seeing as AvroStorage.getSchema reads out the same schema from my input avro data file as the basis for parsing the input?
(I'm sorry I can't be more specific....it's difficult to debug this and so I'm guessing at the cause).
regards,
Andrew Kenworthy

Re: Difficulties loading avro-generated data with AvroStorage

Posted by Andrew Kenworthy <ad...@yahoo.com>.
OK, I think I have an explanation.....

"1. is it legitimate to load a Pair schema, or I should I be loading a schema that just consists of my generic record?"

- yes, that works for me with a Pair schema.

"2. why is Avro unable to load data it produced itself, seeing as AvroStorage.getSchema reads out the same schema from my input avro data file as the basis for parsing the input?"

The problem was that I had created my avro schema by reflection from a thrift entity, and in doing so had set String property in the schema field to "avro.java.string" instead of "avro.util.Utf8" (although I was writing Utf8). When AvroStorage reads this data, the PigAvroDatumReader.read method calls GenericDatumReader.readString, which decides - on the basis of the property mentioned above - whether to read the String value from the Decoder (which reads Utf8 but returns the .toString() result, causing a cast exception when attempting to cast back to Utf8), or the Utf8 value.

i.e. it seems from the explicit cast in PigAvroDatumReader.readString, that only Utf8 values in string fields can be procesed by AvroStorage.

Andrew



>________________________________
> From: Andrew Kenworthy <ad...@yahoo.com>
>To: "user@pig.apache.org" <us...@pig.apache.org> 
>Sent: Wednesday, November 16, 2011 3:25 PM
>Subject: Difficulties loading avro-generated data with AvroStorage
> 
>Hallo,
>
>I'm a little confused as to how to load avro data into pig using AvroStorage. I have a map-reduce job that writes an AvroKey<Long>/AvroValue<GenericRecord> K/V pair, producing a schema that looks like this:
>
>{ "fields" : [ { "doc" : "",
>        "name" : "key",
>        "type" : "long"
>      },
>      { "doc" : "",
>        "name" : "value",
>        "order" : "ignore",
>        "type" : { "fields" : [ { "name" : "logid",
>                  "type" : "long"
>                }
>{ "name" : "my_data",
>                 
>"type" : [ "null",
>                     
>{ "avro.java.string" : "String",
>                       
>"type" : "map",
>                       
>"values" : [ "null",
>                           
>{ "avro.java.string" : "String",
>                             
>"type" : "string"
>                           
>}
>                         
>]
>                     
>}
>                   
>]
>               
>},, {...} etc.etc. ],
>            "name" : "my_log",
>            "namespace" : "x.y.z.log.avro",
>            "type" : "record"
>          }
>      }
>    ],
>  "name" : "Pair",
>  "namespace" : "org.apache.avro.mapred",
>  "type" : "record"
>}
>
>i.e. a Pair schema including an avro.java.string field. When I load a datafile with this schema using AvroStorage, I get the following exception:
>
>java.io.IOException:
>java.lang.ClassCastException: java.lang.String cannot be cast to
>org.apache.avro.util.Utf8
>    at org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:251)
>    at
>org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187)
>    at
>org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:455)
>    at
>org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>    at
>org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:646)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
>    at
>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
>Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to
>org.apache.avro.util.Utf8
>    at org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readString(PigAvroDatumReader.java:154)
>    at
>org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:150)
>    at
>org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
>    at org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvroDatumReader.java:67)
>    at
>org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
>    at
>org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvroDatumReader.java:67)
>    at
>org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
>    at
>org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:135)
>    at
>org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
>    at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
>    at
>org.apache.pig.piggybank.storage.avro.PigAvroRecordReader.getCurrentValue(PigAvroRecordReader.java:80)
>    at
>org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:249)
>which seems to be because my avro string value cannot be cast to an avro utf8 object.
>My questions are:
>1. is it legitimate to load a Pair schema, or I should I be loading a schema that just consists of my generic record?
>2. why is Avro unable to load data it produced itself, seeing as AvroStorage.getSchema reads out the same schema from my input avro data file as the basis for parsing the input?
>(I'm sorry I can't be more specific....it's difficult to debug this and so I'm guessing at the cause).
>regards,
>Andrew Kenworthy
>
>