You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Telco Phone <te...@yahoo.com.INVALID> on 2018/04/12 15:09:48 UTC

Incorrect Avro-Parquet

Summary of the issue:
Using ParquetWriter VS insert overwrite to convert Avro to Parquet

In some version of Hive, the columns do not line up. 
Also presto does not seem to like the output ParquetWriter


Using the the 1.8.1 Maven package for the below java.


Is this a bug ? 







In using the following to convert Avro -> Parquet
DatumReader<GenericRecord> datumReader = new GenericDatumReader<>();DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(new File(args[0]), datumReader);Schema schema = dataFileReader.getSchema();
byte[] schemaBytes = Files.readAllBytes(Paths.get("/var/tmp/1.avsc"));String schemaString = new String(schemaBytes, StandardCharsets.UTF_8);schema = Schema.parse(schemaString);
System.out.println(schema.toString(true));

ParquetWriter<GenericRecord> writer = new AvroParquetWriter<GenericRecord>(new org.apache.hadoop.fs.Path(args[1]), schema, compressionCodecName, blockSize, pageSize);
GenericRecord record = null;
while (dataFileReader.hasNext()) {  record = dataFileReader.next(record);  writer.write(record);}

I am getting an error when queuing the converted data using hive.
When I convert the Avro to Parquet using insert overwrite it works.
The difference between the 2 files is:
Not the object in Java/Avro is a array of structs


Using the ParquetWriter
optional group rtb_bidders (LIST) {    repeated group array {                        <------------- This does not appear to work      optional binary bidder_id (UTF8);      optional binary result (UTF8);      optional double bid_cpm;      optional int64 bid_time;      optional binary creative_url (UTF8);      optional binary third_party_cookie_id (UTF8);      optional binary deal_id (UTF8);      optional int32 error_code;      optional int32 campaign_id;      optional binary rtb_creative_id (UTF8);      optional binary rtb_creative_url (UTF8);      optional double advised_floor_lift;      optional binary advised_floor_source (UTF8);      optional double winning_price_paid;      optional binary seat_id (UTF8);      optional binary first_adinstance (UTF8);      optional binary tag_key (UTF8);      optional binary rtb_creative_size (UTF8);      optional int32 rtb_creative_width;      optional int32 rtb_creative_height;    }  }
Using the insert overwrite hive sql to convert to Parquet 
 optional group rtb_bidders (LIST) {    repeated group bag {                           <------------------ (this looks correct)      optional group array_element {        optional binary bidder_id (UTF8);        optional binary result (UTF8);        optional double bid_cpm;        optional int64 bid_time;        optional binary creative_url (UTF8);        optional binary third_party_cookie_id (UTF8);        optional binary deal_id (UTF8);        optional int32 error_code;        optional int32 campaign_id;        optional binary rtb_creative_id (UTF8);        optional binary rtb_creative_url (UTF8);        optional double advised_floor_lift;        optional binary advised_floor_source (UTF8);        optional double winning_price_paid;        optional binary seat_id (UTF8);        optional binary first_adinstance (UTF8);        optional binary tag_key (UTF8);        optional binary rtb_creative_size (UTF8);        optional int32 rtb_creative_width;        optional int32 rtb_creative_height;      }    }  }