You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/02 19:37:46 UTC

[GitHub] [iceberg] rdblue edited a comment on issue #1152: Why do we need two avro record readers & writers ?

rdblue edited a comment on issue #1152:
URL: https://github.com/apache/iceberg/issues/1152#issuecomment-653185375

Readers and writers are specific to an in-memory representation, which I'll inter-changeably refer to as an object model. The Avro readers and writers you're using here are for different object models: Iceberg generics and Avro.

The `data.avro.DataReader` and `data.avro.DataWriter` classes were written for the Iceberg generics data model. That uses Iceberg's generic record class, Java 8 date/time types, BigDecimal, ByteBuffer, and byte[]. This representation is intended for application authors working directly with Iceberg API. That's why it uses standard Java representations for most types.

The `avro.GenericAvroReader` and `avro.GenericAvroWriter` classes are for working with Avro's generic or specific records. That's why this produces `GenericData.Record` or instances of specific classes that all implement Avro's `IndexedRecord`. This is intended for internal use -- internal implementations of `DataFile` and `ManifestFile` use `IndexedRecord` -- so it produces the internal representations, like `BigDecimal`, long microseconds from epoch, or int days from epoch.

Unfortunately, early on we left the generic Avro reader/writer implementation public in core and have some downstream uses, like the original Netflix Flink sink. I think we should eventually remove Avro from the public API. I would also like to remove it and make `DataFile` and `ManifestFile` implement our own `Record` API, but we would need to have a reader that produces the internal value representations.

For Flink, we should build reader/writer classes that produce and consume its in-memory representation. That's what we do for Spark and Pig. Based on the Parquet support in #1125, I thought that Flink uses Java 8 date/time classes, BigDecimal, and ByteBuffer, so it would make sense to base the readers on Iceberg generics. In that case, basing your implementation on `data.avro.DataReader` and `data.avro.DataWriter` should work fine.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org