You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/01 09:26:52 UTC

[GitHub] [iceberg] openinx opened a new issue #1152: Why do we need two avro record readers & writers ?

openinx opened a new issue #1152:
URL: https://github.com/apache/iceberg/issues/1152


   When I implement the flink avro reader & writer,  I implemented the FlinkAvroReader and FlinkAvroWriter by extending the GenericAvroWriter & GenericAvroReader class. But when I run the unit test, it says  the data generated by `org.apache.iceberg.flink.data.RandomData` could not be written into the avro file. because  the following: 
   
   ```
   java.lang.ClassCastException: java.time.LocalDate cannot be cast to java.lang.Integer
   org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.ClassCastException: java.time.LocalDate cannot be cast to java.lang.Integer
   	at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:317)
   	at org.apache.iceberg.avro.AvroFileAppender.add(AvroFileAppender.java:52)
   	at org.apache.iceberg.io.FileAppender.addAll(FileAppender.java:32)
   	at org.apache.iceberg.io.FileAppender.addAll(FileAppender.java:37)
   	at org.apache.iceberg.flink.data.TestFlinkAvroReaderWriter.testCorrectness(TestFlinkAvroReaderWriter.java:52)
   	at org.apache.iceberg.flink.data.TestFlinkAvroReaderWriter.testNormalData(TestFlinkAvroReaderWriter.java:71)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: java.lang.ClassCastException: java.time.LocalDate cannot be cast to java.lang.Integer
   	at org.apache.iceberg.avro.ValueWriters$IntegerWriter.write(ValueWriters.java:177)
   	at org.apache.iceberg.avro.ValueWriters$StructWriter.write(ValueWriters.java:497)
   	at org.apache.iceberg.avro.ValueWriters$OptionWriter.write(ValueWriters.java:398)
   	at org.apache.iceberg.avro.ValueWriters$StructWriter.write(ValueWriters.java:497)
   	at org.apache.iceberg.avro.ValueWriters$StructWriter.write(ValueWriters.java:497)
   	at org.apache.iceberg.avro.GenericAvroWriter.write(GenericAvroWriter.java:50)
   	at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:314)
   	... 54 more
   ```
   
   I take a closer to the code, and found that the `GenericAvroWriter` will use the IntegerWriter to encode the `date` data type,  while the `org.apache.iceberg.flink.data.RandomData` will generate a random data by (it's a LocalData instance): 
   
   ```java
       @Override
       public Object primitive(Type.PrimitiveType primitive) {
         Object result = randomValue(primitive, random);
         switch (primitive.typeId()) {
           case BINARY:
             return ByteBuffer.wrap((byte[]) result);
           case UUID:
             return UUID.nameUUIDFromBytes((byte[]) result);
           case DATE:
             return EPOCH_DAY.plusDays((Integer) result);
   ```
   
   I was wondering that the same data set generate by `RandomDataGenerator` could not be written into avro files while it could be written to parquet files, there should be something wrong. 
   
   Finally, I found that we have two avro reader writer. one is `org.apache.iceberg.data.avro.DataWriter` and another one is `org.apache.iceberg.avro.GenericAvroWriter` , most of the code are the same except several data type mapping, such as." 
   1. The `GenericAvroWriter`  will encode `date` by `IntegerWriter`, while the `DataWriter`  will encode the `date` by `DateWriter`; 
   2. the `GenericAvroWriter` will encode `time-micros` by `LongWriter` while the `DataWriter` will encode it by `TimeWriter` ;  
   3. the `imestamp-micros` will endoe `imestamp-micros` by `LongWriter` while the `DateWriter` will encode it by `TimestamptzWriter` or `TimestampWriter`. 
   
   So why do we need the two similar reader & writer ? That's quite confusing when I'm trying to implement the flink avro reader, writer (seems I need to make the `FlinkAvroWriter` extend the `DataWriter` ).  Is it possible to remove one of them and just keep the other ? 
   
   FYI @rdblue .
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdsr commented on issue #1152: Why do we need two avro record readers & writers ?

Posted by GitBox <gi...@apache.org>.
rdsr commented on issue #1152:
URL: https://github.com/apache/iceberg/issues/1152#issuecomment-653619954


   I think there's some confusion in how we name things. Reader/writer funcs in IcebergGenerics for ORC and Parquet are called `OrcGeneric[Reader|Writer]` and `GenericParquet[Reader|Writer]`, whereas for Avro they are called `Data[Reader|Writer]` . On top of that `GenericAvro[Reader|Writer] classes are for internal and not used by IcebergGenerics


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1152: Why do we need two avro record readers & writers ?

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1152:
URL: https://github.com/apache/iceberg/issues/1152#issuecomment-653621848


   @rdsr, I agree. The naming should definitely be fixed because we're inconsistent. At least the `ParquetAvroWriter` makes sense.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on issue #1152: Why do we need two avro record readers & writers ?

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on issue #1152:
URL: https://github.com/apache/iceberg/issues/1152#issuecomment-653896854


   I just think about this again, since the `DataReader` and `DataWriter` are used by downstream, it is necessary to consider how to remove them from public API and not impact downstream clients. How about we copy the DataReader and DataWriter to iceberg-data and rename them to AvroGenericReader/Writer then add `deprecated` annotation to `DataWriter` and `DataReader` at first? @openinx @rdblue , what do you think?  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on issue #1152: Why do we need two avro record readers & writers ?

Posted by GitBox <gi...@apache.org>.
openinx commented on issue #1152:
URL: https://github.com/apache/iceberg/issues/1152#issuecomment-653993749


   @chenjunjiedada , Thanks you for following up on this issue.   `deprecated` the existed `DataWriter` and `DataReader`  looks good to me, I guess we don't need to copy all the codes from `DataWriter` to `AvroGenericWriter` because we can abstract the common logic, for example  putting the common logic in a `BaseAvroWriter` then  the `DataWriter` and `AvroGeneircWriter` will extend from it, similar with the patch https://github.com/apache/iceberg/commit/41cc1334bfe50c7dccb58baeb664b236732f0e2b. After several releases, we could just remove the deprecated `DataWriter` interfaces.  Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1152: Why do we need two avro record readers & writers ?

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1152:
URL: https://github.com/apache/iceberg/issues/1152#issuecomment-653185375


   Readers and writers are specific to an in-memory representation, which I'll inter-changeably refer to as an object model. The Avro readers and writers you're using here are for different object models: Iceberg generics and Avro.
   
   The `data.avro.DataReader` and `data.avro.DataWriter` classes were written for the Iceberg generics data model. That uses Iceberg's generic record class, Java 8 date/time types, BigDecimal, ByteBuffer, and byte[]. This representation is intended for application authors working directly with Iceberg API. That's why it uses standard Java representations for most types.
   
   The `avro.GenericAvroReader` and `avro.GenericAvroWriter` classes are for working with Avro's generic or specific records. That's why this produces `GenericData.Record` or instances of specific classes that all implement Avro's `IndexedRecord`. This is intended for internal use -- internal implementations of `DataFile` and `ManifestFile` use `IndexedRecord` -- so it produces the internal representations, like `BigDecimal`, long microseconds from epoch, or int days from epoch.
   
   Unfortunately, early on we left the generic Avro reader/writer implementation public in core and have some downstream uses, like the original Netflix Flink sink. I think we should eventually remove Avro from the public API.
   
   For Flink, we should build reader/writer classes that produce and consume its in-memory representation. That's what we do for Spark and Pig. Based on the Parquet support in #1125, I thought that Flink uses Java 8 date/time classes, BigDecimal, and ByteBuffer, so it would make sense to base the readers on Iceberg generics. In that case, basing your implementation on `data.avro.DataReader` and `data.avro.DataWriter` should work fine.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on issue #1152: Why do we need two avro record readers & writers ?

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on issue #1152:
URL: https://github.com/apache/iceberg/issues/1152#issuecomment-653313170


   Thanks @openinx to bring this out. I will try to work on this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1152: Why do we need two avro record readers & writers ?

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1152:
URL: https://github.com/apache/iceberg/issues/1152#issuecomment-653602970


   @chenjunjiedada, what do you plan to change?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on issue #1152: Why do we need two avro record readers & writers ?

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on issue #1152:
URL: https://github.com/apache/iceberg/issues/1152#issuecomment-653702927


   @openinx and I had a discussion yesterday. Basically, we think it needs to refactor two Avro readers/writers into one base Avro reader/writer, and then has two different versions on iceberg-core and iceberg-generic just like what #1162 does. Also, the `DataReader/Writer` are in iceberg-core, I 'm thinking whether we could move that into iceberg-data module.
   
   Another thing is to build a reader uses the generics data model so as to replace `IndexedRecord` interface of `Datafile` and `ManifestFile` with `Record` as you describe above.
   
   For naming, so do we want to use the pattern `{target file format} + {data model} + reader/writer`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdsr edited a comment on issue #1152: Why do we need two avro record readers & writers ?

Posted by GitBox <gi...@apache.org>.
rdsr edited a comment on issue #1152:
URL: https://github.com/apache/iceberg/issues/1152#issuecomment-653619954


   I think there's some confusion in how we name things. Reader/writer funcs in IcebergGenerics for ORC and Parquet are called `OrcGeneric[Reader|Writer]` and `GenericParquet[Reader|Writer]`, whereas for Avro they are called `Data[Reader|Writer]` . On top of that `GenericAvro[Reader|Writer] classes are internal and not used by IcebergGenerics


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue edited a comment on issue #1152: Why do we need two avro record readers & writers ?

Posted by GitBox <gi...@apache.org>.
rdblue edited a comment on issue #1152:
URL: https://github.com/apache/iceberg/issues/1152#issuecomment-653185375


   Readers and writers are specific to an in-memory representation, which I'll inter-changeably refer to as an object model. The Avro readers and writers you're using here are for different object models: Iceberg generics and Avro.
   
   The `data.avro.DataReader` and `data.avro.DataWriter` classes were written for the Iceberg generics data model. That uses Iceberg's generic record class, Java 8 date/time types, BigDecimal, ByteBuffer, and byte[]. This representation is intended for application authors working directly with Iceberg API. That's why it uses standard Java representations for most types.
   
   The `avro.GenericAvroReader` and `avro.GenericAvroWriter` classes are for working with Avro's generic or specific records. That's why this produces `GenericData.Record` or instances of specific classes that all implement Avro's `IndexedRecord`. This is intended for internal use -- internal implementations of `DataFile` and `ManifestFile` use `IndexedRecord` -- so it produces the internal representations, like `BigDecimal`, long microseconds from epoch, or int days from epoch.
   
   Unfortunately, early on we left the generic Avro reader/writer implementation public in core and have some downstream uses, like the original Netflix Flink sink. I think we should eventually remove Avro from the public API. I would also like to remove it and make `DataFile` and `ManifestFile` implement our own `Record` API, but we would need to have a reader that produces the internal value representations.
   
   For Flink, we should build reader/writer classes that produce and consume its in-memory representation. That's what we do for Spark and Pig. Based on the Parquet support in #1125, I thought that Flink uses Java 8 date/time classes, BigDecimal, and ByteBuffer, so it would make sense to base the readers on Iceberg generics. In that case, basing your implementation on `data.avro.DataReader` and `data.avro.DataWriter` should work fine.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org