You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2021/10/20 16:18:16 UTC

[GitHub] [flink] JingGe edited a comment on pull request #17520: [FLINK-24565][avro] Port avro file format factory to BulkReaderFormatFactory

JingGe edited a comment on pull request #17520:
URL: https://github.com/apache/flink/pull/17520#issuecomment-947822402


   > @slinkydeveloper There are four reasons why I did not choose `StreamFormat`.
   > 
   > 1. The biggest concern is that `StreamFormatAdapter.Reader#readBatch` stores all results in a batch in heap memory. This is bad because avro is a format which supports compression. You'll never know how much data will be stuffed into heap memory after inflation.
   > 2. `StreamFormat`, from its concept, is for a stream of bytes where each record is shipped independently. Avro is a file format which organizes the records in its own blocks, so they do not match from the concept. I would say csv format will be more suitable for `StreamFormat`.
   > 3. `StreamFormatAdapter` cuts batches by counting number of bytes read from the file stream. If the sync size of avro is 2MB it will read 2M bytes from file in one go and produce a batch containing no records. However this only happens at the beginning of reading a file so this might be OK.
   > 4. Both orc and parquet formats have implemented `BulkFormat` instead of `StreamFormat`, so why not `StreamFormat` for them?
   
   The consideration behind your solution was great! Thanks for your contribution. I will try to share what I understood with you. Let's discuss and understand the design together. Correct me if I am wrong.
   
   For point 1, the uncompressed data size should be controlled by `StreamFormat.FETCH_IO_SIZE`. It might not be very precise to control the heap size, since the last read might overfulfil the quota a little bit, but it is acceptable. Speaking of compression, StreamFormatAdapter has built-in compressors support. Does this PR implementation have the same support too?
   
   For point 2, StreamFormat defines a way to read each record. The granularity of each record could be controlled by the generic type `StreamFormat.Reader<T>`. There is plenty room to play if single avro record is too small in this case.
   
   For point 4, it is a good question, we should deep dive into the code. Might It make sense to refactor the orc and parquet formats to StreamFormat too?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org