You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2021/10/20 03:07:10 UTC

[GitHub] [flink] tsreaper commented on pull request #17520: [FLINK-24565][avro] Port avro file format factory to BulkReaderFormatFactory

tsreaper commented on pull request #17520:
URL: https://github.com/apache/flink/pull/17520#issuecomment-947280429


   @slinkydeveloper There are three reasons why I did not choose `StreamFormat`.
   1. The biggest concern is that `StreamFormatAdapter.Reader#readBatch` stores all results in a batch in heap memory. This is bad because avro is a format which supports compression. You'll never know how much data will be stuffed into heap memory after inflation.
   2. `StreamFormatAdapter` cuts batches by counting number of bytes read from the file stream. If the sync size of avro is 2MB it will read 2M bytes from file in one go and produce a batch containing no records. However this only happens at the beginning of reading a file so this might be OK.
   3. Both orc and parquet formats have implemented `BulkFormat` instead of `StreamFormat`, so why not `StreamFormat` for them?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org