You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2021/10/21 08:26:38 UTC

[GitHub] [flink] slinkydeveloper edited a comment on pull request #17520: [FLINK-24565][avro] Port avro file format factory to BulkReaderFormatFactory

slinkydeveloper edited a comment on pull request #17520:
URL: https://github.com/apache/flink/pull/17520#issuecomment-948377409

@tsreaper I start getting your point of lazyness, but now I wonder, why is Avro different from, for example parquet? In parquet format, I see that when you invoke `readBatch` you load in memory the actual data and then return an iterator which iterates on those. So the I/O is performed within the `readBatch` invocation. Same for all the other formats I've looked at.

Your lazy reading is changing the threading model of the source, because with this PR the "consumer" thread (`SourceReaderBase::pollNext` in particular) will be the thread which will perform the actual I/O processing. Why is Avro special in this sense? Why don't we do the same for Parquet? And are we sure this is the way the source architecture is intended to work? Couldn't this cause issues because the original design might have not considered performing I/O within the consumer thread?

Even with lazyness, I still think we don't need concurrency primitives to serialize `nextBatch` invocations because each `SourceOperator` has its own instance of `SourceReader` (look at the method `SourceOperator::initReader`) and from `SourceOperator` the method `SourceReaderBase::pollNext` is invoked, which triggers the actual reading of avro records from your lazy operator.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org