You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/03 16:10:58 UTC

[GitHub] [beam] kennknowles opened a new issue, #18034: Execute some bounded source reads via composite transform

kennknowles opened a new issue, #18034:
URL: https://github.com/apache/beam/issues/18034

   The BoundedSource API is intended for cases where the source can provide meaningfull progress, dynamic splitting and size estimation. E.g. it's a good fit for processing a moderate number of large files, or a key-value table.
   
   However, existing runners have scalability limitations on how many bundles a BoundedSource can split into, and this leads to it being a very poor fit for the case of processing many small files: the source ends up splitting in a too large number of bundles (at least 1 per file) overwhelming the runner.
   
   This is a frequent use case, and the power of BoundedSource API is not needed in this case: small files don't need to be dynamically split, progress estimation is not needed, and size estimation is a "nice-to-have" but not entirely necessary.
   
   In this case, it'd be better to execute the read not as a raw Read.from(BoundedSource) executed natively by the runner, but as a ParDo(splitIntoBundles) **** fusion break **** ParDo(read each bundle). That way the bundles end up as a simple PCollection with no scalability limitations, and most likely much smaller per-bundle overhead.
   
   Implementation options:
   - The BoundedSource API could provide a hint method telling Read.from() to expand in this way
   - Individual connectors, such as TextIO.Read, could switch between expanding into Read.from() or into this composite transform depending on parameters (e.g. TextIO.Read.withCompressionType(GZ) would always expand into the composite transform, because for compressed files BoundedSource API is unnecessary)
   - Something else?
   
   Imported from Jira [BEAM-521](https://issues.apache.org/jira/browse/BEAM-521). Original Jira may contain additional context.
   Reported by: jkff.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org