You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/03 15:07:45 UTC

[GitHub] [beam] damccorm opened a new issue, #17850: FileBasedSource/IOChannelFactory: Custom glob expansion

damccorm opened a new issue, #17850:
URL: https://github.com/apache/beam/issues/17850

   Many cloud and distributed filesystems are eventually consistent, for instance Amazon s3 and Google Cloud Storage.
   
   To work around this, many systems that produce files such as Beam's FileBasedSinks, or Google BigQuery will provide methods to determine the number and set of files produced. E.g.,
   
   * Beam FileBasedSink uses -00000-of-NNNNN
   * BigQuery export jobs uses -000000 -000001 -000002 ... until an empty file is produced
   * Another system may produce a .filelist suffix that contains a list of all files.
   
   Users should be able to supply a glob to FileBasedSource but additionally supply a "glob expander" that can provide a custom implementation for file expansion. That way, e.g., Beam pipelines can be run back-to-back-to-back where each consumes the output of the previous, on an inconsistent filesystem, without data loss.
   
   Imported from Jira [BEAM-60](https://issues.apache.org/jira/browse/BEAM-60). Original Jira may contain additional context.
   Reported by: dhalperi.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] damccorm closed issue #17850: FileBasedSource/IOChannelFactory: Custom glob expansion

Posted by GitBox <gi...@apache.org>.
damccorm closed issue #17850: FileBasedSource/IOChannelFactory: Custom glob expansion
URL: https://github.com/apache/beam/issues/17850


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org