You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/04 15:33:11 UTC

[GitHub] [beam] damccorm opened a new issue, #20137: textio (and fileio in general) takes too long to estimate sizes of large globs

damccorm opened a new issue, #20137:
URL: https://github.com/apache/beam/issues/20137

   As a workaround we could introduce a way to not perform size estimation when reading large globs. For example Java SDK has withHintMatchesManyFiles() option.
   
    
   
   [https://github.com/apache/beam/blob/850e8469de798d45ec535fe90cb2dc5dbda4974a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L371](https://github.com/apache/beam/blob/850e8469de798d45ec535fe90cb2dc5dbda4974a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L371)
   
    
   
   Additionally, seems like we are repeating the size estimation where the same PCollection read from a file-based source is applied to multiple PTransforms.
   
    
   
   See following for more details.
   
   [https://stackoverflow.com/questions/60874942/avoid-recomputing-size-of-all-cloud-storage-files-in-gcsio-beam-python-sdk](https://stackoverflow.com/questions/60874942/avoid-recomputing-size-of-all-cloud-storage-files-in-gcsio-beam-python-sdk)
   
   Imported from Jira [BEAM-9620](https://issues.apache.org/jira/browse/BEAM-9620). Original Jira may contain additional context.
   Reported by: chamikara.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn commented on issue #20137: textio (and fileio in general) takes too long to estimate sizes of large globs

Posted by GitBox <gi...@apache.org>.
Abacn commented on issue #20137:
URL: https://github.com/apache/beam/issues/20137#issuecomment-1315776418

   Bump priority to P2 because we have seen this question constantly. The latest one: https://stackoverflow.com/questions/74234085/how-to-parallelize-properly-dataflow-job-over-11m-of-files-stored-on-gcs-using-f


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn commented on issue #20137: textio (and fileio in general) takes too long to estimate sizes of large globs

Posted by GitBox <gi...@apache.org>.
Abacn commented on issue #20137:
URL: https://github.com/apache/beam/issues/20137#issuecomment-1194124960

   @rviscomi From the source code you linked it seems there is a two stage ReadAllFromText(), when `input_file` is set. Nevertheless, from the Beam side, there are things could be optimized. When validating if there will be at least one file read:
    https://github.com/apache/beam/blob/54b0784da7ccba738deff22bd83fbc374ad21d2e/sdks/python/apache_beam/io/filebasedsource.py#L187 
   
   current gcsio will essentially try to read all files because it returns a dict instead of using lazy evaluation:
   
   https://github.com/apache/beam/blob/54b0784da7ccba738deff22bd83fbc374ad21d2e/sdks/python/apache_beam/io/gcp/gcsio.py#L611
   
   which causes duplicate ops.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] rviscomi commented on issue #20137: textio (and fileio in general) takes too long to estimate sizes of large globs

Posted by GitBox <gi...@apache.org>.
rviscomi commented on issue #20137:
URL: https://github.com/apache/beam/issues/20137#issuecomment-1166396704

   /sub
   
   I filed the question originally on Stack Overflow and I'm still encountering this issue with `ReadAllFromText`. It estimates the size of 10M files on GCS **four times**, taking hours before anything can even be processed. I'd love to see this feature gap closed between Java and Python SDKs!
   
   <img width="513" alt="image" src="https://user-images.githubusercontent.com/1120896/175796221-5e7de5b2-c3b6-444a-a102-f16a9950a418.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] rviscomi commented on issue #20137: textio (and fileio in general) takes too long to estimate sizes of large globs

Posted by GitBox <gi...@apache.org>.
rviscomi commented on issue #20137:
URL: https://github.com/apache/beam/issues/20137#issuecomment-1192644625

   Hi @Abacn, here's our [source code](https://github.com/HTTPArchive/data-pipeline/blob/13607ea26755a3366b50a10fa6b1edbc3a835a7e/modules/transformation.py#L24) for reference.
   
   cc @giancarloaf


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn commented on issue #20137: textio (and fileio in general) takes too long to estimate sizes of large globs

Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn commented on issue #20137:
URL: https://github.com/apache/beam/issues/20137#issuecomment-1708894519

   I'm going to close this issue as the 4 times reported in https://github.com/apache/beam/issues/20137#issuecomment-1166396704 should be reduced to single time 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn commented on issue #20137: textio (and fileio in general) takes too long to estimate sizes of large globs

Posted by GitBox <gi...@apache.org>.
Abacn commented on issue #20137:
URL: https://github.com/apache/beam/issues/20137#issuecomment-1191989114

   @rviscomi `ReadAllFromText ` would query for file metadata only once here: https://github.com/apache/beam/blob/f2f239a44f490f4ca811361473754d07bc98b6c6/sdks/python/apache_beam/io/filebasedsource.py#L354
   
   If it is not that your pipeline has failed and restarted 3 more times (batch pipeline will retry 3 times), could you please provide more detail about what ptransforms involved in your pipeline?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn commented on issue #20137: textio (and fileio in general) takes too long to estimate sizes of large globs

Posted by GitBox <gi...@apache.org>.
Abacn commented on issue #20137:
URL: https://github.com/apache/beam/issues/20137#issuecomment-1324331946

   Another duplicated listing blob operation happens here
   https://github.com/apache/beam/blob/9da27671cdc8b3df2c548d92a4b2e34f5e0aaa0f/sdks/python/apache_beam/io/filebasedsource.py#L144
   and
   https://github.com/apache/beam/blob/9da27671cdc8b3df2c548d92a4b2e34f5e0aaa0f/sdks/python/apache_beam/io/filebasedsource.py#L202
   
   For FileBasedSource, get_range_tracker first calls _get_concat_source which will fetch file list once. Then estimate_size will do another fetch. (If validate is set to True, there is even one more fetch).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn closed issue #20137: textio (and fileio in general) takes too long to estimate sizes of large globs

Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn closed issue #20137: textio (and fileio in general) takes too long to estimate sizes of large globs
URL: https://github.com/apache/beam/issues/20137


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org