You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/03 23:03:00 UTC

[GitHub] [beam] kennknowles opened a new issue, #19238: Slow DownloaderStream when reading from GCS

kennknowles opened a new issue, #19238:
URL: https://github.com/apache/beam/issues/19238

   DownloaderStream inherits io.RawIOBase, which by defaults reads io.DEFAULT_BUFFER_SIZE chunks in .readall(). This is causing extremely slow performance when invoking read() on handles returned by GcsIO().open().
   
   The following code can take ~60 seconds to download a single 2MB file:
   
   ```
   
   gcs = GcsIO()
   t = time.time()
   path = 'gs://my-bucket/my-2MB-file'
   with gcs.open(path) as f:
     
    f.read()
   duration = time.time() - t
   
   ```
   
   
   This monkey patch makes the same download code take <1 second:
   
   ```
   
   from apache_beam.io.gcp import gcsio
   from apache_beam.io.filesystemio import DownloaderStream
   
   def
   downloader_stream_readall(self):
       """Read until EOF, using multiple read() call."""
       res = bytearray()
   
      while True:
           data = self.read(gcsio.DEFAULT_READ_BUFFER_SIZE)
           if not data:
      
           break
           res += data
       if res:
           return bytes(res)
       else:
           return
   data
   
   DownloaderStream.readall = downloader_stream_readall
   
   ```
   
   
   Imported from Jira [BEAM-6027](https://issues.apache.org/jira/browse/BEAM-6027). Original Jira may contain additional context.
   Reported by: andreasjansson.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] closed issue #19238: Slow DownloaderStream when reading from GCS

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed issue #19238: Slow DownloaderStream when reading from GCS
URL: https://github.com/apache/beam/issues/19238


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn commented on issue #19238: Slow DownloaderStream when reading from GCS

Posted by GitBox <gi...@apache.org>.
Abacn commented on issue #19238:
URL: https://github.com/apache/beam/issues/19238#issuecomment-1161843953

   BEAM-6027 is fixed in #8553
   
   .close-issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org