You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Andreas Jansson (JIRA)" <ji...@apache.org> on 2018/11/08 20:19:00 UTC

[jira] [Created] (BEAM-6027) Slow DownloaderStream when reading from GCS

Andreas Jansson created BEAM-6027:
-------------------------------------

             Summary: Slow DownloaderStream when reading from GCS
                 Key: BEAM-6027
                 URL: https://issues.apache.org/jira/browse/BEAM-6027
             Project: Beam
          Issue Type: Bug
          Components: sdk-py-core
            Reporter: Andreas Jansson
            Assignee: Ahmet Altay


DownloaderStream inherits io.RawIOBase, which by defaults reads io.DEFAULT_BUFFER_SIZE chunks in .readall(). This is causing extremely slow performance when invoking read() on handles returned by GcsIO().open().

The following code can take ~60 seconds to download a single 2MB file:

{code:python}
gcs = GcsIO()
t = time.time()
path = 'gs://my-bucket/my-2MB-file'
with gcs.open(path) as f:
    f.read()
duration = time.time() - t
{code}

This monkey patch makes the same download code take <1 second:

{code:python}
from apache_beam.io.gcp import gcsio
from apache_beam.io.filesystemio import DownloaderStream

def downloader_stream_readall(self):
    """Read until EOF, using multiple read() call."""
    res = bytearray()
    while True:
        data = self.read(gcsio.DEFAULT_READ_BUFFER_SIZE)
        if not data:
            break
        res += data
    if res:
        return bytes(res)
    else:
        return data

DownloaderStream.readall = downloader_stream_readall
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)