You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Andreas Jansson (JIRA)" <ji...@apache.org> on 2018/11/08 20:19:00 UTC
[jira] [Created] (BEAM-6027) Slow DownloaderStream when reading
from GCS
Andreas Jansson created BEAM-6027:
-------------------------------------
Summary: Slow DownloaderStream when reading from GCS
Key: BEAM-6027
URL: https://issues.apache.org/jira/browse/BEAM-6027
Project: Beam
Issue Type: Bug
Components: sdk-py-core
Reporter: Andreas Jansson
Assignee: Ahmet Altay
DownloaderStream inherits io.RawIOBase, which by defaults reads io.DEFAULT_BUFFER_SIZE chunks in .readall(). This is causing extremely slow performance when invoking read() on handles returned by GcsIO().open().
The following code can take ~60 seconds to download a single 2MB file:
{code:python}
gcs = GcsIO()
t = time.time()
path = 'gs://my-bucket/my-2MB-file'
with gcs.open(path) as f:
f.read()
duration = time.time() - t
{code}
This monkey patch makes the same download code take <1 second:
{code:python}
from apache_beam.io.gcp import gcsio
from apache_beam.io.filesystemio import DownloaderStream
def downloader_stream_readall(self):
"""Read until EOF, using multiple read() call."""
res = bytearray()
while True:
data = self.read(gcsio.DEFAULT_READ_BUFFER_SIZE)
if not data:
break
res += data
if res:
return bytes(res)
else:
return data
DownloaderStream.readall = downloader_stream_readall
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)