You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Mark Liu (JIRA)" <ji...@apache.org> on 2018/11/30 01:12:00 UTC
[jira] [Commented] (BEAM-6154) Gcsio batch delete broken in Python
3
[ https://issues.apache.org/jira/browse/BEAM-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16704106#comment-16704106 ]
Mark Liu commented on BEAM-6154:
--------------------------------
This can be reproduced by simply call:
{code}
from apache_beam.io.gcp import gcsio
gcsio.GcsIO().delete_batch(['gs://my/gcs/file'])
{code}
> Gcsio batch delete broken in Python 3
> -------------------------------------
>
> Key: BEAM-6154
> URL: https://issues.apache.org/jira/browse/BEAM-6154
> Project: Beam
> Issue Type: Bug
> Components: sdk-py-core
> Reporter: Mark Liu
> Assignee: Ahmet Altay
> Priority: Major
>
> I'm running Python SDK agianst GCP in Python 3.5 and got following gcsio error while deleting files:
> {code}
> File "/usr/local/lib/python3.5/site-packages/apache_beam/io/iobase.py", line 1077, in <genexpr>
> window.TimestampedValue(v, timestamp.MAX_TIMESTAMP) for v in outputs)
> File "/usr/local/lib/python3.5/site-packages/apache_beam/io/filebasedsink.py", line 315, in finalize_write
> num_threads)
> File "/usr/local/lib/python3.5/site-packages/apache_beam/internal/util.py", line 145, in run_using_threadpool
> return pool.map(fn_to_execute, inputs)
> File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 266, in map
> return self._map_async(func, iterable, mapstar, chunksize).get()
> File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 644, in get
> raise self._value
> File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 119, in worker
> result = (True, func(*args, **kwds))
> File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
> return list(map(*args))
> File "/usr/local/lib/python3.5/site-packages/apache_beam/io/filebasedsink.py", line 299, in _rename_batch
> FileSystems.rename(source_files, destination_files)
> File "/usr/local/lib/python3.5/site-packages/apache_beam/io/filesystems.py", line 252, in rename
> return filesystem.rename(source_file_names, destination_file_names)
> File "/usr/local/lib/python3.5/site-packages/apache_beam/io/gcp/gcsfilesystem.py", line 229, in rename
> copy_statuses = gcsio.GcsIO().copy_batch(batch)
> File "/usr/local/lib/python3.5/site-packages/apache_beam/io/gcp/gcsio.py", line 322, in copy_batch
> api_calls = batch_request.Execute(self.client._http) # pylint: disable=protected-access
> File "/usr/local/lib/python3.5/site-packages/apitools/base/py/batch.py", line 222, in Execute
> batch_http_request.Execute(http)
> File "/usr/local/lib/python3.5/site-packages/apitools/base/py/batch.py", line 480, in Execute
> self._Execute(http)
> File "/usr/local/lib/python3.5/site-packages/apitools/base/py/batch.py", line 450, in _Execute
> mime_response = parser.parsestr(header + response.content)
> TypeError: Can't convert 'bytes' object to str implicitly
> {code}
> After looking into related code in apitools library, I found response.content that's returned via http request to gcs is bytes and apitools didn't handle this scenario. This can be a blocker to any pipeline depending on gcsio and apparently blocks all Dataflow job in Python 3.
> This could be another case that moving off apitools dependency in [BEAM-4850|https://issues.apache.org/jira/browse/BEAM-4850].
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)