You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Ahmet Altay (JIRA)" <ji...@apache.org> on 2017/08/03 04:14:00 UTC
[jira] [Commented] (BEAM-1874) Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

    [ https://issues.apache.org/jira/browse/BEAM-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112160#comment-16112160 ] 

Ahmet Altay commented on BEAM-1874:
-----------------------------------

Can we close this issue?

> Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header
> -----------------------------------------------------------------------------------------
>
>                 Key: BEAM-1874
>                 URL: https://issues.apache.org/jira/browse/BEAM-1874
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py
>    Affects Versions: 0.6.0
>            Reporter: Samuli Holopainen
>            Assignee: Charles Chen
>
> We have gzipped text files in Google Cloud Storage that have the following metadata headers set:
> Content-Encoding: gzip
> Content-Type: application/octet-stream
> Trying to read these with apache_beam.io.ReadFromText yields the following error:
> ERROR:root:Exception while fetching 341565 bytes from position 0 of gs://...-c72fa25a-5d8a-4801-a0b4-54b58c4723ce.gz: Cannot have start index greater than total size
> Traceback (most recent call last):
>   File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py", line 585, in _fetch_to_queue
>     value = func(*args)
>   File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py", line 610, in _get_segment
>     downloader.GetRange(start, end)
>   File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py", line 477, in GetRange
>     progress, end_byte = self.__NormalizeStartEnd(start, end)
>   File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py", line 340, in __NormalizeStartEnd
>     'Cannot have start index greater than total size')
> TransferInvalidError: Cannot have start index greater than total size
> WARNING:root:Task failed: Traceback (most recent call last):
>   File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/executor.py", line 300, in __call__
>     result = evaluator.finish_bundle()
>   File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py", line 206, in finish_bundle
>     bundles = _read_values_to_bundles(reader)
>   File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py", line 196, in _read_values_to_bundles
>     read_result = [GlobalWindows.windowed_value(e) for e in reader]
>   File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/concat_source.py", line 79, in read
>     range_tracker.sub_range_tracker(source_ix)):
>   File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py", line 155, in read_records
>     read_buffer)
>   File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py", line 245, in _read_record
>     sep_bounds = self._find_separator_bounds(file_to_read, read_buffer)
>   File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py", line 190, in _find_separator_bounds
>     file_to_read, read_buffer, current_pos + 1):
>   File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py", line 212, in _try_to_ensure_num_bytes_in_buffer
>     read_data = file_to_read.read(self._buffer_size)
>   File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py", line 460, in read
>     self._fetch_to_internal_buffer(num_bytes)
>   File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py", line 420, in _fetch_to_internal_buffer
>     buf = self._file.read(self._read_size)
>   File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py", line 472, in read
>     return self._read_inner(size=size, readline=False)
>   File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py", line 516, in _read_inner
>     self._fetch_next_if_buffer_exhausted()
>   File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py", line 577, in _fetch_next_if_buffer_exhausted
>     raise exn
> TransferInvalidError: Cannot have start index greater than total size
> After removing the Content-Encoding header the read works fine.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)