You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/03 18:29:29 UTC

[GitHub] [beam] kennknowles opened a new issue, #18390: Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

kennknowles opened a new issue, #18390:
URL: https://github.com/apache/beam/issues/18390

   We have gzipped text files in Google Cloud Storage that have the following metadata headers set:
   
   ```
   
   Content-Encoding: gzip
   Content-Type: application/octet-stream
   
   ```
   
   
   Trying to read these with apache_beam.io.ReadFromText yields the following error:
   
   ```
   
   ERROR:root:Exception while fetching 341565 bytes from position 0 of gs://...-c72fa25a-5d8a-4801-a0b4-54b58c4723ce.gz:
   Cannot have start index greater than total size
   Traceback (most recent call last):
     File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
   line 585, in _fetch_to_queue
       value = func(*args)
     File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
   line 610, in _get_segment
       downloader.GetRange(start, end)
     File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
   line 477, in GetRange
       progress, end_byte = self.__NormalizeStartEnd(start, end)
     File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
   line 340, in __NormalizeStartEnd
       'Cannot have start index greater than total size')
   TransferInvalidError:
   Cannot have start index greater than total size
   
   WARNING:root:Task failed: Traceback (most recent
   call last):
     File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/executor.py",
   line 300, in __call__
       result = evaluator.finish_bundle()
     File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
   line 206, in finish_bundle
       bundles = _read_values_to_bundles(reader)
     File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
   line 196, in _read_values_to_bundles
       read_result = [GlobalWindows.windowed_value(e) for e in reader]
   
    File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/concat_source.py",
   line 79, in read
       range_tracker.sub_range_tracker(source_ix)):
     File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
   line 155, in read_records
       read_buffer)
     File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
   line 245, in _read_record
       sep_bounds = self._find_separator_bounds(file_to_read, read_buffer)
   
    File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
   line 190, in _find_separator_bounds
       file_to_read, read_buffer, current_pos + 1):
     File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
   line 212, in _try_to_ensure_num_bytes_in_buffer
       read_data = file_to_read.read(self._buffer_size)
   
    File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
   line 460, in read
       self._fetch_to_internal_buffer(num_bytes)
     File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
   line 420, in _fetch_to_internal_buffer
       buf = self._file.read(self._read_size)
     File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
   line 472, in read
       return self._read_inner(size=size, readline=False)
     File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
   line 516, in _read_inner
       self._fetch_next_if_buffer_exhausted()
     File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
   line 577, in _fetch_next_if_buffer_exhausted
       raise exn
   TransferInvalidError: Cannot have start
   index greater than total size
   
   ```
   
   
   After removing the Content-Encoding header the read works fine.
   
   Imported from Jira [BEAM-1874](https://issues.apache.org/jira/browse/BEAM-1874). Original Jira may contain additional context.
   Reported by: smphhh.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] chavdaparas commented on issue #18390: Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

Posted by "chavdaparas (via GitHub)" <gi...@apache.org>.

chavdaparas commented on issue #18390:
URL: https://github.com/apache/beam/issues/18390#issuecomment-1422729486

   - you can upload the object to GCS with the `Content-Type` set to indicate compression and NO `Content-Encoding` at all, according to best practices.
   
   `Content-encoding: application/gzip`
   `Content-type:`
   
   in this case the only thing immediately known about the object is that it is gzip-compressed, with no information regarding the underlying object type. Moreover, the object is not eligible for decompressive transcoding.
   reference : https://cloud.google.com/storage/docs/transcoding
   
   beam's `ReadFromText` with `compression_type=CompressionTypes.GZIP` works fine with above option
   
   `p  | "Read GCS File" >> beam.io.ReadFromText(file_pattern=file_path,compression_type=CompressionTypes.GZIP,
                       skip_header_lines=int(skip_header))`
   
   Ways to compress the file 
   1. Implicitly by specifying `gsutil cp -Z <filename> <bucket>`
   2. Explicitly by compressing the file first like `gzip <filename>` and load it to GCS 
   
   For more details around which combination works please see the table below : 
   
   <img width="550" alt="Screenshot 2023-02-08 at 8 26 22 PM" src="https://user-images.githubusercontent.com/27141543/217565746-58d45245-a890-4b43-805a-50a225de77cc.png">
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] kennknowles commented on issue #18390: Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

Posted by GitBox <gi...@apache.org>.

kennknowles commented on issue #18390:
URL: https://github.com/apache/beam/issues/18390#issuecomment-1179313964

   Bringing over some context from https://cloud.google.com/storage/docs/transcoding it seems like there are the following consistent situations:
   
   1. GCS transcodes and Beam works with this transparently.
      - `Content-encoding: gzip`
      - `Content-type: X`
      - Beam's IO reads it expecting contents to be X. I believe the problem is that GCS serves metadata that results in wrong splits.
   2. GCS does not transcode because the metadata is set to not transcode (current recommendation)
       - `Content-encoding: <empty>`
       - `Content-typ: gzip`
       - Beam's IO reads and the user specifies gzip or it is autodetected by the IO
   3. GCS does not transcode because the Beam IO requests no transcoding
       - `Content-encoding: gzip`
       - `Content-type: X`
       - Beam's IO passes the header `Accept-Encoding: gzip`
   
   I believe 2 is the only one that works today. I am not sure if 1 is possible. I do think that 3 should be able to work, but needs some implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] BjornPrime commented on issue #18390: Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

Posted by "BjornPrime (via GitHub)" <gi...@apache.org>.

BjornPrime commented on issue #18390:
URL: https://github.com/apache/beam/issues/18390#issuecomment-1708903300

   In encountering this while migrating the GCS client, I do not believe the migration will resolve this issue on its own. It seems to be related to how GCSFileSystem handles compressed files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] daniels-cysiv commented on issue #18390: Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

Posted by GitBox <gi...@apache.org>.

daniels-cysiv commented on issue #18390:
URL: https://github.com/apache/beam/issues/18390#issuecomment-1377705004

   This is still an issue with 2.43.0. Does anyone have a workaround that does not require changing metadata in GCS, and isn't "use the Java SDK"? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] BjornPrime commented on issue #18390: Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

Posted by "BjornPrime (via GitHub)" <gi...@apache.org>.

BjornPrime commented on issue #18390:
URL: https://github.com/apache/beam/issues/18390#issuecomment-1504051025

   .take-issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] kennknowles commented on issue #18390: Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

Posted by "kennknowles (via GitHub)" <gi...@apache.org>.

kennknowles commented on issue #18390:
URL: https://github.com/apache/beam/issues/18390#issuecomment-1711642759

   I haven't thought about this in a while, but is there a problem with always passing `Accept-encoding: gzip` ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header [beam]

Posted by "chaitanya1293 (via GitHub)" <gi...@apache.org>.

chaitanya1293 commented on issue #18390:
URL: https://github.com/apache/beam/issues/18390#issuecomment-2023956126

   I am encountering similar issue when uploading my SQL files from Github via CI. not sure if this issue is still fixed. I tried having   paramter:  headers: |-
               content-type: application/octet-stream
   but it did't make any change in the error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] sqlboy commented on issue #18390: Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

Posted by GitBox <gi...@apache.org>.

sqlboy commented on issue #18390:
URL: https://github.com/apache/beam/issues/18390#issuecomment-1378006456

   The way to fix this is to just use the python GCS library and not use the GCS client in beam, this is assuming you can and it’s not some internal usage by beam. Also, unlike the beam implementation of the official GCS client is thread safe, looks like it’s been move off httplib2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] liferoad commented on issue #18390: Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

Posted by "liferoad (via GitHub)" <gi...@apache.org>.

liferoad commented on issue #18390:
URL: https://github.com/apache/beam/issues/18390#issuecomment-1504054292

   @BjornPrime  is working on fixing #25676, which might fix this issue as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] linamartensson commented on issue #18390: Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

Posted by GitBox <gi...@apache.org>.

linamartensson commented on issue #18390:
URL: https://github.com/apache/beam/issues/18390#issuecomment-1177976870

   Is there an update on this? It looks like it has been an issue for years, and while there is a workaround, it's not very satisfying and we don't want to set the content-encoding to the wrong value on GCS.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] Murli16 commented on issue #18390: Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

Posted by "Murli16 (via GitHub)" <gi...@apache.org>.

Murli16 commented on issue #18390:
URL: https://github.com/apache/beam/issues/18390#issuecomment-1425123123

Hi @kennknowles @sqlboy ,

The option that works correctly so far is as below
1. Do a explicit compression of the file - gzip <file>
2. Upload the file to GCS with correct **content type - application/gzip**
```
gsutil -h "Content-Type:application/gzip" cp sample.csv.gz gs://gcp-sandbox-1-359004/scn4/
```
3. Content encoding will not be set
```
gcloud storage objects describe gs://gcp-sandbox-1-359004/scn4/sample.csv.gz

bucket: gcp-sandbox-1-359004
contentType: application/gzip
crc32c: v1lNUQ==
etag: CLnDx+CIif0CEAE=
generation: '1675967308358073'
```
The only caveat here is user will not be able to have benefit of transcoding as when the user attempts to download from the bucket, he will get a .gz file.

While we explore this caveat with the client, we wanted to check if Option 1 mentioned in the comment (https://github.com/apache/beam/issues/18390#issuecomment-1179313964) can be fixed.

As this option will give best of both worlds, dataflow will be able to read a compressed file and user can take benefit of transcoding.

Please let me know if any alternate suggestion.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] kennknowles commented on issue #18390: Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

Posted by GitBox <gi...@apache.org>.

kennknowles commented on issue #18390:
URL: https://github.com/apache/beam/issues/18390#issuecomment-1379365322

   Thanks for the updates. Seems like the thing that would make this "just work", at some cost on the Dataflow side but saving bandwidth, would be option 3. This should be a fairly easy thing for someone to do as a first issue without knowing Beam too much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] sqlboy commented on issue #18390: Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

Posted by GitBox <gi...@apache.org>.

sqlboy commented on issue #18390:
URL: https://github.com/apache/beam/issues/18390#issuecomment-1328103970

   Guys this is a major issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org