You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "Murli16 (via GitHub)" <gi...@apache.org> on 2023/02/10 03:24:14 UTC

[GitHub] [beam] Murli16 commented on issue #18390: Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

Murli16 commented on issue #18390:
URL: https://github.com/apache/beam/issues/18390#issuecomment-1425123123

   Hi @kennknowles @sqlboy ,
   
   The option that works correctly so far is as below
   1.  Do a explicit compression of the file - gzip <file>
   2. Upload the file to GCS with correct **content type - application/gzip**
   ```
   gsutil -h "Content-Type:application/gzip" cp sample.csv.gz gs://gcp-sandbox-1-359004/scn4/
   ```
   3. Content encoding will not be set
   ```
   gcloud storage objects describe gs://gcp-sandbox-1-359004/scn4/sample.csv.gz
   
   bucket: gcp-sandbox-1-359004
   contentType: application/gzip
   crc32c: v1lNUQ==
   etag: CLnDx+CIif0CEAE=
   generation: '1675967308358073'
   ```
   The only caveat here is user will not be able to have benefit of transcoding as when the user attempts to download from the bucket, he will get a .gz file.
   
   While we explore this caveat with the client, we wanted to check if Option 1 mentioned in the comment (https://github.com/apache/beam/issues/18390#issuecomment-1179313964) can be fixed.
   
   As this option will give best of both worlds, dataflow will be able to read a compressed file and user can take benefit of transcoding.
   
   Please let me know if any alternate suggestion.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org