You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/03 23:51:04 UTC

[GitHub] [beam] kennknowles opened a new issue, #19373: DataflowRunner does not scale when reading gzip file

kennknowles opened a new issue, #19373:
URL: https://github.com/apache/beam/issues/19373

   Hi,
   
   I have a pipe that ReadFromText() a 700mb gz file from a GS bucket.
   
   It then parse json, create BigQuery row, and WriteToBigQuery.
   
   The pipeline above does not scale. If I specify 2 workers on startup it will scale it down to 1 and the throughput remains the same. The job takes 30 minutes.
   
    
   
   What I found is that the exact same pipeline, reading the same but uncompressed 11gb file from the same location scales very well. The job only takes 5 minutes.
   
    
   
   Imported from Jira [BEAM-7094](https://issues.apache.org/jira/browse/BEAM-7094). Original Jira may contain additional context.
   Reported by: moander2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] DataflowRunner does not scale when reading gzip file [beam]

Posted by "mareksuscak (via GitHub)" <gi...@apache.org>.
mareksuscak commented on issue #19373:
URL: https://github.com/apache/beam/issues/19373#issuecomment-1779194520

   Cloud Storage [does not support range requests](https://cloud.google.com/storage/docs/transcoding#range) when the files are transcoded using the built-in on-the-fly transcoding feature. I did some research a while ago, and while I am not 100% sure now, I vaguely remember that I ultimately concluded that this was the main culprit. Plain text files are likely splittable because individual Beam workers can use range requests to request a part of the file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org