You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "lostluck (via GitHub)" <gi...@apache.org> on 2023/03/18 00:38:14 UTC

[GitHub] [beam] lostluck opened a new issue, #25892: [Feature Request][Go SDK]: Improve TextIO splitting performance

lostluck opened a new issue, #25892:
URL: https://github.com/apache/beam/issues/25892

   ### What would you like to happen?
   
   Go TextIO (and future FileIOs) should be updated to perform splitting to match the Java and Python SDF splitting strategies. In particular, have a non-linear chunk size.
   
   See https://github.com/apache/beam/pull/25871/files and https://github.com/apache/beam/blob/abc8099d71b5edad9493c669b5f467e46013b204/sdks/python/apache_beam/io/iobase.py#L894
   
   Specifically:
   A simple linear split isn't appropriate for most data processing. Splitting a multi-gigabyte file into a large number of fixed sized chunks can lead to poor performance as each restriction may need to start the file from scratch.
   
   The java and python linked above employ a square root to get a rough "rule of thumb" for splits to avoid over splitting and have more meaningful chunk sizes, and as a fallback, a 64MB chunk size.
   
   From the Java, where the sizes are in bytes:
   
   ```
   // 1mb --> 1 shard; 1gb --> 32 shards; 1tb --> 1000 shards, 1pb --> 32k shards
   desiredChunkSize = Math.max(1 << 20, (long) (1000 * Math.sqrt(estimatedSize)));
   ```
   
   This allows for larger files to have larger chunks, and reduce overheads related to opening the file repeatedly and similar.
   
   This approach bottoms out at a 1MB chunk size, but 64MB get eight 8MB chunks, while a 6.4GB file gets eighty 80MB chunks, and similar.
   
   ### Issue Priority
   
   Priority: 2 (default / most feature requests should be filed as P2)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [ ] Component: Java SDK
   - [X] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org