You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/04 21:46:21 UTC

[GitHub] [beam] damccorm opened a new issue, #21180: Dataflow fails to materialize elements over 2GB

damccorm opened a new issue, #21180:
URL: https://github.com/apache/beam/issues/21180

   In some cases when we have some big individual element (e.g. after a Combine) and given a combination of side-inputs afterwards Dataflow might decide to materialize PCollection into temporary storage (on GCS using avrofile with simple "bytes" schema) – this process fails if our element is more than 2GB in size.
   
   Job ID: 2021-09-15_02_14_12-14362936082336076824
   Stacktrace:
   ```
   Error message from worker: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 651, in do_work work_executor.execute() File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 181, in execute op.finish() File "dataflow_worker/native_operations.py", line 93, in dataflow_worker.native_operations.NativeWriteOperation.finish File "dataflow_worker/native_operations.py", line 94, in dataflow_worker.native_operations.NativeWriteOperation.finish File "dataflow_worker/native_operations.py", line 95, in dataflow_worker.native_operations.NativeWriteOperation.finish File "/usr/local/lib/python3.7/site-packages/dataflow_worker/nativeavroio.py", line 308, in __exit__ self._data_file_writer.flush() File "fastavro/_write.pyx", line 664, in fastavro._write.Writer.flush File "fastavro/_write.pyx", line 639, in fastavro._write.Writer.dump File "fastavro/_write.pyx", line 451, in fastavro._write.snappy_write_bloc
 k File "fastavro/_write.pyx", line 458, in fastavro._write.snappy_write_block File "/usr/local/lib/python3.7/site-packages/apache_beam/io/filesystemio.py", line 200, in write self._uploader.put(b) File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", line 720, in put self._conn.send_bytes(data.tobytes()) File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset **** size]) File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes header = struct.pack("!i", n) struct.error: 'i' format requires -2147483648 <= number <= 2147483647
   ```
   
   This can be solved via a Reshuffle which forces it to be materialized on a shuffling service instead, which doesn't have this limitation.
   
   Imported from Jira [BEAM-12900](https://issues.apache.org/jira/browse/BEAM-12900). Original Jira may contain additional context.
   Reported by: sadovnychyi.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org