You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/05 00:30:01 UTC

[GitHub] [beam] damccorm opened a new issue, #21598: Beam worker closing gRPC connection with many workers and large shuffle sizes

damccorm opened a new issue, #21598:
URL: https://github.com/apache/beam/issues/21598

   When I run a job with many workers (100 or more) and large shuffle sizes (millions of records and/or several GB), my workers fail unexpectedly with
   ```
   
   python -m apache_beam.runners.worker.sdk_worker_main 
   E0308 12:59:18.067442934     724 chttp2_transport.cc:1103]
     Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings" 
   Traceback
   (most recent call last): 
    File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main
   
      return _run_code(code, main_globals, None, 
    File "/usr/local/lib/python3.8/runpy.py", line 87,
   in _run_code 
      exec(code, run_globals) 
    File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
   line 264, in <module> 
      main(sys.argv) 
    File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
   line 155, in main 
      sdk_harness.run() 
    File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py",
   line 234, in run 
      for work_request in self._control_stub.Control(get_responses()): 
    File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py",
   line 426, in __next__ 
      return self._next() 
    File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py",
   line 826, in _next 
      raise self 
   grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous
   of RPC that terminated with: 
          status = StatusCode.UNAVAILABLE 
          details = "Socket closed"
   
          debug_error_string = "{"created":"@1646744358.118371750","description":"Error received from
   peer ipv6:[::1]:34305","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Socket
   
   closed","grpc_status":14}" 
   >
   ```
   
   This is probably related to or even the same as BEAM-12448 or BEAM-6258, but since one of them is already marked as fixed in a previous version and both reports have large tails of unreadable auto-generated comments, I decided to create a new issue.
   
   There is not much more information I can give you, since this is all the error output I get. It's really hard to debug and with the large number of workers I don't even know if the worker reporting the error is actually the one experiencing it.
   
   Imported from Jira [BEAM-14070](https://issues.apache.org/jira/browse/BEAM-14070). Original Jira may contain additional context.
   Reported by: phoerious.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] HuangXingBo commented on issue #21598: Beam worker closing gRPC connection with many workers and large shuffle sizes

Posted by GitBox <gi...@apache.org>.

HuangXingBo commented on issue #21598:
URL: https://github.com/apache/beam/issues/21598#issuecomment-1323399311

   We met the same problem in PyFlink (Based on beam portability framework). It took me some time to locate and reproduce the problem. The root cause is that the BDP ping period is locally-decided. We can reduce the KEEP_ALIVE_TIME time on the server side or increase the grpc.keepalive_time_ms on the client side. Here is the PR of the modification I made in pyflink
   https://github.com/apache/flink/pull/21363
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] phoerious commented on issue #21598: Beam worker closing gRPC connection with many workers and large shuffle sizes

Posted by GitBox <gi...@apache.org>.

phoerious commented on issue #21598:
URL: https://github.com/apache/beam/issues/21598#issuecomment-1360973383

   No luck. During shuffle, it still fails with
   
   ```E1220 18:51:26.523480166     219 chttp2_transport.cc:1031]   ipv6:%5B::1%5D:45461: Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings". Current keepalive time (before throttling): 20000ms```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] phoerious commented on issue #21598: Beam worker closing gRPC connection with many workers and large shuffle sizes

Posted by "phoerious (via GitHub)" <gi...@apache.org>.

phoerious commented on issue #21598:
URL: https://github.com/apache/beam/issues/21598#issuecomment-1525354582

   Whatever you do, please leave enough wiggle room. Very often these timeouts occur because individual nodes respond slower than normal because of dying hardware, network congestion etc. These can introduce delays of several seconds or even minutes. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] thdesc commented on issue #21598: Beam worker closing gRPC connection with many workers and large shuffle sizes

Posted by "thdesc (via GitHub)" <gi...@apache.org>.

thdesc commented on issue #21598:
URL: https://github.com/apache/beam/issues/21598#issuecomment-1525306893

   According to this [documentation](https://github.com/grpc/grpc/blob/master/doc/keepalive.md), the server may send GOAWAY with ENHANCE_YOUR_CALM to the client if "the client's `GRPC_ARG_KEEPALIVE_TIME_MS` setting is lower than the server's `GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS`." Therefore, it seems that we should consider decreasing the value of `GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS` to 19 seconds, instead of modifying the `KEEP_ALIVE_TIME_SEC` on the server side. What do you think @HuangXingBo ? Alternatively, we could increase the value of `grpc.keepalive_time_ms` in the `channel_factory.py` to a value higher than `GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS`, which is set to 300 000 ms by default. In my case, I chose to set it to 300 001 ms, and I have not encountered the error again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] phoerious commented on issue #21598: Beam worker closing gRPC connection with many workers and large shuffle sizes

Posted by GitBox <gi...@apache.org>.

phoerious commented on issue #21598:
URL: https://github.com/apache/beam/issues/21598#issuecomment-1331899009

   The latest version of Beam is much more stable in this regard, but it still happens at times, particularly if a node is a bit oversubscribed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] phoerious commented on issue #21598: Beam worker closing gRPC connection with many workers and large shuffle sizes

Posted by GitBox <gi...@apache.org>.

phoerious commented on issue #21598:
URL: https://github.com/apache/beam/issues/21598#issuecomment-1380328984

   > I have played around with all sorts of values in channel_factory.py without success so far. I'm trying grpc.keepalive_time_ms with a value of 60000 now. Let's see. I suppose ServerFactory should also set permitKeepAliveWithoutCalls(true).
   
   60000 seems to be working. Can we make this change upstream, please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] cozos commented on issue #21598: Beam worker closing gRPC connection with many workers and large shuffle sizes

Posted by GitBox <gi...@apache.org>.

cozos commented on issue #21598:
URL: https://github.com/apache/beam/issues/21598#issuecomment-1331619714

   Also running into this with the Spark RDD Runner on the Python SDK Harness:
   
   ```
   22/11/30 03:35:04 INFO Executor: Finished task 505.0 in stage 8.0 (TID 1145). 18572 bytes result sent to driver
   22/11/30 03:35:55 INFO Executor: Finished task 434.0 in stage 8.0 (TID 1074). 18572 bytes result sent to driver
   E1130 03:36:33.423173032    2378 chttp2_transport.cc:1016]             ipv4:127.0.0.1:38291: Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings". Current keepalive time (before throttling): 20000ms
   22/11/30 03:36:33 ERROR py:641: Failed to read inputs in the data plane.
   Traceback (most recent call last):
     File "/databricks/python3/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 634, in _read_inputs
       for elements in elements_iterator:
     File "/databricks/python3/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
       return self._next()
     File "/databricks/python3/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
       raise self
   grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
           status = StatusCode.UNAVAILABLE
           details = "Socket closed"
           debug_error_string = "UNKNOWN:Error received from peer ipv4:127.0.0.1:38291 {grpc_message:"Socket closed", grpc_status:14, created_time:"2022-11-30T03:36:33.423534531+00:00"}"
   > Traceback (most recent call last):
     File "/databricks/python3/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 634, in _read_inputs
       for elements in elements_iterator:
     File "/databricks/python3/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
       return self._next()
     File "/databricks/python3/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
       raise self
   grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
           status = StatusCode.UNAVAILABLE
           details = "Socket closed"
           debug_error_string = "UNKNOWN:Error received from peer ipv4:127.0.0.1:38291 {grpc_message:"Socket closed", grpc_status:14, created_time:"2022-11-30T03:36:33.423534531+00:00"}"
   ```
   
   I do have a shuffle/GroupBy, but I don't understand why that would cause this. As I understand it the SDK Harness is only used to execute DoFns, and the GroupBy/shuffle is done on the Spark side, which shouldn't affect the SDK Harness.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] phoerious commented on issue #21598: Beam worker closing gRPC connection with many workers and large shuffle sizes

Posted by GitBox <gi...@apache.org>.

phoerious commented on issue #21598:
URL: https://github.com/apache/beam/issues/21598#issuecomment-1251139176

   Recent versions of beam have `grpc.keepalive_timeout_ms` already set to 300000ms, so that can't be the only thing. The issue is dependent on overall processing speed, however. So if a node is a bit slow or if the shuffle just takes a while to process, the job crashes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] phoerious commented on issue #21598: Beam worker closing gRPC connection with many workers and large shuffle sizes

Posted by GitBox <gi...@apache.org>.

phoerious commented on issue #21598:
URL: https://github.com/apache/beam/issues/21598#issuecomment-1359538055

   Any progress? Beam is so brittle, it's almost unusable. There should never be a situation where any part of Beam fails a job because some magical number of pings was reached.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] HuangXingBo commented on issue #21598: Beam worker closing gRPC connection with many workers and large shuffle sizes

Posted by GitBox <gi...@apache.org>.

HuangXingBo commented on issue #21598:
URL: https://github.com/apache/beam/issues/21598#issuecomment-1332040373

   I think we can decrease the `KEEP_ALIVE_TIME_SEC` of `ServerFactory.java` or increase the `grpc.keepalive_time_ms` in `channel_factory.py` to solve this problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] phoerious commented on issue #21598: Beam worker closing gRPC connection with many workers and large shuffle sizes

Posted by GitBox <gi...@apache.org>.

phoerious commented on issue #21598:
URL: https://github.com/apache/beam/issues/21598#issuecomment-1288690638

   Folks, this is a deal breaker. It happens ALL THE TIME. It is literally impossible to run any large job on Beam with Apache Flink and I have no idea how to fix it. I would submit a pull request, but I need a pointer where to start looking.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] phoerious commented on issue #21598: Beam worker closing gRPC connection with many workers and large shuffle sizes

Posted by GitBox <gi...@apache.org>.

phoerious commented on issue #21598:
URL: https://github.com/apache/beam/issues/21598#issuecomment-1251120741

   Bump. This is happening to me constantly. If I have too many workers or shuffle a lot of data, it's pretty much impossible to get a job past the first stages as they keep failing with
   
   ```E0919 14:35:00.934340277     558 chttp2_transport.cc:1079]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
   2022/09/19 14:35:00 Received signal: terminated
   2022/09/19 14:35:00 Python (worker 2-1) exited: signal: terminated```
   
   either with or without a stacktrace.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] phoerious commented on issue #21598: Beam worker closing gRPC connection with many workers and large shuffle sizes

Posted by GitBox <gi...@apache.org>.

phoerious commented on issue #21598:
URL: https://github.com/apache/beam/issues/21598#issuecomment-1251271798

   I checked the gRPC timout guide and tried setting https://github.com/grpc/grpc/blob/master/doc/keepalive.md and tried adding `("grpc.http2.max_ping_strikes", 0)`  to the default options in https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/channel_factory.py#L24, but without success. I am still seeing these GOAWAY messages if I have too many workers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org