You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "buraktokman (via GitHub)" <gi...@apache.org> on 2023/02/02 07:56:09 UTC

[GitHub] [beam] buraktokman opened a new issue, #25273: [Bug]:

buraktokman opened a new issue, #25273:
URL: https://github.com/apache/beam/issues/25273

   ### What happened?
   
   We have a pipeline to extract embeddings (feature vectors) from `images` stored in Cloud Storage bucket and insert into a BigQuery table.
   
   We're consistently getting `SDK harness sdk-0-1 disconnected.` errors when the Dataflow job runs on **N1** type VM instances.
   
       Error message from worker: 
       Data channel closed, unable to send additional data to SDK sdk-0-3
       SDK harness sdk-0-1 disconnected.
       SDK harness sdk-0-2 disconnected.
       SDK harness sdk-0-0 disconnected.
       Data channel closed, unable to receive additional data from SDK sdk-0-3
       SDK harness sdk-0-1 disconnected.
       SDK harness sdk-0-2 disconnected.
       Data channel closed, unable to receive additional data from SDK sdk-0-1
   
   
   **Notes**
   
   **N2** machines work fine but **N1** fails somewhat surprising because **N1** is Google-default machine.
   
   - Jobs run slower on **N1** machines and sometimes appear to fail due to these errors.
   
   - Using a larger VM (more memory, CPU and disk) didn't resolve the errors.
   
   - We also have *another* pipeline to extract embeddings from `text` and using [lapse][1] model which has the same errors on both **N1** and **N2** machines
   
   - Diagnostics tab: `No errors found during this interval.`
   
   We're creating DF job templates (Apache Beam 2.40 Python), storing them on Cloud Storage and using API to launch new jobs.
   
   - We're **batching** the items before giving them to the stage where embeddings are extracted. Reducing batch size didn't matter.
    - Pipeline option `sdk_worker_parallelism` changed from 0 (default) to 1 and didn't change anything.
   
   - **Auto-scaling** disabled (`max_worker=1`) and same errors.
   
   - **Reshuffle** stage removed from the pipe
     - There are disconnect errors e.g. `SDK harness sdk-0-0 disconnected.`
   but no data channel errors e.g. `The Data channel closed, unable to send additional data to SDK sdk-0-3`
   
     [1]: https://huggingface.co/sentence-transformers/LaBSE
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [X] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [X] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Bug]: Dataflow: SDK harness disconnected errors [beam]

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25273:
URL: https://github.com/apache/beam/issues/25273#issuecomment-1745138906

   `SDK harness sdk-0-1 disconnected` messages are  symptom, not a root cause. One needs to look at other logs preceeding the disconnection event.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on issue #25273: [Bug]: Dataflow: SDK harness disconnected errors

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25273:
URL: https://github.com/apache/beam/issues/25273#issuecomment-1424671466

   If you'd like someone to have a closer look at the pipeline or logs, you can reach out to Dataflow customer support.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Bug]: Dataflow: SDK harness disconnected errors [beam]

Posted by "viniciusdsmello (via GitHub)" <gi...@apache.org>.
viniciusdsmello commented on issue #25273:
URL: https://github.com/apache/beam/issues/25273#issuecomment-1744816884

   Hello folks, does anyone here have figured out the root cause?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Bug]: Dataflow: SDK harness disconnected errors [beam]

Posted by "liferoad (via GitHub)" <gi...@apache.org>.
liferoad commented on issue #25273:
URL: https://github.com/apache/beam/issues/25273#issuecomment-1745788006

   Better open a Google Cloud support ticket. So the team could have more details to help debug.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on issue #25273: [Bug]: Dataflow: SDK harness disconnected errors

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25273:
URL: https://github.com/apache/beam/issues/25273#issuecomment-1424669598

   in some cases there may be logs in worker-startup or worker logger preceding the crash.   
   It's difficult to determine what exactly the rootcause without more information about the pipeline or a reproducible example. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Bug]: Dataflow: SDK harness disconnected errors [beam]

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25273:
URL: https://github.com/apache/beam/issues/25273#issuecomment-1745140143

   Closing this since there is not enough actionable information on this ticket.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on issue #25273: [Bug]: Dataflow: SDK harness disconnected errors

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25273:
URL: https://github.com/apache/beam/issues/25273#issuecomment-1424659392

   The `SDK harness sdk-0-0 disconnected` error means that something made the SDK harness process to crash. This is the process that runs the pipeline user code, and where the bulk of processing is happening. The investigation should focus on identifying what causes the crash. 
   It can be an OOM event, or a crash in a C extension/third party library or something else. If processing a particular element causes the process to crash, those could potentially be filtered out by using `.with_exception_handling(use_subprocess=True)`, see: https://github.com/apache/beam/blob/64e40d2c018f8e906f4bec32ef67f02734a95721/sdks/python/apache_beam/transforms/core.py#L1441


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Bug]: Dataflow: SDK harness disconnected errors [beam]

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn closed issue #25273: [Bug]: Dataflow: SDK harness disconnected errors
URL: https://github.com/apache/beam/issues/25273


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org