You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Yang Liu <y....@fetchrewards.com> on 2023/01/20 22:25:33 UTC

Blob server connection problem

Hello,

Is anyone familiar with the "blob server connection"? We have constantly
been seeing the "Error while executing Blob connection" error, which
sometimes causes a job stuck in the middle of a run if there are too many
connection errors and eventually causes a failure, though most of the time
the streaming run mode can recover from that failure in the subsequent
iterations of runs, but that slows down the entire process. We tried
adjusting the blob.fetch.num-concurrent and some other blob parameters, but
it was not very helpful, so we want to know what might be the root cause of
the issue. Are there any Flink metrics or tools to help us monitor the blob
server connections?

We use:

   - Flink Kubernetes Operator
   - Flink 1.15.3 and 1.16.0
   - Kafka, filesystem(S3)
   - Hudi 0.11.1

Full error message:

java.io.IOException: Unknown operation 71
	at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:116)
[flink-dist-1.15.3.jar:1.15.3]
2023-01-19 16:44:37,448 ERROR
org.apache.flink.runtime.blob.BlobServerConnection           [] -
Error while executing BLOB connection.


Best regards,
Yang

Re: Blob server connection problem

Posted by Matthias Pohl via user <us...@flink.apache.org>.
We had issues like that in the past (e.g. FLINK-24923 [1], FLINK-10683
[2]). The error you're observing is caused by an unexpected byte being read
from the socket. The BlobServer protocol expects either 0 (for put
messages) or 1 (for get messages) being retrieved as a header for new
message blocks [3].
Reading different values might mean that there is some other process
sending data to the port the BlobServer is listening on. May you check your
network traffic?

Matthias

[1] https://issues.apache.org/jira/browse/FLINK-24923
[2] https://issues.apache.org/jira/browse/FLINK-10683
[3]
https://github.com/apache/flink/blob/ab264e4ab5a3bc6961a5128b1c7e19752508a7ca/flink-runtime/src/main/java/org/apache/flink/runtime/blob/BlobServerConnection.java#L115

On Fri, Jan 20, 2023 at 11:26 PM Yang Liu <y....@fetchrewards.com> wrote:

> Hello,
>
> Is anyone familiar with the "blob server connection"? We have constantly
> been seeing the "Error while executing Blob connection" error, which
> sometimes causes a job stuck in the middle of a run if there are too many
> connection errors and eventually causes a failure, though most of the time
> the streaming run mode can recover from that failure in the subsequent
> iterations of runs, but that slows down the entire process. We tried
> adjusting the blob.fetch.num-concurrent and some other blob parameters, but
> it was not very helpful, so we want to know what might be the root cause of
> the issue. Are there any Flink metrics or tools to help us monitor the blob
> server connections?
>
> We use:
>
>    - Flink Kubernetes Operator
>    - Flink 1.15.3 and 1.16.0
>    - Kafka, filesystem(S3)
>    - Hudi 0.11.1
>
> Full error message:
>
> java.io.IOException: Unknown operation 71
> 	at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:116) [flink-dist-1.15.3.jar:1.15.3]
> 2023-01-19 16:44:37,448 ERROR org.apache.flink.runtime.blob.BlobServerConnection           [] - Error while executing BLOB connection.
>
>
> Best regards,
> Yang
>