You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Thai Luong (Jira)" <ji...@apache.org> on 2024/01/15 10:47:00 UTC

[jira] [Comment Edited] (FLINK-20284) Error happens in TaskExecutor when closing JobMaster connection if there was a python UDF

    [ https://issues.apache.org/jira/browse/FLINK-20284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806739#comment-17806739 ] 

Thai Luong edited comment on FLINK-20284 at 1/15/24 10:46 AM:
--------------------------------------------------------------

Dear [~hxbks2ks] , [~dian.fu] , [~zhuzh] 

 

This issue still happens while using FlinkRunner 1.16 and Flink 1.16.3.

 

I would like to share my `docker-compose` for further investigation.

 

```
version: '3.8'
services:
jobmanager:
image:flink:1.16.3-java11
networks:
-flink-network
ports:
-"8081:8081"
command:jobmanager
environment:
-|
FLINK_PROPERTIES=jobmanager.rpc.address: jobmanager
jobmanager.memory.process.size: 2g
-BEAM_WORKER_POOL_IN_DOCKER_VM=1
-DOCKER_MAC_CONTAINER=1

taskmanager:
image:flink:1.16.3-java11
networks:
-flink-network
ports:
-"8100-8200:8100-8200"
depends_on:
-jobmanager
-beamworker
command:taskmanager
scale:1
environment:
-|
FLINK_PROPERTIES=jobmanager.rpc.address: jobmanager
taskmanager.numberOfTaskSlots: 1
taskmanager.memory.process.size: 2g
-BEAM_WORKER_POOL_IN_DOCKER_VM=1
-DOCKER_MAC_CONTAINER=1

beamworker:
image:apache/beam_python3.11_sdk:latest
networks:
-flink-network
entrypoint:/opt/apache/beam/boot
command:--worker_pool
volumes:
-./data/sample.txt:/data/sample.txt
-./results:/results

networks:
flink-network:
name:flink-network
```


was (Author: JIRAUSER303803):
Dear [~hxbks2ks] , [~dian.fu] ,

 

This issue still happens while using FlinkRunner 1.16 and Flink 1.16.3.

 

I would like to share my `docker-compose` for further investigation.

 

```
version: '3.8'
services:
jobmanager:
image:flink:1.16.3-java11
networks:
-flink-network
ports:
-"8081:8081"
command:jobmanager
environment:
-|
FLINK_PROPERTIES=jobmanager.rpc.address: jobmanager
jobmanager.memory.process.size: 2g
-BEAM_WORKER_POOL_IN_DOCKER_VM=1
-DOCKER_MAC_CONTAINER=1

taskmanager:
image:flink:1.16.3-java11
networks:
-flink-network
ports:
-"8100-8200:8100-8200"
depends_on:
-jobmanager
-beamworker
command:taskmanager
scale:1
environment:
-|
FLINK_PROPERTIES=jobmanager.rpc.address: jobmanager
taskmanager.numberOfTaskSlots: 1
taskmanager.memory.process.size: 2g
-BEAM_WORKER_POOL_IN_DOCKER_VM=1
-DOCKER_MAC_CONTAINER=1

beamworker:
image:apache/beam_python3.11_sdk:latest
networks:
-flink-network
entrypoint:/opt/apache/beam/boot
command:--worker_pool
volumes:
-./data/sample.txt:/data/sample.txt
-./results:/results


networks:
flink-network:
name:flink-network
```

> Error happens in TaskExecutor when closing JobMaster connection if there was a python UDF
> -----------------------------------------------------------------------------------------
>
>                 Key: FLINK-20284
>                 URL: https://issues.apache.org/jira/browse/FLINK-20284
>             Project: Flink
>          Issue Type: Bug
>          Components: API / Python
>    Affects Versions: 1.11.0, 1.12.0
>            Reporter: Zhu Zhu
>            Assignee: Huang Xingbo
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.11.3, 1.12.0
>
>
> When a TaskExecutor successfully finished running a python UDF task and disconnecting from JobMaster, errors below will happen. This error, however, seems not affect job execution at the moment.
> {code:java}
> 2020-11-20 17:05:21,932 INFO  org.apache.beam.runners.fnexecution.logging.GrpcLoggingService [] - 1 Beam Fn Logging clients still connected during shutdown.
> 2020-11-20 17:05:21,938 WARN  org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer        [] - Hanged up for unknown endpoint.
> 2020-11-20 17:05:22,126 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: Custom Source -> select: (f0) -> select: (add_one(f0) AS a) -> to: Tuple2 -> Sink: Streaming select table sink (1/1)#0 (b0c2104dd8f87bb1caf0c83586c22a51) switched from RUNNING to FINISHED.
> 2020-11-20 17:05:22,126 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Freeing task resources for Source: Custom Source -> select: (f0) -> select: (add_one(f0) AS a) -> to: Tuple2 -> Sink: Streaming select table sink (1/1)#0 (b0c2104dd8f87bb1caf0c83586c22a51).
> 2020-11-20 17:05:22,128 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Un-registering task and sending final execution state FINISHED to JobManager for task Source: Custom Source -> select: (f0) -> select: (add_one(f0) AS a) -> to: Tuple2 -> Sink: Streaming select table sink (1/1)#0 b0c2104dd8f87bb1caf0c83586c22a51.
> 2020-11-20 17:05:22,156 INFO  org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot TaskSlot(index:0, state:ACTIVE, resource profile: ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=384.000mb (402653174 bytes), taskOffHeapMemory=0 bytes, managedMemory=512.000mb (536870920 bytes), networkMemory=128.000mb (134217730 bytes)}, allocationId: b67c3307dcf93757adfb4f0f9f7b8c7b, jobId: d05f32162f38ec3ec813c4621bc106d9).
> 2020-11-20 17:05:22,157 INFO  org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Remove job d05f32162f38ec3ec813c4621bc106d9 from job leader monitoring.
> 2020-11-20 17:05:22,157 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Close JobManager connection for job d05f32162f38ec3ec813c4621bc106d9.
> 2020-11-20 17:05:23,064 ERROR org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.DefaultPromise.rejectedExecution [] - Failed to submit a listener notification task. Event loop shut down?
> java.lang.NoClassDefFoundError: org/apache/beam/vendor/grpc/v1p26p0/io/netty/util/concurrent/GlobalEventExecutor$2
>         at org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.GlobalEventExecutor.startThread(GlobalEventExecutor.java:227) ~[blob_p-bd7a5d615512eb8a2e856e7c1630a0c22fca7cf3-ff27946fda7e2b8cb24ea56d505b689e:1.12-SNAPSHOT]
>         at org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.GlobalEventExecutor.execute(GlobalEventExecutor.java:215) ~[blob_p-bd7a5d615512eb8a2e856e7c1630a0c22fca7cf3-ff27946fda7e2b8cb24ea56d505b689e:1.12-SNAPSHOT]
>         at org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:841) [blob_p-bd7a5d615512eb8a2e856e7c1630a0c22fca7cf3-ff27946fda7e2b8cb24ea56d505b689e:1.12-SNAPSHOT]
>         at org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:498) [blob_p-bd7a5d615512eb8a2e856e7c1630a0c22fca7cf3-ff27946fda7e2b8cb24ea56d505b689e:1.12-SNAPSHOT]
>         at org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) [blob_p-bd7a5d615512eb8a2e856e7c1630a0c22fca7cf3-ff27946fda7e2b8cb24ea56d505b689e:1.12-SNAPSHOT]
>         at org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:604) [blob_p-bd7a5d615512eb8a2e856e7c1630a0c22fca7cf3-ff27946fda7e2b8cb24ea56d505b689e:1.12-SNAPSHOT]
>         at org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.DefaultPromise.setSuccess(DefaultPromise.java:96) [blob_p-bd7a5d615512eb8a2e856e7c1630a0c22fca7cf3-ff27946fda7e2b8cb24ea56d505b689e:1.12-SNAPSHOT]
>         at org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1089) [blob_p-bd7a5d615512eb8a2e856e7c1630a0c22fca7cf3-ff27946fda7e2b8cb24ea56d505b689e:1.12-SNAPSHOT]
>         at org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [blob_p-bd7a5d615512eb8a2e856e7c1630a0c22fca7cf3-ff27946fda7e2b8cb24ea56d505b689e:1.12-SNAPSHOT]
>         at org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [blob_p-bd7a5d615512eb8a2e856e7c1630a0c22fca7cf3-ff27946fda7e2b8cb24ea56d505b689e:1.12-SNAPSHOT]
>         at java.lang.Thread.run(Thread.java:748) [?:1.8.0_261]
> Caused by: java.lang.ClassNotFoundException: org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.GlobalEventExecutor$2
>         at java.net.URLClassLoader.findClass(URLClassLoader.java:382) ~[?:1.8.0_261]
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[?:1.8.0_261]
>         at org.apache.flink.util.FlinkUserCodeClassLoader.loadClassWithoutExceptionHandling(FlinkUserCodeClassLoader.java:63) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:72) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:49) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[?:1.8.0_261]
>         ... 11 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)