You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Janek Bevendorff <ja...@uni-weimar.de> on 2022/01/27 14:40:18 UTC

Python pipeline on Flink not finishing cleanly

Hi,

When I run a Python pipeline with multiple concurrent TaskManagers on 
Flink, the job hardly ever (or never) finishes properly. At the end, 
Beam (or Flink?) always throws a seemingly random gRPC 
IllegalStateException after my last GlobalCombine, so Beam goes into 
some weird error handling mode and eventually fails to job, even though 
it should have finished cleanly.

This is only reproducible with parallelism set to at least 5 or 8. With 
1-4, I cannot reliably (or at all) reproduce it. It looks like a similar 
issue has already been reported on Jira 
(https://issues.apache.org/jira/browse/BEAM-8980), but it got marked as 
stale. Anyone else seeing this? Is there anything I can do? I don't want 
my job to restart after it's finished and I want a clean exit status, 
otherwise I don't really know if everything succeeded properly and I 
don't want to comb through hundreds of log files to find out.

I added a comment with the stacktraces that I get below the 
above-mentioned issue: 
https://issues.apache.org/jira/browse/BEAM-8980?focusedCommentId=17483174&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17483174

Janek