You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Caio Aoque <ca...@gmail.com> on 2019/10/30 00:24:57 UTC

Flink batch app occasionally hang

Hi, I've been running some flink scala applications on an AWS EMR cluster
(version 5.26.0 with flink 1.8.0 for scala 2.11) for a while and I started
to have some issues now.

I have a flink app that reads some files from S3, process them and save
some files to s3 and also some records to a database.

The application is not so complex it has a source that reads a directory
(multiple files) and other one that reads a single one and then it has some
grouping and mapping and a left outer join between these 2 sources.

The issue is that occasionally the application got stuck with only two
tasks running, one finished and the other ones not even run. The 2 tasks
that keep running forever are the source1 from directory (multiple files)
and the leftouterjoin, the source2 (input from a single file) is the one
that finishes. One interest thing is that there should be several tasks
between source 1 and this leftouterjoin but they remain in CREATED state.
If the app stuck usually I simply kill that and run that again, which
works. The issue is not that frequent but is getting more and more
frequent. It's happening almost everyday now.

I also have a DEBUG log from a job that didn't work and another one from a
job that worked.

Thanks.

Re: Flink batch app occasionally hang

Posted by vino yang <ya...@gmail.com>.
Hi Caio,

Because it involves interaction with external systems. It would be better
if you can provide the full logs.

Best,
Vino

Caio Aoque <ca...@gmail.com> 于2019年10月30日周三 上午8:31写道:

> Hi, I've been running some flink scala applications on an AWS EMR cluster
> (version 5.26.0 with flink 1.8.0 for scala 2.11) for a while and I
> started to have some issues now.
>
> I have a flink app that reads some files from S3, process them and save
> some files to s3 and also some records to a database.
>
> The application is not so complex it has a source that reads a directory
> (multiple files) and other one that reads a single one and then it has some
> grouping and mapping and a left outer join between these 2 sources.
>
> The issue is that occasionally the application got stuck with only two
> tasks running, one finished and the other ones not even run. The 2 tasks
> that keep running forever are the source1 from directory (multiple files)
> and the leftouterjoin, the source2 (input from a single file) is the one
> that finishes. One interest thing is that there should be several tasks
> between source 1 and this leftouterjoin but they remain in CREATED state.
> If the app stuck usually I simply kill that and run that again, which
> works. The issue is not that frequent but is getting more and more
> frequent. It's happening almost everyday now.
>
> I also have a DEBUG log from a job that didn't work and another one from a
> job that worked.
>
> Thanks.
>

Re: Flink batch app occasionally hang

Posted by Zhu Zhu <re...@gmail.com>.
Hi Caio,

Did you check whether there are enough resources to launch the other nodes?

Could you attach the logs you mentioned? And elaborate how the tasks are
connected in the topology?


Thanks,
Zhu Zhu

Caio Aoque <ca...@gmail.com> 于2019年10月30日周三 上午8:31写道:

> Hi, I've been running some flink scala applications on an AWS EMR cluster
> (version 5.26.0 with flink 1.8.0 for scala 2.11) for a while and I
> started to have some issues now.
>
> I have a flink app that reads some files from S3, process them and save
> some files to s3 and also some records to a database.
>
> The application is not so complex it has a source that reads a directory
> (multiple files) and other one that reads a single one and then it has some
> grouping and mapping and a left outer join between these 2 sources.
>
> The issue is that occasionally the application got stuck with only two
> tasks running, one finished and the other ones not even run. The 2 tasks
> that keep running forever are the source1 from directory (multiple files)
> and the leftouterjoin, the source2 (input from a single file) is the one
> that finishes. One interest thing is that there should be several tasks
> between source 1 and this leftouterjoin but they remain in CREATED state.
> If the app stuck usually I simply kill that and run that again, which
> works. The issue is not that frequent but is getting more and more
> frequent. It's happening almost everyday now.
>
> I also have a DEBUG log from a job that didn't work and another one from a
> job that worked.
>
> Thanks.
>