You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Gyula Fóra <gy...@apache.org> on 2016/11/22 09:03:23 UTC

Kafka Sink stuck in cancelling

Hi,

Has anyone ever experienced the Kafka producer getting stuck in cancelling?

I am aware that there were problems with the Kafka consumer before but I
haven't seen this one yet. It happened simultaneously to 3 of my jobs last
night, they were stuck from about 8 pm to 8 am (not exact times but you get
the length.).

The logs don't seem to be very helpful on the JobManager, they just show
that all tasks start cancelling and then go cancelled except for one Kafka
sink task. That goes into cancelling but only gets cancelled 12 hours
later. On one of the task managers I have found this though:

2016-11-21 20:22:52,220 INFO  org.apache.flink.yarn.YarnTaskManager
                     - Un-registering task and sending final execution
state CANCELED to JobManager for task Execute EventProcessors
(f030e71787a6dbd7a543e9745c42289d)

2016-11-22 08:49:35,181 WARN  org.apache.kafka.common.network.Selector
                     - Error in I/O with
kafka17.sto.midasplayer.com/172.25.82.212
java.io.EOFException
	at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:62)
	at org.apache.kafka.common.network.Selector.poll(Selector.java:248)
	at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:192)
	at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:191)
	at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:135)
	at java.lang.Thread.run(Thread.java:745)
2016-11-22 08:49:35,183 INFO
org.apache.flink.runtime.taskmanager.Task                     - Sink:
Kafka output (2/8) switched to CANCELED


There might have been some network/kafka issue that caused 3 jobs to get
stuck at the same time but I don't know what actually happened.

Any ideas?
Gyula

Re: Kafka Sink stuck in cancelling

Posted by Stephan Ewen <se...@apache.org>.
Hi Gyula!

Not sure what is happening there. I found that the Kafka code had a few
issues with cancellation and concurrent closing.
That affected mostly the producer, though.

What Kafka version are you using? 0.8 or 0.9?

Greetings,
Stephan


On Tue, Nov 22, 2016 at 10:03 AM, Gyula Fóra <gy...@apache.org> wrote:

> Hi,
>
> Has anyone ever experienced the Kafka producer getting stuck in cancelling?
>
> I am aware that there were problems with the Kafka consumer before but I
> haven't seen this one yet. It happened simultaneously to 3 of my jobs last
> night, they were stuck from about 8 pm to 8 am (not exact times but you get
> the length.).
>
> The logs don't seem to be very helpful on the JobManager, they just show
> that all tasks start cancelling and then go cancelled except for one Kafka
> sink task. That goes into cancelling but only gets cancelled 12 hours
> later. On one of the task managers I have found this though:
>
> 2016-11-21 20:22:52,220 INFO  org.apache.flink.yarn.YarnTaskManager
>                      - Un-registering task and sending final execution
> state CANCELED to JobManager for task Execute EventProcessors
> (f030e71787a6dbd7a543e9745c42289d)
>
> 2016-11-22 08:49:35,181 WARN  org.apache.kafka.common.network.Selector
>                      - Error in I/O with
> kafka17.sto.midasplayer.com/172.25.82.212
> java.io.EOFException
>         at org.apache.kafka.common.network.NetworkReceive.
> readFrom(NetworkReceive.java:62)
>         at org.apache.kafka.common.network.Selector.poll(
> Selector.java:248)
>         at org.apache.kafka.clients.NetworkClient.poll(
> NetworkClient.java:192)
>         at org.apache.kafka.clients.producer.internals.Sender.run(
> Sender.java:191)
>         at org.apache.kafka.clients.producer.internals.Sender.run(
> Sender.java:135)
>         at java.lang.Thread.run(Thread.java:745)
> 2016-11-22 08:49:35,183 INFO
> org.apache.flink.runtime.taskmanager.Task                     - Sink:
> Kafka output (2/8) switched to CANCELED
>
>
> There might have been some network/kafka issue that caused 3 jobs to get
> stuck at the same time but I don't know what actually happened.
>
> Any ideas?
> Gyula
>

Re: Kafka Sink stuck in cancelling

Posted by Gyula Fóra <gy...@gmail.com>.
Ah sorry I completely missed the version details. I am using Flink 1.1.3
with Kafka 0.8 producer.

We havent had issues with the consumers yet and this is the first time this
happenned as well.

Gyula

On Tue, Nov 22, 2016, 12:15 Till Rohrmann <tr...@apache.org> wrote:

> Hi Gyula,
>
> I'm not aware of any recent issues with the Kafka Producer. However there
> was one with the Kafka Consumer which prevented the proper cancellation (
> https://issues.apache.org/jira/browse/FLINK-5048).
>
> Which version of Flink and which Kafka Producer were you using?
>
> Cheers,
> Till
>
> On Tue, Nov 22, 2016 at 10:03 AM, Gyula Fóra <gy...@apache.org> wrote:
>
> > Hi,
> >
> > Has anyone ever experienced the Kafka producer getting stuck in
> cancelling?
> >
> > I am aware that there were problems with the Kafka consumer before but I
> > haven't seen this one yet. It happened simultaneously to 3 of my jobs
> last
> > night, they were stuck from about 8 pm to 8 am (not exact times but you
> get
> > the length.).
> >
> > The logs don't seem to be very helpful on the JobManager, they just show
> > that all tasks start cancelling and then go cancelled except for one
> Kafka
> > sink task. That goes into cancelling but only gets cancelled 12 hours
> > later. On one of the task managers I have found this though:
> >
> > 2016-11-21 20:22:52,220 INFO  org.apache.flink.yarn.YarnTaskManager
> >                      - Un-registering task and sending final execution
> > state CANCELED to JobManager for task Execute EventProcessors
> > (f030e71787a6dbd7a543e9745c42289d)
> >
> > 2016-11-22 08:49:35,181 WARN  org.apache.kafka.common.network.Selector
> >                      - Error in I/O with
> > kafka17.sto.midasplayer.com/172.25.82.212
> > java.io.EOFException
> >         at org.apache.kafka.common.network.NetworkReceive.
> > readFrom(NetworkReceive.java:62)
> >         at org.apache.kafka.common.network.Selector.poll(
> > Selector.java:248)
> >         at org.apache.kafka.clients.NetworkClient.poll(
> > NetworkClient.java:192)
> >         at org.apache.kafka.clients.producer.internals.Sender.run(
> > Sender.java:191)
> >         at org.apache.kafka.clients.producer.internals.Sender.run(
> > Sender.java:135)
> >         at java.lang.Thread.run(Thread.java:745)
> > 2016-11-22 08:49:35,183 INFO
> > org.apache.flink.runtime.taskmanager.Task                     - Sink:
> > Kafka output (2/8) switched to CANCELED
> >
> >
> > There might have been some network/kafka issue that caused 3 jobs to get
> > stuck at the same time but I don't know what actually happened.
> >
> > Any ideas?
> > Gyula
> >
>

Re: Kafka Sink stuck in cancelling

Posted by Till Rohrmann <tr...@apache.org>.
Hi Gyula,

I'm not aware of any recent issues with the Kafka Producer. However there
was one with the Kafka Consumer which prevented the proper cancellation (
https://issues.apache.org/jira/browse/FLINK-5048).

Which version of Flink and which Kafka Producer were you using?

Cheers,
Till

On Tue, Nov 22, 2016 at 10:03 AM, Gyula Fóra <gy...@apache.org> wrote:

> Hi,
>
> Has anyone ever experienced the Kafka producer getting stuck in cancelling?
>
> I am aware that there were problems with the Kafka consumer before but I
> haven't seen this one yet. It happened simultaneously to 3 of my jobs last
> night, they were stuck from about 8 pm to 8 am (not exact times but you get
> the length.).
>
> The logs don't seem to be very helpful on the JobManager, they just show
> that all tasks start cancelling and then go cancelled except for one Kafka
> sink task. That goes into cancelling but only gets cancelled 12 hours
> later. On one of the task managers I have found this though:
>
> 2016-11-21 20:22:52,220 INFO  org.apache.flink.yarn.YarnTaskManager
>                      - Un-registering task and sending final execution
> state CANCELED to JobManager for task Execute EventProcessors
> (f030e71787a6dbd7a543e9745c42289d)
>
> 2016-11-22 08:49:35,181 WARN  org.apache.kafka.common.network.Selector
>                      - Error in I/O with
> kafka17.sto.midasplayer.com/172.25.82.212
> java.io.EOFException
>         at org.apache.kafka.common.network.NetworkReceive.
> readFrom(NetworkReceive.java:62)
>         at org.apache.kafka.common.network.Selector.poll(
> Selector.java:248)
>         at org.apache.kafka.clients.NetworkClient.poll(
> NetworkClient.java:192)
>         at org.apache.kafka.clients.producer.internals.Sender.run(
> Sender.java:191)
>         at org.apache.kafka.clients.producer.internals.Sender.run(
> Sender.java:135)
>         at java.lang.Thread.run(Thread.java:745)
> 2016-11-22 08:49:35,183 INFO
> org.apache.flink.runtime.taskmanager.Task                     - Sink:
> Kafka output (2/8) switched to CANCELED
>
>
> There might have been some network/kafka issue that caused 3 jobs to get
> stuck at the same time but I don't know what actually happened.
>
> Any ideas?
> Gyula
>