You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Gwen Shapira (JIRA)" <ji...@apache.org> on 2018/07/10 21:47:00 UTC
[jira] [Commented] (KAFKA-7121) Intermittently, Connectors fail to assign tasks and keep retrying every second forever.

    [ https://issues.apache.org/jira/browse/KAFKA-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539264#comment-16539264 ] 

Gwen Shapira commented on KAFKA-7121:
-------------------------------------

Oh, sorry [~yuzhihong@gmail.com], I forgot to update:
We used 1.1.0 release.
We resolved the issue by setting advertised.host for the connect workers. The real issue was that connect workers couldn't talk to the HTTP leader.

There are few layers of problems here:
1. When advertised host isn't set, workers end up picking the wrong IP to advertise.
2. When workers can't talk to the leader the error is completely misleading (we assume that the only reason you can't find the leader is a rebalance, but this is a distributed system, there are 500 reasons why 2 nodes can't talk to each other).
3. We keep retrying forever in this scenario (and logging 10 times per second). I'm not sure this is the right thing to do in this scenario.

> Intermittently, Connectors fail to assign tasks and keep retrying every second forever.
> ---------------------------------------------------------------------------------------
>
>                 Key: KAFKA-7121
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7121
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>            Reporter: Gwen Shapira
>            Assignee: Konstantine Karantasis
>            Priority: Major
>
> We started a connector, and even though it is in RUNNING status, tasks are not getting assigned:
> {"name":"prod-xxx-v2","connector":{"state":"RUNNING","worker_id":"0.0.0.0:8083"},"tasks":[],"type":"sink"}
> Other connectors are running without issues.
> Attempt to restart the connector returned 409 status.
> Logs show the following messages, keep repeating for hours:
> [2018-06-29 20:23:19,288] ERROR Task reconfiguration for prod-xxx-v2 failed unexpectedly, this connector will not be properly reconfigured unless manually triggered. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:956)
> [2018-06-29 20:23:19,289] INFO 10.200.149.201 - - [29/Jun/2018:20:23:19 +0000] "POST /connectors/prod-xxx-v2/tasks?forward=false HTTP/1.1" 409 113 0 (org.apache.kafka.connect.runtime.rest.RestServer:60)
> [2018-06-29 20:23:19,289] INFO 10.200.149.201 - - [29/Jun/2018:20:23:19 +0000] "POST /connectors/prod-xxx-v2/tasks?forward=true HTTP/1.1" 409 113 1 (org.apache.kafka.connect.runtime.rest.RestServer:60)
> [2018-06-29 20:23:19,289] INFO 10.200.149.201 - - [29/Jun/2018:20:23:19 +0000] "POST /connectors/prod-xxx-v2/tasks HTTP/1.1" 409 113 1 (org.apache.kafka.connect.runtime.rest.RestServer:60)
> [2018-06-29 20:23:19,289] ERROR Request to leader to reconfigure connector tasks failed (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1018)
> org.apache.kafka.connect.runtime.rest.errors.ConnectRestException: Cannot complete request because of a conflicting operation (e.g. worker rebalance)
>  at org.apache.kafka.connect.runtime.rest.RestServer.httpRequest(RestServer.java:229)
>  at org.apache.kafka.connect.runtime.distributed.DistributedHerder$18.run(DistributedHerder.java:1015)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)