You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Randall Hauch (Jira)" <ji...@apache.org> on 2020/01/08 19:20:00 UTC
[jira] [Commented] (KAFKA-9385) Connect cluster: connector task repeat like a splitbrain cluster problem

    [ https://issues.apache.org/jira/browse/KAFKA-9385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010955#comment-17010955 ] 

Randall Hauch commented on KAFKA-9385:
--------------------------------------

[~kaikai.hou], what version of Apache Kafka Connect are you using? AK 2.4.0 includes several fixes that avoid splitbrain and zombie tasks (see KAFKA-9184), and although it's been backported to the {{2.3}} branch it AK 2.3.2 has not yet been released. 

If you've used an AK version prior to 2.4.0, could you try using AK 2.4.0 and see if the same problem persists.

If you did use AK 2.4.0, then this might be an issue that was not fixed in KAFKA-9184, and to properly identify and solve the problem we'd need more information:
# What is your worker configuration? Ideally you can provide sanitized worker config properties files, or if that's not practical the log lines from each worker process that show the worker config.
# Have you seen INFO-level log messages that include "Broker coordinator was unreachable" and/or DEBUG-level log messages that include phrases like "lost tasks"?
# Upload a DEBUG log for all workers including the problematic split brain problem plus some number of lines before and after (see KAFKA-9184 for a similar summary).

Thanks in advance!

> Connect cluster: connector task repeat like a splitbrain cluster problem 
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-9385
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9385
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>            Reporter: kaikai.hou
>            Priority: Major
>         Attachments: 12_31_d8c7j_1.jpg
>
>
> I am using Debezium. And find a task repeat problem.[Jump|[https://issues.redhat.com/browse/DBZ-1573?jql=key%20in%20watchedIssues()]]
>  
> 1. I push the Debezium image to our private image repository.
> 2. Deploy the connect cluster with the following *Deployment Config*：
> {code:java}
> //代码占位符
> apiVersion: apps.openshift.io/v1
> kind: DeploymentConfig
> metadata:
>   annotations:
>     openshift.io/generated-by: OpenShiftWebConsole
>   creationTimestamp: '2019-10-14T07:45:41Z'
>   generation: 29
>   labels:
>     app: debezium-test-cloud
>   name: debezium-test-cloud
>   namespace: test
>   resourceVersion: '168496156'
>   selfLink: >-
>     /apis/apps.openshift.io/v1/namespaces/test/deploymentconfigs/debezium-test-cloud
>   uid: 9f4f8f4d-ee56-11e9-a5a1-00163e0e008f
> spec:
>   replicas: 2
>   selector:
>     app: debezium-test-cloud
>     deploymentconfig: debezium-test-cloud
>   strategy:
>     activeDeadlineSeconds: 21600
>     resources: {}
>     rollingParams:
>       intervalSeconds: 1
>       maxSurge: 25%
>       maxUnavailable: 25%
>       timeoutSeconds: 600
>       updatePeriodSeconds: 1
>     type: Rolling
>   template:
>     metadata:
>       annotations:
>         openshift.io/generated-by: OpenShiftWebConsole
>       creationTimestamp: null
>       labels:
>         app: debezium-test-cloud
>         deploymentconfig: debezium-test-cloud
>     spec:
>       containers:
>         - env:
>             - name: BOOTSTRAP_SERVERS
>               value: '192.168.100.228:9092'
>             - name: GROUP_ID
>               value: test-cloud
>             - name: CONFIG_STORAGE_TOPIC
>               value: base.test-cloud.config
>             - name: OFFSET_STORAGE_TOPIC
>               value: base.test-cloud.offset
>             - name: STATUS_STORAGE_TOPIC
>               value: base.test-cloud.status
>             - name: CONNECT_KEY_CONVERTER_SCHEMAS_ENABLE
>               value: 'true'
>             - name: CONNECT_VALUE_CONVERTER_SCHEMAS_ENABLE
>               value: 'true'
>             - name: CONNECT_PRODUCER_MAX_REQUEST_SIZE
>               value: '20971520'
>             - name: CONNECT_DATABASE_HISTORY_KAFKA_RECOVERY_POLL_INTERVAL_MS
>               value: '1000'
>             - name: HEAP_OPTS
>               value: '-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0'
>           image: 'registry.cn-hangzhou.aliyuncs.com/eshine/debeziumconnect:1.0.0.Beta2'
>           imagePullPolicy: IfNotPresent
>           name: debezium-test-cloud
>           ports:
>             - containerPort: 8083
>               protocol: TCP
>             - containerPort: 8778
>               protocol: TCP
>             - containerPort: 9092
>               protocol: TCP
>             - containerPort: 9779
>               protocol: TCP
>           resources:
>             limits:
>               cpu: 400m
>               memory: 1Gi
>             requests:
>               cpu: 200m
>               memory: 1Gi
>           terminationMessagePath: /dev/termination-log
>           terminationMessagePolicy: File
>           volumeMounts:
>             - mountPath: /kafka/config
>               name: debezium-test-cloud-1
>             - mountPath: /kafka/data
>               name: debezium-test-cloud-2
>             - mountPath: /kafka/logs
>               name: debezium-test-cloud-3
>       dnsPolicy: ClusterFirst
>       restartPolicy: Always
>       schedulerName: default-scheduler
>       securityContext: {}
>       terminationGracePeriodSeconds: 30
>       volumes:
>         - emptyDir: {}
>           name: debezium-test-cloud-1
>         - emptyDir: {}
>           name: debezium-test-cloud-2
>         - emptyDir: {}
>           name: debezium-test-cloud-3
>   test: false
>   triggers:
>     - type: ConfigChange
> status:
>   availableReplicas: 2
>   conditions:
>     - lastTransitionTime: '2019-11-25T06:44:30Z'
>       lastUpdateTime: '2019-11-25T06:44:44Z'
>       message: replication controller "debezium-test-cloud-15" successfully rolled out
>       reason: NewReplicationControllerAvailable
>       status: 'True'
>       type: Progressing
>     - lastTransitionTime: '2019-12-31T10:06:23Z'
>       lastUpdateTime: '2019-12-31T10:06:23Z'
>       message: Deployment config has minimum availability.
>       status: 'True'
>       type: Available
>   details:
>     causes:
>       - type: Manual
>     message: manual change
>   latestVersion: 15
>   observedGeneration: 29
>   readyReplicas: 2
>   replicas: 2
>   unavailableReplicas: 0
>   updatedReplicas: 2
> {code}
> 3. Connect cluster in openshift: one service with two pods
> 4.  
>      a). task_connector_1_0 and task_connector_3_0 were running in podA; task_connector_2_0 was running in PodB
>      b) Then, PodA console follows error log:  In attachment "12_31_d8c7j_1.jpg" 
>         !12_31_d8c7j_1.jpg!
>      c) Then, Rebalance started;
>      d) However, In PodB, all task (task_connector_1_0, task_connector_2_0, task_connector_3_0) are running.  In PodA, still task_connector_1_0 and task_connector_3_0.
>      e) So the repeat task appeared.
>  
>     



--
This message was sent by Atlassian Jira
(v8.3.4#803005)