You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by "Sathya Hariesh Prakash (sathypra)" <sa...@cisco.com> on 2017/11/08 03:02:47 UTC
Flink HA Zookeeper Connection Timeout
Hi – We’re currently testing Flink HA and running into a zookeeper timeout issue. Error log below.
Is there a production checklist or any information on parameters that are related to flink HA that I need to pay attention to?
Any pointers would really help. Please let me know if any additional information is needed. Thanks!
NOTE: I see multiple connection timeout messages. With different elapsed times.
{
"timeMillis":1510095254557,
"thread":"Curator-Framework-0",
"level":"ERROR",
"loggerName":"org.apache.flink.shaded.org.apache.curator.ConnectionState",
"message":"Connection timed out for connection string (zookeeper.system.svc.cluster.local:2181) and timeout (15000) / elapsed (15004)",
"thrown":{
"commonElementCount":0,
"localizedMessage":"KeeperErrorCode = ConnectionLoss",
"message":"KeeperErrorCode = ConnectionLoss",
"name":"org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException",
"extendedStackTrace":[
{
"class":"org.apache.flink.shaded.org.apache.curator.ConnectionState",
"method":"checkTimeouts",
"file":"ConnectionState.java",
"line":197,
"exact":true,
"location":"flink-runtime_2.10-1.2.jar",
"version":"1.2"
},
{
"class":"org.apache.flink.shaded.org.apache.curator.ConnectionState",
"method":"getZooKeeper",
"file":"ConnectionState.java",
"line":87,
"exact":true,
"location":"flink-runtime_2.10-1.2.jar",
"version":"1.2"
},
{
"class":"org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient",
"method":"getZooKeeper",
"file":"CuratorZookeeperClient.java",
"line":115,
"exact":true,
"location":"flink-runtime_2.10-1.2.jar",
"version":"1.2"
},
{
"class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl",
"method":"performBackgroundOperation",
"file":"CuratorFrameworkImpl.java",
"line":806,
"exact":true,
"location":"flink-runtime_2.10-1.2.jar",
"version":"1.2"
},
{
"class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl",
"method":"backgroundOperationsLoop",
"file":"CuratorFrameworkImpl.java",
"line":792,
"exact":true,
"location":"flink-runtime_2.10-1.2.jar",
"version":"1.2"
},
{
"class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl",
"method":"access$300",
"file":"CuratorFrameworkImpl.java",
"line":62,
"exact":true,
"location":"flink-runtime_2.10-1.2.jar",
"version":"1.2"
},
{
"class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$4",
"method":"call",
"file":"CuratorFrameworkImpl.java",
"line":257,
"exact":true,
"location":"flink-runtime_2.10-1.2.jar",
"version":"1.2"
},
{
"class":"java.util.concurrent.FutureTask",
"method":"run",
"file":"FutureTask.java",
"line":266,
"exact":true,
"location":"?",
"version":"1.8.0_66"
},
{
"class":"java.util.concurrent.ThreadPoolExecutor",
"method":"runWorker",
"file":"ThreadPoolExecutor.java",
"line":1142,
"exact":true,
"location":"?",
"version":"1.8.0_66"
},
{
"class":"java.util.concurrent.ThreadPoolExecutor$Worker",
"method":"run",
"file":"ThreadPoolExecutor.java",
"line":617,
"exact":true,
"location":"?",
"version":"1.8.0_66"
},
{
"class":"java.lang.Thread",
"method":"run",
"file":"Thread.java",
"line":745,
"exact":true,
"location":"?",
"version":"1.8.0_66"
}
]
},
"endOfBatch":false,
"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger",
"threadId":258,
"threadPriority":5
}
Re: Flink HA Zookeeper Connection Timeout
Posted by Patrick Lucas <pa...@data-artisans.com>.
Hi Sathya,
Here are two JIRA issues that may be related: FLINK-5996
<https://issues.apache.org/jira/browse/FLINK-5996>, FLINK-7021
<https://issues.apache.org/jira/browse/FLINK-7021>
Are there any logs from your ZK cluster that may be of use? Since you're on
Kubernetes, do you have Liveness/ReadinessChecks on ZK, and if so, do they
show any problems? For example, a failed ReadinessCheck could result in the
node temporarily being dropped from the K8s Service, resulting in a timeout
from Flink.
Actually, it's probably a good idea to avoid using a Service altogether
with ZooKeeper in Kubernetes and address the pods directly. For this you
could use a StatefulSet
<https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/>
which gives you hostnames like zookeeper-0, zookeeper-1 etc., avoiding the
indirection of a Service and allowing the client library to do its own
failure resolution since it knows where to find each ZooKeeper.
--
Patrick Lucas
On Wed, Nov 8, 2017 at 4:02 AM, Sathya Hariesh Prakash (sathypra) <
sathypra@cisco.com> wrote:
> Hi – We’re currently testing Flink HA and running into a zookeeper timeout
> issue. Error log below.
>
> *Is there a production checklist or any information on parameters that are
> related to flink HA that I need to pay attention to? *
>
> Any pointers would really help. Please let me know if any additional
> information is needed. Thanks!
>
> NOTE: I see multiple connection timeout messages. With different elapsed
> times.
>
> {
> "timeMillis":1510095254557,
> "thread":"Curator-Framework-0",
> "level":"ERROR",
> "loggerName":"org.apache.flink.shaded.org.apache.
> curator.ConnectionState",
> "message":"Connection timed out for connection
> string (zookeeper.system.svc.cluster.local:2181) and
> timeout (15000) / elapsed (15004)",
> "thrown":{
> "commonElementCount":0,
> "localizedMessage":"KeeperErrorCode = ConnectionLoss",
> "message":"KeeperErrorCode = ConnectionLoss",
> "name":"org.apache.flink.shaded.org.apache.curator.
> CuratorConnectionLossException",
> "extendedStackTrace":[
> {
> "class":"org.apache.flink.shaded.org.apache.curator.
> ConnectionState",
> "method":"checkTimeouts",
> "file":"ConnectionState.java",
> "line":197,
> "exact":true,
> "location":"flink-runtime_2.10-1.2.jar",
> "version":"1.2"
> },
> {
> "class":"org.apache.flink.shaded.org.apache.curator.
> ConnectionState",
> "method":"getZooKeeper",
> "file":"ConnectionState.java",
> "line":87,
> "exact":true,
> "location":"flink-runtime_2.10-1.2.jar",
> "version":"1.2"
> },
> {
> "class":"org.apache.flink.shaded.org.apache.curator.
> CuratorZookeeperClient",
> "method":"getZooKeeper",
> "file":"CuratorZookeeperClient.java",
> "line":115,
> "exact":true,
> "location":"flink-runtime_2.10-1.2.jar",
> "version":"1.2"
> },
> {
> "class":"org.apache.flink.shaded.org.
> apache.curator.framework.imps.CuratorFrameworkImpl",
> "method":"performBackgroundOperation",
> "file":"CuratorFrameworkImpl.java",
> "line":806,
> "exact":true,
> "location":"flink-runtime_2.10-1.2.jar",
> "version":"1.2"
> },
> {
> "class":"org.apache.flink.shaded.org.
> apache.curator.framework.imps.CuratorFrameworkImpl",
> "method":"backgroundOperationsLoop",
> "file":"CuratorFrameworkImpl.java",
> "line":792,
> "exact":true,
> "location":"flink-runtime_2.10-1.2.jar",
> "version":"1.2"
> },
> {
> "class":"org.apache.flink.shaded.org.
> apache.curator.framework.imps.CuratorFrameworkImpl",
> "method":"access$300",
> "file":"CuratorFrameworkImpl.java",
> "line":62,
> "exact":true,
> "location":"flink-runtime_2.10-1.2.jar",
> "version":"1.2"
> },
> {
> "class":"org.apache.flink.shaded.org.
> apache.curator.framework.imps.CuratorFrameworkImpl$4",
> "method":"call",
> "file":"CuratorFrameworkImpl.java",
> "line":257,
> "exact":true,
> "location":"flink-runtime_2.10-1.2.jar",
> "version":"1.2"
> },
> {
> "class":"java.util.concurrent.FutureTask",
> "method":"run",
> "file":"FutureTask.java",
> "line":266,
> "exact":true,
> "location":"?",
> "version":"1.8.0_66"
> },
> {
> "class":"java.util.concurrent.ThreadPoolExecutor",
> "method":"runWorker",
> "file":"ThreadPoolExecutor.java",
> "line":1142,
> "exact":true,
> "location":"?",
> "version":"1.8.0_66"
> },
> {
> "class":"java.util.concurrent.ThreadPoolExecutor$Worker",
> "method":"run",
> "file":"ThreadPoolExecutor.java",
> "line":617,
> "exact":true,
> "location":"?",
> "version":"1.8.0_66"
> },
> {
> "class":"java.lang.Thread",
> "method":"run",
> "file":"Thread.java",
> "line":745,
> "exact":true,
> "location":"?",
> "version":"1.8.0_66"
> }
> ]
> },
> "endOfBatch":false,
> "loggerFqcn":"org.apache.logging.slf4j.Log4jLogger",
> "threadId":258,
> "threadPriority":5
> }
>
Re: Flink HA Zookeeper Connection Timeout
Posted by Nico Kruber <ni...@data-artisans.com>.
Hi Sathya,
have you checked this yet?
https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/
jobmanager_high_availability.html
I'm no expert on the HA setup, have you also tried Flink 1.3 just in case?
Nico
On Wednesday, 8 November 2017 04:02:47 CET Sathya Hariesh Prakash (sathypra)
wrote:
> Hi – We’re currently testing Flink HA and running into a zookeeper timeout
> issue. Error log below.
> Is there a production checklist or any information on parameters that are
> related to flink HA that I need to pay attention to?
> Any pointers would really help. Please let me know if any additional
> information is needed. Thanks!
> NOTE: I see multiple connection timeout messages. With different elapsed
> times.
> {
> "timeMillis":1510095254557,
> "thread":"Curator-Framework-0",
> "level":"ERROR",
>
> "loggerName":"org.apache.flink.shaded.org.apache.curator.ConnectionState",
> "message":"Connection timed out for connection string
> (zookeeper.system.svc.cluster.local:2181) and timeout (15000) / elapsed
> (15004)", "thrown":{
> "commonElementCount":0,
> "localizedMessage":"KeeperErrorCode = ConnectionLoss",
> "message":"KeeperErrorCode = ConnectionLoss",
>
> "name":"org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossExc
> eption",
"extendedStackTrace":[
> {
>
> "class":"org.apache.flink.shaded.org.apache.curator.ConnectionState",
> "method":"checkTimeouts",
> "file":"ConnectionState.java",
> "line":197,
> "exact":true,
> "location":"flink-runtime_2.10-1.2.jar",
> "version":"1.2"
> },
> {
>
> "class":"org.apache.flink.shaded.org.apache.curator.ConnectionState",
> "method":"getZooKeeper",
> "file":"ConnectionState.java",
> "line":87,
> "exact":true,
> "location":"flink-runtime_2.10-1.2.jar",
> "version":"1.2"
> },
> {
>
> "class":"org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient"
> ,
"method":"getZooKeeper",
> "file":"CuratorZookeeperClient.java",
> "line":115,
> "exact":true,
> "location":"flink-runtime_2.10-1.2.jar",
> "version":"1.2"
> },
> {
>
> "class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorF
> rameworkImpl",
"method":"performBackgroundOperation",
> "file":"CuratorFrameworkImpl.java",
> "line":806,
> "exact":true,
> "location":"flink-runtime_2.10-1.2.jar",
> "version":"1.2"
> },
> {
>
> "class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorF
> rameworkImpl",
"method":"backgroundOperationsLoop",
> "file":"CuratorFrameworkImpl.java",
> "line":792,
> "exact":true,
> "location":"flink-runtime_2.10-1.2.jar",
> "version":"1.2"
> },
> {
>
> "class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorF
> rameworkImpl",
"method":"access$300",
> "file":"CuratorFrameworkImpl.java",
> "line":62,
> "exact":true,
> "location":"flink-runtime_2.10-1.2.jar",
> "version":"1.2"
> },
> {
>
> "class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorF
> rameworkImpl$4",
"method":"call",
> "file":"CuratorFrameworkImpl.java",
> "line":257,
> "exact":true,
> "location":"flink-runtime_2.10-1.2.jar",
> "version":"1.2"
> },
> {
> "class":"java.util.concurrent.FutureTask",
> "method":"run",
> "file":"FutureTask.java",
> "line":266,
> "exact":true,
> "location":"?",
> "version":"1.8.0_66"
> },
> {
> "class":"java.util.concurrent.ThreadPoolExecutor",
> "method":"runWorker",
> "file":"ThreadPoolExecutor.java",
> "line":1142,
> "exact":true,
> "location":"?",
> "version":"1.8.0_66"
> },
> {
> "class":"java.util.concurrent.ThreadPoolExecutor$Worker",
> "method":"run",
> "file":"ThreadPoolExecutor.java",
> "line":617,
> "exact":true,
> "location":"?",
> "version":"1.8.0_66"
> },
> {
> "class":"java.lang.Thread",
> "method":"run",
> "file":"Thread.java",
> "line":745,
> "exact":true,
> "location":"?",
> "version":"1.8.0_66"
> }
> ]
> },
> "endOfBatch":false,
> "loggerFqcn":"org.apache.logging.slf4j.Log4jLogger",
> "threadId":258,
> "threadPriority":5
> }