You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Neil Conway (JIRA)" <ji...@apache.org> on 2015/10/22 06:34:27 UTC

[jira] [Created] (MESOS-3790) Zk connection should retry on EAI_NONAME

Neil Conway created MESOS-3790:
----------------------------------

             Summary: Zk connection should retry on EAI_NONAME
                 Key: MESOS-3790
                 URL: https://issues.apache.org/jira/browse/MESOS-3790
             Project: Mesos
          Issue Type: Bug
            Reporter: Neil Conway
            Assignee: Neil Conway
            Priority: Minor


The zookeeper interface is designed to retry (once per second for up to ten minutes) if one or more of the Zookeeper hostnames can't be resolved (see [MESOS-1326] and [MESOS-1523]).

However, the current implementation assumes that a DNS resolution failure is indicated by zookeeper_init() returning NULL and errno being set to EINVAL (Zk translates getaddrinfo() failures into errno values). However, the current Zk code does:

{code}
static int getaddrinfo_errno(int rc) {
    switch(rc) {
    case EAI_NONAME:
// ZOOKEEPER-1323 EAI_NODATA and EAI_ADDRFAMILY are deprecated in FreeBSD.
#if defined EAI_NODATA && EAI_NODATA != EAI_NONAME
    case EAI_NODATA:
#endif
        return ENOENT;
    case EAI_MEMORY:
        return ENOMEM;
    default:
        return EINVAL;
    }
}
{code}

getaddrinfo() returns EAI_NONAME when "the node or service is not known"; per discussion in [MESOS-2186], this seems to happen intermittently due to DNS failures.

Proposed fix: looking at errno is always going to be somewhat fragile, but if we're going to continue doing that, we should check for ENOENT as well as EINVAL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)