You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafodion.apache.org by "Hans Zeller (JIRA)" <ji...@apache.org> on 2017/07/20 20:35:00 UTC

[jira] [Updated] (TRAFODION-2692) Monitor fails to start when node names are not of the right form

     [ https://issues.apache.org/jira/browse/TRAFODION-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hans Zeller updated TRAFODION-2692:
-----------------------------------
    Environment: I tried this on an OpenStack cluster, using Hortonworks HDP 5.4. This is the code with the new elasticity feature.  (was: I tried this on an OpenStack cluster, using Hortonworks HDP 5.4)

> Monitor fails to start when node names are not of the right form
> ----------------------------------------------------------------
>
>                 Key: TRAFODION-2692
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2692
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: foundation
>    Affects Versions: 2.2-incubating
>         Environment: I tried this on an OpenStack cluster, using Hortonworks HDP 5.4. This is the code with the new elasticity feature.
>            Reporter: Hans Zeller
>
> When trying to install Trafodion on a cluster, I ran into various situations where the monitor failed to start, based on how host names were configured and specified. I used three kinds of names:
> NN - a "nickname", a name I made up and put into /etc/hosts. Note: I made the mistake of just adding the nickname, not the actual name in the /etc/hosts line.
> LN - a local, non-qualified name that is also the OpenStack instance name and the host name.
> FQDN - the fully qualified domain host name
> {noformat}
> Case  Name specified  hostname command  sqconfig  What happened
>       in HDP          returns           contains
> ----  --------------  ----------------  --------  --------------------------
>   1   nickname        local name        nickname  sqstart returned an error,
>                                                   saying that sqstart must
>                                                   be executed on one of the
>                                                   nodes of the cluster
>   2   local name      local name        FQDN?     monitor core dump (1)
>   3   local name      FQDN              FQDN      monitor abends (2)
>   4   FQDN            FQDN              FQDN      install succeeds
> {noformat}
> Notes: (1) The core dump happened because of the following code in file core/sqf/monitor/linux/cluster.cxx:
> {noformat}
>     // Build the monitor's configured view of the cluster
>     if ( IsRealCluster )
>     {   // Map node name to physical node id
>         // (for virtual nodes physical node equals "rank" (previously set))
>         MyPNID = clusterConfig->GetPNid( Node_name );
>     }
>     Nodes->AddNodes( );
>     MyNode = Nodes->GetNode(MyPNID);
>     Nodes->SetupCluster( &Node, &LNode, &indexToPnid_ );
> {noformat}
> Node_name is a local name. The name of the nodes in the "Nodes" list is the FQDN, so we don't find the node and MyPNID is set to -1. This leads to dereferencing MyNode, which is a NULL pointer.
> Note 2: The third case is the same as the second, with two modifications: Use the "hostname" command to set the host name to the FQDN, and edit /etc/hosts to put the FQDN first in the line and the local name second (case 2 had it the other way round). This time, we get past the problem described in case 2, but we get an error from MPI, which is unable to communicate with all the nodes (sorry, didn't record the exact error message).
> This is now the lines in /etc/hosts look like (same layout for all nodes of the cluster):
> {noformat}
> # case 1
> 1.2.3.4 nickname1
> 1.2.3.5 nickname2
> # case 2
> 1.2.3.4 mynode1 mynode1.novalocal
> 1.2.3.5 mynode2 mynode2.novalocal
> # cases 3 and 4
> 1.2.3.4 mynode1.novalocal mynode1
> 1.2.3.5 mynode2.novalocal mynode2
> {noformat}
> My suggestion would be to identify the places where we read node names that are provided by the user and where such node names are compared, and to provide a comparison method that tolerates equivalent forms of names.
> There are related JIRAs: TRAFODION-2480 and TRAFODION-2391.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)