You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2021/04/14 21:13:23 UTC

[GitHub] [accumulo] ctubbsii edited a comment on issue #2016: Github QA occasionally hangs while running unit-tests

ctubbsii edited a comment on issue #2016:
URL: https://github.com/apache/accumulo/issues/2016#issuecomment-819793131


   Okay, so fixing the timeout works. I was able to get the logs. It looks like services are starting up okay, but cannot talk to each other. The services register themselves using the local host name determined by using reverse DNS on the local IP address. When services are reached on localhost, everything works fine (e.g. services can talk to zookeeper on `localhost:33647` just fine). I don't see any errors with sending to the tracer service, but do see it listening on an IP address (`[tracer.AsyncSpanReceiver] INFO : starting span receiver with hostname 10.1.0.83`) instead of resolving a hostname.
   
   Tservers and the master in 1.10 (the build I was testing) show that they are listening on hostname `fv-az95-160`, but when the master tries to talk to the tservers, it fails to connect and times out:
   
   ```
   2021-04-14T18:10:40,775 [manager.Manager] INFO : New servers: [fv-az95-160:45343[100000b12f00006], fv-az95-160:36911[100000b12f00002]]
   2021-04-14T18:10:40,794 [manager.EventCoordinator] INFO : There are now 2 tablet servers            
   2021-04-14T18:10:40,803 [manager.Manager] INFO : tserver availability check disabled, continuing with-2 servers. To enable, set manager.startup.tserver.avail.min.count
   2021-04-14T18:10:40,956 [server.ServerUtil] WARN : System swappiness setting is greater than ten (60) which can cause time-sensitive operations to be delayed. Accumulo is time sensitive because it needs to maintain distributed lock agreement.
   2021-04-14T18:10:40,980 [manager.Manager] INFO : Setting manager lock data to fv-az95-160:35861     
   2021-04-14T18:10:41,040 [metrics.ManagerMetricsFactory] INFO : Registered replication metrics module   
   2021-04-14T18:10:41,061 [metrics.ManagerMetricsFactory] INFO : Registered FATE metrics module       
   2021-04-14T18:10:41,061 [manager.Manager] INFO : All metrics modules registered                        
   2021-04-14T18:10:41,330 [balancer.TableLoadBalancer] INFO : Loaded class org.apache.accumulo.core.spi.balancer.SimpleLoadBalancer for table +r
   2021-04-14T18:10:41,331 [manager.Manager] INFO : Assigning 1 tablets                                
   2021-04-14T18:11:20,829 [rpc.ThriftUtil] WARN : Failed to open transport to fv-az95-160:36911          
   2021-04-14T18:11:20,830 [rpc.ThriftUtil] WARN : Failed to open transport to fv-az95-160:45343          
   2021-04-14T18:11:20,830 [manager.Manager] ERROR: unable to get tablet server status fv-az95-160:36911[100000b12f00002] org.apache.thrift.transport.TTransportException: java.nio.channels.ClosedByInterruptException
   ```
   
   There is an additional stack trace further along, but it doesn't have any additional information, just that there was a timeout trying to connect to the tserver.
   
   So, either there is a problem with DNS/rDNS mapping between the hostname and IP address of the runner, or there is some other security / firewall policy preventing services from talking on the non-localhost IP address.
   
   This is clearly the result of some change in GitHub Actions runners, and not in our code, since it also affects minicluster in 1.10.
   
   The most likely change I can think of that could have caused this is the switch of `ubuntu-latest` from mapping to `ubuntu-18.04` to `ubuntu-20.04`. However, I don't have an Ubuntu instance to experiment with at the moment, so this is where I'm stuck for now.
   
   There's a few options forward, if it is an issue with Ubuntu 20.04:
   1. ~Force using ubuntu-18.04 instead of ubuntu-latest~ didn't work, test still hung and timed out with the same errors connecting
   2. ~Disable firewalld (or other firewall, if it's running on Ubuntu 20.04 runners)~ didn't work, the firewall is already inactive
   3. ~Add firewall rules (if the problem is the firewall)~ firewall is already inactive
   4. Fix rDNS name lookup for the local IP address by adding a hosts entry to /etc/hosts or doing something in /etc/nsswitch.conf to use the myhostname local name service rather than DNS and hostnamectl (if the problem is DNS)
   5. Force minicluster services to use localhost
   6. Get GitHub to fix it, if it's a problem with the ubuntu-20.04 image
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org