You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@curator.apache.org by "Francis Simon (Jira)" <ji...@apache.org> on 2022/05/02 20:03:00 UTC
[jira] [Updated] (CURATOR-638) Curator disconnect from zookeeper when IPs change

     [ https://issues.apache.org/jira/browse/CURATOR-638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Francis Simon updated CURATOR-638:
----------------------------------
    Description: 
Blocking usage of Zookeeper in production.   Tried testing a few versions all had the issue.  Effects any recipes that use ephemeral nodes.  Example attached.

We use multiple Apache Curator recipes in our system which is running in Docker and Kubernetes.   The behavior I am seeing is that curator appears to resolve to the IP address of the containers rather than being tied to DNS names.   I have seen old tickets on this, but the behavior is reproducible on the latest code release.    

We are running zookeeper in containers on kubernetes.  In kubernetes many things could cause a container to move hosts, the pod disruption budget ensures that a quorum is always present.   But with this bug if all nodes move for any reason and get new IP addresses clients will disconnect when they shouldn't.  Disconnecting has the bad side effect that all ephemeral nodes are lost.   This effects for us coordination, distributed locking and service discovery.   Causes production downtime so marked as a Blocker.

I have a simple sample which just uses the service discovery recipe to register a bunch of services in zookeeper.  I run the example in docker compose.   It is 100% reproducible.

 
{code:java}
# Standup zookeeper and wait for it to be healthy
docker-compose up -d zookeeper1 zookeeper2 zookeeper3 

# Stand up a server and make sure it is connected and working as expected
docker-compose up -d server1

# Take down a single zookeeper node and stand up another agent.
# The agent will grab the old zookeepers IP address
docker-compose rm -s zookeeper1
docker-compose up -d server2

# Bring the zookeeper node back up.  
# Wait for it be healthy
docker-compose up -d zookeeper1

# Then take down the next zookeeper node and stand up another agent.
# The agent will grab the old zookeepers IP address
docker-compose rm -s zookeeper2
docker-compose up -d server3

# Bring the zookeeper node back up. 
# Wait for it be healthy
docker-compose up -d zookeeper2

# Then take down the next zookeeper node and stand up another agent.
# The agent will grab the old zookeepers IP address
docker-compose rm -s zookeeper3 
docker-compose up -d server4 

# Bring the zookeeper node back up. 
# Wait for it be healthy 
docker-compose up -d zookeeper3{code}
 

At the time of taking down the 3rd zookeeper node, the first server1 that was stood up will now receive a disconnected status because the IP of all three nodes has no changes form the original IP addresses.  

 
{code:java}
server1_1     | Query instances for servicetest
server1_1     | Exception in thread "main" java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /myservices/test/62e23a0b-dfdb-46f5-966f-8dc7a4978c70
server1_1     | 	at org.apache.curator.shaded.com.google.common.base.Throwables.propagate(Throwables.java:241)
server1_1     | 	at org.apache.curator.utils.ExceptionAccumulator.propagate(ExceptionAccumulator.java:38)
server1_1     | 	at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.close(ServiceDiscoveryImpl.java:171)
server1_1     | 	at org.apache.curator.shaded.com.google.common.io.Closeables.close(Closeables.java:78)
server1_1     | 	at org.apache.curator.utils.CloseableUtils.closeQuietly(CloseableUtils.java:59)
server1_1     | 	at zkissue.App.main(App.java:72)
server1_1     | Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /myservices/test/62e23a0b-dfdb-46f5-966f-8dc7a4978c70
server1_1     | 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
server1_1     | 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
server1_1     | 	at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001)
server1_1     | 	at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:274)
server1_1     | 	at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:268)
server1_1     | 	at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
server1_1     | 	at org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:265)
server1_1     | 	at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:249)
server1_1     | 	at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:34)
server1_1     | 	at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.internalUnregisterService(ServiceDiscoveryImpl.java:520)
server1_1     | 	at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.close(ServiceDiscoveryImpl.java:157)
server1_1     | 	... 3 more
{code}
 

 This causes it to disconnect and lose its discovery state which can be seen from the other services.

 
{code:java}
server2_1     | Query instances for servicetest
server2_1     | test
server2_1     | 	service description: http://server-4:57456
server2_1     | 	service description: http://server-3:37740
server2_1     | 	service description: http://server-2:40219{code}
 

 

 

  was:
Blocking usage of Zookeeper in production.   Tried testing a few versions all had the issue.  Effects any recipes that use ephemeral nodes.  Example attached.

We use multiple Apache Curator recipes in our system which is running in Docker and Kubernetes.   The behavior I am seeing is that curator appears to resolve to the IP address of the containers rather than being tied to DNS names.   I have seen old tickets on this, but the behavior is reproducible on the latest code release.    

We are running zookeeper in containers on kubernetes.  In kubernetes many things could cause a container to move hosts, the pod disruption budget ensures that a quorum is always present.   But with this bug if all 2 nodes move for any reason and get new IP addresses clients will disconnect when they shouldn't.  Disconnecting has the bad side effect that all ephemeral nodes are lost.   This effects for us coordination, distributed locking and service discovery.   Causes production downtime so marked as a Blocker.

I have a simple sample which just uses the service discovery recipe to register a bunch of services in zookeeper.  I run the example in docker compose.   It is 100% reproducible.

 
{code:java}
# Standup zookeeper and wait for it to be healthy
docker-compose up -d zookeeper1 zookeeper2 zookeeper3 

# Stand up a server and make sure it is connected and working as expected
docker-compose up -d server1

# Take down a single zookeeper node and stand up another agent.
# The agent will grab the old zookeepers IP address
docker-compose rm -s zookeeper1
docker-compose up -d server2

# Bring the zookeeper node back up.  
# Wait for it be healthy
docker-compose up -d zookeeper1

# Then take down the next zookeeper node and stand up another agent.
# The agent will grab the old zookeepers IP address
docker-compose rm -s zookeeper2
docker-compose up -d server3

# Bring the zookeeper node back up. 
# Wait for it be healthy
docker-compose up -d zookeeper2

# Then take down the next zookeeper node and stand up another agent.
# The agent will grab the old zookeepers IP address
docker-compose rm -s zookeeper3 
docker-compose up -d server4 

# Bring the zookeeper node back up. 
# Wait for it be healthy 
docker-compose up -d zookeeper3{code}
 

At the time of taking down the 3rd zookeeper node, the first server1 that was stood up will now receive a disconnected status because the IP of all three nodes has no changes form the original IP addresses.  

 
{code:java}
server1_1     | Query instances for servicetest
server1_1     | Exception in thread "main" java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /myservices/test/62e23a0b-dfdb-46f5-966f-8dc7a4978c70
server1_1     | 	at org.apache.curator.shaded.com.google.common.base.Throwables.propagate(Throwables.java:241)
server1_1     | 	at org.apache.curator.utils.ExceptionAccumulator.propagate(ExceptionAccumulator.java:38)
server1_1     | 	at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.close(ServiceDiscoveryImpl.java:171)
server1_1     | 	at org.apache.curator.shaded.com.google.common.io.Closeables.close(Closeables.java:78)
server1_1     | 	at org.apache.curator.utils.CloseableUtils.closeQuietly(CloseableUtils.java:59)
server1_1     | 	at zkissue.App.main(App.java:72)
server1_1     | Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /myservices/test/62e23a0b-dfdb-46f5-966f-8dc7a4978c70
server1_1     | 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
server1_1     | 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
server1_1     | 	at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001)
server1_1     | 	at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:274)
server1_1     | 	at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:268)
server1_1     | 	at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
server1_1     | 	at org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:265)
server1_1     | 	at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:249)
server1_1     | 	at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:34)
server1_1     | 	at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.internalUnregisterService(ServiceDiscoveryImpl.java:520)
server1_1     | 	at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.close(ServiceDiscoveryImpl.java:157)
server1_1     | 	... 3 more
{code}
 

 This causes it to disconnect and lose its discovery state which can be seen from the other services.

 
{code:java}
server2_1     | Query instances for servicetest
server2_1     | test
server2_1     | 	service description: http://server-4:57456
server2_1     | 	service description: http://server-3:37740
server2_1     | 	service description: http://server-2:40219{code}
 

 

 


> Curator disconnect from zookeeper when IPs change
> -------------------------------------------------
>
>                 Key: CURATOR-638
>                 URL: https://issues.apache.org/jira/browse/CURATOR-638
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Client, Recipes
>    Affects Versions: 5.2.1
>         Environment: Docker or Kubernetes, docker example provided
>            Reporter: Francis Simon
>            Priority: Blocker
>         Attachments: zkissue.zip
>
>
> Blocking usage of Zookeeper in production.   Tried testing a few versions all had the issue.  Effects any recipes that use ephemeral nodes.  Example attached.
> We use multiple Apache Curator recipes in our system which is running in Docker and Kubernetes.   The behavior I am seeing is that curator appears to resolve to the IP address of the containers rather than being tied to DNS names.   I have seen old tickets on this, but the behavior is reproducible on the latest code release.    
> We are running zookeeper in containers on kubernetes.  In kubernetes many things could cause a container to move hosts, the pod disruption budget ensures that a quorum is always present.   But with this bug if all nodes move for any reason and get new IP addresses clients will disconnect when they shouldn't.  Disconnecting has the bad side effect that all ephemeral nodes are lost.   This effects for us coordination, distributed locking and service discovery.   Causes production downtime so marked as a Blocker.
> I have a simple sample which just uses the service discovery recipe to register a bunch of services in zookeeper.  I run the example in docker compose.   It is 100% reproducible.
>  
> {code:java}
> # Standup zookeeper and wait for it to be healthy
> docker-compose up -d zookeeper1 zookeeper2 zookeeper3 
> # Stand up a server and make sure it is connected and working as expected
> docker-compose up -d server1
> # Take down a single zookeeper node and stand up another agent.
> # The agent will grab the old zookeepers IP address
> docker-compose rm -s zookeeper1
> docker-compose up -d server2
> # Bring the zookeeper node back up.  
> # Wait for it be healthy
> docker-compose up -d zookeeper1
> # Then take down the next zookeeper node and stand up another agent.
> # The agent will grab the old zookeepers IP address
> docker-compose rm -s zookeeper2
> docker-compose up -d server3
> # Bring the zookeeper node back up. 
> # Wait for it be healthy
> docker-compose up -d zookeeper2
> # Then take down the next zookeeper node and stand up another agent.
> # The agent will grab the old zookeepers IP address
> docker-compose rm -s zookeeper3 
> docker-compose up -d server4 
> # Bring the zookeeper node back up. 
> # Wait for it be healthy 
> docker-compose up -d zookeeper3{code}
>  
> At the time of taking down the 3rd zookeeper node, the first server1 that was stood up will now receive a disconnected status because the IP of all three nodes has no changes form the original IP addresses.  
>  
> {code:java}
> server1_1     | Query instances for servicetest
> server1_1     | Exception in thread "main" java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /myservices/test/62e23a0b-dfdb-46f5-966f-8dc7a4978c70
> server1_1     | 	at org.apache.curator.shaded.com.google.common.base.Throwables.propagate(Throwables.java:241)
> server1_1     | 	at org.apache.curator.utils.ExceptionAccumulator.propagate(ExceptionAccumulator.java:38)
> server1_1     | 	at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.close(ServiceDiscoveryImpl.java:171)
> server1_1     | 	at org.apache.curator.shaded.com.google.common.io.Closeables.close(Closeables.java:78)
> server1_1     | 	at org.apache.curator.utils.CloseableUtils.closeQuietly(CloseableUtils.java:59)
> server1_1     | 	at zkissue.App.main(App.java:72)
> server1_1     | Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /myservices/test/62e23a0b-dfdb-46f5-966f-8dc7a4978c70
> server1_1     | 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> server1_1     | 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
> server1_1     | 	at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001)
> server1_1     | 	at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:274)
> server1_1     | 	at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:268)
> server1_1     | 	at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
> server1_1     | 	at org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:265)
> server1_1     | 	at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:249)
> server1_1     | 	at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:34)
> server1_1     | 	at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.internalUnregisterService(ServiceDiscoveryImpl.java:520)
> server1_1     | 	at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.close(ServiceDiscoveryImpl.java:157)
> server1_1     | 	... 3 more
> {code}
>  
>  This causes it to disconnect and lose its discovery state which can be seen from the other services.
>  
> {code:java}
> server2_1     | Query instances for servicetest
> server2_1     | test
> server2_1     | 	service description: http://server-4:57456
> server2_1     | 	service description: http://server-3:37740
> server2_1     | 	service description: http://server-2:40219{code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)