You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by GitBox <gi...@apache.org> on 2022/11/10 18:58:42 UTC

[GitHub] [solr-operator] ramayer opened a new issue, #498: Solr Operator seems very picky about the Kubernetes environment it's using (guessing networking/dns)

ramayer opened a new issue, #498:
URL: https://github.com/apache/solr-operator/issues/498

   Solr Operator's working great in about half of the Kubernetes environment I'm testing; but fails in about the other half.
   
   It fails for me on Ubuntu 22.04 using a kubernetes environment started with:
   
       minikube start
   
   where it seems each Solr instance can communicate with the other two just fine, but appears to have a network timeout when it attempts to communicate with another shard on the same host.   I can create some collections, but am unable to creates any collection that has as many shards as solr pod instances.
   
   It works fine for me on the same Ubuntu 22.04 host using:
   
        minikube start --container-runtime=containerd --cpus 4 --mount-string=$HOME/proj/kube/persistent_volumes:/mnt/host --mount 
   
   It fails for me on MacOS using a kubernetes environment created with:
   
      colima start --cpu 4 -- memory 8 --kubernetes
   
   where it seems like the zookeeper cluster never reaches a quorum; apparently timing out when the second zookeeper node attempts to connect to example-solrcloud-zookeeper-client:2181 .  It seems as if colima's kubernetes's (I think k3s) default networking is not allowing connections to that service until the service is ready (which never seems to happen); but I don't know how to debug this further.
   
   It works fine for me on the same MacOS host using a kubernetes environment created with:
   
       podman machine init -m 16000 --cpus 4 -v "$HOME:$HOME" --rootful
       podman machine start
       minikube start --driver=podman --cpus 4 --memory 12000 --profile=minikube-on-podman
   
   It works fine for me on Microsoft Azure's AKS using the instructions [here](https://learn.microsoft.com/en-us/azure/developer/terraform/create-k8s-cluster-with-tf-and-aks).
   
   
   In all cases, after creating the Kubernetes environment, I'm attempting to create the solr cluster with 
   
       kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/static/provider/cloud/deploy.yaml
       kubectl create -f https://solr.apache.org/operator/downloads/crds/v0.6.0/all-with-dependencies.yaml
       helm install solr-operator apache-solr/solr-operator --version 0.6.0
       helm install example-solr apache-solr/solr --version 0.6.0 \
         --set image.tag=9.0 \
         --set solrOptions.security.authenticationType="Basic" \
         --set solrOptions.javaMemory="-Xms300m -Xmx300m" \
         --set addressability.external.method=Ingress \
         --set addressability.external.domainName="ing.local.domain" \
         --set addressability.external.useExternalAddress="true" \
         --set ingressOptions.ingressClassName="nginx"
   
   I think most of the failure modes seem to be related to when during the startup process Kuberentes exposes enough information (DNS?  IP addresses?) to nodes during the startup process -- but I don't quite know Kubernetes networking well enough to debug this.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr-operator] ramayer commented on issue #498: Solr Operator seems very picky about the Kubernetes environment it's using (guessing networking/dns)

Posted by GitBox <gi...@apache.org>.
ramayer commented on issue #498:
URL: https://github.com/apache/solr-operator/issues/498#issuecomment-1379658068

   For the minikube failure mode mentioned above, I think it's related to this minikube github issue:   https://github.com/kubernetes/minikube/issues/13370   with the workaround from those comments including adding `--cni=bridge` to the minikube startup line.  
   
   I still didn't have any luck finding workarounds for the `colima --kubernetes` distribution of kuberenets.    
   
   Some comments there suggest that iptables-based implementations of services can have trouble when a pod tries to connect to itself through a service-url.    I wonder if this operator could be tweaked to have pods not try to reach themselves through the service name --- but don't even understand enough to know if that makes sense.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr-operator] risdenk commented on issue #498: Solr Operator seems very picky about the Kubernetes environment it's using (guessing networking/dns)

Posted by GitBox <gi...@apache.org>.
risdenk commented on issue #498:
URL: https://github.com/apache/solr-operator/issues/498#issuecomment-1311991399

   @ramayer can you maybe catch any of the error logs from:
   
   > where it seems like the zookeeper cluster never reaches a quorum; apparently timing out when the second zookeeper node attempts to connect to example-solrcloud-zookeeper-client:2181 
   
   and
   
   > where it seems each Solr instance can communicate with the other two just fine, but appears to have a network timeout when it attempts to communicate with another shard on the same host. I can create some collections, but am unable to creates any collection that has as many shards as solr pod instances.
   
   and any other cases where you seem to have an idea of what is happening. I know you shared how to reproduce (and that is helpful) - any log messages from ZK or Solr would potentially help as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr-operator] janhoy commented on issue #498: Solr Operator seems very picky about the Kubernetes environment it's using (guessing networking/dns)

Posted by GitBox <gi...@apache.org>.
janhoy commented on issue #498:
URL: https://github.com/apache/solr-operator/issues/498#issuecomment-1310766128

   > ...where it seems like the zookeeper cluster never reaches a quorum; apparently timing out when the second zookeeper node attempts to connect to example-solrcloud-zookeeper-client:2181
   
   I see the same on macOS on Apple M1, using Docker Desktop 4.9.1. There is some communication issue betwen the zk nodes, so I always run with only 1 zk when debugging on my dev MacBook.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr-operator] ramayer commented on issue #498: Solr Operator seems very picky about the Kubernetes environment it's using (guessing networking/dns)

Posted by GitBox <gi...@apache.org>.
ramayer commented on issue #498:
URL: https://github.com/apache/solr-operator/issues/498#issuecomment-1311989527

   @janhoy 
   
   Thanks.  Adding `--set zk.provided.replicas=1` worked for me on the macOS/colima environment.   However the minikube-on-ubuntu-when-not-using-containerd example I mentioned above it still fails where there seems to be a different communication failure between the solr nodes when trying to create a 3-shard collection on a 3-node cluster.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr-operator] ramayer commented on issue #498: Solr Operator seems very picky about the Kubernetes environment it's using (guessing networking/dns)

Posted by GitBox <gi...@apache.org>.
ramayer commented on issue #498:
URL: https://github.com/apache/solr-operator/issues/498#issuecomment-1343370987

   Using minkube on Ubuntu, log files show messages like this:
   
   ```
   java.net.UnknownHostException: example-solrcloud-zookeeper-1.example-solrcloud-zookeeper-headless.default.svc.cluster.local 	at java.base/java.net.InetAddress$CachedAddresses.get(Unknown Source) 	at java.base/java.net.InetAddress.getAllByName0(Unknown Source) 	at java.base/java.net.InetAddress.getAllByName(Unknown Source) 	at java.base/java.net.InetAddress.getAllByName(Unknown Source) 	at org.apache.zookeeper.client.StaticHostProvider$1.getAllByName(StaticHostProvider.java:88) 	at org.apache.zookeeper.client.StaticHostProvider.resolve(StaticHostProvider.java:141) 	at org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:368) 	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1197)
   ```
   
   It looks like the node `example-solrcloud-zookeeper-1` is not able to access itself by the name `example-solrcloud-zookeeper-1.example-solrcloud-zookeeper-headless.default.svc.cluster.local` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org