You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Jan Høydahl <ja...@cominvent.com> on 2023/02/02 11:32:25 UTC

Re: Slow / unresponsive admin ui since solr 9.1 update

Hi,

Following up on this. I'd still say that the issue here seems to be that your zookeeper config lists 0.0.0.0 as ip address for client connections.

>>> The problem is related to the fact that we run solr and the zookeeper
>>> ensemble dockerized. As we cannot bind zookeeper from docker to its host's
>>> external ip address, we have to use "0.0.0.0" as the server address

I don't know how you run these zk's dockerized, but I'd look for a workaround where you can configure the correct address in zk's configuration. Then Solr will be happy.

Are you saying that in 8.11, the test with zkcli.sh to 0.0.0.0:2181 returns immediately instead of after 30s?

Jan

> 15. des. 2022 kl. 07:10 skrev michael dürr <du...@gmail.com>:
> 
> Hi Jan,
> 
> Thanks for answering!
> 
> I'm pretty sure the reason is related to the problem that solr tries to
> connect to "0.0.0.0" as it reads that IP from the /zookeeper/config znode
> of the zookeeper ensemble.
> The connection I'm talking about is when
> ZookeeperStatusHandler.getZkRawResponse(String zkHostPort, String
> fourLetterWordCommand) tries to open a Socket to "0.0.0.0:2181".
> After a while the connect fails but as said this takes a long time. I did
> not debug deeper as this already is jdk code then.
> 
> The timings for the valid zookeeper addresses (i.e. those from the static
> configuration string) are listed later. What causes problems is the attempt
> to connect to 0.0.0.0:2181:
> 
> /opt/solr-9.1.0$ export ZK_HOST=0.0.0.0:2181
> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
> -cmd get /zookeeper/config
> WARN  - 2022-12-15 06:57:44.828; org.apache.solr.common.cloud.SolrZkClient;
> Using default ZkCredentialsInjector. ZkCredentialsInjector is not secure,
> it creates an empty list of credentials which leads to 'OPEN_ACL_UNSAFE'
> ACLs to Zookeeper nodes
> INFO  - 2022-12-15 06:57:44.852;
> org.apache.solr.common.cloud.ConnectionManager; Waiting up to 30000ms for
> client to connect to ZooKeeper
> Exception in thread "main" org.apache.solr.common.SolrException:
> java.util.concurrent.TimeoutException: Could not connect to ZooKeeper
> 0.0.0.0:2181 within 30000 ms
>        at
> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:225)
>        at
> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:137)
>        at
> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:120)
>        at org.apache.solr.cloud.ZkCLI.main(ZkCLI.java:260)
> Caused by: java.util.concurrent.TimeoutException: Could not connect to
> ZooKeeper 0.0.0.0:2181 within 30000 ms
>        at
> org.apache.solr.common.cloud.ConnectionManager.waitForConnected(ConnectionManager.java:297)
>        at
> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:216)
>        ... 3 more
> 
> real    0m31.728s
> user    0m3.284s
> sys     0m0.226s
> 
> Of course this will fail but this was not a problem before (solr 8.11.1).
> The call also failed but returned fast.
> 
> Here the timings you are interested in for each of my 3 zookeeper nodes
> (adjusted to my setup). The interesting part are the results from fetching
> the /zookeeper/config as it shows the server configurations that include
> the "0.0.0.0" addresses:
> 
> /opt/solr-9.1.0$ export ZK_HOST=192.168.0.109:2181
> 
> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
> -cmd get /zookeeper/config
> server.1=0.0.0.0:2888:3888:participant;0.0.0.0:2181
> server.2=192.168.0.126:2888:3888:participant;0.0.0.0:2181
> server.3=192.168.0.2:2888:3888:participant;0.0.0.0:2181
> version=0
> 
> real    0m0.810s
> user    0m3.142s
> sys     0m0.148s
> 
> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
> -cmd ls /solr/live_nodes
> /solr/live_nodes (2)
> /solr/live_nodes/192.168.0.222:8983_solr (0)
> /solr/live_nodes/192.168.0.223:8983_solr (0)
> 
> real    0m0.838s
> user    0m3.166s
> sys     0m0.210s
> 
> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
> -cmd get /solr/configs/cms_20221214_142242/stopwords.txt
> # Licensed to the Apache Software Foundation (ASF) under one or more
> # ...
> 
> real    0m0.836s
> user    0m3.121s
> sys     0m0.173s
> 
> /opt/solr-9.1.0$ export ZK_HOST=192.168.0.126:2181
> 
> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
> -cmd get /zookeeper/config
> server.1=192.168.0.109:2888:3888:participant;0.0.0.0:2181
> server.2=0.0.0.0:2888:3888:participant;0.0.0.0:2181
> server.3=192.168.0.2:2888:3888:participant;0.0.0.0:2181
> version=0
> 
> real    0m0.843s
> user    0m3.300s
> sys     0m0.183s
> 
> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
> -cmd ls /solr/live_nodes
> /solr/live_nodes (2)
> /solr/live_nodes/192.168.0.222:8983_solr (0)
> /solr/live_nodes/192.168.0.223:8983_solr (0)
> 
> real    0m0.807s
> user    0m3.035s
> sys     0m0.164s
> 
> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
> -cmd get /solr/configs/cms_20221214_142242/stopwords.txt
> # Licensed to the Apache Software Foundation (ASF) under one or more
> # ...
> 
> real    0m0.859s
> user    0m3.354s
> sys     0m0.177s
> 
> export ZK_HOST=192.168.0.2:2181
> 
> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
> -cmd get /zookeeper/config
> server.1=192.168.0.109:2888:3888:participant;0.0.0.0:2181
> server.2=192.168.0.126:2888:3888:participant;0.0.0.0:2181
> server.3=0.0.0.0:2888:3888:participant;0.0.0.0:2181
> version=0
> 
> real    0m0.790s
> user    0m2.838s
> sys     0m0.154s
> 
> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
> -cmd ls /solr/live_nodes
> /solr/live_nodes (2)
> /solr/live_nodes/192.168.0.222:8983_solr (0)
> /solr/live_nodes/192.168.0.223:8983_solr (0)
> 
> real    0m0.861s
> user    0m3.201s
> sys     0m0.169s
> 
> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
> -cmd get /solr/configs/cms_20221214_142242/stopwords.txt
> # Licensed to the Apache Software Foundation (ASF) under one or more
> # ...
> 
> real    0m0.779s
> user    0m3.081s
> sys     0m0.184s
> 
> Thanks,
> Michael
> 
> On Wed, Dec 14, 2022 at 10:08 PM Jan Høydahl <ja...@cominvent.com> wrote:
> 
>> Hi,
>> 
>> We always check how the zookeeper ensemble is configured, and this
>> check does not depend on whether dynamic reconfiguration is possible or
>> not,
>> it is simply to detect the common mistake that a 3 node ensemble is
>> addressed
>> with only one of the hosts in the static config, or with wrong host names.
>> 
>> Sounds like your problem is not with how Solr talks to ZK, but in how you
>> have configured your network. You say
>> 
>>> But this will cause the socket connect to block when resolving
>>> "0.0.0.0" which makes everything very slow.
>> 
>> Can you elaborate on exactly which connection you are talking about
>> here, and why/where it is blocking? Can you perhaps attempt a few commands
>> from the command line to illustrate your point?
>> 
>> Assuming you are on Linux, and have the 'time' command available, try this
>> 
>> export ZK_HOST=my-zookeeper:2181
>> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get
>> /zookeeper/config
>> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd ls /live_nodes
>> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get
>> /configs/_default/stopwords.txt
>> 
>> What kind of timings do you see?
>> 
>> Jan
>> 
>>> 14. des. 2022 kl. 13:23 skrev michael dürr <du...@gmail.com>:
>>> 
>>> Hi,
>>> 
>>> Since we have updated to Solr 9.1, the admin ui has become pretty slow.
>>> 
>>> The problem is related to the fact that we run solr and the zookeeper
>>> ensemble dockerized. As we cannot bind zookeeper from docker to its
>> host's
>>> external ip address, we have to use "0.0.0.0" as the server address which
>>> causes problems when solr tries to get the zookeeper status (via
>>> /solr/admin/zookeeper/status)
>>> 
>>> Some debugging showed that ZookeeperStatusHandler.getZkStatus() always
>>> tries to get the dynamic configuration from zookeeper in order to check
>>> whether it contains all hosts of solr's static zookeeper configuration
>>> string. But this will cause the socket connect to block when resolving
>>> "0.0.0.0" which makes everything very slow.
>>> 
>>> The approach to check whether zookeeper allows for dynamic
>> reconfiguration
>>> is based on the existence of the znode /zookeeper/config which seems not
>> to
>>> be a good approach as this znode will exist even in case the zookeeper
>>> ensemble does not allow dynamic reconfiguration (reconfigEnabled=false).
>>> 
>>> Can anybody suggest some simple action to avoid that blocking (i.e. the
>>> dynamic configuration check) in order to get the status request return
>> fast
>>> again?
>>> 
>>> It would be nice to have a configuration parameter that disables this
>> check
>>> independent of the zookeeper ensemble status. Especially as
>>> reconfigEnabled=false is the default setting for zookeeper.
>>> 
>>> Thanks,
>>> Michael
>> 
>> 


Re: Slow / unresponsive admin ui since solr 9.1 update

Posted by Jan Høydahl <ja...@cominvent.com>.
Hi,

Zookeeper exposes its configuration in that znode regardless of whether users have enabled dynamic re-configuration. Solr does not support and does not care about dynamic configuration. But the UI does compare Zookeeper's actual configuration to the static ZK_HOST string provided on the solr side and looks for problems. So say you have 3 zk nodes but your ZK_HOST lists only one. If that one zookeeper host reponsds, it will use the zk-provided configuration as the master source of truth and obtain status for every single node. Then it will display a warning on that page that you should really configure all addresses for fault tolerance.

An improvement we could perhaps do is to explicitly detect 0.0.0.0 IP and issue a separate warning in UI for it, and fall back to only checking status for the host/ip names in ZK_HOST since we know the others won't resolve. Then you will not get the 30s delay.

If you want to open a JIRA and submit a PR I'll help review.

Jan

> 3. feb. 2023 kl. 06:30 skrev michael dürr <du...@gmail.com>:
> 
> Hi Jan,
> 
> Thanks for answering!
> 
>> I don't know how you run these zk's dockerized, but I'd look for a
> workaround where you can configure the correct address in zk's
> configuration. Then Solr will be happy.
> 
> There exists a workaround where you can assign certain address ranges to a
> docker host to run each of its containers with an external ip the container
> can bind to:
> 
> https://solr.apache.org/guide/solr/latest/deployment-guide/docker-networking.html
> 
> Anyways, this is pretty cumbersome and not really necessary as the
> communication between zookeeper nodes perfectly works by locally binding to
> "0.0.0.0".
> The problem only occurs when using the admin ui, because of the fact that
> (as mentioned in my former email) solr tries to connect to "0.0.0.0" as it
> reads that IP from the /zookeeper/config znode of the zookeeper ensemble.
> That causes solr to call ZookeeperStatusHandler.getZkRawResponse(String
> zkHostPort, String fourLetterWordCommand) with "0.0.0.0:2181".
> This does not happen for any other scenarios as solr won't use "0.0.0.0"
> (dynamic configuration is disabled by default) but only the IPs from the
> configured zkHost string (which does not mention "0.0.0.0" but only valid
> IPs).
> 
> So I'm just wondering why solr ui tries to use IPs read
> from  /zookeeper/config znode even in case the zookeeper config does
> explicitly disable dynamic reconfiguration (reconfigEnabled=false).
> I'd expect solr to respect the zookeeper config and not try to resolve
> addresses other than those configured in zkHost string.
> 
>> Are you saying that in 8.11, the test with zkcli.sh to 0.0.0.0:2181 returns
> immediately instead of after 30s?
> 
> Yes, I tested it now with 8.11.1 and the call immediately returns an
> exception (trace below).
> After 30 seconds the call returns due to a timeout. But in contrast to solr
> 9.1. the call for 8.11 immediately returns an exception. So the solr ui
> also gets immediately responsive.
> 
> root@solr1:/opt/solr#  export ZK_HOST=0.0.0.0:2181
> 
> root@solr1:/opt/solr#  time server/scripts/cloud-scripts/zkcli.sh -z
> $ZK_HOST -cmd get /zookeeper/config
> INFO  - 2023-02-03 06:19:30.748;
> org.apache.solr.common.cloud.ConnectionManager; Waiting for client to
> connect to ZooKeeper
> WARN  - 2023-02-03 06:19:30.759; org.apache.zookeeper.ClientCnxn; Session
> 0x0 for sever 0.0.0.0/0.0.0.0:2181, Closing socket connection. Attempting
> reconnect except it is a SessionExpiredException. =>
> java.net.ConnectException: Connection refused
>       at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native
> Method)
> java.net.ConnectException: Connection refused
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
>       at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
> ~[?:?]
>       at
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:344)
> ~[zookeeper-3.6.2.jar:3.6.2]
>       at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1275)
> ~[zookeeper-3.6.2.jar:3.6.2]
> WARN  - 2023-02-03 06:19:31.867; org.apache.zookeeper.ClientCnxn; Session
> 0x0 for sever 0.0.0.0/0.0.0.0:2181, Closing socket connection. Attempting
> reconnect except it is a SessionExpiredException. =>
> 
> === same exception multiple times and then... ===
> 
> WARN  - 2023-02-03 06:20:01.692; org.apache.zookeeper.ClientCnxn; An
> exception was thrown while closing send thread for session 0x0. =>
> java.net.ConnectException: Connection refused
>       at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native
> Method)
> java.net.ConnectException: Connection refused
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
>       at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
> ~[?:?]
>       at
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:344)
> ~[zookeeper-3.6.2.jar:3.6.2]
>       at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1275)
> ~[zookeeper-3.6.2.jar:3.6.2]
> Exception in thread "main" org.apache.solr.common.SolrException:
> java.util.concurrent.TimeoutException: Could not connect to ZooKeeper
> 0.0.0.0:2181 within 30000 ms
>       at
> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:195)
>       at
> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:119)
>       at
> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:109)
>       at org.apache.solr.cloud.ZkCLI.main(ZkCLI.java:196)
> Caused by: java.util.concurrent.TimeoutException: Could not connect to
> ZooKeeper 0.0.0.0:2181 within 30000 ms
>       at
> org.apache.solr.common.cloud.ConnectionManager.waitForConnected(ConnectionManager.java:251)
> 
>       at
> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:186)
>       ... 3 more
> 
> real    0m31.693s
> user    0m2.375s
> sys     0m0.199s
> 
> Thanks,
> Michael
> 
> On Thu, Feb 2, 2023 at 12:32 PM Jan Høydahl <ja...@cominvent.com> wrote:
> 
>> Hi,
>> 
>> Following up on this. I'd still say that the issue here seems to be that
>> your zookeeper config lists 0.0.0.0 as ip address for client connections.
>> 
>>>>> The problem is related to the fact that we run solr and the zookeeper
>>>>> ensemble dockerized. As we cannot bind zookeeper from docker to its
>> host's
>>>>> external ip address, we have to use "0.0.0.0" as the server address
>> 
>> I don't know how you run these zk's dockerized, but I'd look for a
>> workaround where you can configure the correct address in zk's
>> configuration. Then Solr will be happy.
>> 
>> Are you saying that in 8.11, the test with zkcli.sh to 0.0.0.0:2181
>> returns immediately instead of after 30s?
>> 
>> Jan
>> 
>>> 15. des. 2022 kl. 07:10 skrev michael dürr <du...@gmail.com>:
>>> 
>>> Hi Jan,
>>> 
>>> Thanks for answering!
>>> 
>>> I'm pretty sure the reason is related to the problem that solr tries to
>>> connect to "0.0.0.0" as it reads that IP from the /zookeeper/config znode
>>> of the zookeeper ensemble.
>>> The connection I'm talking about is when
>>> ZookeeperStatusHandler.getZkRawResponse(String zkHostPort, String
>>> fourLetterWordCommand) tries to open a Socket to "0.0.0.0:2181".
>>> After a while the connect fails but as said this takes a long time. I did
>>> not debug deeper as this already is jdk code then.
>>> 
>>> The timings for the valid zookeeper addresses (i.e. those from the static
>>> configuration string) are listed later. What causes problems is the
>> attempt
>>> to connect to 0.0.0.0:2181:
>>> 
>>> /opt/solr-9.1.0$ export ZK_HOST=0.0.0.0:2181
>>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
>>> -cmd get /zookeeper/config
>>> WARN  - 2022-12-15 06:57:44.828;
>> org.apache.solr.common.cloud.SolrZkClient;
>>> Using default ZkCredentialsInjector. ZkCredentialsInjector is not secure,
>>> it creates an empty list of credentials which leads to 'OPEN_ACL_UNSAFE'
>>> ACLs to Zookeeper nodes
>>> INFO  - 2022-12-15 06:57:44.852;
>>> org.apache.solr.common.cloud.ConnectionManager; Waiting up to 30000ms for
>>> client to connect to ZooKeeper
>>> Exception in thread "main" org.apache.solr.common.SolrException:
>>> java.util.concurrent.TimeoutException: Could not connect to ZooKeeper
>>> 0.0.0.0:2181 within 30000 ms
>>>       at
>>> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:225)
>>>       at
>>> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:137)
>>>       at
>>> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:120)
>>>       at org.apache.solr.cloud.ZkCLI.main(ZkCLI.java:260)
>>> Caused by: java.util.concurrent.TimeoutException: Could not connect to
>>> ZooKeeper 0.0.0.0:2181 within 30000 ms
>>>       at
>>> 
>> org.apache.solr.common.cloud.ConnectionManager.waitForConnected(ConnectionManager.java:297)
>>>       at
>>> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:216)
>>>       ... 3 more
>>> 
>>> real    0m31.728s
>>> user    0m3.284s
>>> sys     0m0.226s
>>> 
>>> Of course this will fail but this was not a problem before (solr 8.11.1).
>>> The call also failed but returned fast.
>>> 
>>> Here the timings you are interested in for each of my 3 zookeeper nodes
>>> (adjusted to my setup). The interesting part are the results from
>> fetching
>>> the /zookeeper/config as it shows the server configurations that include
>>> the "0.0.0.0" addresses:
>>> 
>>> /opt/solr-9.1.0$ export ZK_HOST=192.168.0.109:2181
>>> 
>>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
>>> -cmd get /zookeeper/config
>>> server.1=0.0.0.0:2888:3888:participant;0.0.0.0:2181
>>> server.2=192.168.0.126:2888:3888:participant;0.0.0.0:2181
>>> server.3=192.168.0.2:2888:3888:participant;0.0.0.0:2181
>>> version=0
>>> 
>>> real    0m0.810s
>>> user    0m3.142s
>>> sys     0m0.148s
>>> 
>>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
>>> -cmd ls /solr/live_nodes
>>> /solr/live_nodes (2)
>>> /solr/live_nodes/192.168.0.222:8983_solr (0)
>>> /solr/live_nodes/192.168.0.223:8983_solr (0)
>>> 
>>> real    0m0.838s
>>> user    0m3.166s
>>> sys     0m0.210s
>>> 
>>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
>>> -cmd get /solr/configs/cms_20221214_142242/stopwords.txt
>>> # Licensed to the Apache Software Foundation (ASF) under one or more
>>> # ...
>>> 
>>> real    0m0.836s
>>> user    0m3.121s
>>> sys     0m0.173s
>>> 
>>> /opt/solr-9.1.0$ export ZK_HOST=192.168.0.126:2181
>>> 
>>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
>>> -cmd get /zookeeper/config
>>> server.1=192.168.0.109:2888:3888:participant;0.0.0.0:2181
>>> server.2=0.0.0.0:2888:3888:participant;0.0.0.0:2181
>>> server.3=192.168.0.2:2888:3888:participant;0.0.0.0:2181
>>> version=0
>>> 
>>> real    0m0.843s
>>> user    0m3.300s
>>> sys     0m0.183s
>>> 
>>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
>>> -cmd ls /solr/live_nodes
>>> /solr/live_nodes (2)
>>> /solr/live_nodes/192.168.0.222:8983_solr (0)
>>> /solr/live_nodes/192.168.0.223:8983_solr (0)
>>> 
>>> real    0m0.807s
>>> user    0m3.035s
>>> sys     0m0.164s
>>> 
>>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
>>> -cmd get /solr/configs/cms_20221214_142242/stopwords.txt
>>> # Licensed to the Apache Software Foundation (ASF) under one or more
>>> # ...
>>> 
>>> real    0m0.859s
>>> user    0m3.354s
>>> sys     0m0.177s
>>> 
>>> export ZK_HOST=192.168.0.2:2181
>>> 
>>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
>>> -cmd get /zookeeper/config
>>> server.1=192.168.0.109:2888:3888:participant;0.0.0.0:2181
>>> server.2=192.168.0.126:2888:3888:participant;0.0.0.0:2181
>>> server.3=0.0.0.0:2888:3888:participant;0.0.0.0:2181
>>> version=0
>>> 
>>> real    0m0.790s
>>> user    0m2.838s
>>> sys     0m0.154s
>>> 
>>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
>>> -cmd ls /solr/live_nodes
>>> /solr/live_nodes (2)
>>> /solr/live_nodes/192.168.0.222:8983_solr (0)
>>> /solr/live_nodes/192.168.0.223:8983_solr (0)
>>> 
>>> real    0m0.861s
>>> user    0m3.201s
>>> sys     0m0.169s
>>> 
>>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
>>> -cmd get /solr/configs/cms_20221214_142242/stopwords.txt
>>> # Licensed to the Apache Software Foundation (ASF) under one or more
>>> # ...
>>> 
>>> real    0m0.779s
>>> user    0m3.081s
>>> sys     0m0.184s
>>> 
>>> Thanks,
>>> Michael
>>> 
>>> On Wed, Dec 14, 2022 at 10:08 PM Jan Høydahl <ja...@cominvent.com>
>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> We always check how the zookeeper ensemble is configured, and this
>>>> check does not depend on whether dynamic reconfiguration is possible or
>>>> not,
>>>> it is simply to detect the common mistake that a 3 node ensemble is
>>>> addressed
>>>> with only one of the hosts in the static config, or with wrong host
>> names.
>>>> 
>>>> Sounds like your problem is not with how Solr talks to ZK, but in how
>> you
>>>> have configured your network. You say
>>>> 
>>>>> But this will cause the socket connect to block when resolving
>>>>> "0.0.0.0" which makes everything very slow.
>>>> 
>>>> Can you elaborate on exactly which connection you are talking about
>>>> here, and why/where it is blocking? Can you perhaps attempt a few
>> commands
>>>> from the command line to illustrate your point?
>>>> 
>>>> Assuming you are on Linux, and have the 'time' command available, try
>> this
>>>> 
>>>> export ZK_HOST=my-zookeeper:2181
>>>> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get
>>>> /zookeeper/config
>>>> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd ls
>> /live_nodes
>>>> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get
>>>> /configs/_default/stopwords.txt
>>>> 
>>>> What kind of timings do you see?
>>>> 
>>>> Jan
>>>> 
>>>>> 14. des. 2022 kl. 13:23 skrev michael dürr <du...@gmail.com>:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Since we have updated to Solr 9.1, the admin ui has become pretty slow.
>>>>> 
>>>>> The problem is related to the fact that we run solr and the zookeeper
>>>>> ensemble dockerized. As we cannot bind zookeeper from docker to its
>>>> host's
>>>>> external ip address, we have to use "0.0.0.0" as the server address
>> which
>>>>> causes problems when solr tries to get the zookeeper status (via
>>>>> /solr/admin/zookeeper/status)
>>>>> 
>>>>> Some debugging showed that ZookeeperStatusHandler.getZkStatus() always
>>>>> tries to get the dynamic configuration from zookeeper in order to check
>>>>> whether it contains all hosts of solr's static zookeeper configuration
>>>>> string. But this will cause the socket connect to block when resolving
>>>>> "0.0.0.0" which makes everything very slow.
>>>>> 
>>>>> The approach to check whether zookeeper allows for dynamic
>>>> reconfiguration
>>>>> is based on the existence of the znode /zookeeper/config which seems
>> not
>>>> to
>>>>> be a good approach as this znode will exist even in case the zookeeper
>>>>> ensemble does not allow dynamic reconfiguration
>> (reconfigEnabled=false).
>>>>> 
>>>>> Can anybody suggest some simple action to avoid that blocking (i.e. the
>>>>> dynamic configuration check) in order to get the status request return
>>>> fast
>>>>> again?
>>>>> 
>>>>> It would be nice to have a configuration parameter that disables this
>>>> check
>>>>> independent of the zookeeper ensemble status. Especially as
>>>>> reconfigEnabled=false is the default setting for zookeeper.
>>>>> 
>>>>> Thanks,
>>>>> Michael
>>>> 
>>>> 
>> 
>> 


Re: Slow / unresponsive admin ui since solr 9.1 update

Posted by michael dürr <du...@gmail.com>.
Hi Jan,

Thanks for answering!

> I don't know how you run these zk's dockerized, but I'd look for a
workaround where you can configure the correct address in zk's
configuration. Then Solr will be happy.

There exists a workaround where you can assign certain address ranges to a
docker host to run each of its containers with an external ip the container
can bind to:

https://solr.apache.org/guide/solr/latest/deployment-guide/docker-networking.html

Anyways, this is pretty cumbersome and not really necessary as the
communication between zookeeper nodes perfectly works by locally binding to
"0.0.0.0".
The problem only occurs when using the admin ui, because of the fact that
(as mentioned in my former email) solr tries to connect to "0.0.0.0" as it
reads that IP from the /zookeeper/config znode of the zookeeper ensemble.
That causes solr to call ZookeeperStatusHandler.getZkRawResponse(String
zkHostPort, String fourLetterWordCommand) with "0.0.0.0:2181".
This does not happen for any other scenarios as solr won't use "0.0.0.0"
(dynamic configuration is disabled by default) but only the IPs from the
configured zkHost string (which does not mention "0.0.0.0" but only valid
IPs).

So I'm just wondering why solr ui tries to use IPs read
from  /zookeeper/config znode even in case the zookeeper config does
explicitly disable dynamic reconfiguration (reconfigEnabled=false).
I'd expect solr to respect the zookeeper config and not try to resolve
addresses other than those configured in zkHost string.

> Are you saying that in 8.11, the test with zkcli.sh to 0.0.0.0:2181 returns
immediately instead of after 30s?

Yes, I tested it now with 8.11.1 and the call immediately returns an
exception (trace below).
After 30 seconds the call returns due to a timeout. But in contrast to solr
9.1. the call for 8.11 immediately returns an exception. So the solr ui
also gets immediately responsive.

root@solr1:/opt/solr#  export ZK_HOST=0.0.0.0:2181

root@solr1:/opt/solr#  time server/scripts/cloud-scripts/zkcli.sh -z
$ZK_HOST -cmd get /zookeeper/config
INFO  - 2023-02-03 06:19:30.748;
org.apache.solr.common.cloud.ConnectionManager; Waiting for client to
connect to ZooKeeper
WARN  - 2023-02-03 06:19:30.759; org.apache.zookeeper.ClientCnxn; Session
0x0 for sever 0.0.0.0/0.0.0.0:2181, Closing socket connection. Attempting
reconnect except it is a SessionExpiredException. =>
java.net.ConnectException: Connection refused
       at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native
Method)
java.net.ConnectException: Connection refused
       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
       at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
~[?:?]
       at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:344)
~[zookeeper-3.6.2.jar:3.6.2]
       at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1275)
~[zookeeper-3.6.2.jar:3.6.2]
WARN  - 2023-02-03 06:19:31.867; org.apache.zookeeper.ClientCnxn; Session
0x0 for sever 0.0.0.0/0.0.0.0:2181, Closing socket connection. Attempting
reconnect except it is a SessionExpiredException. =>

=== same exception multiple times and then... ===

WARN  - 2023-02-03 06:20:01.692; org.apache.zookeeper.ClientCnxn; An
exception was thrown while closing send thread for session 0x0. =>
java.net.ConnectException: Connection refused
       at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native
Method)
java.net.ConnectException: Connection refused
       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
       at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
~[?:?]
       at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:344)
~[zookeeper-3.6.2.jar:3.6.2]
       at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1275)
~[zookeeper-3.6.2.jar:3.6.2]
Exception in thread "main" org.apache.solr.common.SolrException:
java.util.concurrent.TimeoutException: Could not connect to ZooKeeper
0.0.0.0:2181 within 30000 ms
       at
org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:195)
       at
org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:119)
       at
org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:109)
       at org.apache.solr.cloud.ZkCLI.main(ZkCLI.java:196)
Caused by: java.util.concurrent.TimeoutException: Could not connect to
ZooKeeper 0.0.0.0:2181 within 30000 ms
       at
org.apache.solr.common.cloud.ConnectionManager.waitForConnected(ConnectionManager.java:251)

       at
org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:186)
       ... 3 more

real    0m31.693s
user    0m2.375s
sys     0m0.199s

Thanks,
Michael

On Thu, Feb 2, 2023 at 12:32 PM Jan Høydahl <ja...@cominvent.com> wrote:

> Hi,
>
> Following up on this. I'd still say that the issue here seems to be that
> your zookeeper config lists 0.0.0.0 as ip address for client connections.
>
> >>> The problem is related to the fact that we run solr and the zookeeper
> >>> ensemble dockerized. As we cannot bind zookeeper from docker to its
> host's
> >>> external ip address, we have to use "0.0.0.0" as the server address
>
> I don't know how you run these zk's dockerized, but I'd look for a
> workaround where you can configure the correct address in zk's
> configuration. Then Solr will be happy.
>
> Are you saying that in 8.11, the test with zkcli.sh to 0.0.0.0:2181
> returns immediately instead of after 30s?
>
> Jan
>
> > 15. des. 2022 kl. 07:10 skrev michael dürr <du...@gmail.com>:
> >
> > Hi Jan,
> >
> > Thanks for answering!
> >
> > I'm pretty sure the reason is related to the problem that solr tries to
> > connect to "0.0.0.0" as it reads that IP from the /zookeeper/config znode
> > of the zookeeper ensemble.
> > The connection I'm talking about is when
> > ZookeeperStatusHandler.getZkRawResponse(String zkHostPort, String
> > fourLetterWordCommand) tries to open a Socket to "0.0.0.0:2181".
> > After a while the connect fails but as said this takes a long time. I did
> > not debug deeper as this already is jdk code then.
> >
> > The timings for the valid zookeeper addresses (i.e. those from the static
> > configuration string) are listed later. What causes problems is the
> attempt
> > to connect to 0.0.0.0:2181:
> >
> > /opt/solr-9.1.0$ export ZK_HOST=0.0.0.0:2181
> > /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
> > -cmd get /zookeeper/config
> > WARN  - 2022-12-15 06:57:44.828;
> org.apache.solr.common.cloud.SolrZkClient;
> > Using default ZkCredentialsInjector. ZkCredentialsInjector is not secure,
> > it creates an empty list of credentials which leads to 'OPEN_ACL_UNSAFE'
> > ACLs to Zookeeper nodes
> > INFO  - 2022-12-15 06:57:44.852;
> > org.apache.solr.common.cloud.ConnectionManager; Waiting up to 30000ms for
> > client to connect to ZooKeeper
> > Exception in thread "main" org.apache.solr.common.SolrException:
> > java.util.concurrent.TimeoutException: Could not connect to ZooKeeper
> > 0.0.0.0:2181 within 30000 ms
> >        at
> > org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:225)
> >        at
> > org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:137)
> >        at
> > org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:120)
> >        at org.apache.solr.cloud.ZkCLI.main(ZkCLI.java:260)
> > Caused by: java.util.concurrent.TimeoutException: Could not connect to
> > ZooKeeper 0.0.0.0:2181 within 30000 ms
> >        at
> >
> org.apache.solr.common.cloud.ConnectionManager.waitForConnected(ConnectionManager.java:297)
> >        at
> > org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:216)
> >        ... 3 more
> >
> > real    0m31.728s
> > user    0m3.284s
> > sys     0m0.226s
> >
> > Of course this will fail but this was not a problem before (solr 8.11.1).
> > The call also failed but returned fast.
> >
> > Here the timings you are interested in for each of my 3 zookeeper nodes
> > (adjusted to my setup). The interesting part are the results from
> fetching
> > the /zookeeper/config as it shows the server configurations that include
> > the "0.0.0.0" addresses:
> >
> > /opt/solr-9.1.0$ export ZK_HOST=192.168.0.109:2181
> >
> > /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
> > -cmd get /zookeeper/config
> > server.1=0.0.0.0:2888:3888:participant;0.0.0.0:2181
> > server.2=192.168.0.126:2888:3888:participant;0.0.0.0:2181
> > server.3=192.168.0.2:2888:3888:participant;0.0.0.0:2181
> > version=0
> >
> > real    0m0.810s
> > user    0m3.142s
> > sys     0m0.148s
> >
> > /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
> > -cmd ls /solr/live_nodes
> > /solr/live_nodes (2)
> > /solr/live_nodes/192.168.0.222:8983_solr (0)
> > /solr/live_nodes/192.168.0.223:8983_solr (0)
> >
> > real    0m0.838s
> > user    0m3.166s
> > sys     0m0.210s
> >
> > /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
> > -cmd get /solr/configs/cms_20221214_142242/stopwords.txt
> > # Licensed to the Apache Software Foundation (ASF) under one or more
> > # ...
> >
> > real    0m0.836s
> > user    0m3.121s
> > sys     0m0.173s
> >
> > /opt/solr-9.1.0$ export ZK_HOST=192.168.0.126:2181
> >
> > /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
> > -cmd get /zookeeper/config
> > server.1=192.168.0.109:2888:3888:participant;0.0.0.0:2181
> > server.2=0.0.0.0:2888:3888:participant;0.0.0.0:2181
> > server.3=192.168.0.2:2888:3888:participant;0.0.0.0:2181
> > version=0
> >
> > real    0m0.843s
> > user    0m3.300s
> > sys     0m0.183s
> >
> > /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
> > -cmd ls /solr/live_nodes
> > /solr/live_nodes (2)
> > /solr/live_nodes/192.168.0.222:8983_solr (0)
> > /solr/live_nodes/192.168.0.223:8983_solr (0)
> >
> > real    0m0.807s
> > user    0m3.035s
> > sys     0m0.164s
> >
> > /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
> > -cmd get /solr/configs/cms_20221214_142242/stopwords.txt
> > # Licensed to the Apache Software Foundation (ASF) under one or more
> > # ...
> >
> > real    0m0.859s
> > user    0m3.354s
> > sys     0m0.177s
> >
> > export ZK_HOST=192.168.0.2:2181
> >
> > /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
> > -cmd get /zookeeper/config
> > server.1=192.168.0.109:2888:3888:participant;0.0.0.0:2181
> > server.2=192.168.0.126:2888:3888:participant;0.0.0.0:2181
> > server.3=0.0.0.0:2888:3888:participant;0.0.0.0:2181
> > version=0
> >
> > real    0m0.790s
> > user    0m2.838s
> > sys     0m0.154s
> >
> > /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
> > -cmd ls /solr/live_nodes
> > /solr/live_nodes (2)
> > /solr/live_nodes/192.168.0.222:8983_solr (0)
> > /solr/live_nodes/192.168.0.223:8983_solr (0)
> >
> > real    0m0.861s
> > user    0m3.201s
> > sys     0m0.169s
> >
> > /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
> > -cmd get /solr/configs/cms_20221214_142242/stopwords.txt
> > # Licensed to the Apache Software Foundation (ASF) under one or more
> > # ...
> >
> > real    0m0.779s
> > user    0m3.081s
> > sys     0m0.184s
> >
> > Thanks,
> > Michael
> >
> > On Wed, Dec 14, 2022 at 10:08 PM Jan Høydahl <ja...@cominvent.com>
> wrote:
> >
> >> Hi,
> >>
> >> We always check how the zookeeper ensemble is configured, and this
> >> check does not depend on whether dynamic reconfiguration is possible or
> >> not,
> >> it is simply to detect the common mistake that a 3 node ensemble is
> >> addressed
> >> with only one of the hosts in the static config, or with wrong host
> names.
> >>
> >> Sounds like your problem is not with how Solr talks to ZK, but in how
> you
> >> have configured your network. You say
> >>
> >>> But this will cause the socket connect to block when resolving
> >>> "0.0.0.0" which makes everything very slow.
> >>
> >> Can you elaborate on exactly which connection you are talking about
> >> here, and why/where it is blocking? Can you perhaps attempt a few
> commands
> >> from the command line to illustrate your point?
> >>
> >> Assuming you are on Linux, and have the 'time' command available, try
> this
> >>
> >> export ZK_HOST=my-zookeeper:2181
> >> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get
> >> /zookeeper/config
> >> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd ls
> /live_nodes
> >> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get
> >> /configs/_default/stopwords.txt
> >>
> >> What kind of timings do you see?
> >>
> >> Jan
> >>
> >>> 14. des. 2022 kl. 13:23 skrev michael dürr <du...@gmail.com>:
> >>>
> >>> Hi,
> >>>
> >>> Since we have updated to Solr 9.1, the admin ui has become pretty slow.
> >>>
> >>> The problem is related to the fact that we run solr and the zookeeper
> >>> ensemble dockerized. As we cannot bind zookeeper from docker to its
> >> host's
> >>> external ip address, we have to use "0.0.0.0" as the server address
> which
> >>> causes problems when solr tries to get the zookeeper status (via
> >>> /solr/admin/zookeeper/status)
> >>>
> >>> Some debugging showed that ZookeeperStatusHandler.getZkStatus() always
> >>> tries to get the dynamic configuration from zookeeper in order to check
> >>> whether it contains all hosts of solr's static zookeeper configuration
> >>> string. But this will cause the socket connect to block when resolving
> >>> "0.0.0.0" which makes everything very slow.
> >>>
> >>> The approach to check whether zookeeper allows for dynamic
> >> reconfiguration
> >>> is based on the existence of the znode /zookeeper/config which seems
> not
> >> to
> >>> be a good approach as this znode will exist even in case the zookeeper
> >>> ensemble does not allow dynamic reconfiguration
> (reconfigEnabled=false).
> >>>
> >>> Can anybody suggest some simple action to avoid that blocking (i.e. the
> >>> dynamic configuration check) in order to get the status request return
> >> fast
> >>> again?
> >>>
> >>> It would be nice to have a configuration parameter that disables this
> >> check
> >>> independent of the zookeeper ensemble status. Especially as
> >>> reconfigEnabled=false is the default setting for zookeeper.
> >>>
> >>> Thanks,
> >>> Michael
> >>
> >>
>
>