You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Chetas Joshi <ch...@gmail.com> on 2016/11/19 00:51:50 UTC

CloudSolrClient$RouteException: Cannot talk to ZooKeeper - Updates are disabled.

Hi,

I have a SolrCloud (on HDFS) of 50 nodes and a ZK quorum of 5 nodes. The
SolrCloud is having difficulties talking to ZK when I am ingesting data
into the collections. At that time I am also running queries (that return
millions of docs). The ingest job is crying with the the following exception

org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error
from server at http://xxx/solr/collection1_shard15_replica1: Cannot talk to
ZooKeeper - Updates are disabled.

I think this is happening when the ingest job is trying to update the
clusterstate.json file but the query is reading from that file and thus has
some kind of a lock on that file. Are there any factors that will cause the
"READ" to acquire lock for a long time? Is my understanding correct? I am
using the cursor approach using SolrJ to get back results from Solr.

How often is the ZK updated with the latest cluster state and what
parameter governs that? Should I just increase the ZK client timeout so
that it retries connecting to the ZK for a longer period of time (right now
it is 15 seconds)?

Thanks!

Re: CloudSolrClient$RouteException: Cannot talk to ZooKeeper - Updates are disabled.

Posted by Chetas Joshi <ch...@gmail.com>.

Thanks Erick and Shawn.

I have reduced number of rows per page from 500K to 100K.
I also increased the ZKclientTimeOut to 30 seconds so that I don't run into
ZK time out issues. The ZK cluster has been deployed on the hosts other
than the SolrCloud hosts.

However, I was trying to increase the number of rows per page due to the
following reason: Running ingestion at the same time as running queries has
increased the amount of time it takes to read results from Solr using the
Cursor approach by 5 times. I am able to read 1M sorted documents in 1 hour
(88 bytes of data per document).

What could be the reason behind the low speed of query execution? I am
running solr servers with heap=16g and off-heap=16g. Off-heap is being used
as the block cache. Do ingestion and query execution both use a lot of
block cache? Should I increase the block cache size in oder to improve the
query performance? Should I increase slab.count or maxDirectMemorySize?

Thanks!

On Sat, Nov 19, 2016 at 8:13 AM, Erick Erickson <er...@gmail.com>
wrote:

> Returning 500K rows is, as Shawn says, not Solr's sweet spot.
>
> My guess: All the work you're doing trying to return that many
> rows, particularly in SolrCloud mode is simply overloading
> your system to the point that the ZK connection times out. Don't
> do that. If you need that many rows, either Shawn's cursorMark
> option or use export/streaming aggregation are much better
> choices.....
>
> Consider what happens on a sharded request:
> - the initial node sends a sub-request to a replica for each shard.
> - each replica returns it's candidate topN (doc ID and sort criteria)
> - the initial node sorts these lists (1M from each replica in your
> example) to get the true top N
> - the initial node requests the docs from each replica that made it
> into the true top N
> - each replica goes to disk, decompresses the doc and pulls out the fields
> - each replica sends its portion of the top N to the initial node
> - an enormous packet containing all 1M final docs is assembled and
> returned to the client.
> - this sucks up bandwidth and resources
> - that's bad enough, but especially if your ZK nodes are on the same
> box as your Solr nodes they're even more like to have a timeout issue.
>
>
> Best,
> Erick
>
> On Fri, Nov 18, 2016 at 8:45 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> > On 11/18/2016 6:50 PM, Chetas Joshi wrote:
> >> The numFound is millions but I was also trying with rows= 1 Million. I
> will reduce it to 500K.
> >>
> >> I am sorry. It is state.json. I am using Solr 5.5.0
> >>
> >> One of the things I am not able to understand is why my ingestion job is
> >> complaining about "Cannot talk to ZooKeeper - Updates are disabled."
> >>
> >> I have a spark streaming job that continuously ingests into Solr. My
> shards are always up and running. The moment I start a query on SolrCloud
> it starts running into this exception. However as you said ZK will only
> update the state of the cluster when the shards go down. Then why my job is
> trying to contact ZK when the cluster is up and why is the exception about
> updating ZK?
> >
> > SolrCloud and SolrJ (CloudSolrClient) both maintain constant connections
> > to all the zookeeper servers they are configured to use.  If zookeeper
> > quorum is lost, SolrCloud will go read-only -- no updating is possible.
> > That is what is meant by "updates are disabled."
> >
> > Solr and Lucene are optimized for very low rowcounts, typically two or
> > three digits.  Asking for hundreds of thousands of rows is problematic.
> > The cursorMark feature is designed for efficient queries when paging
> > deeply into results, but it assumes your rows value is relatively small,
> > and that you will be making many queries to get a large number of
> > results, each of which will be fast and won't overload the server.
> >
> > Since it appears you are having a performance issue, here's a few things
> > I have written on the topic:
> >
> > https://wiki.apache.org/solr/SolrPerformanceProblems
> >
> > Thanks,
> > Shawn
> >
>

Re: CloudSolrClient$RouteException: Cannot talk to ZooKeeper - Updates are disabled.

Posted by Erick Erickson <er...@gmail.com>.

Returning 500K rows is, as Shawn says, not Solr's sweet spot.

My guess: All the work you're doing trying to return that many
rows, particularly in SolrCloud mode is simply overloading
your system to the point that the ZK connection times out. Don't
do that. If you need that many rows, either Shawn's cursorMark
option or use export/streaming aggregation are much better
choices.....

Consider what happens on a sharded request:
- the initial node sends a sub-request to a replica for each shard.
- each replica returns it's candidate topN (doc ID and sort criteria)
- the initial node sorts these lists (1M from each replica in your
example) to get the true top N
- the initial node requests the docs from each replica that made it
into the true top N
- each replica goes to disk, decompresses the doc and pulls out the fields
- each replica sends its portion of the top N to the initial node
- an enormous packet containing all 1M final docs is assembled and
returned to the client.
- this sucks up bandwidth and resources
- that's bad enough, but especially if your ZK nodes are on the same
box as your Solr nodes they're even more like to have a timeout issue.


Best,
Erick

On Fri, Nov 18, 2016 at 8:45 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 11/18/2016 6:50 PM, Chetas Joshi wrote:
>> The numFound is millions but I was also trying with rows= 1 Million. I will reduce it to 500K.
>>
>> I am sorry. It is state.json. I am using Solr 5.5.0
>>
>> One of the things I am not able to understand is why my ingestion job is
>> complaining about "Cannot talk to ZooKeeper - Updates are disabled."
>>
>> I have a spark streaming job that continuously ingests into Solr. My shards are always up and running. The moment I start a query on SolrCloud it starts running into this exception. However as you said ZK will only update the state of the cluster when the shards go down. Then why my job is trying to contact ZK when the cluster is up and why is the exception about updating ZK?
>
> SolrCloud and SolrJ (CloudSolrClient) both maintain constant connections
> to all the zookeeper servers they are configured to use.  If zookeeper
> quorum is lost, SolrCloud will go read-only -- no updating is possible.
> That is what is meant by "updates are disabled."
>
> Solr and Lucene are optimized for very low rowcounts, typically two or
> three digits.  Asking for hundreds of thousands of rows is problematic.
> The cursorMark feature is designed for efficient queries when paging
> deeply into results, but it assumes your rows value is relatively small,
> and that you will be making many queries to get a large number of
> results, each of which will be fast and won't overload the server.
>
> Since it appears you are having a performance issue, here's a few things
> I have written on the topic:
>
> https://wiki.apache.org/solr/SolrPerformanceProblems
>
> Thanks,
> Shawn
>

Re: CloudSolrClient$RouteException: Cannot talk to ZooKeeper - Updates are disabled.

Posted by Shawn Heisey <ap...@elyograg.org>.

On 11/18/2016 6:50 PM, Chetas Joshi wrote:
> The numFound is millions but I was also trying with rows= 1 Million. I will reduce it to 500K.
>
> I am sorry. It is state.json. I am using Solr 5.5.0
>
> One of the things I am not able to understand is why my ingestion job is
> complaining about "Cannot talk to ZooKeeper - Updates are disabled."
>
> I have a spark streaming job that continuously ingests into Solr. My shards are always up and running. The moment I start a query on SolrCloud it starts running into this exception. However as you said ZK will only update the state of the cluster when the shards go down. Then why my job is trying to contact ZK when the cluster is up and why is the exception about updating ZK?

SolrCloud and SolrJ (CloudSolrClient) both maintain constant connections
to all the zookeeper servers they are configured to use.  If zookeeper
quorum is lost, SolrCloud will go read-only -- no updating is possible. 
That is what is meant by "updates are disabled."

Solr and Lucene are optimized for very low rowcounts, typically two or
three digits.  Asking for hundreds of thousands of rows is problematic. 
The cursorMark feature is designed for efficient queries when paging
deeply into results, but it assumes your rows value is relatively small,
and that you will be making many queries to get a large number of
results, each of which will be fast and won't overload the server.

Since it appears you are having a performance issue, here's a few things
I have written on the topic:

https://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn

Re: CloudSolrClient$RouteException: Cannot talk to ZooKeeper - Updates are disabled.

Posted by Chetas Joshi <ch...@gmail.com>.

Thanks Erick.

The numFound is millions but I was also trying with rows= 1 Million. I will
reduce it to 500K.

I am sorry. It is state.json. I am using Solr 5.5.0

One of the things I am not able to understand is why my ingestion job is
complaining about "Cannot talk to ZooKeeper - Updates are disabled."

I have a spark streaming job that continuously ingests into Solr. My shards
are always up and running. The moment I start a query on SolrCloud it
starts running into this exception. However as you said ZK will only update
the state of the cluster when the shards go down. Then why my job is trying
to contact ZK when the cluster is up and why is the exception about
updating ZK?


On Fri, Nov 18, 2016 at 5:11 PM, Erick Erickson <er...@gmail.com>
wrote:

> The clusterstate on Zookeeper shouldn't be changing
> very often, only when nodes come and go.
>
> bq: At that time I am also running queries (that return
> millions of docs).
>
> As in rows=milions? This is an anti-pattern, if that's true
> then you're probably network saturated and the like. If
> you mean your numFound is millions, then this is unlikely
> to be a problem.
>
> you say "clusterstate.json", which indicates you're on
> 4x? This has been changed to make a state.json for
> each collection, so either you upgraded sometime and
> didn't transform you ZK (there's a command to do that)
> or can you upgrade?
>
> What I'm guessing is that you have too much going on
> somehow and you're overloading your system and
> getting a timeout. So increasing the timeout
> is definitely a possibility, or reducing the ingestion load
> as a test.
>
> Best,
> Erick
>
> On Fri, Nov 18, 2016 at 4:51 PM, Chetas Joshi <ch...@gmail.com>
> wrote:
> > Hi,
> >
> > I have a SolrCloud (on HDFS) of 50 nodes and a ZK quorum of 5 nodes. The
> > SolrCloud is having difficulties talking to ZK when I am ingesting data
> > into the collections. At that time I am also running queries (that return
> > millions of docs). The ingest job is crying with the the following
> exception
> >
> > org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error
> > from server at http://xxx/solr/collection1_shard15_replica1: Cannot
> talk to
> > ZooKeeper - Updates are disabled.
> >
> > I think this is happening when the ingest job is trying to update the
> > clusterstate.json file but the query is reading from that file and thus
> has
> > some kind of a lock on that file. Are there any factors that will cause
> the
> > "READ" to acquire lock for a long time? Is my understanding correct? I am
> > using the cursor approach using SolrJ to get back results from Solr.
> >
> > How often is the ZK updated with the latest cluster state and what
> > parameter governs that? Should I just increase the ZK client timeout so
> > that it retries connecting to the ZK for a longer period of time (right
> now
> > it is 15 seconds)?
> >
> > Thanks!
>

Re: CloudSolrClient$RouteException: Cannot talk to ZooKeeper - Updates are disabled.

Posted by Erick Erickson <er...@gmail.com>.

The clusterstate on Zookeeper shouldn't be changing
very often, only when nodes come and go.

bq: At that time I am also running queries (that return
millions of docs).

As in rows=milions? This is an anti-pattern, if that's true
then you're probably network saturated and the like. If
you mean your numFound is millions, then this is unlikely
to be a problem.

you say "clusterstate.json", which indicates you're on
4x? This has been changed to make a state.json for
each collection, so either you upgraded sometime and
didn't transform you ZK (there's a command to do that)
or can you upgrade?

What I'm guessing is that you have too much going on
somehow and you're overloading your system and
getting a timeout. So increasing the timeout
is definitely a possibility, or reducing the ingestion load
as a test.

Best,
Erick

On Fri, Nov 18, 2016 at 4:51 PM, Chetas Joshi <ch...@gmail.com> wrote:
> Hi,
>
> I have a SolrCloud (on HDFS) of 50 nodes and a ZK quorum of 5 nodes. The
> SolrCloud is having difficulties talking to ZK when I am ingesting data
> into the collections. At that time I am also running queries (that return
> millions of docs). The ingest job is crying with the the following exception
>
> org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error
> from server at http://xxx/solr/collection1_shard15_replica1: Cannot talk to
> ZooKeeper - Updates are disabled.
>
> I think this is happening when the ingest job is trying to update the
> clusterstate.json file but the query is reading from that file and thus has
> some kind of a lock on that file. Are there any factors that will cause the
> "READ" to acquire lock for a long time? Is my understanding correct? I am
> using the cursor approach using SolrJ to get back results from Solr.
>
> How often is the ZK updated with the latest cluster state and what
> parameter governs that? Should I just increase the ZK client timeout so
> that it retries connecting to the ZK for a longer period of time (right now
> it is 15 seconds)?
>
> Thanks!