You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Wei <we...@gmail.com> on 2020/05/08 23:34:15 UTC

Re: Unbalanced shard requests

Update:  after I remove the shards.preference parameter from
solrconfig.xml,  issue is gone and internal shard requests are now
balanced. The same parameter works fine with solr 7.6.  Still not sure of
the root cause, but I observed a strange coincidence: the nodes that are
most frequently picked for shard requests are the first node in each shard
returned from the CLUSTERSTATUS api.  Seems something wrong with shuffling
equally compared nodes when shards.preference is set.  Will report back if
I find more.

On Mon, Apr 27, 2020 at 5:59 PM Wei <we...@gmail.com> wrote:

> Hi Eric,
>
> I am measuring the number of shard requests, and it's for query only, no
> indexing requests.  I have an external load balancer and see each node
> received about the equal number of external queries. However for the
> internal shard queries,  the distribution is uneven:    6 nodes (one in
> each shard,  some of them are leaders and some are non-leaders ) gets about
> 80% of the shard requests, the other 54 nodes gets about 20% of the shard
> requests.   I checked a few other parameters set:
>
> -Dsolr.disable.shardsWhitelist=true
> shards.preference=replica.location:local,replica.type:TLOG
>
> Nothing seems to cause the strange behavior.  Any suggestions how to
> debug this?
>
> -Wei
>
>
> On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson <er...@gmail.com>
> wrote:
>
>> Wei:
>>
>> How are you measuring utilization here? The number of incoming requests
>> or CPU?
>>
>> The leader for each shard are certainly handling all of the indexing
>> requests since they’re TLOG replicas, so that’s one thing that might
>> skewing your measurements.
>>
>> Best,
>> Erick
>>
>> > On Apr 27, 2020, at 7:13 PM, Wei <we...@gmail.com> wrote:
>> >
>> > Hi everyone,
>> >
>> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My cloud has 6
>> > shards with 10 TLOG replicas each shard.  After upgrade I noticed that
>> one
>> > of the replicas in each shard is handling most of the distributed shard
>> > requests, so 6 nodes are heavily loaded while other nodes are idle.
>> There
>> > is no change in shard handler configuration:
>> >
>> > <shardHandlerFactory name="shardHandlerFactory" class=
>> > "HttpShardHandlerFactory">
>> >
>> >    <int name="socketTimeout">30000</int>
>> >
>> >    <int name="connTimeout">30000</int>
>> >
>> >    <int name="maxConnectionsPerHost">500</int>
>> >
>> > </shardHandlerFactory>
>> >
>> >
>> > What could cause the unbalanced internal distributed request?
>> >
>> >
>> > Thanks in advance.
>> >
>> >
>> >
>> > Wei
>>
>>

Re: Unbalanced shard requests

Posted by ART GALLERY <al...@goretoy.com>.

check out the videos on this website TROO.TUBE don't be such a
sheep/zombie/loser/NPC. Much love!
https://troo.tube/videos/watch/aaa64864-52ee-4201-922f-41300032f219

On Mon, May 11, 2020 at 6:50 PM Wei <we...@gmail.com> wrote:
>
> Thanks Michael!  Yes in each shard I have 10 Tlog replicas,  no other type
> of replicas, and each Tlog replica is an individual solr instance on its
> own physical machine.  In the jira you mentioned 'when "last place matches"
> == "first place matches" – e.g. when shards.preference specified matches
> *all* available replicas'.   My setting is
> shards.preference=replica.location:local,replica.type:TLOG,
> I also tried just shards.preference=replica.location:local and it still has
> the issue. Can you explain a bit more?
>
> On Mon, May 11, 2020 at 12:26 PM Michael Gibney <mi...@michaelgibney.net>
> wrote:
>
> > FYI: https://issues.apache.org/jira/browse/SOLR-14471
> > Wei, assuming you have only TLOG replicas, your "last place" matches
> > (to which the random fallback ordering would not be applied -- see
> > above issue) would be the same as the "first place" matches selected
> > for executing distributed requests.
> >
> >
> > On Mon, May 11, 2020 at 1:49 PM Michael Gibney
> > <mi...@michaelgibney.net> wrote:
> > >
> > > Wei, probably no need to answer my earlier questions; I think I see
> > > the problem here, and believe it is indeed a bug, introduced in 8.3.
> > > Will file an issue and submit a patch shortly.
> > > Michael
> > >
> > > On Mon, May 11, 2020 at 12:49 PM Michael Gibney
> > > <mi...@michaelgibney.net> wrote:
> > > >
> > > > Hi Wei,
> > > >
> > > > In considering this problem, I'm stumbling a bit on terminology
> > > > (particularly, where you mention "nodes", I think you're referring to
> > > > "replicas"?). Could you confirm that you have 10 TLOG replicas per
> > > > shard, for each of 6 shards? How many *nodes* (i.e., running solr
> > > > server instances) do you have, and what is the replica placement like
> > > > across those nodes? What, if any, non-TLOG replicas do you have per
> > > > shard (not that it's necessarily relevant, but just to get a complete
> > > > picture of the situation)?
> > > >
> > > > If you're able without too much trouble, can you determine what the
> > > > behavior is like on Solr 8.3? (there were different changes introduced
> > > > to potentially relevant code in 8.3 and 8.4, and knowing whether the
> > > > behavior you're observing manifests on 8.3 would help narrow down
> > > > where to look for an explanation).
> > > >
> > > > Michael
> > > >
> > > > On Fri, May 8, 2020 at 7:34 PM Wei <we...@gmail.com> wrote:
> > > > >
> > > > > Update:  after I remove the shards.preference parameter from
> > > > > solrconfig.xml,  issue is gone and internal shard requests are now
> > > > > balanced. The same parameter works fine with solr 7.6.  Still not
> > sure of
> > > > > the root cause, but I observed a strange coincidence: the nodes that
> > are
> > > > > most frequently picked for shard requests are the first node in each
> > shard
> > > > > returned from the CLUSTERSTATUS api.  Seems something wrong with
> > shuffling
> > > > > equally compared nodes when shards.preference is set.  Will report
> > back if
> > > > > I find more.
> > > > >
> > > > > On Mon, Apr 27, 2020 at 5:59 PM Wei <we...@gmail.com> wrote:
> > > > >
> > > > > > Hi Eric,
> > > > > >
> > > > > > I am measuring the number of shard requests, and it's for query
> > only, no
> > > > > > indexing requests.  I have an external load balancer and see each
> > node
> > > > > > received about the equal number of external queries. However for
> > the
> > > > > > internal shard queries,  the distribution is uneven:    6 nodes
> > (one in
> > > > > > each shard,  some of them are leaders and some are non-leaders )
> > gets about
> > > > > > 80% of the shard requests, the other 54 nodes gets about 20% of
> > the shard
> > > > > > requests.   I checked a few other parameters set:
> > > > > >
> > > > > > -Dsolr.disable.shardsWhitelist=true
> > > > > > shards.preference=replica.location:local,replica.type:TLOG
> > > > > >
> > > > > > Nothing seems to cause the strange behavior.  Any suggestions how
> > to
> > > > > > debug this?
> > > > > >
> > > > > > -Wei
> > > > > >
> > > > > >
> > > > > > On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson <
> > erickerickson@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> Wei:
> > > > > >>
> > > > > >> How are you measuring utilization here? The number of incoming
> > requests
> > > > > >> or CPU?
> > > > > >>
> > > > > >> The leader for each shard are certainly handling all of the
> > indexing
> > > > > >> requests since they’re TLOG replicas, so that’s one thing that
> > might
> > > > > >> skewing your measurements.
> > > > > >>
> > > > > >> Best,
> > > > > >> Erick
> > > > > >>
> > > > > >> > On Apr 27, 2020, at 7:13 PM, Wei <we...@gmail.com> wrote:
> > > > > >> >
> > > > > >> > Hi everyone,
> > > > > >> >
> > > > > >> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My
> > cloud has 6
> > > > > >> > shards with 10 TLOG replicas each shard.  After upgrade I
> > noticed that
> > > > > >> one
> > > > > >> > of the replicas in each shard is handling most of the
> > distributed shard
> > > > > >> > requests, so 6 nodes are heavily loaded while other nodes are
> > idle.
> > > > > >> There
> > > > > >> > is no change in shard handler configuration:
> > > > > >> >
> > > > > >> > <shardHandlerFactory name="shardHandlerFactory" class=
> > > > > >> > "HttpShardHandlerFactory">
> > > > > >> >
> > > > > >> >    <int name="socketTimeout">30000</int>
> > > > > >> >
> > > > > >> >    <int name="connTimeout">30000</int>
> > > > > >> >
> > > > > >> >    <int name="maxConnectionsPerHost">500</int>
> > > > > >> >
> > > > > >> > </shardHandlerFactory>
> > > > > >> >
> > > > > >> >
> > > > > >> > What could cause the unbalanced internal distributed request?
> > > > > >> >
> > > > > >> >
> > > > > >> > Thanks in advance.
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > Wei
> > > > > >>
> > > > > >>
> >

Re: Unbalanced shard requests

Posted by Wei <we...@gmail.com>.

Hi Michael,

I also verified the patch in SOLR-14471 with 8.4.1 and it fixed the issue
with shards.preference=replica.location:local,replica.type:TLOG in my
setting.  Thanks!

Wei

On Thu, May 21, 2020 at 12:09 PM Phill Campbell
<Si...@yahoo.com.invalid> wrote:

> Yes, JVM heap settings.
>
> > On May 19, 2020, at 10:59 AM, Wei <we...@gmail.com> wrote:
> >
> > Hi Phill,
> >
> > What is the RAM config you are referring to, JVM size? How is that
> related
> > to the load balancing, if each node has the same configuration?
> >
> > Thanks,
> > Wei
> >
> > On Mon, May 18, 2020 at 3:07 PM Phill Campbell
> > <Si...@yahoo.com.invalid> wrote:
> >
> >> In my previous report I was configured to use as much RAM as possible.
> >> With that configuration it seemed it was not load balancing.
> >> So, I reconfigured and redeployed to use 1/4 the RAM. What a difference
> >> for the better!
> >>
> >> 10.156.112.50   load average: 13.52, 10.56, 6.46
> >> 10.156.116.34   load average: 11.23, 12.35, 9.63
> >> 10.156.122.13   load average: 10.29, 12.40, 9.69
> >>
> >> Very nice.
> >> My tool that tests records RPS. In the “bad” configuration it was less
> >> than 1 RPS.
> >> NOW it is showing 21 RPS.
> >>
> >>
> >>
> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >> <
> >>
> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >>>
> >> {
> >>  "responseHeader":{
> >>    "status":0,
> >>    "QTime":161},
> >>  "metrics":{
> >>    "solr.core.BTS.shard1.replica_n2":{
> >>      "QUERY./select.requestTimes":{
> >>        "count":5723,
> >>        "meanRate":6.8163888639859085,
> >>        "1minRate":11.557013215119536,
> >>        "5minRate":8.760356217628159,
> >>        "15minRate":4.707624230995833,
> >>        "min_ms":0.131545,
> >>        "max_ms":388.710848,
> >>        "mean_ms":30.300492048215947,
> >>        "median_ms":6.336654,
> >>        "stddev_ms":51.527164088667035,
> >>        "p75_ms":35.427943,
> >>        "p95_ms":140.025957,
> >>        "p99_ms":230.533099,
> >>        "p999_ms":388.710848}}}}
> >>
> >>
> >>
> >>
> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >> <
> >>
> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >>>
> >> {
> >>  "responseHeader":{
> >>    "status":0,
> >>    "QTime":11},
> >>  "metrics":{
> >>    "solr.core.BTS.shard2.replica_n8":{
> >>      "QUERY./select.requestTimes":{
> >>        "count":6469,
> >>        "meanRate":7.502581801189549,
> >>        "1minRate":12.211423085368564,
> >>        "5minRate":9.445681397767322,
> >>        "15minRate":5.216209798637846,
> >>        "min_ms":0.154691,
> >>        "max_ms":701.657394,
> >>        "mean_ms":34.2734699171445,
> >>        "median_ms":5.640378,
> >>        "stddev_ms":62.27649205954566,
> >>        "p75_ms":39.016371,
> >>        "p95_ms":156.997982,
> >>        "p99_ms":288.883028,
> >>        "p999_ms":538.368031}}}}
> >>
> >>
> >>
> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >> <
> >>
> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >>>
> >> {
> >>  "responseHeader":{
> >>    "status":0,
> >>    "QTime":67},
> >>  "metrics":{
> >>    "solr.core.BTS.shard3.replica_n16":{
> >>      "QUERY./select.requestTimes":{
> >>        "count":7109,
> >>        "meanRate":7.787524673806184,
> >>        "1minRate":11.88519763582083,
> >>        "5minRate":9.893315557386755,
> >>        "15minRate":5.620178363676527,
> >>        "min_ms":0.150887,
> >>        "max_ms":472.826462,
> >>        "mean_ms":32.184282366621204,
> >>        "median_ms":6.977733,
> >>        "stddev_ms":55.729908615189196,
> >>        "p75_ms":36.655011,
> >>        "p95_ms":151.12627,
> >>        "p99_ms":251.440162,
> >>        "p999_ms":472.826462}}}}
> >>
> >>
> >> Compare that to the previous report and you can see the improvement.
> >> So, note to myself. Figure out the sweet spot for RAM usage. Use too
> much
> >> and strange behavior is noticed. While using too much all the load
> focused
> >> on one box and query times slowed.
> >> I did not see any OOM errors during any of this.
> >>
> >> Regards
> >>
> >>
> >>
> >>> On May 18, 2020, at 3:23 PM, Phill Campbell
> >> <Si...@yahoo.com.INVALID> wrote:
> >>>
> >>> I have been testing 8.5.2 and it looks like the load has moved but is
> >> still on one machine.
> >>>
> >>> Setup:
> >>> 3 physical machines.
> >>> Each machine hosts 8 instances of Solr.
> >>> Each instance of Solr hosts one replica.
> >>>
> >>> Another way to say it:
> >>> Number of shards = 8. Replication factor = 3.
> >>>
> >>> Here is the cluster state. You can see that the leaders are well
> >> distributed.
> >>>
> >>> {"TEST_COLLECTION":{
> >>>   "pullReplicas":"0",
> >>>   "replicationFactor":"3",
> >>>   "shards":{
> >>>     "shard1":{
> >>>       "range":"80000000-9fffffff",
> >>>       "state":"active",
> >>>       "replicas":{
> >>>         "core_node3":{
> >>>           "core":"TEST_COLLECTION_shard1_replica_n1",
> >>>           "base_url":"http://10.156.122.13:10007/solr",
> >>>           "node_name":"10.156.122.13:10007_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false"},
> >>>         "core_node5":{
> >>>           "core":"TEST_COLLECTION_shard1_replica_n2",
> >>>           "base_url":"http://10.156.112.50:10002/solr",
> >>>           "node_name":"10.156.112.50:10002_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false",
> >>>           "leader":"true"},
> >>>         "core_node7":{
> >>>           "core":"TEST_COLLECTION_shard1_replica_n4",
> >>>           "base_url":"http://10.156.112.50:10006/solr",
> >>>           "node_name":"10.156.112.50:10006_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false"}}},
> >>>     "shard2":{
> >>>       "range":"a0000000-bfffffff",
> >>>       "state":"active",
> >>>       "replicas":{
> >>>         "core_node9":{
> >>>           "core":"TEST_COLLECTION_shard2_replica_n6",
> >>>           "base_url":"http://10.156.112.50:10003/solr",
> >>>           "node_name":"10.156.112.50:10003_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false"},
> >>>         "core_node11":{
> >>>           "core":"TEST_COLLECTION_shard2_replica_n8",
> >>>           "base_url":"http://10.156.122.13:10004/solr",
> >>>           "node_name":"10.156.122.13:10004_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false",
> >>>           "leader":"true"},
> >>>         "core_node12":{
> >>>           "core":"TEST_COLLECTION_shard2_replica_n10",
> >>>           "base_url":"http://10.156.116.34:10008/solr",
> >>>           "node_name":"10.156.116.34:10008_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false"}}},
> >>>     "shard3":{
> >>>       "range":"c0000000-dfffffff",
> >>>       "state":"active",
> >>>       "replicas":{
> >>>         "core_node15":{
> >>>           "core":"TEST_COLLECTION_shard3_replica_n13",
> >>>           "base_url":"http://10.156.122.13:10008/solr",
> >>>           "node_name":"10.156.122.13:10008_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false"},
> >>>         "core_node17":{
> >>>           "core":"TEST_COLLECTION_shard3_replica_n14",
> >>>           "base_url":"http://10.156.116.34:10005/solr",
> >>>           "node_name":"10.156.116.34:10005_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false"},
> >>>         "core_node19":{
> >>>           "core":"TEST_COLLECTION_shard3_replica_n16",
> >>>           "base_url":"http://10.156.116.34:10002/solr",
> >>>           "node_name":"10.156.116.34:10002_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false",
> >>>           "leader":"true"}}},
> >>>     "shard4":{
> >>>       "range":"e0000000-ffffffff",
> >>>       "state":"active",
> >>>       "replicas":{
> >>>         "core_node20":{
> >>>           "core":"TEST_COLLECTION_shard4_replica_n18",
> >>>           "base_url":"http://10.156.122.13:10001/solr",
> >>>           "node_name":"10.156.122.13:10001_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false"},
> >>>         "core_node23":{
> >>>           "core":"TEST_COLLECTION_shard4_replica_n21",
> >>>           "base_url":"http://10.156.116.34:10004/solr",
> >>>           "node_name":"10.156.116.34:10004_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false"},
> >>>         "core_node25":{
> >>>           "core":"TEST_COLLECTION_shard4_replica_n22",
> >>>           "base_url":"http://10.156.112.50:10001/solr",
> >>>           "node_name":"10.156.112.50:10001_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false",
> >>>           "leader":"true"}}},
> >>>     "shard5":{
> >>>       "range":"0-1fffffff",
> >>>       "state":"active",
> >>>       "replicas":{
> >>>         "core_node27":{
> >>>           "core":"TEST_COLLECTION_shard5_replica_n24",
> >>>           "base_url":"http://10.156.116.34:10007/solr",
> >>>           "node_name":"10.156.116.34:10007_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false"},
> >>>         "core_node29":{
> >>>           "core":"TEST_COLLECTION_shard5_replica_n26",
> >>>           "base_url":"http://10.156.122.13:10006/solr",
> >>>           "node_name":"10.156.122.13:10006_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false"},
> >>>         "core_node31":{
> >>>           "core":"TEST_COLLECTION_shard5_replica_n28",
> >>>           "base_url":"http://10.156.116.34:10006/solr",
> >>>           "node_name":"10.156.116.34:10006_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false",
> >>>           "leader":"true"}}},
> >>>     "shard6":{
> >>>       "range":"20000000-3fffffff",
> >>>       "state":"active",
> >>>       "replicas":{
> >>>         "core_node33":{
> >>>           "core":"TEST_COLLECTION_shard6_replica_n30",
> >>>           "base_url":"http://10.156.122.13:10002/solr",
> >>>           "node_name":"10.156.122.13:10002_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false",
> >>>           "leader":"true"},
> >>>         "core_node35":{
> >>>           "core":"TEST_COLLECTION_shard6_replica_n32",
> >>>           "base_url":"http://10.156.112.50:10008/solr",
> >>>           "node_name":"10.156.112.50:10008_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false"},
> >>>         "core_node37":{
> >>>           "core":"TEST_COLLECTION_shard6_replica_n34",
> >>>           "base_url":"http://10.156.116.34:10003/solr",
> >>>           "node_name":"10.156.116.34:10003_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false"}}},
> >>>     "shard7":{
> >>>       "range":"40000000-5fffffff",
> >>>       "state":"active",
> >>>       "replicas":{
> >>>         "core_node39":{
> >>>           "core":"TEST_COLLECTION_shard7_replica_n36",
> >>>           "base_url":"http://10.156.122.13:10003/solr",
> >>>           "node_name":"10.156.122.13:10003_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false",
> >>>           "leader":"true"},
> >>>         "core_node41":{
> >>>           "core":"TEST_COLLECTION_shard7_replica_n38",
> >>>           "base_url":"http://10.156.122.13:10005/solr",
> >>>           "node_name":"10.156.122.13:10005_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false"},
> >>>         "core_node43":{
> >>>           "core":"TEST_COLLECTION_shard7_replica_n40",
> >>>           "base_url":"http://10.156.112.50:10004/solr",
> >>>           "node_name":"10.156.112.50:10004_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false"}}},
> >>>     "shard8":{
> >>>       "range":"60000000-7fffffff",
> >>>       "state":"active",
> >>>       "replicas":{
> >>>         "core_node45":{
> >>>           "core":"TEST_COLLECTION_shard8_replica_n42",
> >>>           "base_url":"http://10.156.112.50:10007/solr",
> >>>           "node_name":"10.156.112.50:10007_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false"},
> >>>         "core_node47":{
> >>>           "core":"TEST_COLLECTION_shard8_replica_n44",
> >>>           "base_url":"http://10.156.112.50:10005/solr",
> >>>           "node_name":"10.156.112.50:10005_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false",
> >>>           "leader":"true"},
> >>>         "core_node48":{
> >>>           "core":"TEST_COLLECTION_shard8_replica_n46",
> >>>           "base_url":"http://10.156.116.34:10001/solr",
> >>>           "node_name":"10.156.116.34:10001_solr",
> >>>           "state":"active",
> >>>           "type":"NRT",
> >>>           "force_set_state":"false"}}}},
> >>>   "router":{"name":"compositeId"},
> >>>   "maxShardsPerNode":"1",
> >>>   "autoAddReplicas":"false",
> >>>   "nrtReplicas":"3",
> >>>   "tlogReplicas":"0”}}
> >>>
> >>>
> >>> Running TOP on each machine while load tests have been running for 60
> >> minutes.
> >>>
> >>> 10.156.112.50 load average: 0.08, 0.35, 1.65
> >>> 10.156.116.34 load average: 24.71, 24.20, 20.65
> >>> 10.156.122.13 load average: 5.37, 3.21, 4.04
> >>>
> >>>
> >>>
> >>> Here are the stats from each shard leader.
> >>>
> >>>
> >>
> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >> <
> >>
> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >>>
> >>> {
> >>> "responseHeader":{
> >>>   "status":0,
> >>>   "QTime":2},
> >>> "metrics":{
> >>>   "solr.core.BTS.shard1.replica_n2":{
> >>>     "QUERY./select.requestTimes":{
> >>>       "count":805,
> >>>       "meanRate":0.4385455794526838,
> >>>       "1minRate":0.5110237122383522,
> >>>       "5minRate":0.4671091682458005,
> >>>       "15minRate":0.4057871940723353,
> >>>       "min_ms":0.14047,
> >>>       "max_ms":12424.589645,
> >>>       "mean_ms":796.2194458711818,
> >>>       "median_ms":10.534906,
> >>>       "stddev_ms":2567.655224710497,
> >>>       "p75_ms":22.893306,
> >>>       "p95_ms":8316.33323,
> >>>       "p99_ms":12424.589645,
> >>>       "p999_ms":12424.589645}}}}
> >>>
> >>>
> >>
> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >> <
> >>
> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >>>
> >>> {
> >>> "responseHeader":{
> >>>   "status":0,
> >>>   "QTime":2},
> >>> "metrics":{
> >>>   "solr.core.BTS.shard2.replica_n8":{
> >>>     "QUERY./select.requestTimes":{
> >>>       "count":791,
> >>>       "meanRate":0.4244162938316224,
> >>>       "1minRate":0.4869749626003825,
> >>>       "5minRate":0.45856412657687656,
> >>>       "15minRate":0.3948063845907493,
> >>>       "min_ms":0.168369,
> >>>       "max_ms":11022.763933,
> >>>       "mean_ms":2572.0670957974603,
> >>>       "median_ms":1490.222885,
> >>>       "stddev_ms":2718.1710938804276,
> >>>       "p75_ms":4292.490478,
> >>>       "p95_ms":8487.18506,
> >>>       "p99_ms":8855.936617,
> >>>       "p999_ms":9589.218502}}}}
> >>>
> >>>
> >>
> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >> <
> >>
> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >>>
> >>> {
> >>> "responseHeader":{
> >>>   "status":0,
> >>>   "QTime":83},
> >>> "metrics":{
> >>>   "solr.core.BTS.shard3.replica_n16":{
> >>>     "QUERY./select.requestTimes":{
> >>>       "count":840,
> >>>       "meanRate":0.4335334453288775,
> >>>       "1minRate":0.5733683837779382,
> >>>       "5minRate":0.4931753679028527,
> >>>       "15minRate":0.42241330274699623,
> >>>       "min_ms":0.155939,
> >>>       "max_ms":18125.516406,
> >>>       "mean_ms":7097.942850416767,
> >>>       "median_ms":8136.862825,
> >>>       "stddev_ms":2382.041897221542,
> >>>       "p75_ms":8497.844088,
> >>>       "p95_ms":9642.430475,
> >>>       "p99_ms":9993.694346,
> >>>       "p999_ms":12207.982291}}}}
> >>>
> >>>
> >>
> http://10.156.112.50:10001/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >> <
> >>
> http://10.156.112.50:10001/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >>>
> >>> {
> >>> "responseHeader":{
> >>>   "status":0,
> >>>   "QTime":3},
> >>> "metrics":{
> >>>   "solr.core.BTS.shard4.replica_n22":{
> >>>     "QUERY./select.requestTimes":{
> >>>       "count":873,
> >>>       "meanRate":0.43420303985137254,
> >>>       "1minRate":0.4284437786865815,
> >>>       "5minRate":0.44020640429418745,
> >>>       "15minRate":0.40860871277629196,
> >>>       "min_ms":0.136658,
> >>>       "max_ms":11345.407699,
> >>>       "mean_ms":511.28573906464504,
> >>>       "median_ms":9.063677,
> >>>       "stddev_ms":2038.8104673512248,
> >>>       "p75_ms":20.270605,
> >>>       "p95_ms":8418.131442,
> >>>       "p99_ms":8904.78616,
> >>>       "p999_ms":10447.78365}}}}
> >>>
> >>>
> >>
> http://10.156.116.34:10006/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >> <
> >>
> http://10.156.116.34:10006/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >>>
> >>> {
> >>> "responseHeader":{
> >>>   "status":0,
> >>>   "QTime":4},
> >>> "metrics":{
> >>>   "solr.core.BTS.shard5.replica_n28":{
> >>>     "QUERY./select.requestTimes":{
> >>>       "count":863,
> >>>       "meanRate":0.4419375762840668,
> >>>       "1minRate":0.44487242228317025,
> >>>       "5minRate":0.45927613542085916,
> >>>       "15minRate":0.41056066296443494,
> >>>       "min_ms":0.158855,
> >>>       "max_ms":16669.411989,
> >>>       "mean_ms":6513.057114006753,
> >>>       "median_ms":8033.386692,
> >>>       "stddev_ms":3002.7487311308896,
> >>>       "p75_ms":8446.147616,
> >>>       "p95_ms":9888.641316,
> >>>       "p99_ms":13624.11926,
> >>>       "p999_ms":13624.11926}}}}
> >>>
> >>>
> >>
> http://10.156.122.13:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >> <
> >>
> http://10.156.122.13:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >>>
> >>> {
> >>> "responseHeader":{
> >>>   "status":0,
> >>>   "QTime":2},
> >>> "metrics":{
> >>>   "solr.core.BTS.shard6.replica_n30":{
> >>>     "QUERY./select.requestTimes":{
> >>>       "count":893,
> >>>       "meanRate":0.43301141185981046,
> >>>       "1minRate":0.4011485529441132,
> >>>       "5minRate":0.447654905093643,
> >>>       "15minRate":0.41489193746842407,
> >>>       "min_ms":0.161571,
> >>>       "max_ms":14716.828978,
> >>>       "mean_ms":2932.212133523417,
> >>>       "median_ms":1289.686481,
> >>>       "stddev_ms":3426.22045100954,
> >>>       "p75_ms":6230.031884,
> >>>       "p95_ms":8109.408506,
> >>>       "p99_ms":12904.515311,
> >>>       "p999_ms":12904.515311}}}}
> >>>
> >>>
> >>>
> >>>
> >>
> http://10.156.122.13:10003/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >> <
> >>
> http://10.156.122.13:10003/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >>>
> >>> {
> >>> "responseHeader":{
> >>>   "status":0,
> >>>   "QTime":16},
> >>> "metrics":{
> >>>   "solr.core.BTS.shard7.replica_n36":{
> >>>     "QUERY./select.requestTimes":{
> >>>       "count":962,
> >>>       "meanRate":0.46572438680661055,
> >>>       "1minRate":0.4974893681625287,
> >>>       "5minRate":0.49072296556429784,
> >>>       "15minRate":0.44138205926188756,
> >>>       "min_ms":0.164803,
> >>>       "max_ms":12481.82656,
> >>>       "mean_ms":2606.899631183513,
> >>>       "median_ms":1457.505387,
> >>>       "stddev_ms":3083.297183477969,
> >>>       "p75_ms":4072.543679,
> >>>       "p95_ms":8562.456178,
> >>>       "p99_ms":9351.230895,
> >>>       "p999_ms":10430.483813}}}}
> >>>
> >>>
> >>
> http://10.156.112.50:10005/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >> <
> >>
> http://10.156.112.50:10005/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >>>
> >>> {
> >>> "responseHeader":{
> >>>   "status":0,
> >>>   "QTime":3},
> >>> "metrics":{
> >>>   "solr.core.BTS.shard8.replica_n44":{
> >>>     "QUERY./select.requestTimes":{
> >>>       "count":904,
> >>>       "meanRate":0.4356001115451976,
> >>>       "1minRate":0.42906831311171356,
> >>>       "5minRate":0.4651312663377039,
> >>>       "15minRate":0.41812847342709225,
> >>>       "min_ms":0.089738,
> >>>       "max_ms":10857.092832,
> >>>       "mean_ms":304.52127270799156,
> >>>       "median_ms":7.098736,
> >>>       "stddev_ms":1544.5378594679773,
> >>>       "p75_ms":15.599817,
> >>>       "p95_ms":93.818662,
> >>>       "p99_ms":8510.757117,
> >>>       "p999_ms":9353.844994}}}}
> >>>
> >>> I restart all of the instances on “34” so that there are no leaders on
> >> it. The load somewhat goes to the other box.
> >>>
> >>> 10.156.112.50 load average: 0.00, 0.16, 0.47
> >>> 10.156.116.34 load average: 17.00, 16.16, 17.07
> >>> 10.156.122.13 load average: 17.86, 17.49, 14.74
> >>>
> >>> Box “50” is still doing nothing AND it is the leader of 4 of the 8
> >> shards.
> >>> Box “13” is the leader of the remaining 4 shards.
> >>> Box “34” is not the leader of any shard.
> >>>
> >>> I will continue to test, who knows, it may be something I am doing.
> >> Maybe not enough RAM, etc…, so I am definitely leaving this open to the
> >> possibility that I am not well configured for 8.5.
> >>>
> >>> Regards
> >>>
> >>>
> >>>
> >>>
> >>>> On May 16, 2020, at 5:08 PM, Tomás Fernández Löbbe <
> >> tomasflobbe@gmail.com> wrote:
> >>>>
> >>>> I just backported Michael’s fix to be released in 8.5.2
> >>>>
> >>>> On Fri, May 15, 2020 at 6:38 AM Michael Gibney <
> >> michael@michaelgibney.net>
> >>>> wrote:
> >>>>
> >>>>> Hi Wei,
> >>>>> SOLR-14471 has been merged, so this issue should be fixed in 8.6.
> >>>>> Thanks for reporting the problem!
> >>>>> Michael
> >>>>>
> >>>>> On Mon, May 11, 2020 at 7:51 PM Wei <we...@gmail.com> wrote:
> >>>>>>
> >>>>>> Thanks Michael!  Yes in each shard I have 10 Tlog replicas,  no
> other
> >>>>> type
> >>>>>> of replicas, and each Tlog replica is an individual solr instance on
> >> its
> >>>>>> own physical machine.  In the jira you mentioned 'when "last place
> >>>>> matches"
> >>>>>> == "first place matches" – e.g. when shards.preference specified
> >> matches
> >>>>>> *all* available replicas'.   My setting is
> >>>>>> shards.preference=replica.location:local,replica.type:TLOG,
> >>>>>> I also tried just shards.preference=replica.location:local and it
> >> still
> >>>>> has
> >>>>>> the issue. Can you explain a bit more?
> >>>>>>
> >>>>>> On Mon, May 11, 2020 at 12:26 PM Michael Gibney <
> >>>>> michael@michaelgibney.net>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> FYI: https://issues.apache.org/jira/browse/SOLR-14471
> >>>>>>> Wei, assuming you have only TLOG replicas, your "last place"
> matches
> >>>>>>> (to which the random fallback ordering would not be applied -- see
> >>>>>>> above issue) would be the same as the "first place" matches
> selected
> >>>>>>> for executing distributed requests.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, May 11, 2020 at 1:49 PM Michael Gibney
> >>>>>>> <mi...@michaelgibney.net> wrote:
> >>>>>>>>
> >>>>>>>> Wei, probably no need to answer my earlier questions; I think I
> see
> >>>>>>>> the problem here, and believe it is indeed a bug, introduced in
> 8.3.
> >>>>>>>> Will file an issue and submit a patch shortly.
> >>>>>>>> Michael
> >>>>>>>>
> >>>>>>>> On Mon, May 11, 2020 at 12:49 PM Michael Gibney
> >>>>>>>> <mi...@michaelgibney.net> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Wei,
> >>>>>>>>>
> >>>>>>>>> In considering this problem, I'm stumbling a bit on terminology
> >>>>>>>>> (particularly, where you mention "nodes", I think you're
> referring
> >>>>> to
> >>>>>>>>> "replicas"?). Could you confirm that you have 10 TLOG replicas
> per
> >>>>>>>>> shard, for each of 6 shards? How many *nodes* (i.e., running solr
> >>>>>>>>> server instances) do you have, and what is the replica placement
> >>>>> like
> >>>>>>>>> across those nodes? What, if any, non-TLOG replicas do you have
> per
> >>>>>>>>> shard (not that it's necessarily relevant, but just to get a
> >>>>> complete
> >>>>>>>>> picture of the situation)?
> >>>>>>>>>
> >>>>>>>>> If you're able without too much trouble, can you determine what
> the
> >>>>>>>>> behavior is like on Solr 8.3? (there were different changes
> >>>>> introduced
> >>>>>>>>> to potentially relevant code in 8.3 and 8.4, and knowing whether
> >>>>> the
> >>>>>>>>> behavior you're observing manifests on 8.3 would help narrow down
> >>>>>>>>> where to look for an explanation).
> >>>>>>>>>
> >>>>>>>>> Michael
> >>>>>>>>>
> >>>>>>>>> On Fri, May 8, 2020 at 7:34 PM Wei <we...@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Update:  after I remove the shards.preference parameter from
> >>>>>>>>>> solrconfig.xml,  issue is gone and internal shard requests are
> >>>>> now
> >>>>>>>>>> balanced. The same parameter works fine with solr 7.6.  Still
> not
> >>>>>>> sure of
> >>>>>>>>>> the root cause, but I observed a strange coincidence: the nodes
> >>>>> that
> >>>>>>> are
> >>>>>>>>>> most frequently picked for shard requests are the first node in
> >>>>> each
> >>>>>>> shard
> >>>>>>>>>> returned from the CLUSTERSTATUS api.  Seems something wrong with
> >>>>>>> shuffling
> >>>>>>>>>> equally compared nodes when shards.preference is set.  Will
> >>>>> report
> >>>>>>> back if
> >>>>>>>>>> I find more.
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Apr 27, 2020 at 5:59 PM Wei <we...@gmail.com>
> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Eric,
> >>>>>>>>>>>
> >>>>>>>>>>> I am measuring the number of shard requests, and it's for query
> >>>>>>> only, no
> >>>>>>>>>>> indexing requests.  I have an external load balancer and see
> >>>>> each
> >>>>>>> node
> >>>>>>>>>>> received about the equal number of external queries. However
> >>>>> for
> >>>>>>> the
> >>>>>>>>>>> internal shard queries,  the distribution is uneven:    6 nodes
> >>>>>>> (one in
> >>>>>>>>>>> each shard,  some of them are leaders and some are non-leaders
> >>>>> )
> >>>>>>> gets about
> >>>>>>>>>>> 80% of the shard requests, the other 54 nodes gets about 20% of
> >>>>>>> the shard
> >>>>>>>>>>> requests.   I checked a few other parameters set:
> >>>>>>>>>>>
> >>>>>>>>>>> -Dsolr.disable.shardsWhitelist=true
> >>>>>>>>>>> shards.preference=replica.location:local,replica.type:TLOG
> >>>>>>>>>>>
> >>>>>>>>>>> Nothing seems to cause the strange behavior.  Any suggestions
> >>>>> how
> >>>>>>> to
> >>>>>>>>>>> debug this?
> >>>>>>>>>>>
> >>>>>>>>>>> -Wei
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson <
> >>>>>>> erickerickson@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Wei:
> >>>>>>>>>>>>
> >>>>>>>>>>>> How are you measuring utilization here? The number of incoming
> >>>>>>> requests
> >>>>>>>>>>>> or CPU?
> >>>>>>>>>>>>
> >>>>>>>>>>>> The leader for each shard are certainly handling all of the
> >>>>>>> indexing
> >>>>>>>>>>>> requests since they’re TLOG replicas, so that’s one thing that
> >>>>>>> might
> >>>>>>>>>>>> skewing your measurements.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Erick
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Apr 27, 2020, at 7:13 PM, Wei <we...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I have a strange issue after upgrade from 7.6.0 to 8.4.1. My
> >>>>>>> cloud has 6
> >>>>>>>>>>>>> shards with 10 TLOG replicas each shard.  After upgrade I
> >>>>>>> noticed that
> >>>>>>>>>>>> one
> >>>>>>>>>>>>> of the replicas in each shard is handling most of the
> >>>>>>> distributed shard
> >>>>>>>>>>>>> requests, so 6 nodes are heavily loaded while other nodes
> >>>>> are
> >>>>>>> idle.
> >>>>>>>>>>>> There
> >>>>>>>>>>>>> is no change in shard handler configuration:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> <shardHandlerFactory name="shardHandlerFactory" class=
> >>>>>>>>>>>>> "HttpShardHandlerFactory">
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> <int name="socketTimeout">30000</int>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> <int name="connTimeout">30000</int>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> <int name="maxConnectionsPerHost">500</int>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> </shardHandlerFactory>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> What could cause the unbalanced internal distributed
> >>>>> request?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks in advance.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Wei
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >>
> >>
>
>

Re: Unbalanced shard requests

Posted by Phill Campbell <Si...@yahoo.com.INVALID>.

Yes, JVM heap settings.

> On May 19, 2020, at 10:59 AM, Wei <we...@gmail.com> wrote:
> 
> Hi Phill,
> 
> What is the RAM config you are referring to, JVM size? How is that related
> to the load balancing, if each node has the same configuration?
> 
> Thanks,
> Wei
> 
> On Mon, May 18, 2020 at 3:07 PM Phill Campbell
> <Si...@yahoo.com.invalid> wrote:
> 
>> In my previous report I was configured to use as much RAM as possible.
>> With that configuration it seemed it was not load balancing.
>> So, I reconfigured and redeployed to use 1/4 the RAM. What a difference
>> for the better!
>> 
>> 10.156.112.50   load average: 13.52, 10.56, 6.46
>> 10.156.116.34   load average: 11.23, 12.35, 9.63
>> 10.156.122.13   load average: 10.29, 12.40, 9.69
>> 
>> Very nice.
>> My tool that tests records RPS. In the “bad” configuration it was less
>> than 1 RPS.
>> NOW it is showing 21 RPS.
>> 
>> 
>> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>> {
>>  "responseHeader":{
>>    "status":0,
>>    "QTime":161},
>>  "metrics":{
>>    "solr.core.BTS.shard1.replica_n2":{
>>      "QUERY./select.requestTimes":{
>>        "count":5723,
>>        "meanRate":6.8163888639859085,
>>        "1minRate":11.557013215119536,
>>        "5minRate":8.760356217628159,
>>        "15minRate":4.707624230995833,
>>        "min_ms":0.131545,
>>        "max_ms":388.710848,
>>        "mean_ms":30.300492048215947,
>>        "median_ms":6.336654,
>>        "stddev_ms":51.527164088667035,
>>        "p75_ms":35.427943,
>>        "p95_ms":140.025957,
>>        "p99_ms":230.533099,
>>        "p999_ms":388.710848}}}}
>> 
>> 
>> 
>> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>> {
>>  "responseHeader":{
>>    "status":0,
>>    "QTime":11},
>>  "metrics":{
>>    "solr.core.BTS.shard2.replica_n8":{
>>      "QUERY./select.requestTimes":{
>>        "count":6469,
>>        "meanRate":7.502581801189549,
>>        "1minRate":12.211423085368564,
>>        "5minRate":9.445681397767322,
>>        "15minRate":5.216209798637846,
>>        "min_ms":0.154691,
>>        "max_ms":701.657394,
>>        "mean_ms":34.2734699171445,
>>        "median_ms":5.640378,
>>        "stddev_ms":62.27649205954566,
>>        "p75_ms":39.016371,
>>        "p95_ms":156.997982,
>>        "p99_ms":288.883028,
>>        "p999_ms":538.368031}}}}
>> 
>> 
>> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>> {
>>  "responseHeader":{
>>    "status":0,
>>    "QTime":67},
>>  "metrics":{
>>    "solr.core.BTS.shard3.replica_n16":{
>>      "QUERY./select.requestTimes":{
>>        "count":7109,
>>        "meanRate":7.787524673806184,
>>        "1minRate":11.88519763582083,
>>        "5minRate":9.893315557386755,
>>        "15minRate":5.620178363676527,
>>        "min_ms":0.150887,
>>        "max_ms":472.826462,
>>        "mean_ms":32.184282366621204,
>>        "median_ms":6.977733,
>>        "stddev_ms":55.729908615189196,
>>        "p75_ms":36.655011,
>>        "p95_ms":151.12627,
>>        "p99_ms":251.440162,
>>        "p999_ms":472.826462}}}}
>> 
>> 
>> Compare that to the previous report and you can see the improvement.
>> So, note to myself. Figure out the sweet spot for RAM usage. Use too much
>> and strange behavior is noticed. While using too much all the load focused
>> on one box and query times slowed.
>> I did not see any OOM errors during any of this.
>> 
>> Regards
>> 
>> 
>> 
>>> On May 18, 2020, at 3:23 PM, Phill Campbell
>> <Si...@yahoo.com.INVALID> wrote:
>>> 
>>> I have been testing 8.5.2 and it looks like the load has moved but is
>> still on one machine.
>>> 
>>> Setup:
>>> 3 physical machines.
>>> Each machine hosts 8 instances of Solr.
>>> Each instance of Solr hosts one replica.
>>> 
>>> Another way to say it:
>>> Number of shards = 8. Replication factor = 3.
>>> 
>>> Here is the cluster state. You can see that the leaders are well
>> distributed.
>>> 
>>> {"TEST_COLLECTION":{
>>>   "pullReplicas":"0",
>>>   "replicationFactor":"3",
>>>   "shards":{
>>>     "shard1":{
>>>       "range":"80000000-9fffffff",
>>>       "state":"active",
>>>       "replicas":{
>>>         "core_node3":{
>>>           "core":"TEST_COLLECTION_shard1_replica_n1",
>>>           "base_url":"http://10.156.122.13:10007/solr",
>>>           "node_name":"10.156.122.13:10007_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node5":{
>>>           "core":"TEST_COLLECTION_shard1_replica_n2",
>>>           "base_url":"http://10.156.112.50:10002/solr",
>>>           "node_name":"10.156.112.50:10002_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false",
>>>           "leader":"true"},
>>>         "core_node7":{
>>>           "core":"TEST_COLLECTION_shard1_replica_n4",
>>>           "base_url":"http://10.156.112.50:10006/solr",
>>>           "node_name":"10.156.112.50:10006_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"}}},
>>>     "shard2":{
>>>       "range":"a0000000-bfffffff",
>>>       "state":"active",
>>>       "replicas":{
>>>         "core_node9":{
>>>           "core":"TEST_COLLECTION_shard2_replica_n6",
>>>           "base_url":"http://10.156.112.50:10003/solr",
>>>           "node_name":"10.156.112.50:10003_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node11":{
>>>           "core":"TEST_COLLECTION_shard2_replica_n8",
>>>           "base_url":"http://10.156.122.13:10004/solr",
>>>           "node_name":"10.156.122.13:10004_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false",
>>>           "leader":"true"},
>>>         "core_node12":{
>>>           "core":"TEST_COLLECTION_shard2_replica_n10",
>>>           "base_url":"http://10.156.116.34:10008/solr",
>>>           "node_name":"10.156.116.34:10008_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"}}},
>>>     "shard3":{
>>>       "range":"c0000000-dfffffff",
>>>       "state":"active",
>>>       "replicas":{
>>>         "core_node15":{
>>>           "core":"TEST_COLLECTION_shard3_replica_n13",
>>>           "base_url":"http://10.156.122.13:10008/solr",
>>>           "node_name":"10.156.122.13:10008_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node17":{
>>>           "core":"TEST_COLLECTION_shard3_replica_n14",
>>>           "base_url":"http://10.156.116.34:10005/solr",
>>>           "node_name":"10.156.116.34:10005_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node19":{
>>>           "core":"TEST_COLLECTION_shard3_replica_n16",
>>>           "base_url":"http://10.156.116.34:10002/solr",
>>>           "node_name":"10.156.116.34:10002_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false",
>>>           "leader":"true"}}},
>>>     "shard4":{
>>>       "range":"e0000000-ffffffff",
>>>       "state":"active",
>>>       "replicas":{
>>>         "core_node20":{
>>>           "core":"TEST_COLLECTION_shard4_replica_n18",
>>>           "base_url":"http://10.156.122.13:10001/solr",
>>>           "node_name":"10.156.122.13:10001_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node23":{
>>>           "core":"TEST_COLLECTION_shard4_replica_n21",
>>>           "base_url":"http://10.156.116.34:10004/solr",
>>>           "node_name":"10.156.116.34:10004_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node25":{
>>>           "core":"TEST_COLLECTION_shard4_replica_n22",
>>>           "base_url":"http://10.156.112.50:10001/solr",
>>>           "node_name":"10.156.112.50:10001_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false",
>>>           "leader":"true"}}},
>>>     "shard5":{
>>>       "range":"0-1fffffff",
>>>       "state":"active",
>>>       "replicas":{
>>>         "core_node27":{
>>>           "core":"TEST_COLLECTION_shard5_replica_n24",
>>>           "base_url":"http://10.156.116.34:10007/solr",
>>>           "node_name":"10.156.116.34:10007_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node29":{
>>>           "core":"TEST_COLLECTION_shard5_replica_n26",
>>>           "base_url":"http://10.156.122.13:10006/solr",
>>>           "node_name":"10.156.122.13:10006_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node31":{
>>>           "core":"TEST_COLLECTION_shard5_replica_n28",
>>>           "base_url":"http://10.156.116.34:10006/solr",
>>>           "node_name":"10.156.116.34:10006_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false",
>>>           "leader":"true"}}},
>>>     "shard6":{
>>>       "range":"20000000-3fffffff",
>>>       "state":"active",
>>>       "replicas":{
>>>         "core_node33":{
>>>           "core":"TEST_COLLECTION_shard6_replica_n30",
>>>           "base_url":"http://10.156.122.13:10002/solr",
>>>           "node_name":"10.156.122.13:10002_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false",
>>>           "leader":"true"},
>>>         "core_node35":{
>>>           "core":"TEST_COLLECTION_shard6_replica_n32",
>>>           "base_url":"http://10.156.112.50:10008/solr",
>>>           "node_name":"10.156.112.50:10008_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node37":{
>>>           "core":"TEST_COLLECTION_shard6_replica_n34",
>>>           "base_url":"http://10.156.116.34:10003/solr",
>>>           "node_name":"10.156.116.34:10003_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"}}},
>>>     "shard7":{
>>>       "range":"40000000-5fffffff",
>>>       "state":"active",
>>>       "replicas":{
>>>         "core_node39":{
>>>           "core":"TEST_COLLECTION_shard7_replica_n36",
>>>           "base_url":"http://10.156.122.13:10003/solr",
>>>           "node_name":"10.156.122.13:10003_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false",
>>>           "leader":"true"},
>>>         "core_node41":{
>>>           "core":"TEST_COLLECTION_shard7_replica_n38",
>>>           "base_url":"http://10.156.122.13:10005/solr",
>>>           "node_name":"10.156.122.13:10005_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node43":{
>>>           "core":"TEST_COLLECTION_shard7_replica_n40",
>>>           "base_url":"http://10.156.112.50:10004/solr",
>>>           "node_name":"10.156.112.50:10004_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"}}},
>>>     "shard8":{
>>>       "range":"60000000-7fffffff",
>>>       "state":"active",
>>>       "replicas":{
>>>         "core_node45":{
>>>           "core":"TEST_COLLECTION_shard8_replica_n42",
>>>           "base_url":"http://10.156.112.50:10007/solr",
>>>           "node_name":"10.156.112.50:10007_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node47":{
>>>           "core":"TEST_COLLECTION_shard8_replica_n44",
>>>           "base_url":"http://10.156.112.50:10005/solr",
>>>           "node_name":"10.156.112.50:10005_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false",
>>>           "leader":"true"},
>>>         "core_node48":{
>>>           "core":"TEST_COLLECTION_shard8_replica_n46",
>>>           "base_url":"http://10.156.116.34:10001/solr",
>>>           "node_name":"10.156.116.34:10001_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"}}}},
>>>   "router":{"name":"compositeId"},
>>>   "maxShardsPerNode":"1",
>>>   "autoAddReplicas":"false",
>>>   "nrtReplicas":"3",
>>>   "tlogReplicas":"0”}}
>>> 
>>> 
>>> Running TOP on each machine while load tests have been running for 60
>> minutes.
>>> 
>>> 10.156.112.50 load average: 0.08, 0.35, 1.65
>>> 10.156.116.34 load average: 24.71, 24.20, 20.65
>>> 10.156.122.13 load average: 5.37, 3.21, 4.04
>>> 
>>> 
>>> 
>>> Here are the stats from each shard leader.
>>> 
>>> 
>> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>>> {
>>> "responseHeader":{
>>>   "status":0,
>>>   "QTime":2},
>>> "metrics":{
>>>   "solr.core.BTS.shard1.replica_n2":{
>>>     "QUERY./select.requestTimes":{
>>>       "count":805,
>>>       "meanRate":0.4385455794526838,
>>>       "1minRate":0.5110237122383522,
>>>       "5minRate":0.4671091682458005,
>>>       "15minRate":0.4057871940723353,
>>>       "min_ms":0.14047,
>>>       "max_ms":12424.589645,
>>>       "mean_ms":796.2194458711818,
>>>       "median_ms":10.534906,
>>>       "stddev_ms":2567.655224710497,
>>>       "p75_ms":22.893306,
>>>       "p95_ms":8316.33323,
>>>       "p99_ms":12424.589645,
>>>       "p999_ms":12424.589645}}}}
>>> 
>>> 
>> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>>> {
>>> "responseHeader":{
>>>   "status":0,
>>>   "QTime":2},
>>> "metrics":{
>>>   "solr.core.BTS.shard2.replica_n8":{
>>>     "QUERY./select.requestTimes":{
>>>       "count":791,
>>>       "meanRate":0.4244162938316224,
>>>       "1minRate":0.4869749626003825,
>>>       "5minRate":0.45856412657687656,
>>>       "15minRate":0.3948063845907493,
>>>       "min_ms":0.168369,
>>>       "max_ms":11022.763933,
>>>       "mean_ms":2572.0670957974603,
>>>       "median_ms":1490.222885,
>>>       "stddev_ms":2718.1710938804276,
>>>       "p75_ms":4292.490478,
>>>       "p95_ms":8487.18506,
>>>       "p99_ms":8855.936617,
>>>       "p999_ms":9589.218502}}}}
>>> 
>>> 
>> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>>> {
>>> "responseHeader":{
>>>   "status":0,
>>>   "QTime":83},
>>> "metrics":{
>>>   "solr.core.BTS.shard3.replica_n16":{
>>>     "QUERY./select.requestTimes":{
>>>       "count":840,
>>>       "meanRate":0.4335334453288775,
>>>       "1minRate":0.5733683837779382,
>>>       "5minRate":0.4931753679028527,
>>>       "15minRate":0.42241330274699623,
>>>       "min_ms":0.155939,
>>>       "max_ms":18125.516406,
>>>       "mean_ms":7097.942850416767,
>>>       "median_ms":8136.862825,
>>>       "stddev_ms":2382.041897221542,
>>>       "p75_ms":8497.844088,
>>>       "p95_ms":9642.430475,
>>>       "p99_ms":9993.694346,
>>>       "p999_ms":12207.982291}}}}
>>> 
>>> 
>> http://10.156.112.50:10001/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.112.50:10001/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>>> {
>>> "responseHeader":{
>>>   "status":0,
>>>   "QTime":3},
>>> "metrics":{
>>>   "solr.core.BTS.shard4.replica_n22":{
>>>     "QUERY./select.requestTimes":{
>>>       "count":873,
>>>       "meanRate":0.43420303985137254,
>>>       "1minRate":0.4284437786865815,
>>>       "5minRate":0.44020640429418745,
>>>       "15minRate":0.40860871277629196,
>>>       "min_ms":0.136658,
>>>       "max_ms":11345.407699,
>>>       "mean_ms":511.28573906464504,
>>>       "median_ms":9.063677,
>>>       "stddev_ms":2038.8104673512248,
>>>       "p75_ms":20.270605,
>>>       "p95_ms":8418.131442,
>>>       "p99_ms":8904.78616,
>>>       "p999_ms":10447.78365}}}}
>>> 
>>> 
>> http://10.156.116.34:10006/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.116.34:10006/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>>> {
>>> "responseHeader":{
>>>   "status":0,
>>>   "QTime":4},
>>> "metrics":{
>>>   "solr.core.BTS.shard5.replica_n28":{
>>>     "QUERY./select.requestTimes":{
>>>       "count":863,
>>>       "meanRate":0.4419375762840668,
>>>       "1minRate":0.44487242228317025,
>>>       "5minRate":0.45927613542085916,
>>>       "15minRate":0.41056066296443494,
>>>       "min_ms":0.158855,
>>>       "max_ms":16669.411989,
>>>       "mean_ms":6513.057114006753,
>>>       "median_ms":8033.386692,
>>>       "stddev_ms":3002.7487311308896,
>>>       "p75_ms":8446.147616,
>>>       "p95_ms":9888.641316,
>>>       "p99_ms":13624.11926,
>>>       "p999_ms":13624.11926}}}}
>>> 
>>> 
>> http://10.156.122.13:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.122.13:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>>> {
>>> "responseHeader":{
>>>   "status":0,
>>>   "QTime":2},
>>> "metrics":{
>>>   "solr.core.BTS.shard6.replica_n30":{
>>>     "QUERY./select.requestTimes":{
>>>       "count":893,
>>>       "meanRate":0.43301141185981046,
>>>       "1minRate":0.4011485529441132,
>>>       "5minRate":0.447654905093643,
>>>       "15minRate":0.41489193746842407,
>>>       "min_ms":0.161571,
>>>       "max_ms":14716.828978,
>>>       "mean_ms":2932.212133523417,
>>>       "median_ms":1289.686481,
>>>       "stddev_ms":3426.22045100954,
>>>       "p75_ms":6230.031884,
>>>       "p95_ms":8109.408506,
>>>       "p99_ms":12904.515311,
>>>       "p999_ms":12904.515311}}}}
>>> 
>>> 
>>> 
>>> 
>> http://10.156.122.13:10003/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.122.13:10003/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>>> {
>>> "responseHeader":{
>>>   "status":0,
>>>   "QTime":16},
>>> "metrics":{
>>>   "solr.core.BTS.shard7.replica_n36":{
>>>     "QUERY./select.requestTimes":{
>>>       "count":962,
>>>       "meanRate":0.46572438680661055,
>>>       "1minRate":0.4974893681625287,
>>>       "5minRate":0.49072296556429784,
>>>       "15minRate":0.44138205926188756,
>>>       "min_ms":0.164803,
>>>       "max_ms":12481.82656,
>>>       "mean_ms":2606.899631183513,
>>>       "median_ms":1457.505387,
>>>       "stddev_ms":3083.297183477969,
>>>       "p75_ms":4072.543679,
>>>       "p95_ms":8562.456178,
>>>       "p99_ms":9351.230895,
>>>       "p999_ms":10430.483813}}}}
>>> 
>>> 
>> http://10.156.112.50:10005/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.112.50:10005/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>>> {
>>> "responseHeader":{
>>>   "status":0,
>>>   "QTime":3},
>>> "metrics":{
>>>   "solr.core.BTS.shard8.replica_n44":{
>>>     "QUERY./select.requestTimes":{
>>>       "count":904,
>>>       "meanRate":0.4356001115451976,
>>>       "1minRate":0.42906831311171356,
>>>       "5minRate":0.4651312663377039,
>>>       "15minRate":0.41812847342709225,
>>>       "min_ms":0.089738,
>>>       "max_ms":10857.092832,
>>>       "mean_ms":304.52127270799156,
>>>       "median_ms":7.098736,
>>>       "stddev_ms":1544.5378594679773,
>>>       "p75_ms":15.599817,
>>>       "p95_ms":93.818662,
>>>       "p99_ms":8510.757117,
>>>       "p999_ms":9353.844994}}}}
>>> 
>>> I restart all of the instances on “34” so that there are no leaders on
>> it. The load somewhat goes to the other box.
>>> 
>>> 10.156.112.50 load average: 0.00, 0.16, 0.47
>>> 10.156.116.34 load average: 17.00, 16.16, 17.07
>>> 10.156.122.13 load average: 17.86, 17.49, 14.74
>>> 
>>> Box “50” is still doing nothing AND it is the leader of 4 of the 8
>> shards.
>>> Box “13” is the leader of the remaining 4 shards.
>>> Box “34” is not the leader of any shard.
>>> 
>>> I will continue to test, who knows, it may be something I am doing.
>> Maybe not enough RAM, etc…, so I am definitely leaving this open to the
>> possibility that I am not well configured for 8.5.
>>> 
>>> Regards
>>> 
>>> 
>>> 
>>> 
>>>> On May 16, 2020, at 5:08 PM, Tomás Fernández Löbbe <
>> tomasflobbe@gmail.com> wrote:
>>>> 
>>>> I just backported Michael’s fix to be released in 8.5.2
>>>> 
>>>> On Fri, May 15, 2020 at 6:38 AM Michael Gibney <
>> michael@michaelgibney.net>
>>>> wrote:
>>>> 
>>>>> Hi Wei,
>>>>> SOLR-14471 has been merged, so this issue should be fixed in 8.6.
>>>>> Thanks for reporting the problem!
>>>>> Michael
>>>>> 
>>>>> On Mon, May 11, 2020 at 7:51 PM Wei <we...@gmail.com> wrote:
>>>>>> 
>>>>>> Thanks Michael!  Yes in each shard I have 10 Tlog replicas,  no other
>>>>> type
>>>>>> of replicas, and each Tlog replica is an individual solr instance on
>> its
>>>>>> own physical machine.  In the jira you mentioned 'when "last place
>>>>> matches"
>>>>>> == "first place matches" – e.g. when shards.preference specified
>> matches
>>>>>> *all* available replicas'.   My setting is
>>>>>> shards.preference=replica.location:local,replica.type:TLOG,
>>>>>> I also tried just shards.preference=replica.location:local and it
>> still
>>>>> has
>>>>>> the issue. Can you explain a bit more?
>>>>>> 
>>>>>> On Mon, May 11, 2020 at 12:26 PM Michael Gibney <
>>>>> michael@michaelgibney.net>
>>>>>> wrote:
>>>>>> 
>>>>>>> FYI: https://issues.apache.org/jira/browse/SOLR-14471
>>>>>>> Wei, assuming you have only TLOG replicas, your "last place" matches
>>>>>>> (to which the random fallback ordering would not be applied -- see
>>>>>>> above issue) would be the same as the "first place" matches selected
>>>>>>> for executing distributed requests.
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, May 11, 2020 at 1:49 PM Michael Gibney
>>>>>>> <mi...@michaelgibney.net> wrote:
>>>>>>>> 
>>>>>>>> Wei, probably no need to answer my earlier questions; I think I see
>>>>>>>> the problem here, and believe it is indeed a bug, introduced in 8.3.
>>>>>>>> Will file an issue and submit a patch shortly.
>>>>>>>> Michael
>>>>>>>> 
>>>>>>>> On Mon, May 11, 2020 at 12:49 PM Michael Gibney
>>>>>>>> <mi...@michaelgibney.net> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Wei,
>>>>>>>>> 
>>>>>>>>> In considering this problem, I'm stumbling a bit on terminology
>>>>>>>>> (particularly, where you mention "nodes", I think you're referring
>>>>> to
>>>>>>>>> "replicas"?). Could you confirm that you have 10 TLOG replicas per
>>>>>>>>> shard, for each of 6 shards? How many *nodes* (i.e., running solr
>>>>>>>>> server instances) do you have, and what is the replica placement
>>>>> like
>>>>>>>>> across those nodes? What, if any, non-TLOG replicas do you have per
>>>>>>>>> shard (not that it's necessarily relevant, but just to get a
>>>>> complete
>>>>>>>>> picture of the situation)?
>>>>>>>>> 
>>>>>>>>> If you're able without too much trouble, can you determine what the
>>>>>>>>> behavior is like on Solr 8.3? (there were different changes
>>>>> introduced
>>>>>>>>> to potentially relevant code in 8.3 and 8.4, and knowing whether
>>>>> the
>>>>>>>>> behavior you're observing manifests on 8.3 would help narrow down
>>>>>>>>> where to look for an explanation).
>>>>>>>>> 
>>>>>>>>> Michael
>>>>>>>>> 
>>>>>>>>> On Fri, May 8, 2020 at 7:34 PM Wei <we...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> Update:  after I remove the shards.preference parameter from
>>>>>>>>>> solrconfig.xml,  issue is gone and internal shard requests are
>>>>> now
>>>>>>>>>> balanced. The same parameter works fine with solr 7.6.  Still not
>>>>>>> sure of
>>>>>>>>>> the root cause, but I observed a strange coincidence: the nodes
>>>>> that
>>>>>>> are
>>>>>>>>>> most frequently picked for shard requests are the first node in
>>>>> each
>>>>>>> shard
>>>>>>>>>> returned from the CLUSTERSTATUS api.  Seems something wrong with
>>>>>>> shuffling
>>>>>>>>>> equally compared nodes when shards.preference is set.  Will
>>>>> report
>>>>>>> back if
>>>>>>>>>> I find more.
>>>>>>>>>> 
>>>>>>>>>> On Mon, Apr 27, 2020 at 5:59 PM Wei <we...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Eric,
>>>>>>>>>>> 
>>>>>>>>>>> I am measuring the number of shard requests, and it's for query
>>>>>>> only, no
>>>>>>>>>>> indexing requests.  I have an external load balancer and see
>>>>> each
>>>>>>> node
>>>>>>>>>>> received about the equal number of external queries. However
>>>>> for
>>>>>>> the
>>>>>>>>>>> internal shard queries,  the distribution is uneven:    6 nodes
>>>>>>> (one in
>>>>>>>>>>> each shard,  some of them are leaders and some are non-leaders
>>>>> )
>>>>>>> gets about
>>>>>>>>>>> 80% of the shard requests, the other 54 nodes gets about 20% of
>>>>>>> the shard
>>>>>>>>>>> requests.   I checked a few other parameters set:
>>>>>>>>>>> 
>>>>>>>>>>> -Dsolr.disable.shardsWhitelist=true
>>>>>>>>>>> shards.preference=replica.location:local,replica.type:TLOG
>>>>>>>>>>> 
>>>>>>>>>>> Nothing seems to cause the strange behavior.  Any suggestions
>>>>> how
>>>>>>> to
>>>>>>>>>>> debug this?
>>>>>>>>>>> 
>>>>>>>>>>> -Wei
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson <
>>>>>>> erickerickson@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Wei:
>>>>>>>>>>>> 
>>>>>>>>>>>> How are you measuring utilization here? The number of incoming
>>>>>>> requests
>>>>>>>>>>>> or CPU?
>>>>>>>>>>>> 
>>>>>>>>>>>> The leader for each shard are certainly handling all of the
>>>>>>> indexing
>>>>>>>>>>>> requests since they’re TLOG replicas, so that’s one thing that
>>>>>>> might
>>>>>>>>>>>> skewing your measurements.
>>>>>>>>>>>> 
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Erick
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Apr 27, 2020, at 7:13 PM, Wei <we...@gmail.com>
>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I have a strange issue after upgrade from 7.6.0 to 8.4.1. My
>>>>>>> cloud has 6
>>>>>>>>>>>>> shards with 10 TLOG replicas each shard.  After upgrade I
>>>>>>> noticed that
>>>>>>>>>>>> one
>>>>>>>>>>>>> of the replicas in each shard is handling most of the
>>>>>>> distributed shard
>>>>>>>>>>>>> requests, so 6 nodes are heavily loaded while other nodes
>>>>> are
>>>>>>> idle.
>>>>>>>>>>>> There
>>>>>>>>>>>>> is no change in shard handler configuration:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <shardHandlerFactory name="shardHandlerFactory" class=
>>>>>>>>>>>>> "HttpShardHandlerFactory">
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <int name="socketTimeout">30000</int>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <int name="connTimeout">30000</int>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <int name="maxConnectionsPerHost">500</int>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> </shardHandlerFactory>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> What could cause the unbalanced internal distributed
>>>>> request?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks in advance.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Wei
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>> 
>>

Re: Unbalanced shard requests

Posted by Wei <we...@gmail.com>.

Hi Phill,

What is the RAM config you are referring to, JVM size? How is that related
to the load balancing, if each node has the same configuration?

Thanks,
Wei

On Mon, May 18, 2020 at 3:07 PM Phill Campbell
<Si...@yahoo.com.invalid> wrote:

> In my previous report I was configured to use as much RAM as possible.
> With that configuration it seemed it was not load balancing.
> So, I reconfigured and redeployed to use 1/4 the RAM. What a difference
> for the better!
>
> 10.156.112.50   load average: 13.52, 10.56, 6.46
> 10.156.116.34   load average: 11.23, 12.35, 9.63
> 10.156.122.13   load average: 10.29, 12.40, 9.69
>
> Very nice.
> My tool that tests records RPS. In the “bad” configuration it was less
> than 1 RPS.
> NOW it is showing 21 RPS.
>
>
> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> <
> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >
> {
>   "responseHeader":{
>     "status":0,
>     "QTime":161},
>   "metrics":{
>     "solr.core.BTS.shard1.replica_n2":{
>       "QUERY./select.requestTimes":{
>         "count":5723,
>         "meanRate":6.8163888639859085,
>         "1minRate":11.557013215119536,
>         "5minRate":8.760356217628159,
>         "15minRate":4.707624230995833,
>         "min_ms":0.131545,
>         "max_ms":388.710848,
>         "mean_ms":30.300492048215947,
>         "median_ms":6.336654,
>         "stddev_ms":51.527164088667035,
>         "p75_ms":35.427943,
>         "p95_ms":140.025957,
>         "p99_ms":230.533099,
>         "p999_ms":388.710848}}}}
>
>
>
> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> <
> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >
> {
>   "responseHeader":{
>     "status":0,
>     "QTime":11},
>   "metrics":{
>     "solr.core.BTS.shard2.replica_n8":{
>       "QUERY./select.requestTimes":{
>         "count":6469,
>         "meanRate":7.502581801189549,
>         "1minRate":12.211423085368564,
>         "5minRate":9.445681397767322,
>         "15minRate":5.216209798637846,
>         "min_ms":0.154691,
>         "max_ms":701.657394,
>         "mean_ms":34.2734699171445,
>         "median_ms":5.640378,
>         "stddev_ms":62.27649205954566,
>         "p75_ms":39.016371,
>         "p95_ms":156.997982,
>         "p99_ms":288.883028,
>         "p999_ms":538.368031}}}}
>
>
> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> <
> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >
> {
>   "responseHeader":{
>     "status":0,
>     "QTime":67},
>   "metrics":{
>     "solr.core.BTS.shard3.replica_n16":{
>       "QUERY./select.requestTimes":{
>         "count":7109,
>         "meanRate":7.787524673806184,
>         "1minRate":11.88519763582083,
>         "5minRate":9.893315557386755,
>         "15minRate":5.620178363676527,
>         "min_ms":0.150887,
>         "max_ms":472.826462,
>         "mean_ms":32.184282366621204,
>         "median_ms":6.977733,
>         "stddev_ms":55.729908615189196,
>         "p75_ms":36.655011,
>         "p95_ms":151.12627,
>         "p99_ms":251.440162,
>         "p999_ms":472.826462}}}}
>
>
> Compare that to the previous report and you can see the improvement.
> So, note to myself. Figure out the sweet spot for RAM usage. Use too much
> and strange behavior is noticed. While using too much all the load focused
> on one box and query times slowed.
> I did not see any OOM errors during any of this.
>
> Regards
>
>
>
> > On May 18, 2020, at 3:23 PM, Phill Campbell
> <Si...@yahoo.com.INVALID> wrote:
> >
> > I have been testing 8.5.2 and it looks like the load has moved but is
> still on one machine.
> >
> > Setup:
> > 3 physical machines.
> > Each machine hosts 8 instances of Solr.
> > Each instance of Solr hosts one replica.
> >
> > Another way to say it:
> > Number of shards = 8. Replication factor = 3.
> >
> > Here is the cluster state. You can see that the leaders are well
> distributed.
> >
> > {"TEST_COLLECTION":{
> >    "pullReplicas":"0",
> >    "replicationFactor":"3",
> >    "shards":{
> >      "shard1":{
> >        "range":"80000000-9fffffff",
> >        "state":"active",
> >        "replicas":{
> >          "core_node3":{
> >            "core":"TEST_COLLECTION_shard1_replica_n1",
> >            "base_url":"http://10.156.122.13:10007/solr",
> >            "node_name":"10.156.122.13:10007_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false"},
> >          "core_node5":{
> >            "core":"TEST_COLLECTION_shard1_replica_n2",
> >            "base_url":"http://10.156.112.50:10002/solr",
> >            "node_name":"10.156.112.50:10002_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false",
> >            "leader":"true"},
> >          "core_node7":{
> >            "core":"TEST_COLLECTION_shard1_replica_n4",
> >            "base_url":"http://10.156.112.50:10006/solr",
> >            "node_name":"10.156.112.50:10006_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false"}}},
> >      "shard2":{
> >        "range":"a0000000-bfffffff",
> >        "state":"active",
> >        "replicas":{
> >          "core_node9":{
> >            "core":"TEST_COLLECTION_shard2_replica_n6",
> >            "base_url":"http://10.156.112.50:10003/solr",
> >            "node_name":"10.156.112.50:10003_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false"},
> >          "core_node11":{
> >            "core":"TEST_COLLECTION_shard2_replica_n8",
> >            "base_url":"http://10.156.122.13:10004/solr",
> >            "node_name":"10.156.122.13:10004_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false",
> >            "leader":"true"},
> >          "core_node12":{
> >            "core":"TEST_COLLECTION_shard2_replica_n10",
> >            "base_url":"http://10.156.116.34:10008/solr",
> >            "node_name":"10.156.116.34:10008_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false"}}},
> >      "shard3":{
> >        "range":"c0000000-dfffffff",
> >        "state":"active",
> >        "replicas":{
> >          "core_node15":{
> >            "core":"TEST_COLLECTION_shard3_replica_n13",
> >            "base_url":"http://10.156.122.13:10008/solr",
> >            "node_name":"10.156.122.13:10008_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false"},
> >          "core_node17":{
> >            "core":"TEST_COLLECTION_shard3_replica_n14",
> >            "base_url":"http://10.156.116.34:10005/solr",
> >            "node_name":"10.156.116.34:10005_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false"},
> >          "core_node19":{
> >            "core":"TEST_COLLECTION_shard3_replica_n16",
> >            "base_url":"http://10.156.116.34:10002/solr",
> >            "node_name":"10.156.116.34:10002_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false",
> >            "leader":"true"}}},
> >      "shard4":{
> >        "range":"e0000000-ffffffff",
> >        "state":"active",
> >        "replicas":{
> >          "core_node20":{
> >            "core":"TEST_COLLECTION_shard4_replica_n18",
> >            "base_url":"http://10.156.122.13:10001/solr",
> >            "node_name":"10.156.122.13:10001_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false"},
> >          "core_node23":{
> >            "core":"TEST_COLLECTION_shard4_replica_n21",
> >            "base_url":"http://10.156.116.34:10004/solr",
> >            "node_name":"10.156.116.34:10004_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false"},
> >          "core_node25":{
> >            "core":"TEST_COLLECTION_shard4_replica_n22",
> >            "base_url":"http://10.156.112.50:10001/solr",
> >            "node_name":"10.156.112.50:10001_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false",
> >            "leader":"true"}}},
> >      "shard5":{
> >        "range":"0-1fffffff",
> >        "state":"active",
> >        "replicas":{
> >          "core_node27":{
> >            "core":"TEST_COLLECTION_shard5_replica_n24",
> >            "base_url":"http://10.156.116.34:10007/solr",
> >            "node_name":"10.156.116.34:10007_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false"},
> >          "core_node29":{
> >            "core":"TEST_COLLECTION_shard5_replica_n26",
> >            "base_url":"http://10.156.122.13:10006/solr",
> >            "node_name":"10.156.122.13:10006_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false"},
> >          "core_node31":{
> >            "core":"TEST_COLLECTION_shard5_replica_n28",
> >            "base_url":"http://10.156.116.34:10006/solr",
> >            "node_name":"10.156.116.34:10006_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false",
> >            "leader":"true"}}},
> >      "shard6":{
> >        "range":"20000000-3fffffff",
> >        "state":"active",
> >        "replicas":{
> >          "core_node33":{
> >            "core":"TEST_COLLECTION_shard6_replica_n30",
> >            "base_url":"http://10.156.122.13:10002/solr",
> >            "node_name":"10.156.122.13:10002_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false",
> >            "leader":"true"},
> >          "core_node35":{
> >            "core":"TEST_COLLECTION_shard6_replica_n32",
> >            "base_url":"http://10.156.112.50:10008/solr",
> >            "node_name":"10.156.112.50:10008_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false"},
> >          "core_node37":{
> >            "core":"TEST_COLLECTION_shard6_replica_n34",
> >            "base_url":"http://10.156.116.34:10003/solr",
> >            "node_name":"10.156.116.34:10003_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false"}}},
> >      "shard7":{
> >        "range":"40000000-5fffffff",
> >        "state":"active",
> >        "replicas":{
> >          "core_node39":{
> >            "core":"TEST_COLLECTION_shard7_replica_n36",
> >            "base_url":"http://10.156.122.13:10003/solr",
> >            "node_name":"10.156.122.13:10003_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false",
> >            "leader":"true"},
> >          "core_node41":{
> >            "core":"TEST_COLLECTION_shard7_replica_n38",
> >            "base_url":"http://10.156.122.13:10005/solr",
> >            "node_name":"10.156.122.13:10005_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false"},
> >          "core_node43":{
> >            "core":"TEST_COLLECTION_shard7_replica_n40",
> >            "base_url":"http://10.156.112.50:10004/solr",
> >            "node_name":"10.156.112.50:10004_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false"}}},
> >      "shard8":{
> >        "range":"60000000-7fffffff",
> >        "state":"active",
> >        "replicas":{
> >          "core_node45":{
> >            "core":"TEST_COLLECTION_shard8_replica_n42",
> >            "base_url":"http://10.156.112.50:10007/solr",
> >            "node_name":"10.156.112.50:10007_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false"},
> >          "core_node47":{
> >            "core":"TEST_COLLECTION_shard8_replica_n44",
> >            "base_url":"http://10.156.112.50:10005/solr",
> >            "node_name":"10.156.112.50:10005_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false",
> >            "leader":"true"},
> >          "core_node48":{
> >            "core":"TEST_COLLECTION_shard8_replica_n46",
> >            "base_url":"http://10.156.116.34:10001/solr",
> >            "node_name":"10.156.116.34:10001_solr",
> >            "state":"active",
> >            "type":"NRT",
> >            "force_set_state":"false"}}}},
> >    "router":{"name":"compositeId"},
> >    "maxShardsPerNode":"1",
> >    "autoAddReplicas":"false",
> >    "nrtReplicas":"3",
> >    "tlogReplicas":"0”}}
> >
> >
> > Running TOP on each machine while load tests have been running for 60
> minutes.
> >
> > 10.156.112.50 load average: 0.08, 0.35, 1.65
> > 10.156.116.34 load average: 24.71, 24.20, 20.65
> > 10.156.122.13 load average: 5.37, 3.21, 4.04
> >
> >
> >
> > Here are the stats from each shard leader.
> >
> >
> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> <
> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >
> > {
> >  "responseHeader":{
> >    "status":0,
> >    "QTime":2},
> >  "metrics":{
> >    "solr.core.BTS.shard1.replica_n2":{
> >      "QUERY./select.requestTimes":{
> >        "count":805,
> >        "meanRate":0.4385455794526838,
> >        "1minRate":0.5110237122383522,
> >        "5minRate":0.4671091682458005,
> >        "15minRate":0.4057871940723353,
> >        "min_ms":0.14047,
> >        "max_ms":12424.589645,
> >        "mean_ms":796.2194458711818,
> >        "median_ms":10.534906,
> >        "stddev_ms":2567.655224710497,
> >        "p75_ms":22.893306,
> >        "p95_ms":8316.33323,
> >        "p99_ms":12424.589645,
> >        "p999_ms":12424.589645}}}}
> >
> >
> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> <
> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >
> > {
> >  "responseHeader":{
> >    "status":0,
> >    "QTime":2},
> >  "metrics":{
> >    "solr.core.BTS.shard2.replica_n8":{
> >      "QUERY./select.requestTimes":{
> >        "count":791,
> >        "meanRate":0.4244162938316224,
> >        "1minRate":0.4869749626003825,
> >        "5minRate":0.45856412657687656,
> >        "15minRate":0.3948063845907493,
> >        "min_ms":0.168369,
> >        "max_ms":11022.763933,
> >        "mean_ms":2572.0670957974603,
> >        "median_ms":1490.222885,
> >        "stddev_ms":2718.1710938804276,
> >        "p75_ms":4292.490478,
> >        "p95_ms":8487.18506,
> >        "p99_ms":8855.936617,
> >        "p999_ms":9589.218502}}}}
> >
> >
> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> <
> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >
> > {
> >  "responseHeader":{
> >    "status":0,
> >    "QTime":83},
> >  "metrics":{
> >    "solr.core.BTS.shard3.replica_n16":{
> >      "QUERY./select.requestTimes":{
> >        "count":840,
> >        "meanRate":0.4335334453288775,
> >        "1minRate":0.5733683837779382,
> >        "5minRate":0.4931753679028527,
> >        "15minRate":0.42241330274699623,
> >        "min_ms":0.155939,
> >        "max_ms":18125.516406,
> >        "mean_ms":7097.942850416767,
> >        "median_ms":8136.862825,
> >        "stddev_ms":2382.041897221542,
> >        "p75_ms":8497.844088,
> >        "p95_ms":9642.430475,
> >        "p99_ms":9993.694346,
> >        "p999_ms":12207.982291}}}}
> >
> >
> http://10.156.112.50:10001/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> <
> http://10.156.112.50:10001/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >
> > {
> >  "responseHeader":{
> >    "status":0,
> >    "QTime":3},
> >  "metrics":{
> >    "solr.core.BTS.shard4.replica_n22":{
> >      "QUERY./select.requestTimes":{
> >        "count":873,
> >        "meanRate":0.43420303985137254,
> >        "1minRate":0.4284437786865815,
> >        "5minRate":0.44020640429418745,
> >        "15minRate":0.40860871277629196,
> >        "min_ms":0.136658,
> >        "max_ms":11345.407699,
> >        "mean_ms":511.28573906464504,
> >        "median_ms":9.063677,
> >        "stddev_ms":2038.8104673512248,
> >        "p75_ms":20.270605,
> >        "p95_ms":8418.131442,
> >        "p99_ms":8904.78616,
> >        "p999_ms":10447.78365}}}}
> >
> >
> http://10.156.116.34:10006/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> <
> http://10.156.116.34:10006/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >
> > {
> >  "responseHeader":{
> >    "status":0,
> >    "QTime":4},
> >  "metrics":{
> >    "solr.core.BTS.shard5.replica_n28":{
> >      "QUERY./select.requestTimes":{
> >        "count":863,
> >        "meanRate":0.4419375762840668,
> >        "1minRate":0.44487242228317025,
> >        "5minRate":0.45927613542085916,
> >        "15minRate":0.41056066296443494,
> >        "min_ms":0.158855,
> >        "max_ms":16669.411989,
> >        "mean_ms":6513.057114006753,
> >        "median_ms":8033.386692,
> >        "stddev_ms":3002.7487311308896,
> >        "p75_ms":8446.147616,
> >        "p95_ms":9888.641316,
> >        "p99_ms":13624.11926,
> >        "p999_ms":13624.11926}}}}
> >
> >
> http://10.156.122.13:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> <
> http://10.156.122.13:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >
> > {
> >  "responseHeader":{
> >    "status":0,
> >    "QTime":2},
> >  "metrics":{
> >    "solr.core.BTS.shard6.replica_n30":{
> >      "QUERY./select.requestTimes":{
> >        "count":893,
> >        "meanRate":0.43301141185981046,
> >        "1minRate":0.4011485529441132,
> >        "5minRate":0.447654905093643,
> >        "15minRate":0.41489193746842407,
> >        "min_ms":0.161571,
> >        "max_ms":14716.828978,
> >        "mean_ms":2932.212133523417,
> >        "median_ms":1289.686481,
> >        "stddev_ms":3426.22045100954,
> >        "p75_ms":6230.031884,
> >        "p95_ms":8109.408506,
> >        "p99_ms":12904.515311,
> >        "p999_ms":12904.515311}}}}
> >
> >
> >
> >
> http://10.156.122.13:10003/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> <
> http://10.156.122.13:10003/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >
> > {
> >  "responseHeader":{
> >    "status":0,
> >    "QTime":16},
> >  "metrics":{
> >    "solr.core.BTS.shard7.replica_n36":{
> >      "QUERY./select.requestTimes":{
> >        "count":962,
> >        "meanRate":0.46572438680661055,
> >        "1minRate":0.4974893681625287,
> >        "5minRate":0.49072296556429784,
> >        "15minRate":0.44138205926188756,
> >        "min_ms":0.164803,
> >        "max_ms":12481.82656,
> >        "mean_ms":2606.899631183513,
> >        "median_ms":1457.505387,
> >        "stddev_ms":3083.297183477969,
> >        "p75_ms":4072.543679,
> >        "p95_ms":8562.456178,
> >        "p99_ms":9351.230895,
> >        "p999_ms":10430.483813}}}}
> >
> >
> http://10.156.112.50:10005/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> <
> http://10.156.112.50:10005/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >
> > {
> >  "responseHeader":{
> >    "status":0,
> >    "QTime":3},
> >  "metrics":{
> >    "solr.core.BTS.shard8.replica_n44":{
> >      "QUERY./select.requestTimes":{
> >        "count":904,
> >        "meanRate":0.4356001115451976,
> >        "1minRate":0.42906831311171356,
> >        "5minRate":0.4651312663377039,
> >        "15minRate":0.41812847342709225,
> >        "min_ms":0.089738,
> >        "max_ms":10857.092832,
> >        "mean_ms":304.52127270799156,
> >        "median_ms":7.098736,
> >        "stddev_ms":1544.5378594679773,
> >        "p75_ms":15.599817,
> >        "p95_ms":93.818662,
> >        "p99_ms":8510.757117,
> >        "p999_ms":9353.844994}}}}
> >
> > I restart all of the instances on “34” so that there are no leaders on
> it. The load somewhat goes to the other box.
> >
> > 10.156.112.50 load average: 0.00, 0.16, 0.47
> > 10.156.116.34 load average: 17.00, 16.16, 17.07
> > 10.156.122.13 load average: 17.86, 17.49, 14.74
> >
> > Box “50” is still doing nothing AND it is the leader of 4 of the 8
> shards.
> > Box “13” is the leader of the remaining 4 shards.
> > Box “34” is not the leader of any shard.
> >
> > I will continue to test, who knows, it may be something I am doing.
> Maybe not enough RAM, etc…, so I am definitely leaving this open to the
> possibility that I am not well configured for 8.5.
> >
> > Regards
> >
> >
> >
> >
> >> On May 16, 2020, at 5:08 PM, Tomás Fernández Löbbe <
> tomasflobbe@gmail.com> wrote:
> >>
> >> I just backported Michael’s fix to be released in 8.5.2
> >>
> >> On Fri, May 15, 2020 at 6:38 AM Michael Gibney <
> michael@michaelgibney.net>
> >> wrote:
> >>
> >>> Hi Wei,
> >>> SOLR-14471 has been merged, so this issue should be fixed in 8.6.
> >>> Thanks for reporting the problem!
> >>> Michael
> >>>
> >>> On Mon, May 11, 2020 at 7:51 PM Wei <we...@gmail.com> wrote:
> >>>>
> >>>> Thanks Michael!  Yes in each shard I have 10 Tlog replicas,  no other
> >>> type
> >>>> of replicas, and each Tlog replica is an individual solr instance on
> its
> >>>> own physical machine.  In the jira you mentioned 'when "last place
> >>> matches"
> >>>> == "first place matches" – e.g. when shards.preference specified
> matches
> >>>> *all* available replicas'.   My setting is
> >>>> shards.preference=replica.location:local,replica.type:TLOG,
> >>>> I also tried just shards.preference=replica.location:local and it
> still
> >>> has
> >>>> the issue. Can you explain a bit more?
> >>>>
> >>>> On Mon, May 11, 2020 at 12:26 PM Michael Gibney <
> >>> michael@michaelgibney.net>
> >>>> wrote:
> >>>>
> >>>>> FYI: https://issues.apache.org/jira/browse/SOLR-14471
> >>>>> Wei, assuming you have only TLOG replicas, your "last place" matches
> >>>>> (to which the random fallback ordering would not be applied -- see
> >>>>> above issue) would be the same as the "first place" matches selected
> >>>>> for executing distributed requests.
> >>>>>
> >>>>>
> >>>>> On Mon, May 11, 2020 at 1:49 PM Michael Gibney
> >>>>> <mi...@michaelgibney.net> wrote:
> >>>>>>
> >>>>>> Wei, probably no need to answer my earlier questions; I think I see
> >>>>>> the problem here, and believe it is indeed a bug, introduced in 8.3.
> >>>>>> Will file an issue and submit a patch shortly.
> >>>>>> Michael
> >>>>>>
> >>>>>> On Mon, May 11, 2020 at 12:49 PM Michael Gibney
> >>>>>> <mi...@michaelgibney.net> wrote:
> >>>>>>>
> >>>>>>> Hi Wei,
> >>>>>>>
> >>>>>>> In considering this problem, I'm stumbling a bit on terminology
> >>>>>>> (particularly, where you mention "nodes", I think you're referring
> >>> to
> >>>>>>> "replicas"?). Could you confirm that you have 10 TLOG replicas per
> >>>>>>> shard, for each of 6 shards? How many *nodes* (i.e., running solr
> >>>>>>> server instances) do you have, and what is the replica placement
> >>> like
> >>>>>>> across those nodes? What, if any, non-TLOG replicas do you have per
> >>>>>>> shard (not that it's necessarily relevant, but just to get a
> >>> complete
> >>>>>>> picture of the situation)?
> >>>>>>>
> >>>>>>> If you're able without too much trouble, can you determine what the
> >>>>>>> behavior is like on Solr 8.3? (there were different changes
> >>> introduced
> >>>>>>> to potentially relevant code in 8.3 and 8.4, and knowing whether
> >>> the
> >>>>>>> behavior you're observing manifests on 8.3 would help narrow down
> >>>>>>> where to look for an explanation).
> >>>>>>>
> >>>>>>> Michael
> >>>>>>>
> >>>>>>> On Fri, May 8, 2020 at 7:34 PM Wei <we...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> Update:  after I remove the shards.preference parameter from
> >>>>>>>> solrconfig.xml,  issue is gone and internal shard requests are
> >>> now
> >>>>>>>> balanced. The same parameter works fine with solr 7.6.  Still not
> >>>>> sure of
> >>>>>>>> the root cause, but I observed a strange coincidence: the nodes
> >>> that
> >>>>> are
> >>>>>>>> most frequently picked for shard requests are the first node in
> >>> each
> >>>>> shard
> >>>>>>>> returned from the CLUSTERSTATUS api.  Seems something wrong with
> >>>>> shuffling
> >>>>>>>> equally compared nodes when shards.preference is set.  Will
> >>> report
> >>>>> back if
> >>>>>>>> I find more.
> >>>>>>>>
> >>>>>>>> On Mon, Apr 27, 2020 at 5:59 PM Wei <we...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Eric,
> >>>>>>>>>
> >>>>>>>>> I am measuring the number of shard requests, and it's for query
> >>>>> only, no
> >>>>>>>>> indexing requests.  I have an external load balancer and see
> >>> each
> >>>>> node
> >>>>>>>>> received about the equal number of external queries. However
> >>> for
> >>>>> the
> >>>>>>>>> internal shard queries,  the distribution is uneven:    6 nodes
> >>>>> (one in
> >>>>>>>>> each shard,  some of them are leaders and some are non-leaders
> >>> )
> >>>>> gets about
> >>>>>>>>> 80% of the shard requests, the other 54 nodes gets about 20% of
> >>>>> the shard
> >>>>>>>>> requests.   I checked a few other parameters set:
> >>>>>>>>>
> >>>>>>>>> -Dsolr.disable.shardsWhitelist=true
> >>>>>>>>> shards.preference=replica.location:local,replica.type:TLOG
> >>>>>>>>>
> >>>>>>>>> Nothing seems to cause the strange behavior.  Any suggestions
> >>> how
> >>>>> to
> >>>>>>>>> debug this?
> >>>>>>>>>
> >>>>>>>>> -Wei
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson <
> >>>>> erickerickson@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Wei:
> >>>>>>>>>>
> >>>>>>>>>> How are you measuring utilization here? The number of incoming
> >>>>> requests
> >>>>>>>>>> or CPU?
> >>>>>>>>>>
> >>>>>>>>>> The leader for each shard are certainly handling all of the
> >>>>> indexing
> >>>>>>>>>> requests since they’re TLOG replicas, so that’s one thing that
> >>>>> might
> >>>>>>>>>> skewing your measurements.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Erick
> >>>>>>>>>>
> >>>>>>>>>>> On Apr 27, 2020, at 7:13 PM, Wei <we...@gmail.com>
> >>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>
> >>>>>>>>>>> I have a strange issue after upgrade from 7.6.0 to 8.4.1. My
> >>>>> cloud has 6
> >>>>>>>>>>> shards with 10 TLOG replicas each shard.  After upgrade I
> >>>>> noticed that
> >>>>>>>>>> one
> >>>>>>>>>>> of the replicas in each shard is handling most of the
> >>>>> distributed shard
> >>>>>>>>>>> requests, so 6 nodes are heavily loaded while other nodes
> >>> are
> >>>>> idle.
> >>>>>>>>>> There
> >>>>>>>>>>> is no change in shard handler configuration:
> >>>>>>>>>>>
> >>>>>>>>>>> <shardHandlerFactory name="shardHandlerFactory" class=
> >>>>>>>>>>> "HttpShardHandlerFactory">
> >>>>>>>>>>>
> >>>>>>>>>>>  <int name="socketTimeout">30000</int>
> >>>>>>>>>>>
> >>>>>>>>>>>  <int name="connTimeout">30000</int>
> >>>>>>>>>>>
> >>>>>>>>>>>  <int name="maxConnectionsPerHost">500</int>
> >>>>>>>>>>>
> >>>>>>>>>>> </shardHandlerFactory>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> What could cause the unbalanced internal distributed
> >>> request?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks in advance.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Wei
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>
> >>>
> >
>
>

Re: Unbalanced shard requests

Posted by Phill Campbell <Si...@yahoo.com.INVALID>.

In my previous report I was configured to use as much RAM as possible. With that configuration it seemed it was not load balancing.
So, I reconfigured and redeployed to use 1/4 the RAM. What a difference for the better!

10.156.112.50	load average: 13.52, 10.56, 6.46
10.156.116.34	load average: 11.23, 12.35, 9.63
10.156.122.13	load average: 10.29, 12.40, 9.69

Very nice.
My tool that tests records RPS. In the “bad” configuration it was less than 1 RPS.
NOW it is showing 21 RPS.

http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes <http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes>
{
  "responseHeader":{
    "status":0,
    "QTime":161},
  "metrics":{
    "solr.core.BTS.shard1.replica_n2":{
      "QUERY./select.requestTimes":{
        "count":5723,
        "meanRate":6.8163888639859085,
        "1minRate":11.557013215119536,
        "5minRate":8.760356217628159,
        "15minRate":4.707624230995833,
        "min_ms":0.131545,
        "max_ms":388.710848,
        "mean_ms":30.300492048215947,
        "median_ms":6.336654,
        "stddev_ms":51.527164088667035,
        "p75_ms":35.427943,
        "p95_ms":140.025957,
        "p99_ms":230.533099,
        "p999_ms":388.710848}}}}


http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes <http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes>
{
  "responseHeader":{
    "status":0,
    "QTime":11},
  "metrics":{
    "solr.core.BTS.shard2.replica_n8":{
      "QUERY./select.requestTimes":{
        "count":6469,
        "meanRate":7.502581801189549,
        "1minRate":12.211423085368564,
        "5minRate":9.445681397767322,
        "15minRate":5.216209798637846,
        "min_ms":0.154691,
        "max_ms":701.657394,
        "mean_ms":34.2734699171445,
        "median_ms":5.640378,
        "stddev_ms":62.27649205954566,
        "p75_ms":39.016371,
        "p95_ms":156.997982,
        "p99_ms":288.883028,
        "p999_ms":538.368031}}}}

http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes <http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes>
{
  "responseHeader":{
    "status":0,
    "QTime":67},
  "metrics":{
    "solr.core.BTS.shard3.replica_n16":{
      "QUERY./select.requestTimes":{
        "count":7109,
        "meanRate":7.787524673806184,
        "1minRate":11.88519763582083,
        "5minRate":9.893315557386755,
        "15minRate":5.620178363676527,
        "min_ms":0.150887,
        "max_ms":472.826462,
        "mean_ms":32.184282366621204,
        "median_ms":6.977733,
        "stddev_ms":55.729908615189196,
        "p75_ms":36.655011,
        "p95_ms":151.12627,
        "p99_ms":251.440162,
        "p999_ms":472.826462}}}}


Compare that to the previous report and you can see the improvement.
So, note to myself. Figure out the sweet spot for RAM usage. Use too much and strange behavior is noticed. While using too much all the load focused on one box and query times slowed.
I did not see any OOM errors during any of this.

Regards



> On May 18, 2020, at 3:23 PM, Phill Campbell <Si...@yahoo.com.INVALID> wrote:
> 
> I have been testing 8.5.2 and it looks like the load has moved but is still on one machine.
> 
> Setup:
> 3 physical machines.
> Each machine hosts 8 instances of Solr.
> Each instance of Solr hosts one replica.
> 
> Another way to say it:
> Number of shards = 8. Replication factor = 3.
> 
> Here is the cluster state. You can see that the leaders are well distributed. 
> 
> {"TEST_COLLECTION":{
>    "pullReplicas":"0",
>    "replicationFactor":"3",
>    "shards":{
>      "shard1":{
>        "range":"80000000-9fffffff",
>        "state":"active",
>        "replicas":{
>          "core_node3":{
>            "core":"TEST_COLLECTION_shard1_replica_n1",
>            "base_url":"http://10.156.122.13:10007/solr",
>            "node_name":"10.156.122.13:10007_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false"},
>          "core_node5":{
>            "core":"TEST_COLLECTION_shard1_replica_n2",
>            "base_url":"http://10.156.112.50:10002/solr",
>            "node_name":"10.156.112.50:10002_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false",
>            "leader":"true"},
>          "core_node7":{
>            "core":"TEST_COLLECTION_shard1_replica_n4",
>            "base_url":"http://10.156.112.50:10006/solr",
>            "node_name":"10.156.112.50:10006_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false"}}},
>      "shard2":{
>        "range":"a0000000-bfffffff",
>        "state":"active",
>        "replicas":{
>          "core_node9":{
>            "core":"TEST_COLLECTION_shard2_replica_n6",
>            "base_url":"http://10.156.112.50:10003/solr",
>            "node_name":"10.156.112.50:10003_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false"},
>          "core_node11":{
>            "core":"TEST_COLLECTION_shard2_replica_n8",
>            "base_url":"http://10.156.122.13:10004/solr",
>            "node_name":"10.156.122.13:10004_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false",
>            "leader":"true"},
>          "core_node12":{
>            "core":"TEST_COLLECTION_shard2_replica_n10",
>            "base_url":"http://10.156.116.34:10008/solr",
>            "node_name":"10.156.116.34:10008_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false"}}},
>      "shard3":{
>        "range":"c0000000-dfffffff",
>        "state":"active",
>        "replicas":{
>          "core_node15":{
>            "core":"TEST_COLLECTION_shard3_replica_n13",
>            "base_url":"http://10.156.122.13:10008/solr",
>            "node_name":"10.156.122.13:10008_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false"},
>          "core_node17":{
>            "core":"TEST_COLLECTION_shard3_replica_n14",
>            "base_url":"http://10.156.116.34:10005/solr",
>            "node_name":"10.156.116.34:10005_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false"},
>          "core_node19":{
>            "core":"TEST_COLLECTION_shard3_replica_n16",
>            "base_url":"http://10.156.116.34:10002/solr",
>            "node_name":"10.156.116.34:10002_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false",
>            "leader":"true"}}},
>      "shard4":{
>        "range":"e0000000-ffffffff",
>        "state":"active",
>        "replicas":{
>          "core_node20":{
>            "core":"TEST_COLLECTION_shard4_replica_n18",
>            "base_url":"http://10.156.122.13:10001/solr",
>            "node_name":"10.156.122.13:10001_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false"},
>          "core_node23":{
>            "core":"TEST_COLLECTION_shard4_replica_n21",
>            "base_url":"http://10.156.116.34:10004/solr",
>            "node_name":"10.156.116.34:10004_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false"},
>          "core_node25":{
>            "core":"TEST_COLLECTION_shard4_replica_n22",
>            "base_url":"http://10.156.112.50:10001/solr",
>            "node_name":"10.156.112.50:10001_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false",
>            "leader":"true"}}},
>      "shard5":{
>        "range":"0-1fffffff",
>        "state":"active",
>        "replicas":{
>          "core_node27":{
>            "core":"TEST_COLLECTION_shard5_replica_n24",
>            "base_url":"http://10.156.116.34:10007/solr",
>            "node_name":"10.156.116.34:10007_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false"},
>          "core_node29":{
>            "core":"TEST_COLLECTION_shard5_replica_n26",
>            "base_url":"http://10.156.122.13:10006/solr",
>            "node_name":"10.156.122.13:10006_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false"},
>          "core_node31":{
>            "core":"TEST_COLLECTION_shard5_replica_n28",
>            "base_url":"http://10.156.116.34:10006/solr",
>            "node_name":"10.156.116.34:10006_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false",
>            "leader":"true"}}},
>      "shard6":{
>        "range":"20000000-3fffffff",
>        "state":"active",
>        "replicas":{
>          "core_node33":{
>            "core":"TEST_COLLECTION_shard6_replica_n30",
>            "base_url":"http://10.156.122.13:10002/solr",
>            "node_name":"10.156.122.13:10002_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false",
>            "leader":"true"},
>          "core_node35":{
>            "core":"TEST_COLLECTION_shard6_replica_n32",
>            "base_url":"http://10.156.112.50:10008/solr",
>            "node_name":"10.156.112.50:10008_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false"},
>          "core_node37":{
>            "core":"TEST_COLLECTION_shard6_replica_n34",
>            "base_url":"http://10.156.116.34:10003/solr",
>            "node_name":"10.156.116.34:10003_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false"}}},
>      "shard7":{
>        "range":"40000000-5fffffff",
>        "state":"active",
>        "replicas":{
>          "core_node39":{
>            "core":"TEST_COLLECTION_shard7_replica_n36",
>            "base_url":"http://10.156.122.13:10003/solr",
>            "node_name":"10.156.122.13:10003_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false",
>            "leader":"true"},
>          "core_node41":{
>            "core":"TEST_COLLECTION_shard7_replica_n38",
>            "base_url":"http://10.156.122.13:10005/solr",
>            "node_name":"10.156.122.13:10005_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false"},
>          "core_node43":{
>            "core":"TEST_COLLECTION_shard7_replica_n40",
>            "base_url":"http://10.156.112.50:10004/solr",
>            "node_name":"10.156.112.50:10004_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false"}}},
>      "shard8":{
>        "range":"60000000-7fffffff",
>        "state":"active",
>        "replicas":{
>          "core_node45":{
>            "core":"TEST_COLLECTION_shard8_replica_n42",
>            "base_url":"http://10.156.112.50:10007/solr",
>            "node_name":"10.156.112.50:10007_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false"},
>          "core_node47":{
>            "core":"TEST_COLLECTION_shard8_replica_n44",
>            "base_url":"http://10.156.112.50:10005/solr",
>            "node_name":"10.156.112.50:10005_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false",
>            "leader":"true"},
>          "core_node48":{
>            "core":"TEST_COLLECTION_shard8_replica_n46",
>            "base_url":"http://10.156.116.34:10001/solr",
>            "node_name":"10.156.116.34:10001_solr",
>            "state":"active",
>            "type":"NRT",
>            "force_set_state":"false"}}}},
>    "router":{"name":"compositeId"},
>    "maxShardsPerNode":"1",
>    "autoAddReplicas":"false",
>    "nrtReplicas":"3",
>    "tlogReplicas":"0”}}
> 
> 
> Running TOP on each machine while load tests have been running for 60 minutes.
> 
> 10.156.112.50	load average: 0.08, 0.35, 1.65
> 10.156.116.34	load average: 24.71, 24.20, 20.65
> 10.156.122.13	load average: 5.37, 3.21, 4.04
> 
> 
> 
> Here are the stats from each shard leader.
> 
> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes <http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes>
> {
>  "responseHeader":{
>    "status":0,
>    "QTime":2},
>  "metrics":{
>    "solr.core.BTS.shard1.replica_n2":{
>      "QUERY./select.requestTimes":{
>        "count":805,
>        "meanRate":0.4385455794526838,
>        "1minRate":0.5110237122383522,
>        "5minRate":0.4671091682458005,
>        "15minRate":0.4057871940723353,
>        "min_ms":0.14047,
>        "max_ms":12424.589645,
>        "mean_ms":796.2194458711818,
>        "median_ms":10.534906,
>        "stddev_ms":2567.655224710497,
>        "p75_ms":22.893306,
>        "p95_ms":8316.33323,
>        "p99_ms":12424.589645,
>        "p999_ms":12424.589645}}}}
> 
> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes <http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes>
> {
>  "responseHeader":{
>    "status":0,
>    "QTime":2},
>  "metrics":{
>    "solr.core.BTS.shard2.replica_n8":{
>      "QUERY./select.requestTimes":{
>        "count":791,
>        "meanRate":0.4244162938316224,
>        "1minRate":0.4869749626003825,
>        "5minRate":0.45856412657687656,
>        "15minRate":0.3948063845907493,
>        "min_ms":0.168369,
>        "max_ms":11022.763933,
>        "mean_ms":2572.0670957974603,
>        "median_ms":1490.222885,
>        "stddev_ms":2718.1710938804276,
>        "p75_ms":4292.490478,
>        "p95_ms":8487.18506,
>        "p99_ms":8855.936617,
>        "p999_ms":9589.218502}}}}
> 
> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes <http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes>
> {
>  "responseHeader":{
>    "status":0,
>    "QTime":83},
>  "metrics":{
>    "solr.core.BTS.shard3.replica_n16":{
>      "QUERY./select.requestTimes":{
>        "count":840,
>        "meanRate":0.4335334453288775,
>        "1minRate":0.5733683837779382,
>        "5minRate":0.4931753679028527,
>        "15minRate":0.42241330274699623,
>        "min_ms":0.155939,
>        "max_ms":18125.516406,
>        "mean_ms":7097.942850416767,
>        "median_ms":8136.862825,
>        "stddev_ms":2382.041897221542,
>        "p75_ms":8497.844088,
>        "p95_ms":9642.430475,
>        "p99_ms":9993.694346,
>        "p999_ms":12207.982291}}}}
> 
> http://10.156.112.50:10001/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes <http://10.156.112.50:10001/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes>
> {
>  "responseHeader":{
>    "status":0,
>    "QTime":3},
>  "metrics":{
>    "solr.core.BTS.shard4.replica_n22":{
>      "QUERY./select.requestTimes":{
>        "count":873,
>        "meanRate":0.43420303985137254,
>        "1minRate":0.4284437786865815,
>        "5minRate":0.44020640429418745,
>        "15minRate":0.40860871277629196,
>        "min_ms":0.136658,
>        "max_ms":11345.407699,
>        "mean_ms":511.28573906464504,
>        "median_ms":9.063677,
>        "stddev_ms":2038.8104673512248,
>        "p75_ms":20.270605,
>        "p95_ms":8418.131442,
>        "p99_ms":8904.78616,
>        "p999_ms":10447.78365}}}}
> 
> http://10.156.116.34:10006/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes <http://10.156.116.34:10006/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes>
> {
>  "responseHeader":{
>    "status":0,
>    "QTime":4},
>  "metrics":{
>    "solr.core.BTS.shard5.replica_n28":{
>      "QUERY./select.requestTimes":{
>        "count":863,
>        "meanRate":0.4419375762840668,
>        "1minRate":0.44487242228317025,
>        "5minRate":0.45927613542085916,
>        "15minRate":0.41056066296443494,
>        "min_ms":0.158855,
>        "max_ms":16669.411989,
>        "mean_ms":6513.057114006753,
>        "median_ms":8033.386692,
>        "stddev_ms":3002.7487311308896,
>        "p75_ms":8446.147616,
>        "p95_ms":9888.641316,
>        "p99_ms":13624.11926,
>        "p999_ms":13624.11926}}}}
> 
> http://10.156.122.13:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes <http://10.156.122.13:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes>
> {
>  "responseHeader":{
>    "status":0,
>    "QTime":2},
>  "metrics":{
>    "solr.core.BTS.shard6.replica_n30":{
>      "QUERY./select.requestTimes":{
>        "count":893,
>        "meanRate":0.43301141185981046,
>        "1minRate":0.4011485529441132,
>        "5minRate":0.447654905093643,
>        "15minRate":0.41489193746842407,
>        "min_ms":0.161571,
>        "max_ms":14716.828978,
>        "mean_ms":2932.212133523417,
>        "median_ms":1289.686481,
>        "stddev_ms":3426.22045100954,
>        "p75_ms":6230.031884,
>        "p95_ms":8109.408506,
>        "p99_ms":12904.515311,
>        "p999_ms":12904.515311}}}}
> 
> 
> 
> http://10.156.122.13:10003/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes <http://10.156.122.13:10003/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes>
> {
>  "responseHeader":{
>    "status":0,
>    "QTime":16},
>  "metrics":{
>    "solr.core.BTS.shard7.replica_n36":{
>      "QUERY./select.requestTimes":{
>        "count":962,
>        "meanRate":0.46572438680661055,
>        "1minRate":0.4974893681625287,
>        "5minRate":0.49072296556429784,
>        "15minRate":0.44138205926188756,
>        "min_ms":0.164803,
>        "max_ms":12481.82656,
>        "mean_ms":2606.899631183513,
>        "median_ms":1457.505387,
>        "stddev_ms":3083.297183477969,
>        "p75_ms":4072.543679,
>        "p95_ms":8562.456178,
>        "p99_ms":9351.230895,
>        "p999_ms":10430.483813}}}}
> 
> http://10.156.112.50:10005/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes <http://10.156.112.50:10005/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes>
> {
>  "responseHeader":{
>    "status":0,
>    "QTime":3},
>  "metrics":{
>    "solr.core.BTS.shard8.replica_n44":{
>      "QUERY./select.requestTimes":{
>        "count":904,
>        "meanRate":0.4356001115451976,
>        "1minRate":0.42906831311171356,
>        "5minRate":0.4651312663377039,
>        "15minRate":0.41812847342709225,
>        "min_ms":0.089738,
>        "max_ms":10857.092832,
>        "mean_ms":304.52127270799156,
>        "median_ms":7.098736,
>        "stddev_ms":1544.5378594679773,
>        "p75_ms":15.599817,
>        "p95_ms":93.818662,
>        "p99_ms":8510.757117,
>        "p999_ms":9353.844994}}}}
> 
> I restart all of the instances on “34” so that there are no leaders on it. The load somewhat goes to the other box.
> 
> 10.156.112.50	load average: 0.00, 0.16, 0.47
> 10.156.116.34	load average: 17.00, 16.16, 17.07
> 10.156.122.13	load average: 17.86, 17.49, 14.74
> 
> Box “50” is still doing nothing AND it is the leader of 4 of the 8 shards.
> Box “13” is the leader of the remaining 4 shards.
> Box “34” is not the leader of any shard.
> 
> I will continue to test, who knows, it may be something I am doing. Maybe not enough RAM, etc…, so I am definitely leaving this open to the possibility that I am not well configured for 8.5.
> 
> Regards
> 
> 
> 
> 
>> On May 16, 2020, at 5:08 PM, Tomás Fernández Löbbe <to...@gmail.com> wrote:
>> 
>> I just backported Michael’s fix to be released in 8.5.2
>> 
>> On Fri, May 15, 2020 at 6:38 AM Michael Gibney <mi...@michaelgibney.net>
>> wrote:
>> 
>>> Hi Wei,
>>> SOLR-14471 has been merged, so this issue should be fixed in 8.6.
>>> Thanks for reporting the problem!
>>> Michael
>>> 
>>> On Mon, May 11, 2020 at 7:51 PM Wei <we...@gmail.com> wrote:
>>>> 
>>>> Thanks Michael!  Yes in each shard I have 10 Tlog replicas,  no other
>>> type
>>>> of replicas, and each Tlog replica is an individual solr instance on its
>>>> own physical machine.  In the jira you mentioned 'when "last place
>>> matches"
>>>> == "first place matches" – e.g. when shards.preference specified matches
>>>> *all* available replicas'.   My setting is
>>>> shards.preference=replica.location:local,replica.type:TLOG,
>>>> I also tried just shards.preference=replica.location:local and it still
>>> has
>>>> the issue. Can you explain a bit more?
>>>> 
>>>> On Mon, May 11, 2020 at 12:26 PM Michael Gibney <
>>> michael@michaelgibney.net>
>>>> wrote:
>>>> 
>>>>> FYI: https://issues.apache.org/jira/browse/SOLR-14471
>>>>> Wei, assuming you have only TLOG replicas, your "last place" matches
>>>>> (to which the random fallback ordering would not be applied -- see
>>>>> above issue) would be the same as the "first place" matches selected
>>>>> for executing distributed requests.
>>>>> 
>>>>> 
>>>>> On Mon, May 11, 2020 at 1:49 PM Michael Gibney
>>>>> <mi...@michaelgibney.net> wrote:
>>>>>> 
>>>>>> Wei, probably no need to answer my earlier questions; I think I see
>>>>>> the problem here, and believe it is indeed a bug, introduced in 8.3.
>>>>>> Will file an issue and submit a patch shortly.
>>>>>> Michael
>>>>>> 
>>>>>> On Mon, May 11, 2020 at 12:49 PM Michael Gibney
>>>>>> <mi...@michaelgibney.net> wrote:
>>>>>>> 
>>>>>>> Hi Wei,
>>>>>>> 
>>>>>>> In considering this problem, I'm stumbling a bit on terminology
>>>>>>> (particularly, where you mention "nodes", I think you're referring
>>> to
>>>>>>> "replicas"?). Could you confirm that you have 10 TLOG replicas per
>>>>>>> shard, for each of 6 shards? How many *nodes* (i.e., running solr
>>>>>>> server instances) do you have, and what is the replica placement
>>> like
>>>>>>> across those nodes? What, if any, non-TLOG replicas do you have per
>>>>>>> shard (not that it's necessarily relevant, but just to get a
>>> complete
>>>>>>> picture of the situation)?
>>>>>>> 
>>>>>>> If you're able without too much trouble, can you determine what the
>>>>>>> behavior is like on Solr 8.3? (there were different changes
>>> introduced
>>>>>>> to potentially relevant code in 8.3 and 8.4, and knowing whether
>>> the
>>>>>>> behavior you're observing manifests on 8.3 would help narrow down
>>>>>>> where to look for an explanation).
>>>>>>> 
>>>>>>> Michael
>>>>>>> 
>>>>>>> On Fri, May 8, 2020 at 7:34 PM Wei <we...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> Update:  after I remove the shards.preference parameter from
>>>>>>>> solrconfig.xml,  issue is gone and internal shard requests are
>>> now
>>>>>>>> balanced. The same parameter works fine with solr 7.6.  Still not
>>>>> sure of
>>>>>>>> the root cause, but I observed a strange coincidence: the nodes
>>> that
>>>>> are
>>>>>>>> most frequently picked for shard requests are the first node in
>>> each
>>>>> shard
>>>>>>>> returned from the CLUSTERSTATUS api.  Seems something wrong with
>>>>> shuffling
>>>>>>>> equally compared nodes when shards.preference is set.  Will
>>> report
>>>>> back if
>>>>>>>> I find more.
>>>>>>>> 
>>>>>>>> On Mon, Apr 27, 2020 at 5:59 PM Wei <we...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> Hi Eric,
>>>>>>>>> 
>>>>>>>>> I am measuring the number of shard requests, and it's for query
>>>>> only, no
>>>>>>>>> indexing requests.  I have an external load balancer and see
>>> each
>>>>> node
>>>>>>>>> received about the equal number of external queries. However
>>> for
>>>>> the
>>>>>>>>> internal shard queries,  the distribution is uneven:    6 nodes
>>>>> (one in
>>>>>>>>> each shard,  some of them are leaders and some are non-leaders
>>> )
>>>>> gets about
>>>>>>>>> 80% of the shard requests, the other 54 nodes gets about 20% of
>>>>> the shard
>>>>>>>>> requests.   I checked a few other parameters set:
>>>>>>>>> 
>>>>>>>>> -Dsolr.disable.shardsWhitelist=true
>>>>>>>>> shards.preference=replica.location:local,replica.type:TLOG
>>>>>>>>> 
>>>>>>>>> Nothing seems to cause the strange behavior.  Any suggestions
>>> how
>>>>> to
>>>>>>>>> debug this?
>>>>>>>>> 
>>>>>>>>> -Wei
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson <
>>>>> erickerickson@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Wei:
>>>>>>>>>> 
>>>>>>>>>> How are you measuring utilization here? The number of incoming
>>>>> requests
>>>>>>>>>> or CPU?
>>>>>>>>>> 
>>>>>>>>>> The leader for each shard are certainly handling all of the
>>>>> indexing
>>>>>>>>>> requests since they’re TLOG replicas, so that’s one thing that
>>>>> might
>>>>>>>>>> skewing your measurements.
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Erick
>>>>>>>>>> 
>>>>>>>>>>> On Apr 27, 2020, at 7:13 PM, Wei <we...@gmail.com>
>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>> 
>>>>>>>>>>> I have a strange issue after upgrade from 7.6.0 to 8.4.1. My
>>>>> cloud has 6
>>>>>>>>>>> shards with 10 TLOG replicas each shard.  After upgrade I
>>>>> noticed that
>>>>>>>>>> one
>>>>>>>>>>> of the replicas in each shard is handling most of the
>>>>> distributed shard
>>>>>>>>>>> requests, so 6 nodes are heavily loaded while other nodes
>>> are
>>>>> idle.
>>>>>>>>>> There
>>>>>>>>>>> is no change in shard handler configuration:
>>>>>>>>>>> 
>>>>>>>>>>> <shardHandlerFactory name="shardHandlerFactory" class=
>>>>>>>>>>> "HttpShardHandlerFactory">
>>>>>>>>>>> 
>>>>>>>>>>>  <int name="socketTimeout">30000</int>
>>>>>>>>>>> 
>>>>>>>>>>>  <int name="connTimeout">30000</int>
>>>>>>>>>>> 
>>>>>>>>>>>  <int name="maxConnectionsPerHost">500</int>
>>>>>>>>>>> 
>>>>>>>>>>> </shardHandlerFactory>
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> What could cause the unbalanced internal distributed
>>> request?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Thanks in advance.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Wei
>>>>>>>>>> 
>>>>>>>>>> 
>>>>> 
>>> 
>

Re: Unbalanced shard requests

Posted by Phill Campbell <Si...@yahoo.com.INVALID>.

I have been testing 8.5.2 and it looks like the load has moved but is still on one machine.

Setup:
3 physical machines.
Each machine hosts 8 instances of Solr.
Each instance of Solr hosts one replica.

Another way to say it:
Number of shards = 8. Replication factor = 3.

Here is the cluster state. You can see that the leaders are well distributed. 

{"TEST_COLLECTION":{
    "pullReplicas":"0",
    "replicationFactor":"3",
    "shards":{
      "shard1":{
        "range":"80000000-9fffffff",
        "state":"active",
        "replicas":{
          "core_node3":{
            "core":"TEST_COLLECTION_shard1_replica_n1",
            "base_url":"http://10.156.122.13:10007/solr",
            "node_name":"10.156.122.13:10007_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false"},
          "core_node5":{
            "core":"TEST_COLLECTION_shard1_replica_n2",
            "base_url":"http://10.156.112.50:10002/solr",
            "node_name":"10.156.112.50:10002_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false",
            "leader":"true"},
          "core_node7":{
            "core":"TEST_COLLECTION_shard1_replica_n4",
            "base_url":"http://10.156.112.50:10006/solr",
            "node_name":"10.156.112.50:10006_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false"}}},
      "shard2":{
        "range":"a0000000-bfffffff",
        "state":"active",
        "replicas":{
          "core_node9":{
            "core":"TEST_COLLECTION_shard2_replica_n6",
            "base_url":"http://10.156.112.50:10003/solr",
            "node_name":"10.156.112.50:10003_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false"},
          "core_node11":{
            "core":"TEST_COLLECTION_shard2_replica_n8",
            "base_url":"http://10.156.122.13:10004/solr",
            "node_name":"10.156.122.13:10004_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false",
            "leader":"true"},
          "core_node12":{
            "core":"TEST_COLLECTION_shard2_replica_n10",
            "base_url":"http://10.156.116.34:10008/solr",
            "node_name":"10.156.116.34:10008_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false"}}},
      "shard3":{
        "range":"c0000000-dfffffff",
        "state":"active",
        "replicas":{
          "core_node15":{
            "core":"TEST_COLLECTION_shard3_replica_n13",
            "base_url":"http://10.156.122.13:10008/solr",
            "node_name":"10.156.122.13:10008_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false"},
          "core_node17":{
            "core":"TEST_COLLECTION_shard3_replica_n14",
            "base_url":"http://10.156.116.34:10005/solr",
            "node_name":"10.156.116.34:10005_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false"},
          "core_node19":{
            "core":"TEST_COLLECTION_shard3_replica_n16",
            "base_url":"http://10.156.116.34:10002/solr",
            "node_name":"10.156.116.34:10002_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false",
            "leader":"true"}}},
      "shard4":{
        "range":"e0000000-ffffffff",
        "state":"active",
        "replicas":{
          "core_node20":{
            "core":"TEST_COLLECTION_shard4_replica_n18",
            "base_url":"http://10.156.122.13:10001/solr",
            "node_name":"10.156.122.13:10001_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false"},
          "core_node23":{
            "core":"TEST_COLLECTION_shard4_replica_n21",
            "base_url":"http://10.156.116.34:10004/solr",
            "node_name":"10.156.116.34:10004_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false"},
          "core_node25":{
            "core":"TEST_COLLECTION_shard4_replica_n22",
            "base_url":"http://10.156.112.50:10001/solr",
            "node_name":"10.156.112.50:10001_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false",
            "leader":"true"}}},
      "shard5":{
        "range":"0-1fffffff",
        "state":"active",
        "replicas":{
          "core_node27":{
            "core":"TEST_COLLECTION_shard5_replica_n24",
            "base_url":"http://10.156.116.34:10007/solr",
            "node_name":"10.156.116.34:10007_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false"},
          "core_node29":{
            "core":"TEST_COLLECTION_shard5_replica_n26",
            "base_url":"http://10.156.122.13:10006/solr",
            "node_name":"10.156.122.13:10006_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false"},
          "core_node31":{
            "core":"TEST_COLLECTION_shard5_replica_n28",
            "base_url":"http://10.156.116.34:10006/solr",
            "node_name":"10.156.116.34:10006_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false",
            "leader":"true"}}},
      "shard6":{
        "range":"20000000-3fffffff",
        "state":"active",
        "replicas":{
          "core_node33":{
            "core":"TEST_COLLECTION_shard6_replica_n30",
            "base_url":"http://10.156.122.13:10002/solr",
            "node_name":"10.156.122.13:10002_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false",
            "leader":"true"},
          "core_node35":{
            "core":"TEST_COLLECTION_shard6_replica_n32",
            "base_url":"http://10.156.112.50:10008/solr",
            "node_name":"10.156.112.50:10008_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false"},
          "core_node37":{
            "core":"TEST_COLLECTION_shard6_replica_n34",
            "base_url":"http://10.156.116.34:10003/solr",
            "node_name":"10.156.116.34:10003_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false"}}},
      "shard7":{
        "range":"40000000-5fffffff",
        "state":"active",
        "replicas":{
          "core_node39":{
            "core":"TEST_COLLECTION_shard7_replica_n36",
            "base_url":"http://10.156.122.13:10003/solr",
            "node_name":"10.156.122.13:10003_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false",
            "leader":"true"},
          "core_node41":{
            "core":"TEST_COLLECTION_shard7_replica_n38",
            "base_url":"http://10.156.122.13:10005/solr",
            "node_name":"10.156.122.13:10005_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false"},
          "core_node43":{
            "core":"TEST_COLLECTION_shard7_replica_n40",
            "base_url":"http://10.156.112.50:10004/solr",
            "node_name":"10.156.112.50:10004_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false"}}},
      "shard8":{
        "range":"60000000-7fffffff",
        "state":"active",
        "replicas":{
          "core_node45":{
            "core":"TEST_COLLECTION_shard8_replica_n42",
            "base_url":"http://10.156.112.50:10007/solr",
            "node_name":"10.156.112.50:10007_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false"},
          "core_node47":{
            "core":"TEST_COLLECTION_shard8_replica_n44",
            "base_url":"http://10.156.112.50:10005/solr",
            "node_name":"10.156.112.50:10005_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false",
            "leader":"true"},
          "core_node48":{
            "core":"TEST_COLLECTION_shard8_replica_n46",
            "base_url":"http://10.156.116.34:10001/solr",
            "node_name":"10.156.116.34:10001_solr",
            "state":"active",
            "type":"NRT",
            "force_set_state":"false"}}}},
    "router":{"name":"compositeId"},
    "maxShardsPerNode":"1",
    "autoAddReplicas":"false",
    "nrtReplicas":"3",
    "tlogReplicas":"0”}}


Running TOP on each machine while load tests have been running for 60 minutes.

10.156.112.50	load average: 0.08, 0.35, 1.65
10.156.116.34	load average: 24.71, 24.20, 20.65
10.156.122.13	load average: 5.37, 3.21, 4.04



Here are the stats from each shard leader.

http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes <http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes>
{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "metrics":{
    "solr.core.BTS.shard1.replica_n2":{
      "QUERY./select.requestTimes":{
        "count":805,
        "meanRate":0.4385455794526838,
        "1minRate":0.5110237122383522,
        "5minRate":0.4671091682458005,
        "15minRate":0.4057871940723353,
        "min_ms":0.14047,
        "max_ms":12424.589645,
        "mean_ms":796.2194458711818,
        "median_ms":10.534906,
        "stddev_ms":2567.655224710497,
        "p75_ms":22.893306,
        "p95_ms":8316.33323,
        "p99_ms":12424.589645,
        "p999_ms":12424.589645}}}}

http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes <http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes>
{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "metrics":{
    "solr.core.BTS.shard2.replica_n8":{
      "QUERY./select.requestTimes":{
        "count":791,
        "meanRate":0.4244162938316224,
        "1minRate":0.4869749626003825,
        "5minRate":0.45856412657687656,
        "15minRate":0.3948063845907493,
        "min_ms":0.168369,
        "max_ms":11022.763933,
        "mean_ms":2572.0670957974603,
        "median_ms":1490.222885,
        "stddev_ms":2718.1710938804276,
        "p75_ms":4292.490478,
        "p95_ms":8487.18506,
        "p99_ms":8855.936617,
        "p999_ms":9589.218502}}}}

http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes <http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes>
{
  "responseHeader":{
    "status":0,
    "QTime":83},
  "metrics":{
    "solr.core.BTS.shard3.replica_n16":{
      "QUERY./select.requestTimes":{
        "count":840,
        "meanRate":0.4335334453288775,
        "1minRate":0.5733683837779382,
        "5minRate":0.4931753679028527,
        "15minRate":0.42241330274699623,
        "min_ms":0.155939,
        "max_ms":18125.516406,
        "mean_ms":7097.942850416767,
        "median_ms":8136.862825,
        "stddev_ms":2382.041897221542,
        "p75_ms":8497.844088,
        "p95_ms":9642.430475,
        "p99_ms":9993.694346,
        "p999_ms":12207.982291}}}}

http://10.156.112.50:10001/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes <http://10.156.112.50:10001/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes>
{
  "responseHeader":{
    "status":0,
    "QTime":3},
  "metrics":{
    "solr.core.BTS.shard4.replica_n22":{
      "QUERY./select.requestTimes":{
        "count":873,
        "meanRate":0.43420303985137254,
        "1minRate":0.4284437786865815,
        "5minRate":0.44020640429418745,
        "15minRate":0.40860871277629196,
        "min_ms":0.136658,
        "max_ms":11345.407699,
        "mean_ms":511.28573906464504,
        "median_ms":9.063677,
        "stddev_ms":2038.8104673512248,
        "p75_ms":20.270605,
        "p95_ms":8418.131442,
        "p99_ms":8904.78616,
        "p999_ms":10447.78365}}}}

http://10.156.116.34:10006/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes <http://10.156.116.34:10006/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes>
{
  "responseHeader":{
    "status":0,
    "QTime":4},
  "metrics":{
    "solr.core.BTS.shard5.replica_n28":{
      "QUERY./select.requestTimes":{
        "count":863,
        "meanRate":0.4419375762840668,
        "1minRate":0.44487242228317025,
        "5minRate":0.45927613542085916,
        "15minRate":0.41056066296443494,
        "min_ms":0.158855,
        "max_ms":16669.411989,
        "mean_ms":6513.057114006753,
        "median_ms":8033.386692,
        "stddev_ms":3002.7487311308896,
        "p75_ms":8446.147616,
        "p95_ms":9888.641316,
        "p99_ms":13624.11926,
        "p999_ms":13624.11926}}}}

http://10.156.122.13:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes <http://10.156.122.13:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes>
{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "metrics":{
    "solr.core.BTS.shard6.replica_n30":{
      "QUERY./select.requestTimes":{
        "count":893,
        "meanRate":0.43301141185981046,
        "1minRate":0.4011485529441132,
        "5minRate":0.447654905093643,
        "15minRate":0.41489193746842407,
        "min_ms":0.161571,
        "max_ms":14716.828978,
        "mean_ms":2932.212133523417,
        "median_ms":1289.686481,
        "stddev_ms":3426.22045100954,
        "p75_ms":6230.031884,
        "p95_ms":8109.408506,
        "p99_ms":12904.515311,
        "p999_ms":12904.515311}}}}



http://10.156.122.13:10003/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes <http://10.156.122.13:10003/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes>
{
  "responseHeader":{
    "status":0,
    "QTime":16},
  "metrics":{
    "solr.core.BTS.shard7.replica_n36":{
      "QUERY./select.requestTimes":{
        "count":962,
        "meanRate":0.46572438680661055,
        "1minRate":0.4974893681625287,
        "5minRate":0.49072296556429784,
        "15minRate":0.44138205926188756,
        "min_ms":0.164803,
        "max_ms":12481.82656,
        "mean_ms":2606.899631183513,
        "median_ms":1457.505387,
        "stddev_ms":3083.297183477969,
        "p75_ms":4072.543679,
        "p95_ms":8562.456178,
        "p99_ms":9351.230895,
        "p999_ms":10430.483813}}}}

http://10.156.112.50:10005/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes <http://10.156.112.50:10005/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes>
{
  "responseHeader":{
    "status":0,
    "QTime":3},
  "metrics":{
    "solr.core.BTS.shard8.replica_n44":{
      "QUERY./select.requestTimes":{
        "count":904,
        "meanRate":0.4356001115451976,
        "1minRate":0.42906831311171356,
        "5minRate":0.4651312663377039,
        "15minRate":0.41812847342709225,
        "min_ms":0.089738,
        "max_ms":10857.092832,
        "mean_ms":304.52127270799156,
        "median_ms":7.098736,
        "stddev_ms":1544.5378594679773,
        "p75_ms":15.599817,
        "p95_ms":93.818662,
        "p99_ms":8510.757117,
        "p999_ms":9353.844994}}}}

I restart all of the instances on “34” so that there are no leaders on it. The load somewhat goes to the other box.

10.156.112.50	load average: 0.00, 0.16, 0.47
10.156.116.34	load average: 17.00, 16.16, 17.07
10.156.122.13	load average: 17.86, 17.49, 14.74

Box “50” is still doing nothing AND it is the leader of 4 of the 8 shards.
Box “13” is the leader of the remaining 4 shards.
Box “34” is not the leader of any shard.

I will continue to test, who knows, it may be something I am doing. Maybe not enough RAM, etc…, so I am definitely leaving this open to the possibility that I am not well configured for 8.5.

Regards




> On May 16, 2020, at 5:08 PM, Tomás Fernández Löbbe <to...@gmail.com> wrote:
> 
> I just backported Michael’s fix to be released in 8.5.2
> 
> On Fri, May 15, 2020 at 6:38 AM Michael Gibney <mi...@michaelgibney.net>
> wrote:
> 
>> Hi Wei,
>> SOLR-14471 has been merged, so this issue should be fixed in 8.6.
>> Thanks for reporting the problem!
>> Michael
>> 
>> On Mon, May 11, 2020 at 7:51 PM Wei <we...@gmail.com> wrote:
>>> 
>>> Thanks Michael!  Yes in each shard I have 10 Tlog replicas,  no other
>> type
>>> of replicas, and each Tlog replica is an individual solr instance on its
>>> own physical machine.  In the jira you mentioned 'when "last place
>> matches"
>>> == "first place matches" – e.g. when shards.preference specified matches
>>> *all* available replicas'.   My setting is
>>> shards.preference=replica.location:local,replica.type:TLOG,
>>> I also tried just shards.preference=replica.location:local and it still
>> has
>>> the issue. Can you explain a bit more?
>>> 
>>> On Mon, May 11, 2020 at 12:26 PM Michael Gibney <
>> michael@michaelgibney.net>
>>> wrote:
>>> 
>>>> FYI: https://issues.apache.org/jira/browse/SOLR-14471
>>>> Wei, assuming you have only TLOG replicas, your "last place" matches
>>>> (to which the random fallback ordering would not be applied -- see
>>>> above issue) would be the same as the "first place" matches selected
>>>> for executing distributed requests.
>>>> 
>>>> 
>>>> On Mon, May 11, 2020 at 1:49 PM Michael Gibney
>>>> <mi...@michaelgibney.net> wrote:
>>>>> 
>>>>> Wei, probably no need to answer my earlier questions; I think I see
>>>>> the problem here, and believe it is indeed a bug, introduced in 8.3.
>>>>> Will file an issue and submit a patch shortly.
>>>>> Michael
>>>>> 
>>>>> On Mon, May 11, 2020 at 12:49 PM Michael Gibney
>>>>> <mi...@michaelgibney.net> wrote:
>>>>>> 
>>>>>> Hi Wei,
>>>>>> 
>>>>>> In considering this problem, I'm stumbling a bit on terminology
>>>>>> (particularly, where you mention "nodes", I think you're referring
>> to
>>>>>> "replicas"?). Could you confirm that you have 10 TLOG replicas per
>>>>>> shard, for each of 6 shards? How many *nodes* (i.e., running solr
>>>>>> server instances) do you have, and what is the replica placement
>> like
>>>>>> across those nodes? What, if any, non-TLOG replicas do you have per
>>>>>> shard (not that it's necessarily relevant, but just to get a
>> complete
>>>>>> picture of the situation)?
>>>>>> 
>>>>>> If you're able without too much trouble, can you determine what the
>>>>>> behavior is like on Solr 8.3? (there were different changes
>> introduced
>>>>>> to potentially relevant code in 8.3 and 8.4, and knowing whether
>> the
>>>>>> behavior you're observing manifests on 8.3 would help narrow down
>>>>>> where to look for an explanation).
>>>>>> 
>>>>>> Michael
>>>>>> 
>>>>>> On Fri, May 8, 2020 at 7:34 PM Wei <we...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Update:  after I remove the shards.preference parameter from
>>>>>>> solrconfig.xml,  issue is gone and internal shard requests are
>> now
>>>>>>> balanced. The same parameter works fine with solr 7.6.  Still not
>>>> sure of
>>>>>>> the root cause, but I observed a strange coincidence: the nodes
>> that
>>>> are
>>>>>>> most frequently picked for shard requests are the first node in
>> each
>>>> shard
>>>>>>> returned from the CLUSTERSTATUS api.  Seems something wrong with
>>>> shuffling
>>>>>>> equally compared nodes when shards.preference is set.  Will
>> report
>>>> back if
>>>>>>> I find more.
>>>>>>> 
>>>>>>> On Mon, Apr 27, 2020 at 5:59 PM Wei <we...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> Hi Eric,
>>>>>>>> 
>>>>>>>> I am measuring the number of shard requests, and it's for query
>>>> only, no
>>>>>>>> indexing requests.  I have an external load balancer and see
>> each
>>>> node
>>>>>>>> received about the equal number of external queries. However
>> for
>>>> the
>>>>>>>> internal shard queries,  the distribution is uneven:    6 nodes
>>>> (one in
>>>>>>>> each shard,  some of them are leaders and some are non-leaders
>> )
>>>> gets about
>>>>>>>> 80% of the shard requests, the other 54 nodes gets about 20% of
>>>> the shard
>>>>>>>> requests.   I checked a few other parameters set:
>>>>>>>> 
>>>>>>>> -Dsolr.disable.shardsWhitelist=true
>>>>>>>> shards.preference=replica.location:local,replica.type:TLOG
>>>>>>>> 
>>>>>>>> Nothing seems to cause the strange behavior.  Any suggestions
>> how
>>>> to
>>>>>>>> debug this?
>>>>>>>> 
>>>>>>>> -Wei
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson <
>>>> erickerickson@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Wei:
>>>>>>>>> 
>>>>>>>>> How are you measuring utilization here? The number of incoming
>>>> requests
>>>>>>>>> or CPU?
>>>>>>>>> 
>>>>>>>>> The leader for each shard are certainly handling all of the
>>>> indexing
>>>>>>>>> requests since they’re TLOG replicas, so that’s one thing that
>>>> might
>>>>>>>>> skewing your measurements.
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> Erick
>>>>>>>>> 
>>>>>>>>>> On Apr 27, 2020, at 7:13 PM, Wei <we...@gmail.com>
>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi everyone,
>>>>>>>>>> 
>>>>>>>>>> I have a strange issue after upgrade from 7.6.0 to 8.4.1. My
>>>> cloud has 6
>>>>>>>>>> shards with 10 TLOG replicas each shard.  After upgrade I
>>>> noticed that
>>>>>>>>> one
>>>>>>>>>> of the replicas in each shard is handling most of the
>>>> distributed shard
>>>>>>>>>> requests, so 6 nodes are heavily loaded while other nodes
>> are
>>>> idle.
>>>>>>>>> There
>>>>>>>>>> is no change in shard handler configuration:
>>>>>>>>>> 
>>>>>>>>>> <shardHandlerFactory name="shardHandlerFactory" class=
>>>>>>>>>> "HttpShardHandlerFactory">
>>>>>>>>>> 
>>>>>>>>>>   <int name="socketTimeout">30000</int>
>>>>>>>>>> 
>>>>>>>>>>   <int name="connTimeout">30000</int>
>>>>>>>>>> 
>>>>>>>>>>   <int name="maxConnectionsPerHost">500</int>
>>>>>>>>>> 
>>>>>>>>>> </shardHandlerFactory>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> What could cause the unbalanced internal distributed
>> request?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thanks in advance.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Wei
>>>>>>>>> 
>>>>>>>>> 
>>>> 
>>

Re: Unbalanced shard requests

Posted by Tomás Fernández Löbbe <to...@gmail.com>.

I just backported Michael’s fix to be released in 8.5.2

On Fri, May 15, 2020 at 6:38 AM Michael Gibney <mi...@michaelgibney.net>
wrote:

> Hi Wei,
> SOLR-14471 has been merged, so this issue should be fixed in 8.6.
> Thanks for reporting the problem!
> Michael
>
> On Mon, May 11, 2020 at 7:51 PM Wei <we...@gmail.com> wrote:
> >
> > Thanks Michael!  Yes in each shard I have 10 Tlog replicas,  no other
> type
> > of replicas, and each Tlog replica is an individual solr instance on its
> > own physical machine.  In the jira you mentioned 'when "last place
> matches"
> > == "first place matches" – e.g. when shards.preference specified matches
> > *all* available replicas'.   My setting is
> > shards.preference=replica.location:local,replica.type:TLOG,
> > I also tried just shards.preference=replica.location:local and it still
> has
> > the issue. Can you explain a bit more?
> >
> > On Mon, May 11, 2020 at 12:26 PM Michael Gibney <
> michael@michaelgibney.net>
> > wrote:
> >
> > > FYI: https://issues.apache.org/jira/browse/SOLR-14471
> > > Wei, assuming you have only TLOG replicas, your "last place" matches
> > > (to which the random fallback ordering would not be applied -- see
> > > above issue) would be the same as the "first place" matches selected
> > > for executing distributed requests.
> > >
> > >
> > > On Mon, May 11, 2020 at 1:49 PM Michael Gibney
> > > <mi...@michaelgibney.net> wrote:
> > > >
> > > > Wei, probably no need to answer my earlier questions; I think I see
> > > > the problem here, and believe it is indeed a bug, introduced in 8.3.
> > > > Will file an issue and submit a patch shortly.
> > > > Michael
> > > >
> > > > On Mon, May 11, 2020 at 12:49 PM Michael Gibney
> > > > <mi...@michaelgibney.net> wrote:
> > > > >
> > > > > Hi Wei,
> > > > >
> > > > > In considering this problem, I'm stumbling a bit on terminology
> > > > > (particularly, where you mention "nodes", I think you're referring
> to
> > > > > "replicas"?). Could you confirm that you have 10 TLOG replicas per
> > > > > shard, for each of 6 shards? How many *nodes* (i.e., running solr
> > > > > server instances) do you have, and what is the replica placement
> like
> > > > > across those nodes? What, if any, non-TLOG replicas do you have per
> > > > > shard (not that it's necessarily relevant, but just to get a
> complete
> > > > > picture of the situation)?
> > > > >
> > > > > If you're able without too much trouble, can you determine what the
> > > > > behavior is like on Solr 8.3? (there were different changes
> introduced
> > > > > to potentially relevant code in 8.3 and 8.4, and knowing whether
> the
> > > > > behavior you're observing manifests on 8.3 would help narrow down
> > > > > where to look for an explanation).
> > > > >
> > > > > Michael
> > > > >
> > > > > On Fri, May 8, 2020 at 7:34 PM Wei <we...@gmail.com> wrote:
> > > > > >
> > > > > > Update:  after I remove the shards.preference parameter from
> > > > > > solrconfig.xml,  issue is gone and internal shard requests are
> now
> > > > > > balanced. The same parameter works fine with solr 7.6.  Still not
> > > sure of
> > > > > > the root cause, but I observed a strange coincidence: the nodes
> that
> > > are
> > > > > > most frequently picked for shard requests are the first node in
> each
> > > shard
> > > > > > returned from the CLUSTERSTATUS api.  Seems something wrong with
> > > shuffling
> > > > > > equally compared nodes when shards.preference is set.  Will
> report
> > > back if
> > > > > > I find more.
> > > > > >
> > > > > > On Mon, Apr 27, 2020 at 5:59 PM Wei <we...@gmail.com> wrote:
> > > > > >
> > > > > > > Hi Eric,
> > > > > > >
> > > > > > > I am measuring the number of shard requests, and it's for query
> > > only, no
> > > > > > > indexing requests.  I have an external load balancer and see
> each
> > > node
> > > > > > > received about the equal number of external queries. However
> for
> > > the
> > > > > > > internal shard queries,  the distribution is uneven:    6 nodes
> > > (one in
> > > > > > > each shard,  some of them are leaders and some are non-leaders
> )
> > > gets about
> > > > > > > 80% of the shard requests, the other 54 nodes gets about 20% of
> > > the shard
> > > > > > > requests.   I checked a few other parameters set:
> > > > > > >
> > > > > > > -Dsolr.disable.shardsWhitelist=true
> > > > > > > shards.preference=replica.location:local,replica.type:TLOG
> > > > > > >
> > > > > > > Nothing seems to cause the strange behavior.  Any suggestions
> how
> > > to
> > > > > > > debug this?
> > > > > > >
> > > > > > > -Wei
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson <
> > > erickerickson@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Wei:
> > > > > > >>
> > > > > > >> How are you measuring utilization here? The number of incoming
> > > requests
> > > > > > >> or CPU?
> > > > > > >>
> > > > > > >> The leader for each shard are certainly handling all of the
> > > indexing
> > > > > > >> requests since they’re TLOG replicas, so that’s one thing that
> > > might
> > > > > > >> skewing your measurements.
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> Erick
> > > > > > >>
> > > > > > >> > On Apr 27, 2020, at 7:13 PM, Wei <we...@gmail.com>
> wrote:
> > > > > > >> >
> > > > > > >> > Hi everyone,
> > > > > > >> >
> > > > > > >> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My
> > > cloud has 6
> > > > > > >> > shards with 10 TLOG replicas each shard.  After upgrade I
> > > noticed that
> > > > > > >> one
> > > > > > >> > of the replicas in each shard is handling most of the
> > > distributed shard
> > > > > > >> > requests, so 6 nodes are heavily loaded while other nodes
> are
> > > idle.
> > > > > > >> There
> > > > > > >> > is no change in shard handler configuration:
> > > > > > >> >
> > > > > > >> > <shardHandlerFactory name="shardHandlerFactory" class=
> > > > > > >> > "HttpShardHandlerFactory">
> > > > > > >> >
> > > > > > >> >    <int name="socketTimeout">30000</int>
> > > > > > >> >
> > > > > > >> >    <int name="connTimeout">30000</int>
> > > > > > >> >
> > > > > > >> >    <int name="maxConnectionsPerHost">500</int>
> > > > > > >> >
> > > > > > >> > </shardHandlerFactory>
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > What could cause the unbalanced internal distributed
> request?
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > Thanks in advance.
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > Wei
> > > > > > >>
> > > > > > >>
> > >
>

Re: Unbalanced shard requests

Posted by Michael Gibney <mi...@michaelgibney.net>.

Hi Wei,
SOLR-14471 has been merged, so this issue should be fixed in 8.6.
Thanks for reporting the problem!
Michael

On Mon, May 11, 2020 at 7:51 PM Wei <we...@gmail.com> wrote:
>
> Thanks Michael!  Yes in each shard I have 10 Tlog replicas,  no other type
> of replicas, and each Tlog replica is an individual solr instance on its
> own physical machine.  In the jira you mentioned 'when "last place matches"
> == "first place matches" – e.g. when shards.preference specified matches
> *all* available replicas'.   My setting is
> shards.preference=replica.location:local,replica.type:TLOG,
> I also tried just shards.preference=replica.location:local and it still has
> the issue. Can you explain a bit more?
>
> On Mon, May 11, 2020 at 12:26 PM Michael Gibney <mi...@michaelgibney.net>
> wrote:
>
> > FYI: https://issues.apache.org/jira/browse/SOLR-14471
> > Wei, assuming you have only TLOG replicas, your "last place" matches
> > (to which the random fallback ordering would not be applied -- see
> > above issue) would be the same as the "first place" matches selected
> > for executing distributed requests.
> >
> >
> > On Mon, May 11, 2020 at 1:49 PM Michael Gibney
> > <mi...@michaelgibney.net> wrote:
> > >
> > > Wei, probably no need to answer my earlier questions; I think I see
> > > the problem here, and believe it is indeed a bug, introduced in 8.3.
> > > Will file an issue and submit a patch shortly.
> > > Michael
> > >
> > > On Mon, May 11, 2020 at 12:49 PM Michael Gibney
> > > <mi...@michaelgibney.net> wrote:
> > > >
> > > > Hi Wei,
> > > >
> > > > In considering this problem, I'm stumbling a bit on terminology
> > > > (particularly, where you mention "nodes", I think you're referring to
> > > > "replicas"?). Could you confirm that you have 10 TLOG replicas per
> > > > shard, for each of 6 shards? How many *nodes* (i.e., running solr
> > > > server instances) do you have, and what is the replica placement like
> > > > across those nodes? What, if any, non-TLOG replicas do you have per
> > > > shard (not that it's necessarily relevant, but just to get a complete
> > > > picture of the situation)?
> > > >
> > > > If you're able without too much trouble, can you determine what the
> > > > behavior is like on Solr 8.3? (there were different changes introduced
> > > > to potentially relevant code in 8.3 and 8.4, and knowing whether the
> > > > behavior you're observing manifests on 8.3 would help narrow down
> > > > where to look for an explanation).
> > > >
> > > > Michael
> > > >
> > > > On Fri, May 8, 2020 at 7:34 PM Wei <we...@gmail.com> wrote:
> > > > >
> > > > > Update:  after I remove the shards.preference parameter from
> > > > > solrconfig.xml,  issue is gone and internal shard requests are now
> > > > > balanced. The same parameter works fine with solr 7.6.  Still not
> > sure of
> > > > > the root cause, but I observed a strange coincidence: the nodes that
> > are
> > > > > most frequently picked for shard requests are the first node in each
> > shard
> > > > > returned from the CLUSTERSTATUS api.  Seems something wrong with
> > shuffling
> > > > > equally compared nodes when shards.preference is set.  Will report
> > back if
> > > > > I find more.
> > > > >
> > > > > On Mon, Apr 27, 2020 at 5:59 PM Wei <we...@gmail.com> wrote:
> > > > >
> > > > > > Hi Eric,
> > > > > >
> > > > > > I am measuring the number of shard requests, and it's for query
> > only, no
> > > > > > indexing requests.  I have an external load balancer and see each
> > node
> > > > > > received about the equal number of external queries. However for
> > the
> > > > > > internal shard queries,  the distribution is uneven:    6 nodes
> > (one in
> > > > > > each shard,  some of them are leaders and some are non-leaders )
> > gets about
> > > > > > 80% of the shard requests, the other 54 nodes gets about 20% of
> > the shard
> > > > > > requests.   I checked a few other parameters set:
> > > > > >
> > > > > > -Dsolr.disable.shardsWhitelist=true
> > > > > > shards.preference=replica.location:local,replica.type:TLOG
> > > > > >
> > > > > > Nothing seems to cause the strange behavior.  Any suggestions how
> > to
> > > > > > debug this?
> > > > > >
> > > > > > -Wei
> > > > > >
> > > > > >
> > > > > > On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson <
> > erickerickson@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> Wei:
> > > > > >>
> > > > > >> How are you measuring utilization here? The number of incoming
> > requests
> > > > > >> or CPU?
> > > > > >>
> > > > > >> The leader for each shard are certainly handling all of the
> > indexing
> > > > > >> requests since they’re TLOG replicas, so that’s one thing that
> > might
> > > > > >> skewing your measurements.
> > > > > >>
> > > > > >> Best,
> > > > > >> Erick
> > > > > >>
> > > > > >> > On Apr 27, 2020, at 7:13 PM, Wei <we...@gmail.com> wrote:
> > > > > >> >
> > > > > >> > Hi everyone,
> > > > > >> >
> > > > > >> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My
> > cloud has 6
> > > > > >> > shards with 10 TLOG replicas each shard.  After upgrade I
> > noticed that
> > > > > >> one
> > > > > >> > of the replicas in each shard is handling most of the
> > distributed shard
> > > > > >> > requests, so 6 nodes are heavily loaded while other nodes are
> > idle.
> > > > > >> There
> > > > > >> > is no change in shard handler configuration:
> > > > > >> >
> > > > > >> > <shardHandlerFactory name="shardHandlerFactory" class=
> > > > > >> > "HttpShardHandlerFactory">
> > > > > >> >
> > > > > >> >    <int name="socketTimeout">30000</int>
> > > > > >> >
> > > > > >> >    <int name="connTimeout">30000</int>
> > > > > >> >
> > > > > >> >    <int name="maxConnectionsPerHost">500</int>
> > > > > >> >
> > > > > >> > </shardHandlerFactory>
> > > > > >> >
> > > > > >> >
> > > > > >> > What could cause the unbalanced internal distributed request?
> > > > > >> >
> > > > > >> >
> > > > > >> > Thanks in advance.
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > Wei
> > > > > >>
> > > > > >>
> >

Re: Unbalanced shard requests

Posted by Wei <we...@gmail.com>.

Thanks Michael!  Yes in each shard I have 10 Tlog replicas,  no other type
of replicas, and each Tlog replica is an individual solr instance on its
own physical machine.  In the jira you mentioned 'when "last place matches"
== "first place matches" – e.g. when shards.preference specified matches
*all* available replicas'.   My setting is
shards.preference=replica.location:local,replica.type:TLOG,
I also tried just shards.preference=replica.location:local and it still has
the issue. Can you explain a bit more?

On Mon, May 11, 2020 at 12:26 PM Michael Gibney <mi...@michaelgibney.net>
wrote:

> FYI: https://issues.apache.org/jira/browse/SOLR-14471
> Wei, assuming you have only TLOG replicas, your "last place" matches
> (to which the random fallback ordering would not be applied -- see
> above issue) would be the same as the "first place" matches selected
> for executing distributed requests.
>
>
> On Mon, May 11, 2020 at 1:49 PM Michael Gibney
> <mi...@michaelgibney.net> wrote:
> >
> > Wei, probably no need to answer my earlier questions; I think I see
> > the problem here, and believe it is indeed a bug, introduced in 8.3.
> > Will file an issue and submit a patch shortly.
> > Michael
> >
> > On Mon, May 11, 2020 at 12:49 PM Michael Gibney
> > <mi...@michaelgibney.net> wrote:
> > >
> > > Hi Wei,
> > >
> > > In considering this problem, I'm stumbling a bit on terminology
> > > (particularly, where you mention "nodes", I think you're referring to
> > > "replicas"?). Could you confirm that you have 10 TLOG replicas per
> > > shard, for each of 6 shards? How many *nodes* (i.e., running solr
> > > server instances) do you have, and what is the replica placement like
> > > across those nodes? What, if any, non-TLOG replicas do you have per
> > > shard (not that it's necessarily relevant, but just to get a complete
> > > picture of the situation)?
> > >
> > > If you're able without too much trouble, can you determine what the
> > > behavior is like on Solr 8.3? (there were different changes introduced
> > > to potentially relevant code in 8.3 and 8.4, and knowing whether the
> > > behavior you're observing manifests on 8.3 would help narrow down
> > > where to look for an explanation).
> > >
> > > Michael
> > >
> > > On Fri, May 8, 2020 at 7:34 PM Wei <we...@gmail.com> wrote:
> > > >
> > > > Update:  after I remove the shards.preference parameter from
> > > > solrconfig.xml,  issue is gone and internal shard requests are now
> > > > balanced. The same parameter works fine with solr 7.6.  Still not
> sure of
> > > > the root cause, but I observed a strange coincidence: the nodes that
> are
> > > > most frequently picked for shard requests are the first node in each
> shard
> > > > returned from the CLUSTERSTATUS api.  Seems something wrong with
> shuffling
> > > > equally compared nodes when shards.preference is set.  Will report
> back if
> > > > I find more.
> > > >
> > > > On Mon, Apr 27, 2020 at 5:59 PM Wei <we...@gmail.com> wrote:
> > > >
> > > > > Hi Eric,
> > > > >
> > > > > I am measuring the number of shard requests, and it's for query
> only, no
> > > > > indexing requests.  I have an external load balancer and see each
> node
> > > > > received about the equal number of external queries. However for
> the
> > > > > internal shard queries,  the distribution is uneven:    6 nodes
> (one in
> > > > > each shard,  some of them are leaders and some are non-leaders )
> gets about
> > > > > 80% of the shard requests, the other 54 nodes gets about 20% of
> the shard
> > > > > requests.   I checked a few other parameters set:
> > > > >
> > > > > -Dsolr.disable.shardsWhitelist=true
> > > > > shards.preference=replica.location:local,replica.type:TLOG
> > > > >
> > > > > Nothing seems to cause the strange behavior.  Any suggestions how
> to
> > > > > debug this?
> > > > >
> > > > > -Wei
> > > > >
> > > > >
> > > > > On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson <
> erickerickson@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Wei:
> > > > >>
> > > > >> How are you measuring utilization here? The number of incoming
> requests
> > > > >> or CPU?
> > > > >>
> > > > >> The leader for each shard are certainly handling all of the
> indexing
> > > > >> requests since they’re TLOG replicas, so that’s one thing that
> might
> > > > >> skewing your measurements.
> > > > >>
> > > > >> Best,
> > > > >> Erick
> > > > >>
> > > > >> > On Apr 27, 2020, at 7:13 PM, Wei <we...@gmail.com> wrote:
> > > > >> >
> > > > >> > Hi everyone,
> > > > >> >
> > > > >> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My
> cloud has 6
> > > > >> > shards with 10 TLOG replicas each shard.  After upgrade I
> noticed that
> > > > >> one
> > > > >> > of the replicas in each shard is handling most of the
> distributed shard
> > > > >> > requests, so 6 nodes are heavily loaded while other nodes are
> idle.
> > > > >> There
> > > > >> > is no change in shard handler configuration:
> > > > >> >
> > > > >> > <shardHandlerFactory name="shardHandlerFactory" class=
> > > > >> > "HttpShardHandlerFactory">
> > > > >> >
> > > > >> >    <int name="socketTimeout">30000</int>
> > > > >> >
> > > > >> >    <int name="connTimeout">30000</int>
> > > > >> >
> > > > >> >    <int name="maxConnectionsPerHost">500</int>
> > > > >> >
> > > > >> > </shardHandlerFactory>
> > > > >> >
> > > > >> >
> > > > >> > What could cause the unbalanced internal distributed request?
> > > > >> >
> > > > >> >
> > > > >> > Thanks in advance.
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > Wei
> > > > >>
> > > > >>
>

Re: Unbalanced shard requests

Posted by Michael Gibney <mi...@michaelgibney.net>.

FYI: https://issues.apache.org/jira/browse/SOLR-14471
Wei, assuming you have only TLOG replicas, your "last place" matches
(to which the random fallback ordering would not be applied -- see
above issue) would be the same as the "first place" matches selected
for executing distributed requests.


On Mon, May 11, 2020 at 1:49 PM Michael Gibney
<mi...@michaelgibney.net> wrote:
>
> Wei, probably no need to answer my earlier questions; I think I see
> the problem here, and believe it is indeed a bug, introduced in 8.3.
> Will file an issue and submit a patch shortly.
> Michael
>
> On Mon, May 11, 2020 at 12:49 PM Michael Gibney
> <mi...@michaelgibney.net> wrote:
> >
> > Hi Wei,
> >
> > In considering this problem, I'm stumbling a bit on terminology
> > (particularly, where you mention "nodes", I think you're referring to
> > "replicas"?). Could you confirm that you have 10 TLOG replicas per
> > shard, for each of 6 shards? How many *nodes* (i.e., running solr
> > server instances) do you have, and what is the replica placement like
> > across those nodes? What, if any, non-TLOG replicas do you have per
> > shard (not that it's necessarily relevant, but just to get a complete
> > picture of the situation)?
> >
> > If you're able without too much trouble, can you determine what the
> > behavior is like on Solr 8.3? (there were different changes introduced
> > to potentially relevant code in 8.3 and 8.4, and knowing whether the
> > behavior you're observing manifests on 8.3 would help narrow down
> > where to look for an explanation).
> >
> > Michael
> >
> > On Fri, May 8, 2020 at 7:34 PM Wei <we...@gmail.com> wrote:
> > >
> > > Update:  after I remove the shards.preference parameter from
> > > solrconfig.xml,  issue is gone and internal shard requests are now
> > > balanced. The same parameter works fine with solr 7.6.  Still not sure of
> > > the root cause, but I observed a strange coincidence: the nodes that are
> > > most frequently picked for shard requests are the first node in each shard
> > > returned from the CLUSTERSTATUS api.  Seems something wrong with shuffling
> > > equally compared nodes when shards.preference is set.  Will report back if
> > > I find more.
> > >
> > > On Mon, Apr 27, 2020 at 5:59 PM Wei <we...@gmail.com> wrote:
> > >
> > > > Hi Eric,
> > > >
> > > > I am measuring the number of shard requests, and it's for query only, no
> > > > indexing requests.  I have an external load balancer and see each node
> > > > received about the equal number of external queries. However for the
> > > > internal shard queries,  the distribution is uneven:    6 nodes (one in
> > > > each shard,  some of them are leaders and some are non-leaders ) gets about
> > > > 80% of the shard requests, the other 54 nodes gets about 20% of the shard
> > > > requests.   I checked a few other parameters set:
> > > >
> > > > -Dsolr.disable.shardsWhitelist=true
> > > > shards.preference=replica.location:local,replica.type:TLOG
> > > >
> > > > Nothing seems to cause the strange behavior.  Any suggestions how to
> > > > debug this?
> > > >
> > > > -Wei
> > > >
> > > >
> > > > On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson <er...@gmail.com>
> > > > wrote:
> > > >
> > > >> Wei:
> > > >>
> > > >> How are you measuring utilization here? The number of incoming requests
> > > >> or CPU?
> > > >>
> > > >> The leader for each shard are certainly handling all of the indexing
> > > >> requests since they’re TLOG replicas, so that’s one thing that might
> > > >> skewing your measurements.
> > > >>
> > > >> Best,
> > > >> Erick
> > > >>
> > > >> > On Apr 27, 2020, at 7:13 PM, Wei <we...@gmail.com> wrote:
> > > >> >
> > > >> > Hi everyone,
> > > >> >
> > > >> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My cloud has 6
> > > >> > shards with 10 TLOG replicas each shard.  After upgrade I noticed that
> > > >> one
> > > >> > of the replicas in each shard is handling most of the distributed shard
> > > >> > requests, so 6 nodes are heavily loaded while other nodes are idle.
> > > >> There
> > > >> > is no change in shard handler configuration:
> > > >> >
> > > >> > <shardHandlerFactory name="shardHandlerFactory" class=
> > > >> > "HttpShardHandlerFactory">
> > > >> >
> > > >> >    <int name="socketTimeout">30000</int>
> > > >> >
> > > >> >    <int name="connTimeout">30000</int>
> > > >> >
> > > >> >    <int name="maxConnectionsPerHost">500</int>
> > > >> >
> > > >> > </shardHandlerFactory>
> > > >> >
> > > >> >
> > > >> > What could cause the unbalanced internal distributed request?
> > > >> >
> > > >> >
> > > >> > Thanks in advance.
> > > >> >
> > > >> >
> > > >> >
> > > >> > Wei
> > > >>
> > > >>

Re: Unbalanced shard requests

Posted by Michael Gibney <mi...@michaelgibney.net>.

Wei, probably no need to answer my earlier questions; I think I see
the problem here, and believe it is indeed a bug, introduced in 8.3.
Will file an issue and submit a patch shortly.
Michael

On Mon, May 11, 2020 at 12:49 PM Michael Gibney
<mi...@michaelgibney.net> wrote:
>
> Hi Wei,
>
> In considering this problem, I'm stumbling a bit on terminology
> (particularly, where you mention "nodes", I think you're referring to
> "replicas"?). Could you confirm that you have 10 TLOG replicas per
> shard, for each of 6 shards? How many *nodes* (i.e., running solr
> server instances) do you have, and what is the replica placement like
> across those nodes? What, if any, non-TLOG replicas do you have per
> shard (not that it's necessarily relevant, but just to get a complete
> picture of the situation)?
>
> If you're able without too much trouble, can you determine what the
> behavior is like on Solr 8.3? (there were different changes introduced
> to potentially relevant code in 8.3 and 8.4, and knowing whether the
> behavior you're observing manifests on 8.3 would help narrow down
> where to look for an explanation).
>
> Michael
>
> On Fri, May 8, 2020 at 7:34 PM Wei <we...@gmail.com> wrote:
> >
> > Update:  after I remove the shards.preference parameter from
> > solrconfig.xml,  issue is gone and internal shard requests are now
> > balanced. The same parameter works fine with solr 7.6.  Still not sure of
> > the root cause, but I observed a strange coincidence: the nodes that are
> > most frequently picked for shard requests are the first node in each shard
> > returned from the CLUSTERSTATUS api.  Seems something wrong with shuffling
> > equally compared nodes when shards.preference is set.  Will report back if
> > I find more.
> >
> > On Mon, Apr 27, 2020 at 5:59 PM Wei <we...@gmail.com> wrote:
> >
> > > Hi Eric,
> > >
> > > I am measuring the number of shard requests, and it's for query only, no
> > > indexing requests.  I have an external load balancer and see each node
> > > received about the equal number of external queries. However for the
> > > internal shard queries,  the distribution is uneven:    6 nodes (one in
> > > each shard,  some of them are leaders and some are non-leaders ) gets about
> > > 80% of the shard requests, the other 54 nodes gets about 20% of the shard
> > > requests.   I checked a few other parameters set:
> > >
> > > -Dsolr.disable.shardsWhitelist=true
> > > shards.preference=replica.location:local,replica.type:TLOG
> > >
> > > Nothing seems to cause the strange behavior.  Any suggestions how to
> > > debug this?
> > >
> > > -Wei
> > >
> > >
> > > On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson <er...@gmail.com>
> > > wrote:
> > >
> > >> Wei:
> > >>
> > >> How are you measuring utilization here? The number of incoming requests
> > >> or CPU?
> > >>
> > >> The leader for each shard are certainly handling all of the indexing
> > >> requests since they’re TLOG replicas, so that’s one thing that might
> > >> skewing your measurements.
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >> > On Apr 27, 2020, at 7:13 PM, Wei <we...@gmail.com> wrote:
> > >> >
> > >> > Hi everyone,
> > >> >
> > >> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My cloud has 6
> > >> > shards with 10 TLOG replicas each shard.  After upgrade I noticed that
> > >> one
> > >> > of the replicas in each shard is handling most of the distributed shard
> > >> > requests, so 6 nodes are heavily loaded while other nodes are idle.
> > >> There
> > >> > is no change in shard handler configuration:
> > >> >
> > >> > <shardHandlerFactory name="shardHandlerFactory" class=
> > >> > "HttpShardHandlerFactory">
> > >> >
> > >> >    <int name="socketTimeout">30000</int>
> > >> >
> > >> >    <int name="connTimeout">30000</int>
> > >> >
> > >> >    <int name="maxConnectionsPerHost">500</int>
> > >> >
> > >> > </shardHandlerFactory>
> > >> >
> > >> >
> > >> > What could cause the unbalanced internal distributed request?
> > >> >
> > >> >
> > >> > Thanks in advance.
> > >> >
> > >> >
> > >> >
> > >> > Wei
> > >>
> > >>

Re: Unbalanced shard requests

Posted by Michael Gibney <mi...@michaelgibney.net>.

Hi Wei,

In considering this problem, I'm stumbling a bit on terminology
(particularly, where you mention "nodes", I think you're referring to
"replicas"?). Could you confirm that you have 10 TLOG replicas per
shard, for each of 6 shards? How many *nodes* (i.e., running solr
server instances) do you have, and what is the replica placement like
across those nodes? What, if any, non-TLOG replicas do you have per
shard (not that it's necessarily relevant, but just to get a complete
picture of the situation)?

If you're able without too much trouble, can you determine what the
behavior is like on Solr 8.3? (there were different changes introduced
to potentially relevant code in 8.3 and 8.4, and knowing whether the
behavior you're observing manifests on 8.3 would help narrow down
where to look for an explanation).

Michael

On Fri, May 8, 2020 at 7:34 PM Wei <we...@gmail.com> wrote:
>
> Update:  after I remove the shards.preference parameter from
> solrconfig.xml,  issue is gone and internal shard requests are now
> balanced. The same parameter works fine with solr 7.6.  Still not sure of
> the root cause, but I observed a strange coincidence: the nodes that are
> most frequently picked for shard requests are the first node in each shard
> returned from the CLUSTERSTATUS api.  Seems something wrong with shuffling
> equally compared nodes when shards.preference is set.  Will report back if
> I find more.
>
> On Mon, Apr 27, 2020 at 5:59 PM Wei <we...@gmail.com> wrote:
>
> > Hi Eric,
> >
> > I am measuring the number of shard requests, and it's for query only, no
> > indexing requests.  I have an external load balancer and see each node
> > received about the equal number of external queries. However for the
> > internal shard queries,  the distribution is uneven:    6 nodes (one in
> > each shard,  some of them are leaders and some are non-leaders ) gets about
> > 80% of the shard requests, the other 54 nodes gets about 20% of the shard
> > requests.   I checked a few other parameters set:
> >
> > -Dsolr.disable.shardsWhitelist=true
> > shards.preference=replica.location:local,replica.type:TLOG
> >
> > Nothing seems to cause the strange behavior.  Any suggestions how to
> > debug this?
> >
> > -Wei
> >
> >
> > On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson <er...@gmail.com>
> > wrote:
> >
> >> Wei:
> >>
> >> How are you measuring utilization here? The number of incoming requests
> >> or CPU?
> >>
> >> The leader for each shard are certainly handling all of the indexing
> >> requests since they’re TLOG replicas, so that’s one thing that might
> >> skewing your measurements.
> >>
> >> Best,
> >> Erick
> >>
> >> > On Apr 27, 2020, at 7:13 PM, Wei <we...@gmail.com> wrote:
> >> >
> >> > Hi everyone,
> >> >
> >> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My cloud has 6
> >> > shards with 10 TLOG replicas each shard.  After upgrade I noticed that
> >> one
> >> > of the replicas in each shard is handling most of the distributed shard
> >> > requests, so 6 nodes are heavily loaded while other nodes are idle.
> >> There
> >> > is no change in shard handler configuration:
> >> >
> >> > <shardHandlerFactory name="shardHandlerFactory" class=
> >> > "HttpShardHandlerFactory">
> >> >
> >> >    <int name="socketTimeout">30000</int>
> >> >
> >> >    <int name="connTimeout">30000</int>
> >> >
> >> >    <int name="maxConnectionsPerHost">500</int>
> >> >
> >> > </shardHandlerFactory>
> >> >
> >> >
> >> > What could cause the unbalanced internal distributed request?
> >> >
> >> >
> >> > Thanks in advance.
> >> >
> >> >
> >> >
> >> > Wei
> >>
> >>