You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Arturas Mazeika <ma...@gmail.com> on 2018/06/22 12:50:59 UTC

top 10 query overall vs shard

Hi Solr-Team,

I am familiarizing myself with solr cloud and I am trying out and compare
different processing setups. Short story: term-query ran on shard gives
lower numbers compared querying the complete index. I wonder why.



Long story:

I grabbed the 2.7.1 version of solr, created a 4 core setup with
replication factor 2 on windows using [1], I've restarted the setup with
2GB for each node [2], inserted the html docs from the german wikipedia
archive [3], and obtained top 10 terms for the whole collection vs one
specific shard:

http://localhost:9999/solr/de_wiki_all/terms?terms.limit=10&terms.fl=text&wt=json
{
"responseHeader":{
"zkConnected":true,

    "status":0,
    "QTime":5287},
  "terms":{
    "text":[
      "8",670564,
      "application",670564,
      "articles",670564,
      "charset",670564,
      "de",670564,
      "f",670564,
      "utf",670564,
      "wiki",670564,
      "xhtml",670564,
      "xml",670564]}}

http://localhost:9999/solr/de_wiki_all/terms?terms.limit=10&terms.fl=text&wt=json&shards=localhost:9999/solr/de_wiki_all_shard1_replica_n1&shards.qt=de_wiki_all_shard1_replica_n1

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":20274},
  "terms":{
    "text":{
      "8":671396,
      "application":671396,
      "articles":671396,
      "charset":671396,
      "de":671396,
      "f":671396,
      "utf":671396,
      "wiki":671396,
      "xhtml":671396,
      "xml":671396}}}

reveals:
(1) querying one shard takes 20 secs vs 5 secs for the whole index

(2) the counts for one shards are higher than for the whole index

(3) the f: hard drive is samsung SSD 850 evo 4TB (CrystalDeiskMark shows
~500MB/s seq and ~300MBs random read/writes), CPU:i7-6400 @3.4GHz. Querying
for 20 secs shows that java process is neither being pushed on the CPU nor
on the SDD side to the limits. What is the bottleneck in this computation?

(4) the output format is slightly different (compare ',' vs ':' and vector
vs list). I wonder why

The findings are a bit counter intuitive. Could you comment on those?

Cheers,
Arturas



References:

[1] Create cluster

F:\solr_server\solr-7.2.1>bin\solr.cmd start -e cloud

Welcome to the SolrCloud example!

This interactive session will help you launch a SolrCloud cluster on
your local workstation.
To begin, how many Solr nodes would you like to run in your local
cluster? (specify 1-4 nodes) [2]:
4
Ok, let's start up 4 Solr nodes for your example SolrCloud cluster.
Please enter the port for node1 [8983]:
9999
Please enter the port for node2 [7574]:
9998
Please enter the port for node3 [8984]:
9997
Please enter the port for node4 [7575]:
9996
Creating Solr home directory F:\solr_server\solr-7.2.1\example\cloud\node1\solr
Cloning F:\solr_server\solr-7.2.1\example\cloud\node1 into
   F:\solr_server\solr-7.2.1\example\cloud\node2
Cloning F:\solr_server\solr-7.2.1\example\cloud\node1 into
   F:\solr_server\solr-7.2.1\example\cloud\node3
Cloning F:\solr_server\solr-7.2.1\example\cloud\node1 into
   F:\solr_server\solr-7.2.1\example\cloud\node4

Starting up Solr on port 9999 using command:
"F:\solr_server\solr-7.2.1\bin\solr.cmd" start -cloud -p 9999 -s
"F:\solr_server\solr-7.2.1\example\cloud\node1\solr"

Waiting up to 30 to see Solr running on port 9999
Started Solr server on port 9999. Happy searching!

Starting up Solr on port 9998 using command:
"F:\solr_server\solr-7.2.1\bin\solr.cmd" start -cloud -p 9998 -s
"F:\solr_server\solr-7.2.1\example\cloud\node2\solr" -z
localhost:10999

Waiting up to 30 to see Solr running on port 9998

Starting up Solr on port 9997 using command:
"F:\solr_server\solr-7.2.1\bin\solr.cmd" start -cloud -p 9997 -s
"F:\solr_server\solr-7.2.1\example\cloud\node3\solr" -z
localhost:10999

Started Solr server on port 9998. Happy searching!
Waiting up to 30 to see Solr running on port 9997

Starting up Solr on port 9996 using command:
"F:\solr_server\solr-7.2.1\bin\solr.cmd" start -cloud -p 9996 -s
"F:\solr_server\solr-7.2.1\example\cloud\node4\solr" -z
localhost:10999

Started Solr server on port 9997. Happy searching!
Waiting up to 30 to see Solr running on port 9996
Started Solr server on port 9996. Happy searching!
INFO  - 2018-06-21 15:38:16.239;
org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider;
Cluster at localhost:10999 ready

Now let's create a new collection for indexing documents in your 4-node cluster.
Please provide a name for your new collection: [gettingstarted]
de_wiki_all
How many shards would you like to split de_wiki_all into? [2]
4
How many replicas per shard would you like to create? [2]
2
Please choose a configuration for the de_wiki_all collection,
available options are:
_default or sample_techproducts_configs [_default]
sample_techproducts_configs
Created collection 'de_wiki_all' with 4 shard(s), 2 replica(s) with
config-set 'de_wiki_all'

Enabling auto soft-commits with maxTime 3 secs using the Config API

POSTing request to Config API: http://localhost:9999/solr/de_wiki_all/config
{"set-property":{"updateHandler.autoSoftCommit.maxTime":"3000"}}
Successfully set-property updateHandler.autoSoftCommit.maxTime to 3000


SolrCloud example running, please visit: http://localhost:9999/solr


F:\solr_server\solr-7.2.1>





[2] Restart with 2GB

"F:\solr_server\solr-7.2.1\bin\solr.cmd" stop -all

"F:\solr_server\solr-7.2.1\bin\solr.cmd" start -m 2g -cloud -p 9999 -s
"F:\solr_server\solr-7.2.1\example\cloud\node1\solr"
"F:\solr_server\solr-7.2.1\bin\solr.cmd" start -m 2g -cloud -p 9998 -s
"F:\solr_server\solr-7.2.1\example\cloud\node2\solr" -z
localhost:10999
"F:\solr_server\solr-7.2.1\bin\solr.cmd" start -m 2g -cloud -p 9997 -s
"F:\solr_server\solr-7.2.1\example\cloud\node3\solr" -z
localhost:10999
"F:\solr_server\solr-7.2.1\bin\solr.cmd" start -m 2g -cloud -p 9996 -s
"F:\solr_server\solr-7.2.1\example\cloud\node4\solr" -z
localhost:10999




[3] Insert wikipedia files

java  -jar -Durl=http://localhost:9999/solr/de_wiki_all/update  -Dauto
-Drecursive example\exampledocs\post.jar f:\wiki\de\articles\*

2681612 files indexed.
COMMITting Solr index changes to
http://localhost:9999/solr/de_wiki_all/update...
Time spent: 12:32:53.843

Re: top 10 query overall vs shard

Posted by Arturas Mazeika <ma...@gmail.com>.

Hi Shawn et al,

Thanks a lot for the prompt answer.

It looks to me that I made quite a few mistakes in formulating those solr
queries. Setting shards.qt to the name of the core was completely wrong. I
tried to search for shards.qt in http://lucene.apache.org/solr/guide/7_3/
but it did not give any answers. Googling for shards.qt was more successful
(I found an explanation what it means in two books, and a pointers and
numerous examples in usages in the top 40 results). Which means that I
would suggest adding a sentence saying 'use shards.qt as q` somewhere in
the documentation would not hurt :-)

Recomputing the queries:

http://localhost:9999/solr/de_wiki_all_shard1_replica_n1/terms?terms.limit=10&terms.fl=text&wt=json&distrib=false
returns
{

  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":3400},
  "terms":{
    "text":[
      "8",671396,
      "application",671396,
      "articles",671396,
      "charset",671396,
      "de",671396,
      "f",671396,
      "utf",671396,
      "wiki",671396,
      "xhtml",671396,
      "xml",671396]}}

http://localhost:9999/solr/de_wiki_all/terms?terms.limit=10&terms.fl=text&wt=json

returns

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":2584},
  "terms":{
    "text":[
      "8",670564,
      "application",670564,
      "articles",670564,
      "charset",670564,
      "de",670564,
      "f",670564,
      "utf",670564,
      "wiki",670564,
      "xhtml",670564,
      "xml",670564]}}

LOG in CORE1:

INFO  -
2018-06-22 15:27:40.779; [c:de_wiki_all s:shard1 r:core_node3
x:de_wiki_all_shard1_replica_n1] org.apache.solr.core.SolrCore;
[de_wiki_all_shard1_replica_n1]  webapp=/solr path=/terms
params={distrib=false&terms.fl=text&terms.limit=10&wt=json}
status=0 QTime=3027
INFO  - 2018-06-22 15:27:42.059; [c:de_wiki_all
s:shard3 r:core_node11 x:de_wiki_all_shard3_replica_n8]
org.apache.solr.core.SolrCore; [de_wiki_all_shard3_replica_n8]
webapp=/solr path=/terms
params={distrib=false&terms.fl=text&terms.limit=10&wt=json}
status=0 QTime=2608

The number did not change also after

http://localhost:9999/solr/de_wiki_all/update?commit=true

(you correctly assumed that the collection is not getting any updates).


After I fired this query:

http://localhost:9999/solr/de_wiki_all/terms?terms.limit=10&terms.fl=text&wt=json&distrib=true

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":70245},
  "terms":{
    "text":{
      "8":2681402,
      "application":2681402,
      "articles":2681402,
      "charset":2681402,
      "de":2681402,
      "f":2681402,
      "utf":2681402,
      "wiki":2681402,
      "xhtml":2681402,
      "xml":2681402}}}

with the log line:

INFO  - 2018-06-22 15:32:54.805; [c:de_wiki_all s:shard1 r:core_node3
x:de_wiki_all_shard1_replica_n1] org.apache.solr.core.SolrCore;
[de_wiki_all_shard1_replica_n1]  webapp=/solr path=/terms
params={distrib=true&terms.fl=text&terms.limit=10&wt=json} status=0
QTime=70245

even the 1st query started returning the same results (shouldn't the
query be faster in the distributed settings?):

http://localhost:9999/solr/de_wiki_all_shard1_replica_n1/terms?terms.limit=10&terms.fl=text&wt=json&distrib=false

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":3438},
  "terms":{
    "text":[
      "8",671396,
      "application",671396,
      "articles",671396,
      "charset",671396,
      "de",671396,
      "f",671396,
      "utf",671396,
      "wiki",671396,
      "xhtml",671396,
      "xml",671396]}}


http://localhost:9999/solr/de_wiki_all/terms?terms.limit=10&terms.fl=text&wt=json

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":3325},
  "terms":{
    "text":[
      "8",671396,
      "application",671396,
      "articles",671396,
      "charset",671396,
      "de",671396,
      "f",671396,
      "utf",671396,
      "wiki",671396,
      "xhtml",671396,
      "xml",671396]}}

Also

http://localhost:9997/solr/de_wiki_all_shard2_replica_n4/terms?terms.limit=10&terms.fl=text&wt=json&distrib=false
{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":2637},
  "terms":{
    "text":[
      "8",670221,
      "application",670221,
      "articles",670221,
      "charset",670221,
      "de",670221,
      "f",670221,
      "utf",670221,
      "wiki",670221,
      "xhtml",670221,
      "xml",670221]}}

http://localhost:9997/solr/de_wiki_all_shard4_replica_n12/terms?terms.limit=10&terms.fl=text&wt=json&distrib=false

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":2536},
  "terms":{
    "text":[
      "8",669221,
      "application",669221,
      "articles",669221,
      "charset",669221,
      "de",669221,
      "f",669221,
      "utf",669221,
      "wiki",669221,
      "xhtml",669221,
      "xml",669221]}}

http://localhost:9997/solr/de_wiki_all/terms?terms.limit=10&terms.fl=text&wt=json

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":2405},
  "terms":{
    "text":[
      "8",669221,
      "application",669221,
      "articles",669221,
      "charset",669221,
      "de",669221,
      "f",669221,
      "utf",669221,
      "wiki",669221,
      "xhtml",669221,
      "xml",669221]}}

which means that de_wiki_all/terms query is being redirected to only one of
the cores and computed locally.

On the performance part: the PC has 32GB of RAM with some 10GB left for the
OS to cache things. The complete index is ~40GB (the complete collection as
text documents was ~40GB), each replica is around 3.5GB large (shown e.g.,
in http://172.16.203.123:9999/solr/#/de_wiki_all_shard1_replica_n1). What
would be the easiest way to get all index/replicas listed with their
corresponding size in bytes?




What is the complexity of this terms query? Does solr need to go through
individual inverted indexes, or does solr needs to scan the list of terms
only (does every list have the number of IDs in the inverted index
precomputed?)?
This part of the question is particularly interesting as I was not able to
compute the
de_wiki_all/terms?terms.limit=10&terms.fl=text&wt=json&distrib=true with
2GB per core of memory (due to "unable to allocate memory in java heap" I
had to increase every instance it to 3GB).


Cheers,

Arturas



On Fri, Jun 22, 2018 at 4:28 PM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 6/22/2018 8:12 AM, Shawn Heisey wrote:
>
>> I wonder if having an invalid handler contributed to the speed.
>>
>
> Further thought about this:
>
> I can't say whether having an invalid handler name would cause speed
> problems, but based on my limited understanding of the code involved, I
> don't think it would.
>
> I'm guessing that with a shards.qt value that doesn't start with a slash,
> that the request gets sent to /select, with a qt parameter set to the
> value.  Solr would most likely ignore any qt value, because the
> handleSelect setting on requestDispatcher in solrconfig.xml has defaulted
> to false for many versions.
>
> Another possibility is that the OS had cached the information in a
> different replica for the full distributed query, and this made that query
> fast, but when the query directed to a specific shard replica was made,
> that data wasn't cached, and so Solr had to read the disk to satisfy the
> query, which is going to REALLY slow it down.  I would imagine that if you
> repeated the single-shard query multiple times, especially using the
> different URL that I gave you, the speed discrepancy might disappear.
>
> Thanks,
> Shawn
>
>

Re: top 10 query overall vs shard

Posted by Shawn Heisey <ap...@elyograg.org>.

On 6/22/2018 8:12 AM, Shawn Heisey wrote:
> I wonder if having an invalid handler contributed to the speed.

Further thought about this:

I can't say whether having an invalid handler name would cause speed 
problems, but based on my limited understanding of the code involved, I 
don't think it would.

I'm guessing that with a shards.qt value that doesn't start with a 
slash, that the request gets sent to /select, with a qt parameter set to 
the value.  Solr would most likely ignore any qt value, because the 
handleSelect setting on requestDispatcher in solrconfig.xml has 
defaulted to false for many versions.

Another possibility is that the OS had cached the information in a 
different replica for the full distributed query, and this made that 
query fast, but when the query directed to a specific shard replica was 
made, that data wasn't cached, and so Solr had to read the disk to 
satisfy the query, which is going to REALLY slow it down.  I would 
imagine that if you repeated the single-shard query multiple times, 
especially using the different URL that I gave you, the speed 
discrepancy might disappear.

Thanks,
Shawn

Re: top 10 query overall vs shard

Posted by Shawn Heisey <ap...@elyograg.org>.

On 6/22/2018 6:50 AM, Arturas Mazeika wrote:
> I grabbed the 2.7.1 version of solr, created a 4 core setup with
> replication factor 2 on windows using [1], I've restarted the setup with
> 2GB for each node [2], inserted the html docs from the german wikipedia
> archive [3], and obtained top 10 terms for the whole collection vs one
> specific shard:
> http://localhost:9999/solr/de_wiki_all/terms?terms.limit=10&terms.fl=text&wt=json
> {
> "responseHeader":{
> "zkConnected":true,
>
>      "status":0,
>      "QTime":5287},
>    "terms":{
>      "text":[
>        "8",670564,
>        "application",670564,
>        "articles",670564,
>        "charset",670564,
>        "de",670564,
>        "f",670564,
>        "utf",670564,
>        "wiki",670564,
>        "xhtml",670564,
>        "xml",670564]}}
>
> http://localhost:9999/solr/de_wiki_all/terms?terms.limit=10&terms.fl=text&wt=json&shards=localhost:9999/solr/de_wiki_all_shard1_replica_n1&shards.qt=de_wiki_all_shard1_replica_n1
>
> {
>    "responseHeader":{
>      "zkConnected":true,
>      "status":0,
>      "QTime":20274},
>    "terms":{
>      "text":{
>        "8":671396,
>        "application":671396,
>        "articles":671396,
>        "charset":671396,
>        "de":671396,
>        "f":671396,
>        "utf":671396,
>        "wiki":671396,
>        "xhtml":671396,
>        "xml":671396}}}

The value of 'shards.qt' should be /terms, not the name of a core.  
Here's something you might want to try instead for the second query, so 
you won't need shards.qt at all:

http://localhost:9999/solr/de_wiki_all_shard1_replica_n1/terms?terms.limit=10&terms.fl=text&wt=json&distrib=false

You might actually want to add shards.qt=/terms to the first query, or 
even to the definition of the /terms handler in solrconfig.xml so that 
all distributed queries are sent to the same handler instead of going to 
/select.

> reveals:
> (1) querying one shard takes 20 secs vs 5 secs for the whole index

That is strange.  With the shards.qt parameter set to a core name, I'm 
surprised you got anything at all on the second query, but maybe when it 
couldn't find a handler with that name, it just defaulted to /select 
like it would if you didn't include the parameter.  I wonder if having 
an invalid handler contributed to the speed.

> (2) the counts for one shards are higher than for the whole index

If you're not changing the index between the requests, and it doesn't 
sound like you are, I have no idea why that might happen.

> (3) the f: hard drive is samsung SSD 850 evo 4TB (CrystalDeiskMark shows
> ~500MB/s seq and ~300MBs random read/writes), CPU:i7-6400 @3.4GHz. Querying
> for 20 secs shows that java process is neither being pushed on the CPU nor
> on the SDD side to the limits. What is the bottleneck in this computation?

If the amount of memory in the system (NOT talking about heap size here) 
is not sufficient to effectively cache the index, then Solr must 
actually hit the disk to satisfy a query.  Even an SSD is not as fast as 
memory.  You haven't indicated how much disk space is being consumed by 
the eight index cores or how much total memory the system has.  A little 
more than 8GB of the system's memory is being taken up by the four Solr 
processes.  Because you've asked for two replicas, there are two 
complete copies of the index on the system, and both copies will count 
in the total amount of resources that are required.

If there *is* sufficient memory for effective index caching, then the 
disk will barely see any usage during queries, because Solr will get 
most of the data it needs from the OS disk cache (system memory).  This 
will also reduce the impact on the CPU, because it will not be waiting 
for I/O.

Running a query is not going to read the entire index.  If it did, Solr 
would not be fast.

> (4) the output format is slightly different (compare ',' vs ':' and vector
> vs list). I wonder why

That I cannot explain.  The first response doesn't look right to me.  It 
passes RFC 4627 validation, but the software parsing the response would 
have to be very different for each of the output formats.

Thanks,
Shawn