You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Justin Babuscio <jb...@linchpinsoftware.com> on 2012/06/19 22:06:02 UTC

Solr v3.5.0 - numFound changes when paging through results on 8-shard cluster

Solr v3.5.0
8 Master Shards
2 Slaves Per Master

Confirming that there are no active records being written, the "numFound"
value is decreasing as we page through the results.

For example,
Page1 - numFound = 3683
Page2 - numFound = 3683
Page3 - numFound = 3683
Page4 - numFound = 2866
Page5 - numFound = 2419
Page5 - numFound = 1898
Page6 - numFound = 1898
...
PageN - numFound = 1898



It looks like it eventually settles on the real count.  Is this a
limitation when using a distributed cluster or is the numFound always
intended to give an approximately similar to how Google responds with total
hits?


I also increased start to higher than 1898 (made it 3000) and it returned 0
results with numCount = 1898.


Thanks ahead,

-- 
Justin Babuscio
571-210-0035
http://linchpinsoftware.com

Re: Solr v3.5.0 - numFound changes when paging through results on 8-shard cluster

Posted by Justin Babuscio <jb...@linchpinsoftware.com>.
As I understand your problem, it sounds like you were using your master as
part of your search cluster so the two distributed queries were returning
conflicting numbers.

In my scenario, our eight Masters are used for /updates & /deletes only.
 There are no queries issued to these nodes.  When the distributed query is
executed, it could be possible for the two slaves to be out of sync (i.e.
one replicated faster than the other).

The proof that may eliminate this is that there is no activity on my
servers right now.  The search counts have stabilized.

It's consistently returning less results as I execute the query with a
"start" URL param > 400.



On Tue, Jun 19, 2012 at 5:05 PM, Shawn Heisey <so...@elyograg.org> wrote:

> On 6/19/2012 2:32 PM, Justin Babuscio wrote:
>
>> 2) For the shards, we use the URL
>> parameters, shards=s1/solr,s2/solr,s3/**solr,...,s8/solr
>>     where s# point to a baremetal load balancer that routes the requests
>> to
>> one of the two slave shards.
>>
>
> This most likely has nothing to do with your question about changing
> numFound, just a side issue that I wanted to comment on.  I was at one time
> using a similar method where I had each shard as an entry in the load
> balancer.  This led to an unusual occasional problem.
>
> As you may know, a distributed query results in two queries being sent to
> each shard -- the first one finds the documents on each shard, then once
> Solr has gathered those results, it makes another request that retrieves
> the document.
>
> Imagine that you have just updated your master server, and you make a
> query that will include one or more of the new documents in the results.
>  If you make that query just after the master server gets updated, but
> before the slave has had a chance to copy and commit the changes, you can
> run into this:  The first (search) query goes to the master server and will
> see the new document.  The second (retrieval) query will then go to the
> slave, requesting a document that does not yet exist there.  This *will*
> happen eventually.  I would run into it at least once a day on a monitoring
> system that checked the age of the newest document.
>
> Here's one way to deal with that: I have a dedicated core on each server
> that has the shards parameter included in the request handler.  This core
> does not have an index of its own, it exists only to act as a search
> broker, pointing at all the cores with the data.  The name of this core is
> ncmain, and its standard request handler contains the following:
>
> <str name="shards">idxa2.example.**com:8981/solr/inclive,idxa1.**
> example.com:8981/solr/s0live,**idxa1.example.com:8981/solr/**
> s1live,idxa1.example.com:8981/**solr/s2live,idxa2.example.com:**
> 8981/solr/s3live,idxa2.**example.com:8981/solr/s4live,**
> idxa2.example.com:8981/solr/**s5live<http://idxa2.example.com:8981/solr/inclive,idxa1.example.com:8981/solr/s0live,idxa1.example.com:8981/solr/s1live,idxa1.example.com:8981/solr/s2live,idxa2.example.com:8981/solr/s3live,idxa2.example.com:8981/solr/s4live,idxa2.example.com:8981/solr/s5live>
> </str>
>
> On the servers for chain A (idxa1, idxa2), the shards parameter references
> only chain A server cores.  On the servers for chain B (idxb1, idxb2), the
> shards parameter references only chain B server cores.
>
> The load balancer only talks to these broker cores, not the cores with the
> actual indexes.  Neither the client nor the load balancer needs to use (or
> even know about) the shards parameter.  That is handled entirely within the
> Solr configuration.
>
> Thanks,
> Shawn
>
>


-- 
Justin Babuscio
571-210-0035
http://linchpinsoftware.com

Re: Solr v3.5.0 - numFound changes when paging through results on 8-shard cluster

Posted by Shawn Heisey <so...@elyograg.org>.
On 6/19/2012 2:32 PM, Justin Babuscio wrote:
> 2) For the shards, we use the URL
> parameters, shards=s1/solr,s2/solr,s3/solr,...,s8/solr
>      where s# point to a baremetal load balancer that routes the requests to
> one of the two slave shards.

This most likely has nothing to do with your question about changing 
numFound, just a side issue that I wanted to comment on.  I was at one 
time using a similar method where I had each shard as an entry in the 
load balancer.  This led to an unusual occasional problem.

As you may know, a distributed query results in two queries being sent 
to each shard -- the first one finds the documents on each shard, then 
once Solr has gathered those results, it makes another request that 
retrieves the document.

Imagine that you have just updated your master server, and you make a 
query that will include one or more of the new documents in the 
results.  If you make that query just after the master server gets 
updated, but before the slave has had a chance to copy and commit the 
changes, you can run into this:  The first (search) query goes to the 
master server and will see the new document.  The second (retrieval) 
query will then go to the slave, requesting a document that does not yet 
exist there.  This *will* happen eventually.  I would run into it at 
least once a day on a monitoring system that checked the age of the 
newest document.

Here's one way to deal with that: I have a dedicated core on each server 
that has the shards parameter included in the request handler.  This 
core does not have an index of its own, it exists only to act as a 
search broker, pointing at all the cores with the data.  The name of 
this core is ncmain, and its standard request handler contains the 
following:

<str 
name="shards">idxa2.example.com:8981/solr/inclive,idxa1.example.com:8981/solr/s0live,idxa1.example.com:8981/solr/s1live,idxa1.example.com:8981/solr/s2live,idxa2.example.com:8981/solr/s3live,idxa2.example.com:8981/solr/s4live,idxa2.example.com:8981/solr/s5live</str>

On the servers for chain A (idxa1, idxa2), the shards parameter 
references only chain A server cores.  On the servers for chain B 
(idxb1, idxb2), the shards parameter references only chain B server cores.

The load balancer only talks to these broker cores, not the cores with 
the actual indexes.  Neither the client nor the load balancer needs to 
use (or even know about) the shards parameter.  That is handled entirely 
within the Solr configuration.

Thanks,
Shawn


Re: Solr v3.5.0 - numFound changes when paging through results on 8-shard cluster

Posted by Justin Babuscio <jb...@linchpinsoftware.com>.
1) We have 1 core and we use the default search handler.

2) For the shards, we use the URL
parameters, shards=s1/solr,s2/solr,s3/solr,...,s8/solr
    where s# point to a baremetal load balancer that routes the requests to
one of the two slave shards.

   There is definitely the chance that on each page, the load balancer is
mixing the shards used for searching.  Is there a possibility that the
Master1 may have two shards with two different counts?  This could explain
it.

3) For the URL to execute the aggregating search, we have a virtual IP that
round robins to all 16 slaves in the cluster.

On a given search, any one of the 16 slaves may field the aggregation
request then the shard single-node queries are fired off to the load
balancer which then may mix up the nodes.


Other than the load balancing, we have no other configuration or search
differences besides the start parameter in the URL to move to the next page.


On Tue, Jun 19, 2012 at 4:20 PM, Yury Kats <yu...@yahoo.com> wrote:

> On 6/19/2012 4:06 PM, Justin Babuscio wrote:
> > Solr v3.5.0
> > 8 Master Shards
> > 2 Slaves Per Master
> >
> > Confirming that there are no active records being written, the "numFound"
> > value is decreasing as we page through the results.
> >
> > For example,
> > Page1 - numFound = 3683
> > Page2 - numFound = 3683
> > Page3 - numFound = 3683
> > Page4 - numFound = 2866
> > Page5 - numFound = 2419
> > Page5 - numFound = 1898
> > Page6 - numFound = 1898
> > ...
> > PageN - numFound = 1898
> >
> >
> >
> > It looks like it eventually settles on the real count.  Is this a
> > limitation when using a distributed cluster or is the numFound always
> > intended to give an approximately similar to how Google responds with
> total
> > hits?
>
> numFound should return the real count for any given query.
> How are you sepcifying which shards/cores to use for each query?
> Does this change between queries?
>
>


-- 
Justin Babuscio
571-210-0035
http://linchpinsoftware.com

Re: Solr v3.5.0 - numFound changes when paging through results on 8-shard cluster

Posted by Yury Kats <yu...@yahoo.com>.
On 6/19/2012 4:06 PM, Justin Babuscio wrote:
> Solr v3.5.0
> 8 Master Shards
> 2 Slaves Per Master
> 
> Confirming that there are no active records being written, the "numFound"
> value is decreasing as we page through the results.
> 
> For example,
> Page1 - numFound = 3683
> Page2 - numFound = 3683
> Page3 - numFound = 3683
> Page4 - numFound = 2866
> Page5 - numFound = 2419
> Page5 - numFound = 1898
> Page6 - numFound = 1898
> ...
> PageN - numFound = 1898
> 
> 
> 
> It looks like it eventually settles on the real count.  Is this a
> limitation when using a distributed cluster or is the numFound always
> intended to give an approximately similar to how Google responds with total
> hits?

numFound should return the real count for any given query.
How are you sepcifying which shards/cores to use for each query?
Does this change between queries?


Re: Solr v3.5.0 - numFound changes when paging through results on 8-shard cluster

Posted by Justin Babuscio <jb...@linchpinsoftware.com>.
I believe that is the issue.

We recently lost a physical server and a misinformed (due to weekend fire)
sys admin moved one of the master shards.  This caused the automated
deployment scripts to change the order of publishing.  When a rebuild
followed the following day, we essentially wrote the same record to
multiple servers causing this false positive count.

Thank you for all the feedback in resolving this.  We are going to delete
our entire index, rebuild from scratch (achievable for our user base), and
it should clear up any discrepancies.

Justin

On Tue, Jun 19, 2012 at 5:40 PM, Chris Hostetter
<ho...@fucit.org>wrote:

> : Confirming that there are no active records being written, the "numFound"
> : value is decreasing as we page through the results.
>
> 1) check that the "clones" of each shard are in fact identical (just look
> at the index files on each machine and make sure they are the same.
>
> 2) distributed searching relies heavily on using a uniqeuKey, and can
> behave oddly if documents with identical keys exist in multiple shards.
>
>
> http://wiki.apache.org/solr/DistributedSearch?#Distributed_Searching_Limitations
>
> If i remember correctly, what you are describing sounds like one of the
> things that can hapen if you violate the uniqueKey rule across differnet
> shards when indexing.
>
> I *think* what you are seeing is that in the distributed request for
> page#1 the coordinator sums up the numFound from all shards, and merges
> results 1-$rows acording to the sort, likewise for pages 2 & 3 when you
> get to page #4, it suddenly sees that doc#9876543 is included in hte
> responses from 3 diff shards, and it subtracts 2 from the numFound, and so
> on as you page farther through the results.  the more documents with
> duplicate uniqueKeys it find in the results as it pages through, the lower
> the cumulative numFound gets.
>
> : For example,
> : Page1 - numFound = 3683
> : Page2 - numFound = 3683
> : Page3 - numFound = 3683
> : Page4 - numFound = 2866
> : Page5 - numFound = 2419
> : Page5 - numFound = 1898
> : Page6 - numFound = 1898
> : ...
> : PageN - numFound = 1898
>
> -Hoss
>



-- 
Justin Babuscio
571-210-0035
http://linchpinsoftware.com

Re: Solr v3.5.0 - numFound changes when paging through results on 8-shard cluster

Posted by Chris Hostetter <ho...@fucit.org>.
: Confirming that there are no active records being written, the "numFound"
: value is decreasing as we page through the results.

1) check that the "clones" of each shard are in fact identical (just look 
at the index files on each machine and make sure they are the same.

2) distributed searching relies heavily on using a uniqeuKey, and can 
behave oddly if documents with identical keys exist in multiple shards.  

http://wiki.apache.org/solr/DistributedSearch?#Distributed_Searching_Limitations

If i remember correctly, what you are describing sounds like one of the 
things that can hapen if you violate the uniqueKey rule across differnet 
shards when indexing.

I *think* what you are seeing is that in the distributed request for 
page#1 the coordinator sums up the numFound from all shards, and merges 
results 1-$rows acording to the sort, likewise for pages 2 & 3 when you 
get to page #4, it suddenly sees that doc#9876543 is included in hte 
responses from 3 diff shards, and it subtracts 2 from the numFound, and so 
on as you page farther through the results.  the more documents with 
duplicate uniqueKeys it find in the results as it pages through, the lower 
the cumulative numFound gets.

: For example,
: Page1 - numFound = 3683
: Page2 - numFound = 3683
: Page3 - numFound = 3683
: Page4 - numFound = 2866
: Page5 - numFound = 2419
: Page5 - numFound = 1898
: Page6 - numFound = 1898
: ...
: PageN - numFound = 1898

-Hoss