You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Shawn Heisey <ap...@elyograg.org> on 2023/08/26 16:50:38 UTC

Weird issue -- pulling results with cursorMark gets fewer documents than numFound

Source Solr 4.7 SolrCloud, 3 shards, 7 replicas in the collection.
Target Solr 9.1.1 SolrCloud, 3 shards and 3 replicas.

Source version is a custom 4.7.0 version that mentions it includes 
SOLR-5875, which is a very small patch.  Target version is unmodified 
Solr 9.1.1.  The client on this is unwilling to change versions.

Schema meets the requirements for Atomic Update, so we are doing a 
migration by querying the old cluster and writing to the new cluster. 
We are doing it in batches by filtering on one of the fields, and using 
cursorMark to efficiently page through the results.

The query thread gets batches of 10000 documents and dumps them on a 
queue, which is then processed by indexing threads.  The query side uses 
Http2SolrClient with a URL, the target uses CloudHttp2SolrClient with zk 
info, and sets the option to send only to shard leaders.  The source 
collecton is NRT because that's all that 4.7 supports, the target is 
TLOG.  Both SolrClient objects are set to use HTTP 1.1.

One of the batches always indexes 5 fewer documents than numFound.  It's 
consistent -- always 5 documents.  Updates are paused during the 
migration.  On the last run, numFound for this batch was 3824942 and the 
indexed count was 3824937.

The query batches are always 10000 except for the last one, which is 
4937.  The index batches are always 1000 except for the last one, which 
is 937.

It probably doesn't matter, but the queue size is 500000.  There are two 
index threads.

I don't think there is a problem with the migration code.  The other 
batches (created with a filter query) are all working properly ... the 
number of documents indexed matches the numFound.  Total number of 
documents is a little over 30 million, so this batch is a little over 10 
percent of the total.

Has anyone seen a problem on 4.7.0 where numFound doesn't match the 
total document count retrieved with cursorMark?  The only thing I can 
imagine that would cause this is having a different numDocs count in 
each replica, but we have verified that these counts are all the same in 
every replica of each shard.

The other idea I have is that there could be a uniqueKey value that 
appears in more than one shard.  This doesn't seem likely, as the 
compositeId router should keep that from happening.  Is there a way to 
detect this situation?  I have an idea for a SolrJ program that would 
detect it, I am just hoping that Solr 4.7 might have something built in.

Thanks,
Shawn

Re: Weird issue -- pulling results with cursorMark gets fewer documents than numFound

Posted by Shawn Heisey <ap...@elyograg.org>.

On 8/28/23 11:42, Chris Hostetter wrote:
> I assume you mean one of the batches always indexes 5 fewer documents then
> 'rows=N' param (ie: the query batch size) ... correct?
> 
> You're talking about the total numFound being higher then the index count?

The query uses rows=10000, which is configurable via a commandline option.

The source collection's numFound is 5 higher than the number of 
documents indexed to the target.  I was assured that all updates to the 
source collection were paused during the most recent migration test.

> Also possible is that sme shards are out of sync with their leader -- ie:
> for some shardX, replica1 has a doc that replica2 doesn't, and replica1 is
> used for the initial phase of the request to get the "top N sorted doc
> uniqueKey at cursorMark=ZZZ" but replica2 is used in the second phase to
> fetch all of the field values.  (but if that were the case, you'd expect
> that at least some of the time you'd get "lucky" and the two phases would
> both hit replicas that agreeed with eachother -- even if they didn't agree
> with the leader -- and the problem wouldn't reliably reproduce every time)

We did make sure that the numDocs was the same on all replicas for each 
shard.  A comprehensive check of ID values across replicas has not been 
done.  I should be able to write a program to do that.

> : should keep that from happening.  Is there a way to detect this situation?  I
> 
> I would log every cursorMark request URL and the number of docs in the
> response.

It has been verified that each cursorMark batch is 10000 docs except the 
last batch, by checking the size of the SolrDocumentList object 
retrieved from the response.  Added some debug-level logging to show 
that along with the cursorMark value.

I have finished my SolrJ program using Http2SolrClient that will look 
for IDs that exist in more than one shard.  I had hoped to have it get 
the list of core URLs from ZK, but couldn't figure that out, so now the 
commandline options accept multiple core-specific URLs, with the idea 
that one replica core from each shard will be presented.  I have tested 
it against my little Solr install, with the first URL pointing at the 
collection alias and the second pointing at the real core.  It's a 
single-shard collection on a single node.  As expected, it reported that 
every ID was duplicated.  We'll try it for real in the wee hours of the 
morning.

I put the program on github if anyone is interested in taking a look.

https://github.com/elyograg/shard_duplicate_finder

Thanks,
Shawn

Re: Weird issue -- pulling results with cursorMark gets fewer documents than numFound

Posted by Chris Hostetter <ho...@fucit.org>.

: Schema meets the requirements for Atomic Update, so we are doing a migration
: by querying the old cluster and writing to the new cluster. We are doing it in
: batches by filtering on one of the fields, and using cursorMark to efficiently
: page through the results.
	...
: The query thread gets batches of 10000 documents and dumps them on a 
	...
: One of the batches always indexes 5 fewer documents than numFound.  It's
: consistent -- always 5 documents.  Updates are paused during the migration.
: On the last run, numFound for this batch was 3824942 and the indexed count was
: 3824937.

I assume you mean one of the batches always indexes 5 fewer documents then 
'rows=N' param (ie: the query batch size) ... correct?   

You're talking about the total numFound being higher then the index count?

: The other idea I have is that there could be a uniqueKey value that appears in
: more than one shard.  This doesn't seem likely, as the compositeId router

Also possible is that sme shards are out of sync with their leader -- ie: 
for some shardX, replica1 has a doc that replica2 doesn't, and replica1 is 
used for the initial phase of the request to get the "top N sorted doc 
uniqueKey at cursorMark=ZZZ" but replica2 is used in the second phase to 
fetch all of the field values.  (but if that were the case, you'd expect 
that at least some of the time you'd get "lucky" and the two phases would 
both hit replicas that agreeed with eachother -- even if they didn't agree 
with the leader -- and the problem wouldn't reliably reproduce every time)

: should keep that from happening.  Is there a way to detect this situation?  I

I would log every cursorMark request URL and the number of docs in the 
response.

If, at the end of the run, you see a cursorMark value that didn't return 
the same number of docs as your rows param (ignoring the last batch which 
you expect to be smaller) then go manually re-run that query against every 
replica of every shard using `distrib=false` and diff the responses from 
each replica of the same shard



-Hoss
http://www.lucidworks.com/