You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Per Steffensen <st...@designware.dk> on 2012/08/28 08:39:02 UTC

Searching and sorting over multiple collections

Hi

Due to what we have seen in recent tests I got in doubt how Solr search 
is actually supposed to behave

* Searching with 
"distrib=true&q=*:*&rows=10&collection=x,y,z&sort=timestamp asc"
** Is Solr supposed to return the 10 documents with the lowest timestamp 
across all documents in all slices of collection x, y and z, or is it 
supposed to just pick 10 random documents from those slices and just 
sort those 10 randomly selected documents?
** Put in another way - is this search supposed to be consistent, 
returning exactly the same set of documents when performed several times 
(no documents are updated between consecutive searches)?

* A search returns a "numFound"-field telling how many documents all in 
all matches the search-criteria, even though not all those documents are 
returned by the search. It is a crazy question to ask, but I will do it 
anyway because we actually see a problem with this. Isnt it correct that 
two searches which only differs on the "rows"-number (documents to be 
returned) should always return the same value for "numFound"?

Thanks!

Regards, Steff

Re: Searching and sorting over multiple collections

Posted by Per Steffensen <st...@designware.dk>.
Per Steffensen skrev:
> Hi
>
> Due to what we have seen in recent tests I got in doubt how Solr 
> search is actually supposed to behave
>
> * Searching with 
> "distrib=true&q=*:*&rows=10&collection=x,y,z&sort=timestamp asc"
> ** Is Solr supposed to return the 10 documents with the lowest 
> timestamp across all documents in all slices of collection x, y and z, 
> or is it supposed to just pick 10 random documents from those slices 
> and just sort those 10 randomly selected documents?
> ** Put in another way - is this search supposed to be consistent, 
> returning exactly the same set of documents when performed several 
> times (no documents are updated between consecutive searches)?
Fortunately I believe the answer is, that it ought to "return the 10 
documents with the lowest timestamp across all documents in all slices 
of collection x, y and Z". The reason I asked was because I got 
different responses for consecutive simular requests. Now I believe it 
can be explained by the bug described below. I guess they you do 
cross-collection/shard searches, the "request-handling" Solr forwards 
the query to all involved shards simultanious and merges sub-results 
into the final result as they are returned from the shards. Because of 
the "consider documents with same id as the same document even though 
the come from different collections"-bug it is kinda random (depending 
on which shards responds first/last), for a given id, what collection 
the document with that specific id is taken from. And if documents with 
the same id from different collections has different timestamp it is 
random where that document ends up in the final sorted result.

So i believe this inconsistency can be explained by the bug described below.
>
> * A search returns a "numFound"-field telling how many documents all 
> in all matches the search-criteria, even though not all those 
> documents are returned by the search. It is a crazy question to ask, 
> but I will do it anyway because we actually see a problem with this. 
> Isnt it correct that two searches which only differs on the 
> "rows"-number (documents to be returned) should always return the same 
> value for "numFound"?
Well I found out myself what the problem is (or seems to be) - see:
http://lucene.472066.n3.nabble.com/Changing-value-of-start-parameter-affects-numFound-td2460645.html
http://lucene.472066.n3.nabble.com/numFound-inconsistent-for-different-rows-param-td3997269.html
http://lucene.472066.n3.nabble.com/Solr-v3-5-0-numFound-changes-when-paging-through-results-on-8-shard-cluster-td3990400.html

Until 4.0 this "bug" could be "ignored" because it was ok for a 
cross-shards search to consider documents with identical id's as dublets 
and therefore only returning/counting one of them. It is still, in 4.0, 
ok within the same collection, but across collections identical id's 
should not be considered dublicates and should not reduce documents 
returned/counted. So i believe this "feature" has now become a bug in 
4.0 when it comes to cross-collections searches.

Created a issue/bug: SOLR-3765
>
> Thanks!
>
> Regards, Steff
>