You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Pablo Anzorena <an...@gmail.com> on 2017/04/01 17:04:50 UTC
Re: Pagination bug? when sorting by a field (not unique field)

Excellent guns, thank you very much!

El mar. 29, 2017 18:09, "Erick Erickson" <er...@gmail.com> escribió:

> You might be helped by "distributed IDF".
> see: SOLR-1632
>
> On Wed, Mar 29, 2017 at 1:56 PM, Chris Hostetter
> <ho...@fucit.org> wrote:
> >
> > The thing to keep in mind, is that w/o a fully deterministic sort,
> > the underlying problem statement "doc may appera on multiple pages" can
> > exist even in a single node solr index, even if no documents are
> > added/deleted between bage requests: because background merges /
> > searcher re-opening may happen in between those page requests.
> >
> > The best practice, if you really care about ensuring no (non-updated) doc
> > is ever returned twice in subsequent pages, is to to use a fully
> > deterministic sort, with a "tie breaker" clause that is unique to every
> > document (ie: uniqueKey field)
> >
> >
> >
> > : Date: Wed, 29 Mar 2017 23:14:22 +0300
> > : From: Mikhail Khludnev <mk...@apache.org>
> > : Reply-To: solr-user@lucene.apache.org
> > : To: solr-user <so...@lucene.apache.org>
> > : Subject: Re: Pagination bug? when sorting by a field (not unique field)
> > :
> > : Great explanation, Alessandro!
> > :
> > : Let me briefly explain my experience. I have a tiny test with 2 shards
> and
> > : 2 replicas, index about a hundred of docs. And then when I fully
> paginate
> > : search results with score ranking, I've got duplicates across pages.
> And
> > : the reason is deletes, which occur probably due to update/failover.
> Every
> > : paging request lands to the different replica. There are a few
> workarounds:
> > : lands consequent requests to the same replicas; also <optimize> fixes
> > : duplicates; but tie-breaking is the best way for sure.
> > :
> > : On Wed, Mar 29, 2017 at 7:10 PM, alessandro.benedetti <
> a.benedetti@sease.io>
> > : wrote:
> > :
> > : > The reason Mikhail mentioned that, is probably related to :
> > : >
> > : > *The way how number of document calculated is changed (LUCENE-6711)*
> > : > /The number of documents (docCount) is used to calculate term
> specificity
> > : > (idf) and average document length (avdl). Prior to LUCENE-6711,
> > : > collectionStats.maxDoc() was used for the statistics. Now,
> > : > collectionStats.docCount() is used whenever possible, if not
> maxDocs() is
> > : > used.
> > : > Assume that a collection contains 100 documents, and 50 of them have
> > : > "keywords" field. In this example, maxDocs is 100 while docCount is
> 50 for
> > : > the "keywords" field. The total number of tokens for "keywords"
> field is
> > : > divided by docCount to obtain avdl. Therefore, docCount which is the
> total
> > : > number of documents that have at least one term for the field, is a
> more
> > : > precise metric for optional fields.
> > : > DefaultSimilarity does not leverage avdl, so this change would have
> > : > relatively minor change in the result list. Because relative idf
> values of
> > : > terms will remain same. However, when combined with other factors
> such as
> > : > term frequency, relative ranking of documents could change. Some
> Similarity
> > : > implementations (such as the ones instantiated with NormalizationH2
> and
> > : > BM25) take account into avdl and would have notable change in ranked
> list.
> > : > Especially if you have a collection of documents with varying
> lengths.
> > : > Because NormalizationH2 tends to punish documents longer than avdl./
> > : >
> > : > This means that if you are load balancing, the page 2 query could go
> to
> > : > another replica, where the doc is scored differently, ending up on a
> > : > different position ( and maybe appearing again as a final effect).
> > : > This scenario is referred to scored ranking, so it will not affect
> sorting
> > : > (
> > : > and I believe in your initial mail you were referring not to sorting)
> > : >
> > : > Cheers
> > : >
> > : >
> > : > Pablo wrote
> > : > > Mikhall,
> > : > >
> > : > > effectively maxDocs are different and also deletedDocs, but
> numDocs are
> > : > > ok.
> > : > >
> > : > > I don't really get it, but can that be the problem?
> > : >
> > : >
> > : >
> > : >
> > : >
> > : > -----
> > : > ---------------
> > : > Alessandro Benedetti
> > : > Search Consultant, R&D Software Engineer, Director
> > : > Sease Ltd. - www.sease.io
> > : > --
> > : > View this message in context: http://lucene.472066.n3.
> > : > nabble.com/Pagination-bug-when-sorting-by-a-field-not-unique-field-
> > : > tp4327408p4327461.html
> > : > Sent from the Solr - User mailing list archive at Nabble.com.
> > : >
> > :
> > :
> > :
> > : --
> > : Sincerely yours
> > : Mikhail Khludnev
> > :
> >
> > -Hoss
> > http://www.lucidworks.com/
>