You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Tomás Fernández Löbbe <to...@gmail.com> on 2021/04/16 16:51:01 UTC

Re: Impact and WAND

I was looking at the nightly benchmarks[1] and noticed a big jump in
performance for conjunction queries when LUCENE-8060 was merged. I was
puzzled because I didn't expect BMW to help in this type of queries, but I
guess that's the "other optimizations" you were talking about? Do you have
any pointers to those?


[1] https://home.apache.org/~mikemccand/lucenebench

On Thu, Jul 11, 2019 at 6:02 AM Atri Sharma <at...@linux.com> wrote:

> Note that any other scoring mode (COMPLETE or COMPLETE_NO_SCORES) will
> mandatorily visit all hits, so there is no scope of skipping and hence
> no point of using impacts.
>
> On Thu, Jul 11, 2019 at 8:51 AM Wu,Yunfeng <wu...@baidu.com> wrote:
> >
> >
> > @Adrien Grand <jp...@gmail.com>>. Thanks for
> your reply.
> >
> > The explanation ` skip low-scoring matches` is great,  I  looked up some
> docs and inspect some related code.
> >
> > I noticed the ` block-max WAND` mode only work when
> ScoreMode.TOP_SCORES is used,   is right?  (The basic TermQuery would
> generate ImpactDISI with scoreMode is TOP_SCORES.)
> >
> > Lucene compute max score per block and then cached in `MaxScoreCache` ,
> this means we can skip low-scoring block( current one block 128 DocIds)
> and in competitive block  still need to score any docId as seen,   I
> confused with  `MaxScoreCache#getMaxScoreForLevel(int level)`, what the
> level mean? Skip level?  (Somewhere invoke this method pass one Integer
> upTo param)
> >
> > Thanks Lucene Team
> >
> >
> > 在 2019年7月10日,下午10:52,Adrien Grand <jpountz@gmail.com<mailto:
> jpountz@gmail.com>> 写道:
> >
> > To clarify, the scoring process is not accelerated because we
> > terminate early but because we can skip low-scoring matches (there
> > might be competitive hits at the very end of the index).
> >
> > CompetitiveImpactAccumulator is indeed related to WAND. It helps store
> > the maximum score impacts per block of documents in postings lists.
> > Then this information is leveraged by block-max WAND in order to skip
> > low-scoring blocks.
> >
> > This does indeed help avoid reading norms, but also document IDs and
> > term frequencies.
> >
> > On Wed, Jul 10, 2019 at 4:10 PM Wu,Yunfeng <wuyunfeng01@baidu.com
> <ma...@baidu.com>> wrote:
> >
> > Hi,
> >
> > We discuss some topic from
> https://github.com/apache/lucene-solr/pull/595. As Atri Sharma propose
> discuss with the java dev list.
> >
> >
> > Impact `frequency ` and `norm ` just to accelerate the `score process`
> which  `terminate early`.
> >
> > In impact mode, `CompetitiveImpactAccumulator` will record (freq, norm)
> pair , would stored at index level. Also I noted
> `CompetitiveImpactAccumulator` commented with `This class accumulates the
> (freq, norm) pairs that may produce competitive scores`,  maybe related to
> `WAND`?
> >
> >
> > The norm value which produced or consumed by `Lucene80NormsFormat`.
> >
> > In this ` Impact way`, we can avoid read norms from
> `Lucene80NormsProducer` that may generate the extra IO?  ( the norm value
> Lucene stored twice.)and take full advantage of the WAND method?
> >
> >
> >
> > --
> > Adrien
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Impact and WAND

Posted by Tomás Fernández Löbbe <to...@gmail.com>.
Ah, great. Thanks!

On Fri, Apr 16, 2021 at 10:57 AM Adrien Grand <jp...@gmail.com> wrote:

> Hi,
>
> Indeed BMW is only about disjunctions but the paper (
> http://engineering.nyu.edu/~suel/papers/bmw.pdf) shortly describes how
> block max indexes can be used to speed up conjunctions as well using a
> simple algorithm that they call Block Max And, which we implemented in the
> BlockMaxConjunctionScorer class.
>
> Le ven. 16 avr. 2021 à 18:51, Tomás Fernández Löbbe <tomasflobbe@gmail.com
> >
> a écrit :
>
> > I was looking at the nightly benchmarks[1] and noticed a big jump in
> > performance for conjunction queries when LUCENE-8060 was merged. I was
> > puzzled because I didn't expect BMW to help in this type of queries, but
> I
> > guess that's the "other optimizations" you were talking about? Do you
> have
> > any pointers to those?
> >
> >
> > [1] https://home.apache.org/~mikemccand/lucenebench
> >
> > On Thu, Jul 11, 2019 at 6:02 AM Atri Sharma <at...@linux.com> wrote:
> >
> > > Note that any other scoring mode (COMPLETE or COMPLETE_NO_SCORES) will
> > > mandatorily visit all hits, so there is no scope of skipping and hence
> > > no point of using impacts.
> > >
> > > On Thu, Jul 11, 2019 at 8:51 AM Wu,Yunfeng <wu...@baidu.com>
> > wrote:
> > > >
> > > >
> > > > @Adrien Grand <jp...@gmail.com>>. Thanks
> > for
> > > your reply.
> > > >
> > > > The explanation ` skip low-scoring matches` is great,  I  looked up
> > some
> > > docs and inspect some related code.
> > > >
> > > > I noticed the ` block-max WAND` mode only work when
> > > ScoreMode.TOP_SCORES is used,   is right?  (The basic TermQuery would
> > > generate ImpactDISI with scoreMode is TOP_SCORES.)
> > > >
> > > > Lucene compute max score per block and then cached in
> `MaxScoreCache` ,
> > > this means we can skip low-scoring block( current one block 128 DocIds)
> > > and in competitive block  still need to score any docId as seen,   I
> > > confused with  `MaxScoreCache#getMaxScoreForLevel(int level)`, what the
> > > level mean? Skip level?  (Somewhere invoke this method pass one Integer
> > > upTo param)
> > > >
> > > > Thanks Lucene Team
> > > >
> > > >
> > > > 在 2019年7月10日,下午10:52,Adrien Grand <jpountz@gmail.com<mailto:
> > > jpountz@gmail.com>> 写道:
> > > >
> > > > To clarify, the scoring process is not accelerated because we
> > > > terminate early but because we can skip low-scoring matches (there
> > > > might be competitive hits at the very end of the index).
> > > >
> > > > CompetitiveImpactAccumulator is indeed related to WAND. It helps
> store
> > > > the maximum score impacts per block of documents in postings lists.
> > > > Then this information is leveraged by block-max WAND in order to skip
> > > > low-scoring blocks.
> > > >
> > > > This does indeed help avoid reading norms, but also document IDs and
> > > > term frequencies.
> > > >
> > > > On Wed, Jul 10, 2019 at 4:10 PM Wu,Yunfeng <wuyunfeng01@baidu.com
> > > <ma...@baidu.com>> wrote:
> > > >
> > > > Hi,
> > > >
> > > > We discuss some topic from
> > > https://github.com/apache/lucene-solr/pull/595. As Atri Sharma propose
> > > discuss with the java dev list.
> > > >
> > > >
> > > > Impact `frequency ` and `norm ` just to accelerate the `score
> process`
> > > which  `terminate early`.
> > > >
> > > > In impact mode, `CompetitiveImpactAccumulator` will record (freq,
> norm)
> > > pair , would stored at index level. Also I noted
> > > `CompetitiveImpactAccumulator` commented with `This class accumulates
> the
> > > (freq, norm) pairs that may produce competitive scores`,  maybe related
> > to
> > > `WAND`?
> > > >
> > > >
> > > > The norm value which produced or consumed by `Lucene80NormsFormat`.
> > > >
> > > > In this ` Impact way`, we can avoid read norms from
> > > `Lucene80NormsProducer` that may generate the extra IO?  ( the norm
> value
> > > Lucene stored twice.)and take full advantage of the WAND method?
> > > >
> > > >
> > > >
> > > > --
> > > > Adrien
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>

Re: Impact and WAND

Posted by Adrien Grand <jp...@gmail.com>.
Hi,

Indeed BMW is only about disjunctions but the paper (
http://engineering.nyu.edu/~suel/papers/bmw.pdf) shortly describes how
block max indexes can be used to speed up conjunctions as well using a
simple algorithm that they call Block Max And, which we implemented in the
BlockMaxConjunctionScorer class.

Le ven. 16 avr. 2021 à 18:51, Tomás Fernández Löbbe <to...@gmail.com>
a écrit :

> I was looking at the nightly benchmarks[1] and noticed a big jump in
> performance for conjunction queries when LUCENE-8060 was merged. I was
> puzzled because I didn't expect BMW to help in this type of queries, but I
> guess that's the "other optimizations" you were talking about? Do you have
> any pointers to those?
>
>
> [1] https://home.apache.org/~mikemccand/lucenebench
>
> On Thu, Jul 11, 2019 at 6:02 AM Atri Sharma <at...@linux.com> wrote:
>
> > Note that any other scoring mode (COMPLETE or COMPLETE_NO_SCORES) will
> > mandatorily visit all hits, so there is no scope of skipping and hence
> > no point of using impacts.
> >
> > On Thu, Jul 11, 2019 at 8:51 AM Wu,Yunfeng <wu...@baidu.com>
> wrote:
> > >
> > >
> > > @Adrien Grand <jp...@gmail.com>>. Thanks
> for
> > your reply.
> > >
> > > The explanation ` skip low-scoring matches` is great,  I  looked up
> some
> > docs and inspect some related code.
> > >
> > > I noticed the ` block-max WAND` mode only work when
> > ScoreMode.TOP_SCORES is used,   is right?  (The basic TermQuery would
> > generate ImpactDISI with scoreMode is TOP_SCORES.)
> > >
> > > Lucene compute max score per block and then cached in `MaxScoreCache` ,
> > this means we can skip low-scoring block( current one block 128 DocIds)
> > and in competitive block  still need to score any docId as seen,   I
> > confused with  `MaxScoreCache#getMaxScoreForLevel(int level)`, what the
> > level mean? Skip level?  (Somewhere invoke this method pass one Integer
> > upTo param)
> > >
> > > Thanks Lucene Team
> > >
> > >
> > > 在 2019年7月10日,下午10:52,Adrien Grand <jpountz@gmail.com<mailto:
> > jpountz@gmail.com>> 写道:
> > >
> > > To clarify, the scoring process is not accelerated because we
> > > terminate early but because we can skip low-scoring matches (there
> > > might be competitive hits at the very end of the index).
> > >
> > > CompetitiveImpactAccumulator is indeed related to WAND. It helps store
> > > the maximum score impacts per block of documents in postings lists.
> > > Then this information is leveraged by block-max WAND in order to skip
> > > low-scoring blocks.
> > >
> > > This does indeed help avoid reading norms, but also document IDs and
> > > term frequencies.
> > >
> > > On Wed, Jul 10, 2019 at 4:10 PM Wu,Yunfeng <wuyunfeng01@baidu.com
> > <ma...@baidu.com>> wrote:
> > >
> > > Hi,
> > >
> > > We discuss some topic from
> > https://github.com/apache/lucene-solr/pull/595. As Atri Sharma propose
> > discuss with the java dev list.
> > >
> > >
> > > Impact `frequency ` and `norm ` just to accelerate the `score process`
> > which  `terminate early`.
> > >
> > > In impact mode, `CompetitiveImpactAccumulator` will record (freq, norm)
> > pair , would stored at index level. Also I noted
> > `CompetitiveImpactAccumulator` commented with `This class accumulates the
> > (freq, norm) pairs that may produce competitive scores`,  maybe related
> to
> > `WAND`?
> > >
> > >
> > > The norm value which produced or consumed by `Lucene80NormsFormat`.
> > >
> > > In this ` Impact way`, we can avoid read norms from
> > `Lucene80NormsProducer` that may generate the extra IO?  ( the norm value
> > Lucene stored twice.)and take full advantage of the WAND method?
> > >
> > >
> > >
> > > --
> > > Adrien
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>